December 22, 2021: Incorrect player uptime metrics

Incident Description:

The AWS outage on 12/14 resulted in a backlog of player data that needed to be processed. This caused a significant drop in network uptime rates and caused alert failures between 12/14 and 12/22.

On 12/23 a backend bug was introduced which also resulted in a backlog of player data that needed to be processed. This caused a significant drop in network uptime rates and caused alert failures between 12/23 and 12/29.

Therefore uptime data is corrupt from 12/15-12/29. During the outage, players operated normally and there was no impact on ad delivery.

 

 

Product Impacted: 

  • Cortex - uptime reports, alerts

 

Resolution:

To remediate this issue, we increased our server capacity to process the backlogged data. As of 12/30, the queue is completely cleared so uptime metrics and alerts are functioning normally. You will, however, see lower than usual uptime metrics if you pull historical reports between 12/14 and 12/29. 

The Cortex engineering team is working to backlog the uptime data to correct the metrics between 12/14-12/29. An alert will be published to the Fleet UI once resolved. 

Our engineering team is taking steps to improve our monitoring system to catch these types of issues earlier. 

Was this article helpful?
0 out of 0 found this helpful