December 22, 2021: Incorrect player uptime metrics

Incident Description:

The AWS outage on 12/14 resulted in a backlog of player data that needed to be processed. This caused a significant drop in network uptime rates and caused alert failures between 12/14 and 12/22.

On 12/23 a backend bug was introduced which also resulted in a backlog of player data that needed to be processed. This caused a significant drop in network uptime rates and caused alert failures between 12/23 and 12/29.

Therefore uptime data was corrupt from 12/15-12/29. During the outage, players operated normally and there was no impact on ad delivery.

 

 

Product Impacted: 

  • Cortex - uptime reports, alerts

 

Resolution:

To remediate this issue, we increased our server capacity to process the backlogged data. As of 12/30, the queue is completely cleared so uptime metrics and alerts are functioning normally. You will, however, see lower than usual uptime metrics if you pull historical reports between 12/14 and 12/29. 

The Cortex engineering team backlogged the uptime data to correct the metrics between 12/14-12/29 on 1/25. An alert was published to the Fleet UI. 

Our engineering team is taking steps to improve our monitoring system to catch these types of issues earlier. 

Was this article helpful?
0 out of 0 found this helpful