Incident Description:
The AWS outage on 12/14 resulted in a backlog of player data that needed to be processed. This caused a significant drop in network uptime rates and caused alert failures between 12/14 and 12/22.
On 12/23 a backend bug was introduced which also resulted in a backlog of player data that needed to be processed. This caused a significant drop in network uptime rates and caused alert failures between 12/23 and 12/29.
Therefore uptime data was corrupt from 12/15-12/29. During the outage, players operated normally and there was no impact on ad delivery.
Product Impacted:
- Cortex - uptime reports, alerts
Resolution:
To remediate this issue, we increased our server capacity to process the backlogged data. As of 12/30, the queue is completely cleared so uptime metrics and alerts are functioning normally. You will, however, see lower than usual uptime metrics if you pull historical reports between 12/14 and 12/29.
The Cortex engineering team backlogged the uptime data to correct the metrics between 12/14-12/29 on 1/25. An alert was published to the Fleet UI.
Our engineering team is taking steps to improve our monitoring system to catch these types of issues earlier.