S3 Outage Post Mortem

Background

On 2/28/17 the S3 service had an outage, as documented on the AWS status page:

  • 11:37 AM PST We can confirm high error rates for requests made to S3 in the US-EAST-1 Region. We've identified the issue and are working to restore normal operations.
  • 12:54 PM PST We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour.
  • 1:13 PM PST S3 Object retrieval, listing and deletion are fully recovered now. We are still working to recover normal operations for adding new objects to S3.

More details on the outage can be found here: https://aws.amazon.com/message/41926/

The outage had an impact on our ability to deliver creative as well as to record PoPs in reporting as detailed in the following sections.

Creative Delivery

Vistar Media is responsible for the transcoding and storage of creative assets used during ad serving. Currently, we store all assets on a S3 bucket and point the DNS record assets.vistarmedia.com to the underlying bucket URL. As a result of this outage, requests to assets.vistarmedia.com failed. We were unable to move the assets to a different provider since we were also unable to access S3.

For media owners using Cortex, the Cortex client was able to use cached assets it keeps for 6 hours. After the cache entries expired, the Cortex screens were unable to display ads, but continued to function normally, displaying fallback content.

We will begin storing assets in a redundant location outside of S3. In the event of a future S3 outage we will be able to change the assets.vistarmedia.com DNS record to fall back to the backup storage. When this change is made, media owners will be notified via our release notes, and the details will be appended to this article as well.

Reporting

When we receive a Proof of Play we log a spend record to disk in an append-only file. When the log file reaches a certain age or size we upload it to S3 and emit an SQS message so the file can be ingested by the reporting system and create a new log file for incoming messages.

During the s3 outage the log upload was failing. The real issue was the logging code would delete the file from disk regardless of whether the s3 upload succeeded or not. If the file was kept on disk we could've manually loaded it into reporting after the outage was over. This issue has been fixed.

The secondary issue was that we could not hotfix this because our deploys depend on s3. We had to manually deploy a patched WAR to all ad servers which was time consuming and error prone. We have scheduled work to force the ad server to fall back to house campaigns during an outage.

The log files we were able to save due to the patched WAR have been manually loaded into reporting. However, we have a hole in our reporting data for the period between the outage start and when the patched ad servers went live.

Our budgeting/pacing system tracks impressions and cost separately from our reporting system and was not affected by the S3 outage. This means the recorded spend for each campaign will be higher from the ad server’s perspective than what is in reporting and the pacing emails. Affected campaigns may stop spending earlier than what reporting or the pacing emails would indicate. For media owners using ad serving for their direct campaigns, we will provide a list of affected line items, along with the potential impact in terms of lost impressions in reporting. Media owners can then decide how to best to handle the situation (e.g. ignore if negligible, or increase line item budgets to account for lost spend).   

Exchange
For exchange media owners whose inventory was targeted during the outage, there may be ad plays that were not recorded in reporting. The potential loss on a per media owner basis was negligible. We have decided not to credit media owners as the amounts are very low, and there is no way for us to know the correct amount with certainty. Also, buyers will not be charged for any ad plays that were not recorded in reporting. Of course, media owners should reach out if there are questions or concerns.

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request