February 12th Incident: Post Mortem — Librato Blog

February 12th Incident: Post Mortem

Now that we’ve had a chance to fully investigate/document/mitigate the service degradation we incurred in mid-February, we’ve compiled a detailed post mortem. The intent is to document what happened, how it led to the loss of historical data access between Feb 12th and Feb 15th, and what we are doing to prevent this type of incident in the future.


During a planned upgrade of our storage tier on February 12th we ran into some unanticipated issues. To reduce load on the existing storage tier while we diagnosed the situation, we spun up a second storage tier in order to minimize disruption to the Librato Metrics service. The service remained accessible during this time, but access to historical data was suspended while we resolved the issues. Access to nearly all of the historical data was ultimately restored with the exception of some data from Feb 12th 2012 - Feb 15th 2012. We’ve identified the root causes that led to to this incident and have taken steps to make certain that similar problems do not happen again. We hold data integrity at utmost concern and consider loss of any data unacceptable. As such we did not bill any of our users for use of the service between the dates of Feb 12th and Feb 15th.


On Friday, February 10th, 2012 several of our internal metrics indicated a subtle deterioration in the service quality of our Metrics service. Our performance metrics indicated longer than expected response times in our Metrics API and delays in generating historical data. After investigation we determined that we needed to upgrade the instances in the storage tier to handle increasing growth in the service. We scheduled the upgrade for Sunday, February 12th. During the upgrade we noticed delays in transitioning instances in the storage tier. To reduce load on the service we temporarily disabled the batch process of summarizing the raw incoming metrics data into the historical data we present at 60 sec, 15 min and 60 min rollups. We disabled the historical rollup batch process at 17:00 UTC on Feb. 12th. Despite the now reduced load on the storage tier from the historical data summarization, we continued to experience high loads across the storage tier from the incoming raw data. We later determined that at this point in the upgrade we mistakenly removed some of the existing instances from our scalable storage tier before they had completely replicated to new instances. This human error led to an even greater spike in load as the storage tier worked to correct itself from the redundant data available. At this point we decided it was best to provision an entirely new storage tier and immediately transition to it in order to restore consistent service for our users. Additionally this would move all load off the original storage tier and give us maximum capacity to repair it. After a transitional window of 3 hours, during which we pushed all new data to both the original and new storage tiers, we switched all user access over to the new storage tier with no interruption to their access of the last 3 hours of monitoring data. Around Feb 15th, 3:00 UTC we restarted batch historical data summarization on the new storage tier. At this time user access was restored to all newly summarized data. Even with the reduced load on the original storage tier we were still having difficulty migrating historical data from it to the new storage tier. (This was due to the tier still attempting to correct the inconsistency introduced by the human error.) To ensure we fully restored access as soon as possible we switched our attention to restoring data from one of our regular storage tier backups performed prior to the start of our upgrade on Feb. 12th. At 5:00 UTC, Feb 16th we began importing the most recent backup into the new storage tier. This backup contained all of the historical data prior to Feb. 12th. By 12:00 UTC on Feb 16th we had successfully imported all historical data prior to Feb 12th. From 12:00 to 21:00 UTC on Feb 16th we continued to work on extracting the set of raw data (Feb 12th 17:00 UTC to Feb 15th 3:00 UTC) from the original storage tier that the summarization process would need to bring our historical data back to 100% complete coverage and cover the gap between the imported backup and launching the new storage tier. Unfortunately we were unable to successfully migrate this data to the new storage tier.


At no time during the incident did users lose access to real-time metrics. While we were able to maintain the availability of the service during the upgrade and restore all historical data prior to Feb 12th 17:00 UTC, there are some gaps in historical data during the window of time from Feb 12th 17:00 UTC until Feb 15th 3:00 UTC.


Loss of any data is unacceptable and we take full responsibility. The loss was not the cause of any particular technology or tool, but rather the result of a human error during an insufficiently vetted upgrade process undertaken due to an urgent need for more capacity. During the post mortem we identified the steps that led to this critical outage and clarified the steps that must be followed correctly in the future. Future modifications to our storage tier will be outlined in advance and contingency plans will be in place to prevent a repeat of this sort of problem. We now have the ability to spin up redundant storage tiers and simultaneously write new data to both of them. This incident also confirmed the importance of our automated backups of the entire storage tier, and we’re increasing the frequency and coverage of those backups.

Finally, we use Metrics to track a variety of metrics from our web, application and storage tiers. It was the metrics that we track -- like that of our historical rollup batch process -- that alerted us to the initial problem with the storage tier before it started seriously affecting users. Since the incident we have identified and begun tracking additional metrics in our storage tier that would have alerted us much earlier to a growing trend before it became the time critical problem that resulted in the urgent scale up.