Unplanned Maintenance: Historical Data — Librato Blog

Unplanned Maintenance: Historical Data

During a planned in-place upgrade of the capacity on our horizontally-scalable data store, we ran into some subtle and unanticipated issues. The result is that we need to ease read pressure until the switchover has stabilized, so we're temporarily disabling all read access to the summarized (1m, 15m, 60m) data.

We're still working up an estimate for the duration of this impact but will update this space and our Twitter account (@librato) as that crystallizes and our engineers make progress towards resolving the issue. We're truly sorry for the inconvenience this has caused and are taking steps to ensure that it won't happen on future upgrades. Once a final resolution is in place, we'll put together a more detailed post-mortem.

UPDATE (Feb. 15, 05:40:14 UTC) :

We've now completely isolated the affected storage instances and standard summarizations (1m, 15, 60m) on new measurements are taking place. Access will be restored to these new summarizations tomorrow. In parallel we will start the process of importing all previously existing historical data. Once the import is finished the incident will be resolved. We'll provide a more concrete estimate on the duration of the import process tomorrow.

UPDATE (Feb. 15, 18:11:14 UTC) :

Access is now available for all data and summarizations from Feb. 15, 3:15:00 UTC onwards. We're making progress on importing older data and will update as it becomes available.

UPDATE (Feb. 16, 19:25:34 UTC) :

Access is now available for all historically summarized data up until Sunday Feb. 12 17:00:00 UTC. Some metrics additionally have summarized data available through roughly Feb. 13 19:00:00 UTC, but it's non-deterministic. We're now making a final attempt to retrieve summarized data for the interval of Feb. 12 17:00:00 UTC -> Feb. 15, 3:00:00 UTC.

RESOLVED (Feb. 17, 17:00:00 UTC) :

Attempts to recover summarized data from the affected interval (2/12 -> 2/15) have been unsuccessful. Once our internal review is complete we'll post a detailed post-mortem describing the root cause behind this incident, steps taken to mitigate the damage, and the changes we are making to ensure it won't happen again. Data integrity has always been our utmost concern and we consider loss of any kind unacceptable. As a token of our contrition we will not be billing for any Metrics usage during the affected interval.