“Monitoring systems need to be more available and scalable than the systems being monitored” - Adrian Cockcroft
On February 28th, 2017, many companies experienced the effects of the failure of components of the S3 object storage service offered by Amazon Web Services (AWS). Since, according to the Synergy Research Group, AWS owns more than 40% of the public cloud market, this outage potentially affected a large number of systems hosts in the cloud.
Because Librato is built on AWS and relies on services like S3, or services dependent on S3, we were not immune to the S3 downtime either. However, because our infrastructure is built with the Cockcroft quote above in mind, we were able to weather the storm with minimal disruption to our customers, and absolutely no loss of customer data.
At Librato we are heavy users of EBS. For the longest time, we chose not to rely on EBS because of past issues. However, in the past year or so we’ve come to embrace EBS heavily as we’d discussed publicly in our talk at AWS re:Invent 2017. Unfortunately, since EBS is backed by S3, the problems on February 28th affected a number of our systems.
Fortunately, Librato is designed to handle failure relatively gracefully. Our system of record, Cassandra, is deployed through multiple availability zones in AWS so we can lose a number of systems and still continue to accept and retrieve customer data with no adverse effects. This is exactly what happened during this specific disruption. While we did get a number of notices about file system problems on our nodes -- because of they way we are deployed -- there was never a situation where we were in danger of losing any data and our metrics and monitoring pipeline continued to operate normally.
The most impacted feature, however, was our ability to provide Librato Snapshots to our customers. Snapshots are the point in time views of a chart that allow customers to be able to share charts in email, Slack rooms, etc. However, this capability depends on S3’s static website hosting ability as we need a place to store and service these snapshotted images.
When the failure occurred we were temporarily no longer able to write snapshots to our regular S3 buckets to make them available to our customers. “We’ll just make a bucket in another region and use that!”, we thought, and quickly set about making the necessary changes for this to happen. Unfortunately, during the time of the disruption, while most of the API in other regions worked fine, the S3 API (as it is a global service) did not. We were unable to create another bucket in another region and were therefore stuck.
Then we remembered that we actually had a bucket built just for snapshots in another region that is part of our extended internal monitoring systems that enable us to always have complete visibility into the Librato infrastructure. We quickly changed our configuration to take advantage of this existing bucket, and minutes later, we were happily serving Librato snapshots for all our customers once again.
The last issue from the S3 outage was not actually directly related to the outage itself. It was related to the number of alerts that we were generating for our customers as a result of the outage. One of the databases in our alerting system that is typically on the order of single digit megabytes, soon began storing gigabytes and gigabytes of data. Eventually, it overran its storage constraints and was quickly fixed by simply adding additional storage.
At Librato, we pride ourselves in being more available and scalable than the systems that we monitor. We were very pleased that we were able to maintain a high level of service availability for our customers while many components of the infrastructure upon which it relies were experiencing significant distress. We have already proactively created extra S3 buckets so we will not again be in a situation where we need to repurpose a bucket for another purpose if a similar outage were to occur.
Our ability to rapidly adapt to changing situations is one of the reasons that Librato runs on AWS. Our ability to maintain and operate available and scalable infrastructure is one of the reasons our customers rely on us to help them their operate their businesses. We’re thankful every day for the faith our customers place on us to allow us to do just that.