Our “Tales of the Ops Team” blog series explores the tools and techniques used by modern metrics-driven operations teams. We’ll get introspective by sharing our own operations adventures, and hopefully stories from other world-class operations folks. With any luck, we’ll learn a lot, get inspired to level-up our Ops chops, and have a bunch of fun.
A few days ago our production API experienced a small glitch. Not the kind of thing that provokes a thorough postmortem, but just a momentary network issue of the sort that briefly degrades service.
Glitches like this are to be expected, but when you’re running distributed applications, and especially multi-tenant SAAS these little hiccups are sometimes the harbingers of disaster -- they cannot be allowed to persist, and should be detected and investigated as quickly as possible. Our glitch on this particular morning didn’t grow into anything disastrous, but we thought it might be a nice introduction to this series; one that would give you a feel for what problem detection, and diagnosis looks like at Librato.
A Culture of ChatOps
About half of our engineers are remote, so unsurprisingly we rely heavily on ChatOps for everything from diabolical plotting to sportsball banter. Because it’s already where we tend to “be” as a company, we’ve put some work into integrating our persistent chat tool with many of the other tools we use. It should be no surprise then, that our first hint something was wrong came by way of chat:
Dr. Manhattan, a special-purpose account used for tools integration, is the means by which our various third-party service providers feed us notifications, and -- like his namesake -- can talk to all of our service providers at the same time. In this paste he’s letting us know that he’s gotten two notifications from our own alerting feature. The first alert means that our API is taking longer than normal to look up metric names stored in memcached, while the second indicates that our metrics API is responding slowly in general to HTTP POSTs.
Thanks to our campfire integration, Dr. M. is also able to tell us the names of the hosts that are breaking the threshold, and their current values. This is alarming enough, but these alerts are quickly followed by more. First comes a notification from our log processing system.
We’ve configured rsyslog on our AWS hosts to send a subset of our our Ngnix logs to a third-party alert processing service (Papertrail). About the same time those metrics crossed threshold, Papertrail noticed some HTTP 502 errors in our logs, and is sending them to Dr. M. who is listing them in channel. Some of these lines indicate that a small number of requests are failing to post. Not good.
More trouble follows; including several more Alerts relating to our API response time, as well as notifications from our alert escalation service, and our exception reporting service -- the latter of which indicate that some of our users sessions might be failing out with I/O errors.
In every case, the notifying entity provides us a link that we can use to get more info. For example, clicking the link in the first notification from our alerting feature yielded the following graph in our metrics UI:
Sure enough, there are two obvious outliers here, indicating that two machines (out of the dozen or so API hosts) were three orders of magnitude slower than their peers returning metric names from memcached.
Our engineers take a moment to assimilate the barrage of bad news that was just dumped into the channel...
… and then they dig in to figure out just what exactly is going on. Having machine notifications inline with human conversation is a huge win for us. The ability to react directly to what we are seeing in channel without having to context-switch between our phones and workstations, makes us a more productive team -- everyone is literally on the same page all the time. We win again when the troubleshooting we do as a team is automatically documented, and any newcomers to the channel who join mid-crisis get all the context in the scrollback when they join.
We initially suspected that our message queue might be the culprit, but we were able to quickly check the queue latency graph and eliminate that possibility without wasting time poking at the queue directly. Then we noticed some aberrant system-level stats on the two hosts that broke threshold in the initial alert.
Our metrics tool puts all of our production metrics in one place, so it’s trivial to correlate metrics from one end of the stack to the other. Using a combination of application-layer and systems-level graphs, we were able to verify that the problem was in fact some sort of short-lived network partition that isolated those two particular AWS nodes. The data included the graph below, which depicts our total number of healthy AWS nodes (this data sources from CloudWatch, using our turn-key AWS integration feature).
The “dip” you see indicates that two of our nodes went dark for a few seconds (too short a length of time for them to have bounced). We noted their names and will keep them under observation for a few days, but at that point all we could do was glare in Amazon’s general direction and call “all clear”.
Communicate to document.
We love ChatOps so much, that sometimes -- late at night, when nobody is around -- you can catch our engineers talking to themselves in channel:
This happened not long ago, late at night on a Sunday. Our Ops engineer on-call was alerted by pager via our escalation service about a database problem. You can see both the escalation alert, and our engineers acknowledgement shortly before he joins the channel at the top. As he troubleshoots the issue, he narrates his discoveries, and pastes interesting tidbits, including log snippets and graphs of interesting metrics. In this way, he documents the entire incident from detection to resolution for the other engineers who will see it when they join the channel the next morning. Using ChatOps to document incidents has become an invaluable practice for us. Important customer interactions, feature ideas, code deploys, and sales stats are also communicated asynchronously, company-wide via ChatOps. It is our portal, wiki and watercooler.
Another practice that we would find it hard to live without is that of sharing graphs back and forth in-channel in the way Peter has done above. This is a handy way to both communicate what you’re seeing to other engineers while simultaneously documenting it for later. We’d begun to rely so heavily on copy/pasting graphs to each other that we added a snapshot feature to our metrics tool so our engineers can share the graph they’re looking at in our UI directly to a named chatroom in our chat tool without copy/paste.
In this particular incident, Peter tracks the issue down to a read-replica failure on a particular database node, and decides to replace the node with a fresh instance. First he deletes the faulty host and waits for things to normalize...
… and then he brings up the new node and monitors dashboards until he’s convinced everything is copacetic.
Throughout his monologue, Peter is using the Snapshots feature I mentioned earlier to share graphs by clicking on the graph in our UI, and then hitting the snapshot button to send it to Dr. M. who, in turn, pastes it into the channel for everyone to see. Snapshots are generally preferable to copy/pasting because they provide everyone in channel a static png of the graph as well as a link back to the live graph in our UI. Those of us who use the Propane client for Campfire even get live graphs in channel instead of boring png’s.
ChatOps delivers on the promise of remote presence in a way the presence protocols never did. It's a natural estuary for stuff that's going on right now; information just can't help but find it's way there, and having arrived, it is captured for posterity. ChatOps is asynchronous, but timely, and brain-dead simple yet infinitely flexible. It automatically documents our internal operations in a way that is transparent and repeatable, and somehow manages to make time and space irrelevant.
Further, because our monitoring is woven into every layer of the stack, and heavily metrics-driven, data is always available to inform our decisions. We spend less time troubleshooting than we would if we chose to rely on more siloed, legacy techniques. Instrumentation helps us to quickly validate or disprove our hunches, focusing our attention always in the direction of root cause.
We hope you’ve enjoyed this first peek under our operations hood and we’re looking forward to sharing more with you in the posts to come. Be sure to tune-in next time when our chat-bot, twke, says: