Metrics-Driven Development - Part II — Librato Blog

Metrics-Driven Development - Part II


What is MDD? A Quick Recap

In our first blog post about Metrics-Driven Development (MDD), we identified a set of practices that define MDD in the abstract. Simply stated, MDD’s goal is to insert monitoring into the development process by making it trivial for any engineer to instrument the software they create. Correctly implemented, MDD makes it easy to monitor the behavior of services, processes, and worker threads in the wild.

In this article, we explore MDD in practice. Follow along as we recount one of our own production adventures that began not too long ago, when we were experiencing a bout of the classic “website is slow” problem.

Inside Our Own MDD World

In an MDD shop, our first task is to make the metrics platform the preferred engineering interface for discovering and responding to production incidents.  To do this we need to emphasize the use of real work metrics like queue sizes and service latency over system-centric metrics like CPU usage. Using instrumentation libraries to measure and time the interactions between our various web services, we can tell at a glance if something is slow, and more importantly, we can set thresholds so we’re notified about latent connections and snowballing queues before they become service outages.

For example, when our website had a bout of inexplicable slowness earlier this year, our engineers were alerted by notifications like these:


The first alert in the screenshot above tells us that our API is having trouble storing inbound data, and the second informs us of a MySQL instance with an abnormal number of persistent queries (so many in fact, that it’s stopped accepting new connections).

Although the mechanisms by which we handle (transport, store, visualize, analyze, and alert on) metrics of this kind are standardized, we make no attempt to dictate the toolchain used to collect metrics initially. Any engineer can add code or collection agents to measure the metrics that are important to them in whatever way makes the most sense. Our engineers glean important operational metrics from collection agents, from instrumentation running within their apps, and from parsing system logs. In practice we don’t relegate monitoring to one particular team; It's something every engineer embraces as a means to track and reason about the behavior of the systems for which they are directly responsible.

The Initial Assessment

In this particular incident, because the API depends on the MySQL instance, our engineers fixated on the latter as a likely cause of the slowness. Right away, they noticed several interesting correlations that gave them a solid notion of the kind of trouble it was causing.

The MySQL read-replica latency presented as a large singular event.  Our MySQL node, as seen in the graph below,  was generally unresponsive (even to its own read-replicas) and therefore was probably overburdened in some way:

... the MySQL read-replica latency; evidence that our MySQL node was generally unresponsive (even to it's own read-replicas) and therefore probably was overburdened in some way…

We saw the latency spike mirrored in our front-end web server response time, where our normally single-digit millisecond response time peaked at 60ms.  This implied that the web servers were still functional, but they were blocking on I/O, waiting for an answer from the DataBase instance:


Our Elastic Load Balancer (ELB) latency, also skyrocketed (above 20 seconds)—more evidence that this problem was affecting our customers, and prompting us to engage our customer-communication protocols:

... The Elastic Load Balancer (ELB) latency, which skyrocketed above 20 seconds—proving that this problem was also affecting our customers, and prompting us to engage our customer-communication protocols...

Perhaps most distressing, the spike was also apparent in the graph below, which depicts the number of HTTP 200's returned by our web servers. This problem was resulting in time-outs (not just slowness), and some of our customers were getting error responses from the API:

... and the number of HTTP 200's returned by our web servers—proof that the problem was resulting in time-outs (not just slowness). This indicated that some of our customers were getting error responses from the API.

Understanding Cause

The problem cleared itself in a few minutes, but given that it was customer-facing and highly abnormal, our engineering team moved to temporarily reinforce production by adding MySQL read-replicas, and then began looking for root cause.

One important revelation was the graph below that showed a slow 4-week trend of increased write operations to the DB without an accompanying increase in network traffic. We already knew the issue wasn't accompanied by abnormal customer traffic, but the long-term increase in writes was curious. What was the DB writing if not customer data?


It turned out that one of our customers was deleting a huge amount of data every night, and this amount had slowly been growing over time. It didn't correlate with increased traffic, because the customer merely had to send the equivalent of Delete \ *, and our DB was stuck doing the work. These deletes were in fact going on during the time of the incident. Had the write-ops finally crossed a threshold that the DB couldn't handle? We didn't think so, but this was certainly a contributing factor to the incident.

We also noticed an explosion of errors from a third-party service provider that coincided with the start of the incident. These errors indicated that the service provider was throttling our requests, which would have caused a burdensome amount of DB traffic due to query retransmission.

Because these errors persisted after the incident, while the end-user deletes did not, we concluded that the DB was knocked over by the combination of the two problems at the same time, and would return with the next large delete job. To fix it, we'd need to figure out what was causing the errors, but what had caused the errors to suddenly appear?

To answer this question, our engineers turned to our deployment annotation stream. Annotations display as vertical lines in our instruments. They document aperiodic events like batch-jobs, system reboots, and code deployments. Because we are a continuous integration shop, we often deploy code to production (today there were 41 code deployments), so this process is highly automated. Every time new code is pushed to production, an annotation is created on an annotation stream that can be overlaid on any graph.

To search for correlations, we brought up the MySQL latency graph, and overlaid every deployment annotation stream, turning them off or on, one at a time:


This yielded a possible correlation between a particular code deployment to one of our internal services (called “cloudsnarf”).  We proceeded to compare the annotation of this particular deployment to several other metrics associated with the latency problem, and in every case, we discovered strong correlations:


Interestingly, when we overlaid this annotation stream on the graph that depicted the throttling errors from our third party API service, we also found a strong correlation between the two:


From this we can intuit that this particular code deploy caused both an increase in throttling errors, and an increase in latency and load on the MySQL instance. Rolling back this deploy, as shown below, verified the correlation beyond a shadow of a doubt, and convinced us that this deployment had caused our problem.  What, specifically, was problematic inside this deployment remained a mystery.  



The task of debugging something like this—a code deployment into a distributed system that triggers backpressure from a third party vendor—is an unenviable one. It's difficult to test against distributed systems because you can't easily model them locally, so you lose time making small, incremental changes and trying to observe their effect on production without breaking anything.

The problem is one of variable isolation. Presented with a problem, an engineer will form several ideas about its cause, and each of these ideas represents a set of variables that must be isolated, tested, and eliminated. Testing and eliminating variables can be difficult with distributed systems because the problem may be endemic to the relationship between multiple nodes (for example, a protocol that causes message loops), or, like our throttling errors, the problem might only appear under very specific production workloads that are difficult to produce anywhere else.  Many traditional debugging tools like printf statements and tracing tools degenerate the signal to noise ratio or are simply useless or irrelevant in this context.

This is an area where MDD really shines. Because our engineers have a reliable, real-time metrics platform at their disposal to observe the production environment, they can add temporary, special-purpose instrumentation to expose the variables they want to test and eliminate. Additionally, in the context of debugging, historical data is invaluable at identifying changes production behavior, as well as how those changes align with deployed releases.

In this case for example, the engineer suspected the problem was related to a query that might be returning an overabundance of output.  To test his suspicion, the engineer added instrumentation to the app to track the number of sources returned, the number of requests generated, and the number of write operations created against the DB.

He then began shipping new versions of the app to production, each with a slightly different take on the dubious query in an effort to test its effect on those three metrics.  He would, for example, re-craft the query in a way that he expected would increase the number of returned requests, and then observe the result of that change in production. When he encountered an unexpected result, he continued to focus his attention in that direction.

This effort resulted in the graph below, where the engineer’s changes are represented by vertical annotations, and the metrics he added to the production environment are represented by the three different data streams visible in the graph:


Nothing beats a real-time telemetry stream for understanding the effect your changes have on ephemeral production entities.  With quality, real time feedback that they can control, our engineers are able to test multiple variables at the same time and instantly see the effect of each test across all variables. At the same time, a document is created that details the engineers debugging efforts in a way that can be easily interpreted by the rest of the team.

This debugging session convinced us that the query was in fact the problem, and that it was returning an unexpected number of sources, which was burdening our MySQL server.  Having isolated the problem to the number of returned sources, our engineer added some more instrumentation to examine the source data.


Our problem turned out to be a perpetual overwrite bug.  The influx of returned sources from the query was due to the fact that the sources were wiped on every query (instead of being cached). Fixing this query drastically reduced the number of returned sources as seen in the graph below:


It might seem obvious that monitoring is just another aspect of engineering, but in many IT shops today, monitoring is something an operations team bolts on after the product has been built.  It is an expensive burden that might inform us about a few types of simplistic outages if we aren’t already ignoring it because of alert fatigue and false positives.  

MDD reunites monitoring with software engineering. In an MDD shop, monitoring via instrumentation helps engineers design, reason about, and observe the behavior of the applications they create. It’s a powerful development tool whose feedback drives application design and construction as the product moves from an MVP to a scalable, correct system. Further, because they rely on it so heavily, the engineers take an interest in the quality of the monitoring system. Finally, by encouraging engineers to reason about the metrics they need to emit as part of the development process, MDD informs design decisions and results in services and applications that are more transparent, and lend themselves to being monitored.

The benefits of metrics driven development aren’t isolated to application design and debugging. Because every team relies on the same metrics platform, it is constantly being iteratively improved by various types of engineers to answer the questions that are important to them, as well as to remove metrics that are stale or distracting.  The engineer who solved our query problem, for example, included some additional instrumentation that we could use in the future to monitor for query bombs like the one that caused this problem. In this way MDD tends to keep the monitoring system relevant -- the metrics platform remains useful because everyone relies on it daily.

MDD Works

We hope you’ve enjoyed our journey into the practice of Metrics Driven Development. We appreciate the opportunity to share one of our own experiences as a means to illustrate how MDD can help you detect problems, perform post-mortems and improve the design of your distributed applications.

Sign up for a free trial and get started with MDD in your organization.