Metrics-Driven Development — Librato Blog

Metrics-Driven Development

During our recent GigaOm webinar, Monitoring and Metrics for Web-Scale Applications, we talked about something we refer to as metrics-driven development (MDD). At Librato, we believe MDD is equal parts engineering technique and cultural process. It separates the notion of monitoring from its traditional position of exclusivity as an operations thing and places it more appropriately next to its peers as an engineering process. Provided access to real-time production metrics relevant to them individually, both software engineers and operations engineers can validate hypotheses, assess problems, implement solutions, and improve future designs.

While we’re neither the first to speak about MDD, nor the first to write about it as a concept, we found lacking a treatment of the abstract principles underpinning it as a practice. The goal of this post is to define MDD as a series of best-practice axioms.

Suggested Prerequisites

Before diving in, we should note that MDD as described here is a fundamentally iterative process that begins with a high-level view of your system’s overall health and then drills down into relevant subsections as you follow the data. While the tenets discussed herein will improve any development process, they are especially powerful in those that employ Continuous Integration, Continuous Delivery, and Feature Flagging.

In aggregate these practices ensure that changes to code do not cause functional regression, get newly tested code into production with minimal delays, and exert sophisticated control over the set of users exposed to new functionality. In such an environment MDD enables you to use real time metrics to drive rapid, precise, and granular iterations.


The following principles comprise the core tenets of MDD as we see them today. A deep exploration of any one would warrant a standalone post, so for now we’ve focused on a series of high-level treatments. If you’re able to implement an approximation of each, then you’re well on your way to practicing Metrics Driven Development!


To support both rapid iteration and the highest level of precision, developers must be able to add specific, custom instrumentation inline with application code with minimal effort (e.g. increment a counter, or time a critical section of code, etc). The instrumentation library must not impact performance in any meaningful way in order to measure latency-sensitive code-paths. General purpose agents such as monitoring agents that track system metrics (e.g. CPU/Memory usage) or middleware that track coarse-grained metrics (e.g. response latency metrics across all request types) may provide useful supporting context, but cannot on their own provide the required level of detail and control.

#Timing a potentially latent operation with custom instrumentation
statsd.time("api.measures.source_match_build.time") do
  metric_source_match_build(active, explicit, wildcards, negates)

Given the ability to create targeted instrumentation directly in the application code itself, instrumentation becomes a required deliverable for every new feature. When we write new code, we implicitly form a hypothesis about its behavior in production; in MDD developers include instrumentation in their code as a means of proving or disproving these assumptions as soon as their code is deployed into production.

Single Source of Truth

Metrics resulting from the custom instrumentation as well as those sourced from external providers (e.g. system monitoring agents, APM middlewares, etc) must be stored in a common repository, in a common format, with a common interface for visualization/alerting/analysis and readily accessible to everyone on the team.

Developers/operators can then easily correlate metrics up and down all layers of the stack (e.g. business, application, vm, system, and network) or across multiple services that interact to comprise a user-level transaction (e.g. ingress, queuing, business logic, or database).

The metrics platform must be so timely, comprehensive, and intuitive to use that everyone instinctively relies on it as their preferred resource to reason about the production environment. Directly shelling into a production server instance for the purposes of characterizing behavior should be a rare and exceptional occurrence, and considered an indicator of potentially incomplete instrumentation.

Developers Curate Visualizations and Alerts

Developers produce code inclusive with instrumentation to meet both functional and performance requirements. Charts/Dashboards are required to visually inspect how well a service is meeting those requirements at a given point in time as well as highlight emergent conditions that may jeopardize its ability to do so in the future. Alerts are required both to verify that these top-level requirements are continuously met and also to detect regressions.

The developer who writes the code possesses the most accurate mental model of how the code operates as well as the potential degradation modes e.g. dependencies on the performance of external services or susceptibility to garbage collection pressure. It follows that the developer is the ideal candidate to maximize the signal/noise ratio of visualizations and alerts (including the runbooks that detail remediation). As code is continuously refactored/augmented, the developer is best-equipped to tune dashboards and alerts to match the shifting landscape.

A developer-curated dashboard tracking key metrics at each layer of the stack for a production service at Librato.

Alert on What You See

When an alert triggers we use metrics to iterate over possible causes and once isolated, to confirm that remediation was successful. It’s critical therefore that alerts are triggered off of the same dataset used to visually inspect/correlate metrics. Disparate systems introduce both potential impedance mismatches in instrumentation as well as potential confusion when mentally mapping the output of one system into the other. Any lack of certainty adds additional stress and increases the likelihood of human error while responding to an incident.

Back-testing an alert's configuration against production data to tune it's sensitivity.

Back-testing an alert's configuration against production data to tune it's sensitivity.

Show me the Graph

Its virtues aside, MDD is no silver bullet, and bugs will surely still make it into production. It’s in these occasions however that perhaps the most powerful benefits shine through. When a team is used to validating theories through direct observation of production behavior, no discussion around the contributing factors to an incident starts without data. Developers and Operators quickly internalize that any theory not based in production data is idle speculation at best and dangerously misleading at worst. Competing hypotheses are rapidly posed and disproved until the metrics indicate one is valid. This may require the creation of new instrumentation if one of the factors in question was not previously under observation. When a remediation is deployed its effect is validated through the same instrumentation.

Correlating spikes in latency to deploys using Librato's annotation feature.

Correlating spikes in latency to deploys using Librato's annotation feature.

Don’t Measure Everything (YAGNI)

When engineers first experience the rush of informing their decisions through instrumentation and metrics, a subtle anti-pattern that often emerges is excessive instrumentation e.g. “MONITOR ALL THE THINGS!”. A good way to verify if something requires instrumentation is asking what system hypothesis you’re testing. As an example, while you should instrument all interactions with external services (to ensure that they are meeting the expected SLAs), you should not wrap every single method call in your application to increment a counter. This is a conflation of performance management with performance profiling and there are no system hypotheses to validate about whether function dispatch works correctly. Scenarios like this often occur when an engineer attempts to anticipate all instrumentation they might ever need, usually because “monitoring” is considered a one-off work item to be scheduled and completed.

The iterative approach of MDD by comparison, enables us to continuously add (and perhaps more importantly remove!) instrumentation as needed. Because instrumentation is code there is a cost to maintain it. The more instrumentation that exists therefore, the higher the cognitive overhead to, for example, on-board new developers. Unneeded instrumentation also places additional resource constraints on the metrics pipeline itself, and can make individual metrics more difficult to locate and interact with.


By this point you may have observed some parallels to the broadly implemented Test-Driven Development (TDD) and this is no coincidence. It’s considered obvious that developers should reflect on what their functional requirements are, and how they should test them both initially and against future regressions. The same should be said for implementing production system; when you write code, you are materializing assumptions about the way the pieces fit together -- maybe process A requires that a queue depth never exceeds K elements, or maybe process B requires that service S responds with a 99th percentile latency below T.

MDD means that you now test these initial assumptions about system behavior and detect emergent regressions as the system inevitably evolves to handle a workload it wasn't designed for. It reduces the risk change poses to system performance while simultaneously increasing the frequency you can iterate through development / deployment cycles. We hope this post helps to further the discussion on MDD and the best practices to implement it on your team. If you have any thoughts on its content or additional axioms we should add, please reach out!

Special thanks to Dave Josephsen, Ben Sigelman, Coda Hale, Mike Heffner, and Matt Sanders for their review and thoughtful comments on drafts of this post.