Thin-Slicing MDD: How We Choose Effective Operational Metrics — Librato Blog

Thin-Slicing MDD: How We Choose Effective Operational Metrics


Once upon a time I had a job selling clothes I couldn’t afford in a very upscale Los Angeles fashion boutique that no longer exists. Because it was considered necessary that we actually wear the designer clothing we couldn’t afford in order to more effectively sell it, my fellow salespeople and I had free-access to a large volume of ridiculously expensive threads.

I was just a tourist, but for my fashion conscious co-workers, the experience was transformative. Not one of them would ever work again in a job that didn’t provide them some means of dressing fabulously.

I could never empathize until now, because working for Librato has given me unlimited access to the best general-purpose metrics processing and visualization system in the world today. It changes the way you think and work as an engineer. In fact, I’m pretty sure I’m ruined for life, and I don’t think I’m alone in that. In fact, since I’ve been with Librato, the few engineers who have moved on now work for shops that use Librato.

Monitor Everything?

Given the conventional wisdom that you should monitor everything, you might expect our metrics to number in the millions, but we actually track around 60 thousand metrics total across all of our services.

We’ve written at length about our choice and use of metrics inside Librato. When you move past the novelty of being able to measure the things you build, and into the realm of relying on those measurements day to day, you get a little finicky about the metrics you choose. A core principle of Metrics-Driven Development is that you should choose only the metrics that test your own systems hypothesis, but it’s one thing to talk about how we choose metrics, and another thing entirely to just show you the metrics we choose.

Dashboard Diving

So I thought it might be fun to thin-slice it as they say, and pick just one metric out of the 60k metrics we’ve chosen, talk about what it is, why it’s important to us, and how we’re measuring it. Because we live Metrics Driven Development at Librato, it shouldn’t matter which metric we choose to examine in this article. So I’ll choose one at random, using this handy-dandy shellbrato-based one-liner:

  paginate listMetrics | jq '.metrics[].name'| sed -ne “${RANDOM}p”

And the winner is: jackdaw.memcached.mget.metric.time

Written in Java, and named after a Pirate ship, in a video game that was in turn named after the bird (I think?), Jackdaw is our fast-path API service. When you emit metrics to our API, you’re talking to Jackdaw. Jackdaw accepts API Posts, does a bit of mangling on the metrics contained in them, and pushes the result into a Kafka queue where our Storm Topologies can pick them up and begin computing and persisting them.

You’re probably aware that Memcached is a distributed, in-memory, key-value store. Many of our internal services rely on Memcached to avoid expensive database lookups and other kinds of inter-service communication.

Our Cassandra-based persistence layer is very well tuned. We like to be able to predict things like the number of bytes a new metric will use, but this can be difficult in a multi-tenancy environment, where a user can create a variable-length metric name. So part of the ‘mangling’ I mentioned Jackdaw performs above, involves translating several kinds of user-provided names into unique ID numbers that use a predictable amount of space in the persistence layer.

So in Jackdaw’s case, we’re using Memcached to avoid making database queries to translate things like metric names into metric ID’s. And specifically in the case of jackdaw.memcached.get.metric.time, we’re measuring the amount of time it takes for Jackdaw to successfully query memcached for a metric ID given its name.

Measure The Space Between Your Services

Measuring inter-service latency and queue response times is an extremely common occurrence at Librato. This is because it’s very often the case that the best way to understand blocking outages in microservices-based architectures is by measuring the space between your services. In fact, we use similar metrics to track the latency of every type of interaction between Jackdaw and Memcached independently. Here’s a list of these metrics:


We take these measurements from inside Jackdaw using instrumentation code. This is both another core principle of MDD (monitoring is code), as well as another common occurrence at Librato. Instrumentation is by far our biggest source of telemetry data. More of our monitoring is performed by code we write into our services than by any combination of external monitoring systems we employ.

Instrumentation as Code

For services written in Java, we use DropWizard Metrics (formerly known as Coda Hale Metrics). Metrics libraries like this generally provide the programmer with a set of objects that represent various methods you can use to measure the operational parameters of your code. In this case we’re timing interactions, so we can use the timer object provided by DropWizard Metrics. Omitting the preliminary setup (importing the metrics lib etc..), the code we’re using to time jackdaw.memcached.get.metric looks like this:

Timer.Context t = getTime.time(); // start timer
T item = client.get(key, transcoder); //do thing
t.stop(); //stop timer

Percentiles to Filter Outliers

I ran a search against our API to find instruments that include jackdaw.memcached.get.metric and was momentarily surprised when I found this instrument called “Jackdaw Cache p99”.

The p99 in the title implies that this instrument is not displaying our timer directly, but rather the 99th percentile of our timer values. Graphing a percentile like this gives you a line below which 99% of your measurements fall. It’s another common technique employed at Librato because it’s an easy way to filter outliers from a data stream.

It’s not uncommon when working with queues and caches for a single query on a single thread on a single instance to go a little sideways. Here for example I happened to capture a misbehaving memcached query from Jackdaw taking over 100 ms to execute (the top area).

This sort of thing isn’t generally a problem because it’s isolated to a single query, and Jackdaw will quickly time out and re-attempt it. Meanwhile we see the 99th percentile (bottom area) is unaffected. In other words, 99% of all the memcached queries happening across the other threads and hosts are returning within acceptable bounds. Instead of trying to track down every outlier, we focus our attention on continuing to drop the 99th percentile latency over time.

Looking in our metrics interface I find that for each of the metrics I listed above, there are actually four additional versions - one each for: p75, p95, p98, p99, and p999 (the appearance of p999 metrics is, by the way, a decent litmus indicator of a perfectionist in your midst).

In total, that means we’re tracking 40 metrics related to latency in the interaction between Jackdaw and Memcached: four get metrics plus four set metrics on each type of cache data, times five percentiles. In this we see yet another Librato pattern: our tendency to measure deeply the few thousand things we really care about, rather than casting a wide net across a million things we don’t.

Local Aggregation with Statsd

So how are these percentiles being computed? Going back to the code I find that these metrics are being emitted from Jackdaw into a Statsd process running locally on each server instance. Specifically, we use statsite, an implementation of statsd in C with built-in Librato support. Statsite is configured to aggregate these values at the host-level, and emit percentiles for each of these values into Librato every 10 seconds. We run them locally to avoid losing our measurements to the UDP singularity.

Composites for Constructive Laziness

This Jackdaw Cache p99 instrument that I’ve found draws a line for each type of query for each type of cache data. When I initially saw the instrument, I assumed each of these metrics were individually added to it, but by clicking through, instead of 8 different metrics assigned to the instrument, I find the following composite metric definition:

map({metric:"jackdaw.memcached.*.time.99th"}, mean(s("&", "*")))

As defined in our composite metrics language specification, the map function takes a set of metrics, and applies a function to each metric in the set. In this case the metric map is


This typeglob equates to the eight metrics we’ve been looking at (get and set on alertcondition, metric, limit and source). To each metric in this set, the map function will compute:

mean(s("&", "*"))

So, for each of those metrics, this composite averages together the values from each source emitting this metric (get.alertcondition on servers 1 2 and 3 for example), and plots the result.

Composites are a relatively new feature in our UI, but this pattern is clever, and increasingly common.

By using a composite here instead of hard-coding individual metrics, the engineer who crafted it is future proofing this instrument against new data types. If, for example, Jackdaw is modified in the future to cache UserName data, this instrument will slurp up the new cache-data metrics, as long as their name matches the pattern jackdaw.memcached.*.time.99th

Well, I hope you’ve enjoyed dashboard diving with me, and that it’s given you some insight into how our unlimited access to a world-class metrics system has taught us to select and use the most effective operational metrics.

If you liked this post, please share it in your favorite channel (an easy way: use the blue “Share” button at the top of the post).