New Year's Resolutions from the Trenches of Real Time Monitoring — Librato Blog

New Year's Resolutions from the Trenches of Real Time Monitoring

This is the year.

This coming year you mean?

Yes. 2014 is the year. This year I will get our monitoring stuff under control once and for all.


I will. I'll start by eliminating all false positives, every last one.  In 2014 there will not be a single instance of an Ops waking up in the middle of the night for a non-issue. My pocket will not continuously buzz with a never-ending cacophony of a thousand useless pages per day. In 2014 every alert will be meaningful, actionable, and targeted to exactly to the correct individuals, and will take into account the on-call rotation schedule. In 2014 I will be free.

Awesome, we’re re-architecting our alerting too.

No, I'm serious. In 2014 I will create some sort of interface or service that makes it possible to acknowledge every alert at the device that receives it.  There will be some sort of unique ID on every alert, and the service will use that ID to identify the alert in the alerting system.  It'll be great, I'll hack our ChatOps bot so that people can ACK alerts from XMPP, and I'll hack the SMTP gateways so that if you reply to an email or SMS alert in a certain way, you can ACK the alert. We'll setup a policy that requires that all alerts be acknowledged -- there will be, for every alert, a business requirement, and a response in 2014.

You don’t require ACKS now? Well, 3rd party integration is hard to keep up with.

In 2014, I'll make the dashboards readable again by eliminating everything thats been red on the dashboard for more than 24 hours. It will be a reckoning. I will be relentless, and unforgiving, and the deleted configuration will cascade like an avalanche from my hands to /dev/null. And when the dust clears, we'll be able to come in every morning and have a reasonable view of everything important that's down or having trouble.  I'll even add an alert history side-bar to the dashboard, so people can see all the alerts that have been sent in the last 12 hours! Every Busdev passing in the hallway will happen to glance upon our dashboard and be enlightened in 2014 (I meant to do it last year, but I've been busy).

Real Time Monitoring. What can we say?  It does take time.

And another thing, in 2014, I'm going to devops our visualization efforts. I'll tear down the silos and standardize us all on a single tool and a single location for all of our graphs. No more of this couple-hundred RRD's over here and umteen thousand Whisper DB's over there. And don't get me started on those network meatheads who still insist on MRTG.  I will build a time-series paradise! It will be horizontally scalable and redundant and fast and secure and wicked snazzy. And everyone who would ever wish to make a graph will be happy -- no honored to use my system.

Well FWIW we actually did build a....

They will shed their meager tools like the chains they are, and run to me with the gleam of delight and salvation in their eyes. And when it is built, I will wield standardization like a Scythe. Every app will be instrumented, and have a dashboard, oh, and there will be some means of easily adding annotations, you know, so we can correlate non-periodic events on our time series graphs.  This I vow: All will know where to go to see the data in 2014.

You don’t have a dashboard per App?

Are you kidding me? We have 12 monitoring systems across 5 teams, all of which are overlapping in about 70% of their functionality. In 2013, 3 of our teams ran independent monitoring budgets any one of which should have been enough for us all.  We just need to build some common infrastructure you know? Decide on common interfaces. Focus on making monitoring a forethought. We can do it in 2014, I’m sure of it.

We hear you, and hey,


You're not alone.

Sound familiar? Share your goals with us in the comment sections - we’d love to hear what you will be working on in 2014.  And most importantly, how we can help?