Welcome back to our Tales of the Ops Team series. In our last installment, we learned through Confessions of a Chatbot how to make critical information available to all involved with Ops problems. Today, we look at how we fix such problems with a little help of the right monitoring approach, and an awesome graph.
Monitor Everything as a Metric
One of the reasons I wanted to get a job at a monitoring and metrics company (I’ve been here a month and a half) was that I’d get to learn from experts in the field. Metrics and monitoring have always been a passion of mine. Getting to learn from people at Librato was a great opportunity to further that passion, and work side by side by other monitoring aficionados.
Since I started at Librato, I’ve had many conversations with @josephruscio about what we can monitor and how to do it. There are things that I’m used to monitoring with a Nagios alert that didn’t translate neatly into how we’d do such things on the Librato platform, where all alerting is done on a metric. Joe has an approach, in which everything is a signal, that reminds me of what I expect to find in How to Measure Anything, which I hope to find time to read sometime soon.
Last week, we were talking about the impact of the latest round of OpenSSL vulnerabilities, and what that would mean for the AWS Elastic Load Balancers. We expected there to be some kind of noticeable change. Sure enough, on a Thursday we got an alert in our Slack rooms that the IP addresses of our ELBs had changed via the Librato integration:
I asked the people on my team how we detected that if we were monitoring everything as a metric, and they showed me this:
I was amazed. It was easy to see at exactly what time the IP addresses on the ELB had changed, and it was obvious that we were now being served from new IP addresses. We have a python script that iterates through all of our ELBs and looks up the IP addresses associated with each. For each IP returned, we give it a value of 1 and store it.
Next we have an alert that says that if any of the IPs stops reporting for 5 minutes, post to the Slack room.
Because each IP address is submitted as a “source” in Librato, when an IP (any of them) stops reporting (for 5 minutes), that triggers our alerts. By posting the alert in the chat rooms, our support staff and our ops staff are both aware to be on the lookout for any unexpected events from customers, or from our systems (which were unaffected).
I’ve been in my new role running Operations at Librato for six weeks or so, and every day is a new adventure to learn about metrics and monitoring on a whole new level. At Librato, we choose our metrics well, and for me, this particular problem was a great example of how that approach presents a rather unconventional but super effective way to solve a common problem.