3.15569e7 Seconds of Monitoring — Librato Blog

3.15569e7 Seconds of Monitoring

Well internet, that’s another lap around the Sun in the record books. If my rapidly expanding holiday midsection and the influx of season-finale episodes in my hulu queue were no indication, I could still tell it’s the end of the year by the flood of requests in my inbox for humorous-with-a-touch-of-wistful, retrospective content.

Looking back on 2014 I realize that this is no great challenge.  Honestly, I spent most of 2014 in awe of the fantastic engineering going on around me. In fact, so much great work went on this year my real challenge will be in figuring out where to begin, and doing justice to this amazing community that I'm so lucky to be a member of.  So where to begin?

DevOps and TDD Died

Various presentations and articles this year proclaimed the death of both DevOps and TDD, suggesting that the confusion surrounding both terms have crossed Hanlon’s border from well-intentioned misinformation to overtly malicious appropriation. I, for one, am relieved that we can stop arguing about what both terms mean now since they can both safely be considered effectively meaningless.

If this strikes you as unfortunate or sad, you may take solace in waxing prolific (in your neck-beardiest tone) about the heady early days of the revolution when these terms meant something, and how their use has been co-opted and diluted by the media to defend practices orthogonal to the spirit of the original intent. You may also rejoice in never again having to hear the phrase: "... since before devops was even a thing".

This will not, of course, have any meaningful effect on the awesomeness of the DevOpsDays community or the fantastic work their community members have been doing since before DevOps was even a thing.

SRECON and Canary.IO Were Born

Speaking of Devops and Awesomeness, 2014 saw the birth of both SRECon and canary.io. SRECon, a Usenix Conference for Site-Reliability Engineers and people who want to play Cards Against Humanity with Site-Reliability Engineers, sold out in a matter of weeks and was not only an unqualified success, but a darn good time. 

Canary, the brainchild of Michael Gorsuch and Brandon Tindle is a distributed HTTP measurement network, built with open source components that produces open data. You can add your own site to the list, and canary will monitor it and provide the data via a publicly accessible API. Despite the rather morbid connotation of its moniker, Canary is an awesome idea that I hope to see blossom in the coming years. 

The Cloud Glitched (again). Netflix Didn’t (again).

It was a good year to be webops-ing Netflix, where 2014 brought the open-sourcing of Atlas, Netflix's internal telemetry system as well as a litany of high-profile web-operations success stories.  These included surviving the biggest of the AWS-Blips in 2014: the Xen-Zeroday-induced reboots, which took down 218 Netflix Cassandra nodes, but had no discernable impact on their streaming services.

Netflix's Engineering team has become infamous for its chaotic proclivity, even going so far as to create a simian army to constantly and unceremoniously disrupt and sabotage the efforts of honest and well-meaning infrastructure components.  Together with their army, Netflix Chaos engineers wasted no time in 2014 striking down everything from individual  AWS instances through network devices, DNS Services, and even entire data center regions, all in the name of service-level reliability.

Indeed if your parental repertoire includes threatening "reincarnation as a cockroach" I think you might consider swapping out cockroach for "Netflix Instance" because I'm pretty sure the latter the is more frequently disrespected, abused, and ignominiously terminated of the two.  If you haven't visited their blog in a while you're doing yourself a disservice. Keep it up Netflix. I'm on the edge of my seat.

Brandon Gregg Was Ported to Linux

Speaking of Netflix, I was delighted to discover at LISA that among their many other successes this year they've actually managed to port Brandon Gregg to Linux. After several years of guilty procrastination, in want of a few spare hours I could spend checking out some of Brandon's Dtrace scripts on Illumos, you can imagine how excited I was to see that he's recreated via Linux-native instrumentation, a large percentage of his Dtrace tooling.

This really is a fantastic development, and it’s wonderful to see Gregg’s trademark layer-cake diagrams like this one with lots of arrows and subsystems and tools but that now say Linux on top instead of some other OS that (for many complex reasons (that mostly reduce to “laziness”)) I don’t typically run. Anyway, I’m really looking forward to spending 2015 in guilty procrastination for want of a few spare hours I can set aside to check out Brandon's new work on Linux.

The Monitoringosphere was (Still) Unable to Ignore Nagios

2014 brought all sorts of improvements to everyone's favorite tool to hate and use anyway. The recently released Nagios Core 4.0 includes a persistent worker-process model (yay, less worker forking!) and a socket-interface, greatly improving its performance and interoperability respectively. This year Nagios enterprises even released a proprietary log analysis solution built off the ELK stack.  Current users of other lavish commercial log analysis products who are contemplating the switch but are concerned about performance are no doubt comforted by the fact their current bill should buy several dozen (if not several dozen score) Nagios Log servers.

A lot of interesting Nagios-related work ensued throughout the year. Notable examples include several tools showcased at Velocity NY, like Etsy's Nagios-herald for adding context to alerts, and substantial improvements to the venerable alert-processing tool: FlapJack. Graphios released a 2.0 version this year which vastly simplifies the endeavor of connecting Nagios to systems like Graphite, Statsd, and Librato.

Engineers Worked on Tools

InfluxDB, Heka, OpenTSDB, and Grafana were all born in the years preceding 2014, but all saw massive improvement this year. Grafana in particular won the hearts of the community at this years Monitorama conference as evidenced by the several minutes long standing ovation and shower of rose petals and silver coins that followed Torkel Ödegaard's Grafana presentation.  For his part Torkel has asked that whomever has been sending the life-sized solid-chocolate statues of himself every week to please stop (his HOA has begun to complain).

If you’re an OpenTSDB user in want of a means of firing alerts, you might want to check out Bosun, a domain-specific language for describing patterns to alert on data sourced from OpenTSDB. In his excellent LISA presentation, Kyle Brandt showed how Bosun helps Stack Exchange model extremely specific failure criteria like combinations of metrics that remain n-standard deviations away from a historical mean for a defined period of time.

Campfire Got Deprecated and Everybody Tries Out Slack

2014 bore witness to the nearly simultaneous release of SlackHQ and the deprecation of Campfire along with most of the rest of 37-Signals. If you've somehow managed to avoid using it, Slack, with it's seamless multi-team support, private-messaging, gif savviness, and solid API is basically fantastic.  Within a few weeks of its release I found myself in 7 different slack teams as various IRC, Facebook groups and other social media collectives collapsed in its wake like buildings before a cinematic nuclear explosion. And having spent the last several days hacking on Slack chatbots in Go when I should have been writing this article, I can personally attest that its websockets API is basically Rad, and a joy to work with.

So Many Crazy Great Talks

There were too many awesome talks this year to mention, but here are a few of my favorites:  I enjoyed all of @aphyr ‘s talks this year, but his strangeloop talk was one of my all-time favorites (it’s rare to see, in someone so young, such a perfect balance of comedy, science, engineering and weightlifting).  Doug and Jane Ireton’s talk at Velocity about encouraging girls in IT was fantastic.  Mikey Dickerson’s Velocity Keynote: “One Year After healthcare.gov” is required watching for anyone who doesn’t think management is “real work”. And finally, Pete Cheslock’s lightning talk about project-management failure wrapped in a metaphor about the Danish ship: The Vasa was insightful, hilarious, and so.. so heart-wrenchingly true.

Meanwhile at Librato...

2014 was a year of pretty drastic change for me, having gone from a basically traditional Ops role to a work-from-home Silicon-Valley-Startup Engineering/Evangelist rolish thing.  Having spent a year shipping, blogging and talking for Librato, I can honestly say I’m having a blast and I’ve joined a fantastic team of amazing people who are just completely knocking it out of the park. 

Seriously though, if Sisyphus pushed like Librato, he would have been off that hill in two and a half hours. Since I joined about exactly a year ago, every major customer-facing piece of our infrastructure has seen extraordinary improvement, and we’ve added several new features to boot. Highlights include: 

  • Our Alerting platform, which was overhauled into our stream-processing infrastructure and given massive UI and Functional upgrades like graphical threshold previews, alert annotations, and enhanced trigger criteria. 
  • Our new composite metrics DSL, which allows our users to create new metrics by combining and transforming existing metrics. 
  • Spaces, our evolutionary new user-interface, which is currently in private beta, is the most powerful and elegant time-series analysis tool in the world today.

Some other things that happened this year at Librato: 

  • Integrated with VictorOps, Heka, Nagios, and launched a turn-key collectd integration
  • Fed pizza to tons of talented and fantastic engineers at our meetup
  • Lots of conferencing! (12 conference talks to be exact)
  • We hired a slew of talented and amazing people, including the awesomely huggable Jason Dixon
  • Knocked down a wall and doubled the size of our office

If you manage, wrangle, or otherwise work with systems, cloud platforms, distributed applications, oil-wells, brew-kegs, strawberry fields, or even crock-pots, you should totally sign up for a free trial and let us show you what it means to add insight to your endeavors.