Tales of an Ops Team: Fire in the Hole — Librato Blog

Tales of an Ops Team: Fire in the Hole


Welcome back to our Tales of the Ops team series. In our last installment, we introduced our Chatops town-crier: Dr. Manhattan, and looked at some of the ways he helps us collaborate on production problems at Librato. In this follow-up article, we'll expand the scope a little bit, and introduce you to another non-corporeal entity that we rely on for our day-to-day interaction with production.

Introducing Twke

Meet Twke, our resident chatbot. Twke was named after Buck Rogers’ loyal ambuquad sidekick, and he has a critically important job at Librato as the primary abstraction layer between our fallible human engineers and the ephemeral production infrastructure those engineers so lovingly tend.

Like Dr. M., Twke can notify us of interesting events -- in the screenshot above, for example, he's letting us know that the Ops folks are kicking off some shell scripts. Unlike Dr M, however, Twke can also carry out our commands and effect change in production on our behalf.

Twke spends a lot of time orchestrating the automation behind our deployment system. We ship around 50 deployments per day, relying on Github and Travis CI to integrate and regression test our various software projects. Those tools do a great job of organizing our projects, enabling communication, and helping us find and eradicate bugs, but Continuous Integration (CI) also requires that we automate the deployment process in a way that leverages our existing tools, abstracts the application-specific details, and provides an organization-wide methodology for launching and documenting production deployments. To do CI correctly, you really need the proverbial deploy button.

Twke ship!

Some shops build custom systems to implement the deploy-button, like Etsy's deployinator, but for us, Chatops has proven to be an ideal solution for software deploys. The two seem made for each other. This, for example, is how we tell Twke to ship a new version of our metrics product to production:

Twke is a modular Ruby program, built on top of the scamp campfire-bot creation framework. This functionality can be extended with plugins (also written in Ruby). In fact all of the underlying deploy logic I'm about to show you is implemented in a plugin called Squirrel, while the job control functionality is handled by the job_control plugin. When someone tells Twke to ship something, the first thing he does is let us know that he’s on task:

Twke then parses out the command (in this case ship), and attempts to match it to a plugin that has registered to perform that type of command. In our case the squirrel plugin handles the ship command, so Twke passes the command along with any subsequent options to squirrel.

Squirrel is privy to a lot of architectural minutia. It knows what it takes to deploy the given branch of the given application to the given environment, and the parts it isn’t codified with, it knows how to discover. The first part of Squirrel’s job is to figure out what it's going to take to make the requested deployment happen, and then translate that knowledge into the correct set of environment variables and Capistrano tasks required to perform the deployment. Once Squirrel has a handle on it, it lets us know that it’s getting started.

The final part of Squirrel's job is to launch the commands necessary to carry out the production deployments, and capture their output. As I alluded to above, we use Capistrano to carry out the automation tasks necessary to remotely deploy code to production. The Capistrano commands squirrel chooses to run differ depending mostly on the language in which the application is written. Applications written in interpreted languages can be run right out of git, while applications that require compilation cannot. For the latter type we find it handy to repurpose the artifacts of our Travis jobs. Once the magic is complete, and the branch is deployed to production, Twke lets us know that squirrel is finished and has gone back to doing the sorts of things you’d expect squirrels to do.

Sometimes the magic smoke escapes, and Squirrel, for whatever reason, can't get it done. When this happens, Twke lets us know by showing us the error, and playing a sad trombone in our honor.

When the malcontented rationalists among us (who claim not to believe in magic) want to peek behind the curtain, they enlist Twke's help by way of the jobs command. You can list currently running jobs with jobs list, kill a job-gone-bad with jobs kill, and get the output from a job that has just finished with jobs out like so:

IF we don't want the whole log, we can just get the last 20 lines of a job's output with jobs tail. Working our deployments through a Chatops bot like this automatically creates a log of our production interactions that everyone knows how to access. It's a great way to keep everyone informed of what's going on, and keep us from stepping on each others toes. For example, Twke is smart enough to know when we're trying to deploy a branch that already exists in production, and refuses to simultaneously run two deploys on the same app at the same time.

Feature rollouts

We also rely on Twke, and his Rollout plugin as the primary interface to our feature flagging system. Rollout is a central authority that you can use to track which users, and groups of users have access to new and/or experimental features in an application. We flag experimental features in our code-base using branching statements like the one below (written in pseudocode):

Is this user flagged as a “Beta-SDK” person? 
    ok cool, run shiny_new_feature() 
    bummer, do_it_the_old_way()

The line that asks if the user is flagged is actually a function call that results in a request to Rollout. Rollout receives the request, checks its database of users, groups and the features with which they are flagged, and responds whether the user is flagged with the feature. In the first part of the screenshot below, Paul asks Twke to ship a new snapshot feature to staging, and then, while he’s waiting for the deploy to finish, he asks Twke what users are currently flagged for the feature in staging:

Twke lets Paul know that, in fact, nobody is flagged for the new feature, so Paul goes ahead and enables (or, if you prefer, rolls-out) the feature to every ID in the staging environment.

Through Twke, we can also manage group memberships, and add or remove flags from individual users or groups.  We can even tell Twke to flag a percentage of randomly chosen users, which we find useful for simultaneously running new code paths along with old ones to both vet new code and measure it against the existing solution. We also use rollout percentages to incrementally expose new features to increasing amounts of production load over the course of several hours or even a day rather than inundating a new feature with the full production workload all at once.

Once Paul is satisfied with his feature in staging, he decides to rollout the new functionality to production, but instead of enabling every user at once, he proceeds a bit more conservatively by announcing his intentions and flagging 20% of our customer base. Everything will probably go fine, but he's prepared to help field the support burden if things go sideways.

Chatops helps you get cuddly

Together, Twke and Dr M help us focus on our core engineering competencies by making production interactions simple, transparent, repeatable, and ultimately safe. They also manage to keep everyone in the company, regardless of their job description, informed of a very broadly scoped collection of important events, without negatively impacting anyones productivity. For example, our Chatops bots keep us up to date on our trouble-ticketing and support systems:

They inform of us Github commits and Travis results:

They warn us about Leader elections and other sorts of political unrest in production:

They help us track Customer signups and important onboarding stats like how long it took a new user to send their first metrics, and what user-agent they employed:

They let us know when database schema changes happen:

By way of a Zapier integration to Google Calendar, they even make sure we don’t miss any meetings:

Without Twke and DR. M. our organizational awareness and esprit de corps would be impaired. Chatops isn't just an eventual-consistency system for human beings; it's also something of an empathy injection protocol. If we were to deliver these messages individually via email or some other means, not only would they represent an unintelligible maelstrom of  notifications from dozens of unrelated systems that we would inevitably begin to ignore, but they would also be processed individually by human beings isolated from each other by the context of their own endeavors.

When, however, the same messages are delivered via a Chatops bot, they're processed in a community context where everyone can react to them, discuss them, resolve them, and learn about how other people respond to them. By embedding these messages solidly in the context of our daily organizational rhythm, Chatops makes notifications like these not just manageable, but inclusive and educational. In the chatroom, our problems become a unifying force rather than hopefully someone else's problem.

We hope you've enjoyed our second article in the "Tales from the Ops Team" series. Be sure to join us next time when Dr M. says: