When ChatOps Goes From Cool to Critical — Librato Blog

When ChatOps Goes From Cool to Critical

Ben Odom


Hi, I’m Ben, an Operations Engineer with Librato. I am very excited about ChatOps as a force for positive change at companies both small and large. If I could work solely on ChatOps and related integrations for the next several years, I would be a happy camper. When I first heard Dave Josephsen’s talk at Velocity NY 2014 and got a glimpse of Librato’s ChatOps practices, I was intrigued. When I joined the Librato team later that year and dove head-first into helping them migrate their ChatOps platform from Campfire to Slack, I was gobsmacked by how tightly integrated ChatOps was with the daily workflow of engineers.

Since I joined the company just over a year ago, our reliance on ChatOps has continued to grow as our customer base grows, our infrastructure scales, and the need for us to leverage automation increases. This caused me to stop and reflect on the following: ChatOps is no longer a nice-to-have on our team; it is no longer a convenience. It is, in fact, a critical piece of our operations. The realization led us as an operations team to pause and consider how we were designing, operating and supporting our ChatOps platform.

Before I share some of the highlights of our improvements, I would like to ask: how do you, the person implementing the platform, know when your own ChatOps operation has moved from “Cool” to “Critical?” I will review the progression of a typical ChatOps implementation to identify common patterns and demonstrate how one can predict and prepare for upcoming needs.

Illustrating a Typical ChatOps Adventure

How did ChatOps feel for you and your team when you first began prototyping it? Did you start by downloading Lita, Hubot, or Err, and any and all public plugins available? How long did it take you to disable the Google Image Search plugin? Did your team start using Slack, and begin experimenting with slash commands? What useful features did you first take advantage of in your ChatOps experimentation? Some common ChatOps features that teams use when they are first getting started include:

  • Communicating emotions (gifs, pugbombs, etc.)

  • Data lookups (weather, users, dig, etc.)

  • Webhook Service Integration (CI, GitHub, alerting, etc.)

Congratulations! You have reached ChatOps level: Cool. 

Yes, your ChatOps prototype has turned into something very useful and beloved by your team. The Giphy lookups are flowing, GitHub notifications are coming, and you and your team begin to start having ideas about what could be done to do some more useful pieces of automation such as deploying your code from chat commands.

You realize quickly that you will need to create some custom code and configuration to make this automation happen. Now you have likely entered the world of bot and/or bot plugin development. You begin to build a deployment plugin that will allow users to initiate code deployments within your infrastructure. You work with your team members to understand which applications they want to ship into which environments, and what parameters they want to be able to specify when shipping. Internet hugs and high-fives abound as your team is using its chat platform in their daily workflow to review code, receive alerts, ship code, and more.

Pre-built integrations with your chat provider are a quick and easy way to increase adoption of a ChatOps platform.

All is well with the world. Your team has not only fully embraced the functionality of your ChatOps bots, plugins, and integrations, but they couldn’t live without them. You kick back in your chair, start dreaming up your first ChatOp conference talk, and… WAIT. They can’t live without ChatOps? What if it breaks? Are there well-known manual ways to deploy, now that you’ve been enhancing the deployment process in ChatOps? Are there disaster recovery options for the bot and the automation node(s)? Just as you’re thinking through these options, the Director of Engineering stops by to let you know that they plan on adding two new environments in geographically dispersed datacenters, and they are hoping that the ChatOps deploy automation will work there also.

Uh-oh. You have reached ChatOps level: Critical.

Congratulations! You have now enhanced your ChatOps platform to the point where it is a vital piece of your overall platform, and one that must be designed, operated, and automated just like any other valuable application in your portfolio. What are the concerns that one must take into consideration when planning and preparing their ChatOps platform for this type of growth? I can offer some insight into the steps we at Librato have already taken and are taking now to ensure our critical ChatOps platform is flexible to meet current and future needs.

Lessons Learned from Growing our ChatOps Platform

At Librato, we experienced a lot of change and growth in the past year, which began to strain the capabilities of our ChatOps platform design. We needed to migrate our Chat provider from Campfire to Slack, our team was growing, our number of custom plugins were growing, and the flexibility of our ChatOps platform needed to increase. I will discuss a few of the lessons we learned as we addressed these issues, and what architectural concepts we addressed as a part of each lesson.

You are likely to be writing a non-trivial amount of code for ChatOps. If you are planning on fully automating meaningful tasks through ChatOps such as deployments and infrastructure changes, community chatbot plugins are not likely to get you where you want to be. For the most part, these plugins deal with discrete services, from a global view, and without standardization of parameters or design between plugins. Non-standardization isn’t a big problem when you’re using three plugins, but once you expand to 10 or 15 plugins, having standard Chat interfaces matters. Given that you will be writing plenty of code for this automation, choose a starting point that will be more suited to the team(s) that you plan on having contribute to the codebase. Lita (Ruby), Hubot (JavaScript/CoffeeScript), and Err (Python) are examples of OSS bot platforms that are good starting points for ChatOps bot plugin development.

Custom ChatBot plugins provide more in-depth interactions with your applications and infrastructure.

Consider putting your business logic in an API. If you are building your ChatOps platform like any other custom software project, it may make sense to build features into a core API, and to have bot plugins consume the API as clients. In this way, you open up the possibilities for other clients to participate in your automation strategy (such as CLI clients running from a server). Furthermore, designing your platform in this way will make migrating chatbots more trivial, as well as ease the scaling and management of your ChatOps service.

Plan ahead for the appropriate level of security. It is a myth that ChatOps is not secure. Like any piece of software, it is as secure or insecure as it is designed and implemented. In fact, ChatOps can be a great asset to InfoSec/Audit teams because it provides a common stream to perform, control, and audit activity within a team. Furthermore, compliance-obligated organizations such as Box.com, who had compliance requirements including PCI, ISO, HIPPAA, FINRA, and FedRAMP, have provided public examples of how teams can properly secure ChatOps in these environments. Find a common sense approach that balances usability, good practices, and your organization’s compliance requirements.

Plan for the level of availability your team requires out of ChatOps. Because you have identified that ChatOps is an important platform, you should take the time to do some basic risk analysis. What if Slack/HipChat is unavailable? What if the host(s) running your chatbot become unavailable? What is an acceptable MTTR for ChatOps components? Set your levels of redundancy, scale, and disaster recovery planning based on a sensible analysis of these risks against your team’s requirements.

Testing: it’s for your ChatOps plugins, too. Breaking your ChatOps functions is no fun for your team and is disruptive to operations. To quote Jimmy Cuadra, “... any plugins you write for your robot should be as thoroughly tested as any other program you would write.” Take advantage of the testing patterns established by your bot platform and make sure plugins are being tested properly, with your repos having CI tests run against them just like any other application on your team. Both Lita and Err have built-in support for testing your custom plugins.

Lastly, build an internal community around ChatOps. Make it as easy as possible for others on your team to contribute new features and enhance existing ones. At Librato, we periodically host internal hack days, where we can participate as individuals or on ad-hoc teams to quickly create prototypes of new features of their own choosing. Several new features of our ChatOps platform have been created out of these hack days. Find a way that works best for your team to meet periodically and get feedback about how ChatOps is working for them, and how it can be enhanced to improve your team’s ability to deliver value to customers.

These are a few of the considerations we at Librato have found to be important as we have continued to evolve our own ChatOps practice.

If you’d like to work on interesting problems such as evolving ChatOps tools and features, we’d love to talk with you.