When an alarm wakes me in the wee hours, my first instinct is to find a laptop and attack the problem with every chainsaw at my disposal. Don’t do this. Don’t use a chainsaw while you’re still mostly asleep. Alerts that call or page a human are requests for human judgement, not chainsaws. If the problem was simple enough that the automation could handle it, there wouldn’t be a need to wake up a human - so the most important thing we can do is make sure we understand the context. Look at the past. Look at the other graphs in the dashboard, the logs, and any recent alerts.
Chat First, Chainsaw Later
Document your OODA loops as you work the problem. Your automation probably already put a few graphs and logs in the chat room, but the first few loops are likely to end with a Decision that you need more Observations and Orientation before taking Action:
Send snapshots from related metrics, paste links to your logs, and talk to yourself in the room about what you’re seeing until you come up with a hypothesis (Decision) and an Action that will improve the situation and test your hypothesis. Document both of those in the room and return to Observing and Orienting until you can confirm the results:
Get Help Early and Get Together on a Call
If you’ve been working the problem for a while but the situation isn’t improving, it’s time to wake up someone else.
Given the choice, I’ll always pick 30 minutes of questions at 1 a.m. over three hours of firefighting at 2 a.m.
Talk them through the events using the backlog in the chat room as an outline:
Audio is best when you’re discussing complex issues and need to make sure everyone is on the same page:
Keep Your Adrenaline in Check
If you’re starting to feel the need to skip past the Observe, Orient and Decide part of OODA and go right to Act, repeat the Hadfield quotation to yourself: “There is no problem so bad that you can't make it worse”. When there’s a break in the action, use it to keep the adrenaline and cortisol levels down with humor in the room and on the call. The @devops_borat account no longer tweets but there are some real classics in there:
Pull back from the adrenaline-induced tunnel vision by thinking about how this is going to make for an awesome story eventually, and try to identify amusing graphs that can be tweeted later:
Stop Once You’ve Stabilized
Once you've stabilized the system and updated your status page, stop making changes. While you’re still awake and watching the dashboards to make sure the system is truly stable, file tickets to improve the detection and remediation, but resist the impulse to “clean up”. Remember “There is no problem so bad that you can't make it worse”? Even with human friendly automation, it's easy to make a mistake in the dark of the night, so limit your actions to those that will improve stability. Save the cleanup for tomorrow when you’re writing notes on the event and helping with the postmortem.
Be Prepared to Work on an Internal Postmortem
The chat room can be a form of documentation, but it will need to be distilled into tickets, code, and lessons learned. When you’re asked about last night’s event be ready to explain what happened:
When an alert wakes you up, remember that it needs you, not your chainsaw. It needs your judgement and your ability to reason. Get help early and get it often. Adrenaline is not your friend, but it will be your companion so make sure you keep it in check. Last but not least, keep in mind that there is no problem so bad that you can’t make it worse, so save cleanups or improvements for later.
Want to join our awesome OPS team where humor prevails? http://librato.jobs