Next Generation Alerting: Now in Public Beta



new-feature-alerts.png

Alerting Reimagined

I'm excited to be able to tell you about our new alerting system, which has just been released into public beta!  If you asked me to describe monitoring panacea in 5 words or less, "alert on what you draw", would probably be my answer. We have long been frustrated by the dichotomy between the service commitments we make as operations engineers, and our monitoring capabilities. In Ops, our SLA's might, for example, commit us to processing 95% of all requests within 250 ms, while our alerting systems seem able only to notify us about full disks and burdened CPU's. Meanwhile, the systems that collect and graph performance data seem to consider alerting a separate problem, so we are left with monitoring data that confirms our guarantees but nothing that alerts us when they break. We want to alert on what we draw. To the extent we are able to collect, process, analyze and alert on the same unified data stream we all win monitoring. 

Alerting isn’t just a feature. It is a keystone of monitoring infrastructure, and an infamously difficult thing to do well. When it goes wrong, it has a profoundly negative impact on the productivity and happiness of the engineers who rely on it. Speaking for the other engineers at Librato, we’ve all been on the receiving-end of poorly designed alerting systems, so when we decided to overhaul our alerting system from the ground up, we knew how critically important it was to nail it. We had a golden opportunity to create something great; something that could improve our customers quality of life, and provide a foundation upon which they could build awesome applications. 

Our redesign resulted in a fundamentally new vision of what alerting could be in our metrics platform: a sophisticated, first-class component. We wanted to be able to interact with alerts in the same way we interact with metrics, instruments, or dashboards. For the past few months we have been quietly acting on that vision, replacing and expanding services, working and reworking implementations, and establishing the infrastructure we needed to bring it into reality. 

Better Alerting Criteria

We wanted our new system to empower our customers to combine their technical expertise with expressive criteria to craft alerts that precisely modeled their production problems. We wanted to act on our customer feedback and implement the features that would help our customers the most. We wanted to implement the underpinnings like alert triggering and threshold options in a way that would encourage, and even instigate growth and innovation going forward, and we wanted a light-weight interface that made it quick and easy for our customers to manage their alerts. 

Get alerted when all data-points in a given duration exceed a threshold

Our time windowed alerts feature allows you to specify a duration of time within which every data point must exceed a threshold for the alert to fire. Time windowed alerts can prevent false positives for metrics that are “bursty” in nature and sometimes innocuously spike above the threshold for a short period of time. 

Get alerted when a source stops reporting

Our new alerting system supports alerting on a metric/source pair that stops reporting for a configurable period of time. If, for example, one of your hosts stops reporting data because it has crashed, our alerting system can detect it and alert you.  This feature complements the power of our metrics platform while widening its scope to encompass areas like system availability monitoring. 

Get alerted when multiple conditions are met

The new alerting interface allows you to specify multiple conditions that must be met to fire. You could use this to specify that a machine must have both 0 available ram, and 15Mb of utilized swap space before alerting on Memory problems, or that a message queue is both over-utilized and causing latency before alerting on queue trouble.  Any number of conditions can be associated with an alert. 

Assign any number of alerts to a metric

Legacy alerts were basically thresholds assigned to individual metrics, but our new alerts are first-class entities; they stand on their own, so any number of alerts may refer to the same metric, and every alert can use multiple metrics as trigger criteria.

Get alerted on selected summary statistics

If you do high-resolution monitoring and send us pre-computed metric summarization values, like sum, count, min, and max, you may now craft alerts that use any or all of these values as trigger criteria.

Minimize annoying repeat notifications

To protect you from notification fatigue, we’ve implemented re-arm timers, allowing you to specify a minimum time between notifications while the trigger criteria continues to be met.

Better Notifications

Speaking of Notifications, we've added a runbook_url parameter to our notifications, so you can associate every alert with a link that the recipient can click to get detailed information about the alert. We use this to great effect at Librato, by including a runbook.md file in the private git repositories for the services we maintain internally. 

When an alert is triggered, the Ops-Engineer on-call can use the runbook url embedded in the notification to be redirected to the github runbook for that service, which contains timely intel from other engineers about recent modifications to the service and/or troubleshooting advice.

Our new alerting system also allows you to link alerts with multiple notification services, so you can emit alerts via email to pagers, or via web-services to your favorite escalation partner, or both at the same time. We've also modified our notification messages to identify the individual sources that actually triggered the alert, so you can quickly narrow your focus to the aberrant sources where the problem resides.

Alerts as a first-class citizen

Alerts are now a full-blown user-interface entity along with Metrics, Instruments, and Dashboards, and have taken their proper place beside their brethren in the top nav bar.  You can now search for alerts by name and list all alerts via our web interface. 

We remain committed to building all of our services on the foundation of REST APIs, and alerts in our new alerting system remain API accessible. You can enumerate, list, create and modify your Alerts directly via our REST API. Like all of our other services, our own alerts configuration interface uses the API internally to interact with your alerts.

Transitioning Legacy Alerts

Your pre-existing alerts will be stored separately from any new alerts you set up after today. Legacy alerts remain functional in the new system, and are still available via the metric page. We’ll be migrating your legacy alerts to the new system in the near future.

Alert names now follow the same system naming conventions as metrics. In the future we intend to auto-generate event data from your alerts, and use it to provide a searchable event stream in the Ui, as well as other cool stuff, like alerts that generate annotations.

More to come

We think you’ll find our new alerting interface empowering because it lets you reason about alerts in a different way. Instead of configuring alerts in the context of an individual metric, our new alerts are stand-alone, first-class entities that can use any subset of your metrics data as trigger criteria for event notification. They can be made to fire when when your data collectors stop reporting, and to require a configurable number of consecutive aberrant data samples before they fire. 

We’re also very excited about the potential extensibility our new alerting platform represents. This overhaul is really just the first step in a long list of improvements we’ll be making to our alerting in the near future, so we designed it to be modular and flexible. If you like what you see, or have a feature that you’d like to see implemented, we’d appreciate your feedback, let us know.