Collector Highlight Series: Graphios — Librato Blog

Collector Highlight Series: Graphios



Graphios is the fifth installment to our Collector Highlight Series.  We started the series with an overview of the patterns employed by the various data collections tools available, with additional posts including StatsD, collectd, and Heka.

What Is It?

Graphios is the easiest way to connect Nagios to systems like Librato and Graphite. If you need to emit metric data from Nagios to one of the newer graphing tools like InfluxDB or OpenTSDB, Graphios (via StatsD) is the only option other than coding up something yourself.

Graphios works as a glue layer between Nagios and various time-series databases. It uses the host_perfdata and service_perfdata hooks (defined in your nagios.cfg) to capture metrics in your perfdata log, then formats and sends them to systems like Librato, StatsD, and collectd. Graphios uses a modular backend system to support various metrics storage systems via a series of plugins.

graphios-workflow.v2.png

How Does it Work?

Although its initial configuration may seem complex, Graphios is conceptually a very simple special-purpose log parser. It runs as a daemon, waking up on a configurable interval, checking for new performance data logs exported by Nagios, and processing them.

Graphios has a modular back-end model which allows it to write to multiple metrics systems. When Graphios finds a new performance data file, it parses metrics out of it, computes appropriate metric names for the enabled back-end plugins, and then emits the metrics to each back-end metrics system as required.

How Do I Install It?

Graphios is a Python program, so the easiest way to install it is with pip:

pip install graphios

It's also pretty easy to install Graphios manually.  First, get the most recent
version from git with:

git clone https://github.com/shawn-sterling/graphios.git

Then copy the various files to the appropriate locations:

mkdir -p /etc/graphios
cp graphios*.py /usr/local/bin
cp graphios.cfg /etc/graphios/

 

Configuration Requirements:

To get Graphios up and running, you'll need to manually configure three things:

  • The Nagios config files that deal with host and service checks
  • The nagios.cfg
  • The graphios.cfg file

If you installed with pip, the setup.py script will attempt to detect and automatically add a commented-out configuration to your nagios.cfg. The setup script does a pretty good job of creating this configuration on all but the most bespoke Nagios setups. All you should need to do is uncomment the new syntax and restart Nagios.  Given the configuration flexibility of the Nagios Solar-System however, it's possible that you'll need to manually intervene and modify the nagios.cfg yourself.

Designing and Configuring a Metric Name Schema

Nagios is a stand-alone monolithic system in that it assumes its check-command output will never be exported; that no system but Nagios will ever need to process it. So Nagios services generally have very simple names like PING, or LOAD.  Because all of the necessary context is inside the Nagios UI, it should be obvious to the operator what those names refer to.

Graphing systems like Graphite, however, are not monolithic - they're designed to work alongside other monitoring systems and data collectors.  Therefore, they necessarily assume that all data is sourced externally (everything comes from some other monitoring system), and as a result they use dot-delineated, hierarchical metric names like Nagios.dc4.dbserver12.LOAD. In Graphite, a name like LOAD doesn't tell the operator anything about what system the metric refers to, much less how it was collected.

To be an effective glue layer, Graphios gives you a simple, transparent means to translate Nagios' simple, monolithic service names into context-rich hierarchical names that can be used by systems like Librato and Graphite. Specifically, Graphios can read metric prefixes and suffixes out of your Nagios service and host definitions using custom attributes. For example, a typical Nagios service description, excluding the minutiae normally packed into upper-level templates, looks like this:

define service{
   use                  generic-service
   host_name            box1,box2,box3
   service_description  SSH
   check_command        check_ssh
}

The output of the check_ssh plugin looks like this:

SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) | 
time=0.009549s;;;0.000000;10.000000

Everything after the pipe is [performance data]; these are the metrics Graphios exports. In this case, we have a single metric called “time”, which measures the response time of the ssh port (in this case, the ssh port responded to the check_ssh plugin in 0.009549 seconds). Graphios automatically prefixes the metric name with the host name, so without doing anything at all, our metric name becomes:

box1.time

As we’ve already observed above, box1.time isn't a particularly meaningful metric name, so we can tell Graphios to put some additional context in front of this metric name by inserting a _graphiteprefix custom attribute into the service definition like so:

define service{
   use                     generic-service
   host_name               box1,box2,box3
   service_description     SSH
   check_command           check_ssh
   _graphiteprefix         nagios.dc1
}

Graphios will now prepend this prefix to the metric name, making it:

nagios.dc1.box1.time

This is a little bit better, but we can insert some additional context about the service between the host name and the metric name using a _graphitepostfix custom attribute in our service configuration like so:

define service{
   use                  generic-service
   host_name            box1,box2,box3
   service_description  SSH
   check_command        check_ssh
   _graphiteprefix      nagios.dc1
   _graphitepostfix     sshd.rt
}

Graphios will now insert ssh.rt between the host and metric name, making it:

nagios.dc1.box1.sshd.rt.time

Now we have a pretty decent metric name for use with systems like Graphite and StatsD.

Librato and the Source Dimension Exception

In Librato, metrics are first-class citizens which is a very nice feature that enables the use of dynamic dashboards and templates. Graphios correctly supports Librato's source dimension by submitting host-names like box1 to Librato as the source dimension, rather than hard-coding them into the metric name. So, continuing with our example, the metric name submitted to Librato therefore becomes:

nagios.dc1.sshd.rt.time

Any graph displaying this metric would then automatically depict one line for each source (three lines total, one each for box1, box2, and box3).

Configuring the Nagios Perfdata Hooks

Next, we need to configure Nagios to export performance data to a log file in a format that Graphios can understand.  If you installed Graphios using pip, check the bottom of your nagios.cfg file for a block of configuration that begins with:

# ###### AUTO-GENERATED GRAPHIOS CONFIGS

If you aren't already using Nagios perfdata hooks for something else (i.e.,  if your currently running Nagios configuration contains: process_performance_data=0), you can simply uncomment this configuration block, and restart Nagios. If you're already using Nagios perfdata hooks for something like PNP4Nagios, or one of the other RRDTool-based graphing systems, chances are you can safely run both Graphios and your current toolset at the same time. Refer to the Graphios documentation for instructions on how to set this up.

If you didn't install with pip, or if you don't see a block of Graphios config at the bottom of your nagios.cfg, then you'll need to manually enable performance data, and configure it for Graphios.  Refer to the Graphios documentation for instructions on how to set this up.

Once you've configured Nagios to emit performance data, restart the Nagios daemon and verify that it's writing a log file to the Graphios spool directory (named by the service_perfdata_file attribute in your nagios.cfg) with a name like service-perfdata.1418637947. The file should contain lines that look like this:

DATATYPE::SERVICEPERFDATA       
TIMET::1418637938       
HOSTNAME::box1  
SERVICEDESC::SSH
SERVICEPERFDATA::time=0.066863s;;;0.000000;10.000000
SERVICECHECKCOMMAND::check_ssh  
HOSTSTATE::U    
HOSTSTATETYPE::HARD     
SERVICESTATE::OK        
SERVICESTATETYPE::HARD  
GRAPHITEPREFIX::nagios.dc1
GRAPHITEPOSTFIX::sshd.rta

Configuring Graphios  

Graphios installs its config file in /etc/graphios/graphios.cfg by default. This file is very well commented, and by-and-large self explanatory. There is a global configuration section, and one section for each backend plugin that Graphios can write to. Plugins are generally enabled by setting their enable line to True and configuring the required attributes for the plugin. Here, for example, is a working configuration for Librato:

enable_librato=True
librato_email = dave@librato.com
librato_token = ecb79ff8a82areallylonggreatbigstringtokenything6b8cb77e8b5bb98934cf7
librato_whitelist=["load","rta","swap"]

The whitelist attribute bears mentioning since, without it, Graphios emits all performance data measured by Nagios to Librato, which could become expensive. As configured above, only metrics matching the regular expressions: "load", "rta", and "swap" will be emitted to Librato. Here's a working configuration for StatsD:

enable_statsd=True
statsd_servers = 192.168.1.87:8125

You may enable multiple backend plugins (Librato AND StatsD), and even multiple comma-separated instances of the same backend plugin (four different StatsD servers, and a carbon server), and Graphios will happily parse out and emit your Nagios metrics to each back-end system. At this point, you can run Graphios from the command line and see if everything works as expected:

graphios.py --verbose

Daemonizing

Graphios ships with init scripts for Debian and RPM-based systems. These were installed automatically if you installed with pip on a compatible system. Check the Graphios documentation for more information on starting and running Graphios.
 

Hints, Tradeoffs, and Gotchas

If Graphios successfully sends all of the metrics in a file to every back-end, it deletes the log file. Otherwise, it keeps the performance data log, and tries again the next time it wakes up.  This is generally good operationally-robust behavior, but it’s possible to get into the situation where some number of performance-data logs are never deleted and begin to bog-down Graphios by forcing it to re-process the same files over and over.

You can safely manually delete these files from the spool directory using, for example,

cd /var/nagios/spool/graphios && find . -mmin 60

...or by using the nerf attribute in the graphios.cfg to label certain back-ends as non-critical.

For example, if you set nerf_statsd=true in the graphios.cfg, and Graphios fails to send some data to Statsd, Graphios will delete the performance data file anyway, and that data will simply never be sent. This is often an acceptable trade-off for performance metrics sourced from Nagios, which are usually collected on the order of minutes anyway.
 

Get Started!

If you're running Nagios today, and you're still trapped in the RRDTool era, you owe it to yourself to install Graphios and experience the future of scalable metrics analysis systems like Graphite, InfluxDB, and OpenTSDB. Better yet, set up a free account with Librato, and let us unlock the potential buried in your Nagios performance data.

Start using Librato now. Full-featured and free for 30 days. Pricing starts at $5 per month.
I accept the Terms of Service, Privacy Policy and occasional emails from Librato. Got questions? Talk to us.