Cronotations — Librato Blog

Cronotations



Annotations for Aperiodic Events

Our annotations feature does a really great job of documenting what we call 'aperiodic' events. These are occurrences of something that happen every so often. They aren't metrics per-se -- we don’t measure them like time-series values.  They’re just events that occur irregularly. With our annotations API, we can track events along with their duration, by representing them with vertical lines in a chart.

librato-annotations

Individual annotation events are organized into streams. You can add an annotation stream to any chart, in the same way you add metrics in our UI, and all of the individual events associated with that stream will be overlayed on the chart.

One example of an aperiodic event is an alert. In fact, if you have any alerts configured on your account, Librato automatically provides you a librato.alerts annotation stream, and updates it every time one of your alerts fires.

Another example of an aperiodic event is a cron job. In fact, cron jobs make excellent targets for annotations because they can often adversely affect the performance of systems and applications in unexpected ways. In fact, after we had a problem with an fstrim cron job that caused some troubling latency in our Cassandra rings, I began to experiment with an easy way to automatically annotate the cron jobs running on our systems.

Finding a hook

We use Ubuntu LTS in production, which relies on the venerable Vixie-Cron daemon for cron jobs. The hard part is getting in front of the cron jobs that are added automatically by the system. That is, sometimes cron jobs are installed by the package manager, and in that case, we don’t necessarily know they were added. So we want a means to capture these sorts of jobs and send annotations about them to the Librato API whenever it runs, without any sort of manual intervention on our part.

The cron jobs I’m talking about -- the ones automatically added by the various package managers -- are found in the /etc/cron.d, cron.daily, cron.weekly, and etc.. directories.  They’re executed by a program called *run-parts*, which cron relies on to manage execution. So I decided to experiment with wrapping the system's run-parts binary with this shell script.

#!/bin/bash
RP='/bin/run-parts-orig' #where's the original run-parts?
SHELLBRATO='/opt/shellbrato/shellbrato.sh' #where's shellbrato?
LBCREDS='/home/dave/librato_creds' #where are the librato creds?
PARENT=$(ps -ocommand= -p $PPID | awk -F/ '{print $NF}' | awk '{print $1}')
START=$(date -d 'now' +%s)
OUT=$(${RP} ${@})
EXIT=${?}
END=$(date -d 'now' +%s)
if echo ${PARENT} | grep -qi 'cron' #make sure cron is our parent process
then
  if [ -f "${LBCREDS}" ]   #make sure we have creds
  then
     source ${LBCREDS}
     if [ -f "${SHELLBRATO}" ]  #make sure we have shellbrato
     then
        source ${SHELLBRATO}
        TITLE=$(echo "${OUT}" | head -n1)
        O=$(sendAnnotation "Cron-Runs||${TITLE}||${START}||${END}")  #ok send the annotation
     fi
  fi
fi
#if any of that^ fails, fall through to working like run-parts would anyway
echo "${OUT}"
exit ${EXIT}

The general strategy is for this shell script to pretend to be /bin/run-parts. It’ll use our Librato shell client library, Shellbrato, to emit annotations about the various cron jobs it’s tasked to run.

Replacing run-parts without replacing run-parts

Run-parts is a general-purpose tool. Cron isn’t necessarily the only thing using it. In fact other system components (like Xorg to run /etc/X11/Xsession) use it as well. Therefore we don’t want our annotation to fire unless cron is our parent, because we don’t want to annotate things that aren’t cron jobs. Our script should annotate cron jobs and nothing else.

PARENT=$(ps -ocommand= -p $PPID | awk -F/ '{print $NF}' | awk '{print $1}')

We ensure this by capturing the parent process ID with bash's *PPID* variable, and checking it against the process table to make sure we’re being invoked by cron. The PPID variable is a bash-ism, which is the main reason my *sha-bang* specifies /bin/bash instead of /bin/sh.

Credentials

It’s always advisable to avoid hard-coding credentials into shell scripts. Many shops (ours included) use a config-synchronization service like Hiera, ZK, or Augeas to store sensitive environmental information like credentials. For the purpose of this article, I've simulated this by placing our Librato credentials in a separate file and sourcing the file if it exists.

sendAnnotation

The *sendAnnotation* function requires 3 arguments, but accepts 4. The arguments are passed as a double-pipe-separated string, which is a standard syntax for the shellbrato library. The first argument is the name of an annotation stream, if you provide a name that doesn't exist, the stream you name will automatically be created for you by our API.

The second argument is the title. The script automatically sets this to the first line of output returned to STDOUT by the job. Since Cron executes run-parts with the --report switch, the title will be set to the name of the script being executed by cron.

The third argument is a start-time in EPOC seconds, which is automatically captured by our run-parts wrapper before the job is passed off to the real run-parts for execution.

The fourth optional argument is an end-time in EPOC seconds, which is automatically captured by our run-parts wrapper after the job has been executed by the real run-parts binary.
If you just pass a start-time to the *sendAnnotation* function, you'll get a single vertical line in your annotation stream like the one pictured below.

If, however you pass both a start and end time, you'll get a nice filled-in area annotation like this one.

Our run-parts wrapper always passes a start and and an end time argument, but if the cron job finishes in a second or less, these two values will be the same, and you’ll wind up with a single-line. Otherwise, you’ll get a duration that represents the length of time it took a particular cron job to run.

Getting things going

Installing shellbrato is as easy as a git clone.

cd /opt/&& git clone https://github.com/djosephsen/shellbrato.git

With shellbrato in place, and our run-parts script copied over, we can stop cron.

service cron stop

Then we rename the system-provided run-parts binary, and link in our replacement.

mv run-parts run-parts-orig && ln -s /bin/run-parts-alt /bin/run-parts

Now we can exec run-parts manually to test things out (make sure you don’t call it on a real cron folder unless you actually want to execute those scripts outside of their normal schedule!).

run-parts --report /etc/cron.testerly

And then invoke Cron manually to make extra sure.

/usr/sbin/cron-f -L8

Once we're confident everything looks ok, we can re-launch cron as normal.

service cron start

I like this approach because it does a great job of documenting the execution times and duration of system jobs scheduled in cron with a minimum of effort and (re)configuration.  It's also easily installed and automated in whatever configuration management system you're currently using, and it encourages the use of the /etc/cron.d directory site-wide which is, in my opinion a much nicer interface than individual user crontabs.

Finally, I think you'll find the ability to easily correlate your system cron jobs against metrics from any layer in the stack very helpful for troubleshooting, even if only to eliminate things like backups and batch jobs as possible culprits. Read up on our annotations feature, and get a handle on your aperiodic events today.

Start using Librato now. Full-featured and free for 30 days. Pricing starts at $5 per month.
I accept the Terms of Service, Privacy Policy and occasional emails from Librato. Got questions? Talk to us.