Over 64,000 neighborhoods across the U.S. use Nextdoor as their private social network, helping them with everything from finding a babysitter to organizing a neighborhood watch group. Online safety of such a network is critical, requiring complex engineering behind a large user base requiring multiple identity verifications. We spoke to Greg James, engineering lead at Nextdoor, to hear how he and his teams meet Nextdoor’s unique engineering challenges.
Tell us about your environment
We are very agile here: we build more and more services all the time, for the most part in Go, with a wide variety of third party integrations. About a half of Nextdoor’s team is engineers, and I run a number of teams on various projects in various domains. We work on externally and internally facing developer and system tools, some of the main features of the website, etc. To give you an example, one of my teams just launched the feature to allow polling to be done on the website.
How did you start using Librato?
When I joined, Nextdoor was already successfully using Librato. We send to you as many metrics as we can through different systems. We use collectd and StatsD to send key metrics, for example metrics on number of page views and latency, as well as metrics from third party integrations.
Soon after joining the company, I realized that our database metrics were not where they should be. I was immediately able to use Librato to fix the problem. My team and I drove down database metrics by looking daily at the Librato database dashboard.
How has Librato helped you with monitoring?
I have worked with many monitoring and graphing systems before, so I know the bar is very high. I typically measure a monitoring tool against Ganglia’s better features, and Librato really wins here. In fact, in our current suite of tools, Librato is the only one that allows to change the minimum and maximum to get the granularity we are looking for. If there is an event with a large a spike in the data, most graphing systems I have worked with will make the spike fit within the graph. This effectively makes all of the other data worthless because the line becomes flat. Librato simply gives you better visibility.
Librato is perfect for my investigative and sharing use case: my team and I always start with the dashboard, looking for a spike or a valley that needs investigation, then drill in to find the root cause of the problem. Every single time we do a release of any of our layers, we open up our standard Librato dashboards. They give us the overall health of the systems being released to, as well as the general performance of key architectural components.
What are some of your favorite features?
Librato’s timeshift is the easiest way to understand the importance of viewing a metric day over day, week over week. I look at latency, and I look at database metrics – I actually embedded both of these things into a single dashboard view, because it is so informative.
I also love composite metrics. It is an advanced tool that provides so much power with mathematical transformations. Last but not the least, it is a huge help that Librato contacts us when our metrics blow up from a standard to an abnormal number – that saves us a lot of time and energy. In fact, Librato saved our engineering team many times with excellent absent and threshold alerting.
If you were to give a word of advice on monitoring...
You don’t have to do it all! Implementing our own monitoring system would have been far more expensive in terms of hardware and manpower than using Librato. Think scale, too – a solution like Librato scales really well...and we get truly superior support from the monitoring pros.
To learn how Librato can help your engineering team, sign up for a free trial today.