Update June 13, 2011: Canonical has since released a new Ubuntu 10.04 LTS AMI and AKI that fixes this problem. The new kernel version is '2.6.32-316-ec2' and we have not been able to reproduce the original problem with this kernel. We recommend everyone update their provisioning scripts to use this release.
This post highlights a reported bug in the Ubuntu 10.04 LTS Xen kernel that can lead to process lockups when running on instances backed by Intel Nehalem CPUs. The popular Amazon EC2 Xen cloud deploys many of these Nehalem CPUs. We don’t regularly devote entire blog posts to a single bug, but we felt it was prudent to draw attention to this little known issue given our observation of the effects first-hand. We have also noticed that there are a number of Silverline customers that are impacted by this, but are seemingly unaware of the issue. Other tech companies have also attributed this bug to problems in their own backend systems.
This post will walk through how we noticed anomalies in our customer’s Librato Silverline application monitoring graphs and were able to finally identify the root cause. We describe the bug in question, describe how it can affect application workloads and monitoring solutions and describe the workarounds. We are able to visualize the side effects in our time-series graphs.
At Librato we are devoted to providing the best application monitoring and management solution possible. To ensure the best quality product we periodically verify that the monitoring data our customers are receiving is valid. Recently we have noticed that for particular customers their application CPU load values are clearly bogus -- either they are astronomically large (equivalent to consuming about a million cores simultaneously) or are similarly large negative values. We tried numerous times to reproduce the exact issue on our own servers, but did not have any luck.
We had not made much headway on this issue until the other week when we were performing a regularly scheduled update of our Silverline metrics ingress tier. We prefer to eat our own dog food whenever possible, so the backend servers that collect all Silverline monitoring data are in fact Silverlined themselves. These backend servers run on m1.large instances in EC2. The following image shows a normal Silverline CPU utilization graph for this tier.
When we spun up a new server in this tier the other week we noticed almost immediately that we had the same giant spikes in our CPU loads as many of our customers were seeing. In the following graph you can see the CPU utilization spikes in the ‘unassigned’ application tag. You can’t tell on this graph, but the value at those green vertical stripes was in the billions of CPU percentage -- clearly bogus for a mere two-core machine.
We also noticed that the CPU times output by `ps` were completely bogus for several processes on the box. The CPU usage percentage and accrued CPU times are in the 3rd column.
root 4 338661012 0.0 0 0 ? S 22:20 17179869:11 [ksoftirqd/0]
root 195 338661012 0.0 0 0 ? S 22:20 17179869:11 [kjournald]
ubuntu 862 188130172 0.0 79100 1644 ? S 22:24 17179869:11 sshd: ubuntu@pts/0
During conversations with other tech companies we learned of an issue when running the Ubuntu 10.04 LTS release on certain Amazon EC2 servers -- the same environment as our backend servers. The issue appeared to be triggered when launching the Ubuntu 10.04 LTS release on hypervisors running on Intel Xeon Series 55xx (Nehalem) CPUs. For example, some Cassandra users were reporting that nodes would completely freeze up for extended periods of time. We identified that we only saw the large CPU spikes in our backend system CPU graphs when we had launched an E5507 backed instance.
The problem appears to be a bug in the 2.6.32 kernel used in Ubuntu 10.04. The bug report states that the Time Stamp Counter (TSC) is unstable and hence unreliable when the Ubuntu 10.04, 2.6.32-312 kernel is booted on EC2 instances backed by the Intel Nehalem CPU. On the other hand, instances backed by the Intel E5430 CPU were fine. The problem, as pointed out in the bug report, is likely due to a typo during the backport of a Xen patch set. The typo causes the OS to consider the TSC continuous in time which is correct if it were running on the physical Nehalem CPUs, but not when running within a VM on a shared host. The observed result is the large inaccuracies in the accrued CPU accounting times for processes we were seeing. These inaccuracies obviously show up in our Silverline CPU monitoring, but it can also lead to processes locking up for long durations as the OS scheduler tries to enforce fair-share scheduling policies.
We can visualize this problem by plotting the CPU usage over time. We wrote a simple test program that outputs its accrued CPU time in microseconds every two seconds. We ran two simultaneous instances of this test on EC2 m1.large servers -- one instance per CPU -- and plotted the CPU accounting times for each process.
This graph shows the results of running this test on Ubuntu 10.04 on an EC2 instance backed by the Intel E5430 CPU. Notice the CPU times for each process increase linearly and fairly consistently since each process is performing the same workload.
The following graph illustrates the same test, but this time it’s running on an instance backed by the Intel E5507 CPU. Notice that the process labeled “cpu1” has accrued 520 msecs of CPU time at the point of measure -- about what we would expect. On the other hand, the process labeled “cpu0” shows about 18.4 billion seconds of accrued CPU time. This clearly bogus value illustrates the miscalculation of CPU times that can occur with the Ubuntu 10.04 kernel on Intel Nehalem instances.
Given that Ubuntu is a popular cloud distribution with easily available AMIs many EC2 users have chosen to use Ubuntu as their cloud OS. Similarly, since 10.04 is the most recent LTS (Long Term Support) release it is a popular version for cloud users looking for both a recent OS release and a longer supported release. We can’t tell which OS distribution our Silverline customers are using or which particular CPU they are running on. However, we can tell that within the last week, at least ten servers and over fifteen applications appear to be reporting similar bogus CPU utilization numbers. It’s unclear to us if any of these customer’s applications have been negatively impacted by this or whether their CPU graphs are the only noticeable impact. It is highly possible though that some of these customers have experienced occasional process lockups, but have not been able to attribute those problems to any particular cause. At the time of this blog post we are sending an email out to all customers we believe are affected and making sure they are aware of this issue.
It would be fair to surmise that there are a large number of non-Silverline users that are also running systems in this same environment. Given there’s little published data on this, many of these users are likely unaware of the significance or impact of this problem on their systems.
Hopefully the bug will be fixed in an upcoming kernel patch for Ubuntu 10.04 and users relying on these AMIs can simply update their kernel. In the meantime users need to address this issue directly. We believe this part of responsibly owning your entire stack.
There are a number of approaches users can take to avoid being impacted by this:
Update to a newer Ubuntu release, for example, Ubuntu 10.10. Since Ubuntu 10.04, the Xen patches are better integrated into the kernel avoiding the requirement to backport them to 2.6.32. Users have reported that the original process lockups don’t occur with the Ubuntu 10.10 images.
For users with environments currently dependent on the Ubuntu 10.04 environment (we still have some ourselves) we have modified our OPS scripts to throw out instances that boot with the Nehalem CPUs and reprovision until we get an E5430 machine. We have noticed that in some AZs we see more Nehalem’s than in others which likely points to AZs with more recent hardware deployments. Obviously this approach is not sustainable on a whole as more users seek out the older E5430 CPUs and Amazon further invests in the Nehalem architecture, so we are actively working to migrate our 10.04 systems to 10.10.
For advanced users, building a custom 2.6.32 kernel that contains the patchset from the bug report is an option. There are also some custom kernels and AMIs in this bug report that users have reported success with.
We of course recommend Silverline’s application monitoring to proactively monitor your application workloads.