Tuesday 20 March 2012

HBase Performance Evaluation continued - The Smoking Gun

Update: See also part 1 and part 3.

In my last post I described my initial foray into testing our HBase cluster performance using the PerformanceEvaluation class.  I wasn't happy with our conclusions, which could largely be summed up as "we're not sure what's wrong, but it seems slow".  So in the grand tradition of people with itches that wouldn't go away, I kept scratching.  Everything that follows is based on testing with PerformanceEvaluation (the jar patched as in the last post) using a 300M row table built with PerformanceEvaluation sequentialWrite 300, and tested with PerformanceEvaluation scan 300.  I ran the scan test 3 times, so you should see 3 distinct bursts of activity in the charts.  And to recap our hardware setup - we have 3 regionservers and a separate master.

The first unsettling ganglia metric that kept me digging was of ethernet bytes_in and bytes_out.  I'll recreate those here:
Figure 1 - bytes_in (MB/s) with single gigabit ethernet
Figure 2 - bytes_out (MB/s) with single gigabit ethernet
This is a single gigabit ethernet link, and that is borne out by the Max values, which are near the theoretical max of 120 megabytes per second (MB/s), and as you can see they certainly aren't sitting at 120 constantly in the way one would expect if ethernet was our bottleneck. But that bytes_in graph sure looks like it's hitting some kind of limit - steep ramp up, relatively constant transfer rate, then steep ramp down.  So I decided to try channel bonding to effectively use 2 gigabit links on each server "bonded" together.  It turns out that's not quite as trivial as it may seem, but in the end I got it working following the Dell/Cloudera configuration suggestion of mode 6 / balance-alb, which is described more in the Linux kernel bonding documentation. The results are shown in the following graphs:

Figure 3 - bytes_in (MB/s) with dual (bonded) gigabit ethernet
Figure 4 - bytes_out (MB/s) with dual (bonded) gigabit ethernet
Unfortunately the scale has changed (ganglia ain't perfect) but you can see we're now hovering a bit higher than the single ethernet case, and at some points swinging significantly above the single ethernet limit (a few gusts up to 150).  So it would seem that even though the single gigabit graphs didn't look like they were hitting theoretical maximum, they were definitely limited.  Not by much though - maybe 10-15%?  As they say, the proof is in the pudding, so what was the final throughput for the two tests?

EthernetThroughput (rows/s)

Looks like 5%. Certainly not the doubling I was hoping for! Something else must be the limiting factor then, but what?  Well, what does CPU say?
Figure 5 - cpu_idle with dual (bonded) gigabit ethernet
That CPU is saying, "Feed me Seymour!".  If the CPU is hungry, what about the load average?
Figure 6 - one minute load average with dual (bonded) gigabit ethernet
These machines have 8 hyper-threaded cores, so effectively 16 cpus. The load average looks like it's keeping the CPUs fed - hovering around 20. But hang on - the load average isn't all about CPU - it's about work that the whole machine needs to do - including all the io subsystems. So if it ain't network, and it ain't CPU then it's gotta be disks. Let's look at how much time the CPU spent waiting on io (which in this case basically means "waiting on disks"):
Figure 7 - percentage cpu spent in wait_io state, with dual (bonded) gigabit ethernet
From a pure gut perspective, that seems kind of high.  But on closer inspection, that graph also looks familiar - it's basically the same as Figure 6 - the load average. So what can we deduce?  When ethernet is removed as a limiting factor the run queue is filled with processes that all cause an increase in wait_io - which means our processes would all finish faster if they didn't have to wait for io.  The limiting factor must, then, be the disks.

Recap (TL;DR)
Ethernet looked suspiciously capped, but performance (and ethernet usage) only increased slightly when the cap was lifted. Closer inspection of the resulting load average and CPU usage showed that the limiting factor was in fact the disks.  

Solution?  Replacing the disks might help a little - they're 3 year old 2.5" SATA, but they are 7.2k rpm, and a more modern 10k 2.5" would be marginally faster.  The real benefit would probably be more disks - the current wisdom on the mailing lists and in the documentation is to maximize the ratio of disk spindles to cpu cores and I think that's borne out by these results.  In the coming weeks we're going to build a new cluster of 6 machines with 12 disks each and I very much look forward to testing my theories once they're up and running.  Stay tuned!