Master (c1n4): HDFS NameNode, Hadoop JobTracker, HBase Master, and Zookeeper
Zookeeper (c1n1): Zookeeper for this cluster, master for our other cluster
Slaves (c4n1..c4n6): HDFS DataNode, Hadoop TaskTracker, HBase RegionServer (6 GB heap)
Hardware:
c1n*: 1x Intel Xeon X3363 @ 2.83GHz (quad), 8GB RAM, 2x500G SATA 5.4K
c4n*: Dell R720XD, 2x Intel Xeon E5-2640 @ 2.50GHz (6-core), 64GB RAM, 12x1TB SATA 7.2K
Obviously the new machines come with faster everything and lots more RAM, so first I bonded two ethernet ports and then ran the tests again to see how much we had improved:
Figure 1: Scan performance of new cluster (2x 1gig ethernet) |
Figure 2: bytes_in (MB/s) with dual gigabit ethernet |
Figure 3: bytes_out (MB/s) with dual gigabit ethernet |
Unfortunately our master switch is currently full, so we don't have the extra 6 ports needed to test a triple bond - but given our past experience I feel reasonably confident that it would change Figure 1 such that scan performance increases with number of mappers up to some disk I/O contention limit.
Hang-on, what about data locality?
But there's a bigger question here - why are we using so much network bandwidth in the first place? Why stress about major compactions and data locality when it doesn't seem to get used? Therein lies the rub - PerformanceEvaluation can't take advantage of data locality. Tim wrote about the tremendous importance of TableInputFormat in ensuring maximum scan performance from MapReduce, and PerformanceEvaluation doesn't do that. It assigns a block of ids to scan to different mappers at random, meaning that at best one in six mappers (in our setup) will coincidentally have local data to read, and the rest will all transfer their data across the network. This isn't a bug in PerformanceEvaluation, per se, because it was written to try and emulate the tests that Google ran in their seminal white paper on BigTable, rather than act as a true benchmark for scanning performance. But if you're new to this stuff (as I was) it sure can be confusing. When we switched to scanning our real data using TableInputFormat our throughput jumped to 2M/sec from the 1M/sec we got using PerformanceEvaluation.Conclusions
While we learned a lot from using PerformanceEvaluation to test our clusters, and it helped to uncover any misconfigurations and taught us how to fine tune lots of parameters, it is not a good tool for benchmarking scan performance. As Tim wrote, scans across our real occurrence data (~370M records) using TableInputFormat are finishing in 3 minutes - for our needs that is excellent and means we're happy with our cluster upgrade. Stay tuned for news about the occurrence download service that Lars and I are writing to take advantage of all this speed :)
Hi,
ReplyDeleteConsidering how your region servers have 64GB of RAM, why only 6GB of heap allocated to the RegionServers?
In addition, if you don't mind me asking, is it possible if you detailed out the heap distribution for both master and slave?
Thanks,
Young
The machines that host the RegionServers are also hosting and HDFS datanode and lots of Hadoop map and reduce slots. As it happens, in my testing the size of the heap didn't really matter for read performance. In a more recent blog post I talk about our quest to improve write performance, and there heap is paramount. I talk more about heap settings there: http://gbif.blogspot.dk/2012/07/optimizing-writes-in-hbase.html
DeleteI would be curious if that performanc holds when your node has 4 TB of data on it. Cassandra I know recommends only having 1TB of data storage per node as compactions start taking way too long moving data around. I wonder if hbase has any issues like this as all the data is sorted as well, right? where cassandra data is not sorted between nodes at all but rather is random though on disk.
ReplyDeletethanks,
Dean