We're getting serious in our Hadoop adoption. The first process (our so called "rollover") is now in production and it uses Hadoop, Hive, Oozie and various other parts of the Hadoop ecosystem.
Our next step is evaluating HBase and its performance on our (small and aging) cluster. To do that properly and to fix a rather embarrassing situation we first had to get proper monitoring up and running for our cluster. So far we've only had Cacti stats for OS level things (CPU, I/O, etc.) but we were missing actual Hadoop statistics.
So we've now set up Ganglia at GBIF and the best news is it's public and using the very latest Ganglia 3.3 which was released only a few days ago in February 2012. The setup was relatively painless. Ganglia was just nice to work with. To get monitoring of HBase working we had to apply HBASE-4854 because it's not included in our Hadoop distribution (CDH3u2). Thanks to Lars George for the hint.
So we can happily report that Ganglia 3.3 works perfect with the GangliaContext31 from Hadoop.
Now all that's left is learning what most of these stats mean and then trying to extract useful information from them. Any hints are more than welcome and the HBase community already offered to help. Thank you very much!
For anyone interested in a few notes and details about how we set Ganglia up keep on reading.
First we had to configure iptables. This accepts all multicast packets for UDP and IGMP (it took me a while to figure out this was missing...):
iptables -A INPUT -p igmp -d 188.8.131.52/4 -j ACCEPT iptables -A INPUT -p udp -d 184.108.40.206/4 -j ACCEPT
We've split up our 18 machines in two Hadoop clusters so we wanted those separate in Ganglia as well. The only way I could get this to run is to have both clusters join different multicast addresses (mcast_join in gmond.conf). We're doing this using Puppet.
In gmetad.conf we then include one machine from each cluster:
data_source "hadoop-1" c1n1.gbif.org data_source "hadoop-2" c1n2.gbif.org
In multicast mode all machines that are listening on the same multicast address know about all state from all machines in that "ring". So what gmetad has to do is to connect to one of the machines from each ring via TCP (not multicast, that's what the tcp_accept_channel is for in gmond.conf).