Over the last few years a number of new technologies have emerged (inspired largely by Google) to help wrangle Big Data. Things like Hadoop, HBase, Hive, Lucene, Solr and a host of others are becoming the "buzzwords" for handling the type of data that we at the secretariat are working with. As a number of our previous posts here have shown, the GBIF dev team is wholeheartedly embracing these new technologies, and we recently went to the Berlin Buzzwords conference (as a group) to get a sense of how the broader community is using these tools.
My particular interest is in HBase, which is a style of database that can handle "millions of columns and billions of rows". Since we're optimistic about the continued growth of the number of occurrence records indexed by GBIF, it's not unreasonable to think about 1 billion (10^9) indexed records within the medium-term, and while our current MySQL solution has held up reasonably well so far (now closing in on 300 million indexed records) it certainly won't handle an ever-growing future.
I'm now in the process of evaluating HBase's ability to respond to the kinds of queries we need to support, particularly downloads of large datasets corresponding to queries in the data portal. As in most databases, schema design is quite important in HBase, as is the selection of a "primary key" format for each table. A number of the talks at Berlin Buzzwords addressed these issues and I was very happy to hear from some of the core contributers to HBase and their conclusion that figuring out the right setup for any particular problem is far from trivial. Notable among the presenters were Jean-Daniel Cryans from StumbleUpon (a fellow Canadian, woot!) and Jonathan Gray from Facebook (with luck their slides will be up at the Buzzwords slides page soon). Jonathan's presentation especially gives a sense of what HBase is capable of given the truly huge amount of data Facebook drives through it (all of Facebook's messaging is held in HBase).
Apart from learning a number of new techniques and approaches to developing with HBase, more than anything I'm excited to dive into the details knowing such a strong and supportive community is out there to help me when I get stuck. You can follow along my testing and deliberations on the wiki page for our occurrence record project.
This is what phylodiversity looks like
2 days ago