Oliver here - I'm one of the new developers at GBIF, having started in October, 2010. With no previous experience in biology or biological classification you can bet it's been a steep learning curve in my time here, but at the same time it's very nice to be learning about a domain that's real, valuable and permanent, rather than yet another fleeting e-commerce, money-trading or "social media" application!
One of the features of GBIF's Data Portal is allowing searching of primary occurrence data via a backbone taxonomy. For example let's say you're interested in snow leopards and would like to plot all current and historical occurrences of this elusive cat on a world map. Let's further say that Richard Attenborough suggested to you that the snow leopard's scientific name is "Panthera uncia". You would ask the data portal for all records about Panthera uncia and expect to see all occurrences of snow leopards. Unfortunately biologists aren't agreed on how to classify the snow leopard - some argue that it belongs in the genus Panthera, while others argue that it should belong to its own genus, Uncia, and naturally the GBIF network has records under both names. You would just like to see all of those records and never mind the details - and that's just the tip of the iceberg when it comes to building a backbone taxonomy to match the 260 million+ occurrence records in the GBIF network.
Indeed, the backbone taxonomy (we call it our "Nub Taxonomy") in use by the current data portal has been one of the biggest sources of criticism of the GBIF data portal - it doesn't cover enough of the names in our occurrence records, and it doesn't handle the tricky stuff (as above) as well as it should. One of the reasons for that is the current backbone taxonomy was built based on the Catalogue of Life 2007, a similar vintage International Plant Names Index (IPNI), and then augmented with the classifications from any unmatched occurrence records. This has led to a classification hierarchy which is less reliable than we (and the GBIF network) would like.
Markus Döring is the GBIF software team's taxonomy expert and he has employed a new strategy for building an improved Nub Taxonomy by building it exclusively on well-known and respected taxonomies already out there - things like the most recent Catalogue of Life, IPNI, and more, but without using the classifications as given in the occurrence data. After the Nub Taxonomy is built, the occurrence records then need to be matched to it. As the first step to integrating the new Nub Taxonomy into the data portal, my job in the last little while has been to build a searchable index of all the names in our Nub Taxonomy and a web service that can accept a scientific name (from an occurrence record) and match it to the index, while understanding the implications of homonyms and synonyms, as well as tolerating misspellings. And of course, make it fast :)
Since what we're talking about here is string matching with a tolerance for messy input (e.g. spelling mistakes, different violations of nomenclatural rules) the place to start is Lucene. Our Nub Taxonomy has about 8 million unique names, and our 260 million occurrence records are also comprised of roughly 8 million unique names. Our use case is somewhat out of the ordinary for Lucene in that we can build the index once and after that it becomes read-only until the next update of our Nub Taxonomy (e.g. to reflect an update in the Catalog of Life), and it only takes a few minutes to build the index, so it's not all that important for it to be persistent. That means we can optimize for search speed and not worry so much about indexing performance. Lucene has just the index storage implementation for this need - RAMDirectory. For the most part this worked just fine, but no matter how hard I hit the index, I couldn't get cpu usage to 100% - the best I could do was about 80%. I found that very irksome and spent some time testing different Directory implementations, web service stacks, and everything in between. None of the other Directory implementations (all file based in some way) showed any improvements, nor did eliminating the web stack. Finally by attaching a profiler to the Tomcat instance running the webservice while running with RAMDirectory we were able to see thread blocking increasing proportional to the number of requesting threads. That led us to the Lucene source code where we found a synchronized() block that we deemed the culprit. With the cause at least found I decided not to waste time trying to fix the problem for what would be nominal gain, but instead decided to use two Tomcat installations and load balance between them with Apache. With the Tomcats running on quite powerful machines we are now seeing approximately 1000 lookups/sec (including a bunch of business logic beyond the Lucene lookup), which we think is pretty good, and sufficient for our purposes.
This is all being used from within our Oozie orchestrated Hive/Hadoop workflow (which Lars will talk more about soon) but once we're confident that it's behaving properly and stably we will also offer this web service (or something similar) for public consumption. More importantly the new Nub Taxonomy will be available in the GBIF data portal very soon and with it we hope to have eliminated most of the problems people have found with our current backbone taxonomy.