Monday, 18 April 2011

Lucene for searching names in our new common taxonomy

Oliver here - I'm one of the new developers at GBIF, having started in October, 2010. With no previous experience in biology or biological classification you can bet it's been a steep learning curve in my time here, but at the same time it's very nice to be learning about a domain that's real, valuable and permanent, rather than yet another fleeting e-commerce, money-trading or "social media" application!

One of the features of GBIF's Data Portal is allowing searching of primary occurrence data via a backbone taxonomy. For example let's say you're interested in snow leopards and would like to plot all current and historical occurrences of this elusive cat on a world map. Let's further say that Richard Attenborough suggested to you that the snow leopard's scientific name is "Panthera uncia". You would ask the data portal for all records about Panthera uncia and expect to see all occurrences of snow leopards. Unfortunately biologists aren't agreed on how to classify the snow leopard - some argue that it belongs in the genus Panthera, while others argue that it should belong to its own genus, Uncia, and naturally the GBIF network has records under both names. You would just like to see all of those records and never mind the details - and that's just the tip of the iceberg when it comes to building a backbone taxonomy to match the 260 million+ occurrence records in the GBIF network.

Indeed, the backbone taxonomy (we call it our "Nub Taxonomy") in use by the current data portal has been one of the biggest sources of criticism of the GBIF data portal - it doesn't cover enough of the names in our occurrence records, and it doesn't handle the tricky stuff (as above) as well as it should. One of the reasons for that is the current backbone taxonomy was built based on the Catalogue of Life 2007, a similar vintage International Plant Names Index (IPNI), and then augmented with the classifications from any unmatched occurrence records. This has led to a classification hierarchy which is less reliable than we (and the GBIF network) would like.

Markus Döring is the GBIF software team's taxonomy expert and he has employed a new strategy for building an improved Nub Taxonomy by building it exclusively on well-known and respected taxonomies already out there - things like the most recent Catalogue of Life, IPNI, and more, but without using the classifications as given in the occurrence data. After the Nub Taxonomy is built, the occurrence records then need to be matched to it. As the first step to integrating the new Nub Taxonomy into the data portal, my job in the last little while has been to build a searchable index of all the names in our Nub Taxonomy and a web service that can accept a scientific name (from an occurrence record) and match it to the index, while understanding the implications of homonyms and synonyms, as well as tolerating misspellings. And of course, make it fast :)

Since what we're talking about here is string matching with a tolerance for messy input (e.g. spelling mistakes, different violations of nomenclatural rules) the place to start is Lucene. Our Nub Taxonomy has about 8 million unique names, and our 260 million occurrence records are also comprised of roughly 8 million unique names. Our use case is somewhat out of the ordinary for Lucene in that we can build the index once and after that it becomes read-only until the next update of our Nub Taxonomy (e.g. to reflect an update in the Catalog of Life), and it only takes a few minutes to build the index, so it's not all that important for it to be persistent. That means we can optimize for search speed and not worry so much about indexing performance. Lucene has just the index storage implementation for this need - RAMDirectory. For the most part this worked just fine, but no matter how hard I hit the index, I couldn't get cpu usage to 100% - the best I could do was about 80%. I found that very irksome and spent some time testing different Directory implementations, web service stacks, and everything in between. None of the other Directory implementations (all file based in some way) showed any improvements, nor did eliminating the web stack. Finally by attaching a profiler to the Tomcat instance running the webservice while running with RAMDirectory we were able to see thread blocking increasing proportional to the number of requesting threads. That led us to the Lucene source code where we found a synchronized() block that we deemed the culprit. With the cause at least found I decided not to waste time trying to fix the problem for what would be nominal gain, but instead decided to use two Tomcat installations and load balance between them with Apache. With the Tomcats running on quite powerful machines we are now seeing approximately 1000 lookups/sec (including a bunch of business logic beyond the Lucene lookup), which we think is pretty good, and sufficient for our purposes.

This is all being used from within our Oozie orchestrated Hive/Hadoop workflow (which Lars will talk more about soon) but once we're confident that it's behaving properly and stably we will also offer this web service (or something similar) for public consumption. More importantly the new Nub Taxonomy will be available in the GBIF data portal very soon and with it we hope to have eliminated most of the problems people have found with our current backbone taxonomy.

4 comments:

  1. Hi, thanks for this post, very informative !

    I'm trying to figure out how the taxonomy-occurrences link works... If I understand correctly:

    1) the nub taxonomy is created by aggregating several different trustworthy sources. Technically, I suppose it's a tree-like structure with some refinements, such as synonym information.

    2) For every occurrence, your webservice is called (with the scientific name as a parameter). Then a "taxonomic id" is returned, and this id is associated to the occurrence (foreign key concept of a RDBMS) ?

    3) When the taxonomic references are refreshed, the occurence-taxonomic links are dropped and that cycle is restarted.

    Am I more or less right right ? Or is that link not persistent, for example it's created on the fly by querying your web service everytime access an occurrence ?

    Another question : what's the use of having this web service publicly available ? Getting a "GBIF taxon id" to associate it to data that is not published in the GBIF network ? Do you have other use-cases ? Are these taxonomic antries versioned (for example to have a stable ID and a the same time allow updates) ?

    Thanks in advance, I'm trying to get a better view of your data architecture :)

    Nicolas

    ReplyDelete
  2. Exactly right Nico. Note this is NOT LIVE yet. The current live processing still assembles as it goes. We will of course open the web service to the public when it is launched and stable, so others can discover and link to the GBIF taxon IDs.

    We need to store the link as it takes some 10s msecs to do the fuzzy matching and we can't sustain the throughput to do this on query time. It is particularly important as you go to higher ranks of taxonomy - think how many occurrences align to Animalia/Chordata/Aves (100m +).

    Others wish to link as well. Think about EOL (http://eol.org) who link to our maps from species pages. We could do the link in real time, but as we get high traffic (think millions page views a day) it will cripple the service. Hence you store and cache it.

    Hope this helps,
    Tim

    ReplyDelete
  3. Hi all.
    Thanks for the post...and for the job.
    I'm searching into GBIF catalog and no match it...but the development GBIF team thinks to deliver also a search web service by common names (or substrings) as input parameters (like ITIS ws searchForAnyMatch)? It's into a milestone? Or maybe It exist?
    Regards.

    ReplyDelete
  4. Hi...
    Thanks for the post. I am currently working on some Taxonomy Creation and Lucene related search related to text organising application. This blog is giving insight to my knowledge.

    ReplyDelete