Friday, 29 April 2011

The evolution of the GBIF Registry

The GBIF Registry has evolved through time to become an important tool in GBIF's day to day work. But before going into this post, a basic understanding of the GBIF Network model should be provided. GBIF is a decentralised network that has several network entities that are related in some way between each other. At the top level, there are GBIF Participant Nodes, which typically are countries or thematic networks that coordinate their domain. These Nodes endorse one or more Organisations or Institutions inside their domain, and each Organisation possesses one or more Resources exposed through the GBIF Network. Also, each Resource typically comes associated to a Technical Access Point which is the url to access its data. There are also other entities such as IPT Installations which are deployed inside specific organisations, but are not resources by themselves. They publish Resources that might be owned by other organisations. A quick view on the GBIF's network model can be seen:


Not long ago, this complexity was modeled using an Universal Description, Discovery and Integration (UDDI) system. This system served a purpose at the time, despite its limited data structures types (e.g. businessEntity, businessService, bindingTemplate, tModel). A BusinessEntity was associated with an Organisation/Institution, a BusinessService was associated to a Resource and a BindingTemplate was associated with the technical access point to access the data from that specific resource. A tModel was used to associate the BusinessEntity(Organisation) with a specific Node inside the GBIF Network. A quick view on how the network information was kept on this Registry :



The main disadvantages (for our concerns) of the UDDI Specification:

  1. Its lack of contact information at the BusinessService(Resource) level (contacts can only be added at the BusinessEntity(Organisation) level)
  2. Lack of more descriptive metadata on Organisation and Resources (lacking fields such as the address, homepage, phone of the organisation - sure you could provide all of this information through a complex use of UDDI's capabilities, but will result in unnecessary complexity to extract this information for third-party tools.
  3. Limited to a fixed specification and to a fixed API (although, the UDDI client libraries available are quite straightforward to use)
  4. General purpose specification, not easily adaptable for modeling the complexity of GBIF's network.
  5. Our software dated back to the beginning of the past decade (Systinet WASP UDDI).
  6. Third party consumers will need to know how to talk UDDI
In 2009, we tried overcoming some of our Registry limitations by trying an "UDDI on steroids" approach, which consisted still of an UDDI system (jUDDI in our case) and an external database which would hold some extra data (e.g. Resource contact information, organisation's address, homepage or phone, etc.). The main advantage was the creation our own APIs so that third-party tool developers, who wanted to consume the GBIF's network information, didn't need to know the nuts and bolts of UDDI specs anymore. We offered the community a simple API and its proper documentation, and we dealt with the inner workings of it all.

Further in this evolution, our Registry took the next step and we removed the UDDI component and were left only with a DB which gave us complete freedom to model the network. We now had a system on hand which offered the possibility to create any kind of entities on the Network (Nodes, Organisations, Resources, Technical Installations) and any relation among them. Along with this new approach, came the web application (http://gbrds.gbif.org) and a far better API which offered the possibility to consume the data in XML or JSON format. These APIs are easy to follow and are well documented (http://code.google.com/p/gbif-registry/wiki/TableOfContents). Among the new features:

  1. Create any kind of entities
  2. Create any kind of relation among them
  3. More detailed metadata (for entities and contacts)
  4. Ability to tag entities
  5. Individual credentials for each Institution/Organisation to provide the ability to add new or delete existing resources under their own Organisations (this is currently only available through the APIs or via admin management)
  6. Enhanced maintenance features (for admins)

[Evolution of GBIF's Registry]
Development is still ongoing and many exciting features are expected in the future. The status of development can be checked out here.

Wednesday, 27 April 2011

OAI-PMH Harvesting at GBIF

GBIF has been my first experience in the bio-informatics world; my first assignment was developing an OAI-PMH harvester. This post will introduce OAI-PMH protocol and how we are gathering XML documents from different sources, in a next post I'll give a introduction to the Index that we have built using those documents.


The main goal for this project was develop the infrastructure needed across the GBIF network to support the management and delivery of metadata that will enable potential end users to discover which datasets are available, and, to evaluate the appropriateness of such datasets for particular purposes. In the GBIF context, resources are datasets, loosely defined as collections of related data, the granularity of which is determined by the data custodian/provider.



OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) is a platform independent framework for metadata publishers and metadata consumers as well. The most important concepts of this protocol are:
• Metadata: provides information on such aspects as the ‘who, what, where, when and how’ pertaining to a resource. For the producer, metadata are used to document data in order to inform users of their characteristics, while for the consumer, metadata are used to both discover data and assess their appropriateness for particular needs ('fitness for purpose’).
• Repository: an accessible server that is able to process the protocol verbs.
• Unique identifier: is an unambiguous identifier of an item (document/record) inside the repository.
• Record: is metadata expressed in a specific format.
• Metadata-prefix: specifies the metadata format in OAI-PMH requests issued to the repository (EML 2.1.0, Dublin Core, etc.)

GBIF-Metadata Network Topology

The metadata catalogue will primarily be used as the central catalogue in the GBIF Data Portal for the global GBIF network, which, in turn, will broker information to wider initiatives such as EuroGEOSS, OBIS, etc. Such initiatives are basically OAI-PMH service providers that will be contacted by GBIF metadata harvester.

The GBIF metadata catalogue service undertakes both harvesting and serving roles; aggregating metadata from other OAI-PMH repositories and serving metadata via OAI-PMH to other harvesting services. The harvested metadata are stored in a local file system. The system can apply XSLT transformation to create a new document based of the content of the existing one (e.g., transforming an EML document to an ISO19139 one).

OAI-PMH Harvester
The harvester is a standalone Java application, it makes extensive use of the open source project “OAIHarvester2” which supports OAI-PMH v1.1 and v2.0. The source code of this project was not modified but extended to handle the harvested XML payload. The payload is delivered as a single file of aggregated xml documents (one per metadata resource).
This component was implemented by modifying the OAICat (http://www.oclc.org/research/activities/oaicat/default.htm) web application. The main changes, made to achieve specific objectives, are:

• Dynamic load of file store. The default behaviour of the server is to load the file list at the server start-up. Since the harvester can modify the file store, the server loads the file list every time a ListIdentifiers or ListRecords verb is requested.
• Support multiple XSL transformations for an input format. The reference implementation only supports one transformation, in our implementation an input document can be published using multiple formats; for example: an EML document can be published using Dublin Core and DIF, if a XSL transformation is configured for each output format.

More detail about this project is available at the google-code project site: http://code.google.com/p/gbif-metadata/. In a future post I’ll explain how the information gathered by the harvester was used to build a search index using Solr and how a Web application uses the Index to enable end-users the search of metadata.

Wednesday, 20 April 2011

Cleanup of occurrence records

Lars here, like Oliver I've started here in October 2010 and have no biology background either so my first step here at GBIF was to set up the infrastructure Tim was mentioning before, but I've written about that already (at length).

To continue the series of blog posts that was started by Oliver, and in no particular order, I'll talk about what we are doing to process the incoming data - which is the task I was given after the Hadoop setup was done.

During our rollover we're processing Occurrence records. Millions of them, about 270 millions at the moment and we expect this to grow significantly over the next few months and years. It is only natural that there is bound to be bad data in there for various reasons. These might be everything from simple typos to misconfigured publishing tools and transfer errors.

The more we know about the domain and the data the more we are obviously able to fix. Any input is appreciated on how we could do better on this part of our processing.

For fields like kingdom, phylumcountry name or basis of record we do a simple lookup in a dictionary to look for common mistakes and replace those with the proper versions. Other fields like class, order, family, genus and author have way too many distinct values for us to prepare a dictionary with all the possible errors and their correct forms. That is why we only apply a few safe clean up procedures here (e.g. remove blacklisted names or invalid characters).

Scientific names are additionally parsed by the NameParser in the ECAT project which does all kinds of fancy magic to try to infer a correct name. Altitudes, depths and coordinates get treatment as well by looking at common unit markers and errors we've seen in the past.

And last but not least we also try to make most out of the dates we get. As everyone who ever dealt with date strings knows this can be one of the hardest topics in an internationalized environment. In theory our input data consists of three nicely formatted fields: year, month and day. In reality though a lot of dates are just in the year field. We've got all kinds of delimiters (with "/" and "-" being among the most common ones), abbreviations ("Mar") and database export fragments ("1978.0" because it was a floating point variable in the database), missing data and more.

Additionally we obviously have to deal with different time formats. Is "01/02/02" the first of February or the second of January? In most cases we can only guess.

Having said that: We've rewritten large parts of the date handling routines and are continuing to improve them as we know that this is an important part of our data. Feedback on how we're doing here is greatly appreciated!

I'm really hoping to have a chance to compile a few statistics about our incoming data quality once we've tested all of this in production.

Monday, 18 April 2011

Reworking the Portal processing

The GBIF Data Portal has provided a gateway to discover and access the content shared through the GBIF network for some years, without major change.  As the amount of data has grown, GBIF have scaled vertically (e.g. scaling up) to maintain performance levels; this is becoming unmanageable with the current processing routines due to the amount of SQL statements issued against the database.  As GBIF content grows, the indexing infrastructure must change to scale out accordingly.

I have been monitoring and evaluating alternative technologies for some time and a few months ago GBIF initiated the redevelopment of the processing routines.  This current area of work does not increase functionality offered through the portal (that will be addressed following this infrastructural work) but rather aims to:
  • Reduce the latency between a record changing on the publisher side, and being reflected in the index
  • Reduce the amount of (wo)man-hours needed to coax through a successful processing run
  • Improve the quality assurance by inclusion of    
  • Rework all the date and time handling
  • Use dictionaries (vocabularies) for interpretation of fields such as Basis of Record
  • Integrate checklists (taxonomic, nomenclatural and thematic) shared through the GBIF ECAT Programme to improve the taxonomic services, and the backbone ("nub") taxonomy.
  • Provide a robust framework for future development
  • Allow the infrastructure to grow predictably with content and demand growth
Things have progressed significantly since my early investigations, and GBIF are developing using the following technologies:
    • Apache Hadoop: A distributed file system, and cluster processing using the Map Reduce framework
    • Sqoop: A utility to synchronize between relational databases and Hadoop 
    • Hive: A data warehouse infrastructure built on top of Hadoop, and developed and open-sourced by Facebook.  Hive gives SQL capabilities on Hadoop.  [Full table scans on GBIF occurrence records reduce from hours to minutes]
    • Oozie: An open-source workflow/coordination service to manage data processing jobs for Hadoop, developed then open-sourced by Yahoo!
    [GBIF are researching using HBase, the Hadoop database to allow an increase in the richness in the indexed content, and will be the subject of future blogs.  See the project site]

    The processing workflow looks like the following (click for full size):

    The Oozie workflow is still being developed, but the workflow definition can be found here.

    Lucene for searching names in our new common taxonomy

    Oliver here - I'm one of the new developers at GBIF, having started in October, 2010. With no previous experience in biology or biological classification you can bet it's been a steep learning curve in my time here, but at the same time it's very nice to be learning about a domain that's real, valuable and permanent, rather than yet another fleeting e-commerce, money-trading or "social media" application!

    One of the features of GBIF's Data Portal is allowing searching of primary occurrence data via a backbone taxonomy. For example let's say you're interested in snow leopards and would like to plot all current and historical occurrences of this elusive cat on a world map. Let's further say that Richard Attenborough suggested to you that the snow leopard's scientific name is "Panthera uncia". You would ask the data portal for all records about Panthera uncia and expect to see all occurrences of snow leopards. Unfortunately biologists aren't agreed on how to classify the snow leopard - some argue that it belongs in the genus Panthera, while others argue that it should belong to its own genus, Uncia, and naturally the GBIF network has records under both names. You would just like to see all of those records and never mind the details - and that's just the tip of the iceberg when it comes to building a backbone taxonomy to match the 260 million+ occurrence records in the GBIF network.

    Indeed, the backbone taxonomy (we call it our "Nub Taxonomy") in use by the current data portal has been one of the biggest sources of criticism of the GBIF data portal - it doesn't cover enough of the names in our occurrence records, and it doesn't handle the tricky stuff (as above) as well as it should. One of the reasons for that is the current backbone taxonomy was built based on the Catalogue of Life 2007, a similar vintage International Plant Names Index (IPNI), and then augmented with the classifications from any unmatched occurrence records. This has led to a classification hierarchy which is less reliable than we (and the GBIF network) would like.

    Markus Döring is the GBIF software team's taxonomy expert and he has employed a new strategy for building an improved Nub Taxonomy by building it exclusively on well-known and respected taxonomies already out there - things like the most recent Catalogue of Life, IPNI, and more, but without using the classifications as given in the occurrence data. After the Nub Taxonomy is built, the occurrence records then need to be matched to it. As the first step to integrating the new Nub Taxonomy into the data portal, my job in the last little while has been to build a searchable index of all the names in our Nub Taxonomy and a web service that can accept a scientific name (from an occurrence record) and match it to the index, while understanding the implications of homonyms and synonyms, as well as tolerating misspellings. And of course, make it fast :)

    Since what we're talking about here is string matching with a tolerance for messy input (e.g. spelling mistakes, different violations of nomenclatural rules) the place to start is Lucene. Our Nub Taxonomy has about 8 million unique names, and our 260 million occurrence records are also comprised of roughly 8 million unique names. Our use case is somewhat out of the ordinary for Lucene in that we can build the index once and after that it becomes read-only until the next update of our Nub Taxonomy (e.g. to reflect an update in the Catalog of Life), and it only takes a few minutes to build the index, so it's not all that important for it to be persistent. That means we can optimize for search speed and not worry so much about indexing performance. Lucene has just the index storage implementation for this need - RAMDirectory. For the most part this worked just fine, but no matter how hard I hit the index, I couldn't get cpu usage to 100% - the best I could do was about 80%. I found that very irksome and spent some time testing different Directory implementations, web service stacks, and everything in between. None of the other Directory implementations (all file based in some way) showed any improvements, nor did eliminating the web stack. Finally by attaching a profiler to the Tomcat instance running the webservice while running with RAMDirectory we were able to see thread blocking increasing proportional to the number of requesting threads. That led us to the Lucene source code where we found a synchronized() block that we deemed the culprit. With the cause at least found I decided not to waste time trying to fix the problem for what would be nominal gain, but instead decided to use two Tomcat installations and load balance between them with Apache. With the Tomcats running on quite powerful machines we are now seeing approximately 1000 lookups/sec (including a bunch of business logic beyond the Lucene lookup), which we think is pretty good, and sufficient for our purposes.

    This is all being used from within our Oozie orchestrated Hive/Hadoop workflow (which Lars will talk more about soon) but once we're confident that it's behaving properly and stably we will also offer this web service (or something similar) for public consumption. More importantly the new Nub Taxonomy will be available in the GBIF data portal very soon and with it we hope to have eliminated most of the problems people have found with our current backbone taxonomy.

    Friday, 15 April 2011

    The first drafts of the Data Publishing Manuals are available for feedbacks

    Since Darwin Core had been officially ratified by Biodiversity Information Standards (TDWG) in November 2009, a few tools were developed by GBIFS to leverage the standard data format, a.k.a the Darwin Core Archive, to facilitate data mobilisation. These tools include Darwin Core Archive Assistant, GBIF Spreadsheet Processor and some validators that users can use to produce standard-compliant files for data exchanging or publishing purposes. Also, IPT has upgraded to version 2 recently to fully support data publishing in metadata, occurrence data and taxonomic data using Darwin Core Archive.

    Accompanying these development efforts, a suite of document are also prepared to instruct users on, not only the usage of individual software tool, but how to make data available within the GBIF Network. For those tool options we have in the biodiversity information world, we organised these materials according to which kind of content that users want to publish, and present a document map for users to follow. So, if you go to the Informatics section of the GBIF web site, there are pages called "publishing" under "Discovery/Metadata," "Primary Data" and "Name Services." Maps are there ready to guide you through the way for publishing your data. Every node in the maps are clickable and will lead you to those individual manuals.

    The intention of using a map as a guide is to suggest a route that user can have basic understanding about data publishing before they play with software tools, so readers are not given a bunch of documents and don't know where to start, or hesitate to finish reading all of these before starting. We also try to keep each manual as compact as possible, with emphasis on steps, rather than just theories.

    In addition to users with biodiversity background, we'd like invite developers to evaluate these draft materials, too. Any comments are welcome, especially whether these manuals help in explaining the data publishing workflow to the users you serve.

    Wednesday, 13 April 2011

    Can IPT2 handle big datasets now?

    One of IPT1's most serious problems was its inability to handle large datasets. For example, a dataset with only half a million records (relatively small compared to some of the biggest in the GBIF network) caused the application to slow down to such a degree that even the most patient users were throwing their hands up in dismay.
    Anyways, I wanted to see for myself whether the IPT’s problems with large datasets have been overcome or not in the newest version: IPT2.

    Here’s what I did to run the test: First, I connected to a MySQL database and used a “select * from … limit …” query to define my source data totalling 24 million records (the same number of records as a large dataset coming from Sweden). Next, I mapped 17 columns to Darwin Core occurrence terms and once this was done I was able to start the publication of a Darwin Core Archive (DwC-A). The publication took just under 50 minutes to finish, processing approximately 500,000 records per minute. Take a look at the screenshot below that was taken after the successful publication. Important to note is that this test was run on a Tomcat server with only 256MB of memory. In fact, special care was taken during IPT2 design to ensure it could still run on older hardware that didn’t have a lot of memory. It’s worth noting that this is one of the reasons why IPT2 is not as feature rich as the IPT1 was.


    So just how does the IPT2 handle 24 million records coming from a database while running on a system with so little memory? The answer is that instead of returning all records at once, they are retrieved in small result sets only having about 1000 records each. These results sets are then streamed to file and immediately written to disk. The final DwC-A generated was 3.61GB in size, so some disk space is obviously needed too.

    Therefore in conclusion I feel that he IPT2 has successfully overcome its previous problems handling large datasets. I hope other adopters will now give it a shot themselves.

    Monday, 11 April 2011

    The GBIF Development Team

    Recently the GBIF development group have been asked to communicate more on the work being carried out in the secretariat.  To quote one message:
    "IMHO, simply making all these discussions public via a basic mailing list could help people like me ... have a better awareness of what's going on... We could add our comments / identify possible drawbacks / make some "scalability tests"... In fact I'm really eager to participate to this process" (developer in Belgium)
    To kick things off, we plan to make better use of this blog and have set a target of posting 2-3 times a week.  This is a technical blog, so the anticipated audience include developers, database administrators and those interested in following details of the GBIF software development.  We have always welcomed external contributers to this blog and invite any developers working on publishing content through the GBIF network, or developing tools that make use of content discoverable and accessible through GBIF to write posts.

    Today we are pleased to welcome Jan Legind to the team who will be working as a data administrator to help improve the frequency of the network crawling (harvesting) and the indexing processes.  Jan will be working closely with the data publishers to help improve the quality and quantity of content accessible through GBIF.

    The GBIF development group has expanded in the past 6 months, so I'll introduce the whole team working in the secretariat and contracted to GBIF:

    • Developers (in order of appearance in the team): Kyle Braak, José Cuadra, Markus Döring (contracted in Germany), Daniel Amariles & Hectór Tobón (contracted at CIAT in Colombia), Federico Méndez, Lars Francke and Oliver Meyn
    • Systems architect: Tim Robertson
    • Systems analyst: Andrea Hahn
    • Informatics liason: Burke (Chih-Jen) Ko
    • Systems admins: Ciprian Vizitiu & Andrei Cenja
    • Data administrator: Jan Legind

    The current focus of work at GBIF include the following major activities:
    • Developing and rolling out the Integrated Publishing Toolkit.
    • Integrating the checklist (taxonomic, nomenclatural and thematic) content into the current Data portal.
    • Developing a processing framework to automate the steps needed to apply quality control and index content for discovery through the Data portal.
      • Specifically to reducing the time taken and complexity in initiating a rollover of the content behind the index
      • Reworking all quality control (geographic, taxonomic and temporal) 
      • Automating the process
    • Initiating a redesign of the data portal user interface to provide richer discovery and integration across dataset metadata, checklists and primary biodiversity data.
    • Reducing the time between publishing content onto the network and discovery through the Data portal.  This includes providing specific support to those who are experiencing problems with large datasets in particular, and assisting in migration to the DarwinCore-Archive format.
    • Technical and user documentation of the publishing options available
    Let the blogging begin.

    [Please use #gbif in twitter hashtags]