Wednesday 27 April 2011

OAI-PMH Harvesting at GBIF

GBIF has been my first experience in the bio-informatics world; my first assignment was developing an OAI-PMH harvester. This post will introduce OAI-PMH protocol and how we are gathering XML documents from different sources, in a next post I'll give a introduction to the Index that we have built using those documents.


The main goal for this project was develop the infrastructure needed across the GBIF network to support the management and delivery of metadata that will enable potential end users to discover which datasets are available, and, to evaluate the appropriateness of such datasets for particular purposes. In the GBIF context, resources are datasets, loosely defined as collections of related data, the granularity of which is determined by the data custodian/provider.



OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) is a platform independent framework for metadata publishers and metadata consumers as well. The most important concepts of this protocol are:
• Metadata: provides information on such aspects as the ‘who, what, where, when and how’ pertaining to a resource. For the producer, metadata are used to document data in order to inform users of their characteristics, while for the consumer, metadata are used to both discover data and assess their appropriateness for particular needs ('fitness for purpose’).
• Repository: an accessible server that is able to process the protocol verbs.
• Unique identifier: is an unambiguous identifier of an item (document/record) inside the repository.
• Record: is metadata expressed in a specific format.
• Metadata-prefix: specifies the metadata format in OAI-PMH requests issued to the repository (EML 2.1.0, Dublin Core, etc.)

GBIF-Metadata Network Topology

The metadata catalogue will primarily be used as the central catalogue in the GBIF Data Portal for the global GBIF network, which, in turn, will broker information to wider initiatives such as EuroGEOSS, OBIS, etc. Such initiatives are basically OAI-PMH service providers that will be contacted by GBIF metadata harvester.

The GBIF metadata catalogue service undertakes both harvesting and serving roles; aggregating metadata from other OAI-PMH repositories and serving metadata via OAI-PMH to other harvesting services. The harvested metadata are stored in a local file system. The system can apply XSLT transformation to create a new document based of the content of the existing one (e.g., transforming an EML document to an ISO19139 one).

OAI-PMH Harvester
The harvester is a standalone Java application, it makes extensive use of the open source project “OAIHarvester2” which supports OAI-PMH v1.1 and v2.0. The source code of this project was not modified but extended to handle the harvested XML payload. The payload is delivered as a single file of aggregated xml documents (one per metadata resource).
This component was implemented by modifying the OAICat (http://www.oclc.org/research/activities/oaicat/default.htm) web application. The main changes, made to achieve specific objectives, are:

• Dynamic load of file store. The default behaviour of the server is to load the file list at the server start-up. Since the harvester can modify the file store, the server loads the file list every time a ListIdentifiers or ListRecords verb is requested.
• Support multiple XSL transformations for an input format. The reference implementation only supports one transformation, in our implementation an input document can be published using multiple formats; for example: an EML document can be published using Dublin Core and DIF, if a XSL transformation is configured for each output format.

More detail about this project is available at the google-code project site: http://code.google.com/p/gbif-metadata/. In a future post I’ll explain how the information gathered by the harvester was used to build a search index using Solr and how a Web application uses the Index to enable end-users the search of metadata.

No comments:

Post a Comment