Thursday 19 January 2012

BioCASe now producing DarwinCore Archives

Guest post from Jörg Holetschek, Botanic Garden and Botanical Museum Berlin-Dahlem.

The traditional way of sharing occurrence data with GBIF has been web-service-based for years. Data publishers have used one of the existing provider software packages (DiGIR, BioCASe or TAPIR Link) to expose their data as a DiGIR-, BioCASe- or TAPIR-compliant web service. Biodiversity networks such as GBIF used harvesters to crawl and index the records published by these services, an approach that works fine for small and medium-sized datasets, but runs into difficulties when record numbers hit the millions: Harvesting can take days and puts a heavy load on both the publisher and the crawler.

To overcome this, GBIF recently introduced DarwinCore Archives for storing all information of a dataset to be published in a single file. GBIF directly ingesting this file eliminates the time-consuming back-and-forth communication between data provider and harvester, speeding up the process and reducing load for both sides. GBIF’s IPT allows easy creation of such DarwinCore Archives and is a good option for providers that have already used the DarwinCore standard in the past or that want to share rather slim observation data.

However, sixty-two of GBIF’s data publishers are currently using BioCASe. In contrast to DarwinCore, BioCASe and its associated data standard ABCD are targeted mainly at rich data originating from specimens of natural history collections (even though it can be used for any type of occurrence data, including observations). Many of the BioCASe data providers also share their data with special interest networks such as GeoCASe, the DNA-Bank Network, or the EDIT Specimen network, all of them relying on BioCASe web services. Switching to the IPT and the associated DarwinCore standard is not an option for them.

For this reason, we decided to extend the BioCASe Provider Software with a feature to create DarwinCore Archives. This allows providers to continue using the rich ABCD schema (or one of its extensions) for the specific networks they’re connected to while using DarwinCore Archives to share their data with GBIF. In order to combine the richness of ABCD with the efficiency of downloadable archives, we created a hybrid in-between, the so called ABCD Archives, which can be used instead of the BioCASe web service for harvesting purposes (see figure below).




The first step – creating the ABCD archive – is implemented natively in the Provider Software in Python. The second step – transforming the ABCD archive into one or several DarwinCore archives – is done by Pentaho Data Transformation, an open source library also known as Kettle. In the current version 3.0, the transformation step is a stand-alone command-line application that can be downloaded separately; ultimately, it will be bundled with the Provider Software and integrated into the user interface.

The latest version of the Provider Software and the DarwinCore Creator can be downloaded from the BioCASe website. A detailed documentation of the new archiving features can be found in the PyWrapper Wiki. The wiki also stores a sample ABCD archive and a sample DarwinCore archive created by BioCASe.