Friday, 17 July 2009

Darwin Core Archive Reader part1

After writing software that produces Darwin Core Archives (DwC-A), I thought it is time to introduce a little client library that can read DwC-A and makes it very simple to consume it. To create DwC-A there are a couple of resources available:
  1. The Darwin Core Text Guidelines, pretty much the specification
  2. The ECAT Checklist Format gives some best practices and focus on how to encode taxonomic data as DwC-A
  3. The IPT produces DwC-A and has a rich web interface but is limited to ~1 million records
  4. The IPT-lite produces large DwC-A datasets very fast, but lacks any visualisation of the data itself. Its a great way of quickly creating DwC-A and hosting it online
The DwC Archive reader library is under active development and currently being used for indexing checklists. The project is hosted as part of the GBIF Indexing and Harvesting Toolkit, where a jar with all dependencies is hosted for download too. You can use this jar in the terminal to inspect archives like this:
$ java -jar DwcA-reader-1.0-SNAPSHOT.jar hershkovitz.txt

Opening archive: /Users/markus/Desktop/hershkovitz/hershkovitz.txt
Core file(s) found: [/Users/markus/Desktop/hershkovitz/hershkovitz.txt]
Core row type:
Core identifier column: 0
Cannot locate term dwc:kingdom
Cannot locate term dwc:family
Number of extensions 0
Archive contains 3249 core records.
The reader can handle just a simple CSV or tab file, or you can point it to a dwc archive folder with a meta.xml descriptor and several data files. The iterator(s) allow you to walk the core records of the archive while conveniently retrieving all extension records at the same time. You can also tell it to show some core record, passing a limit and offset like this from the commandline:
java -jar DwcA-reader-1.0-SNAPSHOT.jar hershkovitz 10 25
A simple dummy source code example on how to use this reader in your code is shown here:
package org.gbif.dwc.text;

import java.util.Iterator;

import org.gbif.dwc.model.DarwinCoreRecord;
import org.gbif.dwc.model.ExtensionRecord;
import org.gbif.dwc.model.StarRecord;
import org.gbif.dwc.terms.DwcTerm;

public class UsageExample {

 public static void main(String[] args) throws IOException, UnsupportedArchiveException {
  // opens csv files with headers or dwc-a direcotries with a meta.xml descriptor
  Archive arch = ArchiveFactory.openArchive(new File("pontaurus.txt"));

  // does scientific name exist?
  if (!arch.getCore().hasTerm(DwcTerm.scientificName)){
   System.out.println("This application requires dwc-a with scientific names");

  // loop over core darwin core records
  Iterator iter = arch.iteratorDwc();
  DarwinCoreRecord dwc;
   dwc =;

  // loop over star records. i.e. core with all linked extension records
  for (StarRecord rec : arch){
   // print core ID + scientific name
   System.out.println(" - "+rec.value(DwcTerm.scientificName));
   for (ExtensionRecord erec : rec){
    // does this extension have Long/lat?
    if (rec.dataFile().hasTerm(DwcTerm.decimalLongitude) && rec.dataFile().hasTerm(DwcTerm.decimalLatitude)){
     System.out.println("Georeferenced: " + rec.value(DwcTerm.decimalLongitude)+","+rec.value(DwcTerm.decimalLatitude));;


  1. Just wondering why it closes with
    Looks pretty cool tho.

  2. When trying to read an archive I received the following error:

    UnsupportedArchiveException: org.dom4j.DocumentException: Error on line 2 of document: Content is not allowed in prolog.

    Investigating, I discovered that I was mistakenly trying to read an archive with a meta.xml file that specified it was encoded in UTF-8, but in reality was encoded using 'UTF-16 Little Endian'. The reason the error was being thrown is that byte declaring the 'endianess', that is whether it is Little Endian or Big Endian, appears at the beginnning of the text file and disrupts the xml parser.

    To fix solve the problem, the same meta.xml file was resaved as UTF-8 and the archive read anew.