Thursday 11 June 2015

Simplified Downloads

Since its re-launch in 2013 gbif.org has supported the downloading of occurrence data using an arbitrary query with the download file provided as a Darwin Core Archive file whose internal content is described here. This format contains comprehensive and self-explanatory information, which makes it suitable to be referenced in external resources. However, in cases where people only need the occurrence data in its simplest form the DwC-A format presents an additional complexity that can make it hard to use the data. Because of that we now support a new download format: a zip file that only contains a single file with the most common fields/terms used, where each column is separated by the TAB character. This makes things much easier when it comes to importing the data into tools such as Microsoft Excel, geographic information systems and relational databases. The current download functionality was extended to allow the selection of the desired format:

From this point the functionality remains the same: eventually you will receive an email containing a hyperlink where the file can be downloaded.

Technical Architecture

The simplified download format was implemented following the technical requirement that new formats should be supported in the near future with minimal impact to the formats supported at a specific moment. In general, occurrence downloads are implemented using two different sets of technologies depending on the estimated size of the download in number of records; a threshold of 200,000 records is set to define when a download is small (< 200K) and big (>200K), where history shows a vast majority of “small” downloads. The following chart summarizes the key technologies that enables occurrence downloads:

Download workflow

Occurrence downloads are automated using a workflow engine called Oozie, it coordinates the required steps to produce a single download file. In summary the workflow proceeds as follows:
  1. Initially, Apache Solr is contacted to determine the number of records that the download file will contain.
  2. Big or small?
    1.  If the amount of records is less than 200,000 (it is small download), Apache Solr is queried to iterate over the results; the detail of each occurrence record is fetched from HBase since it’s the official storage of occurrence records. Individual downloads are produced by a multi-threaded application implemented using the Akka framework; the Apache Zookeeper and Curator frameworks are used to limit the amount of threads that can be running at the same time (it avoids a thread explosion in the machines that run the download workflow).
    2. If the amount of records is greater than 200,000 (it is a big download), Apache Hive is used to retrieve the occurrence data from an HDFS table. To avoid overloading of HBase we create that HDFS table as a daily snapshot of the occurrence data stored in HBase.
  3. Finally the occurrence records are collected and organized in the requested output format (DwC-A or Simple).
Note: the details of how this is implemented can be consulted in the Github project: https://github.com/gbif/occurrence/tree/master/occurrence-download.

Conclusion

Reducing both the number of columns and the size (number of bytes) in our downloads has been one of our most requested features, and we hope this makes using the GBIF data easier for everyone.