Monday 28 October 2013

The new (real-time) GBIF Registry has gone live

For the last 4 years, GBIF has operated the GBRDS registry with its own web application on http://gbrds.gbif.org.  Previously, when a dataset got registered in the GBRDS registry (for example using an IPT) it wasn't immediately visible in the portal for several weeks until after rollover took place. 

In October, GBIF launched its new portal on www.gbif.org.  During the launch we indicated that the real-time data management would be starting up in November.  We are excited to inform you that today we made the first step towards making this a reality, by enabling the live operation of the new GBIF registry.   
What does this mean for you?
  • any dataset registered through GBIF (using an IPT, web services, or manually by liaison with the Secretariat) will be visible in the portal immediately because the portal and new registry are fully integrated 
  • the GBRDS web application (http://gbrds.gbif.org) is no longer visiblesince the new portal displays all the appropriate information
  • the GBRDS sandbox registry web application (http://gbrdsdev.gbif.org) is no longer visible, but a new registry sandbox has been setup to provide for IPT installations installed in test mode
Please note that the new registry API supports the same web service API that the GBRDS previously did, so existing tools and services built on the GBRDS API (such as the IPT) will continue to work uninterrupted. 
As you may have noticed, occurrence data crawling has been temporarily suspended since the middle of September to prepare for launching real-time data managementWe aim to resume occurrence data crawling in the first week of November, meaning that updates to the index will be visible immediately afterwards.  
On behalf of the GBIF development team, I thank you for your patience during this transition time, and hope you are looking forward to real-time data management as much as we are. 

Thursday 24 October 2013

GBIF Backbone in GitHub

For a long time I wanted to experiment with using GitHub as a tool to browse and manage the GBIF backbone taxonomy. Encouraged by similar sentiments from Rod Page, it would be nice to use git to keep track of versions and allow external parties to fork parts of the taxonomic tree and push back changes if desired. To top it off there is the great GitHub Treeslider to browse the taxonomy, so why not give it a try?

A GitHub filesystem taxonomy

I decided to export each taxon in the backbone as a folder that is named according to the canonical name, containing 2 files:

  1. README.md, a simple markdown file that gets rendered by github and shows the basic attributes of a taxon
  2. data.json, a complete json representation of the taxon as it is exposed via the new GBIF species API
The filesystem represents the taxonomic classification and taxon folders are nested accordingly, for example the species Amanita arctica is represented as:

This is just a first experimental step. One can improve the readme a lot to render more content in a human friendly way and include more data in the json file such as common names and synonyms.

Getting data into GitHub

It didn't take much to write a small NubGitExporter.java class that exports the GBIF backbone into the filesystem as described above. The export of the entire taxonomy, with it's currently 4.4 million taxa incl synonyms, took about one hour on a MacBook Pro laptop. 
Not bad I thought, but then I tried to add the generated files into git and that's when I started to doubt. After waiting for half a day for git to add the files to my local index I decided to kill the process and start by only adding the smaller kingdoms first, excluding animals and plants. That left about 335.000 folders and 670.000 files to be added to git. Adding these to my local git still took several hours, committing and finally pushing them onto the GitHub server took yet another 2 hours.

Delta compression using up to 8 threads.
Compressing objects: 100% (1010487/1010487), done.
Writing objects: 100% (1010494/1010494), 173.51 MiB | 461 KiB/s, done.
Total 1010494 (delta 405506), reused 0 (delta 0)
To https://github.com/mdoering/backbone.git

After those files were added to the index committing a simple change to the main README file took 15 minutes to commit. Although I like the general idea and the pretty user interface I fear GitHub, and even git itself, are not made to be a repository of millions of files and folders.

First GitHub impressions

Browsing taxa in GitHub is surprisingly responsive. The fungi genus Amanita  contains 746 species, but it loads very quickly. In that regard GitHub is much nicer to use than the one on the new GBIF species pages which of course shows much more information. The rendered readme file is not ideal as it's at the very bottom of the page, but showing information to humans that way is nice - and markdown could also be parsed by machines quite easily if we adopt a simple format, for example for every property create a heading with that name and put the content into the following paragraph(s). 

The Amanita example also reveals a bug in the exporter class when dealing with synonyms (the Amanita readme contains the synonym information) and also with infraspecific taxa. For example Amanita muscaria contains some weird form information which is mapped erroneously to the species. This obviously should be fixed.

The GitHub browser sorts all files alphabetically. When mixing ranks (we skip intermediate unknown ranks in the backbone), for example see the Fungus kingdomsorting by the rank first is desirable. We could enable this by naming the taxon folders accordingly, prefixing with an alphabetically correctly ordered rank.

I have not had the time to try to version branches of the tree and see how usable that is. I suspect the git performance to be really slow, but that might not be a blocker if we only do versioning of larger groups and rarely push & pull.