Thursday 24 October 2013

GBIF Backbone in GitHub

For a long time I wanted to experiment with using GitHub as a tool to browse and manage the GBIF backbone taxonomy. Encouraged by similar sentiments from Rod Page, it would be nice to use git to keep track of versions and allow external parties to fork parts of the taxonomic tree and push back changes if desired. To top it off there is the great GitHub Treeslider to browse the taxonomy, so why not give it a try?

A GitHub filesystem taxonomy

I decided to export each taxon in the backbone as a folder that is named according to the canonical name, containing 2 files:

  1. README.md, a simple markdown file that gets rendered by github and shows the basic attributes of a taxon
  2. data.json, a complete json representation of the taxon as it is exposed via the new GBIF species API
The filesystem represents the taxonomic classification and taxon folders are nested accordingly, for example the species Amanita arctica is represented as:

This is just a first experimental step. One can improve the readme a lot to render more content in a human friendly way and include more data in the json file such as common names and synonyms.

Getting data into GitHub

It didn't take much to write a small NubGitExporter.java class that exports the GBIF backbone into the filesystem as described above. The export of the entire taxonomy, with it's currently 4.4 million taxa incl synonyms, took about one hour on a MacBook Pro laptop. 
Not bad I thought, but then I tried to add the generated files into git and that's when I started to doubt. After waiting for half a day for git to add the files to my local index I decided to kill the process and start by only adding the smaller kingdoms first, excluding animals and plants. That left about 335.000 folders and 670.000 files to be added to git. Adding these to my local git still took several hours, committing and finally pushing them onto the GitHub server took yet another 2 hours.

Delta compression using up to 8 threads.
Compressing objects: 100% (1010487/1010487), done.
Writing objects: 100% (1010494/1010494), 173.51 MiB | 461 KiB/s, done.
Total 1010494 (delta 405506), reused 0 (delta 0)
To https://github.com/mdoering/backbone.git

After those files were added to the index committing a simple change to the main README file took 15 minutes to commit. Although I like the general idea and the pretty user interface I fear GitHub, and even git itself, are not made to be a repository of millions of files and folders.

First GitHub impressions

Browsing taxa in GitHub is surprisingly responsive. The fungi genus Amanita  contains 746 species, but it loads very quickly. In that regard GitHub is much nicer to use than the one on the new GBIF species pages which of course shows much more information. The rendered readme file is not ideal as it's at the very bottom of the page, but showing information to humans that way is nice - and markdown could also be parsed by machines quite easily if we adopt a simple format, for example for every property create a heading with that name and put the content into the following paragraph(s). 

The Amanita example also reveals a bug in the exporter class when dealing with synonyms (the Amanita readme contains the synonym information) and also with infraspecific taxa. For example Amanita muscaria contains some weird form information which is mapped erroneously to the species. This obviously should be fixed.

The GitHub browser sorts all files alphabetically. When mixing ranks (we skip intermediate unknown ranks in the backbone), for example see the Fungus kingdomsorting by the rank first is desirable. We could enable this by naming the taxon folders accordingly, prefixing with an alphabetically correctly ordered rank.

I have not had the time to try to version branches of the tree and see how usable that is. I suspect the git performance to be really slow, but that might not be a blocker if we only do versioning of larger groups and rarely push & pull.

9 comments:

  1. Nice! GitHub search on the repository is also really fast: https://github.com/mdoering/backbone/search?q=amanita

    ReplyDelete
  2. Certainly a really good idea but as you say GitHub / git may not be up to it:

    git clone https://github.com/mdoering/backbone.git
    Cloning into 'backbone'...
    remote: Counting objects: 1013692, done.
    remote: Compressing objects: 100% (607519/607519), done.
    remote: fatal: unable to read e053f6784afd230a23a36f4a180b621ce34b41e6
    remote: aborting due to possible repository corruption on the remote side.
    fatal: early EOF
    fatal: index-pack failed

    Trying again ;)

    ReplyDelete
  3. Also nice to use is the json especially for client apps: https://raw.github.com/mdoering/backbone/master/life/Fungi/Agonium/data.json
    The json data should probably know about the file paths so that one can navigate the tree.

    ReplyDelete
  4. I have done a similar thing trying to create folders for taxa and sync'ing using DropBox. While I got it to work for the Catalogue of Life, I found it was the inode count on the filesystem which was the limitation. I concluded a git/svn type protocol backed by a database was likely to be a better solution.

    ReplyDelete
  5. Wonderful! Here is a nice topic on CDN http://code.lancepollard.com/github-as-a-cdn/ and how to access files via the file layer or application layer. I think this is an interesting idea and could see hooks to keep a db synced on our end with this github in our mongo/elasticsearch structure especially since it is already in json format. Here are some other topics that might be worth looking over so this idea is not killed by GitHub.
    https://help.github.com/articles/what-is-my-disk-quota
    https://help.github.com/articles/post-receive-hooks
    https://help.github.com/articles/distributing-large-binaries
    https://help.github.com/articles/what-are-the-limits-for-viewing-content-and-diffs-in-my-repository

    ReplyDelete
  6. Added plants today, but only the readme and no json. Animals will take some days...

    ReplyDelete
  7. What if you considered breaking this out into higher taxonomy plus specific areas, like Catalogue of Life do with the GSDs? A sub github project per area. After all, it will be small parts of the tree that experts work on. That might help alleviate inode issues...

    ReplyDelete
    Replies
    1. This seems a logical thing to do, would also make it more manageable for people doing bulk annotations and/or manual checking. There could be a repository for the higher-level classification, then repos for each lower branch (level might vary among taxa, mammals could be one repo, insects would need to be split up).

      A challenge for the GitHub model is the dependence between taxa. The nested folders approach captures the hierarchy nicely, but not things like synonymy, although this could be stored as lists of names associated with each node (folder). Compare this with storing Darwin Core Archive files for occurrence data (see http://iphylo.blogspot.co.uk/2013/11/annotating-and-cleaning-gbif-data.html ). Each row in an occurrence data set is basically independent, so we are simply editing a table of data (and could use cool tools like Open Refine).

      Another question is whether to edit the GBIF backbone, or to edit the individual checklists (each of which could be in GitHub) so that the original providers can get access to the edits. This would work if the process from going from checklists to backbone was automated and reproducible, and it is easy to identify the reason for any relationship in the backbone. In other words, if the backbone duplicates species due to one checklist lacking synonym data, we could edit that checklist and fix the problem (almost) at source.

      Alternatively, GBIF could maintain a "fix" file that adds missing synonym relationships. This file, plus the other checklists could generate the backbone. This is maybe closer to what the Open Tree of Life folks are doing with "OTT" where they add patches to a large text document.

      Delete
  8. I'm getting a bit nervous by the idea of spreading the backbone across many repositories and not being able to track changes nicely when taxa move between them. For fungi, chromista, bacteria, algae and even protista this is very likely for example.

    It might be better to follow our old approach to build on published checklists, accept comments that will make it into a patch file, feed those and unfound names back to sources. We actually have already a simple checklist for internal "patches" to the backbone which ranks high on the priority list of backbone sources: https://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-checklists/provisional_new_names.txt

    It was just never used externally and also not much internally. How about moving this file into github and allow external edits? Or even better provide a folder with various patch files that all get applied, but allows to keep the files contents restricted to individual taxonomic groups? That is pretty much what OTT does and it appears to work fine for them

    ReplyDelete