Developer Blog: GBIF Backbone in GitHub

Thursday, 24 October 2013

GBIF Backbone in GitHub

For a long time I wanted to experiment with using GitHub as a tool to browse and manage the GBIF backbone taxonomy. Encouraged by similar sentiments from Rod Page, it would be nice to use git to keep track of versions and allow external parties to fork parts of the taxonomic tree and push back changes if desired. To top it off there is the great GitHub Treeslider to browse the taxonomy, so why not give it a try?

A GitHub filesystem taxonomy

I decided to export each taxon in the backbone as a folder that is named according to the canonical name, containing 2 files:

README.md, a simple markdown file that gets rendered by github and shows the basic attributes of a taxon
data.json, a complete json representation of the taxon as it is exposed via the new GBIF species API

The filesystem represents the taxonomic classification and taxon folders are nested accordingly, for example the species Amanita arctica is represented as:

This is just a first experimental step. One can improve the readme a lot to render more content in a human friendly way and include more data in the json file such as common names and synonyms.

Getting data into GitHub

It didn't take much to write a small NubGitExporter.java class that exports the GBIF backbone into the filesystem as described above. The export of the entire taxonomy, with it's currently 4.4 million taxa incl synonyms, took about one hour on a MacBook Pro laptop.

Not bad I thought, but then I tried to add the generated files into git and that's when I started to doubt. After waiting for half a day for git to add the files to my local index I decided to kill the process and start by only adding the smaller kingdoms first, excluding animals and plants. That left about 335.000 folders and 670.000 files to be added to git. Adding these to my local git still took several hours, committing and finally pushing them onto the GitHub server took yet another 2 hours.

Delta compression using up to 8 threads.
Compressing objects: 100% (1010487/1010487), done.
Writing objects: 100% (1010494/1010494), 173.51 MiB | 461 KiB/s, done.
Total 1010494 (delta 405506), reused 0 (delta 0)
To https://github.com/mdoering/backbone.git

After those files were added to the index committing a simple change to the main README file took 15 minutes to commit. Although I like the general idea and the pretty user interface I fear GitHub, and even git itself, are not made to be a repository of millions of files and folders.

First GitHub impressions

Browsing taxa in GitHub is surprisingly responsive. The fungi genus Amanita contains 746 species, but it loads very quickly. In that regard GitHub is much nicer to use than the one on the new GBIF species pages which of course shows much more information. The rendered readme file is not ideal as it's at the very bottom of the page, but showing information to humans that way is nice - and markdown could also be parsed by machines quite easily if we adopt a simple format, for example for every property create a heading with that name and put the content into the following paragraph(s).

The Amanita example also reveals a bug in the exporter class when dealing with synonyms (the Amanita readme contains the synonym information) and also with infraspecific taxa. For example Amanita muscaria contains some weird form information which is mapped erroneously to the species. This obviously should be fixed.

The GitHub browser sorts all files alphabetically. When mixing ranks (we skip intermediate unknown ranks in the backbone), for example see the Fungus kingdom, sorting by the rank first is desirable. We could enable this by naming the taxon folders accordingly, prefixing with an alphabetically correctly ordered rank.

I have not had the time to try to version branches of the tree and see how usable that is. I suspect the git performance to be really slow, but that might not be a blocker if we only do versioning of larger groups and rarely push & pull.

9 comments:

Unknown24 October 2013 at 16:09
Nice! GitHub search on the repository is also really fast: https://github.com/mdoering/backbone/search?q=amanita
ReplyDelete
Replies
Unknown24 October 2013 at 16:39
Certainly a really good idea but as you say GitHub / git may not be up to it:

git clone https://github.com/mdoering/backbone.git
Cloning into 'backbone'...
remote: Counting objects: 1013692, done.
remote: Compressing objects: 100% (607519/607519), done.
remote: fatal: unable to read e053f6784afd230a23a36f4a180b621ce34b41e6
remote: aborting due to possible repository corruption on the remote side.
fatal: early EOF
fatal: index-pack failed

Trying again ;)
ReplyDelete
Replies
Unknown24 October 2013 at 18:09
Also nice to use is the json especially for client apps: https://raw.github.com/mdoering/backbone/master/life/Fungi/Agonium/data.json
The json data should probably know about the file paths so that one can navigate the tree.
ReplyDelete
Replies
Tim Robertson24 October 2013 at 18:13
I have done a similar thing trying to create folders for taxa and sync'ing using DropBox. While I got it to work for the Catalogue of Life, I found it was the inode count on the filesystem which was the limitation. I concluded a git/svn type protocol backed by a database was likely to be a better solution.
ReplyDelete
Replies
Michael Giddens24 October 2013 at 19:13
Wonderful! Here is a nice topic on CDN http://code.lancepollard.com/github-as-a-cdn/ and how to access files via the file layer or application layer. I think this is an interesting idea and could see hooks to keep a db synced on our end with this github in our mongo/elasticsearch structure especially since it is already in json format. Here are some other topics that might be worth looking over so this idea is not killed by GitHub.
https://help.github.com/articles/what-is-my-disk-quota
https://help.github.com/articles/post-receive-hooks
https://help.github.com/articles/distributing-large-binaries
https://help.github.com/articles/what-are-the-limits-for-viewing-content-and-diffs-in-my-repository
ReplyDelete
Replies
Unknown26 October 2013 at 16:49
Added plants today, but only the readme and no json. Animals will take some days...
ReplyDelete
Replies
Tim Robertson26 October 2013 at 18:19
What if you considered breaking this out into higher taxonomy plus specific areas, like Catalogue of Life do with the GSDs? A sub github project per area. After all, it will be small parts of the tree that experts work on. That might help alleviate inode issues...
ReplyDelete
Replies
Unknown5 November 2013 at 14:49
I'm getting a bit nervous by the idea of spreading the backbone across many repositories and not being able to track changes nicely when taxa move between them. For fungi, chromista, bacteria, algae and even protista this is very likely for example.

It might be better to follow our old approach to build on published checklists, accept comments that will make it into a patch file, feed those and unfound names back to sources. We actually have already a simple checklist for internal "patches" to the backbone which ranks high on the priority list of backbone sources: https://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-checklists/provisional_new_names.txt

It was just never used externally and also not much internally. How about moving this file into github and allow external edits? Or even better provide a folder with various patch files that all get applied, but allows to keep the files contents restricted to individual taxonomic groups? That is pretty much what OTT does and it appears to work fine for them
ReplyDelete
Replies

Add comment

Thursday, 24 October 2013

GBIF Backbone in GitHub

A GitHub filesystem taxonomy

Getting data into GitHub

First GitHub impressions

9 comments:

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Tags

Followers

Thursday, 24 October 2013

GBIF Backbone in GitHub

A GitHub filesystem taxonomy

Getting data into GitHub

First GitHub impressions

9 comments:

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Subscribe To

Tags

Followers