Monday, 11 April 2011

The GBIF Development Team

Recently the GBIF development group have been asked to communicate more on the work being carried out in the secretariat.  To quote one message:
"IMHO, simply making all these discussions public via a basic mailing list could help people like me ... have a better awareness of what's going on... We could add our comments / identify possible drawbacks / make some "scalability tests"... In fact I'm really eager to participate to this process" (developer in Belgium)
To kick things off, we plan to make better use of this blog and have set a target of posting 2-3 times a week.  This is a technical blog, so the anticipated audience include developers, database administrators and those interested in following details of the GBIF software development.  We have always welcomed external contributers to this blog and invite any developers working on publishing content through the GBIF network, or developing tools that make use of content discoverable and accessible through GBIF to write posts.

Today we are pleased to welcome Jan Legind to the team who will be working as a data administrator to help improve the frequency of the network crawling (harvesting) and the indexing processes.  Jan will be working closely with the data publishers to help improve the quality and quantity of content accessible through GBIF.

The GBIF development group has expanded in the past 6 months, so I'll introduce the whole team working in the secretariat and contracted to GBIF:

  • Developers (in order of appearance in the team): Kyle Braak, José Cuadra, Markus Döring (contracted in Germany), Daniel Amariles & Hectór Tobón (contracted at CIAT in Colombia), Federico Méndez, Lars Francke and Oliver Meyn
  • Systems architect: Tim Robertson
  • Systems analyst: Andrea Hahn
  • Informatics liason: Burke (Chih-Jen) Ko
  • Systems admins: Ciprian Vizitiu & Andrei Cenja
  • Data administrator: Jan Legind

The current focus of work at GBIF include the following major activities:
  • Developing and rolling out the Integrated Publishing Toolkit.
  • Integrating the checklist (taxonomic, nomenclatural and thematic) content into the current Data portal.
  • Developing a processing framework to automate the steps needed to apply quality control and index content for discovery through the Data portal.
    • Specifically to reducing the time taken and complexity in initiating a rollover of the content behind the index
    • Reworking all quality control (geographic, taxonomic and temporal) 
    • Automating the process
  • Initiating a redesign of the data portal user interface to provide richer discovery and integration across dataset metadata, checklists and primary biodiversity data.
  • Reducing the time between publishing content onto the network and discovery through the Data portal.  This includes providing specific support to those who are experiencing problems with large datasets in particular, and assisting in migration to the DarwinCore-Archive format.
  • Technical and user documentation of the publishing options available
Let the blogging begin.

[Please use #gbif in twitter hashtags]


  1. Thanks for this Tim, I'm very pleased to see my comments are followed by concrete actions !

    I know it will require effort from already busy people to post 2-3 times a week as targeted, but I'm convinced it will be very efficient on the long term !

    It will probably takes a few weeks / months of posting before having a critical mass of active readers/answerers, but it's worth it IMHO.

    Huge benefits on the long term for the GBIF community (and secretariat) if this could become one of the favorite web page of any GBIF-concerned developer in the following months.

  2. Nice!

    hope to see a post on how GBIF "integrate taxonomic" information.

    And the possible alternatives that the Development Team has considered to address this problem.

  3. @niconoe
    Thanks for the comments. Please remember that others can post here too on their development work contributing to GBIF. We are "testing the waters" right now in terms of communication channels so appreciate the guidance.

    That is in the pipeline for the coming days/weeks, but will likely be more than 1 post. The short answer is we are working on a whole new processing structure which involves normalizing as many "authoritative" sources as possible, and then applying a rule based assembly into a "backbone" classification structure. For Kingdom->Family we will try and default to the Catalogue of Life 2011, and then augment it below family where necessary. With that backbone in place, we are working on a web services that allows you to call with kingdom,phylum...scientific_name and it will apply rule based matching to select the most appropriate classification and return the identifiers for those taxa - much of the rule based approach relies on definitive homonym lists. The main difference with this new strategy we are working on and the one live in the portal ( is that the portal was assembled with occurrence classifications (which are often quite 'dirty') and therefore became something that is less reliable as a management classification hierarchy. This will be better explained in coming blog posts...

  4. Neat! Great to see you guys blogging again! Having that kind of biodiversity informatics experience, try-outs, tutorials and early announcements online is a tremendous resource. As demonstrated by iPhylo.