Monday 8 August 2016

GBIF Backbone - August 2016 Update

GBIF has just put a new backbone taxonomy into production! Since our last update of the GBIF Backbone we have received various feedback and gained insight into potential code improvements. Here is a quick summary of what has changed in this August 2016 version.

Important code changes:

  • much less eager basionym detection resulting in fewer algorithmically assigned synonyms and removing many false synonyms especially in plants
  • detect and merge orthographic variants of species doing gender stemming, allowing double consonant characters, deal with author transliterations and merging hybrid names

All fixed issues in the source code that generates a new backbone can be found there, each of them often leads to actual reported user feedback: http://dev.gbif.org/issues/browse/POR-3029

New sources

The following new sources have been incorporated into the august backbone:
  • major new version of The Paleobiology Database contributing 2,315 new families, 11,390 genera and 131,958 species names to the backbone. Feeds many isExtinct and livingPeriod values into the backbone for fossil taxa
  • thousands of new Plazi articles with 1,883 genera, 28,725 species and 1,935 infraspecific names. Only use genus names and below from Plazi, excluding any synonyms until we are confident they are all correctly marked up
  • added Artsnavnebasen source, contributing 3,640 new genera and 29,751 species names to the backbone
  • added International Cichorieae Network source, contributing 190 new Asteraceae genera; 1,415 species and 3,427 infraspecies names to the backbone
The 39 sources used in this backbone build

Backbone impact

The new backbone has a total of 5,307,978 names of which it treats 2,525,274 species names as accepted (previously 2,420,842 out of 5,208,172). More backbone metrics are available through our portal and in more detail through our API.
  • 187,854 deleted names, mostly due to the removal of orthographic variants
  • 279,404 new names 
    • Unknown: 165 families; 743 genera; 785 species; 14 infraspecific
    • Animalia: 13 order; 1,649 families; 10,171 genera; 125,478 species; 4,398 infraspecific
    • Archaea: 2 genera; 3 species
    • Bacteria: 1 families; 33 genera; 544 species; 36 infraspecific
    • Chromista: 38 families; 412 genera; 5,594 species; 295 infraspecific
    • Fungi: 1 families; 691 genera; 11,127 species; 2,039 infraspecific
    • Plantae: 50 families; 666 genera; 82,672 species; 14,725 infraspecific
    • Protozoa: 1 class; 1 order; 4 families; 38 genera; 349 species; 24 infraspecific
    • Viruses: 1 families; 982 genera; 6,311 species
A very large and detailed log of the backbone build is also available.

The largest taxonomic groups in the backbone, exceeding 3% of all accepted species is shown in the following diagram:



The Catalogue of Life as the largest single primary source contributes 59,8% of all names (previously 60,9%). A breakdown by backbone constituents is now also available as a species search facet. For example this shows the breakdown for all accepted plant species in the backbone:


Occurrence impact

With a new backbone we have reprocessed all of our 642 million occurrences. The larger changes were:
  • Fixed various old/new world distributions of incorrectly synonymized species
  • Reduced the number of virus records from 157,492 down to just 5,348 records. Most occurrences were Lepidoptera, e.g. the common peacock butterfly that had formerly been mismatched because there was no classification given with the name.
Some more metrics of backbone names in our occurrences. There are:
  • 216,699 distinct genera in GBIF occurrences. That is 55% out of all 396.990 genera in the backbone
  • 1,226,668 accepted species in GBIF occurrences. That is 50% out of all 2,420,842 backbone species
  • 2,059,961 distinct names in GBIF occurrences. Which is 39% of all 5.208.172 names in the backbone
The distribution of the major taxonomic groups exceeding 3%, i.e have a minimum of 36.800 species, is shown in this last diagram: