The taxonomy employed by GBIF for organising all occurrences into a
consistent view has remained unchanged since 2013. We have been working on
a replacement for some time and are pleased to introduce a preview in this
post. The work is rather complex and tries to establish an automated
process to build a new backbone which we aim to run on a regular, probably
quarterly basis. We would like to release the new taxonomy rather soon and
improve the backbone iteratively. Large regressions should be avoided
initially, but it is quite hard to evaluate all the changes between 2 large
taxonomies with 4 - 5 million names each. We are therefore seeking feedback
and help to discover oddities of the new backbone.
Relevance & Challenges
Every occurrence record in GBIF is matched to a taxon in the backbone.
Because occurrence records in GBIF cover the whole tree of life and names
may come from all possible, often outdated, taxonomies, it is important to
have the broadest coverage of names possible. We also deal with fossil
names, extinct taxa and (due to advanced digital publishing) even names
that have just been described a week before the data is indexed at
GBIF.
The Taxonomic Backbone provides a single classification and a synonymy
that we use to inform our systems when creating maps, providing metrics or
even when you do a plain occurrence search. It is also used to crosslink
names between different checklist datasets.
The Origins
The very first taxonomy that GBIF used was based on the Catalogue of
Life. As this only included around half the names we found in GBIF
occurrences, all other cleaned occurrence names were merged into the GBIF
backbone. As the backbone grew we never deleted names and increasingly
faced more and more redundant names with slightly different
classifications. It was time for a different procedure.
The Current Backbone
The current version of the backbone was built in July 2013. It is
largely based on the Catalogue of Life from 2012 and has folded in names
from
39 further taxonomic sources.
It was built using an automated process that made use of selected checklists from
the GBIF ChecklistBank in a prioritised order. The Catalogue of Life was
still the starting point and provided the higher classification down to
orders.
The
Interim
Register of Marine and Nonmarine Genera was used as the single
reference list for generic homonyms. Otherwise only a single version of any
name was allowed to exist in the backbone, even where the authorship
differed.
Current issues
We kept track of
nearly 150
reported issues. Some of the main issues showing up regularly that we
wanted to address were:
- Enable an automated build
process so we can use the latest Catalogue of Life and other
sources to capture newly described or currently missing names
- It was impossible to have synonyms using the same
canonical name but with different authors. This means Poa pubescens was
always considered a synonym of Poa pratensis L. when in fact
Poa pubescens R.Br. is considered a
synonym of Eragrostis pubescens (R.Br.) Steud.
- Some families contain far too many accepted species and hardly any
synonyms. Especially for plants the Catalogue of Life was surprisingly
sparsely populated and we heavily relied on IPNI names. For example the
family Cactaceae has
12.062 accepted species in GBIF while The Plant List recognizes
just 2.233.
- Many accepted names are based on the same basionym. For example the
current backbone considers both Sulcorebutia breviflora
Backeb. and Weingartia breviflora
(Backeb.) Hentzschel & K.Augustin as accepted taxa.
- Relying purely on IRMNG for homonyms meant that homonyms which were
not found in IRMNG were conflated. On the other hand there are many
genera in IRMNG - and thus in the backbone - that are hardly used
anywhere, creating confusion and many empty genera without any species
in our backbone.
The New Backbone
The new backbone is available for
preview
in our test environment. In order to review the new backbone and
compare it to the
previous
version we provide a few tools with a different focus:
-
Stable ID report: We have joined the old and new
backbone names to each other and compared their identifiers. When joining on
the full scientific name there is still an issue with changing
identifiers which we are still investigating.
-
Tree Diffs: For comparing the higher
classification we used a
tool from Rod Page to diff the
tree down to families. There are surprisingly many changes, but
all of them stem from evolution in the Catalogue of Life or the
changed Algae classification.
-
Nub Browser: For comparing actual species and also
reviewing the impact of the changed taxonomy on the GBIF
occurrences, we developed a new Backbone
Browser sitting on top of our existing API (Google Chrome only). Our test
environment has a complete copy of the current GBIF occurrence
index which we have reprocessed to use the new backbone. This also
includes all maps and metrics
which we show in the new browser.
Family
Asparagaceae
as seen in the nub browser:
Red numbers next to names indicate taxa that have fewer occurrences
using the new backbone, while green numbers indicate an increase. This is
also seen in the tree maps of the children by occurrences. The genus
Campylandra J.G. Baker, 1875 is dark red with zero occurrences because the
species in that genus were moved into the genus Rhodea in the latest
Catalog of Life.
Species
Asparagus
asparagoides as seen in the nub browser:
The details view shows all synonyms, the basionym and also a list of
homonyms from the new backbone.
Sources
We manually curate a
list of priority ordered checklist datasets that we use to build the
taxonomy. Three datasets are treated in a slightly special way:
-
GBIF Backbone Patch: a small dataset we manually curate at GBIF
to override any other list. We mainly use the dataset to add
missing names reported by users.
-
Catalogue of Life: The Catalogue of Life provides the entire
higher classification above families with the exception of algaes.
-
GBIF Algae Classification: With the withdrawal of Algaebase the
current Catalogue of Life is lacking any algae taxonomy. To allow
other sources to at least provide genus and species names for algae
we have created a new dataset that just provides an algae
classification down to families. This classification fits right
into the empty phyla of the Catalogue of Life.
The GBIF portal now also lists
the source datasets that contributed to the GBIF Backbone and the
number of names that were used as primary references.
Other Improvements
As well as fixing the main issues listed above, there is another
frequently occurring situation that we have improved. Many occurrences
could not be matched to a backbone species because the name existed
multiple times as an accepted taxon. In the new backbone, only one version
of a name is ever considered to be accepted. All others now are flagged as
doubtful. That resolves many issues which prevented a species match because
of name ambiguity. For example there are many occurrences of
Hyacinthoides hispanica in Britain which only show up in the new
backbone (
old /
new occurrence,
old /
new match). This is best seen in the
map comparison
of the nub browser, try to swipe the map!
Known problems
We are aware of some problems with the new backbone which we like to
address in the
next
stage. Two of these issues we consider as candidates for blocking the
release of the new backbone:
Species matching
service ignores authorship
As we better keep different authors apart the backbone now contains a
lot more species names which just differ by their authorship. The current
algorithm only keeps one of these names as the accepted name from the most
trusted source (e.g. CoL) and treats the other as doubtful if they are not
already treated as synonyms.
The problem currently is that the species matching service we use to
align occurrences to the backbone does
not deal with authorship.
Therefore we have some cases where occurrences are attached to a doubtful
name or even split across some of the “homonyms”.
There are nearly 166.832 species names with different authorship
existing in the new backbone, accounting for 98.977.961 occurrences.
Too eager basionym merging
The same epithet is sometimes used by the same author for different
names in the same family. This currently leads to an
overly eager basionym
grouping with less accepted names.
As these names are still in the backbone and occurrences can be matched
to them this is currently not considered a blocker.