Wednesday 6 April 2016

Updating the GBIF Backbone

The taxonomy employed by GBIF for organising all occurrences into a consistent view has remained unchanged since 2013. We have been working on a replacement for some time and are pleased to introduce a preview in this post. The work is rather complex and tries to establish an automated process to build a new backbone which we aim to run on a regular, probably quarterly basis. We would like to release the new taxonomy rather soon and improve the backbone iteratively. Large regressions should be avoided initially, but it is quite hard to evaluate all the changes between 2 large taxonomies with 4 - 5 million names each. We are therefore seeking feedback and help to discover oddities of the new backbone.

Relevance & Challenges

Every occurrence record in GBIF is matched to a taxon in the backbone. Because occurrence records in GBIF cover the whole tree of life and names may come from all possible, often outdated, taxonomies, it is important to have the broadest coverage of names possible. We also deal with fossil names, extinct taxa and (due to advanced digital publishing) even names that have just been described a week before the data is indexed at GBIF.
The Taxonomic Backbone provides a single classification and a synonymy that we use to inform our systems when creating maps, providing metrics or even when you do a plain occurrence search. It is also used to crosslink names between different checklist datasets.

The Origins

The very first taxonomy that GBIF used was based on the Catalogue of Life. As this only included around half the names we found in GBIF occurrences, all other cleaned occurrence names were merged into the GBIF backbone. As the backbone grew we never deleted names and increasingly faced more and more redundant names with slightly different classifications. It was time for a different procedure.

The Current Backbone

The current version of the backbone was built in July 2013. It is largely based on the Catalogue of Life from 2012 and has folded in names from 39 further taxonomic sources. It was built using an automated process that made use of selected checklists from the GBIF ChecklistBank in a prioritised order. The Catalogue of Life was still the starting point and provided the higher classification down to orders. The Interim Register of Marine and Nonmarine Genera was used as the single reference list for generic homonyms. Otherwise only a single version of any name was allowed to exist in the backbone, even where the authorship differed.

Current issues

We kept track of nearly 150 reported issues. Some of the main issues showing up regularly that we wanted to address were:
  • Enable an automated build process so we can use the latest Catalogue of Life and other sources to capture newly described or currently missing names
  • It was impossible to have synonyms using the same canonical name but with different authors. This means Poa pubescens was always considered a synonym of Poa pratensis L. when in fact Poa pubescens R.Br. is considered a synonym of Eragrostis pubescens (R.Br.) Steud.
  • Some families contain far too many accepted species and hardly any synonyms. Especially for plants the Catalogue of Life was surprisingly sparsely populated and we heavily relied on IPNI names. For example the family Cactaceae has 12.062 accepted species in GBIF while The Plant List recognizes just 2.233.
  • Many accepted names are based on the same basionym. For example the current backbone considers both Sulcorebutia breviflora Backeb. and Weingartia breviflora (Backeb.) Hentzschel & K.Augustin as accepted taxa.
  • Relying purely on IRMNG for homonyms meant that homonyms which were not found in IRMNG were conflated. On the other hand there are many genera in IRMNG - and thus in the backbone - that are hardly used anywhere, creating confusion and many empty genera without any species in our backbone.

The New Backbone

The new backbone is available for preview in our test environment. In order to review the new backbone and compare it to the previous version we provide a few tools with a different focus:
  • Stable ID report: We have joined the old and new backbone names to each other and compared their identifiers. When joining on the full scientific name there is still an issue with changing identifiers which we are still investigating.
  • Tree Diffs: For comparing the higher classification we used a tool from Rod Page to diff the tree down to families. There are surprisingly many changes, but all of them stem from evolution in the Catalogue of Life or the changed Algae classification.
  • Nub Browser: For comparing actual species and also reviewing the impact of the changed taxonomy on the GBIF occurrences, we developed a new Backbone Browser sitting on top of our existing API (Google Chrome only). Our test environment has a complete copy of the current GBIF occurrence index which we have reprocessed to use the new backbone. This also includes all maps and metrics which we show in the new browser.
Family Asparagaceae as seen in the nub browser:
Red numbers next to names indicate taxa that have fewer occurrences using the new backbone, while green numbers indicate an increase. This is also seen in the tree maps of the children by occurrences. The genus Campylandra J.G. Baker, 1875 is dark red with zero occurrences because the species in that genus were moved into the genus Rhodea in the latest Catalog of Life.

Species Asparagus asparagoides as seen in the nub browser:
The details view shows all synonyms, the basionym and also a list of homonyms from the new backbone.

Sources

We manually curate a list of priority ordered checklist datasets that we use to build the taxonomy. Three datasets are treated in a slightly special way:
  1. GBIF Backbone Patch: a small dataset we manually curate at GBIF to override any other list. We mainly use the dataset to add missing names reported by users.
  2. Catalogue of Life: The Catalogue of Life provides the entire higher classification above families with the exception of algaes.
  3. GBIF Algae Classification: With the withdrawal of Algaebase the current Catalogue of Life is lacking any algae taxonomy. To allow other sources to at least provide genus and species names for algae we have created a new dataset that just provides an algae classification down to families. This classification fits right into the empty phyla of the Catalogue of Life.
The GBIF portal now also lists the source datasets that contributed to the GBIF Backbone and the number of names that were used as primary references.

Other Improvements

As well as fixing the main issues listed above, there is another frequently occurring situation that we have improved. Many occurrences could not be matched to a backbone species because the name existed multiple times as an accepted taxon. In the new backbone, only one version of a name is ever considered to be accepted. All others now are flagged as doubtful. That resolves many issues which prevented a species match because of name ambiguity. For example there are many occurrences of Hyacinthoides hispanica in Britain which only show up in the new backbone (old / new occurrence, old / new match). This is best seen in the map comparison of the nub browser, try to swipe the map!

Known problems

We are aware of some problems with the new backbone which we like to address in the next stage. Two of these issues we consider as candidates for blocking the release of the new backbone:
Species matching service ignores authorship
As we better keep different authors apart the backbone now contains a lot more species names which just differ by their authorship. The current algorithm only keeps one of these names as the accepted name from the most trusted source (e.g. CoL) and treats the other as doubtful if they are not already treated as synonyms.
The problem currently is that the species matching service we use to align occurrences to the backbone does not deal with authorship. Therefore we have some cases where occurrences are attached to a doubtful name or even split across some of the “homonyms”.
There are nearly 166.832 species names with different authorship existing in the new backbone, accounting for 98.977.961 occurrences.
Too eager basionym merging
The same epithet is sometimes used by the same author for different names in the same family. This currently leads to an overly eager basionym grouping with less accepted names.
As these names are still in the backbone and occurrences can be matched to them this is currently not considered a blocker.

25 comments:

  1. Hi Markus,

    A couple of IRMNG-related comments, if I may:...

    1. Regarding the lack of algal coverage in the present Catalogue of Life, you could get just about all the relevant genera from IRMNG (if you have not already done so), both extant and fossil, since IRMNG is pretty complete in this respect (genera supplied by Index Nominum Genericorum and other sources in the main, not AlgaeBase). Some of the family assignments may not be 100% resolved or correct, but maybe that can be addressed as a separate issue.

    2. IRMNG is presently in transition between CSIRO (location for last 10 years) and VLIZ, Belgium where it will eventually be under new editorship and direction. So if GBIF has particular needs it would be a good time to articulate them to the new custodians - for example the continuation of the IRMNG homonyms list, which I have tended to regard as a somewhat minor spinoff in the big picture of IRMNG ongoing population but would seem to be more significant to your present uses.

    Hope the above is of some interest,

    Regards - Tony

    ReplyDelete
  2. Looking very much forward to the launch of the new backbone. Do you have a timeframe? (though probably difficult to say with the blockers you mention). What I really wanted to know is whether inclusion of new names lists is on halt currently? E.g. Denmark for a while has had a dataset with new row-beetle names (http://doi.org/10.11646/zootaxa.3893.1.2) sitting on GBIF, still without the names being recognised :-) What do you recommend we do?

    ReplyDelete
    Replies
    1. Isabel, we are trying to go live with the new backbone by the end of April. It will be a new edition of the current one in preview here, with some of the discovered bugs fixed. As I hope the post illustrates the GBIF backbone has been unchanged since 2013 and not a single new name has entered since then. But with the new one out thats changing. It still needs a tiny manual configuration change to include any new checklist as a source.

      The only dataset matching your DOI that I can see is an occurrence dataset in GBIF: http://www.gbif.org/dataset/1515ac55-d024-4ed6-9785-b90625706f59
      We do not include names from occurrences. If you could publish that paper as a checklist with names at its heart then we can easily add you - already to the April edition if its be the end of this week :)

      Delete
    2. I see the occurrences don't have any extensions attached. In that case you could republish your entire dataset as a checklist and attach occurrences as an extension

      Delete
    3. Thanx Markus, sounds as a good idea, I will look into it.

      Delete
  3. Thanks Tony, we use your latest IRMNG export and have included all the genera. So that should help a lot to fill the algae gap together with the higher replacement classification I did. I haven't verified yet how well the IRMNG genera actually fit into that classification, should be interesting.

    The IRMNG homonym list we don't use in this algorithm anymore, so there is no need to treat that special. Said that the well curated list of genera (embedded in the regular, full IRMNG) is of very high value still for us. Most importantly it would be great if VLIZ would be able to automatically generate a new dwc archive on a regular basis or just when changes had been incorporated if that is rather rare. Maybe even using the IPT which they make heavy use of already: http://ipt.vliz.be/

    ReplyDelete
  4. Hello Markus
    I'm interested to know why Chromista and Protozoa are being included as kingdoms in the algae classification. Any particular reason?

    ReplyDelete
    Replies
    1. You an read more about the "algae" dataset here: https://github.com/gbif/algae

      Basically it tries to fill in the gaps of the CoL hierarchy due to the removal of Algaebase. The phyla Ochrophyta, Haptophyta & Cryptophyta for Chromista and Euglenozoa, Metamonada & Loukozoa for Protozoa.
      Metamonada & Loukozoa are not really algae, but are another gap identified in the CoL tree which we wanted to fill.

      If you have reasons to believe we should follow a different approach or are missing key taxa please let us know.
      It is important that any classifications fits into the current CoL tree which we need to extend and follows the spirit of their management classification described here:

      http://www.catalogueoflife.org/col/info/hierarchy
      http://www.catalogueoflife.org/annual-checklist/2009/info_hierarchy.php
      http://www.catalogueoflife.org/annual-checklist/2009/show_database_details.php?database_name=AlgaeBase

      Delete
  5. Here is a very simple dwc archive of the preview backbone: https://dl.dropboxusercontent.com/u/457027/nub.txt.zip

    ReplyDelete
  6. I had a look at the genus Oxalis, because it is one of the few that I know a little about. Two things struck me, firstly, varieties are often given synonym rank, the example I spotted was Oxalis corniculata var atropurpurea. This is a widely accepted variety, so this surprised me. Secondly, Oxalis stricta is listed as a synonym O. dillenii. This surprised me because although there have been many name changes of these two species O. dillenii was described after O. stricta so the name O. stricta would always have priority. I imagine these issues come from the sources databases, rather than the way you constructed the backbone. The problems seem to stem from the Synonymic Checklists of the Vascular Plants of the World, but I could not dig any deeper to find out how these names got to where they are.

    ReplyDelete
    Replies
    1. Thanks Quentin. The variety synonyms are indeed synonyms because the source taxonomies refer to them as synonyms. In the case of Oxalis corniculata var. atropurpurea it is ITIS: http://www.gbif-uat.org/species/102292804

      Oxalis stricta is treated as a synonym by the Catalogue of Life which again refers to World Plants: Synonymic Checklists of the Vascular Plants of the World: http://www.gbif-uat.org/species/116689209

      Would you think plants are treated better in a different source such as The Plant List?
      http://www.theplantlist.org/tpl1.1/record/kew-2394269

      Delete
    2. Oxalis acetosella is another improved example. We only had "Oxalis acetosella auct. non L." treated as a synonym for O. montanum which all the GBIF occurrences of O. acetosella were matched to: http://www.gbif.org/species/2891761
      This was clearly wrong, e.g. this record: http://www.gbif.org/occurrence/1236911939/verbatim

      Now we correctly have Oxalis acetosella L. in the backbone http://www.gbif-uat.org/species/8235501 and that above record is linked to it: http://www.gbif-uat.org/occurrence/1236911939/verbatim

      Delete
    3. For these examples the Plant list is a better source, but of course this is a very small sample. It's frustrating that you can't get deeper into the Synonymic Checklists of the Vascular Plants to find out how they reached their conclusion. The fact it has O. stricta as a synonym doesn't build confidence. However, this is not really a reflection on the GBIF backbone.

      Delete
    4. The problem seems to be Hassler M. (2016). World Plants: Synonymic Checklists of the Vascular Plants of the World, which apparently is not a published list, except through the COL. This list has no other online presence and doesn't give the providence of who the taxonomic authority was. I find this disappointing given that so much time and effort that has been put into different taxonomic names lists. It's certainly not GBIF's fault, but I don't think botanists can be proud of the taxonomic name infrastructure we have.

      Delete
  7. Can't wait for the new backbone to come online, as we have been using the name match service intensively the last couple of weeks to be able to merge taxa from different sources on their gbif acceptedKey. Hyacinthoides hispanica is one of the taxa that is currently an issue for us.

    ReplyDelete
    Replies
    1. Thanks Peter. H. hispanica is available in the backbone: http://api.gbif-uat.org/v1/species/match?name=Hyacinthoides%20hispanica&kingdom=plantae

      If you are aware of blockers please let us know!

      Delete
  8. Hi Markus,

    If you have not seen it, there is a version of Ruggiero et al.'s FALO classification down to family available for download at http://ggi.eol.org/downloads - it is labelled "GGI Family Data". Again if you have not already done so, I would suggest doing a difference report so you can see which families are in FALO and not in the latest GBIF nub, and vice versa, which may give you some discrepancies worth chasing.

    I had a quick eyeball of the algae list without exhaustive checking, and noted a few items out of place or superfluous, I will send you the things I spotted via separate email.

    I was wondering if you are including Chlorarachnion (Chlorarachniophyceae etc.) in the list under algae or elsewhere (it's a green zooflagellate treated under the botanical Code) - did not see it - here is the treatment from AlgaeBase: http://www.algaebase.org/search/species/detail/?species_id=59340. It is currently missing in CoL because previously supplied by AlgaeBase.

    Hope the above is helpful,

    Regards - Tony

    ReplyDelete
    Replies
    1. Thanks Tony, this is useful. Especially since we have about 1500 families without any classification yet! Any happy to receive any comments on the algae classification. As we plan to rebuild the backbone now every couple of month we finally can include feedback each time

      Delete
    2. There are 17 homonym families in the FALO classification - any idea if these are intended to be present?

      Cepolidae
      Lumbricidae
      Odontopharyngidae
      Heterocheilidae
      Lutodrilidae
      Urostylidae
      Sagittariidae
      Chilodontidae
      Megascolecidae
      Hydridae
      Poraniidae
      Glossoscolecidae
      Cepheidae

      Delete
    3. We had the Chlorarachniales in our previous backbone under Protozoa:
      http://www.gbif.org/species/678

      As CoL still treats Cercozoa as Protozoa I will add them also there in the Algae dataset. Thanks Tony, that was missing!

      Delete
    4. Hi Markus,

      I have researched/checked the duplicate/potential homonym families from FALO that you list, and my comments are inserted below, prefixed "* Tony:". The list is split into 2 parts on account of the character limit for replies. Hope this helps. Regards - Tony

      ----------------------------
      Cepolidae (1): Mollusca > Gastropoda > Heterobranchia > Pulmonata > Stylommatophora
      Cepolidae (2): Chordata > Vertebrata > Gnathostomata > Osteichthyes > Actinopterygii > Neopterygii > Teleostei

      * Tony note: Cepolidae Ihering, 1909 (Mollusca) is currently an unreplaced junior homonym of Cepolidae Rafinesque, 1815 (Actinopterygii), needs replacement name and/or referral to ICZN for resolution.

      Lumbricidae (1): Oligochaeta > Tubificida
      Lumbricidae (2): Oligochaeta > Metagynophora > Opistophora

      * Tony note: I believe this is a duplicate entry (same family, different placement). Family authorship is Claus, 1876.

      Odontopharyngidae (1): Nematoda > Chromadorea > Plectia > Rhabditica > Diplogasterida
      Odontopharyngidae (2): Nematoda > Chromadorea > Plectia > Rhabditica > Rhabditida

      * Tony note: I believe this is a duplicate entry (same family, different placement). Family authorship is Micoletzky, 1922.

      Heterocheilidae (1): Insecta > Pterygota > Neoptera > Holometabola > Diptera
      Heterocheilidae (2): Nematoda > Chromadorea > Plectia > Rhabditica > Spirurida

      * Tony note: Heterocheilidae McAlpine, 1991 (Diptera) is a junior homonym of Heterocheilidae Railliet & Henry, 1915 (Nematoda), former is given as syn. of Helcomyzidae in WoRMS (2010 version) although incorrectly listed as available/valid in www.diptera.org

      Lutodrilidae (1): Annelida > Clitellata > Oligochaeta > Metagynophora > Opistophora
      Lutodrilidae (2): Annelida > Clitellata > Oligochaeta > Metagynophora > Opistophora

      * Tony note: I believe this is a duplicate entry (same family, different placement). Family authorship is McMahan, 1976.

      Urostylidae (1): Ciliophora > Intramacronucleata > Spirotrichia > Spirotrichea > Stichotrichia > Urostylida
      Urostylidae (2): Insecta > Pterygota > Neoptera > Paraneoptera > Hemiptera

      * Tony note: Urostylidae Bütschli, 1889 (Ciliophora) is a junior homonym of Urostylidae Dallas, 1851 (Hemiptera), however the homonymy has been removed by emendation of Urostylidae Dallas, 1851 (Hemiptera) to Urostylididae by Berger et al. 2001.

      Sagittariidae (1): Ciliophora > Intramacronucleata > Ventrata > Colpodea > Cyrtolophosidida
      Sagittariidae (2): Chordata > Vertebrata > Gnathostomata > Tetrapoda > Aves > Neornithes > Neognathae > Neoaves > Accipitriformes

      * Tony note: Sagittariidae Grandori & Grandori, 1935 (Ciliophora) is currently an unreplaced junior homonym of Sagittariidae Finsch & Hartlaub, 1870 (Aves), needs replacement name and/or referral to ICZN for resolution.

      Delete
    5. Chilodontidae (1): Mollusca > Gastropoda > Vetigastropoda
      Chilodontidae (2): Chordata > Vertebrata > Gnathostomata > Osteichthyes > Actinopterygii > Neopterygii > Teleostei > Characiformes

      * Tony note: Chilodontidae Wenz, 1938 (Mollusca) is currently an unreplaced junior homonym of Chilodontidae Eigenmann, 1912 (Actinopterygii), needs replacement name and/or referral to ICZN for resolution.

      Megascolecidae (1): Annelida > Clitellata > Oligochaeta > Tubificida
      Megascolecidae (2): Annelida > Clitellata > Oligochaeta > Metagynophora > Opistophora

      * Tony note: I believe this is a duplicate entry (same family, different placement). Family authorship is Rosa, 1891.

      Hydridae (1): Cnidaria > Medusozoa > Hydrozoa > Hydroidolina > Anthoathecata
      Hydridae (2): Mollusca > Bivalvia > Autobranchia > Heteroconchia > Unionida

      * Tony note: Hydridae Dana, 1846 is a valid family in Cnidaria. "Hydridae" in Mollusca (Unionida) appears to be a misspelling of Hyriidae Swainson, 1840 (type genus is Hyria Lamarck, 1819).

      Poraniidae (1): Echinodermata > Asterozoa > Asteroidea > Valvatacea
      Poraniidae (2): Echinodermata > Asterozoa > Asteroidea > Valvatacea > Valvatida

      * Tony note: I believe this is a duplicate entry (same family, different placement). Family authorship is Perrier, 1893.

      Glossoscolecidae (1): Annelida > Clitellata > Oligochaeta > Tubificida
      Glossoscolecidae (2): Annelida > Clitellata > Oligochaeta > Metagynophora > Opistophora

      * Tony note: I believe this is a duplicate entry (same family, different placement). (Family authorship not yet traced)

      Cepheidae (1): Cnidaria Medusozoa > Scyphozoa > Discomedusae > Rhizostomeae
      Cepheidae (2): Chelicerata > Arachnida > Acari > Acariformes > Sarcoptiformes

      * Tony note: Cepheidae Berlese, 1896 (Acariformes) is a junior homonym of Cepheidae L. Agassiz, 1862 (Cnidaria); former is given as synonomym of Compactozetidae Luxton, 1988 in WoRMS (2013).

      Delete
  9. I also had a quick look at the new backbone (https://dl.dropboxusercontent.com/u/457027/nub.txt.zip) and identified a few potentially problematic names -

    1. 250 taxon names with "taxonRank=SPECIES" but which in fact belong to genus (e.g. 7348906, 7350813, 8232585, etc.)
    2. 140 taxon names with "taxonRank=VARIETY|FORM" but for which intraspecific epithet is "null" (e.g. 7407832, 8181923 , 8189733, etc.)

    In addition to that, it seems that there are many taxon names marked as "*SYNONYM|MISAPPLIED" (7,756) or "DOUBTFUL" (4,836) for which there are direct children in the dataset. Is this how GBIF nub taxonomy has been and will be handling such cases or is it something that should ideally be fixed (e.g. parent should be changed to equivalent ACCEPTED name but this has not been done yet)?

    And finally, I can see that you are also using Index Fungorum as one of the underlying taxonomic sources. Do you use the same 2011 version (http://www.gbif.org/dataset/bf3db7c9-5e5d-4fd0-bd5b-94539eaf9598) or will there be an updated one? Or have there been any discussions about including MycoBank as well?

    Overall, nice (and enormous load of) work you have done. Very useful service and happy to use it!

    ReplyDelete
    Replies
    1. Thanks!

      The null epithet is an issue that should be fixed in the new version we build last weekend to go live with today or tomorrow. It was reported here: http://dev.gbif.org/issues/browse/POR-3069
      I will follow up on mismatching ranks & epithets in this issue: http://dev.gbif.org/issues/browse/POR-3081

      For synonyms there should never by a direct child in the taxonomy. I have created a new issue http://dev.gbif.org/issues/browse/POR-3080 and will investigate that, thanks!

      The Index Fungorum version is unfortunately still the old one, yes. We are in touch with both MycoBank and IF now at Kew and hope to get a newer version at some point. But progress has been rather slow so we sticked with the old version for now.

      Delete
  10. The new backbone to go live as a simple single CSV file Darwin Core archive: http://rs.gbif.org/datasets/backbone/backbone-2016-04-13.csv.gz

    ReplyDelete