Monday 22 July 2013

Validating scientific names with the forthcoming GBIF Portal web service API

This guest post was written by Gaurav Vaidya, Victoria Tersigni and Robert Guralnick, and is being cross-posted to the VertNet Blog. David Bloom and John Wieczorek read through drafts of this post and improved it tremendously.

A whale named Physeter macrocephalus Physeter catodon Physeter macrocephalus (photograph by Gabriel Barathieu, reused under CC-BY-SA from the Wikimedia Commons)
Validating scientific names is one of the hardest parts of cleaning up a biodiversity dataset: as taxonomists' understanding of species boundaries change, the names attached to them can be synonymized, moved between genera or even have their Latin grammar corrected (it's Porphyrio martinicus, not Porphyrio martinica). Different taxonomists may disagree on what to call a species, whether a particular set of populations make up a species, subspecies or species complex, or even which of several published names correspond to our modern understanding of that species, such as the dispute over whether the sperm whale is really Physeter catodon Linnaeus, 1758, or Physeter macrocephalus Linnaeus, 1758.

A good way to validate scientific names is to match them against a taxonomic checklist: a publication that describes the taxonomy of a particular taxonomic group in a particular geographical region. It is up to the taxonomists who write such treatises to catalogue all the synonyms that have ever been used for the names in their checklist, and to identify a single accepted name for each taxon they recognize. While these checklists are themselves evolving over time and sometimes contradict each other, they serve as essential points of reference in an ever-changing taxonomic landscape.

Over a hundred digitized checklists have been assembled by the Global Biodiversity Information Facility (GBIF) and will be indexed in the forthcoming GBIF Portal, currently in development and testing. This collection includes large, global checklists, such as the Catalogue of Life and the International Plant Names Index, alongside smaller, more focussed checklists, such as a checklist of 383 species of seed plants found in the Singhalila National Park in India and the 87 species of moss bug recorded in the Coleorrhyncha Species File. Many of these checklists can be downloaded as Darwin Core Archive files, an important format for working with and exchanging biodiversity data.

So how can we match names against these databases? OpenRefine (the recently-renamed Google Refine) is a popular data cleaning tool, with features that make it easy to clean up many different types of data. Javier Otegui has written a tutorial on cleaning biodiversity data in OpenRefine, and last year Rod Page provided tools and a step-by-step guide to reconciling scientific names, establishing OpenRefine as an essential tool for biodiversity data and scientific name cleanup.

Linnaeus' original description of Felis Tigris. From an 1894 republication of Linnaeus' Systema Naturae, 10th edition, digitized by the Biodiversity Heritage Library.
We extended Rod's work by building a reconciliation service against the forthcoming GBIF web services API. We wanted to see if we could use one of the GBIF Portal's biggest strengths -- the large number of checklists it has indexed -- to identify names recognized in similar ways by different checklists. Searching through multiple checklists containing possible synonyms and accepted names increases the odds of finding an obscure or recently created name; and if the same name is recognized by a number of checklists, this may signify a well-known synonymy -- for example, two of the Portal checklists recognize that the species Linnaeus named Felis tigris is the same one that is known as Panthera tigris today.

To do this, we wrote a new OpenRefine reconciliation service that searches for a queried name in all the checklists on the GBIF Portal. It then clusters names using four criteria and counts how often a particular name has the same:
  • scientific name (for example, "Felis tigris"),
  • authority ("Linnaeus, 1758"),
  • accepted name ("Panthera tigris"), and
  • kingdom ("Animalia").

Once you do a reconciliation through our new service, your results will look like this:

Since OpenRefine limits the number of results it shows for any reconciliation, we know only that at least five checklists in the GBIF Portal matched the name "Felis tigris". Of these,
  1. Two checklists consider Felis tigris Linnaeus, 1758 to be a junior synonym of Panthera tigris (Linnaeus, 1758). Names are always sorted by the number of checklists that contain that interpretation, so this interpretation -- as it happens, the correct one -- is at the top of the list.
  2. The remaining checklists all consider Felis tigris to be an accepted name in its own right. They contain mutually inconsistent information: one places this species in the kingdom Animalia, another in the kingdom Metazoa, and the third contains both a kingdom and an taxonomic authority. You can click on each name to find out more details.

Using our reconciliation service, you can immediately see how many checklists agree on the most important details of the name match, and whether a name should be replaced with an accepted name. The same name may also be spelled identically under different nomenclatural codes: for example, does "Ficus" refer to the genus Ficus Röding, 1798 or the genus Ficus L.? If you know that the former is in kingdom Animalia while the latter is in Plantae, it becomes easier to figure out the right match for your dataset.

We've designed a complete workflow around our reconciliation service, starting with ITIS as a very fast first step to catch the most well recognized names, and ending with EOL's fuzzy matching search as a final step to look for incorrectly spelled names. For VertNet's 2013 Biodiversity Informatics Training Workshop, we wrote two tutorials that walk you through our workflow:

If you're already familiar with OpenRefine, you can add the reconciliation service with the URL:
Give it a try, and let us know if it helps you reconcile names faster!

The Map of Life project is continuing to work on improving OpenRefine for taxonomic use in a project we call TaxRefine. If you have suggestions for features you'd like to see, please let us know! You can leave a comment on this blog post, or add an issue to our issue tracker on GitHub.