Thursday, 22 June 2017

GBIF Name Parser

The GBIF name parser has been a fundamental library for GBIF to parse a scientific name string into a structured representation of a name. It has been refined over many years based on actual name strings encountered in the GBIF occurrence and checklist indices. Over the years the major design goals have not changed much and can be summarised as follows:
  • extract canonical, code relevant name parts
    • populate only the ParsedName class of the GBIF API
    • ignore any superflous name parts irrelevant to the code, e.g. species authorships in infraspecific names, infrageneric placements of species or superflous infraspecific parts in quadrinomials
  • deal with a wide variety of names that the ParsedName class can represent
    • cultivar names
    • bacterial strains & candidate names
    • virus names
    • named hybrids
    • taxon concept references, sensu latu/strictu or aggregates
    • legacy ranks
  • extract notes often found in names:
    • nomenclatural remarks
    • determination notes like aff. 
    • partially determined species, e.g. only down to the genus: Abies spec.
  • in case author parsing is impossible, fallback to parsing just the canonical name without authors
  • allow slightly imperfect names not strictly well formed according to the rules
  • classify names according to our NameType enumeration
Compared to gnparser these are slightly different goals explaining some of the behavior explained in the recent paper from Dmitry Mozzherin 2017. As that paper explains the GBIF name parser is based on regular expressions, some of them even recursive. This is not the reason why we do not support hybrid formulas though. Hybrid formulas (e.g. Quercus robur x Q. macrocarpa) as opposed to named hybrids (e.g. Quercus x turneri) are a variable combination of names and thus are very different to the Linnean names represented by a ParsedName. For name matching, backbone building and many more problems hybrid formulas are incompatible and we instead decided to deal with hybrid formulas just as with other unparsable viruses or OTU names that do not follow the neat structure of Linnean names. We simply keep the entire string as it was, classify it with a NameType and do not further parse it.

GBIF exposes the name parser through the GBIF JSON API, here are some examples for illustration:
Authorships are not (yet) parsed into a list of individual authors. This has been done internally already and it is something we are likely to expose in the future. Currently the authorship is parsed into four pieces, the authorship and year for the combination and basionym.

gnparser in GBIF

The GNA name parser is a great parser for well formed names. It has slightly different goals, but since it is available for the JVM we have wrapped it to support the GBIF NameParser interface producing ParsedName instances. Wrapping the Scala based gnaparser was not as trivial as we had thought due to its different parsing output and the mismatch of Scala and Java collections, but working against the NameParser interface you can finally select your parser of choice.

The authorship semantics for original names are also slightly different between the two parsers. Again some examples to illustrate the difference:
Azalea schlippenbachii (Maxim.) Kuntze
Both parsers show the same semantics:
GBIF:
"authorship": "Kuntze",
"bracketAuthorship": "Maxim.",

GNA:
"value": "(Maxim.) Kuntze",
"basionym_authorship": {
  "authors": ["Maxim."]
},
"combination_authorship": {
  "authors": ["Kuntze"]
}
Rhododendron schlippenbachii Maxim.
The GBIF parser places the author into “authorship” as the author of the very combination.
The gnparser places the author into the basionym authorship instead even though it is not surrounded by brackets.
As the parser cannot know if the name actually is a basionym, i.e. there indeed exists a subsequenct recombination, this was slightly unexpected
and the GBIF NameParser wrapper had to swap the authorship to populate a ParsedName in such cases:
GBIF:
"authorship": "Maxim.",

GNA:
"basionym_authorship": {
  "authors": ["Maxim."]
}
Puma concolor (Linnaeus, 1771)
Both parsers show the same semantics:
GBIF:
"bracketAuthorship": "Linnaeus",
"bracketYear": "1771",

GNA:
"basionym_authorship": {
  "authors": ["Linnaeus"],
  "year": {
    "value": "1771"
  }
}
Ex authors are not parsed in GBIF (yet). They do cause a little issue as the sequence of authors varies between botany and zoology. In botany, the author of the earlier name precedes the later, valid one while in zoology this is reversed. GNA follows the zoological model, even though far more usages of ex-authors can be found in botanical names.

Uninomials are also treated differently. GBIF uses a single property genusOrAbove for both the genus part of a binomial, a standalone genus or a uninomal of a rank above genus. GNA places genera from a binomial into the genus field, but uses uninomial for a standalone genus name.

Performance

We are still comparing gnparser with the GBIF name parser, but initial tests using gnparser-0.4.0 to parse 1380 names from our unit tests suggests the GBIF parser is up to twice as fast running in a Java environment. There is an overhead of wrapping the gnparser for the ParsedName result class. But even if we just parse the names and do not convert the Scala result into a ParsedName it takes up 75% more time:
  
Total time parsing 1380 names
MacBookPro 2017, Java8, single thread:

  GBIF: 1331ms
  GNA : 2596ms
  GNA-: 2323ms # without wrapper

This contradicts the results presented in the gnaparser paper, but might be related to the selection of names or running the parser in different environments.

Future

We are working with GNA to improve both parsers and align them more. With slightly different goals it might be hard to fully merge the two projects, but we will try to unify the efforts as much as we can. For the GBIF name parser we will be adding parsed author and ex author teams in the near future. This is needed to do author comparisons for better name matching in the GBIF backbone building (where it already exists) and the Catalogue of Life.