Developer Blog: GBIF Name Parser

The GBIF name parser has been a fundamental library for GBIF to parse a scientific name string into a structured representation of a name. It has been refined over many years based on actual name strings encountered in the GBIF occurrence and checklist indices. Over the years the major design goals have not changed much and can be summarised as follows:

extract canonical, code relevant name parts

populate only the ParsedName class of the GBIF API
ignore any superflous name parts irrelevant to the code, e.g. species authorships in infraspecific names, infrageneric placements of species or superflous infraspecific parts in quadrinomials

deal with a wide variety of names that the ParsedName class can represent

cultivar names
bacterial strains & candidate names
virus names
named hybrids
taxon concept references, sensu latu/strictu or aggregates
legacy ranks

extract notes often found in names:

nomenclatural remarks
determination notes like aff.
partially determined species, e.g. only down to the genus: Abies spec.

in case author parsing is impossible, fallback to parsing just the canonical name without authors
allow slightly imperfect names not strictly well formed according to the rules
classify names according to our NameType enumeration

Compared to gnparser these are slightly different goals explaining some of the behavior explained in the recent paper from Dmitry Mozzherin 2017. As that paper explains the GBIF name parser is based on regular expressions, some of them even recursive. This is not the reason why we do not support hybrid formulas though. Hybrid formulas (e.g. Quercus robur x Q. macrocarpa) as opposed to named hybrids (e.g. Quercus x turneri) are a variable combination of names and thus are very different to the Linnean names represented by a ParsedName. For name matching, backbone building and many more problems hybrid formulas are incompatible and we instead decided to deal with hybrid formulas just as with other unparsable viruses or OTU names that do not follow the neat structure of Linnean names. We simply keep the entire string as it was, classify it with a NameType and do not further parse it.

GBIF exposes the name parser through the GBIF JSON API, here are some examples for illustration:

variety Serjania meridionalis Cambess. var. o’donelli F.A. Barkley
basionym Carex scirpoidea Michx. subsp. convoluta (Kük.) D.A.Dunlop
cultivar Stephanandra incisa (Thunb.) Zabel cv. ‘Crispa’
subgenus Polana (Bulbusana) vana DeLong & Freytag 1972
named hybrid Quercus x turneri
hybrid formula Quercus robur x Q. macrocarpa
virus Choristoneura rosaceana entomopoxvirus
indetermined Abies spec.
uncertain determination Rasbora aff. elegans
nomenclatural remark Iridaea undulosa var. papillosa Bory de Saint-Vincent, nom. nud.
taxon concept Achillea millefolium sec. Greuter 2009
serovar Salmonella enterica serovar Typhimurium
bacterial strain Yersinia pestis biovar orientalis str. IP674
legacy rank Potamon (Pontipotamon) ibericum tauricum natio trojensis Pretzmann, 1983
sensu latu Taraxacum erythrospermum s.l.
placeholder Asteraceae incertae sedis

Authorships are not (yet) parsed into a list of individual authors. This has been done internally already and it is something we are likely to expose in the future. Currently the authorship is parsed into four pieces, the authorship and year for the combination and basionym.

gnparser in GBIF

The GNA name parser is a great parser for well formed names. It has slightly different goals, but since it is available for the JVM we have wrapped it to support the GBIF NameParser interface producing ParsedName instances. Wrapping the Scala based gnaparser was not as trivial as we had thought due to its different parsing output and the mismatch of Scala and Java collections, but working against the NameParser interface you can finally select your parser of choice.

The authorship semantics for original names are also slightly different between the two parsers. Again some examples to illustrate the difference:
Azalea schlippenbachii (Maxim.) Kuntze
Both parsers show the same semantics:

GBIF:
"authorship": "Kuntze",
"bracketAuthorship": "Maxim.",

GNA:
"value": "(Maxim.) Kuntze",
"basionym_authorship": {
  "authors": ["Maxim."]
},
"combination_authorship": {
  "authors": ["Kuntze"]
}

Rhododendron schlippenbachii Maxim.
The GBIF parser places the author into “authorship” as the author of the very combination.
The gnparser places the author into the basionym authorship instead even though it is not surrounded by brackets.
As the parser cannot know if the name actually is a basionym, i.e. there indeed exists a subsequenct recombination, this was slightly unexpected
and the GBIF NameParser wrapper had to swap the authorship to populate a ParsedName in such cases:

GBIF:
"authorship": "Maxim.",

GNA:
"basionym_authorship": {
  "authors": ["Maxim."]
}

Puma concolor (Linnaeus, 1771)
Both parsers show the same semantics:

GBIF:
"bracketAuthorship": "Linnaeus",
"bracketYear": "1771",

GNA:
"basionym_authorship": {
  "authors": ["Linnaeus"],
  "year": {
    "value": "1771"
  }
}

Ex authors are not parsed in GBIF (yet). They do cause a little issue as the sequence of authors varies between botany and zoology. In botany, the author of the earlier name precedes the later, valid one while in zoology this is reversed. GNA follows the zoological model, even though far more usages of ex-authors can be found in botanical names.

Uninomials are also treated differently. GBIF uses a single property genusOrAbove for both the genus part of a binomial, a standalone genus or a uninomal of a rank above genus. GNA places genera from a binomial into the genus field, but uses uninomial for a standalone genus name.

Performance

We are still comparing gnparser with the GBIF name parser, but initial tests using gnparser-0.4.0 to parse 1380 names from our unit tests suggests the GBIF parser is up to twice as fast running in a Java environment. There is an overhead of wrapping the gnparser for the ParsedName result class. But even if we just parse the names and do not convert the Scala result into a ParsedName it takes up 75% more time:

  
Total time parsing 1380 names
MacBookPro 2017, Java8, single thread:

  GBIF: 1331ms
  GNA : 2596ms
  GNA-: 2323ms # without wrapper

This contradicts the results presented in the gnaparser paper, but might be related to the selection of names or running the parser in different environments.

Future

We are working with GNA to improve both parsers and align them more. With slightly different goals it might be hard to fully merge the two projects, but we will try to unify the efforts as much as we can. For the GBIF name parser we will be adding parsed author and ex author teams in the near future. This is needed to do author comparisons for better name matching in the GBIF backbone building (where it already exists) and the Catalogue of Life.

4 comments:

dimus8 July 2017 at 11:09
Performance wise --- might the number of names? We used 1M names for performance tests
dimus10 July 2017 at 19:14
For benchmarking we used 1 000 000 instead of 1000, because JVM takes time to load, and because about any parser is fast enough for 1k of names. Both GBIF and GN parsers are way way faster than 1000 or 500 names/sec

Thursday, 22 June 2017

GBIF Name Parser

gnparser in GBIF

Performance

Future

4 comments:

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Tags

Followers

Thursday, 22 June 2017

GBIF Name Parser

gnparser in GBIF

Performance

Future

4 comments:

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Subscribe To

Tags

Followers