The GBIF name parser has been a fundamental library for GBIF to parse a scientific name string into a structured representation of a name. It has been refined over many years based on actual name strings encountered in the GBIF occurrence and checklist indices. Over the years the major design goals have not changed much and can be summarised as follows:
- extract canonical, code relevant name parts
- populate only the ParsedName class of the GBIF API
- ignore any superflous name parts irrelevant to the code, e.g. species authorships in infraspecific names, infrageneric placements of species or superflous infraspecific parts in quadrinomials
- deal with a wide variety of names that the ParsedName class can represent
- cultivar names
- bacterial strains & candidate names
- virus names
- named hybrids
- taxon concept references, sensu latu/strictu or aggregates
- legacy ranks
- extract notes often found in names:
- nomenclatural remarks
- determination notes like aff.
- partially determined species, e.g. only down to the genus: Abies spec.
- in case author parsing is impossible, fallback to parsing just the canonical name without authors
- allow slightly imperfect names not strictly well formed according to the rules
- classify names according to our NameType enumeration
GBIF exposes the name parser through the GBIF JSON API, here are some examples for illustration:
- variety Serjania meridionalis Cambess. var. o’donelli F.A. Barkley
- basionym Carex scirpoidea Michx. subsp. convoluta (Kük.) D.A.Dunlop
- cultivar Stephanandra incisa (Thunb.) Zabel cv. ‘Crispa’
- subgenus Polana (Bulbusana) vana DeLong & Freytag 1972
- named hybrid Quercus x turneri
- hybrid formula Quercus robur x Q. macrocarpa
- virus Choristoneura rosaceana entomopoxvirus
- indetermined Abies spec.
- uncertain determination Rasbora aff. elegans
- nomenclatural remark Iridaea undulosa var. papillosa Bory de Saint-Vincent, nom. nud.
- taxon concept Achillea millefolium sec. Greuter 2009
- serovar Salmonella enterica serovar Typhimurium
- bacterial strain Yersinia pestis biovar orientalis str. IP674
- legacy rank Potamon (Pontipotamon) ibericum tauricum natio trojensis Pretzmann, 1983
- sensu latu Taraxacum erythrospermum s.l.
- placeholder Asteraceae incertae sedis
gnparser in GBIF
The GNA name parser is a great parser for well formed names. It has slightly different goals, but since it is available for the JVM we have wrapped it to support the GBIF NameParser interface producing ParsedName instances. Wrapping the Scala based gnaparser was not as trivial as we had thought due to its different parsing output and the mismatch of Scala and Java collections, but working against the NameParser interface you can finally select your parser of choice.The authorship semantics for original names are also slightly different between the two parsers. Again some examples to illustrate the difference:
Azalea schlippenbachii (Maxim.) Kuntze
Both parsers show the same semantics:
GBIF: "authorship": "Kuntze", "bracketAuthorship": "Maxim.", GNA: "value": "(Maxim.) Kuntze", "basionym_authorship": { "authors": ["Maxim."] }, "combination_authorship": { "authors": ["Kuntze"] }Rhododendron schlippenbachii Maxim.
The GBIF parser places the author into “authorship” as the author of the very combination.
The gnparser places the author into the basionym authorship instead even though it is not surrounded by brackets.
As the parser cannot know if the name actually is a basionym, i.e. there indeed exists a subsequenct recombination, this was slightly unexpected
and the GBIF NameParser wrapper had to swap the authorship to populate a ParsedName in such cases:
GBIF: "authorship": "Maxim.", GNA: "basionym_authorship": { "authors": ["Maxim."] }Puma concolor (Linnaeus, 1771)
Both parsers show the same semantics:
GBIF: "bracketAuthorship": "Linnaeus", "bracketYear": "1771", GNA: "basionym_authorship": { "authors": ["Linnaeus"], "year": { "value": "1771" } }Ex authors are not parsed in GBIF (yet). They do cause a little issue as the sequence of authors varies between botany and zoology. In botany, the author of the earlier name precedes the later, valid one while in zoology this is reversed. GNA follows the zoological model, even though far more usages of ex-authors can be found in botanical names.
Uninomials are also treated differently. GBIF uses a single property genusOrAbove for both the genus part of a binomial, a standalone genus or a uninomal of a rank above genus. GNA places genera from a binomial into the genus field, but uses uninomial for a standalone genus name.
Performance
We are still comparing gnparser with the GBIF name parser, but initial tests using gnparser-0.4.0 to parse 1380 names from our unit tests suggests the GBIF parser is up to twice as fast running in a Java environment. There is an overhead of wrapping the gnparser for the ParsedName result class. But even if we just parse the names and do not convert the Scala result into a ParsedName it takes up 75% more time:Total time parsing 1380 names MacBookPro 2017, Java8, single thread: GBIF: 1331ms GNA : 2596ms GNA-: 2323ms # without wrapper
This contradicts the results presented in the gnaparser paper, but might be related to the selection of names or running the parser in different environments.