The GBIF name parser has been a fundamental library for GBIF to parse a scientific name string into a structured representation of a name. It has been refined over many years based on actual name strings encountered in the GBIF occurrence and checklist indices. Over the years the major design goals have not changed much and can be summarised as follows:
- extract canonical, code relevant name parts
- populate only the ParsedName class of the GBIF API
- ignore any superflous name parts irrelevant to the code, e.g. species authorships in infraspecific names, infrageneric placements of species or superflous infraspecific parts in quadrinomials
- deal with a wide variety of names that the ParsedName class can represent
- cultivar names
- bacterial strains & candidate names
- virus names
- named hybrids
- taxon concept references, sensu latu/strictu or aggregates
- legacy ranks
- extract notes often found in names:
- nomenclatural remarks
- determination notes like aff.
- partially determined species, e.g. only down to the genus: Abies spec.
- in case author parsing is impossible, fallback to parsing just the canonical name without authors
- allow slightly imperfect names not strictly well formed according to the rules
- classify names according to our NameType enumeration
GBIF exposes the name parser through the GBIF JSON API, here are some examples for illustration:
- variety Serjania meridionalis Cambess. var. o’donelli F.A. Barkley
- basionym Carex scirpoidea Michx. subsp. convoluta (Kük.) D.A.Dunlop
- cultivar Stephanandra incisa (Thunb.) Zabel cv. ‘Crispa’
- subgenus Polana (Bulbusana) vana DeLong & Freytag 1972
- named hybrid Quercus x turneri
- hybrid formula Quercus robur x Q. macrocarpa
- virus Choristoneura rosaceana entomopoxvirus
- indetermined Abies spec.
- uncertain determination Rasbora aff. elegans
- nomenclatural remark Iridaea undulosa var. papillosa Bory de Saint-Vincent, nom. nud.
- taxon concept Achillea millefolium sec. Greuter 2009
- serovar Salmonella enterica serovar Typhimurium
- bacterial strain Yersinia pestis biovar orientalis str. IP674
- legacy rank Potamon (Pontipotamon) ibericum tauricum natio trojensis Pretzmann, 1983
- sensu latu Taraxacum erythrospermum s.l.
- placeholder Asteraceae incertae sedis
gnparser in GBIF
The GNA name parser is a great parser for well formed names. It has slightly different goals, but since it is available for the JVM we have wrapped it to support the GBIF NameParser interface producing ParsedName instances. Wrapping the Scala based gnaparser was not as trivial as we had thought due to its different parsing output and the mismatch of Scala and Java collections, but working against the NameParser interface you can finally select your parser of choice.The authorship semantics for original names are also slightly different between the two parsers. Again some examples to illustrate the difference:
Azalea schlippenbachii (Maxim.) Kuntze
Both parsers show the same semantics:
GBIF: "authorship": "Kuntze", "bracketAuthorship": "Maxim.", GNA: "value": "(Maxim.) Kuntze", "basionym_authorship": { "authors": ["Maxim."] }, "combination_authorship": { "authors": ["Kuntze"] }Rhododendron schlippenbachii Maxim.
The GBIF parser places the author into “authorship” as the author of the very combination.
The gnparser places the author into the basionym authorship instead even though it is not surrounded by brackets.
As the parser cannot know if the name actually is a basionym, i.e. there indeed exists a subsequenct recombination, this was slightly unexpected
and the GBIF NameParser wrapper had to swap the authorship to populate a ParsedName in such cases:
GBIF: "authorship": "Maxim.", GNA: "basionym_authorship": { "authors": ["Maxim."] }Puma concolor (Linnaeus, 1771)
Both parsers show the same semantics:
GBIF: "bracketAuthorship": "Linnaeus", "bracketYear": "1771", GNA: "basionym_authorship": { "authors": ["Linnaeus"], "year": { "value": "1771" } }Ex authors are not parsed in GBIF (yet). They do cause a little issue as the sequence of authors varies between botany and zoology. In botany, the author of the earlier name precedes the later, valid one while in zoology this is reversed. GNA follows the zoological model, even though far more usages of ex-authors can be found in botanical names.
Uninomials are also treated differently. GBIF uses a single property genusOrAbove for both the genus part of a binomial, a standalone genus or a uninomal of a rank above genus. GNA places genera from a binomial into the genus field, but uses uninomial for a standalone genus name.
Performance
We are still comparing gnparser with the GBIF name parser, but initial tests using gnparser-0.4.0 to parse 1380 names from our unit tests suggests the GBIF parser is up to twice as fast running in a Java environment. There is an overhead of wrapping the gnparser for the ParsedName result class. But even if we just parse the names and do not convert the Scala result into a ParsedName it takes up 75% more time:Total time parsing 1380 names MacBookPro 2017, Java8, single thread: GBIF: 1331ms GNA : 2596ms GNA-: 2323ms # without wrapper
This contradicts the results presented in the gnaparser paper, but might be related to the selection of names or running the parser in different environments.
Performance wise --- might the number of names? We used 1M names for performance tests
ReplyDeleteFor benchmarking we used 1 000 000 instead of 1000, because JVM takes time to load, and because about any parser is fast enough for 1k of names. Both GBIF and GN parsers are way way faster than 1000 or 500 names/sec
ReplyDeletethe parsing time is without JVM startup and the parser instance is created before time is measured. This is just the parsing time. Not done perfectly with a proper performance framework, but still. The parsing times highly depend on the name being parsed. Large authorships with 20 or more authors can slow down the GBIF parser for example. The test set of 1380 names explicitly tries not to be just simple binomials.
DeleteI have run more tests and the GNA parser becomes faster the more names are parsed.
DeleteRepeating the same 1380 names 1x, 10x and 100x:
GBIF - total time parsing 1380 names: 1128 ms
GNA - total time parsing 1380 names: 1666 ms
GBIF - total time parsing 13800 names: 13612 ms
GNA - total time parsing 13800 names: 5023 ms
GBIF - total time parsing 138000 names: 148381 ms
GNA - total time parsing 138000 names: 27333 ms
I thought there might be some caching involved in the GNA parser, as the GBIF parsing time is rather linear.
So Ive tried 1380 1x, 10x, 100x with random binomials including an author (e.g. Zpc aafoax Iiv; Aioaeuzoai eaemeau Oeovzmboular)
GBIF - total time parsing 1380 names: 333 ms
GNA - total time parsing 1380 names: 1516 ms
GBIF - total time parsing 13800 names: 2266 ms
GNA - total time parsing 13800 names: 3511 ms
GBIF - total time parsing 138000 names: 18715 ms
GNA - total time parsing 138000 names: 16223 ms
As you can see GBIF becomes a lot faster with these simple names. Both parsers are not linear anymore, GBIF gets slightly faster but the GNA one gets a lot faster. It takes only 10x more time for 100x names.
I finally tried a million random names and added also an authorship year, e.g. Uouixuu eeuouao Vxgoea, 1806
GBIF - total time parsing 1000000 names: 188939 ms
GNA - total time parsing 1000000 names: 113468 ms
As I said this does not use a test framework and e.g. JVM garbage collection can happen anytime. Still interesting behavior and the performance between both depends on the number of names to be parsed and also the kind of names.