Thursday, 22 June 2017

GBIF Name Parser

The GBIF name parser has been a fundamental library for GBIF to parse a scientific name string into a structured representation of a name. It has been refined over many years based on actual name strings encountered in the GBIF occurrence and checklist indices. Over the years the major design goals have not changed much and can be summarised as follows:
  • extract canonical, code relevant name parts
    • populate only the ParsedName class of the GBIF API
    • ignore any superflous name parts irrelevant to the code, e.g. species authorships in infraspecific names, infrageneric placements of species or superflous infraspecific parts in quadrinomials
  • deal with a wide variety of names that the ParsedName class can represent
    • cultivar names
    • bacterial strains & candidate names
    • virus names
    • named hybrids
    • taxon concept references, sensu latu/strictu or aggregates
    • legacy ranks
  • extract notes often found in names:
    • nomenclatural remarks
    • determination notes like aff. 
    • partially determined species, e.g. only down to the genus: Abies spec.
  • in case author parsing is impossible, fallback to parsing just the canonical name without authors
  • allow slightly imperfect names not strictly well formed according to the rules
  • classify names according to our NameType enumeration
Compared to gnparser these are slightly different goals explaining some of the behavior explained in the recent paper from Dmitry Mozzherin 2017. As that paper explains the GBIF name parser is based on regular expressions, some of them even recursive. This is not the reason why we do not support hybrid formulas though. Hybrid formulas (e.g. Quercus robur x Q. macrocarpa) as opposed to named hybrids (e.g. Quercus x turneri) are a variable combination of names and thus are very different to the Linnean names represented by a ParsedName. For name matching, backbone building and many more problems hybrid formulas are incompatible and we instead decided to deal with hybrid formulas just as with other unparsable viruses or OTU names that do not follow the neat structure of Linnean names. We simply keep the entire string as it was, classify it with a NameType and do not further parse it.

GBIF exposes the name parser through the GBIF JSON API, here are some examples for illustration:
Authorships are not (yet) parsed into a list of individual authors. This has been done internally already and it is something we are likely to expose in the future. Currently the authorship is parsed into four pieces, the authorship and year for the combination and basionym.

gnparser in GBIF

The GNA name parser is a great parser for well formed names. It has slightly different goals, but since it is available for the JVM we have wrapped it to support the GBIF NameParser interface producing ParsedName instances. Wrapping the Scala based gnaparser was not as trivial as we had thought due to its different parsing output and the mismatch of Scala and Java collections, but working against the NameParser interface you can finally select your parser of choice.

The authorship semantics for original names are also slightly different between the two parsers. Again some examples to illustrate the difference:
Azalea schlippenbachii (Maxim.) Kuntze
Both parsers show the same semantics:
GBIF:
"authorship": "Kuntze",
"bracketAuthorship": "Maxim.",

GNA:
"value": "(Maxim.) Kuntze",
"basionym_authorship": {
  "authors": ["Maxim."]
},
"combination_authorship": {
  "authors": ["Kuntze"]
}
Rhododendron schlippenbachii Maxim.
The GBIF parser places the author into “authorship” as the author of the very combination.
The gnparser places the author into the basionym authorship instead even though it is not surrounded by brackets.
As the parser cannot know if the name actually is a basionym, i.e. there indeed exists a subsequenct recombination, this was slightly unexpected
and the GBIF NameParser wrapper had to swap the authorship to populate a ParsedName in such cases:
GBIF:
"authorship": "Maxim.",

GNA:
"basionym_authorship": {
  "authors": ["Maxim."]
}
Puma concolor (Linnaeus, 1771)
Both parsers show the same semantics:
GBIF:
"bracketAuthorship": "Linnaeus",
"bracketYear": "1771",

GNA:
"basionym_authorship": {
  "authors": ["Linnaeus"],
  "year": {
    "value": "1771"
  }
}
Ex authors are not parsed in GBIF (yet). They do cause a little issue as the sequence of authors varies between botany and zoology. In botany, the author of the earlier name precedes the later, valid one while in zoology this is reversed. GNA follows the zoological model, even though far more usages of ex-authors can be found in botanical names.

Uninomials are also treated differently. GBIF uses a single property genusOrAbove for both the genus part of a binomial, a standalone genus or a uninomal of a rank above genus. GNA places genera from a binomial into the genus field, but uses uninomial for a standalone genus name.

Performance

We are still comparing gnparser with the GBIF name parser, but initial tests using gnparser-0.4.0 to parse 1380 names from our unit tests suggests the GBIF parser is up to twice as fast running in a Java environment. There is an overhead of wrapping the gnparser for the ParsedName result class. But even if we just parse the names and do not convert the Scala result into a ParsedName it takes up 75% more time:
  
Total time parsing 1380 names
MacBookPro 2017, Java8, single thread:

  GBIF: 1331ms
  GNA : 2596ms
  GNA-: 2323ms # without wrapper

This contradicts the results presented in the gnaparser paper, but might be related to the selection of names or running the parser in different environments.

Future

We are working with GNA to improve both parsers and align them more. With slightly different goals it might be hard to fully merge the two projects, but we will try to unify the efforts as much as we can. For the GBIF name parser we will be adding parsed author and ex author teams in the near future. This is needed to do author comparisons for better name matching in the GBIF backbone building (where it already exists) and the Catalogue of Life.

4 comments:

  1. Performance wise --- might the number of names? We used 1M names for performance tests

    ReplyDelete
  2. For benchmarking we used 1 000 000 instead of 1000, because JVM takes time to load, and because about any parser is fast enough for 1k of names. Both GBIF and GN parsers are way way faster than 1000 or 500 names/sec

    ReplyDelete
    Replies
    1. the parsing time is without JVM startup and the parser instance is created before time is measured. This is just the parsing time. Not done perfectly with a proper performance framework, but still. The parsing times highly depend on the name being parsed. Large authorships with 20 or more authors can slow down the GBIF parser for example. The test set of 1380 names explicitly tries not to be just simple binomials.

      Delete
    2. I have run more tests and the GNA parser becomes faster the more names are parsed.
      Repeating the same 1380 names 1x, 10x and 100x:

      GBIF - total time parsing 1380 names: 1128 ms
      GNA - total time parsing 1380 names: 1666 ms

      GBIF - total time parsing 13800 names: 13612 ms
      GNA - total time parsing 13800 names: 5023 ms

      GBIF - total time parsing 138000 names: 148381 ms
      GNA - total time parsing 138000 names: 27333 ms


      I thought there might be some caching involved in the GNA parser, as the GBIF parsing time is rather linear.
      So Ive tried 1380 1x, 10x, 100x with random binomials including an author (e.g. Zpc aafoax Iiv; Aioaeuzoai eaemeau Oeovzmboular)

      GBIF - total time parsing 1380 names: 333 ms
      GNA - total time parsing 1380 names: 1516 ms

      GBIF - total time parsing 13800 names: 2266 ms
      GNA - total time parsing 13800 names: 3511 ms

      GBIF - total time parsing 138000 names: 18715 ms
      GNA - total time parsing 138000 names: 16223 ms


      As you can see GBIF becomes a lot faster with these simple names. Both parsers are not linear anymore, GBIF gets slightly faster but the GNA one gets a lot faster. It takes only 10x more time for 100x names.

      I finally tried a million random names and added also an authorship year, e.g. Uouixuu eeuouao Vxgoea, 1806

      GBIF - total time parsing 1000000 names: 188939 ms
      GNA - total time parsing 1000000 names: 113468 ms


      As I said this does not use a test framework and e.g. JVM garbage collection can happen anytime. Still interesting behavior and the performance between both depends on the number of names to be parsed and also the kind of names.

      Delete