Comments on Developer Blog: GBIF Name Parser

I have run more tests and the GNA parser becomes f...

2017-07-19T17:42:06.918+02:00

I have run more tests and the GNA parser becomes faster the more names are parsed.
Repeating the same 1380 names 1x, 10x and 100x:

GBIF - total time parsing 1380 names: 1128 ms
GNA - total time parsing 1380 names: 1666 ms

GBIF - total time parsing 13800 names: 13612 ms
GNA - total time parsing 13800 names: 5023 ms

GBIF - total time parsing 138000 names: 148381 ms
GNA - total time parsing 138000 names: 27333 ms

I thought there might be some caching involved in the GNA parser, as the GBIF parsing time is rather linear.
So Ive tried 1380 1x, 10x, 100x with random binomials including an author (e.g. Zpc aafoax Iiv; Aioaeuzoai eaemeau Oeovzmboular)

GBIF - total time parsing 1380 names: 333 ms
GNA - total time parsing 1380 names: 1516 ms

GBIF - total time parsing 13800 names: 2266 ms
GNA - total time parsing 13800 names: 3511 ms

GBIF - total time parsing 138000 names: 18715 ms
GNA - total time parsing 138000 names: 16223 ms

As you can see GBIF becomes a lot faster with these simple names. Both parsers are not linear anymore, GBIF gets slightly faster but the GNA one gets a lot faster. It takes only 10x more time for 100x names.

I finally tried a million random names and added also an authorship year, e.g. Uouixuu eeuouao Vxgoea, 1806

GBIF - total time parsing 1000000 names: 188939 ms
GNA - total time parsing 1000000 names: 113468 ms

As I said this does not use a test framework and e.g. JVM garbage collection can happen anytime. Still interesting behavior and the performance between both depends on the number of names to be parsed and also the kind of names.

the parsing time is without JVM startup and the pa...

2017-07-19T15:42:30.982+02:00

the parsing time is without JVM startup and the parser instance is created before time is measured. This is just the parsing time. Not done perfectly with a proper performance framework, but still. The parsing times highly depend on the name being parsed. Large authorships with 20 or more authors can slow down the GBIF parser for example. The test set of 1380 names explicitly tries not to be just simple binomials.

For benchmarking we used 1 000 000 instead of 1000...

2017-07-10T19:14:33.470+02:00

For benchmarking we used 1 000 000 instead of 1000, because JVM takes time to load, and because about any parser is fast enough for 1k of names. Both GBIF and GN parsers are way way faster than 1000 or 500 names/sec

Performance wise --- might the number of names? We...

2017-07-08T11:09:55.614+02:00

Performance wise --- might the number of names? We used 1M names for performance tests