Developer Blog: Improving the GBIF Backbone matching

In GBIF occurrence records are matched to a taxon in a backbone taxonomy using the species match API. This is important to reduce spelling variations and create consistent metrics and searches according to a single classification and synonymy.

Over the past years we have been alerted to various bad matches. Most of the reported issues refer to a false fuzzy match for a name missing in our backbone.

In order to improve the taxonomic classification of occurrence records, we are undertaking 2 activities. The first is to improve the algorithms we use to fuzzily match names, and the second will be to improve the algorithms used to assembled the backbone taxonomy itself. Here I explain some of the work currently underway to tackle the former, which is visible on the test environment.

1.Name parsing of undetermined species

In occurrences we see many names with a partly undetermined name such as Lucanus spec. Erroneously these rank markers have been treated as real species epithets and together with fuzzy matching resulted in poor results.

Examples

Xysticus sp. used to wrongly match Xysticus spiethi while it now just matches the genus Xysticus.
Triodia sp. used to match the family Poaceae while it now matches the genus

2. Damerau–Levenshtein distance algorithm

For scoring fuzzy matches we have so far applied the Jaro Winkler distance which is often used for matching person names. It tends to allow for rather fuzzy matches at the end of long strings. This is desirable for scientific names, but the allowed fuzziness was too big and we decided to revert to the classical and more predictable Damerau–Levenshtein distance. This reduces false positive fuzzy matches considerably even though we lost a few good matches at the same time.

Examples

Xyris kralii Wand. used to match to Xyris harleyi but now just matches to the genus Xyris L. as the species is missing from our backbone.
Zea mays subsp. parviglumis var. huehuet Iltis & Doebley used to match Zea mays var. hirta while it now just hits the species Zea mays L.

Matching results

The distinct, verbatim classifications of 528 million records were passed through the original and the new fuzzy matching algorithms - this included 10.5 million distinct classifications in total. The results show that 428 thousand classifications (4%), representing 5,323,758 occurrence records produce a different match. So far we have taken a random subsample of the records which change, and manually inspected the results - we can hardly spot any degression or wrong matches.

We have published the complete matching comparison as well as the subset of changed records at Zenodo as tab delimited files:

Dataset 1: All classification matches (10.5 million)

Dataset 2: Changed matches (428 thousand)

The schema of the files have 3 column families each with the scientificName, GBIF taxonKey and the higher DwC classification terms for every match record (verbatim prefixed with v_ , old matching with an _old suffix and the new matching results with plain terms, e.g. v_scientificName, scientificName_old, scientificName).

We are glad to receive any feedback on further improvements or bad matching results we need to fix in the next iteration of work. Please get in touch with Markus Döring, mdoering@gbif.org.

Appendix

Create distinct occurrence names table

CREATE TABLE markus.names AS 
SELECT count(*) as numocc, count(distinct datasetKey) as numdatasets, v_scientificName, v_kingdom, v_phylum, v_class, v_order_ as v_order, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification 
FROM prod_b.occurrence_hdfs 
GROUP BY v_scientificName, v_kingdom, v_phylum, v_class, v_order_, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification 
ORDER BY v_scientificName, numocc DESC

Lookup taxonkey with both old & new lookup

CREATE TABLE markus.name_matches AS
SELECT 
  n.numocc, 
  n.numdatasets, 
  n.v_scientificName, 
  n.v_kingdom, 
  n.v_phylum, 
  n.v_class, 
  n.v_order, 
  n.v_family, 
  n.v_genus, 
  n.v_subgenus, 
  n.v_specificEpithet, 
  n.v_infraspecificEpithet, 
  n.v_scientificNameAuthorship, 
  n.v_taxonrank, 
  n.v_higherClassification, 

  prod.taxonKey as taxonKey_old,
  prod.scientificName as scientificName_old,
  prod.rank as rank_old,
  prod.status as status_old,
  prod.matchType as matchType_old,
  prod.confidence as confidence_old,
  prod.kingdomKey as kingdomKey_old,
  prod.phylumKey as phylumKey_old,
  prod.classKey as classKey_old,
  prod.orderKey as orderKey_old,
  prod.familyKey as familyKey_old,
  prod.genusKey as genusKey_old,
  prod.speciesKey as speciesKey_old,
  prod.kingdom as kingdom_old,
  prod.phylum as phylum_old,
  prod.class_ as class_old,
  prod.order_ as order_old,
  prod.family as family_old,
  prod.genus as genus_old,
  prod.species as species_old,

  uat.taxonKey as taxonKey,
  uat.scientificName as scientificName,
  uat.rank as rank,
  uat.status as status,
  uat.matchType as matchType,
  uat.confidence as confidence,
  uat.kingdomKey as kingdomKey,
  uat.phylumKey as phylumKey,
  uat.classKey as classKey,
  uat.orderKey as orderKey,
  uat.familyKey as familyKey,
  uat.genusKey as genusKey,
  uat.speciesKey as speciesKey,
  uat.kingdom as kingdom,
  uat.phylum as phylum,
  uat.class_ as class_,
  uat.order_ as order_,
  uat.family as family,
  uat.genus as genus,
  uat.species as species

FROM (
  SELECT 
    numocc, 
    numdatasets, 
    v_scientificName, 
    v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_subgenus, 
    v_specificEpithet, 
    v_infraspecificEpithet, 
    v_scientificNameAuthorship, 
    v_taxonrank, 
    v_higherClassification, 
    match('PROD', v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) prod, 
    match('UAT', v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) uat
  FROM markus.names
) n;

Hive exports

CREATE TABLE markus.matches_changed 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' NULL DEFINED AS '' AS 
SELECT * from markus.name_matches 
WHERE taxonKey!=taxonKey_old;

CREATE TABLE markus.matches_all 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' NULL DEFINED AS '' AS 
SELECT * from markus.name_matches;

Monday, 30 March 2015

Improving the GBIF Backbone matching

1.Name parsing of undetermined species

2. Damerau–Levenshtein distance algorithm

Matching results

Appendix

Create distinct occurrence names table

Lookup taxonkey with both old & new lookup

Hive exports

No comments:

Post a Comment

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Tags

Followers

Monday, 30 March 2015

Improving the GBIF Backbone matching

1.Name parsing of undetermined species

2. Damerau–Levenshtein distance algorithm

Matching results

Appendix

Create distinct occurrence names table

Lookup taxonkey with both old & new lookup

Hive exports

No comments:

Post a Comment

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Subscribe To

Tags

Followers