Wednesday, 6 April 2016

Updating the GBIF Backbone

The taxonomy employed by GBIF for organising all occurrences into a consistent view has remained unchanged since 2013. We have been working on a replacement for some time and are pleased to introduce a preview in this post. The work is rather complex and tries to establish an automated process to build a new backbone which we aim to run on a regular, probably quarterly basis. We would like to release the new taxonomy rather soon and improve the backbone iteratively. Large regressions should be avoided initially, but it is quite hard to evaluate all the changes between 2 large taxonomies with 4 - 5 million names each. We are therefore seeking feedback and help to discover oddities of the new backbone.

Relevance & Challenges

Every occurrence record in GBIF is matched to a taxon in the backbone. Because occurrence records in GBIF cover the whole tree of life and names may come from all possible, often outdated, taxonomies, it is important to have the broadest coverage of names possible. We also deal with fossil names, extinct taxa and (due to advanced digital publishing) even names that have just been described a week before the data is indexed at GBIF.
The Taxonomic Backbone provides a single classification and a synonymy that we use to inform our systems when creating maps, providing metrics or even when you do a plain occurrence search. It is also used to crosslink names between different checklist datasets.

The Origins

The very first taxonomy that GBIF used was based on the Catalogue of Life. As this only included around half the names we found in GBIF occurrences, all other cleaned occurrence names were merged into the GBIF backbone. As the backbone grew we never deleted names and increasingly faced more and more redundant names with slightly different classifications. It was time for a different procedure.

The Current Backbone

The current version of the backbone was built in July 2013. It is largely based on the Catalogue of Life from 2012 and has folded in names from 39 further taxonomic sources. It was built using an automated process that made use of selected checklists from the GBIF ChecklistBank in a prioritised order. The Catalogue of Life was still the starting point and provided the higher classification down to orders. The Interim Register of Marine and Nonmarine Genera was used as the single reference list for generic homonyms. Otherwise only a single version of any name was allowed to exist in the backbone, even where the authorship differed.

Current issues

We kept track of nearly 150 reported issues. Some of the main issues showing up regularly that we wanted to address were:
  • Enable an automated build process so we can use the latest Catalogue of Life and other sources to capture newly described or currently missing names
  • It was impossible to have synonyms using the same canonical name but with different authors. This means Poa pubescens was always considered a synonym of Poa pratensis L. when in fact Poa pubescens R.Br. is considered a synonym of Eragrostis pubescens (R.Br.) Steud.
  • Some families contain far too many accepted species and hardly any synonyms. Especially for plants the Catalogue of Life was surprisingly sparsely populated and we heavily relied on IPNI names. For example the family Cactaceae has 12.062 accepted species in GBIF while The Plant List recognizes just 2.233.
  • Many accepted names are based on the same basionym. For example the current backbone considers both Sulcorebutia breviflora Backeb. and Weingartia breviflora (Backeb.) Hentzschel & K.Augustin as accepted taxa.
  • Relying purely on IRMNG for homonyms meant that homonyms which were not found in IRMNG were conflated. On the other hand there are many genera in IRMNG - and thus in the backbone - that are hardly used anywhere, creating confusion and many empty genera without any species in our backbone.

The New Backbone

The new backbone is available for preview in our test environment. In order to review the new backbone and compare it to the previous version we provide a few tools with a different focus:
  • Stable ID report: We have joined the old and new backbone names to each other and compared their identifiers. When joining on the full scientific name there is still an issue with changing identifiers which we are still investigating.
  • Tree Diffs: For comparing the higher classification we used a tool from Rod Page to diff the tree down to families. There are surprisingly many changes, but all of them stem from evolution in the Catalogue of Life or the changed Algae classification.
  • Nub Browser: For comparing actual species and also reviewing the impact of the changed taxonomy on the GBIF occurrences, we developed a new Backbone Browser sitting on top of our existing API (Google Chrome only). Our test environment has a complete copy of the current GBIF occurrence index which we have reprocessed to use the new backbone. This also includes all maps and metrics which we show in the new browser.
Family Asparagaceae as seen in the nub browser:
Red numbers next to names indicate taxa that have fewer occurrences using the new backbone, while green numbers indicate an increase. This is also seen in the tree maps of the children by occurrences. The genus Campylandra J.G. Baker, 1875 is dark red with zero occurrences because the species in that genus were moved into the genus Rhodea in the latest Catalog of Life.

Species Asparagus asparagoides as seen in the nub browser:
The details view shows all synonyms, the basionym and also a list of homonyms from the new backbone.


We manually curate a list of priority ordered checklist datasets that we use to build the taxonomy. Three datasets are treated in a slightly special way:
  1. GBIF Backbone Patch: a small dataset we manually curate at GBIF to override any other list. We mainly use the dataset to add missing names reported by users.
  2. Catalogue of Life: The Catalogue of Life provides the entire higher classification above families with the exception of algaes.
  3. GBIF Algae Classification: With the withdrawal of Algaebase the current Catalogue of Life is lacking any algae taxonomy. To allow other sources to at least provide genus and species names for algae we have created a new dataset that just provides an algae classification down to families. This classification fits right into the empty phyla of the Catalogue of Life.
The GBIF portal now also lists the source datasets that contributed to the GBIF Backbone and the number of names that were used as primary references.

Other Improvements

As well as fixing the main issues listed above, there is another frequently occurring situation that we have improved. Many occurrences could not be matched to a backbone species because the name existed multiple times as an accepted taxon. In the new backbone, only one version of a name is ever considered to be accepted. All others now are flagged as doubtful. That resolves many issues which prevented a species match because of name ambiguity. For example there are many occurrences of Hyacinthoides hispanica in Britain which only show up in the new backbone (old / new occurrence, old / new match). This is best seen in the map comparison of the nub browser, try to swipe the map!

Known problems

We are aware of some problems with the new backbone which we like to address in the next stage. Two of these issues we consider as candidates for blocking the release of the new backbone:
Species matching service ignores authorship
As we better keep different authors apart the backbone now contains a lot more species names which just differ by their authorship. The current algorithm only keeps one of these names as the accepted name from the most trusted source (e.g. CoL) and treats the other as doubtful if they are not already treated as synonyms.
The problem currently is that the species matching service we use to align occurrences to the backbone does not deal with authorship. Therefore we have some cases where occurrences are attached to a doubtful name or even split across some of the “homonyms”.
There are nearly 166.832 species names with different authorship existing in the new backbone, accounting for 98.977.961 occurrences.
Too eager basionym merging
The same epithet is sometimes used by the same author for different names in the same family. This currently leads to an overly eager basionym grouping with less accepted names.
As these names are still in the backbone and occurrences can be matched to them this is currently not considered a blocker.

Thursday, 25 February 2016

Reprojecting coordinates according to their geodetic datum

For a long time Darwin Core has a term to declare the exact geodetic datum used for the given coordinate. Quite a few data publishers in GBIF have used dwc:geodeticDatum for some time to publish the datum of their location coordinates.

Until now GBIF has treated all coordinates as if they were in WGS84, the widespread global standard datum used by the Global Positioning System (GPS). Accordingly locations given in a different datum, for example NAD27 or AGD66, were displaced on GBIF maps a little. This so called “datum shift” is not dramatic, but can be up to a few hundred metres depending on the location and datum. The Univeristy of Colorado has a nice visualization of the impact.

At GBIF we interpret the geodeticDatum and reproject all coordinates as good as we can into the single datum WGS84. In order to do this there are two main steps that need to be done: parse and interpret the given verbatim geodetic datum and then do the actual transformation based on the known geodetic parameters.

Parsing geodeticDatum

As usual GBIF receives a lot of noise when reading the dwc:geodeticDatum. After removing the obvious bad values, e.g. introduced by bad mappings done by the publisher, we still ended up with over 300 different values for some datum. Most commonly simple names or abbreviations are used, e.g. NAD27, WGS72, ED50, TOKYO. In some cases we also see proper EPSG codes coming in, e.g. EPSG:4326 which is the EPSG code for WGS84. As EPSG is a widespread and complete reference dataset of geodetic parameters, supported by many java libraries, we decided to add a new DatumParser to our parser library that directly returns EPSG integer codes for datum values. That way we can lookup geodetic parameters easily in the following transformation step. In addition to parse any given EPSG:xyz code directly it also understands most datums found in the GBIF network based on a simple dictionary file which we manually curate.

Even though EPSG codes are well maintained, very complete and supported by most software opaque integer codes have a harder time to get used than meaningful short names. Maybe a lesson we should keep in mind when debating about identifiers elsewhere.

Our recommendation to publishers is to use the EPSG codes if you know them, otherwise stick to the simple well known names. A good place to search for EPSG codes is


Once we have a decimal coordinate and a well known geodetic source datum the transformation itself is rather straight forward. We use geotools to do the work. The first step in the transformation is to instantiate a CoordinateReferenceSystem (CRS) using the parsed EPSG code of the geodeticDatum. A CRS combines a datum with a coordinate system, in our case this always a 2 dimensional system with the prime meridian in Greenwich and longitude values increasing East, latitude values North.

As EPSG codes can refer to both, just a plain datum and also a complete spatial reference system, we need to take this into account when building the CRS like this:

 private CoordinateReferenceSystem parseCRS(String datum) {
    CoordinateReferenceSystem crs = null;
    // the GBIF DatumParser in use
    ParseResult<Integer> epsgCode = PARSER.parse(datum);
    if (epsgCode.isSuccessful()) {
      final String code = "EPSG:" + epsgCode.getPayload();
      // first try to create a full fledged CRS from the given code
      try {
        crs = CRS.decode(code);

      } catch (FactoryException e) {
        // that didn't work, maybe it is *just* a datum
        try {
          GeodeticDatum dat = DATUM_FACTORY.createGeodeticDatum(code);
      // build a CRS using the standard 2-dim Greenwich coordinate system
          crs = new DefaultGeographicCRS(dat, DefaultEllipsoidalCS.GEODETIC_2D);

        } catch (FactoryException e1) {
          // also not a datum, no further ideas, log error
"No CRS or DATUM for given datum code >>{}<<: {}", datum, e1.getMessage());
    return crs;

Once we have a CRS instance we can create a specific WGS84 transformation and apply it to our coordinate:

public ParseResult<LatLng> reproject(double lat, double lon, String datum) {
   CoordinateReferenceSystem crs = parseCRS(datum);
   MathTransform transform = CRS.findMathTransform(crs, DefaultGeographicCRS.WGS84, true);
   // different CRS may swap the x/y axis for lat lon, so check first:
   double[] srcPt;
   double[] dstPt = new double[3];
   if (CRS.getAxisOrder(crs) == CRS.AxisOrder.NORTH_EAST) {
     // lat lon
     srcPt = new double[] {lat, lon, 0};
   } else {
     // lon lat
     srcPt = new double[] {lon, lat, 0};

   transform.transform(srcPt, 0, dstPt, 0, 1);
   return ParseResult.success(ParseResult.CONFIDENCE.DEFINITE, new LatLng(dstPt[1], dstPt[0]), issues);

The actual projection code does a bit more of null and exception handling which I have removed here for simplicity.

As you can see above we also have to watch out for spatial reference systems that use a different axis ordering. Luckily geotools knows all about that and provides a very simple way to test for it.

Issue flags

As with most of our processing we flag records when problems or assumed behavior occurs. In the case of the geodetic datum processing we keep track of 5 distinct issues which are available as GBIF portal occurrence search filters:

  • COORDINATE_REPROJECTION_FAILED: A CRS was instantiated, but the transformation failed for some reason.
  • GEODETIC_DATUM_INVALID: The datum parser was unable to return an EPSG code for the given datum string.
  • COORDINATE_REPROJECTION_SUSPICIOUS: The reprojection resulted in a datum shift larger than 0.1 degrees.
  • GEODETIC_DATUM_ASSUMED_WGS84: No datum was given or the given datum was not understood. In that case the original coordinates remain untouched.
  • COORDINATE_REPROJECTED: The coordinate was successfully transformed and differs now from the verbatim one given.

Thursday, 11 June 2015

Simplified Downloads

Since its re-launch in 2013 has supported the downloading of occurrence data using an arbitrary query with the download file provided as a Darwin Core Archive file whose internal content is described here. This format contains comprehensive and self-explanatory information, which makes it suitable to be referenced in external resources. However, in cases where people only need the occurrence data in its simplest form the DwC-A format presents an additional complexity that can make it hard to use the data. Because of that we now support a new download format: a zip file that only contains a single file with the most common fields/terms used, where each column is separated by the TAB character. This makes things much easier when it comes to importing the data into tools such as Microsoft Excel, geographic information systems and relational databases. The current download functionality was extended to allow the selection of the desired format:

From this point the functionality remains the same: eventually you will receive an email containing a hyperlink where the file can be downloaded.

Technical Architecture

The simplified download format was implemented following the technical requirement that new formats should be supported in the near future with minimal impact to the formats supported at a specific moment. In general, occurrence downloads are implemented using two different sets of technologies depending on the estimated size of the download in number of records; a threshold of 200,000 records is set to define when a download is small (< 200K) and big (>200K), where history shows a vast majority of “small” downloads. The following chart summarizes the key technologies that enables occurrence downloads:

Download workflow

Occurrence downloads are automated using a workflow engine called Oozie, it coordinates the required steps to produce a single download file. In summary the workflow proceeds as follows:
  1. Initially, Apache Solr is contacted to determine the number of records that the download file will contain.
  2. Big or small?
    1.  If the amount of records is less than 200,000 (it is small download), Apache Solr is queried to iterate over the results; the detail of each occurrence record is fetched from HBase since it’s the official storage of occurrence records. Individual downloads are produced by a multi-threaded application implemented using the Akka framework; the Apache Zookeeper and Curator frameworks are used to limit the amount of threads that can be running at the same time (it avoids a thread explosion in the machines that run the download workflow).
    2. If the amount of records is greater than 200,000 (it is a big download), Apache Hive is used to retrieve the occurrence data from an HDFS table. To avoid overloading of HBase we create that HDFS table as a daily snapshot of the occurrence data stored in HBase.
  3. Finally the occurrence records are collected and organized in the requested output format (DwC-A or Simple).
Note: the details of how this is implemented can be consulted in the Github project:


Reducing both the number of columns and the size (number of bytes) in our downloads has been one of our most requested features, and we hope this makes using the GBIF data easier for everyone.

Friday, 29 May 2015

Don't fill your HDFS disks (upgrading to CDH 5.4.2)

Just a short post on the dangers of filling your HDFS disks. It's a warning you'll hear at conferences and in best practices blog posts like this one, but usually with only a vague consequence of "bad things will happen". We upgraded from CDH 5.2.0 to CDH 5.4.2 this past weekend and learned the hard way: bad things will happen.

The Machine Configuration

The upgrade went fine in our dev cluster (which has almost no data in HDFS) so we weren't expecting problems in production. Our production cluster is of course slightly different than our (much smaller) dev cluster. In production we have 3 masters, where one holds the NameNode and another holds the SecondaryNameNode (we're not yet using a High Availability setup, but it's in the plan). We have 12 DataNodes where each one has 13 disks dedicated to HDFS storage: 12 are 1TB and one is 512GB. They are formatted with 0% reserved blocks for root. The machines are evenly split into two racks.

Pre Upgrade Status

We were at about 75% total HDFS usage with only a few percent difference between machines. We were configured to use Round Robin block placement (dfs.datanode.fsdataset.volume.choosing.policy) with 10GB reserved for non-hdfs use (dfs.datanode.du.reserved), which are the defaults in CDH manager. Each of the 1TB disks was around 700GB used (of 932GB usable), and the 512 GB disks were all at their limit: 456GB used (of 466GB usable). That left only the configured 10GB free for non-hdfs use on the small disks. Our disks are mounted in the pattern /mnt/disk_a, /mnt/disk_b and so on, with /mnt/disk_m as the small disk. We're using the free version of CDHM so we can't do rolling upgrades, meaning this upgrade would bringing everything down. And because our cluster is getting full (> 80% usage is another rumoured "bad things" threshold) we have reduced one class of data (user's occurrence downloads) to a replication factor of 2 (from the default of 3). This is considered somewhere between naughty and criminal, and you'll see why below.

Upgrade Time

We followed the recommended procedure and did the oozie, hive, and CDH manager backups, downloaded the latest parcels, and pressed the big Update button. Everything appeared to be going fine until HDFS tried to start up again, where the symptom was that it was taking a really long time (several minutes, after which the CDHM upgrade process finally gave up saying the DataNodes weren't making contact). Looking at the NameNode logs we see that it was performing a "Block Pool Upgrade", which took btw 90 and 120 seconds for each of our ~700GB disks. Here's an excerpt of where it worked without problems:

2015-05-23 20:18:53,715 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /mnt/disk_a/dfs/dn/in_use.lock acquired by nodename
2015-05-23 20:18:53,811 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-2033573672-
2015-05-23 20:18:53,811 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /mnt/disk_a/dfs/dn/current/BP-2033573672-
2015-05-23 20:18:53,823 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading block pool storage directory /mnt/disk_a/dfs/dn/current/BP-2033573672-
   old LV = -56; old CTime = 1416737045694.
   new LV = -56; new CTime = 1432405112136
2015-05-23 20:20:33,565 INFO org.apache.hadoop.hdfs.server.common.Storage: HardLinkStats: 59768 Directories, including 53157 Empty Directories, 0 single Link operations, 6611 multi-Link operations, linking 22536 files, total 22536 linkable files.  Also physically copied 0 other files.
2015-05-23 20:20:33,609 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrade of block pool BP-2033573672- at /mnt/disk_a/dfs/dn/current/BP-2033573672- is complete

That upgrade time happens sequentially for each disk, so even the though the machines were upgrading in parallel, we were still looking at ~30 minutes of downtime for the whole cluster. As if that wasn't sufficiently worrying, then we finally get to disk_m, our nearly full 512G disk:

2015-05-23 20:53:05,814 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /mnt/disk_m/dfs/dn/in_use.lock acquired by nodename
2015-05-23 20:53:05,869 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-2033573672-
2015-05-23 20:53:05,870 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /mnt/disk_m/
2015-05-23 20:53:05,886 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading block pool storage directory /mnt/disk_m/
   old LV = -56; old CTime = 1416737045694.
   new LV = -56; new CTime = 1432405112136
2015-05-23 20:54:12,469 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to analyze storage directories for block pool BP-2033573672- Cannot create directory /mnt/disk_m/
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocks(
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.linkAllBlocks(
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doUpgrade(
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(
2015-05-23 20:54:12,476 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage for block pool: BP-2033573672- : Cannot create directory /mnt/disk_m/dfs/dn/current/BP-2033573672-

The somewhat misleading "Cannot create directory" is not a file permission problem but rather a disk full problem. During this block pool upgrade some temporary space is needed for rewriting metadata, and that space is apparently more than the 10G that was available to "non-HDFS" (which we've concluded means "not HDFS storage files, but everything else is fair game"). Because some space is available to start the upgrade, it begins, but then when it exhausts the disk it fails, and This Kills The DataNode. It does clean up after itself, but prevents the DataNode from starting, meaning our cluster was on its knees and in no danger of standing up.

So the problem was lack of free space, which on 10 of our 12 machines we were able to solve by wiping temporary files from the colocated yarn directory. Those 10 machines were then able to upgrade their disk_m and started up. We still had two nodes down and unfortunately they were in different racks, so that meant we had a big pile of our replication factor 2 files missing blocks (the default HDFS block replication policy places the second and subsequent copies on a different rack from the first copy).

While digging around in the different properties that we thought could affect our disks and HDFS behaviour we were also restarting the failing DataNodes regularly. At some point the log message changed to:

WARN org.apache.hadoop.hdfs.server.common.Storage: /mnt/disk_m/dfs/dn/in_use.lock (No space left on device)

After that message the DataNode started, but with disk_m marked as a failed volume. We're not sure why this happened but presume that after one of our failures it didn't clean up it's temp files on disk_m and then on subsequent restarts found the disk completely full and (rightly) considered it unusable and tried to carry on. With the final two DataNodes up we had almost all of our cluster, minus the two failed volumes. There were only 35 corrupted files (missing blocks) left after they came up. These were files set to replication factor 2, and by bad luck had both copies of some of their blocks on the failed disk_m (one from rack1, one from rack2).

It would not have been the end of the world to just delete the corrupted user downloads (they were all over a year old) but on principle, it would not be The Right Thing To Do.

On inodes and hardlinks

The normal directory structure of the dfs dir in a DataNode is /dfs/dn/current/<blockpool name>/current/finalized and within finalized are a whole series of directories to fan out the various blocks that the volume contains. During the block pool upgrade a copy of 'finalized' is made called previous.tmp. It's not a normal copy however - it uses hardlinks in order to avoid duplicating all of the data (which obviously wouldn't work). The copy is needed during the upgrade and is removed afterwards. Since our upgrade failed halfway through we had both directories and had no choice but to move the entire /dfs directory off of /disk_m to a temporary disk and complete the upgrade there. We first tried a copy (use cp -a to preserve hardlinks) to a mounted NFS share. The copy looked fine but on startup the DataNode didn't understand the mounted drive ("drive not formatted"). Then we tried copying to a USB drive plugged into the machine and that ultimately worked (despite feeling decidedly un-Yahoo). Once the USB drive was upgraded and online in the cluster, replication took over and copied all of its blocks to new homes on /rack2. We then unmounted the USB drive, wiped both /disk_m's and then let replication balance out again. Final result: no lost blocks.


With the cluster happy again we made a few changes to hopefully ensure this doesn't happen again:
  • dfs.datanode.du.reserved:25GB this guarantees 25GB free on each volume (up from 10GB) and should be enough to allow a future upgrade to happen
  • dfs.datanode.fsdataset.volume.choosing.policy:AvailableSpace 
  • dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction:1.0 together these two direct new blocks to disks that have more free space, thereby leaving our now full /disk_m alone


This was one small taste of what can go wrong with filling heterogenous disks in an HDFS cluster. We're sure there are worse dangers lurking on the full-disk horizon, so hopefully you've learned from our pain and will give yourself some breathing room when things start to fill up. Also, don't use a replication factor of less than 3 if there's anyway you can help it.

Monday, 30 March 2015

Improving the GBIF Backbone matching

In GBIF occurrence records are matched to a taxon in a backbone taxonomy using the species match API. This is important to reduce spelling variations and create consistent metrics and searches according to a single classification and synonymy.

Over the past years we have been alerted to various bad matches. Most of the reported issues refer to a false fuzzy match for a name missing in our backbone.

In order to improve the taxonomic classification of occurrence records, we are undertaking 2 activities.  The first is to improve the algorithms we use to fuzzily match names, and the second will be to improve the algorithms used to assembled the backbone taxonomy itself.  Here I explain some of the work currently underway to tackle the former, which is visible on the test environment.

1.Name parsing of undetermined species

In occurrences we see many names with a partly undetermined name such as Lucanus spec. Erroneously these rank markers have been treated as real species epithets and together with fuzzy matching resulted in poor results.

  • Xysticus sp. used to wrongly match Xysticus spiethi while it now just matches the genus Xysticus.
  • Triodia sp. used to match the family Poaceae while it now matches the genus

2. Damerau–Levenshtein distance algorithm

For scoring fuzzy matches we have so far applied the Jaro Winkler distance which is often used for matching person names. It tends to allow for rather fuzzy matches at the end of long strings. This is desirable for scientific names, but the allowed fuzziness was too big and we decided to revert to the classical and more predictable Damerau–Levenshtein distance. This reduces false positive fuzzy matches considerably even though we lost a few good matches at the same time.


Matching results

The distinct, verbatim classifications of 528 million records were passed through the original and the new fuzzy matching algorithms - this included 10.5 million distinct classifications in total.  The results show that 428 thousand classifications (4%), representing 5,323,758 occurrence records produce a different match. So far we have taken a random subsample of the records which change, and manually inspected the results - we can hardly spot any degression or wrong matches.

We have published the complete matching comparison as well as the subset of changed records at Zenodo as tab delimited files:

The schema of the files have 3 column families each with the scientificName, GBIF taxonKey and the higher DwC classification terms for every match record (verbatim prefixed with v_ , old matching with an _old suffix and the new matching results with plain terms, e.g. v_scientificName, scientificName_old, scientificName).

We are glad to receive any feedback on further improvements or bad matching results we need to fix in the next iteration of work. Please get in touch with Markus Döring,


Create distinct occurrence names table

CREATE TABLE markus.names AS 
SELECT count(*) as numocc, count(distinct datasetKey) as numdatasets, v_scientificName, v_kingdom, v_phylum, v_class, v_order_ as v_order, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification 
FROM prod_b.occurrence_hdfs 
GROUP BY v_scientificName, v_kingdom, v_phylum, v_class, v_order_, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification 
ORDER BY v_scientificName, numocc DESC

Lookup taxonkey with both old & new lookup

CREATE TABLE markus.name_matches AS

  prod.taxonKey as taxonKey_old,
  prod.scientificName as scientificName_old,
  prod.rank as rank_old,
  prod.status as status_old,
  prod.matchType as matchType_old,
  prod.confidence as confidence_old,
  prod.kingdomKey as kingdomKey_old,
  prod.phylumKey as phylumKey_old,
  prod.classKey as classKey_old,
  prod.orderKey as orderKey_old,
  prod.familyKey as familyKey_old,
  prod.genusKey as genusKey_old,
  prod.speciesKey as speciesKey_old,
  prod.kingdom as kingdom_old,
  prod.phylum as phylum_old,
  prod.class_ as class_old,
  prod.order_ as order_old, as family_old,
  prod.genus as genus_old,
  prod.species as species_old,

  uat.taxonKey as taxonKey,
  uat.scientificName as scientificName,
  uat.rank as rank,
  uat.status as status,
  uat.matchType as matchType,
  uat.confidence as confidence,
  uat.kingdomKey as kingdomKey,
  uat.phylumKey as phylumKey,
  uat.classKey as classKey,
  uat.orderKey as orderKey,
  uat.familyKey as familyKey,
  uat.genusKey as genusKey,
  uat.speciesKey as speciesKey,
  uat.kingdom as kingdom,
  uat.phylum as phylum,
  uat.class_ as class_,
  uat.order_ as order_, as family,
  uat.genus as genus,
  uat.species as species

    v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_subgenus, 
    match('PROD', v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) prod, 
    match('UAT', v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) uat
  FROM markus.names
) n;

Hive exports

CREATE TABLE markus.matches_changed 
SELECT * from markus.name_matches 
WHERE taxonKey!=taxonKey_old;

CREATE TABLE markus.matches_all 
SELECT * from markus.name_matches;

Friday, 27 March 2015

IPT v2.2 – Making data citable through DataCite

GBIF is pleased to release IPT 2.2, now capable of automatically connecting with either DataCite or EZID to assign DOIs to datasets. This new feature makes biodiversity data easier to access on the Web and facilitates tracking its re-use.

DataCite integration explained

DataCite specialises in assigning DOIs to datasets. It was established in 2009 with three fundamental goals(1):
  1. Establish easier access to research data on the Internet
  2. Increase acceptance of research data as citable contributions to the scholarly record
  3. Support research data archiving to permit results to be verified and re-purposed for future study

Wednesday, 26 November 2014

Upgrading our cluster from CDH4 to CDH5

A little over a year ago we wrote about upgrading from CDH3 to CDH4 and now the time had come to upgrade from CDH4 to CDH5. The short version: upgrading the cluster itself was easy, but getting our applications to work with the new classpaths, especially MapReduce v2 (YARN), was painful.

The Cluster

Our cluster has grown since the last upgrade (now 12 slaves and 3 masters), and we no longer had the luxury of splitting the machines to build a new cluster from scratch. So this was an in-place upgrade, using CDH Manager.

Upgrade CDH Manager

The first step was upgrading to CDH Manager 5.2 (from our existing 4.8). The Cloudera documentation is excellent so I don't need to repeat it here. What we did find was that the management service now requests significantly more RAM for it's monitoring services (minimum "happy" config of 14GB), to the point where our existing masters were overwhelmed. As a stop gap we've added a 4th old machine to the "masters" group, used exclusively for the management service. In the longer term we'll replace the 4 masters with 3 new machines that have enough resources. 

Upgrade Cluster Members

Again the Cloudera documentation is excellent but I'll just add a bit. The upgrade process will now ask if a JAVA jdk should be installed (an improvement over the old behaviour of just installing one anyway). That means we could finally remove the Oracle JDK 6 rpms that have been lying around on the machines. For some reason the Host Inspector still complains about OpenJDK 7 vs Oracle 7 but we've happily been running on OpenJDK 7 since early 2014, and so far so good with CDH5 as well. After the upgrade wizard finished we had to tweak memory settings throughout the cluster, including setting the "Memory Overcommit Validation Threshold" to 0.99, up from its (very conservative) default of 0.8. Cloudera has another nice blog post on figuring out memory settings for YARN. Additionally Hue's configuration required some attention because after the upgrade it had forgotten where Zookeeper and the HBase Thrift server were. All in all quite painless.

The Gotchas

Getting our software to work with CDH5 was definitely not painless. All of our problems stemmed from conflicting versions of jars, due either to changes in CDH dependencies, or in changes to how a user classpath is specified as having priority over that of Yarn/HBase/Oozie. Additionally it took some time to wrap our heads around the new artifact packaging used by YARN and HBase. Note that we also use Maven for dependency management.

We're not alone in our suffering at the hands of mismatched Guava versions (e.g. HADOOP-10101HDFS-7040), but suffer we did. We resorted to specifying version 14.0.1 in any of our code that touches Hadoop and more importantly HBase, and exclude any higher version guavas from our dependencies. This meant downgrading some actual code that was using guava 15, but was the easiest path to getting a working system.

We have many dependencies on Jackson 1.9 and 2+ throughout our code, so downgrading to match HBase's shipped 1.8.8 was not an option. It meant figuring out the classpath precedence rules described below, and solving the problems (like logging) that doing so introduced.

Logging in Java is a horrible mess, and with the number of intermingled projects required to make application software run on a Hadoop/HBase cluster it's not surprise that getting logging to work was brutal. We code to the SLF4J API and use Logback as our implementation of choice. The Hadoop world uses a mix of Java Commons Logging, java.util.logging, and log4j. We thought that meant we'd be clear if we used the same SLF4J API (1.7.5) and used the bridges (log4j-over-slf4j, jcl-over-slf4j, and jul-to-slf4j), which has worked for us up to now. <montage>Angry men smash things angrily over the course of days</montage> Turns out, there's a bug in the 1.7.5 implementation of log4j-over-slf4j, which blows up as we described over at YARN-2875. Short version - use 1.7.6+ in client code that attempts to use YARN and log4j-over-slf4j.

The crux of our problems was having our classpath loaded after the Hadoop classpath had been loaded, meaning old versions of our dependencies were loaded first. The new, surprisingly hard to find parameter that tells YARN to load your classpath first is "mapreduce.job.user.classpath.first". YARN also quizzically claims that the parameter is deprecated, but.. works for me.

Convincing Oozie to load our classpath involved another montage of angry faces. It uses the same parameter as YARN, but with a prefix, so what you want is "oozie.launcher.mapreduce.job.user.classpath.first". We had been loading the old parameter "mapreduce.task.classpath.user.precedence" in each action in the workflow using the <job-xml> tag to load the configs from a file called hive-default.xml. We then encountered two problems: 
  1. Note the name - we used hive-default.xml instead of hive-site.xml because of a bug in Oozie (discussed here and here). That was fixed in the CDH5.2 Oozie, but we didn't get the memo. Now the file is called hive-site.xml and contains our specific configs and is again being picked up. BUT:
  2. Adding oozie.launcher.mapreduce.job.user.classpath.first to hive-site.xml doesn't work! As we wrote up in Oozie bug OOZIE-2066 this parameter has to be specified for each action, at the action level, in the workflow.xml. Repeating the example workaround from the bug report:
 <action name="run-test">  
  <ok to="end" />  
  <error to="kill" />  

New Packaging Woes

We build our jars using a combination of jar-with-dependencies and the shade plugin, but in both cases it means all our dependencies are built in. The problems come when a downstream, transitive dependency loads a different (typically older) version of one of the jars we've bundled in our main jar. This happens a lot with the Hadoop and HBase artifacts, especially when it comes to MR1 and logging.

Example exclusions

hbase-server (needed to run MapReduce over HBase):

hbase-testing-util (needed to run mini clusters):


hadoop-client (removing logging):

Beyond just sorting conflicting dependencies, we also encountered a problem that presented as "No FileSystem for scheme: file". It turns out we had projects bringing in both hadoop-common and hadoop-hdfs, and so we were getting only one of the META-INF/services files in the final jar.  Thus we could not use the FileSystem to read local files (like jars for the class path) and also from HDFS.  The fix was to include the org.apache.hadoop.fs.FileSystem in our project explicitly:

Finally we had to stop the TableMapReduceUtil from bringing in it’s own dependent jars, which brought in yet more conflicting jars - this appears to be a change in the default behaviour, where dependent jars are now being brought in by default in the shorter versions of initTableMapper:


As you can see the client side of the upgrade was beset on all sides by the iniquities of jars, packaging and old dependencies. It seems strange that upgrading Guava is considered a no-no, major breaking change by these projects, yet discussions about removing HBaseTablePool are proceeding apace and will definitely break many projects (including any of ours that touch HBase). While we're ultimately pleased that everything now works, and looking forward to benefiting from the performance improvements and new features of CDH5, it wasn't a great trip. Hopefully our experience will help others migrate more smoothly.