Thursday 25 February 2016

Reprojecting coordinates according to their geodetic datum

For a long time Darwin Core has a term to declare the exact geodetic datum used for the given coordinate. Quite a few data publishers in GBIF have used dwc:geodeticDatum for some time to publish the datum of their location coordinates.

Until now GBIF has treated all coordinates as if they were in WGS84, the widespread global standard datum used by the Global Positioning System (GPS). Accordingly locations given in a different datum, for example NAD27 or AGD66, were displaced on GBIF maps a little. This so called “datum shift” is not dramatic, but can be up to a few hundred metres depending on the location and datum. The Univeristy of Colorado has a nice visualization of the impact.

At GBIF we interpret the geodeticDatum and reproject all coordinates as good as we can into the single datum WGS84. In order to do this there are two main steps that need to be done: parse and interpret the given verbatim geodetic datum and then do the actual transformation based on the known geodetic parameters.

Parsing geodeticDatum

As usual GBIF receives a lot of noise when reading the dwc:geodeticDatum. After removing the obvious bad values, e.g. introduced by bad mappings done by the publisher, we still ended up with over 300 different values for some datum. Most commonly simple names or abbreviations are used, e.g. NAD27, WGS72, ED50, TOKYO. In some cases we also see proper EPSG http://www.epsg.org/ codes coming in, e.g. EPSG:4326 which is the EPSG code for WGS84. As EPSG is a widespread and complete reference dataset of geodetic parameters, supported by many java libraries, we decided to add a new DatumParser to our parser library that directly returns EPSG integer codes for datum values. That way we can lookup geodetic parameters easily in the following transformation step. In addition to parse any given EPSG:xyz code directly it also understands most datums found in the GBIF network based on a simple dictionary file which we manually curate.

Even though EPSG codes are well maintained, very complete and supported by most software opaque integer codes have a harder time to get used than meaningful short names. Maybe a lesson we should keep in mind when debating about identifiers elsewhere.

Our recommendation to publishers is to use the EPSG codes if you know them, otherwise stick to the simple well known names. A good place to search for EPSG codes is http://epsg.io/.

Transformation

Once we have a decimal coordinate and a well known geodetic source datum the transformation itself is rather straight forward. We use geotools to do the work. The first step in the transformation is to instantiate a CoordinateReferenceSystem (CRS) using the parsed EPSG code of the geodeticDatum. A CRS combines a datum with a coordinate system, in our case this always a 2 dimensional system with the prime meridian in Greenwich and longitude values increasing East, latitude values North.

As EPSG codes can refer to both, just a plain datum and also a complete spatial reference system, we need to take this into account when building the CRS like this:

 private CoordinateReferenceSystem parseCRS(String datum) {
    CoordinateReferenceSystem crs = null;
    // the GBIF DatumParser in use
    ParseResult<Integer> epsgCode = PARSER.parse(datum);
    if (epsgCode.isSuccessful()) {
      final String code = "EPSG:" + epsgCode.getPayload();
      // first try to create a full fledged CRS from the given code
      try {
        crs = CRS.decode(code);

      } catch (FactoryException e) {
        // that didn't work, maybe it is *just* a datum
        try {
          GeodeticDatum dat = DATUM_FACTORY.createGeodeticDatum(code);
      // build a CRS using the standard 2-dim Greenwich coordinate system
          crs = new DefaultGeographicCRS(dat, DefaultEllipsoidalCS.GEODETIC_2D);

        } catch (FactoryException e1) {
          // also not a datum, no further ideas, log error
          LOG.info("No CRS or DATUM for given datum code >>{}<<: {}", datum, e1.getMessage());
        }
      }
    }
    return crs;
  }

Once we have a CRS instance we can create a specific WGS84 transformation and apply it to our coordinate:

public ParseResult<LatLng> reproject(double lat, double lon, String datum) {
   CoordinateReferenceSystem crs = parseCRS(datum);
   MathTransform transform = CRS.findMathTransform(crs, DefaultGeographicCRS.WGS84, true);
   // different CRS may swap the x/y axis for lat lon, so check first:
   double[] srcPt;
   double[] dstPt = new double[3];
   if (CRS.getAxisOrder(crs) == CRS.AxisOrder.NORTH_EAST) {
     // lat lon
     srcPt = new double[] {lat, lon, 0};
   } else {
     // lon lat
     srcPt = new double[] {lon, lat, 0};
   }

   transform.transform(srcPt, 0, dstPt, 0, 1);
   return ParseResult.success(ParseResult.CONFIDENCE.DEFINITE, new LatLng(dstPt[1], dstPt[0]), issues);
  }

The actual projection code does a bit more of null and exception handling which I have removed here for simplicity.

As you can see above we also have to watch out for spatial reference systems that use a different axis ordering. Luckily geotools knows all about that and provides a very simple way to test for it.

Issue flags

As with most of our processing we flag records when problems or assumed behavior occurs. In the case of the geodetic datum processing we keep track of 5 distinct issues which are available as GBIF portal occurrence search filters:

  • COORDINATE_REPROJECTION_FAILED: A CRS was instantiated, but the transformation failed for some reason.
  • GEODETIC_DATUM_INVALID: The datum parser was unable to return an EPSG code for the given datum string.
  • COORDINATE_REPROJECTION_SUSPICIOUS: The reprojection resulted in a datum shift larger than 0.1 degrees.
  • GEODETIC_DATUM_ASSUMED_WGS84: No datum was given or the given datum was not understood. In that case the original coordinates remain untouched.
  • COORDINATE_REPROJECTED: The coordinate was successfully transformed and differs now from the verbatim one given.