Developer Blog

Goodbye developer blog, hello data-blog!

2018-12-04T11:44:00.001+01:00

GBIF has a new blog!

https://data-blog.gbif.org/

What is it?

A place for GBIF staff and guest bloggers to contribute:

Statistics
Graphs
Tutorials
Ideas
Opinions

Who can contribute?

If you would like to contribute you can contact jwaller@gbif.org. Guest blogs are very welcome.

How can I write a post?

There is a short turtorial on the blog github.

What about the developer blog?

The developer blog will remain up as an archive, but there are no plans to actively post new content here.

How popular is your favorite species?

2018-07-27T11:41:00.003+02:00

How to use

Use the box to the left to type in the species you are interested in.
Make sure to use a scientific name:

Aves instead of birds
Plantae instead of plants
Anura instead of frogs

Explanation of tool

This tool plots the downloads through time for species or other taxonomic groups with more than 25 downloads at GBIF. Downloads at GBIF most often occur through the web interface. In a previous post, we saw that most users are downloading data from GBIF via filtering by scientific name (aka Taxon Key). Since the GBIF index currently sits at over 1 billion records (a 400+GB csv), most users will simply filter by their taxonomic group of interest and then generate a download.

How to bookmark a result?

If you would like to bookmark a result or graph to share with others, you can visit app page direcly: app link. On this page the state of the app will be saved inside the url. You can also save a jpg by clicking on the little sandwich in the top right.

What counts as a download?

For the graphs above, I decided that it would be more meaningful to roll up downloads below the queried taxonomic level.

If a user downloaded 5 different bird species at once, this would count as 1 download for Aves and 1 download for each of the species downloaded.
If a user only typed in Aves in the occurrence download interface and not any other species. This would only count as 1 download for Aves and 0 downloads for all bird species.
Similarly, if a user only typed the order Passeriformes into the search, this would count as 1 download for Passeriformes and 1 download for Aves (and 1 download for Animalia ect.) but 0 downloads for all the species, families, and genera within Passeriformes.

It is possible, but not as easy, to get data from GBIF without generating a download. In fact users can stream data using the GBIF occurrence api without ever generating a download. Currently users can “download” 200k-long chunks of occurrence data without generating a download by using the api. If someone got their data using the api in this way, we would not be able to track it currently. Presumably, the vast majority of users are getting their data directly through the web interface.

For more technical details on this tool, you can visit my personal blog:
http://www.johnwalleranalytics.org/2018/07/06/gbif-download-trends/

Occurrence Downloads

2018-06-28T16:22:00.000+02:00

Occurrences at GBIF are often downloaded through the web interface, or through the api (via rgbif ect.). Users can place various filters on the data in order to limit the number of records returned. As the occurrence index is currently a 447 GB csv, most users want to use a filter.

Total monthly downloads

Here I plot the total monthly downloads for various popular filters. For the past few years, GBIF has be averaging around 10k downloads per month.

Two peaks in total downloads stand out:

Mar 2014
Sep 2016

The Sep 2016 peak seems to be explained by high DATASET_KEY downloads. Both the Mar 2014 and Sep 2016 peaks are well explained by the top users. Top users in this graph are all the downloads generated by the top 3 most active users on GBIF. These users generate downloads in the 1000s and are most likely to be automated downloads generated internally.

One interesting detail is that while No Filter Used is not used very often it accounts for more than 500 billion occurrence records downloaded.

Finally, if we look at the number of unique users (un-select everything else to see in isolation), we see that the number of individuals making downloads on GBIF has been increasing steadily with some perhaps interesting cyclical patterns. The graph below is interactive. You can see different data views by clicking on the names.

Popular filters explained

There are many ways that a user can filter data. The types and combinations of filters are almost limitless. Below I describe some of the most common filters:

1. TAXON_KEY

This is one of the most common filters users place on the GBIF occurrence index. Users can either choose one or many taxon names to filter the data, and users can choose any taxon rank they want (species, genus, family, kingdom ect.).

2. COUNTRY

Here users can return records only from a certain country. This is the country the user searched and not where user is searching from.

3. HAS_GEOSPATIAL_ISSUE

Here users can specify that they want occurrence records without some interpreted error.

4. HAS_COORDINATE

Here users can say that they want occurrence records that have coordinates.

5. No Filter

Finally, a surprising number of users never put any filter and instead request to download the entire occurrence index. In the overwhelming majority of cases, we have to assume these users have done this by mistake.

You can read more about downloads at GBIF here:
http://www.johnwalleranalytics.org/2018/05/30/gbif-download-statistics/

GBIF Name Parser

2017-06-22T07:38:00.000+02:00

The GBIF name parser has been a fundamental library for GBIF to parse a scientific name string into a structured representation of a name. It has been refined over many years based on actual name strings encountered in the GBIF occurrence and checklist indices. Over the years the major design goals have not changed much and can be summarised as follows:

extract canonical, code relevant name parts

populate only the ParsedName class of the GBIF API
ignore any superflous name parts irrelevant to the code, e.g. species authorships in infraspecific names, infrageneric placements of species or superflous infraspecific parts in quadrinomials

deal with a wide variety of names that the ParsedName class can represent

cultivar names
bacterial strains & candidate names
virus names
named hybrids
taxon concept references, sensu latu/strictu or aggregates
legacy ranks

extract notes often found in names:

nomenclatural remarks
determination notes like aff.
partially determined species, e.g. only down to the genus: Abies spec.

in case author parsing is impossible, fallback to parsing just the canonical name without authors
allow slightly imperfect names not strictly well formed according to the rules
classify names according to our NameType enumeration

Compared to gnparser these are slightly different goals explaining some of the behavior explained in the recent paper from Dmitry Mozzherin 2017. As that paper explains the GBIF name parser is based on regular expressions, some of them even recursive. This is not the reason why we do not support hybrid formulas though. Hybrid formulas (e.g. Quercus robur x Q. macrocarpa) as opposed to named hybrids (e.g. Quercus x turneri) are a variable combination of names and thus are very different to the Linnean names represented by a ParsedName. For name matching, backbone building and many more problems hybrid formulas are incompatible and we instead decided to deal with hybrid formulas just as with other unparsable viruses or OTU names that do not follow the neat structure of Linnean names. We simply keep the entire string as it was, classify it with a NameType and do not further parse it.

GBIF exposes the name parser through the GBIF JSON API, here are some examples for illustration:

variety Serjania meridionalis Cambess. var. o’donelli F.A. Barkley
basionym Carex scirpoidea Michx. subsp. convoluta (Kük.) D.A.Dunlop
cultivar Stephanandra incisa (Thunb.) Zabel cv. ‘Crispa’
subgenus Polana (Bulbusana) vana DeLong & Freytag 1972
named hybrid Quercus x turneri
hybrid formula Quercus robur x Q. macrocarpa
virus Choristoneura rosaceana entomopoxvirus
indetermined Abies spec.
uncertain determination Rasbora aff. elegans
nomenclatural remark Iridaea undulosa var. papillosa Bory de Saint-Vincent, nom. nud.
taxon concept Achillea millefolium sec. Greuter 2009
serovar Salmonella enterica serovar Typhimurium
bacterial strain Yersinia pestis biovar orientalis str. IP674
legacy rank Potamon (Pontipotamon) ibericum tauricum natio trojensis Pretzmann, 1983
sensu latu Taraxacum erythrospermum s.l.
placeholder Asteraceae incertae sedis

Authorships are not (yet) parsed into a list of individual authors. This has been done internally already and it is something we are likely to expose in the future. Currently the authorship is parsed into four pieces, the authorship and year for the combination and basionym.

gnparser in GBIF

The GNA name parser is a great parser for well formed names. It has slightly different goals, but since it is available for the JVM we have wrapped it to support the GBIF NameParser interface producing ParsedName instances. Wrapping the Scala based gnaparser was not as trivial as we had thought due to its different parsing output and the mismatch of Scala and Java collections, but working against the NameParser interface you can finally select your parser of choice.

The authorship semantics for original names are also slightly different between the two parsers. Again some examples to illustrate the difference:
Azalea schlippenbachii (Maxim.) Kuntze
Both parsers show the same semantics:

GBIF:
"authorship": "Kuntze",
"bracketAuthorship": "Maxim.",

GNA:
"value": "(Maxim.) Kuntze",
"basionym_authorship": {
  "authors": ["Maxim."]
},
"combination_authorship": {
  "authors": ["Kuntze"]
}

Rhododendron schlippenbachii Maxim.
The GBIF parser places the author into “authorship” as the author of the very combination.
The gnparser places the author into the basionym authorship instead even though it is not surrounded by brackets.
As the parser cannot know if the name actually is a basionym, i.e. there indeed exists a subsequenct recombination, this was slightly unexpected
and the GBIF NameParser wrapper had to swap the authorship to populate a ParsedName in such cases:

GBIF:
"authorship": "Maxim.",

GNA:
"basionym_authorship": {
  "authors": ["Maxim."]
}

Puma concolor (Linnaeus, 1771)
Both parsers show the same semantics:

GBIF:
"bracketAuthorship": "Linnaeus",
"bracketYear": "1771",

GNA:
"basionym_authorship": {
  "authors": ["Linnaeus"],
  "year": {
    "value": "1771"
  }
}

Ex authors are not parsed in GBIF (yet). They do cause a little issue as the sequence of authors varies between botany and zoology. In botany, the author of the earlier name precedes the later, valid one while in zoology this is reversed. GNA follows the zoological model, even though far more usages of ex-authors can be found in botanical names.

Uninomials are also treated differently. GBIF uses a single property genusOrAbove for both the genus part of a binomial, a standalone genus or a uninomal of a rank above genus. GNA places genera from a binomial into the genus field, but uses uninomial for a standalone genus name.

Performance

We are still comparing gnparser with the GBIF name parser, but initial tests using gnparser-0.4.0 to parse 1380 names from our unit tests suggests the GBIF parser is up to twice as fast running in a Java environment. There is an overhead of wrapping the gnparser for the ParsedName result class. But even if we just parse the names and do not convert the Scala result into a ParsedName it takes up 75% more time:

  
Total time parsing 1380 names
MacBookPro 2017, Java8, single thread:

  GBIF: 1331ms
  GNA : 2596ms
  GNA-: 2323ms # without wrapper

This contradicts the results presented in the gnaparser paper, but might be related to the selection of names or running the parser in different environments.

Future

We are working with GNA to improve both parsers and align them more. With slightly different goals it might be hard to fully merge the two projects, but we will try to unify the efforts as much as we can. For the GBIF name parser we will be adding parsed author and ex author teams in the near future. This is needed to do author comparisons for better name matching in the GBIF backbone building (where it already exists) and the Catalogue of Life.

GBIF Backbone - February 2017 Update

2017-02-27T14:52:00.003+01:00

We are happy to annouce that a new GBIF Backbone just went live, available also as an improved Darwin Core Archive for download. Here are some facts highlighting the important changes.

New source datasets

Apart from continuously updated source like the Catalog of Life or WoRMS here are the new datasets we used as a source to build the backbone.

New Type specimen checklist listing all distinct names of type specimens found in GBIF occurrences contributing 252,410 new species and 57,410 infra specific names.
ZooBank joined GBIF and was added as a nomenclator with 175,775 names, contributing 3460 new generic and 39,695 new species names.
Added phylum Myzozoa with 136 families under kingdom Chromista to GBIF Algae Classification to fill the classification gap for Dinoflagellates
Tiny new dataset listing species named after famous people and which are often found in news

The 43 sources used in this backbone build

Code changes

Merging of duplicate taxa across kingdoms, especially with taxa from the incertae sedis kingdom. Examples

Exclude genus & species synonyms for taxa at a higher rank: http://dev.gbif.org/issues/browse/POR-3169
Restrict name normalisation with double letters to bi/trinomials. Finally the fish Lota lota is a fish again. Examples of other previously wrongly conflated families that have been reported:

Stable identifier for pro parte taxa in the backbone.

All other fixed issues in the source code that generates the backbone can be found in our Jira epic
and github milestone.

Backbone impact

The new backbone has a total of 5,887,500 names of which it treats 2,818,534 species names as accepted (up from 5,307,978 and 2,525,274 respectively).
More backbone metrics are available through our portal and in more detail through our API.

105,296 deleted names, many of them previous erroneous duplicates
685,853 new names

Animalia: 164 families; 6,616 genera; 257,196 species; 87,660 infraspecific
Archaea: 2 families; 6 genera; 48 species
Bacteria: 27 families; 225 genera; 2,470 species; 615 infraspecific
Chromista: 2 phyla; 13 classes; 58 order; 54 families; 767 genera; 12,124 species; 2,953 infraspecific
Fungi: 2 families; 269 genera; 8,703 species; 2,993 infraspecific
Plantae: 3 families; 795 genera; 63,617 species; 33,282 infraspecific
Protozoa: 4 families; 65 genera; 1,412 species; 280 infraspecific
Viruses: 8 families; 1,227 genera; 8,488 species
Unknown: 4 families; 2,708 genera; 13,076 species; 2,237 infraspecific

A very large and detailed log of the backbone build is also available.

The largest taxonomic groups in the backbone, exceeding 3% of all accepted species is shown in the following diagram:

All contributors to the backbone arranged by number of names the source serves as the primary reference:

3,330,535 Catalogue of Life
685,831 Interim Register of Marine and Nonmarine Genera
312,746 World Register of Marine Species
309,820 GBIF Type Specimen Names
285,859 The Plant List with literature
140,937 Fauna Europaea
136,981 Index Fungorum
126,960 The Paleobiology Database
114,089 International Plant Names Index
53,848 Integrated Taxonomic Information System ITIS
44,732 ZooBank
30,482 GRIN Taxonomy
29,267 Plazi
25,749 Artsnavnebasen
24,996 Afromoths
15,007 Species Files
13,818 Brazilian Flora 2020 project
8,923 Dyntaxa
6,807 DiversityTaxonNames Lists
5,696 Official Lists and Indexes of Names in Zoology
5,317 Prokaryotic Nomenclature Up-to-date
4,617 International Cichorieae Network ICN
4,611 Catalogue of Afrotropical Bees
4,416 Database of Vascular Plants of Canada
4,312 ICTV Master Species List
3,874 The Clements Checklist
2,702 Checklist of Beetles Coleoptera of Canada and Alaska
1,198 IOC World Bird List, v6.3
1,087 GBIF Algae Classification
578 ION Taxonomic Hierarchy
272 Mammal Species of the World
144 GBIF Backbone Patch
39 Species named after famous people
36 True Fruit Flies Diptera, Tephritidae of the Afrotropical Region
7 Backbone Family Classification Patch
7 TAXREF

Occurrence impact

With a new backbone we have reprocessed all of our 712 million occurrences.

The distribution of the major taxonomic groups exceeding 3%, i.e have a minimum of 36.800 species, is shown in this last diagram:

The 1,226,520 accepted species in GBIF occurrences (140 less than before) represent 44% of all accepted backbone species.

Sampling-event standard takes flight on the wings of butterflies

2017-01-25T17:43:00.000+01:00

Data collected from systematic monitoring schemes is highly valuable. That's because harvesting species data from a given set of sites repeatedly over time using a well-defined sampling effort opens the door to key ecological analyses including phenology, population trends, changes in community structure and other metrics related to a range of Essential Biodiversity Variables (EBVs).

A couple of years ago there was no faithful way to universally standardize data from systematic monitoring schemes. This meant that researchers using this kind of data would need to spend a lot of time deciphering it first. Their job would get even more complicated when trying to integrate data from various heterogeneous sources, each storing their data in different formats, units, etc.

Today, the situation looks much better thanks to a massive collaboration between GBIF, EU BON partners and the wider biodiversity community whose aim was to enable sharing of "sampling-event datasets".

Indeed, one of the most successful outcomes from this collaboration has been the development of a standardized format for systematic butterfly monitoring schemes.

The format has been developed in close collaboration with the EU BON partners Israel Pe'er (GlueCAD- Biodiversity IT) and his son, Dr. Guy Pe'er, (UFZ), who works with systematic monitoring data. The format can be adapted to many other types of systematic monitoring, for many taxonomic groups, as it ensures the following important conditions for researchers are met:

all visits to a given site are known, including those with no sightings, as this allows for analyses of species phenology, etc.
the range of species being recorded during sampling is explicit, as this allows for true absence to be determined.
the location hierarchies can be specified (e.g. the location is a fixed transect or subsection of a transect), as this allows users to group observations by location.
enough detailed information about the sampling effort and sampling area (e.g. units of measurement) are captured, as this allows users to calculate density or convert between units of abundance.

The Israeli Butterfly Systematic Monitoring Scheme (BMS-IL) dataset has already been published openly using this format. I'd like to invite everyone to explore this exemplar dataset from either the EU BON IPT or via GBIF.org.

In the future, I hope that GEO BON's Guidelines for Standardized Global Butterfly Monitoring will incorporate a new recommendation that all monitoring programs use this standardized format for sharing their data. Without a doubt this will make researchers' jobs easier when integrating data from several butterfly monitoring programs for their analyses. It will also enable integrating the data with standardized sampling-event data from other disciplines as well.

Ideally, making the data openly available in a standardized format also leads to new collaboration. So far, BMS-IL data has been used to assess trends in the abundance and phenology of Israel's butterflies for the benefit of conservation or climate change research for example. I would like to encourage you to reach out to Israel and Guy Pe'er if you have any novel ideas on how to reuse their newly standardized data in order to help unlock its full potential.

IPT v2.3.3 - Your repository for standardized biodiversity data

2017-01-12T17:40:00.000+01:00

GBIF is pleased to announce the release of IPT v2.3.3, now available for download from the IPT website.

This version looks and feels the same as 2.3.2 but is much more robust and secure. I'd like to recommend that all existing IPT installations be upgraded as soon as possible following the instructions listed in the release notes.

Additionally, a couple new strategic features have been added to the tool to enhance its potential. A description of these new features follows below.

Improved dataset homepage

Compared with general-purpose repositories such as Dryad or Figshare, the IPT ensures that uploaded biodiversity data gets disseminated in a standardized format (Darwin Core Archive - DwC-A), facilitating wider reuse and enabling the data to be indexed by aggregators such as GBIF.org.

Interoperability comes at a small cost though, as depositors choosing to use the IPT must overcome a learning curve in understanding how to map their data to the Darwin Core standard.

To make this easier for depositors, a new set of Darwin Core Excel templates have recently been released. These new templates provide a simpler solution for capturing, formatting and uploading data to the IPT.

Similarly, users of the standardized data need to understand how to unpack a DwC-A and make sense of the data inside.

Data Records section - RLS Global Reef Fish Dataset
doi:10.15468/qjgwba

To make this process easier for users, a new Data Records section has been added to the dataset homepage that provides an explanation of what the DwC-A format is with a graphic illustration showing the number of records in each file contained within it.

Overall this advancement will strengthen the IPT as a data repository, which is already capable of assigning DOIs to datasets to make them discoverable and citable.

Translation into Russian

Map of IPT installations in Russia - January 2017

Installed in 52 countries around the world, use of the IPT heavily is underrepresented across Russian speaking countries. Therefore to extend the IPT's reach in these areas, the user interface has been fully translated into Russian by a team of volunteer translators with the largest contribution made by Ivan Chadin from the Komi Science Centre of the Ural Branch of the Russian Academy of Sciences.

Map of data published by Russia - January 2017

At the time of writing there were already 18 datasets from Russia published by 5 IPTs installed across Pushchino, Moscow, St Petersburg and the Komi Republic. It will be exciting to watch this number grow over time in part thanks to this enormous volunteer contribution.

Acknowledgements

Once again I'd like to recognize all the volunteer translators that contributed their time and expertise to making this new version available in seven different languages:

Sophie Pamerlon (GBIF France) - Updating French translation
Yukiko Yamazaki (GBIF Japan (JBIF)) - Updating Japanese translation
Daniel Lins (Universidade de São Paulo, Research Center on Biodiversity and Computing - BioComp) - Updating Portuguese translation
Néstor Beltrán (Colombian Biodiversity Information System (SiB Colombia)) - Updating Spanish translation
Ivan Chadin (Institute of Biology of Komi Scientific Centre of the Ural Branch of the Russian Academy of Sciences), Max Shashkov (Institute of Physicochemical and Biological Problems in Soil Science, Russian Academy of Science) and Artyom Leostrin (Komarov Botanical Institute of the Russian Academy of Sciences (Saint-Petersburg)) - Adding Russian translation

I'd also like to recognize a few volunteers that helped make significant improvements to the IPT codebase:

Bruno P. Kinoshita (National Institute of Water and Atmospheric Research (NIWA)) - Fixed issue #1241, ensuring the IPT can be installed on a server behind a proxy
Pieter Provoost (UNESCO) - Fixed issue #1248, improving the IPT's RSS feed
Tadj Youssouf (Security researcher, fb.com/oc3f.dz) - Helped address a cross site scripting issue

Although the core development of the IPT happens at the GBIF Secretariat, the coding, documentation, and internationalization are a community effort and everyone is welcome to join in.

I look forward to seeing the IPT's community of volunteers and users continue to grow and hope you can unlock the full potential of this publishing tool and repository.

GBIF Backbone - August 2016 Update

2016-08-08T15:14:00.000+02:00

GBIF has just put a new backbone taxonomy into production! Since our last update of the GBIF Backbone we have received various feedback and gained insight into potential code improvements. Here is a quick summary of what has changed in this August 2016 version.

Important code changes:

much less eager basionym detection resulting in fewer algorithmically assigned synonyms and removing many false synonyms especially in plants
detect and merge orthographic variants of species doing gender stemming, allowing double consonant characters, deal with author transliterations and merging hybrid names

All fixed issues in the source code that generates a new backbone can be found there, each of them often leads to actual reported user feedback: http://dev.gbif.org/issues/browse/POR-3029

New sources

The following new sources have been incorporated into the august backbone:

major new version of The Paleobiology Database contributing 2,315 new families, 11,390 genera and 131,958 species names to the backbone. Feeds many isExtinct and livingPeriod values into the backbone for fossil taxa
thousands of new Plazi articles with 1,883 genera, 28,725 species and 1,935 infraspecific names. Only use genus names and below from Plazi, excluding any synonyms until we are confident they are all correctly marked up
added Artsnavnebasen source, contributing 3,640 new genera and 29,751 species names to the backbone
added International Cichorieae Network source, contributing 190 new Asteraceae genera; 1,415 species and 3,427 infraspecies names to the backbone

The 39 sources used in this backbone build

Backbone impact

The new backbone has a total of 5,307,978 names of which it treats 2,525,274 species names as accepted (previously 2,420,842 out of 5,208,172). More backbone metrics are available through our portal and in more detail through our API.

187,854 deleted names, mostly due to the removal of orthographic variants
279,404 new names

Unknown: 165 families; 743 genera; 785 species; 14 infraspecific
Animalia: 13 order; 1,649 families; 10,171 genera; 125,478 species; 4,398 infraspecific
Archaea: 2 genera; 3 species
Bacteria: 1 families; 33 genera; 544 species; 36 infraspecific
Chromista: 38 families; 412 genera; 5,594 species; 295 infraspecific
Fungi: 1 families; 691 genera; 11,127 species; 2,039 infraspecific
Plantae: 50 families; 666 genera; 82,672 species; 14,725 infraspecific
Protozoa: 1 class; 1 order; 4 families; 38 genera; 349 species; 24 infraspecific
Viruses: 1 families; 982 genera; 6,311 species

A very large and detailed log of the backbone build is also available.

The largest taxonomic groups in the backbone, exceeding 3% of all accepted species is shown in the following diagram:

The Catalogue of Life as the largest single primary source contributes 59,8% of all names (previously 60,9%). A breakdown by backbone constituents is now also available as a species search facet. For example this shows the breakdown for all accepted plant species in the backbone:

Occurrence impact

With a new backbone we have reprocessed all of our 642 million occurrences. The larger changes were:

Fixed various old/new world distributions of incorrectly synonymized species
Reduced the number of virus records from 157,492 down to just 5,348 records. Most occurrences were Lepidoptera, e.g. the common peacock butterfly that had formerly been mismatched because there was no classification given with the name.

Some more metrics of backbone names in our occurrences. There are:

216,699 distinct genera in GBIF occurrences. That is 55% out of all 396.990 genera in the backbone
1,226,668 accepted species in GBIF occurrences. That is 50% out of all 2,420,842 backbone species
2,059,961 distinct names in GBIF occurrences. Which is 39% of all 5.208.172 names in the backbone

The distribution of the major taxonomic groups exceeding 3%, i.e have a minimum of 36.800 species, is shown in this last diagram:

Probably Turboveg's best-kept secret

2016-07-20T17:02:00.000+02:00

Turboveg is one of the most widely used software programs used to manage vegetation data. Probably its best-kept secret is that it can export vegetation data in Darwin Core Archive (DwC-A) format, which is a standard format that enables its quick and easy integration with other resources on GBIF.org. Turboveg v2 converts vegetation data into species occurrence data packaged as a DwC-A. Now thanks to an 8 month long collaboration between GBIF and Stephan Hennekens (Turboveg's developer), v3 will convert vegetation data into sampling event data packaged as a DwC-A - a much more faithful and useful representation of the data.

Turboveg

Screenshot of Turboveg v3 prototype

Turboveg is an easy to install and easy to use Windows program for storing, managing, visualizing and exporting vegetation data (relevés). A relevé is a list of the plants in a delimited plot of vegetation, with information on species cover and on substrate and other abiotic features in order to make as complete as possible description in terms of plant community composition and structure.

Today there are about 1500 users of the software worldwide managing more than 1,5 million relevés. Turboveg can export relevés in various file formats, which is useful to enable further analysis. Support for exporting relevés as species occurrence data packaged as a Darwin Core Archive (DwC-A) was added to v2 in 2011. Guidance on how to use this feature can be found in the Turboveg User Manual.

Version 3, due to be released in 2017, will export relevés as sampling event data packaged as DwC-A - a format that more accurately reflects the original data.

Sampling event data

Sampling event data derive from environmental, ecological, and natural resource investigations that follow standardized protocols for measuring and observing biodiversity. This is in contrast to opportunistic observation and collection data, which today form a significant proportion of openly accessible biodiversity data. A good example of sampling data is data coming from vegetation sampling events using the Braun-Blanquet protocol. Because the sampling methodology and sampling units are precisely described the resulting data is comparable and thus better suited for measuring trends in habitat change and climate change.

Sampling event data model

A data model provides the details of the structure of the data. Previously sampling event data couldn't be modelled in a standardized way in Darwin Core due to the complexity of encoding the underlying protocols. Over the past two years, however, GBIF has been working with EU BON and the wider bioinformatics community to develop a data model for sharing sampling event data. In March 2015 TDWG, the international body responsible for maintaining standards for the exchange of biological data, ratified changes that enabled support for modelling sampling event data.

In summary, the de facto data model for sampling event data in Darwin Core consists of three tables: Sampling event, Measurements or Facts and Species occurrences.

A Sampling event can be associated with many Species occurrences, while a Species occurrence can only be associated to one Sampling event. Similarly, a Sampling event can be associated with many Measurements or Facts. In this way a Sampling event has a one-to-many relationship to both Species occurrences and Measurements or Facts.

Note additional tables of information can also be added to a Sampling event, such as Multimedia (e.g. to record images of the plot). More information about this preferred data model for sampling event data can be found in the IPT Sampling Event Data Primer.

Sampling event data model for vegetation plot data

Vegetation surveys or relevés produce a wealth of information on species cover and on substrate and other abiotic features in the plot. Species cover can be measured using dozens of different vegetation abundance scales such as the Braun-Blanquet scale or Londo decimal scale to name a couple. To standardize how this information is stored, a custom Relevé table is used instead of the Measurements or Facts table.

This data model for vegetation plot data in Darwin Core consists of three tables: Sampling event, Relevé and Species Occurrence.

A Sampling event can be associated with only one Relevé. The Relevé consists of the most common relevé measurements covering all vegetation layers. Note for each measurement the unit and precision is explicitly defined. A Sampling event can also be associated with many Species occurrences, however, each Species occurrence should specify the vegetation layer where it was found hence the same species can be found within multiple vegetation layers. In this way the vegetation composition can be described for each layer within the plot.

Note that at the time of writing the Darwin Core standard doesn't have the terminology for storing vegetation layers. Therefore a formal proposal has been made to add the new term "layer" to Darwin Core. To standardise how this new term is populated, a custom vocabulary for vegetation layers has also been produced.

Example DwC-A export by Turboveg: Dutch Vegetation Database (LVD)

Fortunately, the Dutch Vegetation Database (LVD) has recently been republished using the new sampling event format and can thus serve as an exemplar dataset. LVD is a substantial dataset published by Alterra (a major Dutch research institute) that covers all plant communities in the Netherlands with more than 85 years of vegetation recording for some habitats. The latest version of this dataset has more than 650 thousand relevés associated with almost 12 million species occurrences.

Alterra uses Turboveg v3 to manage this dataset and export it in the standardized DwC-A format. It is important to note that special care is taken by the software to protect sensitive species: the location of plots, which have red list species observed in them, are obfuscated to a level of 5x5 km squares. Furthermore the software converts all coverage values to the same unit (e.g. species coverage values are converted into percentage coverage) in order to make the data easy to use and integrate with other sources.

Sampling event data on GBIF.org: Dutch Vegetation Database (LVD)

GBIF.org map of LVD georeferenced data

All versions of LVD are imported to the EU BON IPT where they get archived and published through GBIF.org.

The 8 month long collaboration between GBIF and Stephan Hennekens culminated in the latest version of LVD being indexed into GBIF.org here. A special and grateful thanks is owed to Stephan for all his hard work to make this happen.

Over the next couple of years GBIF will continue working on enhancing the indexing and discovery of sampling event datasets (e.g. showing events' plots/transects on a map, filtering events by sampling protocol, indexing Relevés, etc.). At least when Turboveg v3 is released in 2017, users can already export their relevés into this new standardized format that represents their data much more faithfully.

Updating the GBIF Backbone

2016-04-06T10:08:00.000+02:00

The taxonomy employed by GBIF for organising all occurrences into a consistent view has remained unchanged since 2013. We have been working on a replacement for some time and are pleased to introduce a preview in this post. The work is rather complex and tries to establish an automated process to build a new backbone which we aim to run on a regular, probably quarterly basis. We would like to release the new taxonomy rather soon and improve the backbone iteratively. Large regressions should be avoided initially, but it is quite hard to evaluate all the changes between 2 large taxonomies with 4 - 5 million names each. We are therefore seeking feedback and help to discover oddities of the new backbone.

Relevance & Challenges

Every occurrence record in GBIF is matched to a taxon in the backbone. Because occurrence records in GBIF cover the whole tree of life and names may come from all possible, often outdated, taxonomies, it is important to have the broadest coverage of names possible. We also deal with fossil names, extinct taxa and (due to advanced digital publishing) even names that have just been described a week before the data is indexed at GBIF.
The Taxonomic Backbone provides a single classification and a synonymy that we use to inform our systems when creating maps, providing metrics or even when you do a plain occurrence search. It is also used to crosslink names between different checklist datasets.

The Origins

The very first taxonomy that GBIF used was based on the Catalogue of Life. As this only included around half the names we found in GBIF occurrences, all other cleaned occurrence names were merged into the GBIF backbone. As the backbone grew we never deleted names and increasingly faced more and more redundant names with slightly different classifications. It was time for a different procedure.

The Current Backbone

The current version of the backbone was built in July 2013. It is largely based on the Catalogue of Life from 2012 and has folded in names from 39 further taxonomic sources. It was built using an automated process that made use of selected checklists from the GBIF ChecklistBank in a prioritised order. The Catalogue of Life was still the starting point and provided the higher classification down to orders. The Interim Register of Marine and Nonmarine Genera was used as the single reference list for generic homonyms. Otherwise only a single version of any name was allowed to exist in the backbone, even where the authorship differed.

Current issues

We kept track of nearly 150 reported issues. Some of the main issues showing up regularly that we wanted to address were:

Enable an automated build process so we can use the latest Catalogue of Life and other sources to capture newly described or currently missing names
It was impossible to have synonyms using the same canonical name but with different authors. This means Poa pubescens was always considered a synonym of Poa pratensis L. when in fact Poa pubescens R.Br. is considered a synonym of Eragrostis pubescens (R.Br.) Steud.
Some families contain far too many accepted species and hardly any synonyms. Especially for plants the Catalogue of Life was surprisingly sparsely populated and we heavily relied on IPNI names. For example the family Cactaceae has 12.062 accepted species in GBIF while The Plant List recognizes just 2.233.
Many accepted names are based on the same basionym. For example the current backbone considers both Sulcorebutia breviflora Backeb. and Weingartia breviflora (Backeb.) Hentzschel & K.Augustin as accepted taxa.
Relying purely on IRMNG for homonyms meant that homonyms which were not found in IRMNG were conflated. On the other hand there are many genera in IRMNG - and thus in the backbone - that are hardly used anywhere, creating confusion and many empty genera without any species in our backbone.

The New Backbone

The new backbone is available for preview in our test environment. In order to review the new backbone and compare it to the previous version we provide a few tools with a different focus:

Stable ID report: We have joined the old and new backbone names to each other and compared their identifiers. When joining on the full scientific name there is still an issue with changing identifiers which we are still investigating.
Tree Diffs: For comparing the higher classification we used a tool from Rod Page to diff the tree down to families. There are surprisingly many changes, but all of them stem from evolution in the Catalogue of Life or the changed Algae classification.
Nub Browser: For comparing actual species and also reviewing the impact of the changed taxonomy on the GBIF occurrences, we developed a new Backbone Browser sitting on top of our existing API (Google Chrome only). Our test environment has a complete copy of the current GBIF occurrence index which we have reprocessed to use the new backbone. This also includes all maps and metrics which we show in the new browser.

Family Asparagaceae as seen in the nub browser:

Red numbers next to names indicate taxa that have fewer occurrences using the new backbone, while green numbers indicate an increase. This is also seen in the tree maps of the children by occurrences. The genus Campylandra J.G. Baker, 1875 is dark red with zero occurrences because the species in that genus were moved into the genus Rhodea in the latest Catalog of Life.

Species Asparagus asparagoides as seen in the nub browser:

The details view shows all synonyms, the basionym and also a list of homonyms from the new backbone.

Sources

We manually curate a list of priority ordered checklist datasets that we use to build the taxonomy. Three datasets are treated in a slightly special way:

GBIF Backbone Patch: a small dataset we manually curate at GBIF to override any other list. We mainly use the dataset to add missing names reported by users.
Catalogue of Life: The Catalogue of Life provides the entire higher classification above families with the exception of algaes.
GBIF Algae Classification: With the withdrawal of Algaebase the current Catalogue of Life is lacking any algae taxonomy. To allow other sources to at least provide genus and species names for algae we have created a new dataset that just provides an algae classification down to families. This classification fits right into the empty phyla of the Catalogue of Life.

The GBIF portal now also lists the source datasets that contributed to the GBIF Backbone and the number of names that were used as primary references.

Other Improvements

As well as fixing the main issues listed above, there is another frequently occurring situation that we have improved. Many occurrences could not be matched to a backbone species because the name existed multiple times as an accepted taxon. In the new backbone, only one version of a name is ever considered to be accepted. All others now are flagged as doubtful. That resolves many issues which prevented a species match because of name ambiguity. For example there are many occurrences of Hyacinthoides hispanica in Britain which only show up in the new backbone (old / new occurrence, old / new match). This is best seen in the map comparison of the nub browser, try to swipe the map!

Known problems

We are aware of some problems with the new backbone which we like to address in the next stage. Two of these issues we consider as candidates for blocking the release of the new backbone:

Species matching service ignores authorship

As we better keep different authors apart the backbone now contains a lot more species names which just differ by their authorship. The current algorithm only keeps one of these names as the accepted name from the most trusted source (e.g. CoL) and treats the other as doubtful if they are not already treated as synonyms.
The problem currently is that the species matching service we use to align occurrences to the backbone does not deal with authorship. Therefore we have some cases where occurrences are attached to a doubtful name or even split across some of the “homonyms”.
There are nearly 166.832 species names with different authorship existing in the new backbone, accounting for 98.977.961 occurrences.

Too eager basionym merging

The same epithet is sometimes used by the same author for different names in the same family. This currently leads to an overly eager basionym grouping with less accepted names.
As these names are still in the backbone and occurrences can be matched to them this is currently not considered a blocker.

Reprojecting coordinates according to their geodetic datum

2016-02-25T21:22:00.000+01:00

For a long time Darwin Core has a term to declare the exact geodetic datum used for the given coordinate. Quite a few data publishers in GBIF have used dwc:geodeticDatum for some time to publish the datum of their location coordinates.

Until now GBIF has treated all coordinates as if they were in WGS84, the widespread global standard datum used by the Global Positioning System (GPS). Accordingly locations given in a different datum, for example NAD27 or AGD66, were displaced on GBIF maps a little. This so called “datum shift” is not dramatic, but can be up to a few hundred metres depending on the location and datum. The Univeristy of Colorado has a nice visualization of the impact.

At GBIF we interpret the geodeticDatum and reproject all coordinates as good as we can into the single datum WGS84. In order to do this there are two main steps that need to be done: parse and interpret the given verbatim geodetic datum and then do the actual transformation based on the known geodetic parameters.

Parsing geodeticDatum

As usual GBIF receives a lot of noise when reading the dwc:geodeticDatum. After removing the obvious bad values, e.g. introduced by bad mappings done by the publisher, we still ended up with over 300 different values for some datum. Most commonly simple names or abbreviations are used, e.g. NAD27, WGS72, ED50, TOKYO. In some cases we also see proper EPSG http://www.epsg.org/ codes coming in, e.g. EPSG:4326 which is the EPSG code for WGS84. As EPSG is a widespread and complete reference dataset of geodetic parameters, supported by many java libraries, we decided to add a new DatumParser to our parser library that directly returns EPSG integer codes for datum values. That way we can lookup geodetic parameters easily in the following transformation step. In addition to parse any given EPSG:xyz code directly it also understands most datums found in the GBIF network based on a simple dictionary file which we manually curate.

Even though EPSG codes are well maintained, very complete and supported by most software opaque integer codes have a harder time to get used than meaningful short names. Maybe a lesson we should keep in mind when debating about identifiers elsewhere.

Our recommendation to publishers is to use the EPSG codes if you know them, otherwise stick to the simple well known names. A good place to search for EPSG codes is http://epsg.io/.

Transformation

Once we have a decimal coordinate and a well known geodetic source datum the transformation itself is rather straight forward. We use geotools to do the work. The first step in the transformation is to instantiate a CoordinateReferenceSystem (CRS) using the parsed EPSG code of the geodeticDatum. A CRS combines a datum with a coordinate system, in our case this always a 2 dimensional system with the prime meridian in Greenwich and longitude values increasing East, latitude values North.

As EPSG codes can refer to both, just a plain datum and also a complete spatial reference system, we need to take this into account when building the CRS like this:

 private CoordinateReferenceSystem parseCRS(String datum) {
    CoordinateReferenceSystem crs = null;
    // the GBIF DatumParser in use
    ParseResult<Integer> epsgCode = PARSER.parse(datum);
    if (epsgCode.isSuccessful()) {
      final String code = "EPSG:" + epsgCode.getPayload();
      // first try to create a full fledged CRS from the given code
      try {
        crs = CRS.decode(code);

      } catch (FactoryException e) {
        // that didn't work, maybe it is *just* a datum
        try {
          GeodeticDatum dat = DATUM_FACTORY.createGeodeticDatum(code);
      // build a CRS using the standard 2-dim Greenwich coordinate system
          crs = new DefaultGeographicCRS(dat, DefaultEllipsoidalCS.GEODETIC_2D);

        } catch (FactoryException e1) {
          // also not a datum, no further ideas, log error
          LOG.info("No CRS or DATUM for given datum code >>{}<<: {}", datum, e1.getMessage());
        }
      }
    }
    return crs;
  }

Once we have a CRS instance we can create a specific WGS84 transformation and apply it to our coordinate:

public ParseResult<LatLng> reproject(double lat, double lon, String datum) {
   CoordinateReferenceSystem crs = parseCRS(datum);
   MathTransform transform = CRS.findMathTransform(crs, DefaultGeographicCRS.WGS84, true);
   // different CRS may swap the x/y axis for lat lon, so check first:
   double[] srcPt;
   double[] dstPt = new double[3];
   if (CRS.getAxisOrder(crs) == CRS.AxisOrder.NORTH_EAST) {
     // lat lon
     srcPt = new double[] {lat, lon, 0};
   } else {
     // lon lat
     srcPt = new double[] {lon, lat, 0};
   }

   transform.transform(srcPt, 0, dstPt, 0, 1);
   return ParseResult.success(ParseResult.CONFIDENCE.DEFINITE, new LatLng(dstPt[1], dstPt[0]), issues);
  }

The actual projection code does a bit more of null and exception handling which I have removed here for simplicity.

As you can see above we also have to watch out for spatial reference systems that use a different axis ordering. Luckily geotools knows all about that and provides a very simple way to test for it.

Issue flags

As with most of our processing we flag records when problems or assumed behavior occurs. In the case of the geodetic datum processing we keep track of 5 distinct issues which are available as GBIF portal occurrence search filters:

COORDINATE_REPROJECTION_FAILED: A CRS was instantiated, but the transformation failed for some reason.
GEODETIC_DATUM_INVALID: The datum parser was unable to return an EPSG code for the given datum string.
COORDINATE_REPROJECTION_SUSPICIOUS: The reprojection resulted in a datum shift larger than 0.1 degrees.
GEODETIC_DATUM_ASSUMED_WGS84: No datum was given or the given datum was not understood. In that case the original coordinates remain untouched.
COORDINATE_REPROJECTED: The coordinate was successfully transformed and differs now from the verbatim one given.

Simplified Downloads

2015-06-11T17:06:00.000+02:00

Since its re-launch in 2013 gbif.org has supported the downloading of occurrence data using an arbitrary query with the download file provided as a Darwin Core Archive file whose internal content is described here. This format contains comprehensive and self-explanatory information, which makes it suitable to be referenced in external resources. However, in cases where people only need the occurrence data in its simplest form the DwC-A format presents an additional complexity that can make it hard to use the data. Because of that we now support a new download format: a zip file that only contains a single file with the most common fields/terms used, where each column is separated by the TAB character. This makes things much easier when it comes to importing the data into tools such as Microsoft Excel, geographic information systems and relational databases. The current download functionality was extended to allow the selection of the desired format:

From this point the functionality remains the same: eventually you will receive an email containing a hyperlink where the file can be downloaded.

Technical Architecture

The simplified download format was implemented following the technical requirement that new formats should be supported in the near future with minimal impact to the formats supported at a specific moment. In general, occurrence downloads are implemented using two different sets of technologies depending on the estimated size of the download in number of records; a threshold of 200,000 records is set to define when a download is small (< 200K) and big (>200K), where history shows a vast majority of “small” downloads. The following chart summarizes the key technologies that enables occurrence downloads:

Download workflow

Occurrence downloads are automated using a workflow engine called Oozie, it coordinates the required steps to produce a single download file. In summary the workflow proceeds as follows:

Initially, Apache Solr is contacted to determine the number of records that the download file will contain.
Big or small?

If the amount of records is less than 200,000 (it is small download), Apache Solr is queried to iterate over the results; the detail of each occurrence record is fetched from HBase since it’s the official storage of occurrence records. Individual downloads are produced by a multi-threaded application implemented using the Akka framework; the Apache Zookeeper and Curator frameworks are used to limit the amount of threads that can be running at the same time (it avoids a thread explosion in the machines that run the download workflow).
If the amount of records is greater than 200,000 (it is a big download), Apache Hive is used to retrieve the occurrence data from an HDFS table. To avoid overloading of HBase we create that HDFS table as a daily snapshot of the occurrence data stored in HBase.

Finally the occurrence records are collected and organized in the requested output format (DwC-A or Simple).

Note: the details of how this is implemented can be consulted in the Github project: https://github.com/gbif/occurrence/tree/master/occurrence-download.

Conclusion

Reducing both the number of columns and the size (number of bytes) in our downloads has been one of our most requested features, and we hope this makes using the GBIF data easier for everyone.

Don't fill your HDFS disks (upgrading to CDH 5.4.2)

2015-05-29T16:34:00.000+02:00

Just a short post on the dangers of filling your HDFS disks. It's a warning you'll hear at conferences and in best practices blog posts like this one, but usually with only a vague consequence of "bad things will happen". We upgraded from CDH 5.2.0 to CDH 5.4.2 this past weekend and learned the hard way: bad things will happen.

The Machine Configuration

The upgrade went fine in our dev cluster (which has almost no data in HDFS) so we weren't expecting problems in production. Our production cluster is of course slightly different than our (much smaller) dev cluster. In production we have 3 masters, where one holds the NameNode and another holds the SecondaryNameNode (we're not yet using a High Availability setup, but it's in the plan). We have 12 DataNodes where each one has 13 disks dedicated to HDFS storage: 12 are 1TB and one is 512GB. They are formatted with 0% reserved blocks for root. The machines are evenly split into two racks.

Pre Upgrade Status

We were at about 75% total HDFS usage with only a few percent difference between machines. We were configured to use Round Robin block placement (dfs.datanode.fsdataset.volume.choosing.policy) with 10GB reserved for non-hdfs use (dfs.datanode.du.reserved), which are the defaults in CDH manager. Each of the 1TB disks was around 700GB used (of 932GB usable), and the 512 GB disks were all at their limit: 456GB used (of 466GB usable). That left only the configured 10GB free for non-hdfs use on the small disks. Our disks are mounted in the pattern /mnt/disk_a, /mnt/disk_b and so on, with /mnt/disk_m as the small disk. We're using the free version of CDHM so we can't do rolling upgrades, meaning this upgrade would bringing everything down. And because our cluster is getting full (> 80% usage is another rumoured "bad things" threshold) we have reduced one class of data (user's occurrence downloads) to a replication factor of 2 (from the default of 3). This is considered somewhere between naughty and criminal, and you'll see why below.

Upgrade Time

We followed the recommended procedure and did the oozie, hive, and CDH manager backups, downloaded the latest parcels, and pressed the big Update button. Everything appeared to be going fine until HDFS tried to start up again, where the symptom was that it was taking a really long time (several minutes, after which the CDHM upgrade process finally gave up saying the DataNodes weren't making contact). Looking at the NameNode logs we see that it was performing a "Block Pool Upgrade", which took btw 90 and 120 seconds for each of our ~700GB disks. Here's an excerpt of where it worked without problems:

2015-05-23 20:18:53,715 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /mnt/disk_a/dfs/dn/in_use.lock acquired by nodename 27117@c4n1.gbif.org
2015-05-23 20:18:53,811 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-2033573672-130.226.238.178-1367832131535
2015-05-23 20:18:53,811 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535
2015-05-23 20:18:53,823 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading block pool storage directory /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535.
old LV = -56; old CTime = 1416737045694.
new LV = -56; new CTime = 1432405112136
2015-05-23 20:20:33,565 INFO org.apache.hadoop.hdfs.server.common.Storage: HardLinkStats: 59768 Directories, including 53157 Empty Directories, 0 single Link operations, 6611 multi-Link operations, linking 22536 files, total 22536 linkable files. Also physically copied 0 other files.

2015-05-23 20:20:33,609 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrade of block pool BP-2033573672-130.226.238.178-1367832131535 at /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535 is complete

That upgrade time happens sequentially for each disk, so even the though the machines were upgrading in parallel, we were still looking at ~30 minutes of downtime for the whole cluster. As if that wasn't sufficiently worrying, then we finally get to disk_m, our nearly full 512G disk:

2015-05-23 20:53:05,814 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /mnt/disk_m/dfs/dn/in_use.lock acquired by nodename 12424@c4n1.gbif.org
2015-05-23 20:53:05,869 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-2033573672-130.226.238.178-1367832131535
2015-05-23 20:53:05,870 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /mnt/disk_m/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535
2015-05-23 20:53:05,886 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading block pool storage directory /mnt/disk_m/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535.
   old LV = -56; old CTime = 1416737045694.
   new LV = -56; new CTime = 1432405112136
2015-05-23 20:54:12,469 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to analyze storage directories for block pool BP-2033573672-130.226.238.178-1367832131535
java.io.IOException: Cannot create directory /mnt/disk_m/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535/current/finalized/subdir91/subdir168
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1259)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1296)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1296)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocks(DataStorage.java:1023)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.linkAllBlocks(BlockPoolSliceStorage.java:647)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doUpgrade(BlockPoolSliceStorage.java:456)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:390)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:171)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:214)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:242)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:396)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1397)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1362)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:227)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:839)
        at java.lang.Thread.run(Thread.java:745)

2015-05-23 20:54:12,476 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage for block pool: BP-2033573672-130.226.238.178-1367832131535 : Cannot create directory /mnt/disk_m/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535/current/finalized/subdir91/subdir168

The somewhat misleading "Cannot create directory" is not a file permission problem but rather a disk full problem. During this block pool upgrade some temporary space is needed for rewriting metadata, and that space is apparently more than the 10G that was available to "non-HDFS" (which we've concluded means "not HDFS storage files, but everything else is fair game"). Because some space is available to start the upgrade, it begins, but then when it exhausts the disk it fails, and This Kills The DataNode. It does clean up after itself, but prevents the DataNode from starting, meaning our cluster was on its knees and in no danger of standing up.

So the problem was lack of free space, which on 10 of our 12 machines we were able to solve by wiping temporary files from the colocated yarn directory. Those 10 machines were then able to upgrade their disk_m and started up. We still had two nodes down and unfortunately they were in different racks, so that meant we had a big pile of our replication factor 2 files missing blocks (the default HDFS block replication policy places the second and subsequent copies on a different rack from the first copy).

While digging around in the different properties that we thought could affect our disks and HDFS behaviour we were also restarting the failing DataNodes regularly. At some point the log message changed to:

WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.FileNotFoundException: /mnt/disk_m/dfs/dn/in_use.lock (No space left on device)

After that message the DataNode started, but with disk_m marked as a failed volume. We're not sure why this happened but presume that after one of our failures it didn't clean up it's temp files on disk_m and then on subsequent restarts found the disk completely full and (rightly) considered it unusable and tried to carry on. With the final two DataNodes up we had almost all of our cluster, minus the two failed volumes. There were only 35 corrupted files (missing blocks) left after they came up. These were files set to replication factor 2, and by bad luck had both copies of some of their blocks on the failed disk_m (one from rack1, one from rack2).

It would not have been the end of the world to just delete the corrupted user downloads (they were all over a year old) but on principle, it would not be The Right Thing To Do.

On inodes and hardlinks

The normal directory structure of the dfs dir in a DataNode is /dfs/dn/current/<blockpool name>/current/finalized and within finalized are a whole series of directories to fan out the various blocks that the volume contains. During the block pool upgrade a copy of 'finalized' is made called previous.tmp. It's not a normal copy however - it uses hardlinks in order to avoid duplicating all of the data (which obviously wouldn't work). The copy is needed during the upgrade and is removed afterwards. Since our upgrade failed halfway through we had both directories and had no choice but to move the entire /dfs directory off of /disk_m to a temporary disk and complete the upgrade there. We first tried a copy (use cp -a to preserve hardlinks) to a mounted NFS share. The copy looked fine but on startup the DataNode didn't understand the mounted drive ("drive not formatted"). Then we tried copying to a USB drive plugged into the machine and that ultimately worked (despite feeling decidedly un-Yahoo). Once the USB drive was upgraded and online in the cluster, replication took over and copied all of its blocks to new homes on /rack2. We then unmounted the USB drive, wiped both /disk_m's and then let replication balance out again. Final result: no lost blocks.

Mitigation

With the cluster happy again we made a few changes to hopefully ensure this doesn't happen again:

dfs.datanode.du.reserved:25GB this guarantees 25GB free on each volume (up from 10GB) and should be enough to allow a future upgrade to happen
dfs.datanode.fsdataset.volume.choosing.policy:AvailableSpace
dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction:1.0 together these two direct new blocks to disks that have more free space, thereby leaving our now full /disk_m alone

Conclusion

This was one small taste of what can go wrong with filling heterogenous disks in an HDFS cluster. We're sure there are worse dangers lurking on the full-disk horizon, so hopefully you've learned from our pain and will give yourself some breathing room when things start to fill up. Also, don't use a replication factor of less than 3 if there's anyway you can help it.

Improving the GBIF Backbone matching

2015-03-30T22:30:00.000+02:00

In GBIF occurrence records are matched to a taxon in a backbone taxonomy using the species match API. This is important to reduce spelling variations and create consistent metrics and searches according to a single classification and synonymy.

Over the past years we have been alerted to various bad matches. Most of the reported issues refer to a false fuzzy match for a name missing in our backbone.

In order to improve the taxonomic classification of occurrence records, we are undertaking 2 activities. The first is to improve the algorithms we use to fuzzily match names, and the second will be to improve the algorithms used to assembled the backbone taxonomy itself. Here I explain some of the work currently underway to tackle the former, which is visible on the test environment.

1.Name parsing of undetermined species

In occurrences we see many names with a partly undetermined name such as Lucanus spec. Erroneously these rank markers have been treated as real species epithets and together with fuzzy matching resulted in poor results.

Examples

Xysticus sp. used to wrongly match Xysticus spiethi while it now just matches the genus Xysticus.
Triodia sp. used to match the family Poaceae while it now matches the genus

2. Damerau–Levenshtein distance algorithm

For scoring fuzzy matches we have so far applied the Jaro Winkler distance which is often used for matching person names. It tends to allow for rather fuzzy matches at the end of long strings. This is desirable for scientific names, but the allowed fuzziness was too big and we decided to revert to the classical and more predictable Damerau–Levenshtein distance. This reduces false positive fuzzy matches considerably even though we lost a few good matches at the same time.

Examples

Xyris kralii Wand. used to match to Xyris harleyi but now just matches to the genus Xyris L. as the species is missing from our backbone.
Zea mays subsp. parviglumis var. huehuet Iltis & Doebley used to match Zea mays var. hirta while it now just hits the species Zea mays L.

Matching results

The distinct, verbatim classifications of 528 million records were passed through the original and the new fuzzy matching algorithms - this included 10.5 million distinct classifications in total. The results show that 428 thousand classifications (4%), representing 5,323,758 occurrence records produce a different match. So far we have taken a random subsample of the records which change, and manually inspected the results - we can hardly spot any degression or wrong matches.

We have published the complete matching comparison as well as the subset of changed records at Zenodo as tab delimited files:

Dataset 1: All classification matches (10.5 million)

Dataset 2: Changed matches (428 thousand)

The schema of the files have 3 column families each with the scientificName, GBIF taxonKey and the higher DwC classification terms for every match record (verbatim prefixed with v_ , old matching with an _old suffix and the new matching results with plain terms, e.g. v_scientificName, scientificName_old, scientificName).

We are glad to receive any feedback on further improvements or bad matching results we need to fix in the next iteration of work. Please get in touch with Markus Döring, mdoering@gbif.org.

Appendix

Create distinct occurrence names table

CREATE TABLE markus.names AS 
SELECT count(*) as numocc, count(distinct datasetKey) as numdatasets, v_scientificName, v_kingdom, v_phylum, v_class, v_order_ as v_order, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification 
FROM prod_b.occurrence_hdfs 
GROUP BY v_scientificName, v_kingdom, v_phylum, v_class, v_order_, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification 
ORDER BY v_scientificName, numocc DESC

Lookup taxonkey with both old & new lookup

CREATE TABLE markus.name_matches AS
SELECT 
  n.numocc, 
  n.numdatasets, 
  n.v_scientificName, 
  n.v_kingdom, 
  n.v_phylum, 
  n.v_class, 
  n.v_order, 
  n.v_family, 
  n.v_genus, 
  n.v_subgenus, 
  n.v_specificEpithet, 
  n.v_infraspecificEpithet, 
  n.v_scientificNameAuthorship, 
  n.v_taxonrank, 
  n.v_higherClassification, 

  prod.taxonKey as taxonKey_old,
  prod.scientificName as scientificName_old,
  prod.rank as rank_old,
  prod.status as status_old,
  prod.matchType as matchType_old,
  prod.confidence as confidence_old,
  prod.kingdomKey as kingdomKey_old,
  prod.phylumKey as phylumKey_old,
  prod.classKey as classKey_old,
  prod.orderKey as orderKey_old,
  prod.familyKey as familyKey_old,
  prod.genusKey as genusKey_old,
  prod.speciesKey as speciesKey_old,
  prod.kingdom as kingdom_old,
  prod.phylum as phylum_old,
  prod.class_ as class_old,
  prod.order_ as order_old,
  prod.family as family_old,
  prod.genus as genus_old,
  prod.species as species_old,

  uat.taxonKey as taxonKey,
  uat.scientificName as scientificName,
  uat.rank as rank,
  uat.status as status,
  uat.matchType as matchType,
  uat.confidence as confidence,
  uat.kingdomKey as kingdomKey,
  uat.phylumKey as phylumKey,
  uat.classKey as classKey,
  uat.orderKey as orderKey,
  uat.familyKey as familyKey,
  uat.genusKey as genusKey,
  uat.speciesKey as speciesKey,
  uat.kingdom as kingdom,
  uat.phylum as phylum,
  uat.class_ as class_,
  uat.order_ as order_,
  uat.family as family,
  uat.genus as genus,
  uat.species as species

FROM (
  SELECT 
    numocc, 
    numdatasets, 
    v_scientificName, 
    v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_subgenus, 
    v_specificEpithet, 
    v_infraspecificEpithet, 
    v_scientificNameAuthorship, 
    v_taxonrank, 
    v_higherClassification, 
    match('PROD', v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) prod, 
    match('UAT', v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) uat
  FROM markus.names
) n;

Hive exports

CREATE TABLE markus.matches_changed 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' NULL DEFINED AS '' AS 
SELECT * from markus.name_matches 
WHERE taxonKey!=taxonKey_old;

CREATE TABLE markus.matches_all 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' NULL DEFINED AS '' AS 
SELECT * from markus.name_matches;

IPT v2.2 – Making data citable through DataCite

2015-03-27T13:55:00.000+01:00

GBIF is pleased to release IPT 2.2, now capable of automatically connecting with either DataCite or EZID to assign DOIs to datasets. This new feature makes biodiversity data easier to access on the Web and facilitates tracking its re-use.

DataCite integration explained

DataCite specialises in assigning DOIs to datasets. It was established in 2009 with three fundamental goals(1):

Establish easier access to research data on the Internet
Increase acceptance of research data as citable contributions to the scholarly record
Support research data archiving to permit results to be verified and re-purposed for future study

EZID is hosted by the California Digital Library (a founding member of DataCite) and adds services on top of the DataCite DOI infrastructure such as their own easy-to-use programming interface.

To integrate with DataCite and further these three goals for biodiversity data, IPT version 2.2 introduces the following new features:

DOIs can be assigned to datasets thereby making them persistently resolvable
A new DOI can be assigned to a dataset each time it undergoes scientifically significant changes, which is recommended best practice(1) and part of the IPT's new versioning policy
Citations can be automatically generated for datasets in a standard format which includes the DOI and dataset version number

A version history is kept for each dataset, allowing researchers to easily track changes and access/download all previous versions

To take advantage of these optional new features, there are two basic requirements:

The IPT must be configured with either a DataCite or EZID account. GBIF participants interested in a DataCite account should contact the GBIF Helpdesk directly. General information about getting a DataCite account can be found here; information about getting an EZID account can be found here.
The IPT should be always on and accessible to ensure that assigned DOIs continue to be resolvable.

Once publishers make their data citable through DataCite they can expect the following benefits:

Their datasets will be globally discoverable through the DataCite Metadata Search tool and the Thomson Reuters Data Citation Index (part of the Web of Science) thanks to a collaboration with DataCite formalised in 2014
They can find out exactly who cited their dataset via the Data Citation Index, and better understand the impact their dataset has had within the scholarly research and policy making communities

Sample basic metadata page, IPT 2.2

Other new features

The IPT 2.2 also introduces a simple way of licensing datasets under one of three machine readable waivers or licences: CC0 v1.0, CC-BY v4.0, or CC-BY-NC v4.0. These waivers or CC licenses are "something that the creators of works can understand, their users can understand, and even the Web itself can understand."(2) You may read more about GBIF's new licensing policy here for more information.

Sample resource overview page, IPT 2.2

Whether an IPT is DOI-turbocharged or not, there are a number of other new benefits in this release:

basisOfRecord validation for occurrence datasets
The ability to preview source mappings prior to publication
The ability to preview resource metadata prior to publication
A suite of new metadata fields such as ORCIDs for contacts
An enhanced user interface including a new and improved resource homepage
Additional context help to guide users, especially first-time users

Acknowledgements

Thanks to the hard work and dedication of the team of contributors, version 2.2 has been fully translated into French, Japanese, Portuguese, and Spanish. Since so many new features have gone into this new version, the text requiring translation was enormous. The following translators deserve a huge thanks, merci, arigato, obrigado, and gracias:

Sophie Pamerlon, Marie-Elise Lecoq (GBIF France) - Updating French translation
Yukiko Yamazaki (GBIF Japan (JBIF)) - Updating Japanese translation
Allan Koch Veiga, Etienne Americo Cartolano, Daniel Lins, and Antonio Mauro Saraiva (Universidade de São Paulo, Research Center on Biodiversity and Computing - BioComp) - Updating Portuguese translation
Dairo Escobar, Nestor Beltran, and Daniel Amariles (Colombian Biodiversity Information System (SiB Colombia)) - Updating Spanish Translation

Lastly, a special thanks must go out to David Shorthouse from Canadensys for his guidance and help. Canadensys has been assigning DOIs to datasets it serves via its IPT since 2012, as described here, and has provided invaluable assistance throughout development.

On behalf of the GBIF development team, I really hope you enjoy using this new version, and hope that you will be able to take advantage of all its exciting new features.

Footnotes

http://schema.datacite.org/meta/kernel-3/doc/DataCite-MetadataKernel_v3.1.pdf
https://creativecommons.org/licenses/

Upgrading our cluster from CDH4 to CDH5

2014-11-26T11:41:00.000+01:00

A little over a year ago we wrote about upgrading from CDH3 to CDH4 and now the time had come to upgrade from CDH4 to CDH5. The short version: upgrading the cluster itself was easy, but getting our applications to work with the new classpaths, especially MapReduce v2 (YARN), was painful.

The Cluster

Our cluster has grown since the last upgrade (now 12 slaves and 3 masters), and we no longer had the luxury of splitting the machines to build a new cluster from scratch. So this was an in-place upgrade, using CDH Manager.

Upgrade CDH Manager

The first step was upgrading to CDH Manager 5.2 (from our existing 4.8). The Cloudera documentation is excellent so I don't need to repeat it here. What we did find was that the management service now requests significantly more RAM for it's monitoring services (minimum "happy" config of 14GB), to the point where our existing masters were overwhelmed. As a stop gap we've added a 4th old machine to the "masters" group, used exclusively for the management service. In the longer term we'll replace the 4 masters with 3 new machines that have enough resources.

Upgrade Cluster Members

Again the Cloudera documentation is excellent but I'll just add a bit. The upgrade process will now ask if a JAVA jdk should be installed (an improvement over the old behaviour of just installing one anyway). That means we could finally remove the Oracle JDK 6 rpms that have been lying around on the machines. For some reason the Host Inspector still complains about OpenJDK 7 vs Oracle 7 but we've happily been running on OpenJDK 7 since early 2014, and so far so good with CDH5 as well. After the upgrade wizard finished we had to tweak memory settings throughout the cluster, including setting the "Memory Overcommit Validation Threshold" to 0.99, up from its (very conservative) default of 0.8. Cloudera has another nice blog post on figuring out memory settings for YARN. Additionally Hue's configuration required some attention because after the upgrade it had forgotten where Zookeeper and the HBase Thrift server were. All in all quite painless.

The Gotchas

Getting our software to work with CDH5 was definitely not painless. All of our problems stemmed from conflicting versions of jars, due either to changes in CDH dependencies, or in changes to how a user classpath is specified as having priority over that of Yarn/HBase/Oozie. Additionally it took some time to wrap our heads around the new artifact packaging used by YARN and HBase. Note that we also use Maven for dependency management.

Guava

We're not alone in our suffering at the hands of mismatched Guava versions (e.g. HADOOP-10101, HDFS-7040), but suffer we did. We resorted to specifying version 14.0.1 in any of our code that touches Hadoop and more importantly HBase, and exclude any higher version guavas from our dependencies. This meant downgrading some actual code that was using guava 15, but was the easiest path to getting a working system.

Jackson

We have many dependencies on Jackson 1.9 and 2+ throughout our code, so downgrading to match HBase's shipped 1.8.8 was not an option. It meant figuring out the classpath precedence rules described below, and solving the problems (like logging) that doing so introduced.

Logging

Logging in Java is a horrible mess, and with the number of intermingled projects required to make application software run on a Hadoop/HBase cluster it's not surprise that getting logging to work was brutal. We code to the SLF4J API and use Logback as our implementation of choice. The Hadoop world uses a mix of Java Commons Logging, java.util.logging, and log4j. We thought that meant we'd be clear if we used the same SLF4J API (1.7.5) and used the bridges (log4j-over-slf4j, jcl-over-slf4j, and jul-to-slf4j), which has worked for us up to now. <montage>Angry men smash things angrily over the course of days</montage> Turns out, there's a bug in the 1.7.5 implementation of log4j-over-slf4j, which blows up as we described over at YARN-2875. Short version - use 1.7.6+ in client code that attempts to use YARN and log4j-over-slf4j.

YARN

The crux of our problems was having our classpath loaded after the Hadoop classpath had been loaded, meaning old versions of our dependencies were loaded first. The new, surprisingly hard to find parameter that tells YARN to load your classpath first is "mapreduce.job.user.classpath.first". YARN also quizzically claims that the parameter is deprecated, but.. works for me.

Oozie

Convincing Oozie to load our classpath involved another montage of angry faces. It uses the same parameter as YARN, but with a prefix, so what you want is "oozie.launcher.mapreduce.job.user.classpath.first". We had been loading the old parameter "mapreduce.task.classpath.user.precedence" in each action in the workflow using the <job-xml> tag to load the configs from a file called hive-default.xml. We then encountered two problems:

Note the name - we used hive-default.xml instead of hive-site.xml because of a bug in Oozie (discussed here and here). That was fixed in the CDH5.2 Oozie, but we didn't get the memo. Now the file is called hive-site.xml and contains our specific configs and is again being picked up. BUT:
Adding oozie.launcher.mapreduce.job.user.classpath.first to hive-site.xml doesn't work! As we wrote up in Oozie bug OOZIE-2066 this parameter has to be specified for each action, at the action level, in the workflow.xml. Repeating the example workaround from the bug report:

 <action name="run-test">  
  <java>  
   <job-tracker>c1n2.gbif.org:8032</job-tracker>  
   <name-node>hdfs://c1n1.gbif.org:8020</name-node>  
   <configuration>  
    <property>  
     <name>oozie.launcher.mapreduce.task.classpath.user.precedence</name>  
     <value>true</value>  
    </property>  
   </configuration>  
   <main-class>test.CPTest</main-class>  
  </java>  
  <ok to="end" />  
  <error to="kill" />  
 </action>

New Packaging Woes

We build our jars using a combination of jar-with-dependencies and the shade plugin, but in both cases it means all our dependencies are built in. The problems come when a downstream, transitive dependency loads a different (typically older) version of one of the jars we've bundled in our main jar. This happens a lot with the Hadoop and HBase artifacts, especially when it comes to MR1 and logging.

Example exclusions

hbase-server (needed to run MapReduce over HBase): https://github.com/gbif/datacube/blob/master/pom.xml#L268

hbase-testing-util (needed to run mini clusters): https://github.com/gbif/datacube/blob/master/pom.xml#L302

hbase-client: https://github.com/gbif/metrics/blob/master/pom.xml#L226

hadoop-client (removing logging): https://github.com/gbif/metrics/blob/master/pom.xml#L327

Beyond just sorting conflicting dependencies, we also encountered a problem that presented as "No FileSystem for scheme: file". It turns out we had projects bringing in both hadoop-common and hadoop-hdfs, and so we were getting only one of the META-INF/services files in the final jar. Thus we could not use the FileSystem to read local files (like jars for the class path) and also from HDFS. The fix was to include the org.apache.hadoop.fs.FileSystem in our project explicitly: https://github.com/gbif/metrics/blob/master/cube/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem

Finally we had to stop the TableMapReduceUtil from bringing in it’s own dependent jars, which brought in yet more conflicting jars - this appears to be a change in the default behaviour, where dependent jars are now being brought in by default in the shorter versions of initTableMapper:
https://github.com/gbif/metrics/blob/master/cube/src/main/java/org/gbif/metrics/cube/occurrence/backfill/BackfillCallback.java#L37

Conclusion

As you can see the client side of the upgrade was beset on all sides by the iniquities of jars, packaging and old dependencies. It seems strange that upgrading Guava is considered a no-no, major breaking change by these projects, yet discussions about removing HBaseTablePool are proceeding apace and will definitely break many projects (including any of ours that touch HBase). While we're ultimately pleased that everything now works, and looking forward to benefiting from the performance improvements and new features of CDH5, it wasn't a great trip. Hopefully our experience will help others migrate more smoothly.

Multimedia in GBIF

2014-05-06T12:06:00.000+02:00

We are happy to announce another long awaited improvement to the GBIF portal. Our portal test environment now shows multimedia items and their metadata associated with occurrences. As of today we have nearly 700 thousand occurrences with multimedia indexed. Based on the Dublin Core type vocabulary we distinguish between images, videos and sound files. As has been requested by many people the media type is available as a new filter in the occurrence search and subsequently in downloads. For example you can now easily find all audio recordings of birds.

UAM:Mamm:11470 - Eumetopias jubatus - skull

If you follow to the details page of any of those records you can see that sound files show up as simple links to the media file. We do the same for video files and currently do not have plans to embed any media player in our portal. This is different from images which are shown in a dedicated gallery you might have encountered for species pages before already. On the left you can see an example of a skull specimen with multiple images.

When requested for the first time, GBIF transiently caches the original images and processes them into various standard sizes and formats suitable for the use in the portal.

Publishing multimedia metadata

GBIF indexes multimedia metadata published in different ways within the GBIF network. From a simple URL given as an additional field in Darwin Core via multiple items expressed as ABCD XML or a dedicated multimedia extension in Darwin Core archives the difference usually is in metadata expressiveness.

Simple Darwin Core

Melocactus intortus record in iNaturalist

Whenever we spot the term dwc:associatedMedia in xml or Darwin Core archives as part of the a simple, flat occurrence record we try to extract URLs to media items. As the term officially allows for concatenated lists of URLs we try common delimiters such as comma, semicolon or the pipe symbols. An example of multiple, concatenated image URLs can be found in iNaturalist:

As you can see on the right every extracted link is regarded as a separate media item as there is no standard way to detect that 2 links refer to the same item. In the example above every image has a link to the actual image file and another one to the respective html page where it's metadata is presented. There is also no way to specify additional metadata about a link. As a consequence all images based on dwc:associatedMedia do not have a title, license or any further information. The verbatim data for that record before we extract image links can be seen here: http://www.gbif-uat.org/occurrence/891030819/verbatim

Darwin Core archive multimedia extension

By having a dedicated extension for media items many media items per core occurrence record can be published in a structured way. This is the GBIF recommended way to publish multimedia as it gives you most control over your metadata. Note that the same extension can also be used to publish multimedia for species in checklist datasets. This extension, based entirely on existing Dublin Core terms, allows you to specify the following information about a media item, all of which will make it into the GBIF portal if provided:

dc:type, the kind of media item based on the DCMI Type Vocabulary: StillImage, MovingImage or Sound
dc:format, MIME type of the multimedia object's format
dc:identifier, the public URL that identifies and locates the media file directly, not the html page it might be shown on
dc:references, the URL of an html webpage that shows the media item or its metadata. It is recommended to provide this url even if a media file exists as it will be used for linking out
dc:title, the media items title
dc:description, a textual description of the content of the media item
dc:created, the date and time this media item was taken
dc:creator, the person that took the image, recorded the video or sound
dc:contributor, any contributor in addition to the creator that helped in recording the media item
dc:publisher, the name of an entity responsible for making the image available
dc:audience, a class or description for whom the image is intended or useful
dc:source, a reference to the source the media item was derived or taken from. For example a book from which an image was scanned or the original provider of a photo/graphic, such as photography agencies
dc:license, license for this media object. If possible declare it as CC0 to ensure greatest use
dc:rightsHolder, the person or organization owning or managing rights over the media item

Access to Biological Collections Data

As usual we also provide a binding from the TDWG ABCD standard (versions 1.2 and 2.06) mostly used with the BioCASE software.

From ABCD 1.2 we extract media information based on the UnitDigitalImage subelements. In particular information about the file URL (ImageURI), the description (Comment) and the license (TermsOfUse).

In ABCD 2.06 we use the unit MultiMediaObject subelements instead. Here there are distinct file and webpage URLs (FileURI, ProductURI), the description (Comment), the license (License/Text, TermsOfUseStatements) and also an indication of the mime type (Format). The bird sound example from above comes in as ABCD 2.06 via the Animal Sound Archive dataset. You can see the original details of that ABCD record in it's raw XML fragment. There are also fossil images available through ABCD.

Missing from both ABCD versions is a media title, creator and created element.

Media type interpretation

We derive the media type from either an explicitly given dc:type, the mime type found in dc:format or the media file suffix. In the case of dwc:associatedMedia found in simple Darwin Core we can only rely on the file URL to interpret the kind of media item. If that URL is pointing to some html page instead of an actual static media file with a wellknown suffix the media type remains unknown.

Production deployment

We hope you like this new feature and we are eager to get this out into production within the next weeks. This is the first iteration of this work, and like all GBIF developments we welcome any feedback.

IPT v2.1 – Promoting the use of stable occurrenceIDs

2014-04-23T12:22:00.000+02:00

GBIF is pleased to announce the release of the IPT 2.1 with the following key changes:

Stricter controls for the Darwin Core occurrenceID to improve stability of record level identifiers network wide
Ability to support Microsoft Excel spreadsheets natively
Japanese translation thanks to Dr. Yukiko Yamazaki from the National Institute of Genetics (NIG) in Japan

With this update, GBIF continues to refine and improve the IPT based on feedback from users, while carrying out a number of deliverables that are part of the GBIF Work Programme for 2014-16.

The most significant new feature that has been added in this release is the ability to validate that each record within an Occurrence dataset has a unique identifier. If any missing or duplicate identifiers are found, publishing fails, and the problem records are logged in the publication report.

This new feature will support data publishers who use the Darwin Core term occurrenceID to uniquely identify their occurrence records. The change is intended to make it easier to link to records as they propagate throughout the network, simplifying the mechanism to cross reference databases and potentially help towards tracking use.

Previously, GBIF has asked publishers to use the three Darwin Core terms: institutionCode, collectionCode, and catalogNumber to uniquely identify their occurrence records. This triplet style identifier will continue to be accepted, however, it is notoriously unstable since the codes are prone to change and in many cases are meaningless for datasets originating from outside of the museum collections community. For this reason, GBIF is adopting the recommendations coming from the IPT user community and recommending the use of occurrenceID instead.

Best practices for creating an occurrenceID are that they (a) must be unique within the dataset, (b) should remain stable over time, and (c) should be globally unique wherever possible. By taking advantage of the IPT’s built-in identifier validation, publishers will automatically satisfy the first condition.

Ultimately, GBIF hopes that by transitioning to more widespread use of stable occurrenceIDs, the following goals can be realized:

GBIF can begin to resolve occurrence records using an occurrenceID. This resolution service could also help check whether identifiers are globally unique or not.
GBIF’s own occurrence identifiers will become inherently more stable as well.
GBIF can sustain more reliable cross-linkages to its records from other databases (e.g. GenBank).
Record-level citation can be made possible, enhancing attribution and the ability to track data usage.
It will be possible to consider tracking annotations and changes to a record over time.

If you’re a new or existing publisher, GBIF hope you’ll agree these goals are worth working towards, and start using occurrenceIDs.

The IPT 2.1 also includes support for uploading Excel files as data sources.

Another enhancement is that the interface has been translated into Japanese. GBIF offer their sincere thanks to Dr. Yukiko Yamazaki from the National Institute of Genetics (NIG) in Japan for this extraordinary effort.

In the 11 months since version 2.0.5 was released, a total of 11 enhancements have been added, and 38 bugs have been squashed. So what else has been fixed?

If you like the IPT’s auto publishing feature, you will be happy to know the bug causing the temporary directory to grow until disk space was exhausted has now been fixed. Resources that are configured to auto publish, but fail to be published for whatever reason, are now easily identifiable within the resource tables as shown:

If you ever created a data source by connecting directly to a database like MySQL, you may have noticed an error that caused datasets to truncate unexpectedly upon encountering a row with bad data. Thanks to a patch from Paul Morris (Harvard University Herbaria) bad rows now get skipped and reported to the user without skipping subsequent rows of data.

As always we’d like to give special thanks to the other volunteers who contributed to making this version a reality:

Marie-Elise Lecoq, and Gallien Labeyrie (GBIF France) - Updating French translation
Yu-Huang Wang (TaiBIF, Taiwan) - Updating Traditional Chinese translation
Nestor Beltran (Colombian Biodiversity Information System (SiB)) - Updating Spanish translation
Etienne Cartolano, Allan Koch Veiga, and Antonio Mauro Saraiva (Universidade de São Paulo, Research Center on Biodiversity and Computing) - Updating Portuguese translation
Carlos Cubillos (Colombian Biodiversity Information System (SiB)) - Contributing style improvements

On behalf of the GBIF development team, I can say that we’re really excited to get this new version out to everyone! Happy publishing.

Lots of columns with Hive and HBase

2014-03-04T11:20:00.000+01:00

We're in the process of rolling out a long awaited feature here at GBIF, namely the indexing of more fields from Darwin Core. Until the launch of our now HBase-backed occurrence store (in the fall of 2013) we couldn't index more than about 30 or so terms from Darwin Core because we were limited by our MySQL schema. Now that we have HBase we can add as many columns as we like!

Or so we thought.

Our occurrence download service gets a lot of use and naturally we want downloaders to have access to all of the newly indexed fields. The way our downloads work is as an Oozie workflow that executes a Hive query of an HDFS table (more details in this Cloudera blog). We use an HDFS table to significantly speed up the scan speed of the query - using an HBase backed Hive table takes something like 4-5x as long. But to generated that HDFS table we need to start from a Hive table that _is_ backed by HBase.

Here's an example of how to write a Hive table definition for an HBase-backed table:

CREATE EXTERNAL TABLE tiny_hive_example (
key INT,
kingdom STRING,
kingdomkey INT
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,o:kingdom#s,o:kingdomKey#b")
TBLPROPERTIES(
"hbase.table.name" = "tiny_hbase_table",
"hbase.table.default.storage.type" = "binary"
);

But now that we have something like 600 columns to map to HBase, and that we've chosen to name our HBase columns just like the DwC Terms they represent (e.g. the basis of record term's column name is basisOfRecord) we have a very long "SERDEPROPERTIES" string in our Hive table definition. How long? Well, way more than the 4000 character limit of Hive. For our Hive metastore we use PostgreSQL and when Hive creates the SERDE_PARAMS table it gives the PARAM_VALUE column a datatype of VARCHAR(4000). Because 4k should be enough for anyone, right? Sigh.

The solution:

alter table "SERDE_PARAMS" alter column "PARAM_VALUE" type text;

We did lots of testing to make sure the existing definitions didn't get nuked by this change, and can confirm that the Hive code is not checking that 4000 value either (value is turned into a String: the source). Our new super-wide downloads table works, and will be in production soon!

The new (real-time) GBIF Registry has gone live

2013-10-28T12:04:00.000+01:00

For the last 4 years, GBIF has operated the GBRDS registry with its own web application on http://gbrds.gbif.org. Previously, when a dataset got registered in the GBRDS registry (for example using an IPT) it wasn't immediately visible in the portal for several weeks until after rollover took place.

In October, GBIF launched its new portal on www.gbif.org. During the launch we indicated that the real-time data management would be starting up in November. We are excited to inform you that today we made the first step towards making this a reality, by enabling the live operation of the new GBIF registry.

What does this mean for you?

any dataset registered through GBIF (using an IPT, web services, or manually by liaison with the Secretariat) will be visible in the portal immediately because the portal and new registry are fully integrated

the GBRDS web application (http://gbrds.gbif.org) is no longer visible, since the new portal displays all the appropriate information

old links to the GBRDS will automatically redirect to their corresponding entry in the new portal. As an example, try http://gbrds.gbif.org/browse/agent?uuid=4fa7b334-ce0d-4e88-aaae-2e0c138d049e

the GBRDS sandbox registry web application (http://gbrdsdev.gbif.org) is no longer visible, but a new registry sandbox has been setup to provide for IPT installations installed in test mode

Please note that the new registry API supports the same web service API that the GBRDS previously did, so existing tools and services built on the GBRDS API (such as the IPT) will continue to work uninterrupted.

As you may have noticed, occurrence data crawling has been temporarily suspended since the middle of September to prepare for launching real-time data management. We aim to resume occurrence data crawling in the first week of November, meaning that updates to the index will be visible immediately afterwards.

On behalf of the GBIF development team, I thank you for your patience during this transition time, and hope you are looking forward to real-time data management as much as we are.

GBIF Backbone in GitHub

2013-10-24T14:39:00.000+02:00

For a long time I wanted to experiment with using GitHub as a tool to browse and manage the GBIF backbone taxonomy. Encouraged by similar sentiments from Rod Page, it would be nice to use git to keep track of versions and allow external parties to fork parts of the taxonomic tree and push back changes if desired. To top it off there is the great GitHub Treeslider to browse the taxonomy, so why not give it a try?

A GitHub filesystem taxonomy

I decided to export each taxon in the backbone as a folder that is named according to the canonical name, containing 2 files:

README.md, a simple markdown file that gets rendered by github and shows the basic attributes of a taxon
data.json, a complete json representation of the taxon as it is exposed via the new GBIF species API

The filesystem represents the taxonomic classification and taxon folders are nested accordingly, for example the species Amanita arctica is represented as:

This is just a first experimental step. One can improve the readme a lot to render more content in a human friendly way and include more data in the json file such as common names and synonyms.

Getting data into GitHub

It didn't take much to write a small NubGitExporter.java class that exports the GBIF backbone into the filesystem as described above. The export of the entire taxonomy, with it's currently 4.4 million taxa incl synonyms, took about one hour on a MacBook Pro laptop.

Not bad I thought, but then I tried to add the generated files into git and that's when I started to doubt. After waiting for half a day for git to add the files to my local index I decided to kill the process and start by only adding the smaller kingdoms first, excluding animals and plants. That left about 335.000 folders and 670.000 files to be added to git. Adding these to my local git still took several hours, committing and finally pushing them onto the GitHub server took yet another 2 hours.

Delta compression using up to 8 threads.
Compressing objects: 100% (1010487/1010487), done.
Writing objects: 100% (1010494/1010494), 173.51 MiB | 461 KiB/s, done.
Total 1010494 (delta 405506), reused 0 (delta 0)
To https://github.com/mdoering/backbone.git

After those files were added to the index committing a simple change to the main README file took 15 minutes to commit. Although I like the general idea and the pretty user interface I fear GitHub, and even git itself, are not made to be a repository of millions of files and folders.

First GitHub impressions

Browsing taxa in GitHub is surprisingly responsive. The fungi genus Amanita contains 746 species, but it loads very quickly. In that regard GitHub is much nicer to use than the one on the new GBIF species pages which of course shows much more information. The rendered readme file is not ideal as it's at the very bottom of the page, but showing information to humans that way is nice - and markdown could also be parsed by machines quite easily if we adopt a simple format, for example for every property create a heading with that name and put the content into the following paragraph(s).

The Amanita example also reveals a bug in the exporter class when dealing with synonyms (the Amanita readme contains the synonym information) and also with infraspecific taxa. For example Amanita muscaria contains some weird form information which is mapped erroneously to the species. This obviously should be fixed.

The GitHub browser sorts all files alphabetically. When mixing ranks (we skip intermediate unknown ranks in the backbone), for example see the Fungus kingdom, sorting by the rank first is desirable. We could enable this by naming the taxon folders accordingly, prefixing with an alphabetically correctly ordered rank.

I have not had the time to try to version branches of the tree and see how usable that is. I suspect the git performance to be really slow, but that might not be a blocker if we only do versioning of larger groups and rarely push & pull.

Validating scientific names with the forthcoming GBIF Portal web service API

2013-07-22T21:16:00.000+02:00

This guest post was written by Gaurav Vaidya, Victoria Tersigni and Robert Guralnick, and is being cross-posted to the VertNet Blog. David Bloom and John Wieczorek read through drafts of this post and improved it tremendously.

A whale named ~~Physeter macrocephalus~~ ~~Physeter catodon~~ Physeter macrocephalus (photograph by Gabriel Barathieu, reused under CC-BY-SA from the Wikimedia Commons)

Validating scientific names is one of the hardest parts of cleaning up a biodiversity dataset: as taxonomists' understanding of species boundaries change, the names attached to them can be synonymized, moved between genera or even have their Latin grammar corrected (it's Porphyrio martinicus, not Porphyrio martinica). Different taxonomists may disagree on what to call a species, whether a particular set of populations make up a species, subspecies or species complex, or even which of several published names correspond to our modern understanding of that species, such as the dispute over whether the sperm whale is really Physeter catodon Linnaeus, 1758, or Physeter macrocephalus Linnaeus, 1758.

A good way to validate scientific names is to match them against a taxonomic checklist: a publication that describes the taxonomy of a particular taxonomic group in a particular geographical region. It is up to the taxonomists who write such treatises to catalogue all the synonyms that have ever been used for the names in their checklist, and to identify a single accepted name for each taxon they recognize. While these checklists are themselves evolving over time and sometimes contradict each other, they serve as essential points of reference in an ever-changing taxonomic landscape.

Over a hundred digitized checklists have been assembled by the Global Biodiversity Information Facility (GBIF) and will be indexed in the forthcoming GBIF Portal, currently in development and testing. This collection includes large, global checklists, such as the Catalogue of Life and the International Plant Names Index, alongside smaller, more focussed checklists, such as a checklist of 383 species of seed plants found in the Singhalila National Park in India and the 87 species of moss bug recorded in the Coleorrhyncha Species File. Many of these checklists can be downloaded as Darwin Core Archive files, an important format for working with and exchanging biodiversity data.

So how can we match names against these databases? OpenRefine (the recently-renamed Google Refine) is a popular data cleaning tool, with features that make it easy to clean up many different types of data. Javier Otegui has written a tutorial on cleaning biodiversity data in OpenRefine, and last year Rod Page provided tools and a step-by-step guide to reconciling scientific names, establishing OpenRefine as an essential tool for biodiversity data and scientific name cleanup.

Linnaeus' original description of Felis Tigris. From an 1894 republication of Linnaeus' Systema Naturae, 10th edition, digitized by the Biodiversity Heritage Library.

We extended Rod's work by building a reconciliation service against the forthcoming GBIF web services API. We wanted to see if we could use one of the GBIF Portal's biggest strengths -- the large number of checklists it has indexed -- to identify names recognized in similar ways by different checklists. Searching through multiple checklists containing possible synonyms and accepted names increases the odds of finding an obscure or recently created name; and if the same name is recognized by a number of checklists, this may signify a well-known synonymy -- for example, two of the Portal checklists recognize that the species Linnaeus named Felis tigris is the same one that is known as Panthera tigris today.

To do this, we wrote a new OpenRefine reconciliation service that searches for a queried name in all the checklists on the GBIF Portal. It then clusters names using four criteria and counts how often a particular name has the same:

scientific name (for example, "Felis tigris"),
authority ("Linnaeus, 1758"),
accepted name ("Panthera tigris"), and
kingdom ("Animalia").

Once you do a reconciliation through our new service, your results will look like this:

Since OpenRefine limits the number of results it shows for any reconciliation, we know only that at least five checklists in the GBIF Portal matched the name "Felis tigris". Of these,

Two checklists consider Felis tigris Linnaeus, 1758 to be a junior synonym of Panthera tigris (Linnaeus, 1758). Names are always sorted by the number of checklists that contain that interpretation, so this interpretation -- as it happens, the correct one -- is at the top of the list.
The remaining checklists all consider Felis tigris to be an accepted name in its own right. They contain mutually inconsistent information: one places this species in the kingdom Animalia, another in the kingdom Metazoa, and the third contains both a kingdom and an taxonomic authority. You can click on each name to find out more details.

Using our reconciliation service, you can immediately see how many checklists agree on the most important details of the name match, and whether a name should be replaced with an accepted name. The same name may also be spelled identically under different nomenclatural codes: for example, does "Ficus" refer to the genus Ficus Röding, 1798 or the genus Ficus L.? If you know that the former is in kingdom Animalia while the latter is in Plantae, it becomes easier to figure out the right match for your dataset.

We've designed a complete workflow around our reconciliation service, starting with ITIS as a very fast first step to catch the most well recognized names, and ending with EOL's fuzzy matching search as a final step to look for incorrectly spelled names. For VertNet's 2013 Biodiversity Informatics Training Workshop, we wrote two tutorials that walk you through our workflow:

Name validation in OpenRefine, using both the new GBIF API reconciliation service as well as Rod Page's reconciliation service for EOL, and
Higher taxonomy in OpenRefine, using the web service APIs provided by GBIF and EOL, as well as OpenRefine's ability to parse JSON.

If you're already familiar with OpenRefine, you can add the reconciliation service with the URL:

http://refine.taxonomics.org/gbifchecklists/reconcile

Give it a try, and let us know if it helps you reconcile names faster!

The Map of Life project is continuing to work on improving OpenRefine for taxonomic use in a project we call TaxRefine. If you have suggestions for features you'd like to see, please let us know! You can leave a comment on this blog post, or add an issue to our issue tracker on GitHub.

IPT v2.0.5 Released - A melhor versão até o momento!

2013-05-22T15:37:00.000+02:00

The GBIF Secretariat is proud to release version 2.0.5 of the Integrated Publishing Toolkit (IPT), available for download on the project website here.

As with every release, it's your chance to take advantage of the most requested feature enhancements and bug fixes.

The most notable feature enhancements include:

A resource can now be configured to publish automatically on an interval (See "Automated Publishing" section in User Manual)
The interface has been translated into Portuguese, making the IPT available in five languages: French, Spanish, Traditional Chinese, Portuguese and of course English.
An IPT can be configured to back up each DwC-Archive version published (See "Archival Mode" in User Manual)
Each resource version now has a resolvable URL (See "Versioned Page" section in User Manual)

Filterable, pageable, and sortable resource overview table in v2.0.5

The order of columns in published DwC-Archives is always the same between versions
Style (CSS) customizations are easier than ever - check out this new guide entitled "How to Style Your IPT" for more information
Hundreds if not thousands of resources can be handled, now that the resource overview tables are filterable, pageable, and sortable (See "Public Resource Table" section in User Manual)

The most important bug fixes are:

Garbled encoding on registration updates has been fixed
The problems uploading DwC-Archives in .gzip format has been fixed
The problem uploading a resource logo has been fixed

The new look in v2.0.5

The changes mentioned above represent just a fraction of the work that has gone into this version. Since version 2.0.4 was released 7 months ago, a total of 45 issues have been addressed. These are detailed in the issue tracking system.

It is great to see so much feedback from the community in the form of issues especially as the IPT becomes more stable and comprehensive over time. After all, the IPT is a community-driven project and anyone can contribute patches, translations, or have their say simply by adding or voting on issues.

The single largest community contribution in this version has been the translation into Portuguese done by three volunteers at the Universidade de São Paulo, Research Center on Biodiversity and Computing: Etienne Cartolano, Allan Koch Veiga, and Antonio Mauro Saraiva. With Brazil recently joining the GBIF network, we hope the Portuguese interface for the IPT will help in publication of the wealth of biodiversity data available from Brazilian institutions.

We’d also like to give special thanks to the other volunteers below:

Marie-Elise Lecoq (GBIF France) - Updating French translation
Yu-Huang Wang (TaiBIF, Taiwan) - Updating Traditional Chinese translation
Dairo Escobar, and Daniel Amariles (Colombian Biodiversity Information System (SiB)) - Updating Spanish translation
Carlos Cubillos (Colombian Biodiversity Information System (SiB)) - Contributing style improvements
Sijmen Cozijnsen (independent contractor working for NLBIF, Netherlands) - Contributing style improvements

On behalf of the GBIF development team, I hope you enjoy using the latest version of the IPT.

Migrating our hadoop cluster from CDH3 to CDH4

2013-05-14T15:34:00.000+02:00

We've written a number of times on the initial setup, eventual upgrade and continued tuning of our hadoop cluster. Our latest project has been upgrading from CDH3u3 to CDH4.2.1. Upgrades are almost always disruptive, but we decided it was worth the hassle for a number of reasons:

general performance improvements in the entire Hadoop/HBase stack
continued support from the community/user list (a non-trivial concern - anybody asking questions on the user groups and mailing list about problems with older clusters are invariably asked to update before people are interested in tackling the problem)
multi-threaded compactions (the need for which we concluded in this post)
table-based region balancing (rather than just cluster-wide)

We had been managing our cluster primarily using Puppet, with all the knowledge of how the bits worked together firmly within our dev team. In an effort to make everyone's lives easier, reduce our bus factor, and get the server management back into the hands of our ops team, we've moved to CDH Manager to control our CDH installation. That's been going pretty well so far, but, we're getting ahead of ourselves...

The Process

We have 6 slave nodes that have a lot of disk capacity since we spec'd with a goal of lots of spindles which meant we got lots of space "for free". Rather than upgrading in place, we decided to start fresh with new master & zookeeper nodes, and we calculated that we'd have enough space to pull half the slaves into the new cluster without losing any data. We cleaned up all the tmp files and anything we deemed not worth saving from HBase and hdfs, and started the migration:

Reduce the replication factor

We reduced the replication factor to 2 on the 6 slave nodes to reduce the disk use:

hadoop fs -setrep -R 2 /

Decommission the 3 nodes to move

"Decommissioning" is the civilized and safe way to remove nodes from a cluster where there's risk that they contain the only copies of some data in the cluster (they'll block writes but accept reads until all blocks have finished replicating out). To do it add the names of the target machines to an "excludes" file (one per line) that your hdfs config needs to reference, and then refresh hdfs.

The block in hdfs-site.xml:

<name>dfs.hosts.exclude</name>

<value>/etc/hadoop/conf/excluded_hosts</value>

</property>

then run:

bin/hadoop dfsadmin -refreshNodes

and wait for the "under replicated blocks" count on the hdfs admin page to drop to 0 and the decommissioning nodes to move into state "Decommissioned".

Don't forget HBase

The hdfs datanodes are tidied up now but don't forget to cleanly shutdown the HBase regionservers - run:

./bin/graceful_stop.sh HOSTNAME

from within the HBase directory on the host you're shutting down (specifying the real name for HOSTNAME). It will shed its regions and shutdown when tidied up (more details here).

Now you can shutdown the tasktracker and datanode, and then the machine is ready to be wiped.

Build the new cluster

We wiped the 3 decommissioned slave nodes and installed the latest version of CentOS (our linux of choice, version 6.4 at time of writing). We also pulled 3 much lesser machines from our other cluster after decommissioning them in the same way, and installed CentOS 6.4 there, too. The 3 lesser machines would form our zookeeper ensemble and master nodes in the new cluster.

Enter CDH Manager

The folks at Cloudera have made a free version of their CDH Manager app available, and it makes managing a cluster much, much easier. After setting up the 6 machines that would form the basis of our new cluster with just the barebones OS, we were ready to start wielding the manager. We made a small VM to hold the manager app and installed it there. The manager instructions are pretty good, so I won't recreate them here. We had trouble with the key-based install so had to resort to setting identical passwords for root and allowing root ssh access for the duration of the install, but other than that it all went pretty smoothly. We installed in the following configuration (the master machines are the lesser ones described above, and the slaves the more powerful machines).

Machine and Role assignments
Machine	Roles
master1	HDFS Primary NameNode, Zookeeper Member, HBase Master (secondary)
master2	HDFS Secondary NameNode, Zookeeper Member, HBase Master (primary)
master3	Hadoop JobTracker, Zookeeper Member, HBase Master (secondary)
slave1	HDFS DataNode, Hadoop TaskTracker, HBase Regionserver
slave2	HDFS DataNode, Hadoop TaskTracker, HBase Regionserver
slave3	HDFS DataNode, Hadoop TaskTracker, HBase Regionserver

Copy the data

Now we had two running clusters - our old CDH3u3 cluster (with half its machines removed) and the new, empty CDH 4.2.1 cluster. The trick was how to get data from the old cluster into the new, with our primary concern being the data in HBase. The builtin facility for this sort of thing is called CopyTable, and sounds great, except that it doesn't work across major versions of HBase, so that was out. Next we looked at copying the HFiles directly from the old cluster to the new using the HDFS builtin command distcp. Because we could handle shutting down HBase on the old cluster for the duration of the copy this, in theory, should work - newer versions of HBase can read the older versions' HFiles and then write the new versions during compactions (and by shutting down we don't run the risk of missing updates from caches that haven't flushed, etc). And in spite of lots of warnings around the net that it wouldn't work, we tried it anyway. And it didn't work :) We managed to get the -ROOT- table up but it couldn't find .META. and that's where our patience ended. The next, and thankfully successful, attempt was using HBase export, distcp, and HBase import.

On the old cluster we ran:

bin/hadoop jar hbase-0.90.4-cdh3u3.jar export table_name /exports/table_name

for each of our tables, which produced a bunch of sequence files in the old cluster's HDFS. Those we copied over to the new cluster using HDFS's distcp command:

bin/hadoop distcp hftp://old-cluster-namenode:50070/exports/table_name hdfs://master1:8020/imports/table_name

which takes advantage of the builtin http-like interface (hftp) that HDFS provides, which makes the copy process version agnostic.

Finally on the new cluster we can import the copied sequence files into the new HBase:

bin/hadoop jar hbase-0.94.2-cdh4.2.1-security.jar import table_name /imports/table_name

Make sure the table exists before you import, and because the import is a mapreduce job that does Puts, it would also be wise to presplit any large tables at creation time so that you don't crush your new cluster with lots of hot regions and splitting. Also one known issue in this version of HBase is a performance regression from version 0.92 to 0.94 (detailed in HBASE-7868), which you can workaround by adding the following to your table definition:

DATA_BLOCK_ENCODING => 'FAST_DIFF'

e.g. create 'test_table', {NAME=>'cf', COMPRESSION=>'SNAPPY', VERSIONS=>1, DATA_BLOCK_ENCODING => 'FAST_DIFF'}

As per that linked issue, you should also enable short-circuit reads from the CDH Manager interface.

And to complete the copying process, run major compactions on all your tables to ensure the best data locality you can for your regionservers.

All systems go

After running checks on the copied data, and updating our software to talk to CDH4, we were happy that our new cluster was behaving as expected. To get back to our normal performance levels we then shutdown the remaining machines in the CDH3u3 cluster, wiped and installed the latest OS, and then told CDH Manager to install on them. A few minutes later we had all our M/R slots back, as well as our regionservers. We ran the HBase balancer to evenly spread out the regions, ran another major compaction on our tables to force data-locality, and we were back in business!

Data cleaning: Using MySQL to identify XML breaking characters

2013-02-08T12:45:00.000+01:00

Sometimes publishers have problems with data resources that contain control characters that will break the xml response if they are included. Identifying these characters and removing them can be a daunting task, especially if the dataset contains thousands of records.

Publishers that share datasets through the DiGIR and TAPIR protocols are especially vulnerable to text fields that contain polluted data. Information about locality (http://rs.tdwg.org/dwc/terms/index.htm#locality) is often quite rich and can be copied from diverse sources, thereby entering the database table possibly without having been through a verification or a cleaning process. The locality string can be copy/pasted from a file into the locality column, or the data itself can be mass loaded infile, or it can be bulk inserted – each of these methods contains a risk that unintended characters enter the table.

Even if you have time and are meticulous, you could miss certain control characters because they are invisible to the naked eye. So what are publishers - some with limited resources – going to do to ferret out these xml breaking characters? Assuming that you have access to the MySQL database itself you can identify these pesky control characters by performing a few basic steps that involves creating a small table, inserting some hexadecimal values into it (sounds much harder than it is), and finally you run the query that picks out these ‘illegal’ characters from the table that you specify.

We start out with creating a table to hold the values for the problematic characters so that we can use them in a query:

CREATE TABLE control_char (
id int(4) NOT NULL AUTO_INCREMENT,
hex_val CHAR(2),
PRIMARY KEY(id)
) DEFAULT CHARACTER SET = utf8;

The DEFAULT CHARACTER SET declaration forces UTF-8 compliance which the regular expressions used later requires.
We then populate the table with these hex values that represent control characters:

INSERT INTO control_char (hex_val)
VALUES
('00'),('01'),('02'),('03'),('04'),('05'),('06'),('07'),('08'),('09'),('0a'),('0b'),('0c'),('0d'),('0e'),('0f'),
('10'),('11'),('12'),('13'),('14'),('15'),('16'),('17'),('18'),('19'),('1a'),('1b'),('1c'),('1d'),('1e'),('1f')
;

You can read more about these values here: http://en.wikipedia.org/wiki/C0_and_C1_control_codes

At this point you may ask why the control_char table is not a temporary table as you might not want it to be a permanent feature in the database. The reason for this is sadly that MySQL has a long standing bug that prevents a temporary table from being referenced more than once; http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html and we have to reference it more than once as you will see later.

Now on to the main query – these declarations test the table and column that you specify against the control_char table:

SELECT t1.* FROM scinames_harvested t1, control_char
WHERE LOCATE(control_char.hex_val , HEX(t1.scientific_name)) MOD 2 != 0;

The query references two tables; one is a table of roughly 5000 records containing a record primary key, scientific_name and some other columns. Some of the scientific name strings are polluted with characters that we want to get rid of. The second table contains the control characters.
The way we ensure that the LOCATE function tests for value pairs two steps at the time is by using the modulus keyword MOD. Remember we want to look through the scientific_name char string after it has been converted to hexadecimal values (HEX) that consist of value pairs. We don’t want to test across value pairs!

Running the query, in this instance, gives me five records with characters that are not kosher:

This is pretty neat if the alternative is eyeballing each and every record.
Note that I cannot guarantee that this will properly process every character from the UTF-8 Latin-1 supplement http://en.wikipedia.org/wiki/C1_Controls_and_Latin-1_Supplement

If you want to create a test table and try out the queries above, this UPDATE query template will change the string into something containing control characters:

UPDATE your_table SET your_column = CONCAT('Adelotus brevis', X'0B') WHERE id = 12345;

In the CONCAT declaration the second part looks funny, but you have to remember that the X in front of '0B' tells MySQL that a hex value is coming. In this case it is a line-tabulation character: www.fileformat.info/info/unicode/char/000b/index.htm. This part can be edited to other values for test purposes. Naturally the CONCAT function can take n number of strings for concatenation.