Monday 28 October 2013

The new (real-time) GBIF Registry has gone live

For the last 4 years, GBIF has operated the GBRDS registry with its own web application on http://gbrds.gbif.org.  Previously, when a dataset got registered in the GBRDS registry (for example using an IPT) it wasn't immediately visible in the portal for several weeks until after rollover took place. 

In October, GBIF launched its new portal on www.gbif.org.  During the launch we indicated that the real-time data management would be starting up in November.  We are excited to inform you that today we made the first step towards making this a reality, by enabling the live operation of the new GBIF registry.   
What does this mean for you?
  • any dataset registered through GBIF (using an IPT, web services, or manually by liaison with the Secretariat) will be visible in the portal immediately because the portal and new registry are fully integrated 
  • the GBRDS web application (http://gbrds.gbif.org) is no longer visiblesince the new portal displays all the appropriate information
  • the GBRDS sandbox registry web application (http://gbrdsdev.gbif.org) is no longer visible, but a new registry sandbox has been setup to provide for IPT installations installed in test mode
Please note that the new registry API supports the same web service API that the GBRDS previously did, so existing tools and services built on the GBRDS API (such as the IPT) will continue to work uninterrupted. 
As you may have noticed, occurrence data crawling has been temporarily suspended since the middle of September to prepare for launching real-time data managementWe aim to resume occurrence data crawling in the first week of November, meaning that updates to the index will be visible immediately afterwards.  
On behalf of the GBIF development team, I thank you for your patience during this transition time, and hope you are looking forward to real-time data management as much as we are. 

Thursday 24 October 2013

GBIF Backbone in GitHub

For a long time I wanted to experiment with using GitHub as a tool to browse and manage the GBIF backbone taxonomy. Encouraged by similar sentiments from Rod Page, it would be nice to use git to keep track of versions and allow external parties to fork parts of the taxonomic tree and push back changes if desired. To top it off there is the great GitHub Treeslider to browse the taxonomy, so why not give it a try?

A GitHub filesystem taxonomy

I decided to export each taxon in the backbone as a folder that is named according to the canonical name, containing 2 files:

  1. README.md, a simple markdown file that gets rendered by github and shows the basic attributes of a taxon
  2. data.json, a complete json representation of the taxon as it is exposed via the new GBIF species API
The filesystem represents the taxonomic classification and taxon folders are nested accordingly, for example the species Amanita arctica is represented as:

This is just a first experimental step. One can improve the readme a lot to render more content in a human friendly way and include more data in the json file such as common names and synonyms.

Getting data into GitHub

It didn't take much to write a small NubGitExporter.java class that exports the GBIF backbone into the filesystem as described above. The export of the entire taxonomy, with it's currently 4.4 million taxa incl synonyms, took about one hour on a MacBook Pro laptop. 
Not bad I thought, but then I tried to add the generated files into git and that's when I started to doubt. After waiting for half a day for git to add the files to my local index I decided to kill the process and start by only adding the smaller kingdoms first, excluding animals and plants. That left about 335.000 folders and 670.000 files to be added to git. Adding these to my local git still took several hours, committing and finally pushing them onto the GitHub server took yet another 2 hours.

Delta compression using up to 8 threads.
Compressing objects: 100% (1010487/1010487), done.
Writing objects: 100% (1010494/1010494), 173.51 MiB | 461 KiB/s, done.
Total 1010494 (delta 405506), reused 0 (delta 0)
To https://github.com/mdoering/backbone.git

After those files were added to the index committing a simple change to the main README file took 15 minutes to commit. Although I like the general idea and the pretty user interface I fear GitHub, and even git itself, are not made to be a repository of millions of files and folders.

First GitHub impressions

Browsing taxa in GitHub is surprisingly responsive. The fungi genus Amanita  contains 746 species, but it loads very quickly. In that regard GitHub is much nicer to use than the one on the new GBIF species pages which of course shows much more information. The rendered readme file is not ideal as it's at the very bottom of the page, but showing information to humans that way is nice - and markdown could also be parsed by machines quite easily if we adopt a simple format, for example for every property create a heading with that name and put the content into the following paragraph(s). 

The Amanita example also reveals a bug in the exporter class when dealing with synonyms (the Amanita readme contains the synonym information) and also with infraspecific taxa. For example Amanita muscaria contains some weird form information which is mapped erroneously to the species. This obviously should be fixed.

The GitHub browser sorts all files alphabetically. When mixing ranks (we skip intermediate unknown ranks in the backbone), for example see the Fungus kingdomsorting by the rank first is desirable. We could enable this by naming the taxon folders accordingly, prefixing with an alphabetically correctly ordered rank.

I have not had the time to try to version branches of the tree and see how usable that is. I suspect the git performance to be really slow, but that might not be a blocker if we only do versioning of larger groups and rarely push & pull.

Monday 22 July 2013

Validating scientific names with the forthcoming GBIF Portal web service API

This guest post was written by Gaurav Vaidya, Victoria Tersigni and Robert Guralnick, and is being cross-posted to the VertNet Blog. David Bloom and John Wieczorek read through drafts of this post and improved it tremendously.

A whale named Physeter macrocephalus Physeter catodon Physeter macrocephalus (photograph by Gabriel Barathieu, reused under CC-BY-SA from the Wikimedia Commons)
Validating scientific names is one of the hardest parts of cleaning up a biodiversity dataset: as taxonomists' understanding of species boundaries change, the names attached to them can be synonymized, moved between genera or even have their Latin grammar corrected (it's Porphyrio martinicus, not Porphyrio martinica). Different taxonomists may disagree on what to call a species, whether a particular set of populations make up a species, subspecies or species complex, or even which of several published names correspond to our modern understanding of that species, such as the dispute over whether the sperm whale is really Physeter catodon Linnaeus, 1758, or Physeter macrocephalus Linnaeus, 1758.

A good way to validate scientific names is to match them against a taxonomic checklist: a publication that describes the taxonomy of a particular taxonomic group in a particular geographical region. It is up to the taxonomists who write such treatises to catalogue all the synonyms that have ever been used for the names in their checklist, and to identify a single accepted name for each taxon they recognize. While these checklists are themselves evolving over time and sometimes contradict each other, they serve as essential points of reference in an ever-changing taxonomic landscape.

Over a hundred digitized checklists have been assembled by the Global Biodiversity Information Facility (GBIF) and will be indexed in the forthcoming GBIF Portal, currently in development and testing. This collection includes large, global checklists, such as the Catalogue of Life and the International Plant Names Index, alongside smaller, more focussed checklists, such as a checklist of 383 species of seed plants found in the Singhalila National Park in India and the 87 species of moss bug recorded in the Coleorrhyncha Species File. Many of these checklists can be downloaded as Darwin Core Archive files, an important format for working with and exchanging biodiversity data.

So how can we match names against these databases? OpenRefine (the recently-renamed Google Refine) is a popular data cleaning tool, with features that make it easy to clean up many different types of data. Javier Otegui has written a tutorial on cleaning biodiversity data in OpenRefine, and last year Rod Page provided tools and a step-by-step guide to reconciling scientific names, establishing OpenRefine as an essential tool for biodiversity data and scientific name cleanup.

Linnaeus' original description of Felis Tigris. From an 1894 republication of Linnaeus' Systema Naturae, 10th edition, digitized by the Biodiversity Heritage Library.
We extended Rod's work by building a reconciliation service against the forthcoming GBIF web services API. We wanted to see if we could use one of the GBIF Portal's biggest strengths -- the large number of checklists it has indexed -- to identify names recognized in similar ways by different checklists. Searching through multiple checklists containing possible synonyms and accepted names increases the odds of finding an obscure or recently created name; and if the same name is recognized by a number of checklists, this may signify a well-known synonymy -- for example, two of the Portal checklists recognize that the species Linnaeus named Felis tigris is the same one that is known as Panthera tigris today.

To do this, we wrote a new OpenRefine reconciliation service that searches for a queried name in all the checklists on the GBIF Portal. It then clusters names using four criteria and counts how often a particular name has the same:
  • scientific name (for example, "Felis tigris"),
  • authority ("Linnaeus, 1758"),
  • accepted name ("Panthera tigris"), and
  • kingdom ("Animalia").

Once you do a reconciliation through our new service, your results will look like this:


Since OpenRefine limits the number of results it shows for any reconciliation, we know only that at least five checklists in the GBIF Portal matched the name "Felis tigris". Of these,
  1. Two checklists consider Felis tigris Linnaeus, 1758 to be a junior synonym of Panthera tigris (Linnaeus, 1758). Names are always sorted by the number of checklists that contain that interpretation, so this interpretation -- as it happens, the correct one -- is at the top of the list.
  2. The remaining checklists all consider Felis tigris to be an accepted name in its own right. They contain mutually inconsistent information: one places this species in the kingdom Animalia, another in the kingdom Metazoa, and the third contains both a kingdom and an taxonomic authority. You can click on each name to find out more details.

Using our reconciliation service, you can immediately see how many checklists agree on the most important details of the name match, and whether a name should be replaced with an accepted name. The same name may also be spelled identically under different nomenclatural codes: for example, does "Ficus" refer to the genus Ficus Röding, 1798 or the genus Ficus L.? If you know that the former is in kingdom Animalia while the latter is in Plantae, it becomes easier to figure out the right match for your dataset.

We've designed a complete workflow around our reconciliation service, starting with ITIS as a very fast first step to catch the most well recognized names, and ending with EOL's fuzzy matching search as a final step to look for incorrectly spelled names. For VertNet's 2013 Biodiversity Informatics Training Workshop, we wrote two tutorials that walk you through our workflow:


If you're already familiar with OpenRefine, you can add the reconciliation service with the URL:
Give it a try, and let us know if it helps you reconcile names faster!

The Map of Life project is continuing to work on improving OpenRefine for taxonomic use in a project we call TaxRefine. If you have suggestions for features you'd like to see, please let us know! You can leave a comment on this blog post, or add an issue to our issue tracker on GitHub.

Wednesday 22 May 2013

IPT v2.0.5 Released - A melhor versão até o momento!



The GBIF Secretariat is proud to release version 2.0.5 of the Integrated Publishing Toolkit (IPT), available for download on the project website here.

As with every release, it's your chance to take advantage of the most requested feature enhancements and bug fixes.

The most notable feature enhancements include:
  • A resource can now be configured to publish automatically on an interval (See "Automated Publishing" section in User Manual)
  • The interface has been translated into Portuguese, making the IPT available in five languages: French, Spanish, Traditional Chinese, Portuguese and of course English.
  • An IPT can be configured to back up each DwC-Archive version published (See "Archival Mode" in User Manual)
  • Each resource version now has a resolvable URL (See "Versioned Page" section in User Manual)
Filterable, pageable, and sortable resource overview table in v2.0.5
  • The order of columns in published DwC-Archives is always the same between versions
  • Style (CSS) customizations are easier than ever - check out this new guide entitled "How to Style Your IPT" for more information
  • Hundreds if not thousands of resources can be handled, now that the resource overview tables are filterable, pageable, and sortable (See "Public Resource Table" section in User Manual) 

The most important bug fixes are:
  • Garbled encoding on registration updates has been fixed
  • The problems uploading DwC-Archives in .gzip format has been fixed
  • The problem uploading a resource logo has been fixed
The new look in v2.0.5
The changes mentioned above represent just a fraction of the work that has gone into this version. Since version 2.0.4 was released 7 months ago, a total of 45 issues have been addressed. These are detailed in the issue tracking system.

It is great to see so much feedback from the community in the form of issues especially as the IPT becomes more stable and comprehensive over time. After all, the IPT is a community-driven project and anyone can contribute patches, translations, or have their say simply by adding or voting on issues. 

The single largest community contribution in this version has been the translation into Portuguese done by three volunteers at the Universidade de São Paulo, Research Center on Biodiversity and Computing: Etienne Cartolano, Allan Koch Veiga, and Antonio Mauro Saraiva. With Brazil recently joining the GBIF network, we hope the Portuguese interface for the IPT will help in publication of the wealth of biodiversity data available from Brazilian institutions.   

We’d also like to give special thanks to the other volunteers below:
  • Marie-Elise Lecoq (GBIF France) - Updating French translation
  • Yu-Huang Wang (TaiBIF, Taiwan) - Updating Traditional Chinese translation
  • Dairo Escobar, and Daniel Amariles (Colombian Biodiversity Information System (SiB)) - Updating Spanish translation
  • Carlos Cubillos (Colombian Biodiversity Information System (SiB)) - Contributing style improvements
  • Sijmen Cozijnsen (independent contractor working for NLBIF, Netherlands) - Contributing style improvements
On behalf of the GBIF development team, I hope you enjoy using the latest version of the IPT. 

Tuesday 14 May 2013

Migrating our hadoop cluster from CDH3 to CDH4

We've written a number of times on the initial setup, eventual upgrade and continued tuning of our hadoop cluster. Our latest project has been upgrading from CDH3u3 to CDH4.2.1. Upgrades are almost always disruptive, but we decided it was worth the hassle for a number of reasons:

  • general performance improvements in the entire Hadoop/HBase stack
  • continued support from the community/user list (a non-trivial concern - anybody asking questions on the user groups and mailing list about problems with older clusters are invariably asked to update before people are interested in tackling the problem)
  • multi-threaded compactions (the need for which we concluded in this post)
  • table-based region balancing (rather than just cluster-wide)
We had been managing our cluster primarily using Puppet, with all the knowledge of how the bits worked together firmly within our dev team. In an effort to make everyone's lives easier, reduce our bus factor, and get the server management back into the hands of our ops team, we've moved to CDH Manager to control our CDH installation. That's been going pretty well so far, but, we're getting ahead of ourselves...

The Process

We have 6 slave nodes that have a lot of disk capacity since we spec'd with a goal of lots of spindles which meant we got lots of space "for free". Rather than upgrading in place, we decided to start fresh with new master & zookeeper nodes, and we calculated that we'd have enough space to pull half the slaves into the new cluster without losing any data. We cleaned up all the tmp files and anything we deemed not worth saving from HBase and hdfs, and started the migration:

Reduce the replication factor

We reduced the replication factor to 2 on the 6 slave nodes to reduce the disk use:

hadoop fs -setrep -R 2 /

Decommission the 3 nodes to move

"Decommissioning" is the civilized and safe way to remove nodes from a cluster where there's risk that they contain the only copies of some data in the cluster (they'll block writes but accept reads until all blocks have finished replicating out). To do it add the names of the target machines to an "excludes" file (one per line) that your hdfs config needs to reference, and then refresh hdfs.

The block in hdfs-site.xml:

<property>
  <name>dfs.hosts.exclude</name>
  <value>/etc/hadoop/conf/excluded_hosts</value>
</property>

then run:

bin/hadoop dfsadmin -refreshNodes

and wait for the "under replicated blocks" count on the hdfs admin page to drop to 0 and the decommissioning nodes to move into state "Decommissioned".

Don't forget HBase

The hdfs datanodes are tidied up now but don't forget to cleanly shutdown the HBase regionservers - run:

./bin/graceful_stop.sh HOSTNAME

from within the HBase directory on the host you're shutting down (specifying the real name for HOSTNAME). It will shed its regions and shutdown when tidied up (more details here).

Now you can shutdown the tasktracker and datanode, and then the machine is ready to be wiped.

Build the new cluster

We wiped the 3 decommissioned slave nodes and installed the latest version of CentOS (our linux of choice, version 6.4 at time of writing). We also pulled 3 much lesser machines from our other cluster after decommissioning them in the same way, and installed CentOS 6.4 there, too. The 3 lesser machines would form our zookeeper ensemble and master nodes in the new cluster.

Enter CDH Manager

The folks at Cloudera have made a free version of their CDH Manager app available, and it makes managing a cluster much, much easier. After setting up the 6 machines that would form the basis of our new cluster with just the barebones OS, we were ready to start wielding the manager. We made a small VM to hold the manager app and installed it there. The manager instructions are pretty good, so I won't recreate them here. We had trouble with the key-based install so had to resort to setting identical passwords for root and allowing root ssh access for the duration of the install, but other than that it all went pretty smoothly. We installed in the following configuration (the master machines are the lesser ones described above, and the slaves the more powerful machines).



Machine and Role assignments
Machine Roles
master1 HDFS Primary NameNode, Zookeeper Member, HBase Master (secondary)
master2 HDFS Secondary NameNode, Zookeeper Member, HBase Master (primary)
master3 Hadoop JobTracker, Zookeeper Member, HBase Master (secondary)
slave1 HDFS DataNode, Hadoop TaskTracker, HBase Regionserver
slave2 HDFS DataNode, Hadoop TaskTracker, HBase Regionserver
slave3 HDFS DataNode, Hadoop TaskTracker, HBase Regionserver

Copy the data

Now we had two running clusters - our old CDH3u3 cluster (with half its machines removed) and the new, empty CDH 4.2.1 cluster. The trick was how to get data from the old cluster into the new, with our primary concern being the data in HBase. The builtin facility for this sort of thing is called CopyTable, and sounds great, except that it doesn't work across major versions of HBase, so that was out. Next we looked at copying the HFiles directly from the old cluster to the new using the HDFS builtin command distcp. Because we could handle shutting down HBase on the old cluster for the duration of the copy this, in theory, should work - newer versions of HBase can read the older versions' HFiles and then write the new versions during compactions (and by shutting down we don't run the risk of missing updates from caches that haven't flushed, etc). And in spite of lots of warnings around the net that it wouldn't work, we tried it anyway. And it didn't work :) We managed to get the -ROOT- table up but it couldn't find .META. and that's where our patience ended. The next, and thankfully successful, attempt was using HBase export, distcp, and HBase import.

On the old cluster we ran:

bin/hadoop jar hbase-0.90.4-cdh3u3.jar export table_name /exports/table_name

for each of our tables, which produced a bunch of sequence files in the old cluster's HDFS. Those we copied over to the new cluster using HDFS's distcp command:

bin/hadoop distcp hftp://old-cluster-namenode:50070/exports/table_name hdfs://master1:8020/imports/table_name

which takes advantage of the builtin http-like interface (hftp) that HDFS provides, which makes the copy process version agnostic.

Finally on the new cluster we can import the copied sequence files into the new HBase:

bin/hadoop jar hbase-0.94.2-cdh4.2.1-security.jar import table_name /imports/table_name

Make sure the table exists before you import, and because the import is a mapreduce job that does Puts, it would also be wise to presplit any large tables at creation time so that you don't crush your new cluster with lots of hot regions and splitting. Also one known issue in this version of HBase is a performance regression from version 0.92 to 0.94 (detailed in HBASE-7868), which you can workaround by adding the following to your table definition:

DATA_BLOCK_ENCODING => 'FAST_DIFF'

e.g. create 'test_table', {NAME=>'cf', COMPRESSION=>'SNAPPY', VERSIONS=>1, DATA_BLOCK_ENCODING => 'FAST_DIFF'}

As per that linked issue, you should also enable short-circuit reads from the CDH Manager interface.

And to complete the copying process, run major compactions on all your tables to ensure the best data locality you can for your regionservers.

All systems go

After running checks on the copied data, and updating our software to talk to CDH4, we were happy that our new cluster was behaving as expected. To get back to our normal performance levels we then shutdown the remaining machines in the CDH3u3 cluster, wiped and installed the latest OS, and then told CDH Manager to install on them. A few minutes later we had all our M/R slots back, as well as our regionservers. We ran the HBase balancer to evenly spread out the regions, ran another major compaction on our tables to force data-locality, and we were back in business!


Friday 8 February 2013

Data cleaning: Using MySQL to identify XML breaking characters

Sometimes publishers have problems with data resources that contain control characters that will break the xml response if they are included. Identifying these characters and removing them can be a daunting task, especially if the dataset contains thousands of records.

Publishers that share datasets through the DiGIR and TAPIR protocols are especially vulnerable to text fields that contain polluted data. Information about locality (http://rs.tdwg.org/dwc/terms/index.htm#locality) is often quite rich and can be copied from diverse sources, thereby entering the database table possibly without having been through a verification or a cleaning process. The locality string can be copy/pasted from a file into the locality column, or the data itself can be mass loaded infile, or it can be bulk inserted – each of these methods contains a risk that unintended characters enter the table.

Even if you have time and are meticulous, you could miss certain control characters because they are invisible to the naked eye. So what are publishers - some with limited resources – going to do to ferret out these xml breaking characters? Assuming that you have access to the MySQL database itself you can identify these pesky control characters by performing a few basic steps that involves creating a small table, inserting some hexadecimal values into it (sounds much harder than it is), and finally you run the query that picks out these ‘illegal’ characters from the table that you specify.

We start out with creating a table to hold the values for the problematic characters so that we can use them in a query:

CREATE TABLE control_char (
id int(4) NOT NULL AUTO_INCREMENT,
hex_val CHAR(2),
PRIMARY KEY(id)
) DEFAULT CHARACTER SET = utf8;

The DEFAULT CHARACTER SET declaration forces UTF-8 compliance which the regular expressions used later requires.
We then populate the table with these hex values that represent control characters:

INSERT INTO control_char (hex_val)
VALUES
('00'),('01'),('02'),('03'),('04'),('05'),('06'),('07'),('08'),('09'),('0a'),('0b'),('0c'),('0d'),('0e'),('0f'),
('10'),('11'),('12'),('13'),('14'),('15'),('16'),('17'),('18'),('19'),('1a'),('1b'),('1c'),('1d'),('1e'),('1f')
;

You can read more about these values here: http://en.wikipedia.org/wiki/C0_and_C1_control_codes

At this point you may ask why the control_char table is not a temporary table as you might not want it to be a permanent feature in the database. The reason for this is sadly that MySQL has a long standing bug that prevents a temporary table from being referenced more than once; http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html and we have to reference it more than once as you will see later.

Now on to the main query – these declarations test the table and column that you specify against the control_char table:
SELECT t1.* FROM scinames_harvested t1, control_char
WHERE LOCATE(control_char.hex_val , HEX(t1.scientific_name)) MOD 2 != 0;

The query references two tables; one is a table of roughly 5000 records containing a record primary key, scientific_name and some other columns. Some of the scientific name strings are polluted with characters that we want to get rid of. The second table contains the control characters.
The way we ensure that the LOCATE function tests for value pairs two steps at the time is by using the modulus keyword MOD. Remember we want to look through the scientific_name char string after it has been converted to hexadecimal values (HEX) that consist of value pairs. We don’t want to test across value pairs!

Running the query, in this instance, gives me five records with characters that are not kosher:



This is pretty neat if the alternative is eyeballing each and every record.
Note that I cannot guarantee that this will properly process every character from the UTF-8 Latin-1 supplement http://en.wikipedia.org/wiki/C1_Controls_and_Latin-1_Supplement

If you want to create a test table and try out the queries above, this UPDATE query template will change the string into something containing control characters:
UPDATE your_table SET your_column = CONCAT('Adelotus brevis', X'0B') WHERE id = 12345;
In the CONCAT declaration the second part looks funny, but you have to remember that the X in front of '0B' tells MySQL that a hex value is coming. In this case it is a line-tabulation character: www.fileformat.info/info/unicode/char/000b/index.htm. This part can be edited to other values for test purposes. Naturally the CONCAT function can take n number of strings for concatenation.