Monday, 25 July 2011

Customizing the IPT

One of my responsibilities as the Biodiversity Informatics Manager for Canadensys is to develop a data portal giving access to all the biodiversity information published by the participants of our network. A huge portion of this task can now be done with the GBIF Integrated Publishing Toolkit version 2 or IPT. The IPT allows to host biodiversity resources, manage their data and metadata, and register them with GBIF so they can appear on the GBIF data portal, which are all targets we want to achieve. Best of all, most management can be done by the collection managers themselves.

I have tested the IPT thoroughly and I am convinced the GBIF development team has done an excellent job creating a stable tool I can trust. This post explains how I have customized our IPT installation to integrate it with our other Canadensys websites.


Background

Our Canadensys community portal is powered by WordPress (MySQL, PHP), while our data portal - which before the IPT installation only consisted of the Database of Vascular Plants of Canada (VASCAN) - is a Tomcat application. We are using different technologies because we want to use the most adequate technology for a certain website. WordPress (or Drupal for that matter) is an excellent and easy-to-use CMS, perfect for our community portal, but not suitable for a custom made checklist website like VASCAN. To the user however, both websites look the same:


We do this by using the same HTML markup and CSS for both websites. If you want to learn HTML and CSS, w3schools provides excellent tutorials.

The HTML markup defines elements on a page (e.g. header, menu, content, sidebar, footer) and the CSS stylizes those elements (e.g. their position and color). The CSS is typically stored as one file (e.g. style.css) which is referenced in the <head> section of a page. For dynamic websites, the HTML is typically stored as different files, one for each section of a page (e.g. header.php, sidebar.php). Those files are combined as one page by the server if a page is requested. That way, changing a common element on all pages of a website (e.g. the header) can be done by changing just one file.

All of this also applies to the IPT. Here's how the IPT looks like without CSS:


Attempt 1 - Editing the CSS and logo

My first attempt at customizing the IPT was at the Experts Workshop in Copenhagen, by changing the CSS and logo only, which you can find in the /styles folder of your IPT installation:

/styles/main.css
/styles/logo.jpg

In 15 minutes, my IPT was Canadensys red and had a custom logo:


Attempt 2 - Editing the FreeMarker files

Even though my IPT now had its own branding, it was still noticeably different from the other Canadensys websites. The only way I could change that, was by editing the HTML as well. Luckily, the sections I wanted to change were all stored as FreeMarker files in the /inc folder:

/WEB-INF/pages/inc/header.ftl - the <head> section
/WEB-INF/pages/inc/menu.ftl - the header, menu and sidebar
/WEB-INF/pages/inc/footer.ftl - the footer
/WEB-INF/pages/inc/header_setup.ftl - the header during installation

I incorporated the HTML structure I use for the VASCAN website into menu.ftl (including the header, menu, container and sidebar), making sure I did not break any of the IPT functionality.

I started doing the same with main.css by replacing chunks of now unused IPT CSS with CSS I copied over from VASCAN, but I quickly realized that this wasn't the best option. Doing so would result in 2 CSS files: one for VASCAN and one for IPT, even though both web applications are under the same domain name with a lot of shared CSS. It would be easier if I only had to maintain a single stylesheet, used by both applications.

Attempt 3 - One styles folder for the data portal

I created a /common/styles folder under ROOT, where I placed my single common data portal stylesheet: /common/styles/common.css. This would be the CSS file I could use for IPT and VASCAN. I did the same for my favicon: /common/images/favicon.png.

I added a reference to both files in the header.ftl of my IPT (and VASCAN):

<link rel="stylesheet" type="text/css" href="${baseURL}/styles/main.css">
<link rel="stylesheet" type="text/css" href="http://data.canadensys.net/common/styles/common.css">
<link rel="shortcut icon" href="http://data.canadensys.net/common/images/favicon.png">

As you can see on the first line, I kept the reference to the default IPT stylesheet: ${baseURL}/styles/main.css (it's perfectly fine to reference more than one CSS file). This is where I would keep all the unaltered (=default) IPT CSS. In fact, I'm not removing anything from the default IPT stylesheet, I'm only commenting out the CSS that is unused or conflicting:

/* Unused or conflicting CSS */

The advantage of doing so, is that I now easily can compare this commented file with changes in the stylesheet of any new IPT version.

After I had done everything, my IPT now looked like this:


My IPT is now sporting the Canadensys header, footer and sidebar (only visible when editing a resource), making it indistinguishable from the other Canadensys websites. It is also using a more readable font-size (13.5px) and a fluid width.

Closing remarks

I have (re)designed quite a lot of websites, and very often I have been so frustrated with the HTML and CSS that I just started over from scratch. I didn't have that option here and it wasn't necessary either. I would like to thank the GBIF development team for creating such an easily customizable tool, with logical HTML and CSS. As a reminder, the whole customization has been done by editing only 5 files (links show default files):

/styles/main.css (custom file)
/WEB-INF/pages/inc/header.ftl
/WEB-INF/pages/inc/menu.ftl
/WEB-INF/pages/inc/footer.ftl
/WEB-INF/pages/inc/header_setup.ftl

Important: Remember that installing a new IPT version will overwrite all the customized files, so make sure to back them up! I will try to figure out a way to reapply my customization automatically after an update and post about that experience in a follow-up post. In the meantime, I hope that this post will help others in the customization of their IPT.

Monday, 18 July 2011

Working with Scientific Names

Dealing with scientific names is an important regular part of our work at GBIF. Scientific names are highly structured strings with a syntax governed by a nomenclatural code. Unfortunately there are different ones for botany, zoology, bacteria, virus and even cultivar names. When dealing with names we often do not know to which code or classification it belongs to, so we need to have a code agnostic representation as much as possible. GBIF came up with a structured representation which is a compromise focusing on the most common names, primarily the botanical and zoological names which are quite similar in its basic form.

The ParsedName class

Our ParsedName class provides us with the following core properties:
 genusOrAbove
infraGeneric
specificEpithet
rankMarker
infraSpecificEpithet
authorship
year
bracketAuthorship
bracketYear
These allow us to represent regular names properly. For example Agalinis purpurea var. borealis (Berg.) Peterson 1987 is represented as
 genusOrAbove=Agalinis
specificEpithet=purpurea
rankMarker=var.
infraSpecificEpithet=borealis
authorship=Peterson
year=1987
bracketAuthorship=Berg.
or the botanical section Maxillaria sect. Multiflorae Christenson as
 genusOrAbove=Maxillaria
infraGeneric=Multiflorae
rankMarker=sect.
authorship=Christenson
Especially in botany you often encounter names with authorships for both the species and some infraspecific rank or names citing more than one infraspecific rank. These names are not formed based on rules or recommendations from the respective codes and we ignore those superflous parts. For example Agalinis purpurea (L.) Briton var. borealis (Berg.) Peterson 1987 is represented exactly the same as Agalinis purpurea var. borealis above. In case of 4 parted names like Senecio fuchsii C.C.Gmel. subsp. fuchsii var. expansus (Boiss. & Heldr.) Hayek only the lowest infraspecific rank is preserved:
 genusOrAbove=Senecio
specificEpithet=fuchsii
rankMarker=var.
infraSpecificEpithet=expansus
authorship=Hayek
bracketAuthorship=Boiss. & Heldr.
Hybrid names are evil. They come in two flavors, named hybrids and hybrid formulas.
Named hybrids are not so bad and simply prefix a name part with the multiplication sign ×, the hybrid marker, or prefix the rank marker of infraspecific names with notho. Strictly this symbol is not part of the genus or epithet. To represent these notho taxa our ParsedName class contains a property called nothoRank that keeps the rank or part of the name that needs to be marked as with the hybrid sign. For example the named hybrid Pyrocrataegus ×willei L.L.Daniel is represented as
 genusOrAbove=Pyrocrataegus
specificEpithet=willei
authorship=L.L.Daniel
nothoRank=species
Hybrid formulas such as Agrostis stolonifera L. × Polypogon monspeliensis (L.) Desf., Asplenium rhizophyllum × ruta-muraria or Mentha aquatica L. × M. arvensis L. × M. spicata L. cannot be represented by our class. The hybrid formulas in theory can combine any number of names or name parts, so its hard to deal with them. Luckily they are not very common and we can afford to live with a complete string representation in those cases. Yet another "extension" to the botanical code are cultivar names, i.e. names for plants in horticulture. Cultivar names are regular botanical names followed by a cultivar name usually in english given in single quotes. For example Cryptomeria japonica 'Elegans'. To keep track of this we have an additional cultivar property, so that:
 genusOrAbove=Cryptomeria
specificEpithet=japonica
cultivar=Elegans
In taxonomic works you often have additional information in a name that details the taxonomic concept, the sec reference most often prefixed by sensu or sec. For example Achillea millefolium sec. Greuter 2009 or Achillea millefolium sensu latu. In nomenclatoral works one frequently encounters nomenclatoral notes about the name such as nom.illeg. or nomen nudum. Both these informations are hold in our ParsedName class, for example Solanum bifidum Vell. ex Dunal, nomen nudum becomes
 genusOrAbove=Solanum
specificEpithet=bifidum
authorship=Vell. ex Dunal
nomStatus=nomen nudum

Reconstructing name strings

The ParsedName class provides us with some common methods to build a name string. In many cases you dont want the complete name with all its details, so we offer some popular name string types out of the box and a flexible string builder that you can explicitly tell which parts you want to include. The most important methods are canonicalName(): builds the canonical name sensu strictu with nothing else but the three name parts at max (genus, species, infraspecific). No rank, hybrid markers or authorship information are included. fullName(): builds the full name with all details that exist. canonicalSpeciesName(): builds the canonical binomial in case of species or below, ignoring infraspecific information

The name parser

We decided at GBIF that sharing the complete name string is more reliable than trusting already parsed names. But parsing names by hand is a very tedious enterprise, so we needed to develop some parser that can handle the vast majority of all names that we encounter. After a short experimental phase with BNF and other grammars to automatically build a parser we decided to go back to start and start something based on good old regular expressions and plain java code. The parser has evolved now for nearly 2 years now and it might be the best unit tested class we have ever written at GBIF. It is interesting to take a look at the range of names we use for testing and also the test themseves to make sure its working as expected.

Parsing names

Using the NameParser in code is trivial. Once you create a parser instance all you need to do is call the parser.parse(String name) method to get your ParsedName object. As authorships are the hardest, ie most variable part of a name we have actually implemented two parsers internally. One that tries to parse the complete string and another fallback one that ignores authorships and only extracts the canonical name. The authorsParsed flag on a ParsedName instance tells you if the simpler fallback parser has been used. If a name cannot be parsed at all an UnparsableException is thrown. This is also the case for viral names and hybrid formulas, as the ParsedName class cannot treat these names. The exception itself actually has an enumerated property that you can use to know if the exception has been caused by a virus, hybrid or other name. As of today from 10.114.724 unique name strings that we have indexed only 116.000 names couldnt be parsed and these are mostly hybrid formulas.

Normalisation

Apart from the parse method the name parser also exposes a normalisation method to normalise any whitespace, commas, brackets and hybrid markers found in name strings. The parser uses this method internally before the actual parsing takes place. The string is trimmed, only single whitespace is allowed and spaces before commas are removed while it is enforced after a comma. Similar whitespace before opening brackets is added but removed inside. Instead of the proper multiplication sign for hybrids often a simple × followed by whitespace is used which is also replaced by this method.

Parsing Webservices

GBIF is offering a free webservice API to parse names using our name parser. We use JSON for the parsed results and it accepts single names as well as batches of names. For larger input data you have to use a POST request (GET requests are restricted in length), but for few names also a simple GET request with the names url encoded in the paramter "names" is accepted. Multiple names can be concatenated with the pipe | symbol. To parse the two names Symphoricarpos albus (L.) S.F.Blake cv. 'Turesson' and Pyrocrataegus willei ×libidi L.L.Daniel the parser service call looks like this: http://ecat-dev.gbif.org/ws/parser?names=Symphoricarpos%20albus%20(L.)%20S.F.Blake%20cv.%20'Turesson'|Stagonospora%20polyspora%20M.T.%20Lucas%20%26%20Sousa%20da%20Camara%201934 For manual usages we also provide a simple web client to this service that provides a form to enter names to be parsed and also accepts files with one name per line for upload. It is available as part of our tools collection at http://tools.gbif.org/nameparser/.

Source Code

All the code is part of a small java library that we call ecat-common. It is freely available under Apache 2 licensing as most of our GBIF work and you are invited to use our code at http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/, download the latest jar from our maven repository or include it in your maven dependencies like this:
 <repositories>
<repository>
<id>gbif-all</id>
<url>http://repository.gbif.org/content/groups/gbif</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.gbif</groupId>
<artifactId>ecat-common</artifactId>
<version>1.5.1-SNAPSHOT</version>
</dependency>
</dependencies>

Friday, 8 July 2011

Are you the keymaster?


As I mentioned previously I'm starting work on evaluating HBase for our occurrence record needs.  In the last little while that has meant coming up with a key structure and/or schema that optimizes reads for one major use case of the GBIF data portal - a user request to download an entire record set, including raw records as well as interpreted.  The most common form of this request looks like "Give me all records for ", eg "Give me all records for Family Felidae".

So far I'm concentrating more on the lookup and retrieval rather than writing or data storage optimization, so the schema I'm using is two column families, one for verbatim columns, one for interpreted (for a total of about 70 columns).  The question of which key to use for HTable's single indexed column is what we need to figure out.  For all these examples we assume we know the backbone taxonomy id of the taxon concept in question (ie Family Felidae is id 123456).

Option 1
Key: native record's unique id

Query style: The simplest way of finding all records that belong to Family Felidae is scan all of them, and check against the Family column from the interpreted column family.  The code looks like this:

    HTable table = new HTable(HBaseConfiguration.create(), tableName);
    byte[] cf = Bytes.toBytes(colFam);
    byte[] colName = Bytes.toBytes(col);
    byte[] value = Bytes.toBytes(val);

    Scan scan = new Scan();
    ResultScanner scanner = table.getScanner(scan);

    for (Result result : scanner) {
      byte[] testVal = result.getValue(cf, colName);
      if (Bytes.compareTo(testVal, value) == 0) doSomething;

    }

Because this means transferring all columns of every row to the client before checking if it's even a record we want, it's incredibly wasteful and therefore very slow.  It's a Bad Idea.

Option 2
Key: native record's unique id

Query style: HBase provides a SingleColumnValueFilter that executes our equality check on the server side, thereby saving the transfer of unwanted columns to the client.  Here's the code:

    HTable table = new HTable(HBaseConfiguration.create(), tableName);
    byte[] cf = Bytes.toBytes(colFam);
    byte[] colName = Bytes.toBytes(col);
    byte[] value = Bytes.toBytes(val);

    SingleColumnValueFilter valFilter = new SingleColumnValueFilter(cf, colName, CompareFilter.CompareOp.EQUAL, value);
    valFilter.setFilterIfMissing(true);

    Scan scan = new Scan();
    scan.setFilter(valFilter);
    ResultScanner scanner = table.getScanner(scan);


This is about as good as it gets until we start getting clever :)

Option 3
Key: concatentation of nub-taxonomy "left" with native record's unique id

Query style:  We know that a taxonomy is a tree, and our backbone taxonomy is a well behaved (ie true) tree.  We can use nested sets to make our "get all children of node x" query much faster, which Markus realized some time ago, and so thoughtfully included the left and right calculation as part of the backbone taxonomy creation.  Individual occurrences of the same taxon will share the same backbone taxonomy id, as well as the left and right.  One property of nested sets not mentioned in the wikipedia article is that when the records are ordered by their lefts, the query of "give me all records where left is between parent left and parent right" becomes "give me all rows starting with parent left and ending with parent right", which in HBase terms is much more efficient since we're doing a sequential read from disk without any seeking.  So we build the key as leftId_uniqueId, and query as follows (note that startRow is inclusive and stopRow is exclusive, and we want exclusive on both ends):


    HTable table = new HTable(HBaseConfiguration.create(), tableName);
    Scan scan = new Scan();
    scan.setStartRow(Bytes.toBytes((left + 1) + "_"));
    scan.setStopRow(Bytes.toBytes(right + "_"));

    ResultScanner scanner = table.getScanner(scan);

Which looks pretty good, and is in fact about 40% faster than Option 2 (on average - depends on the size of the query result).  But on closer inspection, there's a problem.  By concatenating the left and unique ids with an underscore as separator, we've created a String, and now HBase is doing its usual lexicographical ordering, which means our rows aren't ordered as we'd hoped.  For example, this is the ordering we expect:

1_1234
2_3458
3_3298
4_9378
5_3435
10_5439
100_9763

but because these are strings, HBase orders them as:

1_1234
10_5439
100_9763
2_3458
3_3298
4_9378
5_3435

There isn't much we can do here but filter on the client side.  For every key, we can extract the left portion, convert to a Long, and compare it to our range, discarding those that don't match.  It sounds ugly, and it is, but it doesn't add anything appreciable to the processing time, so it would work.

Except that there's a more fundamental problem - if we embed the left in our primary key, it only takes one node added to the backbone taxonomy to force an update in half of all the lefts (on average) which means all of our primary keys get rewritten.  At 300 million records and growing, that's not an option.

Option 4
Key: native record's unique id
Secondary index: left to list of unique ids

Query style: Following on from Option 3, we can build a second table that will serve as a secondary index.  We use the left as a numeric key (which gives us automatic, correct ordering) and write each corresponding unique occurrence id as a new column in the row.  Then we can do a proper range query on the lefts, and generate a distinct Get for each distinct id.  Unfortunately building that index is quite slow, and is still building as I write this, so I haven't been able to test the lookups yet. 

For those keeping score at home, I'm using Hbase 0.89 (from CDH3b4) which doesn't have built in secondary indexes (which 0.19 and 0.20 did).

I'll write more when I've learned more, and welcome any tips or suggestions you might have to aid in my quest!