tag:blogger.com,1999:blog-23266248135333830622024-03-16T08:07:52.102+01:00Developer BlogTim Robertsonhttp://www.blogger.com/profile/07889700598656669041noreply@blogger.comBlogger88125tag:blogger.com,1999:blog-2326624813533383062.post-21557016410765665752018-12-04T11:44:00.001+01:002018-12-04T11:44:28.627+01:00Goodbye developer blog, hello data-blog!<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-SpM1WaccSF8/XAZZQVyfFuI/AAAAAAAAAmo/qPTlEC_D4XEe8xRiLwB9SQeVX0qh_MPtwCLcBGAs/s1600/logo.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="191" data-original-width="443" height="137" src="https://3.bp.blogspot.com/-SpM1WaccSF8/XAZZQVyfFuI/AAAAAAAAAmo/qPTlEC_D4XEe8xRiLwB9SQeVX0qh_MPtwCLcBGAs/s320/logo.png" width="320" /></a></div>
<br />
<br />
<div style="text-align: center;">
<b><span style="font-size: large;">GBIF has a new blog!</span></b></div>
<div style="text-align: center;">
<br /></div>
<div style="text-align: center;">
<span style="font-size: x-large;"><a href="https://data-blog.gbif.org/">https://data-blog.gbif.org/</a></span></div>
<br />
<div>
<br /></div>
<h2>
What is it?</h2>
A place for GBIF staff and guest bloggers to contribute:<br /><ul>
<li>Statistics </li>
<li>Graphs </li>
<li>Tutorials </li>
<li>Ideas </li>
<li>Opinions </li>
</ul>
<h2>
Who can contribute?</h2>
If you would like to contribute you can contact jwaller@gbif.org. <b>Guest blogs are very welcome</b>.<br /><h2>
How can I write a post?</h2>
There is a short turtorial on the <a href="https://github.com/gbif/data-blog">blog github</a>.<br /><h2>
What about the developer blog?</h2>
The developer blog will remain up as an archive, but there are no plans to actively post new content here. <div>
<br /></div>
<div>
<br /></div>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>John Wallerhttp://www.blogger.com/profile/01361374410016999446noreply@blogger.com0tag:blogger.com,1999:blog-2326624813533383062.post-69588886026350912892018-07-27T11:41:00.003+02:002018-07-29T09:39:23.691+02:00How popular is your favorite species? <br />
<link crossorigin="anonymous" href="https://use.fontawesome.com/releases/v5.2.0/css/all.css" integrity="sha384-hWVjflwFxL6sNzntih27bfxkr27PmbbK/iSvJ+a4+0owXq79v+lsFkW54bOGbiDQ" rel="stylesheet"></link><br />
<br />
<div><iframe align="right" height="800" src="http://178.128.167.105/shiny/gbifDownloadTrends//?_inputs_&selectInput=%5B%22Animalia%20total%3A%20208011%22%2C%22Bacteria%20total%3A%20219%22%2C%22Fungi%20total%3A%205839%22%2C%22Plantae%20total%3A%20196763%22%5D"" style="border: none;" width="100%"><br />
</iframe><br />
</div><br />
<h2>How to use</h2><div>Use the box to the left to <b>type in</b> the species you are interested in.<br />
Make sure to use a <b>scientific name:</b><br />
<ul><li><b>Aves</b> instead of <b>birds</b></li>
<li><b>Plantae</b> instead of <b>plants</b></li>
<li><b>Anura </b> instead of <b>frogs</b></li>
</ul></div><h2>Explanation of tool</h2>This tool plots the downloads through time for species or other taxonomic groups with <b>more than 25 downloads</b> at GBIF. Downloads at GBIF most often occur through the <a href="https://www.gbif.org/occurrence/search">web interface</a>. In a <a href="http://gbif.blogspot.com/2018/06/occurrence-downloads-occurrences-at.html">previous post</a>, we saw that most users are downloading data from GBIF via filtering by scientific name (aka Taxon Key). Since the GBIF index currently sits at over <b>1 billion records</b> (a 400+GB csv), most users will simply filter by their taxonomic group of interest and then generate a download.<br />
<h2>How to bookmark a result?</h2>If you would like to bookmark a result or graph to share with others, you can visit app page direcly: <a href="http://178.128.167.105/shiny/gbifDownloadTrends/">app link</a>. On this page the state of the app will be saved inside the url. You can also save a jpg by clicking on the little sandwich <i class="fas fa-bars"></i> in the top right. <br />
<h2>What counts as a download?</h2>For the graphs above, I decided that it would be more meaningful to roll up downloads <strong>below</strong> the queried taxonomic level.<br />
<ul><li>If a user downloaded 5 different bird species at once, this would count as <strong>1 download</strong> for Aves and <strong>1 download</strong> for each of the species downloaded.</li>
<li>If a user <strong>only typed in Aves</strong> in the <a href="https://www.gbif.org/occurrence/search?taxon_key=212">occurrence download interface</a> and not any other species. This would only count as 1 download for Aves and <strong>0 downloads for all bird species</strong>.</li>
<li>Similarly, if a user only typed the order <strong>Passeriformes</strong> into the search, this would count as 1 download for <strong>Passeriformes</strong> and 1 download for <strong>Aves </strong>(and 1 download for Animalia ect.) but <strong>0 downloads</strong> for all the species, families, and genera within Passeriformes.</li>
</ul>It is possible, but not as easy, to get data from GBIF <b>without generating a download</b>. In fact users can stream data using the GBIF occurrence api without ever generating a download. Currently users can “download” 200k-long chunks of occurrence data without generating a download by using the api. If someone got their data using the api in this way, we would not be able to track it currently. Presumably, the vast majority of users are getting their data directly through the <a href="https://www.gbif.org/occurrence/search">web interface</a>.<br />
<br />
For more technical details on this tool, you can visit my personal blog:<br />
<a href="http://www.johnwalleranalytics.org/2018/07/06/gbif-download-trends/">http://www.johnwalleranalytics.org/2018/07/06/gbif-download-trends/</a><br />
<br />
<br />
<br />
<br />
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>John Wallerhttp://www.blogger.com/profile/01361374410016999446noreply@blogger.com0tag:blogger.com,1999:blog-2326624813533383062.post-60754839534365396702018-06-28T16:22:00.000+02:002018-06-28T16:23:50.144+02:00Occurrence DownloadsOccurrences at GBIF are often downloaded through the <a href="https://www.gbif.org/occurrence/search?q=">web interface</a>, or through the api (via rgbif ect.). Users can place various filters on the data in order to limit the number of records returned. As the occurrence index is currently a 447 GB csv, most users want to use a filter.<h2 id="totalmonthlydownloads">
Total monthly downloads</h2>
Here I plot the total monthly downloads for various popular filters. For the past few years, GBIF has be averaging around <b>10k downloads</b> per month.<br />
<br />
Two peaks in total downloads stand out:<br />
<ul>
<li>Mar 2014</li>
<li>Sep 2016</li>
</ul>
The <strong>Sep 2016</strong> peak seems to be explained by high <strong>DATASET_KEY</strong> downloads. Both the <strong>Mar 2014</strong> and <strong>Sep 2016</strong> peaks are well explained by the <strong>top users</strong>. Top users in this graph are all the downloads generated by the <strong>top 3 most active users</strong> on GBIF. These users generate downloads in the 1000s and are most likely to be automated downloads generated internally. <br />
<br />
One interesting detail is that while <strong>No Filter Used</strong> is not used very often it accounts for more than <strong>500 billion</strong> occurrence records downloaded. <br />
<br />
Finally, if we look at the <strong>number of unique users</strong> (un-select everything else to see in isolation), we see that <strong>the number of individuals making downloads on GBIF has been increasing steadily</strong> with some perhaps interesting cyclical patterns. The graph below is <b>interactive. You can see different data views by clicking on the names. </b><br />
<br />
<iframe src="https://jhnwllr.github.io/charts/monthlyDownloads.html" style="border: none; height: 500px; width: 100%;"></iframe><br />
<h2 id="typesoffilters">
Popular filters explained</h2>
There are many ways that a user can filter data. The types and combinations of filters are almost limitless. Below I describe some of the <strong>most common</strong> filters:<br />
<br />
<strong>1. TAXON_KEY</strong><br />
<br />
This is one of the most common filters users place on the GBIF occurrence index. Users can either choose <strong>one</strong> or <strong>many</strong> taxon names to filter the data, and users can choose any taxon rank they want (species, genus, family, kingdom ect.).<br />
<br />
<strong>2. COUNTRY</strong><br />
<br />
Here users can return records only from a certain country. This is the country the user searched and <strong>not</strong> where user is searching from.<br />
<br />
<strong>3. HAS_GEOSPATIAL<em>_</em>ISSUE</strong><br />
<br />
Here users can specify that they want occurrence records <strong>without some interpreted error</strong>.<br />
<br />
<strong>4. HAS_COORDINATE</strong><br />
<br />
Here users can say that they want occurrence records that <strong>have coordinates</strong>.<br />
<br />
<strong>5. No Filter</strong><br />
<br />
Finally, a surprising number of users never put any filter and instead request to download the <strong>entire occurrence index</strong>. In the overwhelming majority of cases, we have to assume these users have done this <strong>by mistake</strong>.<br />
<br />
You can read more about downloads at GBIF here:<br />
<a href="http://www.johnwalleranalytics.org/2018/05/30/gbif-download-statistics/">http://www.johnwalleranalytics.org/2018/05/30/gbif-download-statistics/</a><br />
<br /><div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>John Wallerhttp://www.blogger.com/profile/01361374410016999446noreply@blogger.com0tag:blogger.com,1999:blog-2326624813533383062.post-15230506798908378402017-06-22T07:38:00.000+02:002017-06-22T09:39:40.954+02:00GBIF Name Parser<div class="tr_bq">
The <a href="https://github.com/gbif/name-parser">GBIF name parser</a> has been a fundamental library for GBIF to parse a scientific name string into a structured representation of a name. It has been refined over many years based on actual name strings encountered in the GBIF occurrence and checklist indices. Over the years the major design goals have not changed much and can be summarised as follows:</div>
<ul>
<li>extract canonical, code relevant name parts</li>
<ul>
<li>populate only the <a href="https://github.com/gbif/gbif-api/blob/master/src/main/java/org/gbif/api/model/checklistbank/ParsedName.java">ParsedName</a> class of the GBIF API</li>
<li>ignore any superflous name parts irrelevant to the code, e.g. species authorships in infraspecific names, infrageneric placements of species or superflous infraspecific parts in quadrinomials</li>
</ul>
<li>deal with a wide variety of names that the ParsedName class can represent</li>
<ul>
<li>cultivar names</li>
<li>bacterial strains & candidate names</li>
<li>virus names</li>
<li>named hybrids</li>
<li>taxon concept references, sensu latu/strictu or aggregates</li>
<li>legacy ranks</li>
</ul>
<li>extract notes often found in names:</li>
<ul>
<li>nomenclatural remarks</li>
<li>determination notes like aff. </li>
<li>partially determined species, e.g. only down to the genus: <i>Abies</i> spec.</li>
</ul>
<li>in case author parsing is impossible, fallback to parsing just the canonical name without authors</li>
<li>allow slightly imperfect names not strictly well formed according to the rules</li>
<li>classify names according to our <a href="https://github.com/gbif/gbif-api/blob/master/src/main/java/org/gbif/api/vocabulary/NameType.java#L23">NameType</a> enumeration</li>
</ul>
Compared to <a href="https://github.com/GlobalNamesArchitecture/gnparser">gnparser</a> these are slightly different goals explaining some of the behavior explained in the recent paper from <a href="https://dx.doi.org/10.1186%2Fs12859-017-1663-3">Dmitry Mozzherin 2017</a>. As that paper explains the GBIF name parser is based on regular expressions, some of them even recursive. This is not the reason why we do not support hybrid formulas though. Hybrid formulas (e.g. <i>Quercus robur</i> x <i>Q. macrocarpa</i>) as opposed to named hybrids (e.g. <i>Quercus</i> x <i>turneri</i>) are a variable combination of names and thus are very different to the Linnean names represented by a ParsedName. For name matching, backbone building and many more problems hybrid formulas are incompatible and we instead decided to deal with hybrid formulas just as with other unparsable viruses or OTU names that do not follow the neat structure of Linnean names. We simply keep the entire string as it was, classify it with a NameType and do not further parse it.<br />
<br />
GBIF exposes the name parser through the <a href="http://www.gbif.org/developer/species#parser">GBIF JSON API</a>, here are some examples for illustration:<br />
<ul>
<li>variety <a href="http://api.gbif.org/v1/parser/name?name=Serjania%20meridionalis%20Cambess.%20var.%20o%27donelli%20F.A.%20Barkley">Serjania meridionalis Cambess. var. o’donelli F.A. Barkley</a></li>
<li>basionym <a href="http://api.gbif.org/v1/parser/name?name=Carex%20scirpoidea%20Michx.%20subsp.%20convoluta%20(K%C3%BCk.)%20D.A.Dunlop">Carex scirpoidea Michx. subsp. convoluta (Kük.) D.A.Dunlop</a></li>
<li>cultivar <a href="http://api.gbif.org/v1/parser/name?name=Stephanandra%20incisa%20(Thunb.)%20Zabel%20cv.%20%27Crispa%27">Stephanandra incisa (Thunb.) Zabel cv. ‘Crispa’</a></li>
<li>subgenus <a href="http://api.gbif.org/v1/parser/name?name=Polana%20(Bulbusana)%20vana%20DeLong%20%26%20Freytag%201972">Polana (Bulbusana) vana DeLong & Freytag 1972</a></li>
<li>named hybrid <a href="http://api.gbif.org/v1/parser/name?name=Quercus%20x%20turneri">Quercus x turneri</a></li>
<li>hybrid formula <a href="http://api.gbif.org/v1/parser/name?name=Quercus%20robur%20x%20Q.%20macrocarpa">Quercus robur x Q. macrocarpa</a></li>
<li>virus <a href="http://api.gbif.org/v1/parser/name?name=Choristoneura%20rosaceana%20entomopoxvirus">Choristoneura rosaceana entomopoxvirus</a></li>
<li>indetermined <a href="http://api.gbif.org/v1/parser/name?name=Abies%20spec.">Abies spec.</a></li>
<li>uncertain determination <a href="http://api.gbif.org/v1/parser/name?name=Rasbora%20aff.%20elegans">Rasbora aff. elegans</a></li>
<li>nomenclatural remark <a href="http://api.gbif.org/v1/parser/name?name=Iridaea%20undulosa%20var.%20papillosa%20Bory%20de%20Saint-Vincent,%20nom.%20nud.">Iridaea undulosa var. papillosa Bory de Saint-Vincent, nom. nud.</a></li>
<li>taxon concept <a href="http://api.gbif.org/v1/parser/name?name=Achillea%20millefolium%20sec.%20Greuter%202009">Achillea millefolium sec. Greuter 2009</a></li>
<li>serovar <a href="http://api.gbif.org/v1/parser/name?name=Salmonella%20enterica%20serovar%20Typhimurium">Salmonella enterica serovar Typhimurium</a></li>
<li>bacterial strain <a href="http://api.gbif.org/v1/parser/name?name=Yersinia%20pestis%20biovar%20orientalis%20str.%20IP674">Yersinia pestis biovar orientalis str. IP674</a></li>
<li>legacy rank <a href="http://api.gbif.org/v1/parser/name?name=Potamon%20(Pontipotamon)%20ibericum%20tauricum%20natio%20trojensis%20Pretzmann,%201983">Potamon (Pontipotamon) ibericum tauricum natio trojensis Pretzmann, 1983</a></li>
<li>sensu latu <a href="http://api.gbif.org/v1/parser/name?name=Taraxacum%20erythrospermum%20s.l.">Taraxacum erythrospermum s.l.</a></li>
<li>placeholder <a href="http://api.gbif.org/v1/parser/name?name=Asteraceae%20incertae%20sedis">Asteraceae incertae sedis</a></li>
</ul>
Authorships are not (yet) parsed into a list of individual authors. This has been done internally already and it is something we are likely to expose in the future. Currently the authorship is parsed into four pieces, the authorship and year for the combination and basionym.<br />
<h3>
gnparser in GBIF</h3>
The GNA name parser is a great parser for well formed names. It has slightly different goals, but since it is available for the JVM we have <a href="https://github.com/gbif/name-parser/blob/master/name-parser-gna/src/main/java/org/gbif/nameparser/GNANameParser.java#L22">wrapped it to support the GBIF NameParser</a> interface producing ParsedName instances. Wrapping the Scala based gnaparser was not as trivial as we had thought due to its different parsing output and the mismatch of Scala and Java collections, but working against the NameParser interface you can finally select your parser of choice.<br />
<br />
The <b>authorship semantics</b> for original names are also slightly different between the two parsers. Again some examples to illustrate the difference:<br />
<strong>Azalea schlippenbachii (Maxim.) Kuntze</strong><br />
Both parsers show the same semantics:<br />
<pre>GBIF:
"authorship": "Kuntze",
"bracketAuthorship": "Maxim.",
GNA:
"value": "(Maxim.) Kuntze",
"basionym_authorship": {
"authors": ["Maxim."]
},
"combination_authorship": {
"authors": ["Kuntze"]
}
</pre>
<strong>Rhododendron schlippenbachii Maxim.</strong><br />
The GBIF parser places the author into “authorship” as the author of the very combination.<br />
The gnparser places the author into the basionym authorship instead even though it is not surrounded by brackets.<br />
As the parser cannot know if the name actually is a basionym, i.e. there indeed exists a subsequenct recombination, this was slightly unexpected<br />
and the GBIF NameParser wrapper had to swap the authorship to populate a ParsedName in such cases:<br />
<pre>GBIF:
"authorship": "Maxim.",
GNA:
"basionym_authorship": {
"authors": ["Maxim."]
}
</pre>
<strong>Puma concolor (Linnaeus, 1771)</strong><br />
Both parsers show the same semantics:<br />
<pre>GBIF:
"bracketAuthorship": "Linnaeus",
"bracketYear": "1771",
GNA:
"basionym_authorship": {
"authors": ["Linnaeus"],
"year": {
"value": "1771"
}
}
</pre>
<b>Ex authors</b> are not parsed in GBIF (yet). They do cause a little issue as the sequence of authors varies between botany and zoology. In botany, the author of the earlier name precedes the later, valid one while in zoology this is reversed. GNA follows the zoological model, even though far more usages of ex-authors can be found in botanical names.<br />
<br />
<b>Uninomials</b> are also treated differently. GBIF uses a single property genusOrAbove for both the genus part of a binomial, a standalone genus or a uninomal of a rank above genus. GNA places genera from a binomial into the genus field, but uses uninomial for a standalone genus name.<br />
<h3>
Performance</h3>
We are still comparing gnparser with the GBIF name parser, but <a href="https://github.com/gbif/name-parser/blob/master/name-parser-comparison/src/main/java/org/gbif/nameparser/ParserComparison.java">initial tests</a> using gnparser-0.4.0 to parse <a href="https://github.com/gbif/name-parser/blob/master/name-parser-comparison/src/main/resources/names.txt">1380 names</a> from our unit tests suggests the GBIF parser is up to twice as fast running in a Java environment. There is an overhead of wrapping the gnparser for the ParsedName result class. But even if we just parse the names and do not convert the Scala result into a ParsedName it takes up 75% more time:<br />
<pre>
Total time parsing 1380 names
MacBookPro 2017, Java8, single thread:
GBIF: 1331ms
GNA : 2596ms
GNA-: 2323ms # without wrapper
</pre>
<br />
This contradicts the results presented in the gnaparser paper, but might be related to the selection of names or running the parser in different environments.<br />
<h3>
Future</h3>
We are working with GNA to improve both parsers and align them more. With slightly different goals it might be hard to fully merge the two projects, but we will try to unify the efforts as much as we can. For the GBIF name parser we will be adding parsed author and ex author teams in the near future. This is needed to do author comparisons for better name matching in the GBIF backbone building (where it already exists) and the Catalogue of Life.<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/02525336976753861766noreply@blogger.com4tag:blogger.com,1999:blog-2326624813533383062.post-63361383319249414552017-02-27T14:52:00.003+01:002017-02-27T14:52:52.824+01:00GBIF Backbone - February 2017 UpdateWe are happy to annouce that a new <a href="http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c">GBIF Backbone</a> just went live, available also as an improved <a href="http://rs.gbif.org/datasets/backbone/2017-02-13/backbone.zip">Darwin Core Archive for download</a>. Here are some facts highlighting the important changes.<br />
<h2>
New source datasets</h2>
Apart from continuously updated source like the <a href="http://www.gbif.org/dataset/7ddf754f-d193-4cc9-b351-99906754a03b">Catalog of Life</a> or <a href="http://www.gbif.org/dataset/2d59e5db-57ad-41ff-97d6-11f5fb264527">WoRMS</a> here are the new datasets we used as a source to build the backbone.<br />
<br />
<ul>
<li>New <a href="http://www.gbif.org/dataset/6cfd67d6-4f9b-400b-8549-1933ac27936f">Type specimen checklist</a> listing all distinct names of <a href="http://www.gbif.org/occurrence/search?TYPE_STATUS=*">type specimens found in GBIF occurrences</a> contributing 252,410 new species and 57,410 infra specific names.</li>
<li><a href="http://www.gbif.org/dataset/b9a214b7-c368-4d22-aa53-b1fc16a1210a">ZooBank</a> joined GBIF and was added as a nomenclator with 175,775 names, contributing 3460 new generic and 39,695 new species names.</li>
<li>Added <a href="http://www.gbif.org/species/8770992">phylum Myzozoa</a> with 136 families under kingdom Chromista to <a href="https://github.com/gbif/algae/commit/afccc623414b7ff2be715bcce1e64fc1aa97ca86">GBIF Algae Classification</a> to fill the <a href="https://github.com/gbif/checklistbank/issues/12">classification gap for Dinoflagellates</a></li>
<li>Tiny new dataset listing species named after <a href="http://www.gbif.org/dataset/00e791be-36ae-40ee-8165-0b2cb0b8c84f">famous people</a> and which are often found in news</li>
</ul>
<div>
<br /></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-xRzL5VVkPNs/WLQKIYQjKGI/AAAAAAAAEMk/WTxm_E987800qMpI-HMt2yBPaMQIcvYeACLcB/s1600/nub-sources.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="376" src="https://4.bp.blogspot.com/-xRzL5VVkPNs/WLQKIYQjKGI/AAAAAAAAEMk/WTxm_E987800qMpI-HMt2yBPaMQIcvYeACLcB/s640/nub-sources.png" width="640" /></a></div>
<br />
<div style="text-align: center;">
The <a href="https://github.com/gbif/checklistbank/blob/77f8a4b5ccd90cda59243978565c6a05820ead1a/checklistbank-nub/nub-sources.tsv">43 sources</a> used in this backbone build</div>
<br />
<h2>
Code changes</h2>
<br />
<ul>
<li>Merging of duplicate taxa across kingdoms, especially with taxa from the incertae sedis kingdom. Examples</li>
<ul>
<li><a href="https://demo.gbif.org/species/8592581">Dictyodora Weiss, 1884</a> and <a href="https://demo.gbif.org/species/4897359">Dictyodora C.E. Weiss, 1884</a></li>
<li><a href="https://demo.gbif.org/species/8486131">Barilium Norman, 2010</a> and <a href="https://demo.gbif.org/species/7455976">Barilium</a></li>
</ul>
<li>Exclude genus & species synonyms for taxa at a higher rank: <a href="http://dev.gbif.org/issues/browse/POR-3169">http://dev.gbif.org/issues/browse/POR-3169</a></li>
<li><a href="http://dev.gbif.org/issues/browse/PF-2600">Restrict name normalisation</a> with double letters to bi/trinomials. Finally the fish <a href="http://dev.gbif.org/issues/browse/PF-2611">Lota lota</a> is a fish again. Examples of other previously wrongly conflated families that have been reported:</li>
<ul>
<li><a href="http://www.gbif-uat.org/species/9639">Lotidae</a> & <a href="http://www.gbif-uat.org/species/6553">Lottiidae</a></li>
<li><a href="http://www.gbif.org/species/4237">Belidae</a> & <a href="http://www.gbif.org/species/2775">Belliidae</a></li>
<li><a href="http://www.gbif-uat.org/species/9125">Lauridae</a> & <a href="http://www.gbif-uat.org/species/3243623">Lauriidae</a></li>
</ul>
<li>Stable identifier for <a href="http://dev.gbif.org/issues/browse/POR-3031">pro parte taxa</a> in the backbone.</li>
</ul>
<br />
<br />
All other fixed issues in the source code that generates the backbone can be found in our <a href="http://dev.gbif.org/issues/browse/POR-3208">Jira epic</a><br />
and <a href="https://github.com/gbif/checklistbank/milestone/2?closed=1">github milestone</a>.<br />
<h2>
Backbone impact</h2>
The new backbone has a total of 5,887,500 names of which it treats 2,818,534 species names as accepted (up from 5,307,978 and 2,525,274 respectively).<br />
More <a href="http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c/stats">backbone metrics</a> are available through our portal and in more detail through our <a href="http://api.gbif.org/v1/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c/metrics">API</a>.<br />
<br />
<br />
<ul>
<li>105,296 <a href="http://rs.gbif.org/datasets/backbone/2017-02-13/deleted.txt.gz">deleted names</a>, many of them previous erroneous duplicates</li>
<li>685,853 <a href="http://rs.gbif.org/datasets/backbone/2017-02-13/created.txt.gz">new names</a></li>
<ul>
<li>Animalia: 164 families; 6,616 genera; 257,196 species; 87,660 infraspecific</li>
<li>Archaea: 2 families; 6 genera; 48 species</li>
<li>Bacteria: 27 families; 225 genera; 2,470 species; 615 infraspecific</li>
<li>Chromista: 2 phyla; 13 classes; 58 order; 54 families; 767 genera; 12,124 species; 2,953 infraspecific</li>
<li>Fungi: 2 families; 269 genera; 8,703 species; 2,993 infraspecific</li>
<li>Plantae: 3 families; 795 genera; 63,617 species; 33,282 infraspecific</li>
<li>Protozoa: 4 families; 65 genera; 1,412 species; 280 infraspecific</li>
<li>Viruses: 8 families; 1,227 genera; 8,488 species</li>
<li>Unknown: 4 families; 2,708 genera; 13,076 species; 2,237 infraspecific</li>
</ul>
</ul>
<br />
A very large and <a href="http://rs.gbif.org/datasets/backbone/2017-02-13/clb-nub.log.gz">detailed log</a> of the backbone build is also available.<br />
<br />
The largest taxonomic groups in the backbone, exceeding 3% of all accepted species is shown in the following diagram:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-xeo5Ad3hlOE/WLQP5K0k0WI/AAAAAAAAEM0/rIo0T-iCwNQ-9NvGk_-x31Jt3T8Mt94kwCLcB/s1600/backbone%2Bspecies.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="498" src="https://3.bp.blogspot.com/-xeo5Ad3hlOE/WLQP5K0k0WI/AAAAAAAAEM0/rIo0T-iCwNQ-9NvGk_-x31Jt3T8Mt94kwCLcB/s640/backbone%2Bspecies.png" width="640" /></a></div>
<br />
All contributors to the backbone arranged by number of names the source serves as the primary reference:<br />
<br />
<ul>
<li>3,330,535 <a href="http://www.gbif.org/dataset/7ddf754f-d193-4cc9-b351-99906754a03b">Catalogue of Life</a> </li>
<li>685,831 <a href="http://www.gbif.org/dataset/0938172b-2086-439c-a1dd-c21cb0109ed5">Interim Register of Marine and Nonmarine Genera</a> </li>
<li>312,746 <a href="http://www.gbif.org/dataset/2d59e5db-57ad-41ff-97d6-11f5fb264527">World Register of Marine Species</a> </li>
<li>309,820 <a href="http://www.gbif.org/dataset/6cfd67d6-4f9b-400b-8549-1933ac27936f">GBIF Type Specimen Names</a> </li>
<li>285,859 <a href="http://www.gbif.org/dataset/d9a4eedb-e985-4456-ad46-3df8472e00e8">The Plant List with literature</a> </li>
<li>140,937 <a href="http://www.gbif.org/dataset/90d9e8a6-0ce1-472d-b682-3451095dbc5a">Fauna Europaea</a> </li>
<li>136,981 <a href="http://www.gbif.org/dataset/bf3db7c9-5e5d-4fd0-bd5b-94539eaf9598">Index Fungorum</a> </li>
<li>126,960 <a href="http://www.gbif.org/dataset/c33ce2f2-c3cc-43a5-a380-fe4526d63650">The Paleobiology Database</a> </li>
<li>114,089 <a href="http://www.gbif.org/dataset/046bbc50-cae2-47ff-aa43-729fbf53f7c5">International Plant Names Index</a> </li>
<li>53,848 <a href="http://www.gbif.org/dataset/9ca92552-f23a-41a8-a140-01abaa31c931">Integrated Taxonomic Information System ITIS</a> </li>
<li>44,732 <a href="http://www.gbif.org/dataset/b9a214b7-c368-4d22-aa53-b1fc16a1210a">ZooBank</a> </li>
<li>30,482 <a href="http://www.gbif.org/dataset/66dd0960-2d7d-46ee-a491-87b9adcfe7b1">GRIN Taxonomy</a> </li>
<li>29,267 <a href="http://www.gbif.org/publisher/7ce8aef0-9e92-11dc-8738-b8a03c50a862">Plazi</a> </li>
<li>25,749 <a href="http://www.gbif.org/dataset/a6c6cead-b5ce-4a4e-8cf5-1542ba708dec">Artsnavnebasen</a> </li>
<li>24,996 <a href="http://www.gbif.org/dataset/65c9103f-2fbf-414b-9b0b-e47ca96c5df2">Afromoths</a> </li>
<li>15,007 <a href="http://www.gbif.org/publisher/47a779a6-a230-4edd-b787-19c3d2c80ab5">Species Files</a> </li>
<li>13,818 <a href="http://www.gbif.org/dataset/aacd816d-662c-49d2-ad1a-97e66e2a2908">Brazilian Flora 2020 project</a> </li>
<li>8,923 <a href="http://www.gbif.org/dataset/de8934f4-a136-481c-a87a-b0b202b80a31">Dyntaxa</a></li>
<li>6,807 <a href="http://www.gbif.org/publisher/0674aea0-a7e1-11d8-9534-b8a03c50a862">DiversityTaxonNames Lists</a> </li>
<li>5,696 <a href="http://www.gbif.org/dataset/80b4b440-eaca-4860-aadf-d0dfdd3e856e">Official Lists and Indexes of Names in Zoology</a> </li>
<li>5,317 <a href="http://www.gbif.org/dataset/52a423d2-0486-4e77-bcee-6350d708d6ff">Prokaryotic Nomenclature Up-to-date</a> </li>
<li>4,617 <a href="http://www.gbif.org/dataset/ded724e7-3fde-49c5-bfa3-03b4045c4c5f">International Cichorieae Network ICN</a></li>
<li>4,611 <a href="http://www.gbif.org/dataset/da38f103-4410-43d1-b716-ea6b1b92bbac">Catalogue of Afrotropical Bees</a> </li>
<li>4,416 <a href="http://www.gbif.org/dataset/3f8a1297-3259-4700-91fc-acc4170b27ce">Database of Vascular Plants of Canada</a> </li>
<li>4,312 <a href="http://www.gbif.org/dataset/e01b0cbb-a10a-420c-b5f3-a3b20cc266ad">ICTV Master Species List</a> </li>
<li>3,874 <a href="http://www.gbif.org/dataset/47f16512-bf31-410f-b272-d151c996b2f6">The Clements Checklist</a> </li>
<li>2,702 <a href="http://www.gbif.org/dataset/7a9bccd4-32fc-420e-a73b-352b92267571">Checklist of Beetles Coleoptera of Canada and Alaska</a> </li>
<li>1,198 <a href="http://www.gbif.org/dataset/c696e5ee-9088-4d11-bdae-ab88daffab78">IOC World Bird List, v6.3</a></li>
<li>1,087 <a href="http://www.gbif.org/dataset/7ea21580-4f06-469d-995b-3f713fdcc37c">GBIF Algae Classification</a> </li>
<li>578 <a href="http://www.gbif.org/dataset/8dc469b3-8e61-4f6f-b9db-c70dbbc8858c">ION Taxonomic Hierarchy</a> </li>
<li>272 <a href="http://www.gbif.org/dataset/672aca30-f1b5-43d3-8a2b-c1606125fa1b">Mammal Species of the World</a> </li>
<li>144 <a href="http://www.gbif.org/dataset/daacce49-b206-469b-8dc2-2257719f3afa">GBIF Backbone Patch</a> </li>
<li>39 <a href="http://www.gbif.org/dataset/00e791be-36ae-40ee-8165-0b2cb0b8c84f">Species named after famous people</a> </li>
<li>36 <a href="http://www.gbif.org/dataset/bd25fbf7-278f-41d6-bc17-9f08f2632f70">True Fruit Flies Diptera, Tephritidae of the Afrotropical Region</a> </li>
<li>7 <a href="http://www.gbif.org/dataset/6e4c3b6f-0126-4c5f-bd63-fe6ffd3b29fa">Backbone Family Classification Patch</a> </li>
<li>7 <a href="http://www.gbif.org/dataset/0e61f8fe-7d25-4f81-ada7-d970bbb2c6d6">TAXREF</a> </li>
</ul>
<br />
<h2>
Occurrence impact</h2>
With a new backbone we have reprocessed all of our <a href="http://www.gbif.org/occurrence">712 million occurrences</a>.<br />
<br />
The distribution of the major taxonomic groups exceeding 3%, i.e have a minimum of 36.800 species, is shown in this last diagram:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-aV97u_id8Io/WLQP5HW0GiI/AAAAAAAAEM4/9R1ziV62SpoJD-ygjWifwVoQQuNRnclywCEw/s1600/occ%2Bspecies.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="498" src="https://1.bp.blogspot.com/-aV97u_id8Io/WLQP5HW0GiI/AAAAAAAAEM4/9R1ziV62SpoJD-ygjWifwVoQQuNRnclywCEw/s640/occ%2Bspecies.png" width="640" /></a></div>
<br />
The 1,226,520 accepted species in GBIF occurrences (140 less than before) represent 44% of all accepted backbone species.<br />
<br /><div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/02525336976753861766noreply@blogger.com0tag:blogger.com,1999:blog-2326624813533383062.post-29092170879153683482017-01-25T17:43:00.000+01:002017-01-25T17:43:18.501+01:00Sampling-event standard takes flight on the wings of butterflies<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
Data collected from systematic monitoring schemes is highly valuable. That's because harvesting species data from a given set of sites repeatedly over time using a well-defined sampling effort opens the door to key ecological analyses including phenology, population trends, changes in community structure and other metrics related to a range of Essential Biodiversity Variables (<a href="http://geobon.org/essential-biodiversity-variables/ebv-classes-2/" target="_blank">EBVs</a>).<br />
<br />
A couple of years ago there was no faithful way to universally standardize data from systematic
monitoring schemes. This meant that researchers using this kind of data would need to
spend a lot of time deciphering it first. Their job would get even more
complicated when trying to integrate data from various heterogeneous sources,
each storing their data in different formats, units, etc.<br />
<br />
Today, the situation looks much better thanks to a massive collaboration between <a href="http://www.gbif.org/" target="_blank">GBIF</a>, <a href="http://www.eubon.eu/show/partners_2735/" target="_blank">EU BON partners</a> and the wider biodiversity community whose aim was to enable sharing of "sampling-event datasets". <br />
<br />
Indeed, one of the most successful outcomes from this collaboration has been the development of a standardized format for systematic butterfly monitoring schemes.<br />
<br />
The format has been developed in close collaboration with the EU BON partners Israel Pe'er (<a href="http://www.gluecad-bio.com/face/gluecad_en.html" target="_blank">GlueCAD- Biodiversity IT</a>) and his son, Dr. Guy Pe'er, (<a href="https://www.ufz.de/index.php?en=38961" target="_blank">UFZ</a>), who works with systematic monitoring data. The format can be adapted to many other types of systematic monitoring, for many taxonomic groups, as it ensures the following important conditions for researchers are met:<br />
<ul style="text-align: left;">
<li>all visits to a given site are known, including those with no sightings, as this allows for analyses of species phenology, etc.</li>
<li>the range of species being recorded during sampling is explicit, as this allows for true absence to be determined.</li>
<li>the location hierarchies can be specified (e.g. the location is a fixed transect or subsection of a transect), as this allows users to group observations by location.</li>
<li>enough detailed information about the sampling effort and sampling area (e.g. units of measurement) are captured, as this allows users to calculate density or convert between units of abundance.</li>
</ul>
The Israeli Butterfly Systematic Monitoring Scheme (BMS-IL) dataset has already been published openly using this format. I'd like to invite everyone to explore this exemplar dataset from either the <a href="http://cloud.gbif.org/eubon/resource?r=butterflies-monitoring-scheme-il" target="_blank">EU BON IPT</a> or via <a href="http://www.gbif.org/dataset/647ae6f8-8e26-4189-b448-02b45b7ad884" target="_blank">GBIF.org</a>. <br />
<div>
<br />
In the future, I hope that <a href="http://geobon.org/products/reports-papers/geo-bon-technical-reports/" target="_blank">GEO BON's Guidelines for Standardized Global Butterfly Monitoring</a> will incorporate a new recommendation that all monitoring programs use this standardized format for sharing their data. Without a doubt this will make researchers' jobs easier when integrating data from several butterfly monitoring programs for their analyses. It will also enable integrating the data with standardized sampling-event data from other disciplines as well. <br />
<br />
Ideally, making the data openly available in a standardized format also leads to new collaboration. So far, BMS-IL data has been used to assess trends in the abundance and phenology of
Israel's butterflies for the benefit of conservation or climate change research for example. I would like to encourage you to reach out to Israel and Guy Pe'er if you have any novel ideas on how to reuse their newly standardized data in order to help unlock its full potential.</div>
</div>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/16423423909368777750noreply@blogger.com0tag:blogger.com,1999:blog-2326624813533383062.post-53212023993294788722017-01-12T17:40:00.000+01:002017-01-24T16:50:07.910+01:00IPT v2.3.3 - Your repository for standardized biodiversity data<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
GBIF is pleased to announce the release of IPT v2.3.3, now available for download from the <a href="http://www.gbif.org/ipt" target="_blank">IPT website</a>. <br />
<br />
This version looks and feels the same as 2.3.2 but is much more robust and secure. I'd like to recommend that all existing IPT installations be upgraded as soon as possible following the instructions listed in the <a href="https://github.com/gbif/ipt/wiki/IPTReleaseNotes233.wiki" target="_blank">release notes</a>.<br />
<br />
Additionally, a couple new strategic features have been added to the tool to enhance its potential. A description of these new features follows below. <br />
<br />
<h3 style="text-align: left;">
Improved dataset homepage </h3>
<br />
Compared with general-purpose repositories such as <a href="http://datadryad.org/" target="_blank">Dryad</a> or <a href="https://figshare.com/" target="_blank">Figshare</a>, the IPT ensures that uploaded biodiversity data gets disseminated in a standardized format (Darwin Core Archive - DwC-A), facilitating wider reuse and enabling the data to be indexed by aggregators such as GBIF.org.<br />
<br />
Interoperability comes at a small cost though, as depositors choosing to use the IPT must overcome a learning curve in understanding how to map their data to the Darwin Core standard. <br />
<br />
To make this easier for depositors, a <a href="http://www.gbif.org/newsroom/news/new-darwin-core-spreadsheet-templates" target="_blank">new set of Darwin Core Excel templates</a> have recently been released. These new templates provide a simpler solution for capturing, formatting and uploading data to the IPT.<br />
<br />
Similarly, users of the standardized data need to understand how to unpack a DwC-A and make sense of the data inside. <br />
<br />
<div style="text-align: right;">
</div>
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/--LKHvtfk9vg/WHOGB4kSXaI/AAAAAAAASXU/2picxTBNkBMsrWaMVEEemIJbBgtmuxZIQCEw/s1600/Screen%2BShot%2B2017-01-09%2Bat%2B13.45.27.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img alt="" border="0" height="134" src="https://1.bp.blogspot.com/--LKHvtfk9vg/WHOGB4kSXaI/AAAAAAAASXU/2picxTBNkBMsrWaMVEEemIJbBgtmuxZIQCEw/s320/Screen%2BShot%2B2017-01-09%2Bat%2B13.45.27.png" title="Data Records section from RLS Global Reef Fish Dataset" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Data Records section - RLS Global Reef Fish Dataset<br />
<a href="http://doi.org/10.15468/qjgwba">doi:10.15468/qjgwba</a> </td></tr>
</tbody></table>
To make this process easier for users, a new Data Records section has been added to the dataset homepage that provides an explanation of what the DwC-A format is with a graphic illustration showing the number of records in each file contained within it. <br />
<br />
Overall this advancement will strengthen the IPT as a data repository, which is already capable of <a href="http://gbif.blogspot.dk/2015/03/ipt-v22.html" target="_blank">assigning DOIs to datasets</a> to make them discoverable and citable. <br />
<br />
<h3 style="text-align: left;">
Translation into Russian </h3>
<br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-R1_1Fd_aB6c/WHN2XOv0jnI/AAAAAAAASW4/VFwai6Q-iVEaPqMg6o_pEmTvPMr2QYQUACLcB/s1600/Screen%2BShot%2B2017-01-09%2Bat%2B12.36.25.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img alt="" border="0" height="203" src="https://1.bp.blogspot.com/-R1_1Fd_aB6c/WHN2XOv0jnI/AAAAAAAASW4/VFwai6Q-iVEaPqMg6o_pEmTvPMr2QYQUACLcB/s400/Screen%2BShot%2B2017-01-09%2Bat%2B12.36.25.png" title="Map of IPT installations focused on Russian speaking countries" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><a href="http://www.gbif.org/ipt/stats" target="_blank">Map of IPT installations in Russia - January 2017</a> </td></tr>
</tbody></table>
<a href="http://www.gbif.org/ipt/stats" target="_blank">Installed in 52 countries</a> around the world, use of the IPT heavily is underrepresented across Russian speaking countries. Therefore to extend the IPT's reach in these areas, the user interface has been fully translated into Russian by a team of volunteer translators with the largest contribution made by Ivan Chadin from the Komi Science Centre of the Ural Branch of the Russian Academy of Sciences.<br />
<br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-RZLdGPK-sZs/WHN4aurFJUI/AAAAAAAASXE/At7ZEN1okJ8_RmVeWYHd9KkoTuLhF0G_wCLcB/s1600/Screen%2BShot%2B2017-01-09%2Bat%2B12.45.18.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img alt="" border="0" height="200" src="https://2.bp.blogspot.com/-RZLdGPK-sZs/WHN4aurFJUI/AAAAAAAASXE/At7ZEN1okJ8_RmVeWYHd9KkoTuLhF0G_wCLcB/s400/Screen%2BShot%2B2017-01-09%2Bat%2B12.45.18.png" title="Map of data published by Russia" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><a href="http://www.gbif.org/country/RU/publishing" target="_blank">Map of data published by Russia - January 2017</a></td></tr>
</tbody></table>
At the time of writing there were already 18 datasets from Russia published by 5 IPTs installed across Pushchino, Moscow, St Petersburg and the Komi Republic. It will be exciting to watch this number grow over time in part thanks to this enormous volunteer contribution.<br />
<br />
<br />
<br />
<h3 style="text-align: left;">
Acknowledgements</h3>
<br />
Once again I'd like to recognize all the volunteer translators that contributed their time and expertise to making this new version available in seven different languages:<br />
<ul style="text-align: left;">
<li>Sophie Pamerlon (GBIF France) - Updating French translation</li>
<li>Yukiko Yamazaki (GBIF Japan (JBIF)) - Updating Japanese translation</li>
<li>Daniel Lins (Universidade de São Paulo, Research Center on Biodiversity and Computing - BioComp) - Updating Portuguese translation</li>
<li>Néstor Beltrán (Colombian Biodiversity Information System (SiB Colombia)) - Updating Spanish translation</li>
<li>Ivan Chadin (Institute of Biology of Komi Scientific Centre of the Ural
Branch of the Russian Academy of Sciences), Max Shashkov (Institute of
Physicochemical and Biological Problems in Soil Science, Russian Academy
of Science) and Artyom Leostrin (Komarov Botanical Institute of the Russian Academy of Sciences (Saint-Petersburg)) - Adding Russian translation </li>
</ul>
I'd also like to recognize a few volunteers that helped make significant improvements to the IPT codebase:<br />
<ul style="text-align: left;">
<li>Bruno P. Kinoshita (National Institute of Water and Atmospheric Research (NIWA)) - Fixed <a href="https://github.com/gbif/ipt/issues/1241" target="_blank">issue #1241</a>, ensuring the IPT can be installed on a server behind a proxy</li>
<li>Pieter Provoost (UNESCO) - Fixed <a href="https://github.com/gbif/ipt/issues/1248" target="_blank">issue #1248</a>, improving the IPT's RSS feed</li>
<li>Tadj Youssouf (Security researcher, fb.com/oc3f.dz) - Helped address a cross site scripting issue</li>
</ul>
Although the core development of the IPT happens at the GBIF Secretariat, the coding, documentation, and internationalization are a community effort and everyone is welcome to join in.<br />
<br />
I look forward to seeing the IPT's community of volunteers and users continue to grow and hope you can unlock the full potential of this publishing tool and repository. </div>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/16423423909368777750noreply@blogger.com0tag:blogger.com,1999:blog-2326624813533383062.post-11651976054108289102016-08-08T15:14:00.000+02:002016-08-08T15:14:47.052+02:00GBIF Backbone - August 2016 UpdateGBIF has just put a new backbone taxonomy into production! Since our last update of the GBIF Backbone we have received various feedback and gained insight into potential code improvements. Here is a quick summary of what has changed in this August 2016 version.<br />
<h4>
Important code changes:</h4>
<ul>
<li>much less eager basionym detection resulting in fewer algorithmically assigned synonyms and removing many false synonyms especially in plants</li>
<li>detect and merge orthographic variants of species doing gender stemming, allowing double consonant characters, deal with author transliterations and merging hybrid names</li>
</ul>
<br />
All fixed issues in the source code that generates a new backbone can be found there, each of them often leads to actual reported user feedback: <a href="http://dev.gbif.org/issues/browse/POR-3029">http://dev.gbif.org/issues/browse/POR-3029</a><br />
<h4>
New sources</h4>
The following new sources have been incorporated into the august backbone:<br />
<ul>
<li>major new version of <a href="http://www.gbif.org/dataset/c33ce2f2-c3cc-43a5-a380-fe4526d63650">The Paleobiology Database</a> contributing 2,315 new families, 11,390 genera and 131,958 species names to the backbone. Feeds many isExtinct and livingPeriod values into the backbone for fossil taxa</li>
<li>thousands of new <a href="http://www.gbif.org/publisher/7ce8aef0-9e92-11dc-8738-b8a03c50a862">Plazi articles</a> with 1,883 genera, 28,725 species and 1,935 infraspecific names. Only use genus names and below from Plazi, excluding any synonyms until we are confident they are all correctly marked up</li>
<li>added <a href="http://www.gbif.org/dataset/a6c6cead-b5ce-4a4e-8cf5-1542ba708dec">Artsnavnebasen</a> source, contributing 3,640 new genera and 29,751 species names to the backbone</li>
<li>added <a href="http://www.gbif.org/dataset/ded724e7-3fde-49c5-bfa3-03b4045c4c5f">International Cichorieae Network</a> source, contributing 190 new Asteraceae genera; 1,415 species and 3,427 infraspecies names to the backbone</li>
</ul>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-yAZAPLuD4hU/V5pHM_dhduI/AAAAAAAAEKk/s7DcP01lkSgWl2ESkECBFtjxOaIZn-_yACLcB/s1600/nub%2Bsource%2Bchanges.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="437" src="https://1.bp.blogspot.com/-yAZAPLuD4hU/V5pHM_dhduI/AAAAAAAAEKk/s7DcP01lkSgWl2ESkECBFtjxOaIZn-_yACLcB/s640/nub%2Bsource%2Bchanges.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: small; text-align: start;">The </span><a href="https://github.com/gbif/checklistbank/blob/master/checklistbank-nub/nub-sources.tsv" style="font-size: medium; text-align: start;">39 sources</a><span style="font-size: small; text-align: start;"> used in this backbone build</span></td></tr>
</tbody></table>
<h4>
Backbone impact</h4>
<div>
The new backbone has a total of 5,307,978 names of which it treats 2,525,274 species names as accepted (previously 2,420,842 out of 5,208,172). More <a href="http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c/stats">backbone metrics</a> are available through our portal and in more detail through our <a href="http://api.gbif.org/v1/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c/metrics">API</a>.</div>
<ul>
<li><a href="http://rs.gbif.org/datasets/backbone/2016-07-25/deleted.txt.gz">187,854 deleted names</a>, mostly due to the removal of orthographic variants</li>
<li><a href="http://rs.gbif.org/datasets/backbone/2016-07-25/created.txt.gz">279,404 new names</a> </li>
<ul>
<li><u>Unknown</u>: 165 families; 743 genera; 785 species; 14 infraspecific</li>
<li><u>Animalia</u>: 13 order; 1,649 families; 10,171 genera; 125,478 species; 4,398 infraspecific</li>
<li><u>Archaea</u>: 2 genera; 3 species</li>
<li><u>Bacteria</u>: 1 families; 33 genera; 544 species; 36 infraspecific</li>
<li><u>Chromista</u>: 38 families; 412 genera; 5,594 species; 295 infraspecific</li>
<li><u>Fungi</u>: 1 families; 691 genera; 11,127 species; 2,039 infraspecific</li>
<li><u>Plantae</u>: 50 families; 666 genera; 82,672 species; 14,725 infraspecific</li>
<li><u>Protozoa</u>: 1 class; 1 order; 4 families; 38 genera; 349 species; 24 infraspecific</li>
<li><u>Viruses</u>: 1 families; 982 genera; 6,311 species</li>
</ul>
</ul>
A very large and detailed <a href="http://rs.gbif.org/datasets/backbone/2016-07-25/clb-nub.log.gz">log of the backbone build</a> is also available.<br />
<br />
The largest taxonomic groups in the backbone, exceeding 3% of all accepted species is shown in the following diagram:<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-KahdGo8Y_7k/V6iFEy9janI/AAAAAAAAELY/cUfvYn37kFEguj0fJkR717RwVKjjNAM4gCLcB/s1600/backbonegroups.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="578" src="https://2.bp.blogspot.com/-KahdGo8Y_7k/V6iFEy9janI/AAAAAAAAELY/cUfvYn37kFEguj0fJkR717RwVKjjNAM4gCLcB/s640/backbonegroups.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<br />
The Catalogue of Life as the largest single primary source contributes 59,8% of all names (previously 60,9%). A breakdown by <a href="http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c/constituents">backbone constituents</a> is now also available as a species search facet. For example this shows the <a href="http://www.gbif.org/species/search?dataset_key=d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&rank=SPECIES&highertaxon_key=6">breakdown for all accepted plant species</a> in the backbone:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-iTo_4STIi3o/V6hZzQXKfaI/AAAAAAAAEK0/KoxShyeA1e84i3kq7phQQMDfw_IAQ7bgACLcB/s1600/Screen%2BShot%2B2016-08-08%2Bat%2B12.05.57.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="464" src="https://3.bp.blogspot.com/-iTo_4STIi3o/V6hZzQXKfaI/AAAAAAAAEK0/KoxShyeA1e84i3kq7phQQMDfw_IAQ7bgACLcB/s640/Screen%2BShot%2B2016-08-08%2Bat%2B12.05.57.png" width="640" /></a></div>
<br />
<h4>
Occurrence impact</h4>
With a new backbone we have reprocessed all of our 642 million occurrences. The larger changes were:<br />
<ul>
<li>Fixed various old/new world distributions of incorrectly synonymized species</li>
<li>Reduced the number of <a href="http://www.gbif.org/species/8">virus records</a> from 157,492 down to just 5,348 records. Most occurrences were Lepidoptera, e.g. the common <a href="http://www.gbif.org/species/5881450">peacock butterfly</a> that had formerly been mismatched because there was no classification given with the name.</li>
</ul>
<div>
Some more metrics of backbone names in our occurrences. There are:</div>
<div>
<ul>
<li>216,699 distinct genera in GBIF occurrences. That is 55% out of all 396.990 genera in the backbone</li>
<li>1,226,668 accepted species in GBIF occurrences. That is 50% out of all 2,420,842 backbone species</li>
<li>2,059,961 distinct names in GBIF occurrences. Which is 39% of all 5.208.172 names in the backbone</li>
</ul>
<div>
The distribution of the major taxonomic groups exceeding 3%, i.e have a minimum of 36.800 species, is shown in this last diagram:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-dWLOf2QH2Os/V6iFFICq5hI/AAAAAAAAELc/0KcOjSAQYTYEzBHNLftWK-E_l2TabBucwCLcB/s1600/occurrencegroups.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="602" src="https://3.bp.blogspot.com/-dWLOf2QH2Os/V6iFFICq5hI/AAAAAAAAELc/0KcOjSAQYTYEzBHNLftWK-E_l2TabBucwCLcB/s640/occurrencegroups.png" width="640" /></a></div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div>
<br /></div>
</div>
<div>
</div>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/02525336976753861766noreply@blogger.com0tag:blogger.com,1999:blog-2326624813533383062.post-79871755293128001602016-07-20T17:02:00.000+02:002017-01-24T16:53:13.808+01:00Probably Turboveg's best-kept secret<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="font-family: Trebuchet MS, sans-serif;"><a href="http://www.synbiosys.alterra.nl/turboveg/">Turboveg</a> is one of the most widely used software programs used to manage vegetation data. Probably its best-kept secret is that it can export vegetation data in Darwin Core Archive (DwC-A) format, which is a standard format that enables its quick and easy integration with other resources on <a href="http://www.gbif.org/">GBIF.org</a>. Turboveg v2 converts vegetation data into species occurrence data packaged as a DwC-A. Now thanks to an 8 month long collaboration between GBIF and Stephan Hennekens (Turboveg's developer), v3 will convert vegetation data into sampling event data packaged as a DwC-A - a much more faithful and useful representation of the data.</span><h4 style="text-align: left;">
<span style="font-family: "Trebuchet MS",sans-serif;"><span style="font-family: Trebuchet MS, sans-serif;"> Turboveg</span></span></h4>
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><span style="font-family: Trebuchet MS, sans-serif;"><a href="http://www.synbiosys.alterra.nl/turboveg/" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img alt="" border="0" height="243" src="https://3.bp.blogspot.com/-2SayE6te0tE/V4318_c567I/AAAAAAAARV8/m3H8gRTRueI5nyYw8o8jWsaSFjx988QrACLcB/s400/TV3-exportDwca.png" title="Screenshot of Turboveg v3 prototype" width="400" /></a></span></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Trebuchet MS, sans-serif;">Screenshot of Turboveg v3 prototype</span></td></tr>
</tbody></table>
<span style="font-family: Trebuchet MS, sans-serif;"><a href="http://www.synbiosys.alterra.nl/turboveg/">Turboveg</a> is an easy to install and easy to use Windows program for storing, managing, visualizing and exporting vegetation data (relevés). A relevé is a list of the plants in a delimited plot of vegetation, with information on species cover and on substrate and other abiotic features in order to make as complete as possible description in terms of plant community composition and structure. <br /><br />Today there are about 1500 users of the software worldwide managing more than 1,5 million relevés. Turboveg can export relevés in various file formats, which is useful to enable further analysis. Support for exporting relevés as species occurrence data packaged as a Darwin Core Archive (DwC-A) was added to v2 in 2011. Guidance on how to use this feature can be found in the <a href="http://www.synbiosys.alterra.nl/turboveg/help/Index.html?idh_export_darwincore.htm">Turboveg User Manual</a>. <br /><br />Version 3, due to be released in 2017, will export relevés as sampling event data packaged as DwC-A - a format that more accurately reflects the original data.</span><h4 style="text-align: left;">
<span style="font-family: "Trebuchet MS",sans-serif;"><span style="font-family: Trebuchet MS, sans-serif;"> <span style="font-family: "trebuchet ms" , sans-serif;">S</span>ampling event data</span></span></h4>
<span style="font-family: Trebuchet MS, sans-serif;">Sampling event data derive from environmental, ecological, and natural resource investigations that follow standardized protocols for measuring and observing biodiversity. This is in contrast to opportunistic observation and collection data, which today form a significant proportion of openly accessible biodiversity data. A good example of sampling data is data coming from vegetation sampling events using the Braun-Blanquet protocol. Because the sampling methodology and sampling units are precisely described the resulting data is comparable and thus better suited for measuring trends in habitat change and climate change.</span><h4 style="text-align: left;">
<span style="font-family: "Trebuchet MS",sans-serif;"><span style="font-family: Trebuchet MS, sans-serif;"><span style="font-family: "trebuchet ms" , sans-serif;">S</span>ampling event data model</span></span></h4>
<span style="font-family: Trebuchet MS, sans-serif;">A data model provides the details of the structure of the data. Previously sampling event data couldn't be modelled in a standardized way in Darwin Core due to the complexity of encoding the underlying protocols. Over the past two years, however, GBIF has been working with EU BON and the wider bioinformatics community to develop a data model for sharing sampling event data. In March 2015 TDWG, the international body responsible for maintaining standards for the exchange of biological data, ratified changes that enabled support for modelling sampling event data.</span><div>
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span><div class="separator" style="clear: both; text-align: center;">
<span style="font-family: "Trebuchet MS",sans-serif;"><span style="font-family: Trebuchet MS, sans-serif;"><a href="https://1.bp.blogspot.com/-iC_gLwcA8bY/V43xJFoDzqI/AAAAAAAARVc/meeu9HDHtBQQHQGP4-ihmD1ZMwDn3umpgCLcB/s1600/dm1.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img alt="" border="0" height="200" src="https://1.bp.blogspot.com/-iC_gLwcA8bY/V43xJFoDzqI/AAAAAAAARVc/meeu9HDHtBQQHQGP4-ihmD1ZMwDn3umpgCLcB/s200/dm1.png" title="Sampling event data model" width="134" /></a></span></span></div>
<span style="font-family: Trebuchet MS, sans-serif;"><div>
In summary, the de facto data model for sampling event data in Darwin Core consists of three tables: Sampling event, Measurements or Facts and Species occurrences. </div>
<div>
<br /></div>
<div>
A Sampling event can be associated with many Species occurrences, while a Species occurrence can only be associated to one Sampling event. Similarly, a Sampling event can be associated with many Measurements or Facts. In this way a Sampling event has a one-to-many relationship to both Species occurrences and Measurements or Facts. </div>
<br />Note additional tables of information can also be added to a Sampling event, such as Multimedia (e.g. to record images of the plot). More information about this preferred data model for sampling event data can be found in the <a href="http://links.gbif.org/ipt-sample-data-primer">IPT Sampling Event Data Primer</a>.</span></div>
<div>
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span><div style="text-align: left;">
<span style="font-family: "Trebuchet MS",sans-serif;"><span style="font-family: "trebuchet ms" , sans-serif;"><span style="font-family: Trebuchet MS, sans-serif;"><br /></span></span></span></div>
<h4 style="text-align: left;">
<span style="font-family: "Trebuchet MS",sans-serif;"><span style="font-family: Trebuchet MS, sans-serif;"><span style="font-family: "trebuchet ms" , sans-serif;">S</span>ampling event data model for vegetation plot data </span></span></h4>
<span style="font-family: Trebuchet MS, sans-serif;">Vegetation surveys or relevés produce a wealth of information on species cover and on substrate and other abiotic features in the plot. Species cover can be measured using dozens of different vegetation abundance scales such as the Braun-Blanquet scale or Londo decimal scale to name a couple. To standardize how this information is stored, a custom Relevé table is used instead of the Measurements or Facts table.</span><span style="font-family: Trebuchet MS, sans-serif;"><span style="font-family: "Trebuchet MS",sans-serif;"><br /></span>
</span><br />
<div class="separator" style="clear: both; text-align: center;">
<span style="font-family: "Trebuchet MS",sans-serif;"><span style="font-family: Trebuchet MS, sans-serif;"><a href="https://2.bp.blogspot.com/-8tjZhOXIzU8/V43zQRXvZrI/AAAAAAAARVs/C5WEEEW0ErISc9SAmA5312VmUhvW_ikfwCLcB/s1600/dm2_2.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img alt="" border="0" height="200" src="https://2.bp.blogspot.com/-8tjZhOXIzU8/V43zQRXvZrI/AAAAAAAARVs/C5WEEEW0ErISc9SAmA5312VmUhvW_ikfwCLcB/s200/dm2_2.png" title="Sampling event data model for vegetation data" width="133" /></a></span></span></div>
<span style="font-family: Trebuchet MS, sans-serif;">This data model for vegetation plot data in Darwin Core consists of three tables: Sampling event, Relevé and Species Occurrence.</span></div>
<div>
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span><div>
<span style="font-family: Trebuchet MS, sans-serif;">A Sampling event can be associated with only one Relevé. The Relevé consists of the most common relevé measurements covering all vegetation layers. Note for each measurement the unit and precision is explicitly defined. A Sampling event can also be associated with many Species occurrences, however, each Species occurrence should specify the vegetation layer where it was found hence the same species can be found within multiple vegetation layers. In this way the vegetation composition can be described for each layer within the plot.</span></div>
<div>
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Trebuchet MS, sans-serif;">Note that at the time of writing the Darwin Core standard doesn't have the terminology for storing vegetation layers. Therefore a <a href="https://github.com/tdwg/dwc/issues/125">formal proposal</a> has been made to add the new term "layer" to Darwin Core. To standardise how this new term is populated, a <a href="http://rs.gbif.org/vocabulary/gbif/vegetation_layer.xml">custom vocabulary for vegetation layers</a> has also been produced.</span></div>
<div>
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span><br />
<h4 style="text-align: left;">
<span style="font-family: "Trebuchet MS",sans-serif;"><span style="font-family: Trebuchet MS, sans-serif;">Example DwC-A export by Turboveg: Dutch Vegetation Database (LVD)</span></span></h4>
<span style="font-family: Trebuchet MS, sans-serif;">Fortunately, the <a href="http://cloud.gbif.org/eubon/resource?r=lvd&v=1.6">Dutch Vegetation Database (LVD)</a> has recently been republished using the new sampling event format and can thus serve as an exemplar dataset. LVD is a substantial dataset published by <a href="http://www.wageningenur.nl/nl/Expertises-Dienstverlening/Onderzoeksinstituten/Alterra.htm">Alterra</a> (a major Dutch research institute) that covers all plant communities in the Netherlands with more than 85 years of vegetation recording for some habitats. The latest version of this dataset has more than 650 thousand relevés associated with almost 12 million species occurrences. <br /><br />Alterra uses Turboveg v3 to manage this dataset and export it in the standardized DwC-A format. It is important to note that special care is taken by the software to protect sensitive species: the location of plots, which have red list species observed in them, are obfuscated to a level of 5x5 km squares. Furthermore the software converts all coverage values to the same unit (e.g. species coverage values are converted into percentage coverage) in order to make the data easy to use and integrate with other sources.</span><h4 style="text-align: left;">
<span style="font-family: "Trebuchet MS",sans-serif;"><span style="font-family: Trebuchet MS, sans-serif;">Sampling event data on GBIF.org: Dutch Vegetation Database (LVD)</span></span></h4>
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><span style="font-family: Trebuchet MS, sans-serif;"><a href="https://2.bp.blogspot.com/-frN3LcyvwYE/V434Ai7lhtI/AAAAAAAARWI/I3-1ojq9E3gQWNUOJW976IHAWH6MP-opQCLcB/s1600/GBIF-LVD-map.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img alt="" border="0" height="251" src="https://2.bp.blogspot.com/-frN3LcyvwYE/V434Ai7lhtI/AAAAAAAARWI/I3-1ojq9E3gQWNUOJW976IHAWH6MP-opQCLcB/s400/GBIF-LVD-map.png" title="GBIF.org map of LVD georeferenced data" width="400" /></a></span></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-family: Trebuchet MS, sans-serif;">GBIF.org map of LVD georeferenced data</span></td></tr>
</tbody></table>
<span style="font-family: Trebuchet MS, sans-serif;">All versions of LVD are imported to the <a href="http://cloud.gbif.org/eubon/resource?r=lvd">EU BON IPT</a> where they get archived and published through <a href="http://www.gbif.org/">GBIF.org</a>. </span><div>
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Trebuchet MS, sans-serif;">The 8 month long collaboration between GBIF and Stephan Hennekens culminated in the latest version of LVD being indexed into GBIF.org <a href="http://www.gbif.org/dataset/740df67d-5663-41a2-9d12-33ec33876c47">here</a>. A special and grateful thanks is owed to Stephan for all his hard work to make this happen.</span></div>
<div>
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Trebuchet MS, sans-serif;">Over the next couple of years GBIF will continue working on enhancing the indexing and discovery of sampling event datasets (e.g. showing events' plots/transects on a map, filtering events by sampling protocol, indexing Relevés, etc.). At least when Turboveg v3 is released in 2017, users can already export their relevés into this new standardized format that represents their data much more faithfully.</span></div>
</div>
</div>
</div>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/16423423909368777750noreply@blogger.com0tag:blogger.com,1999:blog-2326624813533383062.post-18310265663071562752016-04-06T10:08:00.000+02:002016-04-06T17:01:22.509+02:00Updating the GBIF BackboneThe taxonomy employed by GBIF for organising all occurrences into a
consistent view has remained unchanged since 2013. We have been working on
a replacement for some time and are pleased to introduce a preview in this
post. The work is rather complex and tries to establish an automated
process to build a new backbone which we aim to run on a regular, probably
quarterly basis. We would like to release the new taxonomy rather soon and
improve the backbone iteratively. Large regressions should be avoided
initially, but it is quite hard to evaluate all the changes between 2 large
taxonomies with 4 - 5 million names each. We are therefore seeking feedback
and help to discover oddities of the new backbone.<br />
<h3>
Relevance & Challenges</h3>
Every occurrence record in GBIF is matched to a taxon in the backbone.
Because occurrence records in GBIF cover the whole tree of life and names
may come from all possible, often outdated, taxonomies, it is important to
have the broadest coverage of names possible. We also deal with fossil
names, extinct taxa and (due to advanced digital publishing) even names
that have just been described a week before the data is indexed at
GBIF.<br />
The Taxonomic Backbone provides a single classification and a synonymy
that we use to inform our systems when creating maps, providing metrics or
even when you do a plain occurrence search. It is also used to crosslink
names between different checklist datasets.<br />
<h3>
The Origins</h3>
The very first taxonomy that GBIF used was based on the Catalogue of
Life. As this only included around half the names we found in GBIF
occurrences, all other cleaned occurrence names were merged into the GBIF
backbone. As the backbone grew we never deleted names and increasingly
faced more and more redundant names with slightly different
classifications. It was time for a different procedure.<br />
<h3>
The Current Backbone</h3>
The current version of the backbone was built in July 2013. It is
largely based on the Catalogue of Life from 2012 and has folded in names
from <a href="https://github.com/mdoering/backbone-preview/blob/master/nub-live/sources.md">39 further taxonomic sources</a>.
It was built using an automated process that made use of selected checklists from
the GBIF ChecklistBank in a prioritised order. The Catalogue of Life was
still the starting point and provided the higher classification down to
orders.
The <a href="http://www.gbif-uat.org/dataset/714c64e3-2dc1-4bb7-91e4-54be5af4da12">Interim
Register of Marine and Nonmarine Genera</a> was used as the single
reference list for generic homonyms. Otherwise only a single version of any
name was allowed to exist in the backbone, even where the authorship
differed.<br />
<h4>
Current issues</h4>
We kept track of <a href="http://dev.gbif.org/issues/issues/?jql=labels%20%3D%20nub">nearly 150
reported issues</a>. Some of the main issues showing up regularly that we
wanted to address were:<br />
<ul>
<li>Enable an <a href="http://dev.gbif.org/issues/browse/POR-2467">automated build
process</a> so we can use the latest Catalogue of Life and other
sources to capture newly described or currently missing names
</li>
<li>It was impossible to have <a href="http://dev.gbif.org/issues/browse/POR-353">synonyms using the same
canonical name but with different authors</a>. This means <a href="http://www.gbif.org/species/4113236"><em>Poa pubescens</em></a> was
always considered a synonym of <em>Poa pratensis</em> L. when in fact
<em>Poa pubescens</em> R.Br. is considered a
synonym of Eragrostis pubescens (R.Br.) Steud.
</li>
<li>Some families contain far too many accepted species and hardly any
synonyms. Especially for plants the Catalogue of Life was surprisingly
sparsely populated and we heavily relied on IPNI names. For example the
family <a href="http://dev.gbif.org/issues/browse/POR-1389"><em>Cactaceae</em> has
12.062 accepted species</a> in GBIF while The Plant List recognizes
just 2.233.
</li>
<li>Many accepted names are based on the same basionym. For example the
current backbone considers both <a href="http://www.gbif.org/species/7283318"><em>Sulcorebutia breviflora</em>
Backeb.</a> and <a href="http://www.gbif.org/species/7281391"><em>Weingartia breviflora</em>
(Backeb.) Hentzschel & K.Augustin</a> as accepted taxa.
</li>
<li>Relying purely on IRMNG for homonyms meant that homonyms which were
not found in IRMNG were conflated. On the other hand there are many
genera in IRMNG - and thus in the backbone - that are hardly used
anywhere, creating confusion and many empty genera without any species
in our backbone.</li>
</ul>
<h3>
The New Backbone</h3>
The new backbone is available for <a href="http://www.gbif-uat.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c">preview
in our test environment</a>. In order to review the new backbone and
compare it to the <a href="http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c">previous
version</a> we provide a few tools with a different focus:<br />
<ul>
<li>
<strong>Stable ID report</strong>: We have joined the old and new
backbone names to each other and <a href="https://github.com/mdoering/backbone-preview/blob/master/nub/stable-ids.md">compared their identifiers</a>. When joining on
the full scientific name there is still an issue with changing
identifiers which we are still investigating.
</li>
<li>
<strong>Tree Diffs</strong>: For comparing the higher
classification we used a <a href="http://iphylo.blogspot.dk/2015/12/visualising-difference-between-two.html">
tool from Rod Page</a> to <a href="http://mdoering.github.io/backbone-preview/families.html">diff the
tree down to families</a>. There are surprisingly many changes, but
all of them stem from evolution in the Catalogue of Life or the
changed Algae classification.
</li>
<li>
<strong>Nub Browser</strong>: For comparing actual species and also
reviewing the impact of the changed taxonomy on the GBIF
occurrences, we developed a <a href="http://mdoering.github.io/nub-browser/app/#/">new Backbone
Browser</a> sitting on top of our existing API (Google Chrome only). Our test
environment has a complete copy of the current GBIF occurrence
index which we have reprocessed to use the new backbone. This also
includes all maps and <a href="http://mdoering.github.io/nub-browser/app/#/metrics">metrics</a>
which we show in the new browser.
</li>
</ul>
Family <a href="http://mdoering.github.io/nub-browser/app/#/taxon/7683"><em>Asparagaceae</em></a>
as seen in the nub browser:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-Ot4eGWB1YR4/VwPRmqyBBVI/AAAAAAAAEJ4/4kLPyjgcXBQCwnm82cZJ0UTb83ik_hFUA/s1600/Asparagaceae.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="261" src="https://1.bp.blogspot.com/-Ot4eGWB1YR4/VwPRmqyBBVI/AAAAAAAAEJ4/4kLPyjgcXBQCwnm82cZJ0UTb83ik_hFUA/s320/Asparagaceae.png" width="320" /></a>
</div>
Red numbers next to names indicate taxa that have fewer occurrences
using the new backbone, while green numbers indicate an increase. This is
also seen in the tree maps of the children by occurrences. The genus
Campylandra J.G. Baker, 1875 is dark red with zero occurrences because the
species in that genus were moved into the genus Rhodea in the latest
Catalog of Life.<br />
<br />
Species <a href="http://mdoering.github.io/nub-browser/app/#/taxon/2768367"><em>Asparagus
asparagoides</em></a> as seen in the nub browser:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-_mv67R2iyZA/VwPRse48jHI/AAAAAAAAEJ8/58Cu_6fY3kogM4Thp6JrpTfQtL9trhyXA/s1600/Asparagus_asparagoides.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="171" src="https://3.bp.blogspot.com/-_mv67R2iyZA/VwPRse48jHI/AAAAAAAAEJ8/58Cu_6fY3kogM4Thp6JrpTfQtL9trhyXA/s320/Asparagus_asparagoides.png" width="320" /></a>
</div>
The details view shows all synonyms, the basionym and also a list of
homonyms from the new backbone.<br />
<h4>
Sources</h4>
We manually curate a <a href="https://github.com/gbif/checklistbank/blob/master/checklistbank-nub/nub-sources.tsv">
list of priority ordered checklist datasets</a> that we use to build the
taxonomy. Three datasets are treated in a slightly special way:<br />
<ol>
<li>
<a href="http://www.gbif-uat.org/dataset/daacce49-b206-469b-8dc2-2257719f3afa">
GBIF Backbone Patch</a>: a small dataset we manually curate at GBIF
to override any other list. We mainly use the dataset to add
missing names reported by users.
</li>
<li>
<a href="http://www.gbif-uat.org/dataset/7ddf754f-d193-4cc9-b351-99906754a03b">
Catalogue of Life</a>: The Catalogue of Life provides the entire
higher classification above families with the exception of algaes.
</li>
<li>
<a href="http://www.gbif-uat.org/dataset/7ea21580-4f06-469d-995b-3f713fdcc37c">
GBIF Algae Classification</a>: With the withdrawal of Algaebase the
current Catalogue of Life is lacking any algae taxonomy. To allow
other sources to at least provide genus and species names for algae
we have created a new dataset that just provides an algae
classification down to families. This classification fits right
into the empty phyla of the Catalogue of Life.
</li>
</ol>
The GBIF portal now also lists <a href="http://www.gbif-uat.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c/constituents">
the source datasets that contributed to the GBIF Backbone</a> and the
number of names that were used as primary references.<br />
<h4>
Other Improvements</h4>
As well as fixing the main issues listed above, there is another
frequently occurring situation that we have improved. Many occurrences
could not be matched to a backbone species because the name existed
multiple times as an accepted taxon. In the new backbone, only one version
of a name is ever considered to be accepted. All others now are flagged as
doubtful. That resolves many issues which prevented a species match because
of name ambiguity. For example there are many occurrences of
<em>Hyacinthoides hispanica</em> in Britain which only show up in the new
backbone (<a href="http://www.gbif.org/occurrence/795765755">old</a> /
<a href="http://www.gbif-uat.org/occurrence/795765755">new</a> occurrence,
<a href="http://api.gbif.org/v1/species/match?verbose=true&kingdom=plantae&name=Hyacinthoides%20hispanica">
old</a> / <a href="http://api.gbif-uat.org/v1/species/match?verbose=true&kingdom=plantae&name=Hyacinthoides%20hispanica">
new</a> match). This is best seen in the <a href="http://mdoering.github.io/nub-browser/app/#/taxon/5304257">map comparison
of the nub browser</a>, try to swipe the map!<br />
<h4>
Known problems</h4>
We are aware of some problems with the new backbone which we like to
address in the <a href="http://dev.gbif.org/issues/browse/POR-3029">next
stage</a>. Two of these issues we consider as candidates for blocking the
release of the new backbone:<br />
<h5>
Species matching
service ignores authorship</h5>
As we better keep different authors apart the backbone now contains a
lot more species names which just differ by their authorship. The current
algorithm only keeps one of these names as the accepted name from the most
trusted source (e.g. CoL) and treats the other as doubtful if they are not
already treated as synonyms.<br />
The problem currently is that the species matching service we use to
align occurrences to the backbone does <a href="http://dev.gbif.org/issues/browse/POR-2768">not deal with authorship</a>.
Therefore we have some cases where occurrences are attached to a doubtful
name or even split across some of the “homonyms”.<br />
There are nearly 166.832 species names with different authorship
existing in the new backbone, accounting for 98.977.961 occurrences.<br />
<h5>
Too eager basionym merging</h5>
The same epithet is sometimes used by the same author for different
names in the same family. This currently leads to an <a href="http://dev.gbif.org/issues/browse/POR-2989">overly eager basionym
grouping</a> with less accepted names.<br />
As these names are still in the backbone and occurrences can be matched
to them this is currently not considered a blocker.<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/02525336976753861766noreply@blogger.com25tag:blogger.com,1999:blog-2326624813533383062.post-39280565035711513612016-02-25T21:22:00.000+01:002016-02-25T21:23:42.557+01:00Reprojecting coordinates according to their geodetic datum<!DOCTYPE html>
<html>
<body>
<p>For a long time Darwin Core has a term to declare the exact geodetic datum used for the given coordinate.
Quite a few data publishers in GBIF have used <a href="http://rs.tdwg.org/dwc/terms/index.htm#geodeticDatum">dwc:geodeticDatum</a> for some time to publish the datum of their location coordinates.</p>
<p>Until now GBIF has treated all coordinates as if they were in <a href="http://en.wikipedia.org/wiki/World_Geodetic_System">WGS84</a>, the widespread global standard datum used by the Global Positioning System (GPS). Accordingly locations given in a different datum, for example NAD27 or AGD66, were displaced on GBIF maps a little. This so called “datum shift” is not dramatic, but can be up to a few hundred metres depending on the location and datum. The Univeristy of Colorado has a nice <a href="http://www.colorado.edu/geography/gcraft/notes/datum/datum_f.html">visualization of the impact</a>.</p>
<p>At GBIF we interpret the geodeticDatum and reproject all coordinates as good as we can into the single datum WGS84. In order to do this there are two main steps that need to be done: parse and interpret the given verbatim geodetic datum and then do the actual transformation based on the known geodetic parameters.</p>
<h4 id="parsinggeodeticdatum">Parsing geodeticDatum</h4>
<p>As usual GBIF receives a lot of noise when reading the dwc:geodeticDatum. After removing the obvious bad values, e.g. introduced by bad mappings done by the publisher, we still ended up with over 300 different values for some datum. Most commonly simple names or abbreviations are used, e.g. NAD27, WGS72, ED50, TOKYO. In some cases we also see proper EPSG http://www.epsg.org/ codes coming in, e.g. EPSG:4326 which is the EPSG code for WGS84. As EPSG is a widespread and complete reference dataset of geodetic parameters, supported by many java libraries, we decided to add a new <a href="https://github.com/gbif/parsers/blob/master/src/main/java/org/gbif/common/parsers/geospatial/DatumParser.java">DatumParser</a> to our parser library that directly returns EPSG integer codes for datum values. That way we can lookup geodetic parameters easily in the following transformation step. In addition to parse any given EPSG:xyz code directly it also understands most datums found in the GBIF network based on a simple <a href="https://github.com/gbif/parsers/blob/master/src/main/resources/dictionaries/parse/datum.txt">dictionary file</a> which we manually curate.</p>
<p>Even though EPSG codes are well maintained, very complete and supported by most software opaque integer codes have a harder time to get used than meaningful short names. Maybe a lesson we should keep in mind when debating about identifiers elsewhere.</p>
<p>Our recommendation to publishers is to use the EPSG codes if you know them, otherwise stick to the simple well known names. A good place to search for EPSG codes is <a href="http://epsg.io/">http://epsg.io/</a>. </p>
<h4 id="transformation">Transformation</h4>
<p>Once we have a decimal coordinate and a well known geodetic source datum the transformation itself is rather straight forward. We use <a href="http://www.geotools.org/">geotools</a> to do the work. The first step in the transformation is to instantiate a CoordinateReferenceSystem (CRS) using the parsed EPSG code of the geodeticDatum. A CRS combines a datum with a coordinate system, in our case this always a 2 dimensional system with the prime meridian in Greenwich and longitude values increasing East, latitude values North.</p>
<p>As EPSG codes can refer to both, just a plain datum and also a complete spatial reference system, we need to take this into account when building the CRS like this:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"><code> private CoordinateReferenceSystem parseCRS(String datum) {
CoordinateReferenceSystem crs = null;
// the GBIF DatumParser in use
ParseResult<Integer> epsgCode = PARSER.parse(datum);
if (epsgCode.isSuccessful()) {
final String code = "EPSG:" + epsgCode.getPayload();
// first try to create a full fledged CRS from the given code
try {
crs = CRS.decode(code);
} catch (FactoryException e) {
// that didn't work, maybe it is *just* a datum
try {
GeodeticDatum dat = DATUM_FACTORY.createGeodeticDatum(code);
// build a CRS using the standard 2-dim Greenwich coordinate system
crs = new DefaultGeographicCRS(dat, DefaultEllipsoidalCS.GEODETIC_2D);
} catch (FactoryException e1) {
// also not a datum, no further ideas, log error
LOG.info("No CRS or DATUM for given datum code >>{}<<: {}", datum, e1.getMessage());
}
}
}
return crs;
}
</code></pre>
<p>Once we have a CRS instance we can create a specific WGS84 transformation and apply it to our coordinate:</p>
<pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"><code class="java">public ParseResult<LatLng> reproject(double lat, double lon, String datum) {
CoordinateReferenceSystem crs = parseCRS(datum);
MathTransform transform = CRS.findMathTransform(crs, DefaultGeographicCRS.WGS84, true);
// different CRS may swap the x/y axis for lat lon, so check first:
double[] srcPt;
double[] dstPt = new double[3];
if (CRS.getAxisOrder(crs) == CRS.AxisOrder.NORTH_EAST) {
// lat lon
srcPt = new double[] {lat, lon, 0};
} else {
// lon lat
srcPt = new double[] {lon, lat, 0};
}
transform.transform(srcPt, 0, dstPt, 0, 1);
return ParseResult.success(ParseResult.CONFIDENCE.DEFINITE, new LatLng(dstPt[1], dstPt[0]), issues);
}
</code></pre>
<p>The actual <a href="https://github.com/gbif/occurrence/blob/master/occurrence-processor/src/main/java/org/gbif/occurrence/processor/interpreting/util/Wgs84Projection.java#L61">projection code</a> does a bit more of null and exception handling which I have removed here for simplicity.</p>
<p>As you can see above we also have to watch out for spatial reference systems that use a different axis ordering. Luckily geotools knows all about that and provides a very simple way to test for it. </p>
<h4 id="issueflags">Issue flags</h4>
<p>As with most of our processing we flag records when problems or assumed behavior occurs. In the case of the geodetic datum processing we keep track of 5 distinct issues which are available as <a href="http://www.gbif-uat.org/occurrence/search?ISSUE=COORDINATE_REPROJECTION_FAILED&ISSUE=GEODETIC_DATUM_INVALID&ISSUE=COORDINATE_REPROJECTION_SUSPICIOUS&ISSUE=GEODETIC_DATUM_ASSUMED_WGS84&ISSUE=COORDINATE_REPROJECTED">GBIF portal occurrence search filters</a>:</p>
<ul>
<li>COORDINATE_REPROJECTION_FAILED: A CRS was instantiated, but the transformation failed for some reason.</li>
<li>GEODETIC_DATUM_INVALID: The datum parser was unable to return an EPSG code for the given datum string.</li>
<li>COORDINATE_REPROJECTION_SUSPICIOUS: The reprojection resulted in a datum shift larger than 0.1 degrees.</li>
<li>GEODETIC_DATUM_ASSUMED_WGS84: No datum was given or the given datum was not understood. In that case the original coordinates remain untouched.</li>
<li>COORDINATE_REPROJECTED: The coordinate was successfully transformed and differs now from the verbatim one given.</li>
</ul>
</body>
</html><div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/02525336976753861766noreply@blogger.com0tag:blogger.com,1999:blog-2326624813533383062.post-9091424326316397452015-06-11T17:06:00.000+02:002015-06-12T09:29:40.096+02:00Simplified Downloads<div style="text-align: justify;">
Since its re-launch in 2013 <a href="http://www.gbif.org/" target="_blank">gbif.org</a> has supported the downloading of occurrence data using an arbitrary query with the download file provided as a <a href="http://rs.tdwg.org/dwc/" target="_blank">Darwin Core Archive</a> file whose internal content is described <a href="http://www.gbif.org/faq/datause" target="_blank">here</a>. This format contains comprehensive and self-explanatory information, which makes it suitable to be referenced in external resources. However, in cases where people only need the occurrence data in its simplest form the <a href="http://rs.tdwg.org/dwc/" target="_blank">DwC-A</a> format presents an additional complexity that can make it hard to use the data. Because of that we now support a new download format: a zip file that only contains a single file with the most common fields/terms used, where each column is separated by the TAB character. This makes things much easier when it comes to importing the data into tools such as Microsoft Excel, geographic information systems and relational databases. The current download functionality was extended to allow the selection of the desired format:</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-AbAeglZJSro/VXjLYFqv1WI/AAAAAAAAAZE/xdbKFBeSkzI/s1600/Screen%2BShot%2B2015-06-10%2Bat%2B18.19.42.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-AbAeglZJSro/VXjLYFqv1WI/AAAAAAAAAZE/xdbKFBeSkzI/s320/Screen%2BShot%2B2015-06-10%2Bat%2B18.19.42.png" /></a></div>
<div style="text-align: justify;">
From this point the functionality remains the same: eventually you will receive an email containing a hyperlink where the file can be downloaded.</div>
<h2>
Technical Architecture</h2>
The simplified download format was implemented following the technical requirement that new formats should be supported in the near future with minimal impact to the formats supported at a specific moment. In general, occurrence downloads are implemented using two different sets of technologies depending on the estimated size of the download in number of records; a threshold of 200,000 records is set to define when a download is small (< 200K) and big (>200K), where history shows a vast majority of “small” downloads. The following chart summarizes the key technologies that enables occurrence downloads:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-csKM37rv3TI/VXjM_45ohoI/AAAAAAAAAZQ/5ILGlNPlSiY/s1600/Screen%2BShot%2B2015-06-11%2Bat%2B01.48.22.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-csKM37rv3TI/VXjM_45ohoI/AAAAAAAAAZQ/5ILGlNPlSiY/s320/Screen%2BShot%2B2015-06-11%2Bat%2B01.48.22.png" /></a></div>
<h2>
Download workflow</h2>
Occurrence downloads are automated using a workflow engine called <a href="http://oozie.apache.org/" target="_blank">Oozie</a>, it coordinates the required steps to produce a single download file. In summary the workflow proceeds as follows: <br />
<ol>
<li>Initially, <a href="http://lucene.apache.org/solr/" target="_blank">Apache Solr</a> is contacted to determine the number of records that the download file will contain.</li>
<li>Big or small?</li>
<ol>
<li> If the amount of records is less than 200,000 (it is small download), <a href="http://lucene.apache.org/solr/" target="_blank">Apache Solr</a> is queried to iterate over the results; the detail of each occurrence record is fetched from <a href="http://hbase.apache.org/" target="_blank">HBase</a> since it’s the official storage of occurrence records. Individual downloads are produced by a multi-threaded application implemented using the <a href="http://akka.io/" target="_blank">Akka</a> framework; the Apache <a href="https://zookeeper.apache.org/" target="_blank">Zookeeper</a> and <a href="http://curator.apache.org/" target="_blank">Curator</a> frameworks are used to limit the amount of threads that can be running at the same time (it avoids a thread explosion in the machines that run the download workflow).</li>
<li>If the amount of records is greater than 200,000 (it is a big download), <a href="https://hive.apache.org/" target="_blank">Apache Hive</a> is used to retrieve the occurrence data from an <a href="https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html" target="_blank">HDFS</a> table. To avoid overloading of <a href="http://hbase.apache.org/" target="_blank">HBase</a> we create that HDFS table as a daily snapshot of the occurrence data stored in <a href="http://hbase.apache.org/" target="_blank">HBase</a>.</li>
</ol>
<li>Finally the occurrence records are collected and organized in the requested output format (DwC-A or Simple).</li>
</ol>
Note: the details of how this is implemented can be consulted in the Github project: <a href="https://github.com/gbif/occurrence/tree/master/occurrence-download">https://github.com/gbif/occurrence/tree/master/occurrence-download</a>.<br />
<br />
<h2>
Conclusion</h2>
<div>
Reducing both the number of columns and the size (number of bytes) in our downloads has been one of our most requested features, and we hope this makes using the GBIF data easier for everyone.</div>
<br />
<br /><div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Fede Méndezhttp://www.blogger.com/profile/11707904250426427540noreply@blogger.com3tag:blogger.com,1999:blog-2326624813533383062.post-68988103071582104572015-05-29T16:34:00.000+02:002015-08-25T16:01:13.198+02:00Don't fill your HDFS disks (upgrading to CDH 5.4.2)Just a short post on the dangers of filling your HDFS disks. It's a warning you'll hear at conferences and in best practices blog posts like this one, but usually with only a vague consequence of "bad things will happen". We upgraded from CDH 5.2.0 to CDH 5.4.2 this past weekend and learned the hard way: bad things will happen.<br />
<br />
<h4>
The Machine Configuration</h4>
<div>
The upgrade went fine in our dev cluster (which has almost no data in HDFS) so we weren't expecting problems in production. Our production cluster is of course slightly different than our (much smaller) dev cluster. In production we have 3 masters, where one holds the NameNode and another holds the SecondaryNameNode (we're not yet using a High Availability setup, but it's in the plan). We have 12 DataNodes where each one has 13 disks dedicated to HDFS storage: 12 are 1TB and one is 512GB. They are formatted with 0% reserved blocks for root. The machines are evenly split into two racks.</div>
<div>
<br /></div>
<h4>
Pre Upgrade Status</h4>
<div>
We were at about 75% total HDFS usage with only a few percent difference between machines. We were configured to use Round Robin block placement (<span style="font-family: Courier New, Courier, monospace;">dfs.datanode.fsdataset.volume.choosing.policy</span>) with 10GB reserved for non-hdfs use (<span style="font-family: Courier New, Courier, monospace;">dfs.datanode.du.reserved</span>), which are the defaults in CDH manager. Each of the 1TB disks was around 700GB used (of 932GB usable), and the 512 GB disks were all at their limit: 456GB used (of 466GB usable). That left only the configured 10GB free for non-hdfs use on the small disks. Our disks are mounted in the pattern /mnt/disk_a, /mnt/disk_b and so on, with /mnt/disk_m as the small disk. We're using the free version of CDHM so we can't do rolling upgrades, meaning this upgrade would bringing everything down. And because our cluster is getting full (> 80% usage is another rumoured "bad things" threshold) we have reduced one class of data (user's <a href="http://www.gbif.org/occurrence/search" target="_blank">occurrence downloads</a>) to a replication factor of 2 (from the default of 3). This is considered somewhere between naughty and criminal, and you'll see why below.</div>
<div>
<br /></div>
<h4>
Upgrade Time</h4>
<div>
We followed the <a href="http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_earlier_cdh5_upgrade.html" target="_blank">recommended procedure</a> and did the oozie, hive, and CDH manager backups, downloaded the latest parcels, and pressed the big Update button. Everything appeared to be going fine until HDFS tried to start up again, where the symptom was that it was taking a really long time (several minutes, after which the CDHM upgrade process finally gave up saying the DataNodes weren't making contact). Looking at the NameNode logs we see that it was performing a "Block Pool Upgrade", which took btw 90 and 120 seconds for each of our ~700GB disks. Here's an excerpt of where it worked without problems:</div>
<div>
<br /></div>
<div>
<!--?xml version="1.0" encoding="UTF-8" standalone="no"?-->
<br />
<div>
<span style="font-size: 11px;"><span style="font-family: Courier New, Courier, monospace;">2015-05-23 20:18:53,715 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /mnt/disk_a/dfs/dn/in_use.lock acquired by nodename <a href="mailto:27117@c4n1.gbif.org">27117@c4n1.gbif.org</a><br />2015-05-23 20:18:53,811 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-2033573672-130.226.238.178-1367832131535<br />2015-05-23 20:18:53,811 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535<br />2015-05-23 20:18:53,823 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading block pool storage directory /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535.<br /> old LV = -56; old CTime = 1416737045694.<br /> new LV = -56; new CTime = 1432405112136<br />2015-05-23 20:20:33,565 INFO org.apache.hadoop.hdfs.server.common.Storage: HardLinkStats: 59768 Directories, including 53157 Empty Directories, 0 single Link operations, 6611 multi-Link operations, linking 22536 files, total 22536 linkable files. Also physically copied 0 other files.</span></span></div>
<div>
<span style="font-size: 11px;"><span style="font-family: Courier New, Courier, monospace;">2015-05-23 20:20:33,609 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrade of block pool BP-2033573672-130.226.238.178-1367832131535 at /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535 is complete</span></span></div>
</div>
<div>
<br /></div>
<div>
That upgrade time happens sequentially for each disk, so even the though the machines were upgrading in parallel, we were still looking at ~30 minutes of downtime for the whole cluster. As if that wasn't sufficiently worrying, then we finally get to disk_m, our nearly full 512G disk:</div>
<div>
<br /></div>
<div>
<!--?xml version="1.0" encoding="UTF-8" standalone="no"?-->
<br />
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: xx-small;"><span style="font-stretch: normal;">2015-05-23 20:53:05,814 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /mnt/disk_m/</span><span style="font-stretch: normal;">dfs/dn/in_use.lock acquired by nodename <a href="mailto:12424@c4n1.gbif.org">12424@c4n1.gbif.org</a><br />2015-05-23 20:53:05,869 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-2033573672-130.226.238.178-1367832131535<br />2015-05-23 20:53:05,870 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /mnt/disk_m/</span><span style="font-stretch: normal;">dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535<br />2015-05-23 20:53:05,886 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading block pool storage directory /mnt/disk_m/</span><span style="font-stretch: normal;">dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535.<br /> old LV = -56; old CTime = 1416737045694.<br /> new LV = -56; new CTime = 1432405112136<br />2015-05-23 20:54:12,469 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to analyze storage directories for block pool BP-2033573672-130.226.238.178-1367832131535<br />java.io.IOException: Cannot create directory /mnt/disk_m/</span>dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535/current/finalized/subdir91/subdir168<br /> at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1259)<br /> at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1296)<br /> at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1296)<br /> at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocks(DataStorage.java:1023)<br /> at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.linkAllBlocks(BlockPoolSliceStorage.java:647)<br /> at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doUpgrade(BlockPoolSliceStorage.java:456)<br /> at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:390)<br /> at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:171)<br /> at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:214)<br /> at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:242)<br /> at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:396)<br /> at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)<br /> at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1397)<br /> at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1362)<br /> at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)<br /> at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:227)<br /> at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:839)<br /> at java.lang.Thread.run(Thread.java:745)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: xx-small;">2015-05-23 20:54:12,476 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage for block pool: BP-2033573672-130.226.238.178-1367832131535 : Cannot create directory /mnt/disk_m/<span style="font-stretch: normal;">dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535/current/finalized/subdir91/subdir168</span></span></div>
</div>
<div>
<br /></div>
<div>
The somewhat misleading "Cannot create directory" is not a file permission problem but rather a disk full problem. During this block pool upgrade some temporary space is needed for rewriting metadata, and that space is apparently more than the 10G that was available to "non-HDFS" (which we've concluded means "not HDFS storage files, but everything else is fair game"). Because <i>some</i> space is available to start the upgrade, it begins, but then when it exhausts the disk it fails, and <b>This Kills The DataNode</b>. It does clean up after itself, but prevents the DataNode from starting, meaning our cluster was on its knees and in no danger of standing up.</div>
<div>
<br /></div>
<div>
So the problem was lack of free space, which on 10 of our 12 machines we were able to solve by wiping temporary files from the colocated yarn directory. Those 10 machines were then able to upgrade their disk_m and started up. We still had two nodes down and unfortunately they were in different racks, so that meant we had a big pile of our replication factor 2 files missing blocks (the default HDFS block replication policy places the second and subsequent copies on a different rack from the first copy).</div>
<div>
<br /></div>
<div>
While digging around in the different properties that we thought could affect our disks and HDFS behaviour we were also restarting the failing DataNodes regularly. At some point the log message changed to:</div>
<div>
<br /></div>
<div>
<div class="p1">
<span style="font-family: Courier New, Courier, monospace; font-size: xx-small;">WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.FileNotFoundException: /mnt/disk_m/dfs/dn/in_use.lock (No space left on device)</span></div>
<div class="p1">
<br /></div>
<div class="p1">
After that message the DataNode started, but with disk_m marked as a failed volume. We're not sure why this happened but presume that after one of our failures it didn't clean up it's temp files on disk_m and then on subsequent restarts found the disk completely full and (rightly) considered it unusable and tried to carry on. With the final two DataNodes up we had almost all of our cluster, minus the two failed volumes. There were only 35 corrupted files (missing blocks) left after they came up. These were files set to replication factor 2, and by bad luck had both copies of some of their blocks on the failed disk_m (one from rack1, one from rack2).</div>
<div class="p1">
<br /></div>
<div class="p1">
It would not have been the end of the world to just delete the corrupted user downloads (they were all over a year old) but on principle, it would not be The Right Thing To Do.</div>
<div class="p1">
<br /></div>
<h4>
On inodes and hardlinks</h4>
<div class="p1">
The normal directory structure of the dfs dir in a DataNode is /dfs/dn/current/<blockpool name>/current/finalized and within finalized are a whole series of directories to fan out the various blocks that the volume contains. During the block pool upgrade a copy of 'finalized' is made called previous.tmp. It's not a normal copy however - it uses <a href="http://en.wikipedia.org/wiki/Hard_link" target="_blank">hardlinks</a> in order to avoid duplicating all of the data (which obviously wouldn't work). The copy is needed during the upgrade and is removed afterwards. Since our upgrade failed halfway through we had both directories and had no choice but to move the entire /dfs directory off of /disk_m to a temporary disk and complete the upgrade there. We first tried a copy (use cp -a to preserve hardlinks) to a mounted NFS share. The copy looked fine but on startup the DataNode didn't understand the mounted drive ("drive not formatted"). Then we tried copying to a USB drive plugged into the machine and that ultimately worked (despite feeling <a href="http://www.aosabook.org/en/hdfs.html" target="_blank">decidedly un-Yahoo</a>). Once the USB drive was upgraded and online in the cluster, replication took over and copied all of its blocks to new homes on /rack2. We then unmounted the USB drive, wiped both /disk_m's and then let replication balance out again. Final result: no lost blocks.</div>
<div class="p1">
<br /></div>
<h4>
Mitigation</h4>
<div class="p1">
With the cluster happy again we made a few changes to hopefully ensure this doesn't happen again:</div>
<div class="p1">
</div>
<ul>
<li><span style="font-family: Courier New, Courier, monospace;">dfs.datanode.du.reserved:25GB</span> this guarantees 25GB free on each volume (up from 10GB) and should be enough to allow a future upgrade to happen</li>
<li><span style="font-family: Courier New, Courier, monospace;">dfs.datanode.fsdataset.volume.choosing.policy:AvailableSpace </span></li>
<li><span style="font-family: Courier New, Courier, monospace;">dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction:1.0 </span>together these two direct new blocks to disks that have more free space, thereby leaving our now full /disk_m alone</li>
</ul>
<h4>
Conclusion</h4>
<div>
This was one small taste of what can go wrong with filling heterogenous disks in an HDFS cluster. We're sure there are worse dangers lurking on the full-disk horizon, so hopefully you've learned from our pain and will give yourself some breathing room when things start to fill up. Also, don't use a replication factor of less than 3 if there's anyway you can help it.</div>
<br />
<div class="p1">
<br /></div>
<div class="p1">
<span style="background-color: whitesmoke; color: #555555; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12px; line-height: 20px;"><br /></span></div>
<div class="p1">
<span style="background-color: whitesmoke; color: #555555; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12px; line-height: 20px;"><br /></span></div>
<div class="p1">
<br /></div>
</div>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Oliver Meynhttp://www.blogger.com/profile/04706642473308341930noreply@blogger.com1tag:blogger.com,1999:blog-2326624813533383062.post-79071782052625656962015-03-30T22:30:00.000+02:002015-03-31T18:29:52.967+02:00Improving the GBIF Backbone matchingIn GBIF <a href="http://www.gbif.org/occurrence">occurrence records</a> are matched to a taxon in a <a href="http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c">backbone taxonomy</a> using the <a href="http://www.gbif.org/developer/species#searching">species match API</a>. This is important to reduce spelling variations and create consistent metrics and searches according to a single classification and synonymy.<br />
<br />
Over the past years we have been alerted to <a href="http://dev.gbif.org/issues/issues?jql=labels%20%3D%20speciesmatch">various bad matches</a>. Most of the reported issues refer to a false fuzzy match for a name missing in our backbone.<br />
<br />
In order to improve the taxonomic classification of occurrence records, we are undertaking 2 activities. The first is to improve the algorithms we use to fuzzily match names, and the second will be to improve the algorithms used to assembled the backbone taxonomy itself. Here I explain some of the work currently underway to tackle the former, which is visible on the test environment.<br />
<h2 id="1name-parsing-of-undetermined-species">
1.Name parsing of undetermined species</h2>
In occurrences we see many names with a partly undetermined name such as <em>Lucanus spec.</em> Erroneously these rank markers have been treated as real species epithets and together with fuzzy matching resulted in poor results. <br />
<strong><br /></strong>
<strong>Examples</strong><br />
<ul>
<li><a href="http://www.gbif.org/occurrence/164267402/verbatim"><em>Xysticus</em> sp.</a> used to wrongly match <em>Xysticus spiethi</em> while it now just matches the genus <em>Xysticus</em>.</li>
<li><a href="http://www.gbif.org/occurrence/1061576151/verbatim"><em>Triodia</em> sp.</a> used to match the family Poaceae while it now matches the genus</li>
</ul>
<h2 id="2-dameraulevenshtein-distance-algorithm">
2. Damerau–Levenshtein distance algorithm</h2>
For scoring fuzzy matches we have so far applied the <a href="http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance">Jaro Winkler distance</a> which is often used for matching person names. It tends to allow for rather fuzzy matches at the end of long strings. This is desirable for scientific names, but the allowed fuzziness was too big and we decided to revert to the classical and more predictable <a href="http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance">Damerau–Levenshtein distance</a>. This reduces false positive fuzzy matches considerably even though we lost a few good matches at the same time.<br />
<strong><br /></strong>
<strong>Examples</strong><br />
<ul>
<li><a href="http://www.gbif.org/occurrence/1037140379/verbatim"><em>Xyris kralii</em> Wand.</a> used to match to <em>Xyris harleyi</em> but now just matches to the genus <em>Xyris L.</em> as the species is missing from our backbone.</li>
<li><a href="http://www.gbif.org/occurrence/144904719/verbatim"><em>Zea mays</em> subsp. <em>parviglumis</em> var. <em>huehuet</em> Iltis & Doebley</a> used to match <em>Zea mays</em> var. <em>hirta</em> while it now just hits the species <em>Zea mays</em> L.</li>
</ul>
<h3 id="matching-results">
Matching results</h3>
<div class="p1">
The distinct, verbatim classifications of 528 million records were passed through the original and the new fuzzy matching algorithms - this included 10.5 million distinct classifications in total. The results show that 428 thousand classifications (4%), representing 5,323,758 occurrence records produce a different match. So far we have taken a random subsample of the records which change, and manually inspected the results - we can hardly spot any degression or wrong matches.</div>
<div class="p2">
<br /></div>
<div class="p1">
We have published the complete matching comparison as well as the subset of changed records at <a href="https://zenodo.org/record/16491">Zenodo</a> as tab delimited files:</div>
<div class="p2">
<br /></div>
<div class="p1">
Dataset 1: <a href="https://zenodo.org/deposit/26044/file/?file_id=23bc2f5e-f883-410e-ae2d-bd718ccb2b40">All classification matches (10.5 million)</a></div>
<div class="p1">
Dataset 2: <a href="https://zenodo.org/deposit/26044/file/?file_id=bbed9d39-ecb5-44cc-949e-a9a6068dc166">Changed matches (428 thousand)</a></div>
<div class="p2">
<br /></div>
<div class="p1">
The schema of the files have 3 column families each with the scientificName, GBIF taxonKey and the higher DwC classification terms for every match record (verbatim prefixed with v_ , old matching with an _old suffix and the new matching results with plain terms, e.g. v_scientificName, scientificName_old, scientificName).</div>
<div class="p2">
<br /></div>
<br />
<div class="p1">
We are glad to receive any feedback on further improvements or bad matching results we need to fix in the next iteration of work. Please get in touch with Markus Döring, <a href="mailto:mdoering@gbif.org"><span class="s1">mdoering@gbif.org</span></a>.</div>
<h3 id="appendix">
Appendix</h3>
<h2 id="create-distinct-occurrence-names-table">
Create distinct occurrence names table</h2>
<pre class="prettyprint"><code class=" hljs sql"><span class="hljs-operator"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> markus.<span class="hljs-keyword">names</span> <span class="hljs-keyword">AS</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-aggregate">count</span>(*) <span class="hljs-keyword">as</span> numocc, <span class="hljs-aggregate">count</span>(<span class="hljs-keyword">distinct</span> datasetKey) <span class="hljs-keyword">as</span> numdatasets, v_scientificName, v_kingdom, v_phylum, v_class, v_order_ <span class="hljs-keyword">as</span> v_order, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification
<span class="hljs-keyword">FROM</span> prod_b.occurrence_hdfs
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> v_scientificName, v_kingdom, v_phylum, v_class, v_order_, v_family, v_genus, v_subgenus, v_specificEpithet, v_infraspecificEpithet, v_scientificNameAuthorship, v_taxonrank, v_higherClassification
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> v_scientificName, numocc <span class="hljs-keyword">DESC</span></span></code></pre>
<h2 id="lookup-taxonkey-with-both-old-new-lookup">
Lookup taxonkey with both old & new lookup</h2>
<pre class="prettyprint"><code class=" hljs sql"><span class="hljs-operator"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> markus.name_matches <span class="hljs-keyword">AS</span>
<span class="hljs-keyword">SELECT</span>
n.numocc,
n.numdatasets,
n.v_scientificName,
n.v_kingdom,
n.v_phylum,
n.v_class,
n.v_order,
n.v_family,
n.v_genus,
n.v_subgenus,
n.v_specificEpithet,
n.v_infraspecificEpithet,
n.v_scientificNameAuthorship,
n.v_taxonrank,
n.v_higherClassification,
prod.taxonKey <span class="hljs-keyword">as</span> taxonKey_old,
prod.scientificName <span class="hljs-keyword">as</span> scientificName_old,
prod.rank <span class="hljs-keyword">as</span> rank_old,
prod.status <span class="hljs-keyword">as</span> status_old,
prod.matchType <span class="hljs-keyword">as</span> matchType_old,
prod.confidence <span class="hljs-keyword">as</span> confidence_old,
prod.kingdomKey <span class="hljs-keyword">as</span> kingdomKey_old,
prod.phylumKey <span class="hljs-keyword">as</span> phylumKey_old,
prod.classKey <span class="hljs-keyword">as</span> classKey_old,
prod.orderKey <span class="hljs-keyword">as</span> orderKey_old,
prod.familyKey <span class="hljs-keyword">as</span> familyKey_old,
prod.genusKey <span class="hljs-keyword">as</span> genusKey_old,
prod.speciesKey <span class="hljs-keyword">as</span> speciesKey_old,
prod.kingdom <span class="hljs-keyword">as</span> kingdom_old,
prod.phylum <span class="hljs-keyword">as</span> phylum_old,
prod.class_ <span class="hljs-keyword">as</span> class_old,
prod.order_ <span class="hljs-keyword">as</span> order_old,
prod.family <span class="hljs-keyword">as</span> family_old,
prod.genus <span class="hljs-keyword">as</span> genus_old,
prod.species <span class="hljs-keyword">as</span> species_old,
uat.taxonKey <span class="hljs-keyword">as</span> taxonKey,
uat.scientificName <span class="hljs-keyword">as</span> scientificName,
uat.rank <span class="hljs-keyword">as</span> rank,
uat.status <span class="hljs-keyword">as</span> status,
uat.matchType <span class="hljs-keyword">as</span> matchType,
uat.confidence <span class="hljs-keyword">as</span> confidence,
uat.kingdomKey <span class="hljs-keyword">as</span> kingdomKey,
uat.phylumKey <span class="hljs-keyword">as</span> phylumKey,
uat.classKey <span class="hljs-keyword">as</span> classKey,
uat.orderKey <span class="hljs-keyword">as</span> orderKey,
uat.familyKey <span class="hljs-keyword">as</span> familyKey,
uat.genusKey <span class="hljs-keyword">as</span> genusKey,
uat.speciesKey <span class="hljs-keyword">as</span> speciesKey,
uat.kingdom <span class="hljs-keyword">as</span> kingdom,
uat.phylum <span class="hljs-keyword">as</span> phylum,
uat.class_ <span class="hljs-keyword">as</span> class_,
uat.order_ <span class="hljs-keyword">as</span> order_,
uat.family <span class="hljs-keyword">as</span> family,
uat.genus <span class="hljs-keyword">as</span> genus,
uat.species <span class="hljs-keyword">as</span> species
<span class="hljs-keyword">FROM</span> (
<span class="hljs-keyword">SELECT</span>
numocc,
numdatasets,
v_scientificName,
v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_subgenus,
v_specificEpithet,
v_infraspecificEpithet,
v_scientificNameAuthorship,
v_taxonrank,
v_higherClassification,
<span class="hljs-keyword">match</span>(<span class="hljs-string">'PROD'</span>, v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) prod,
<span class="hljs-keyword">match</span>(<span class="hljs-string">'UAT'</span>, v_kingdom, v_phylum, v_class, v_order, v_family, v_genus, v_scientificName, v_specificEpithet, v_infraspecificEpithet) uat
<span class="hljs-keyword">FROM</span> markus.<span class="hljs-keyword">names</span>
) n;</span></code></pre>
<h2 id="hive-exports">
Hive exports</h2>
<pre class="prettyprint"><code class=" hljs sql"><span class="hljs-operator"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> markus.matches_changed
<span class="hljs-keyword">ROW</span> FORMAT DELIMITED FIELDS TERMINATED <span class="hljs-keyword">BY</span> <span class="hljs-string">'\t'</span> LINES TERMINATED <span class="hljs-keyword">BY</span> <span class="hljs-string">'\n'</span> <span class="hljs-keyword">NULL</span> DEFINED <span class="hljs-keyword">AS</span> <span class="hljs-string">''</span> <span class="hljs-keyword">AS</span>
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">from</span> markus.name_matches</span></code>
<span class="hljs-keyword">WHERE</span> taxonKey!=taxonKey_old;</pre>
<pre class="prettyprint"><code class=" hljs sql"><span class="hljs-operator"><span class="hljs-keyword">
</span></span></code></pre>
<pre class="prettyprint"><code class=" hljs sql"><span class="hljs-operator"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> markus.matches_all
<span class="hljs-keyword">ROW</span> FORMAT DELIMITED FIELDS TERMINATED <span class="hljs-keyword">BY</span> <span class="hljs-string">'\t'</span> LINES TERMINATED <span class="hljs-keyword">BY</span> <span class="hljs-string">'\n'</span> <span class="hljs-keyword">NULL</span> DEFINED <span class="hljs-keyword">AS</span> <span class="hljs-string">''</span> <span class="hljs-keyword">AS</span>
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">from</span> markus.name_matches</span></code>;</pre>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/02525336976753861766noreply@blogger.com0tag:blogger.com,1999:blog-2326624813533383062.post-39467163761760944712015-03-27T13:55:00.000+01:002017-01-24T16:50:07.905+01:00IPT v2.2 – Making data citable through DataCite<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="p1">
<span style="font-family: Times, Times New Roman, serif;">GBIF is pleased to release <a href="http://www.gbif.org/ipt"><span class="s1">IPT 2.2</span></a>, now capable of automatically connecting with either <a href="https://www.datacite.org/"><span class="s1">DataCite</span></a> or <a href="http://ezid.cdlib.org/" target="_blank">EZID</a> to assign DOIs to datasets. This new feature makes biodiversity data easier to access on the Web and facilitates tracking its re-use.</span><br />
<br />
<h3 style="text-align: left;">
<span style="font-family: Times, Times New Roman, serif;">DataCite integration explained</span></h3>
<span style="font-family: Times, 'Times New Roman', serif;">DataCite specialises in assigning DOIs to datasets. It was established in 2009 with three fundamental goals<span style="font-size: xx-small;">(1)</span>:</span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-TpjTdrwdPzw/VRG20469uPI/AAAAAAAAOgw/9e_MQulhE0I/s1600/datacite-logo-web.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"></a><a href="https://3.bp.blogspot.com/-TpjTdrwdPzw/VRG20469uPI/AAAAAAAAOgw/9e_MQulhE0I/s1600/datacite-logo-web.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-TpjTdrwdPzw/VRG20469uPI/AAAAAAAAOgw/9e_MQulhE0I/s1600/datacite-logo-web.png" /> </a> </div>
<ol class="ol1">
<li class="li1"><span style="font-family: Times, Times New Roman, serif;">Establish easier access to research data on the Internet</span></li>
<li class="li1"><span style="font-family: Times, Times New Roman, serif;">Increase acceptance of research data as citable contributions to the scholarly record</span></li>
<li class="li1"><span style="font-family: Times, Times New Roman, serif;">Support research data archiving to permit results to be verified and re-purposed for future study<a name='more'></a></span></li>
</ol>
<div style="text-align: left;">
<span style="font-family: Times, Times New Roman, serif;">EZID is hosted by the <a href="http://www.cdlib.org/" target="_blank">California Digital Library</a> (a founding member of DataCite) and adds <a href="http://www.cdlib.org/uc3/ezid/" target="_blank">services</a> on top of the DataCite DOI infrastructure such as their own easy-to-use <a href="http://ezid.cdlib.org/doc/apidoc.html" target="_blank">programming interface</a>.</span></div>
<div style="text-align: left;">
<span style="font-family: Times, 'Times New Roman', serif;"><br /></span></div>
<div style="text-align: left;">
<span style="font-family: Times, 'Times New Roman', serif;">To integrate with DataCite and further these three goals for biodiversity data, IPT version 2.2 introduces the following new features:</span></div>
<ul class="ul1">
<li class="li1"><span style="font-family: Times, Times New Roman, serif;">DOIs can be assigned to datasets thereby making them persistently resolvable </span></li>
<li class="li1"><span style="font-family: Times, Times New Roman, serif;">A new DOI can be assigned to a dataset each time it undergoes scientifically significant changes, which is recommended best practice<span style="font-size: xx-small;">(1)</span> and part of the IPT's new <a href="https://code.google.com/p/gbif-providertoolkit/wiki/IPT2Versioning"><span class="s1">versioning policy</span></a></span></li>
<li class="li1"><span style="font-family: Times, Times New Roman, serif;">Citations can be automatically generated for datasets in a <a href="https://code.google.com/p/gbif-providertoolkit/wiki/IPT2Citation"><span class="s1">standard format</span></a> which includes the DOI and dataset version number</span></li>
<div style="text-align: right;">
</div>
<li class="li1"><span style="font-family: Times, Times New Roman, serif;">A <a href="https://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes#Version_history"><span class="s1">version history</span></a> is kept for each dataset, allowing researchers to easily track changes and access/download all previous versions</span></li>
</ul>
<div class="p1">
<span style="font-family: Times, 'Times New Roman', serif;">To take advantage of these optional new features, there are two basic requirements: </span></div>
<ol class="ol1" style="text-align: left;">
<li><span style="font-family: Times, 'Times New Roman', serif;">The IPT must be configured with either a DataCite or EZID account. GBIF participants interested in a DataCite account should contact the <a href="mailto:helpdesk@gbif.org" target="_blank">GBIF Helpdesk</a> directly. General information about getting a DataCite account can be found </span><a href="https://www.datacite.org/join-datacite" style="font-family: Times, 'Times New Roman', serif;"><span class="s1">here</span></a><span style="font-family: Times, 'Times New Roman', serif;">; information about getting an EZID account can be found </span><a href="http://ezid.cdlib.org/home/pricing" style="font-family: Times, 'Times New Roman', serif;"><span class="s1">here</span></a><span style="font-family: Times, 'Times New Roman', serif;">. </span></li>
<li class="li1"><span style="font-family: Times, Times New Roman, serif;">The IPT should be always on and accessible to ensure that assigned DOIs continue to be resolvable. </span></li>
</ol>
<div class="p1">
<span style="font-family: Times, Times New Roman, serif;">Once publishers make their data citable through DataCite they can expect the following benefits:</span></div>
<div style="text-align: left;">
</div>
<ul style="text-align: left;">
<li><span style="font-family: Times, Times New Roman, serif;">Their datasets will be globally discoverable through the <a href="http://search.datacite.org/ui"><span class="s1">DataCite Metadata Search tool</span></a> and the Thomson Reuters <a href="http://wokinfo.com/products_tools/multidisciplinary/dci/"><span class="s1">Data Citation Index</span></a> (part of the <a href="http://thomsonreuters.com/en/products-services/scholarly-scientific-research/scholarly-search-and-discovery/web-of-science.html" target="_blank">Web of Science</a>) thanks to a <a href="http://thomsonreuters.com/en/press-releases/2014/thomson-reuters-collaborates-with-datacite-to-expand-discovery-of-research-data.html"><span class="s1">collaboration</span></a> with DataCite formalised in 2014</span></li>
<li><span style="font-family: Times, Times New Roman, serif;">They can find out exactly who cited their dataset via the <span class="s1"><a href="http://wokinfo.com/products_tools/multidisciplinary/dci/">Data Citation Index</a></span>, and better understand the impact their dataset has had within the scholarly research and policy making communities</span></li>
</ul>
<div class="p1">
<div class="separator" style="clear: both; text-align: center;">
<span style="font-family: Times, Times New Roman, serif;"></span></div>
<span style="font-family: Times, 'Times New Roman', serif;"><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="http://2.bp.blogspot.com/-S4fWCWFb1UE/VRF-UHfx58I/AAAAAAAAOgU/fnSiBSuQWW0/s1600/IPTManageResourceMetadataBasicMetadata.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" src="http://2.bp.blogspot.com/-S4fWCWFb1UE/VRF-UHfx58I/AAAAAAAAOgU/fnSiBSuQWW0/s1600/IPTManageResourceMetadataBasicMetadata.png" height="204" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Sample basic metadata page, IPT 2.2</td></tr>
</tbody></table>
</span><br />
<h3>
<span style="font-family: Times, 'Times New Roman', serif;">Other new features</span></h3>
<br />
<span style="font-family: Times, 'Times New Roman', serif;">The IPT 2.2 also introduces a simple way of licensing datasets </span><span style="font-family: Times, 'Times New Roman', serif;">under one of three machine readable waivers or licences: </span><a href="http://creativecommons.org/publicdomain/zero/1.0/legalcode" style="font-family: Times, 'Times New Roman', serif;"><span class="s1">CC0 v1.0</span></a><span style="font-family: Times, 'Times New Roman', serif;">, </span><a href="http://creativecommons.org/licenses/by/4.0/legalcode" style="font-family: Times, 'Times New Roman', serif;"><span class="s1">CC-BY v4.0</span></a><span style="font-family: Times, 'Times New Roman', serif;">, or </span><a href="http://creativecommons.org/licenses/by-nc/4.0/legalcode" style="font-family: Times, 'Times New Roman', serif;"><span class="s1">CC-BY-NC v4.0</span></a><span style="font-family: Times, 'Times New Roman', serif;">. These waivers or CC licenses are "something that the creators of works can understand, their users can understand, and even the Web itself can understand."<span style="font-size: xx-small;">(2) </span>You may read more about GBIF's new licensing policy </span><span class="s1" style="font-family: Times, 'Times New Roman', serif;"><a href="http://www.gbif.org/terms/licences" style="font-family: Times, 'Times New Roman', serif;">here</a> for more information.</span></div>
<div class="p1">
<span style="font-family: Times, 'Times New Roman', serif;"><br /></span></div>
<ul class="ul1">
</ul>
<div class="p1">
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-left: 1em; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-SUJDc6CGGS4/VRF0uj31mWI/AAAAAAAAOfY/bSC2IbbBE1U/s1600/IPTManageResourceOverview.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://1.bp.blogspot.com/-SUJDc6CGGS4/VRF0uj31mWI/AAAAAAAAOfY/bSC2IbbBE1U/s1600/IPTManageResourceOverview.png" height="316" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Sample resource overview page, IPT 2.2</td></tr>
</tbody></table>
<span style="font-family: Times, 'Times New Roman', serif;"><br /></span>
<span style="font-family: Times, 'Times New Roman', serif;">Whether an IPT is DOI-turbocharged or not, there are a number of other new benefits in this release:</span><br />
<ul class="ul1">
<li class="li1"><span style="font-family: Times, Times New Roman, serif;"><b>basisOfRecord validation</b> for occurrence datasets</span></li>
<li class="li1"><span style="font-family: Times, Times New Roman, serif;">The ability to <b>preview source mappings</b> prior to publication</span></li>
<li class="li1"><span style="font-family: Times, Times New Roman, serif;">The ability to <b>preview resource metadata</b> prior to publication</span></li>
<li class="li1"><span style="font-family: Times, Times New Roman, serif;">A suite of new metadata fields such as <b>ORCIDs</b> for contacts</span></li>
<li class="li1"><span style="font-family: Times, Times New Roman, serif;">An enhanced user interface including a new and <b>improved resource homepage</b></span></li>
<li class="li1"><span style="font-family: Times, Times New Roman, serif;"><b>Additional context help</b> to guide users, especially first-time users</span></li>
</ul>
<span style="font-family: Times, Times New Roman, serif;"><br /></span>
<br />
<h3 style="text-align: left;">
<span style="font-family: Times, Times New Roman, serif;">Acknowledgements</span><span style="font-family: Times, 'Times New Roman', serif;"> </span></h3>
<br />
<span style="font-family: Times, Times New Roman, serif;">Thanks to the hard work and dedication of the team of contributors, version 2.2 has been fully translated into French, Japanese, Portuguese, and Spanish. Since so many new features have gone into this new version, the text requiring translation was enormous. The following translators deserve a huge thanks, merci, arigato, </span><span style="font-family: Times, 'Times New Roman', serif;">obrigado, and </span><span style="font-family: Times, 'Times New Roman', serif;">gracias:</span></div>
<div style="text-align: left;">
</div>
<ul style="text-align: left;">
<li><span style="font-family: Times, Times New Roman, serif;">Sophie Pamerlon, Marie-Elise Lecoq (<a href="http://www.gbif.fr/"><span class="s1">GBIF France</span></a>) - Updating French translation</span></li>
<li><span style="font-family: Times, Times New Roman, serif;">Yukiko Yamazaki (<a href="http://www.gbif.jp/" target="_blank">GBIF Japan (JBIF)</a>) - Updating Japanese translation</span></li>
<li><span style="font-family: Times, Times New Roman, serif;">Allan Koch Veiga, Etienne Americo Cartolano, Daniel Lins, and Antonio Mauro Saraiva (<span class="s1"><a href="http://www.biocomp.org.br/" target="_blank">Universidade de São Paulo, Research Center on Biodiversity and Computing - BioComp</a></span>) - Updating Portuguese translation</span></li>
<li><span style="font-family: Times, Times New Roman, serif;">Dairo Escobar, Nestor Beltran, and Daniel Amariles (<a href="http://www.sibcolombia.net/web/sib/home"><span class="s1">Colombian Biodiversity Information System (SiB Colombia)</span></a>) - Updating Spanish Translation</span></li>
</ul>
<span style="font-family: Times, 'Times New Roman', serif;">Lastly, a special thanks must go out to David Shorthouse from </span><a href="http://www.canadensys.net/" style="font-family: Times, 'Times New Roman', serif;"><span class="s1">Canadensys</span></a><span style="font-family: Times, 'Times New Roman', serif;"> for his guidance and help. Canadensys has been assigning DOIs to datasets it serves via its IPT since 2012, as described </span><a href="http://www.canadensys.net/2012/link-love-dois-for-darwin-core-archives" style="font-family: Times, 'Times New Roman', serif;"><span class="s1">here</span></a><span style="font-family: Times, 'Times New Roman', serif;">, and has provided invaluable assistance throughout development. </span><br />
<div class="p1">
<span style="font-family: Times, 'Times New Roman', serif;"><br /></span></div>
<div class="p1">
<span style="font-family: Times, 'Times New Roman', serif;">On behalf of the GBIF development team, I really hope you enjoy using this new version, and hope that you will be able to take advantage of all its exciting new features.</span><br />
<span style="font-family: Times, 'Times New Roman', serif;"><br /></span>
<br />
<h3 style="text-align: left;">
<span style="font-family: Times, Times New Roman, serif;">Footnotes</span></h3>
<div>
<ol style="text-align: left;">
<li><span style="font-family: Times, Times New Roman, serif;">http://schema.datacite.org/meta/kernel-3/doc/DataCite-MetadataKernel_v3.1.pdf</span></li>
<li><span style="font-family: Times, Times New Roman, serif;">https://creativecommons.org/licenses/</span></li>
</ol>
</div>
</div>
</div>
<!-- Blogger automated replacement: "https://images-blogger-opensocial.googleusercontent.com/gadgets/proxy?url=http%3A%2F%2F3.bp.blogspot.com%2F-TpjTdrwdPzw%2FVRG20469uPI%2FAAAAAAAAOgw%2F9e_MQulhE0I%2Fs1600%2Fdatacite-logo-web.png&container=blogger&gadget=a&rewriteMime=image%2F*" with "https://3.bp.blogspot.com/-TpjTdrwdPzw/VRG20469uPI/AAAAAAAAOgw/9e_MQulhE0I/s1600/datacite-logo-web.png" --><div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/16423423909368777750noreply@blogger.com0tag:blogger.com,1999:blog-2326624813533383062.post-47612997473017365562014-11-26T11:41:00.000+01:002015-08-25T16:01:32.378+02:00Upgrading our cluster from CDH4 to CDH5A little over a year ago we wrote about <a href="http://gbif.blogspot.dk/2013/05/migrating-our-hadoop-cluster-from-cdh3.html" target="_blank">upgrading from CDH3 to CDH4</a> and now the time had come to upgrade from CDH4 to <a href="http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html" target="_blank">CDH5</a>. The short version: upgrading the cluster itself was easy, but getting our applications to work with the new classpaths, especially MapReduce v2 (YARN), was painful.<br />
<br />
<h3>
The Cluster</h3>
<div>
Our cluster has grown since the last upgrade (now 12 slaves and 3 masters), and we no longer had the luxury of splitting the machines to build a new cluster from scratch. So this was an in-place upgrade, using CDH Manager.</div>
<div>
<br /></div>
<h2>
Upgrade CDH Manager</h2>
<div>
The first step was upgrading to CDH Manager 5.2 (from our existing 4.8). The <a href="http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_ag_upgrade_cm4_to_cm5.html" target="_blank">Cloudera documentation</a> is excellent so I don't need to repeat it here. What we did find was that the management service now requests significantly more RAM for it's monitoring services (minimum "happy" config of 14GB), to the point where our existing masters were overwhelmed. As a stop gap we've added a 4th old machine to the "masters" group, used exclusively for the management service. In the longer term we'll replace the 4 masters with 3 new machines that have enough resources. </div>
<div>
<br /></div>
<h2>
Upgrade Cluster Members</h2>
<div>
Again the <a href="http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_upgrade_tocdh5_using_parcels.html" target="_blank">Cloudera documentation</a> is excellent but I'll just add a bit. The upgrade process will now ask if a JAVA jdk should be installed (an improvement over the old behaviour of just installing one anyway). That means we could finally remove the Oracle JDK 6 rpms that have been lying around on the machines. For some reason the Host Inspector still complains about OpenJDK 7 vs Oracle 7 but we've happily been running on OpenJDK 7 since early 2014, and so far so good with CDH5 as well. After the upgrade wizard finished we had to tweak memory settings throughout the cluster, including setting the "Memory Overcommit Validation Threshold" to 0.99, up from its (very conservative) default of 0.8. Cloudera has another nice blog post on <a href="http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/" target="_blank">figuring out memory settings for YARN</a>. Additionally Hue's configuration required some attention because after the upgrade it had forgotten where Zookeeper and the HBase Thrift server were. All in all quite painless.</div>
<div>
<br /></div>
<h3>
The Gotchas</h3>
<div>
Getting our software to work with CDH5 was definitely not painless. All of our problems stemmed from conflicting versions of jars, due either to changes in CDH dependencies, or in changes to how a user classpath is specified as having priority over that of Yarn/HBase/Oozie. Additionally it took some time to wrap our heads around the new artifact packaging used by YARN and HBase. Note that we also use Maven for dependency management.</div>
<div>
<br /></div>
<b>Guava</b><br />
<div>
We're not alone in our suffering at the hands of mismatched Guava versions (e.g. <a href="https://issues.apache.org/jira/browse/HADOOP-10101" target="_blank">HADOOP-10101</a>, <a href="https://issues.apache.org/jira/browse/HDFS-7040" target="_blank">HDFS-7040</a>), but suffer we did. We resorted to specifying version 14.0.1 in any of our code that touches Hadoop and more importantly HBase, and exclude any higher version guavas from our dependencies. This meant downgrading some actual code that was using guava 15, but was the easiest path to getting a working system.</div>
<div>
<br /></div>
<b>Jackson</b><br />
<div>
We have many dependencies on Jackson 1.9 and 2+ throughout our code, so downgrading to match HBase's shipped 1.8.8 was not an option. It meant figuring out the classpath precedence rules described below, and solving the problems (like logging) that doing so introduced.</div>
<div>
<br /></div>
<b>Logging</b><br />
<div>
Logging in Java is a horrible mess, and with the number of intermingled projects required to make application software run on a Hadoop/HBase cluster it's not surprise that getting logging to work was brutal. We code to the SLF4J API and use Logback as our implementation of choice. The Hadoop world uses a mix of Java Commons Logging, java.util.logging, and log4j. We thought that meant we'd be clear if we used the same SLF4J API (1.7.5) and used the bridges (log4j-over-slf4j, jcl-over-slf4j, and jul-to-slf4j), which has worked for us up to now. <montage>Angry men smash things angrily over the course of days</montage> Turns out, there's a bug in the 1.7.5 implementation of log4j-over-slf4j, which blows up as we described over at <a href="https://issues.apache.org/jira/browse/YARN-2875" target="_blank">YARN-2875</a>. Short version - use 1.7.6+ in client code that attempts to use YARN and log4j-over-slf4j.</div>
<div>
<br /></div>
<div>
<b>YARN</b></div>
<div>
The crux of our problems was having our classpath loaded after the Hadoop classpath had been loaded, meaning old versions of our dependencies were loaded first. The new, surprisingly hard to find parameter that tells YARN to load your classpath first is "<span style="font-family: Courier New, Courier, monospace;"><b>mapreduce.job.user.classpath.first</b></span>". YARN also quizzically claims that the parameter is deprecated, but.. works for me.</div>
<div>
<br /></div>
<div>
<b>Oozie</b></div>
<div>
Convincing Oozie to load our classpath involved another montage of angry faces. It uses the same parameter as YARN, but with a prefix, so what you want is "<b><span style="font-family: Courier New, Courier, monospace;">oozie.launcher.mapreduce.job.user.classpath.first</span></b>". We had been loading the old parameter "<span style="font-family: Courier New, Courier, monospace;"><b>mapreduce.task.classpath.user.precedence</b></span>" in each action in the workflow using the <span style="font-family: Courier New, Courier, monospace;"><job-xml></span><span style="font-family: inherit;"> tag to load the configs from a file called</span><span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> </span><span style="font-family: Courier New, Courier, monospace;">hive-default.xml</span><span style="font-family: inherit;">. We then encountered two problems: </span></div>
<div>
<ol>
<li><span style="font-family: inherit;">Note the name - we used </span><span style="font-family: Courier New, Courier, monospace;">hive-default.xml</span><span style="font-family: inherit;"> instead of </span><span style="font-family: Courier New, Courier, monospace;">hive-site.xml</span><span style="font-family: inherit;"> because of a bug in Oozie (discussed <a href="https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/RW5WmSTzbLo" target="_blank">here</a> and <a href="https://groups.google.com/a/cloudera.org/forum/#!msg/cdh-user/y66j12jb1ig/tODJGmJ2BawJ" target="_blank">here</a>). That was fixed in the CDH5.2 Oozie, but we didn't get the memo. Now the file is called </span><span style="font-family: Courier New, Courier, monospace;">hive-site.xml </span><span style="font-family: inherit;">and contains our specific configs and is again being picked up. BUT:</span></li>
<li>Adding <span style="font-family: Courier New, Courier, monospace; font-weight: bold;">oozie.launcher.mapreduce.job.user.classpath.first</span><span style="font-family: inherit;"> to <span style="font-family: 'Courier New', Courier, monospace;">hive-site.xml</span> doesn't work! As we wrote up in Oozie bug <a href="https://issues.apache.org/jira/browse/OOZIE-2066" target="_blank">OOZIE-2066</a> this parameter has to be specified for each action, at the action level, in the workflow.xml. Repeating the example workaround from the bug report:</span></li>
</ol>
</div>
<pre style="background-image: URL(http://2.bp.blogspot.com/_z5ltvMQPaa8/SjJXr_U2YBI/AAAAAAAAAAM/46OqEP32CJ8/s320/codebg.gif); background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> <action name="run-test">
<java>
<job-tracker>c1n2.gbif.org:8032</job-tracker>
<name-node>hdfs://c1n1.gbif.org:8020</name-node>
<configuration>
<property>
<name>oozie.launcher.mapreduce.task.classpath.user.precedence</name>
<value>true</value>
</property>
</configuration>
<main-class>test.CPTest</main-class>
</java>
<ok to="end" />
<error to="kill" />
</action>
</code></pre>
<div>
<br />
<br />
<h3>
<b>New Packaging Woes</b></h3>
<br />
We build our jars using a combination of jar-with-dependencies and the shade plugin, but in both cases it means all our dependencies are built in. The problems come when a downstream, transitive dependency loads a different (typically older) version of one of the jars we've bundled in our main jar. This happens a lot with the Hadoop and HBase artifacts, especially when it comes to MR1 and logging.<br />
<br />
<b>Example exclusions</b><br />
<br />
hbase-server (needed to run MapReduce over HBase): <a href="https://github.com/gbif/datacube/blob/master/pom.xml#L268">https://github.com/gbif/datacube/blob/master/pom.xml#L268</a><br />
<br />
hbase-testing-util (needed to run mini clusters): <a href="https://github.com/gbif/datacube/blob/master/pom.xml#L302">https://github.com/gbif/datacube/blob/master/pom.xml#L302</a><br />
<br />
hbase-client: <a href="https://github.com/gbif/metrics/blob/master/pom.xml#L226">https://github.com/gbif/metrics/blob/master/pom.xml#L226</a><br />
<br />
hadoop-client (removing logging): <a href="https://github.com/gbif/metrics/blob/master/pom.xml#L327">https://github.com/gbif/metrics/blob/master/pom.xml#L327</a><br />
<br />
<br />
Beyond just sorting conflicting dependencies, we also encountered a problem that presented as "<span style="font-family: Courier New, Courier, monospace;">No FileSystem for scheme: file"</span>. It turns out we had projects bringing in both hadoop-common and hadoop-hdfs, and so we were getting only one of the META-INF/services files in the final jar. Thus we could not use the FileSystem to read local files (like jars for the class path) and also from HDFS. The fix was to include the <span style="font-family: Courier New, Courier, monospace;">org.apache.hadoop.fs.FileSystem</span> in our project explicitly: <a href="https://github.com/gbif/metrics/blob/master/cube/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem">https://github.com/gbif/metrics/blob/master/cube/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem</a><br />
<br />
Finally we had to stop the TableMapReduceUtil from bringing in it’s own dependent jars, which brought in yet more conflicting jars - this appears to be a change in the default behaviour, where dependent jars are now being brought in by default in the shorter versions of <span style="font-family: Courier New, Courier, monospace;">initTableMapper</span>:<br />
<a href="https://github.com/gbif/metrics/blob/master/cube/src/main/java/org/gbif/metrics/cube/occurrence/backfill/BackfillCallback.java#L37">https://github.com/gbif/metrics/blob/master/cube/src/main/java/org/gbif/metrics/cube/occurrence/backfill/BackfillCallback.java#L37</a><br />
<br />
<h3>
Conclusion</h3>
</div>
<div>
As you can see the client side of the upgrade was beset on all sides by the iniquities of jars, packaging and old dependencies. It seems strange that upgrading Guava is considered a no-no, major breaking change by these projects, yet <a href="https://issues.apache.org/jira/browse/HBASE-9117" target="_blank">discussions about removing HBaseTablePool</a> are proceeding apace and will definitely break many projects (including any of ours that touch HBase). While we're ultimately pleased that everything now works, and looking forward to benefiting from the performance improvements and new features of CDH5, it wasn't a great trip. Hopefully our experience will help others migrate more smoothly.</div>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Oliver Meynhttp://www.blogger.com/profile/04706642473308341930noreply@blogger.com0tag:blogger.com,1999:blog-2326624813533383062.post-15824824213077674082014-05-06T12:06:00.000+02:002014-05-06T12:06:21.109+02:00Multimedia in GBIFWe are happy to announce another long awaited improvement to the GBIF portal. Our portal test environment now shows multimedia items and their metadata associated with occurrences. As of today we have nearly <a href="http://www.gbif-uat.org/occurrence/search?MEDIA_TYPE=Sound&MEDIA_TYPE=StillImage&MEDIA_TYPE=MovingImage">700 thousand occurrences with multimedia</a> indexed. Based on the Dublin Core type vocabulary we distinguish between images, videos and sound files. As has been requested by many people the media type is available as a new filter in the occurrence search and subsequently in downloads. For example you can now easily <a href="http://www.gbif-uat.org/occurrence/search?TAXON_KEY=212&MEDIA_TYPE=Sound">find all audio recordings of birds</a>.<br />
<br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="http://2.bp.blogspot.com/-hHrOPCfJ01k/U2ioucIaS8I/AAAAAAAAECQ/AlV3WCTrmjg/s1600/Screen+Shot+2014-05-06+at+11.17.21.png" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" src="http://2.bp.blogspot.com/-hHrOPCfJ01k/U2ioucIaS8I/AAAAAAAAECQ/AlV3WCTrmjg/s1600/Screen+Shot+2014-05-06+at+11.17.21.png" height="320" width="297" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">UAM:Mamm:11470 - Eumetopias jubatus - skull</td></tr>
</tbody></table>
If you follow to the <a href="http://www.gbif-uat.org/occurrence/779863593#media">details page</a> of any of those records you can see that sound files show up as simple links to the media file. We do the same for video files and currently do not have plans to embed any media player in our portal. This is different from images which are shown in a dedicated gallery you might have encountered for species pages before already. On the left you can see an example of a <a href="http://www.gbif-uat.org/occurrence/784732286">skull specimen with multiple images</a><u>.</u><br />
<span style="text-align: center;"><br /></span>
<span style="text-align: center;">When requested for the first time, GBIF transiently caches the original images and processes them into various standard sizes and formats suitable for the use in the portal.</span><br />
<br />
<br />
<h3>
Publishing multimedia metadata</h3>
GBIF indexes multimedia metadata published in different ways within the GBIF network. From a simple URL given as an additional field in Darwin Core via multiple items expressed as ABCD XML or a dedicated multimedia extension in Darwin Core archives the difference usually is in metadata expressiveness.<br />
<h4>
Simple Darwin Core</h4>
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-FUnigeu6Ubs/U2Dg7LzaIJI/AAAAAAAAECA/OO2MLbIXWvw/s1600/Screen+Shot+2014-04-30+at+13.38.55.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" src="http://1.bp.blogspot.com/-FUnigeu6Ubs/U2Dg7LzaIJI/AAAAAAAAECA/OO2MLbIXWvw/s1600/Screen+Shot+2014-04-30+at+13.38.55.png" height="243" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Melocactus intortus record in iNaturalist</td></tr>
</tbody></table>
Whenever we spot the term <a href="http://rs.tdwg.org/dwc/terms/index.htm#associatedMedia">dwc:associatedMedia</a> in xml or Darwin Core archives as part of the a simple, flat occurrence record we try to extract URLs to media items. As the term officially allows for concatenated lists of URLs we try common delimiters such as comma, semicolon or the pipe symbols. An example of multiple, concatenated image URLs can be found in <a href="http://www.gbif-uat.org/occurrence/891030819#images">iNaturalist</a>:<br />
<br />
As you can see on the right every extracted link is regarded as a separate media item as there is no standard way to detect that 2 links refer to the same item. In the example above every image has a link to the actual image file and another one to the respective html page where it's metadata is presented. There is also no way to specify additional metadata about a link. As a consequence all images based on dwc:associatedMedia do not have a title, license or any further information. The verbatim data for that record before we extract image links can be seen here: <a href="http://www.gbif-uat.org/occurrence/891030819/verbatim">http://www.gbif-uat.org/occurrence/891030819/verbatim</a><br />
<h4>
Darwin Core archive multimedia extension</h4>
By having a <a href="http://rs.gbif.org/extension/gbif/1.0/multimedia.xml">dedicated extension</a> for media items many media items per core occurrence record can be published in a structured way. This is the GBIF recommended way to publish multimedia as it gives you most control over your metadata. Note that the same extension can also be used to publish multimedia for species in <a href="http://www.gbif.org/dataset/search?type=CHECKLIST">checklist datasets</a>. This extension, based entirely on existing Dublin Core terms, allows you to specify the following information about a media item, all of which will make it into the GBIF portal if provided:<br />
<br />
<ul>
<li> <b>dc:type</b>, the kind of media item based on the DCMI Type Vocabulary: StillImage, MovingImage or Sound</li>
<li> <b>dc:format</b>, MIME type of the multimedia object's format </li>
<li> <b>dc:identifier</b>, the public URL that identifies and locates the media file directly, not the html page it might be shown on</li>
<li> <b>dc:references</b>, the URL of an html webpage that shows the media item or its metadata. It is recommended to provide this url even if a media file exists as it will be used for linking out</li>
<li> <b>dc:title</b>, the media items title</li>
<li> <b>dc:description</b>, a textual description of the content of the media item</li>
<li> <b>dc:created</b>, the date and time this media item was taken</li>
<li> <b>dc:creator</b>, the person that took the image, recorded the video or sound</li>
<li> <b>dc:contributor</b>, any contributor in addition to the creator that helped in recording the media item</li>
<li> <b>dc:publisher</b>, the name of an entity responsible for making the image available</li>
<li> <b>dc:audience</b>, a class or description for whom the image is intended or useful</li>
<li> <b>dc:source</b>, a reference to the source the media item was derived or taken from. For example a book from which an image was scanned or the original provider of a photo/graphic, such as photography agencies</li>
<li> <b>dc:license</b>, license for this media object. If possible declare it as CC0 to ensure greatest use</li>
<li> <b>dc:rightsHolder</b>, the person or organization owning or managing rights over the media item</li>
</ul>
<h4>
Access to Biological Collections Data</h4>
As usual we also provide a binding from the <a href="http://www.tdwg.org/activities/abcd/">TDWG ABCD standard</a> (versions 1.2 and 2.06) mostly used with the BioCASE software.<br />
<br />
From <i>ABCD 1.2</i> we extract media information based on the UnitDigitalImage subelements. In particular information about the file URL (ImageURI), the description (Comment) and the license (TermsOfUse).<br />
<br />
In <i>ABCD 2.06</i> we use the unit MultiMediaObject subelements instead. Here there are distinct file and webpage URLs (FileURI, ProductURI), the description (Comment), the license (License/Text, TermsOfUseStatements) and also an indication of the mime type (Format). The <a href="http://www.gbif-uat.org/occurrence/779863593">bird sound example</a> from above comes in as ABCD 2.06 via the <a href="http://www.gbif-uat.org/dataset/b7ec1bf8-819b-11e2-bad2-00145eb45e9a">Animal Sound Archive dataset</a>. You can see the original details of that ABCD record in it's <a href="http://www.gbif-uat.org/occurrence/779863593/fragment">raw XML fragment</a>. There are also <a href="http://www.gbif-uat.org/occurrence/773646053#images">fossil images</a> available through ABCD.<br />
<br />
Missing from both ABCD versions is a media title, creator and created element.<br />
<br />
<h3>
Media type interpretation</h3>
We derive the media type from either an explicitly given dc:type, the mime type found in dc:format or the media file suffix. In the case of dwc:associatedMedia found in simple Darwin Core we can only rely on the file URL to interpret the kind of media item. If that URL is pointing to some html page instead of an actual static media file with a wellknown suffix the media type remains unknown.<br />
<br />
<h3>
Production deployment</h3>
We hope you like this new feature and we are eager to get this out into production within the next weeks. This is the first iteration of this work, and like all GBIF developments we welcome any feedback.<br />
<div>
<br /></div>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/02525336976753861766noreply@blogger.com2tag:blogger.com,1999:blog-2326624813533383062.post-43907556914639216732014-04-23T12:22:00.000+02:002014-04-24T15:39:29.648+02:00IPT v2.1 – Promoting the use of stable occurrenceIDs<div dir="ltr" style="text-align: left;" trbidi="on">
<div>
<br /></div>
GBIF is pleased to announce the release of the <a href="http://www.gbif.org/ipt" target="_blank">IPT 2.1</a> with the following key changes:<br />
<ul style="text-align: left;">
<li>Stricter controls for the Darwin Core occurrenceID to improve stability of record level identifiers network wide</li>
<li>Ability to support Microsoft Excel spreadsheets natively</li>
<li>Japanese translation thanks to Dr. Yukiko Yamazaki from the National Institute of Genetics (NIG) in Japan</li>
</ul>
With this update, GBIF continues to refine and improve the IPT based on feedback from users, while carrying out a number of deliverables that are part of the <a href="http://www.gbif.org/resources/2970" target="_blank">GBIF Work Programme for 2014-16</a>.<br />
<br />
The most significant new feature that has been added in this release is the ability to validate that each record within an Occurrence dataset has a unique identifier. If any missing or duplicate identifiers are found, publishing fails, and the problem records are logged in the publication report.<br />
<br />
This new feature will support data publishers who use the Darwin Core term <a href="http://rs.tdwg.org/dwc/terms/#occurrenceID" target="_blank">occurrenceID</a> to uniquely identify their occurrence records. The change is intended to make it easier to link to records as they propagate throughout the network, simplifying the mechanism to cross reference databases and potentially help towards tracking use.<br />
<br />
Previously, GBIF has asked publishers to use the three Darwin Core terms: <a href="http://rs.tdwg.org/dwc/terms/#institutionCode" target="_blank">institutionCode</a>, <a href="http://rs.tdwg.org/dwc/terms/#collectionCode" target="_blank">collectionCode</a>, and <a href="http://rs.tdwg.org/dwc/terms/#catalogNumber" target="_blank">catalogNumber</a> to uniquely identify their occurrence records. This triplet style identifier will continue to be accepted, however, it is notoriously unstable since the codes are prone to change and in many cases are meaningless for datasets originating from outside of the museum collections community. For this reason, GBIF is adopting the recommendations coming from the IPT user community and recommending the use of occurrenceID instead. <br />
<br />
Best practices for creating an occurrenceID are that they (a) must be unique within the dataset, (b) should remain stable over time, and (c) should be globally unique wherever possible. By taking advantage of the IPT’s built-in identifier validation, publishers will automatically satisfy the first condition.<br />
<br />
Ultimately, GBIF hopes that by transitioning to more widespread use of stable occurrenceIDs, the following goals can be realized:<br />
<ul style="text-align: left;">
<li>GBIF can begin to resolve occurrence records using an occurrenceID. This resolution service could also help check whether identifiers are globally unique or not.</li>
<li>GBIF’s own occurrence identifiers will become inherently more stable as well.</li>
<li>GBIF can sustain more reliable cross-linkages to its records from other databases (e.g. GenBank).</li>
<li>Record-level citation can be made possible, enhancing attribution and the ability to track data usage.</li>
<li>It will be possible to consider tracking annotations and changes to a record over time.</li>
</ul>
If you’re a new or existing publisher, GBIF hope you’ll agree these goals are worth working towards, and start using occurrenceIDs. <br />
<br />
The <a href="http://www.gbif.org/ipt" target="_blank">IPT 2.1</a> also includes support for uploading Excel files as data sources.<br />
<br />
Another enhancement is that the interface has been translated into Japanese. GBIF offer their sincere thanks to Dr. Yukiko Yamazaki from the <a href="http://www.nig.ac.jp/english/index.html" target="_blank">National Institute of Genetics (NIG)</a> in Japan for this extraordinary effort.<br />
<br />
In the 11 months since version 2.0.5 was released, a total of 11 enhancements have been added, and 38 bugs have been squashed. So what else has been fixed?<br />
<br />
If you like the IPT’s auto publishing feature, you will be happy to know the bug causing the temporary directory to grow until disk space was exhausted has now been fixed. Resources that are configured to auto publish, but fail to be published for whatever reason, are now easily identifiable within the resource tables as shown:<br />
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-2m3hE7IsRX8/U07eEE54fMI/AAAAAAAALzs/GFO-nQUbPb4/s1600/Screen+Shot+2014-04-16+at+9.45.11+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-2m3hE7IsRX8/U07eEE54fMI/AAAAAAAALzs/GFO-nQUbPb4/s1600/Screen+Shot+2014-04-16+at+9.45.11+PM.png" /></a></div>
<div>
If you ever created a data source by connecting directly to a database like MySQL, you may have noticed an error that caused datasets to truncate unexpectedly upon encountering a row with bad data. Thanks to a patch from Paul Morris (<a href="http://www.huh.harvard.edu/" target="_blank">Harvard University Herbaria</a>) bad rows now get skipped and reported to the user without skipping subsequent rows of data.<br />
<br />
As always we’d like to give special thanks to the other volunteers who contributed to making this version a reality:<br />
<div>
<ul style="text-align: left;">
<li>Marie-Elise Lecoq, and Gallien Labeyrie (<a href="http://www.gbif.fr/" target="_blank">GBIF France</a>) - Updating French translation</li>
<li>Yu-Huang Wang (<a href="http://taibif.tw/" target="_blank">TaiBIF</a>, Taiwan) - Updating Traditional Chinese translation</li>
<li>Nestor Beltran (<a href="http://www.sibcolombia.net/web/sib/home" target="_blank">Colombian Biodiversity Information System (SiB)</a>) - Updating Spanish translation</li>
<li>Etienne Cartolano, Allan Koch Veiga, and Antonio Mauro Saraiva (<a href="http://www.biocomp.org.br/" target="_blank">Universidade de São Paulo, Research Center on Biodiversity and Computing</a>) - Updating Portuguese translation</li>
<li>Carlos Cubillos (<a href="http://www.sibcolombia.net/web/sib/home" target="_blank">Colombian Biodiversity Information System (SiB)</a>) - Contributing style improvements</li>
</ul>
On behalf of the GBIF development team, I can say that we’re really excited to get this new version out to everyone! Happy publishing.<br />
<div class="MsoNormal">
<div>
<div>
<div>
<div>
<div style="font-size: 13.5pt;">
<!--EndFragment--></div>
</div>
</div>
<span style="font-size: 13.5pt;"> </span> <br />
<div style="font-size: 13.5pt;">
</div>
</div>
</div>
<div class="MsoNormal">
</div>
</div>
</div>
</div>
</div>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/16423423909368777750noreply@blogger.com8tag:blogger.com,1999:blog-2326624813533383062.post-41902075957716270432014-03-04T11:20:00.000+01:002015-08-25T16:01:47.923+02:00Lots of columns with Hive and HBaseWe're in the process of rolling out a long awaited feature here at GBIF, namely the indexing of more fields from <a href="http://rs.tdwg.org/dwc/" target="_blank">Darwin Core</a>. Until the launch of our now HBase-backed occurrence store (in the fall of 2013) we couldn't index more than about 30 or so terms from Darwin Core because we were limited by our MySQL schema. Now that we have HBase we can add as many columns as we like!<br />
<br />
Or so we thought.<br />
<br />
Our occurrence download service gets a lot of use and naturally we want downloaders to have access to all of the newly indexed fields. The way our downloads work is as an Oozie workflow that executes a Hive query of an HDFS table (more details in this <a href="http://blog.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/" target="_blank">Cloudera blog</a>). We use an HDFS table to significantly speed up the scan speed of the query - using an HBase backed Hive table takes something like 4-5x as long. But to generated that HDFS table we need to start from a Hive table that _is_ backed by HBase.<br />
<br />
Here's an example of how to write a Hive table definition for an HBase-backed table:<br />
<br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">CREATE EXTERNAL TABLE tiny_hive_example (</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> key INT,</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> kingdom STRING,</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> kingdomkey INT</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">)</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,o:kingdom#s,o:kingdomKey#b")</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">TBLPROPERTIES(</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> "hbase.table.name" = "tiny_hbase_table",</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> "hbase.table.default.storage.type" = "binary"</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">);</span><br />
<br />
But now that we have something like 600 columns to map to HBase, and that we've chosen to name our HBase columns just like the DwC Terms they represent (e.g. the <a href="http://rs.tdwg.org/dwc/terms/index.htm#basisOfRecord" target="_blank">basis of record</a> term's column name is basisOfRecord) we have a very long "SERDEPROPERTIES" string in our Hive table definition. How long? Well, way more than the 4000 character limit of Hive. For our Hive metastore we use PostgreSQL and when Hive creates the SERDE_PARAMS table it gives the PARAM_VALUE column a datatype of VARCHAR(4000). Because 4k should be enough for anyone, right? Sigh.<br />
<br />
The solution:<br />
<br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">alter table "SERDE_PARAMS" alter column "PARAM_VALUE" type text;</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
We did lots of testing to make sure the existing definitions didn't get nuked by this change, and can confirm that the Hive code is not checking that 4000 value either (value is turned into a String: <a href="http://svn.apache.org/repos/asf/hive/trunk/metastore/src/model/package.jdo" target="_blank">the source</a>). Our new super-wide downloads table works, and will be in production soon!<br />
<br /><div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Oliver Meynhttp://www.blogger.com/profile/04706642473308341930noreply@blogger.com4tag:blogger.com,1999:blog-2326624813533383062.post-81975349067876346422013-10-28T12:04:00.000+01:002013-10-28T12:04:15.847+01:00The new (real-time) GBIF Registry has gone live<div dir="ltr" style="text-align: left;" trbidi="on">
<blockquote style="text-align: left;" type="cite">
<div style="-webkit-line-break: after-white-space; -webkit-nbsp-mode: space; word-wrap: break-word;">
<div>
<span style="font-family: Times, Times New Roman, serif;"><span style="background-color: white; color: #222222;">For the last 4 years, GBIF has operated the GBRDS registry with its own web application on </span><a href="http://gbrds.gbif.org/" style="background-color: white; color: #1155cc;" target="_blank">http://gbrds.gbif.org</a>. Previously, when a dataset got registered in the GBRDS registry (for example using an <a href="http://www.gbif.org/ipt" target="_blank">IPT</a>) it wasn't immediately visible in the portal for several weeks until after rollover took place. </span></div>
<div>
<span style="background-color: white; color: #222222; font-family: Times, Times New Roman, serif;"><br /></span></div>
<div>
<span style="background-color: white; color: #222222; font-family: Times, Times New Roman, serif;">In October, GBIF launched its new portal on <a href="http://www.gbif.org/" style="color: #1155cc;" target="_blank">www.gbif.org</a>. During the launch we indicated that the real-time data management would be starting up in November. We are excited to inform you that today we made the first step towards making this a reality, by enabling the live operation of the new GBIF registry. </span> </div>
</div>
</blockquote>
<blockquote style="text-align: left;" type="cite">
<div style="-webkit-line-break: after-white-space; -webkit-nbsp-mode: space; word-wrap: break-word;">
<div>
<span style="background-color: white; color: #222222; font-family: Times, 'Times New Roman', serif;">What does this mean for you?</span></div>
</div>
</blockquote>
<blockquote style="text-align: left;" type="cite">
<div style="-webkit-line-break: after-white-space; -webkit-nbsp-mode: space; word-wrap: break-word;">
<ul style="text-align: left;">
<li><span style="background-color: white; color: #222222; font-family: Times, 'Times New Roman', serif;">any dataset registered through GBIF (using an </span><a href="http://www.gbif.org/ipt" style="font-family: Times, 'Times New Roman', serif;" target="_blank">IPT</a><span style="background-color: white; color: #222222; font-family: Times, 'Times New Roman', serif;">, web services, or manually by liaison with the Secretariat) will be visible in the portal immediately because the portal and new registry are fully integrated</span> </li>
</ul>
</div>
</blockquote>
<blockquote style="text-align: left;" type="cite">
<div style="-webkit-line-break: after-white-space; -webkit-nbsp-mode: space; word-wrap: break-word;">
<ul style="text-align: left;">
<li><span style="background-color: white; color: #222222; font-family: Times, 'Times New Roman', serif;">the GBRDS web application (</span><a href="http://gbrds.gbif.org/" style="color: #1155cc; font-family: Times, 'Times New Roman', serif;" target="_blank">http://gbrds.gbif.org</a><span style="background-color: white; color: #222222; font-family: Times, 'Times New Roman', serif;">) is no longer visible</span><span style="font-family: Times, Times New Roman, serif;"><span style="background-color: white; color: #222222;">, </span></span><span style="background-color: white; color: #222222;"><span style="font-family: Times, Times New Roman, serif;">since the new portal displays all the appropriate information</span></span></li>
</ul>
</div>
</blockquote>
<blockquote style="text-align: left;" type="cite">
<div style="-webkit-line-break: after-white-space; -webkit-nbsp-mode: space; word-wrap: break-word;">
<ul style="text-align: left;">
<li><span style="color: #222222; font-family: Times, 'Times New Roman', serif;">old links to the GBRDS will automatically redirect to their corresponding entry in the new portal. As an example, try </span><a href="http://gbrds.gbif.org/browse/agent?uuid=4fa7b334-ce0d-4e88-aaae-2e0c138d049e" style="font-family: Times, 'Times New Roman', serif;">http://gbrds.gbif.org/browse/agent?uuid=4fa7b334-ce0d-4e88-aaae-2e0c138d049e</a><span style="color: #222222; font-family: Times, 'Times New Roman', serif;"> </span></li>
</ul>
</div>
</blockquote>
<blockquote style="text-align: left;" type="cite">
<div style="-webkit-line-break: after-white-space; -webkit-nbsp-mode: space; word-wrap: break-word;">
<div style="text-align: left;">
</div>
<div style="text-align: left;">
</div>
<ul style="text-align: left;">
<li><span class="Apple-style-span" style="color: #222222; font-family: Times, Times New Roman, serif;">the GBRDS sandbox registry web application (<a href="http://gbrdsdev.gbif.org/">http://gbrdsdev.gbif.org</a></span><span style="color: #222222; font-family: Times, 'Times New Roman', serif;">) is no longer visible, but a new registry sandbox has been setup to provide for </span><a href="http://www.gbif.org/ipt" style="font-family: Times, 'Times New Roman', serif;" target="_blank">IPT</a><span style="color: #222222; font-family: Times, 'Times New Roman', serif;"> installations installed in test mode</span></li>
</ul>
</div>
</blockquote>
<blockquote style="text-align: left;" type="cite">
<div style="-webkit-line-break: after-white-space; -webkit-nbsp-mode: space; text-align: left; word-wrap: break-word;">
<span style="background-color: white; color: #222222; font-family: Times, 'Times New Roman', serif;">Please note that the new <a href="http://www.gbif.org/developer/registry">registry API</a> </span><span style="background-color: white; font-family: Times, 'Times New Roman', serif;">supports the same web service API that the GBRDS previously did</span><span style="background-color: white; color: #222222; font-family: Times, 'Times New Roman', serif;">, so existing tools and services built on the GBRDS API (such as the <a href="http://www.gbif.org/ipt" target="_blank">IPT</a>) will continue to work uninterrupted.</span><span style="font-family: Times, 'Times New Roman', serif;"> </span></div>
</blockquote>
<blockquote style="text-align: left;" type="cite">
<div style="-webkit-line-break: after-white-space; -webkit-nbsp-mode: space; word-wrap: break-word;">
<div>
<span style="font-family: Times, Times New Roman, serif;"><span style="background-color: white; color: #222222;">As you may have noticed, occurrence data crawling has been temporarily suspended since the middle of September to prepare for launching </span><span style="background-color: white; color: #222222;">real-time data management</span><span style="background-color: white; color: #222222;">. </span><span style="color: #222222;">We aim to resume occurrence data crawling in the first week of November, meaning that updates to the index will be visible immediately afterwar</span><span style="background-color: white; color: #222222;">ds. </span> </span></div>
</div>
</blockquote>
<blockquote style="text-align: left;" type="cite">
<div style="-webkit-line-break: after-white-space; -webkit-nbsp-mode: space; word-wrap: break-word;">
<div>
<span style="text-align: justify;"><span style="font-family: Times, Times New Roman, serif;">On behalf of the GBIF development team, I thank you for your patience during this transition time, and hope you are looking forward to real-time data management as much as we are.</span> </span></div>
</div>
</blockquote>
</div>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/16423423909368777750noreply@blogger.com0tag:blogger.com,1999:blog-2326624813533383062.post-68235470911571230782013-10-24T14:39:00.000+02:002013-10-24T14:41:06.222+02:00GBIF Backbone in GitHub<link href="http://alexgorbatchev.com/pub/sh/current/styles/shThemeDefault.css" rel="stylesheet" type="text/css"></link>
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js" type="text/javascript"></script>
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shAutoloader.js" type="text/javascript"></script>
<script type="text/javascript">
SyntaxHighlighter.autoloader(
'js jscript javascript /js/shBrushJScript.js',
'text plain @shBrushPlain.js',
'py python @shBrushPython.js',
'sql @shBrushSql.js',
'bash shell @shBrushBash.js',
'css @shBrushCss.js',
'java @shBrushJava.js',
'xml xhtml xslt html @shBrushXml.js'
);
SyntaxHighlighter.all();
</script>
<span style="font-family: Verdana, sans-serif;">For a long time I wanted to experiment with using <a href="https://github.com/mdoering/backbone">GitHub</a> as a tool to browse and manage the <a href="http://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c">GBIF backbone taxonomy</a>. Encouraged by similar sentiments from <a href="http://iphylo.blogspot.co.uk/2013/04/time-to-put-taxonomy-into-github.html">Rod Page</a>, it would be nice to use git to keep track of versions and allow external parties to fork parts of the taxonomic tree and push back changes if desired. To top it off there is the </span><span style="font-family: Verdana, sans-serif;">great GitHub Treeslider to browse the taxonomy, so why not give it a try?</span><br />
<h3>
<span style="font-family: Verdana, sans-serif;">A GitHub filesystem taxonomy</span></h3>
<span style="font-family: Verdana, sans-serif;">I decided to export each taxon in the backbone as a folder that is named according to the canonical name, containing 2 files:</span><br />
<br />
<ol>
<li><span style="font-family: Courier New, Courier, monospace;"><b>README.md,</b></span><span style="font-family: Verdana, sans-serif;"> a simple markdown file that gets rendered by github and shows the basic attributes of a taxon</span></li>
<li><span style="font-family: Courier New, Courier, monospace;"><b>data.json,</b></span><span style="font-family: Verdana, sans-serif;"> a complete json representation of the taxon as it is exposed via the new <a href="http://www.gbif.org/developer/species">GBIF species API</a></span></li>
</ol>
<span style="font-family: Verdana, sans-serif;">The filesystem represents the taxonomic classification and taxon folders are nested accordingly, for example the species <a href="https://github.com/mdoering/backbone/tree/master/life/Fungi/Basidiomycota/Agaricomycetes/Agaricales/Amanitaceae/Amanita/Amanita%20arctica">Amanita arctica</a> is represented as:</span><br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-dNFtybvZtnE/UmkTfFIwWfI/AAAAAAAAEAc/_AwIyGsxGus/s1600/Screen+Shot+2013-10-24+at+14.32.41.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="343" src="http://3.bp.blogspot.com/-dNFtybvZtnE/UmkTfFIwWfI/AAAAAAAAEAc/_AwIyGsxGus/s400/Screen+Shot+2013-10-24+at+14.32.41.png" width="400" /></a></div>
<br />
<span style="font-family: Verdana, sans-serif;">This is just a first experimental step. One can improve the readme a lot to render more content in a human friendly way and include more data in the json file such as common names and synonyms.</span><br />
<h3>
<span style="font-family: Verdana, sans-serif;">Getting data into GitHub</span></h3>
<div>
<span style="font-family: Verdana, sans-serif;">It didn't take much to write a small <a href="https://code.google.com/p/gbif-ecat/source/browse/checklistbank/trunk/checklistbank-nub/src/main/java/org/gbif/nub/export/NubGitExporter.java">NubGitExporter.java</a> class that exports the GBIF backbone into the filesystem as described above. The export of the entire taxonomy, with it's currently 4.4 million taxa incl synonyms, took about one hour on a MacBook Pro laptop. </span></div>
<div>
<span style="font-family: Verdana, sans-serif;">Not bad I thought, but then I tried to add the generated files into git and that's when I started to doubt. After waiting for half a day for git to add the files to my local index I decided to kill the process and start by only adding the smaller kingdoms first, excluding animals and plants. That left about 335.000 folders and </span><span style="font-family: Verdana, sans-serif;">670.000</span><span style="font-family: Verdana, sans-serif;"> files to be added to git. Adding these to my local git still took several hours, committing and finally pushing them onto the GitHub server took yet another 2 hours.</span><br />
<span style="font-family: Verdana, sans-serif;"><br /></span></div>
<pre class="brush: bash" name="code">Delta compression using up to 8 threads.
Compressing objects: 100% (1010487/1010487), done.
Writing objects: 100% (1010494/1010494), 173.51 MiB | 461 KiB/s, done.
Total 1010494 (delta 405506), reused 0 (delta 0)
To https://github.com/mdoering/backbone.git
</pre>
<div>
<span style="font-family: Verdana, sans-serif;"><br /></span><span style="font-family: Verdana, sans-serif;">After those files were added to the index committing a simple change to the main README file took 15 minutes to commit. Although I like the general idea and the pretty user interface I fear GitHub, and even <a href="http://stackoverflow.com/questions/984707/what-are-the-file-limits-in-git-number-and-size">git</a> itself, are not made to be a repository of millions of files and folders.</span><br />
<h3>
<span style="font-family: Verdana, sans-serif;">First GitHub impressions</span></h3>
<div>
<span style="font-family: Verdana, sans-serif;">Browsing taxa in GitHub is surprisingly responsive. The fungi genus <a href="https://github.com/mdoering/backbone/tree/master/life/Fungi/Basidiomycota/Agaricomycetes/Agaricales/Amanitaceae/Amanita">Amanita</a> contains 746 species, but it loads very quickly. In that regard GitHub is much nicer to use than the one on the new <a href="http://www.gbif.org/species/2526057">GBIF species pages</a> which of course shows much more information. The rendered <a href="https://github.com/mdoering/backbone/blob/master/life/Fungi/Basidiomycota/Agaricomycetes/Agaricales/Amanitaceae/Amanita/README.md">readme</a> file is not ideal as it's at the very bottom of the page, but showing </span><span style="font-family: Verdana, sans-serif;">information to </span><span style="font-family: Verdana, sans-serif;">humans that way is nice - and markdown could also be parsed by machines quite easily if we adopt a simple format, for example for every property create a heading with that name and put the content into the following paragraph(s). </span></div>
<div>
<span style="font-family: Verdana, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Verdana, sans-serif;">The Amanita example also reveals a bug in the exporter class when dealing with synonyms (the <a href="https://github.com/mdoering/backbone/blob/master/life/Fungi/Basidiomycota/Agaricomycetes/Agaricales/Amanitaceae/Amanita/README.md">Amanita readme</a> contains the synonym information) and also with infraspecific taxa. For example <a href="https://github.com/mdoering/backbone/tree/master/life/Fungi/Basidiomycota/Agaricomycetes/Agaricales/Amanitaceae/Amanita/Amanita%20muscaria">Amanita muscaria</a> contains some weird form information which is mapped erroneously to the species. This obviously should be fixed.</span></div>
<div>
<span style="font-family: Verdana, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Verdana, sans-serif;">The GitHub browser sorts all files alphabetically. When mixing ranks </span><span style="font-family: Verdana, sans-serif;">(we skip intermediate unknown ranks in the backbone),</span><span style="font-family: Verdana, sans-serif;"> for example see the <a href="https://github.com/mdoering/backbone/tree/master/life/Fungi">Fungus kingdom</a>, </span><span style="font-family: Verdana, sans-serif;">sorting by the rank first is desirable. We could enable this by naming the taxon folders accordingly, prefixing with an alphabetically correctly ordered rank.</span></div>
<div>
<span style="font-family: Verdana, sans-serif;"><br /></span></div>
<div>
<span style="font-family: Verdana, sans-serif;">I have not had the time to try to version branches of the tree and see how usable that is. I suspect the git performance to be really slow, but that might not be a blocker if we only do versioning of larger groups and rarely push & pull.</span></div>
<div>
<br /></div>
</div>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/02525336976753861766noreply@blogger.com9tag:blogger.com,1999:blog-2326624813533383062.post-53841166993629313922013-07-22T21:16:00.000+02:002013-07-22T21:16:07.954+02:00Validating scientific names with the forthcoming GBIF Portal web service API<div dir="ltr" id="docs-internal-guid-617a94a6-07b1-98a4-ce0e-860efe6518aa" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<i>This guest post was written by Gaurav Vaidya, </i><i>Victoria Tersigni and Robert Guralnick, and is being cross-posted to the VertNet Blog. David Bloom and John Wieczorek read through drafts of this post and improved it tremendously.</i></div>
<div dir="ltr" id="docs-internal-guid-617a94a6-07b1-98a4-ce0e-860efe6518aa" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right; width: 200px;"><tbody>
<tr><td style="text-align: center;"><a href="http://2.bp.blogspot.com/-vxZ3Fg0vyzc/Ue1808go0TI/AAAAAAAAAuQ/smjqdcmDpto/s1600/1024px-Mother_and_baby_sperm_whale.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img alt="" border="0" height="179" src="http://2.bp.blogspot.com/-vxZ3Fg0vyzc/Ue1808go0TI/AAAAAAAAAuQ/smjqdcmDpto/s320/1024px-Mother_and_baby_sperm_whale.jpg" title="" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A whale named <i><strike>Physeter macrocephalus</strike> <strike>Physeter catodon</strike> Physeter macrocephalus</i> (photograph by Gabriel Barathieu, reused under CC-BY-SA from <a href="https://commons.wikimedia.org/wiki/File:Mother_and_baby_sperm_whale.jpg">the Wikimedia Commons</a>)</td></tr>
</tbody></table>
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Validating scientific names is one of the hardest parts of cleaning up a biodiversity dataset: as taxonomists' understanding of species boundaries change, the names attached to them can be synonymized, moved between genera or even have their Latin grammar corrected (it's </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Porphyrio martini</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">cus</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">, not </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Porphyrio martini</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">ca</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">). Different taxonomists may disagree on what to call a species, whether a particular set of populations make up a species, subspecies or species complex, or even which of several published names correspond to our modern understanding of that species, such as </span><a href="http://www.repository.naturalis.nl/record/318605" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">the dispute</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> over whether the sperm whale is really </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Physeter catodon</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> Linnaeus, 1758, or </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Physeter macrocephalus</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> Linnaeus, 1758.</span></div>
<div dir="ltr" id="docs-internal-guid-617a94a6-07b1-98a4-ce0e-860efe6518aa" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<br /></div>
<div dir="ltr" id="docs-internal-guid-617a94a6-07b1-f108-eed3-15bc258078d0" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">A good way to validate scientific names is to match them against a taxonomic checklist: a publication that describes the taxonomy of a particular taxonomic group in a particular geographical region. It is up to the taxonomists who write such treatises to catalogue all the synonyms that have ever been used for the names in their checklist, and to identify a single accepted name for each taxon they recognize. While these checklists are themselves evolving over time and sometimes contradict each other, they serve as essential points of reference in an ever-changing taxonomic landscape.</span></div>
<div dir="ltr" id="docs-internal-guid-617a94a6-07b1-f108-eed3-15bc258078d0" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<br /></div>
<div dir="ltr" id="docs-internal-guid-617a94a6-07b2-1506-462c-670a7a7a817b" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Over a hundred digitized checklists have been assembled by the Global Biodiversity Information Facility (GBIF) and will be indexed in the forthcoming </span><a href="http://uat.gbif.org/" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">GBIF Portal</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">, currently in development and testing. This collection includes large, global checklists, such as the </span><a href="http://uat.gbif.org/dataset/7ddf754f-d193-4cc9-b351-99906754a03b" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">Catalogue of Life</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> and the </span><a href="http://uat.gbif.org/dataset/046bbc50-cae2-47ff-aa43-729fbf53f7c5" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">International Plant Names Index</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">, alongside smaller, more focussed checklists, such as </span><a href="http://uat.gbif.org/dataset/d7f2602e-9f79-45e8-8399-08d0c5e43f5d" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">a checklist of 383 species of seed plants</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> found in the </span><a href="http://en.wikipedia.org/wiki/Singalila_National_Park" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">Singhalila National Park in India</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> and </span><a href="http://uat.gbif.org/dataset/db93cee5-60d1-4e16-a69e-83dd7080a55e" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">the 87 species of moss bug</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> recorded in the </span><a href="http://coleorrhyncha.speciesfile.org/" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">Coleorrhyncha Species File</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">. Many of these checklists can be downloaded as </span><a href="http://www.gbif.org/informatics/standards-and-tools/publishing-data/data-standards/darwin-core-archives/" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">Darwin Core Archive</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> files, an important format for working with and exchanging biodiversity data.</span><br />
<br /></div>
<div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">So how can we match names against these databases? </span><a href="http://www.openrefine.org/" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">OpenRefine</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> (the recently-renamed Google Refine) is a popular data cleaning tool, with features that make it easy to clean up many different types of data. </span><a href="http://about.me/jotegui" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">Javier Otegui</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> has written a tutorial on </span><a href="http://bit.ly/BITW13_OpenRefine" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">cleaning biodiversity data in OpenRefine</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">, and last year </span><a href="http://iphylo.blogspot.com/" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">Rod Page</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> provided tools and a </span><a href="http://iphylo.blogspot.com/2012/02/using-google-refine-and-taxonomic.html" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">step-by-step guide to reconciling scientific names</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">, establishing OpenRefine as an essential tool for biodiversity data and scientific name cleanup.</span><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; height: 278px; margin-left: 1em; text-align: right; width: 267px;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-E5_Ja4jxV8w/Ue19jVYkyPI/AAAAAAAAAuY/CwmaK86aRvw/s1600/Felis+Tigris+in+Syst+Nat+10th+ed.png" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" height="200" src="http://4.bp.blogspot.com/-E5_Ja4jxV8w/Ue19jVYkyPI/AAAAAAAAAuY/CwmaK86aRvw/s200/Felis+Tigris+in+Syst+Nat+10th+ed.png" width="190" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Linnaeus' original description of <i>Felis Tigris</i>. From an 1894 republication of Linnaeus' <i>Systema Naturae, 10th edition</i>, <a href="http://biodiversitylibrary.org/page/25033833">digitized by the Biodiversity Heritage Library</a>.</td></tr>
</tbody></table>
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; line-height: 1.15; text-decoration: none; vertical-align: baseline;">We extended Rod's work by building a reconciliation service against </span><a href="http://dev.gbif.org/wiki/display/POR/Webservice+API" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">the forthcoming GBIF web services API</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">.
We wanted to see if we could use one of the GBIF Portal's biggest
strengths -- the large number of checklists it has indexed -- to
identify names recognized in similar ways by different checklists.
Searching through multiple checklists containing possible synonyms and
accepted names increases the odds of finding an obscure or recently
created name; and if the same name is recognized by a number of
checklists, this may signify a well-known synonymy -- for example, two
of the Portal checklists recognize that the species Linnaeus named </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Felis tigris</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> is the same one that is known as </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Panthera tigris </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">today.</span><br />
</div>
<br />
<div dir="ltr" id="docs-internal-guid-617a94a6-07b4-9eeb-5e5f-e3e700cbe6c9" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">To do this, we wrote a new </span><a href="http://refine.taxonomics.org/gbifchecklists/code" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">OpenRefine reconciliation service</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> that searches for a queried name in all the checklists on the GBIF Portal. It then clusters names using four criteria and counts how often a particular name has the same:</span></div>
<ul style="margin-bottom: 0pt; margin-top: 0pt;">
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">scientific name (for example, "</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Felis tigris</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">"),</span></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">authority ("Linnaeus, 1758"),</span></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">accepted name ("</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Panthera tigris</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">"), and</span></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">kingdom ("Animalia").</span></div>
</li>
</ul>
<br />
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"></span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Once you do a reconciliation through our new service, your results will look like this:</span><br />
<div dir="ltr" id="docs-internal-guid-617a94a6-07b1-f108-eed3-15bc258078d0" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-l-vU_0ve6Lw/Ue1-POoTaVI/AAAAAAAAAug/6gpqQOqjSHg/s1600/Felis+tigris+reconciliation.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="97" src="http://4.bp.blogspot.com/-l-vU_0ve6Lw/Ue1-POoTaVI/AAAAAAAAAug/6gpqQOqjSHg/s320/Felis+tigris+reconciliation.png" width="320" /></a></div>
<br />
<div dir="ltr" id="docs-internal-guid-617a94a6-07b5-545e-228f-32c0c2f8d033" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Since OpenRefine limits the number of results it shows for any reconciliation, we know only that at least five checklists in the GBIF Portal matched the name "Felis tigris". Of these,</span><br />
</div>
<ol style="margin-bottom: 0pt; margin-top: 0pt;">
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; list-style-type: decimal; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Two checklists consider </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Felis tigris </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Linnaeus, 1758 to be a junior synonym of </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Panthera tigris</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> (Linnaeus, 1758). Names are always sorted by the number of checklists that contain that interpretation, so this interpretation -- as it happens, the correct one -- is at the top of the list.</span><br />
</div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; list-style-type: decimal; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">The remaining checklists all consider </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Felis tigris</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> to be an accepted name in its own right. They contain mutually inconsistent information: one places this species in the kingdom Animalia, another in the kingdom Metazoa, and the third contains both a kingdom and an taxonomic authority. You can click on each name to find out more details.</span></div>
</li>
</ol>
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"></span><br />
<div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Using our reconciliation service, you can immediately see how many checklists agree on the most important details of the name match, and whether a name should be replaced with an accepted name. The same name may also be spelled identically under different nomenclatural codes: for example, does "Ficus" refer to the genus </span><a href="http://en.wikipedia.org/wiki/Ficus_%28gastropod%29" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">Ficus </span><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">Röding, 1798</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> or the genus </span><a href="http://en.wikipedia.org/wiki/Ficus" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">Ficus</span><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;"> L.</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">? If you know that the former is in kingdom Animalia while the latter is in Plantae, it becomes easier to figure out the right match for your dataset.</span></div>
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"></span><br />
<div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">We've designed a complete workflow around our reconciliation service, starting with ITIS as a very fast first step to catch the most well recognized names, and ending with EOL's fuzzy matching search as a final step to look for incorrectly spelled names. For VertNet's 2013 </span><a href="http://vertnet.org/about/BITW.php" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">Biodiversity Informatics Training Workshop</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">, we wrote two tutorials that walk you through our workflow:</span></div>
<br />
<ul style="margin-bottom: 0pt; margin-top: 0pt;">
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<a href="http://bit.ly/bitw2013-taxon-validation-tutorial" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">Name validation in OpenRefine</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">, using both the new GBIF API reconciliation service as well as Rod Page's reconciliation service for EOL, and</span></div>
</li>
<li dir="ltr" style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<a href="http://bit.ly/bitw2013-higher-taxonomy-tutorial" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">Higher taxonomy in OpenRefine</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">, using the web service APIs provided by GBIF and EOL, as well as OpenRefine's ability to parse JSON.</span></div>
</li>
</ul>
<br />
<div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">If you're already familiar with OpenRefine, you can add the reconciliation service with the URL:</span></div>
<div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> </span><a href="http://refine.taxonomics.org/gbifchecklists/reconcile" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Consolas; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">http://refine.taxonomics.org/gbifchecklists/reconcile</span></a><span style="background-color: transparent; color: black; font-family: Consolas; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"></span></div>
<div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">Give it a try, and let us know if it helps you reconcile names faster!</span></div>
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"></span><br />
<div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<a href="http://www.mappinglife.org/" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">The Map of Life project</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"> is continuing to work on improving OpenRefine for taxonomic use in a project we call </span><a href="https://github.com/gaurav/taxrefine" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">TaxRefine</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">. If you have suggestions for features you'd like to see, please let us know! You can leave a comment on this blog post, or </span><a href="https://github.com/gaurav/taxrefine/issues" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline;">add an issue to our issue tracker on GitHub</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;">.</span></div>
<div dir="ltr" id="docs-internal-guid-617a94a6-07b1-98a4-ce0e-860efe6518aa" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;">
<br /></div>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-2326624813533383062.post-46981112765552650802013-05-22T15:37:00.000+02:002013-05-22T15:38:43.385+02:00IPT v2.0.5 Released - A melhor versão até o momento!<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<br /></div>
<div class="p1">
<div style="text-align: justify;">
<br class="Apple-interchange-newline" /></div>
<div style="text-align: justify;">
The GBIF Secretariat is proud to release version 2.0.5 of the Integrated Publishing Toolkit (IPT), available for download on the project website <a href="http://code.google.com/p/gbif-providertoolkit/downloads/list"><span class="s1">here</span></a>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
As with every release, it's your chance to take advantage of the most requested feature enhancements and bug fixes.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The most notable feature enhancements include:</div>
<ul style="text-align: left;">
<li><span style="text-align: justify;">A resource can now be configured to publish automatically on an interval </span><i style="text-align: justify;">(See "<a href="https://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes?tm=6#Published_Release" target="_blank">Automated Publishing</a>" section in User Manual)</i></li>
<li><i style="text-align: justify;"><span style="font-style: normal;">The interface has been translated into Portuguese, </span></i>making the IPT available in five languages: French, Spanish, Traditional Chinese, Portuguese and of course English.</li>
<li style="text-align: justify;">An IPT can be configured to back up each DwC-Archive version published <i>(See "<a href="https://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes?tm=6#Configure_IPT_settings" target="_blank">Archival Mode</a>" in User Manual)</i></li>
<li style="text-align: justify;">Each resource version now has a resolvable URL <i>(See "<a href="https://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes#Versioned_page" target="_blank">Versioned Page</a>" section in User Manual)</i></li>
</ul>
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-6tFmFS6XnBY/UZyMMyHLAzI/AAAAAAAAHkI/nLYRp6Nh7Ss/s1600/Screen+Shot+2013-05-22+at+11.11.47+AM.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto; text-align: justify;"><img border="0" height="220" src="http://1.bp.blogspot.com/-6tFmFS6XnBY/UZyMMyHLAzI/AAAAAAAAHkI/nLYRp6Nh7Ss/s400/Screen+Shot+2013-05-22+at+11.11.47+AM.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Filterable, pageable, and sortable resource overview table in v2.0.5</td></tr>
</tbody></table>
<ul style="text-align: left;">
<li style="text-align: justify;">The order of columns in published DwC-Archives is always the same between versions</li>
<li style="text-align: justify;">Style (CSS) customizations are easier than ever - check out this new guide entitled "<a href="https://code.google.com/p/gbif-providertoolkit/wiki/IPT2Customization" target="_blank">How to Style Your IPT</a>" for more information</li>
<li style="text-align: justify;"><i><span style="font-style: normal;">Hundreds if not thousands of resources can be handled, now that the resource overview tables are filterable, pageable, and sortable <i>(See "<a href="https://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes#Public_Resources_Table" target="_blank">Public Resource Table</a>" section in User Manual)</i> </span></i></li>
</ul>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The most important bug fixes are:</div>
<div>
<ul style="text-align: left;">
<li style="text-align: justify;">Garbled encoding on registration updates has been fixed</li>
<li style="text-align: justify;">The problems uploading DwC-Archives in .gzip format has been fixed</li>
<li style="text-align: justify;">The problem uploading a resource logo has been fixed</li>
</ul>
<div style="text-align: justify;">
</div>
</div>
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-xnJrM7Zv-eI/UZyMRZEFlRI/AAAAAAAAHkQ/6h1rgSZGUuA/s1600/Screen+Shot+2013-05-22+at+11.12.32+AM.png" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto; text-align: justify;"><img border="0" height="244" src="http://1.bp.blogspot.com/-xnJrM7Zv-eI/UZyMRZEFlRI/AAAAAAAAHkQ/6h1rgSZGUuA/s320/Screen+Shot+2013-05-22+at+11.12.32+AM.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The new look in v2.0.5</td></tr>
</tbody></table>
<div style="text-align: justify;">
The changes mentioned above represent just a fraction of the work that has gone into this version. Since version 2.0.4 was released 7 months ago, a total of 45 issues have been addressed. These are detailed in the <span class="s1"><a href="https://code.google.com/p/gbif-providertoolkit/issues/list?can=1&q=milestone%3DRelease2.0.5">issue tracking system</a></span>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
It is great to see so much feedback from the community in the form of issues especially as the IPT becomes more stable and comprehensive over time. After all, the IPT is a community-driven project and anyone can contribute patches, translations, or have their say simply by adding or voting on issues. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The single largest community contribution in this version has been the translation into Portuguese done by three volunteers at the <a href="http://www.biocomp.org.br/" target="_blank">Universidade de São Paulo, Research Center on Biodiversity and Computing</a>: Etienne Cartolano, Allan Koch Veiga, and Antonio Mauro Saraiva. With <a href="http://www.gbif.org/communications/news-and-events/showsingle/article/brazil-joins-global-initiative-for-biodiversity-data-access" target="_blank">Brazil recently joining the GBIF network</a>, we hope the Portuguese interface for the IPT will help in publication of the wealth of biodiversity data available from Brazilian institutions. </div>
</div>
<div class="p1">
<div style="text-align: justify;">
<br /></div>
</div>
<div class="p1">
<div style="text-align: justify;">
We’d also like to give special thanks to the other volunteers below:</div>
</div>
<ul class="ul1">
<li class="li1" style="text-align: justify;">Marie-Elise Lecoq (GBIF France<span class="s2">)</span> - Updating French translation</li>
<li class="li1" style="text-align: justify;">Yu-Huang Wang (TaiBIF, Taiwan) - Updating Traditional Chinese translation</li>
<li class="li3" style="text-align: justify;">Dairo Escobar, and Daniel Amariles (Colombian Biodiversity Information System (SiB)) - Updating <span class="s3">Spanish translation</span></li>
<li class="li3" style="text-align: justify;">Carlos Cubillos (Colombian Biodiversity Information System (SiB)) - Contributing style improvements</li>
<li class="li3" style="text-align: justify;">Sijmen Cozijnsen (independent contractor working for NLBIF, Netherlands) - Contributing style improvements</li>
</ul>
<div class="p1">
<div style="text-align: justify;">
On behalf of the GBIF development team, I hope you enjoy using the latest version of the IPT. </div>
</div>
</div>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Anonymoushttp://www.blogger.com/profile/16423423909368777750noreply@blogger.com1tag:blogger.com,1999:blog-2326624813533383062.post-70640305360511415342013-05-14T15:34:00.000+02:002015-08-25T16:02:01.748+02:00Migrating our hadoop cluster from CDH3 to CDH4We've written a number of times on the <a href="http://gbif.blogspot.dk/2011/01/setting-up-hadoop-cluster-part-1-manual.html" target="_blank">initial setup</a>, eventual <a href="http://gbif.blogspot.dk/2012/06/faster-hbase-hardware-matters.html" target="_blank">upgrade</a> and continued <a href="http://gbif.blogspot.dk/2012/07/optimizing-writes-in-hbase.html" target="_blank">tuning</a> of our hadoop cluster. Our latest project has been upgrading from CDH3u3 to <a href="http://blog.cloudera.com/blog/2012/02/introducing-cdh4/" target="_blank">CDH4.2.1</a>. Upgrades are almost always disruptive, but we decided it was worth the hassle for a number of reasons:<br />
<br />
<ul>
<li>general performance improvements in the entire Hadoop/HBase stack</li>
<li>continued support from the community/user list (a non-trivial concern - anybody asking questions on the user groups and mailing list about problems with older clusters are invariably asked to update before people are interested in tackling the problem)</li>
<li>multi-threaded compactions (the need for which we concluded <a href="http://gbif.blogspot.dk/2012/07/optimizing-writes-in-hbase.html" target="_blank">in this post</a>)</li>
<li>table-based region balancing (rather than just cluster-wide)</li>
</ul>
<div>
We had been managing our cluster primarily using Puppet, with all the knowledge of how the bits worked together firmly within our dev team. In an effort to make everyone's lives easier, reduce our <a href="http://en.wikipedia.org/wiki/Bus_factor" target="_blank">bus factor</a>, and get the server management back into the hands of our ops team, we've moved to <a href="https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Downloads" target="_blank">CDH Manager</a> to control our CDH installation. That's been going pretty well so far, but, we're getting ahead of ourselves...</div>
<div>
<br /></div>
<h3>
The Process</h3>
<div>
We have 6 slave nodes that have a lot of disk capacity since we spec'd with a goal of lots of spindles which meant we got lots of space "for free". Rather than upgrading in place, we decided to start fresh with new master & zookeeper nodes, and we calculated that we'd have enough space to pull half the slaves into the new cluster without losing any data. We cleaned up all the tmp files and anything we deemed not worth saving from HBase and hdfs, and started the migration:</div>
<h4>
Reduce the replication factor</h4>
<div>
We reduced the replication factor to 2 on the 6 slave nodes to reduce the disk use:</div>
<div>
<br /></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">hadoop fs -setrep -R 2 /</span></div>
<h4>
Decommission the 3 nodes to move</h4>
<div>
"Decommissioning" is the civilized and safe way to remove nodes from a cluster where there's risk that they contain the only copies of some data in the cluster (they'll block writes but accept reads until all blocks have finished replicating out). To do it add the names of the target machines to an "excludes" file (one per line) that your hdfs config needs to reference, and then refresh hdfs.</div>
<div>
<br /></div>
<div>
The block in hdfs-site.xml:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><property></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> <name>dfs.hosts.exclude</name></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> <value>/etc/hadoop/conf/excluded_hosts</value></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"></property></span></div>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: inherit;">then run:</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">bin/hadoop dfsadmin -refreshNodes</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: inherit;">and wait for the "under replicated blocks" count on the hdfs admin page to drop to 0 and the decommissioning nodes to move into state "Decommissioned".</span></div>
<h4>
Don't forget HBase</h4>
<div>
<span style="font-family: inherit;">The hdfs datanodes are tidied up now but don't forget to cleanly shutdown the HBase regionservers - run:</span></div>
<div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">./bin/graceful_stop.sh HOSTNAME</span></div>
<div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<span style="font-family: inherit;">from within the HBase directory on the host you're shutting down (specifying the real name for HOSTNAME). It will shed its regions and shutdown when tidied up (more details <a href="http://hbase.apache.org/book/node.management.html" target="_blank">here</a>).</span><br />
<span style="font-family: inherit;"><br /></span></div>
<div>
<span style="font-family: inherit;">Now you can shutdown the tasktracker and datanode, and then the machine is ready to be wiped.</span></div>
<h4>
<span style="font-family: inherit;">Build the new cluster</span></h4>
<div>
<span style="font-family: inherit;">We wiped the 3 decommissioned slave nodes and installed the latest version of CentOS (our linux of choice, version 6.4 at time of writing). We also pulled 3 much lesser machines from our other cluster after decommissioning them in the same way, and installed CentOS 6.4 there, too. The 3 lesser machines would form our zookeeper ensemble and master nodes in the new cluster.</span></div>
<h4>
<span style="font-family: inherit;">Enter CDH Manager</span></h4>
<div>
<span style="font-family: inherit;">The folks at <a href="http://www.cloudera.com/content/cloudera/en/home.html" target="_blank">Cloudera</a> have made a free version of their <a href="https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Downloads" target="_blank">CDH Manager app</a> available, and it makes managing a cluster much, much easier. After setting up the 6 machines that would form the basis of our new cluster with just the barebones OS, we were ready to start wielding the manager. We made a small VM to hold the manager app and installed it there. The <a href="http://www.cloudera.com/content/support/en/documentation/manager-free/cloudera-manager-free-v4-latest.html" target="_blank">manager instructions</a> are pretty good, so I won't recreate them here. We had trouble with the key-based install so had to resort to setting identical passwords for root and allowing root ssh access for the duration of the install, but other than that it all went pretty smoothly. We installed in the following configuration (the master machines are the lesser ones described above, and the slaves the more powerful machines).</span></div>
<div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<span style="font-family: inherit;"><br /></span></div>
<style type="text/css">.nobrtable br { display: none } tr {text-align: center;} tr.alt td {background-color: #eeeeee; color: black;} tr {text-align: center;} caption {caption-side:bottom;}</style>
<br />
<center>
<div class="nobrtable">
<table border="2" bordercolor="#000000" cellpadding="10" cellspacing="0" style="background-color: #dddddd; border-collapse: collapse; width: 70%px;">
<caption>Machine and Role assignments</caption>
<tbody>
<tr style="background-color: #dddddd; color: black; padding-bottom: 4px; padding-top: 5px;">
<th>Machine</th>
<th>Roles</th>
</tr>
<tr class="alt">
<td>master1</td>
<td>HDFS Primary NameNode, Zookeeper Member, HBase Master (secondary)</td>
</tr>
<tr class="alt">
<td>master2</td>
<td>HDFS Secondary NameNode, Zookeeper Member, HBase Master (primary)</td>
</tr>
<tr class="alt">
<td>master3</td>
<td>Hadoop JobTracker, Zookeeper Member, HBase Master (secondary)</td>
</tr>
<tr class="alt">
<td>slave1</td>
<td>HDFS DataNode, Hadoop TaskTracker, HBase Regionserver</td>
</tr>
<tr class="alt">
<td>slave2</td>
<td>HDFS DataNode, Hadoop TaskTracker, HBase Regionserver</td>
</tr>
<tr class="alt">
<td>slave3</td>
<td>HDFS DataNode, Hadoop TaskTracker, HBase Regionserver</td></tr>
</tbody></table>
</div>
</center>
<div>
<h4>
<span style="font-family: inherit;">Copy the data</span></h4>
</div>
<div>
<span style="font-family: inherit;">Now we had two running clusters - our old CDH3u3 cluster (with half its machines removed) and the new, empty CDH 4.2.1 cluster. The trick was how to get data from the old cluster into the new, with our primary concern being the data in HBase. The builtin facility for this sort of thing is called CopyTable, and sounds great, except that it doesn't work across major versions of HBase, so that was out. Next we looked at copying the HFiles directly from the old cluster to the new using the HDFS builtin command </span><span style="font-family: Courier New, Courier, monospace;">distcp</span><span style="font-family: inherit;">. Because we could handle shutting down HBase on the old cluster for the duration of the copy this, in theory, should work - newer versions of HBase can read the older versions' HFiles and then write the new versions during compactions (and by shutting down we don't run the risk of missing updates from caches that haven't flushed, etc). And in spite of lots of warnings around the net that it wouldn't work, we tried it anyway. And it didn't work :) We managed to get the -ROOT- table up but it couldn't find .META. and that's where our patience ended. The next, and thankfully successful, attempt was using HBase export, distcp, and HBase import.</span></div>
<div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<span style="font-family: inherit;">On the old cluster we ran:</span></div>
<div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">bin/hadoop jar hbase-0.90.4-cdh3u3.jar export table_name /exports/table_name</span></div>
<div>
<br /></div>
<div>
for each of our tables, which produced a bunch of sequence files in the old cluster's HDFS. Those we copied over to the new cluster using HDFS's <span style="font-family: Courier New, Courier, monospace;">distcp</span> command:</div>
<div>
<br /></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">bin/hadoop distcp hftp://old-cluster-namenode:50070/exports/table_name hdfs://master1:8020/imports/table_name</span></div>
<div>
<br /></div>
<div>
which takes advantage of the builtin http-like interface (hftp) that HDFS provides, which makes the copy process version agnostic.</div>
<div>
<br /></div>
<div>
Finally on the new cluster we can import the copied sequence files into the new HBase:</div>
<div>
<br /></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">bin/hadoop jar hbase-0.94.2-cdh4.2.1-security.jar import table_name /imports/table_name</span></div>
<div>
<br /></div>
<div>
Make sure the table exists before you import, and because the import is a mapreduce job that does Puts, it would also be wise to presplit any large tables at creation time so that you don't crush your new cluster with lots of hot regions and splitting. Also one known issue in this version of HBase is a performance regression from version 0.92 to 0.94 (detailed in <a href="https://issues.apache.org/jira/browse/HBASE-7868" target="_blank">HBASE-7868</a>), which you can workaround by adding the following to your table definition:</div>
<br />
<span style="font-family: Courier New, Courier, monospace;">DATA_BLOCK_ENCODING => 'FAST_DIFF'</span><br />
<br />
e.g. <span style="font-family: Courier New, Courier, monospace;">create 'test_table', {NAME=>'cf', COMPRESSION=>'SNAPPY', VERSIONS=>1, DATA_BLOCK_ENCODING => 'FAST_DIFF'}</span><br />
<div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<span style="font-family: inherit;">As per that linked issue, you should also enable short-circuit reads from the CDH Manager interface.</span></div>
<div>
<br /></div>
<div>
And to complete the copying process, run major compactions on all your tables to ensure the best data locality you can for your regionservers.</div>
<h4>
All systems go</h4>
<div>
After running checks on the copied data, and updating our software to talk to CDH4, we were happy that our new cluster was behaving as expected. To get back to our normal performance levels we then shutdown the remaining machines in the CDH3u3 cluster, wiped and installed the latest OS, and then told CDH Manager to install on them. A few minutes later we had all our M/R slots back, as well as our regionservers. We ran the HBase balancer to evenly spread out the regions, ran another major compaction on our tables to force data-locality, and we were back in business!</div>
<div>
<br /></div>
<div>
<br /></div>
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Oliver Meynhttp://www.blogger.com/profile/04706642473308341930noreply@blogger.com2tag:blogger.com,1999:blog-2326624813533383062.post-49501115907887246402013-02-08T12:45:00.000+01:002013-02-12T10:51:38.852+01:00Data cleaning: Using MySQL to identify XML breaking charactersSometimes publishers have problems with data resources that contain control characters that will break the xml response if they are included. Identifying these characters and removing them can be a daunting task, especially if the dataset contains thousands of records.<br />
<br />
Publishers that share datasets through the DiGIR and TAPIR protocols are especially vulnerable to text fields that contain polluted data. Information about locality (http://rs.tdwg.org/dwc/terms/index.htm#locality) is often quite rich and can be copied from diverse sources, thereby entering the database table possibly without having been through a verification or a cleaning process. The locality string can be copy/pasted from a file into the locality column, or the data itself can be mass loaded infile, or it can be bulk inserted – each of these methods contains a risk that unintended characters enter the table.<br />
<br />
Even if you have time and are meticulous, you could miss certain control characters because they are invisible to the naked eye. So what are publishers - some with limited resources – going to do to ferret out these xml breaking characters? Assuming that you have access to the MySQL database itself you can identify these pesky control characters by performing a few basic steps that involves creating a small table, inserting some hexadecimal values into it (sounds much harder than it is), and finally you run the query that picks out these ‘illegal’ characters from the table that you specify.<br />
<br />
We start out with creating a table to hold the values for the problematic characters so that we can use them in a query:<br />
<br />
<blockquote>CREATE TABLE control_char (<br />
id int(4) NOT NULL AUTO_INCREMENT,<br />
hex_val CHAR(2),<br />
PRIMARY KEY(id) <br />
) DEFAULT CHARACTER SET = utf8;</blockquote><br />
The DEFAULT CHARACTER SET declaration forces UTF-8 compliance which the regular expressions used later requires.<br />
We then populate the table with these hex values that represent control characters:<br />
<br />
<blockquote>INSERT INTO control_char (hex_val)<br />
VALUES<br />
('00'),('01'),('02'),('03'),('04'),('05'),('06'),('07'),('08'),('09'),('0a'),('0b'),('0c'),('0d'),('0e'),('0f'),<br />
('10'),('11'),('12'),('13'),('14'),('15'),('16'),('17'),('18'),('19'),('1a'),('1b'),('1c'),('1d'),('1e'),('1f')<br />
;</blockquote><br />
You can read more about these values here: <a href="http://en.wikipedia.org/wiki/C0_and_C1_control_codes">http://en.wikipedia.org/wiki/C0_and_C1_control_codes</a> <br />
<br />
At this point you may ask why the control_char table is not a temporary table as you might not want it to be a permanent feature in the database. The reason for this is sadly that MySQL has a long standing bug that prevents a temporary table from being referenced more than once; <a href="http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html">http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html</a> and we have to reference it more than once as you will see later.<br />
<br />
Now on to the main query – these declarations test the table and column that you specify against the control_char table:<br />
<blockquote>SELECT t1.* FROM scinames_harvested t1, control_char<br />
WHERE LOCATE(control_char.hex_val , HEX(t1.scientific_name)) MOD 2 != 0;</blockquote><br />
The query references two tables; one is a table of roughly 5000 records containing a record primary key, scientific_name and some other columns. Some of the scientific name strings are polluted with characters that we want to get rid of. The second table contains the control characters.<br />
The way we ensure that the LOCATE function tests for value pairs two steps at the time is by using the modulus keyword MOD. Remember we want to look through the scientific_name char string after it has been converted to hexadecimal values (HEX) that consist of value pairs. We don’t want to test across value pairs!<br />
<br />
Running the query, in this instance, gives me five records with characters that are not kosher:<br />
<br />
<a href="http://2.bp.blogspot.com/-TDfCxJC6iGo/URTktXdaW8I/AAAAAAAAAIw/79AeXAEqLlU/s1600/control_char.png" imageanchor="1" style=""><img border="0" height="151" width="281" src="http://2.bp.blogspot.com/-TDfCxJC6iGo/URTktXdaW8I/AAAAAAAAAIw/79AeXAEqLlU/s400/control_char.png" /></a><br />
<br />
This is pretty neat if the alternative is eyeballing each and every record.<br />
Note that I cannot guarantee that this will properly process every character from the UTF-8 Latin-1 supplement <a href="http://en.wikipedia.org/wiki/C1_Controls_and_Latin-1_Supplement">http://en.wikipedia.org/wiki/C1_Controls_and_Latin-1_Supplement</a> <br />
<br />
If you want to create a test table and try out the queries above, this UPDATE query template will change the string into something containing control characters:<br />
<blockquote>UPDATE your_table SET your_column = CONCAT('Adelotus brevis', X'0B') WHERE id = 12345;</blockquote>In the CONCAT declaration the second part looks funny, but you have to remember that the X in front of '0B' tells MySQL that a hex value is coming. In this case it is a line-tabulation character: <a href="http://www.fileformat.info/info/unicode/char/000b/index.htm">www.fileformat.info/info/unicode/char/000b/index.htm</a>. This part can be edited to other values for test purposes. Naturally the CONCAT function can take n number of strings for concatenation. <br />
<br />
<div class="blogger-post-footer">-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.</div>Jan K. Legindhttp://www.blogger.com/profile/11185887314419707389noreply@blogger.com0