<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-2326624813533383062</id><updated>2012-02-29T15:30:47.864+01:00</updated><category term='xml'/><category term='DwC-archive'/><category term='nub'/><category term='pywrapper'/><category term='xsl'/><category term='scala'/><category term='eml'/><category term='names'/><category term='postgres'/><category term='CSS'/><category term='Canadensys'/><category term='registry'/><category term='lucene'/><category term='lift'/><category term='HIT'/><category term='wallboard'/><category term='IPT'/><category term='sql'/><category term='character sets'/><category term='iso19139'/><category term='dc'/><category term='PMH'/><category term='OAI'/><category term='BioCASe'/><category term='harvester'/><category term='dublin core'/><category term='OAI-PMH'/><category term='ABCD'/><category term='GBIF'/><category term='taxonomy'/><category term='customization'/><title type='text'>Developer Blog</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Tim Robertson</name><uri>http://www.blogger.com/profile/07889700598656669041</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://1.bp.blogspot.com/_5wZ2Fic5QtA/STxJva6Zq2I/AAAAAAAAABc/g2I1cG9fztw/S220/tim.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>52</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-120585784179699951</id><published>2012-02-28T17:09:00.000+01:00</published><updated>2012-02-28T17:09:07.319+01:00</updated><title type='text'>Performance Evaluation of HBase</title><content type='html'>In the last post Lars talked about&amp;nbsp;&lt;a href="http://gbif.blogspot.com/2012/02/monitoring-hadoop-and-hbase.html"&gt;setting up Ganglia&lt;/a&gt;&amp;nbsp;for monitoring our Hadoop and HBase installations. &amp;nbsp;That was in preparation for giving HBase a solid testing run to assess its suitability for hosting our index of occurrence records. &amp;nbsp;One of the important features in our new Data Portal will be the "Download" function that lets people download occurrences matching some search criteria and currently that process is a very manual and labour intensive one, so automating it will be a big help to us. &amp;nbsp;Using HBase it would be implemented as a full table scan, and that's why I've spent some time testing our scan performance.&amp;nbsp;Anyone who has been down this road will probably have encountered the myriad opinions on what will improve performance (some of them conflicting) along with the seemingly endless parameters that can be tuned in a given cluster. &amp;nbsp;The overall result of that kind of research is: "You gotta do it yourself". &amp;nbsp;So here we go.&lt;br /&gt;
&lt;br /&gt;
Because we hope to get some feedback from the HBase community on our results&amp;nbsp;(ideally comparisons to other running clusters)&amp;nbsp;we decided to use the PerformanceEvaluation class that ships with HBase. &amp;nbsp;Unfortunately it's not completely bug free so I've had to patch it to work in the way we would like. &amp;nbsp;From a stock cdh3u3 HBase I patched&amp;nbsp;&lt;a href="https://issues.apache.org/jira/browse/HBASE-5401"&gt;HBASE-5401&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a href="https://issues.apache.org/jira/browse/HBASE-4688"&gt;HBASE-4688&lt;/a&gt;, and had a good, hard think about&amp;nbsp;&lt;a href="https://issues.apache.org/jira/browse/HBASE-5402"&gt;HBASE-5402&lt;/a&gt;. &amp;nbsp;There are a bunch of things I learned about the PerformanceEvaluation (PE) class in the last few weeks, and the implications of 5402 are some of them. &amp;nbsp;If anyone's interested I'll blog about those another time, but for now suffice it to say that in my final suite of tests I went with writing tables using "sequentialWrite" (which obviates the difficulties in 5402) and doing scans with the "scan" test.&lt;br /&gt;
&lt;br /&gt;
There are many variables that can be changed in an HBase/Hadoop cluster setup but you have to start somewhere so here's our basic config that remained unchanged throughout the test (you can see more at its &lt;a href="http://dev.gbif.org/ganglia/?c=hadoop-2&amp;amp;m=load_one&amp;amp;r=hour&amp;amp;s=by%20name&amp;amp;hc=4&amp;amp;mc=2"&gt;ganglia page&lt;/a&gt;):&lt;br /&gt;
&lt;br /&gt;
Master (c1n2): HDFS NameNode, Hadoop JobTracker, HBase Master, and Zookeeper&lt;br /&gt;
Slaves (c2n1, c2n2, c2n3): HDFS DataNode, Hadoop TaskTracker, HBase RegionServer (6 GB heap)&lt;br /&gt;
&lt;br /&gt;
Hardware:&lt;br /&gt;
&lt;strong&gt;c1n2&lt;/strong&gt;: 1x&lt;strong&gt;&amp;nbsp;&lt;/strong&gt;Intel(R) Xeon(R) X3363 @ 2.83GHz (quad), 8GB RAM, 2x500G SATA 5.4K&lt;br /&gt;
&lt;b&gt;c2n*&lt;/b&gt;: 2x Intel(R) Xeon(R) E5630 @ 2.53GHz (quad), 24GB RAM, 6x250G SATA 7.2K&lt;br /&gt;
&lt;br /&gt;
&lt;div&gt;
The parameters I varied are:&lt;/div&gt;
&lt;div&gt;
&lt;ul&gt;
&lt;li&gt;number of rows in the test table (50M, 100M, 200M)&lt;/li&gt;
&lt;li&gt;number of mappers per server (8, 10, 12, 14, 20 - more than 20 and we ran out of RAM)&lt;/li&gt;
&lt;li&gt;snappy compression (on/off)&lt;/li&gt;
&lt;li&gt;scanner caching (size)&lt;/li&gt;
&lt;li&gt;block caching (on/off)&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
&lt;b&gt;Test Methodology&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
The test methodology was to start with an empty HDFS (replication factor of 2), an empty HBase and no other tasks running on the machines. &amp;nbsp;Then build a table using a command line like&amp;nbsp;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;hbase org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 100&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: inherit;"&gt;and then wait for the split/compaction cycle to complete, and then run a major compaction to encourage data locality. &amp;nbsp;In tests using compression I would construct an empty table by hand, specifying compression=&amp;gt;'SNAPPY' for the info column family before running the sequentialWrite. I'd also wait for the region balancer to run, ensuring that all regionservers had an equal number of regions (+/- 1). Once that was complete I would run a scan test with a command like:&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;hbase org.apache.hadoop.hbase.PerformanceEvaluation scan 100&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: inherit;"&gt;and note the real time taken as reported by the JobTracker as well as the total time spent in mappers as reported by PE itself. &amp;nbsp;I ran each scan test 3 times unless the first 2 tests were very close, in which case I sometimes skipped the 3rd run. Note also that PE uses 10 byte keys and 1000 kilobyte values by default.&amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Two things were immediately obvious in my caching tests, which I'll talk about briefly before getting into the results proper.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;Scanner Caching&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;The HBase API Scan class takes a setting called &lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;setCaching(int caching)&lt;/span&gt; which tells the scanner how many rows to fetch at a time from the server. &amp;nbsp;This defaults to 1, which is optimal for single gets, but is decidedly suboptimal for big scans like ours. Unfortunately PE doesn't allow for explicitly setting this value on the command line and so I took the advice of the API Javadoc and passed in a value of 1000 for the property &lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;hbase.client.scanner.caching&lt;/span&gt; in the hbase-site.xml file that forms the configuration for the MapReduce job that is PE. &amp;nbsp;It would seem, though, that this property is not being honoured by HBase and so my initial tests all ran with a cache size of 1, producing poor results. &amp;nbsp;I subsequently hard-coded the &lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;setCaching(1000)&lt;/span&gt; on the scans themselves in the PE and saw significant improvements. Those differences are visible in the results, below.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;Block Caching&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
When I say block caching I mean HBase blocks - not HDFS blocks. &amp;nbsp;See &lt;a href="http://hbase.apache.org/book.html#block.cache"&gt;the HBase guide&lt;/a&gt;&amp;nbsp;or Lars George's excellent &lt;a href="http://shop.oreilly.com/product/0636920014348.do"&gt;HBase: The Definitive Guide&lt;/a&gt;&amp;nbsp;for many more details. You'll see many recommendations from different sources to turn block caching off during scans but it's not immediately obvious why that should be. &amp;nbsp;It turns out that it has no performance implications for the scan itself, but will significantly impact any random reads (gets) that may be relying on recently loaded blocks. &amp;nbsp;If the scan is using the block cache then it will load the cache with a block, read the block, and then load the next one, ignoring the first one for the rest of the scan. &amp;nbsp;The earlier loaded blocks will be evicted quickly (as new blocks are loaded) leading to "cache churn". &amp;nbsp;So I tested the block cache on and off and concluded that there was no difference in a dedicated scan test (no other activity in the cluster) and so for the rest of my tests I left the block cache on. &amp;nbsp;In order to test this behaviour I again had to hard-code PE with&amp;nbsp;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;setCacheBlocks(false)&lt;/span&gt; (it defaults to true).&lt;/div&gt;
&lt;/div&gt;
&lt;br /&gt;
&lt;b&gt;The Results&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
The full spreadsheet is available for download from our &lt;a href="http://gbif-occurrencestore.googlecode.com/files/GBIF_PE_scan_tests.xlsx"&gt;google code site&lt;/a&gt;. &amp;nbsp;I've just included the summary chart here, because it pretty much tells the story. &amp;nbsp;Of note: the y-axis is Records / second which I calculate as the total records scanned divided by the total time for the job as reported by the JobTracker. &amp;nbsp;This doesn't, therefore, take any of the PE generated per-mapper counts/times into account (which are still somewhat mysterious to me).&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-m5ih3ufEHt0/T0z657LStfI/AAAAAAAAAB0/e1WycShilaQ/s1600/hbase_scan_pe.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="420" src="http://1.bp.blogspot.com/-m5ih3ufEHt0/T0z657LStfI/AAAAAAAAAB0/e1WycShilaQ/s640/hbase_scan_pe.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;/div&gt;
&lt;br /&gt;
Conclusions (within our obviously limited test environment):&lt;br /&gt;
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;with bigger datasets, more mappers are better; with smaller datasets, somewhat fewer are better&lt;/li&gt;
&lt;li&gt;compression slows things down. &amp;nbsp;This is probably the most surprising (and contentious) finding. &amp;nbsp;We would expect performance to improve with compression (at worst staying the same) because there is much less data to transfer when the files are compressed, and I/O is typically the limiting factor in HBase performance. &amp;nbsp;I kept an eye on the cpus with ganglia during the uncompressed tests and sure enough, they spent approximately double the amount of time in io_wait as in the compressed tests.&lt;/li&gt;
&lt;li&gt;ganglia revealed no obvious bottlenecks during the scan. &amp;nbsp;You can look at &lt;a href="http://dev.gbif.org/ganglia/?c=hadoop-2&amp;amp;m=load_one&amp;amp;r=hour&amp;amp;s=by%20name&amp;amp;hc=4&amp;amp;mc=2"&gt;our ganglia charts&lt;/a&gt; from the past few weeks and apart from a few purposefully degenerate tests there was no obvious component that was limiting things - cpus never went above 60% (io_wait never above 30%), and RAM and disks "looked ok". &amp;nbsp;If anyone has insight here we'd &lt;i&gt;really&lt;/i&gt; appreciate it.&lt;/li&gt;
&lt;li&gt;total variability across all these tests was only about 20%. &amp;nbsp;Either all tests are hitting the same, as yet unseen, hardware or configuration limit, or there really is no silver bullet to performance increases - tuning is an incremental thing.&lt;/li&gt;
&lt;li&gt;there are still strange things going on :) &amp;nbsp;I can't explain the fact that in the compressed test performance increases from 50 to 100M records, and then decreases dramatically with 200M records.&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
We'd love feedback on these results, both in terms of how we can improve our testing and in interpreting the results, as well as any numbers you may have from your clusters when running PE. &amp;nbsp;Is 300k records per second even in the ballpark? &amp;nbsp;It's crazy that these data don't exist out there already, so hopefully this post helps alleviate that somewhat. &amp;nbsp;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The good news is that we'll be getting new cluster hardware soon, and then I think the next step is to load our real occurrences data into a table and run similar scanning tests against it to optimize those scans. &amp;nbsp;Of course then we'll start writing new records at the same time, and then things will get.... interesting. &amp;nbsp;Should be fun :)&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-120585784179699951?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/120585784179699951/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2012/02/performance-evaluation-of-hbase.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/120585784179699951'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/120585784179699951'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2012/02/performance-evaluation-of-hbase.html' title='Performance Evaluation of HBase'/><author><name>Oliver Meyn</name><uri>http://www.blogger.com/profile/04706642473308341930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/-6_sfRK7-oU0/TaWanB_bQ3I/AAAAAAAAAAM/RcPMC0E9E3k/s220/oliver_amory_profile.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-m5ih3ufEHt0/T0z657LStfI/AAAAAAAAAB0/e1WycShilaQ/s72-c/hbase_scan_pe.png' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-7819512969170502121</id><published>2012-02-08T15:57:00.000+01:00</published><updated>2012-02-08T16:04:43.029+01:00</updated><title type='text'>Monitoring Hadoop and HBase</title><content type='html'>&lt;br /&gt;
We're getting serious in our Hadoop adoption. The first process (our so called "rollover") is now in production and it uses Hadoop, Hive, Oozie and various other parts of the Hadoop ecosystem.&lt;br /&gt;
&lt;br /&gt;
Our next step is evaluating HBase and its performance on our (small and aging) cluster. To do that properly and to fix a rather&amp;nbsp;embarrassing&amp;nbsp;situation we first had to get proper monitoring up and running for our cluster. So far we've only had &lt;a href="http://www.cacti.net/"&gt;Cacti&lt;/a&gt; stats for OS level things (CPU, I/O, etc.) but we were missing actual Hadoop statistics.&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-EG9VLEdHZGQ/TzKC2XrrjdI/AAAAAAAABMM/05xctqwUpaA/s1600/ganglia1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="264" src="http://1.bp.blogspot.com/-EG9VLEdHZGQ/TzKC2XrrjdI/AAAAAAAABMM/05xctqwUpaA/s320/ganglia1.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So we've now set up Ganglia at GBIF and the best news is it's &lt;a href="http://dev.gbif.org/ganglia/"&gt;public&lt;/a&gt;&amp;nbsp;and using the very latest Ganglia 3.3 which was released only a few days ago in February 2012. The setup was relatively painless. Ganglia was just nice to work with. To get monitoring of HBase working we had to apply &lt;a href="https://issues.apache.org/jira/browse/HBASE-4854"&gt;HBASE-4854&lt;/a&gt;&amp;nbsp;because it's not included in our Hadoop distribution (CDH3u2). Thanks to &lt;a href="https://plus.google.com/102889878040727939162/about"&gt;Lars George&lt;/a&gt; for the hint.&lt;br /&gt;
&lt;br /&gt;
So we can happily report that Ganglia 3.3 works perfect with the GangliaContext31 from Hadoop.&lt;br /&gt;
&lt;br /&gt;
Now all that's left is learning what most of these stats mean and then trying to extract useful information from them. Any hints are more than welcome and the HBase community already offered to help. Thank you very much!&lt;br /&gt;
&lt;br /&gt;
For anyone interested in a few notes and details about how we set Ganglia up keep on reading.&lt;br /&gt;
&lt;br /&gt;
&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;
First we had to configure &lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;iptables&lt;/span&gt;. This accepts all multicast packets for UDP and IGMP (it took me a while to figure out &lt;i&gt;this&lt;/i&gt; was missing...):&lt;br /&gt;
&lt;pre class="brush:shell"&gt;iptables -A INPUT -p igmp -d 224.0.0.0/4 -j ACCEPT
iptables -A INPUT -p udp -d 224.0.0.0/4 -j ACCEPT&lt;/pre&gt;
&lt;br /&gt;
We've split up our 18 machines in two Hadoop clusters so we wanted those separate in Ganglia as well. The only way I could get this to run is to have both clusters join different multicast addresses (&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;mcast_join&lt;/span&gt; in &lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;gmond.conf&lt;/span&gt;). We're doing this using &lt;a href="http://code.google.com/p/gbif-common-resources/source/browse/cluster-puppet/modules/ganglia/templates/gmond.conf"&gt;Puppet&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
In gmetad.conf we then include one machine from each cluster:&lt;br /&gt;
&lt;pre class="brush:shell"&gt;data_source "hadoop-1" c1n1.gbif.org
data_source "hadoop-2" c1n2.gbif.org
&lt;/pre&gt;
&lt;br /&gt;
In multicast mode all machines that are listening on the same multicast address know about all state from all machines in that "ring". So what gmetad has to do is to connect to one of the machines from each ring via TCP (not multicast, that's what the &lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;tcp_accept_channel&lt;/span&gt; is for in gmond.conf).&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-7819512969170502121?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/7819512969170502121/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2012/02/monitoring-hadoop-and-hbase.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/7819512969170502121'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/7819512969170502121'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2012/02/monitoring-hadoop-and-hbase.html' title='Monitoring Hadoop and HBase'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-EG9VLEdHZGQ/TzKC2XrrjdI/AAAAAAAABMM/05xctqwUpaA/s72-c/ganglia1.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-2019628916510389833</id><published>2012-01-19T14:08:00.001+01:00</published><updated>2012-01-19T14:33:27.739+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ABCD'/><category scheme='http://www.blogger.com/atom/ns#' term='IPT'/><category scheme='http://www.blogger.com/atom/ns#' term='BioCASe'/><category scheme='http://www.blogger.com/atom/ns#' term='DwC-archive'/><title type='text'>BioCASe now producing DarwinCore Archives</title><content type='html'>&lt;i&gt;Guest post from Jörg Holetschek, Botanic Garden and Botanical Museum Berlin-Dahlem.&lt;/i&gt;
&lt;br/&gt;&lt;br/&gt;
The traditional way of sharing occurrence data with GBIF has been web-service-based for years. Data publishers have used one of the existing provider software packages (&lt;a href="http://digir.net/" target="_blank"&gt;DiGIR&lt;/a&gt;, &lt;a href="http://www.biocase.org/products/provider_software" target="_blank"&gt;BioCASe&lt;/a&gt; or &lt;a href="http://wiki.tdwg.org/twiki/bin/view/TAPIR/TapirLink" target="_blank"&gt;TAPIR Link&lt;/a&gt;) to expose their data as a DiGIR-, BioCASe- or TAPIR-compliant web service. Biodiversity networks such as GBIF used harvesters to crawl and index the records published by these services, an approach that works fine for small and medium-sized datasets, but runs into difficulties when record numbers hit the millions: Harvesting can take days and puts a heavy load on both the publisher and the crawler.&lt;br /&gt;
&lt;br /&gt;
To overcome this, GBIF recently introduced DarwinCore Archives for storing all information of a dataset to be published in a single file. GBIF directly ingesting this file eliminates the time-consuming back-and-forth communication between data provider and harvester, speeding up the process and reducing load for both sides. GBIF’s &lt;a href="http://code.google.com/p/gbif-providertoolkit/" target="_blank"&gt;IPT &lt;/a&gt;allows easy creation of such DarwinCore Archives and is a good option for providers that have already used the DarwinCore standard in the past or that want to share rather slim observation data.&lt;br /&gt;
&lt;br /&gt;
However, sixty-two of GBIF’s data publishers are currently using BioCASe. In contrast to DarwinCore, BioCASe and its associated data standard ABCD are targeted mainly at rich data originating from specimens of natural history collections (even though it can be used for any type of occurrence data, including observations). Many of the BioCASe data providers also share their data with special interest networks such as &lt;a href="http://www.geocase.eu/" target="_blank"&gt;GeoCASe&lt;/a&gt;, the &lt;a href="http://www.dnabank-network.org/" target="_blank"&gt;DNA-Bank Network&lt;/a&gt;, or the &lt;a href="http://search.biocase.org/edit" target="_blank"&gt;EDIT Specimen network&lt;/a&gt;, all of them relying on BioCASe web services. Switching to the IPT and the associated DarwinCore standard is not an option for them.&lt;br /&gt;
&lt;br /&gt;
For this reason, we decided to extend the BioCASe Provider Software with a feature to create DarwinCore Archives. This allows providers to continue using the rich ABCD schema (or one of its extensions) for the specific networks they’re connected to while using DarwinCore Archives to share their data with GBIF. In order to combine the richness of ABCD with the efficiency of downloadable archives, we created a hybrid in-between, the so called ABCD Archives, which can be used instead of the BioCASe web service for harvesting purposes (see figure below).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-dQ4TKJ-mx0Q/TxRnnJsLweI/AAAAAAAABMM/7AQvDbqqjjw/s1600/DwC+Archive+Creation.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-dQ4TKJ-mx0Q/TxRnnJsLweI/AAAAAAAABMM/7AQvDbqqjjw/s1600/DwC+Archive+Creation.PNG" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
The first step – creating the ABCD archive – is implemented natively in the Provider Software in Python. The second step – transforming the ABCD archive into one or several DarwinCore archives – is done by Pentaho Data Transformation, an open source library also known as &lt;a href="http://kettle.pentaho.com/" target="_blank"&gt;Kettle&lt;/a&gt;. In the current version 3.0, the transformation step is a stand-alone command-line application that can be downloaded separately; ultimately, it will be bundled with the Provider Software and integrated into the user interface.&lt;br /&gt;
&lt;br /&gt;
The latest version of the Provider Software and the DarwinCore Creator can be downloaded from the &lt;a href="http://www.biocase.org/products/provider_software/index.shtml#download" target="_blank"&gt;BioCASe website&lt;/a&gt;. A detailed documentation of the new archiving features can be found in the &lt;a href="http://wiki.bgbm.org/bps/index.php/Archiving" target="_blank"&gt;PyWrapper Wiki&lt;/a&gt;. The wiki also stores a &lt;a href="http://wiki.bgbm.org/bps/uploads/bps/f/fe/AlgenEngelsSmall_ABCD_2.06.zip"&gt;sample ABCD archive&lt;/a&gt; and a &lt;a href="http://wiki.bgbm.org/bps/uploads/bps/b/b3/Desmidiaceae_Engels.zip"&gt;sample DarwinCore archive&lt;/a&gt; created by BioCASe.&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-2019628916510389833?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/2019628916510389833/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2012/01/biocase-now-producing-darwincore.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/2019628916510389833'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/2019628916510389833'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2012/01/biocase-now-producing-darwincore.html' title='BioCASe now producing DarwinCore Archives'/><author><name>Jörg Holetschek</name><uri>http://www.blogger.com/profile/05073954788894074991</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='21' src='http://2.bp.blogspot.com/-glVU0Ii1JtE/TwwvXK1v3cI/AAAAAAAABLY/CmaaWeX5fQw/s220/tasmanien1.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-dQ4TKJ-mx0Q/TxRnnJsLweI/AAAAAAAABMM/7AQvDbqqjjw/s72-c/DwC+Archive+Creation.PNG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-5004690468990811172</id><published>2011-12-08T16:47:00.001+01:00</published><updated>2011-12-08T17:12:38.443+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='IPT'/><category scheme='http://www.blogger.com/atom/ns#' term='customization'/><category scheme='http://www.blogger.com/atom/ns#' term='Canadensys'/><category scheme='http://www.blogger.com/atom/ns#' term='CSS'/><title type='text'>Updating a customized IPT</title><content type='html'>&lt;p&gt;&lt;em&gt;This post originally appeared on the &lt;a href="http://www.canadensys.net/2011/updating-a-customized-ipt"&gt;Canadensys blog&lt;/a&gt; and is a follow-up of the post &lt;a href="http://gbif.blogspot.com/2011/07/customizing-ipt.html"&gt;Customizing the IPT&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;&lt;p&gt;As mentioned at the very end of my post about &lt;a href="http://gbif.blogspot.com/2011/07/customizing-ipt.html"&gt;customizing the IPT&lt;/a&gt;, I face a problem when I want to install a new version of the &lt;a href="http://code.google.com/p/gbif-providertoolkit/"&gt;GBIF Integrated Publishing Toolkit&lt;/a&gt;: installing it will overwrite all my customized files! Luckily Tim Robertson gave me a hint on how to solve this: a shell script to reapply my customization.&lt;/p&gt;&lt;p&gt;Here's how it works (for Mac and Linux systems only):&lt;/p&gt;&lt;h4&gt;Comparing the customized files with the default files&lt;/h4&gt;&lt;p&gt;First of all, I need to compare my customized files with the files from the new IPT. They might have changed to include new functionalities or fix bugs. So, I installed the newest version of IPT on my &lt;em&gt;localhost&lt;/em&gt;, opened the default files and compared them with my files. Although there are tools to compare files, I mostly did this manually. The biggest change in version 2.0.3 was the addition of localization, for which I'm using a different UI, so I had to tweak some things here and there. It took me about 3 hours until I was satisfied with the new customized IPT version on my &lt;em&gt;localhost&lt;/em&gt;.&lt;/p&gt;&lt;p&gt;I also subscribed to the RSS of the &lt;a href="http://code.google.com/p/gbif-providertoolkit/"&gt;IPT Google Code website&lt;/a&gt;, to be notified of any changes in the code of "my" files, but I was just using this as a heads-up for coming changes. It is more efficient to change everything at once, when a stable version of IPT is out.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://code.google.com/feeds/p/gbif-providertoolkit/svnchanges/basic?path=/trunk/gbif-ipt/src/main/webapp/WEB-INF/pages/inc"&gt;RSS subscription&lt;/a&gt; for any changes in &lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/#svn%2Ftrunk%2Fgbif-ipt%2Fsrc%2Fmain%2Fwebapp%2FWEB-INF%2Fpages%2Finc"&gt;/webapp/WEB-INF/pages/inc&lt;/a&gt;, which contains most of my customized files&lt;/li&gt;
&lt;li&gt;&lt;a href="http://code.google.com/feeds/p/gbif-providertoolkit/svnchanges/basic?path=/trunk/gbif-ipt/src/main/webapp/styles/main.css"&gt;RSS subscription&lt;/a&gt; for any changes in &lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/trunk/gbif-ipt/src/main/webapp/styles/main.css"&gt;/webapp/styles/main.css&lt;/a&gt;, where I'm commenting out a lot of stuff so my CSS can kick in.&lt;/li&gt;
&lt;/ul&gt;&lt;h4&gt;Setting up a file structure&lt;/h4&gt;&lt;p&gt;This is how we've organized the files on our server. I've created a folder called &lt;em&gt;ipt-customization&lt;/em&gt;, which contains all my customized files. That way, they can never be overwritten by a new IPT installation, which gets deployed in &lt;em&gt;webapps&lt;/em&gt;. The folder also contains a script to apply the customization and a folder to backup the default files currently used by IPT.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;ipt-data&lt;/li&gt;
&lt;li&gt;webapps&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;ipt&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;ipt-customization&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;backup-default&lt;/li&gt;
&lt;li&gt;apply-customization.sh&lt;/li&gt;
&lt;li&gt;revert-customization.sh&lt;/li&gt;
&lt;li&gt;header.ftl&lt;/li&gt;
&lt;li&gt;header_setup.ftl&lt;/li&gt;
&lt;li&gt;menu.ftl&lt;/li&gt;
&lt;li&gt;footer.ftl&lt;/li&gt;
&lt;li&gt;main.css&lt;/li&gt;
&lt;li&gt;custom.js&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;h4&gt;Creating the shell script&lt;/h4&gt;&lt;p&gt;The &lt;strong&gt;&lt;em&gt;apply-customization.sh&lt;/em&gt;&lt;/strong&gt; script works in two steps:&lt;p&gt;&lt;ol&gt;&lt;li&gt;Backup the default files, by copying them from IPT to the folder &lt;em&gt;backup-default&lt;/em&gt;. The script will ask if I want to overwrite any previously backed up files. The last part is important if I'm running the script several times. In that case I do not want to overwrite the backups with the already customized files.&lt;/li&gt;
&lt;li&gt;Overwrite the files currently used by IPT with the customized files, by copying them from my &lt;em&gt;ipt-customization&lt;/em&gt; folder to the correct folder in IPT&lt;/li&gt;
&lt;/ol&gt;&lt;code&gt;# backup files of new IPT installation&lt;br /&gt;
cp -i ../webapps/ipt/WEB-INF/pages/inc/footer.ftl ../ipt-customization/backup-default/&lt;br /&gt;
cp -i ../webapps/ipt/WEB-INF/pages/inc/header_setup.ftl ../ipt-customization/backup-default/&lt;br /&gt;
cp -i ../webapps/ipt/WEB-INF/pages/inc/header.ftl ../ipt-customization/backup-default/&lt;br /&gt;
cp -i ../webapps/ipt/WEB-INF/pages/inc/menu.ftl ../ipt-customization/backup-default/&lt;br /&gt;
cp -i ../webapps/ipt/styles/main.css ../ipt-customization/backup-default/&lt;br /&gt;
&lt;br /&gt;
# apply customization&lt;br /&gt;
cp footer.ftl ../webapps/ipt/WEB-INF/pages/inc/&lt;br /&gt;
cp header_setup.ftl ../webapps/ipt/WEB-INF/pages/inc/&lt;br /&gt;
cp header.ftl ../webapps/ipt/WEB-INF/pages/inc/&lt;br /&gt;
cp menu.ftl ../webapps/ipt/WEB-INF/pages/inc/&lt;br /&gt;
cp main.css ../webapps/ipt/styles/&lt;br /&gt;
cp custom.js ../webapps/ipt/js/&lt;/code&gt;&lt;br /&gt;
&lt;p&gt;I also created a script &lt;strong&gt;&lt;em&gt;revert-customization.sh&lt;/em&gt;&lt;/strong&gt;, to revert the customization to the default IPT, in case something is broken. It moves the backed up files back to IPT:&lt;/p&gt;&lt;code&gt;# revert customization&lt;br /&gt;
cp backup-default/footer.ftl ../webapps/ipt/WEB-INF/pages/inc/&lt;br /&gt;
cp backup-default/header_setup.ftl ../webapps/ipt/WEB-INF/pages/inc/&lt;br /&gt;
cp backup-default/header.ftl ../webapps/ipt/WEB-INF/pages/inc/&lt;br /&gt;
cp backup-default/menu.ftl ../webapps/ipt/WEB-INF/pages/inc/&lt;br /&gt;
cp backup-default/main.css ../webapps/ipt/styles/&lt;br /&gt;
rm ../webapps/ipt/js/custom.js&lt;/code&gt;&lt;br /&gt;
&lt;h4&gt;Running the script&lt;/h4&gt;&lt;p&gt;From the command line, I login to my server, navigate to the folder &lt;em&gt;ipt-customization&lt;/em&gt; and make my script executable:&lt;/p&gt;&lt;code&gt;chmod +x apply-customization.sh&lt;/code&gt;&lt;br /&gt;
&lt;p&gt;I only have to do this the first time I want to use my script. From then on I can use:&lt;/p&gt;&lt;code&gt;sh ./apply-customization.sh&lt;/code&gt;&lt;br /&gt;
&lt;p&gt;To execute the script and customize my &lt;a href="http://data.canadensys.net/ipt/"&gt;new version of IPT&lt;/a&gt;!&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-5004690468990811172?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/5004690468990811172/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/12/updating-customized-ipt.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/5004690468990811172'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/5004690468990811172'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/12/updating-customized-ipt.html' title='Updating a customized IPT'/><author><name>Peter Desmet</name><uri>http://www.blogger.com/profile/18072937114922733628</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/-4IKyMNyiCSo/TihR4QTSlgI/AAAAAAAAFqs/ja8HqGmmb_U/s220/peter_300.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-7345480087409665564</id><published>2011-12-05T14:28:00.000+01:00</published><updated>2011-12-05T14:28:30.605+01:00</updated><title type='text'>Bug fixing in the GBIF Data Portal</title><content type='html'>Despite our current efforts to develop a &lt;a href="http://gbif.blogspot.com/2011/09/portal-v2-there-will-be-cake.html" target="_blank"&gt;new Portal v2&lt;/a&gt;,&amp;nbsp;our current data portal at &lt;a href="http://data.gbif.org/" target="_blank"&gt;data.gbif.org&lt;/a&gt;&amp;nbsp;has not been left unattended. Bug fixes are being done periodically from feedback sent&amp;nbsp;to us from our user community. In order to keep our community informed, this post will summarize the most important fixes and enhancements done in the past months:&lt;br /&gt;
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;The data portal's main page now shows the total number of occurrence records with coordinates, along with the total count of records (non-georeferenced and georeferenced).&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li&gt;Decimal coordinate searches were not working properly. When a user wanted to refine their coordinate searches to use decimals, the data portal was returning an erroneous count of occurrence records. Issue was fixed. Details &lt;a href="http://code.google.com/p/gbif-dataportal/issues/detail?id=103" target="_blank"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li&gt;Any feedback e-mail message sent from an &lt;a href="http://data.gbif.org/occurrences/320620918/" target="_blank"&gt;occurrence&lt;/a&gt; or a &lt;a href="http://data.gbif.org/species/2435099" target="_blank"&gt;taxon page&lt;/a&gt; now includes the original sender's email address in the CC field. Previously the&amp;nbsp;sender's email address was not included in the feedback email, which represented a problem when the receiver replied to the email, but the sender&amp;nbsp;never knew about the reply.&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href="http://data.gbif.org/ws/rest/taxon" target="_blank"&gt;Taxon Web Service's&lt;/a&gt;&amp;nbsp;GET operation was returning errors when trying to request some specific taxons.&amp;nbsp;The problem was detected and fixed.&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li&gt;On an &lt;a href="http://data.gbif.org/occurrences/352549467/" target="_blank"&gt;occurrence detail page&lt;/a&gt;, when retrieving the original record from a data publisher,&amp;nbsp;and the source data came from a &lt;a href="http://www.gbif.org/informatics/standards-and-tools/publishing-data/data-standards/darwin-core-archives/" target="_blank"&gt;Darwin Core Archive&lt;/a&gt;,&amp;nbsp;it was not possible to retrieve a single record due to the single-file nature of a DwC Archive. (As opposed to a DiGIR request, in which&amp;nbsp;you could extract just a &lt;a href="http://data.gbif.org/occurrences/352549467/rawProviderMessage/" target="_blank"&gt;single record&lt;/a&gt;). A fix was introduced so that the user can decide if he/she wants to download the complete archive (&lt;a href="http://data.gbif.org/occurrences/411128350/rawProviderMessage/" target="_blank"&gt;see an example&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li&gt;When using the data portal's&amp;nbsp;&lt;a href="http://data.gbif.org/ws/" target="_blank"&gt;Web Services&lt;/a&gt;&amp;nbsp;to produce KML output, there were some problems when the generated KML contained HTML elements and Google Earth tried to open the file&amp;nbsp;for visualization (This is a standard problem of XML). A &lt;a href="http://code.google.com/apis/kml/documentation/kml_tut.html#descriptive_html" target="_blank"&gt;small fix&lt;/a&gt;&amp;nbsp;was introduced&amp;nbsp;to escape the conflicting HTML inside the KML output.&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li&gt;Other small GUI enhancements where also done.&lt;/li&gt;
&lt;/ul&gt;
&lt;br /&gt;
&lt;a href="http://code.google.com/p/gbif-dataportal/updates/list" target="_blank"&gt;Updates to the data portal's codebase is now done seldomly&lt;/a&gt;, but our goal is to fix any major issues that our user community reports. If you ever encounter problems, please don't hesitate to contact us at &lt;a href="mailto:portal@gbif.org"&gt;portal@gbif.org&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-7345480087409665564?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/7345480087409665564/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/12/bug-fixing-in-gbif-data-portal.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/7345480087409665564'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/7345480087409665564'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/12/bug-fixing-in-gbif-data-portal.html' title='Bug fixing in the GBIF Data Portal'/><author><name>Jose Cuadra</name><uri>http://www.blogger.com/profile/00591450269169657407</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-6362661361722544998</id><published>2011-11-09T09:44:00.000+01:00</published><updated>2011-11-09T09:44:10.141+01:00</updated><title type='text'>Important Quality Boost for GBIF Data Portal</title><content type='html'>&lt;br /&gt;
&lt;h3 style="font-family: Arial, Helvetica, sans-serif; font-size: 12px;"&gt;
Improvements speed processing, “clean” name and location data, enable checklist publishing.&lt;/h3&gt;
&lt;div class="news-single-date" style="font-family: Arial, Helvetica, sans-serif; margin-bottom: 10px;"&gt;
&lt;span class="Apple-style-span" style="font-size: 12px; line-height: 18px;"&gt;&lt;i&gt;[This is a reposting from the &lt;a href="http://www.gbif.org/communications/news-and-events/showsingle/article/important-quality-boost-for-gbif-data-portal/"&gt;GBIF news site&lt;/a&gt;]&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class="news-single-date" style="font-family: Arial, Helvetica, sans-serif; font-size: 10px; margin-bottom: 10px;"&gt;
&lt;span class="Apple-style-span" style="font-size: 12px; line-height: 18px;"&gt;A major upgrade to enhance the quality and usability of data accessible through the GBIF Data Portal has gone live.&lt;/span&gt;&lt;/div&gt;
&lt;div style="font-family: Arial, Helvetica, sans-serif; font-size: 12px; line-height: 1.5; margin-bottom: 20px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;
&lt;br /&gt;The enhancements are the result of a year’s work by developers at the Copenhagen-based GBIF Secretariat, in collaboration with colleagues throughout the worldwide network.&lt;br /&gt;&lt;br /&gt;They respond to a range of issues including the need for quicker ‘turnaround’ time between entering new data and their appearance on the portal; filtering out inaccurate or incorrect locations and names for species occurrences; and enabling species checklists to be indexed as datasets accessible through the portal.&lt;br /&gt;&lt;br /&gt;After a testing period, the changes now apply to the more than 312 million biodiversity data records currently indexed from some 8,500 datasets and 340 publishers worldwide.&lt;/div&gt;
&lt;div style="font-family: Arial, Helvetica, sans-serif; font-size: 12px; line-height: 1.5; margin-bottom: 20px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;
Key improvements include:&lt;br /&gt;&lt;br /&gt;•&amp;nbsp;&amp;nbsp;&amp;nbsp; processing time for data has fallen from 3-4 days to around 36 hours, paving the way for more frequent ‘rollovers’ or index updates;&lt;br /&gt;•&amp;nbsp;&amp;nbsp;&amp;nbsp; the ‘backbone taxonomy’ used by the GBIF Portal has been reworked with up-to-date checklists and taxonomic catalogues such as the&amp;nbsp;&lt;a class="external-link-new-window" href="http://www.catalogueoflife.org/" style="color: navy; text-decoration: none;" target="_blank" title="Opens external link in new window"&gt;Catalogue of Life 2011&lt;/a&gt;, improving search and download;&lt;br /&gt;•&amp;nbsp;&amp;nbsp;&amp;nbsp; checklists describing species in particular geographic locations, taxonomic groups or thematic categories (eg. invasives) can now be published using a standard set of terms called the Global Names Architecture (GNA) Profile (&lt;a class="external-link-new-window" href="http://www.gbif.org/communications/news-and-events/showsingle/article/important-quality-boost-for-gbif-data-portal/orc/?doc_id=2869&amp;amp;l=en" style="color: navy; text-decoration: none;" target="_blank" title="Opens external link in new window"&gt;see GNA guidelines&lt;/a&gt;) and thus become indexed and accessible via the Data Portal;&lt;br /&gt;•&amp;nbsp;&amp;nbsp;&amp;nbsp; automated interpretation of the coordinates, country location and scientific names used in published records has been improved to screen out inaccuracies – for example, ensuring that records identified as coming from a particular country are shown as occurring within the borders and territorial waters of that country; and&lt;br /&gt;•&amp;nbsp;&amp;nbsp;&amp;nbsp; a mechanism using the&amp;nbsp;&lt;a class="external-link-new-window" href="http://hadoop.apache.org/" style="color: navy; text-decoration: none;" target="_blank" title="Opens external link in new window"&gt;Hadoop open-source software system&lt;/a&gt;&amp;nbsp;has been introduced to ensure that the Data Portal is able to cope with anticipated future growth in the volume of data.&lt;/div&gt;
&lt;div style="font-family: Arial, Helvetica, sans-serif; font-size: 12px; line-height: 1.5; margin-bottom: 20px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;
The algorithms and dictionaries developed to improve interpretation of data published through the GBIF Data Portal are intended for future re-use by the wider biodiversity informatics community.&lt;br /&gt;&lt;br /&gt;Commenting on the release of these substantive Data Portal improvements, GBIF Executive Secretary Nicholas King said: “These changes represent a major step forward in the usefulness of GBIF to science and society.&lt;br /&gt;&lt;br /&gt;“They are a direct response to the feedback we have had from the data publishing and user communities, and will enable an even greater return on the long-term investment made over the past decade by GBIF Participant countries.”&lt;/div&gt;
&lt;div style="font-family: Arial, Helvetica, sans-serif; font-size: 12px; line-height: 1.5; margin-bottom: 20px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;
&lt;b&gt;IPT&amp;nbsp;v.2.0.3 launched&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The GBIF Secretariat has also issued a new release of the Integrated Publishing Toolkit (IPT), which enables biodiversity data updates to be ‘harvested’ automatically from databases published to the Internet.&lt;br /&gt;&lt;br /&gt;IPT version 2.0.3 addressed 76 reported issues from the previous version, and includes translations into French and Spanish.&lt;br /&gt;&lt;br /&gt;Instructions on installing the new version are available&amp;nbsp;&lt;a class="external-link-new-window" href="http://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes" style="color: navy; text-decoration: none;" target="_blank" title="Opens external link in new window"&gt;here&lt;/a&gt;.&amp;nbsp;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-6362661361722544998?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/6362661361722544998/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/11/important-quality-boost-for-gbif-data.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/6362661361722544998'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/6362661361722544998'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/11/important-quality-boost-for-gbif-data.html' title='Important Quality Boost for GBIF Data Portal'/><author><name>Tim Robertson</name><uri>http://www.blogger.com/profile/07889700598656669041</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://1.bp.blogspot.com/_5wZ2Fic5QtA/STxJva6Zq2I/AAAAAAAAABc/g2I1cG9fztw/S220/tim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-895468098142974368</id><published>2011-10-21T11:52:00.000+02:00</published><updated>2011-10-21T11:52:02.603+02:00</updated><title type='text'>Integration tests with DBUnit</title><content type='html'>&lt;span class="Apple-style-span" style="font-size: large;"&gt;Database driven JUnit tests&lt;/span&gt;
&lt;br /&gt;
As part of our migration to a solid, general testing framework we are now using &lt;a href="http://www.dbunit.org/"&gt;DbUnit&lt;/a&gt; for database integration tests of our database service layer with JUnit (on top of liquibase for the DDL).
&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: large;"&gt;Creating a DbUnit test file&lt;/span&gt;
&lt;br /&gt;
As it can be painful to maintain a relational test dataset with many tables, I've decided to dump a small, existing Postgres database into the DbUnit XML structure, namely &lt;a href="http://www.dbunit.org/apidocs/org/dbunit/dataset/xml/FlatXmlDataSet.html"&gt;FlatXML&lt;/a&gt;.
It turned out to be less simple as I had hoped for. 
&lt;br /&gt;
&lt;br /&gt;
First I've created a simple &lt;a href="http://code.google.com/p/gbif-ecat/source/browse/trunk/checklistbank-mybatis-service/src/test/java/org/gbif/checklistbank/dbunit/TestFileGenerator.java"&gt;exporter script&lt;/a&gt; in Java that dumps the entire DB into XML. Simple.
&lt;br /&gt;
&lt;br /&gt;
The first problem I've stumbled across was a column named "order" which caused a SQL exception. It turns out DbUnit needs to be configured for specific databases, so I've ended up using three configurations to both dump and read the files. 
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Use Postgres specific types&lt;/li&gt;
&lt;li&gt;Double quote column and table names&lt;/li&gt;
&lt;li&gt;Enable case sensitive table &amp;amp; column names (now that we use quoted names, Postgres becomes case sensitive)&lt;/li&gt;
&lt;/ol&gt;
After that, reading in the DbUnit test file started out fine, but reached a weird &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;NullPointerException&lt;/span&gt; that left me puzzled. After trying various settings I finally found a log warning that some columns might not be detected properly by DbUnit, as it only inspects the first XML record by default which can contain many null columns which subsequently then will be ignored. Luckily since version 4.3.7 of dbunit you can tell the &lt;i&gt;builder&lt;/i&gt; that reads in the test files to scan all records first in memory, a feature know as &lt;i&gt;column sensing&lt;/i&gt;. 
That got me a long way, but ultimately I've hit a much harder issue. Relational integrity.
&lt;br /&gt;
&lt;br /&gt;
The classic way to avoid integrity checks during inserts (including DbUnit) is simply to temporarily disable all foreign key constraints. On some databases this is simple. For example in MySQL you can simply execute &lt;i&gt;SET FOREIGN_KEY_CHECKS=0&lt;/i&gt; in your db connection. In H2 there is an equivalent of &lt;i&gt;SET REFERENTIAL_INTEGRITY FALSE&lt;/i&gt;. Unfortunately there is nothing like that in PostgreSQL. You will have to disable all constraints individually and then painfully recreate them. In our case these were nearly a hundred constraints and I didn't want to go down that route.
&lt;br /&gt;
&lt;br /&gt;
The latest DBunit comes with a nice &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;DatabaseSequenceFilter&lt;/span&gt; to automatically sort the tables being dumped in an order that respects the constraints. That worked very well for all constraints across tables, but of course failed to sort the individual records in tables which contain a self reference, for example the taxonomy table which has an adjacency list via &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;parent_fk&lt;/span&gt;. Luckily I had only one table like this and that included already some &lt;a href="http://de.wikipedia.org/wiki/Nested_Sets"&gt;nested sets&lt;/a&gt; indices (lft,rgt) that allowed me to sort the records in a parent first order. For this I had to issue a custom SQL query though, so I ended up dumping the entire database with all tables using the filter and in addition to export only one table with a custom sql that I then had to manually copy into the complete xml dump file. Voila, finally a &lt;a href="http://code.google.com/p/gbif-ecat/source/browse/trunk/checklistbank-mybatis-service/src/test/resources/dbunit/squirrels-full.xml"&gt;working DbUnit test file&lt;/a&gt;!
&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: large;"&gt;DatabaseDrivenTest for MyBatis Services&lt;/span&gt;
&lt;br /&gt;
In order to load the test data into a test DB for every JUnit test we decided to use a JUnit Rule definition that is executed before each test. 
The class responsible for most of the magic is &lt;a href="http://code.google.com/p/gbif-ecat/source/browse/trunk/checklistbank-mybatis-service/src/test/java/org/gbif/checklistbank/service/mybatis/DatabaseDrivenTest.java"&gt;DatabaseDrivenTest&lt;/a&gt; which is parameterized for the specific MyBatis Service to be tested. It is generic and can be used with any database system. The subclass &lt;a href="http://code.google.com/p/gbif-ecat/source/browse/trunk/checklistbank-mybatis-service/src/test/java/org/gbif/checklistbank/service/mybatis/DatabaseDrivenChecklistBankTest.java"&gt;DatabaseDrivenChecklistBankTest&amp;lt;T&amp;gt;&lt;/a&gt; then adds the database specific configurations and can be used as a Rule within the individual tests.
&lt;br /&gt;
&lt;br /&gt;
A simple &amp;amp; clean &lt;a href="http://code.google.com/p/gbif-ecat/source/browse/trunk/checklistbank-mybatis-service/src/test/java/org/gbif/checklistbank/service/mybatis/ReferenceServiceMyBatisTest.java"&gt;integration test example&lt;/a&gt; does look like this now:
&lt;br /&gt;
&lt;pre&gt;public class ReferenceServiceMyBatisTest {

  @Rule
  public DatabaseDrivenChecklistBankTest&lt;referenceservice&gt; ddt = new DatabaseDrivenChecklistBankTest&lt;referenceservice&gt;(ReferenceService.class, "squirrels-full.xml");

  @Test
  public void testGet() {
    Reference ref = ddt.getService().get(37);
    assertEquals("Wilson, D. E. ; Reeder, D. M. Mammal Species of the World", ref.getCitation());
    assertEquals(100000025, ref.getUsageKey());
  }
&lt;/referenceservice&gt;&lt;/referenceservice&gt;&lt;/pre&gt;
&lt;br /&gt;
Isn't that gorgeous? We only need to pass the dbunit test file and the service class to be tested to the JUnit Rule and then only need to bother with testing the service results!
No additional setting up or tearing down is needed.&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-895468098142974368?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/895468098142974368/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/10/integration-tests-with-dbunit.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/895468098142974368'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/895468098142974368'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/10/integration-tests-with-dbunit.html' title='Integration tests with DBUnit'/><author><name>Markus Döring</name><uri>https://profiles.google.com/114975314573163797395</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-6270053401575046266</id><published>2011-10-20T16:06:00.000+02:00</published><updated>2011-10-20T16:06:15.270+02:00</updated><title type='text'>GBIF Portal: Geographic interpretations</title><content type='html'>The &lt;a href="http://gbif.blogspot.com/2011/04/reworking-portal-processing.html"&gt;new portal processing&lt;/a&gt; is about to go into production, and during testing I was drawing some metrics on the revised geographic interpretation. &amp;nbsp;It is a simple issue, but many records have coordinates that contradict the country that the record claims to be in. &amp;nbsp;Some illustrations of this were previously &lt;a href="http://gbif.blogspot.com/2011/05/here-be-dragons-mapping-occurrence-data.html"&gt;shared by Oliver&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
The challenge of this is two fold. &amp;nbsp;Firstly we see many variations in the &lt;a href="http://rs.tdwg.org/dwc/terms/index.htm#country"&gt;country name&lt;/a&gt;&amp;nbsp;which needs to be interpreted. &amp;nbsp;Some examples for Argentina are given (there are 100s of variations per country):&lt;br /&gt;
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;Argent.&lt;/li&gt;
&lt;li&gt;Argentina&lt;/li&gt;
&lt;li&gt;Argentiana&lt;/li&gt;
&lt;li&gt;N Argentina&lt;/li&gt;
&lt;li&gt;N. Argentina&lt;/li&gt;
&lt;li&gt;ARGENTINA&lt;/li&gt;
&lt;li&gt;ARGENTINIA&lt;/li&gt;
&lt;li&gt;ARGENTINNIA&lt;/li&gt;
&lt;li&gt;"ARGENTINIA"&lt;/li&gt;
&lt;li&gt;""ARGENTINIA""&lt;/li&gt;
&lt;li&gt;etc etc&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
We have abstracted the parsing code into a &lt;a href="http://code.google.com/p/gbif-common-resources/source/browse/#svn%2Fgbif-parsers"&gt;separate Java library&lt;/a&gt; which makes use of basic algorithms and dictionary files to help interpret the results. &amp;nbsp;This library might be useful for other tools requiring similar interpretation, or data cleaning efforts, and will be maintained over time as it will be in use in several GBIF tools.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The second challenge is that we need to determine if the point falls within the country. &amp;nbsp;There is always room for improvement in this area, such as understanding changes over time, but due to the &lt;a href="http://gbif.blogspot.com/2011/05/here-be-dragons-mapping-occurrence-data.html"&gt;huge volume of outliers&lt;/a&gt; when using the raw data a check like this is required. &amp;nbsp;&lt;a href="http://code.google.com/p/gbif-common-resources/source/browse/#svn%2Fgeocode"&gt;Our implementation&lt;/a&gt; is a very basic reverse georeferencing RESTful web service that takes a latitude and longitude, and returns the proposed country and some basic information such as the title. &amp;nbsp;Operating the service requires &lt;a href="http://postgis.refractions.net/"&gt;PostGIS&lt;/a&gt;&amp;nbsp;and a Java server like &lt;a href="http://tomcat.apache.org/"&gt;Apache Tomcat&lt;/a&gt;. &amp;nbsp;Currently we make use of freely available terrestrial shapefiles, and marine economic exclusion zones. &amp;nbsp;It would be trivial to expand the service to use more shapefiles for other uses, and is expected to happen over time. &amp;nbsp;Currently the GBIF service is an internal only processing service, but is expected to be released for public use in the coming months.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Improving the country name interpretation and making use of a more accurate geospatial verification service than previously will help improve data reporting at the national level using the GBIF portal as indicated here.&lt;br /&gt;
&lt;br /&gt;
&lt;div style="text-align: center;"&gt;
&lt;table border="1" cellpadding="3"&gt;
 &lt;tbody&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;# Records&lt;/th&gt;
  &lt;th&gt;# Georeferenced&lt;/th&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td rowspan="2" valign="top"&gt;Argentina&lt;/td&gt;
  &lt;td&gt;Previously&lt;/td&gt;
  &lt;td&gt;665,284&lt;/td&gt;
  &lt;td&gt;284,012&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Now&lt;/td&gt;
  &lt;td&gt;680,344&lt;/td&gt;
  &lt;td&gt;303,889&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td rowspan="2" valign="top"&gt;United States&lt;/td&gt;
  &lt;td&gt;Previously&lt;/td&gt;
  &lt;td&gt;79,432,986&lt;/td&gt;
  &lt;td&gt;68,900,415&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Now&lt;/td&gt;
  &lt;td&gt;81,483,086&lt;/td&gt;
  &lt;td&gt;70,588,182&lt;/td&gt;
 &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table&gt;



&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-6270053401575046266?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/6270053401575046266/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/10/gbif-portal-geographic-interpretations.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/6270053401575046266'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/6270053401575046266'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/10/gbif-portal-geographic-interpretations.html' title='GBIF Portal: Geographic interpretations'/><author><name>Tim Robertson</name><uri>http://www.blogger.com/profile/07889700598656669041</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://1.bp.blogspot.com/_5wZ2Fic5QtA/STxJva6Zq2I/AAAAAAAAABc/g2I1cG9fztw/S220/tim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-7421557074920909309</id><published>2011-10-07T10:52:00.000+02:00</published><updated>2011-10-07T10:52:48.147+02:00</updated><title type='text'>Group synergy</title><content type='html'>During the last few weeks we have been intensively designing and implementing what would come to be the new data portal. Oliver described nicely the new stage our team has entered in his last blog post &lt;a href="http://gbif.blogspot.com/2011/09/portal-v2-there-will-be-cake.html"&gt;Portal v2 - There will be cake&lt;/a&gt;. As my personal opinion, I think this has been truly a group experience as we have decided to change our paradigm of working. Normally we would have worked on different components each one of us and later try to integrate everything, but now we took the approach of just focusing on one subcomponent, all of us, and driving our efforts into it. From my point of view, the main advantage of this is that we avoid the&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Bus_factor"&gt;Bus Factor&lt;/a&gt;&amp;nbsp;element, that we as a small group of developers, are quite exposed to. Communication has increased among our team as we are all on the same page now.&lt;br /&gt;
&lt;br /&gt;
As a general overview, the portal v2 will consist of different subcomponents (or sub-projects) that would need to interact between them to come up with a consolidated "view" for the user. Currently we have 3 different sub-projects on our tray, &lt;a href="http://code.google.com/p/gbif-ecat/"&gt;Checklist Bank&lt;/a&gt;, &lt;a href="http://code.google.com/p/gbif-registry/"&gt;Registry&lt;/a&gt;, and &lt;a href="http://code.google.com/p/gbif-occurrencestore"&gt;Occurrence Store&lt;/a&gt;&amp;nbsp;and our plan will be to have an API (exposed through web services) which will offer all data necessary (from these projects) for the portal to consume. The portal will then need to make use of a simple webservice client to communicate with this API.&lt;br /&gt;
&lt;br /&gt;
Currently we have been working on the &lt;a href="http://code.google.com/p/gbif-ecat/"&gt;Checklist Bank&lt;/a&gt; sub-project. As Oliver pointed out in his previous post, some members of our team are more familiarized with certain sub-projects, and the checklist one does not escape from this reality. So for many, including me, it has been a learning experience. We have started development following very strict guidelines on API design and code conventions (which we document internally for our use). Even decisions that are sometimes taken in seconds by a single developer, are placed under group scrutiny so we are all on the same track. We have taken the commitment to apply the best coding practices.&lt;br /&gt;
&lt;br /&gt;
Specifically on the checklist sub-project, we have come up with a preliminary API. &amp;nbsp;&lt;b&gt;Please note this API won't be exposed to the public as it is. It is subject to change as we try to refine it.&amp;nbsp;&lt;/b&gt;It is just nice to show to the outside world what we have been working on.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-9nrxjcRHKRU/To6tgt0npCI/AAAAAAAAI4Q/1OCH_kF9WSk/s1600/clbapi.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="604" src="http://4.bp.blogspot.com/-9nrxjcRHKRU/To6tgt0npCI/AAAAAAAAI4Q/1OCH_kF9WSk/s640/clbapi.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
I personally think we are in exciting times inside GBIF and that the final product of all this effort would be a great tool that would benefit the community in big ways. Expect more from us!&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-7421557074920909309?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/7421557074920909309/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/10/group-synergy.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/7421557074920909309'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/7421557074920909309'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/10/group-synergy.html' title='Group synergy'/><author><name>Jose Cuadra</name><uri>http://www.blogger.com/profile/00591450269169657407</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-9nrxjcRHKRU/To6tgt0npCI/AAAAAAAAI4Q/1OCH_kF9WSk/s72-c/clbapi.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-6030355909359886528</id><published>2011-09-27T10:59:00.000+02:00</published><updated>2011-09-27T11:00:00.137+02:00</updated><title type='text'>Portal v2 - There will be cake</title><content type='html'>The current &lt;a href="http://data.gbif.org/"&gt;GBIF data portal&lt;/a&gt; was started in 2007 to provide access to the network's biodiversity data - at the time that meant a &lt;a href="http://en.wikipedia.org/wiki/Federated_search"&gt;federated search&lt;/a&gt; across 220 providers and 76 million occurrence records.  While that approach has served us well over the years, there are many features that have been requested for the portal that weren't addressable in the current architecture.  Combined with the fact that we're now well over 300 million occurrence records, with millions of new taxonomic records to boot, it becomes clear that a new portal is needed.  After a long consultation process with the wider community the initial requirements of a new portal have been determined, and I'm pleased to report that work has officially started on its design and development.

&lt;br /&gt;
&lt;br /&gt;
For the last 6 months or so the development team has been working on improving our rollover process, registry improvements, IPT development, and disparate other tasks.  The new portal marks an important milestone in our team development as we're now all working on the portal, with as little distraction from other projects as we can manage.  Obviously we're still fixing critical bugs and responding to data requests, etc, but all of us focusing on the same general task has already shown dividends in the conversations coming out of our daily scrums.  Everyone being on the same page really does help.
&lt;br /&gt;
&lt;br /&gt;
And yes, we've been using daily stand-up meetings that we call "scrums" for several months, but the new portal marks the start of our first proper attempt at &lt;a href="http://en.wikipedia.org/wiki/Agile_software_development"&gt;agile software development&lt;/a&gt;, including the proper use of &lt;a href="http://en.wikipedia.org/wiki/Scrum_(development)"&gt;scrum&lt;/a&gt;.  Most of our team has had some experience with parts of agile techniques, so we're combining the best practices that everyone has had to make the best system for us.  Obviously the ideal of interchangeable people with no single expert in a given domain is rather hard for us when Tim, Markus, Kyle and Jose have worked on these things for so long and people like Lars, Federico and I are still relatively new (even though we're celebrating our one year anniversaries at GBIF in the next weeks!), but we're trying hard to have non-experts working with experts to share the knowledge.
&lt;br /&gt;
&lt;br /&gt;
In terms of managing the process, I (Oliver) am acting as Scrum Master and project lead.  Andrea Hahn has worked hard at gathering our initial requirements, turning them into stories, and leading the wireframing of the new portal.  As such she'll be acting as a Stakeholder to the project and help us set priorities.  As the underlying infrastructure gets built and the process continues I'm sure we'll be involving more people in the prioritization process, but for now our plates are certainly full with "plumbing". At Tim's suggestion we're using &lt;a href="http://basecamphq.com/"&gt;Basecamp&lt;/a&gt; to manage our backlog, active stories, and sprints, following the example from &lt;a href="http://highnotes.posterous.com/how-to-do-scrumxp-with-basecamp"&gt;these guys&lt;/a&gt;.  Our first kickoff revealed some weaknesses in mapping Basecamp to agile, and the lack of a physical storyboard makes it hard to see the big picture, but we'll start with this and re-evaluate in a little while - certainly it's more important to get the process started and determine our actual needs rather than playing with different tools in some kind of abstract evaluation process.  Once we've ironed out the process and settled on our tools we'll also make them more visible to the outside world.
&lt;br /&gt;
&lt;br /&gt;
We're only now coming up on the end of our first, 2 week sprint, so it will take a few more iterations to really get into the flow, but so far so good, and I'll report back on our experience in a future post.
&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;(If you didn't get it, apologies for the &lt;a href="http://en.wikipedia.org/wiki/Portal_(video_game)"&gt;cake reference&lt;/a&gt;)
&lt;/span&gt;&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-6030355909359886528?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/6030355909359886528/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/09/portal-v2-there-will-be-cake.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/6030355909359886528'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/6030355909359886528'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/09/portal-v2-there-will-be-cake.html' title='Portal v2 - There will be cake'/><author><name>Oliver Meyn</name><uri>http://www.blogger.com/profile/04706642473308341930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/-6_sfRK7-oU0/TaWanB_bQ3I/AAAAAAAAAAM/RcPMC0E9E3k/s220/oliver_amory_profile.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-919861230614743940</id><published>2011-09-15T09:08:00.000+02:00</published><updated>2011-09-15T09:08:01.860+02:00</updated><title type='text'>VertNet and the GBIF Integrated Publishing Toolkit</title><content type='html'>(A guest post from our friends at VertNet, cross-posted from the &lt;a href="http://blog.vertnet.org/"&gt;VertNet blog&lt;/a&gt;)

        &lt;p&gt;This week we’d like to discuss the current and future roles of the GBIF &lt;/span&gt;&lt;a href="http://www.gbif.org/informatics/infrastructure/publishing/" target="_blank"&gt;Integrated Publishing Toolkit&lt;/a&gt; (IPT) in VertNet. IPT is a Java-based web application that allows a user to publish and share biodiversity data sets from a server. Here are some of the things IPT can do:&lt;/p&gt; 
&lt;p&gt;     &lt;img src="http://media.tumblr.com/tumblr_lrhcjgmVNN1ql3zjs.jpg" align="text-top" alt="GBIF IPT Logo Image"/&gt;&lt;/p&gt; 
&lt;ol&gt;&lt;li&gt;Create Darwin Core Archives. In our &lt;a href="http://blog.vertnet.org/post/9893042082/publishing-data-first-thoughts" target="_blank"&gt;post about data publishing&lt;/a&gt; last week, we wrote about Darwin Core being the “language of choice” for VertNet. IPT allows publishers to create Darwin Core data records from either files or databases and to export them in zipped archive files that contain exactly what is needed by VertNet for uploading.&lt;/li&gt; 
&lt;/ol&gt;&lt;ol start="2"&gt;&lt;li&gt;Make data available for efficient indexing by GBIF. VertNet has an agreement with its data publishers that, by participating, they will also publish data through GBIF. GBIF keeps our registry of data providers and uses this registry to find and update data periodically from the original sources to make it available through the GBIF &lt;a href="http://data.gbif.org/" target="_blank"&gt;data portal&lt;/a&gt;. IPT gives data publishers an easy means of keeping their data up-to-date with GBIF.&lt;/li&gt; 
&lt;/ol&gt;&lt;p&gt;IPT can help with the data publishing process in other ways as well:&lt;/p&gt; 
&lt;ul&gt;&lt;li&gt;standardizing terms&lt;/li&gt; 
&lt;li&gt;validating records before they get published&lt;/li&gt; 
&lt;li&gt;adding default values for fields that aren’t in the original data&lt;/li&gt; 
&lt;/ul&gt;&lt;p&gt;To get a better understanding of the capabilities, take a look at the &lt;a href="http://code.google.com/p/gbif-providertoolkit/wiki/IPT2ManualNotes" target="_blank"&gt;IPT User Manual&lt;/a&gt;.&lt;br/&gt;&lt;br/&gt;Why are we using IPT? &lt;br/&gt;&lt;br/&gt;VertNet has a long waiting list of organizations (65 to date) that have expressed interest in making their data publicly accessible through VertNet. In the past, these institutions would have needed their own server and specialized software (&lt;a href="http://digir.net/" target="_blank"&gt;DiGIR&lt;/a&gt;) for publishing to the separate vertebrate networks. We’d rather not require any of these participants to buy servers if they don’t have to. As an interim solution, we’re using the IPT to make data available online while we build VertNet. We have installed, at the University of Kansas &lt;a href="http://biodiversity.ku.edu/" target="_blank"&gt;Biodiversity Institute&lt;/a&gt;, an &lt;a href="http://vertnet.nhm.ku.edu:8080/ipt/" target="_blank"&gt;IPT&lt;/a&gt; that can act as a host for as many collections as are interested. The service is shared, yet organizations can maintain their own identity and data securely within this hosted IPT. This is a big win for us at VertNet, because there will be fewer servers to maintain and we can get more collections involved more quickly.&lt;br/&gt;&lt;br/&gt;Going forward&amp;#8230;&lt;br/&gt;&lt;br/&gt;Well before completion, VertNet will support &lt;a href="http://blog.vertnet.org/post/9893042082/publishing-data-first-thoughts" target="_blank"&gt;simple and sustainable publishing&lt;/a&gt; by uploading records from text files in Simple Darwin Core form. Because of this, the IPT will not be a required component of data publishing for VertNet. Rather, we see IPT as a great tool to facilitate the creation of Darwin Core Archives, which we will be able to use to upload data to VertNet.&lt;br/&gt;&lt;br/&gt;Interested in publishing now with IPT?&lt;br/&gt;&lt;br/&gt;We currently have two institutions sharing their collections with VertNet and GBIF through the &lt;a href="http://vertnet.nhm.ku.edu:8080/ipt/" target="_blank"&gt;VertNet IPT&lt;/a&gt; and we’re in the process of working with several more.&lt;/p&gt; 
&lt;p&gt;So, if you are or would like to be a vertebrate data publisher and would like to make your data accessible as Darwin Core Archives sooner rather than later, VertNet’s IPT might be the solution for you!  Learn more about the &lt;a href="http://vertnet.org/publishers/join.php" target="_blank"&gt;process&lt;/a&gt; on the VertNet web site or email &lt;a href="mailto:larussell@vertnet.org" target="_blank"&gt;Laura Russell&lt;/a&gt; and &lt;a href="mailto:dbloom@vertnet.org" target="_blank"&gt;Dave Bloom&lt;/a&gt;.&lt;/p&gt; 
  
&lt;p&gt;&lt;em&gt;Posted by Laura Russell, VertNet Programmer; John Wieczorek, Information Architect; and Aaron Steele, Information Architect&lt;/em&gt;&lt;/p&gt;
&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-919861230614743940?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/919861230614743940/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/09/vertnet-and-gbif-integrated-publishing.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/919861230614743940'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/919861230614743940'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/09/vertnet-and-gbif-integrated-publishing.html' title='VertNet and the GBIF Integrated Publishing Toolkit'/><author><name>Oliver Meyn</name><uri>http://www.blogger.com/profile/04706642473308341930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/-6_sfRK7-oU0/TaWanB_bQ3I/AAAAAAAAAAM/RcPMC0E9E3k/s220/oliver_amory_profile.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-8591018617989050411</id><published>2011-08-26T15:03:00.004+02:00</published><updated>2011-08-26T15:22:11.704+02:00</updated><title type='text'>Darwin Core Archives for Species Checklists</title><content type='html'>GBIF has long had an ambition for supporting the sharing of annotated species checklists through the network.   Realising this ambition has been frustrated by the lack of a data exchange standard of sufficient scope and simplicity as to promote publication of this type of resource.   In 2009,  the Darwin Core standard data set was formerly ratified by the TDWG,  Biodiversity Information Standards.   The addition of new terms, and a means of expressing these terms in a simplified and extensible text-based format,  paved the way for the development of a data exchange profile for exchanging species checklists known as the Global Names Architecture (GNA) Profile.   Species checklists, published in this format,  can be zipped into single, portable, 'archive' files.&lt;div&gt;
&lt;/div&gt;&lt;div&gt;Here I introduce two example archives that illustrate the flexible scope of the format. The first represents a very simple species checklist while the second is a more richly documented taxonomic catalogue.  The contents of any file can be viewed by clicking on the file icon or filename.   A complete list of terms used in sharing checklists can be found &lt;a href="http://tools.gbif.org/resource_browser/"&gt; here.&lt;/a&gt;&lt;/div&gt;&lt;div&gt;

&lt;table&gt;&lt;caption align="TOP"&gt;Example 1: &lt;b&gt;U.S. National Arboretum Checklist&lt;/b&gt;&lt;/caption&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td colspan="2"&gt;This checklist represents the most simple checklist archive.  It consists of a document that describes the checklist and a second file with the checklist data itself.   The checklist data consist of two columns.  Note that by including column headers that match the standard DarwinCore term names, that no additional mapping document is needed.&lt;/td&gt;&lt;/tr&gt;&lt;tr valign="middle"&gt;&lt;td align="center"&gt;
&lt;a href="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/eml.xml"&gt;
&lt;img width="80" src="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/xml.png" /&gt;
EML.xml&lt;/a&gt;
&lt;/td&gt;&lt;td&gt;The checklist is documented using an Ecological Metadata Language (EML) document. &lt;/td&gt;&lt;/tr&gt;&lt;tr valign="middle"&gt;&lt;td align="center"&gt;
&lt;a href="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/Checklist.txt"&gt;
&lt;img width="80" src="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/txt.png" /&gt;Checklist.txt&lt;/a&gt;
&lt;/td&gt;&lt;td&gt;The checklist itself is kept in this simple text file.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;&lt;div&gt;

&lt;table&gt;&lt;caption align="TOP"&gt;Example 2: &lt;b&gt;Catalog of Living Whales&lt;/b&gt;&lt;/caption&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td colspan="2"&gt;
This checklist represents an annotated species checklist.  In addition to the core species list ('whales.tab') there are numerous other data types consisting of Darwin Core extensions that conform to the GNA Profile.  This more complex archive contains a resource map file ('meta.xml') that describes the files in the archive.  An EML metadata document describes the catalog itself.   This more complex archive uses a common identifier, taxonID, to link data in the extension files to the data records in the core species checklist ('whales.tab').
&lt;/td&gt;&lt;/tr&gt;&lt;tr valign="middle"&gt;&lt;td align="center"&gt;
&lt;a href="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/whales/eml.xml"&gt;
&lt;img width="80" src="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/xml.png" /&gt;
EML.xml&lt;/a&gt;
&lt;/td&gt;&lt;td&gt;The checklist is documented using a Ecological Metadata Language (EML) document.  It includes a title,  contacts,  citation information and more.&lt;/td&gt;&lt;/tr&gt;&lt;tr valign="middle"&gt;&lt;td align="center"&gt;
&lt;a href="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/whales/whales.tab"&gt;
&lt;img width="80" src="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/txt.png" /&gt;whale&lt;span class="Apple-style-span" style="color: rgb(0, 0, 0); -webkit-text-decorations-in-effect: none; "&gt;&lt;/span&gt;&lt;/a&gt;&lt;a href="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/whales/whales.tab"&gt;s.tab&lt;/a&gt;&lt;/td&gt;&lt;td&gt;The checklist itself is kept in this tab-delimited file.&lt;/td&gt;&lt;/tr&gt;&lt;tr valign="middle"&gt;&lt;td align="center"&gt;
&lt;a href="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/whales/meta.xml"&gt;
&lt;img width="80" src="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/xml.png" /&gt;
meta.xml&lt;/a&gt;
&lt;/td&gt;&lt;td&gt;The files in the archive are described in this resource map file. &lt;/td&gt;&lt;/tr&gt;&lt;tr valign="middle"&gt;&lt;td align="center"&gt;
&lt;a href="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/whales/distribution.tab"&gt;
&lt;img width="80" src="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/txt.png" /&gt;distribution.tab&lt;/a&gt;
&lt;/td&gt;&lt;td&gt;Distribution information conforming to the GNA Distribution extension are stored in this file.&lt;/td&gt;&lt;/tr&gt;&lt;tr valign="middle"&gt;&lt;td align="center"&gt;
&lt;a href="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/whales/references.tab"&gt;
&lt;img width="80" src="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/txt.png" /&gt;references.tab&lt;/a&gt;
&lt;/td&gt;&lt;td&gt;Bibliographic references are stored in this file and linked to 'whales.tab' via the taxonID&lt;/td&gt;&lt;/tr&gt;&lt;tr valign="middle"&gt;&lt;td align="center"&gt;
&lt;a href="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/whales/types.tab"&gt;
&lt;img width="80" src="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/txt.png" /&gt;types.tab&lt;/a&gt;
&lt;/td&gt;&lt;td&gt;Type specimen details are contained in this file.&lt;/td&gt;&lt;/tr&gt;&lt;tr valign="middle"&gt;&lt;td align="center"&gt;
&lt;a href="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/whales/vernaculars.tab"&gt;
&lt;img width="80" src="http://www.gbif.org/fileadmin/Images/Informatics/Architecture/txt.png" /&gt;vernaculars.tab&lt;/a&gt;
&lt;/td&gt;&lt;td&gt;Common name information that conforms to the GNA Vernacular Extension are stored in this file.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-8591018617989050411?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/8591018617989050411/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/08/darwin-core-archives-for-species.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/8591018617989050411'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/8591018617989050411'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/08/darwin-core-archives-for-species.html' title='Darwin Core Archives for Species Checklists'/><author><name>David Remsen</name><uri>http://www.blogger.com/profile/07274081646623362154</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://3.bp.blogspot.com/_xN1XbTgolpU/SSWDYDDjhVI/AAAAAAAAAAM/XQdQU0DDfIE/S220/Photo+38.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-7249923992376497153</id><published>2011-08-21T22:42:00.008+02:00</published><updated>2011-08-26T22:29:10.334+02:00</updated><title type='text'>Configuring Drupal and some modules for ticketing emails</title><content type='html'>&lt;p&gt;We at the Secretariat receive enquiries via helpdesk[at]gbif[dot]org, portal[at]gbif[dot]org and info[at]gbif[dot]org, everyday, or I would say, almost every hour. Some of them are provider-specific questions that need special attention from staff, while some others are FAQs. We have been thinking about better managing questions/issues, so by adding a little bit structure in the collaborative workflow, we can:&lt;/p&gt;

&lt;p&gt;1. Make sure questions are answered with satisfaction;
2. Estimate how much man hours have been spent, or evaluate performance;
3. Improve efficiency on helpdesk activities.&lt;/p&gt;

&lt;p&gt;To achieve these, we need softwares that meet these requirements:&lt;br/&gt;
1. Case management for incoming emails;&lt;br/&gt;
2. A Q&amp;amp;A cycle should be completed by solely using email. Web forms are good but not necessary in the beginning;&lt;br/&gt;
3. Easy configured knowledge base essays;&lt;br/&gt;
4. Graphical reports shows the helpdesk performance;&lt;br/&gt;
5. Automatic escalation of case status.&lt;/p&gt;

&lt;p&gt;We looked for options from Open Source Help Desk List. While most of the sounding choices are tailored for software development cycle, some are commercial packages/services that indeed designed for enterprise help desk needs. While evaluating a few of those packages, I also found with Drupal and some modules, a solution that just meets our need is pretty out-of-box ready. The result is quite convincing and I can imagine the transition won't require too much learning of my colleagues.&lt;/p&gt;
&lt;p&gt;Here is the recipe.&lt;/p&gt;

&lt;span style="font-weight:bold;"&gt;Materials and methods:&lt;/span&gt;
&lt;p&gt;1. A mail server. All right I admit this is not something easy if you're not a system administrator. We use Dovecot to provide IMAP access to emails.&lt;br/&gt;
2. A Drupal installation. Installation instructions are here. As a wimp I choose version 6.&lt;br/&gt;
3. The Support module. Downloadable at &lt;a href="http://drupal.org/project/support"&gt;http://drupal.org/project/support&lt;/a&gt;.&lt;br/&gt; support_deadline, support_fields, support_timer, support_views, support_token, and support_nag are relevant modules that fit our purposes.&lt;br/&gt;
4. The CCK module. Downloadable at http://drupal/project/cck.&lt;br/&gt;
5. The Views module. Downloadable at http://drupal/project/views.&lt;br/&gt;
6. The Google Chart module. Downloadable at http://drupal.org/project/chart. Not "charts", which is a different module.&lt;br/&gt;
7. The Date module. Downloadable at http://drupal.org/project/date.&lt;br/&gt;
8. The Admin Menu module, for your administrative pleasure. Downloadable at http://drupal.org/project/admin_menu.&lt;br/&gt;
9. The Views Calc module, required by Support modules. Downloadable at http://drupal.org/project/views_calc.&lt;br/&gt;
10. Download all necessary modules to [drupalroot]/sites/all/modules directory. Enable them at [baseURL]/admin/build/modules.&lt;br/&gt;
11. You should see a "support ticketing system" menu by now. You need to
&lt;div&gt;&lt;ol&gt;&lt;li&gt;Add an email client with an email account you set on the dovecot mail server;&lt;/li&gt;&lt;li&gt;You probably want to change the email template at [baseURL]/admin/support/settings/mail;&lt;/li&gt;&lt;li&gt;Go through the general settings of the ticketing system at [baseURL]/admin/support/settings.&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;
12. Send some testing emails to the testing email address.&lt;br/&gt;
13. Visit [baseURL]/admin/support/clients/1/fetch, see if the system retrieve email and create tickets successfully.&lt;/p&gt;
&lt;a href="http://3.bp.blogspot.com/-qXHnKjeMF64/TlFvAmyVKHI/AAAAAAAAACE/Fj6lfhYbCxM/s1600/ticket%2Blist.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 391px;" src="http://3.bp.blogspot.com/-qXHnKjeMF64/TlFvAmyVKHI/AAAAAAAAACE/Fj6lfhYbCxM/s400/ticket%2Blist.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5643413864274274418" /&gt;&lt;/a&gt;
&lt;p&gt;14. After more testing emails have been sent to the address and fetched. You can visit [baseURL]/admin/support/charts.&lt;/p&gt;
&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/-Wok13EMYPDI/TlFvPuSmZLI/AAAAAAAAACM/0gZ177q7uv8/s1600/charts.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 106px; height: 400px;" src="http://4.bp.blogspot.com/-Wok13EMYPDI/TlFvPuSmZLI/AAAAAAAAACM/0gZ177q7uv8/s400/charts.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5643414123986707634" /&gt;&lt;/a&gt;
&lt;p&gt;These are just some facets of my explorations so far. Some details are not covered, like permissions in Drupal. Probably a newbie would need a crash course of Drupal to start, but after that things will be easier and faster.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-7249923992376497153?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/7249923992376497153/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/08/configuring-drupal-and-some-modules-for.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/7249923992376497153'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/7249923992376497153'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/08/configuring-drupal-and-some-modules-for.html' title='Configuring Drupal and some modules for ticketing emails'/><author><name>Burke Chih-Jen Ko</name><uri>http://www.blogger.com/profile/09806308970203169452</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/-JHV15rSIJlw/Td_9T-7V2iI/AAAAAAAAABI/TamywweE4I4/s220/P1282909r_icon.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-qXHnKjeMF64/TlFvAmyVKHI/AAAAAAAAACE/Fj6lfhYbCxM/s72-c/ticket%2Blist.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-5062633761741279574</id><published>2011-08-12T14:29:00.001+02:00</published><updated>2011-08-12T14:29:59.786+02:00</updated><title type='text'>Using C3P0 with MyBatis</title><content type='html'>&lt;span class="Apple-style-span" style="font-size: large;"&gt;The problem&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
In our &lt;i&gt;rollover&lt;/i&gt;&amp;nbsp;process, which turns our raw harvested data into the interpreted occurrences you can see on our &lt;a href="http://data.gbif.org/welcome.htm"&gt;portal&lt;/a&gt;,&amp;nbsp;we now have a step that calls a Web Service to turn geographical coordinates into country names. We use this to enrich and validate the incoming data.&lt;br /&gt;
&lt;br /&gt;
This step in our process usually took about three to four hours but last week it stopped working all together without any changes to the Web Service or the input data.&lt;br /&gt;
&lt;br /&gt;
We've spent a lot of time trying to find the problem and while we still can't say for sure what the exact problem is or was we've found a fix that works for us which also allows us to make some assumptions about the cause of the failure.&lt;br /&gt;
&lt;br /&gt;
It is a project called&amp;nbsp;&lt;a href="http://code.google.com/p/gbif-occurrencestore/source/browse/#svn%2Ftrunk%2Fgeocode-ws"&gt;geocode-ws&lt;/a&gt;&amp;nbsp;and it is a very simple project that uses &lt;a href="http://mybatis.org/"&gt;MyBatis&lt;/a&gt; to call a &lt;a href="http://www.postgresql.org/"&gt;PostgreSQL&lt;/a&gt;&amp;nbsp;(8.4.2) &amp;amp;&amp;nbsp;&lt;a href="http://postgis.refractions.net/"&gt;PostGIS&lt;/a&gt;&amp;nbsp;(1.4.0) database which does the GISy work of finding matches.&lt;br /&gt;
&lt;br /&gt;
Our process started out fine. The first few million calls to the Web Service were fine and returning reasonably fast but then at the end the process slowed down until it came almost to a complete stop with response times of over 10 minutes. That's when our Hadoop maps timed out and failed.&lt;br /&gt;
&lt;br /&gt;
With hindsight we should have come to our final conclusion much earlier but it took us a while.&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: large;"&gt;Looking for the problem&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
We've looked at the PostgreSQL configuration and tweaked it a lot. We added a lot more logging and we've made sure to log any long running statements (using the&amp;nbsp;&lt;a href="http://www.postgresql.org/docs/8.4/static/runtime-config-logging.html"&gt;log_min_duration_statement&lt;/a&gt;&amp;nbsp;option). We also made sure that our memory settings are sensible and that we don't run out of memory. Looking at io- and vmstats as well as our Cacti monitoring we could see that this wasn't the case though. PostgreSQL didn't seem to be the problem.&lt;br /&gt;
&lt;br /&gt;
We also looked at the OS configuration itself as well as the connectivity between our Hadoop cluster and this Tomcat and PostgreSQL server but couldn't find the problem either.&lt;br /&gt;
&lt;br /&gt;
Then we began improving our Web Service and implemented a JMX MBean to get more detailed information about the process. While our changes should have improved the code base they didn't fix the problem. Finally we enabled GC logging on our Tomcat server (something we should have done much earlier and we will probably do by default for our servers in the future). We didn't do it earlier because the Web Service didn't experience any symptoms of memory leak issues before and we didn't change anything there. It hadn't even been restarted in a while.&lt;br /&gt;
&lt;br /&gt;
But as it turned out the problem was garbage collection. Unfortunately I can't provide pretty graphs because I've overwritten the GC logs but it was very easy to see (using the awesome&amp;nbsp;&lt;a href="http://www.tagtraum.com/gcviewer.html"&gt;GCViewer&lt;/a&gt;) a typical pattern of minor collections not reclaiming all space and growing memory usage up until the point where almost no memory could be reclaimed and most of the time was spent in Garbage Collection. We found the problem! This explained our time outs.&lt;br /&gt;
&lt;br /&gt;
It still doesn't explain what was leaking though. And having spent that much time on it we quickly gave up trying to find the problem. We suspect some kind of combination between the MyBatis Connection Pool, the PostgreSQL JDBC driver and our configuration.&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: large;"&gt;Our workaround (the MyBatis &amp;amp; C3P0 part)&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
So we looked around for other connection pools for use with MyBatis but unfortunately we couldn't find a ready made thing. There are implementations in the MyBatis-Guice project but they can only be used with the Annotation based configuration and we're using XML.&lt;br /&gt;
&lt;br /&gt;
We ended up writing our &lt;a href="http://code.google.com/p/gbif-common-resources/source/browse/#svn%2Fmybatis-c3p0%2Ftrunk"&gt;own implementation&lt;/a&gt;&amp;nbsp;of a &lt;a href="http://sourceforge.net/projects/c3p0/"&gt;C3P0&lt;/a&gt; DataSourceFactory&amp;nbsp;and it turned out to be very very easy: It is just &lt;a href="http://code.google.com/p/gbif-common-resources/source/browse/mybatis-c3p0/trunk/src/main/java/org/gbif/common/mybatis/C3P0DataSourceFactory.java"&gt;one class&lt;/a&gt;&amp;nbsp;(JavaDoc &lt;a href="http://sites.gbif.org/common-resources/mybatis-c3p0/apidocs/org/gbif/common/mybatis/C3P0DataSourceFactory.html"&gt;here&lt;/a&gt;) with one line of code in it.&lt;br /&gt;
&lt;br /&gt;
This not only solved our apparent memory leak but the performance increased by a factor of two to three as well. We haven't had a problem with our setup since.&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: large;"&gt;Conclusion&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
We didn't have the time to find the real problem but we found a solution that works for us. I suspect had we gone about this better we might have found the problem a lot sooner and perhaps identified the real reason for it.&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;Enable GC logging!&lt;/li&gt;
&lt;li&gt;Enable JMX for Tomcat and set up your applications with useful metrics and logging&lt;/li&gt;
&lt;li&gt;Even though the use of Profilers is heavily disputed they can often help. We've found &lt;a href="http://www.yourkit.com/"&gt;YourKit&lt;/a&gt;&amp;nbsp;to be excellent&lt;/li&gt;
&lt;li&gt;Try to follow a logical route, change only one thing at a time, mock things to find a problem&lt;/li&gt;
&lt;li&gt;Monitor and graph your systems&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-5062633761741279574?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/5062633761741279574/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/08/using-c3p0-with-mybatis.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/5062633761741279574'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/5062633761741279574'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/08/using-c3p0-with-mybatis.html' title='Using C3P0 with MyBatis'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-5842470106688313880</id><published>2011-08-03T08:35:00.000+02:00</published><updated>2011-08-03T08:35:49.288+02:00</updated><title type='text'>Indexing occurrences data - using Lucene technology</title><content type='html'>The &lt;a href="http://www.gbif.org/"&gt;GBIF&lt;/a&gt; Occurrence Index collects, stores and parses data gathered from different sources to provide a fast and accurate access to biodiversity occurrence data. The purpose of having a GBIF Index is optimize speed, relevance and performance of search functionalities that will be implemented by the new GBIF portal architecture. 
&lt;br /&gt;
&lt;br /&gt;
Currently, GBIF has been providing search functionalities in its Data Portal supported in a semi-denormalized index relational database design, which allows find occurrence information by specifying filters to refine the expected results. That design was envisioned to support use cases of the actual &lt;a href="http://data.gbif.org/"&gt;GBIF Data Portal &lt;/a&gt;(a Web application); for the next generation of the GBIF platform, a new set of requirements must be achieved and is possible that the current index will not be able to support them, the most relevant of those requirements are: scheduling of batch exports, full text search, realtime faceted search and probably new schemas of data sharing with other biodiversity networks.
&lt;br /&gt;
&lt;br /&gt;
For implementing this new Occurrence Index, several technologies are under evaluated, each technology taken into considerationfor specific features that make them an attractive option, those are:
&lt;br /&gt;
&lt;table border="1"&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;Technology&lt;/td&gt;&lt;td&gt;Description&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="http://www.postgresql.org/"&gt;PostgreSQL&lt;/a&gt;&lt;/td&gt;&lt;td&gt;This relational data base contains several features that worth evaluate: query optimization for JOIN-like queries, flexible key-value store, partial indices and multicolumn indices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="http://lucene.apache.org/java/docs/index.html"&gt;Lucene Index&lt;/a&gt;&lt;/td&gt;&lt;td&gt;At least four options are available for this implementation: pure Lucene Index, Katta, Apache Solr and ElasticSearch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="http://www.mysql.com/"&gt;MySQL&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;This is the current implementation of the index, a evaluation could help to determine if this technology will be able to support new use cases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="http://en.wikipedia.org/wiki/NoSQL"&gt;Key-value systems&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Several schema-less data stores are available: CouchDB, Mongo, PostgreSQL hstore. The main concern about these technologies is their capabilities to handle a considerable amount of records&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
In this post will show some preliminary results in the evaluation of Lucene-based indices, specifically: Lucene as itself, &lt;a href="http://katta.sourceforge.net/"&gt;Katta&lt;/a&gt;,&lt;a href="http://lucene.apache.org/solr/"&gt; Apache Solr&lt;/a&gt; and &lt;a href="http://www.elasticsearch.org/"&gt;ElasticSearch&lt;/a&gt;. The analysis will keep apart two concerns (in this post only the index creation is cited):&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&lt;b&gt;Index creation&lt;/b&gt;, means how the index is created, split (in shards) and merged if necessary.&lt;b&gt;&amp;nbsp;&lt;/b&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;b&gt;Index use&lt;/b&gt;,
 refers in how the index performs in terms of usability (queries and 
search patterns), performance (response time) and througput. &lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
Index Creation&lt;/h3&gt;
Three scenarios were considered to the index creation phase:
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;b&gt;Single process - n Indices&lt;/b&gt;: In this case a single process creates n-shards, the input data are split evenly; an &lt;a href="http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/IndexWriter.html"&gt;IndexWriter&lt;/a&gt; is created for each shard. The case of n = 1 is considered part of this scenario, the # of shards is a parameter defined by the user, and is equal to the # of expected shards at the end of the process.
&lt;/li&gt;
&lt;li&gt;
&lt;b&gt;N threads - n Indices&lt;/b&gt;: The &lt;a href="http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/IndexWriter.html"&gt;IndexWriter&lt;/a&gt; is a thread safe class, so it can be shared by several threads in order to create a single Index. The # of shards is defined by the user and internally is used to define the # of IndexWriters.
&lt;/li&gt;
&lt;li&gt;
&lt;b&gt;Distributed Index creation&lt;/b&gt;: in this case the index is created by splitting the input data into N shards, each shard is assigned to one process that contains a single &lt;a href="http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/IndexWriter.html"&gt;IndexWriter&lt;/a&gt; which is responsible for the index creation.
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
Single process index creation&lt;/h4&gt;
The process followed for this scenario is pretty straightforward: 
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;
 The input is a row delimited file and each column is separated by a special character ('/001' in our case.&lt;/li&gt;
&lt;li&gt;The # of shards input defines the number &lt;a href="http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/IndexWriter.html"&gt;IndexWriters&lt;/a&gt; (only one IndexWriter can be opened for a Lucene Index).&lt;/li&gt;
&lt;li&gt;Each row represents a Lucene document and is stored using one of the available index writers.&lt;/li&gt;
&lt;li&gt;If multiple indices were created, at the final step those are merged into a single index (using &lt;a href="http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/IndexWriter.html#addIndexesNoOptimize%28org.apache.lucene.store.Directory...%29"&gt;"IndexWriter.addIndexesNoOptimize&lt;/a&gt;")&lt;/li&gt;
&lt;/ul&gt;
This process was tested using a 100 millions of records file. The entire process took 9200821 milliseconds (= 9200.821 seconds = 153.347016666666667 minutes = 2.555783611111111 hours) to finish.

Some optimizations were implemented for this process, worth mention that the same sort of optimizations were applied for the multithread scenario:
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;Re-use the IndexWriter in multiple threads&lt;/li&gt;
&lt;li&gt;Re-use the org.apache.lucene.document.Document and  org.apache.lucene.document.Field instances. The Lucene fields are created in a static block and its value is changed for each new Document, then the document is added to the index. The intention of this is avoid the objects to be garbage collected.
&lt;pre class="brush:java"&gt;  //Initialization
  static {
    for (int i = 0; i &amp;lt; accFieldsValues.length; i++) {
      fields[i] = new Field(accFieldsValues[i].name(), "", Store.YES, Index.ANALYZED);
    }
  }
 ...
 //Sets the field value
 fields[fieldsCount].setValue(stringTokenizer.nextToken());
 ...
 //Adds the same document instance with different values
 indexWriter.addDocument(doc);
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;The compound file format is turned off, this reduces the amount of files opened at the same time.&lt;/li&gt;
&lt;li&gt;IndexWriter.autocommit is set to false: since the index doesn't provide searching during the creation time, this feature can be disable.&lt;/li&gt;
&lt;li&gt;
The flush is done by RAM, and the RAM usage is maximized:
&lt;pre class="brush:java"&gt; 
LogByteSizeMergePolicy logByteSizeMergePolicy = new LogByteSizeMergePolicy(); logByteSizeMergePolicy.setMergeFactor(mergeFactor);
...
indexWriterConfig.setRAMBufferSizeMB(bufferSize);
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;Every N documents a entry is written in a log in order to notify the overall progress.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
Multithreaded Index Creation&lt;/h4&gt;
In terms of optimizations applied this scenario is very similar to the "Single process" scenario. However, the process is very different in terms of steps and the resulted index:
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;The input is a row delimited file and each column is separated by a special character ('/001' in our case).
&lt;/li&gt;
&lt;li&gt;The number of rows of the input file is known and is passed as input parameter.&lt;/li&gt;
&lt;li&gt;The input file is split evenly in intermediate files, each file is assigned to a thread which will read it to create a Lucene Index.&lt;/li&gt;
&lt;li&gt;The intermediate files are deleted after each index is created.&lt;/li&gt;
&lt;li&gt;
Depending of the number of shards desired , the indices are merged in smaller set of indices.&lt;/li&gt;
&lt;/ul&gt;
This process was run using: a pool of 50 threads and an input file with 100 million of rows. The execution time is detailed in the next table:
&lt;br /&gt;
&lt;table border="1"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Phase&lt;/td&gt;
&lt;td&gt;Time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Slicing (split the input file and distribute it in the threads)&lt;/td&gt;&lt;td&gt;1045948ms == 17.43 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Indices creation&lt;/td&gt;&lt;td&gt;6890988 == 114.8498 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total time&lt;/td&gt;
&lt;td&gt;132.2798 minutes = 2.204663333333333 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;h4&gt;
Katta for Index Creation&lt;/h4&gt;
&lt;a href="http://katta.sourceforge.net/"&gt;Katta&lt;/a&gt; is a distributed storage of indices, currently supports 2 types of indices: Lucene and Hadoop MapFiles. It uses &lt;a href="http://wiki.apache.org/hadoop/ZooKeeper"&gt;ZooKeeper&lt;/a&gt; to coordinate the index creation, replication and the search across the nodes.

&lt;br /&gt;
&lt;h5&gt;
Main relevant features&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;An Katta index is basically a folder containing sub-folders (shards)&lt;/li&gt;
&lt;li&gt;The client-node communication is implemented using &lt;a href="http://www.lexemetech.com/2008/07/rpc-and-serialization-with-hadoop.html"&gt;HadoopRPC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Supports distributed scoring, each search query requires two network roundtrips: get document frequencies in all shards and second perform the query.&lt;/li&gt;
&lt;li&gt;Provides functionality to merge indices (though is not a very comple task to implement using standard Lucene libraries)&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;
Relevant issues found&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;Documentation is not extensive and lacks of necessary detail.&lt;/li&gt;
&lt;li&gt;Small community and the development is very low: last commit was done in 2009-04-2.&lt;/li&gt;
&lt;li&gt;Doesn't provide any help to create the indeces, index sharding must be done prior to import them into Katta.&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;
Test environment configuration &lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;A cluster of 4 Katta nodes in 2 servers was used.&lt;/li&gt;
&lt;li&gt;The index was split in 8 shards.&lt;/li&gt;
&lt;li&gt;The master configuration is replicated in each node using passphraseless ssh access between master and nodes.&lt;/li&gt;
&lt;li&gt;The ZooKeeper server was embedded into the Katta master node. 
     &lt;pre&gt; katta.zk.properties (file) ==&amp;gt;zookeeper.embedded=true 
     &lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;Each node contains 2 shards, each shard is replicated in 2 nodes.&lt;/li&gt;
&lt;li&gt;The Lucene sharded index contains 100 millions of documents and was stored at the Hadoop distributed files system.&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;
Index creation&lt;/h5&gt;
Since Katta doesn't provide any functionality to create a Lucene from the scratch, the index was built using a multithread application and the were copied into Hadoop DFS. Then, the sharded index was imported into Katta using the command line: 
&lt;br /&gt;
&lt;pre&gt;bin/katta addIndex occurrence hdfs://namenode:port/occurrence/shardedindex/ 2&lt;/pre&gt;
("2" means a replication factor of 2). Importing a index into Katta is just a matter of copy the file from Hadoop and update the Index status in the ZooKeeper server, so the index creation is external factor to Katta.

In a next post the "Distributed Index Creation" scenario will be analyzed as well as  the technologies ElasticSearch and Solr  for index creation...&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-5842470106688313880?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/5842470106688313880/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/08/indexing-occurrences-data-using-lucene.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/5842470106688313880'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/5842470106688313880'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/08/indexing-occurrences-data-using-lucene.html' title='Indexing occurrences data - using Lucene technology'/><author><name>Fede Méndez</name><uri>http://www.blogger.com/profile/11707904250426427540</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-451428525567105295</id><published>2011-07-25T10:01:00.001+02:00</published><updated>2011-12-08T17:06:35.832+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='IPT'/><category scheme='http://www.blogger.com/atom/ns#' term='customization'/><category scheme='http://www.blogger.com/atom/ns#' term='Canadensys'/><category scheme='http://www.blogger.com/atom/ns#' term='CSS'/><title type='text'>Customizing the IPT</title><content type='html'>&lt;p&gt;One of my responsibilities as the Biodiversity Informatics Manager for &lt;a href="http://www.canadensys.net/"&gt;Canadensys&lt;/a&gt; is to develop a data portal giving access to all the biodiversity information published by the participants of our network. A huge portion of this task can now be done with the &lt;a href="http://code.google.com/p/gbif-providertoolkit/"&gt;GBIF Integrated Publishing Toolkit version 2&lt;/a&gt; or IPT. The IPT allows to host biodiversity resources, manage their data and metadata, and register them with GBIF so they can appear on the &lt;a href="http://data.gbif.org/"&gt;GBIF data portal&lt;/a&gt;, which are all targets we want to achieve. Best of all, most management can be done by the collection managers themselves.&lt;/p&gt;&lt;p&gt;I have tested the IPT thoroughly and I am convinced the GBIF development team has done an excellent job creating a stable tool I can trust. This post explains how I have customized &lt;a href="http://data.canadensys.net/ipt"&gt;our IPT installation&lt;/a&gt; to integrate it with our other Canadensys websites.&lt;/p&gt;&lt;a href="http://1.bp.blogspot.com/-DI0DYZOSSQY/TimGOjg_EgI/AAAAAAAAFsY/qVjcly6EtvE/s1600/ipt.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 291px;" src="http://1.bp.blogspot.com/-DI0DYZOSSQY/TimGOjg_EgI/AAAAAAAAFsY/qVjcly6EtvE/s400/ipt.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5632180393613398530" /&gt;&lt;/a&gt;&lt;br /&gt;
&lt;h4&gt;Background&lt;/h4&gt;&lt;p&gt;Our &lt;a href="http://www.canadensys.net/"&gt;Canadensys community portal&lt;/a&gt; is powered by WordPress (MySQL, PHP), while our data portal - which before the IPT installation only consisted of the &lt;a href="http://data.canadensys.net/vascan"&gt;Database of Vascular Plants of Canada (VASCAN)&lt;/a&gt; - is a Tomcat application. We are using different technologies because we want to use the most adequate technology for a certain website. &lt;a href="http://wordpress.org/"&gt;WordPress&lt;/a&gt; (or &lt;a href="http://drupal.org/"&gt;Drupal&lt;/a&gt; for that matter) is an excellent and easy-to-use &lt;a href="http://en.wikipedia.org/wiki/Content_management_system"&gt;CMS&lt;/a&gt;, perfect for our community portal, but not suitable for a custom made checklist website like VASCAN. To the user however, both websites look the same:&lt;/p&gt;&lt;a href="http://3.bp.blogspot.com/-aRzZHBeK8dU/TimGrB0TOkI/AAAAAAAAFsg/V3D7SeeDNdI/s1600/canadensys-community.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 258px;" src="http://3.bp.blogspot.com/-aRzZHBeK8dU/TimGrB0TOkI/AAAAAAAAFsg/V3D7SeeDNdI/s400/canadensys-community.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5632180882783812162" /&gt;&lt;/a&gt; &lt;a href="http://4.bp.blogspot.com/-7x_plNQR6Hc/TimGuqi2ITI/AAAAAAAAFso/U42y2m6u78c/s1600/canadensys-data.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 258px;" src="http://4.bp.blogspot.com/-7x_plNQR6Hc/TimGuqi2ITI/AAAAAAAAFso/U42y2m6u78c/s400/canadensys-data.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5632180945256063282" /&gt;&lt;/a&gt;&lt;br /&gt;
&lt;p&gt;We do this by using the same HTML markup and &lt;a href="http://en.wikipedia.org/wiki/Cascading_Style_Sheets"&gt;CSS&lt;/a&gt; for both websites. If you want to learn &lt;a href="http://www.w3schools.com/html/default.asp"&gt;HTML&lt;/a&gt; and &lt;a href="http://www.w3schools.com/css/default.asp"&gt;CSS&lt;/a&gt;, &lt;a href="http://www.w3schools.com/"&gt;w3schools&lt;/a&gt; provides excellent tutorials.&lt;/p&gt;&lt;p&gt;The HTML markup defines elements on a page (e.g. header, menu, content, sidebar, footer) and the CSS stylizes those elements (e.g. their position and color). The CSS is typically stored as one file (e.g. &lt;a href="http://www.canadensys.net/wp-content/themes/canadensys/style.css"&gt;style.css&lt;/a&gt;) which is referenced in the &amp;lt;head&amp;gt; section of a page. For dynamic websites, the HTML is typically stored as different files, one for each section of a page (e.g. header.php, sidebar.php). Those files are combined as one page by the server if a page is requested. That way, changing a common element on all pages of a website (e.g. the header) can be done by changing just one file.&lt;/p&gt;&lt;p&gt;All of this also applies to the IPT. Here's how the IPT looks like without CSS:&lt;/p&gt;&lt;a href="http://4.bp.blogspot.com/-J62y3EtgEjY/TinKkAVU7QI/AAAAAAAAFtI/_wrGTelMHqc/s1600/ipt-no-css.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 265px;" src="http://4.bp.blogspot.com/-J62y3EtgEjY/TinKkAVU7QI/AAAAAAAAFtI/_wrGTelMHqc/s400/ipt-no-css.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5632255528917069058" /&gt;&lt;/a&gt;&lt;br /&gt;
&lt;h4&gt;Attempt 1 - Editing the CSS and logo&lt;/h4&gt;&lt;p&gt;My first attempt at customizing the IPT was at the &lt;a href="http://www.gbif.org/participation/training/events/training-event-details/?eventid=113"&gt;Experts Workshop&lt;/a&gt; in Copenhagen, by changing the CSS and logo only, which you can find in the &lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/#svn%2Ftrunk%2Fgbif-ipt%2Fsrc%2Fmain%2Fwebapp%2Fstyles"&gt;/styles&lt;/a&gt; folder of your IPT installation:&lt;/p&gt;&lt;code&gt;&lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/trunk/gbif-ipt/src/main/webapp/styles/main.css"&gt;/styles/main.css&lt;/a&gt;&lt;br /&gt;
&lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/trunk/gbif-ipt/src/main/webapp/styles/logo.jpg"&gt;/styles/logo.jpg&lt;/a&gt;&lt;/code&gt;&lt;br /&gt;
&lt;p&gt;In 15 minutes, my IPT was Canadensys red and had a custom logo:&lt;/p&gt;&lt;a href="http://1.bp.blogspot.com/-r-IDNpooe0M/TiiK408DMcI/AAAAAAAAFsQ/MnydpQzfGUc/s1600/ipt-css.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 279px;" src="http://1.bp.blogspot.com/-r-IDNpooe0M/TiiK408DMcI/AAAAAAAAFsQ/MnydpQzfGUc/s400/ipt-css.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5631904042914623938" /&gt;&lt;/a&gt;&lt;br /&gt;
&lt;h4&gt;Attempt 2 - Editing the FreeMarker files&lt;/h4&gt;&lt;p&gt;Even though my IPT now had its own branding, it was still noticeably different from the other Canadensys websites. The only way I could change that, was by editing the HTML as well. Luckily, the sections I wanted to change were all stored as FreeMarker files in the &lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/trunk/gbif-ipt/src/main/webapp/#webapp%2FWEB-INF%2Fpages%2Finc"&gt;/inc&lt;/a&gt; folder:&lt;/p&gt;&lt;code&gt;&lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/trunk/gbif-ipt/src/main/webapp/WEB-INF/pages/inc/header.ftl"&gt;/WEB-INF/pages/inc/header.ftl&lt;/a&gt; - the &amp;lt;head&amp;gt; section&lt;br /&gt;
&lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/trunk/gbif-ipt/src/main/webapp/WEB-INF/pages/inc/menu.ftl"&gt;/WEB-INF/pages/inc/menu.ftl&lt;/a&gt; - the header, menu and sidebar&lt;br /&gt;
&lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/trunk/gbif-ipt/src/main/webapp/WEB-INF/pages/inc/footer.ftl"&gt;/WEB-INF/pages/inc/footer.ftl&lt;/a&gt; - the footer&lt;br /&gt;
&lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/trunk/gbif-ipt/src/main/webapp/WEB-INF/pages/inc/header_setup.ftl"&gt;/WEB-INF/pages/inc/header_setup.ftl&lt;/a&gt; - the header during installation&lt;/code&gt;&lt;br /&gt;
&lt;p&gt;I incorporated the HTML structure I use for the VASCAN website into menu.ftl (including the header, menu, container and sidebar), making sure I did not break any of the IPT functionality.&lt;/p&gt;&lt;p&gt;I started doing the same with main.css by replacing chunks of now unused IPT CSS with CSS I copied over from VASCAN, but I quickly realized that this wasn't the best option. Doing so would result in 2 CSS files: one for VASCAN and one for IPT, even though both web applications are under the same &lt;a href="http://data.canadensys.net/"&gt;domain name&lt;/a&gt; with a lot of shared CSS. It would be easier if I only had to maintain a single stylesheet, used by both applications.&lt;/p&gt;&lt;h4&gt;Attempt 3 - One styles folder for the data portal&lt;/h4&gt;&lt;p&gt;I created a /common/styles folder under ROOT, where I placed my single common data portal stylesheet: &lt;a href="http://data.canadensys.net/common/styles/common.css"&gt;/common/styles/common.css&lt;/a&gt;. This would be the CSS file I could use for IPT and VASCAN. I did the same for my &lt;a href="http://en.wikipedia.org/wiki/Favicon"&gt;favicon&lt;/a&gt;: &lt;a href="http://data.canadensys.net/common/images/favicon.png"&gt;/common/images/favicon.png&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;I added a reference to both files in the header.ftl of my IPT (and VASCAN):&lt;/p&gt;&lt;code&gt;&amp;lt;link rel="stylesheet" type="text/css" href="${baseURL}/styles/main.css"&amp;gt;&lt;br /&gt;
&amp;lt;link rel="stylesheet" type="text/css" href="http://data.canadensys.net/common/styles/common.css"&amp;gt;&lt;br /&gt;
&amp;lt;link rel="shortcut icon" href="http://data.canadensys.net/common/images/favicon.png"&amp;gt;&lt;/code&gt;&lt;br /&gt;
&lt;p&gt;As you can see on the first line, I kept the reference to the default IPT stylesheet: &lt;a href="http://data.canadensys.net/ipt/styles/main.css"&gt;${baseURL}/styles/main.css&lt;/a&gt; (it's perfectly fine to reference more than one CSS file). This is where I would keep all the unaltered (=default) IPT CSS. In fact, I'm not removing anything from the default IPT stylesheet, I'm only commenting out the CSS that is unused or conflicting:&lt;/p&gt;&lt;code&gt;/* Unused or conflicting CSS */&lt;/code&gt;&lt;br /&gt;
&lt;p&gt;The advantage of doing so, is that I now easily can compare this commented file with changes in the stylesheet of any new IPT version.&lt;/p&gt;&lt;p&gt;After I had done everything, my IPT now looked like &lt;a href="http://data.canadensys.net/ipt"&gt;this&lt;/a&gt;:&lt;/p&gt;&lt;a href="http://3.bp.blogspot.com/-4dW3foHFn5M/TinDdmqppCI/AAAAAAAAFs4/LrpSbjGDLfY/s1600/ipt-canadensys.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 291px;" src="http://3.bp.blogspot.com/-4dW3foHFn5M/TinDdmqppCI/AAAAAAAAFs4/LrpSbjGDLfY/s400/ipt-canadensys.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5632247722366575650" /&gt;&lt;/a&gt; &lt;a href="http://2.bp.blogspot.com/-jvynZyJu1_c/TinFjhcAVgI/AAAAAAAAFtA/enwyHpzAxEw/s1600/ipt-canadensys-edit.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 279px;" src="http://2.bp.blogspot.com/-jvynZyJu1_c/TinFjhcAVgI/AAAAAAAAFtA/enwyHpzAxEw/s400/ipt-canadensys-edit.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5632250023065441794" /&gt;&lt;/a&gt;&lt;br /&gt;
&lt;p&gt;My IPT is now sporting the Canadensys header, footer and sidebar (only visible when editing a resource), making it indistinguishable from the other Canadensys websites. It is also using a more readable font-size (13.5px) and a fluid width.&lt;/p&gt;&lt;h4&gt;Closing remarks&lt;/h4&gt;&lt;p&gt;I have (re)designed quite a lot of websites, and very often I have been so frustrated with the HTML and CSS that I just started over from scratch. I didn't have that option here and it wasn't necessary either. I would like to thank the GBIF development team for creating such an easily customizable tool, with logical HTML and CSS. As a reminder, the whole customization has been done by editing only 5 files (links show default files):&lt;/p&gt;&lt;code&gt;&lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/trunk/gbif-ipt/src/main/webapp/styles/main.css"&gt;/styles/main.css&lt;/a&gt; (&lt;a href="http://data.canadensys.net/ipt/styles/main.css"&gt;custom file&lt;/a&gt;)&lt;br /&gt;
&lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/trunk/gbif-ipt/src/main/webapp/WEB-INF/pages/inc/header.ftl"&gt;/WEB-INF/pages/inc/header.ftl&lt;/a&gt;&lt;br /&gt;
&lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/trunk/gbif-ipt/src/main/webapp/WEB-INF/pages/inc/menu.ftl"&gt;/WEB-INF/pages/inc/menu.ftl&lt;/a&gt;&lt;br /&gt;
&lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/trunk/gbif-ipt/src/main/webapp/WEB-INF/pages/inc/footer.ftl"&gt;/WEB-INF/pages/inc/footer.ftl&lt;/a&gt;&lt;br /&gt;
&lt;a href="http://code.google.com/p/gbif-providertoolkit/source/browse/trunk/gbif-ipt/src/main/webapp/WEB-INF/pages/inc/header_setup.ftl"&gt;/WEB-INF/pages/inc/header_setup.ftl&lt;/a&gt;&lt;/code&gt;&lt;br /&gt;
&lt;p&gt;&lt;span style="color:red;"&gt;Important&lt;/span&gt;: Remember that installing a new IPT version will overwrite all the customized files, so make sure to back them up! I will try to figure out a way to reapply my customization automatically after an update and post about that experience in a follow-up post. In the meantime, I hope that this post will help others in the customization of their IPT.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-451428525567105295?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/451428525567105295/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/07/customizing-ipt.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/451428525567105295'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/451428525567105295'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/07/customizing-ipt.html' title='Customizing the IPT'/><author><name>Peter Desmet</name><uri>http://www.blogger.com/profile/18072937114922733628</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/-4IKyMNyiCSo/TihR4QTSlgI/AAAAAAAAFqs/ja8HqGmmb_U/s220/peter_300.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-DI0DYZOSSQY/TimGOjg_EgI/AAAAAAAAFsY/qVjcly6EtvE/s72-c/ipt.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-7032650993216123601</id><published>2011-07-18T11:37:00.015+02:00</published><updated>2011-09-22T20:25:49.425+02:00</updated><title type='text'>Working with Scientific Names</title><content type='html'>&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;Dealing with scientific names is an important regular part of our work at GBIF.
Scientific names are highly structured strings with a syntax governed by a nomenclatural code. Unfortunately there are different ones for &lt;/span&gt;&lt;a href="http://ibot.sav.sk/icbn/main.htm"&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;botany&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;, &lt;/span&gt;&lt;a href="http://www.iczn.org/iczn/index.jsp"&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;zoology&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;, &lt;/span&gt;&lt;a href="http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=icnb.TOC"&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;bacteria&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;, &lt;/span&gt;&lt;a href="http://www.ICTVonline.org/codeOfVirusClassification_2002.asp"&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;virus&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt; and even &lt;/span&gt;&lt;a href="http://www.ishs.org/sci/icracpco.htm"&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;cultivar&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt; names.

When dealing with names we often do not know to which code or classification it belongs to, so we need to have a code agnostic representation as much as possible. GBIF came up with a structured representation which is a compromise focusing on the most common names, primarily the botanical and zoological names which are quite similar in its basic form.

&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;
&lt;/span&gt;&lt;h3&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;The ParsedName class&lt;/span&gt;&lt;/h3&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;Our ParsedName class provides us with the following core properties:
&lt;/span&gt;&lt;pre&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt; genusOrAbove
infraGeneric
specificEpithet
rankMarker
infraSpecificEpithet
authorship
year
bracketAuthorship
bracketYear&lt;/span&gt;&lt;/pre&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;These allow us to represent regular names properly.
For example &lt;i&gt;Agalinis purpurea var. borealis (Berg.) Peterson 1987&lt;/i&gt; is represented as&lt;/span&gt;&lt;pre&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt; genusOrAbove=Agalinis
specificEpithet=purpurea
rankMarker=var.
infraSpecificEpithet=borealis
authorship=Peterson
year=1987
bracketAuthorship=Berg.&lt;/span&gt;&lt;/pre&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;or the botanical section &lt;i&gt;Maxillaria sect. Multiflorae Christenson&lt;/i&gt; as&lt;/span&gt;&lt;pre&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt; genusOrAbove=Maxillaria
infraGeneric=Multiflorae
rankMarker=sect.
authorship=Christenson&lt;/span&gt;&lt;/pre&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;Especially in botany you often encounter names with authorships for both the species and some infraspecific rank or names citing more than one infraspecific rank. These names are not formed based on rules or recommendations from the respective codes and we ignore those superflous parts.
For example &lt;i&gt;Agalinis purpurea (L.) Briton var. borealis (Berg.) Peterson 1987&lt;/i&gt; is represented exactly the same as &lt;i&gt;Agalinis purpurea var. borealis&lt;/i&gt; above. In case of 4 parted names like &lt;i&gt;Senecio fuchsii C.C.Gmel. subsp. fuchsii var. expansus (Boiss. &amp;amp; Heldr.) Hayek&lt;/i&gt; only the lowest infraspecific rank is preserved:&lt;/span&gt;&lt;pre&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt; genusOrAbove=Senecio
specificEpithet=fuchsii
rankMarker=var.
infraSpecificEpithet=expansus
authorship=Hayek
bracketAuthorship=Boiss. &amp;amp; Heldr.&lt;/span&gt;&lt;/pre&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;Hybrid names are evil. They come in two flavors, named hybrids and hybrid formulas.&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;
Named hybrids are not so bad and simply prefix a name part with the multiplication sign ×, the hybrid marker, or prefix the rank marker of infraspecific names with &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;notho&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;. Strictly this symbol is not part of the genus or epithet. To represent these notho taxa our ParsedName class contains a property called nothoRank that keeps the rank or part of the name that needs to be marked as with the hybrid sign. For example the named hybrid &lt;i&gt;Pyrocrataegus ×willei L.L.Daniel&lt;/i&gt; is represented as&lt;/span&gt;&lt;pre&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt; genusOrAbove=Pyrocrataegus
specificEpithet=willei
authorship=L.L.Daniel
nothoRank=species&lt;/span&gt;&lt;/pre&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;Hybrid formulas such as &lt;i&gt;Agrostis stolonifera L. × Polypogon monspeliensis (L.) Desf.&lt;/i&gt;, &lt;i&gt;Asplenium rhizophyllum × ruta-muraria&lt;/i&gt; or &lt;i&gt;Mentha aquatica L. × M. arvensis L. × M. spicata L.&lt;/i&gt; cannot be represented by our class. The hybrid formulas in theory can combine any number of names or name parts, so its hard to deal with them. Luckily they are not very common and we can afford to live with a complete string representation in those cases.

Yet another "extension" to the botanical code are cultivar names, i.e. names for plants in horticulture. Cultivar names are regular botanical names followed by a cultivar name usually in english given in single quotes. For example &lt;i&gt;Cryptomeria japonica 'Elegans'&lt;/i&gt;. To keep track of this we have an additional cultivar property, so that:&lt;/span&gt;&lt;pre&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt; genusOrAbove=Cryptomeria
specificEpithet=japonica
cultivar=Elegans&lt;/span&gt;&lt;/pre&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;In taxonomic works you often have additional information in a name that details the taxonomic concept, the sec reference most often prefixed by sensu or sec. For example &lt;i&gt;Achillea millefolium sec. Greuter 2009&lt;/i&gt; or &lt;i&gt;Achillea millefolium sensu latu&lt;/i&gt;. In nomenclatoral works one frequently encounters nomenclatoral notes about the name such as nom.illeg. or nomen nudum.

Both these informations are hold in our ParsedName class, for example &lt;i&gt;Solanum bifidum Vell. ex Dunal, nomen nudum&lt;/i&gt; becomes&lt;/span&gt;&lt;pre&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt; genusOrAbove=Solanum
specificEpithet=bifidum
authorship=Vell. ex Dunal
nomStatus=nomen nudum&lt;/span&gt;&lt;/pre&gt;&lt;h4&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;Reconstructing name strings&lt;/span&gt;&lt;/h4&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;The ParsedName class provides us with some common methods to build a name string. In many cases you dont want the complete name with all its details, so we offer some popular name string types out of the box and a flexible string builder that you can explicitly tell which parts you want to include. The most important methods are

&lt;/span&gt;&lt;strong&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;canonicalName()&lt;/span&gt;&lt;/strong&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;: builds the canonical name sensu strictu with nothing else but the three name parts at max (genus, species, infraspecific). No rank, hybrid markers or authorship information are included.

&lt;/span&gt;&lt;strong&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;fullName()&lt;/span&gt;&lt;/strong&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;: builds the full name with all details that exist.

&lt;/span&gt;&lt;strong&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;canonicalSpeciesName()&lt;/span&gt;&lt;/strong&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;: builds the canonical binomial in case of species or below, ignoring infraspecific information

&lt;/span&gt;&lt;h3&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;
&lt;/span&gt;&lt;/h3&gt;&lt;h3&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;The name parser&lt;/span&gt;&lt;/h3&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;We decided at GBIF that sharing the complete name string is more reliable than trusting already parsed names. But parsing names by hand is a very tedious enterprise, so we needed to develop some parser that can handle the vast majority of all names that we encounter. After a short experimental phase with BNF and other grammars to automatically build a parser we decided to go back to start and start something based on good old regular expressions and plain java code. The parser has evolved now for nearly 2 years now and it might be the best unit tested class we have ever written at GBIF. It is interesting to take a look at the range of &lt;/span&gt;&lt;a href="http://gbif-ecat.googlecode.com/svn/trunk/ecat-common/src/test/resources/scientific_names.txt"&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;names we use for testing&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt; and also &lt;/span&gt;&lt;a href="http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src/test/java/org/gbif/ecat/parser/NameParserTest.java"&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;the test themseves&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt; to make sure its working as expected.

&lt;/span&gt;&lt;h4&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;Parsing names&lt;/span&gt;&lt;/h4&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;Using the NameParser in code is trivial. Once you create a parser instance all you need to do is call the parser.parse(String name) method to get your ParsedName object. As authorships are the hardest, ie most variable part of a name we have actually implemented two parsers internally. One that tries to parse the complete string and another fallback one that ignores authorships and only extracts the canonical name. The authorsParsed flag on a ParsedName instance tells you if the simpler fallback parser has been used.

If a name cannot be parsed at all an UnparsableException is thrown. This is also the case for viral names and hybrid formulas, as the ParsedName class cannot treat these names. The exception itself actually has an enumerated property that you can use to know if the exception has been caused by a virus, hybrid or other name.

As of today from 10.114.724 unique name strings that we have indexed only 116.000 names couldnt be parsed and these are mostly hybrid formulas.

&lt;/span&gt;&lt;h4&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;Normalisation&lt;/span&gt;&lt;/h4&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;Apart from the parse method the name parser also exposes a normalisation method to normalise any whitespace, commas, brackets and hybrid markers found in name strings. The parser uses this method internally before the actual parsing takes place. The string is trimmed, only single whitespace is allowed and spaces before commas are removed while it is enforced after a comma. Similar whitespace before opening brackets is added but removed inside. Instead of the proper multiplication sign for hybrids often a simple &lt;span class="Apple-style-span"   style="  -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; font-family:Times;font-size:medium;"&gt;×&lt;/span&gt; followed by whitespace is used which is also replaced by this method.

&lt;/span&gt;&lt;h4&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;Parsing Webservices&lt;/span&gt;&lt;/h4&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;GBIF is offering a free &lt;/span&gt;&lt;a href="http://tools.gbif.org/nameparser/api.do"&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;webservice API to parse names&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt; using our name parser. We use JSON for the parsed results and it accepts single names as well as batches of names. For larger input data you have to use a POST request (GET requests are restricted in length), but for few names also a simple GET request with the names url encoded in the paramter "names" is accepted. Multiple names can be concatenated with the pipe | symbol.

To parse the two names Symphoricarpos albus (L.) S.F.Blake cv. 'Turesson' and Pyrocrataegus willei ×libidi L.L.Daniel the parser service call looks like this:
&lt;/span&gt;&lt;a href="http://ecat-dev.gbif.org/ws/parser?names=Symphoricarpos%20albus%20(L.)%20S.F.Blake%20cv.%20'Turesson'|Stagonospora%20polyspora%20M.T.%20Lucas%20%26%20Sousa%20da%20Camara%201934"&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;http://ecat-dev.gbif.org/ws/parser?names=Symphoricarpos%20albus%20(L.)%20S.F.Blake%20cv.%20'Turesson'|Stagonospora%20polyspora%20M.T.%20Lucas%20%26%20Sousa%20da%20Camara%201934&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;

For manual usages we also provide a simple web client to this service that provides a form to enter names to be parsed and also accepts files with one name per line for upload. It is available as part of our tools collection at &lt;/span&gt;&lt;a href="http://tools.gbif.org/nameparser/"&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;http://tools.gbif.org/nameparser/&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;.

&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;
&lt;/span&gt;&lt;h3&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;Source Code&lt;/span&gt;&lt;/h3&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;All the code is part of a small java library that we call ecat-common.
It is freely available under Apache 2 licensing as most of our GBIF work and you are invited to use our code at &lt;/span&gt;&lt;a href="http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/"&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;, download the latest jar from our maven repository or include it in your maven dependencies like this:
&lt;/span&gt;

&lt;pre style="font-family:arial;font-size:12px;border:1px dashed #CCCCCC;width:99%;height:auto;overflow:auto;background:#f0f0f0;padding:0px;color:#000000;text-align:left;line-height:20px;"&gt;&lt;code style="color:#000000;word-wrap:normal;"&gt; &amp;lt;repositories&amp;gt;
&amp;lt;repository&amp;gt;
&amp;lt;id&amp;gt;gbif-all&amp;lt;/id&amp;gt;
&amp;lt;url&amp;gt;http://repository.gbif.org/content/groups/gbif&amp;lt;/url&amp;gt;
&amp;lt;/repository&amp;gt;
&amp;lt;/repositories&amp;gt;
&amp;lt;dependencies&amp;gt;
&amp;lt;dependency&amp;gt;
&amp;lt;groupId&amp;gt;org.gbif&amp;lt;/groupId&amp;gt;
&amp;lt;artifactId&amp;gt;ecat-common&amp;lt;/artifactId&amp;gt;
&amp;lt;version&amp;gt;1.5.1-SNAPSHOT&amp;lt;/version&amp;gt;
&amp;lt;/dependency&amp;gt;
&amp;lt;/dependencies&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-7032650993216123601?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/7032650993216123601/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/07/working-with-scientific-names.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/7032650993216123601'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/7032650993216123601'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/07/working-with-scientific-names.html' title='Working with Scientific Names'/><author><name>Markus Döring</name><uri>https://profiles.google.com/114975314573163797395</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-5529178058385406557</id><published>2011-07-08T11:44:00.013+02:00</published><updated>2011-07-08T11:52:24.750+02:00</updated><title type='text'>Are you the keymaster?</title><content type='html'>&lt;br /&gt;
As I &lt;a href="http://gbif.blogspot.com/2011/06/buzzword-compliance.html"&gt;mentioned previously&lt;/a&gt; I'm starting work on evaluating &lt;a href="http://hbase.apache.org/"&gt;HBase&lt;/a&gt; for our occurrence record needs.&amp;nbsp; In the last little while that has meant coming up with a key structure and/or schema that optimizes reads for one major use case of the &lt;a href="http://data.gbif.org/"&gt;GBIF data portal&lt;/a&gt; - a user request to download an entire record set, including raw records as well as interpreted.&amp;nbsp; The most common form of this request looks like "Give me all records for &lt;taxonomic rank=""&gt; &lt;name&gt;", eg "Give me all records for Family Felidae".&lt;/name&gt;&lt;/taxonomic&gt;&lt;br /&gt;
&lt;br /&gt;
So far I'm concentrating more on the lookup and retrieval rather than writing or data storage optimization, so the schema I'm using is two column families, one for verbatim columns, one for interpreted (for a total of about 70 columns).&amp;nbsp; The question of which key to use for HTable's single indexed column is what we need to figure out.&amp;nbsp; For all these examples we assume we know the backbone taxonomy id of the taxon concept in question (ie Family Felidae is id 123456).&lt;br /&gt;
&lt;br /&gt;
&lt;span style="font-size: small;"&gt;&lt;b&gt;Option 1&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
Key: native record's unique id&lt;br /&gt;
&lt;br /&gt;
Query style: The simplest way of finding all records that belong to Family Felidae is scan all of them, and check against the Family column from the interpreted column family.&amp;nbsp; The code looks like this:&lt;br /&gt;
&lt;br /&gt;
&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;span style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; HTable table = new HTable(HBaseConfiguration.create(), tableName);&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; byte[] cf = Bytes.toBytes(colFam);&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; byte[] colName = Bytes.toBytes(col);&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; byte[] value = Bytes.toBytes(val);&lt;br /&gt;
&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; Scan scan = new Scan();&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; ResultScanner scanner = table.getScanner(scan);&lt;br /&gt;
&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; for (Result result : scanner) {&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; byte[] testVal = result.getValue(cf, colName);&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if (Bytes.compareTo(testVal, value) == 0) doSomething;&lt;/span&gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;
Because this means transferring all columns of every row to the client before checking if it's even a record we want, it's incredibly wasteful and therefore very slow.&amp;nbsp; It's a Bad Idea.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Option 2&lt;/b&gt;&lt;br /&gt;
Key: native record's unique id&lt;br /&gt;
&lt;br /&gt;
Query style: HBase provides a SingleColumnValueFilter that executes our equality check on the server side, thereby saving the transfer of unwanted columns to the client.&amp;nbsp; Here's the code:&lt;br /&gt;
&lt;br /&gt;
&lt;span style="font-size: small;"&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; HTable table = new HTable(HBaseConfiguration.create(), tableName);&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; byte[] cf = Bytes.toBytes(colFam);&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; byte[] colName = Bytes.toBytes(col);&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; byte[] value = Bytes.toBytes(val);&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; SingleColumnValueFilter valFilter = new SingleColumnValueFilter(cf, colName, CompareFilter.CompareOp.EQUAL, value);&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; valFilter.setFilterIfMissing(true);&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Scan scan = new Scan();&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;/span&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; scan.setFilter(valFilter);&lt;/span&gt;&lt;br style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;" /&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ResultScanner scanner = table.getScanner(scan);&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
This is about as good as it gets until we start getting clever :)&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Option 3&lt;/b&gt;&lt;br /&gt;
Key: concatentation of nub-taxonomy "left" with native record's unique id&lt;br /&gt;
&lt;br /&gt;
Query style:&amp;nbsp; We know that a taxonomy is a tree, and our backbone taxonomy is a well behaved (ie true) tree.&amp;nbsp; We can use &lt;a href="http://en.wikipedia.org/wiki/Nested_set_model"&gt;nested sets&lt;/a&gt; to make our "get all children of node x" query much faster, which Markus realized some time ago, and so thoughtfully included the left and right calculation as part of the backbone taxonomy creation.&amp;nbsp; Individual occurrences of the same taxon will share the same backbone taxonomy id, as well as the left and right.&amp;nbsp; One property of nested sets not mentioned in the wikipedia article is that when the records are ordered by their lefts, the query of "give me all records where left is between parent left and parent right" becomes "give me all rows starting with parent left and ending with parent right", which in HBase terms is much more efficient since we're doing a sequential read from disk without any seeking.&amp;nbsp; So we build the key as leftId_uniqueId, and query as follows (note that startRow is inclusive and stopRow is exclusive, and we want exclusive on both ends):&lt;br /&gt;
&lt;br /&gt;
&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;br /&gt;
&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; HTable table = new HTable(HBaseConfiguration.create(), tableName);&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; Scan scan = new Scan();&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; scan.setStartRow(Bytes.toBytes((left + 1) + "_"));&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; scan.setStopRow(Bytes.toBytes(right + "_"));&lt;br /&gt;
&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; ResultScanner scanner = table.getScanner(scan);&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;
Which looks pretty good, and is in fact about 40% faster than Option 2 (on average - depends on the size of the query result).&amp;nbsp; But on closer inspection, there's a problem.&amp;nbsp; By concatenating the left and unique ids with an underscore as separator, we've created a String, and now HBase is doing its usual lexicographical ordering, which means our rows aren't ordered as we'd hoped.&amp;nbsp; For example, this is the ordering we expect:&lt;br /&gt;
&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;br /&gt;
&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: small;"&gt;1_1234&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: small;"&gt;2_3458&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: small;"&gt;3_3298&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: small;"&gt;4_9378&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: small;"&gt;5_3435&lt;/span&gt;&lt;/div&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;10_5439 &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-size: small;"&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;100_9763&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
but because these are strings, HBase orders them as:&lt;br /&gt;
&lt;br /&gt;
&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: small;"&gt;1_1234&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;10_5439&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: small;"&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;100_9763 &lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: small;"&gt;&lt;br /&gt;
&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: small;"&gt;2_3458&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: small;"&gt;3_3298&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: small;"&gt;4_9378&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: small;"&gt;5_3435&lt;/span&gt;&lt;br /&gt;
&lt;/div&gt;&lt;br /&gt;
There  isn't much we can do here but filter on the client side.&amp;nbsp; For every  key, we can extract the left portion, convert to a Long, and compare it  to our range, discarding those that don't match.&amp;nbsp; It sounds ugly, and it  is, but it doesn't add anything appreciable to the processing time, so  it would work.&lt;br /&gt;
&lt;br /&gt;
Except  that there's a more fundamental problem - if we embed the left in our  primary key, it only takes one node added to the backbone taxonomy to  force an update in half of all the lefts (on average) which means all of  our primary keys get rewritten.&amp;nbsp; At 300 million records and growing,  that's not an option.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;span style="font-size: small;"&gt;Option 4&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;
Key: native record's unique id&lt;br /&gt;
Secondary index: left to list of unique ids&lt;br /&gt;
&lt;br /&gt;
Query  style: Following on from Option 3, we can build a second table that  will serve as a secondary index.&amp;nbsp; We use the left as a numeric key  (which gives us automatic, correct ordering) and write each  corresponding unique occurrence id as a new column in the row.&amp;nbsp; Then we  can do a proper range query on the lefts, and generate a distinct Get  for each distinct id.&amp;nbsp; Unfortunately building that index is quite slow,  and is still building as I write this, so I haven't been able to test  the lookups yet.&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
For those keeping score at home, I'm using Hbase 0.89  (from CDH3b4) which doesn't have built in secondary indexes (which 0.19  and 0.20 did).&lt;br /&gt;
&lt;br /&gt;
I'll write more when I've learned more, and welcome any tips or suggestions you might have to aid in my quest!&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-5529178058385406557?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/5529178058385406557/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/07/are-you-keymaster.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/5529178058385406557'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/5529178058385406557'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/07/are-you-keymaster.html' title='Are you the keymaster?'/><author><name>Oliver Meyn</name><uri>http://www.blogger.com/profile/04706642473308341930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/-6_sfRK7-oU0/TaWanB_bQ3I/AAAAAAAAAAM/RcPMC0E9E3k/s220/oliver_amory_profile.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-1069052424541806088</id><published>2011-06-30T15:14:00.014+02:00</published><updated>2011-08-12T09:26:26.706+02:00</updated><title type='text'>The organisational structure and the endorsement process - if you're an IPT administrator</title><content type='html'>&lt;p&gt;During the &lt;a href="http://community.gbif.org/pg/groups/3529/gbif-ipt-helpdesk-and-training-experts/"&gt;Expert Workshop&lt;/a&gt; last week in Copenhagen, we had a session talking about configuring IPT to reflect different organisational structures. I think it's worth to explain about that part as a blog post here, since some of our readers would like to help deploy IPT in the GBIF Network. It's usually started by questions like this:&lt;/p&gt;

&lt;p&gt;Why am I asked for a password of the organisation that I choose to register IPT? Why am I asked again when I want to add an additional organisation?&lt;/p&gt;

&lt;p&gt;The short answer is, by having the password of the organisation, that means you have got the permission from that organisation and the organisation is aware of the fact that you're registering an IPT against it.&lt;/p&gt;

&lt;p&gt;So, why is this the way of registering an organisation?&lt;/p&gt;

&lt;h4&gt;The organisational structure&lt;/h4&gt;

&lt;p&gt;Remember the GBIF Network is not only a common pool of sharing biodiversity data, to form such a pool, it's also the social network in which biodiversity data publishers interact. IPT, serves as the technical skeleton of the network, needs to tie to the organisational structure in order to properly accredit institutions or individuals, by helping to reflect those relationships among the organisations, hosted resources, IPTs and endorsing GBIF nodes/participants. The relationship can be seen on &lt;a href="http://gbrds.gbif.org/index"&gt;the GBIF Registry&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Take &lt;a href="http://gbrds.gbif.org/browse/agent?uuid=a957a663-2f17-415f-b1c8-5cf6398df8ed"&gt;VertNet Hosting&lt;/a&gt; as an example. VertNet Hosting is an IPT installation that hosts data resources authored by those users in the IPT. In the second half of its page on the Registry, you see:&lt;/p&gt;

&lt;pre&gt;VertNet Hosting publishes &lt;a href="http://gbrds.gbif.org/browse/agent?uuid=29d54fb6-ed46-4a75-9697-55b73e63beed"&gt;uafmc_fish&lt;/a&gt;&lt;br/&gt;
&lt;a href="http://gbrds.gbif.org/browse/agent?uuid=b554c320-0560-11d8-b851-b8a03c50a862" title="GBRDS-Registry"&gt;University of Kansas Biodiversity Research Center&lt;/a&gt; has technical installation &lt;a href="http://gbrds.gbif.org/browse/agent?uuid=a957a663-2f17-415f-b1c8-5cf6398df8ed" title="GBRDS-Registry"&gt;VertNet Hosting&lt;/a&gt;&lt;/pre&gt;

&lt;p&gt;Which means, this IPT has a public resource called &lt;em&gt;uafmc_fish&lt;/em&gt;, which has been registered to the GBIF Network, and this IPT is registered against &lt;em&gt;University of Kansas Biodiversity Research Center&lt;/em&gt;, which is the hosting organisation of IPT.&lt;/p&gt;

&lt;p&gt;If you click the &lt;em&gt;uafmc_fish&lt;/em&gt; link in the Registry page, it says:&lt;/p&gt;

&lt;pre&gt;&lt;a href="http://gbrds.gbif.org/browse/agent?uuid=9a367b8c-22dd-402d-9161-d3c64c6d6a94" title="GBRDS-Registry"&gt;University of Arkansas Collections Facility, UAFMC&lt;/a&gt; has data resource &lt;a href="http://gbrds.gbif.org/browse/agent?uuid=29d54fb6-ed46-4a75-9697-55b73e63beed" title="GBRDS-Registry"&gt;uafmc_fish&lt;/a&gt;&lt;/pre&gt;

&lt;p&gt;That means the uafmc_fish resource has been registered against University of Arkansas Collections Facility, UAFMC, which should have been added to the VertNet hosting IPT to be available for users to choose from, other than University of Kansas Biodiversity Research Center.&lt;/p&gt;

&lt;p&gt;So the administrator of the VertNet Hosting IPT, in this case Ms. Laura Russell, had been asked twice the different passwords she needed to register her IPT and add another organisation.&lt;/p&gt;

&lt;p&gt;The activities can be summarised as this graphic:&lt;/p&gt;

&lt;a href="http://1.bp.blogspot.com/-AUqqa7RDlfI/Tgx8kp74uQI/AAAAAAAAABs/j8Ro1H4jN_Q/s1600/vertnet_concept.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5624007003853076738" src="http://1.bp.blogspot.com/-AUqqa7RDlfI/Tgx8kp74uQI/AAAAAAAAABs/j8Ro1H4jN_Q/s320/vertnet_concept.png" style="cursor: hand; cursor: pointer; display: block; height: 320px; margin: 0px auto 10px; text-align: center; width: 267px;" /&gt;&lt;/a&gt;

&lt;p&gt;Which says, with Vertnet Hosting (IPT) registered against University of Kansas Biodiversity Research Center, uafmc_fish can be hosted by the IPT but tied to University of Arkansas Collections Facility, UAFMC. This means an organisation doesn't necessarily need to have the capacity to install IPT in order to host published resources. Any other IPT, with the agreement of hosting organisations, can host resources of others. This is why here the password is required.&lt;/p&gt;

&lt;p&gt;The relationships of these units then are reflected on the Registry as:&lt;/p&gt;

&lt;a href="http://3.bp.blogspot.com/-ALDHXHFgjrY/Tgx9GR9Tm0I/AAAAAAAAAB0/_mZrEfuatiE/s1600/vertnet_registry.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5624007581532134210" src="http://3.bp.blogspot.com/-ALDHXHFgjrY/Tgx9GR9Tm0I/AAAAAAAAAB0/_mZrEfuatiE/s320/vertnet_registry.png" style="cursor: hand; cursor: pointer; display: block; height: 320px; margin: 0px auto 10px; text-align: center; width: 269px;" /&gt;&lt;/a&gt;

&lt;p&gt;By this design the relationship and accreditation of the GBIF Network is maintained.&lt;/p&gt;

&lt;p&gt;Now, beyond these graphics, if you click &lt;a href="http://gbrds.gbif.org/browse/agent?uuid=b554c320-0560-11d8-b851-b8a03c50a862" title="GBRDS-Registry"&gt;University of Kansas Biodiversity Research Center&lt;/a&gt; and &lt;a href="http://gbrds.gbif.org/browse/agent?uuid=9a367b8c-22dd-402d-9161-d3c64c6d6a94" title="GBRDS-Registry"&gt;University of Arkansas Collections Facility, UAFMC&lt;/a&gt; in the Registry, you'll these two statements individually in each page:&lt;/p&gt;

&lt;pre&gt;
&lt;a href="http://gbrds.gbif.org/browse/agent?uuid=8618c64a-93e0-4300-b546-7249e5148ed2" title="GBRDS-Registry"&gt;USA&lt;/a&gt; endorses &lt;a href="http://gbrds.gbif.org/browse/agent?uuid=b554c320-0560-11d8-b851-b8a03c50a862" title="GBRDS-Registry"&gt;University of Kansas Biodiversity Research Center&lt;/a&gt;&lt;br/&gt;
&lt;a href="http://gbrds.gbif.org/browse/agent?uuid=8618c64a-93e0-4300-b546-7249e5148ed2" title="GBRDS-Registry"&gt;USA&lt;/a&gt; endorses &lt;a href="http://gbrds.gbif.org/browse/agent?uuid=9a367b8c-22dd-402d-9161-d3c64c6d6a94" title="GBRDS-Registry"&gt;University of Arkansas Collections Facility, UAFMC&lt;/a&gt;&lt;/pre&gt;

&lt;p&gt;That means, these 2 organisations are endorsed by USA, which is a member of GBIF, and since they are endorsed, they are therefore available in the organisation drop-down list of IPT.&lt;p/&gt;

&lt;h4&gt;The endorsement process&lt;/h4&gt;

&lt;p&gt;Chances are, you're looking for your organisation in that list and you are pretty sure it's not there. What should you do?&lt;/p&gt;

&lt;a href="http://4.bp.blogspot.com/-WODFgeld8CU/Tgx9PCM1HJI/AAAAAAAAAB8/I0zLv7c_mvY/s1600/theEndorsement.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5624007731921099922" src="http://4.bp.blogspot.com/-WODFgeld8CU/Tgx9PCM1HJI/AAAAAAAAAB8/I0zLv7c_mvY/s320/theEndorsement.png" style="cursor: hand; cursor: pointer; display: block; height: 143px; margin: 0px auto 10px; text-align: center; width: 320px;" /&gt;&lt;/a&gt;

&lt;p&gt;Either you're registering IPT, or one of your user is requesting an organisation that is not available yet, you should try talk to administration level people and seek to get endorsed by a GBIF member. Normally, &amp;#x278A; you or the representative of your institution write to helpdesk@gbif.org, provide some background information and at least a technical contact, &amp;#x278B; the helpdesk will look for appropriate node for you to get endorsed. Upon &amp;#x278C; positive feedbacks from an endorsing node, &amp;#x278D; the helpdesk will inform you the availability of your organisation and the password.&lt;/p&gt;

&lt;p&gt;This process runs administratively because we rely on the social level to ensure the responsibility for the registered IPT and published resources. This also makes the accreditation goes to correct persons or organisations.&lt;/p&gt;

&lt;p&gt;Hope this helps IPT experts.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-1069052424541806088?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/1069052424541806088/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/06/organisational-structure-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/1069052424541806088'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/1069052424541806088'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/06/organisational-structure-and.html' title='The organisational structure and the endorsement process - if you&apos;re an IPT administrator'/><author><name>Burke Chih-Jen Ko</name><uri>http://www.blogger.com/profile/09806308970203169452</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/-JHV15rSIJlw/Td_9T-7V2iI/AAAAAAAAABI/TamywweE4I4/s220/P1282909r_icon.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-AUqqa7RDlfI/Tgx8kp74uQI/AAAAAAAAABs/j8Ro1H4jN_Q/s72-c/vertnet_concept.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-6239126323617194222</id><published>2011-06-24T21:10:00.002+02:00</published><updated>2011-06-24T21:19:53.191+02:00</updated><title type='text'>Synchronizing occurrence records</title><content type='html'>&lt;br /&gt;
&lt;div class="MsoNormal"&gt;
&lt;span lang="ES-TRAD"&gt;This
post should be read in the line of&amp;nbsp;Tim’s post about &lt;/span&gt;&lt;span lang="ES-TRAD"&gt;&lt;a href="http://gbif.blogspot.com/2011/05/decoupling-components.html"&gt;Decoupling Components&lt;/a&gt;&lt;/span&gt;&lt;span lang="ES-TRAD"&gt;,
as it takes for granted some information written there. &lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;span lang="ES-TRAD"&gt;During
the last week, I’ve been learning&lt;/span&gt;/working with some
technologies that are related to the decoupling of components we want to
accomplish. &amp;nbsp;Specifically, I’ve
been working with the &lt;a href="http://gbif.blogspot.com/2011/05/decoupling-components.html"&gt;Synchronizer&lt;/a&gt;&amp;nbsp;component of the event driven architecture Tim described. &lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
Right now, the synchronizer takes the responses from the
resources and gets those responses into the occurrence store (MySQL as of today, but not final). But it has
more to it: The responses from the resources come typically from DiGIR, TAPIR
and BioCASe providers which render their responses into XML format. So how does
all this data ends up in the occurrence store&lt;span lang="ES-TRAD"&gt;? Well, fortunately my colleague Oliver Meyn
wrote a &lt;/span&gt;&lt;span lang="ES-TRAD"&gt;&lt;a href="http://code.google.com/p/gbif-occurrencestore/source/browse/#svn%2Ftrunk%2Foccurrence-parser"&gt;very useful library&lt;/a&gt;&lt;/span&gt;&lt;span lang="ES-TRAD"&gt; to unmarshall all these XML chunks into&amp;nbsp; nice and simple objects, so on my side
I just have to worry about calling all those getter methods. Also, the synchronizer
acts as a listener to a message queue , queue that will store all the resource responses
that need to be handled. All the queue’s nuts &amp;amp; bolts were worked out by Tim
and Federico Méndez. So yes, it has been a nice collaboration from many
developers inside the Secretariat and it’s always nice to have this kind of head
start from your colleagues :) &lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;span lang="ES-TRAD"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;span lang="ES-TRAD"&gt;So,
getting back to my duties, I have to take all these objects and start
populating the occurrence target store taking some precautions (e.g.: not
inserting duplicated occurrence records, checking that some mandatory fields
are not null and other requirements). &lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;span lang="ES-TRAD"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;span lang="ES-TRAD"&gt;For
now, it’s in development mode, but I have managed to make some tests and
extract some metrics that show current performance and definitely leaves room for improvement. For the tests, first the message queue is loaded with some responses that need to be attended
and afterwards I execute the synchronizer which starts populating the occurrence
store. All these tests are done on my MacBook Pro, so definetely response times
will improve on a better box. So here are the metrics:&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;span lang="ES-TRAD"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;span lang="ES-TRAD"&gt;&lt;b&gt;Environment: &lt;/b&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;MacBook
Pro 2.4 GHz Core2 Duo (4GB Memory)&lt;/li&gt;
&lt;li&gt;Mac OS
X 10.5.8 (Leopard)&lt;/li&gt;
&lt;li&gt;Message
Queue &amp;amp; MySQL DB reside on different machines, but same intranet.&lt;/li&gt;
&lt;li&gt;&lt;span lang="ES-TRAD"&gt;&lt;i&gt;Threads&lt;/i&gt;&lt;/span&gt;&lt;span lang="ES-TRAD"&gt;: synchronizer spawns 5 threads to attend the queue elements.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span lang="ES-TRAD"&gt;&lt;i&gt;Message
queue:&lt;/i&gt;&lt;/span&gt;&lt;span lang="ES-TRAD"&gt; loaded with 552 responses (some responses are just empty to emulate a real world scenario).&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span lang="ES-TRAD"&gt;&lt;i&gt;Records in total: &lt;/i&gt;&lt;b&gt;70,326&lt;/b&gt;&lt;/span&gt;&lt;span lang="ES-TRAD"&gt;
occurrence records in total in all responses&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br /&gt;
&lt;div class="MsoNormal"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;span lang="ES-TRAD"&gt;&lt;o:p&gt;&lt;b&gt;Results Test 1 (without filtering out records):&lt;/b&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Extracting
responses from queue&lt;/li&gt;
&lt;li&gt;Unmarshalling&lt;/li&gt;
&lt;li&gt;Inserting into a MySQL DB&lt;/li&gt;
&lt;li&gt;&lt;b&gt;202022
milliseconds (3 min, 22 secs)&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="MsoNormal"&gt;
&lt;span lang="ES-TRAD"&gt;&lt;o:p&gt;&lt;br /&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;span lang="ES-TRAD"&gt;&lt;o:p&gt;&lt;b&gt;Results Test 2 (filtering out records):&lt;/b&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Extracting
from queue&lt;/li&gt;
&lt;li&gt;Unmarshalling&lt;/li&gt;
&lt;li&gt;Filtering out records (duplicates, mandatory
fields, etc)&lt;/li&gt;
&lt;li&gt;Inserting into MySQL DB&lt;/li&gt;
&lt;li&gt;&lt;b&gt;over 30 minutes... (big FAIL)&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
So, as you see there is &lt;b&gt;MUCH&lt;/b&gt; room for improvement. As I have just joined this project in particular, I need to start the long and tedious road of debugging why the huge difference, obviously the filtering out process needs huge improvement. Obvious solutions come to mind: increasing threads, improve memory consumption and other not so obvious solutions. &amp;nbsp;I will try to keep you readers posted about this, and hopefully some more inspiring metrics, and for sure in a better box.&lt;br /&gt;
&lt;div class="MsoNormal"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
I hope to communicate further improvements later, see you for now.&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="MsoNormal"&gt;
&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-6239126323617194222?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/6239126323617194222/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/06/synchronizing-occurrence-records.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/6239126323617194222'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/6239126323617194222'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/06/synchronizing-occurrence-records.html' title='Synchronizing occurrence records'/><author><name>Jose Cuadra</name><uri>http://www.blogger.com/profile/00591450269169657407</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-2647672092664194180</id><published>2011-06-20T15:20:00.021+02:00</published><updated>2011-06-21T14:24:41.211+02:00</updated><title type='text'>Querying Solr using a pure AJAX application</title><content type='html'>This is the third (and final) post related to the &lt;a href="http://code.google.com/p/gbif-metadata/"&gt;GBIF Metacatalogue Project&lt;/a&gt;. The first 2 were dedicated to explain how the data is harvested and how that information is stored in Apache Solr. Those post can be consulted in:
&lt;ul&gt;&lt;li&gt;&lt;a href="http://gbif.blogspot.com/2011/04/oia-pmh-harvesting-at-gbif.html"&gt;OAI-PMH Harvesting at GBIF&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://gbif.blogspot.com/2011/05/indexing-bio-diversity-metadata-using.html"&gt;Indexing bio-diversity metadata using Solr: schema, import-handlers, custom-transformers&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

One the nicest features of &lt;a href="http://lucene.apache.org/solr/"&gt;Solr&lt;/a&gt; is that most of its functionalities are exposed via Rest API. This API can be used for different operations like: delete documents, post new documents and more important to query the index. In cases when the index is self-contained (i.e: doesn't depend of external services or storages to return valuable information) a very thin application client without any mediator is viable option. In general terms, "mediator" is a layer that handles the communication between the user interface and Solr, in some cases (possibly) that layer manipulates the information before send it to user interface. &lt;a href="http://code.google.com/p/gbif-metadata/"&gt;Metadata Web application&lt;/a&gt; is a perfect example of the scenario just described: it's basically an independent storage of documents that can be used to provide free and structured search. All the information is collected from several sources and then is store in the Solr index, even the full XML documents are stored in the Solr Index.


&lt;a href="https://github.com/evolvingweb/ajax-solr/"&gt;AJAX Solr&lt;/a&gt; is a Javascript framework that facilitates querying a Solr server and the display of results in a Web Application; to implement the remote communication only requires a way of sending requests to Solr, in our case, we used JQuery. In AJAX Solr the information is displayed to the end-user by widgets like: facets widgets, list of results, cloud widgets, etc.
Widgets

&lt;h3&gt;Widgets&lt;/h3&gt;
In this context widgets are user interface componenents  whose functionality doesn't depend on other widgets, each one has an specific responsabilty.
All the communication (between Solr and the UI) is handled by &lt;a href="http://evolvingweb.github.com/ajax-solr/docs/symbols/AjaxSolr.Manager.html"&gt;Manager&lt;/a&gt; whose main responsability is send the requests and communicates the responses to the widgets. The image below shows some widgets examples:

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-QDpiWebBxxk/TgBmFA_MFGI/AAAAAAAAABQ/06EY1i53r7E/s1600/metacataloguewidgets.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 152px;" src="http://3.bp.blogspot.com/-QDpiWebBxxk/TgBmFA_MFGI/AAAAAAAAABQ/06EY1i53r7E/s400/metacataloguewidgets.png" alt="" id="BLOGGER_PHOTO_ID_5620604571308790882" border="0" /&gt;&lt;/a&gt;

From the implementation point of view the code below shows how the manager is created and the way of attach  widgets to it:
&lt;pre class="brush:javascript"&gt;
$(function () {
  Manager = new AjaxSolr.Manager({
    solrUrl: solrServerUrl
  });
  /*Adds a listener widgets to the Manager*/
  Manager.addWidget(new AjaxSolr.ResultWidget({
    id: 'result',
    target: '#docs' //&amp;lt;-- element where the result will be displayed     }));
 Manager.addWidget(new AjaxSolr.PagerWidget({    
     id: 'pager',    
     target: '#pager',    
     prevLabel: '&amp;lt;',    
     nextLabel: '&amp;gt;... &lt;/pre&gt;

Then, we "simply" add the desired params to perform the query:
&lt;pre class="brush:javascript"&gt;
  Manager.store.addByValue('facet.field', 'providerExact');
  Manager.store.addByValue('facet.date', 'endDate');
  Manager.store.addByValue('q', '*:*');
  Manager.doRequest();
&lt;/pre&gt;

&lt;h3&gt;Other libraries&lt;/h3&gt;
Some other components were used in order to provide a better user experience, those are:
&lt;ul&gt;&lt;li&gt;JQuery/JQuery UI (http://jquery.com/): AJAX Solr requires a library to implement the AJAX requests. JQuery was chosen for this purpose. Additionally, several JQueryUI widgets are extensively used for a richer user experience.&lt;/li&gt;&lt;li&gt;SyntaxHighlighter(http://alexgorbatchev.com/SyntaxHighlighter/): this is a code syntax highlighter developed in JavaScript,  This component is used for displaying the XML view of a metadata document.&lt;/li&gt;&lt;/ul&gt;

The prototype application is available &lt;a href="http://metadata.gbif.org/catalogue"&gt;here&lt;/a&gt;; this application is 99.9% free of server-side code, there's only one line of code with server dependency, and is for indicate the Solr server url:
&lt;pre class="brush:javascript"&gt;
var solrServerUrl = &amp;lt;%="'" + config.getServletContext().getInitParameter("solrServerUrl") + "'"%&amp;gt;;
&lt;/pre&gt;
However, that line can be modified easily to deploy the same application in other web server technology rather than a Servlet container (Tomcat, Jetty, etc.).&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-2647672092664194180?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/2647672092664194180/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/06/querying-solr-using-pure-ajax.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/2647672092664194180'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/2647672092664194180'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/06/querying-solr-using-pure-ajax.html' title='Querying Solr using a pure AJAX application'/><author><name>Fede Méndez</name><uri>http://www.blogger.com/profile/11707904250426427540</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-QDpiWebBxxk/TgBmFA_MFGI/AAAAAAAAABQ/06EY1i53r7E/s72-c/metacataloguewidgets.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-8064693161207384254</id><published>2011-06-17T14:28:00.000+02:00</published><updated>2011-06-17T14:28:48.934+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='lift'/><category scheme='http://www.blogger.com/atom/ns#' term='wallboard'/><category scheme='http://www.blogger.com/atom/ns#' term='GBIF'/><category scheme='http://www.blogger.com/atom/ns#' term='scala'/><title type='text'>Simple wallboard display with Scala and Lift at GBIF</title><content type='html'>This week we hit &lt;a href="https://twitter.com/#!/GBIF/status/80650711150505984"&gt;300 million&lt;/a&gt; indexed occurrence records. As you can see in the &lt;a href="http://yfrog.com/h429250726j"&gt;picture&lt;/a&gt;&amp;nbsp;we have got a monitor set up that shows us our current record count. It started as an idea a few weeks ago but while at the Berlin Buzzwords conference (we were at about 298 million then) I decided it was time to do something about it.&lt;br /&gt;
&lt;br /&gt;
I've been playing around with &lt;a href="http://www.scala-lang.org/"&gt;Scala&lt;/a&gt;&amp;nbsp;a bit in the last few months so this was a good opportunity to try &lt;a href="http://liftweb.net/"&gt;Lift&lt;/a&gt;,&amp;nbsp;a web framework written in Scala. In the end it turns out that very little code was needed to create an auto-updating counter. There are three components:&lt;br /&gt;
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;We've got a &lt;a href="http://code.google.com/p/gbif-common-resources/source/browse/gbif-wallboard/trunk/src/main/scala/org/gbif/provider/DBUpdater.scala"&gt;DBUpdater&lt;/a&gt; object that uses Lift's &lt;a href="http://scala-tools.org/mvnsites/liftweb-2.3/net/liftweb/util/Schedule.html"&gt;Schedule&lt;/a&gt;&amp;nbsp;(used to be called ActorPing which caused some confusion for me) to update its internal count of raw occurrence records every ten seconds. The beauty is that there is just one instance of this no matter how many clients are looking at the webpage.&lt;/li&gt;
&lt;li&gt;The second part is a class that acts as a &lt;a href="http://en.wikipedia.org/wiki/Comet_(programming)"&gt;Comet&lt;/a&gt; adaptor called &lt;a href="http://code.google.com/p/gbif-common-resources/source/browse/gbif-wallboard/trunk/src/main/scala/org/gbif/comet/RawOccurrenceRecordCount.scala"&gt;RawOccurrenceRecordCount&lt;/a&gt;&amp;nbsp;which waits for updates from the DBUpdater and passes these on to the clients.&lt;/li&gt;
&lt;li&gt;The last part is the &lt;a href="http://code.google.com/p/gbif-common-resources/source/browse/gbif-wallboard/trunk/src/main/scala/bootstrap/liftweb/Boot.scala"&gt;Bootstrap code&lt;/a&gt;&amp;nbsp;that schedules the first update of the DBUpdater and sets up the database connection and other stuff.&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
To get to this point, though, took quite some time as I have to say that the documentation for Lift is very lacking especially in explaining the basic concepts (I've read &lt;a href="http://simply.liftweb.net/"&gt;Simply Lift&lt;/a&gt;, bits and pieces in the &lt;a href="http://www.assembla.com/wiki/show/liftweb"&gt;Wiki&lt;/a&gt;&amp;nbsp;and am halfway through &lt;a href="http://exploring.liftweb.net/"&gt;Exploring Lift&lt;/a&gt;) for beginners like me. I'm really looking forward to &lt;a href="http://www.manning.com/perrett/"&gt;Lift in Action&lt;/a&gt;&amp;nbsp;and really hope it serves as a better introduction than the currently available documentation.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
That said I liked the end product very much and I hope to be able to extend the work a bit more to incorporate more stats for our &lt;a href="http://code.google.com/p/gbif-common-resources/source/browse/gbif-wallboard/#gbif-wallboard%2Ftrunk"&gt;wallboard&lt;/a&gt; display but so far I haven't managed to call JavaScript functions from my Comet Actor. That's next on my list. Ideas for a wallboard are piling up and I hope to be able to continue doing it in Lift and Scala.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-8064693161207384254?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/8064693161207384254/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/06/simple-wallboard-display-with-scala-and.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/8064693161207384254'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/8064693161207384254'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/06/simple-wallboard-display-with-scala-and.html' title='Simple wallboard display with Scala and Lift at GBIF'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-3912669359266348344</id><published>2011-06-15T12:05:00.080+02:00</published><updated>2011-06-15T16:31:09.719+02:00</updated><title type='text'>Buzzword compliance</title><content type='html'>Over the last few years a number of new technologies have emerged (inspired largely by Google) to help wrangle Big Data.&amp;nbsp; Things like Hadoop, HBase, Hive, Lucene, Solr and a host of others are becoming the "buzzwords" for handling the type of data that we at the secretariat are working with. As a number of our previous posts here have shown, the GBIF dev team is wholeheartedly embracing these new technologies, and we recently went to the &lt;a href="http://berlinbuzzwords.de/"&gt;Berlin Buzzwords&lt;/a&gt; conference (as a group) to get a sense of how the broader community is using these tools.&lt;br /&gt;&lt;br /&gt;
My particular interest is in &lt;a href="http://hbase.apache.org/"&gt;HBase&lt;/a&gt;, which is a style of database that can handle "millions of columns and billions of rows".&amp;nbsp; Since we're optimistic about the continued growth of the number of occurrence records indexed by GBIF, it's not unreasonable to think about 1 billion (10^9) indexed records within the medium-term, and while our current MySQL solution has held up reasonably well so far (now closing in on 300 million indexed records) it certainly won't handle an ever-growing future.&lt;br /&gt;
&lt;br /&gt;
I'm now in the process of evaluating HBase's ability to respond to the kinds of queries we need to support, particularly downloads of large datasets corresponding to queries in the &lt;a href="http://data.gbif.org/"&gt;data portal&lt;/a&gt;.&amp;nbsp; As in most databases, schema design is quite important in HBase, as is the selection of a "primary key" format for each table.&amp;nbsp; A number of the talks at Berlin Buzzwords addressed these issues and I was very happy to hear from some of the core contributers to HBase and their conclusion that figuring out the right setup for any particular problem is far from trivial.&amp;nbsp; Notable among the presenters were Jean-Daniel Cryans from StumbleUpon (a fellow Canadian, woot!) and Jonathan Gray from Facebook (with luck their slides will be up at the &lt;a href="http://berlinbuzzwords.de/node/748"&gt;Buzzwords slides page&lt;/a&gt; soon).&amp;nbsp; Jonathan's presentation especially gives a sense of what HBase is capable of given the truly huge amount of data Facebook drives through it (all of Facebook's messaging is held in HBase).&lt;br /&gt;
&lt;br /&gt;
Apart from learning a number of new techniques and approaches to developing with HBase, more than anything I'm excited to dive into the details knowing such a strong and supportive community is out there to help me when I get stuck.&amp;nbsp; You can follow along my testing and deliberations on the &lt;a href="http://code.google.com/p/gbif-occurrencestore/wiki/HBaseSchema"&gt;wiki page&lt;/a&gt; for our occurrence record project.&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-3912669359266348344?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/3912669359266348344/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/06/buzzword-compliance.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3912669359266348344'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3912669359266348344'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/06/buzzword-compliance.html' title='Buzzword compliance'/><author><name>Oliver Meyn</name><uri>http://www.blogger.com/profile/04706642473308341930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/-6_sfRK7-oU0/TaWanB_bQ3I/AAAAAAAAAAM/RcPMC0E9E3k/s220/oliver_amory_profile.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-5293009589918724529</id><published>2011-06-09T22:59:00.011+02:00</published><updated>2011-06-09T23:35:35.469+02:00</updated><title type='text'>Getting started with Avro RPC</title><content type='html'>&lt;a href="http://avro.apache.org/"&gt;Apache Avro&lt;/a&gt; is a data exchange format started by &lt;a href="http://www.linkedin.com/in/cutting"&gt;Doug Cutting&lt;/a&gt; of &lt;a href="http://lucene.apache.org/java/docs/index.html"&gt;Lucene&lt;/a&gt; and &lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt; fame.  A good introduction to Avro &lt;a href="http://www.cloudera.com/blog/2010/03/avro-1-3-0/"&gt;is on the cloudera blog&lt;/a&gt; so an introduction is not the intention of this post.  &lt;br /&gt;
&lt;br /&gt;
Avro is surprisingly difficult to get into, as it is lacking the most basic "getting started" documentation for a new-comer to the project.  This post serves as a reminder to myself of what I did, and hopefully to help others get the hello world working quickly.  If people find it useful, let's fill it out and submit it to the Avro wiki!&lt;br /&gt;
&lt;br /&gt;
Prerequisites: knowledge of &lt;a href="http://maven.apache.org/"&gt;Apache Maven&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
Start by adding the Avro maven plugin to the pom.  This is needed to compile the Avro schema definitions into the Java classes.&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush:xml"&gt;&amp;lt;plugin&amp;gt;
  &amp;lt;groupId&amp;gt;org.apache.avro&amp;lt;/groupId&amp;gt;
  &amp;lt;artifactId&amp;gt;avro-maven-plugin&amp;lt;/artifactId&amp;gt;
  &amp;lt;version&amp;gt;1.5.1&amp;lt;/version&amp;gt;
  &amp;lt;executions&amp;gt;
    &amp;lt;execution&amp;gt;
      &amp;lt;id&amp;gt;schemas&amp;lt;/id&amp;gt;
      &amp;lt;phase&amp;gt;generate-sources&amp;lt;/phase&amp;gt;
      &amp;lt;goals&amp;gt;
        &amp;lt;goal&amp;gt;schema&amp;lt;/goal&amp;gt;
        &amp;lt;goal&amp;gt;protocol&amp;lt;/goal&amp;gt;
        &amp;lt;goal&amp;gt;idl-protocol&amp;lt;/goal&amp;gt;
      &amp;lt;/goals&amp;gt;
      &amp;lt;configuration&amp;gt;
        &amp;lt;excludes&amp;gt;
          &amp;lt;exclude&amp;gt;**/mapred/tether/**&amp;lt;/exclude&amp;gt;
        &amp;lt;/excludes&amp;gt;
        &amp;lt;sourceDirectory&amp;gt;${project.basedir}/src/main/avro/&amp;lt;/sourceDirectory&amp;gt;
        &amp;lt;outputDirectory&amp;gt;${project.basedir}/src/main/java/&amp;lt;/outputDirectory&amp;gt;
        &amp;lt;testSourceDirectory&amp;gt;${project.basedir}/src/test/avro/&amp;lt;/testSourceDirectory&amp;gt;
        &amp;lt;testOutputDirectory&amp;gt;${project.basedir}/src/test/java/&amp;lt;/testOutputDirectory&amp;gt;
      &amp;lt;/configuration&amp;gt;
    &amp;lt;/execution&amp;gt;
  &amp;lt;/executions&amp;gt;
&amp;lt;/plugin&amp;gt;&lt;/pre&gt;&lt;br /&gt;
Now add the dependency on Avro and the Avro IPC (Inter Process Calls) &lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush:xml"&gt;&amp;lt;dependency&amp;gt;
  &amp;lt;groupId&amp;gt;org.apache.avro&amp;lt;/groupId&amp;gt;
  &amp;lt;artifactId&amp;gt;avro&amp;lt;/artifactId&amp;gt;
  &amp;lt;version&amp;gt;1.5.1&amp;lt;/version&amp;gt;
&amp;lt;/dependency&amp;gt;
&amp;lt;dependency&amp;gt;
  &amp;lt;groupId&amp;gt;org.apache.avro&amp;lt;/groupId&amp;gt;
  &amp;lt;artifactId&amp;gt;avro-ipc&amp;lt;/artifactId&amp;gt;
  &amp;lt;version&amp;gt;1.5.1&amp;lt;/version&amp;gt;
&amp;lt;/dependency&amp;gt;&lt;/pre&gt;&lt;br /&gt;
Now we create the Avro Protocol file, which defines the RPC exchange.  This file is stored in /src/main/avro/nublookup.avpr and looks like so:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush:js"&gt;{"namespace": "org.gbif.ecat.ws",
 "protocol": "NubLookup",
 "types": [
     {"name": "Request", "type": "record",
      "fields": [
        {"name": "kingdom", "type": ["string", "null"]},
        {"name": "phylum", "type": ["string", "null"]},
        {"name": "class", "type": ["string", "null"]},
        {"name": "order", "type": ["string", "null"]},
        {"name": "family", "type": ["string", "null"]},
        {"name": "genus", "type": ["string", "null"]},
        {"name": "name", "type": ["string", "null"]}
      ]
     },
     {"name": "Response", "type": "record",
      "fields": [
        {"name": "kingdomId", "type": ["int", "null"]},
        {"name": "phylumId", "type": ["int", "null"]},
        {"name": "classId", "type": ["int", "null"]},
        {"name": "orderId", "type": ["int", "null"]},
        {"name": "familyId", "type": ["int", "null"]},
        {"name": "genusId", "type": ["int", "null"]},
        {"name": "nameId", "type": ["int", "null"]}
      ]
     }  
 ],
 "messages": {
     "send": {
         "request": [{"name": "request", "type": "Request"}],
         "response": "Response"
     }
 }
}&lt;/pre&gt;&lt;br /&gt;
This protocol defines an interface called NubLookup, that takes a Request and returns a Response.  Simple stuff.&lt;br /&gt;
&lt;br /&gt;
From the command line issue a compile:&lt;br /&gt;
&lt;pre class="brush:shell"&gt;$mvn compile&lt;/pre&gt;This will generate into src/main/java and the package I declared in the .avpr file (org.gbif.ecat.ws in my case).&lt;br /&gt;
&lt;br /&gt;
Now we can test it using a simple Netty server which is included in the Avro dependency:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush:java"&gt;public class Test {
  private static NettyServer server;
  
  // A mock implementation
  public static class NubLookupImpl implements NubLookup {
    public Response send(Request request) throws AvroRemoteException {
      Response r = new Response();
      r.kingdomId=100;
      return r;
    }
  }
  
  public static void main(String[] args) throws IOException {
    server = new NettyServer(new SpecificResponder(
        NubLookup.class, 
        new NubLookupImpl()), 
        new InetSocketAddress(7001)); 

      NettyTransceiver client = new NettyTransceiver(
          new InetSocketAddress(server.getPort()));
      
      NubLookup proxy = (NubLookup) SpecificRequestor.getClient(NubLookup.class, client);
      
      Request req = new Request();
      req.name = new Utf8("Puma");
      System.out.println("Result: " + proxy.send(req).kingdomId);

      client.close();
      server.close();
  }
}&lt;/pre&gt;&lt;br /&gt;
I am evaluating Avro to provide the high performance RPC chatter for lookup services while we process the content for the portal.  I'll blog later about the performance compared to the &lt;a href="http://jersey.java.net/"&gt;Jersey REST&lt;/a&gt; implementation currently running.&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-5293009589918724529?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/5293009589918724529/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/06/getting-started-with-avro-rpc.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/5293009589918724529'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/5293009589918724529'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/06/getting-started-with-avro-rpc.html' title='Getting started with Avro RPC'/><author><name>Tim Robertson</name><uri>http://www.blogger.com/profile/07889700598656669041</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://1.bp.blogspot.com/_5wZ2Fic5QtA/STxJva6Zq2I/AAAAAAAAABc/g2I1cG9fztw/S220/tim.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-9141780243834013928</id><published>2011-06-03T19:05:00.003+02:00</published><updated>2011-06-06T15:03:28.407+02:00</updated><title type='text'>MySQL: A speed-up of over 9000 times using partitioning</title><content type='html'>I wanted to write about a MySQL performance optimization using partitioning as I recently applied it to the Harvesting and Indexing Toolkit’s (HIT) log table. The log table was already using a composite index (indexes on multiple columns), but as this table grew bigger and bigger (&amp;gt;50 million records) queries were being answered at a turtle’s pace.

To set things up, imagine that in the HIT application there is a log page that allows the user to tail the latest log messages in almost real time. Behind the scenes, the application is querying the log table every few seconds for the most recent logs, and the effect is a running view of the logs. The tail query used looks like this:

&lt;pre&gt;mysql&amp;gt; select * from log_event where id &amp;gt;= ‘latest id’ and datasource_id = ‘datasource_id’ and level &amp;gt;= ‘log level’ order by id desc;&lt;/pre&gt;

In effect this query asks: “give me the latest logs for datasource with id X having having at least a certain log level”.

Partitioning basically divides a table into different portions that are stored and can be queried separately. The benefit is that if a query only has to hit a small portion instead of the whole table, it can be answered faster.

There are different ways that you can partition tables in MySQL, and you can read about them all in the MySQL reference manual. I first experimented using Key partitioning using the table ID. Unfortunately, because different logs for a datasource could be spread across different partitions, the tail query would have to hit all partitions. To check how many partitions the query hits, I used the following query:

&lt;pre&gt;mysql&amp;gt; explain partitions select * from … ;&lt;/pre&gt;

This resulted in an even slower response than without partitioning, so Tim thought about it from a different angle. He discovered a nice solution using Range partitioning by datasource ID instead. This way the table would get divided into ranges of datasources that are contiguous but not overlapping. A range size of 1000 was used, so the 1st partition would contain all logs for datasources with IDs between 0 – 999, the 2nd partition would contain all logs for datasources with IDs between 1000 – 1999 and so on. Part of the command used to apply Tim’s partitioning strategy (having 36 partitions) is displayed below:

&lt;pre&gt;
ALTER TABLE log_event
ADD PRIMARY KEY(id,bio_datasource_fk)
PARTITION BY RANGE (bio_datasource_fk) (
    PARTITION p0 VALUES LESS THAN (1000),
    PARTITION p1 VALUES LESS THAN (2000),
    .
    .
    PARTITION p36 VALUES LESS THAN MAXVALUE
);
&lt;/pre&gt;

Checking how many partitions the tail query would hit, I confirmed that it only ever uses a single partition. The result was impressive, and initial tests resulted in a speed-up of over 9000 times!

Important to note is that the primary key must include all fields in the partition. Therefore because we were partitioning using the datasource id, this field had to be included in the primary key before partitioning would work. Also, an index on the id was also added to further optimize the query - why not right?

The speed-up might be dramatic now, but as more log messages get written to a partition and it starts to swell, I envisage having to either delete old logs or repartition the table again using smaller range sizes in order to sustain good performance. There is a trade-off between the number of partitions and performance, so some tweaking is needed in every case I guess. Lastly, I’ll reiterate that improper partitioning can actually make things worse. Perhaps it could work for you too, but please apply with caution.&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-9141780243834013928?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/9141780243834013928/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/06/mysql-speed-up-of-over-9000-times-using.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/9141780243834013928'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/9141780243834013928'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/06/mysql-speed-up-of-over-9000-times-using.html' title='MySQL: A speed-up of over 9000 times using partitioning'/><author><name>Kyle Braak</name><uri>http://www.blogger.com/profile/16423423909368777750</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-6484385941947680406</id><published>2011-06-01T19:55:00.010+02:00</published><updated>2011-06-03T10:07:23.963+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sql'/><category scheme='http://www.blogger.com/atom/ns#' term='postgres'/><title type='text'>Ordered updates with Postgres</title><content type='html'>&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;When updating a postgres table you sometimes want the update to happen in a specific order. For example I found myself in a situation when I wanted to assign new sequential ids to records in the alphabetical order given by a text string column. &lt;/span&gt;
&lt;br/&gt;&lt;br/&gt;

&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;With postgres 8.4 the solution using an &lt;/span&gt;&lt;a href="http://www.pelagodesign.com/blog/2007/07/23/postgresql-update-a-table-using-order-by/"&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;updateable, ordered view&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt; didn't work (anymore?). After experimenting a little I found that clustering a table according to the desired order is a simple solution that works exactly as hoped for. &lt;a href="http://www.postgresql.org/docs/8.4/interactive/sql-cluster.html"&gt;Clustering&lt;/a&gt; changes the actual order of the table data instead of only adding a new index. And apparently postgres uses this &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt;native&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:verdana;"&gt; order for updates.&lt;/span&gt;&lt;div&gt;
&lt;/div&gt;
&lt;br/&gt;

&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;CREATE TABLE idupd (id int, name varchar(128));&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;CREATE INDEX idupd_idx ON idupd (name);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;CLUSTER idupd USING idupd_idx;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;CREATE SEQUENCE idupd_seq;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;SELECT setval('&lt;/span&gt;&lt;span class="Apple-style-span"  style=" ;font-family:'courier new';"&gt;idupd_seq', 100&lt;/span&gt;&lt;span class="Apple-style-span"  style=" ;font-family:'courier new';"&gt;);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;UPDATE idupd set id=nextval('idupd_seq');&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;
&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-6484385941947680406?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/6484385941947680406/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/06/ordered-updates-with-postgres.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/6484385941947680406'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/6484385941947680406'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/06/ordered-updates-with-postgres.html' title='Ordered updates with Postgres'/><author><name>Markus Döring</name><uri>https://profiles.google.com/114975314573163797395</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-5068104976611655310</id><published>2011-05-30T16:20:00.001+02:00</published><updated>2011-05-30T16:28:55.883+02:00</updated><title type='text'>Decoupling components</title><content type='html'>Recent blog posts have introduced some of the &lt;a href="http://gbif.blogspot.com/2011/05/2011-registry-refactoring.html"&gt;registry&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a href="http://gbif.blogspot.com/2011/04/reworking-portal-processing.html"&gt;portal processing&lt;/a&gt;&amp;nbsp;work under development at GBIF. &amp;nbsp;Here I'd like to introduce some of the research&amp;nbsp;underway to &amp;nbsp;improve the overall processing workflows by identifying well defined components and decoupling unnecessary dependencies. &amp;nbsp;The target being to improve the robustness, reliability and throughput of the data indexing performed for the portal.&lt;br /&gt;
&lt;br /&gt;
Key to the GBIF portal is the crawling, processing and indexing of the content shared through the GBIF network, which is currently performed by the &lt;a href="http://code.google.com/p/gbif-indexingtoolkit/"&gt;Harvesting and Indexing Toolkit (HIT)&lt;/a&gt;. &amp;nbsp;Today the HIT operates largely as follows:&lt;br /&gt;
&lt;ol&gt;&lt;li&gt;Synchronise with the registry to discover the technical endpoints&lt;/li&gt;
&lt;li&gt;Allow the administrator to schedule the harvest and process of an endpoint, as follows:&lt;/li&gt;
&lt;ol&gt;&lt;li&gt;Initiate a metadata request to discover the datasets at the endpoint&lt;/li&gt;
&lt;li&gt;For each resource initiate a request for the inventory of distinct scientific names&lt;/li&gt;
&lt;li&gt;Process the names into ranges&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Harvest the records by name range&lt;/li&gt;
&lt;li&gt;Process the harvested responses into tab delimited files&lt;/li&gt;
&lt;li&gt;Synchronise the tab delimited files with the database "verbatim" tables&lt;/li&gt;
&lt;li&gt;Process the "verbatim" tables into interpreted tables&lt;/li&gt;
&lt;/ol&gt;&lt;/ol&gt;Logically the HIT is depicted:&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-8oMrGHXWMcs/TeOV_Wiuk3I/AAAAAAAAAD0/INwySCxNrzE/s1600/hit-today.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="310" src="http://3.bp.blogspot.com/-8oMrGHXWMcs/TeOV_Wiuk3I/AAAAAAAAAD0/INwySCxNrzE/s320/hit-today.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;Some of the limitations in this model include:&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;&lt;li&gt;The tight coupling between the HIT and the target DB mean we need to stop the harvesting when we are going to perform very expensive processing on the database&lt;/li&gt;
&lt;li&gt;Changes to the user interface for the HIT require the&amp;nbsp;harvester&amp;nbsp;to be stopped&lt;/li&gt;
&lt;li&gt;The user interface console is driven by the same machine that is crawling, meaning the UI becomes unresponsive periodically.&lt;/li&gt;
&lt;li&gt;The tight coupling between the HIT and the target DB preclude the option of storing in multiple datastores (as is current desire as we investigate enriching the occurrence store)&lt;/li&gt;
&lt;/ol&gt;&lt;br /&gt;
The HIT can be separated into the following distinct concerns:&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;&lt;li&gt;An administration console to allow the scheduling, oversight and diagnostics of crawlers&lt;/li&gt;
&lt;li&gt;Crawlers that harvest the content&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Synchronisers that interpret and persist the content into the target datastores&amp;nbsp;&amp;nbsp;&lt;/li&gt;
&lt;/ol&gt;&lt;br /&gt;
An&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Event-driven_architecture"&gt;event driven architecture&lt;/a&gt;&amp;nbsp;would allow this to happen and overcome the current limitations. &amp;nbsp;In this model, components can be deployed independently, and message each other&amp;nbsp;through a queue&amp;nbsp;when significant events occur . &amp;nbsp;Subscribers to the queue determine what action if any to take on a per message basis. &amp;nbsp;The architecture under research is shown:&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-9S308scseM4/TeOeyQS6b0I/AAAAAAAAAD4/aPYyATkmQcg/s1600/hit-new.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://4.bp.blogspot.com/-9S308scseM4/TeOeyQS6b0I/AAAAAAAAAD4/aPYyATkmQcg/s320/hit-new.png" width="270" /&gt;&lt;/a&gt;&lt;/div&gt;In this depiction, the following sequence of events would occur:&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;&lt;li&gt;Through the Administration console, the administrator schedules the crawling of a resource. &amp;nbsp;&lt;/li&gt;
&lt;li&gt;The scheduler broadcasts to the queue that the resource is to be crawled&amp;nbsp;rather than spawning a crawler directly. &amp;nbsp;&lt;/li&gt;
&lt;li&gt;When capacity allows, a crawler will act on this event and crawl the resource, storing to the filesystem as it goes. &amp;nbsp;On each response message, the crawler will broadcast that the response is to be handled.&lt;/li&gt;
&lt;li&gt;Synchronizers will act on the new response messages and store them in the occurrence target stores. &amp;nbsp;In the above depiction, there are actually 2 target stores, each of which would act on the message indicating there is new data to synchronise.&lt;/li&gt;
&lt;/ol&gt;&lt;div&gt;This architecture would have significant improvements to the existing setup. &amp;nbsp;The crawlers would only ever stop when bug fixing in the crawlers themselves occurs. &amp;nbsp;Different target stores can be researched independently of the crawling codebase. &amp;nbsp;The user interface for the scheduling can be developed, and redeployed without interrupting the crawling. &amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;
&lt;/div&gt;&lt;br /&gt;
As an aside, during this exercise we are also investigating improvements in the following:&lt;br /&gt;
&lt;ol&gt;&lt;li&gt;The HIT (today) performs the metadata request, but does NOT update the registry with the datasets that are discovered, only the data portal. &amp;nbsp;The GBIF registry is "dataset aware" for the datasets served through the&amp;nbsp;&lt;a href="http://code.google.com/p/gbif-providertoolkit/"&gt;Integrated Publishing Toolkit&lt;/a&gt;&amp;nbsp;and ultimately we intend the registry to be able to reconcile the multiple identifiers associated with a dataset. &amp;nbsp;For example, it should be possible in the future to synchronise with the like of the&amp;nbsp;&lt;a href="http://www.biodiversitycollectionsindex.org/"&gt;Biodiversity Collections Index&lt;/a&gt;&amp;nbsp;which is a dataset level registry.&lt;/li&gt;
&lt;li&gt;The harvesting procedure is rather complex, with many points for failure; it involves inventories of scientific names, processing into ranges of names and a harvest based on the name ranges. &amp;nbsp;Early tests suggest a more simpler approach of discrete name ranges [Aaa-Aaz, Aba-Abz ... Zza Zzz] yield better results.&lt;/li&gt;
&lt;/ol&gt;&lt;div&gt;Watch this space for results of this investigation...&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-5068104976611655310?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/5068104976611655310/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/05/decoupling-components.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/5068104976611655310'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/5068104976611655310'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/05/decoupling-components.html' title='Decoupling components'/><author><name>Tim Robertson</name><uri>http://www.blogger.com/profile/07889700598656669041</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://1.bp.blogspot.com/_5wZ2Fic5QtA/STxJva6Zq2I/AAAAAAAAABc/g2I1cG9fztw/S220/tim.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-8oMrGHXWMcs/TeOV_Wiuk3I/AAAAAAAAAD0/INwySCxNrzE/s72-c/hit-today.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-5491441057910366937</id><published>2011-05-27T15:31:00.007+02:00</published><updated>2011-05-28T07:38:22.399+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='pywrapper'/><category scheme='http://www.blogger.com/atom/ns#' term='HIT'/><category scheme='http://www.blogger.com/atom/ns#' term='character sets'/><title type='text'>The Phantom Records Menace</title><content type='html'>For a data administrator, going to the web test interface of data publisher can be incredibly useful if one needs to compare the data that was collected using the &lt;a href="http://code.google.com/p/gbif-indexingtoolkit/"&gt;Harvesting and Indexing Toolkit: HIT&lt;/a&gt; and what is available from the publisher. In a perfect world transfer of records would happen without a glitch but when we eventually get less (or more!) than we asked for the search/test interfaces can be a real help (for instance the &lt;a href="http://www.biocase.org/products/protocols/index.shtml"&gt;PyWrapper&lt;/a&gt; quering utilities)&lt;p&gt;

Sometimes GBIF will index a resource that for no apparent reason turns in fewer records than what is expected from the line count that the HIT performs automatically. In this particular case there appears to be several identical records on top of that – which we are made aware of by the HIT that warns us that there are multiple records with the same “holy triplet”: Institution code, collection code and catalogue number.
&lt;p&gt;
Now what happens when a request goes out for this name range: Abies alba Mill. - Achillea millefolium L. followed by a request for Achillea millefolium agg. - Acinos arvensis (Lam.) Dandy? Those of you with good eyesight will have spotted that the request asks for Achillea millefolium L. before Achillea millefolium agg. This is because this particular instance or configuration of pywrapper returns a name range that is sorted according to the character values you find in &lt;a href="http://en.wikipedia.org/wiki/UTF-8"&gt;UTF-8&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Ascii"&gt;ASCII&lt;/a&gt;/&lt;a href="http://en.wikipedia.org/wiki/Latin_1"&gt;Latin-1&lt;/a&gt; which orders all upper-case characters before the lower-case ones.
Whether this is an artifact of the underlying database system or the pywrapper itself, or even a specific version of the wrapper is not yet known, but the scenario exists today and consumers should be aware of this.
The HIT then builds requests based on this name range and if the requests by chance divide between “Achillea millefolium L. and Achillea millefolium agg.” you will be receiving overlapping responses - that is two responses that contain parts of each other’s records – because the response is not based on a &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/charset-binary-op.html"&gt;BINARY&lt;/a&gt; select statement and therefore returns the records alphabetically sorted without giving precedence to upper-case letters. This behavior can be replicated by going to the pywrapper interface and searching these name ranges. Fortunately the HIT removes the redundant records during the synchronizing process. However, the record count is based on the line count at the point where the records are received from the access point. This is why the record count in the HIT is inflated and as you see this kind of error can be am bit difficult to spot.&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-5491441057910366937?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/5491441057910366937/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/05/phantom-records-menace.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/5491441057910366937'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/5491441057910366937'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/05/phantom-records-menace.html' title='The Phantom Records Menace'/><author><name>Jan K. Legind</name><uri>http://www.blogger.com/profile/11185887314419707389</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-2934437820826438174</id><published>2011-05-23T16:50:00.003+02:00</published><updated>2011-05-24T08:11:40.134+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='registry'/><title type='text'>2011 GBIF Registry Refactoring</title><content type='html'>For the past couple of months, I have been working closely with another GBIF developer (and also fellow blog writer) Federico Mendez, 
on development tasks on the GBIF's Registry application. This post provides an overview of the work being done on this matter.&lt;br /&gt;
&lt;br /&gt;
First, I will like to explain the nuts and bolts of the current Registry application (the one online), and then
the additions/modifications it has "suffered" during 2011 (modifications have not been deployed).
As stated on &lt;a href="http://gbif.blogspot.com/2011/04/evolution-of-gbif-registry.html"&gt;The evolution of the GBIF Registry&lt;/a&gt; blog post, in 2010 the 
Registry entered a new stage on its development by moving to a single DB, &amp;nbsp;enhanced &lt;a href="ttp://code.google.com/p/gbif-registry/wiki/TableOfContents?tm=6#Available_APIs"&gt;web service API&lt;/a&gt;, 
and a &lt;a href="http://gbrds.gbif.org/"&gt;web user interface&lt;/a&gt;. On top of this, an admin-only web interface was created so that we could do internal
curation of the data inside the Secretariat.&lt;br /&gt;
&lt;br /&gt;
&lt;a href="http://www.hibernate.org/"&gt;Hibernate's framework&lt;/a&gt; was chosen as the preferred persistence framework and the &lt;a href="http://java.sun.com/blueprints/corej2eepatterns/Patterns/DataAccessObject.html"&gt;Data-Access-Object (DAO)&lt;/a&gt;&amp;nbsp;classes were coded with the&amp;nbsp;&lt;a href="http://docs.jboss.org/hibernate/core/3.3/reference/en/html/queryhql.html"&gt;HQL&lt;/a&gt; necessary to
provide an interface to the Hibernate persistence mechanism. The Business tier consisted of several Manager classes that relied on the DAOs to get the required data. These Managers also were the ones responsible for populating the Data-Transfer-Objects (DTOs) so that they could be passed to the Presentation tier. This last tier made use of plain Java Server Pages (JSPs), along with JQuery, Ajax, CSS among others.

Then, at the start of this year 2011, a decision was made to improve the application's underlying implementation in some aspects:&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Use of &lt;a href="http://www.mybatis.org/"&gt;MyBatis data mapper framework&lt;/a&gt;. This involved walking away from Hibernate's Object-Relational Mapping (ORM) approach. Our use of Hibernate involved HQL, adding an extra latency
component when converting HQL to SQL, but in MyBatis we use direct SQL mapped statements making it quicker to access the DB. (I will share some benchmarking on my next blog post, to justify this remark)&lt;/li&gt;
&lt;br /&gt;
&lt;li&gt;We found out that using a DTO pattern represented somewhat of an overkill for an application that didn't had such complexity at the 
model level. We could trim some code complexity by passing the model objects straight to the presentation tier. 
So we did, and all DTOFactories &amp;amp; DTO objects were gone.&amp;nbsp;&lt;/li&gt;
&lt;br /&gt;
&lt;li&gt;Several codebase improvements were introduced mainly by Federico, cutting down huge amounts of lines and making it easier to add new
functionality with less effort (e.g. heavy use of Java's generics)&amp;nbsp;&lt;/li&gt;
&lt;br /&gt;
&lt;li&gt;At the web service level, the &lt;a href="http://struts.apache.org/2.x/docs/rest-plugin.html"&gt;Struts2 Rest plugin&lt;/a&gt; was replaced by the&amp;nbsp;&lt;a href="http://jersey.java.net/"&gt;Jersey library&lt;/a&gt;. I personally found the Struts2 Rest plugin lacking documentation (1 year ago) so
the Registry's use of it was kind of ad hoc. My next blog post will include more reasoning about this decision.&lt;/li&gt;
&lt;br /&gt;
&lt;li&gt;We now make use of the &lt;span id="goog_2068611743"&gt;&lt;/span&gt;&lt;a href="http://code.google.com/p/google-guice/"&gt;Guice&amp;nbsp;dependency injection framework&lt;/a&gt;. Beforehand, we were making
use of Spring's ability for this. Also, these injections are made through annotations now; with Spring we were using XML based injection.&amp;nbsp;&lt;/li&gt;
&lt;br /&gt;
&lt;li&gt;The Registry project is now divided into different libraries. In particular:&amp;nbsp;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://code.google.com/p/gbif-registry/source/browse/#svn%2Ftrunk%2Fregistry-core"&gt;registry-core&lt;/a&gt;: Business &amp;amp; persistence logic&lt;/li&gt;
&lt;li&gt;&lt;a href="http://code.google.com/p/gbif-registry/source/browse/#svn%2Ftrunk%2Fregistry-web"&gt;registry-web&lt;/a&gt;: All related to the web application (Struts2)&lt;/li&gt;
&lt;li&gt;&lt;a href="http://code.google.com/p/gbif-registry/source/browse/#svn%2Ftrunk%2Fregistry-ws"&gt;registry-ws&lt;/a&gt;: All the web service stuff&lt;/li&gt;
&lt;li&gt;There are also some libraries Federico has created to manage the interaction between the Registry and all technical installations (DiGIR, Tapir, BioCase, etc) of those publishers sharing data with GBIF. These are extremely important libraries as they are the ones who keep the Registry up to date.&lt;/li&gt;
&lt;/ul&gt;
&lt;/ul&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-ESQFM9PUMV4/Tdpw9WsG6HI/AAAAAAAAIlg/VPuRx2NucH8/s1600/registryArchitecture.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://2.bp.blogspot.com/-ESQFM9PUMV4/Tdpw9WsG6HI/AAAAAAAAIlg/VPuRx2NucH8/s400/registryArchitecture.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div style="text-align: center;"&gt;
&lt;i&gt;(2011 refactoring)&lt;/i&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
I must emphasize again that these changes are not yet deployed, this in an ongoing project but if you are really interested to see the progress being made, please feel free to visit the &lt;a href="http://code.google.com/p/gbif-registry/"&gt;project's site&lt;/a&gt;. Also, these changes won't affect the current web services API, or the DB structure. Merely the changes are to improve the underlying codebase.&amp;nbsp;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-2934437820826438174?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/2934437820826438174/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/05/2011-registry-refactoring.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/2934437820826438174'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/2934437820826438174'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/05/2011-registry-refactoring.html' title='2011 GBIF Registry Refactoring'/><author><name>Jose Cuadra</name><uri>http://www.blogger.com/profile/00591450269169657407</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-ESQFM9PUMV4/Tdpw9WsG6HI/AAAAAAAAIlg/VPuRx2NucH8/s72-c/registryArchitecture.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-8582127028479737352</id><published>2011-05-19T20:14:00.007+02:00</published><updated>2011-05-19T22:19:13.643+02:00</updated><title type='text'>Indexing bio-diversity metadata using Solr: schema, import-handlers, custom-transformers</title><content type='html'>This post is the second part of &lt;a href="http://gbif.blogspot.com/2011/04/oia-pmh-harvesting-at-gbif.html"&gt;OAI-PMH Harvesting at GBIF&lt;/a&gt;. In that blog was explained how different OAI-PMH services are harvested. The subject of this post is introduce the overall architecture of the index created using the information gathered from those services.

Let's start by justifying why we needed a metadata index at GBIF, one of the main requirements we had was allow "search datasets by a end-users". To enable this, the system provides two main search functionalities: Full Text Search and Advanced Search.

For both functionalities the system will display a list of data sets containing the following information: title, provider, description (abstract) and hyperlink to view the full metadata document in the original format (DIF, EML, etc.) provided by the source; all that information was collected by the harvester. The results of any search had to be displayed with two, amog others, specific features: highlight the text that matched the searh criteria, and group/filter the results  by facets: providers, dates and OAI-PMH services. In order to provide nice searh features we couldn't leave the responsability to the capabilities of a database, so we decided implement a index support the searh requirements by building a index capable of facilitate the user needs. An index is like a single-table database without any support for relational queries with only purpose to support search and not be the primary source of data. The structure of the index is de-normalized and contain just the data needed to be searched.

The index was implemented using &lt;a href="http://lucene.apache.org/solr/"&gt;Solr&lt;/a&gt; which is an open source enterprise search. It has numerous other features such as search result highlighting, faceted navigation, query spell correction, auto-suggest queries and “more like this” for finding similar documents.

The metadata application stores a subset of the available information in the metadata documents as Solr fields and a special field (fullText) is used to store the whole XML document to enable full text search, the schema fields are:
&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;id&lt;/span&gt;: full file name is used for this field.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;title&lt;/span&gt;: title of the dataset, &lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;provider&lt;/span&gt;: provider of the dataset&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;providerExact&lt;/span&gt;: same as the previous field, but uses String data type for facets and exact match search&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;description&lt;/span&gt;: description or abstract of the dataset&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;beginDate&lt;/span&gt;: begin date of the temporal coverage of dataset&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;endDate&lt;/span&gt;: end date of the temporal coverage of dataset, when the input format only supports one dataset date, beginDate and endDate will contain the same value&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;westBoundingCoordinate&lt;/span&gt;: Geographic west coordinate&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;eastBoundingCoordinate&lt;/span&gt;: Geographic east coordinate&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;northBoundingCoordinate&lt;/span&gt;: Geographic north coordinate&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;southBoundingCoordinate&lt;/span&gt;: Geographic south coordinate&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;fullText&lt;/span&gt;: The complete text of the XML metadata document&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;externalUrl&lt;/span&gt;: Url containing specific information about the dataset; in the case of the &lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;serverId&lt;/span&gt;: Id of the source OAI-PMH Service; this information is taken from the file system structure and is used for the facets search.&lt;/li&gt;&lt;/ul&gt;

The XML documents gathered by the harvester are imported into Solr using &lt;a href="http://wiki.apache.org/solr/DataImportHandler"&gt;data import handlers&lt;/a&gt; for each input format (EML, DIF,etc.). An example of one of the data import handlers is the following used for index &lt;a href="http://dublincore.org/documents/dces/"&gt;dublin core xml files&lt;/a&gt;:
&lt;pre class="brush:xml"&gt;
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;dataConfig&gt;
 &lt;dataSource name="dcFileReader" type="FileDataSource" /&gt;
 &lt;document&gt;
  &lt;entity name="dcdataset" rootEntity="false" dataSource="null"
   processor="FileListEntityProcessor" fileName="^.*\.xml$" recursive="true"
   baseDir="${dataimporter.request.dcBaseDir}"&gt;
   &lt;entity name="dcdatasett" dataSource="dcFileReader" rootEntity="true" stream="false"
    url="${dcdataset.fileAbsolutePath}" processor="XPathEntityProcessor"
    transformer="org.gbif.solr.handler.dataimport.ListDateFormatTransformer,RegexTransformer,LogTransformer,TemplateTransformer,org.gbif.solr.handler.dataimport.ServerInfoTransformer"    
    forEach="/dc" logTemplate="processing ${dcdataset.fileAbsolutePath}" logLevel="info"&gt;
    &lt;field column="id" template="${dcdataset.fileAbsolutePath}" /&gt;
    &lt;field column="title" xpath="/dc/title"/&gt;    
    &lt;field column="provider" xpath="/dc/creator"/&gt;
    &lt;field column="description" xpath="/dc/description"/&gt;
                &lt;field column="serverId" template="${dcdataset.fileAbsolutePath}"/&gt;
    &lt;field column="serverName" template="${dcdataset.fileAbsolutePath}"/&gt;
    &lt;field column="beginDate"   listDateTimeFormat="yyyy-MM-dd" selectedDatePosition="1" separator="-" lastDay="false" xpath="/dc/date"/&gt; 
    &lt;field column="endDate"     listDateTimeFormat="yyyy-MM-dd" selectedDatePosition="2" separator="-" lastDay="true" xpath="/dc/date"/&gt;  
    &lt;entity processor="PlainTextEntityProcessor" name="x" url="${dcdataset.fileAbsolutePath}" dataSource="dcFileReader"&gt;
       &lt;!-- copies the text to a field called 'text' in Solr--&gt;
      &lt;field column="plainText" name="fullText"/&gt;
    &lt;/entity&gt;      
   &lt;/entity&gt;
  &lt;/entity&gt;
 &lt;/document&gt;
&lt;/dataConfig&gt;
&lt;/pre&gt;

The data import handlers are implemented using three main features available in Solr:
&lt;ul&gt;&lt;li&gt;FileDataSource: allows fetching content from files on disk.&lt;/li&gt;&lt;li&gt;FileListEntityProcessor: an entity processor used to enumerate the list of files.&lt;/li&gt;&lt;li&gt;XPathEntityProcessor: used to index the XML files, it allows defining of Xpath expressions to retrieve specific elements.&lt;/li&gt;&lt;li&gt;PlainTextEntityProcessor: reads all content from the data source into a single field; this processor is used to import the whole XML file into one field.&lt;/li&gt;&lt;li&gt;DateFormatTransformer: parses date/time strings into java.util.Date instances; it is used for the date fields.&lt;/li&gt;&lt;li&gt;RegexTransformer: helps in extracting or manipulating values from fields (from the source) using Regular Expressions.&lt;/li&gt;&lt;li&gt;TemplateTransformer: used to overwrite or modify any existing Solr field or to create new Solr fields; it is used to create the id field.&lt;/li&gt;&lt;li&gt;org.gbif.solr.handler.dataimport.ListDateFormatTransformer: this is a custom transformer to handle non-standard date formats that are common in input dates; it can handle dates with formats like: 12-2010, 09-1988, and (1998)-(2000). It has three important attributes: i) separator that defines the character/string to be used as separator between year and month fields, ii) lastDay to define if the date to be used with a particular year value (e.g., 1998) should be the first or the last day of the year: if the year is being interpreted as a beginDate, then the value is set to yyyy-01-01 and lastDay is set to false; if the year is interpreted as an endDate then the value is set to yyyy-12-31 and the lastDay value is set to true, iii) selectedDatePosition to define which date is being processed when a range of dates is present in the input field; for example:&lt;/li&gt;&lt;/ul&gt;           &lt;style&gt; &lt;!--  /* Font Definitions */ @font-face  {font-family:Cambria;  panose-1:2 4 5 3 5 4 6 3 2 4;  mso-font-charset:0;  mso-generic-font-family:auto;  mso-font-pitch:variable;  mso-font-signature:3 0 0 0 1 0;} @font-face  {font-family:"Trebuchet MS";  panose-1:2 11 6 3 2 2 2 2 2 4;  mso-font-charset:0;  mso-generic-font-family:auto;  mso-font-pitch:variable;  mso-font-signature:3 0 0 0 1 0;} @font-face  {font-family:Consolas;  panose-1:2 11 6 9 2 2 4 3 2 4;  mso-font-charset:0;  mso-generic-font-family:auto;  mso-font-pitch:variable;  mso-font-signature:3 0 0 0 1 0;} @font-face  {font-family:Monaco;  panose-1:2 0 5 0 0 0 0 0 0 0;  mso-font-alt:"Courier New";  mso-font-charset:0;  mso-generic-font-family:auto;  mso-font-pitch:variable;  mso-font-signature:3 0 0 0 1 0;}  /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal  {mso-style-parent:"";  margin:0cm;  margin-bottom:.0001pt;  line-height:150%;  mso-pagination:widow-orphan;  font-size:11.0pt;  mso-bidi-font-size:12.0pt;  font-family:"Times New Roman";  mso-ascii-font-family:"Trebuchet MS";  mso-fareast-font-family:Cambria;  mso-fareast-theme-font:minor-latin;  mso-hansi-font-family:"Trebuchet MS";  mso-bidi-font-family:"Times New Roman";  mso-bidi-theme-font:minor-bidi;} @page Section1  {size:612.0pt 792.0pt;  margin:72.0pt 90.0pt 72.0pt 90.0pt;  mso-header-margin:36.0pt;  mso-footer-margin:36.0pt;  mso-paper-source:0;} div.Section1  {page:Section1;} --&gt; &lt;/style&gt;     &lt;p class="MsoNormal" style="margin-left: 48.2pt;"&gt;&lt;span style="mso-bidi-line-height:150%;font-family:Consolas;mso-bidi- font-family:Monaco;font-size:11.0pt;color:teal;"   &gt;&amp;lt;&lt;/span&gt;&lt;span style="mso-bidi-line-height:150%; font-family:Consolas;mso-bidi-font-family:Monaco;font-size:11.0pt;color:#3F7F7F;"   &gt;field&lt;/span&gt;&lt;span style="line-height: 150%;font-family:Consolas;" &gt; &lt;span style="color:#7F007F;"&gt;column&lt;/span&gt;&lt;span style="color:black;"&gt;=&lt;/span&gt;&lt;i&gt;&lt;span style="color:#2A00FF;"&gt;"beginDate"&lt;/span&gt;&lt;/i&gt;&lt;span style="mso-spacerun: yes"&gt;   &lt;/span&gt;&lt;span style="color:#7F007F;"&gt;listDateTimeFormat&lt;/span&gt;&lt;span style="color:black;"&gt;=&lt;/span&gt;&lt;i&gt;&lt;span style="color:#2A00FF;"&gt;"yyyy-MM-dd"&lt;/span&gt;&lt;/i&gt; &lt;span style="color:#7F007F;"&gt;selectedDatePosition&lt;/span&gt;&lt;span style="color:black;"&gt;=&lt;/span&gt;&lt;i&gt;&lt;span style="color:#2A00FF;"&gt;"1"&lt;/span&gt;&lt;/i&gt; &lt;span style="color:#7F007F;"&gt;separator&lt;/span&gt;&lt;span style="color:black;"&gt;=&lt;/span&gt;&lt;i&gt;&lt;span style="color:#2A00FF;"&gt;"_"&lt;/span&gt;&lt;/i&gt; &lt;span style="color:#7F007F;"&gt;lastDay&lt;/span&gt;&lt;span style="color:black;"&gt;=&lt;/span&gt;&lt;i&gt;&lt;span style="color:#2A00FF;"&gt;"false"&lt;/span&gt;&lt;/i&gt; &lt;span style="color:#7F007F;"&gt;xpath&lt;/span&gt;&lt;span style="color:black;"&gt;=&lt;/span&gt;&lt;i&gt;&lt;span style="color:#2A00FF;"&gt;"/dc/date"&lt;/span&gt;&lt;/i&gt;&lt;span style="color:teal;"&gt;/&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;
imports the “dc/date” into the begin date using “_” as separator ; selectedDatePosition=”1” states the date to be processed is the first one in the range of dates and lastDay is thus set to false. The implementation of this custom handler is available on &lt;a href="http://code.google.com/p/gbif-metadata/source/browse/trunk/metacatalog/src/main/java/org/gbif/solr/handler/dataimport/ListDateFormatTransformer.java"&gt;google code site&lt;/a&gt;.

The web interface can be visited in this &lt;a href="http://metadata.gbif.org/catalogue/"&gt;url&lt;/a&gt;, in a next blog I'll exaplained how this user interface was implemented using some a simple ajax framework.&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-8582127028479737352?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/8582127028479737352/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/05/indexing-bio-diversity-metadata-using.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/8582127028479737352'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/8582127028479737352'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/05/indexing-bio-diversity-metadata-using.html' title='Indexing bio-diversity metadata using Solr: schema, import-handlers, custom-transformers'/><author><name>Fede Méndez</name><uri>http://www.blogger.com/profile/11707904250426427540</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-628152096754085767</id><published>2011-05-17T13:40:00.000+02:00</published><updated>2011-05-17T13:40:06.555+02:00</updated><title type='text'>Software quality control at GBIF</title><content type='html'>We've not only set up &lt;a href="http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html"&gt;Hadoop&lt;/a&gt;&amp;nbsp;here at GBIF but also introduced a few other new things. With the growing software development team we've felt the need to put some control measures in place to guarantee the quality of our software and to make the development process more transparent both for us at GBIF and hopefully for other interested parties as well.&lt;br /&gt;
&lt;br /&gt;
GBIF projects have always been open source and hosted at their Google Code sites (e.g. &lt;a href="http://code.google.com/p/gbif-occurrencestore/"&gt;GBIF Occurrencestore&lt;/a&gt;&amp;nbsp;or the &lt;a href="http://code.google.com/p/gbif-providertoolkit/"&gt;IPT&lt;/a&gt;). So in theory it was always possible for everyone to check every commit and review it. We've set up a &lt;a href="http://jenkins-ci.org/"&gt;Jenkins&lt;/a&gt; server however that does continuous integration for us which means that every time a change is made to one of our projects it is checked out and a full build is being run including all tests, code quality measurements (I'm going to get back to those later), web site creation (e.g. Javadocs) and publishing of the results to our &lt;a href="http://maven.apache.org/"&gt;Maven&lt;/a&gt; repository.&lt;br /&gt;
&lt;br /&gt;
This is the first step in our new process. Every commit is checked in this way and we've had great success improving the stability of our builds in this way. Our Jenkins server is publicly visible at the URL &lt;a href="http://hudson.gbif.org/"&gt;http://hudson.gbif.org&lt;/a&gt; (background on the Hudson name in the URL can be found on &lt;a href="http://en.wikipedia.org/wiki/Jenkins_(software)#Hudson"&gt;Wikipedia&lt;/a&gt;).&lt;br /&gt;
&lt;br /&gt;
As part of the process Jenkins also calls a code quality server called &lt;a href="http://www.sonarsource.org/"&gt;Sonar&lt;/a&gt;. Our Sonar instance is &lt;a href="http://sonar.gbif.org/"&gt;public&lt;/a&gt; as well. Take a look at &lt;a href="http://sonar.gbif.org/dashboard/index/1498"&gt;the metrics&lt;/a&gt; for the IPT for example. You'll see a lot of information about our code, good and bad. We're not yet using this information extensively but are looking into useful metrics to incorporate them more closely into our development process. One example are some &lt;a href="http://rs.gbif.org/conventions/"&gt;Coding Conventions&lt;/a&gt;&amp;nbsp;to make the code consistent and easier to understand for everybody.&lt;br /&gt;
&lt;br /&gt;
Once the build has finished the Sonar stage the results of the build are pushed to our &lt;a href="http://repository.gbif.org/index.html#welcome"&gt;Maven repository&lt;/a&gt;&amp;nbsp;(which is running a &lt;a href="http://nexus.sonatype.org/"&gt;Nexus&lt;/a&gt; server). That means we now have up to date SNAPSHOT builds of all our projects available (to use in our and your projects).&lt;br /&gt;
&lt;br /&gt;
At the moment we don't have a lot of code contributions from outside of the GBIF to our projects but we hope that by making our development process more transparent we can encourage others to take a look as well.&lt;br /&gt;
&lt;br /&gt;
We're always open for suggestions, questions and comments about our code base.&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-628152096754085767?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/628152096754085767/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/05/software-quality-control-at-gbif.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/628152096754085767'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/628152096754085767'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/05/software-quality-control-at-gbif.html' title='Software quality control at GBIF'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-1725712586796028750</id><published>2011-05-16T08:51:00.000+02:00</published><updated>2011-05-16T08:51:58.452+02:00</updated><title type='text'>Here be dragons - mapping occurrence data</title><content type='html'>One of the most compelling ways of viewing GBIF data is on a map.&amp;nbsp; While name lists and detailed text are useful if you know what you're looking for, a map can give you the overview you need to start honing your search.&amp;nbsp; I've always liked playing with maps in web applications and recently I had the chance to add the functionality to our new Hadoop/Hive processing that answers the question "what species occurrence records exist in country x?".&lt;br /&gt;
&lt;br /&gt;
Approximately 82% of the GBIF occurrence records have latitude and longitude recorded, but these often contain errors - typically typos, and often one or both of lat and long reversed.&amp;nbsp; Map 1, below, plots all of the verbatim (i.e. completely unprocessed) records that have a latitude and longitude and claim to be in the USA.&amp;nbsp; Note the common mistakes, which result in glaring errors: reversed longitude produces the near-perfect mirror over China; reversed latitude produces a faint image over the Pacific off the coast of Chile; reversing both produces an even fainter image off Australia; setting 0 for lat or long produces tell tale straight lines over the Prime Meridian and the equator. &lt;br /&gt;
&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-G30xWk-ZGvE/Tcuslv8VbDI/AAAAAAAAAA4/g9xAkUYAm98/s1600/us-verbatim.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="320" id=":current_picnik_image" src="http://4.bp.blogspot.com/-G30xWk-ZGvE/Tcuslv8VbDI/AAAAAAAAAA4/g9xAkUYAm98/s640/us-verbatim.png" width="640" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;td style="text-align: center;"&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Map 1: Verbatim (unprocessed) occurrence data coordinates for the USA&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;One of goals of the GBIF Secretariat is to help publishers improve their data, and identifying and reporting back these types of problems is one way of doing that.&amp;nbsp; Of course the current &lt;a href="http://data.gbif.org/"&gt;GBIF data portal&lt;/a&gt; attempts to filter these records before displaying them.&amp;nbsp; The current system for verifying that given coordinates fall within the country they claim is by overlaying a 1 degree grid on the world map, and identifying each of those grid points as belonging to one or more countries.&amp;nbsp; This overlay is curated by hand, and is therefore error prone, and its maintenance is time consuming.&lt;br /&gt;
&lt;br /&gt;
The results of doing a lookup against the overlay are shown in Map 2, where a number of bugs in the processing are still visible: parts of the mirror over China are still visible; none of the coastal waters that are legally US territory (i.e. &lt;a href="http://en.wikipedia.org/wiki/Exclusive_Economic_Zone"&gt;Exclusive Economic Zone&lt;/a&gt; of 200 nautical miles off shore) are shown; the Aleutian Islands off the coast of Alaska are not shown; and, some spots around the world are allowed through, including 0,0 and a few seemingly at random. &lt;br /&gt;
&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-RA_9dbm32bI/TcusxFky4XI/AAAAAAAAABA/xYHv7i5Ge1A/s1600/us-portal.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="320" src="http://3.bp.blogspot.com/-RA_9dbm32bI/TcusxFky4XI/AAAAAAAAABA/xYHv7i5Ge1A/s640/us-portal.png" width="640" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Map 2: Results of current data portal processing for occurrences in the USA&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;My work, then, was to build new processing into our Hive/Hadoop processing workflow that addresses these problems and produces a map that is as close to error free as possible.&amp;nbsp; The starting point is a webservice that can answer the question "In what country (including coastal waters) does this lat/long pair fall?".&amp;nbsp; This is clearly a GIS problem, and in GIS-speak this is a reverse geocode, and something that &lt;a href="http://postgis.refractions.net/"&gt;PostGIS&lt;/a&gt; is well equipped to provide.&amp;nbsp; Because country definitions and borders change semi-regularly, it seemed wisest to use a trusted source of country boundaries (shapefiles) that we could replace whenever needed.&amp;nbsp; Similarly we needed the boundaries of Exclusive Economic Zones to cover coastal waters. The political boundaries come from &lt;a href="http://www.naturalearthdata.com/"&gt;Natural Earth&lt;/a&gt;, and the EEZ boundaries shapefile come from the &lt;a href="http://vliz.be/vmdcdata/marbound/"&gt;VLIZ Maritime Boundaries Geodatabase.&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
While not an especially difficult query to formulize, a word to the wise: if you're doing this kind of reverse geocode lookup, remember to build your query by scoping the distance query within its enclosing polygon, like so &lt;br /&gt;
&lt;blockquote style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: x-small;"&gt;where the_geom &amp;amp;&amp;amp; ST_GeomFromText(#{point}, 4326) and distance(the_geom, geomfromtext(#{point}, 4326)) &amp;lt; 0.001&lt;/span&gt;&lt;/blockquote&gt;This buys an order of magnitude improvement in query response time!&lt;br /&gt;
&lt;br /&gt;
With a thin webservice wrapper from &lt;a href="http://jersey.java.net/"&gt;Jersey&lt;/a&gt;, we have the GIS pieces built.&amp;nbsp; We opted for a webservice approach to allow us to ultimately expose this quality control utility externally in the future.&amp;nbsp; Since we process in Hadoop, we experienced huge stress on this web service - we were DDOS'ing ourselves.&amp;nbsp; I mentioned a similar approach in my &lt;a href="http://gbif.blogspot.com/2011/04/lucene-for-searching-names-in-our-new.html"&gt;last entry&lt;/a&gt;, where we alleviated the problem with load balancing across multiple machines.&amp;nbsp; And in case anyone is wondering why we didn't just use Google's reverse-geocoding webservice, the answer is twofold - first, it violates their terms of use, and second, even if we were allowed, they hold a rate limit on how many queries you can send over time, and that would have brought our workflow to its knees.&lt;br /&gt;
&lt;br /&gt;
The last piece of the puzzle is adding the call to the webservice from a &lt;a href="http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF"&gt;Hive UDF&lt;/a&gt; and adding it to our workflow, which is reasonably straight forward.&amp;nbsp; The result of the new processing is shown in Map 3, where the problems of Map 2 are all addressed.   &lt;br /&gt;
&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-IIrsEyAuKjI/Tcuswu18zAI/AAAAAAAAAA8/-h3zvnih0wY/s1600/us-new.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="320" src="http://1.bp.blogspot.com/-IIrsEyAuKjI/Tcuswu18zAI/AAAAAAAAAA8/-h3zvnih0wY/s640/us-new.png" width="640" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Map 3: Results of new processing workflow for occurrences in the USA&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;These maps and the mapping cleanup processing will replace the existing maps and processing in our data portal later this year, hopefully in as little as a few months.&lt;br /&gt;
&lt;br /&gt;
You can find the source of the reverse-geocode webservice at the Google code site for the &lt;a href="http://code.google.com/p/gbif-occurrencestore/source/browse/#svn%2Ftrunk%2Foccurrence-spatial"&gt;occurrence-spatial project&lt;/a&gt;.&amp;nbsp; Similarly you can browse the source of the &lt;a href="http://code.google.com/p/gbif-occurrencestore/source/browse/#svn%2Ftrunk%2Foozie-apps%2Frollover"&gt;Hadoop/Hive workflow&lt;/a&gt; and the &lt;a href="http://code.google.com/p/gbif-occurrencestore/source/browse/#svn%2Ftrunk%2Foccurrence-store%2Fsrc%2Fmain%2Fjava%2Forg%2Fgbif%2Foccurrencestore%253Fstate%253Dclosed"&gt;Hive UDFs&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-1725712586796028750?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/1725712586796028750/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/05/here-be-dragons-mapping-occurrence-data.html#comment-form' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/1725712586796028750'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/1725712586796028750'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/05/here-be-dragons-mapping-occurrence-data.html' title='Here be dragons - mapping occurrence data'/><author><name>Oliver Meyn</name><uri>http://www.blogger.com/profile/04706642473308341930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/-6_sfRK7-oU0/TaWanB_bQ3I/AAAAAAAAAAM/RcPMC0E9E3k/s220/oliver_amory_profile.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-G30xWk-ZGvE/Tcuslv8VbDI/AAAAAAAAAA4/g9xAkUYAm98/s72-c/us-verbatim.png' height='72' width='72'/><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-2926352699049042674</id><published>2011-05-11T23:37:00.002+02:00</published><updated>2011-05-13T22:31:33.656+02:00</updated><title type='text'>The GBIF Spreadsheet Processor - an easy option to publish data</title><content type='html'>&lt;p&gt;Most of data publishers in the GBIF Network use software wrappers to make data available on the web. To set up those tools, usually an institution or an individual needs to have certain degrees of technical capacity, and this more or less raises the threshold for publishing biodiversity data.&lt;/p&gt;

&lt;p&gt;Imaging an entomologist who deals with collections and monographs everyday, the only thing s/he does on a PC is Word or Excel. S/he's got no student to help with, but keen to share the data before s/he retires. What is s/he going to do?&lt;/p&gt;

&lt;p&gt;One of our tools is built to support this kind of scenario - &lt;a href="http://tools.gbif.org/spreadsheet-processor/"&gt;the GBIF Darwin Core Archive Spreadsheet Processor&lt;/a&gt;, usually we just call it "the Spreadsheet Processor."&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-9zfbmegjHx4/Tcr_3cERDQI/AAAAAAAAAA0/P7-KrKvN27I/s1600/Screen%2Bshot%2B2011-05-11%2Bat%2B10.17.13%2BPM.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 191px;" src="http://3.bp.blogspot.com/-9zfbmegjHx4/Tcr_3cERDQI/AAAAAAAAAA0/P7-KrKvN27I/s320/Screen%2Bshot%2B2011-05-11%2Bat%2B10.17.13%2BPM.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5605574014107979010" /&gt;&lt;/a&gt;

&lt;p&gt;The Spreadsheet Processor is a web application that one can:
&lt;ol&gt;
&lt;li&gt;Use templates provided on the web site;&lt;/li&gt;
&lt;li&gt;Fill and upload(or email) the xls file;&lt;/li&gt;
&lt;li&gt;Get a Darwin Core Archive file as the result.&lt;/li&gt;
&lt;/ol&gt;
&lt;/p&gt;
&lt;p&gt;This is a pretty straight-forward approach to prepare data for publishing, because the learning curve is flat if users already know how to use Excel, how to upload a file on a web site.&lt;/p&gt;

&lt;p&gt;When the spreadsheet template is uploaded to the page, the web app first parses the values in the metadata sheet to generate an eml.xml, and then the occurrence or checklist sheet to generate an meta.xml and csv file. These files are then collected and zipped according to &lt;a href="http://rs.tdwg.org/dwc/terms/guides/text/index.htm"&gt;Darwin Core Archive standard&lt;/a&gt; - ready to download.&lt;/p&gt;

&lt;p&gt;With a DwC-A file, the data is in a standardized format and ready to be published. In the example scenario above, this entomologist can either only share them among colleagues, or, send them to the nearest GBIF node which hosts IPT. Since IPT can digest a DwC-A file and publish it, the entomologist doesn't need to know the usage of IPT. To update it, s/he can revise the spreadsheet, create and send DwC-A to the node again.&lt;/p&gt;

&lt;p&gt;P.S. &lt;a href="http://www.gbif.org/orc/?doc_id=2824&amp;l=en"&gt;This manual&lt;/a&gt; explains how to publish and register data in DwC-A format.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-2926352699049042674?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/2926352699049042674/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/05/gbif-spreadsheet-processor-easy-option.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/2926352699049042674'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/2926352699049042674'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/05/gbif-spreadsheet-processor-easy-option.html' title='The GBIF Spreadsheet Processor - an easy option to publish data'/><author><name>Burke Chih-Jen Ko</name><uri>http://www.blogger.com/profile/09806308970203169452</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/-JHV15rSIJlw/Td_9T-7V2iI/AAAAAAAAABI/TamywweE4I4/s220/P1282909r_icon.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-9zfbmegjHx4/Tcr_3cERDQI/AAAAAAAAAA0/P7-KrKvN27I/s72-c/Screen%2Bshot%2B2011-05-11%2Bat%2B10.17.13%2BPM.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-2359533585293752660</id><published>2011-05-10T09:39:00.018+02:00</published><updated>2011-05-10T12:00:02.891+02:00</updated><title type='text'>Reworking the HIT, after reworking the Portal processing</title><content type='html'>If GBIF reworks the Portal processing, then what would be the knock-on effect on the &lt;a href="http://code.google.com/p/gbif-indexingtoolkit/"&gt;Harvesting and Indexing Toolkit (HIT)&lt;/a&gt;? This blog serves to talk a little about the future of the &lt;a href="http://code.google.com/p/gbif-indexingtoolkit/"&gt;HIT&lt;/a&gt;, and very little about the new &lt;a href="http://gbif.blogspot.com/2011/04/reworking-portal-processing.html"&gt;Portal processing&lt;/a&gt;  (saved for later blogs).&lt;br /&gt;
&lt;br /&gt;
&lt;div&gt;
&lt;div&gt;
To provide some background, the &lt;a href="http://code.google.com/p/gbif-indexingtoolkit/"&gt;HIT&lt;/a&gt; has three major responsibilities:&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;harvesting specimen and occurrence data from data publishers,&lt;/li&gt;
&lt;li&gt;writing that data in its raw form to the database, and&amp;nbsp;&lt;/li&gt;
&lt;li&gt;transforming raw data into its processed form running quality assurance routines (such as date and terrestrial point validation) and tying it to the backbone "nub" taxonomy.&lt;/li&gt;
&lt;/ol&gt;
&lt;br /&gt;
When it is complete, the new Portal processing is actually going to do step 3. In the new processing, data will be extracted from the MySQL database into  &lt;a href="http://hbase.apache.org/"&gt;HBase&lt;/a&gt; (using &lt;a href="http://www.cloudera.com/downloads/sqoop/"&gt;sqoop&lt;/a&gt;) where quality assurance routines can be run much more quickly. Running outside of the MySQL database means that there won't be any more competition between steps 2 and 3 - step 3 constantly locking the raw data table in order to run its routines. That will mean the &lt;a href="http://code.google.com/p/gbif-indexingtoolkit/"&gt;HIT&lt;/a&gt; will be able to write raw data uninterrupted to the database.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;/div&gt;
&lt;div&gt;
Lately I can tell you that the &lt;a href="http://code.google.com/p/gbif-indexingtoolkit/"&gt;HIT&lt;/a&gt; has been having some frustrations trying to process large datasts. For example, a dataset with 12 million records, processing 10,000 records at a time, would lock the raw table for 10 minutes while scanning through the more than 280 million raw records in order to generate its record set. No raw data can be written at that time, thereby bringing the massively parallel application to its knees. Perhaps now you can understand why the rework of the Portal processing is so urgently needed.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;/div&gt;
&lt;div&gt;
For the few adopters of the &lt;a href="http://code.google.com/p/gbif-indexingtoolkit/"&gt;HIT&lt;/a&gt; that will still require the application with its current functionality please rest assured that the project will just maintain a separate trimmed-down version when the time comes to adapt it. It will always remain an open-source application that anyone in the community can customize for their own needs.&lt;/div&gt;
&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-2359533585293752660?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/2359533585293752660/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/05/reworking-hit-after-reworking-portal.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/2359533585293752660'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/2359533585293752660'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/05/reworking-hit-after-reworking-portal.html' title='Reworking the HIT, after reworking the Portal processing'/><author><name>Kyle Braak</name><uri>http://www.blogger.com/profile/16423423909368777750</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-6079865179765844165</id><published>2011-05-06T16:02:00.001+02:00</published><updated>2011-05-06T16:07:39.569+02:00</updated><title type='text'>Improving Hive join performance using Oozie</title><content type='html'>In the&amp;nbsp;&lt;a href="http://gbif.blogspot.com/2011/04/reworking-portal-processing.html"&gt;portal processing&lt;/a&gt;&amp;nbsp;we are making use of&amp;nbsp;&lt;a href="http://wiki.apache.org/hadoop/Hive"&gt;Apache Hive&lt;/a&gt;&amp;nbsp;to provide SQL capabilities and&amp;nbsp;&lt;a href="http://yahoo.github.com/oozie/"&gt;Yahoo!'s Oozie&lt;/a&gt;&amp;nbsp;to provide a workflow engine. &amp;nbsp;In this blog I explain how we are making use of forks to improve the join performance of Hive, by further parallelizing the join beyond what Hive provides natively.&lt;br /&gt;
&lt;blockquote&gt;&lt;i&gt;Please note that this was adopted using Hive version 0.5 but in Hive 0.7 there are&amp;nbsp;&lt;/i&gt;&lt;a href="http://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919"&gt;&lt;i&gt;significant improvements to joins&lt;/i&gt;&lt;/a&gt;&lt;/blockquote&gt;For the purposes of this explanation, let's consider the following simple example, where a table of verbatim values is being processed into four tables in a star schema:&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-z3nafGKcm-4/TcPyi-6t46I/AAAAAAAAADw/w6Nl06o1UXs/s1600/hivePerf.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="332" src="http://2.bp.blogspot.com/-z3nafGKcm-4/TcPyi-6t46I/AAAAAAAAADw/w6Nl06o1UXs/s400/hivePerf.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;To generate the leaves of the star, we have three simple queries (making use of a &lt;a href="http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF"&gt;simple UDF&lt;/a&gt;&amp;nbsp;to produce the increment IDs):&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;CREATE TABLE institution_code AS&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;SELECT rowSequence(), institution_code&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;FROM verbatim_record&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;GROUP BY institution_code;&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;
&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;CREATE TABLE collection_code AS&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;SELECT rowSequence(), collection_code&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;FROM verbatim_record&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;GROUP BY collection_code;&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;
&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;CREATE TABLE catalogue_number AS&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;SELECT rowSequence(), catalogue_number&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;FROM verbatim_record&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;GROUP BY catalogue_number;&lt;/span&gt;&lt;/blockquote&gt;&lt;br /&gt;
To build the core of the star the simple approach is to issue the following SQL:&lt;br /&gt;
&lt;br /&gt;
&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;CREATE TABLE parsed_content AS&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;SELECT v.id AS id, ic.id AS institution_code_id,&amp;nbsp;&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;cc.id AS collection_code_id, cn.id AS catalogue_number_id&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;FROM verbatim_record v&amp;nbsp;&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp;JOIN institution_code ic ON v.institution_code=ic.institution_code&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp;JOIN collection_code cc ON v.collection_code=cc.collection_code&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp;JOIN catalogue_number cn ON v.catalogue_number=cn.catalogue_number;&lt;/span&gt;&lt;/blockquote&gt;&lt;br /&gt;
What is important to note is that the JOIN is across 3 different values, and this results in a query plan with three sequential MR jobs, a very large intermediate result set, which is ultimately passed through the final Reduce in the Hive planning.&lt;br /&gt;
&lt;br /&gt;
By using Oozie (see the bottom of this post for pseudo workflow config), we are able to produce three temporary join tables, in a parallel fork, and then do a single join to bring it all back together.&lt;br /&gt;
&lt;br /&gt;
&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;# parallel join 1&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;CREATE TABLE t1 AS&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;SELECT v.id AS id, ic.id AS institution_code_id&amp;nbsp;&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;FROM verbatim_record v JOIN institution_code ic ON v.institution_code=ic.institution_code;&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;
&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;# parallel join 2&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;CREATE TABLE t2 AS&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;SELECT v.id AS id, cc.id AS collection_code_id&amp;nbsp;&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;FROM verbatim_record v JOIN collection_code cc ON v.collection_code=cc.collection_code&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;
&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;# parallel join 3&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;CREATE TABLE t3 AS&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;SELECT v.id AS id, ic.id AS institution_code_id&amp;nbsp;&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;FROM verbatim_record v JOIN catalogue_number cn ON v.catalogue_number=cn.catalogue_number;&lt;/span&gt;&lt;/blockquote&gt;&lt;br /&gt;
&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;CREATE TABLE parsed_content AS&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;SELECT v.id AS id, t1.institution_code_id&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;t2.collection_code_id, t3.catalogue_number_id&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;FROM verbatim_record v&amp;nbsp;&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp;JOIN t1 ic ON v.id=t1.id&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp;JOIN t2 cc ON v.id=t2.id&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp;JOIN t3 cn ON v.id=t3.id;&lt;/span&gt;&lt;/blockquote&gt;&lt;br /&gt;
Because we have built the join tables in parallel, and join on the foreign key only, Hive compiles to a single MR job, and runs much quicker.&lt;br /&gt;
&lt;br /&gt;
In reality our tables are far more complex, and we use a &lt;a href="http://wiki.apache.org/hadoop/Hive/LanguageManual/Joins"&gt;Map side JOIN&lt;/a&gt;&amp;nbsp;for the institution_code since it is small, but for&amp;nbsp;&lt;a href="http://code.google.com/p/gbif-occurrencestore/wiki/ClusterConfig"&gt;our small cluster&lt;/a&gt;&amp;nbsp;and the following table sizes we saw a reduction from&lt;span class="Apple-style-span" style="color: red;"&gt; &lt;/span&gt;&lt;b&gt;&lt;u&gt;&lt;span class="Apple-style-span" style="color: red;"&gt;several hours to 40 minutes&lt;/span&gt;&lt;/u&gt;&lt;/b&gt; to compute these tables.&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;verbatim_record: 284 million&lt;/li&gt;
&lt;li&gt;collection_code: 1.5 million&lt;/li&gt;
&lt;li&gt;catalogue_number: 199 million&lt;/li&gt;
&lt;li&gt;institution_code: 8 thousand&lt;/li&gt;
&lt;/ul&gt;&lt;div&gt;All of this work &lt;a href="http://code.google.com/p/gbif-occurrencestore/source/browse/trunk/oozie-apps/rollover/workflow.xml"&gt;can be found here&lt;/a&gt;.&amp;nbsp;&lt;/div&gt;&lt;br /&gt;
Pseudo workflow config for this:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush:xml"&gt;&lt;fork name="generate_join_tables"&gt;
  &lt;path start="create_ic"&gt;&lt;/path&gt;
  &lt;path start="create_cc"&gt;&lt;/path&gt;
  &lt;path start="create_cn"&gt;&lt;/path&gt;
&lt;/fork&gt;
  
&lt;action name="create_ic"&gt;
  &lt;hive xmlns="uri:oozie:hive-action:0.1"&gt;
    &lt;script&gt;
hive-scripts/create_ic.q
&lt;/script&gt;
    ...    
  &lt;/hive&gt;
  &lt;ok to="join_parsed"&gt;&lt;/ok&gt;
  &lt;error to="failure"&gt;&lt;/error&gt;
&lt;/action&gt;
  
&lt;action name="create_cc"&gt;
  &lt;hive xmlns="uri:oozie:hive-action:0.1"&gt;
    &lt;script&gt;
hive-scripts/create_cc.q
&lt;/script&gt;
    ...    
  &lt;/hive&gt;
  &lt;ok to="join_parsed"&gt;&lt;/ok&gt;
  &lt;error to="failure"&gt;&lt;/error&gt;
&lt;/action&gt;
  
&lt;action name="create_cn"&gt;
  &lt;hive xmlns="uri:oozie:hive-action:0.1"&gt;
    &lt;script&gt;
hive-scripts/create_cn.q
&lt;/script&gt;
    ...    
  &lt;/hive&gt;
  &lt;ok to="join_parsed"&gt;&lt;/ok&gt;
  &lt;error to="failure"&gt;&lt;/error&gt;
&lt;/action&gt;
 
&lt;join name="join_parsed" to="create_parsed"&gt;

&lt;action name="create_parsed"&gt;
  &lt;hive xmlns="uri:oozie:hive-action:0.1"&gt;
    &lt;script&gt;
hive-scripts/create_parsed.q
&lt;/script&gt;
    ...    
  &lt;/hive&gt;
  &lt;ok to="end"&gt;&lt;/ok&gt;
  &lt;error to="failure"&gt;&lt;/error&gt;
&lt;/action&gt;
&lt;/join&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-6079865179765844165?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/6079865179765844165/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/05/improving-hive-join-performance-using.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/6079865179765844165'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/6079865179765844165'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/05/improving-hive-join-performance-using.html' title='Improving Hive join performance using Oozie'/><author><name>Tim Robertson</name><uri>http://www.blogger.com/profile/07889700598656669041</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://1.bp.blogspot.com/_5wZ2Fic5QtA/STxJva6Zq2I/AAAAAAAAABc/g2I1cG9fztw/S220/tim.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-z3nafGKcm-4/TcPyi-6t46I/AAAAAAAAADw/w6Nl06o1UXs/s72-c/hivePerf.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-3263575298104393421</id><published>2011-05-04T14:48:00.005+02:00</published><updated>2011-05-04T16:12:45.063+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GBIF'/><category scheme='http://www.blogger.com/atom/ns#' term='DwC-archive'/><title type='text'>Line terminating characters breaking Darwin Core Archive</title><content type='html'>Hi, I am Jan K. Legind the new data administrator at the GBIF Secretariat and one of my responsibilities is to ensure that datasets from publishers get indexed so that the data can be made available through the GBIF Portal. I am a historian by training and I have worked with archival data collection and testing prior to joining GBIF. &lt;span style="mso-spacerun:yes"&gt; &lt;/span&gt;  &lt;p class="MsoNormal"&gt;Recently I have been bug hunting a large dataset (DwC - Archive) that from a casual glance would look OK at the publisher side, but upon hitting the parser several records would be rejected because of the occurrence of line terminating characters in the records themselves (Hex value 0A). On top of that the individual record would be replaced by one empty line due to the illegal line termination AND another empty line would be added to that due to the tail end of the record appearing to the parser as the start of a new record, which of course would not be well-formed (thus being replaced with blank line number two). The parser will see a line that has too few columns and drop it. Since the line was bisected the tail end will also be conceived of as an individual line with an insufficient number of columns.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;Here is an example of a record that would be replaced by an empty line:&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/-wLvc4ofHK5Y/TcFLxTjHvaI/AAAAAAAAAIE/vMWlw28ceps/s1600/line_terminating.jpg"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 675px; height: 92px;" src="http://4.bp.blogspot.com/-wLvc4ofHK5Y/TcFLxTjHvaI/AAAAAAAAAIE/vMWlw28ceps/s400/line_terminating.jpg" alt="" id="BLOGGER_PHOTO_ID_5602842721858862498" border="0" /&gt;&lt;/a&gt;

&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:worddocument&gt;   &lt;w:view&gt;Normal&lt;/w:View&gt;   &lt;w:zoom&gt;0&lt;/w:Zoom&gt;   &lt;w:trackmoves/&gt;   &lt;w:trackformatting/&gt;   &lt;w:punctuationkerning/&gt;   &lt;w:validateagainstschemas/&gt;   &lt;w:saveifxmlinvalid&gt;false&lt;/w:SaveIfXMLInvalid&gt;   &lt;w:ignoremixedcontent&gt;false&lt;/w:IgnoreMixedContent&gt;   &lt;w:alwaysshowplaceholdertext&gt;false&lt;/w:AlwaysShowPlaceholderText&gt;   &lt;w:donotpromoteqf/&gt;   &lt;w:lidthemeother&gt;EN-US&lt;/w:LidThemeOther&gt;   &lt;w:lidthemeasian&gt;X-NONE&lt;/w:LidThemeAsian&gt;   &lt;w:lidthemecomplexscript&gt;X-NONE&lt;/w:LidThemeComplexScript&gt;   &lt;w:compatibility&gt;    &lt;w:breakwrappedtables/&gt;    &lt;w:snaptogridincell/&gt;    &lt;w:wraptextwithpunct/&gt;    &lt;w:useasianbreakrules/&gt;    &lt;w:dontgrowautofit/&gt;    &lt;w:splitpgbreakandparamark/&gt;    &lt;w:dontvertaligncellwithsp/&gt;    &lt;w:dontbreakconstrainedforcedtables/&gt;    &lt;w:dontvertalignintxbx/&gt;    &lt;w:word11kerningpairs/&gt;    &lt;w:cachedcolbalance/&gt;   &lt;/w:Compatibility&gt;   &lt;m:mathpr&gt;    &lt;m:mathfont val="Cambria Math"&gt;    &lt;m:brkbin val="before"&gt;    &lt;m:brkbinsub val="&amp;#45;-"&gt;    &lt;m:smallfrac val="off"&gt;    &lt;m:dispdef/&gt;    &lt;m:lmargin val="0"&gt;    &lt;m:rmargin val="0"&gt;    &lt;m:defjc val="centerGroup"&gt;    &lt;m:wrapindent val="1440"&gt;    &lt;m:intlim val="subSup"&gt;    &lt;m:narylim val="undOvr"&gt;   &lt;/m:mathPr&gt;&lt;/w:WordDocument&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:latentstyles deflockedstate="false" defunhidewhenused="true" defsemihidden="true" defqformat="false" defpriority="99" latentstylecount="267"&gt;   &lt;w:lsdexception locked="false" priority="0" semihidden="false" unhidewhenused="false" qformat="true" name="Normal"&gt;   &lt;w:lsdexception locked="false" priority="9" semihidden="false" unhidewhenused="false" qformat="true" name="heading 1"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 2"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 3"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 4"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 5"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 6"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 7"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 8"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 9"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 1"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 2"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 3"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 4"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 5"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 6"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 7"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 8"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 9"&gt;   &lt;w:lsdexception locked="false" priority="35" qformat="true" name="caption"&gt;   &lt;w:lsdexception locked="false" priority="10" semihidden="false" unhidewhenused="false" qformat="true" name="Title"&gt;   &lt;w:lsdexception locked="false" priority="1" name="Default Paragraph Font"&gt;   &lt;w:lsdexception locked="false" priority="11" semihidden="false" unhidewhenused="false" qformat="true" name="Subtitle"&gt;   &lt;w:lsdexception locked="false" priority="22" semihidden="false" unhidewhenused="false" qformat="true" name="Strong"&gt;   &lt;w:lsdexception locked="false" priority="20" semihidden="false" unhidewhenused="false" qformat="true" name="Emphasis"&gt;   &lt;w:lsdexception locked="false" priority="59" semihidden="false" unhidewhenused="false" name="Table Grid"&gt;   &lt;w:lsdexception locked="false" unhidewhenused="false" name="Placeholder Text"&gt;   &lt;w:lsdexception locked="false" priority="1" semihidden="false" unhidewhenused="false" qformat="true" name="No Spacing"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 1"&gt;   &lt;w:lsdexception locked="false" unhidewhenused="false" name="Revision"&gt;   &lt;w:lsdexception locked="false" priority="34" semihidden="false" unhidewhenused="false" qformat="true" name="List Paragraph"&gt;   &lt;w:lsdexception locked="false" priority="29" semihidden="false" unhidewhenused="false" qformat="true" name="Quote"&gt;   &lt;w:lsdexception locked="false" priority="30" semihidden="false" unhidewhenused="false" qformat="true" name="Intense Quote"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="19" semihidden="false" unhidewhenused="false" qformat="true" name="Subtle Emphasis"&gt;   &lt;w:lsdexception locked="false" priority="21" semihidden="false" unhidewhenused="false" qformat="true" name="Intense Emphasis"&gt;   &lt;w:lsdexception locked="false" priority="31" semihidden="false" unhidewhenused="false" qformat="true" name="Subtle Reference"&gt;   &lt;w:lsdexception locked="false" priority="32" semihidden="false" unhidewhenused="false" qformat="true" name="Intense Reference"&gt;   &lt;w:lsdexception locked="false" priority="33" semihidden="false" unhidewhenused="false" qformat="true" name="Book Title"&gt;   &lt;w:lsdexception locked="false" priority="37" name="Bibliography"&gt;   &lt;w:lsdexception locked="false" priority="39" qformat="true" name="TOC Heading"&gt;  &lt;/w:LatentStyles&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 10]&gt; &lt;style&gt;  /* Style Definitions */  table.MsoNormalTable  {mso-style-name:"Table Normal";  mso-tstyle-rowband-size:0;  mso-tstyle-colband-size:0;  mso-style-noshow:yes;  mso-style-priority:99;  mso-style-qformat:yes;  mso-style-parent:"";  mso-padding-alt:0in 5.4pt 0in 5.4pt;  mso-para-margin-top:0in;  mso-para-margin-right:0in;  mso-para-margin-bottom:10.0pt;  mso-para-margin-left:0in;  line-height:115%;  mso-pagination:widow-orphan;  font-size:11.0pt;  font-family:"Calibri","sans-serif";  mso-ascii-font-family:Calibri;  mso-ascii-theme-font:minor-latin;  mso-fareast-font-family:"Times New Roman";  mso-fareast-theme-font:minor-fareast;  mso-hansi-font-family:Calibri;  mso-hansi-theme-font:minor-latin;  mso-bidi-font-family:"Times New Roman";  mso-bidi-theme-font:minor-bidi;} &lt;/style&gt; &lt;![endif]--&gt;  &lt;p class="MsoNormal"&gt;The line terminating characters seems to have been escaped but without achieving the desired result. The secondary effect of this error is that the record count is miscalculated since the parser merely counts the lines and therefore ends up with a larger number than the publisher expected (remember that &lt;span style="mso-spacerun:yes"&gt; &lt;/span&gt;the line terminating character breaks the data file by producing &lt;span style="mso-spacerun:yes"&gt; &lt;/span&gt;two lines with an incorrect number of columns). Incidentally this example can sometimes explain why we harvest MORE than 100% of the target records.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;By using the Integrated Publishing Toolkit (IPT) illegal characters can be avoided and the publishers will benefit from a faster transition into data appearing live in the GBIF portal. http://www.gbif.org/informatics/infrastructure/publishing/&lt;/p&gt;  &lt;p class="MsoNormal"&gt;Fortunately I am working in a joint effort with the publisher’s team on ironing out the bumps on this resource so we can get the data published fast and prevent future errors of this sort. &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-3263575298104393421?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/3263575298104393421/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/05/line-terminating-characters-breaking.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3263575298104393421'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3263575298104393421'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/05/line-terminating-characters-breaking.html' title='Line terminating characters breaking Darwin Core Archive'/><author><name>Jan K. Legind</name><uri>http://www.blogger.com/profile/11185887314419707389</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-wLvc4ofHK5Y/TcFLxTjHvaI/AAAAAAAAAIE/vMWlw28ceps/s72-c/line_terminating.jpg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-726832421834655671</id><published>2011-05-02T11:16:00.007+02:00</published><updated>2011-05-02T11:28:36.454+02:00</updated><title type='text'>GBIF Data Portal</title><content type='html'>The current &lt;a href="http://data.gbif.org/"&gt;GBIF Data Portal&lt;/a&gt; was  designed and implemented in 2005/2006, around the time I first joined  the GBIF Secretariat in Copenhagen. As I am not a developer myself, but  have been involved with the Data Portal for a long time, I thought I  would take the opportunity to give a bit of a summary view of some of  the components discussed in other posts here, looking at them more from  the perspective of the Data Portal.&lt;br&gt;&lt;br&gt;


The GBIF Data Portal has been  in operation more or less in its current form since mid 2007. From the  time it was designed, the Portal's focus is on providing discovery of  and access to primary species occurrence data (specimens in museums,  observations in the field, culture strains and others). Since the  launch, bug fixes and some minor changes were made, but development  stopped due to new priorities. We did receive a lot of input on data  content and functionality, though, both from data publishers and data  users, and also through a number of reports and analyses.&lt;br&gt;&lt;br&gt;

Towards the  end of 2010, a new development phase started, initiating version 2 of  the GBIF Data Portal. This was the time to start taking care of all the  known shortcomings and improvement requests, e.g. a more robust and  reliable backbone taxonomy, improvement of data quality, better  attribution of contributors, and others. However, this is not just a  matter of adding some data or changing the user interface: a lot of  those points first require considerable reworking of internal processing  and workflows between the Data Portal and related components, blogged  about in other contributions here:

&lt;ul&gt;&lt;li&gt;quicker indexing and more  frequent rollovers (publication cycles) from the non-public indexing  database to the public web portal can only be achieved through a  complete re-working of the rollover processing workflow.&lt;/li&gt;&lt;li&gt;a  reliable taxonomic backbone required a review and re-implementation of  name parsing routines, integrating lookup services, and following that, a  complete regeneration of the taxonomic backbone&lt;/li&gt;&lt;li&gt;the demand for  better attribution of data owners and service providers can only be met  after having moved on to a new registry, better modelling the GBIF  network structure, players and interactions. This is especially the case  where datasets are aggregated or hosted, and both the owning and the  service providing institution need to receive proper credit for their  contributions&lt;/li&gt;&lt;li&gt;extended and improved metadata are needed to  assess suitability of a dataset for specific applications (e.g.  modelling), and to allow discovery of collections that are not digitised  or not published&lt;/li&gt;&lt;/ul&gt;In 2011, GBIF Data Portal development focuses  on consolidating and integrating these re-worked components, and on  including both names (checklist) and metadata sources into the search  functionality. The implied changes on the Portal user interface side are  quite fundamental. With other known and future requirements on user  interface functionality, the time has now come to replace the old Portal  code base. At present, we are working with an &lt;a href="http://www.vizzuality.com/"&gt;external team&lt;/a&gt;  to develop wireframes for key Portal pages, based on functionality  requests from GBIFS regarding the integration of the new data areas and  following evaluation of a number of sources (task group reports,  reviews, participant reports etc). Those wireframes will aid further  discussions on functionality starting from July, and also build the  basis for implementation in 2011 and after. Once there is a public  version available to look at, we will give an update.&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-726832421834655671?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/726832421834655671/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/05/gbif-data-portal.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/726832421834655671'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/726832421834655671'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/05/gbif-data-portal.html' title='GBIF Data Portal'/><author><name>Andrea Hahn</name><uri>http://www.blogger.com/profile/03720231445578387602</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-3246096268057131928</id><published>2011-04-29T14:20:00.003+02:00</published><updated>2011-04-29T17:02:58.513+02:00</updated><title type='text'>The evolution of the GBIF Registry</title><content type='html'>The GBIF Registry has evolved through time to become an important tool in GBIF's day to day work. But before going into this post, a basic understanding of the GBIF Network model should be provided. GBIF is a decentralised network that has several network entities that are related in some way between each other. At the top level, there are &lt;b&gt;&lt;a href="http://www.gbif.org/participation/participant-nodes/who-we-are/"&gt;GBIF Participant Nodes&lt;/a&gt;, &lt;/b&gt;which typically are countries or thematic networks that coordinate their domain. These Nodes &lt;i&gt;endorse&lt;/i&gt; one or more &lt;b&gt;Organisations or Institutions &lt;/b&gt;inside their domain&lt;b&gt;, &lt;/b&gt;and each Organisation &lt;i&gt;possesses&lt;/i&gt; one or more &lt;b&gt;Resources&lt;/b&gt; exposed through the GBIF Network. Also, each Resource typically comes associated to a &lt;b&gt;Technical Access Point &lt;/b&gt;which is the url to access its data. There are also other entities such as &lt;b&gt;&lt;a href="http://code.google.com/p/gbif-providertoolkit/"&gt;IPT Installations&lt;/a&gt; &lt;/b&gt;which are deployed inside specific organisations, but are not resources by themselves. They &lt;i&gt;publish &lt;/i&gt;Resources that might be owned by other organisations. A quick view on the GBIF's network model can be seen:&lt;br /&gt;
&lt;br /&gt;
&lt;div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-KkOK8EgW3Xs/TbrSs0P2e_I/AAAAAAAAIf8/JtA8DOw6y-g/s1600/01gbifmodel.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 300px;" src="http://2.bp.blogspot.com/-KkOK8EgW3Xs/TbrSs0P2e_I/AAAAAAAAIf8/JtA8DOw6y-g/s400/01gbifmodel.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5601020753969839090" /&gt;&lt;/a&gt;&lt;br /&gt;
&lt;div&gt;
Not long ago, this complexity was modeled using an &lt;a href="http://en.wikipedia.org/wiki/Universal_Description_Discovery_and_Integration"&gt;Universal Description, Discovery and Integration&lt;/a&gt; (UDDI) system. This system served a purpose at the time, despite its limited data structures types (e.g. businessEntity, businessService, bindingTemplate, tModel). A &lt;i&gt;BusinessEntity&lt;/i&gt; was associated with an Organisation/Institution, a &lt;i&gt;BusinessService&lt;/i&gt; was associated to a Resource and a &lt;i&gt;BindingTemplate&lt;/i&gt; was associated with the technical access point to access the data from that specific resource. A tModel was used to associate the BusinessEntity(Organisation) with a specific Node inside the GBIF Network.  A quick view on how the network information was kept on this Registry :

&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;a href="http://2.bp.blogspot.com/-eJg7G4bunR8/TbqXwjCvx_I/AAAAAAAAIfg/9j1BRvcGvf8/s1600/02uddimodel.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5600955946884909042" src="http://2.bp.blogspot.com/-eJg7G4bunR8/TbqXwjCvx_I/AAAAAAAAIfg/9j1BRvcGvf8/s400/02uddimodel.png" style="cursor: hand; cursor: pointer; display: block; height: 157px; margin: 0px auto 10px; text-align: center; width: 367px;" /&gt;&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
The main disadvantages (for our concerns) of the UDDI Specification:&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Its lack of contact information at the BusinessService(Resource) level (contacts can only be added at the BusinessEntity(Organisation) level)&lt;/li&gt;
&lt;li&gt;Lack of more descriptive metadata on Organisation and Resources (lacking fields such as the address, homepage, phone of the organisation - sure you could provide all of this information through a complex use of UDDI's capabilities, but will result in unnecessary complexity to extract this information for third-party tools.&lt;/li&gt;
&lt;li&gt;Limited to a &lt;a href="http://www.uddi.org/pubs/uddi_v3.htm"&gt;fixed specification&lt;/a&gt; and to a fixed API (although, the UDDI client libraries available are quite straightforward to use)&lt;/li&gt;
&lt;li&gt;General purpose specification, not easily adaptable for modeling the complexity of GBIF's network.&lt;/li&gt;
&lt;li&gt;Our software dated back to the beginning of the past decade (&lt;a href="http://www2.sys-con.com/itsg/virtualcd/webservices/archives/0401/barbash/index.html"&gt;Systinet WASP UDDI&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Third party consumers will need to know how to talk UDDI&lt;/li&gt;
&lt;/ol&gt;
In 2009, we tried overcoming some of our Registry limitations by trying an "UDDI on steroids" approach, which consisted still of an UDDI system (&lt;a href="http://juddi.apache.org/"&gt;jUDDI&lt;/a&gt; in our case) and an external database which would hold some extra data (e.g. Resource contact information, organisation's address, homepage or phone, etc.). The main advantage was the creation our own APIs so that third-party tool developers, who wanted to consume the GBIF's network information, didn't need to know the nuts and bolts of UDDI specs anymore. We offered the community a simple API and its proper documentation, and we dealt with the inner workings of it all.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Further in this evolution, our Registry took the next step and we removed the UDDI component and were left only with a DB which gave us complete freedom to model the network. We now had a system on hand which offered the possibility to create any kind of entities on the Network (Nodes, Organisations, Resources, Technical Installations) and any relation among them. Along with this new approach, came the web application (&lt;a href="http://gbrds.gbif.org/"&gt;http://gbrds.gbif.org&lt;/a&gt;) and a far better API which offered the possibility to consume the data in XML or JSON format. These APIs are easy to follow and are well documented (&lt;a href="http://code.google.com/p/gbif-registry/wiki/TableOfContents"&gt;http://code.google.com/p/gbif-registry/wiki/TableOfContents&lt;/a&gt;). Among the new features:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Create any kind of entities&lt;/li&gt;
&lt;li&gt;Create any kind of relation among them&lt;/li&gt;
&lt;li&gt;More detailed metadata (for entities and contacts)&lt;/li&gt;
&lt;li&gt;Ability to tag entities&lt;/li&gt;
&lt;li&gt;Individual credentials for each Institution/Organisation to provide the ability to add new or delete existing resources under their own Organisations (this is currently only available through the APIs or via admin management)&lt;/li&gt;
&lt;li&gt;Enhanced maintenance features (for admins)&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-degPpmJzWTM/TbrS5NOybYI/AAAAAAAAIgE/HGbp2u09IIg/s1600/03evolution.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 300px;" src="http://3.bp.blogspot.com/-degPpmJzWTM/TbrS5NOybYI/AAAAAAAAIgE/HGbp2u09IIg/s400/03evolution.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5601020966834695554" /&gt;&lt;/a&gt;&lt;br /&gt;
&lt;div style="text-align: center;"&gt;
&lt;i&gt;[Evolution of GBIF's Registry]&lt;/i&gt;&lt;/div&gt;
Development is &lt;a href="http://code.google.com/p/gbif-registry/"&gt;still ongoing&lt;/a&gt; and many exciting features are expected in the future. The status of development can be checked out &lt;a href="http://code.google.com/p/gbif-registry/"&gt;here&lt;/a&gt;.&lt;/div&gt;
&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-3246096268057131928?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/3246096268057131928/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/04/evolution-of-gbif-registry.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3246096268057131928'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3246096268057131928'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/04/evolution-of-gbif-registry.html' title='The evolution of the GBIF Registry'/><author><name>Jose Cuadra</name><uri>http://www.blogger.com/profile/00591450269169657407</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-KkOK8EgW3Xs/TbrSs0P2e_I/AAAAAAAAIf8/JtA8DOw6y-g/s72-c/01gbifmodel.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-1170890136300056784</id><published>2011-04-27T07:00:00.014+02:00</published><updated>2011-04-27T19:01:40.721+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='PMH'/><category scheme='http://www.blogger.com/atom/ns#' term='OAI'/><category scheme='http://www.blogger.com/atom/ns#' term='iso19139'/><category scheme='http://www.blogger.com/atom/ns#' term='dublin core'/><category scheme='http://www.blogger.com/atom/ns#' term='OAI-PMH'/><category scheme='http://www.blogger.com/atom/ns#' term='xml'/><category scheme='http://www.blogger.com/atom/ns#' term='harvester'/><category scheme='http://www.blogger.com/atom/ns#' term='xsl'/><category scheme='http://www.blogger.com/atom/ns#' term='eml'/><category scheme='http://www.blogger.com/atom/ns#' term='GBIF'/><category scheme='http://www.blogger.com/atom/ns#' term='dc'/><title type='text'>OAI-PMH Harvesting at GBIF</title><content type='html'>&lt;span style="font-family:verdana;"&gt;GBIF has been my first experience in the bio-informatics world; my first assignment was developing an OAI-PMH harvester. This post will introduce OAI-PMH protocol and how we are gathering XML documents from different sources, in a next post I'll give a introduction to the Index that we have built using those documents.  &lt;/span&gt;&lt;br/&gt;

&lt;span style="font-family:verdana;"&gt;The main goal for this project was develop the infrastructure needed across the GBIF network to support the management and delivery of metadata that will enable potential end users to discover which datasets are available, and, to evaluate the appropriateness of such datasets for particular purposes. In the GBIF context, resources are datasets, loosely defined as collections of related data, the granularity of which is determined by the data custodian/provider.  &lt;/span&gt;&lt;br/&gt;
&lt;br/&gt;

&lt;span style="font-weight: bold;font-family:verdana;" &gt;OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting)&lt;/span&gt;&lt;span style="font-family:verdana;"&gt; is a platform independent framework for metadata publishers and metadata consumers as well. The most important concepts of this protocol are:&lt;/span&gt;&lt;br/&gt;
&lt;span style="font-family:verdana;"&gt;•    Metadata: provides information on such aspects as the ‘who, what, where, when and how’ pertaining to a resource. For the producer, metadata are used to document data in order to inform users of their characteristics, while for the consumer, metadata are used to both discover data and assess their appropriateness for particular needs ('fitness for purpose’).&lt;/span&gt;&lt;br/&gt;
&lt;span style="font-family:verdana;"&gt;•    Repository: an accessible server that is able to process the protocol verbs.&lt;/span&gt;&lt;br/&gt;
&lt;span style="font-family:verdana;"&gt;•    Unique identifier: is an unambiguous identifier of an item (document/record) inside the repository.&lt;/span&gt;&lt;br/&gt;
&lt;span style="font-family:verdana;"&gt;•    Record: is metadata expressed in a specific format.&lt;/span&gt;&lt;br/&gt;
&lt;span style="font-family:verdana;"&gt;•    Metadata-prefix: specifies the metadata format in OAI-PMH requests issued to the repository (EML 2.1.0, Dublin Core, etc.)&lt;/span&gt;&lt;br/&gt;
&lt;br/&gt;
&lt;span style="font-weight: bold;font-family:verdana;" &gt;GBIF-Metadata Network Topology &lt;/span&gt;&lt;br/&gt;
&lt;span style="font-family:verdana;"&gt;The metadata catalogue will primarily be used as the central catalogue in the GBIF Data Portal for the global GBIF network, which, in turn, will broker information to wider initiatives such as EuroGEOSS, OBIS, etc. Such initiatives are basically OAI-PMH service providers that will be contacted by GBIF metadata harvester.&lt;/span&gt;&lt;br/&gt;

&lt;a style="font-family: verdana;" href="http://gbif-metadata.googlecode.com/files/MetadataTopBlog.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 828px; height: 586px;" src="http://gbif-metadata.googlecode.com/files/MetadataTopBlog.png" alt="" border="0" /&gt;&lt;/a&gt;&lt;br/&gt;
&lt;span style="font-family:verdana;"&gt;The GBIF metadata catalogue service undertakes both harvesting and serving roles; aggregating metadata from other OAI-PMH repositories and serving metadata via OAI-PMH to other harvesting services. The harvested metadata are stored in a local file system. The system can apply XSLT transformation to create a new document based of the content of the existing one (e.g., transforming an EML document to an ISO19139 one).&lt;/span&gt;&lt;br/&gt;
&lt;br/&gt;

&lt;span style="font-weight: bold;font-family:verdana;" &gt;OAI-PMH Harvester&lt;/span&gt;&lt;br/&gt;
&lt;span style="font-family:verdana;"&gt;The harvester is a standalone Java application, it makes extensive use of the open source project “OAIHarvester2” which supports OAI-PMH v1.1 and v2.0. The source code of this project was not modified but extended to handle the harvested XML payload. The payload is delivered as a single file of aggregated xml documents (one per metadata resource).&lt;br/&gt;
This component was implemented by modifying the OAICat (http://www.oclc.org/research/activities/oaicat/default.htm) web application. The main changes, made to achieve specific objectives, are:&lt;/span&gt;&lt;br/&gt;
&lt;span style="font-family:verdana;"&gt;•    Dynamic load of file store. The default behaviour of the server is to load the file list at the server start-up. Since the harvester can modify the file store, the server loads the file list every time a ListIdentifiers or ListRecords verb is requested.&lt;/span&gt;&lt;br/&gt;
&lt;span style="font-family:verdana;"&gt;•    Support multiple XSL transformations for an input format. The reference implementation only supports one transformation, in our implementation an input document can be published using multiple formats; for example: an EML document can be published using Dublin Core and DIF, if a XSL transformation is configured for each output format.&lt;/span&gt;&lt;br/&gt;
&lt;br/&gt;
&lt;span style="font-family:verdana;"&gt;More detail about this project is available at the google-code project site: http://code.google.com/p/gbif-metadata/. In a future post I’ll explain how the information gathered by the harvester was used to build a search index using Solr and how a  Web application uses the Index to enable end-users the search of metadata.&lt;/span&gt;&lt;br/&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-1170890136300056784?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/1170890136300056784/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/04/oia-pmh-harvesting-at-gbif.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/1170890136300056784'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/1170890136300056784'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/04/oia-pmh-harvesting-at-gbif.html' title='OAI-PMH Harvesting at GBIF'/><author><name>Fede Méndez</name><uri>http://www.blogger.com/profile/11707904250426427540</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-7323235008317670990</id><published>2011-04-20T13:19:00.001+02:00</published><updated>2011-04-20T13:19:36.243+02:00</updated><title type='text'>Cleanup of occurrence records</title><content type='html'>Lars here, like Oliver I've started here in October 2010 and have no biology background either so my first step here at GBIF was to set up the infrastructure&amp;nbsp;&lt;a href="http://gbif.blogspot.com/2011/04/reworking-portal-processing.html"&gt;Tim&lt;/a&gt;&amp;nbsp;was mentioning before, but I've written about that &lt;a href="http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html"&gt;already&lt;/a&gt; (at length).&lt;br /&gt;
&lt;br /&gt;
To continue the series of blog posts that was started by &lt;a href="http://gbif.blogspot.com/2011/04/lucene-for-searching-names-in-our-new.html"&gt;Oliver&lt;/a&gt;, and&amp;nbsp;in no particular order, I'll talk about what we are doing to process the incoming data - which is the task I was given after the Hadoop setup was done.&lt;br /&gt;
&lt;br /&gt;
During our rollover we're processing Occurrence records. Millions of them, about 270 millions at the moment and we expect this to grow significantly over the next few months and years. It is only natural that there is bound to be bad data in there for various reasons. These might be everything from simple typos to misconfigured publishing tools and transfer errors.&lt;br /&gt;
&lt;br /&gt;
The more we know about the domain and the data the more we are obviously able to fix. Any input is appreciated on how we could do better on this part of our processing.&lt;br /&gt;
&lt;br /&gt;
For fields like &lt;a href="http://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/resources/dictionaries/parse/kingdoms.txt"&gt;&lt;i&gt;kingdom&lt;/i&gt;&lt;/a&gt;, &lt;a href="http://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/resources/dictionaries/parse/phyla.txt" style="font-style: italic;"&gt;phylum&lt;/a&gt;,&amp;nbsp;&lt;a href="http://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/resources/dictionaries/parse/countryName.txt"&gt;country name&lt;/a&gt; or &lt;i&gt;&lt;a href="http://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/resources/dictionaries/parse/basisOfRecord.txt"&gt;basis of record&lt;/a&gt;&lt;/i&gt;&amp;nbsp;we do a simple lookup in a dictionary to look for common mistakes and replace those with the proper versions. Other fields like &lt;i&gt;class&lt;/i&gt;, &lt;i&gt;order&lt;/i&gt;, &lt;i&gt;family&lt;/i&gt;, &lt;i&gt;genus&lt;/i&gt;&amp;nbsp;and &lt;i&gt;author&lt;/i&gt; have way too many distinct values for us to prepare a dictionary with all the possible errors and their correct forms. That is why we only apply a few safe &lt;a href="http://sites.gbif.org/occurrencestore/occurrence-store/apidocs/org/gbif/occurrencestore/utils/ClassificationUtils.html#clean(java.lang.String)"&gt;clean&lt;/a&gt; up procedures here (e.g. remove &lt;a href="http://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/resources/blacklistedNames.txt"&gt;blacklisted names&lt;/a&gt; or invalid characters).&lt;br /&gt;
&lt;br /&gt;
Scientific names are additionally parsed by the &lt;a href="http://sites.gbif.org/ecat/ecat-common/apidocs/org/gbif/ecat/parser/NameParser.html"&gt;NameParser&lt;/a&gt; in the &lt;a href="http://code.google.com/p/gbif-ecat/"&gt;ECAT&lt;/a&gt; project which does all kinds of fancy magic to try to infer a correct name. &lt;i&gt;Altitudes&lt;/i&gt;, &lt;i&gt;depths&lt;/i&gt; and &lt;i&gt;coordinates&lt;/i&gt; get &lt;a href="http://sites.gbif.org/occurrencestore/occurrence-store/apidocs/org/gbif/occurrencestore/utils/parse/geospatial/GeospatialParseUtils.html"&gt;treatment&lt;/a&gt; as well by looking at common unit markers and errors we've seen in the past.&lt;br /&gt;
&lt;br /&gt;
And last but not least we also try to make most out of the &lt;i&gt;dates&lt;/i&gt; we get. As everyone who ever dealt with date strings knows this can be one of the hardest topics in an internationalized environment. In theory our input data consists of three nicely formatted fields: &lt;i&gt;year&lt;/i&gt;, &lt;i&gt;month&lt;/i&gt; and &lt;i&gt;day&lt;/i&gt;. In reality though a lot of dates are just in the &lt;i&gt;year&lt;/i&gt; field. We've got all kinds of delimiters (with "/" and "-" being among the most common ones), abbreviations ("Mar") and database export fragments ("1978.0" because it was a floating point variable in the database), missing data and more.&lt;br /&gt;
&lt;br /&gt;
Additionally we obviously have to deal with different time formats. Is "01/02/02" the first of February or the second of January? In most cases we can only guess.&lt;br /&gt;
&lt;br /&gt;
Having said that: We've rewritten large parts of the &lt;a href="http://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/java/org/gbif/occurrencestore/utils/parse/date/DateParseUtils.java"&gt;date handling routines&lt;/a&gt; and are continuing to improve them as we know that this is an important part of our data. Feedback on how we're doing here is greatly appreciated!&lt;br /&gt;
&lt;br /&gt;
I'm really hoping to have a chance to compile a few statistics about our incoming data quality once we've tested all of this in production.&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-7323235008317670990?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/7323235008317670990/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/04/cleanup-of-occurrence-records.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/7323235008317670990'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/7323235008317670990'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/04/cleanup-of-occurrence-records.html' title='Cleanup of occurrence records'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-1402463844058807931</id><published>2011-04-18T15:49:00.002+02:00</published><updated>2011-04-19T09:35:52.257+02:00</updated><title type='text'>Reworking the Portal processing</title><content type='html'>The&amp;nbsp;&lt;a href="http://data.gbif.org/"&gt;GBIF Data Portal&lt;/a&gt;&amp;nbsp;has provided a gateway to discover and access the content shared through the GBIF network for some years, without major change. &amp;nbsp;As the amount of data has grown, GBIF have &lt;a href="http://en.wikipedia.org/wiki/Scalability#Scale_vertically_.28scale_up.29"&gt;scaled vertically (e.g. scaling up)&lt;/a&gt;&amp;nbsp;to maintain performance levels; this is becoming unmanageable with the current processing routines due to the amount of SQL statements issued against the database. &amp;nbsp;As GBIF content grows, the indexing infrastructure must change to &lt;a href="http://en.wikipedia.org/wiki/Scalability#Scale_horizontally_.28scale_out.29"&gt;scale out&lt;/a&gt;&amp;nbsp;accordingly.&lt;br /&gt;
&lt;br /&gt;
I have been monitoring and evaluating alternative technologies &lt;a href="http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html"&gt;for some time&lt;/a&gt;&amp;nbsp;and a few months ago GBIF initiated the redevelopment of the processing routines. &amp;nbsp;This current area of work does not increase functionality offered through the portal (that will be addressed following this infrastructural work) but rather aims to:&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Reduce the latency between a record changing on the publisher side, and being reflected in the index&lt;/li&gt;
&lt;li&gt;Reduce the amount of (wo)man-hours needed to coax through a successful processing run&lt;/li&gt;
&lt;li&gt;Improve the quality assurance by inclusion of &amp;nbsp; &amp;nbsp;&lt;/li&gt;
&lt;ul&gt;&lt;li&gt;Checking that terrestrial point locations fall within the stated country using&amp;nbsp;&lt;a href="http://www.naturalearthdata.com/"&gt;shapefiles&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Checking coastal waters using&amp;nbsp;&lt;a href="http://www.vliz.be/vmdcdata/marbound/"&gt;Exclusive Economic Zones&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;li&gt;Rework all the date and time handling&lt;/li&gt;
&lt;li&gt;Use dictionaries (vocabularies) for interpretation of fields such as Basis of Record&lt;/li&gt;
&lt;li&gt;Integrate checklists (taxonomic, nomenclatural and thematic) shared through the&amp;nbsp;&lt;a href="http://www.gbif.org/informatics/name-services/"&gt;GBIF ECAT Programme&lt;/a&gt;&amp;nbsp;to improve the taxonomic services, and the&amp;nbsp;&lt;a href="http://gbif.blogspot.com/2011/04/lucene-for-searching-names-in-our-new.html"&gt;backbone ("nub") taxonomy&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Provide a robust framework for future development&lt;/li&gt;
&lt;li&gt;Allow the infrastructure to grow predictably with content and demand growth&lt;/li&gt;
&lt;/ul&gt;Things have progressed significantly since my early investigations, and GBIF are developing using the following technologies:&lt;br /&gt;
&lt;ul&gt;&lt;ul&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://hadoop.apache.org/"&gt;Apache Hadoop&lt;/a&gt;: A distributed file system, and cluster processing using the Map Reduce framework&lt;/li&gt;
&lt;ul&gt;&lt;li&gt;GBIF are using the&amp;nbsp;&lt;a href="http://www.cloudera.com/hadoop/"&gt;Cloudera distribution of Hadoop&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;li&gt;&lt;a href="http://www.cloudera.com/downloads/sqoop/"&gt;Sqoop&lt;/a&gt;: A utility to synchronize between relational databases and Hadoop&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://wiki.apache.org/hadoop/Hive"&gt;Hive&lt;/a&gt;: A data warehouse infrastructure built on top of Hadoop, and developed and open-sourced by&amp;nbsp;&lt;a href="http://www.royans.net/arch/hive-facebook/"&gt;Facebook&lt;/a&gt;. &amp;nbsp;Hive gives SQL capabilities on Hadoop. &amp;nbsp;[Full table scans on GBIF occurrence records reduce from hours to minutes]&lt;/li&gt;
&lt;li&gt;&lt;a href="http://yahoo.github.com/oozie/"&gt;Oozie&lt;/a&gt;: An open-source workflow/coordination service to manage data processing jobs for Hadoop, developed then open-sourced by&amp;nbsp;&lt;a href="http://developer.yahoo.com/hadoop/"&gt;Yahoo!&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div&gt;[GBIF are researching using &lt;a href="http://hbase.apache.org/"&gt;HBase&lt;/a&gt;, the Hadoop database to allow an increase in the richness in the indexed content, and will be the subject of future blogs. &amp;nbsp;See the &lt;a href="http://code.google.com/p/gbif-occurrencestore/"&gt;project site&lt;/a&gt;]&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div&gt;The processing workflow looks like the following (click for full size):&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-oCOvYEjOBbE/Taw9UYcFSNI/AAAAAAAAADk/7g1rilfYkds/s1600/oozie.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="424" src="http://3.bp.blogspot.com/-oCOvYEjOBbE/Taw9UYcFSNI/AAAAAAAAADk/7g1rilfYkds/s640/oozie.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div&gt;The Oozie workflow is still being developed, but the workflow definition&amp;nbsp;&lt;a href="http://code.google.com/p/gbif-occurrencestore/source/browse/trunk/oozie-apps/rollover/workflow.xml"&gt;can be found here&lt;/a&gt;.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-1402463844058807931?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/1402463844058807931/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/04/reworking-portal-processing.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/1402463844058807931'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/1402463844058807931'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/04/reworking-portal-processing.html' title='Reworking the Portal processing'/><author><name>Tim Robertson</name><uri>http://www.blogger.com/profile/07889700598656669041</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://1.bp.blogspot.com/_5wZ2Fic5QtA/STxJva6Zq2I/AAAAAAAAABc/g2I1cG9fztw/S220/tim.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-oCOvYEjOBbE/Taw9UYcFSNI/AAAAAAAAADk/7g1rilfYkds/s72-c/oozie.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-6177494919622965712</id><published>2011-04-18T02:00:00.000+02:00</published><updated>2011-04-18T13:45:01.138+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nub'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene'/><category scheme='http://www.blogger.com/atom/ns#' term='taxonomy'/><category scheme='http://www.blogger.com/atom/ns#' term='names'/><title type='text'>Lucene for searching names in our new common taxonomy</title><content type='html'>&lt;p&gt;Oliver here - I'm one of the new developers at GBIF, having started in October, 2010. With no previous experience in biology or biological classification you can bet it's been a steep learning curve in my time here, but at the same time it's very nice to be learning about a domain that's real, valuable and permanent, rather than yet another fleeting e-commerce, money-trading or "social media" application!&lt;/p&gt;&lt;p&gt;One of the features of GBIF's &lt;a href="http://data.gbif.org/"&gt;Data Portal&lt;/a&gt; is allowing searching of primary occurrence data via a backbone taxonomy.  For example let's say you're interested in snow leopards and would like to plot all current and historical occurrences of this elusive cat on a world map.  Let's further say that Richard Attenborough suggested to you that the snow leopard's scientific name is "Panthera uncia".  You would ask the data portal for all records about Panthera uncia and expect to see all occurrences of snow leopards.  Unfortunately biologists aren't agreed on how to classify the snow leopard - some argue that it belongs in the genus Panthera, while others argue that it should belong to its own genus, Uncia, and naturally the GBIF network has records under both names.  You would just like to see all of those records and never mind the details - and that's just the tip of the iceberg when it comes to building a backbone taxonomy to match the 260 million+ occurrence records in the GBIF network. &lt;/p&gt;&lt;p&gt;Indeed, the backbone taxonomy (we call it our "Nub Taxonomy") in use by the current data portal has been one of the biggest sources of criticism of the GBIF data portal - it doesn't cover enough of the names in our occurrence records, and it doesn't handle the tricky stuff (as above) as well as it should. One of the reasons for that is the current backbone taxonomy was built based on the Catalogue of Life 2007, a similar vintage International Plant Names Index (IPNI), and then augmented with the classifications from any unmatched occurrence records.  This has led to a classification hierarchy which is less reliable than we (and the GBIF network) would like.&lt;/p&gt;&lt;p&gt;Markus Döring is the GBIF software team's taxonomy expert and he has employed a new strategy for building an improved Nub Taxonomy by building it exclusively on well-known and respected taxonomies already out there - things like the most recent Catalogue of Life, IPNI, and more, but without using the classifications as given in the occurrence data.  After the Nub Taxonomy is built, the occurrence records then need to be matched to it.  As the first step to integrating the new Nub Taxonomy into the data portal, my job in the last little while has been to build a searchable index of all the names in our Nub Taxonomy and a web service that can accept a scientific name (from an occurrence record) and match it to the index, while understanding the implications of homonyms and synonyms, as well as tolerating misspellings.  And of course, make it fast :)&lt;/p&gt;&lt;p&gt;Since what we're talking about here is string matching with a tolerance for messy input (e.g. spelling mistakes, different violations of nomenclatural rules) the place to start is &lt;a href="http://lucene.apache.org/"&gt;Lucene&lt;/a&gt;.  Our Nub Taxonomy has about 8 million unique names, and our 260 million occurrence records are also comprised of roughly 8 million unique names.  Our use case is somewhat out of the ordinary for Lucene in that we can build the index once and after that it becomes read-only until the next update of our Nub Taxonomy (e.g. to reflect an update in the Catalog of Life), and it only takes a few minutes to build the index, so it's not all that important for it to be persistent.  That means we can optimize for search speed and not worry so much about indexing performance.  Lucene has just the index storage implementation for this need - &lt;a href="http://lucene.apache.org/java/3_1_0/api/all/org/apache/lucene/store/RAMDirectory.html"&gt;RAMDirectory&lt;/a&gt;.  For the most part this worked just fine, but no matter how hard I hit the index, I couldn't get cpu usage to 100% - the best I could do was about 80%.  I found that very irksome and spent some time testing different Directory implementations, web service stacks, and everything in between.  None of the other Directory implementations (all file based in some way) showed any improvements, nor did eliminating the web stack.  Finally by attaching a profiler to the Tomcat instance running the webservice while running with RAMDirectory we were able to see thread blocking increasing proportional to the number of requesting threads.  That led us to the Lucene source code where we found a synchronized() block that we deemed the culprit.  With the cause at least found I decided not to waste time trying to fix the problem for what would be nominal gain, but instead decided to use two Tomcat installations and load balance between them with Apache.  With the Tomcats running on quite powerful machines we are now seeing approximately 1000 lookups/sec (including a bunch of business logic beyond the Lucene lookup), which we think is pretty good, and sufficient for our purposes.&lt;/p&gt;&lt;p&gt;This is all being used from within our Oozie orchestrated Hive/Hadoop workflow (which Lars will talk more about soon) but once we're confident that it's behaving properly and stably we will also offer this web service (or something similar) for public consumption.  More importantly the new Nub Taxonomy will be available in the GBIF data portal very soon and with it we hope to have eliminated most of the problems people have found with our current backbone taxonomy.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-6177494919622965712?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/6177494919622965712/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/04/lucene-for-searching-names-in-our-new.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/6177494919622965712'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/6177494919622965712'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/04/lucene-for-searching-names-in-our-new.html' title='Lucene for searching names in our new common taxonomy'/><author><name>Oliver Meyn</name><uri>http://www.blogger.com/profile/04706642473308341930</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/-6_sfRK7-oU0/TaWanB_bQ3I/AAAAAAAAAAM/RcPMC0E9E3k/s220/oliver_amory_profile.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-3258977646105403465</id><published>2011-04-15T17:26:00.014+02:00</published><updated>2011-04-16T12:12:17.518+02:00</updated><title type='text'>The first drafts of the Data Publishing Manuals are available for feedbacks</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-QE1bex1gbL4/TakulN8Wq4I/AAAAAAAAADY/Ijb0xgCKNt0/s1600/13_DocumentMapOccurrence_web_200.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-QE1bex1gbL4/TakulN8Wq4I/AAAAAAAAADY/Ijb0xgCKNt0/s1600/13_DocumentMapOccurrence_web_200.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;Since Darwin Core had been officially ratified by Biodiversity Information Standards (TDWG) in November 2009, a few tools were developed by GBIFS to leverage the standard data format, a.k.a the Darwin Core Archive, to facilitate data mobilisation. These tools include &lt;a href="http://tools.gbif.org/dwca-assistant/"&gt;Darwin Core Archive Assistant&lt;/a&gt;, &lt;a href="http://tools.gbif.org/spreadsheet-processor/"&gt;GBIF Spreadsheet Processor&lt;/a&gt; and some &lt;a href="http://tools.gbif.org/dwca-validator/"&gt;validators&lt;/a&gt; that users can use to produce standard-compliant files for data exchanging or publishing purposes. Also, IPT &lt;a href="http://www.gbif.org/communications/news-and-events/showsingle/article/gbif-release-the-integrated-publishing-toolkit-v20/"&gt; has upgraded to version 2&lt;/a&gt; recently to fully support data publishing in metadata, occurrence data and taxonomic data using Darwin Core Archive.&lt;/p&gt;
&lt;p&gt;Accompanying these development efforts, a suite of document are also prepared to instruct users on, not only the usage of individual software tool, but how to make data available within the GBIF Network. For those tool options we have in the biodiversity information world, we organised these materials according to which kind of content that users want to publish, and present a document map for users to follow. So, if you go to the &lt;a href="http://www.gbif.org/informatics/"&gt;Informatics section&lt;/a&gt; of the GBIF web site, there are pages called "publishing" under "&lt;a href="http://www.gbif.org/informatics/discoverymetadata/publishing/"&gt;Discovery/Metadata&lt;/a&gt;," "&lt;a href="http://www.gbif.org/informatics/primary-data/publishing/"&gt;Primary Data&lt;/a&gt;" and "&lt;a href="http://www.gbif.org/informatics/name-services/publishing/"&gt;Name Services&lt;/a&gt;." Maps are there ready to guide you through the way for publishing your data. Every node in the maps are clickable and will lead you to those individual manuals.&lt;/p&gt;
&lt;p&gt;The intention of using a map as a guide is to suggest a route that user can have basic understanding about data publishing before they play with software tools, so readers are not given a bunch of documents and don't know where to start, or hesitate to finish reading all of these before starting. We also try to keep each manual as compact as possible, with emphasis on steps, rather than just theories.&lt;/p&gt;
&lt;p&gt;In addition to users with biodiversity background, we'd like invite developers to evaluate these draft materials, too. Any comments are welcome, especially whether these manuals help in explaining the data publishing workflow to the users you serve.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-3258977646105403465?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/3258977646105403465/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/04/first-drafts-of-data-publishing-manuals.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3258977646105403465'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3258977646105403465'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/04/first-drafts-of-data-publishing-manuals.html' title='The first drafts of the Data Publishing Manuals are available for feedbacks'/><author><name>Burke Chih-Jen Ko</name><uri>http://www.blogger.com/profile/09806308970203169452</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/-JHV15rSIJlw/Td_9T-7V2iI/AAAAAAAAABI/TamywweE4I4/s220/P1282909r_icon.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-QE1bex1gbL4/TakulN8Wq4I/AAAAAAAAADY/Ijb0xgCKNt0/s72-c/13_DocumentMapOccurrence_web_200.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-3798198801908285803</id><published>2011-04-13T10:27:00.018+02:00</published><updated>2011-04-18T15:50:21.859+02:00</updated><title type='text'>Can IPT2 handle big datasets now?</title><content type='html'>One of IPT1's most serious problems was its inability to handle large datasets. For example, a dataset with only half a million records (relatively small compared to some of the biggest in the GBIF network) caused the application to slow down to such a degree that even the most patient users were throwing their hands up in dismay.&lt;br /&gt;
Anyways, I wanted to see for myself whether the IPT’s problems with large datasets have been overcome or not in the newest version: IPT2.&lt;br /&gt;
&lt;br /&gt;
Here’s what I did to run the test: First, I connected to a MySQL database and used a “select * from … limit …” query to define my source data totalling 24 million records (the same number of records as a large dataset coming from Sweden). Next, I mapped 17 columns to Darwin Core occurrence terms and once this was done I was able to start the publication of a Darwin Core Archive (DwC-A). The publication took just under 50 minutes to finish, processing approximately 500,000 records per minute. Take a look at the screenshot below that was taken after the successful publication. Important to note is that this test was run on a Tomcat server with only 256MB of memory. In fact, special care was taken during IPT2 design to ensure it could still run on older hardware that didn’t have a lot of memory. It’s worth noting that this is one of the reasons why IPT2 is not as feature rich as the IPT1 was.&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-874rFo6vMFQ/TakvRg0Fk_I/AAAAAAAAADc/YsIZxP8TVMk/s1600/IPT.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="388" src="http://3.bp.blogspot.com/-874rFo6vMFQ/TakvRg0Fk_I/AAAAAAAAADc/YsIZxP8TVMk/s640/IPT.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;
So just how does the IPT2 handle 24 million records coming from a database while running on a system with so little memory? The answer is that instead of returning all records at once, they are retrieved in small result sets only having about 1000 records each. These results sets are then streamed to file and immediately written to disk. The final DwC-A generated was 3.61GB in size, so some disk space is obviously needed too.&lt;br /&gt;
&lt;br /&gt;
Therefore in conclusion I feel that he IPT2 has successfully overcome its previous problems handling large datasets. I hope other adopters will now give it a shot themselves.&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-3798198801908285803?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/3798198801908285803/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/04/can-ipt2-handle-big-datasets-now.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3798198801908285803'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3798198801908285803'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/04/can-ipt2-handle-big-datasets-now.html' title='Can IPT2 handle big datasets now?'/><author><name>Kyle Braak</name><uri>http://www.blogger.com/profile/16423423909368777750</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-874rFo6vMFQ/TakvRg0Fk_I/AAAAAAAAADc/YsIZxP8TVMk/s72-c/IPT.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-3367198889388220808</id><published>2011-04-11T11:30:00.001+02:00</published><updated>2011-04-11T16:20:05.634+02:00</updated><title type='text'>The GBIF Development Team</title><content type='html'>Recently the GBIF development group have been asked to communicate more on the work being carried out in the secretariat. &amp;nbsp;To quote one message: &lt;br /&gt;
&lt;blockquote&gt;"&lt;i&gt;IMHO, simply making all these discussions public via a basic mailing list could help people like me ... have a better awareness of what's going on... We could add our comments&amp;nbsp;/ identify possible drawbacks / make some "scalability tests"...&amp;nbsp;In fact I'm really eager to participate to this process&lt;/i&gt;"&lt;span class="Apple-style-span" style="color: #999999;"&gt; (developer in Belgium)&lt;/span&gt;&lt;/blockquote&gt;To kick things off, we plan to make better use of this blog and have set a target of&lt;b&gt; &lt;u&gt;posting 2-3 times a week&lt;/u&gt;&lt;/b&gt;. &amp;nbsp;This is a technical blog, so the anticipated audience include developers, database administrators and those interested in following details of the GBIF software development. &amp;nbsp;We have always welcomed external contributers to this blog and invite any developers working on publishing content through the GBIF network, or developing tools that make use of content discoverable and accessible through GBIF to write posts. &lt;br /&gt;
&lt;br /&gt;
Today we are pleased to welcome &lt;b&gt;Jan Legind&lt;/b&gt; to the team who will be working as a data administrator to help improve the frequency of the network crawling (harvesting) and the indexing processes. &amp;nbsp;Jan will be working closely with the data publishers to help improve the quality and quantity of content accessible through GBIF.&lt;br /&gt;
&lt;br /&gt;
The GBIF development group has expanded in the past 6 months, so I'll introduce the whole team working in the secretariat and contracted to GBIF:&lt;br /&gt;
&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Developers (in order of appearance in the team): &lt;b&gt;Kyle Braak&lt;/b&gt;, &lt;b&gt;José Cuadra&lt;/b&gt;, &lt;b&gt;Markus Döring&lt;/b&gt; (contracted in Germany), &lt;b&gt;Daniel Amariles&lt;/b&gt; &amp;amp; &lt;b&gt;Hectór Tobón&lt;/b&gt; (contracted at CIAT in Colombia), &lt;b&gt;Federico Méndez&lt;/b&gt;, &lt;b&gt;Lars Francke&lt;/b&gt; and &lt;b&gt;Oliver Meyn&lt;/b&gt;.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Systems architect: &lt;b&gt;Tim Robertson&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Systems analyst: &lt;b&gt;Andrea Hahn&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Informatics liason: &lt;b&gt;Burke (Chih-Jen) Ko&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Systems admins: &lt;b&gt;Ciprian Vizitiu&lt;/b&gt; &amp;amp; &lt;b&gt;Andrei Cenja&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;Data administrator: &lt;b&gt;Jan Legind&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;br /&gt;
&lt;div&gt;The current focus of work at GBIF include the following major activities:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Developing and rolling out the&amp;nbsp;&lt;a href="http://code.google.com/p/gbif-providertoolkit/"&gt;Integrated Publishing Toolkit&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Integrating the checklist&amp;nbsp;(taxonomic, nomenclatural and thematic)&amp;nbsp;content into the current&amp;nbsp;&lt;a href="http://data.gbif.org/"&gt;Data portal&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Developing a processing framework to automate the steps needed to apply quality control and index content for discovery through the&amp;nbsp;&lt;a href="http://data.gbif.org/"&gt;Data portal&lt;/a&gt;.&lt;/li&gt;
&lt;ul&gt;&lt;li&gt;Specifically to reducing the time taken and complexity in initiating a&amp;nbsp;&lt;i&gt;rollover&lt;/i&gt;&amp;nbsp;of the content behind the index&lt;/li&gt;
&lt;li&gt;Reworking all quality control (geographic, taxonomic and temporal)&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Automating the process&lt;/li&gt;
&lt;/ul&gt;&lt;li&gt;Initiating a redesign of the data portal user interface to provide richer discovery and integration across dataset metadata, checklists and primary biodiversity data.&lt;/li&gt;
&lt;li&gt;Reducing the time between publishing content onto the network and discovery through the&amp;nbsp;&lt;a href="http://data.gbif.org/"&gt;Data portal&lt;/a&gt;. &amp;nbsp;This includes providing specific support to those who are experiencing problems with large datasets in particular, and assisting in migration to the DarwinCore-Archive format.&lt;/li&gt;
&lt;li&gt;Technical and user documentation of the publishing options available&lt;/li&gt;
&lt;/ul&gt;&lt;div&gt;Let the blogging begin.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div&gt;[Please use #gbif in twitter hashtags]&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-3367198889388220808?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/3367198889388220808/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/04/gbif-development-team.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3367198889388220808'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3367198889388220808'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/04/gbif-development-team.html' title='The GBIF Development Team'/><author><name>Tim Robertson</name><uri>http://www.blogger.com/profile/07889700598656669041</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://1.bp.blogspot.com/_5wZ2Fic5QtA/STxJva6Zq2I/AAAAAAAAABc/g2I1cG9fztw/S220/tim.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-3385874402411941837</id><published>2011-01-26T16:20:00.000+01:00</published><updated>2011-01-26T16:20:54.606+01:00</updated><title type='text'>Setting up a Hadoop cluster - Part 1: Manual Installation</title><content type='html'>&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Introduction&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
In the last few months I was tasked several times with setting up Hadoop clusters. Those weren't huge - two to thirteen machines - but from what I read and hear this is a common use case especially for companies just starting with Hadoop or setting up a first small test cluster.  While there is a huge amount of documentation in form of official documentation, blog posts, articles and books most of it stops just where it gets interesting: Dealing with all the stuff you really have to do to set up a cluster, cleaning logs, maintaining the system, knowing what and how to tune etc.&lt;br /&gt;
&lt;br /&gt;
I'll try to describe all the hoops we had to jump through and all the steps involved to get our Hadoop cluster up and running. Probably trivial stuff for experienced Sysadmins but if you're a Developer and finding yourself in the "Devops" role all of a sudden I hope it is useful to you.&lt;br /&gt;
&lt;br /&gt;
While working at&amp;nbsp;&lt;a href="http://www.gbif.org/"&gt;GBIF&lt;/a&gt; I was asked to set up a Hadoop cluster on 15 existing and 3 new machines. So the first interesting thing about this setup is that it is a heterogeneous environment: Three different configurations at the moment. This is where our first goal came from: We wanted some kind of automated configuration management. We needed to try different cluster configurations and we need to be able to shift roles around the cluster without having to do a lot of manual work on each machine. We decided to use a tool called &lt;a href="http://www.puppetlabs.com/"&gt;Puppet&lt;/a&gt; for this task.&lt;br /&gt;
&lt;br /&gt;
While Hadoop is not currently in production at GBIF there are mid- to long-term plans to switch parts of our infrastructure to various components of the HStack. Namely MapReduce jobs with Hive and perhaps Pig (there is already strong knowledge of SQL here) and also storing of large amounts of raw data in HBase to be processed asynchronously (~500 million records until next year) and indexed in a Lucene/Solr solution possibly using something like Katta to distribute indexes. For good measure we also have fairly complex geographic calculations and map-tile rendering that could be done on Hadoop. So we have those 18 machines and no real clue how they'll be used and which services we'd need in the end.&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Environment&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
As mentioned before we have three different server configurations. We've put those machines in three logical clusters &lt;em&gt;c1&lt;/em&gt;, &lt;em&gt;c2&lt;/em&gt; and &lt;em&gt;c3&lt;/em&gt; and just counting up in those (our master for example is currently running on &lt;em&gt;c1n1&lt;/em&gt;):
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;c1&lt;/em&gt; 10: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz, 2x6MB (quad), 8 GB RAM, 2 x 500GB SATA 7.2K&lt;/li&gt;
&lt;li&gt;&lt;em&gt;c2&lt;/em&gt; 3: 2 x Intel(R) Xeon(R) CPU E5630 @ 2.53GHz (quad), 24 GB RAM, 6 x 250 GB SATA 5.4K&lt;/li&gt;
&lt;li&gt;&lt;em&gt;c3&lt;/em&gt; 5: Intel(R) Xeon(R) CPU X3220 @ 2.40GHz (quad), 4 GB RAM, 2 x 160 GB SATA 7.2K&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.centos.org/"&gt;CentOS&lt;/a&gt; 5.5&lt;/li&gt;
&lt;li&gt;The machines are in different racks but connected to only one switch&lt;/li&gt;
&lt;/ul&gt;
We realize that this is a very heterogeneous cluster configuration. We also realize that some people highly discourage use of old machines or machines with little RAM but the &lt;em&gt;c1&lt;/em&gt; and &lt;em&gt;c3&lt;/em&gt; clusters were old unused machines and this way they still serve a purpose and we've had no problems so far using this setup.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Goal&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
These were the goals we set out to achieve on our cluster and these are also all the things I'll try to describe in this or a following post:
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;Puppet for setting up the services and configuring machine state&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.cloudera.com/display/DOC/Hadoop+Installation"&gt;CDH3&lt;/a&gt; (Beta 3)
&lt;ul&gt;
&lt;li&gt;Hadoop HDFS + MapReduce incl. Hadoop LZO&lt;/li&gt;
&lt;li&gt;Hue&lt;/li&gt;
&lt;li&gt;Zookeeper&lt;/li&gt;
&lt;li&gt;HBase&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Easily distributable packages for Hadoop, Hive and Pig to be used by the employees to access the cluster from their own workstations&lt;/li&gt;
&lt;li&gt;Benchmarks &amp;amp; Optimizations&lt;/li&gt;
&lt;/ul&gt;
Be warned: This is going to be a very long post and unfortunately it is the nature of these things that some of the information is bound to be outdated pretty quickly so let me know if something has changed and I'll alter the post.
&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Manual Installation&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Before we use Puppet to do everything automatically I will show how it can be done manually. I think it is important to know all the steps in case something goes wrong or you decide not to use Puppet at all. When I talk about "the server" I always mean "all servers in your cluster" except when noted otherwise. I highly recommend not skipping this part even if you want to use Puppet.
&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: large;"&gt;Operating System&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
For now I'll just assume a vanilla CentOS 5.5 installation. There's nothing special you need. I recommend just the bare minimum, everything else needed can be installed at a later time. A few words though about things you might want to do:  Your servers probably have multiple disks. You shouldn't use any RAID or LVM on any of your slaves (i.e. DataNodes/TaskTracker). Just use a JBOD configuration. In our cluster all disks are in a simple structure:
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/mnt/disk1&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/mnt/disk2&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;...&lt;/li&gt;
&lt;/ul&gt;
There are also two tweaks for your slaves you can do:
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;Mount your data disks with&amp;nbsp;&lt;code&gt;noatime&lt;/code&gt; (e.g. &lt;code&gt;/dev/sdc1 /mnt/disk3 ext3 defaults,noatime 1 2&lt;/code&gt; which btw. implies &lt;code&gt;nodiratime&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;By default there are a certain number of blocks reserved on ext (not familiar with others) file systems (check by running &lt;code&gt;tune2fs -l /dev/sdc1&lt;/code&gt; and look at the &lt;em&gt;Reserved block count&lt;/em&gt;). While this is useful on system disks so that critical processes can still write some data when the disk is full otherwise this is wasted space on our data disks. By default 5% of a HDD are reserved for this. I recommend setting this down to 1% by running: &lt;code&gt;tune2fs -m 1 &amp;lt;device&amp;gt;&lt;/code&gt; on all your data disks (i.e. &lt;code&gt;tune2fs -m 1 /dev/sdc1&lt;/code&gt;) which frees up quite a bit of disk space. You can also set it to 0% if you want though I went with 1% for our cluster. Keep the default setting for your system disks though!&lt;/li&gt;
&lt;/ul&gt;
On your NameNode however use any means you feel&amp;nbsp;necessary&amp;nbsp;to secure your data. You know your requirements better than I do. Use RAID and/or LVM however you like. We don't have any special resources so our NameNode is running on one of our regular servers at the moment. We might change that in the future.
&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: large;"&gt;A note on Cloudera's Package system &amp;amp; naming&lt;/span&gt;
&lt;br /&gt;
&lt;br /&gt;
Cloudera provides the various components of Hadoop in different Packages but they follow a simple structure: There is one &lt;code&gt;hadoop-0.20&lt;/code&gt; package which contains all the jars, config files, directories, etc. needed for all the roles. And then there are packages like &lt;code&gt;hadoop-0.20-namenode&lt;/code&gt; which are only a few kilobytes and they only contain the&amp;nbsp;appropriate&amp;nbsp;start- and stopscripts for the role in question.
&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: large;"&gt;1. Common Requirements&lt;/span&gt;
&lt;br /&gt;
&lt;br /&gt;
Most of the commands in this guide need to be executed as &lt;code&gt;root&lt;/code&gt;. I've chosen the easy route here and just logged in as &lt;code&gt;root&lt;/code&gt;. If you're operating as a non-privileged user remember to use &lt;code&gt;su&lt;/code&gt;, &lt;code&gt;sudo&lt;/code&gt; or any other means to ensure you have the proper rights.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Repository&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.cloudera.com/display/DOC/CDH3+Installation"&gt;Cloudera documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
As all the packages we're going to install are provided by Cloudera we need to add their repository to our cluster:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;curl http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo &amp;gt; /etc/yum.repos.d/cloudera-cdh3.repo&lt;/pre&gt;

&lt;b&gt;Java installation&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.cloudera.com/display/DOC/Java+Development+Kit+Installation"&gt;Cloudera documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.oracle.com/technetwork/java/javase/downloads/index.html"&gt;Java downloads&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;We're using JDK 6 Update 23&lt;/li&gt;
&lt;/ul&gt;
You have to download the JDK from Oracle's website yourself as license issues prevent it from being added to the repositories. Chose the correct system (probably Linux x64) and make sure to download the file ending in&amp;nbsp;&lt;code&gt;-rpm.bin&lt;/code&gt; (i.e.&amp;nbsp;&lt;code&gt;jdk-6u23-linux-x64-rpm.bin&lt;/code&gt;). You might have to do this from a client machine because you need a browser that works with the Oracle site. So on any one machine execute the following:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;unzip jdk-6u23-linux-x64-rpm.bin&lt;/pre&gt;
You should now have a bunch of .rpm files but you only need one of them:&amp;nbsp;&lt;code&gt;jdk-6u23-linux-amd64.rpm&lt;/code&gt;. Copy this file to your servers and install it as root using rpm:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;rpm -Uvh ./jdk-6u23-linux-amd64.rpm&lt;/pre&gt;

&lt;b&gt;Time&lt;/b&gt;
&lt;br /&gt;
While not a hard requirement it makes a lot of things easier if the clocks on your servers are synchronized. I added this part at the last minute because we just realized that &lt;code&gt;ntpd&lt;/code&gt; was disabled on three of our machines (c2) by accident and had some problems with it. It is worth taking a look at the clocks now and set up &lt;code&gt;ntp&lt;/code&gt; properly before you start.
&lt;br /&gt;&lt;br /&gt;

&lt;b&gt;DNS&lt;/b&gt;
&lt;br /&gt;
It doesn't matter if you use a DNS server or hosts files or any other means for the servers to find each other. But make sure this works! Do it now! Even if you think everything's set up correctly. Another thing that you should check is if the local hostname resolves to the public IP address. If you're using a DNS server you can use &lt;code&gt;dig&lt;/code&gt; to test this but that doesn't take into account the&amp;nbsp;&lt;code&gt;/etc/hosts&lt;/code&gt; file so here is a simple test to see if it is correct:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;ping -c 1 `hostname`&lt;/pre&gt;
This should resolve to the public IP and not to &lt;code&gt;127.0.0.1&lt;/code&gt;.
&lt;br /&gt;&lt;br /&gt;

&lt;b&gt;Firewall&lt;/b&gt;
&lt;br /&gt;
Hadoop uses a lot of ports for its internal and external&amp;nbsp;communications. We've just allowed all traffic between the servers in the cluster and clients. But if you don't want to do that you can also selectively open the required ports. I try to mention them but they can all be changed in the configuration files. I might also miss some due to our config so I'd be glad if someone could point those out to me.
&lt;br /&gt;&lt;br /&gt;

&lt;b&gt;Packages&lt;/b&gt;
&lt;br /&gt;
We're going to use lzo compression, the Hadoop native libraries as well as hue so there are a few common dependencies on all machines in the cluster which can be easily installed:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm
yum install -y lzo hue-plugins hadoop-0.20-native&lt;/pre&gt;

&lt;b&gt;Directories&lt;/b&gt;
&lt;br /&gt;
We also need some directories later on so we can just create them now:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;mkdir &amp;lt;data disk&amp;gt;/hadoop
chown root:hadoop &amp;lt;data disk&amp;gt;/hadoop&lt;/pre&gt;
Cloudera uses the &lt;a href="http://linux.die.net/man/8/alternatives"&gt;alternatives&lt;/a&gt; system to manage configuration. In &lt;code&gt;/etc/hadoop/conf&lt;/code&gt; is the currently activated configuration. Look at the contents of &lt;code&gt;/etc/hadoop&lt;/code&gt; and you'll find all the installed configurations. At the moment there is only a &lt;code&gt;conf.empty&lt;/code&gt; directory which we'll use as our starting point:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;cp -R /etc/hadoop/conf.empty /etc/hadoop/conf.cluster&lt;/pre&gt;
Now feel free to edit the configuration files in &lt;code&gt;/etc/hadoop/conf.cluster&lt;/code&gt; but we'll go through them as well later in this post. The last step is to activate this configuration:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;/usr/sbin/alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster 50&lt;/pre&gt;

&lt;b&gt;LZO&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://code.google.com/p/hadoop-gpl-compression/"&gt;hadoop-gpl-compression project&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Todd Lipcon's &lt;a href="https://github.com/toddlipcon/hadoop-lzo"&gt;hadoop-lzo&lt;/a&gt; &amp;amp; Kevin Weil's &lt;a href="https://github.com/kevinweil/hadoop-lzo"&gt;hadoop-lzo&lt;/a&gt; projects&lt;/li&gt;
&lt;/ul&gt;
Due to licensing issues the LZO bindings for Hadoop cannot be distributed the same way as the rest of the packages. So this - once again - involves a few manual steps. After these bindings were removed from Hadoop itself a few versions ago they moved tho the hadoop-gpl-compression project on Google Code which (as far as I know) still works but hasn't seen any development for over a year. Thankfully though Twitter's Kevin Weil and Cloudera's Todd Lipcon have picked up the project and maintained it. They regularly sync their github repositories so both should have almost the same code. I'm going to use Todd's version here as it should be better synced with CDH releases.  You have to download the code from the repository, build the native libraries as well as the jar file and distribute those files on your cluster. You need to do this only on one machine which ideally should run the same OS version as the servers in your cluster. When you're finished you can just copy the result to all servers. We're using version 0.4.9 so we use this to download and build:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm
yum install -y lzo-devel
wget --no-check-certificate https://github.com/toddlipcon/hadoop-lzo/tarball/0.4.9
tar xvfz toddlipcon-hadoop-lzo-0.4.9-0-g0e70051.tar.gz
wget http://www.apache.org/dist/ant/binaries/apache-ant-1.8.2-bin.tar.bz2
tar jxvf apache-ant-1.8.2-bin.tar.gz
cd toddlipcon-hadoop-lzo-0e70051
JAVA_HOME=/usr/java/latest/ BUILD_REVISION="0.4.9" ../apache-ant-1.8.2/bin/ant tar&lt;/pre&gt;
The ant version that comes with CentOS 5.5 didn't work for me that's why I downloaded a new one. This should leave you with a &lt;code&gt;hadoop-lzo-0.4.9.tar.gz&lt;/code&gt; file in the build directory which you can extract to get all the&amp;nbsp;necessary&amp;nbsp;files for your servers:
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;hadoop-lzo-0.4.9.jar&lt;/code&gt; needs to be copied into &lt;code&gt;/usr/lib/hadoop/lib&lt;/code&gt; on each server&lt;/li&gt;
&lt;li&gt;&lt;code&gt;lib/native/Linux-amd64-64&lt;/code&gt; needs to be copied into &lt;code&gt;/usr/lib/hadoop/lib/native&lt;/code&gt; on each server&lt;/li&gt;
&lt;/ul&gt;

&lt;b&gt;cron &amp;amp; log cleaning&lt;/b&gt;
&lt;br /&gt;
We've had a problem with unintentional debug logs filling up our hard drives. The investigations that followed that incident resulted in a &lt;a href="http://www.cloudera.com/blog/2010/11/hadoop-log-location-and-retention/"&gt;blog post&lt;/a&gt; by &lt;a href="http://www.larsgeorge.com/"&gt;Lars George&lt;/a&gt; explaining all the log files Hadoop writes. It is a worthwhile read.

Hadoop writes tons of logs in various processes and phases and you should make sure that these don't fill up your hard drives. There are two instances in the current CDH3b3 where you have to manually interfere:
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;Hadoop daemon logs&lt;/li&gt;
&lt;li&gt;Job XML files on the JobTracker&lt;/li&gt;
&lt;/ul&gt;
Hadoop uses a &lt;a href="http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/DailyRollingFileAppender.html"&gt;&lt;code&gt;DailyRollingFileAppender&lt;/code&gt;&lt;/a&gt; which unfortunately doesn't have a &lt;a href="http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/RollingFileAppender.html#maxBackupIndex"&gt;&lt;code&gt;maxBackupIndex&lt;/code&gt;&lt;/a&gt; setting like the &lt;a href="http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/RollingFileAppender.html"&gt;&lt;code&gt;RollingFileAppender&lt;/code&gt;&lt;/a&gt;. So either change the appender or manually clean up logs after a few days. We chose the second path and added a very simple cron job to run daily:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;find /var/log/hadoop/ -type f -mtime +14 -name "hadoop-hadoop-*" -delete&lt;/pre&gt;
This jobs deletes old log files after 14 days.  We'll take care of the Job XML files in a similar way at the JobTracker.
&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: large;"&gt;HDFS&lt;/span&gt;
&lt;br /&gt;
&lt;br /&gt;
One property needs to be set for both the NameNode and the DataNodes in the file &lt;code&gt;/etc/hadoop/conf/core-site.xml&lt;/code&gt;:&amp;nbsp;&lt;code&gt;fs.default.name&lt;/code&gt;. So just add this and replace &lt;code&gt;$namenode&lt;/code&gt; with the IP or name of your NameNode:
&lt;br /&gt;
&lt;pre class="brush:xml"&gt;&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;fs.default.name&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;hdfs://$namenode:8020&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;&lt;/pre&gt;
&lt;br/&gt;
&lt;b&gt;2.1. NameNode&lt;/b&gt;
&lt;br/&gt;
Installing the NameNode is straightforward:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;yum install -y hadoop-0.20-namenode&lt;/pre&gt;
This installs the startup scripts for the NameNode. The core package was already installed in the previous step. Now we need to change the configuration, create some directories and&amp;nbsp;format the NameNode.&lt;br/&gt;
&lt;br/&gt;
In &lt;code&gt;/etc/hadoop/conf/hdfs-site.xml&lt;/code&gt; add the &lt;code&gt;dfs.name.dir&lt;/code&gt; property which &lt;em&gt;"determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy."&lt;/em&gt; We mentioned before that we're using a JBOD configuration. We do this even for our NameNode. So in our case the NameNode has two disks mounted at&amp;nbsp;&lt;code&gt;/mnt/disk1&lt;/code&gt; and&amp;nbsp;&lt;code&gt;/mnt/disk2&lt;/code&gt; but you might want to write to just one location if you use RAID. As it says in the documentation the NameNode will write to&amp;nbsp;each&amp;nbsp;of the locations. You can write to a third location: A NFS mount which serves as a backup. Our configuration looks like this:
&lt;br /&gt;
&lt;pre class="brush:xml"&gt;&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;dfs.name.dir&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;/mnt/disk1/hadoop/dfs/name,/mnt/disk2/hadoop/dfs/name&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;&lt;/pre&gt;
Make sure to create the &lt;code&gt;dfs&lt;/code&gt; directories before starting the NameNode. They need to belong to &lt;code&gt;hdfs:hadoop&lt;/code&gt;.&amp;nbsp;Formatting the NameNode is all that's left:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;su hdfs -c "/usr/bin/hadoop namenode -format"&lt;/pre&gt;
Once you've done all that you can enable the service so it will be started upon system boot and start the NameNode:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;chkconfig hadoop-0.20-namenode on
service hadoop-0.20-namenode start&lt;/pre&gt;
You should be able to see the web interface on your namenode at port 50070 now.  Ports that need to be opened to clients on the NameNode are 50070 (web interface, 50470 if you enabled SSL) and 8020 (for HDFS command line interaction). Only port 8020 needs to be enabled for all other servers in the cluster.&lt;br/&gt;
&lt;br/&gt;
We also use a cron job to run the HDFS Balancer every evening:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;/usr/lib/hadoop-0.20/bin/start-balancer.sh -threshold 5&lt;/pre&gt;
&lt;br/&gt;
&lt;b&gt;2.2 DataNodes&lt;/b&gt;
&lt;br/&gt;
The DataNodes handle all the data by storing it and serving it to clients. You can run a DataNode on your NameNode and especially for small- or test clusters this is often done but as soon as you have more than three to five machines or rely on your cluster for production use you should use a dedicated NameNode. Setting the DataNodes up is easy though after all our preparations. We need to set the property &lt;code&gt;dfs.data.dir&lt;/code&gt; in the file &lt;code&gt;/etc/hadoop/conf/hdfs-site.xml&lt;/code&gt;. It &lt;em&gt;"determines where on the local filesystem an DFS data node should store its blocks.  If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored."&lt;/em&gt; These are the directories where the real data bytes of HDFS will be written to. If you specify multiple directories the DataNode will write to them in turn which gives good performance when reading the data.&lt;br/&gt;
&lt;br/&gt;
This is an example of what we are using:
&lt;br /&gt;
&lt;pre class="brush:xml"&gt;&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;dfs.data.dir&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;/mnt/disk1/hadoop/dfs/data,/mnt/disk2/hadoop/dfs/data&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;&lt;/pre&gt;
Make sure to create the &lt;code&gt;dfs&lt;/code&gt; directories before starting the DataNodes. They need to belong to &lt;code&gt;hdfs:hadoop&lt;/code&gt;. When that's done you just need to install the DataNode, activate the startup scripts and start it:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;yum install -y hadoop-0.20-datanode
chkconfig hadoop-0.20-datanode on
service hadoop-0.20-datanode start&lt;/pre&gt;
Your DataNode should be up and running and if you have configured it correctly should also have connected to the NameNode and be visible in the web interface in the &lt;em&gt;Live Nodes&lt;/em&gt; list and the configured capacity should go up.  Ports that need to be opened to clients are 50075 (web interface, 50475 if you enabled SSL) and 50010 (for data transfer). For the cluster you need to open ports 50010 and 50020.
&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: large;"&gt;3. MapReduce&lt;/span&gt;
&lt;br /&gt;
&lt;br /&gt;
MapReduce is split in two parts as well: A JobTracker and multiple TaskTrackers. For small-ish clusters the NameNode and the JobTracker can run on the same server but depending on your usage and available memory you might need to run them on separate servers. We have 18 servers, 17 slaves and 1 master (with NameNode, JobTracker and other services) which isn't a problem so far. We need three properties set on all servers (in &lt;code&gt;mapred-site.xml&lt;/code&gt;) to get started.
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;mapred.job.tracker&lt;/code&gt;: &lt;em&gt;"The host and port that the MapReduce job tracker runs at.  If 'local', then jobs are run in-process as a single map and reduce task."&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;This just points to your JobTracker. There is no default port for this in Hadoop 0.20 but 8021 is often used.&lt;/li&gt;
&lt;li&gt;Our value (replace &lt;code&gt;$jobtracker&lt;/code&gt; with the name or IP of your designated JobTracker): &lt;code&gt;$jobtracker:8021&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;mapred.local.dir&lt;/code&gt;: &lt;em&gt;"The local directory where MapReduce stores intermediate data files.  May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored."&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;As it says this is a local directory where MapReduce stores stuff an we spread it out over all our discs.&lt;/li&gt;
&lt;li&gt;Our value: &lt;code&gt;/mnt/disk1/hadoop/mapreduce,/mnt/disk2/hadoop/mapreduce&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Create the directories on each server with the owner &lt;code&gt;mapred:hadoop&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;mapred.system.dir&lt;/code&gt;: &lt;em&gt;"The shared directory where MapReduce stores control files." &lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;This is a path in HDFS where MapReduce stores stuff&lt;/li&gt;
&lt;li&gt;Our value: &lt;code&gt;/hadoop/mapreduce/system&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;dfs.permissions&lt;/code&gt; are on you need to create this directory in HDFS. Execute this command on any server in your cluster: &lt;code&gt;su hdfs -c "/usr/bin/hadoop fs -mkdir /hadoop/mapreduce &amp;amp;&amp;amp; /usr/bin/hadoop fs -chown mapred:hadoop /hadoop/mapreduce"&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;br/&gt;
&lt;b&gt;3.1 JobTracker&lt;/b&gt;
&lt;br/&gt;
The JobTracker is very easy to setup and start:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;yum install -y hadoop-0.20-jobtracker
chkconfig hadoop-0.20-jobtracker on
service hadoop-0.20-jobtracker start&lt;/pre&gt;
The web interface should now be available at port 50030 on your JobTracker.  Ports 50030 (web interface) and 8021 (not well defined in Hadoop 0.20 but if you followed my configuration this is correct) need to be opened to clients. Only 8021 is&amp;nbsp;necessary&amp;nbsp;for the TaskTrackers.&lt;br/&gt;
&lt;br/&gt;
If the JobTracker is restarted some old files will not be cleaned up. That's why we added another small cronjob to run daily:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;find /var/log/hadoop/ -type f -mtime +3 -name "job_*_conf.xml" -delete&lt;/pre&gt;
&lt;br/&gt;
&lt;b&gt;3.2 TaskTracker&lt;/b&gt;
&lt;br/&gt;
The TaskTracker are as easy to install as the JobTracker:
&lt;br /&gt;
&lt;pre class="brush:shell"&gt;yum install -y hadoop-0.20-tasktracker
chkconfig hadoop-0.20-tasktracker on
service hadoop-0.20-tasktracker start&lt;/pre&gt;
The TaskTracker should now be up and running and visible in the JobTracker's Nodes list.  Only port 50060 needs to be opened to clients for a minimalistic web interface. Other than that no other ports are needed as TaskTrackers check in at the JobTracker&amp;nbsp;regularly&amp;nbsp;(heartbeat) and get assigned Tasks at the same time.
&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: large;"&gt;4. Configuration&lt;/span&gt;
&lt;br /&gt;
&lt;br /&gt;
I'll discuss a few configuration properties here that in a range of "necessary&amp;nbsp;to change" to "nice to know about". I'll mention the following things for each property:
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;The default value,&lt;/li&gt;
&lt;li&gt;the value we use for our cluster at GBIF if it differs from the default,&lt;/li&gt;
&lt;li&gt;some of the defaults are quite old and have never been changed so I might mention a value I deem safe to use for everybody,&lt;/li&gt;
&lt;li&gt;if we set the property to final so it can't be overridden by clients (we set a lot of the parameters to final for purely documentary reasons, even those that can't be overwritten in the first place),&lt;/li&gt;
&lt;li&gt;if the property has been renamed or deprecated in Hadoop 0.21,&lt;/li&gt;
&lt;li&gt;and if this property is required in a client configuration file or only on the cluster, if I don't mention it it's not needed.&lt;/li&gt;
&lt;/ul&gt;
Here are the default configuration files for Hadoop 0.20.2 and 0.21:
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;core-default.xml: &lt;a href="https://github.com/apache/hadoop-common/blob/release-0.20.2/src/core/core-default.xml"&gt;0.20.2&lt;/a&gt;, &lt;a href="https://github.com/apache/hadoop-common/blob/release-0.21.0/src/java/core-default.xml"&gt;0.21&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;hdfs-default.xml: &lt;a href="https://github.com/apache/hadoop-common/blob/release-0.20.2/src/hdfs/hdfs-default.xml"&gt;0.20.2&lt;/a&gt;, &lt;a href="https://github.com/apache/hadoop-hdfs/blob/release-0.21.0/src/java/hdfs-default.xml"&gt;0.21&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;mapred-default.xml: &lt;a href="https://github.com/apache/hadoop-common/blob/release-0.20.2/src/mapred/mapred-default.xml"&gt;0.20.2&lt;/a&gt;, &lt;a href="https://github.com/apache/hadoop-mapreduce/blob/release-0.21.0/src/java/mapred-default.xml"&gt;0.21&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
And I know that there are some duplications to the section above but I want to keep this Configuration section as a reference.
&lt;br /&gt;
&lt;h3&gt;&lt;code&gt;core-site.xml&lt;/code&gt;&lt;/h3&gt;
&lt;b&gt;&lt;code&gt;fs.default.name&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;file:///&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;hdfs://$namenode:8020&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;fs.defaultFS&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Needed on the clients&lt;/li&gt;
&lt;/ul&gt;
This is used to specify the default file system and defaults to your local file system that's why it needs be set to a HDFS address. This is important for client configuration as well so your local configuration file should include this element.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;hadoop.tmp.dir&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;/tmp/hadoop-${user.name}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;CDH3 Default: &lt;code&gt;/var/lib/hadoop-0.20/cache/${user.name}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: Left it at the CDH3 default&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
As mentioned in the default file this is mainly a base for other temporary directories. If all other configuration options are set correctly there shouldn't be too much data in here.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;fs.trash.interval&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;10080&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
Hadoop has a Trash feature were removed files (using the command line tools) are moved to a .Trash folder in the users home folder. If set to 0 this feature is disabled but if set to a non-zero value this is the amount of minutes between Trash cleaner runs. As we have a lot of users in our system using Hadoop for the first time we chose a safe value here.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;fs.checkpoint.dir&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;${hadoop.tmp.dir}/dfs/namesecondary&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;/mnt/disk1/hadoop/dfs/namesecondary,/mnt/disk2/hadoop/dfs/namesecondary&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
The secondary NameNode stores its images to merge here. If it is a comma separated list the data is replicated to all these locations on the local disks.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;io.file.buffer.size&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;4096&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Safe: &lt;code&gt;65536&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;131072&lt;/code&gt; (32 * 4096)&lt;/li&gt;
&lt;li&gt;Can be overwritten by clients&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
This is used for buffers all over the place to copy, store and write data to. It should be a multiple of 4096 and it should be safe to use 65536 today but we use double that. The performance gain is not enormous but there have been blog posts in the past measuring the impact and it was positive. We've also done our own tests and saw a small performance gain. If you use HBase be careful not to set this too high.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;io.compression.codecs&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
This lists all installed compression codecs. If you followed my manual you've got to add two more to the default list of codecs: &lt;code&gt;LzoCodec&lt;/code&gt; and &lt;code&gt;LzopCodec&lt;/code&gt;.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;io.compression.codec.lzo.class&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;We: &lt;code&gt;com.hadoop.compression.lzo.LzopCodec&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
I have actually no idea why this setting is needed as I couldn't find any reference where it is actually used in the code but I didn't look very hard so I might be wrong. All I know is that the documentation mentions that this property needs to be set.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;webinterface.private.actions&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;false&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;true&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
By setting this to &lt;code&gt;true&lt;/code&gt; the web interfaces for the JobTracker and NameNode gain some advanced options like killing a job. It makes life a lot easier while still in development or evaluation. But you probably should set this to false once you rely on your Hadoop cluster for production use.
&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;&lt;code&gt;hdfs-site.xml&lt;/code&gt;&lt;/h3&gt;
&lt;b&gt;&lt;code&gt;dfs.name.dir&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;${hadoop.tmp.dir}/dfs/name&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;/mnt/disk1/hadoop/dfs/name,/mnt/disk2/hadoop/dfs/name&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
This is an important setting to set that's why I've already mentioned it above. The NameNode stores stuff in these directories by replicating all information to all these disks. One of them could be a mount on a remote disk (e.g. NFS) to have a backup.
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;dfs.data.dir&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;${hadoop.tmp.dir}/dfs/data&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;/mnt/disk1/hadoop/dfs/data,/mnt/disk2/hadoop/dfs/data&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
This is another important setting as explained above. Different to &lt;code&gt;dfs.name.dir&lt;/code&gt; in that the data is not replicated to all disks but distributed among all those locations. The DataNodes save the actual data in these locations. So more space is better. The easiest thing is to use dedicated disks for this. If you save other stuff than Hadoop data on the disks make sure to set &lt;code&gt;dfs.datanode.du.reserved&lt;/code&gt; (see below).
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;dfs.namenode.handler.count&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;10&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;20&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Safe: 10-20&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
The number of threads the NameNode uses to serve requests. This depends highly on your usage and size of your cluster. We've tried a bunch of different values and settled on 20 without seeing any notable differences. &lt;code&gt;nnbench&lt;/code&gt; is probably a good tool to benchmark this. If you've got a large cluster or many file operations (create or delete) you can try upping this value.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;dfs.datanode.handler.count&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;3&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;5&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Safe: 5-10&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
The number of threads DataNodes use. I can't tell what a good value is for large clusters but the &lt;code&gt;TestDFSIO&lt;/code&gt; benchmark seems like a good test to run to find a good value here. Just play around. We've tried a bunch of different values up to 20 and didn't see a difference so we chose a value slightly larger than the default.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;dfs.datanode.du.reserved&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: Left the default&lt;/li&gt;
&lt;/ul&gt;
This many bytes will be left free on the volumes used by the DataNodes (see &lt;code&gt;dfs.data.dir&lt;/code&gt;). As our drives are dedicated to Hadoop we left this at 0 but if the drives host other stuff as well set this to an appropriate value.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;dfs.permissions&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;true&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;true&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;dfs.permissions.enabled&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
This enables permission checking in HDFS. Unless you use Secure Hadoop (which we don't that's why I don't cover it here) it is still easy for anyone to read, write and delete anything on the cluster as there is no authentication of users done. So this is purely for safety reasons to avoid messing with the wrong data by accident.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;dfs.replication&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;3&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;3&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
This is the default replication level used for new files in HDFS. if you change this value later on no existing files will be changed (that can be done on the command line though). Every file in HDFS can have a different replication level. This just sets the default.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;dfs.block.size&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;67108864&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;134217728&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;dfs.blocksize&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
This factor is on a per file basis and only used for new files. Files saved to HDFS are split in blocks at most this large (64 MB by default). This has multiple implications. The more blocks you have the more load there is on your NameNode. So if you have many files that are larger than the blocksize you might set this larger. If your files are mostly smaller than this you waste no space. All files only take as much space as they actually have data (this is unlike other file systems where a file takes up at least one block no matter how large it really is). So NameNode load (memory requirements as well) are one factor. The deciding factor for us to set this higher per default is that a lot of our calculations in MapReduce are very fast and Mappers finish quickly. As one Mapper usually processes one block and Mappers take a while to set up we chose a higher block size so that each Mapper has more data to process.&lt;br/&gt;
&lt;br/&gt;
This can be set on a per file basis so you really have to find your own perfect value, perhaps even on a per dataset basis.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;dfs.balance.bandwidthPerSec&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;1048576&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;2097152&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;dfs.datanode.balance.bandwidthPerSec&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
This property configures the amounts of bytes per second (default is 1 MB/s) that a DFS balancing operation can use per DataNode. The default is pretty low so we doubled it. We don't use a lot of bandwidth in our cluster at the moment so this is not a problem. Depends on your use case. The higher this number the faster balancing operations will complete. We run balancing every night on a cron job so we want it to be finished by morning.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;dfs.hosts&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: no default set&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;/etc/hadoop/conf/allowed_hosts&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
This file has to contain one name per line. Every name is the name of a DataNode that is allowed to connect to the NameNode. This prevents accidents like what happened to me: I test everything in Virtual Machines so I started a bunch of them, deployed the live config and forgot to change the NameNode so all of a sudden a bunch of Virtual Machines joined our HDFS cluster and blocks began replicating there.... So it is a good thing to explicitly list all allowed hosts in this file.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;dfs.support.append&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;false&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;true&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;As far as I know this option has been removed in Hadoop 0.21 and is enabled by default&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
This option has quite a history. To make it short: If you're using CDH3 set this to true, otherwise leave it false. You want/need this on true if you plan to use HBase.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;dfs.datanode.max.xcievers&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;256&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Safe: &lt;code&gt;1024&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;2048&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Yes, this is misspelt in Hadoop and it hasn't been fixed in Hadoop 0.21.&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
This is the maximum number of threads a DataNode may use (for example for file access to the local file system). There used to be bugs in Hadoop so that the default was a bit to low and needed to be set higher. Even today it's worth it to set it higher without a lot of risk. Especially if you're using HBase.
&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;&lt;code&gt;mapred-site.xml&lt;/code&gt;&lt;/h3&gt;
HDFS is pretty straightforward to configure and benchmark. MapReduce is more of a black art unfortunately. I'll describe the MapReduce process here because it is important to understand where all the properties come in so you can safely change their values and tweak the performance. In my first draft of this post I wrote that I won't go into much detail on the internals of the MapReduce process. (Un-)fortunately this wasn't as easy as I thought and it has grown into a full blown explanation of everything I know. It is very possible that something's wrong here so please correct me if you see something that is off. And if you're not interested in how this works just skip to the descriptions of the properties itself.&lt;br /&gt;
&lt;br /&gt;
All of this is valid for Hadoop 0.20.2+737 (the CDH version). I know that some things have changed in Hadoop 0.21 but that's left for another time.
&lt;br /&gt;
&lt;h4&gt;The Map side&lt;/h4&gt;
While a Map is running it is collecting output records in an in-memory buffer called &lt;code&gt;MapOutputBuffer&lt;/code&gt;, if there are no reducers a &lt;code&gt;DirectMapOutputCollector&lt;/code&gt; is used which makes most of the rest obsolete as it writes immediately to disk. The total size of this in memory buffer is set by the &lt;code&gt;io.sort.mb&lt;/code&gt; property and defaults to &lt;em&gt;100 MB&lt;/em&gt; (which is converted to a byte value using a bit shift operation [&lt;code&gt;100 &amp;lt;&amp;lt; 20 = 104857600&lt;/code&gt;]). Out of these &lt;em&gt;100 MB&lt;/em&gt; &lt;code&gt;io.sort.record.percent&lt;/code&gt; are reserved for tracking record boundaries. This property defaults to &lt;em&gt;0.05&lt;/em&gt; (i.e. &lt;em&gt;5%&lt;/em&gt; which means &lt;em&gt;5 MB&lt;/em&gt; in the default case). Each record to track takes &lt;em&gt;16 bytes&lt;/em&gt; (4 integers of 4 bytes each) of memory which means the buffer can track &lt;em&gt;327680&lt;/em&gt; map output records with the default settings. The rest of the memory (&lt;em&gt;104857600 bytes - (16 bytes * 327680) = 99614720 bytes&lt;/em&gt;) is used to store the actual bytes to be collected (in the default case this will be &lt;em&gt;95 MB&lt;/em&gt;). While Map outputs are collected they are stored in the remaining memory and their location in the in-memory buffer is tracked as well. Once one of these two buffers reaches a threshold specified by &lt;code&gt;io.sort.spill.percent&lt;/code&gt;, which defaults to &lt;em&gt;0.8&lt;/em&gt; (i.e. &lt;em&gt;80%&lt;/em&gt;), the buffer is flushed to disk:
&lt;br /&gt;
&lt;pre class="brush:plain"&gt;0.8 * 99614720 = 79691776
0.8 * 327680 = 262144&lt;/pre&gt;
Look in the log output of your Maps and you'll see these three lines at the beginning of every log:
&lt;br /&gt;
&lt;pre class="brush:plain"&gt;2010-12-05 01:33:04,912 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2010-12-05 01:33:04,996 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2010-12-05 01:33:04,996 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680&lt;/pre&gt;
You should recognize these numbers!&lt;br /&gt;
&lt;br /&gt;
Now while the Map is running you might see log lines like these:
&lt;br /&gt;
&lt;pre class="brush:plain"&gt;2010-12-05 01:33:08,823 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
2010-12-05 01:33:08,823 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 19361312; bufvoid = 99614720
2010-12-05 01:33:08,823 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680
2010-12-05 01:33:09,558 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0&lt;/pre&gt;
This means we've reached the maximum number of records we can track even though our buffer is still pretty empty (&lt;em&gt;99614720 -&amp;nbsp;19361312 bytes&lt;/em&gt; still free). If however your buffer is the cause of your spill you'll see a line like this:
&lt;br /&gt;
&lt;pre class="brush:plain"&gt;2010-12-05 01:33:08,823 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full = true&lt;/pre&gt;
All of this spilling to disk is done in a separate thread so that the Map can continue running. That's also the reason why the spill begins early (when the buffer is only &lt;em&gt;80%&lt;/em&gt; full) so it doesn't fill up before a spill is finished. If one single Map output is too large to fit into the in memory buffer a single spill is done for this one value. A spill actually consists of one file per partition, meaning one file per Reducer.&lt;br /&gt;
&lt;br /&gt;
After a Map task has finished there may be multiple spills on the TaskTracker. Those files have to be merged into one single sorted file per partition which is then fetched by the Reducers. The property &lt;code&gt;io.sort.factor&lt;/code&gt; says how many of those spill files will be merged into one file at a time. The lower the number is the more passes will be required to arrive at the goal. The default is very low and it was considered to set the default to &lt;em&gt;100&lt;/em&gt; (and in fact looking at the code it sometimes is set to &lt;em&gt;100&lt;/em&gt; by default). This property can make a pretty huge difference if your Mappers output a lot of data. Not much memory is needed for this property but the larger it is the more open files there will be so make sure to set this to a reasonable value. To find such a value you should run a few MapReduce jobs that you'd expect to see in production use and carefully monitor the log files.&lt;br /&gt;
&lt;br /&gt;
Watch out for log messages like these:
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Merging &amp;lt;numSegments&amp;gt; sorted segments&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Down to the last merge-pass, with &amp;lt;numSegments&amp;gt;&amp;nbsp;segments left of total size: &amp;lt;totalBytes&amp;gt; bytes&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Merging &amp;lt;segmentsToMerge.size()&amp;gt;&amp;nbsp;intermediate segments out of a total of &amp;lt;totalSegments&amp;gt;&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
This is the process on the Map side where this factor is used. If your Mappers only have on spill file all of this doesn't matter. So if you try to benchmark this make sure to use a job with a lot of Map output data. If you only see a line like "&lt;code&gt;Finished spill 0&lt;/code&gt;" but none of the above you're only producing one spill file which doesn't require any merging or further sorting. This is the ideal situation and you should try to get the number of spilled records/files as low as possible.
&lt;br /&gt;
&lt;h4&gt;The Reduce side&lt;/h4&gt;
The reduce phase has three different steps: Copy, Sort (which should really be called Merge) and Reduce.&lt;br /&gt;
&lt;br /&gt;
During the Copy phase the Reducer tries to fetch the output of the Maps from the TaskTrackers and store it on the Reducer either in memory or on disk. The property &lt;code&gt;mapred.reduce.parallel.copies&lt;/code&gt; (which defaults to &lt;em&gt;5&lt;/em&gt;) defines how many Threads are started per Reduce task to fetch Map output from the TaskTrackers.&lt;br /&gt;
&lt;br /&gt;
Here's an example log from the beginning of a Reducer log:
&lt;pre class="brush:plain"&gt;2010-12-05 01:53:03,846 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=334063200, MaxSingleShuffleLimit=83515800
2010-12-05 01:53:03,879 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012031527_0021_r_000103_0 Need another 1870 map output(s) where 0 is already in progress
2010-12-05 01:53:03,880 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012031527_0021_r_000103_0 Thread started: Thread for merging on-disk files
2010-12-05 01:53:03,880 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012031527_0021_r_000103_0 Thread waiting: Thread for merging on-disk files
2010-12-05 01:53:03,880 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012031527_0021_r_000103_0 Thread started: Thread for merging in memory files
2010-12-05 01:53:03,881 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012031527_0021_r_000103_0 Thread started: Thread for polling Map Completion Events&lt;/pre&gt;
You can see two things in these log lines. First of all the &lt;code&gt;ShuffleRamManager&lt;/code&gt; is started and afterwards you see that this Reducer needs to fetch 1870 map outputs (meaning we had 1870 Mappers). The map output is fetched and shuffled into memory (that's what the &lt;code&gt;ShuffleRamManager&lt;/code&gt; is for). You can control its&amp;nbsp;behavior&amp;nbsp;using the &lt;code&gt;mapred.job.shuffle.input.buffer.percent&lt;/code&gt; (default is &lt;em&gt;0.7&lt;/em&gt;). &lt;a href="http://download.oracle.com/javase/6/docs/api/java/lang/Runtime.html#maxMemory()"&gt;Runtime.getRuntime().maxMemory()&lt;/a&gt; is used to get the available memory which unfortunately returns slightly &lt;a href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4686462"&gt;incorrect&lt;/a&gt; values so be careful when setting this. We'll get back to the last four lines later.&lt;br /&gt;
&lt;br /&gt;
Our child tasks are running with &lt;code&gt;-Xmx512m&lt;/code&gt; (536870912 bytes)&amp;nbsp;so 70% of that should be &lt;em&gt;375809638 bytes&lt;/em&gt; but the &lt;code&gt;ShuffleRamManager&lt;/code&gt; reports &lt;em&gt;334063200&lt;/em&gt;. No big deal, just be aware of it. There's a hardcoded limit of 25% of the buffer that a single map output may not surpass. If it is larger than that it will be written to disk (see the MaxSingleShuffleLimit value above: 334063200 * 0.25 = 83515800).&lt;br /&gt;
&lt;br /&gt;
Now that everything's set up the copiers will start their work and fetch the output. You'll see a bunch of log lines like these:
&lt;pre class="brush:plain"&gt;2010-12-05 01:53:11,114 INFO org.apache.hadoop.mapred.ReduceTask: header: attempt_201012031527_0021_m_000011_0, compressed len: 454055, decompressed len: 454051
2010-12-05 01:53:11,114 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 454051 bytes (454055 raw bytes) into RAM from attempt_201012031527_0021_m_000011_0
2010-12-05 01:53:11,133 INFO org.apache.hadoop.mapred.ReduceTask: Read 454051 bytes from map-output for attempt_201012031527_0021_m_000011_0
2010-12-05 01:53:11,133 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_201012031527_0021_m_000011_0 -&amp;gt; (70, 6) from c1n7.gbif.org&lt;/pre&gt;
In the first line you see that a map output was successfully copied and it could read the size of the data from the headers. The next line is actually what we've talked about earlier: The map output will now be decompressed (if it was compressed) and saved into memory using the &lt;code&gt;ShuffleRamManager&lt;/code&gt;. The third line acknowledges that this succeeded. And the last line is information for a &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3647"&gt;bug&lt;/a&gt; and should have been removed already according to a comment in the source code.&lt;br /&gt;
&lt;br /&gt;
If for whatever reason the map output doesn't fit into memory you will see a similar log line to the second one above but "&lt;code&gt;RAM&lt;/code&gt;" will be replaced by "&lt;code&gt;Local-FS&lt;/code&gt;" and the fourth line will be missing. You obviously want as much data into memory as possible so shuffling on to the Local-FS is a warning sign or at least a sign for possible optimizations.&lt;br /&gt;
&lt;br /&gt;
While all this goes on until all map outputs have been fetched there are two threads (Thread for merging on-disk files and Thread for merging in memory files) waiting for some conditions until they become active. The conditions are as follows:
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;The used memory in the in-memory buffer is above &lt;code&gt;mapred.job.shuffle.merge.percent&lt;/code&gt; (default ist 66%, in our example that would mean 334063200 * 0.66 = 220481712 bytes) &lt;em&gt;and&lt;/em&gt; there are at least two map outputs in the buffer&lt;/li&gt;
&lt;li&gt;or there are more than &lt;code&gt;mapred.inmem.merge.threshold&lt;/code&gt; (defaults to 1000) map outputs in the in-memory buffer, independent of the size&lt;/li&gt;
&lt;li&gt;or if there are more than &lt;code&gt;io.sort.factor&lt;/code&gt; * 2 -1 files on &lt;em&gt;disk&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;
When one of the first two condition triggers you'll see something like this:
&lt;br /&gt;
&lt;pre class="brush:plain"&gt;2010-12-05 01:53:42,106 INFO org.apache.hadoop.mapred.ReduceTask: Initiating in-memory merge with 501 segments...
2010-12-05 01:53:42,114 INFO org.apache.hadoop.mapred.Merger: Merging 501 sorted segments
...
2010-12-05 01:53:46,492 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012031527_0021_r_000103_0 Merge of the 501 files in-memory complete. Local file is /mnt/disk1/hadoop/mapreduce/local/taskTracker/lfrancke/jobcache/job_201012031527_0021/attempt_201012031527_0021_r_000103_0/output/map_1.out of size 220545981
2&lt;/pre&gt;
This could actually trigger the third condition as it writes a new file to disk. When that happens you'll see something like this:
&lt;br /&gt;
&lt;pre class="brush:plain"&gt;2010-12-10 14:28:23,289 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012101346_0001_r_000012_0We have  19 map outputs on disk. Triggering merge of 10 files&lt;/pre&gt;
The &lt;code&gt;io.sort.factor&lt;/code&gt; was set to the default of 10. 10 (out of the 19) files will be merged into one, leaving 10 on disk (i.e. &lt;code&gt;io.sort.factor&lt;/code&gt;).&lt;br /&gt;
&lt;br /&gt;
Both of these (the in-memory and the on-disk merge, the latter is also called &lt;em&gt;Interleaved on-disk merge&lt;/em&gt;) will produce a new single output file and write it to disk. All of this is only going on as long as map outputs are still fetched. When that's finished we wait for running merges to finish but won't start any new ones in these threads:
&lt;pre class="brush:plain"&gt;2010-12-05 01:59:10,598 INFO org.apache.hadoop.mapred.ReduceTask: GetMapEventsThread exiting
2010-12-05 01:59:10,599 INFO org.apache.hadoop.mapred.ReduceTask: getMapsEventsThread joined.
2010-12-05 01:59:10,599 INFO org.apache.hadoop.mapred.ReduceTask: Closed ram manager
2010-12-05 01:59:10,599 INFO org.apache.hadoop.mapred.ReduceTask: Interleaved on-disk merge complete: 3 files left.
2010-12-05 01:59:10,599 INFO org.apache.hadoop.mapred.ReduceTask: In-memory merge complete: 314 files left.&lt;/pre&gt;
As you can see by the timestamps no merges were running in our case so everything just shut down. During the copy phase we finished a total of three in-memory merges that's why we currently have three files on the disk. 314 more map outputs are still in the in-memory buffer. This concludes the Copy phase and the Sort phase begins:
&lt;pre class="brush:plain"&gt;2010-12-05 01:59:10,605 INFO org.apache.hadoop.mapred.Merger: Merging 314 sorted segments
2010-12-05 01:59:10,605 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 314 segments left of total size: 127512782 bytes
2010-12-05 01:59:13,903 INFO org.apache.hadoop.mapred.ReduceTask: Merged 314 segments, 127512782 bytes to disk to satisfy reduce memory limit
2010-12-05 01:59:13,904 INFO org.apache.hadoop.mapred.ReduceTask: Merging 4 files, 788519164 bytes from disk
2010-12-05 01:59:13,905 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 segments, 0 bytes from memory into reduce
2010-12-05 01:59:13,905 INFO org.apache.hadoop.mapred.Merger: Merging 4 sorted segments
2010-12-05 01:59:14,493 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 4 segments left of total size: 788519148 bytes&lt;/pre&gt;
There are two things happening here. First of all the remaining 314 files that are still in memory are merged into one file on the disk (the first three lines). So now there are four files on the disk. These four files are merged into one.&lt;br /&gt;
&lt;br /&gt;
There is an option &lt;code&gt;mapred.job.reduce.input.buffer.percent&lt;/code&gt; which is set to 0 by default which allows the Reducer to keep some map output files in memory. The following is a snippet with this property set to 0.7:
&lt;br /&gt;
&lt;pre class="brush:plain"&gt;2010-12-05 23:11:55,657 INFO org.apache.hadoop.mapred.ReduceTask: Merging 3 files, 661137901 bytes from disk
2010-12-05 23:11:55,660 INFO org.apache.hadoop.mapred.ReduceTask: Merging 312 segments, 127381881 bytes from memory into reduce
2010-12-05 23:11:55,661 INFO org.apache.hadoop.mapred.Merger: Merging 3 sorted segments
2010-12-05 23:11:55,688 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 661137889 bytes
2010-12-05 23:11:55,688 INFO org.apache.hadoop.mapred.Merger: Merging 313 sorted segments
2010-12-05 23:11:55,689 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 313 segments left of total size: 788519778 bytes&lt;/pre&gt;
You can see that instead of merging the 312 segments from memory to disk they are kept in memory while the three files on disk are merged into one and all of the resulting 313 segments are streamed into the reducer.&lt;br /&gt;
&lt;br /&gt;
There seems to be a bug in Hadoop though. I'm not 100% sure about this one so any insight would be appreciated. When the following conditions are true segments from the memory don't seem to be written to disk even if they should be according to the configuration:
&lt;ul&gt;
&lt;li&gt;There are segments in memory that should be written to disk before the reduce task begins according to &lt;code&gt;mapred.job.reduce.input.buffer.percent&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;and&lt;/em&gt; there are more files on disk than &lt;code&gt;io.sort.factor&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
If this happens you see this:
&lt;br /&gt;
&lt;pre class="brush:plain"&gt;2010-12-10 16:39:40,671 INFO org.apache.hadoop.mapred.ReduceTask: Keeping 14 segments, 18888592 bytes in memory for intermediate, on-disk merge
2010-12-10 16:39:40,673 INFO org.apache.hadoop.mapred.ReduceTask: Merging 10 files, 4143441520 bytes from disk
2010-12-10 16:39:40,674 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 segments, 0 bytes from memory into reduce
2010-12-10 16:39:40,674 INFO org.apache.hadoop.mapred.Merger: Merging 24 sorted segments
2010-12-10 16:39:40,859 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 24 segments left of total size: 4143441480 bytes&lt;/pre&gt;
So the steps being done in the Sort phase are the following:
&lt;ol&gt;
&lt;li&gt;Merge all segments (= map outputs) that are still in memory and don't fit into the memory specified by &lt;code&gt;mapred.job.reduce.input.buffer.percent&lt;/code&gt; into one file on disk &lt;em&gt;if&lt;/em&gt; there are less than &lt;code&gt;io.sort.factor&lt;/code&gt; files on disk so we end up with at most &lt;code&gt;io.sort.factor&lt;/code&gt; files on the disk after this step. If there are already &lt;code&gt;io.sort.factor&lt;/code&gt; or more files on disk but there are map outputs that need to be written out of memory keep them in memory for now
&lt;ol&gt;
&lt;li&gt;In the first case you'll see a log message like this: &lt;code&gt;Merged 314 segments, 127512782 bytes to disk to satisfy reduce memory limit&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;In the second case you'll see this: &lt;code&gt;Keeping 14 segments, 18888592 bytes in memory for intermediate, on-disk merge&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;All files on disk and all remaining files in memory that need to be merged (case 1.b) are determined. You'll see a log message like this: "&lt;code&gt;Merging 4 files, 788519164 bytes from disk&lt;/code&gt;".&lt;/li&gt;
&lt;li&gt;All files that remain in memory during the Reduce phase are determined: "&lt;code&gt;Merging 312 segments, 127381881 bytes from memory into reduce&lt;/code&gt;".&lt;/li&gt;
&lt;li&gt;All files (on disk + in-memory) from step 2. are merged together using &lt;code&gt;io.sort.factor&lt;/code&gt; as the merge factor. Which means that there might be intermediate merges to disk.&lt;/li&gt;
&lt;li&gt;Merge all remaining in-memory (from step 3.) and on-disk files (from step 4.) into one stream to be read by the Reducer. This is done in a streaming fashion without writing new data to disk and just returning an Iterator to the Reduce phase.&lt;/li&gt;
&lt;/ol&gt;
This Iterator is given to the Reducer and so the Reduce phase starts.&lt;br /&gt;
&lt;br /&gt;
Well, it turned out to be a rather detailed description of the process which is helpful to understand the configuration properties available to you. See below for a detailed list of all the relevant properties:
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;io.sort.factor&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;10&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;100&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Safe: 20-100&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.task.io.sort.factor&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
I've explained pretty thoroughly what this parameter does so I won't go into detail here. The whole situation with &lt;code&gt;io.sort.factor&lt;/code&gt; and &lt;code&gt;io.sort.mb&lt;/code&gt; is not ideal but as long as they are the options we have and the defaults are very low it is pretty safe to change them to a more reasonable value. It is worthwhile to take a look at your logs and search for the lines mentioned in the explanation above. This can be set on a per-job basis and for jobs that run frequently it's worth to find a good job specific value.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;io.sort.mb&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;100&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.task.io.sort.mb&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
You can adjust the amount of memory used in the Mappers to collect Map outputs with this parameter. This parameter obviously depends heavily on the amount of memory you have available in total for your child VMs and on the memory requirements of your tasks. Your goal should be to minimize the amount of spilling that has to be performed as explained above and to utilize the available as best as possible. If your Map tasks don't need a lot of memory themselves you can use almost all available memory here. The default settings allocate 200 MB for child VMs and half of that is used for the output buffer so your Map tasks has about 100 MB available by default.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;io.sort.record.percent&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;0.05&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;This has been removed in favor of &lt;a href="https://issues.apache.org/jira/browse/MAPREDUCE-64"&gt;automatic configuration&lt;/a&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
The output buffer on the map side is split in two parts. One stores the actual bytes of the output data and the other one stores 16 bytes of metadata per output. This property specifies how much memory of the buffer (io.sort.mb) is used for tracking the metadata. The default is 5% and is often very low for jobs which output only small amounts of data in their map tasks. Look for lines indicating whether a spill to disk occurs because of &lt;code&gt;record full = true&lt;/code&gt;. If this happens try to increase this value. This is another property which is very specific to the jobs you're running so it might need tuning for each and every job.

Thankfully this mechanism has been replaced in Hadoop 0.21.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;io.sort.spill.percent&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;0.8&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.map.sort.spill.percent&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
This property just configures when the data from the map output buffer will be written (spilled) to disk. The spilling process is running in a separate thread and output will be collected while it is running so it is important to start this process before the buffer is completely full as the map tasks will pause until there is space available.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.job.tracker&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;local&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;&amp;lt;jobtracker&amp;gt;:8021&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.jobtracker.address&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Needed on the clients&lt;/li&gt;
&lt;/ul&gt;
This lets the client know where to find the JobTracker and it lets the JobTracker know which port to bind to.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.local.dir&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;${hadoop.tmp.dir}/mapred/local&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;/mnt/disk1/hadoop/mapreduce/local,/mnt/disk2/hadoop/mapreduce/local&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.cluster.local.dir&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;/ul&gt;
This lets the MapReduce servers know where to store intermediate files. This may be a comma-separated list of directories to spread the load. Make sure there's enough space here for all your intermediate files. We share the same disks for MapReduce and HDFS.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.system.dir&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;${hadoop.tmp.dir}/mapred/system&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;/hadoop/mapred/system&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.jobtracker.system.dir&lt;/code&gt;&amp;nbsp;in Hadoop 0.21&lt;/li&gt;
&lt;/ul&gt;
This is a folder in the &lt;code&gt;defaultFS&lt;/code&gt; where MapReduce stores some control files. In our case that would be a directory in HDFS. If you have &lt;code&gt;dfs.permissions&lt;/code&gt; (which it is by default) enabled make sure that this directory exists and is owned by mapred:hadoop.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.temp.dir&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;${hadoop.tmp.dir}/mapred/temp&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;/tmp/mapreduce&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.cluster.temp.dir&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;/ul&gt;
This is a folder to store temporary files in. It is hardly - if at all used. If I understand the description correctly this is supposed to be in HDFS but I'm not entirely sure by reading the source code. So we set this to a directory that exists on the local filesystem as well as in HDFS.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.map.tasks&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;2&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.job.maps&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;/ul&gt;
It is important to realize that this is just a hint for MapReduce as to the number of Maps it should use. In most cases this value is ignored and the actual number of Maps is dependent on the input data and generated automatically. For those rare cases where this value is used we set it to about 90% of our map slot capacity. This can be set client-side per job so if you have a job that relies on this property you better set it there to an appropriate value.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.reduce.tasks&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;1&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.job.reduces&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;/ul&gt;
This is different than the property for map tasks in that it is often not possible to calculate a "native" or optimal number of reduce tasks for a job. With this property you can specify the number of reduce tasks to start for a given job. The default is very low. The description suggests to set this to 99% of the cluster capacity so that all reduces finish in one wave. This is sensible when you use the default scheduler but as soon as multiple jobs run in parallel it's hard to guarantee that all reduces of one job finish in one wave. We're constantly playing around with this and currently have this at about 50% of our capacity.&lt;br /&gt;
&lt;br /&gt;
This too can be specified on a per-job basis.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.jobtracker.taskScheduler&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;org.apache.hadoop.mapred.JobQueueTaskScheduler&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;org.apache.hadoop.mapred.FairScheduler&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.jobtracker.taskscheduler&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;/ul&gt;
With the default configuration all jobs are placed in a priority FIFO queue and submitted one after the other. This is fine for testing but it doesn't utilize the available resources very well. This property allows you to change the scheduler used. These are the available schedulers in CDH3b3:
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;JobQueueTaskScheduler&lt;/li&gt;
&lt;li&gt;&lt;a href="http://archive.cloudera.com/cdh/3/hadoop/fair_scheduler.html"&gt;FairScheduler&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://archive.cloudera.com/cdh/3/hadoop/capacity_scheduler.html"&gt;CapacityScheduler&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
Depending on the scheduler you decide to use there may be additional properties which I'm not going to mention here. Have a look at the dedicated documentation.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.reduce.parallel.copies&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;5&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: ~20-50&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.reduce.shuffle.parallelcopies&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;/ul&gt;
The reduce tasks have to fetch the map outputs from the remote servers. They have to fetch the output from each map of which there may be thousands. This option allows to parallelize the copy process. Tuning this to a value is very worthwhile. In our first tests this property gave us one of the best performance increases of all properties. We started to increase this property in steps of 5 and looked very carefully at the logs and our monitoring system to find a value that works for us. We've not yet finished this process but values between 20 and 50 seem to mostly work without problems.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.tasktracker.map.tasks.maximum&lt;/code&gt; &amp;amp; &lt;code&gt;mapred.tasktracker.reduce.tasks.maximum&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: 2&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.tasktracker.map.tasks.maximum&lt;/code&gt; &amp;amp; &lt;code&gt;mapreduce.tasktracker.map.tasks.maximum&lt;/code&gt;&amp;nbsp;in Hadoop 0.21&lt;/li&gt;
&lt;/ul&gt;
This setting is very important and we've yet to find values that we are comfortable with. This setting can be different on each TaskTracker and defines how many map or reduce task "slots" there are on a specific TaskTracker. You need to set these to values that don't overload your servers while still fully utilizing them. You also need to make sure that there's enough memory for all tasks and services running on a server (see mapred.child.java.opts).

By setting this property to different values depending on your server configuration you can easily use heterogeneous hardware in your cluster. Each distinct hardware configuration will have these properties set to different values.

A general rule from the &lt;a href="http://oreilly.com/catalog/0636920010388"&gt;Hadoop Definitive Guide&lt;/a&gt; book says that these properties can be set to &lt;code&gt;number of cores - 1&lt;/code&gt;. We've tried various settings now but found the load on the servers to be very high with those settings so we'll have to do more benchmarking.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.child.java.opts&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;-Xmx200m&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
These are the options given to each child JVM started (map- and reduce tasks). The default just sets the maximum memory to 200 MB. This can be set on the client to pass options needed for a specific job. GC logging for example can be enabled as well. This isn't configurable on a per TaskTracker basis so you have to make sure that every machine in your cluster fulfills the requirements. Available memory needs to be at least &lt;code&gt;(mapred.tasktracker.map.tasks.maximum + mapred.tasktracker.reduce.tasks.maximum) * Xmx&lt;/code&gt;.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.inmem.merge.threshold&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;1000&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.reduce.merge.inmem.threshold&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
I've explained the effect of this property in the MapReduce description above but to reiterate: The reduce side fetches map outputs to memory. Once the memory is full or this many map outputs are in memory they are merged together to one file on the disk. This can be set on a per job basis but as a default we've disabled this behavior and just flush to disk when the memory is full. This seems to have been better for all our jobs so far but it's definitely a property to look out for.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.job.shuffle.merge.percent&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;0.66&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.reduce.shuffle.merge.percent&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
Once the memory buffer in the copy (shuffle) phase of the reduce task is this full a background thread will start to merge all map outputs collected in memory so far and write them to a single file on disk. This is similar to what's happening on the map side. In the default configuration the &lt;code&gt;mapred.inmem.merge.threshold&lt;/code&gt; parameter might actually trigger a merge before this value is hit. We haven't yet played around with this property but you'd have to be careful to turn it not too high so that the copy processes have to wait for the buffer to be empty again. That could be a huge performance hit.&lt;br /&gt;
&lt;br /&gt;
An addition to Hadoop's logging would be nice that lets us know how full the buffer is the moment a merge finishes.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.job.shuffle.input.buffer.percent&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;0.7&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.reduce.shuffle.input.buffer.percent&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
This is the amount of memory from the total available memory (specified by mapred.child.java.opts) that's allocated for collecting map outputs in memory on the reduce side. Another parameter we haven't played around with but my guess would be that this can be easily set a little bit higher.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.job.reduce.input.buffer.percent&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;0.0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.reduce.input.buffer.percent&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
Usually map outputs would be written to disk when the sort phase (on the reduce) ends. If you have reduce tasks that don't need a lot of memory themselves you can set this to a higher value so that map outputs up to this amount of memory&amp;nbsp;(in percent of the total available memory) aren't written to disk but kept in memory. This is obviously faster than an intermediate spill to disk. Should be considered on a per-job basis.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.map.tasks.speculative.execution&lt;/code&gt; &amp;amp; &lt;code&gt;mapred.reduce.tasks.speculative.execution&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;true&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;false&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We will set this to final once we're in production&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.map.speculative&lt;/code&gt; &amp;amp; &lt;code&gt;mapreduce.reduce.speculative&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
Speculative Execution starts multiple instances of certain map or reduce tasks when it detects certain circumstances (like an unusually slow task or node) to avoid waiting for stragglers too long. This sounds like a good idea and we've got it enabled at the moment but when we go to production this will probably be disabled as it uses valuable resources on the cluster that mostly goes to waste and while one job may finish faster all the others have to wait longer.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.job.reuse.jvm.num.tasks&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;1&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.job.jvm.numtasks&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
Child JVMs are spawned for the map and reduce tasks. This parameter lets you reuse these VMs for multiple tasks. The default value creates a new JVM for each task which has some overhead (the book says about one second per JVM). We've played around with it a bit and it can make things faster but you've got to be careful with memory leaks and shared state. Basically you should be sure that your jobs can handle this. If you have a performance critical job you can play around with this but we've had some OutOfMemory errors when using this so we're conservative at the moment. If you set it to &lt;code&gt;-1&lt;/code&gt; a JVM will never be destroyed.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;tasktracker.http.threads&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;40&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;80&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.tasktracker.http.threads&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;/ul&gt;
The map output is fetched by the reducers from the TaskTrackers via HTTP. This property lets you adjust the number of threads that server those requests. When we upped the parallel copies we had some errors about fetch-failures so we slowly increased this value. Those two parameters need to be carefully tuned. 80 seemed to cause no problems for us so we stuck to it for now. You have to restart your TaskTrackers after changing this value.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.compress.map.output&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: &lt;code&gt;false&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;true&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.map.output.compress&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
Turning this on will compress the output of your Mappers using SequenceFile compression. Depending on the codec you chose this computation may be CPU intensive and result in varying degrees of compression. We've benchmarked jobs of different sizes with this intermediate compression enabled and disabled and while some of them took slightly longer than before it is still good to enable it. The cost isn't too high and there is a lot less intermediate data generated. Less I/O in general is good especially if multiple jobs are running.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.map.output.compression.codec&lt;/code&gt;&lt;/b&gt;
&lt;code&gt;
&lt;/code&gt;&lt;br /&gt;
&lt;ul&gt;&lt;code&gt;
 &lt;/code&gt;
&lt;li&gt;&lt;code&gt;Default: &lt;codec&gt;org.apache.hadoop.io.compress.DefaultCodec&lt;/codec&gt;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;com.hadoop.compression.lzo.LzoCodec&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.map.output.compress.codec&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;Can be used in the client configuration&lt;/li&gt;
&lt;/ul&gt;
With this property you specify the specific compression codec to use for Map output compression. So far we've only tried LZO. This choice was based on the experience of others and the general properties of the algorithm being very fast but sacrificing a bit of compression&amp;nbsp;efficiency&amp;nbsp;for its speed. We plan to test the other algorithms as well.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;&lt;code&gt;mapred.hosts&lt;/code&gt;&lt;/b&gt;
&lt;ul&gt;
&lt;li&gt;Default: no default set&lt;/li&gt;
&lt;li&gt;We: &lt;code&gt;/etc/hadoop/conf/allowed_hosts&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Renamed to &lt;code&gt;mapreduce.jobtracker.hosts.filename&lt;/code&gt; in Hadoop 0.21&lt;/li&gt;
&lt;li&gt;We set this to final&lt;/li&gt;
&lt;/ul&gt;
This is the same as &lt;code&gt;dfs.hosts&lt;/code&gt; just specifies which TaskTrackers are allowed to get work from the JobTracker. They both have the same format so it's quite common for them to be the same file.
&lt;br /&gt;
&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Conclusion&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
After setting up all these parameters the way you like them you should have a fully functional but basic Hadoop cluster running. You can submit jobs, use HDFS etc. But there are a few more things that we can do like installing Hive, Hue, Pig, Sqoop, etc. We've also yet to cover Puppet. All of this is hopefully forthcoming in more blog posts in the future.&lt;br/&gt;
&lt;br/&gt;
We're also very interested in other users (or interested people and companies) of Hadoop, HBase &amp; Co. in Scandinavia who would be interested in a Hadoop Meetup. We're located in Copenhagen. Contact us if you're interested.&lt;br/&gt;&lt;br/&gt;
If you have any questions or spot any problems or mistakes please let me know in the comments or by &lt;a href="mailto:lars.francke@gmail.com"&gt;mail&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-3385874402411941837?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/3385874402411941837/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3385874402411941837'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3385874402411941837'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html' title='Setting up a Hadoop cluster - Part 1: Manual Installation'/><author><name>Lars Francke</name><uri>https://profiles.google.com/109602018791202712990</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-754437864856581275</id><published>2010-05-12T22:33:00.006+02:00</published><updated>2010-05-13T17:49:19.692+02:00</updated><title type='text'>How To Configure GeoServer For The GBIF Data Portal</title><content type='html'>We are Daniel Amariles and Hector Tobon, we work for the &lt;a href="http://www.ciat.cgiar.org/"&gt;International Center for Tropical Agriculture&lt;/a&gt;, based in Cali, Colombia, and we are currently correcting some bugs on the GBIF Data Portal.&lt;br /&gt;
&lt;br /&gt;
A few days ago we had to install Geoserver in order to deal with some map issues from the GBIF Data Portal. We want to take this opportunity to document our experience (installation and configuration) so that anybody inside the community can benefit from this knowledge. This post refers to a customised installation of GeoServer for the GBIF Data Portal.&lt;br /&gt;
&lt;hr /&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://geoserver.org//download/attachments/19005441/global.logo?version=1" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://geoserver.org//download/attachments/19005441/global.logo?version=1" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;
&lt;span style="font-size: large;"&gt;&lt;b&gt;Installing &lt;a href="http://geoserver.org/display/GEOS/What+is+Geoserver"&gt;GeoServer&lt;/a&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&amp;nbsp;There are different ways to install GeoServer:&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;&amp;nbsp;Install Geoserver as a WAR in Apache Tomcat&lt;/li&gt;
&lt;li&gt;&amp;nbsp;Install Geoserver as a binary file (OS independent)&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;&lt;b&gt;Configuring system variables&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
Configure the following variables according to your system configuration.&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;JAVA_HOME=&lt;i&gt;Your JAVA installation path&lt;/i&gt;&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;JAVA_OPTS="-XX:PermSize=512M -Xmx1g -Djava.awt.headless=true -Dcom.sun.management.jmxremote"&lt;/code&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Install GeoServer as a WAR in Apache tomcat (recommended)&lt;/b&gt;&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Get the Apache Tomcat from &lt;a href="http://tomcat.apache.org/"&gt;http://tomcat.apache.org/&lt;/a&gt; (Binary Distributions - Core) &lt;br /&gt;
In this case we are going to use &lt;a href="http://apache.multihomed.net/tomcat/tomcat-6/v6.0.26/bin/apache-tomcat-6.0.26.zip"&gt;Apache Tomcat Zipped Binary Distribution 6.0.26&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&amp;nbsp;Get the GeoServer WAR file from &lt;a href="http://geoserver.org/display/GEOS/Stable"&gt;http://geoserver.org/display/GEOS/Stable&lt;/a&gt; (Web Archive Format)&lt;br /&gt;
In This case we are going to use &lt;a href="http://downloads.sourceforge.net/geoserver/geoserver-2.0.1-war.zip"&gt;GeoServer 2.0.1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Unzip the tomcat file and place it in a location of your choice and put the GeoServer WAR file in the folder &lt;i&gt;$TOMCAT/webapps/geoserver.war&lt;/i&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Start Tomcat by running &lt;i&gt;$TOMCAT/bin/startup.sh&lt;/i&gt; or &lt;i&gt;$TOMCAT/bin/startup.bat&lt;/i&gt; according to your OS. &lt;/li&gt;
&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Go to URL &lt;a href="http://localhost:8080/geoserver"&gt;http://localhost:8080/geoserver&lt;/a&gt; and sign in using the default username: &lt;i&gt;admin &lt;/i&gt;password: &lt;i&gt;geoserver&lt;/i&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;b&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; How to change the port&lt;/b&gt;&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Shutdown Tomcat running &lt;i&gt;$TOMCAT/bin/shutdown.sh&lt;/i&gt; or &lt;i&gt;$TOMCAT/bin/shutdown.bat&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;Open file &lt;i&gt;$TOMCAT/conf/server.xml&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;Search line &lt;code&gt;&amp;lt;connector port="8080" protocol="HTTP/1.1" ...&lt;/code&gt; and change the default to your convenience port.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Start Tomcat&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;&lt;b&gt;Install GeoServer as Binary (OS independent)&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
Note: If you install Geoserver as Binary it will be run by a jetty server rather than Tomcat.&amp;nbsp; &lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Get the bin folder from &lt;a href="http://geoserver.org/display/GEOS/Stable"&gt;http://geoserver.org/display/GEOS/Stable&lt;/a&gt; (Binary - OS independent)&lt;br /&gt;
In this case we are going to use GeoServer 2.0.1&lt;/li&gt;
&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Unzip the folder and place it in a location of your choice.&lt;/li&gt;
&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Run&lt;i&gt; $GEOSERVER/bin/startup.bat&lt;/i&gt; or &lt;i&gt;$GEOSERVER/bin/startup.sh&lt;/i&gt; according to your OS.&lt;/li&gt;
&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&amp;nbsp;Go to URL &lt;a href="http://localhost:8080/geoserver"&gt;http://localhost:8080/geoserver&lt;/a&gt; and sign in using the default username: admin password: geoserver&lt;/li&gt;
&lt;/ul&gt;&lt;b&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; How to change the port &lt;/b&gt;&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Shutdown Jetty server running &lt;br /&gt;
&lt;i&gt;$GEOSERVER/bin/shutdown.sh&lt;/i&gt; or &lt;i&gt;$GEOSERVER/bin/shutdown.bat&amp;nbsp;&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;Open file &lt;i&gt;$GEOSERVER/etc/jetty.xml&lt;/i&gt;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Search line &lt;code&gt;&amp;lt;Set name="port"&amp;gt;8081&amp;lt;/Set&amp;gt;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Start Jet&lt;/li&gt;
&lt;/ul&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;Resources &amp;amp; libraries&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Shapefile&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
Copy the files corresponding to the country shape file&lt;br /&gt;
&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;i&gt;&lt;a href="http://ogc.gbif.org/data/data/shapefiles/country.dbf"&gt;country.dbf&lt;/a&gt;  &lt;a href="http://ogc.gbif.org/data/data/shapefiles/country.fix"&gt;country.fix&lt;/a&gt;  &lt;a href="http://ogc.gbif.org/data/data/shapefiles/country.ORG.dbf"&gt;country.ORG.dbf&lt;/a&gt; &lt;a href="http://ogc.gbif.org/data/data/shapefiles/country.prj"&gt;country.prj&lt;/a&gt;  &lt;a href="http://ogc.gbif.org/data/data/shapefiles/country.qix"&gt;country.qix&lt;/a&gt;  &lt;a href="http://ogc.gbif.org/data/data/shapefiles/country.shp"&gt;country.shp&lt;/a&gt;  &lt;a href="http://ogc.gbif.org/data/data/shapefiles/country.shx"&gt;country.shx&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
to the shapefiles directory of the GeoServer &lt;i&gt;$GEOSERVER/data_dir/data/shapefiles&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;The gt-ala-tab library&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;Copy the library&amp;nbsp; &lt;img border="0" src="http://code.google.com/hosting/images/paperclip.gif" /&gt;&lt;a href="http://gbif-dataportal.googlecode.com/issues/attachment?aid=3210348668769350056&amp;amp;name=gt-ala-tab-1.0-SNAPSHOT.jar&amp;amp;token=077070ac22e492c1e9f4b15d25b265bd"&gt;gt-ala-tab-1.0-SNAPSHOT.jar&lt;/a&gt; to the libraries directory of the GeoServer &lt;i&gt;$GEOSERVER\geoserver\WEB-INF\lib&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;span style="font-size: large;"&gt;&lt;b&gt;Workspace&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Create the Workspace named gbif with URI &lt;a href="http://www.gbif.org/"&gt;http://www.gbif.org&lt;/a&gt; and set it as default workspace&lt;br /&gt;
&lt;br /&gt;
&lt;span style="font-size: large;"&gt;&lt;b&gt;Stores&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;country stores&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
Add the stores type &lt;i&gt;Shapefile &lt;/i&gt;with names &lt;i&gt;country_borders&lt;/i&gt;, &lt;i&gt;country_fill&lt;/i&gt; and &lt;i&gt;country_names&lt;/i&gt;. With the following parameters:&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Workspace: &lt;i&gt;gbif&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;URL: file:data/shapefiles/country.shp &lt;/li&gt;
&lt;/ul&gt;&lt;b&gt;tab_density store&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
Add the store &lt;i&gt;tab_density&lt;/i&gt; of type &lt;i&gt;Tab Url DataStore&lt;/i&gt;.With the following parameters:&lt;br /&gt;
&lt;ul&gt;&lt;li&gt; Workspace: &lt;i&gt;gbif&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;minx: -180&lt;/li&gt;
&lt;li&gt;miny: -90&lt;/li&gt;
&lt;li&gt;maxx: 180&lt;/li&gt;
&lt;li&gt;maxy: 90 &lt;/li&gt;
&lt;/ul&gt;&lt;b&gt;Styles&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
Add the following SLD styles. Click on the validate button to verify the style is a valid SLD document.&lt;br /&gt;
&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;b&gt;country_borders style&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush:xml"&gt;&amp;lt;?xml version="1.0" encoding="ISO-8859-1"?&amp;gt; 
    &amp;lt;StyledLayerDescriptor version="1.0.0" xmlns="http://www.opengis.net/sld" xmlns:ogc="http://www.opengis.net/ogc" 
      xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="http://www.opengis.net/sld http://schemas.opengis.net/sld/1.0.0/StyledLayerDescriptor.xsd"&amp;gt; 
      &amp;lt;NamedLayer&amp;gt; 
        &amp;lt;Name&amp;gt;Countries Borders&amp;lt;/Name&amp;gt; 
        &amp;lt;UserStyle&amp;gt; 
          &amp;lt;Title&amp;gt;Countries Borders&amp;lt;/Title&amp;gt; 
          &amp;lt;Abstract&amp;gt;A style that just draws a 1 pixel stroke around each countries borders&amp;lt;/Abstract&amp;gt; 
          &amp;lt;FeatureTypeStyle&amp;gt; 
            &amp;lt;Rule&amp;gt; 
              &amp;lt;Title&amp;gt;Polygon&amp;lt;/Title&amp;gt; 
              &amp;lt;PolygonSymbolizer&amp;gt; 
                &amp;lt;Stroke&amp;gt; 
                  &amp;lt;CssParameter name="stroke"&amp;gt;#006600&amp;lt;/CssParameter&amp;gt; 
                  &amp;lt;CssParameter name="stroke-width"&amp;gt;1&amp;lt;/CssParameter&amp;gt; 
                &amp;lt;/Stroke&amp;gt; 
              &amp;lt;/PolygonSymbolizer&amp;gt; 
            &amp;lt;/Rule&amp;gt; 
          &amp;lt;/FeatureTypeStyle&amp;gt; 
        &amp;lt;/UserStyle&amp;gt; 
      &amp;lt;/NamedLayer&amp;gt; 
    &amp;lt;/StyledLayerDescriptor&amp;gt;&amp;nbsp;&lt;/pre&gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;b&gt;country_fill style&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush:xml"&gt;&amp;lt;?xml version="1.0" encoding="ISO-8859-1"?&amp;gt; 
    &amp;lt;StyledLayerDescriptor version="1.0.0" xmlns="http://www.opengis.net/sld" xmlns:ogc="http://www.opengis.net/ogc" 
      xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="http://www.opengis.net/sld http://schemas.opengis.net/sld/1.0.0/StyledLayerDescriptor.xsd"&amp;gt; 
      &amp;lt;NamedLayer&amp;gt; 
        &amp;lt;Name&amp;gt;Country Polygons Fill&amp;lt;/Name&amp;gt; 
        &amp;lt;UserStyle&amp;gt; 
          &amp;lt;Title&amp;gt;Country Polygons style&amp;lt;/Title&amp;gt; 
          &amp;lt;Abstract&amp;gt;A style that fills the countries polygons with a specific color&amp;lt;/Abstract&amp;gt; 
          &amp;lt;FeatureTypeStyle&amp;gt; 
            &amp;lt;Rule&amp;gt; 
              &amp;lt;Title&amp;gt;Polygon&amp;lt;/Title&amp;gt; 
              &amp;lt;PolygonSymbolizer&amp;gt; 
                &amp;lt;Fill&amp;gt; 
                  &amp;lt;CssParameter name="fill"&amp;gt;#003333&amp;lt;/CssParameter&amp;gt; 
                  &amp;lt;CssParameter name="fill-opacity"&amp;gt;&amp;lt;ogc:Literal&amp;gt;1&amp;lt;/ogc:Literal&amp;gt;&amp;lt;/CssParameter&amp;gt;               
                &amp;lt;/Fill&amp;gt; 
              &amp;lt;/PolygonSymbolizer&amp;gt; 
            &amp;lt;/Rule&amp;gt; 
          &amp;lt;/FeatureTypeStyle&amp;gt; 
        &amp;lt;/UserStyle&amp;gt; 
      &amp;lt;/NamedLayer&amp;gt; 
    &amp;lt;/StyledLayerDescriptor&amp;gt;
&lt;/pre&gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;b&gt;country_names style&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush:xml"&gt;&amp;lt;?xml version="1.0" encoding="ISO-8859-1"?&amp;gt; 
&amp;lt;StyledLayerDescriptor version="1.0.0" xmlns="http://www.opengis.net/sld" xmlns:ogc="http://www.opengis.net/ogc" 
  xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xsi:schemaLocation="http://www.opengis.net/sld http://schemas.opengis.net/sld/1.0.0/StyledLayerDescriptor.xsd"&amp;gt; 
  &amp;lt;NamedLayer&amp;gt; 
    &amp;lt;Name&amp;gt;Country Names&amp;lt;/Name&amp;gt; 
    &amp;lt;UserStyle&amp;gt; 
      &amp;lt;Title&amp;gt;Country Names&amp;lt;/Title&amp;gt; 
      &amp;lt;Abstract&amp;gt;Style that renders the names of the countries on the map&amp;lt;/Abstract&amp;gt; 
      &amp;lt;FeatureTypeStyle&amp;gt; 
  &amp;lt;Rule&amp;gt; 
  &amp;lt;TextSymbolizer&amp;gt; 
    &amp;lt;Label&amp;gt; 
      &amp;lt;ogc:PropertyName&amp;gt;CNTRY_NAME&amp;lt;/ogc:PropertyName&amp;gt; 
    &amp;lt;/Label&amp;gt;   
    &amp;lt;Font&amp;gt; 
      &amp;lt;CssParameter name="font-family"&amp;gt;&amp;lt;ogc:Literal&amp;gt;Lucida Sans&amp;lt;/ogc:Literal&amp;gt;&amp;lt;/CssParameter&amp;gt; 
      &amp;lt;CssParameter name="font-style"&amp;gt;&amp;lt;ogc:Literal&amp;gt;normal&amp;lt;/ogc:Literal&amp;gt;&amp;lt;/CssParameter&amp;gt; 
      &amp;lt;CssParameter name="font-size"&amp;gt;&amp;lt;ogc:Literal&amp;gt;10.0&amp;lt;/ogc:Literal&amp;gt;&amp;lt;/CssParameter&amp;gt; 
      &amp;lt;CssParameter name="font-weight"&amp;gt;&amp;lt;ogc:Literal&amp;gt;bold&amp;lt;/ogc:Literal&amp;gt;&amp;lt;/CssParameter&amp;gt; 
    &amp;lt;/Font&amp;gt;   
&amp;lt;LabelPlacement&amp;gt; 
&amp;lt;PointPlacement&amp;gt; 
&amp;lt;AnchorPoint&amp;gt; 
&amp;lt;AnchorPointX&amp;gt;0.5&amp;lt;/AnchorPointX&amp;gt; 
&amp;lt;AnchorPointY&amp;gt;0.5&amp;lt;/AnchorPointY&amp;gt; 
&amp;lt;/AnchorPoint&amp;gt; 
&amp;lt;/PointPlacement&amp;gt; 
&amp;lt;/LabelPlacement&amp;gt; 
    &amp;lt;Fill&amp;gt; 
      &amp;lt;CssParameter name="fill"&amp;gt;#6e8686&amp;lt;/CssParameter&amp;gt; 
    &amp;lt;/Fill&amp;gt; 
  &amp;lt;/TextSymbolizer&amp;gt; 
  &amp;lt;/Rule&amp;gt; 
      &amp;lt;/FeatureTypeStyle&amp;gt; 
    &amp;lt;/UserStyle&amp;gt; 
  &amp;lt;/NamedLayer&amp;gt; 
&amp;lt;/StyledLayerDescriptor&amp;gt;
&lt;/pre&gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;b&gt;density_layer style&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush:xml"&gt;&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt; 
&amp;lt;StyledLayerDescriptor version="1.0.0" 
 xsi:schemaLocation="http://www.opengis.net/sld StyledLayerDescriptor.xsd" 
 xmlns="http://www.opengis.net/sld" xmlns:ogc="http://www.opengis.net/ogc" 
 xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"&amp;gt; 
 &amp;lt;NamedLayer&amp;gt; 
   &amp;lt;Name&amp;gt;densityLayer&amp;lt;/Name&amp;gt; 
   &amp;lt;UserStyle&amp;gt; 
     &amp;lt;FeatureTypeStyle&amp;gt; 
       &amp;lt;!-- If it is a point data, render as such --&amp;gt; 
       &amp;lt;Rule&amp;gt; 
         &amp;lt;ogc:Filter&amp;gt; 
           &amp;lt;ogc:PropertyIsEqualTo&amp;gt; 
             &amp;lt;ogc:Function name="geometryType"&amp;gt; 
               &amp;lt;ogc:PropertyName&amp;gt;geom&amp;lt;/ogc:PropertyName&amp;gt; 
             &amp;lt;/ogc:Function&amp;gt; 
             &amp;lt;ogc:Literal&amp;gt;Point&amp;lt;/ogc:Literal&amp;gt; 
           &amp;lt;/ogc:PropertyIsEqualTo&amp;gt; 
         &amp;lt;/ogc:Filter&amp;gt; 
         &amp;lt;PointSymbolizer&amp;gt; 
           &amp;lt;Graphic&amp;gt; 
             &amp;lt;Mark&amp;gt; 
               &amp;lt;WellKnownName&amp;gt;circle&amp;lt;/WellKnownName&amp;gt;
               &amp;lt;Fill&amp;gt; 
                 &amp;lt;CssParameter name="fill"&amp;gt;#cc0000&amp;lt;/CssParameter&amp;gt; 
                 &amp;lt;CssParameter name="fill-opacity"&amp;gt;1.0&amp;lt;/CssParameter&amp;gt; 
               &amp;lt;/Fill&amp;gt; 
             &amp;lt;/Mark&amp;gt; 
             &amp;lt;Size&amp;gt;6&amp;lt;/Size&amp;gt; 
           &amp;lt;/Graphic&amp;gt; 
         &amp;lt;/PointSymbolizer&amp;gt; 
       &amp;lt;/Rule&amp;gt; 
       &amp;lt;!-- 1-9 --&amp;gt; 
       &amp;lt;Rule&amp;gt; 
         &amp;lt;ogc:Filter&amp;gt; 
           &amp;lt;ogc:And&amp;gt; 
             &amp;lt;ogc:PropertyIsGreaterThanOrEqualTo&amp;gt; 
               &amp;lt;ogc:PropertyName&amp;gt;count&amp;lt;/ogc:PropertyName&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;1&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/ogc:PropertyIsGreaterThanOrEqualTo&amp;gt; 
             &amp;lt;ogc:PropertyIsLessThan&amp;gt; 
               &amp;lt;ogc:PropertyName&amp;gt;count&amp;lt;/ogc:PropertyName&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;10&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/ogc:PropertyIsLessThan&amp;gt; 
           &amp;lt;/ogc:And&amp;gt; 
         &amp;lt;/ogc:Filter&amp;gt; 
         &amp;lt;PolygonSymbolizer&amp;gt; 
           &amp;lt;Fill&amp;gt; 
             &amp;lt;CssParameter name="fill"&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;#ffff00&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/CssParameter&amp;gt; 
             &amp;lt;CssParameter name="fill-opacity"&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;1&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/CssParameter&amp;gt; 
           &amp;lt;/Fill&amp;gt; 
          &amp;lt;Stroke&amp;gt; 
            &amp;lt;CssParameter name="stroke"&amp;gt;#ffff00&amp;lt;/CssParameter&amp;gt; 
            &amp;lt;CssParameter name="stroke-width"&amp;gt;0&amp;lt;/CssParameter&amp;gt; 
          &amp;lt;/Stroke&amp;gt; 
         &amp;lt;/PolygonSymbolizer&amp;gt; 
       &amp;lt;/Rule&amp;gt; 
       &amp;lt;!-- 10-99 --&amp;gt; 
       &amp;lt;Rule&amp;gt; 
         &amp;lt;ogc:Filter&amp;gt; 
           &amp;lt;ogc:And&amp;gt; 
             &amp;lt;ogc:PropertyIsGreaterThanOrEqualTo&amp;gt; 
               &amp;lt;ogc:PropertyName&amp;gt;count&amp;lt;/ogc:PropertyName&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;10&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/ogc:PropertyIsGreaterThanOrEqualTo&amp;gt; 
             &amp;lt;ogc:PropertyIsLessThan&amp;gt; 
               &amp;lt;ogc:PropertyName&amp;gt;count&amp;lt;/ogc:PropertyName&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;100&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/ogc:PropertyIsLessThan&amp;gt; 
           &amp;lt;/ogc:And&amp;gt; 
         &amp;lt;/ogc:Filter&amp;gt; 
         &amp;lt;PolygonSymbolizer&amp;gt; 
           &amp;lt;Fill&amp;gt; 
             &amp;lt;CssParameter name="fill"&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;#ffcc00&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/CssParameter&amp;gt; 
             &amp;lt;CssParameter name="fill-opacity"&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;1&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/CssParameter&amp;gt; 
           &amp;lt;/Fill&amp;gt; 
          &amp;lt;Stroke&amp;gt; 
            &amp;lt;CssParameter name="stroke"&amp;gt;#ffcc00&amp;lt;/CssParameter&amp;gt; 
            &amp;lt;CssParameter name="stroke-width"&amp;gt;0&amp;lt;/CssParameter&amp;gt; 
          &amp;lt;/Stroke&amp;gt; 
         &amp;lt;/PolygonSymbolizer&amp;gt; 
       &amp;lt;/Rule&amp;gt; 
       &amp;lt;!-- 100-999 --&amp;gt; 
       &amp;lt;Rule&amp;gt; 
         &amp;lt;ogc:Filter&amp;gt; 
           &amp;lt;ogc:And&amp;gt; 
             &amp;lt;ogc:PropertyIsGreaterThanOrEqualTo&amp;gt; 
               &amp;lt;ogc:PropertyName&amp;gt;count&amp;lt;/ogc:PropertyName&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;100&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/ogc:PropertyIsGreaterThanOrEqualTo&amp;gt; 
             &amp;lt;ogc:PropertyIsLessThan&amp;gt; 
               &amp;lt;ogc:PropertyName&amp;gt;count&amp;lt;/ogc:PropertyName&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;1000&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/ogc:PropertyIsLessThan&amp;gt; 
           &amp;lt;/ogc:And&amp;gt; 
         &amp;lt;/ogc:Filter&amp;gt; 
         &amp;lt;PolygonSymbolizer&amp;gt; 
           &amp;lt;Fill&amp;gt; 
             &amp;lt;CssParameter name="fill"&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;#ff9900&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/CssParameter&amp;gt; 
             &amp;lt;CssParameter name="fill-opacity"&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;1&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/CssParameter&amp;gt; 
           &amp;lt;/Fill&amp;gt; 
          &amp;lt;Stroke&amp;gt; 
            &amp;lt;CssParameter name="stroke"&amp;gt;#ff9900&amp;lt;/CssParameter&amp;gt; 
            &amp;lt;CssParameter name="stroke-width"&amp;gt;0&amp;lt;/CssParameter&amp;gt; 
          &amp;lt;/Stroke&amp;gt; 
         &amp;lt;/PolygonSymbolizer&amp;gt; 
       &amp;lt;/Rule&amp;gt; 
       &amp;lt;!-- 1000-9999 --&amp;gt; 
       &amp;lt;Rule&amp;gt; 
         &amp;lt;ogc:Filter&amp;gt; 
           &amp;lt;ogc:And&amp;gt; 
             &amp;lt;ogc:PropertyIsGreaterThanOrEqualTo&amp;gt; 
               &amp;lt;ogc:PropertyName&amp;gt;count&amp;lt;/ogc:PropertyName&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;1000&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/ogc:PropertyIsGreaterThanOrEqualTo&amp;gt; 
             &amp;lt;ogc:PropertyIsLessThan&amp;gt; 
               &amp;lt;ogc:PropertyName&amp;gt;count&amp;lt;/ogc:PropertyName&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;10000&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/ogc:PropertyIsLessThan&amp;gt; 
           &amp;lt;/ogc:And&amp;gt; 
         &amp;lt;/ogc:Filter&amp;gt; 
         &amp;lt;PolygonSymbolizer&amp;gt; 
           &amp;lt;Fill&amp;gt; 
             &amp;lt;CssParameter name="fill"&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;#ff6600&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/CssParameter&amp;gt; 
             &amp;lt;CssParameter name="fill-opacity"&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;1&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/CssParameter&amp;gt; 
           &amp;lt;/Fill&amp;gt; 
          &amp;lt;Stroke&amp;gt; 
            &amp;lt;CssParameter name="stroke"&amp;gt;#ff6600&amp;lt;/CssParameter&amp;gt; 
            &amp;lt;CssParameter name="stroke-width"&amp;gt;0&amp;lt;/CssParameter&amp;gt; 
          &amp;lt;/Stroke&amp;gt; 
         &amp;lt;/PolygonSymbolizer&amp;gt; 
       &amp;lt;/Rule&amp;gt; 
       &amp;lt;!-- 10000-99999 --&amp;gt; 
       &amp;lt;Rule&amp;gt; 
         &amp;lt;ogc:Filter&amp;gt; 
           &amp;lt;ogc:And&amp;gt; 
             &amp;lt;ogc:PropertyIsGreaterThanOrEqualTo&amp;gt; 
               &amp;lt;ogc:PropertyName&amp;gt;count&amp;lt;/ogc:PropertyName&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;10000&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/ogc:PropertyIsGreaterThanOrEqualTo&amp;gt; 
             &amp;lt;ogc:PropertyIsLessThan&amp;gt; 
               &amp;lt;ogc:PropertyName&amp;gt;count&amp;lt;/ogc:PropertyName&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;100000&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/ogc:PropertyIsLessThan&amp;gt; 
           &amp;lt;/ogc:And&amp;gt; 
         &amp;lt;/ogc:Filter&amp;gt; 
         &amp;lt;PolygonSymbolizer&amp;gt; 
           &amp;lt;Fill&amp;gt; 
             &amp;lt;CssParameter name="fill"&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;#ff3300&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/CssParameter&amp;gt; 
             &amp;lt;CssParameter name="fill-opacity"&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;1&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/CssParameter&amp;gt; 
           &amp;lt;/Fill&amp;gt; 
          &amp;lt;Stroke&amp;gt; 
            &amp;lt;CssParameter name="stroke"&amp;gt;#ff3300&amp;lt;/CssParameter&amp;gt; 
            &amp;lt;CssParameter name="stroke-width"&amp;gt;0&amp;lt;/CssParameter&amp;gt; 
          &amp;lt;/Stroke&amp;gt; 
         &amp;lt;/PolygonSymbolizer&amp;gt; 
       &amp;lt;/Rule&amp;gt; 
       &amp;lt;!-- 100000+ --&amp;gt; 
       &amp;lt;Rule&amp;gt; 
         &amp;lt;ogc:Filter&amp;gt; 
           &amp;lt;ogc:PropertyIsGreaterThanOrEqualTo&amp;gt; 
             &amp;lt;ogc:PropertyName&amp;gt;count&amp;lt;/ogc:PropertyName&amp;gt; 
             &amp;lt;ogc:Literal&amp;gt;100000&amp;lt;/ogc:Literal&amp;gt; 
           &amp;lt;/ogc:PropertyIsGreaterThanOrEqualTo&amp;gt; 
         &amp;lt;/ogc:Filter&amp;gt; 
         &amp;lt;PolygonSymbolizer&amp;gt; 
           &amp;lt;Fill&amp;gt; 
             &amp;lt;CssParameter name="fill"&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;#cc0000&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/CssParameter&amp;gt; 
             &amp;lt;CssParameter name="fill-opacity"&amp;gt; 
               &amp;lt;ogc:Literal&amp;gt;1&amp;lt;/ogc:Literal&amp;gt; 
             &amp;lt;/CssParameter&amp;gt; 
           &amp;lt;/Fill&amp;gt; 
          &amp;lt;Stroke&amp;gt; 
            &amp;lt;CssParameter name="stroke"&amp;gt;#cc0000&amp;lt;/CssParameter&amp;gt; 
            &amp;lt;CssParameter name="stroke-width"&amp;gt;0&amp;lt;/CssParameter&amp;gt; 
          &amp;lt;/Stroke&amp;gt; 
         &amp;lt;/PolygonSymbolizer&amp;gt; 
       &amp;lt;/Rule&amp;gt; 
     &amp;lt;/FeatureTypeStyle&amp;gt; 
   &amp;lt;/UserStyle&amp;gt; 
 &amp;lt;/NamedLayer&amp;gt; 
&amp;lt;/StyledLayerDescriptor&amp;gt; 
&lt;/pre&gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;b&gt;country_borders_black style&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush:xml"&gt;&amp;lt;?xml version="1.0" encoding="ISO-8859-1"?&amp;gt;
&amp;lt;StyledLayerDescriptor version="1.0.0" xmlns="http://www.opengis.net/sld" xmlns:ogc="http://www.opengis.net/ogc"
 xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.opengis.net/sld http://schemas.opengis.net/sld/1.0.0/StyledLayerDescriptor.xsd"&amp;gt;
 &amp;lt;NamedLayer&amp;gt;
   &amp;lt;Name&amp;gt;Countries Borders&amp;lt;/Name&amp;gt;
   &amp;lt;UserStyle&amp;gt;
     &amp;lt;Title&amp;gt;Countries Borders&amp;lt;/Title&amp;gt;
     &amp;lt;Abstract&amp;gt;A style that just draws a 1 pixel stroke around each countries borders&amp;lt;/Abstract&amp;gt;
     &amp;lt;FeatureTypeStyle&amp;gt;
       &amp;lt;Rule&amp;gt;
         &amp;lt;Title&amp;gt;Polygon&amp;lt;/Title&amp;gt;
         &amp;lt;PolygonSymbolizer&amp;gt;
           &amp;lt;Stroke&amp;gt;
             &amp;lt;CssParameter name="stroke"&amp;gt;#000000&amp;lt;/CssParameter&amp;gt;
             &amp;lt;CssParameter name="stroke-width"&amp;gt;1&amp;lt;/CssParameter&amp;gt;
           &amp;lt;/Stroke&amp;gt;
         &amp;lt;/PolygonSymbolizer&amp;gt;
       &amp;lt;/Rule&amp;gt;
     &amp;lt;/FeatureTypeStyle&amp;gt;
   &amp;lt;/UserStyle&amp;gt;
 &amp;lt;/NamedLayer&amp;gt;
&amp;lt;/StyledLayerDescriptor&amp;gt;
&lt;/pre&gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;b&gt;country_names_black style&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="brush:xml"&gt;&amp;lt;?xml version="1.0" encoding="ISO-8859-1"?&amp;gt;
&amp;lt;StyledLayerDescriptor version="1.0.0" xmlns="http://www.opengis.net/sld" xmlns:ogc="http://www.opengis.net/ogc"
 xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.opengis.net/sld http://schemas.opengis.net/sld/1.0.0/StyledLayerDescriptor.xsd"&amp;gt;
 &amp;lt;NamedLayer&amp;gt;
   &amp;lt;Name&amp;gt;Country Names&amp;lt;/Name&amp;gt;
   &amp;lt;UserStyle&amp;gt;
     &amp;lt;Title&amp;gt;Country Names&amp;lt;/Title&amp;gt;
     &amp;lt;Abstract&amp;gt;Style that renders the names of the countries on the map&amp;lt;/Abstract&amp;gt;
     &amp;lt;FeatureTypeStyle&amp;gt;
 &amp;lt;Rule&amp;gt;
 &amp;lt;TextSymbolizer&amp;gt;
   &amp;lt;Label&amp;gt;
     &amp;lt;ogc:PropertyName&amp;gt;CNTRY_NAME&amp;lt;/ogc:PropertyName&amp;gt;
   &amp;lt;/Label&amp;gt;
   &amp;lt;Font&amp;gt;
     &amp;lt;CssParameter name="font-family"&amp;gt;&amp;lt;ogc:Literal&amp;gt;Lucida Sans&amp;lt;/ogc:Literal&amp;gt;&amp;lt;/CssParameter&amp;gt;
     &amp;lt;CssParameter name="font-style"&amp;gt;&amp;lt;ogc:Literal&amp;gt;normal&amp;lt;/ogc:Literal&amp;gt;&amp;lt;/CssParameter&amp;gt;
     &amp;lt;CssParameter name="font-size"&amp;gt;&amp;lt;ogc:Literal&amp;gt;13.0&amp;lt;/ogc:Literal&amp;gt;&amp;lt;/CssParameter&amp;gt;
     &amp;lt;CssParameter name="font-weight"&amp;gt;&amp;lt;ogc:Literal&amp;gt;bold&amp;lt;/ogc:Literal&amp;gt;&amp;lt;/CssParameter&amp;gt;
   &amp;lt;/Font&amp;gt;
 &amp;lt;LabelPlacement&amp;gt;
   &amp;lt;PointPlacement&amp;gt;
     &amp;lt;AnchorPoint&amp;gt;
       &amp;lt;AnchorPointX&amp;gt;0.5&amp;lt;/AnchorPointX&amp;gt;
       &amp;lt;AnchorPointY&amp;gt;0.5&amp;lt;/AnchorPointY&amp;gt;
     &amp;lt;/AnchorPoint&amp;gt;
   &amp;lt;/PointPlacement&amp;gt;
 &amp;lt;/LabelPlacement&amp;gt;
   &amp;lt;Fill&amp;gt;
     &amp;lt;CssParameter name="fill"&amp;gt;#6666FF &amp;lt;/CssParameter&amp;gt;
   &amp;lt;/Fill&amp;gt;
 &amp;lt;/TextSymbolizer&amp;gt;
 &amp;lt;/Rule&amp;gt;
     &amp;lt;/FeatureTypeStyle&amp;gt;
   &amp;lt;/UserStyle&amp;gt;
 &amp;lt;/NamedLayer&amp;gt;
&amp;lt;/StyledLayerDescriptor&amp;gt;
&lt;/pre&gt;&lt;br /&gt;
&lt;span style="font-size: large;"&gt;&lt;b&gt;Layers&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Add the following layers in the layers menu&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Adding the country layers&lt;b&gt;&lt;/b&gt;&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
Add the layers from the stores &lt;i&gt;gbif:country_borders&lt;/i&gt;, &lt;i&gt;gbif:country_fill&lt;/i&gt; and &lt;i&gt;gbif:country_names&lt;/i&gt;, respectively with the names &lt;i&gt;country_borders&lt;/i&gt;, &lt;i&gt;country_fill&lt;/i&gt; y &lt;i&gt;country_names&lt;/i&gt;. With the following parameters:&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Data tab&lt;/b&gt;&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Coordinate Reference Systems: The Native SRS is &lt;i&gt;UNKNOWN &lt;/i&gt;and the Declared SRS is &lt;i&gt;EPSG:4326&lt;/i&gt;.&lt;/li&gt;
&lt;li&gt;Bounding Boxes: In both Native Bounding Box and Lat/Lon Bounding Box, set the following:&lt;/li&gt;
&lt;/ul&gt;&lt;table style="margin-left: auto; margin-right: auto; text-align: left;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;Min X&lt;/td&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;Min Y&lt;/td&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;Max X&lt;/td&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;Max Y&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;-180&lt;/td&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;-90&lt;/td&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;180&lt;/td&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;83,623 &lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;
&lt;b&gt;Publishing tab&lt;/b&gt;&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Default Title: Set the Default Style as the respective layer name.&lt;/li&gt;
&lt;/ul&gt;&lt;b&gt;Adding the tabDensityLayer layer&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
Add the layer &lt;i&gt;tabDensityLayer &lt;/i&gt;from the store &lt;i&gt;gbif:tab_density&lt;/i&gt;. With the following parameters.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Data tab&lt;/b&gt;&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Coordinate Reference Systems: The Declared SRS is &lt;i&gt;EPSG:4326&lt;/i&gt;.&lt;/li&gt;
&lt;li&gt;Bounding Boxes: In both Native Bounding Box and Lat/Lon Bounding Box, set the following:&lt;/li&gt;
&lt;/ul&gt;&lt;table style="margin-left: auto; margin-right: auto; text-align: left;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;Min  X&lt;/td&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;Min Y&lt;/td&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;Max X&lt;/td&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;Max Y&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;-180&lt;/td&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;-90&lt;/td&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;180&lt;/td&gt;&lt;td style="border: 1px solid rgb(170, 170, 170); padding: 5px;"&gt;90&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;
&lt;b&gt;Publishing tab&lt;/b&gt;&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;&amp;nbsp;Default Title: Set the Default Style as the respective layer name.&lt;/li&gt;
&lt;/ul&gt;&lt;b&gt;Test the layers&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
To test if the country layers are working well, enter to the GWC GeoWebCache, then to &lt;i&gt;"A list of all the layers and automatic demos"&lt;/i&gt; and check the behavior of the country layers, they should work according to their names.&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-754437864856581275?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/754437864856581275/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2010/05/how-to-configure-geoserver.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/754437864856581275'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/754437864856581275'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2010/05/how-to-configure-geoserver.html' title='How To Configure GeoServer For The GBIF Data Portal'/><author><name>Héctor Tobón</name><uri>http://www.blogger.com/profile/01325691279912891308</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_-hRtiexeGnY/Srvd98tVQNI/AAAAAAAAARM/BM0kBUanQn0/S220/Copia+de+Copia+de+P1060417.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-7336675637008476826</id><published>2009-10-05T16:23:00.004+02:00</published><updated>2009-10-05T16:32:37.941+02:00</updated><title type='text'>Struts 2.1.6 and GUICE</title><content type='html'>&lt;a href="http://code.google.com/p/google-guice/"&gt;Google GUICE&lt;/a&gt; is a great lightweight dependency injection framework that comes with a plugin for struts2. Using guice 2.0 and &lt;a href="http://struts.apache.org/2.1.6/"&gt;struts2.1.6&lt;/a&gt; on tomcat6 I run into a dependency problem with different xwork and ognl jars provided by struts and the struts2 plugin from guice:

&lt;pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"&gt;&lt;code&gt;[INFO] +- org.apache.struts:struts2-core:jar:2.1.6:compile
[INFO] &amp;#124;  +- com.opensymphony:xwork:jar:2.1.2:compile
[INFO] &amp;#124;  &amp;#124;  \- org.springframework:spring-test:jar:2.5.6:test (scope managed from compile)
[INFO] &amp;#124;  +- opensymphony:ognl:jar:2.6.11:compile
[INFO] &amp;#124;  \- commons-fileupload:commons-fileupload:jar:1.2.1:compile
...
[INFO] +- com.google.inject.extensions:guice-struts2-plugin:jar:2.0:compile
[INFO] &amp;#124;  \- opensymphony:xwork:jar:2.0.0:compile
[INFO] &amp;#124;     \- ognl:ognl:jar:2.6.9:compile

&lt;/code&gt;&lt;/pre&gt;

This can easily be resolved excluding the older OGNL library from guice like this in your POM:

&lt;pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"&gt;&lt;code&gt;    &amp;lt;dependency&amp;gt;
      &amp;lt;groupid&amp;gt;com.google.inject.extensions&amp;lt;/groupid&amp;gt;
      &amp;lt;artifactid&amp;gt;guice-struts2-plugin&amp;lt;/artifactid&amp;gt;
      &amp;lt;version&amp;gt;2.0&amp;lt;/version&amp;gt;
        &amp;lt;exclusions&amp;gt;
        &amp;lt;exclusion&amp;gt;
          &amp;lt;groupid&amp;gt;opensymphony&amp;lt;/groupid&amp;gt;
          &amp;lt;artifactid&amp;gt;xwork&amp;lt;/artifactid&amp;gt;
        &amp;lt;/exclusion&amp;gt;
        &amp;lt;exclusion&amp;gt;
          &amp;lt;groupid&amp;gt;ognl&amp;lt;/groupid&amp;gt;
          &amp;lt;artifactid&amp;gt;ognl&amp;lt;/artifactid&amp;gt;
        &amp;lt;/exclusion&amp;gt;
      &amp;lt;/exclusions&amp;gt;
    &amp;lt;/dependency&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-7336675637008476826?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/7336675637008476826/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2009/10/struts-216-and-guice.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/7336675637008476826'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/7336675637008476826'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2009/10/struts-216-and-guice.html' title='Struts 2.1.6 and GUICE'/><author><name>Markus Döring</name><uri>https://profiles.google.com/114975314573163797395</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-3859571145431959770</id><published>2009-07-17T23:37:00.006+02:00</published><updated>2010-02-24T18:22:55.266+01:00</updated><title type='text'>Darwin Core Archive Reader part1</title><content type='html'>After writing software that produces Darwin Core Archives (DwC-A), I thought it is time to introduce a little client library that can read DwC-A and makes it very simple to consume it. To create DwC-A there are a couple of resources available:
&lt;ol&gt;
&lt;li&gt; &lt;a href="http://rs.tdwg.org/dwc/terms/guides/text/index.htm"&gt; The Darwin Core Text Guidelines&lt;/a&gt;, pretty much the specification&lt;/li&gt;
&lt;li&gt;The &lt;a href="http://code.google.com/p/gbif-ecat/wiki/ChecklistFormat"&gt;ECAT Checklist Format&lt;/a&gt; gives some best practices and focus on how to encode taxonomic data as DwC-A&lt;/li&gt;
&lt;li&gt;&lt;a href="http://code.google.com/p/gbif-providertoolkit/wiki/UserManualChapter1"&gt;The IPT&lt;/a&gt; produces DwC-A and has a rich web interface but is limited to ~1 million records&lt;/li&gt;
&lt;li&gt;&lt;a href="ipt-lite.gbif.org"&gt;The IPT-lite&lt;/a&gt; produces large DwC-A datasets very fast, but lacks any visualisation of the data itself. Its a great way of quickly creating DwC-A and hosting it online&lt;/li&gt;
&lt;/ol&gt;

The DwC Archive reader library is under active development and currently being used for indexing checklists.
The project is hosted as part of the &lt;a href="http://code.google.com/p/gbif-indexingtoolkit/source/browse/trunk/dwc-archive"&gt;GBIF Indexing and Harvesting Toolkit&lt;/a&gt;, where a jar with all dependencies is hosted for &lt;a href="http://code.google.com/p/gbif-indexingtoolkit/downloads/list"&gt;download&lt;/a&gt; too.

You can use this jar in the terminal to inspect archives like this:
&lt;pre&gt;
$ java -jar DwcA-reader-1.0-SNAPSHOT.jar hershkovitz.txt

Opening archive: /Users/markus/Desktop/hershkovitz/hershkovitz.txt
Core file(s) found: [/Users/markus/Desktop/hershkovitz/hershkovitz.txt]
Core row type:
Core identifier column: 0
Cannot locate term dwc:kingdom
Cannot locate term dwc:family
Number of extensions 0
Archive contains 3249 core records.
&lt;/pre&gt;

The reader can handle just a simple CSV or tab file, or you can point it to a dwc archive folder with a meta.xml descriptor and several data files. The iterator(s) allow you to walk the core records of the archive while conveniently retrieving all extension records at the same time. You can also tell it to show some core record, passing a limit and offset like this from the commandline:

&lt;pre&gt;
java -jar DwcA-reader-1.0-SNAPSHOT.jar hershkovitz 10 25
&lt;/pre&gt;

A simple dummy source code example on how to use this reader in your code is shown here:

&lt;sh params="brush: java"&gt;
&lt;pre class="brush:java"&gt;
package org.gbif.dwc.text;

import java.io.File;
import java.io.IOException;
import java.util.Iterator;

import org.gbif.dwc.model.DarwinCoreRecord;
import org.gbif.dwc.model.ExtensionRecord;
import org.gbif.dwc.model.StarRecord;
import org.gbif.dwc.terms.DwcTerm;

public class UsageExample {

 public static void main(String[] args) throws IOException, UnsupportedArchiveException {
  // opens csv files with headers or dwc-a direcotries with a meta.xml descriptor
  Archive arch = ArchiveFactory.openArchive(new File("pontaurus.txt"));

  // does scientific name exist?
  if (!arch.getCore().hasTerm(DwcTerm.scientificName)){
   System.out.println("This application requires dwc-a with scientific names");
   System.exit(1);
  }

  // loop over core darwin core records
  Iterator&lt;darwincorerecord&gt; iter = arch.iteratorDwc();
  DarwinCoreRecord dwc;
  while(iter.hasNext()){
   dwc = iter.next();
   System.out.println(dwc);
  }

  // loop over star records. i.e. core with all linked extension records
  for (StarRecord rec : arch){
   // print core ID + scientific name
   System.out.println(rec.id()+" - "+rec.value(DwcTerm.scientificName));
   for (ExtensionRecord erec : rec){
    // does this extension have Long/lat?
    if (rec.dataFile().hasTerm(DwcTerm.decimalLongitude) &amp;amp;&amp;amp; rec.dataFile().hasTerm(DwcTerm.decimalLatitude)){
     System.out.println("Georeferenced: " + rec.value(DwcTerm.decimalLongitude)+","+rec.value(DwcTerm.decimalLatitude));;
    }
    
   }
  }
 }
}
&lt;/pre&gt;
&lt;/sh&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-3859571145431959770?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/3859571145431959770/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2009/07/darwin-core-archive-reader-part1.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3859571145431959770'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/3859571145431959770'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2009/07/darwin-core-archive-reader-part1.html' title='Darwin Core Archive Reader part1'/><author><name>Markus Döring</name><uri>https://profiles.google.com/114975314573163797395</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-600971155805471316</id><published>2009-06-23T15:15:00.017+02:00</published><updated>2009-07-02T14:35:43.781+02:00</updated><title type='text'>Profiling memory usage of various String collections</title><content type='html'>&lt;p&gt;
Wanting to know the memory footprint and performance of different java options to keep simple string lookups in memory,
I profiled different java.util collection classes that are filled with the same list of strings to see how much the memory usage differs. I then loaded the same data into 2 lucene in memory indices using lucenes RAMDirectory. I finally also evaluated an embedded file based &amp;amp; in memory H2 database.

The data that has been loaded are 1.573.345  scientific name strings, the longest being about 150 characters. The original uncompressed text file is 31.6MB (zipped 8.1MB). To also test ID lookup in case of java.util.map or the KVP lucene index, the row number of each name has been used.

The machine I used for testing was a MacPro 8-core 3GHz, 5GB RAM using Java6 with 2GB of memory (-Xmx2g) on Mac OSX 64bit. Here are the shortened results using System.currentTimeMillis() and JProfiler inspecting deep object copies in heap dumps (a seriously memory intensive thing too in some cases like lucene and H2 which even crashed):
&lt;/p&gt;

&lt;pre&gt;
&lt;code&gt;
Text file: 31.6 MB
Zipped: 8.1 MB

java.util.HashSet&lt;string&gt;
# 264MB
# contains test of 10.000 x 12 terms took 28 msecs
# uses HashMap internally...

java.util.TreeSet&lt;string&gt;
# 256MB
# contains test of 10.000 x 12 terms took 98 msecs
# uses TreeMap internally...

java.util.HashMap&lt;string,integer&gt;
# 300MB
# contains (key) test of 10.000 x 12 terms took 21 msecs
# includes Integer values as opposed to above Set

java.util.TreeMap&lt;string,integer&gt;
# 292MB
# contains (key) test of 10.000 x 12 terms took 93 msecs
# includes Integer values as opposed to above Set

java.util.ArrayList&lt;string&gt;
# 172MB
# contains test of 100 x 12 terms took 43257 msecs

java.util.LinkedList&lt;string&gt;
# 220MB
# contains test of 100 x 12 terms took 50564 msecs

String[] array
# 172MB


org.apache.commons.collections.map.HashedMap
# 384 MB
# contains (key) test of 10000 x 12 terms took 23 msecs
# no generics support !


javolution.util.FastList
# 220 MB
# contains test of 100 x 12 terms took 58886 msecs

javolution.util.FastMap
# 396 MB
# contains (key) test of 10000 x 12 terms took 18 msecs

javolution.util.FastSet
# 276MB
# contains test of 10000 x 12 terms took 10 msecs

javolution.util.FastTable
# 172MB
# contains test of 100 x 12 terms took 50961 msecs


gnu.trove.THashMap
# 331MB
# contains (key) test of 10000 x 12 terms took 65 msecs

gnu.trove.THashSet
# 185 MB
# contains test of 10000 x 12 terms took 47 msecs


com.google.common.collect.ImmutableSet
# 331MB
# contains test of 10000 x 12 terms took 19 msecs

com.google.common.collect.ImmutableMap
# 404MB
# contains (key) test of 10000 x 12 terms took 32 msecs


Lucene KVP index
# 94MB
# lucene 100 x 12 TermQueries took 58 msecs
# lucene 10000 x 12 TermQueries took 590 msecs
# in memory Index building a key value index with each term being a document and storing the value as a document field
# overhead of using an IndexSearcher to just the RAMDirectory is minimal
# 1573345 records loaded into Lucene KVP index in 13453 msecs

Lucene term index
# 44MB
# lucene 100 x 12 TermQueries took 32 msecs
# lucene 10000 x 12 TermQueries took 369 msecs
# in memory Index storing only the pure term index
# 1573345 records loaded into simple Lucene term index in 10222 msecs

H2 file based
# 33MB (connection object, not during querying)
# sql equals test of 10000 x 12 terms took 634 msecs
# file based db with 1 table, 1 indexed varchar(255) column
# 1573345 records loaded into H2 file database in 245025 msecs

H2 in memory
# ???MB (JProfiler requires &gt;3gig memory)
# sql equals test of 10000 x 12 terms took 1024 msecs
# in memory db with 1 table, 1 indexed varchar(255) column
# 1573345 records loaded into H2 in memory database in 18760 msecs

&lt;/string&gt;&lt;/string&gt;&lt;/string,integer&gt;&lt;/string,integer&gt;&lt;/string&gt;&lt;/string&gt;&lt;/code&gt;
&lt;/pre&gt;

&lt;p&gt;
Lucene came out with the least memory footprint of only 44MB, gnu.trove does best in terms of memory for classic sets, while javolution outperformes anyone else on speed. But their footprint is even slightly larger than the java.utilHashSet.

Compared to KVP lucene the performance of H2 is pretty similar. Building the H2 file db took more than 10 times longer than the in memory one, so for regular updated and writes in memory offers lot more performance, but reads are a surprise. The file based version was nearly twice as fast as the in memory one!

gnu.trove contains a lot of specialised classes to hold primitives as keys in sets or maps for example. In case one is using int or long this should be a much better footprint, but I haven't tested it as I am interested in Strings currently.

For those interested, &lt;a href="http://code.google.com/p/gbif-ecat/source/browse/trunk/lookup-benchmarks/src/main/java/org/gbif/ecat/benchmark/ProfileKvpStores.java"&gt;here is the source code&lt;/a&gt; that created the objects and did the time measurments (most tests are commented out in this particular revision).
&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-600971155805471316?l=gbif.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gbif.blogspot.com/feeds/600971155805471316/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://gbif.blogspot.com/2009/06/profiling-memory-usage-of-various.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/600971155805471316'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2326624813533383062/posts/default/600971155805471316'/><link rel='alternate' type='text/html' href='http://gbif.blogspot.com/2009/06/profiling-memory-usage-of-various.html' title='Profiling memory usage of various String collections'/><author><name>Markus Döring</name><uri>https://profiles.google.com/114975314573163797395</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2326624813533383062.post-3122517240828619364</id><published>2009-05-12T16:24:00.027+02:00</published><updated>2010-05-07T14:48:45.607+02:00</updated><title type='text'>Deploying the portal web application</title><content type='html'>&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;For building the web application&lt;/span&gt;
&lt;p&gt;
The steps for building and deploying the portal web application are as follows:
&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;
&lt;p&gt;
1) Download the source code at:
&lt;a href="http://code.google.com/p/gbif-dataportal/source/checkout"&gt;http://code.google.com/p/gbif-dataportal/source/checkout&lt;/a&gt;

&lt;div&gt;&lt;/div&gt;
&lt;div&gt;&lt;/div&gt;

The modules needed are:
&lt;ul&gt;&lt;li&gt;portal-core&lt;/li&gt;&lt;li&gt;portal-index&lt;/li&gt;&lt;li&gt;portal-service&lt;/li&gt;&lt;li&gt;portal-web&lt;/li&gt;&lt;/ul&gt;For instructions on how to checkout this modules from the SVN to your machine, please see &lt;a href="http://code.google.com/p/gbif-dataportal/source/checkout"&gt;http://code.google.com/p/gbif-dataportal/source/checkout&lt;/a&gt;.
&lt;p&gt;
&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;
2) Once that the modules have been saved to your machine, you need to build them. There is a script on the &lt;span style="font-weight: bold;"&gt;portal-web &lt;/span&gt;module for automatically building all the project and downloading all the dependencies (libraries) from the repositories
&lt;p&gt;
Script location: portal-web/first-build-all.sh&lt;div&gt;
&lt;p&gt;
&lt;/div&gt;&lt;div&gt;
&lt;/div&gt;&lt;div&gt;
&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;For building the database&lt;/span&gt;&lt;/div&gt;&lt;div&gt;
&lt;/div&gt;&lt;div&gt;&lt;p&gt;1) On the &lt;span class="Apple-style-span" style="font-weight: bold;"&gt;portal-core &lt;/span&gt;project, there is a file at &lt;span class="Apple-style-span" style="font-style: italic;"&gt;db/portal.ddl&lt;/span&gt; that builds the initial structure for the index DB for the portal. &lt;/div&gt;
&lt;pre class="brush:xml"&gt;
mysql&gt; create database portal;
&lt;/pre&gt;
&lt;pre class="brush:xml"&gt;
mysql -u [username] -p [database] &lt; /PATH_TO_FILE/portal.ddl;
&lt;/pre&gt;
&lt;p&gt;
&lt;div&gt;2) For populating the database with the minimum data required,  there is a file at &lt;span class="Apple-style-span" style="font-style: italic;"&gt;db/initPortal.data&lt;/span&gt; for doing such activity.
&lt;p&gt;
&lt;pre class="brush:xml"&gt;
mysql -u [username] -p [database] &lt; /PATH_TO_FILE/initPortal.data;
&lt;/pre&gt;
&lt;/div&gt;&lt;div&gt;
&lt;blockquote&gt;&lt;/blockquote&gt;
&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;-------
All blog items represent the authors own ideas, and should not be considered GBIF or Institutional policy.&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2326624813533383062-3122517240828619364?l=gbif.blogspot.com' alt='' /&gt;&lt
