Tuesday 27 September 2011

Portal v2 - There will be cake

The current GBIF data portal was started in 2007 to provide access to the network's biodiversity data - at the time that meant a federated search across 220 providers and 76 million occurrence records. While that approach has served us well over the years, there are many features that have been requested for the portal that weren't addressable in the current architecture. Combined with the fact that we're now well over 300 million occurrence records, with millions of new taxonomic records to boot, it becomes clear that a new portal is needed. After a long consultation process with the wider community the initial requirements of a new portal have been determined, and I'm pleased to report that work has officially started on its design and development.

For the last 6 months or so the development team has been working on improving our rollover process, registry improvements, IPT development, and disparate other tasks. The new portal marks an important milestone in our team development as we're now all working on the portal, with as little distraction from other projects as we can manage. Obviously we're still fixing critical bugs and responding to data requests, etc, but all of us focusing on the same general task has already shown dividends in the conversations coming out of our daily scrums. Everyone being on the same page really does help.

And yes, we've been using daily stand-up meetings that we call "scrums" for several months, but the new portal marks the start of our first proper attempt at agile software development, including the proper use of scrum. Most of our team has had some experience with parts of agile techniques, so we're combining the best practices that everyone has had to make the best system for us. Obviously the ideal of interchangeable people with no single expert in a given domain is rather hard for us when Tim, Markus, Kyle and Jose have worked on these things for so long and people like Lars, Federico and I are still relatively new (even though we're celebrating our one year anniversaries at GBIF in the next weeks!), but we're trying hard to have non-experts working with experts to share the knowledge.

In terms of managing the process, I (Oliver) am acting as Scrum Master and project lead. Andrea Hahn has worked hard at gathering our initial requirements, turning them into stories, and leading the wireframing of the new portal. As such she'll be acting as a Stakeholder to the project and help us set priorities. As the underlying infrastructure gets built and the process continues I'm sure we'll be involving more people in the prioritization process, but for now our plates are certainly full with "plumbing". At Tim's suggestion we're using Basecamp to manage our backlog, active stories, and sprints, following the example from these guys. Our first kickoff revealed some weaknesses in mapping Basecamp to agile, and the lack of a physical storyboard makes it hard to see the big picture, but we'll start with this and re-evaluate in a little while - certainly it's more important to get the process started and determine our actual needs rather than playing with different tools in some kind of abstract evaluation process. Once we've ironed out the process and settled on our tools we'll also make them more visible to the outside world.

We're only now coming up on the end of our first, 2 week sprint, so it will take a few more iterations to really get into the flow, but so far so good, and I'll report back on our experience in a future post.

(If you didn't get it, apologies for the cake reference)

Thursday 15 September 2011

VertNet and the GBIF Integrated Publishing Toolkit

(A guest post from our friends at VertNet, cross-posted from the VertNet blog)

This week we’d like to discuss the current and future roles of the GBIF Integrated Publishing Toolkit (IPT) in VertNet. IPT is a Java-based web application that allows a user to publish and share biodiversity data sets from a server. Here are some of the things IPT can do:

GBIF IPT Logo Image

  1. Create Darwin Core Archives. In our post about data publishing last week, we wrote about Darwin Core being the “language of choice” for VertNet. IPT allows publishers to create Darwin Core data records from either files or databases and to export them in zipped archive files that contain exactly what is needed by VertNet for uploading.
  1. Make data available for efficient indexing by GBIF. VertNet has an agreement with its data publishers that, by participating, they will also publish data through GBIF. GBIF keeps our registry of data providers and uses this registry to find and update data periodically from the original sources to make it available through the GBIF data portal. IPT gives data publishers an easy means of keeping their data up-to-date with GBIF.

IPT can help with the data publishing process in other ways as well:

  • standardizing terms
  • validating records before they get published
  • adding default values for fields that aren’t in the original data

To get a better understanding of the capabilities, take a look at the IPT User Manual.

Why are we using IPT?

VertNet has a long waiting list of organizations (65 to date) that have expressed interest in making their data publicly accessible through VertNet. In the past, these institutions would have needed their own server and specialized software (DiGIR) for publishing to the separate vertebrate networks. We’d rather not require any of these participants to buy servers if they don’t have to. As an interim solution, we’re using the IPT to make data available online while we build VertNet. We have installed, at the University of Kansas Biodiversity Institute, an IPT that can act as a host for as many collections as are interested. The service is shared, yet organizations can maintain their own identity and data securely within this hosted IPT. This is a big win for us at VertNet, because there will be fewer servers to maintain and we can get more collections involved more quickly.

Going forward…

Well before completion, VertNet will support simple and sustainable publishing by uploading records from text files in Simple Darwin Core form. Because of this, the IPT will not be a required component of data publishing for VertNet. Rather, we see IPT as a great tool to facilitate the creation of Darwin Core Archives, which we will be able to use to upload data to VertNet.

Interested in publishing now with IPT?

We currently have two institutions sharing their collections with VertNet and GBIF through the VertNet IPT and we’re in the process of working with several more.

So, if you are or would like to be a vertebrate data publisher and would like to make your data accessible as Darwin Core Archives sooner rather than later, VertNet’s IPT might be the solution for you! Learn more about the process on the VertNet web site or email Laura Russell and Dave Bloom.

Posted by Laura Russell, VertNet Programmer; John Wieczorek, Information Architect; and Aaron Steele, Information Architect