Friday 24 June 2011

Synchronizing occurrence records

This post should be read in the line of Tim’s post about Decoupling Components, as it takes for granted some information written there.

During the last week, I’ve been learning/working with some technologies that are related to the decoupling of components we want to accomplish.  Specifically, I’ve been working with the Synchronizer component of the event driven architecture Tim described.

Right now, the synchronizer takes the responses from the resources and gets those responses into the occurrence store (MySQL as of today, but not final). But it has more to it: The responses from the resources come typically from DiGIR, TAPIR and BioCASe providers which render their responses into XML format. So how does all this data ends up in the occurrence store? Well, fortunately my colleague Oliver Meyn wrote a very useful library to unmarshall all these XML chunks into  nice and simple objects, so on my side I just have to worry about calling all those getter methods. Also, the synchronizer acts as a listener to a message queue , queue that will store all the resource responses that need to be handled. All the queue’s nuts & bolts were worked out by Tim and Federico Méndez. So yes, it has been a nice collaboration from many developers inside the Secretariat and it’s always nice to have this kind of head start from your colleagues :)

So, getting back to my duties, I have to take all these objects and start populating the occurrence target store taking some precautions (e.g.: not inserting duplicated occurrence records, checking that some mandatory fields are not null and other requirements).

For now, it’s in development mode, but I have managed to make some tests and extract some metrics that show current performance and definitely leaves room for improvement. For the tests, first the message queue is loaded with some responses that need to be attended and afterwards I execute the synchronizer which starts populating the occurrence store. All these tests are done on my MacBook Pro, so definetely response times will improve on a better box. So here are the metrics:

  • MacBook Pro 2.4 GHz Core2 Duo (4GB Memory)
  • Mac OS X 10.5.8 (Leopard)
  • Message Queue & MySQL DB reside on different machines, but same intranet.
  • Threads: synchronizer spawns 5 threads to attend the queue elements.
  • Message queue: loaded with 552 responses (some responses are just empty to emulate a real world scenario).
  • Records in total: 70,326 occurrence records in total in all responses

Results Test 1 (without filtering out records):
  • Extracting responses from queue
  • Unmarshalling
  • Inserting into a MySQL DB
  • 202022 milliseconds (3 min, 22 secs)

Results Test 2 (filtering out records):
  • Extracting from queue
  • Unmarshalling
  • Filtering out records (duplicates, mandatory fields, etc)
  • Inserting into MySQL DB
  • over 30 minutes... (big FAIL)
So, as you see there is MUCH room for improvement. As I have just joined this project in particular, I need to start the long and tedious road of debugging why the huge difference, obviously the filtering out process needs huge improvement. Obvious solutions come to mind: increasing threads, improve memory consumption and other not so obvious solutions.  I will try to keep you readers posted about this, and hopefully some more inspiring metrics, and for sure in a better box.

I hope to communicate further improvements later, see you for now.


  1. For uniqueness checks, a technique I've found useful is to set up the schema so the database will do your checking for you. Instead of reading to see if there's an existing record, just write, and handle the exception if it turns out to be a duplicate.

    The other thing I'd say is that I think it's unlikely you'll be able to get the improvement you want from tweaking. To address a 10x difference in performance, architectural and/or algorithmic improvements are required.

  2. Thanks for the suggestion Mark. Today we found the origin of the problem that was causing such a bad performance. It happened to be some issues on how our synchronizing process was reading from the queue (Federico and Tim spotted the problem), and we might very likely be blogging about this solution, so you are right in writing that the improvement was not going to come from tweaking.

    I'll consider your suggestion for checking duplicates, but if you have any quick reference on this technique I will appreciate it very much.


  3. You would create a unique index in the database. Then just try to insert a record. If it goes in, all is well and you just saved yourself a read. If the uniqueness constraint triggers, you get the same result as if you'd checked for uniqueness before the insert.

  4. Hi Mark,
    Thanks for the suggestions.
    The issue with what you propose for us is actually the key would be a compound key across a MEDIUM INT UNSIGNED and 4 fields of VARCHAR(255). In the table of 300million + records this doesn't perform, and since the DB is UTF8 the key is too long for MyISAM to handle anyway as key lengths are limited in bytes.
    We end up having to create a compound index across a prefix of the VARCHAR(255) field with the highest cardinality, and the INT which performs much better. We can't really benefit from PARITIONING either as that would require the partition filed to join the PK and benchmarking inserts shows it slows significantly.
    The index and lookup work pretty fine, and on a single table can see up to 5k check or put per second.
    There were other issues (relating to the RabbitMQ queue) that existed when Jose wrote this post that clouded the issue - mainly that he was saturating a single client.

    Thanks for the pointers,

  5. If I understand correctly, using the INT + prefix of one VARCHAR field will result in some false positives. Is that right? Where that data matches, but the rest of the record is actually different. In which case, I wonder if indexing on a checksum might be good, in that it would let more index keys stay in memory at once.

  6. Hi Mark,
    It will have false positives from the INDEX but the rest of the criteria are also in the WHERE clause of course so the query returns accurate results. If there were a huge number of records for the given prefix the performance would drop way off, but something like 99.9% of the time it is enough to limit to a single record, and for the other cases it will return a handful at most.
    A checksum would probably allow us to index more than the prefix, so while it still suffers from false positives, would probably allow more to be indexed in a smaller index, with the overhead of the checksum being stored though.
    This is a legacy system we are patching, and in the planned work we expect to use a column oriented store, with no locking, and fast PK lookup. HBase is the leading candidate for this, since it fits so well into the rest of our processing architecture, which is Oozie, Hive, Pig and Hadoop based. I don't think we will likely change the MySQL schema further (especially now that it is giving the throughput we needed), but these ideas are all relevant for future work too. Oliver's latest post on key design for example is very close to this kind of thing.