This
post should be read in the line of Tim’s post about Decoupling Components,
as it takes for granted some information written there.
During
the last week, I’ve been learning/working with some
technologies that are related to the decoupling of components we want to
accomplish. Specifically, I’ve
been working with the Synchronizer component of the event driven architecture Tim described.
Right now, the synchronizer takes the responses from the
resources and gets those responses into the occurrence store (MySQL as of today, but not final). But it has
more to it: The responses from the resources come typically from DiGIR, TAPIR
and BioCASe providers which render their responses into XML format. So how does
all this data ends up in the occurrence store? Well, fortunately my colleague Oliver Meyn
wrote a very useful library to unmarshall all these XML chunks into nice and simple objects, so on my side
I just have to worry about calling all those getter methods. Also, the synchronizer
acts as a listener to a message queue , queue that will store all the resource responses
that need to be handled. All the queue’s nuts & bolts were worked out by Tim
and Federico Méndez. So yes, it has been a nice collaboration from many
developers inside the Secretariat and it’s always nice to have this kind of head
start from your colleagues :)
So,
getting back to my duties, I have to take all these objects and start
populating the occurrence target store taking some precautions (e.g.: not
inserting duplicated occurrence records, checking that some mandatory fields
are not null and other requirements).
For
now, it’s in development mode, but I have managed to make some tests and
extract some metrics that show current performance and definitely leaves room for improvement. For the tests, first the message queue is loaded with some responses that need to be attended
and afterwards I execute the synchronizer which starts populating the occurrence
store. All these tests are done on my MacBook Pro, so definetely response times
will improve on a better box. So here are the metrics:
Environment:
- MacBook Pro 2.4 GHz Core2 Duo (4GB Memory)
- Mac OS X 10.5.8 (Leopard)
- Message Queue & MySQL DB reside on different machines, but same intranet.
- Threads: synchronizer spawns 5 threads to attend the queue elements.
- Message queue: loaded with 552 responses (some responses are just empty to emulate a real world scenario).
- Records in total: 70,326 occurrence records in total in all responses
- Extracting responses from queue
- Unmarshalling
- Inserting into a MySQL DB
- 202022 milliseconds (3 min, 22 secs)
- Extracting from queue
- Unmarshalling
- Filtering out records (duplicates, mandatory fields, etc)
- Inserting into MySQL DB
- over 30 minutes... (big FAIL)
I hope to communicate further improvements later, see you for now.
For uniqueness checks, a technique I've found useful is to set up the schema so the database will do your checking for you. Instead of reading to see if there's an existing record, just write, and handle the exception if it turns out to be a duplicate.
ReplyDeleteThe other thing I'd say is that I think it's unlikely you'll be able to get the improvement you want from tweaking. To address a 10x difference in performance, architectural and/or algorithmic improvements are required.
Thanks for the suggestion Mark. Today we found the origin of the problem that was causing such a bad performance. It happened to be some issues on how our synchronizing process was reading from the queue (Federico and Tim spotted the problem), and we might very likely be blogging about this solution, so you are right in writing that the improvement was not going to come from tweaking.
ReplyDeleteI'll consider your suggestion for checking duplicates, but if you have any quick reference on this technique I will appreciate it very much.
thanks.
You would create a unique index in the database. Then just try to insert a record. If it goes in, all is well and you just saved yourself a read. If the uniqueness constraint triggers, you get the same result as if you'd checked for uniqueness before the insert.
ReplyDeleteHi Mark,
ReplyDeleteThanks for the suggestions.
The issue with what you propose for us is actually the key would be a compound key across a MEDIUM INT UNSIGNED and 4 fields of VARCHAR(255). In the table of 300million + records this doesn't perform, and since the DB is UTF8 the key is too long for MyISAM to handle anyway as key lengths are limited in bytes.
We end up having to create a compound index across a prefix of the VARCHAR(255) field with the highest cardinality, and the INT which performs much better. We can't really benefit from PARITIONING either as that would require the partition filed to join the PK and benchmarking inserts shows it slows significantly.
The index and lookup work pretty fine, and on a single table can see up to 5k check or put per second.
There were other issues (relating to the RabbitMQ queue) that existed when Jose wrote this post that clouded the issue - mainly that he was saturating a single client.
Thanks for the pointers,
Tim
If I understand correctly, using the INT + prefix of one VARCHAR field will result in some false positives. Is that right? Where that data matches, but the rest of the record is actually different. In which case, I wonder if indexing on a checksum might be good, in that it would let more index keys stay in memory at once.
ReplyDeleteHi Mark,
ReplyDeleteIt will have false positives from the INDEX but the rest of the criteria are also in the WHERE clause of course so the query returns accurate results. If there were a huge number of records for the given prefix the performance would drop way off, but something like 99.9% of the time it is enough to limit to a single record, and for the other cases it will return a handful at most.
A checksum would probably allow us to index more than the prefix, so while it still suffers from false positives, would probably allow more to be indexed in a smaller index, with the overhead of the checksum being stored though.
This is a legacy system we are patching, and in the planned work we expect to use a column oriented store, with no locking, and fast PK lookup. HBase is the leading candidate for this, since it fits so well into the rest of our processing architecture, which is Oozie, Hive, Pig and Hadoop based. I don't think we will likely change the MySQL schema further (especially now that it is giving the throughput we needed), but these ideas are all relevant for future work too. Oliver's latest post on key design for example is very close to this kind of thing.
Cheers,
Tim