I wanted to write about a MySQL performance optimization using partitioning as I recently applied it to the Harvesting and Indexing Toolkit’s (HIT) log table. The log table was already using a composite index (indexes on multiple columns), but as this table grew bigger and bigger (>50 million records) queries were being answered at a turtle’s pace.
To set things up, imagine that in the HIT application there is a log page that allows the user to tail the latest log messages in almost real time. Behind the scenes, the application is querying the log table every few seconds for the most recent logs, and the effect is a running view of the logs. The tail query used looks like this:
mysql> select * from log_event where id >= ‘latest id’ and datasource_id = ‘datasource_id’ and level >= ‘log level’ order by id desc;
In effect this query asks: “give me the latest logs for datasource with id X having having at least a certain log level”.
Partitioning basically divides a table into different portions that are stored and can be queried separately. The benefit is that if a query only has to hit a small portion instead of the whole table, it can be answered faster.
There are different ways that you can partition tables in MySQL, and you can read about them all in the MySQL reference manual. I first experimented using Key partitioning using the table ID. Unfortunately, because different logs for a datasource could be spread across different partitions, the tail query would have to hit all partitions. To check how many partitions the query hits, I used the following query:
mysql> explain partitions select * from … ;
This resulted in an even slower response than without partitioning, so Tim thought about it from a different angle. He discovered a nice solution using Range partitioning by datasource ID instead. This way the table would get divided into ranges of datasources that are contiguous but not overlapping. A range size of 1000 was used, so the 1st partition would contain all logs for datasources with IDs between 0 – 999, the 2nd partition would contain all logs for datasources with IDs between 1000 – 1999 and so on. Part of the command used to apply Tim’s partitioning strategy (having 36 partitions) is displayed below:
ALTER TABLE log_event
ADD PRIMARY KEY(id,bio_datasource_fk)
PARTITION BY RANGE (bio_datasource_fk) (
PARTITION p0 VALUES LESS THAN (1000),
PARTITION p1 VALUES LESS THAN (2000),
.
.
PARTITION p36 VALUES LESS THAN MAXVALUE
);
Checking how many partitions the tail query would hit, I confirmed that it only ever uses a single partition. The result was impressive, and initial tests resulted in a speed-up of over 9000 times!
Important to note is that the primary key must include all fields in the partition. Therefore because we were partitioning using the datasource id, this field had to be included in the primary key before partitioning would work. Also, an index on the id was also added to further optimize the query - why not right?
The speed-up might be dramatic now, but as more log messages get written to a partition and it starts to swell, I envisage having to either delete old logs or repartition the table again using smaller range sizes in order to sustain good performance. There is a trade-off between the number of partitions and performance, so some tweaking is needed in every case I guess. Lastly, I’ll reiterate that improper partitioning can actually make things worse. Perhaps it could work for you too, but please apply with caution.
I assume you previously had an index on id + datasource_id and that proved too slow, right? Because of course with such an index you'd never be hitting anywhere near the whole table.
ReplyDeletehi, i've been searching aout mysql partitioning by month, but i can´t alter my table.
ReplyDeletecan you help me please, i´ll send you by email.
thanks