A down side of durable messaging

February 06, 2013

Overview

Durable messaging can be very fast, as fast as non-durable messaging up to a point.

Limitations of durable messaging

Durable messaging is dependant on the size of your main memory and the speed of your hard drive. If you have a HDD, this can be as low as 20 MB/s and as high as 60 MB/s. A RAID set of HDD can support between 100 and 300 MB/s. An SATA SSD can support between 100 and 500 MB/s and a PCI SSD can support up to 1.5 GB/s.

Case study

Say you have 8 GB of memory, writing two million 100 bytes messages per second and a HDD which support 25 MB/s.

This works fine at the speed in bursts but you reach a point where your disk cache is full. Depending on your OS this can be between 20% and 80% of your main memory size. In my experience, Windows tends to be closer to 20% even if you have plenty of free memory whereas Linux tends to allow in the region of 30% of your memory in uncommitted writes.

Say you are writing two million 100 byte message per second or 200 MB/s and you have 1600 MB of disk cache. The difference in speed is 175 MB/s between the rate you are writing and the rate you are generating it so in just 9 seconds you have filled the cache. At this point your performance plummets to the write speed of your disk which is 25 MB/second. With each messaging being 100 bytes, you are now writing 250,000 messages per second or 8x slower.

What is the solution?

Keep your micro-bursts to less than you can fit in disk cache e.g. in the above case this would be about 18 million messages.
Increase the amount of memory you have. While memory is cheap and you can buy 32 GB for about £150, all this does in include the duration of the micro-burst you can support.
Increase the speed of your drive. With SSD you can support much higher bandwidths. SATA SSD drives support up to 500 MB/sec which is higher than Chronicle can typically serialize messages, i.e. more than enough. The downside of this is it reduces the total number of messages you can write. A 500 MB SSD can store 5 billion 100 byte messages. A 6x4 TB RAID-5 set can support a transfer rate of over 200 MB/s which would be enough for the above case study, and can store 200 billion messages.

Conclusion

If you see any durable messaging solution suddenly slow down under load, you need to look at the size of your buffers and the throughput of your disk sub-system.

Comments

Yeroc6 February 2013 at 19:42
I'm confused. How can messages be durable if they're only in I/O cache in main memory?!? I'm certain any product advertising durable messaging queues will be flushing all writes to disk so any main memory disk cache should be irrelevant. The only disk cache size that may be relevant would be battery-backed cache on the disk controller itself.

Corey
ReplyDelete
Replies
Peter Lawrey6 February 2013 at 20:49
You are right that the buffer is question depends on the implementation. In the case of chronicle, if you only need reliability in case of your program dying you can use it as is. If you need reliability in case of a machine dying or a power failure, it can be replicated to a second machine. You could commit every write to battery backed up disk, which is also supported if you really need it but the impact to performance is dramatic for very little gain in most cases. In fact I should do a page on how much difference flushing every message to disk makes.

As I have noted before most messaging solutions avoid benchmarking durable messaging, but those that do clearly don't commit writes given the numbers they report. ;)
ReplyDelete
Replies
JP7 February 2013 at 03:55
Peter you are correct to doubt those sorts of numbers. The durable write path for a messaging system isn't terribly different from the transaction log for a database. A single disk can only support 1 commit/flush per revolution. The first few generations of SATA SSDs weren't much better, though they are improving. By the way, RAID-5 is fairly terrible at that particular kind of write load, due to the parity calculation overhead. I suspect you'd be better off with either RAID 1 or RAID 0+1. As always, best to test it out, and durability tests may well deserve checking how the system does in a hard power fault.
ReplyDelete
Replies

Add comment

Vanilla Java