A down side of durable messaging
OverviewDurable messaging can be very fast, as fast as non-durable messaging up to a point.
Limitations of durable messagingDurable messaging is dependant on the size of your main memory and the speed of your hard drive. If you have a HDD, this can be as low as 20 MB/s and as high as 60 MB/s. A RAID set of HDD can support between 100 and 300 MB/s. An SATA SSD can support between 100 and 500 MB/s and a PCI SSD can support up to 1.5 GB/s.
Case studySay you have 8 GB of memory, writing two million 100 bytes messages per second and a HDD which support 25 MB/s.
This works fine at the speed in bursts but you reach a point where your disk cache is full. Depending on your OS this can be between 20% and 80% of your main memory size. In my experience, Windows tends to be closer to 20% even if you have plenty of free memory whereas Linux tends to allow in the region of 30% of your memory in uncommitted writes.
Say you are writing two million 100 byte message per second or 200 MB/s and you have 1600 MB of disk cache. The difference in speed is 175 MB/s between the rate you are writing and the rate you are generating it so in just 9 seconds you have filled the cache. At this point your performance plummets to the write speed of your disk which is 25 MB/second. With each messaging being 100 bytes, you are now writing 250,000 messages per second or 8x slower.
What is the solution?
- Keep your micro-bursts to less than you can fit in disk cache e.g. in the above case this would be about 18 million messages.
- Increase the amount of memory you have. While memory is cheap and you can buy 32 GB for about £150, all this does in include the duration of the micro-burst you can support.
- Increase the speed of your drive. With SSD you can support much higher bandwidths. SATA SSD drives support up to 500 MB/sec which is higher than Chronicle can typically serialize messages, i.e. more than enough. The downside of this is it reduces the total number of messages you can write. A 500 MB SSD can store 5 billion 100 byte messages. A 6x4 TB RAID-5 set can support a transfer rate of over 200 MB/s which would be enough for the above case study, and can store 200 billion messages.
I'm confused. How can messages be durable if they're only in I/O cache in main memory?!? I'm certain any product advertising durable messaging queues will be flushing all writes to disk so any main memory disk cache should be irrelevant. The only disk cache size that may be relevant would be battery-backed cache on the disk controller itself.ReplyDelete
You are right that the buffer is question depends on the implementation. In the case of chronicle, if you only need reliability in case of your program dying you can use it as is. If you need reliability in case of a machine dying or a power failure, it can be replicated to a second machine. You could commit every write to battery backed up disk, which is also supported if you really need it but the impact to performance is dramatic for very little gain in most cases. In fact I should do a page on how much difference flushing every message to disk makes.ReplyDelete
As I have noted before most messaging solutions avoid benchmarking durable messaging, but those that do clearly don't commit writes given the numbers they report. ;)
Ok, you seem to use the term durable in a rather loose sense. Perhaps you could edit your blog post to define what you mean by durable at the top of your post. I believe most people think of durable as equating to data being safely written to non-volatile storage. To me, calling something durable when it exists in I/O cache in main memory is not durable. So what you're benchmarking here is not really durable messaging either. As JP noted, durable messaging is pretty similar to transaction logging for a database and comes with the same performance penalties.Delete
Peter you are correct to doubt those sorts of numbers. The durable write path for a messaging system isn't terribly different from the transaction log for a database. A single disk can only support 1 commit/flush per revolution. The first few generations of SATA SSDs weren't much better, though they are improving. By the way, RAID-5 is fairly terrible at that particular kind of write load, due to the parity calculation overhead. I suspect you'd be better off with either RAID 1 or RAID 0+1. As always, best to test it out, and durability tests may well deserve checking how the system does in a hard power fault.ReplyDelete