Chronicle and low latency in Java

Overview

I was watching this excellent presentation by Rolan Kuhn of Typesafe on Introducing Reactive Streams At first glance it appears that it has some similar goals to Chronicle, but as you dig into the details it was clear to me that there was a few key assumptions which were fundamentally different.

Key assumptions

The key assumptions  in the design of Chronicle are
  • low latency is your problem, not throughput. Data comes in micro-bursts which you want to handle as quickly as possible long before the next micro-burst of activity. 
  • you can't pause an exchange/producer if you are busy. (or pausing the end user is not an option)
  • your information is high value, recording every event with detailed timing is valuable. Recording all your events is key to understanding micro-bursts.
  • You want to be able to examine any event which occurred in the past.

Low latency is essential

The key problem Chronicle is design to help you solve is consistent low latency.  It assumes that if your latency is low enough, you don't have a problem with throughput.  Many web based systems are designed for throughput and as long as the latency is not visible to end users, latency is not an issue. For soft real time systems, you need low latency most of the time and a modest worst case latency, much faster than a human can see.

You can't stop the world

Another key assumption is that flow control is not an option.  If you are running slow, you can't say to the exchange and all its users, wait up a second while I catch up.  This means the producer can never be slowed down by a consumer.

Slowing the producer is effectively the same as adding latency, but this latency is easy to hide.  If you wait until your producer timestamps an event, this can make you latencies look better. If you want a more realistic measure is you should use the time stamp the event should have been sent by a producer which is not delayed.

You need to record every thing for replay

Replaying can be useful for testing your application under a range of conditions.  e.g. you can change your application and see not only that it behaves correctly, but behaves in a timely manner.  For quantitative analysis, they will need a set of data to tune their strategies.

Replay an old event in real time.

Instead of taking a copy of event you might want to refer to later, you can instead remember it's index and you can replay that event later on demand.  This saves memory in the heap, or just-in-case copies of data.

Micro-bursts are critical to understanding your system.

The performance of some systems are characterised in terms of transactions per day.  This implies that if no transactions were completed for the first 23 hours and all of them completed in the last hour, you would still perform this many transactions per day.  Often the transactions per day is quoted because its a higher numbers, but in my case having all day to smooth out the work load sounds like a luxury.

Some systems might be characterised in terms of the number of transactions per second.  This may imply that those transactions can start and complete in one second, but not always.  If you have 1000 transactions and one comes in every milli-seconds, you get an even response time.  What I find more interesting is the number of transactions in the busiest second of a day.  This gives you an indication of the flow rate your system should be able to handle.

Examining micro bursts

Consider a system which gets 30 events all in the same 100 micro-seconds and these bursts are 100 milli-seconds apart.   This could appear as a (30 / 0.1) 300 transactions per second which sounds relatively easy if all we need to do is to keep up, but if we want to response as quickly as possible, the short term/burst throughput is 30 in 100 micro-seconds or 300,000 events per second.

In other words, to handle micro-bursts as fast as possible, you will need a systems which can handle throughputs many orders of magnitude higher than you would expect over seconds or minutes or a day. Ideally, the throughput of your systems will be 100x the busiest second of the day.  This is required to handle the busiest 10 milli-seconds in any second without slowing the handling of these bursts of data.

How does Chronicle improves handling of micro-bursts

Low garbage rate

Minimising garbage is key to avoiding GC pauses. To use your L1 and L2 cache efficiently, you need to keep your garbage rates very low.  If you are not using these cache efficiently your application can be 2-5x slower. 

The garbage from Chronicle is low enough that you can process one million events without jstat detecting you have created any garbage.  jstat only displays multiples of 4 KB, and only when a new TLAB is allocated.  Chronicle does create garbage, but it is extremely low. i.e. a few objects per million events processes.

Once you make the GC pauses manageable, or non-existent, you start to see other sources of delay in your system.   Take away the boulders and you start to see the rocks.  Take away the rocks and you start to see the pebbles.

Supports a write everything model.

It is common knowledge that if you leave DEBUG level logging on, it can slow down your application dramatically.  There is a tension between recording everything you might want to know later, and the impact on your application.

Chronicle is designed to be fast enough that you can record everything.  If you replace queues and IPC connections in your system, it can improve the performance and you get "record everything" for free, or even better.

Being able to record everything means you can add trace timings through every stage of your system and then monitor your system, but also drill into the worst 1% delays in your system.  This is not something you can do with a profiler which gives you averages.  

With chronicle you can answer questions such as; which parts of the system were responsible for the slowest 20 events for a day?

Chronicle has minimal interaction with the Operating System.

System calls are slow, and if you can avoid call the OS, you can save significant amounts of latency. 

For example, if you send a message over TCP on loopback, this can add a 10 micro-seconds latency between writing and reading the data.  You can write to a chronicle, which is a plain write to memory, and read from chronicle, which is also a read from memory with a latency of 0.2 micro-seconds. (And as I mentioned before, you get persistence as well)

No need to worry about running out of heap.

A common problem with unbounded queues and this uses an open ended amount of heap.  

Chronicle solves this by not using the heap to store data, but instead using memory mapped files.  This improve memory utilisation by making the data more compact but also means a 1 GB JVM can stream 1 TB of data over a day without worrying about the heap or how much main memory you have.  In this case, an unbounded queue becomes easier to manage.

Conclusion

By being built on different assumptions, Chronicle solves problems other solutions avoid, such as the need for flow control or consuming messages (deleting processed messages).

Chronicle is designed to be used your hardware more efficiently so you don't need a cloud of say 30 servers to handle around one million events per second (as a number of cloud based solutions claim), you can do this event rate with a developer laptop.

Comments

  1. Hello,

    nice text, but could you give some real examples and some code?

    Thank,
    Felipe Albrecht

    ReplyDelete
    Replies
    1. Hi Felipe,
      The source for chornicle is available here https://github.com/OpenHFT/Java-Chronicle
      There is examples in the test area and a stand alone demo here https://github.com/OpenHFT/Java-Chronicle/tree/master/chronicle-demo

      Is there a particular piece of code you might like to see?

      Peter.

      Delete
  2. :) yeah some of the numbers being bragged on in the web make me ask myself "wtf did the other 29 servers do ?". Very few people have an idea of what is really possible on modern hardware + VM's.

    ReplyDelete
  3. After having watched the vid I am sorry to say there are a couple of severe misconceptions.

    * Flow control works in 1:1 communication. In 1:N patterns (required for redundancy/failover and publish/subscribe) flow control means slowing down the sender to the speed of the slowest receiver. So a failing node degrades performance of the whole cluster.

    * If you need to apply flow control on a reactive system, this simply means your system can't handle the load. Its too slow. Instead of backpressuring the whole pipeline up to the input component of the system, one better monitors queue sizes and refuses further input (e.g. disallow more users) at a critical point. This has the same effect as fine grained flow control but does avoid the overhead. Especially in publish/subscribe patterns FlowControl creates A LOT of overhead thereby degrading peak performance significantly.
    Another possibility is to apply adaptive netting at the input side (Last Value Cache) to reduce the load. Peaks can be handled by increasing queue size's/messaging buffer sizes.

    * They are talking directed single sender, singler receiver TCP all the time, on the same time they talk of redundancy (requires 1:N) and publish/subscribe. But flow control patterns lined out do not work for 1:N patterns. The only thing half-decent flow control in 1:N communication is NAK, which in turn requires large sender side queues/buffers.

    * Trying to handle network and local (inprocess) communication with the same patterns does not work, because of network latency. The longer a flow control message latency, the harder it gets to do reasonable flow control, as the flow control message received is from the past, this gets especially messy when publishing to many subscribers (1:N), the publisher literally needs to track per-subscriber latency in order to make meaningful send rate control.

    From my experience when building high volume, soft realtime distributed systems trying to mess with flow control in one-to-many communication patterns just slows down your system incl. spurious "overcompensation" etc. because of mislead flow control. Even TCP doesn't do too well in 1:1 communication patterns.

    Investing that time in optimizing encoding/decoding, enqueing/dequeuing and actor message dispatch will yield a faster system that doesn't require flow control at all (but don't forget that big overload fuse at the entrance).

    ReplyDelete

Post a Comment

Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Comparing Approaches to Durability in Low Latency Messaging Queues