C++ like Java for low latency
Overview
Previously I wrote an article on C like Java. This is term I had come across before. However, on reflection I thought C++ like Java is a better term as you still use OOP practices, (which not C-like) but you put more work into managing and recycling memory yourself.The term I favour now is "low level" Java programming.
Low latency and "natural" Java
Many low latency systems are still programmed in what I call natural Java. In most cases, this is the simplest and most productive. The problems really start when you have large heap sizes and low latency requirements as these don't work well together.
If you have an established program and a rewrite is not reasonable, you need something like Zing which has a truly concurrent collector. Although you may see worst pauses of 1-4 ms this as about as good as a non real time system will get. RedHat is rumoured to be producing a concurrent collector but as Oracle have found with G1, writing an effective low latency is very difficult. G1 works fine, it just isn't suitable for low latency trading (e.g. with 1 ms pause times) and I am not sure it ever will be.
If a rewrite of key components is an option, esp if its something you are considering this anyway for reasons other than performance, you can get significant performance gains but using lower level programming which looks more like C++ or C. The reason it looks like these languages is you are using the same tricks which work in those languages and have been used for a long time, but in Java.
Why do low level programming in Java?
The main reason is integration and ease of support. Low level Java integrates very cleanly with natural Java code. All the Java tools work with it just fine and it is still cross platform (or at least across OSes) Some of the low level techniques are JVM specific e.g. they work on OpenJDK/HotSpot but not IBM's JVM or visa versa. They might work, but many not help. e.g. Using Unsafe on HotSpot can be significantly faster, but significantly slower on Android.
The other benefit of low level Java programming is that experienced Java developers find it relatively easy to work with. They might find it difficult to write from scratch but if a library is used which hides the more hairy details, it be used effectively.
What are some of the considerations with low level Java?
Latency targets
You can consider 1 ms latency as not that fast these days unless you are operating at 99.9% of the time less than 1 ms. The fastest systems in Java are typically around 100 micro-seconds, even below 20 micro-seconds external to the box.
How to handle high throughput
For trading systems, I suggest you make the latency as low as possible and the throughput is often enough. For example, Chronicle can persist a full Opera feed at maximum burst as a sustained rate with one thread. This is possible because the latency of Chronicle is very low. A tiny 4 byte message has an average latency of 7 nano-seconds. This will be persisted even if the JVM crashes on the next line. At this point throughput is not such an issue.
For back testing, low latency is not enough. This is because what you want to be able to do is to replay months worth of data in a fraction of a second (ideally). In this case, you need high degrees of parallelism and the ability to replay lots of data in a short period of time. For this I have used memory mapped files which have pre canned queries of the data I need. If these files are in a binary format and fit in main memory, they can be accessed repeatedly across thread very fast.
Handling GCs.
Garbage collection pauses slow you down, so the fastest approach is to build an engine is to not use the breaks. Instead you give yourself a budget, the largest Eden size you can reasonably have. This might be 24 GB of Eden. If you take all day to fill it, you won't collect, even minor collect in a day. Assuming you can stop for a few seconds over night or on the weekend, you Full GC at a predetermined time and GCs are no longer an issue.
Even then I suggest keeping garbage to a minimum and localise memory access patterns to maximise the L1 / L2 efficiency which can result in 2-5x performance improvement. The L3 cache is at least 10x slower than the L1 and if you are not filling your L1 with garbage (as many Java program literally do) your program runs much faster, more consistently.
Use shared memory for IPC
This has two advantages.
1) you can keep a record of every event, with timings for the whole day with trivial overhead. Because you can keep much more data, you can reproduce the exact state of the system externally, down stream or to reproduce a bug. This gives you a massive data driven test based and you can feed production data through your test system in real time to see how ti behaves before running live.
2) In fact it can be much faster than the alternatives. A tiny message can have a average latency of 7 nano-seconds to write and can be visible to another process in under 100 nano-seconds.
Pinning cores
Many people have had tried pinning cores and not found much difference. This can be due to a) the jitter of the application being so high, it hardly matters or b) then didn't pin to an isolated core. I have found if you bind to an isolated core the worst case latencies drop from 2 ms to 10 micro-seconds, but pinning to a non-isolated core doesn't appear to help at all.
Busy waiting
You don't want your critical thread to context switch and one way to avoid this is to never give up the core or CPU. This can be done by busy waiting on a pinned and isolated core. One thing I have done is to use isolated cores to
a) disable hyper threading selectively for the most critical threads, but leave it on for the rest.
b) over clock isolated core more by not over-clocking the "junk" cores. This works by limiting heat produced by cores that don't need maximum speed anyway. e.g. say you have a hex core, you leave cores 0,2,4 not over clocked, and over-clock core 1,3,5 without using HT another 10% more than you can the whole socket. I assume that the over-clocked cores are not next to each other on the die so they are less likely to over heat.
Larger address space collections
A key feature of Chronicle and OpenHFT's Direct Store is that they allow access to 64-bit sizes. (Technically only 48-bit due to OS/CPU limitations) This means you can manage TB sized data sets as one collection with many billions of records/excerpts natively, in memory in Java. This avoids even a JNI barrier or a system call to slow down your application.
Conclusion
If your application spend 90% of its time in 10% of its code, or perhaps 80/20, you can write 90% of the code in "natural" Java , or use third party libraries which do. However if you write 10% in low level Java you can there are significant performance gains to be made.
Peter,
ReplyDeleteInteresting read. Thanks for sharing. Can you share some more details on pinning the CPU? How can we have a java thread use a idle core?