On heap vs off heap memory usage
Overview
I was recently asked about the benefits and wisdom of using off heap memory in Java. The answers may be of interest to others facing the same choices.Off heap memory is nothing special. The thread stacks, application code, NIO buffers are all off heap. In fact in C and C++, you only have unmanaged memory as it does not have a managed heap by default. The use of managed memory or "heap" in Java is a special feature of the language. Note: Java is not the only language to do this.
new Object() vs Object pool vs Off Heap memory.
new Object()
Before Java 5.0, using object pools was very popular. Creating objects was still very expensive. However, from Java 5.0, object allocation and garbage cleanup was made much cheaper, and developers found they got a performance speed up and a simplification of their code by removing object pools and just creating new objects whenever needed. Before Java 5.0, almost any object pool, even an object pool which used objects provided an improvement, from Java 5.0 pooling only expensive objects obviously made sense e.g. threads, socket and database connections.Object pools
In the low latency space it was still apparent that recycling mutable objects improved performance by reduced pressure on your CPU caches. These objects have to have simple life cycles and have a simple structure, but you could see significant improvements in performance and jitter by using them.Another area where it made sense to use object pools is when loading large amounts of data with many duplicate objects. With a significant reduction in memory usage and a reduction in the number of objects the GC had to manage, you saw a reduction in GC times and an increase in throughput.
These object pools were designed to be more light weight than say using a synchronized HashMap, and so they still helped.
Take this StringInterner class as an example. You pass it a recycled mutable StringBuilder of the text you want as a String and it will provide a String which matches. Passing a String would be inefficient as you would have already created the object. The StringBuilder can be recycled.
Note: this structure has an interesting property that requires no additional thread safety features, like volatile or synchronized, other than is provided by the minimum Java guarantees. i.e. you can see the final fields in a String correctly and only read consistent references.
public class StringInterner{ private final String[] interner; private final int mask; public StringInterner(int capacity) { int n = Maths.nextPower2(capacity, 128); interner = new String[n]; mask = n - 1; } private static boolean isEqual(@Nullable CharSequence s, @NotNull CharSequence cs) { if (s == null) return false; if (s.length() != cs.length()) return false; for (int i = 0; i < cs.length(); i++) if (s.charAt(i) != cs.charAt(i)) return false; return true; } @NotNull public String intern(@NotNull CharSequence cs) { long hash = 0; for (int i = 0; i < cs.length(); i++) hash = 57 * hash + cs.charAt(i); int h = (int) Maths.hash(hash) & mask; String s = interner[h]; if (isEqual(s, cs)) return s; String s2 = cs.toString(); return interner[h] = s2; } }
Off heap memory usage
Using off heap memory and using object pools both help reduce GC pauses, this is their only similarity. Object pools are good for short lived mutable objects, expensive to create objects and long live immutable objects where there is a lot of duplication. Medium lived mutable objects, or complex objects are more likely to be better left to the GC to handle. However, medium to long lived mutable objects suffer in a number of ways which off heap memory solves.Off heap memory provides;
- Scalability to large memory sizes e.g. over 1 TB and larger than main memory.
- Notional impact on GC pause times.
- Sharing between processes, reducing duplication between JVMs, and making it easier to split JVMs.
- Persistence for faster restarts or replying of production data in test.
The use of off heap memory gives you more options in terms of how you design your system. The most important improvement is not performance, but determinism.
Off heap and testing
One of the biggest challenges in high performance computing is reproducing obscure bugs and being able to prove you have fixed them. By storing all your input events and data off heap in a persisted way you can turn your critical systems into a series of complex state machines. (Or in simple cases, just one state machine) In this way you get reproducible behaviour and performance between test and production.
A number of investment banks use this technique to replay a system reliably to any event in the day and work out exactly why that event was processed the way it was. More importantly, once you have a fix you can show that you have fixed the issue which occurred in production, instead of finding an issue and hoping this was the issue.
Along with deterministic behaviour comes deterministic performance. In test environments, you can replay the events with realistic timings and show the latency distribution you expect to get in production. Some system jitter can't be reproduce esp if the hardware is not the same, but you can get pretty close when you take a statistical view. To avoid taking a day to replay a day of data you can add a threshold. e.g. if the time between events is more than 10 ms you might only wait 10 ms. This can allow you to replay a day of events with realistic timing in under an hour and see whether your changes have improved your latency distribution or not.
By going more low level don't you lose some of "compile once, run anywhere"?
To some degree this is true, but it is far less than you might think. When you are working closer the processor and so you are more dependant on how the processor, or OS behaves. Fortunately, most systems use AMD/Intel processors and even ARM processors are becoming more compatible in terms of the low level guarantees they provide. There is also differences in the OSes, and these techniques tend to work better on Linux than Windows. However, if you develop on MacOSX or Windows and use Linux for production, you shouldn't have any issues. This is what we do at Higher Frequency Trading.
What new problems are we creating by using off heap?
Nothing comes for free, and this is the case with off heap. The biggest issue with off heap is your data structures become less natural. You either need a simple data structure which can be mapped directly to off heap, or you have a complex data structure which serializes and deserializes to put it off heap. Obvious using serialization has its own headaches and performance hit. Using serialization thus much slower than on heap objects.
In the financial world, most high ticking data structure are flat and simple, full of primitives which maps nicely off heap with little overhead. However, this doesn't apply in all applications and you can get complex nested data structures e.g. graphs, which you can end up having to cache some objects on-heap as well.
Another problem is that the JVM limits how much of the system you can use. You don't have to worry about the JVM overloading the system so much. With off heap, some limitations are lifted and you can use data structures much larger than main memory, and you start having to worry about what kind of disk sub-system you have if you do this. For example, you don't want to be paging to a HDD which has 80 IOPS, instead you are likely to want an SSD with 80,000 IOPS (Input/Ouput Operations per Second) or better i.e. 1000x faster.
How does OpenHFT help?
OpenHFT has a number of libraries to hide the fact you are really using native memory to store your data. These data structures are persisted and can be used with little or no garbage. These are used in applications which run all day without a minor collectionChronicle Queue - Persisted queue of events. Supports concurrent writers across JVMs on the same machine and concurrent readers across machines. Micro-second latencies and sustained throughputs in the millions of messages per second.
Chronicle Map - Native or Persisted storage of a key-value Map. Can be shared between JVMs on the same machine, replicated via UDP or TCP and/or accessed remotely via TCP. Micro-second latencies and sustained read/write rates in the millions of operations per second per machine.
Thread Affinity - Binding of critical threads to isolated cores or logical cpus to minimise jitter. Can reduce jitter by a factor of 1000.
Which API to use?
If you need to record every event -> Chronicle Queue
If you only need the latest result for a unique key -> Chronicle Map
If you care about 20 micro-second jitter -> Thread Affinity
Conclusion
Off heap memory can have challenges but also come with a lot of benefits. Where you see the biggest gain and compares with other solutions introduced to achieve scalability. Off heap is likely to be simpler and much faster than using partitioned/sharded on heap caches, messaging solutions, or out of process databases. By being faster, you may find that some of the tricks you need to do to give you the performance you need are no longer required. e.g. off heap solutions can support synchronous writes to the OS, instead of having to perform them asynchronously with the risk of data loss.
The biggest gain however, can be your startup time, giving you a production system which restarts much faster. e.g. mapping in a 1 TB data set can take 10 milli-seconds, and ease of reproducibility in test by replaying every event in order you get the same behaviour every time. This allows you to produce quality systems you can rely on.
Thank you Peter for a great article.
ReplyDeleteRegarding Chronicle Queue, in its OpenHFT "How it Works" section, it is mentioned that "Chronicle writes data directly into off heap memory which is shared between java processes on the same server".
How can this work when the java processes are on different machines?
Unless you publish the mapped file in a common mount point, but in that case wouldn't the latency the file system introduces be so high that the throughput will be several orders of magnitude lower?
Is there a Chronicle Queue benchmark/demo for such a scenario?
Thanks,
--Amir
When Java processes are on different machines, we use TCP replication.
DeleteWhen we write to an ext4 filesystem, the latency is on average 150 nano-seconds higher than writing to a tmpfs filesystem.
The best examples are in the unit tests in the source as these all work. There is a module called chronicle-demo which contains a couple of demos.
I suggest you check out
the wiki https://github.com/OpenHFT/Chronicle-Queue/blob/master/docs/HowItWorks.md#getting-started
the source https://github.com/OpenHFT/Chronicle-Queue
Thanks Peter for providing a great insight into off-heap.
ReplyDeleteQ: Java Heap is mostly contiguous (and JVM also makes an effort on every GC cycle), while off-heap seems to be largely 'randomly linked'. Can this be an eventual disadvantage?
Yes you can get fragmentation but with on heap memory you can be limited to a small portion of the large servers you can buy today.
ReplyDeleteSome java systems run on 3 TB machines and the heap doesn't scale so well at this size.
DeleteThanks Peter. That clears up the doubt.
ReplyDeleteCouldn't resist asking an Endianness question related to off-heap. Intel CPUs are LITTLE_ENDIAN. Java is BIG_ENDIAN and so is the network. Usually, ByteOrder.nativeOrder().equals(ByteOrder.BIG_ENDIAN) can provide the info on a working environment's endinaness. Q. Would it be advantageous to use a BIG_ENDIAN byte ordered CPU for openHFT libraries since it uses ByteBuffer.allocateDirect().order(ByteOrder.nativeOrder() ) ? This is to ask to avoid hton() OR ntoh() conversions while reading data from sockets for LITTLE_ENDIAN CPU. By going off-heap would it drop off some us/ns?
ReplyDeleteJava defaults to big endian but supports little endian for ByteBuffer. On heap objects use the native byte ordering and when you use native ordering off heap there is no performance benefit either way.
ReplyDeleteIf you look at the hardware, Intel and Amd can put more transiters on a chip than any one else for the price they charge and this makes more difference than anything else. At the end of the day you get more performance for your money from machine which happen to be little endian.
ReplyDeleteJust glancing at the code but it looks that a string could enter your cache and leave ? In which case its lifetime is potentially extended ? i.e. in theory convert a short lived object into a medium lived one if a subsequent string is hashed to a used slot but not equal or have I misread it. If it can do that would it not be better to leave the older string in the cache and just return the new offering (uncached) ie once the cache is full its full, though I can see why its nice to cache new strings I normally prefer die quick or live forever.
In terms of latency, is off-heap access (read and write) better than heap access? I might be wrong, but my impression about off-heap memory is that its only benefits are more memory availability and the unlikely possibility that you might want to use off-heap memory for inter-process communication across JVMs. I think direct buffers are great exactly for this latter reason. But think with me: Instead of allocating a big chunk of off-heap memory, why can't you just allocate a big byte array in the heap that never goes out-of-scope and, for that reason, is never collected by the GC? My point is: unless you need more memory or want to share memory across processes, there is no reason to go off heap.
ReplyDelete