Why You Should Tune Code Before Your Garbage Collector

Optimising your memory allocations in Java could make far more difference than your choice of Garbage Collector and may even change which is the best garbage collector.

In this post I look at a simple event to response latency benchmark, MarketDataSnapshot to NewOrderSingle at 50K/s for 30 minutes using JLBH to test Chronicle-FIX. The goal is to compare a system which is doing redundant work (in this case logging each message using SLF4J), compared with not logging (Chronicle-FIX records every message internally using Chronicle Queue) and how this changes the choice of Garbage Collector

For the p99 (worst 1 in 100) the choice of Garbage Collector makes a different on par with optimising how loggin is done

However, for the p99.99 (worst 1 in 10,000) optimsing how the logging is done is orders of magnitude more signifciant than the choice of Garbage Collector

Unoptimised Benchmark

This takes the optimised benchmark and adds one SLF4J log line of just the message to be sent. One log line might not sound like much but ding this on every message makes a big difference esp when the rest of the code is written for low latency.

Table 1. RTT latency distribution with slf4j logging
GC option	p99	p99.99
Parallel large Eden, no large pages	16.86	20,480
ZGC, no large pages	12.21	19,694
G1 + COH, no large pages	12.02	20,349
G1, 2 MiB large pages	13.94	20,021
Shenandoah generational, no large pages	12.30	19,235

Note	The p99.99 are thousands of microseconds or 19 to 20 milliseconds.

The p99.99 or worst 1 in 10,000 might sound rare, however at 50K/s that is 5 times per second or 300 times per minute

Based on these results, you might conclude that Shenandoah is a good option, and avoid Parallel GC.

How much difference is the disk IO making

IO is often a significant proportion of delays, and we can see that just moving where the logs are written. In this case to a tmpfs filesystem.

Table 2. RTT latency distribution with slf4j logging to /dev/shm
GC option	p99	p99.99
Parallel large Eden, no large pages	34.50	10,994
ZGC, no large pages	10.86	9,617
G1 + COH, no large pages	10.90	10,043
G1, 2 MiB large pages	10.64	11,682
Shenandoah generational, no large pages	10.32	9,552

While the p99 for parallel GC is possibly an outlier, you still might favour Shenandoah, however just moving where the log is being written makes far more difference.

Note	The p99.99 are thousands of microseconds or 9 to 10 milliseconds.

Remove the redundant logging

Chronicle-FIX already records every message using Chronicle-FIX so the slf4j log is redundant. For low latency coding we use Chronicle-Queue for all recording and almost no logging as it can be the biggest source of delays otherwise. As a policy, we don’t have info level logging after startup. ie. either its an error/warning and on by default, or a debug logging and off by default. ie. make a decision as to whether it’s really needed or not.

Table 3. RTT latency distribution with only Chronicle-Queue recording
GC option	p99	p99.99
Parallel large Eden, no large pages	7.14	8.94
ZGC, no large pages	7.82	9.36
G1 + COH, no large pages	8.30	12.4
G1, 2 MiB large pages	8.59	154.4
Shenandoah generational, no large pages	7.83	233.7

As you would expect, the p99 has dropped about 2 microseconds, however what is significant is that the p99.99 has dropped by three orders of magnitude. Also, the preferred garbage collector has flipped due to the change in work load, all for the sake of one log line. In this case, parallel GC looks best, and Shenandoah came out worst on the p99.99.

Note	The p99.99 are microseconds or more than a 1000x faster than logging to /dev/shm

Conclusion

The choice of the best Garbage Collector and how to tune it depends on your workload. It makes sense to ensure your workload has been optimised first because a) it can make more difference in some cases, b) it can change your preference as well.

Vanilla Java