Why You Should Tune Code Before Your Garbage Collector

Optimising your memory allocations in Java could make far more difference than your choice of Garbage Collector and may even change which is the best garbage collector.

In this post I look at a simple event to response latency benchmark, MarketDataSnapshot to NewOrderSingle at 50K/s for 30 minutes using JLBH to test Chronicle-FIX. The goal is to compare a system which is doing redundant work (in this case logging each message using SLF4J), compared with not logging (Chronicle-FIX records every message internally using Chronicle Queue) and how this changes the choice of Garbage Collector

Unoptimised Benchmark

This takes the optimised benchmark and adds one SLF4J log line of just the message to be sent. One log line might not sound like much but ding this on every message makes a big difference esp when the rest of the code is written for low latency.

Table 1. RTT latency distribution with slf4j logging

GC option

p99

p99.99

Parallel large Eden, no large pages

16.86

20,480

ZGC, no large pages

12.21

19,694

G1 + COH, no large pages

12.02

20,349

G1, 2 MiB large pages

13.94

20,021

Shenandoah generational, no large pages

12.30

19,235

Note
The p99.99 are thousands of microseconds or 19 to 20 milliseconds.

The p99.99 or worst 1 in 10,000 might sound rare, however at 50K/s that is 5 times per second or 300 times per minute

Based on these results, you might conclude that Shenandoah is a good option, and avoid Parallel GC.

How much difference is the disk IO making

IO is often a significant proportion of delays, and we can see that just moving where the logs are written. In this case to a tmpfs filesystem.

Table 2. RTT latency distribution with slf4j logging to /dev/shm

GC option

p99

p99.99

Parallel large Eden, no large pages

34.50

10,994

ZGC, no large pages

10.86

9,617

G1 + COH, no large pages

10.90

10,043

G1, 2 MiB large pages

10.64

11,682

Shenandoah generational, no large pages

10.32

9,552

While the p99 for parallel GC is possibly an outlier, you still might favour Shenandoah, however just moving where the log is being written makes far more difference.

Note
The p99.99 are thousands of microseconds or 9 to 10 milliseconds.

Remove the redundant logging

Chronicle-FIX already records every message using Chronicle-FIX so the slf4j log is redundant. For low latency coding we use Chronicle-Queue for all recording and almost no logging as it can be the biggest source of delays otherwise. As a policy, we don’t have info level logging after startup. ie. either its an error/warning and on by default, or a debug logging and off by default. ie. make a decision as to whether it’s really needed or not.

Table 3. RTT latency distribution with only Chronicle-Queue recording

GC option

p99

p99.99

Parallel large Eden, no large pages

7.14

8.94

ZGC, no large pages

7.82

9.36

G1 + COH, no large pages

8.30

12.4

G1, 2 MiB large pages

8.59

154.4

Shenandoah generational, no large pages

7.83

233.7

As you would expect, the p99 has dropped about 2 microseconds, however what is significant is that the p99.99 has dropped by three orders of magnitude. Also, the preferred garbage collector has flipped due to the change in work load, all for the sake of one log line. In this case, parallel GC looks best, and Shenandoah came out worst on the p99.99.

Note
The p99.99 are microseconds or more than a 1000x faster than logging to /dev/shm

Conclusion

The choice of the best Garbage Collector and how to tune it depends on your workload. It makes sense to ensure your workload has been optimised first because a) it can make more difference in some cases, b) it can change your preference as well.

Comments

Popular posts from this blog

Demystifying Java Object Sizes: Compact Headers, Compressed Oops, and Beyond

Asking multiple AI to optimise the same code

Updated Biography