Why You Should Tune Code Before Your Garbage Collector
Optimising your memory allocations in Java could make far more difference than your choice of Garbage Collector and may even change which is the best garbage collector.
In this post I look at a simple event to response latency benchmark, MarketDataSnapshot to NewOrderSingle at 50K/s for 30 minutes using JLBH to test Chronicle-FIX. The goal is to compare a system which is doing redundant work (in this case logging each message using SLF4J), compared with not logging (Chronicle-FIX records every message internally using Chronicle Queue) and how this changes the choice of Garbage Collector
Unoptimised Benchmark
This takes the optimised benchmark and adds one SLF4J log line of just the message to be sent. One log line might not sound like much but ding this on every message makes a big difference esp when the rest of the code is written for low latency.
GC option |
p99 |
p99.99 |
Parallel large Eden, no large pages |
16.86 |
20,480 |
ZGC, no large pages |
12.21 |
19,694 |
G1 + COH, no large pages |
12.02 |
20,349 |
G1, 2 MiB large pages |
13.94 |
20,021 |
Shenandoah generational, no large pages |
12.30 |
19,235 |
|
Note
|
The p99.99 are thousands of microseconds or 19 to 20 milliseconds. |
The p99.99 or worst 1 in 10,000 might sound rare, however at 50K/s that is 5 times per second or 300 times per minute
Based on these results, you might conclude that Shenandoah is a good option, and avoid Parallel GC.
How much difference is the disk IO making
IO is often a significant proportion of delays, and we can see that just moving where the logs are written. In this case to a tmpfs filesystem.
GC option |
p99 |
p99.99 |
Parallel large Eden, no large pages |
34.50 |
10,994 |
ZGC, no large pages |
10.86 |
9,617 |
G1 + COH, no large pages |
10.90 |
10,043 |
G1, 2 MiB large pages |
10.64 |
11,682 |
Shenandoah generational, no large pages |
10.32 |
9,552 |
While the p99 for parallel GC is possibly an outlier, you still might favour Shenandoah, however just moving where the log is being written makes far more difference.
|
Note
|
The p99.99 are thousands of microseconds or 9 to 10 milliseconds. |
Remove the redundant logging
Chronicle-FIX already records every message using Chronicle-FIX so the slf4j log is redundant. For low latency coding we use Chronicle-Queue for all recording and almost no logging as it can be the biggest source of delays otherwise. As a policy, we don’t have info level logging after startup. ie. either its an error/warning and on by default, or a debug logging and off by default. ie. make a decision as to whether it’s really needed or not.
GC option |
p99 |
p99.99 |
Parallel large Eden, no large pages |
7.14 |
8.94 |
ZGC, no large pages |
7.82 |
9.36 |
G1 + COH, no large pages |
8.30 |
12.4 |
G1, 2 MiB large pages |
8.59 |
154.4 |
Shenandoah generational, no large pages |
7.83 |
233.7 |
As you would expect, the p99 has dropped about 2 microseconds, however what is significant is that the p99.99 has dropped by three orders of magnitude. Also, the preferred garbage collector has flipped due to the change in work load, all for the sake of one log line. In this case, parallel GC looks best, and Shenandoah came out worst on the p99.99.
|
Note
|
The p99.99 are microseconds or more than a 1000x faster than logging to /dev/shm |
Conclusion
The choice of the best Garbage Collector and how to tune it depends on your workload. It makes sense to ensure your workload has been optimised first because a) it can make more difference in some cases, b) it can change your preference as well.
Comments
Post a Comment