Try optimising the memory consumption first

Overview

You would think that if you wanted your application to go faster you would start with the CPU profiling.  However, when looking for quick wins, it's the memory profiler I target first.

Allocating memory is cheap

Allocating memory has never been cheaper.  Memory is cheaper, you can get machines will thousands of GBs of memory. You can buy 16 GB for less than $200.

The memory allocation operation is cheaper than in the past, and it's multi-threaded so it scales reasonably well.

However, memory allocation is not free.  Your CPU cache is a precious resources especially if you are trying to use multiple threads.  While you can buy 16 GB of main memory easily, you might only have 2 MB of cache per logical CPU.  If you want these CPUs to run independently, you want to spend as much time as possible within the 256 KB L2 cache.

Cache levelSizeaccess time in clock cyclesconcurrency
1
32 KB data
32 KB instruction
1
cores independent
2
256 KB

3
cores independent
3
3 MB - 32 MB

10-20
sockets independent
main 
memory
4 MB - 4 TB
200+
each memory region seperate

Allocating memory is not linear

Allocating memory on the heap is not linear.  The CPU is very good at doing things in parallel.  This means that if memory bandwidth is not your main bottleneck, the rate you produce garbage has less impact that what ever your bottleneck is, however if the allocation rate is high enough (and in most Java systems it is high) it will be a serious bottleneck.

You can tell if the allocation rate is a bottleneck if;
  • You are close to the maximum allocation rate of the machine.  Write a small test which creates lots of garbage and measure the allocation rate.  If you close to this you have a problem.
  • When you reduce the garbage produced by say 10%, the 99% latency of application becomes 10% faster, and yet the allocation rate hardly drops.  This means your application will speed up so that it reached your bottleneck again.
  • You have very long pause times e.g. into the seconds.  At this point, your memory consumption has a very high impact on your performance, and reducing the memory consumption and allocation rate can improve scalability (how many requests you can process concurrently) as well as reduce your worst case jitter.

Is there a way to see CPU and memory at the same time

After reducing allocation rate, I look at the CPU consumption, with memory trace turned on.  This give more weight to the memory allocations and will give you a different view to looking at CPU alone.

Only when this CPU&Memory view looks clean, or at least has no quick wins do I look at CPU profiling alone.

Using these techniques as a starting point my aim is typically to reduce the 99%tile latency (the worst 1%) by a fact of 10.  However, this approach can also increase the throughput of each threads as well as allow you to run more thread concurrently in an efficient manner.

For more information


The profiler I use is YourKit , the IDE I use is IntelliJ, an excellent tool for visualising your allocation rate and GC timings is Censum.

We offer Advanced Java Training with hands on exercises for individuals, as well as Corporate Training which can be tailors and is more cost effective per person.

Comments

  1. Switched to JMC/Flight Recorder for allocation profiling and never looked back. Much better accuracy, next to no overhead.

    ReplyDelete
  2. The mantra on the hot path is usually: "Best defence, not be there" meaning if you can avoid doing anything on the fast path, you probably should. This applies to construction costs as well as anything.
    I would separate pure allocation from constructing new object as the costs are very different.

    ReplyDelete

Post a Comment

Popular posts from this blog

Low Latency Microservices, A Retrospective

Unusual Java: StackTrace Extends Throwable

System wide unique nanosecond timestamps