Java Lambdas and Low Latency

Overview

The main question around the use of Lambdas in Java and Low Latency is; Does they produce garbage and is there anything you can do about it? 

Background

I am working on a library which supports different wire protocols.  The idea being that you can describe the data you want to write/read and the wire protocol determines if it uses text with fields like JSon or YAML, text with field numbers like FIX, binary with field names like BSON or a Binary form of YAML, binary with fields name, field numbers or no field meta at all.  The values can be fixed length, variables length and/or self describing data types.

The idea being that it can handle a variety of schema changes or if you can determine the schema is the same e.g. over a TCP session, you can skip all that and just send the data.

Another big idea is using lambdas to support this.

What is the problem with Lambdas

The main issue is the need to avoid significant garbage in low latencies applications.  Notionally, every time you see lambda code this is a new Object.  

Fortunately, Java 8 has significantly improved Escape Analysis. Escape Analysis allows the JVM to replace new Object by unpacking them onto the stack, effectively giving you stack allocation. This feature was available in Java 7 however it rarely eliminated objects.  Note: when you use a profiler it tends to prevent Escape Analysis from working so you can't trust profilers that use code injection as the profiler might say an object is being creation when without the profiler it doesn't create an object.  Flight Recorder does appear to mess with Escape Analysis.

Escape Analysis has always had quirks and it appears that it still does.  For example, if you have an IntConsumer or any other primitive consumer, the allocation of the lambda can be eliminated in Java 8 update 20 - update 40. However, the exception being boolean where this doesn't appear to happen.  Hopefully this will be fixed in a future version.

Another quirk is that the size (after inlining) of the method where the object elimination occurs matters and in relatively modest methods, escape analysis can give up.

A specific case

In my case I have a read method which looks like this

public void readMarshallable(Wire wire) throws StreamCorruptedException {
    wire.read(Fields.I).int32(this::i)
            .read(Fields.J).int32(this::j)
            .read(Fields.K).int32(this::k)
            .read(Fields.L).int32(this::l)
            .read(Fields.M).int32(this::m)
            .read(Fields.N).int32(this::n)
            .read(Fields.O).int32(this::o)
            .read(Fields.P).int32(this::p)
            .read(Fields.Q).int32(this::q)
            .read(Fields.R).int32(this::r)
            .read(Fields.S).int32(this::s)
            .read(Fields.T).int32(this::t)
            .read(Fields.U).int32(this::u)
            .read(Fields.V).int32(this::v)
            .read(Fields.W).int32(this::w)
            .read(Fields.X).int32(this::x)
    ;
}


I am using lambdas for setting the fields  the framework can handle optional, missing or out of order fields.  In the optimal case, the fields are available in the order provided. In the case of a schema change, the order may be different or have a different set of fields.  The use of lambdas allows the framework to handle in order and out of order fields differently.


Using this code, I performed a test, serializing and deserializing the object 10 million times.  I configured the JVM to have an eden size of 10 MB with -Xmn14m -XX:SurvivorRatio=5 The Eden space 5x the two survivor spaces with ratio 5:2. The Eden space is 5/7th of the total young generation i.e. 10 MB.

By having an Eden size of 10 MB and 10 million tests I can estimate the garbage created by counting the number of GCs printed by -verbose:gc  For every GC I get, an average of one byte per test was crated.  When I varied the number of fields serialized and deserialized I got the following result on an Intel i7-3970X.


In this chart you can see that for 1 to 8 fields deserialized i.e. up to 8 lambdas in the same method, there is almost no garbage created i.e. at most one GC. However at 9 or more fields or lambdas, the escape analysis fails and you get garbage being created, increasing linearly with the number of fiedls.

I wouldn't want you to believe that 8 is some magic number.  It is far more likely to be a limit of the size in bytes of the method, though I couldn't find such a command line setting.  The difference occurs when the method grew to 170 bytes.

Is there anything which can be done?  The simplest "fix" turned out to be breaking the code into two methods (possibly more if needed) by deserializing half the fields in one method and half the fields in another, it was able to deserialize 9 to 16 fields without garbage.  This is the "bytes(2)" and "ns (2)" results.  By eliminating garbage the code also runs faster on average.

Note: the time to serialize and deserialize an object with 14 x 32-bit integer was under 100 ns.

Other notes:

When I used a profiler, YourKit in this case, code which produced no garbage started producing garbage as the Escape Analysis failed.

I printed the method inlining and found assert statements in some key methods prevented them from being inlined as it made the methods larger.  I fixed this by creating a sub-class of by main class with assertions on to be created by a factory method when assertions are enabled.  The default class has no assertions and no performance impact.

Before I moved these assertions I could only deserialize 7 fields without triggering garbage.

When I replaced the lambdas with anonymous inner classes, I saw similar object elimination though in most cases if you can use lambda that is preferred.

Conclusion

Java 8 appears to be much smarter at removing garbage produce by very short lived objects.  This means that techniques such as passing lambdas can be an option in Low Latency applications.

EDIT (version 2)

By reading the OpenJDK source I found this option -XX:BCEATraceLevel=3 which prints whether a method has had Escape Analysis applied or not.  I saw the message

Skipping method because: code size (271) exceeds MaxBCEAEstimateSize (150).

So I raised the -XX:MaxBCEAEstimateSize=300 and the method above performs the Escape Analysis for the larger methods I am using.

Comments

  1. Hi Peter - not sure if my first comment was actually posted, repeating - so, with -XX:MaxBCEAEstimateSize=300 are you able to benefit from EA even in cases where you have more lambdas (e.g. more than 8, in your example) in a single method? I can't imagine that wouldn't have side effects on other optimizations in HS, though... Thanks, Patrick

    ReplyDelete
    Replies
    1. The main side effect is longer compilation time for all methods of between 151 and 300 bytes even if it doesn't help much.

      Delete
    2. Maybe worth mentioning, in the context of low latency Java apps, even if not very new anymore: http://martinfowler.com/articles/lmax.html - I'm sure at least part of the techniques presented there are also useful with Java 8. I've saw it presented on infoq by one of the programmers in the project a few years ago, then read Fowler's paper a few weeks ago, so when I read the post above it immediately rang a bell.

      Delete
  2. You seem to have a framework method which follows this pattern:

    MyType int32(IntConsumer c) {
    if(condition) c.accept(producedValue);
    return this;
    }

    It will then be called at UseSite like int32(this::x) which may create a new instance as long as escape analysis doesn’t elide it, because it captures the value of «this». In contrast, method references which do not capture values are implemented as singletons in Oracle’s implementation and do not need escape analysis at all. A little code transformation may help here:

    MyType int32(T context, ObjIntConsumer c) {
    if(condition) c.accept(context, producedValue);
    return this;
    }

    This method still has the same logic but allows an arbitrary object to be passed through. Now it may get invoked at UseSite like int32(this, UseSite::x) and the method reference does not capture this. Using this pattern might be more promising than splitting methods into smaller ones.

    See also http://stackoverflow.com/a/27524543/2711488

    ReplyDelete
    Replies
    1. This is a good idea. I will look at using this to avoid the ambiguity of escape analysis/

      Delete
  3. Valuable information about Lambdas. I'm using webfirmframework and lambda is frequently used in our application because it make better code readable & maintainable. Anyway, thanks for sharing this post.

    ReplyDelete

Post a Comment

Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Comparing Approaches to Durability in Low Latency Messaging Queues