Trivially Copyable Objects in Java

TL;DR
  • Problem: Java’s standard serialisation can be slow due to scattered object fields and reflection-based overhead.
  • Approach: Emulate C++-style trivially copyable objects by restricting fields to primitives, enabling bulk memory copies.
  • Result: Near C++-like serialisation performance, dramatically reducing latency and improving throughput.
  • Trade-offs: Requires careful design, limited flexibility, and testing for JVM compatibility.
  • Outcome: Low-latency systems with high performance, suitable for financial data feeds, real-time analytics, and other latency-sensitive domains.

Introduction

For low-latency systems, every microsecond has tangible business impact. In high-frequency trading, real-time analytics, and similarly time-sensitive workloads, even minor inefficiencies in serialisation and deserialisation can degrade throughput and responsiveness. The seemingly mundane act of converting objects into bytes and back often becomes a performance bottleneck.

This article explores how we can emulate a C++-like concept of Trivially Copyable Objects within Java to achieve far more efficient serialisation. By ensuring objects contain only fixed-size primitives, we can sidestep the traditional overheads of object graph traversal, reflection, and per-field copying. Instead, we can treat them as contiguous memory blocks, dramatically reducing the time taken to read and write data. We shall also consider how to get very close to this performance using Chronicle Bytes without strictly requiring trivial copyability. This technique blends the low-level efficiency with the safety and familiarity of Java’s ecosystem.

The Core Challenge of Java Serialisation

Most Java object graphs are composed of references linking scattered heap allocations. Serialising such objects typically involves visiting numerous memory locations, reading each field individually, and writing them out one at a time. This is akin to foraging around a warehouse for individual items whenever you need to pack a box. Although this might be fine at moderate data rates, it does not scale well when the volume and frequency of messages skyrocket.

In latency-critical domains—such as algorithmic trading, telemetry processing, or high-throughput messaging—this overhead is costly. Minimising the time to transform data from in-memory objects to a binary format and back can make the difference between meeting a service-level agreement (SLA) and missing it. The objective: reduce the cost of serialisation to something approaching a single memcpy operation.

Emulating Trivially Copyable Objects in Java

C++ defines a Trivially Copyable Object as one that can be safely cloned by copying its memory representation directly (e.g. using memcpy). This guarantees efficient and predictable memory layouts. Java does not expose this concept directly, but we can achieve a similar result.

If an object is composed solely of primitive fields—long, int, double, etc.—with known fixed sizes, it resembles a flat, contiguous block of data. This allows us to perform a bulk copy of its contents in one go. The result is similar to how C++ handles trivially copyable types, but now we are in Java’s safe, garbage-collected environment.

Steps to Achieve a Contiguous Layout

  1. Restrict to Primitives Only: Avoid String, List, or other reference fields. Instead, store identifiers, timestamps, and short textual representations as primitive longs (using converters like @ShortText or @NanoTime) to maintain a fixed-size layout.
  2. Use Fixed-Size Fields: Ensure that all fields have a predictable size. This not only simplifies the copy but also makes the data layout predictable.
  3. Leverage Chronicle Utilities: Chronicle Bytes provides methods like unsafeReadObject() and unsafeWriteObject() to perform bulk memory operations. These bypass the overhead of per-field reflection or loops.

By structuring your objects this way, you remove the CPU overhead of repeatedly checking field offsets and types. Instead, you treat the entire object as a known layout of bytes that can be slurped up or written out in a single shot.

Not every system can fully adopt trivially copyable objects. Some may need reference fields. In such cases, you can still improve performance incrementally:

  • Explicit Serialisation Methods: By implementing custom readMarshallable() and writeMarshallable() methods, you can avoid reflection overhead. This alone can drastically reduce latency, as shown by the benchmarks below.
  • Direct Memory Access: For a moderate increase in complexity, explicitly reading and writing fields by their offsets using Chronicle Bytes lowers overhead even further. While more effort is required, this approach narrows the gap to trivial copyability.
  • Full Trivial Copying: For the absolute best performance, treat the entire object as a single contiguous memory block. Bulk copying here is the key to near C++-style efficiency.

Code Examples

Consider a MarketData DTO with many fields. A straightforward SelfDescribingMarshallable class might rely on reflection or explicit read/write methods:

abstract class MarketData extends SelfDescribingMarshallable {
    @ShortText
    long securityId;

    @NanoTime
    long time;

    int bidQty0, bidQty1, bidQty2, bidQty3;
    int askQty0, askQty1, askQty2, askQty3;

    double bidPrice0, bidPrice1, bidPrice2, bidPrice3;
    double askPrice0, askPrice1, askPrice2, askPrice3;

    // Getters and setters omitted for clarity
}

Default approaches often rely on reflection or self-describing fields. While convenient, this is not the fastest method. Explicitly coding each field read and write is faster:

public final class ExplicitMarketData extends MarketData {

    @Override
    public void readMarshallable(BytesIn bytes) {
        securityId = bytes.readLong();
        time = bytes.readLong();
        bidQty0 = bytes.readDouble();
        // ... repeated for all fields
    }

    @Override
    public void writeMarshallable(BytesOut bytes) {
        bytes.writeLong(securityId);
        bytes.writeLong(time);
        // ... repeated for all fields
    }
}

For even lower overhead, we might write a DirectMarketData class that manually calculates offsets:

public final class DirectMarketData extends MarketData {
    @Override
    public void readMarshallable(BytesIn bytes) {
        BytesStore<?, ?> bs = ((Bytes<?>) bytes).bytesStore();
        long position = bytes.readPosition();
        // generated by GitHub Copilot
        bytes.readSkip(112);
        securityId = bs.readLong(position+0);
        time = bs.readLong(position+8);
        // ... repeated for all fields
    }

    @Override
    public void writeMarshallable(BytesOut bytes) {
        BytesStore<?, ?> bs = ((Bytes<?>) bytes).bytesStore();
        long position = bytes.writePosition();

        // generated by GitHub Copilot
        bytes.writeSkip(112);
        bs.writeLong(position+0, securityId);
        bs.writeLong(position+8, time);
        // ... repeated for all fields
    }
}

Finally, a TriviallyCopyableMarketData class uses Chronicle’s unsafeReadObject() and unsafeWriteObject() methods to perform a single bulk copy:

public final class TriviallyCopyableMarketData extends MarketData {
    static final int START =
        triviallyCopyableStart(TriviallyCopyableMarketData.class);
    static final int LENGTH =
        triviallyCopyableLength(TriviallyCopyableMarketData.class);

    @Override
    public void readMarshallable(BytesIn bytes) {
        bytes.unsafeReadObject(this, START, LENGTH);
    }

    @Override
    public void writeMarshallable(BytesOut bytes) {
        bytes.unsafeWriteObject(this, START, LENGTH);
    }
}

These methods bypass iterative per-field copying. Instead, they use knowledge of the object’s layout to copy memory in one go.

The Benchmark Results

Running benchmarks on a high-end CPU (e.g. a Ryzen 5950X) shows the progressive improvements:

Benchmark                              Mode  Cnt     Score    Error  Units
BenchmarkRunner.defaultWriteRead       avgt   25  1204.359 ± 72.394  ns/op
BenchmarkRunner.defaultBytesWriteRead  avgt   25   375.479 ±  6.066  ns/op
BenchmarkRunner.explicitWriteRead      avgt   25    45.769 ±  0.661  ns/op
BenchmarkRunner.directWriteRead        avgt   25    27.303 ±  0.867  ns/op
BenchmarkRunner.trivialWriteRead       avgt   25    25.568 ±  0.228  ns/op

Here, trivialWriteRead approaches raw memory copy speeds, slashing overhead by more than an order of magnitude compared to default approaches. The directWriteRead is very close in terms of performance but isn't impacted by layout changes in the JVM.

Considerations and Caveats

  1. JVM Stability: While typically stable, relying on certain low-level assumptions may differ slightly between JVM versions or distributions. Test carefully if you need cross-JVM compatibility.
  2. Loss of Flexibility: Restricting fields to primitives means losing some convenience. Often, you can mitigate this by mapping strings or enumerations to integers, or converting short texts via @ShortText, and timestamps with @NanoTime.
  3. Schema Evolution: Changes to object structures require coordination. Both sender and receiver must remain compatible. Use versioning strategies and robust integration tests.
  4. Nearly Trivial Without Going Fully Trivial: If you cannot fully restrict yourself to primitives, consider direct copying of at least the performance-critical parts of the data, and handle the rest with explicit methods.
  5. Leverage Chronicle’s Tooling: Chronicle Bytes and Queue provide the building blocks. While they add complexity, the performance pay-off justifies it in latency-critical systems.

Key Points

  • Treating objects as contiguous blocks of primitive fields significantly reduces serialisation overhead.
  • Moving from self-describing, reflective approaches to explicit field reads/writes yields large gains.
  • Using direct memory offsets or bulk copying is yet more efficient, approaching C++-like speeds.
  • While not free of trade-offs, trivial copyability offers a compelling pattern for systems where latency and throughput trump convenience.

Try It Yourself

Why not measure the impact on your own workload? The benchmark harness is available here:

BenchmarkRunner.java on GitHub

Run it with JMH to see if trivial copyability can enhance your system’s performance. Experiment with different layouts, measure the impact, and adopt the approach incrementally.

This Article Is Based On...

This article is an update of two articles by Per Minborg How to Get C++ Speed in Java Serialisation and Did You Know the Fastest Way of Serializing a Java Field is not Serializing it at All? It builds on the original concepts and benchmarks, providing a fresh perspective on achieving ultra-low-latency Java systems.

Conclusion

Java may not natively support trivially copyable objects, but we can still achieve near C++-like serialisation speeds by restructuring data and using Chronicle’s low-level operations. By experimenting with these techniques and applying them judiciously, developers can build ultra-low-latency Java systems that confidently handle high-throughput workloads. If you have been searching for that extra edge in performance, give trivial copyability—or its direct-copy variants—a try. It might just be the key to unlocking new levels of efficiency.

Resources

Comments

Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Comparing Approaches to Durability in Low Latency Messaging Queues