Memory alignment in C, C++ and Java

Overview

You might assume that reducing the size of a struct or class saves the same amount of memory.    However due to memory alignment in memory allocators it can make no difference, or perhaps more difference than you might expect.    This is because the the amount of memory reserved is usually a multiple of the memory alignment. For Java and C this can 8 or 16 bytes.

Size of memory reserved

These tests were performed in a 64-bit C program (gcc 4.5.2) and a 64 JVM (Oracle Java 7) In Java, direct memory is largely a wrapper for malloc and free.
BytesC malloc() reservedJava ByteBuffer.allocateDirect()
0 to 2432 bytes32 bytes + a ByteBuffer
25 to 4048 bytes48 bytes + a ByteBuffer
41 to 5664 bytes64 bytes + a ByteBuffer
57 to 7280 bytes80 bytes + a ByteBuffer

Constructing objects is a similar story

Number of fieldsC class of int
(heap/stack)
C class of void *
(heap/stack)
 Java class 
 with int 
Java class
with Object references
132/16 bytes32/16 bytes16 bytes16 bytes
232/16 bytes32/16 bytes24 bytes24 bytes
332/16 bytes32/32 bytes24 bytes24 bytes
432/16 bytes48/32 bytes32 bytes32 bytes
532/32 bytes48/48 bytes32 bytes32 bytes
632/32 bytes64/48 bytes40 bytes40 bytes
748/32 bytes64/64 bytes40 bytes40 bytes
848/32 bytes80/64 bytes48 bytes48 bytes

Using a C struct/class on the stack is more efficient than the other approaches for a number of reasons, two of them being that there is no memory management header and no additional pointer/reference (not shown in the table).

Sun/Oracle and OpenJDK 6 and 7 JVMs will use 32-bit references and the 8 byte memory aligned and to support up to 32 GB (8 * 4 G)    Most JVMs are less than 32 GB in size making this a useful optimisation.   Note: Part of the reason that JVMs are usually 1 to 4 GB in size is that the worst case Full GC time is typically 1 second per GB of heap and a 30 second full CG time is to long for most applications.   The typical way around this full GC time problem is to keep the working size of the heap to a few GB and use an external database or heapless "direct" and "memory mapped" memory.

Another solution for Java is to use a pause less concurrent collector such as that provided by Azul.   They claim excellent scalability beyond 40 GB of heap, but don't openly list their costs ;)

Why does this matter?

Say you have a class like this

class MyClass {
    int num;
    short value;
}

In C, how much memory is saved by changing num to a short or how much more is consumed of with make it long long.    The answer is likely to be none at all (unless you have an array of these) In Java, it could make a difference as the alignment size is different. Conversely, if the C class/struct is 16 or 17 bytes, it can make the size on the stack be 16 or 32 bytes. Similarly being 24 or 25 bytes can make the malloc'ed size used 32 or 48 bytes long.

The code

MemoryAlignment/main.cpp and MemoryAlignment.java

Comments

  1. Actually, in my brief experiments stack->stack copies can be slower than heap->heap copies.

    https://gist.github.com/1102088

    ReplyDelete
  2. @Chad Brewbaker, Did you come to an conclusions as to why this might be so?

    On my PC the stack to stack copy took 26.3, 25.7, 26.5 seconds and the heap copy took 27.1, 26.7, 26.8 seconds.

    So there could be a small difference but in my case its the other way.

    ReplyDelete
  3. I would venture to guess that memcpy() being such a basic function is heavily optimized with inline assembler. Perhaps they found a performance tweak, but for some reason it is not being used for stack->stack copies. I would have to take a look at Apple's version of glibc memcpy()and compare it to the glibc you are using.

    http://www.gnu.org/s/libc/

    ReplyDelete
  4. memcpy.c from glibc-2.14

    As it is doing paging this might be a thing with the Mach kernel on OSX boxen.

    ----
    #include
    #include
    #include

    #undef memcpy

    void *
    memcpy (dstpp, srcpp, len)
    void *dstpp;
    const void *srcpp;
    size_t len;
    {
    unsigned long int dstp = (long int) dstpp;
    unsigned long int srcp = (long int) srcpp;

    /* Copy from the beginning to the end. */

    /* If there not too few bytes to copy, use word copy. */
    if (len >= OP_T_THRES)
    {
    /* Copy just a few bytes to make DSTP aligned. */
    len -= (-dstp) % OPSIZ;
    BYTE_COPY_FWD (dstp, srcp, (-dstp) % OPSIZ);

    /* Copy whole pages from SRCP to DSTP by virtual address manipulation,
    as much as possible. */

    PAGE_COPY_FWD_MAYBE (dstp, srcp, len, len);

    /* Copy from SRCP to DSTP taking advantage of the known alignment of
    DSTP. Number of bytes remaining is put in the third argument,
    i.e. in LEN. This number may vary from machine to machine. */

    WORD_COPY_FWD (dstp, srcp, len, len);

    /* Fall out and copy the tail. */
    }

    /* There are just a few bytes to copy. Use byte memory operations. */
    BYTE_COPY_FWD (dstp, srcp, len);

    return dstpp;
    }
    libc_hidden_builtin_def (memcpy)

    ReplyDelete
  5. the result varies a lot if you use a different memory allocator, for example tcmalloc. On 64bit linux, heap with one int uses 8 bytes (due to 64bit system, I guess). Heap with two int also uses 8 bytes.

    ReplyDelete
  6. @Derek Li, Are you saying that it can allocate memory without any per allocation memory overhead? Thats impressive. I assume it has to keep track of allocated/free memory somewhere.

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Another benefit of TCMalloc is space-efficient representation of small objects. For example, N 8-byte objects can be allocated while using space approximately 8N * 1.01 bytes. I.e., a one-percent space overhead. ptmalloc2 uses a four-byte header for each object and (I think) rounds up the size to a multiple of 8 bytes and ends up using 16N bytes.

    copied from
    http://goog-perftools.sourceforge.net/doc/tcmalloc.html

    ReplyDelete

Post a Comment

Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Comparing Approaches to Durability in Low Latency Messaging Queues