Memory alignment in C, C++ and Java

September 10, 2011

Overview

You might assume that reducing the size of a struct or class saves the same amount of memory. However due to memory alignment in memory allocators it can make no difference, or perhaps more difference than you might expect. This is because the the amount of memory reserved is usually a multiple of the memory alignment. For Java and C this can 8 or 16 bytes.

Size of memory reserved

These tests were performed in a 64-bit C program (gcc 4.5.2) and a 64 JVM (Oracle Java 7) In Java, direct memory is largely a wrapper for malloc and free.

Bytes	C malloc() reserved	Java ByteBuffer.allocateDirect()
0 to 24	32 bytes	32 bytes + a ByteBuffer
25 to 40	48 bytes	48 bytes + a ByteBuffer
41 to 56	64 bytes	64 bytes + a ByteBuffer
57 to 72	80 bytes	80 bytes + a ByteBuffer

Constructing objects is a similar story

Number of fields	C class of int (heap/stack)	C class of void * (heap/stack)	Java class with int	Java class with Object references
1	32/16 bytes	32/16 bytes	16 bytes	16 bytes
2	32/16 bytes	32/16 bytes	24 bytes	24 bytes
3	32/16 bytes	32/32 bytes	24 bytes	24 bytes
4	32/16 bytes	48/32 bytes	32 bytes	32 bytes
5	32/32 bytes	48/48 bytes	32 bytes	32 bytes
6	32/32 bytes	64/48 bytes	40 bytes	40 bytes
7	48/32 bytes	64/64 bytes	40 bytes	40 bytes
8	48/32 bytes	80/64 bytes	48 bytes	48 bytes

Using a C struct/class on the stack is more efficient than the other approaches for a number of reasons, two of them being that there is no memory management header and no additional pointer/reference (not shown in the table).

Sun/Oracle and OpenJDK 6 and 7 JVMs will use 32-bit references and the 8 byte memory aligned and to support up to 32 GB (8 * 4 G) Most JVMs are less than 32 GB in size making this a useful optimisation. Note: Part of the reason that JVMs are usually 1 to 4 GB in size is that the worst case Full GC time is typically 1 second per GB of heap and a 30 second full CG time is to long for most applications. The typical way around this full GC time problem is to keep the working size of the heap to a few GB and use an external database or heapless "direct" and "memory mapped" memory.

Another solution for Java is to use a pause less concurrent collector such as that provided by Azul. They claim excellent scalability beyond 40 GB of heap, but don't openly list their costs ;)

Why does this matter?

Say you have a class like this

class MyClass {
    int num;
    short value;
}

In C, how much memory is saved by changing num to a short or how much more is consumed of with make it long long. The answer is likely to be none at all (unless you have an array of these) In Java, it could make a difference as the alignment size is different. Conversely, if the C class/struct is 16 or 17 bytes, it can make the size on the stack be 16 or 32 bytes. Similarly being 24 or 25 bytes can make the malloc'ed size used 32 or 48 bytes long.

The code

MemoryAlignment/main.cpp and MemoryAlignment.java

Comments

Chad Brewbaker10 September 2011 at 22:03
Actually, in my brief experiments stack->stack copies can be slower than heap->heap copies.

https://gist.github.com/1102088
ReplyDelete
Replies
Peter Lawrey11 September 2011 at 06:57
@Chad Brewbaker, Did you come to an conclusions as to why this might be so?

On my PC the stack to stack copy took 26.3, 25.7, 26.5 seconds and the heap copy took 27.1, 26.7, 26.8 seconds.

So there could be a small difference but in my case its the other way.
ReplyDelete
Replies
Chad Brewbaker11 September 2011 at 16:32
I would venture to guess that memcpy() being such a basic function is heavily optimized with inline assembler. Perhaps they found a performance tweak, but for some reason it is not being used for stack->stack copies. I would have to take a look at Apple's version of glibc memcpy()and compare it to the glibc you are using.

http://www.gnu.org/s/libc/
ReplyDelete
Replies
Chad Brewbaker11 September 2011 at 18:28
memcpy.c from glibc-2.14

As it is doing paging this might be a thing with the Mach kernel on OSX boxen.

----
#include
#include
#include

#undef memcpy

void *
memcpy (dstpp, srcpp, len)
void *dstpp;
const void *srcpp;
size_t len;
{
unsigned long int dstp = (long int) dstpp;
unsigned long int srcp = (long int) srcpp;

/* Copy from the beginning to the end. */

/* If there not too few bytes to copy, use word copy. */
if (len >= OP_T_THRES)
{
/* Copy just a few bytes to make DSTP aligned. */
len -= (-dstp) % OPSIZ;
BYTE_COPY_FWD (dstp, srcp, (-dstp) % OPSIZ);

/* Copy whole pages from SRCP to DSTP by virtual address manipulation,
as much as possible. */

PAGE_COPY_FWD_MAYBE (dstp, srcp, len, len);

/* Copy from SRCP to DSTP taking advantage of the known alignment of
DSTP. Number of bytes remaining is put in the third argument,
i.e. in LEN. This number may vary from machine to machine. */

WORD_COPY_FWD (dstp, srcp, len, len);

/* Fall out and copy the tail. */
}

/* There are just a few bytes to copy. Use byte memory operations. */
BYTE_COPY_FWD (dstp, srcp, len);

return dstpp;
}
libc_hidden_builtin_def (memcpy)
ReplyDelete
Replies
Derek Li21 October 2011 at 12:42
the result varies a lot if you use a different memory allocator, for example tcmalloc. On 64bit linux, heap with one int uses 8 bytes (due to 64bit system, I guess). Heap with two int also uses 8 bytes.
ReplyDelete
Replies
Peter Lawrey21 October 2011 at 13:29
@Derek Li, Are you saying that it can allocate memory without any per allocation memory overhead? Thats impressive. I assume it has to keep track of allocated/free memory somewhere.
ReplyDelete
Replies
Derek Li21 October 2011 at 13:51
This comment has been removed by the author.
ReplyDelete
Replies
Derek Li21 October 2011 at 14:39
Another benefit of TCMalloc is space-efficient representation of small objects. For example, N 8-byte objects can be allocated while using space approximately 8N * 1.01 bytes. I.e., a one-percent space overhead. ptmalloc2 uses a four-byte header for each object and (I think) rounds up the size to a multiple of 8 bytes and ends up using 16N bytes.

copied from
http://goog-perftools.sourceforge.net/doc/tcmalloc.html
ReplyDelete
Replies

Add comment

Vanilla Java