Memory alignment in C, C++ and Java
Overview
You might assume that reducing the size of a struct or class saves the same amount of memory. However due to memory alignment in memory allocators it can make no difference, or perhaps more difference than you might expect. This is because the the amount of memory reserved is usually a multiple of the memory alignment. For Java and C this can 8 or 16 bytes.Size of memory reserved
These tests were performed in a 64-bit C program (gcc 4.5.2) and a 64 JVM (Oracle Java 7) In Java, direct memory is largely a wrapper for malloc and free.Bytes | C malloc() reserved | Java ByteBuffer.allocateDirect() |
---|---|---|
0 to 24 | 32 bytes | 32 bytes + a ByteBuffer |
25 to 40 | 48 bytes | 48 bytes + a ByteBuffer |
41 to 56 | 64 bytes | 64 bytes + a ByteBuffer |
57 to 72 | 80 bytes | 80 bytes + a ByteBuffer |
Number of fields | C class of int (heap/stack) | C class of void * (heap/stack) | Java class with int | Java class with Object references |
---|---|---|---|---|
1 | 32/16 bytes | 32/16 bytes | 16 bytes | 16 bytes |
2 | 32/16 bytes | 32/16 bytes | 24 bytes | 24 bytes |
3 | 32/16 bytes | 32/32 bytes | 24 bytes | 24 bytes |
4 | 32/16 bytes | 48/32 bytes | 32 bytes | 32 bytes |
5 | 32/32 bytes | 48/48 bytes | 32 bytes | 32 bytes |
6 | 32/32 bytes | 64/48 bytes | 40 bytes | 40 bytes |
7 | 48/32 bytes | 64/64 bytes | 40 bytes | 40 bytes |
8 | 48/32 bytes | 80/64 bytes | 48 bytes | 48 bytes |
Why does this matter?
Say you have a class like thisclass MyClass { int num; short value; }In C, how much memory is saved by changing num to a short or how much more is consumed of with make it long long. The answer is likely to be none at all (unless you have an array of these) In Java, it could make a difference as the alignment size is different. Conversely, if the C class/struct is 16 or 17 bytes, it can make the size on the stack be 16 or 32 bytes. Similarly being 24 or 25 bytes can make the malloc'ed size used 32 or 48 bytes long.
Actually, in my brief experiments stack->stack copies can be slower than heap->heap copies.
ReplyDeletehttps://gist.github.com/1102088
@Chad Brewbaker, Did you come to an conclusions as to why this might be so?
ReplyDeleteOn my PC the stack to stack copy took 26.3, 25.7, 26.5 seconds and the heap copy took 27.1, 26.7, 26.8 seconds.
So there could be a small difference but in my case its the other way.
I would venture to guess that memcpy() being such a basic function is heavily optimized with inline assembler. Perhaps they found a performance tweak, but for some reason it is not being used for stack->stack copies. I would have to take a look at Apple's version of glibc memcpy()and compare it to the glibc you are using.
ReplyDeletehttp://www.gnu.org/s/libc/
memcpy.c from glibc-2.14
ReplyDeleteAs it is doing paging this might be a thing with the Mach kernel on OSX boxen.
----
#include
#include
#include
#undef memcpy
void *
memcpy (dstpp, srcpp, len)
void *dstpp;
const void *srcpp;
size_t len;
{
unsigned long int dstp = (long int) dstpp;
unsigned long int srcp = (long int) srcpp;
/* Copy from the beginning to the end. */
/* If there not too few bytes to copy, use word copy. */
if (len >= OP_T_THRES)
{
/* Copy just a few bytes to make DSTP aligned. */
len -= (-dstp) % OPSIZ;
BYTE_COPY_FWD (dstp, srcp, (-dstp) % OPSIZ);
/* Copy whole pages from SRCP to DSTP by virtual address manipulation,
as much as possible. */
PAGE_COPY_FWD_MAYBE (dstp, srcp, len, len);
/* Copy from SRCP to DSTP taking advantage of the known alignment of
DSTP. Number of bytes remaining is put in the third argument,
i.e. in LEN. This number may vary from machine to machine. */
WORD_COPY_FWD (dstp, srcp, len, len);
/* Fall out and copy the tail. */
}
/* There are just a few bytes to copy. Use byte memory operations. */
BYTE_COPY_FWD (dstp, srcp, len);
return dstpp;
}
libc_hidden_builtin_def (memcpy)
the result varies a lot if you use a different memory allocator, for example tcmalloc. On 64bit linux, heap with one int uses 8 bytes (due to 64bit system, I guess). Heap with two int also uses 8 bytes.
ReplyDelete@Derek Li, Are you saying that it can allocate memory without any per allocation memory overhead? Thats impressive. I assume it has to keep track of allocated/free memory somewhere.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteAnother benefit of TCMalloc is space-efficient representation of small objects. For example, N 8-byte objects can be allocated while using space approximately 8N * 1.01 bytes. I.e., a one-percent space overhead. ptmalloc2 uses a four-byte header for each object and (I think) rounds up the size to a multiple of 8 bytes and ends up using 16N bytes.
ReplyDeletecopied from
http://goog-perftools.sourceforge.net/doc/tcmalloc.html