Reading/writing GC-less memory
Overview
How you access data can make a difference to the speed. Whether you use manual loop unrolling or let the JIT do it for you can also make a difference to performance.I have included C++ and Java tests doing the same thing for comparison.
Tests
In each case, different approaches to storing 16 GB of data were compared.In the following tests I compared storing data
- allocating, writing to, reading from and total GC times
- byte[] (smallest primitive) and long[] (largest primitive)
- arrays, direct ByteBuffer and Unsafe
- JIT optimised and hand unrolled four times
store | type | size | unrolled | allocate | writing | reading | GC time |
---|---|---|---|---|---|---|---|
C++ char[] | native | 8-bit char | no | 31 μs | 12.0 s | 8.7 s | N/A |
C++ char[] | native | 8-bit char | yes | 5 μs | 8.8 s | 6.6 s | N/A |
C++ long long[] | native | 64-bit int | no | 11 μs | 4.6 s | 1.4 s | N/A |
C++ long long[] | native | 64-bit int | yes | 12 μs | 4.2 s | 1.2 s | N/A |
byte[] | heap | byte | no | 4.9 s | 20.7/7.8 s | 7.4 s | 51 ms |
byte[] | heap | byte | yes | 4.9 s | 7.1 s | 8.5 s | 44 ms |
long[] | heap | long | no | 4.7 s | 1.6 s | 1.5 s | 37 ms |
long[] | heap | long | yes | 4.7 s | 1.5 s | 1.4 s | 45 ms |
ByteBuffer | direct | byte | no | 4.8 s | 18.1/10.0 s | 14.0 s | 6.1 ms |
ByteBuffer | direct | byte | yes | 4.8 s | 12.2/10.0 s | 16.7 s | 6.1 ms |
ByteBuffer | direct | long | no | 4.7 s | 6.0/3.9 s | 2.4 s | 6.1 ms |
ByteBuffer | direct | long | yes | 4.6 s | 4.7/2.3 s | 7.9 s | 6.1 ms |
Unsafe | direct | byte | no | 10 μs | 18.2 s | 13.8 s | 6.0 ms |
Unsafe | direct | byte | yes | 10 μs | 8.7 s | 8.3 s | 6.0 ms |
Unsafe | direct | long | no | 10 μs | 5.2 s | 1.9 s | 6.0 ms |
Unsafe | direct | long | yes | 10 μs | 4.2 s | 1.3 s | 6.0 ms |
In each case, this is the time to perform 8-bit byte or 64-bit long operations on 16 GB of data in different structures as required. In C++ and using Unsafe, I single array/block memory was used. For Java array and ByteBuffer multiple objects were use to create the same total amount of space.
C++ test configuration
All tests were performed with gcc 4.5.2 on ubuntu 11.04, compiled with -O2Java test configuration
All test were performed with Java 6 update 26 and Java 7 update 0, on a fast PC with 24 GB of memory. Timings are for 6/7. Where there one value they were the same.All tests were run with the options -mx23g -XX:MaxDirectMemorySize=20g -verbosegc
Curiosity
For me the most curious result was the performance of the long[] which was very fast in Java, faster than using C++ or Unsafe directly.
The code
C++ tests - memorytest/main.cppJava tests - MemoryTest.java
Hi Peter,
ReplyDeleteThanks for all good analytic and numbers you provide in your blog... Can you also write some post on performance of "copyOnWrite" Collections in Java.
@Subhash, A good suggestion.
ReplyDeleteNot sure why java heap reading time byte[] would be similar to c++
ReplyDeleteHow about a microbench to access a 1024 byte array 1 million times, would it be having similar latency as accessing an large array of 1G bytes.
Under the hood baload can be JITed into several calls per hotspot cpp source code.
Not sure how byte[]->baload will be JITed, especially how many times will arrayOopDesc::base_offset_in_bytes be called, and wether JIT can compile *HeapWordSize into <<<3 for cases when HeapWordSize=8
678 void TemplateTable::baload() {
679 transition(itos, itos);
680 __ pop_ptr(rdx);
681 // eax: index
682 // rdx: array
683 index_check(rdx, rax); // kills rbx
684 __ load_signed_byte(rax,
685 Address(rdx, rax,
686 Address::times_1,
687 arrayOopDesc::base_offset_in_bytes(T_BYTE)));
688 }