Low Latency Slides
Last weekend was LJC Open Conference #4, and like many people I got a lot out of it. My talk was up first which meant I could relax for the rest of the day. Here are the slides Note: the message size was 16 longs or 128 bytes. This makes a difference at higher throughputs. In answer to @aix's questions
On slide 12; what are we measuring here, elapsed time from the message hitting the buffer to the "echo" reply showing up on the client socket?In each message I add a timestamp as it is written. When each message is read, I compare the timestamp with the current time. I sort the results and take the middle / 50%tile as typical timings and the worst 0.01% as the 99.99%tile.
What's causing the large difference between "nanoTime(), Normal" and "RDTSC, Normal" in the bottom half of the slide (2M/s)?The reason for taking the RDTSC directly (9 ns) is that System.nanoTime() is quite slow on Centos 5.7 (180 ns) and the later is a system calls which may disturb the cache. At modest message rate 200K/s (one message every 5000 ns) the difference is minor however as the message rates increase 2M/s (one message every 500 ns) the addition to latency is significant. Its not possible to send messages over 5 M/s if I was using System.nanoTime() where as with RDTSC I got up to 12 M/s. Without timing each message, I got a throughput of 17 M/s. In answer to @Matt's questions
Is that a safe way to call tsc on a multicore machine? Is it consistent across cores?Calling RDTSC on a multi-core system is safe as there is one counter per socket. However, calling TSC on a multi-socket system is not safe and there will be difference and possibly a drift between sockets. This is not a problem if you have only one socket, or you do all you timing on one socket. e.g. You have used thread affinity and know the timings will all be on the same socket.
Don't you need some sort of serialisation in the instruction stream to avoid out of order behaviour?This is a potential problem. However, the total time taken with JNI is around 9 ns which is more than 40 instructions. This is longer than the CPU pipeline can re-order. (typically around 32 instructions) If this instruction were embedded the way the some Unsafe methods are, it could be re-ordered with Java instruction. However, provided this re-ordering is not random, the difference is likely to be a bias of << 10 ns. If the re-ordering was random this could add up to 10 ns jitter. Given I am timing latencies to an accuracy of 100 ns, I decided it wasn't a problem for me. It could be a problem if you want 10 ns accuracy.
On the "reproducible results" theme, to be completely anal you need to repeat the test n times (20-30?) and examine the distribution of results.What I do is repeatedly run the test for 5 seconds and print the results. This will consist of a minimum of one million individual messages. This is repeated 30 times and I print an aggregate distribution. I compare the individual distributions with the aggregate to see if they are "close". I will try to document how the test are done in more detail when I am happy with the reproducibility of the 99.99%tile values. :|