How much difference can thread affinity make

Overview

In the past when I have performed performance tests using thread affinity, at best, it didn't appear to make much difference.

I recently developed a library to use thread affinity in a more controlled manner, and another library which makes use of threads working together in a tightly coupled manner. The later is important because it appears to me that affinity makes the most difference when you have tightly coupled threads.

Results

The results indicate that
  • System.nanotime() impacts perform at this level in Centos 6.2, as it did in Centos 5.7. Using the JNI wrapper for RDTSC improved timings. From other test I have done, Ubuntu 11 didn't appear to make as much difference.
  • Using thread affinity without isolating cpus improved latencies.
  • Using thread affinity on isolated cpus didn't improved latencies much but did improve throughput.
  • Using hyper threading on a high performance system can be a practical option for key threads where a drop of 20% throughput is acceptable.

The test

I am testing my Java Chronicle which is a wrapper for memory mapped files. These files are shared between threads without locking or system calls. This situation shows up the difference between stable thread assignment and leaving it to the OS.

I perform the same test with and without thread affinity to compare how much difference it can make.

The test itself involves sending a message to a second thread which passes the message back. The message contains a nano-second time stamp which is used to measure latency.

The results

No thread affinity

The average RTT latency was 335 ns. The 50/99 / 99.9/99.99%tile latencies were 370/520 / 3,560/21,870. There were 15 delays over 100 μs
The average RTT latency was 332 ns. The 50/99 / 99.9/99.99%tile latencies were 370/410 / 3,460/21,660. There were 65 delays over 100 μs
The average RTT latency was 336 ns. The 50/99 / 99.9/99.99%tile latencies were 370/490 / 3,540/21,620. There were 34 delays over 100 μs

Took 10.770 seconds to write/read 200,000,000 entries, rate was 18.6 M entries/sec
Took 11.269 seconds to write/read 200,000,000 entries, rate was 17.7 M entries/sec
Took 12.230 seconds to write/read 200,000,000 entries, rate was 16.4 M entries/sec

Thread affinity using the JNI + RDTSC based nano-timer.

The average RTT latency was 171 ns. The 50/99 / 99.9/99.99%tile latencies were 160/190 / 2,680/22,060. There were 6 delays over 100 μs
The average RTT latency was 175 ns. The 50/99 / 99.9/99.99%tile latencies were 160/190 / 2,640/21,270. There were 8 delays over 100 μs
The average RTT latency was 174 ns. The 50/99 / 99.9/99.99%tile latencies were 160/190 / 2,630/21,270. There were 5 delays over 100 μs

Took 11.052 seconds to write/read 200,000,000 entries, rate was 18.1 M entries/sec
Took 12.399 seconds to write/read 200,000,000 entries, rate was 16.1 M entries/sec
Took 13.068 seconds to write/read 200,000,000 entries, rate was 15.3 M entries/sec

Thread affinity and RDTSC using isolated CPUs (with isolcpus=2,3,6,7 in grub.conf)

The average RTT latency was 175 ns. The 50/99 / 99.9/99.99%tile latencies were 160/180 / 2,380/20,450. There were 2 delays over 100 μs
The average RTT latency was 173 ns. The 50/99 / 99.9/99.99%tile latencies were 160/190 / 2,450/20,480. There were 2 delays over 100 μs
The average RTT latency was 175 ns. The 50/99 / 99.9/99.99%tile latencies were 160/190 / 2,400/20,540. There were 4 delays over 100 μs

Took 9.533 seconds to write/read 200,000,000 entries, rate was 21.0 M entries/sec
Took 9.584 seconds to write/read 200,000,000 entries, rate was 20.9 M entries/sec
Took 9.632 seconds to write/read 200,000,000 entries, rate was 20.8 M entries/sec

Using thread affinity to share threads on the same core.

The average RTT latency was 172 ns. The 50/99 / 99.9/99.99%tile latencies were 160/180 / 2,420/20,840. There were 2 delays over 100 μs
The average RTT latency was 176 ns. The 50/99 / 99.9/99.99%tile latencies were 160/190 / 2,440/21,060. There were 2 delays over 100 μs
The average RTT latency was 176 ns. The 50/99 / 99.9/99.99%tile latencies were 160/190 / 2,480/20,800. There were 4 delays over 100 μs

Took 12.194 seconds to write/read 200,000,000 entries, rate was 16.4 M entries/sec
Took 12.302 seconds to write/read 200,000,000 entries, rate was 16.3 M entries/sec
Took 12.076 seconds to write/read 200,000,000 entries, rate was 16.6 M entries/sec

The system

These tests were performed using Java 7 update 3 on an 4.6 GHz i7-2600 with Centos 6.2 with 16 GB of memory and a fast 240 GB SSD drive. No JVM options were set.

Latency test code


Throughput test code

Comments

  1. How do you define throughput and latency for your test? Wouldn't less latency lead to more throughput ??? Can you explain that concept in mere mortal terms? Thanks!

    ReplyDelete

Post a Comment

Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Comparing Approaches to Durability in Low Latency Messaging Queues