Micro jitter, busy waiting and binding CPUs

July 01, 2013

Performance profiling a new machine

When I work on a new machine, I like to get an understanding of it's limitations. In this post I am looking at the jitter on the machine and the impact of busy waiting for a new PC I built this weekend. The specs for the machine are interesting but not the purpose of the post. Never the less they are:

i7-3970X six core running at 4.5 GHz (with HT turned on)
32 GB of PC-1600 memory
An OCZ RevoDrive 3, PCI SSD (actual write bandwidth of 600 MB/s)
Ubuntu 13.04

Note: the OCZ RevoDrive is not officially supported on Linux, but is much cheaper than their models which are.

Tests for jitter

My micro jitter sampler looks at interrupts to a running thread. It is similar to jHiccup but instead of measuring how delayed a thread is in waking up, it measures how delays a thread gets once it has started running. Surprisingly how you run your threads impacts the sort of delays it will see once it wakes up.

This chart is a bit dense. It shows the number of interrupts of a range of time occurred within a CPU hour on average (each test was run for more than two clock hours) There raw data is available HERE

The interesting differences are in how binding to an isolated CPU and/or busy waiting threads are handled by the OS.

Busy waiting

In the case of busy waiting, binding to an isolated core really helps reduce the higher latency gaps.

These tests were run at the same time. The only difference is that the "bound" thread was bound to an "isolcpus" CPU where the other CPU for that core was isolated as well. i.e. the whole core was isolated.

Fairly busy - 50%

In this case, the thread alternated between sampling for 1 milli-second and sleeping for 1 milli-second

The unbound 50% busy thread had much lower delays of 2 micro-seconds, but significantly more of longer delays.

Slightly busy - 10%

In this test, the sampler runs for 0.111 millis-seconds and sleeps for 1 milli-second. Even in this case, binding to an isolated CPU makes a difference.

Binding but not isolating - 10%

In this case, the bound thread was not isolated. It was bound to a CPU where the core wasn't free and it was not isolated either. It would appear that binding alone makes little difference compared to unbound in this test.

Comparing bound and isolated threads

Something I have seen before, but I find a little strange is that if you give up the CPU your thread performs badly when it wakes up. Previously I had put this down the caches not being warmed up, but the code does very little access to memory and the code is very short so it is still possible but unlikely. The peak at 20 micro-seconds at one million per hour could be due to a delay which happens on every forth wake up. It is about 90,000 clock cycles which seems like a lot for a cache miss.

Comparing unbound threads

In this chart, it suggest it really helps to be greedy of the CPU even if you are not bound. Busy threads get interrupted less. It is hard to say that 50% busy is better than 10% busy. It might be, but a longer test would be needed (I would say it's within the margin of error)

Conclusion

Using thread affinity, without isolating the CPU doesn't appear to help much on this system. I suspect this is true of other versions of Linux and even Windows.

Where affinity and isolation helps, it may still make sense to busy wait as it appears the scheduler will interrupt the thread less often if you do.

Comments

Thierry Abaléa5 July 2013 at 21:26
Taking also into consideration the results of your previous post "How much difference can thread affinity make" (http://vanillajava.blogspot.fr/2012/02/how-much-difference-can-thread-affinity.html), the conclusion is that: thread affinity without cpu isolation improved latency and thread affinity with cpu isolation improved jitter and throughput. Am I right ?
ReplyDelete
Replies
Peter Lawrey5 July 2013 at 22:07
I wrote the post in response to the view that pinning CPUs is for the weak or shouldn't be needed.

@Thierry I suspect the real impact will depend on what your application is doing. In the test above, there is a very big difference in using an isolated CPU but not such much just bonding to a CPU which was not isolated. The test is not realistic at all. What the test does suggest is that the OS looks at what the program appears to be doing and behaves differently. The isolated/bonded threads which don't busy wait get more delays in the 14-40 micro-second range.

In truth you need to test this for your application, your OS on your hardware. The point of the post is to show that using an isolated CPU can make a huge difference, at least in some cases, and it is well worth considering if sub-milli-second latency is important to you.
ReplyDelete
Replies
Thierry Abaléa5 July 2013 at 22:46
Thanks for the clarification.

It appears that even applications that don't required sub-milli-second latency can take benefit of CPU isolation. They can scale at a lower cost. Quora use this technique for their web servers http://engineering.quora.com/Web-Server-Architecture-at-Quora
ReplyDelete
Replies

Add comment

Vanilla Java