Overview
Locking critical thread to a CPU can improve throughput and reduce latency. It can make a big difference to 99%, 99.9% and 99.99%tile latencies.
Unfortunately there is no standard calls in the JDK to do this, so I wrote a simple library so you can manage and see how CPUs have been assigned.
What does it look like running?
In the following example, there are four threads main, reader, writer and engine. The main thread finishes before the engine starts so they end up using the same CPU.
Estimated clock frequency was 3400 MHz
Assigning cpu 7 to Thread[main,5,main]
Assigning cpu 6 to Thread[reader,5,main]
Assigning cpu 3 to Thread[writer,5,main]
Releasing cpu 7 from Thread[main,5,main]
Assigning cpu 7 to Thread[engine,5,main]
The assignment of CPUs is
0: General use CPU
1: General use CPU
2: Reserved for this application
3: Thread[writer,5,main] alive=true
4: General use CPU
5: General use CPU
6: Thread[reader,5,main] alive=true
7: Thread[engine,5,main] alive=true
Releasing cpu 6 from Thread[reader,5,main]
Releasing cpu 3 from Thread[writer,5,main]
Releasing cpu 7 from Thread[engine,5,main]
The code which produces this looks like
public class AffinitySupportMain {
public static void main(String... args) {
AffinityLock al = AffinityLock.acquireLock();
try {
new Thread(new SleepRunnable(), "reader").start();
new Thread(new SleepRunnable(), "writer").start();
new Thread(new SleepRunnable(), "engine").start();
} finally {
al.release();
}
System.out.println("\nThe assignment of CPUs is\n" + AffinityLock.dumpLocks());
}
private static class SleepRunnable implements Runnable {
public void run() {
AffinityLock al = AffinityLock.acquireLock();
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
} finally {
al.release();
}
}
}
}
This library has been tested on Linux, and I believe it can be adapted to Windows and other OSes.
The Java Thread Affinity Library
very nice.
ReplyDelete@Hal cheers. If you look at my previous presentation, you can see how much difference it can make. In one test sending 128 byte messages without thread affinity 12 M/s, with thread affinity 17 M/s. The 99.9% latency (worst 0.1%) without affinity was off the chart, but with affinity its only a few times the typical latency.
ReplyDeleteLooks like a black magic, could you explain it ?
ReplyDeleteAffinityLock uses AffinitySupport (and native calls) only once and that call is getAffinity.
It calls getAffinity once to determine which threads are generally available to all processes. By default this is all of them but if you have isolated specific CPU e.g. using isolcpus= it will detect which ones and uses these to assign critical threads to.
ReplyDeleteIf you haven't done this, or you want to run multiple processes like this, you can specify which threads a process can reserve. This can be set with -Daffinity.reserved=CC or whatever hex mask indicates which cpus can be used. I suggest allocating whole cores (all the threads for a core) to a process as I suspect, but haven't tested, that this would be the best performing.
To allocate a thread to a cpu, it sets the thread affinity to only one cpu, assuming no other thread will be assigned to it. It can't prevent other thread or interrupts using the CPU, however you can configure the OS (Linux definitively and Window I would expect you can) not to use certain CPUs unless specific scheduled to do so.
I have started a page https://github.com/peter-lawrey/Java-Thread-Affinity/wiki/How-it-works
If you have more questions, please let me know so I can add more detail.
Interesting and usefull library. I've used something similar for stable benchmarking, but not so well-designed.
ReplyDeleteOne question -- seems like you use JNI only for making several syscalls. May be you can use JNA for this -- it seems like you lib will not require c compiler for build, and it'll be easier to port it on other platforms...
>It can't prevent other thread or interrupts using the CPU, however you can configure the OS (Linux definitively and Window I would expect you can) not to use certain CPUs unless specific scheduled to do so.
ReplyDeleteDo you mean here, what if I assign my specific thread to core#2, your library still does not prevent some other thread in my application (one of the threads which was not explicitly assigned to specific core) to use core#2 occasionally?
Don't you think about ability to make some thread "exclusive owner" of specific core? It is usefull ability, since even if I assign to cores every of threads, explicitly created by myself, it is still some threads which core java libs use by itself, and it is still some internal threads, which still can contend with my threads.
I suppose, such ability will be quite easy to implement -- sched_setaffinity, AFAIK, can set affinity for full process, including all it's threads. So you can assign process to specific subset of available cores, and exclude some cores for exclusive use, and such cores can be given into exclusive use to specific threads.
Sure, even such method does not prevent other processes to contend on such cores. But it seems quite reasonable for library to take responseability of everything inside java process, but does not touch anything outside...
What do you think about it?
For the Thread Affinity alone JNA could be used in theory. I tried it and with my limited knowledge it didn't work (the set affinity didn't do anything)
ReplyDeleteFor the RDTSC call, I believe JNA would be too slow to make this useful. If anyone can get the affinity via JNA working I would be happy to include it as an option.
As to exclusive use of a core to a thread, the way I solve this is to prevent the core being used unless it is specific assigned. In Linux this requires the use of isolcpus= in grub.conf and setting irqbalance not to use those cores. (See the documentation for more details) I don't know what you can do in windows which is the same but I would be surprised if you can't control this.
ReplyDelete>For the Thread Affinity alone JNA could be used in theory. I tried it and with my limited knowledge it didn't work (the set affinity didn't do anything)
ReplyDeleteWell, my own ThreadAffinity class (which I've given you in stackoverflow discussion) use JNA to call sched_setaffinity() -- and from my tests it does his work well. At least, primary, sched_setaffinity() is consistent with subsequent sched_getcpu(). Next -- if I assign all threads to single core, top really gives me only 1 core consumed, although if spread threads to separate cores, top list up to 1600% usage.
For me it seems like my way of calling sched_* does it's job well. Here is the link, if you lost it: http://subversion.assembla.com/svn/Behemoth/Tests/JAVA/test/src/main/java/test/threads/ThreadAffinity.java
>In Linux this requires the use of isolcpus= in grub.conf and setting irqbalance not to use those cores.
Yes, such system-wide things must be done via system-wide configuration. By any way, I suppose, it is not a good design to allow application itself to change such global settings for it's own needs.
But ability to do complete thread affinity management _inside_ application threads -- for me it seems to be a good and usefull option. For new (AFAIU) you library require explicit core assignment for each thread. But where are many threads inside java application which is hard to explicitly obtain reference to, to assign them somewhere. So, in current design, if I want to get some core exclusive for my thread, I really can't do what. Not only system-wide, but even in my application -- GC threads, timers, reference queues cleaners, finalizers, and other staff will contend with my thread.
I suppose the way to solve such problem: it seems like you can use sched_setaffinity on per-process basis -- assigning all process's threads to subset of cores. When you can explicitly assign specific thread(s) to free cores.
I like design idea behind you library -- to not just making wrapper around system call, but to create high-level abstraction of making some block of code "core-assigment". I will be great if such library will grow farther, since we definitely need some abstraction of affinity management in java for now.
I believe Java RTS had some Thread affinity features, but I haven't used it.
ReplyDeleteAs far, as I know, RTS trades performance for predictability -- and it pays much performance :). So, even if it woudn't cost money -- it is not an option for high performance apps.
ReplyDelete@BegemoT Based on your code, I have added support for JNA. If the native library is not available it will use the JNA library instead.
ReplyDeleteYou're amazingly fast :) I'll check updates soon. I think, it'll be better to continue on github issues tracker?
ReplyDeleteAgreed. I hope you don't mind me mentioning your help in the code. ;)
ReplyDelete@BegemoT You may find this issue relevant to your code https://github.com/peter-lawrey/Java-Thread-Affinity/issues/4
ReplyDeleteYes, thank you. I've spent much time trying to find reliable way to work with errno...
ReplyDelete@Peter As about JRTS affinity api -- I google a little, and it seems what such "API" is quite small. Here http://docs.oracle.com/javase/realtime/doc_2.2u1/release/JavaRTSTechInfo.html#cpusets you can see what it is as simple, as just binding some kind of threads (NoHeap realtime thread or just realtime threads) to specific subset of cores. And such binding can be done only on startup -- so it is not vary far from simple taskset -c . And more -- it works only on Linux :(
ReplyDelete@BegemoT This library shouldn't have those limitations. I have checked you can change affinity on the fly and it should work on Solaris, Mac OS and Windows.
ReplyDelete@Peter "this library" == JavaThreadAffinity? I'm not such optimistic -- today I've made some digging about porting lib on windows/mac, and it makes me sad a little. Since, for MacOS, there is no way of specifying affinity at all -- Mac OS starting from Leopard allows you only to give scheduler a hint about thread placement in form of "this group of threads a pleased to be as close as it possible, since they want to share L2 caches". I found no way to assign thread to core (prevent relocation from), and no way to pin out thread from core (prevent relocation in).
ReplyDelete@BegemoT, I had assumed that since sched_setaffinity is a POSIX call, Mac OS would support it. Perhaps it has the call but treats it as a hint or ignores it.
ReplyDeleteThe reason for using thread affinity is that when there two or three tightly coupled thread, you can see a 10 - 30% improvement and throughput and a 10 fold improvement in worst case latencies. e.g. the worst 0.1 %.
ReplyDeleteHere http://stephendoyle.blogspot.com/2008/04/thread-affinity-on-os-x.html and here http://hints.macworld.com/article.php?story=20071130131843289 peoples have the same troubles with affinity on darwin, as I've met. So it seems like we can't port lib on my laptop :(
ReplyDeleteI still unable to set Thread Affinity. there are some exceptions raised during affinity set. Its not working on Window XP
ReplyDelete