Why we shouldn't use more threads than we need to

Overview

There is a common argument that because we have lots of cores, and will have even more in the future we have to use them.  We just we need to find the best ways to use them but just because we can doesn't mean we should.

What is our goal?

Good reasons to use multiple threads are
  • the performance of using one thread is not enough.
  • you have profiled your application to ensure there is no low hanging fruit.
  • multiple threads improve the throughput,  latency or consistency.
At this point you should add a thread when you know it gets you closer to your goal.

A bad reason to use multiple threads

Just because we can use more threads doesn't mean we should.  Multiple threads
  • Adds complexity to the code
  • There are other was to speed up an application.  You L1 cache is 10-20x faster than you L3 cache and if you can spend more time in you L1 cache by optimising your memory usage and access, you can gain more performance than using every CPU in your socket.
  • Multiple thread can introduce subtle, rarely seen bugs which just wouldn't be there with single threaded code.
  • Multiple threads adds synchronization, more use of immutable objects instead of recycling mutable one.
  • Multiple threads tend to lead to much worse jitter and worse case performance even if the typical performance is better.
In short, multi-threading more likely to slow down a program than speed it up unless some thought is put into it.  Two CPUs can be twice as fast at best but can easily be ten times slower if you are not careful. i.e. you have more to lose than you can gain.

A simple example of this is calculating Fibonacci numbers.  These are very easy to describe recursively and create lots of threads.  Thus calculation Fibonacci numbers are often used as a example of how to use lots of threads.  What they often don't mentions is that the number of threads you create is equal to the answer i.e. it grow exponentially.  This means that while iterating in one loop/thread take about 4 ms to compute fib(69), the multi-threaded version will create trillions of trillions of threads and will take longer than the age of the universe if it didn't crash.

But if I have CPUs idle I am wasting them.

If you want to use every CPU, just write a busy waiting thread for every CPU and you are done, every CPU is at 100%

Say you want to travel from A to B,  sometimes you can take one street and sometimes taking four streets is faster.  But there are 20 streets near A and B and you should go up and down all twenty street because otherwise there is no point them being there, right!?

Conclusion

If you are focused on engineering your system, for ease of development and maintainability, you want the simplest solution to which will solve your problem. If that means you don't use 100% of your network bandwidth, or 100% of your disk space or 100% of your memory or 100% of your CPUs, perhaps that is a good thing.

Comments

  1. To clarify, it's not simply that your L3 cache is slower, but rather the socket-to-socket communication using the QPI bridge to other socket's L3's is expensive. L2 cache misses into the socket-local L3 can still be relatively quick.

    ReplyDelete
  2. I wrote about the same thing here. Threads are evil !!! http://mentablog.soliveirajr.com/2013/02/inter-socket-communication-with-less-than-2-microseconds-latency/

    ReplyDelete
  3. Would be very interesting if you could show the bad implementation of Fibonacci that you talking about (that will crash the VM). I believe it does not mean that there is no "good' or correct parallel implementation that will run faster on several CPUs than single threaded?...

    Will it be generally true that if computation can be made parallel and does not require state sharing/synchronization then it cab be divided and executed on several CPUs. Of course you might involve some batching of number of computation on every CPU to optimize the use of cache instead of executing single operation in one thread.

    ReplyDelete
  4. @Jen You can only get efficient parallelism if there is inherently a degree of parallelism in the algorithm you are using. The standard Fibonacci formula you calculate the next next from the previous two value. This can be used to waste a lot of CPU, but inherently this approach is single threaded. There is an even fastest, but much complex formula for calculating large Fibonacci numbers and this could have some degrees of parallelism.

    ReplyDelete

Post a Comment

Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Comparing Approaches to Durability in Low Latency Messaging Queues