Why Concurrency examples are confusing

Overview

I was recently at a talk were the audience was asked who used concurrency and who found it confusing. Nearly all who use concurrency found it confusing.

This was a little surprising to me, and I was left wondering what the reason might be.

I have been thinking about writing a book, so I have been watching a number of web talks and reading a number of books which talk about concurrency and I think I know at least one reason.

Bad Examples

The problem is; in a talk or a book you need a simple, easily understood example. Unfortunately, simple and easily understood examples almost always make very BAD examples for concurrency.

What makes them bad examples is that concurrency code comes with an overhead both to the developer in terms of complexity of the code and in CPU performance. Simple examples are likely to have far more overhead than the work done making the performance of the concurrent example far worse than just writing it single threaded and also the code more confusing.

Account transfer example

A common example used to illustrate how to avoid dead lock between two objects, is the account transfer problem. The problem is that to transfer between two accounts, you must lock both accounts (assuming you lock individual accounts). However, unless you lock them in the same order, you can get a deadlock where two threads hold the one of the accounts the other thread needs.

Took an average of 30.7 ns to transfer money with a lock on each Account.

Using synchronized improves performance synchronized is marginally faster for un-contended locks. (i.e. the same threadkeep acquiring the lock)

Took an average of 28.9 ns to transfer money with a lock on each Account.

This is the speed of using a thread safe account with one thread. In theory, this should scale very well and using two threads would halve this time. You might get good scalability up to the number of cores you have.

But the awkward question is, what if we didn't have one lock per account and just had one global lock. The code would be simpler, but not as scalable. But what is the cost of scalability?

What if I used just one global lock

Took an average of 9.7 ns to transfer money with a one global lock.

This means we would need to have at least 7 threads trying to transfer money to be as fast as using one global lock. (Assuming we achieved the theoretically best scalability) Given a real application has a lot other things to do, it is highly unlikely that you will average more than 7 thread in this section of code.

What if I just make the code single threaded?

Took an average of 1.8 ns to transfer money, single threaded

This implies that we need to average over 56 cores (not just threads) in the account transfer block of code, to break even. In this unlikely scenario, I would suggest you need to rethink your application. e.g. if you are working for an Uber-Bank performing one billion transfers per second, why are you using just one machine to do it?

A Really Bad Example

A pet hate of mine is using the Fibonacci series in examples of concurrency. This example has the advantage of being easy to define recursively and translate the code to being concurrent. It also an amazingly, staggeringly BAD idea.

This is run on a 4.6 GHz i7 2600K with 16 GB of memory. The code stops when there isn't enough resources to start any more threads.

Fibonacci 2 was 2, took 1,665 us, Time ratio=832.5 
Fibonacci 3 was 3, took 123 us, Time ratio=41.0 
Fibonacci 4 was 5, took 126 us, Time ratio=25.2 
Fibonacci 5 was 8, took 154 us, Time ratio=19.3 
Fibonacci 6 was 13, took 291 us, Time ratio=22.4 
Fibonacci 7 was 21, took 359 us, Time ratio=17.1 
Fibonacci 8 was 34, took 646 us, Time ratio=19.0 
Fibonacci 9 was 55, took 544 us, Time ratio=9.9 
Fibonacci 10 was 89, took 471 us, Time ratio=5.3 
Fibonacci 11 was 144, took 1,417 us, Time ratio=9.8 
Fibonacci 12 was 233, took 3,516 us, Time ratio=15.1 
Fibonacci 13 was 377, took 3,120 us, Time ratio=8.3 
Fibonacci 14 was 610, took 2,143 us, Time ratio=3.5 
Fibonacci 15 was 987, took 7,949 us, Time ratio=8.1 
Fibonacci 16 was 1,597, took 17,539 us, Time ratio=11.0 
Fibonacci 17 was 2,584, took 7,639 us, Time ratio=3.0 
Fibonacci 18 was 4,181, took 28,766 us, Time ratio=6.9 
Fibonacci 19 was 6,765, took 28,838 us, Time ratio=4.3 
Fibonacci 20 was 10,946, took 18,460 us, Time ratio=1.7 
Fibonacci 21 was 17,711, took 85,533 us, Time ratio=4.8 
Fibonacci 22 was 28,657, took 67,250 us, Time ratio=2.3 
Fibonacci 23 was 46,368, took 194,481 us, Time ratio=4.2 
Fibonacci 24 was 75,025, took 122,891 us, Time ratio=1.6 
Fibonacci 25 was 121,393, took 283,110 us, Time ratio=2.3 
Fibonacci 26 was 196,418, took 403,611 us, Time ratio=2.1 
Fibonacci 27 was 317,811, took 2,476,695 us, Time ratio=7.8 
Fibonacci 28 was 514,229, took 7,661,490 us, Time ratio=14.9 
Exception in thread "main" java.lang.AssertionError: java.util.concurrent.ExecutionException: java.lang.AssertionError: java.util.concurrent.ExecutionException: java.lang.AssertionError: java.util.concurrent.ExecutionException: java.lang.AssertionError: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: unable to create new native thread

The time ratio is the ratio of the time taken in micro-seconds and the solution. This is interesting because the time taken is proportional to the final value, something which rises exponentially. Unfortunately it also consumes more resources, exponentially.

Now compare this with the single threaded looping solution with -XX:+PrintCompilation turned on.

     30    1             java.lang.String::hashCode (67 bytes)
     36    2             sun.nio.cs.UTF_8$Encoder::encode (361 bytes)
Fibonacci 2 was 2, took 1 us
Fibonacci 3 was 3, took 1 us
Fibonacci 4 was 5, took 1 us
Fibonacci 5 was 8, took 1 us
Fibonacci 6 was 13, took 1 us
Fibonacci 7 was 21, took 1 us
Fibonacci 8 was 34, took 0 us
Fibonacci 9 was 55, took 0 us
Fibonacci 10 was 89, took 1 us
Fibonacci 11 was 144, took 1 us
Fibonacci 12 was 233, took 1 us
Fibonacci 13 was 377, took 1 us
Fibonacci 14 was 610, took      42    3             java.lang.String::indexOf (87 bytes)
1 us
Fibonacci 15 was 987, took 1 us
Fibonacci 16 was 1,597, took 1 us
Fibonacci 17 was 2,584, took 0 us
     42    4             java.lang.String::charAt (33 bytes)
Fibonacci 18 was 4,181, took 1 us
Fibonacci 19 was 6,765, took 1 us
Fibonacci 20 was 10,946, took 1 us
Fibonacci 21 was 17,711, took 1 us
Fibonacci 22 was 28,657, took 1 us
Fibonacci 23 was 46,368, took 1 us
Fibonacci 24 was 75,025, took 1 us
Fibonacci 25 was 121,393, took 0 us
Fibonacci 26 was 196,418, took 1 us
Fibonacci 27 was 317,811, took 1 us
Fibonacci 28 was 514,229, took 1 us
Fibonacci 29 was 832,040, took 1 us
Fibonacci 30 was 1,346,269, took 0 us
Fibonacci 31 was 2,178,309, took 1 us
Fibonacci 32 was 3,524,578, took 1 us
Fibonacci 33 was 5,702,887, took 1 us
Fibonacci 34 was 9,227,465, took 1 us
Fibonacci 35 was 14,930,352, took 1 us
Fibonacci 36 was 24,157,817, took 0 us
Fibonacci 37 was 39,088,169, took 1 us
Fibonacci 38 was 63,245,986, took 1 us
Fibonacci 39 was 102,334,155, took 1 us
Fibonacci 40 was 165,580,141, took 3 us
Fibonacci 41 was 267,914,296, took 0 us
Fibonacci 42 was 433,494,437, took 1 us
Fibonacci 43 was 701,408,733, took 1 us
Fibonacci 44 was 1,134,903,170, took 1 us
Fibonacci 45 was 1,836,311,903, took 1 us
Fibonacci 46 was 2,971,215,073, took 1 us
Fibonacci 47 was 4,807,526,976, took 1 us
Fibonacci 48 was 7,778,742,049, took 1 us
Fibonacci 49 was 12,586,269,025, took 1 us
Fibonacci 50 was 20,365,011,074, took 1 us
Fibonacci 51 was 32,951,280,099, took 1 us
Fibonacci 52 was 53,316,291,173, took 1 us
Fibonacci 53 was 86,267,571,272, took 1 us
Fibonacci 54 was 139,583,862,445, took 1 us
Fibonacci 55 was 225,851,433,717, took 1 us
Fibonacci 56 was 365,435,296,162, took 1 us
Fibonacci 57 was 591,286,729,879, took 1 us
Fibonacci 58 was 956,722,026,041, took 1 us
Fibonacci 59 was 1,548,008,755,920, took 1 us
Fibonacci 60 was 2,504,730,781,961, took 1 us
Fibonacci 61 was 4,052,739,537,881, took 1 us
Fibonacci 62 was 6,557,470,319,842, took 2 us
Fibonacci 63 was 10,610,209,857,723, took 2 us
Fibonacci 64 was 17,167,680,177,565, took 2 us
Fibonacci 65 was 27,777,890,035,288, took 1 us
Fibonacci 66 was 44,945,570,212,853, took 1 us
Fibonacci 67 was 72,723,460,248,141, took 1 us
Fibonacci 68 was 117,669,030,460,994, took 2 us
Fibonacci 69 was 190,392,490,709,135, took 2 us
Fibonacci 70 was 308,061,521,170,129, took 1 us
Fibonacci 71 was 498,454,011,879,264, took 1 us
Fibonacci 72 was 806,515,533,049,393, took 1 us
Fibonacci 73 was 1,304,969,544,928,657, took 1 us
Fibonacci 74 was 2,111,485,077,978,050, took 1 us
Fibonacci 75 was 3,416,454,622,906,707, took 1 us
Fibonacci 76 was 5,527,939,700,884,757, took 1 us
Fibonacci      49    5             java.util.regex.Matcher::77reset (83 bytes)
 was 8,944,394,323,791,464, took 1 us
Fibonacci 78 was 14,472,334,024,676,221, took 1 us
Fibonacci 79 was 23,416,728,348,467,685, took 1 us
Fibonacci 80 was 37,889,062,373,143,906, took 2 us
Fibonacci 81 was 61,305,790,721,611,591, took 3 us
Fibonacci 82 was 99,194,853,094,755,497, took 2 us
Fibonacci 83 was 160,500,643,816,367,088, took 1 us
Fibonacci 84 was 259,695,496,911,122,585, took 2 us
Fibonacci 85 was 420,196,140,727,489,673, took 2 us
Fibonacci 86 was 679,891,637,638,612,258, took 1 us
Fibonacci 87 was 1,100,087,778,366,101,931, took 2 us
Fibonacci 88 was 1,779,979,416,004,714,189, took 2 us
Fibonacci 89 was 2,880,067,194,370,816,120, took 1 us
Fibonacci 90 was 4,660,046,610,375,530,309, took 1 us
Fibonacci 91 was 7,540,113,804,746,346,429, took 2 us

Note: the method calculating Fibonacci values is not even compiled and is still running in the relatively slow interpreter. Yet it barely uses more than 2 micro-seconds. When compiled to native code it takes around 40 nano-seconds.

If the multi-threaded Fibonacci took 2 micro-second per result value, implying "Fibonacci 91" might take around 180 million years. Unfortunately, the number of threads required is also proportional to the result value, requiring in the order of a trillion trillion threads.

Conclusion

When considering multi-threading a section of an application, you should take a single threaded, lock free implementation as a base line. If you can't make the multi-threaded, thread safe implementation faster, don't make it multi-threaded.

Put another way, if you are using concurrency to improve performance, make sure you are dealing with a problem which can be made concurrent and a solution which is faster than using one thread.

The Code

For Accounts, look at the ThreadedAccountMain, ThreadedAccountLockMain, ThreadedAccountOneLockMain and UnthreadedAccountMain.

For Fibonaccii, look at the ThreadedFibonacciMain and UnthreadedFibonacciMain

Threads code

I would like to thank Roger Lindsjö for his suggestions on DZone.

Comments

Stefan Zobel7 November 2011 at 21:02
I hope you are going to write that book ;)
Peter Lawrey7 November 2011 at 21:32
@Stefan, Cheers. :)
Anonymous8 November 2011 at 02:34
+1 for a book.
Javier8 November 2011 at 08:10
The book is a great news. Do you need some help to change from "thinking" to "writing"?
Peter Lawrey8 November 2011 at 08:37
@tokyocoder, @Javier, Encouragement is worth a lot ;) I have been speaking with a good publisher. The only concern is that it a heck of a lot of work for very little money. I fully intend to start a book, and the first chapter I am working on is Concurrency. I will be giving a talk at the London Java Community Open Conference http://www.meetup.com/Londonjavacommunity/events/39646352/

The topic is about how simple you can make a high performance system, assuming you want more than fast enough e.g. 100K transactions per second, rather than the fastest possible. say 10M transactions per second.
Chad Brewbaker8 November 2011 at 22:47
IMHO the best concurrency example with a real world use case was Tim Bray's WideFinder contest: http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder
Peter Lawrey9 November 2011 at 09:04
@Chad, I am not so sure. Many people argue that an efficient implementation or WideFinder is IO bound on a single file or disk. Provided your application processes the data as fast as it can be read, a single threaded implementation will be just as fast as a multi-threaded one (possibly faster, as at least one example states). Using multiple threads is more useful if you have a less efficient language which can't process the file as fast as it can be read.
Sandeep10 November 2011 at 04:29
As per my experience, the problem is that in real world we have so many things to care for in addition to multithreading only: Inheritance, Cloning, Atomicity, Loose Coupling etc.

All these concepts at a single point makes it complex and add to that limited exposure to multithread programming.

Multithreading Interview Questions

Vanilla Java