Hardware Transactional Memory in Java, or why synchronized will be cool again.

Overview 

Hardware Transaction Memory has the potential to allow multiple threads to speculatively access the same data structure at the same time and let the cache coherence protocol determine if a conflict occurred.  HTM aims to give you the scalability of fine grain locking, the simplicity of course grain locking, and performance close to no locking at all.  If supported by the JVM, your program or library was written with course grain lock it could mean your application can scale to many more cores with little change.

While adding support for this into C and C++ is non trivial, adding support to native code generated by the JVM might be done without having to change the byte code.

In short, this could allow many threads to speculatively execute synchronized blocks for a lock concurrently, even concurrent writes, and the processor work out whether this was a problem and repeats the block until it is not.

What is Hardware Transaction Memory and how much will it cost? 

Hardware Transactional Memory has been around for some time but until recently it hasn't been main stream.  With Intel introducing support into some of their 4th Generation i3/i5/i7 processors (Haswell) and their E3-1200 v3 (up to 4 core, one socket ATM) family processors, is widely available to new Intel based machines.  It may be later this year, or next year before we see the larger numbers of core which means HTM would make a real difference. AFAIK, AMD plan to add this functionality soon.

BTW Azul's Vega systems have had this technology for almost ten years, and I would expect Azul would be best placed to implement this in the JVM first.

The hardware you will buy, and perhaps have already, will do this. Many new models of laptop have Haswell processors as they have significantly improved power consumption.

How might it work? 

synchronized blocks are often used in Java on a just-in-case basis.  To simplify the code, these locks are often much more coarse than is optimal e.g. Hashtable locks the whole object/Map for any operation vs ConcurrentHashMap which has fine grain locking.  Writing fine grain locking is much harder to get right and thus more error prone.  The goal of Hardware Transaction Memory is to support course grained locking but get the benefit of fine grain locking.  This is particularly useful for code where re-optimising the code is not practical.

Example

private final Map<String, PooledObject> map = new HashMap<>(); 

public synchronized PooledObject acquireObject(String key) {
    PooledObject object = map.get(key);
    if (object == null)
        map.put(key, object = new PooledObject());
    return object;
}

You might expect the common case

  • only reads the map
  • updates the map, but in different places e.g. different keys.
  • rarely attempts to update the same key in two threads at once.

What you would like is

  • concurrent execution between threads.
  • very small over head compared to code without locking.
  • the CPU or JVM to do all the work to optimise this i.e you don't have to change the code.

Without HTM, the synchronized block needs to obtain a lock and enforce serialized access even though the majority case is a read operation.

With HTM the byte code could be turned into pseudo code like this

public PooledObject acquireObject(String key) {
    int code;
    do {
        xbegin();
        PooledObjectobject = map.get(key);
        if (object == null)
            map.put(key, object = new PooledObject());
        return map;
    } while((code = xend()) == RETRYABLE);
    if (code != DONE) {
        // take corrective action such as
        // obtain a normal lock and repeat
    }
}

The XEND instruction demarcates the end of where the speculative read-set and write-set in the cache is checked to see whether any of these are any cache line accessed was modified by another CPU/thread.   If not, the changes made are committed.  Otherwise any changes are dropped and the loop can be repeated.

Note: rolling back the transaction means reverting the changes and could even mean rolling back the object creation where it doesn't have significant side effects.  If it does have side effects there is an XABORT instruction which can trigger the transaction to abort and the fall back code would need to be run.

Compare And Swap is limited to 64-bits what is the limit of these transactions? 

The limit is the number of lines you can store in your L1 cache.  This is up to 32 KB.  If you have hyper threading this might be half i.e. 16 KB. Also, the L1 cache is 8 way associative so in the worst case 9 cache lines which hash to the same bucket could cause the transaction to fail. (less with hyper threading)  Never the less, it is much higher and far more flexible than CAS 64-bit or 2CAS which is 128-bit.

Writing this transaction locking structure with fall back add boilerplate and duplicate code in a language like C.

Conclusion 

The beauty of this pattern is it can be applied to Java code which has already been compiled and available as open source libraries.  Unlike C code which would need significant rework to exploit this functionality, Java programs could take advantage of HTM without re-compilation.  What we would need is a change in the JVM.

Notes (some corrections/clarifications on what I said earlier)

For me; "cool" technology is one I believe generates wide interest, even if there isn't proven wide usefulness. I believe implementing this in a mainstream JVM will challenge what even experienced developers "know" about multi-threaded programming.

While Intel TSX is available in some Haswell processors, it is not available in all Haswell processors.  You should check with Haswell on ARK and look for Intel TSX-NI is Yes.

It has been noted that this may not make much difference for well tuned code.  Intel's Designer for TSX Ravi Rajwar presented at QCon SF 2012 on Going Under the Hood with Intel?s Next Generation Microarchitecture Codename Haswell on the Mechanical Sympathy track. If you look at Page 29, it suggests to me that fine grained code will scale well across cores anyway and will not gain as much.  Where TSX may help is course grained locking. 

For more technical details I suggest you read Gil Tene's post on the mechanical sympathy group.  He has more first hand experience of tuning a JVM to support HTM than any one I have met.

References

Speculative Locking: Breaking the Scale Barrier (JAOO 2005) by Azul's Gil Tene.
Early Experience with a Commercial Hardware Transactional Memory Implementation (October 2009) by Sun Microsystems' David Dice, Yossi Lev, Mark Moir, Daniel Nussbaum, Marek Olszewski.
Benchmarks : Haswell's TSX and Memory Transaction Throughput (HLE and RTM) from SiSoftware
Fun with Intel® Transactional Synchronization Extensions from Intel
Transactional memory support: the speculative_spin_mutex from Intel
Making Sense of the Intel Haswell Transactional Synchronization eXtensions by Johan De Gelas

Comments

  1. Good article. The only sensible question remaining is: 'is Oracle willing to enhance the JVM to support HTM and WHEN will it be available ?'

    ReplyDelete
    Replies
    1. AFAIK, Oracle plans to support CAS for 128 bit values in Java 9. So a little support will come soon, but that support might not be much for some time.

      Delete

  2. I just read your blog post. Pretty amazing.

    > Hardware Transaction Memory has the potential to allow multiple threads to speculatively access the same data structure at the same time and let the cache coherence protocol determine if a conflict occurred.

    When you say "let the coherence protocol determine" are you saying that we could potentially take Java code that explicitly is written in the .java

    synchronized(operand_object_in_HTM_custody) {

    /* operate on the HTM operand in coarse/pessimistic (e.g. SERIALIZABLE) full isolation */
    }

    and potentially expect that an HTM-savvy run-time could re-write the original byte-code to generate destination byte-code that could realize a (much preferred) finer-grained/optimistic (e.g. READ_COMMITTED) isolation capability ... BUT ALSO ... with the guarantee that - should the cache coherency protocol discover that the optimistic isolation was betrayed (i.e. encountered a conflict) - the ultimate transactional outcome would "roll back/recover" capability to ensure that the outcome would be exactly as safe as if the course-grained pessimistic (original synchronized block) was never re-written?

    If I understand this correctly, this is pretty special.

    > Without HTM, the synchronized block needs to obtain a lock and enforce serialized access even though the majority case is a read operation.

    So it is coded at isolation=SERIALIZABLE, but if possible we can optimistically re-write to attempt finer run-time isolation. Again, are we saying this exactly? ==> Write fully synchronized code, let the run-time optimistically re-write your generated byte-code to take advantage of HTM capabilities, and, in any case that the HTM optimistic code encountered run-time conflict - the HTM will coherently roll back automatically .... leaving you exactly as safe as if only the original (non-HTM savvy) byte code executed.

    Nice. Real nice.

    .

    ReplyDelete
    Replies
    1. The byte code doesn't need to be changed. Instead the native code generated would be changed so that it attempts to perform the operation speculatively first, if it fails it will have to fall back to normal locking.

      This is a significant shift in what the CPU allows the programmer to do. At the end of a transaction it can detect a conflict with another thread and back out the writes made. While the size of the transaction is not unlimited it uses the L1 cache to store the transaction i.e. up to 32 KB.

      Delete
    2. Thanks Peter.

      Two more questions:


      1. If byte-code remains unchanged, what needs to be done from the JVM (if anything) to affect the HTM cache protocol to re-write the machine code to realize a (100% SAFE/ROLL-BACK READY) finer-grained locking isolation?

      2. Can any known HTM product today *literally* re-write machine-code that exhibits isolation=SERIALIZABLE (aswould be generated from the.java use of synchronized) into machine code that optimistically attempts to render an isolation=READ_COMMITTED capability (and stands ready to 100% safely auto roll back to isolation=SERIALIZABLE should any un-resolvable DIRTY_READ risk (or other conflct) be encountered at any commit time)?

      Delete
  3. Interesting post. However I have some doubt if this will be a silver bullet. E.g. it won't be sufficient to only check for concurrent changes on the same cache line, but also check for concurrent reads in order to determine if a transaction can be commited. (You probably know the famous endless loop, when reading from a hashmap concurrent to a "put").

    e.g.
    Thread A: sync { int size = this.size; [some loop] }
    Thread B: sync { [grow array]. this.size = newSize }

    so the transaction of thread B must be canceled if a read access to the modified "size" value has been done concurrently in another thread, isn't it ?
    This means, fine grained locking still will be necessary to get decent performance in many cases.

    ReplyDelete
    Replies
    1. Just as I read it, Ofc its vice versa: Its cheaper to cancel Transaction A once a read value is modified concurrently :-)

      Delete
  4. FWIW, here are some perspectives about HTM and STM from Cliff Click and Doug Lea (April 2013):

    https://www.youtube.com/watch?v=khFDJi3Xxn0&t=47m54s

    ReplyDelete
  5. For reference, I uploaded an ancient but detailed slide deck of mine on the subject, titled "Speculative Locking: Breaking the Scale Barrier (JAOO 2005)" (http://www.slideshare.net/slideshow/embed_code/30727191). With Haswell based commodity servers likely coming next year, this subject will be reopened and much-discussed.

    Having built up quite a bit of practical, real-world experience with synchronized blocks and HTM (through three generations of Vega hardware and supporting JVMs), I think we'll see mixed results in unintentionally-written cases, but to Peter's point, and especially for mechanically sympathetic developers out there, internal use of synchronized blocks when a JVM auto-backs-them-up with HTM can be an interesting new capability.

    ReplyDelete
  6. BTW, one key comment: AFAIK HTM is only there in Haswell based processors. The E5-2600 V2s are Ivy Bridge processors. We'll probably need to wait for the V3 (or whatever they call it) to see this in servers.

    ReplyDelete
  7. @Gil or Peter: so this is like "CAS" for large memory regions (not just 64 bits).

    Questions: what happens if the modified memory doesn't fit in the L1 cache? If we just map synchronized to HTM, I expect that there would be many instances where the number of modified elements is just too large (because people write large synchronized blocks and Java has an indirect nature of representing data which is far from cache friendly).

    Also, what happens if the thread is context-switched out while "in transaction" and the next process running on the same core trashes the L1 cache?

    ReplyDelete
    Replies
    1. the "2CAS" is for 128-bits. If the transaction doesn't fit in the L1 cache, the transaction aborts. Some updates are large, even simple updates can cause it to fail, e.g. a modCount prevents concurrent writers.

      I imagine a context switch is an abort, or you could prevent interrupts for short portions of code.

      Delete
  8. Can an HTM re-write of a strict .java synchronized code block ever lead to producing a run-time machine code re-write whose concurrent access scheme exhibits anything other than isolation=SERIALIZABLE?

    ReplyDelete

Post a Comment

Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

Low Latency Microservices, A Retrospective

Unusual Java: StackTrace Extends Throwable