 
              Lock-Free, Wait-Free and Multi-core Programming Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap
Lock-Free and Wait-Free Data Structures  Overview  The Java Maps that use Lock-Free techniques  Graphical performance of Map data structures  Consensus number concept  The Ubiquitous ‘CAS’ primitive  Implementing AtomicInteger using CAS  Implementing Java ConcurrentSkipListMap  Volatile variables – vital but confusing  Memory Barriers  so esoteric, we buy mutually free beer
Lock-Free and Wait-Free Data Structures  For multiple threads sharing data  Fast  Extreme concurrency with many cores active  Extreme performance – no expensive wait queues  Extremely low latency (wait-free)  Constructed from very powerful, simple primitives  Algorithms difficult, so usually use canned ones  Active research on these precious techniques
Lock-Free and Wait-Free Data Structures  Can implement fast locks with wait queues  Mutexes, RW locks, Semaphores, Condition Variables  Can implement fast Atomics  Integers, Longs, Booleans, References  Can implement multi-core data structures  HashMaps or Sets, Tree Maps or Sets, Queues, Lists, Stacks
Lock-Free and Wait-Free Data Structures  Lock-Free  Not fair between threads  Always has a retry loop  Guarantees progress of some thread but not which one  Not a spin lock! Spins can almost stall the whole system  Wait-Free – beats Lock Free  Fair between threads  Every thread is guaranteed to make progress in finite time  Rely on GC for unique ids, can generate much garbage  More difficult in C, C++, boost::lockfree (the ‘ABA’ problem)
The standard Java Map Classes The Concurrent* are Lock-Free and AirMap is Mostly Lock-Free AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Map Feature Comparison ConcurrentSkipListMap ConcurrentHashMap HashMap TreeMap AirMap put/get/remove l l l l l ordered access l l l thread safe l l l most memory efficient l fastest multicore access l AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Lock-Free Map Random Cumulative Put Decreasing exponential speed with Map size AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Lock-Free Map Concurrent Random Put AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Map Concurrent Random Access Mixed 4 thread put 4 thread get AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Lock-Free 8-Thread Remove Speed JVM size versus time shows GC efficiency AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Lock-Free One-Thread Iterator Speed AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Lock-Free One-Thread Iterator Speed Log scale shows the entire spectrum AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Map Entry size vs Map size Size of basic Key/Value entry in bytes given log Map size
Consensus Number Any given concurrency primitive has one How many Threads can be synchronized?  Consensus 1: Surprisingly, memory is weak  Atomic read or write to memory. Dekker’s Algorithm  Consensus 2: Another surprise – many are weak  Queues, test-and-set, swap, getAndAdd, stacks  Consensus infinity: A few vital powerful primitives  Augmented queue – like socket poll  Compare And S et “CAS” type instruction  Load-Link and Store-Conditional instruction pair AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
The Ubiquitous CAS Compare and Set Atomic, Infinite consensus number Pseudo-code, normally one instruction: boolean compareAndSet ( ValueType * p, ValueType expectedValue, ValueType newValue ) { … } Java implementation invokes secret native code: class AtomicInteger { public final boolean compareAndSet ( int expect, int update) { return unsafe.compareAndSwapInt(this, valueOffset, expect, update); } … }
The Ubiquitous CAS Compare and Set  Definition: Atomically change a given memory location to a given new value if it has a given expected value, and return true iff the change took place.  Consensus infinity is expensive.  Memory bus is locked for all cores: slow  x86, x64 instruction (with lock prefix byte for SMP): LOCK; CMPXCHG ptr, expected, new  Can implement primitives with lower consensus numbers like AtomicInteger.getAndIncrement() AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
AtomicInteger from Java library source code. lock-free (has retry loop) /** * Atomically increments by one the current value. * * @return the previous value */ public final int getAndIncrement () { for (;;) { int current = get (); int next = current + 1; if ( compareAndSet (current, next)) return current; } } AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
ConcurrentSkipListMap Leaf node structure from Java source code static final class Node<K,V> { final K key; volatile Object value; volatile Node<K,V> next; } AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
ConcurrentSkipListMap from Java source code comments * Here's the sequence of events for a deletion of node n with * predecessor b and successor f, initially: * * +------+ +------+ +------+ * ... | b |------>| n |----->| f | ... * +------+ +------+ +------+ * * 1. CAS n's value field from non-null to null. * From this point on, no public operations encountering * the node consider this mapping to exist. However, other * ongoing insertions and deletions might still modify * n's next pointer. AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
ConcurrentSkipListMap from source code comments * 2. CAS n's next pointer to point to a new marker node. * From this point on, no other nodes can be appended to n. * which avoids deletion errors in CAS-based linked lists. * * +------+ +------+ +------+ +------+ * ... | b |------>| n |----->|marker|------>| f | ... * +------+ +------+ +------+ +------+ * AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
ConcurrentSkipListMap from Java source code comments * 3. CAS b's next pointer over both n and its marker. * From this point on, no new traversals will encounter n, * and it can eventually be GCed. * +------+ +------+ * ... | b |----------------------------------->| f | ... * +------+ +------+ * * A failure at step 1 leads to simple retry due to a lost race * with another operation. Steps 2-3 can fail because some other * thread noticed during a traversal a node with null value and * helped out by marking and/or unlinking. This helping-out * ensures that no thread can become stuck waiting for progress of * the deleting thread. The use of marker nodes slightly * complicates helping-out code because traversals must track * consistent reads of up to four nodes (b, n, marker, f), not * just (b, n, f), although the next field of a marker is * immutable, and once a next field is CAS'ed to point to a * marker, it never again changes, so this requires less care. AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Volatile Variables V ital, little understood. We consider Java ‘volatile’ here Necessary for inter-thread visibility (also in C#) class MyClass { // only one thread necessarily sees this int i; // vi can be seen by any thread volatile int vi; // Java array elements are not volatile! volatile int[] va = new int[ SIZE ]; // only the reference is volatile volatile ArrayList v al = new ArrayList (); // synchronized loads, stores all variables public synchronized void set (int newI) { i = newI; } … } AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Volatile Variables Vital, little understood. Some architectures re-order loads/stores to memory!  ‘As if’ no change to the code but slower.  Ensure loads and stores reach memory for inter- thread visibility (except for C,C++ it’s only for I/O)  Locks and synchronized blocks do too, but they are slower and not lock-free.  Not Atomic!  myVolatile++ by two threads may lose a count.  Use AtomicInteger instead.  Generally much faster than CAS, atomics, locks.  Very fast, or free. (on x86, load is free on hardware)  Consensus number 1 AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Recommend
More recommend