Lock-Free, Wait-Free and Multi-core Programming Roger Deran - - PowerPoint PPT Presentation
Lock-Free, Wait-Free and Multi-core Programming Roger Deran - - PowerPoint PPT Presentation
Lock-Free, Wait-Free and Multi-core Programming Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap Lock-Free and Wait-Free Data Structures Overview The Java Maps that use Lock-Free techniques Graphical performance
Lock-Free and Wait-Free Data Structures
Overview The Java Maps that use Lock-Free techniques Graphical performance of Map data structures Consensus number concept The Ubiquitous ‘CAS’ primitive Implementing AtomicInteger using CAS Implementing Java ConcurrentSkipListMap Volatile variables – vital but confusing Memory Barriers
so esoteric, we buy mutually free beer
Lock-Free and Wait-Free Data Structures
For multiple threads sharing data Fast
Extreme concurrency with many cores active Extreme performance – no expensive wait queues Extremely low latency (wait-free)
Constructed from very powerful, simple primitives Algorithms difficult, so usually use canned ones Active research on these precious techniques
Lock-Free and Wait-Free Data Structures
Can implement fast locks with wait queues
Mutexes, RW locks, Semaphores, Condition
Variables
Can implement fast Atomics
Integers, Longs, Booleans, References
Can implement multi-core data structures
HashMaps or Sets, Tree Maps or Sets, Queues, Lists,
Stacks
Lock-Free and Wait-Free Data Structures
Lock-Free
Not fair between threads Always has a retry loop Guarantees progress of some thread but not which one Not a spin lock! Spins can almost stall the whole system
Wait-Free – beats Lock Free
Fair between threads Every thread is guaranteed to make progress in finite time
Rely on GC for unique ids, can generate much garbage More difficult in C, C++, boost::lockfree (the ‘ABA’ problem)
The standard Java Map Classes
The Concurrent* are Lock-Free and AirMap is Mostly Lock-Free
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Map Feature Comparison
HashMap TreeMap ConcurrentHashMap ConcurrentSkipListMap AirMap
put/get/remove
l l l l l
- rdered access
l l l
thread safe
l l l
most memory efficient
l
fastest multicore access
l
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Lock-Free Map Random Cumulative Put
Decreasing exponential speed with Map size
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Lock-Free Map Concurrent Random Put
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Map Concurrent Random Access Mixed
4 thread put 4 thread get
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Lock-Free 8-Thread Remove Speed
JVM size versus time shows GC efficiency
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Lock-Free One-Thread Iterator Speed
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Lock-Free One-Thread Iterator Speed
Log scale shows the entire spectrum
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Map Entry size vs Map size
Size of basic Key/Value entry in bytes given log Map size
Consensus Number
Any given concurrency primitive has one How many Threads can be synchronized?
Consensus 1: Surprisingly, memory is weak
Atomic read or write to memory. Dekker’s Algorithm
Consensus 2: Another surprise – many are weak
Queues, test-and-set, swap, getAndAdd, stacks
Consensus infinity: A few vital powerful primitives
Augmented queue – like socket poll Compare And Set “CAS” type instruction Load-Link and Store-Conditional instruction pair AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
The Ubiquitous CAS
Compare and Set Atomic, Infinite consensus number boolean compareAndSet( ValueType *p, ValueType expectedValue, ValueType newValue) { … } class AtomicInteger { public final boolean compareAndSet( int expect, int update) { return unsafe.compareAndSwapInt(this, valueOffset, expect, update); } … } Java implementation invokes secret native code: Pseudo-code, normally one instruction:
The Ubiquitous CAS
Compare and Set
Definition: Atomically change a given memory
location to a given new value if it has a given expected value, and return true iff the change took place.
Consensus infinity is expensive.
Memory bus is locked for all cores: slow x86, x64 instruction (with lock prefix byte for SMP):
LOCK; CMPXCHG ptr, expected, new
Can implement primitives with lower consensus
numbers like AtomicInteger.getAndIncrement()
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
AtomicInteger
from Java library source code. lock-free (has retry loop) /** * Atomically increments by one the current value. * * @return the previous value */ public final int getAndIncrement() { for (;;) { int current = get(); int next = current + 1; if (compareAndSet(current, next)) return current; } }
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
ConcurrentSkipListMap
Leaf node structure from Java source code
static final class Node<K,V> { final K key; volatile Object value; volatile Node<K,V> next; }
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
ConcurrentSkipListMap
from Java source code comments
* Here's the sequence of events for a deletion of node n with * predecessor b and successor f, initially: * * +------+ +------+ +------+ * ... | b |------>| n |----->| f | ... * +------+ +------+ +------+ * * 1. CAS n's value field from non-null to null. * From this point on, no public operations encountering * the node consider this mapping to exist. However, other * ongoing insertions and deletions might still modify * n's next pointer. AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
ConcurrentSkipListMap
from source code comments
* 2. CAS n's next pointer to point to a new marker node. * From this point on, no other nodes can be appended to n. * which avoids deletion errors in CAS-based linked lists. * * +------+ +------+ +------+ +------+ * ... | b |------>| n |----->|marker|------>| f | ... * +------+ +------+ +------+ +------+ * AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
ConcurrentSkipListMap
from Java source code comments
* 3. CAS b's next pointer over both n and its marker. * From this point on, no new traversals will encounter n, * and it can eventually be GCed. * +------+ +------+ * ... | b |----------------------------------->| f | ... * +------+ +------+ * * A failure at step 1 leads to simple retry due to a lost race * with another operation. Steps 2-3 can fail because some other * thread noticed during a traversal a node with null value and * helped out by marking and/or unlinking. This helping-out * ensures that no thread can become stuck waiting for progress of * the deleting thread. The use of marker nodes slightly * complicates helping-out code because traversals must track * consistent reads of up to four nodes (b, n, marker, f), not * just (b, n, f), although the next field of a marker is * immutable, and once a next field is CAS'ed to point to a * marker, it never again changes, so this requires less care. AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Volatile Variables
Vital, little understood. We consider Java ‘volatile’ here Necessary for inter-thread visibility (also in C#)
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
class MyClass { // only one thread necessarily sees this int i; // vi can be seen by any thread volatile int vi; // Java array elements are not volatile! volatile int[] va = new int[SIZE]; // only the reference is volatile volatile ArrayList val = new ArrayList(); // synchronized loads, stores all variables public synchronized void set(int newI) { i = newI; } … }
Volatile Variables
Vital, little understood. Some architectures re-order loads/stores to memory!
‘As if’ no change to the code but slower. Ensure loads and stores reach memory for inter-
thread visibility (except for C,C++ it’s only for I/O)
Locks and synchronized blocks do too, but they are
slower and not lock-free.
Not Atomic!
myVolatile++ by two threads may lose a count. Use AtomicInteger instead.
Generally much faster than CAS, atomics, locks.
Very fast, or free. (on x86, load is free on hardware)
Consensus number 1
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Volatile Reordering
Some architectures re-order loads/stores to memory!
No reordering of volatile loads/stores to memory : program order is followed
By ahead-of-time compiler (javac)
By just-in-time compiler (the JVM)
By core (all necessary implied or explicit ‘memory barriers’)
Mixed with Non-volatiles:
Non-volatile loads and stores can mix together in any way.
Non-volatile ops can ‘float’ below a volatile load (‘acquire’)
Non-volatile ops can ‘float’ above a volatile store (‘release’)
http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf http://g.oswego.edu/dl/jmm/cookbook.html (Doug Lea - Java) https://gcc.gnu.org/onlinedocs/gcc-4.4.0/gcc/Atomic-Builtins.html
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Java Volatile Variables
Vital, little understood.
Broken before Java 1.5 fixed the memory model Array elements are always non-volatile Primitives or references can be volatile Defined by ‘happens-before’ binary relation.
Seems almost nobody understands it this way. AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
C# Volatile Variables
Vital, little understood.
volatile variables likeJava System.Threading.Volatile.Read(var) System.Threading.Volatile.Write(var) Volatile variable load/store just implies volatile
read/write as above
Can be applied to array elements, unlike Java
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
C, C++ Volatile Variables
Vital, little understood.
C, C++ already has a volatile keyword like const (dangerous)
Forces access to occur, in program order
Intended only for memory mapped I/O!
Not for threading but may work, e.g. in MicroSoft, probably gcc
No re-ordering control of non-volatiles at all!
Illegal in Linux kernel ! Use kernel native barriers and locks.
gcc:
Full sw-only barrier: asm volatile(“” ::: “memory”);
Full sw/hw barrier: gcc 4.4.0 and above: __sync_synchronize()
Volatile load/store: asm volatile(hw_specific_instruction)
c++11:
Nice sw-only barrier: atomic_signal_fence(std:memory_order)
Nice hw/sw barrier: atomic_thread_fence(std:memory_order)
Atomic variables: std:atomic<..>
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Hardware Memory Barriers
prevent core from swapping its loads/stores to memory
Four conceptual primitive kinds of barriers, can be combined:
load-store: default in Sparc TSO, X86, ARM, POWER load-load: default in Sparc TSO, X86, ARM, POWER store-store: default in Sparc TSO, x86 store-load: default in none. Slow
memory barriers affect all other loads or stores by the
generating core, and not just a memory location of the load
- r store in the instruction.
E.g. a load-store barrier prevents any earlier load from being
swapped with any later store by that core.
A volatile load causes a succeeding load-store and load-load,
i.e. ‘acquire’ barrier. Other stores can ‘float’ down.
A volatile store causes a preceding load-store and store-store,
i.e. ‘release’ barrier. (plus a ‘store-load’ in x86 OpenJDK).
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Reordering Comparison
Increasing Processor Relaxed Memory Ordering Levels
Total Store reordering. Can switch Store-Load
Sparc: TSO ‘total store ordering’ mode
AMD: x86, x64 instruction set architecture
Intel X86, x64:
Partial ordering. Can switch Store-Store and Store-Load
Sparc PSO (obsolete)
Full reordering: Can switch everything including atomic load/store
Sparc: RMO (obsolete)
ARM v7 or later: depends on implementation architecture?
POWER
IA-64 (Intel Itanium)
MIPS: hw implementation environment dependent
Full reordering plus dependent loads reordered
DEC Alpha: AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com
Memory Barrier Instructions
from OpenJDK orderAccess.hpp
// sparc RMO ia64 x86 // --------------------------------------------------------------------- // fence membar #LoadStore | mf lock addl 0,(sp) // #StoreStore | // #LoadLoad | // #StoreLoad // // release membar #LoadStore | st.rel [sp]=r0 movl $0,<dummy> // #StoreStore // st %g0,[] // // acquire ld [%sp],%g0 ld.acq <r>=[sp] movl (sp),<r> // membar #LoadLoad | // #LoadStore // // release_store membar #LoadStore | st.rel <store> // #StoreStore // st // // store_fence st st lock xchg // fence mf // // load_acquire ld ld.acq <load> // membar #LoadLoad | // #LoadStore
AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com