Lock-Free, Wait-Free and Multi-core Programming Roger Deran - - PowerPoint PPT Presentation

lock free wait free and
SMART_READER_LITE
LIVE PREVIEW

Lock-Free, Wait-Free and Multi-core Programming Roger Deran - - PowerPoint PPT Presentation

Lock-Free, Wait-Free and Multi-core Programming Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap Lock-Free and Wait-Free Data Structures Overview The Java Maps that use Lock-Free techniques Graphical performance


slide-1
SLIDE 1

Lock-Free, Wait-Free and Multi-core Programming

Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap

slide-2
SLIDE 2

Lock-Free and Wait-Free Data Structures

 Overview  The Java Maps that use Lock-Free techniques  Graphical performance of Map data structures  Consensus number concept  The Ubiquitous ‘CAS’ primitive  Implementing AtomicInteger using CAS  Implementing Java ConcurrentSkipListMap  Volatile variables – vital but confusing  Memory Barriers

 so esoteric, we buy mutually free beer

slide-3
SLIDE 3

Lock-Free and Wait-Free Data Structures

 For multiple threads sharing data  Fast

 Extreme concurrency with many cores active  Extreme performance – no expensive wait queues  Extremely low latency (wait-free)

 Constructed from very powerful, simple primitives  Algorithms difficult, so usually use canned ones  Active research on these precious techniques

slide-4
SLIDE 4

Lock-Free and Wait-Free Data Structures

 Can implement fast locks with wait queues

 Mutexes, RW locks, Semaphores, Condition

Variables

 Can implement fast Atomics

 Integers, Longs, Booleans, References

 Can implement multi-core data structures

 HashMaps or Sets, Tree Maps or Sets, Queues, Lists,

Stacks

slide-5
SLIDE 5

Lock-Free and Wait-Free Data Structures

 Lock-Free

 Not fair between threads  Always has a retry loop  Guarantees progress of some thread but not which one  Not a spin lock! Spins can almost stall the whole system

 Wait-Free – beats Lock Free

 Fair between threads  Every thread is guaranteed to make progress in finite time

 Rely on GC for unique ids, can generate much garbage  More difficult in C, C++, boost::lockfree (the ‘ABA’ problem)

slide-6
SLIDE 6

The standard Java Map Classes

The Concurrent* are Lock-Free and AirMap is Mostly Lock-Free

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-7
SLIDE 7

Map Feature Comparison

HashMap TreeMap ConcurrentHashMap ConcurrentSkipListMap AirMap

put/get/remove

l l l l l

  • rdered access

l l l

thread safe

l l l

most memory efficient

l

fastest multicore access

l

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-8
SLIDE 8

Lock-Free Map Random Cumulative Put

Decreasing exponential speed with Map size

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-9
SLIDE 9

Lock-Free Map Concurrent Random Put

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-10
SLIDE 10

Map Concurrent Random Access Mixed

4 thread put 4 thread get

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-11
SLIDE 11

Lock-Free 8-Thread Remove Speed

JVM size versus time shows GC efficiency

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-12
SLIDE 12

Lock-Free One-Thread Iterator Speed

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-13
SLIDE 13

Lock-Free One-Thread Iterator Speed

Log scale shows the entire spectrum

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-14
SLIDE 14

Map Entry size vs Map size

Size of basic Key/Value entry in bytes given log Map size

slide-15
SLIDE 15

Consensus Number

Any given concurrency primitive has one How many Threads can be synchronized?

 Consensus 1: Surprisingly, memory is weak

 Atomic read or write to memory. Dekker’s Algorithm

 Consensus 2: Another surprise – many are weak

 Queues, test-and-set, swap, getAndAdd, stacks

 Consensus infinity: A few vital powerful primitives

 Augmented queue – like socket poll  Compare And Set “CAS” type instruction  Load-Link and Store-Conditional instruction pair AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-16
SLIDE 16

The Ubiquitous CAS

Compare and Set Atomic, Infinite consensus number boolean compareAndSet( ValueType *p, ValueType expectedValue, ValueType newValue) { … } class AtomicInteger { public final boolean compareAndSet( int expect, int update) { return unsafe.compareAndSwapInt(this, valueOffset, expect, update); } … } Java implementation invokes secret native code: Pseudo-code, normally one instruction:

slide-17
SLIDE 17

The Ubiquitous CAS

Compare and Set

 Definition: Atomically change a given memory

location to a given new value if it has a given expected value, and return true iff the change took place.

 Consensus infinity is expensive.

 Memory bus is locked for all cores: slow  x86, x64 instruction (with lock prefix byte for SMP):

LOCK; CMPXCHG ptr, expected, new

 Can implement primitives with lower consensus

numbers like AtomicInteger.getAndIncrement()

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-18
SLIDE 18

AtomicInteger

from Java library source code. lock-free (has retry loop) /** * Atomically increments by one the current value. * * @return the previous value */ public final int getAndIncrement() { for (;;) { int current = get(); int next = current + 1; if (compareAndSet(current, next)) return current; } }

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-19
SLIDE 19

ConcurrentSkipListMap

Leaf node structure from Java source code

static final class Node<K,V> { final K key; volatile Object value; volatile Node<K,V> next; }

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-20
SLIDE 20

ConcurrentSkipListMap

from Java source code comments

* Here's the sequence of events for a deletion of node n with * predecessor b and successor f, initially: * * +------+ +------+ +------+ * ... | b |------>| n |----->| f | ... * +------+ +------+ +------+ * * 1. CAS n's value field from non-null to null. * From this point on, no public operations encountering * the node consider this mapping to exist. However, other * ongoing insertions and deletions might still modify * n's next pointer. AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-21
SLIDE 21

ConcurrentSkipListMap

from source code comments

* 2. CAS n's next pointer to point to a new marker node. * From this point on, no other nodes can be appended to n. * which avoids deletion errors in CAS-based linked lists. * * +------+ +------+ +------+ +------+ * ... | b |------>| n |----->|marker|------>| f | ... * +------+ +------+ +------+ +------+ * AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-22
SLIDE 22

ConcurrentSkipListMap

from Java source code comments

* 3. CAS b's next pointer over both n and its marker. * From this point on, no new traversals will encounter n, * and it can eventually be GCed. * +------+ +------+ * ... | b |----------------------------------->| f | ... * +------+ +------+ * * A failure at step 1 leads to simple retry due to a lost race * with another operation. Steps 2-3 can fail because some other * thread noticed during a traversal a node with null value and * helped out by marking and/or unlinking. This helping-out * ensures that no thread can become stuck waiting for progress of * the deleting thread. The use of marker nodes slightly * complicates helping-out code because traversals must track * consistent reads of up to four nodes (b, n, marker, f), not * just (b, n, f), although the next field of a marker is * immutable, and once a next field is CAS'ed to point to a * marker, it never again changes, so this requires less care. AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-23
SLIDE 23

Volatile Variables

Vital, little understood. We consider Java ‘volatile’ here Necessary for inter-thread visibility (also in C#)

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

class MyClass { // only one thread necessarily sees this int i; // vi can be seen by any thread volatile int vi; // Java array elements are not volatile! volatile int[] va = new int[SIZE]; // only the reference is volatile volatile ArrayList val = new ArrayList(); // synchronized loads, stores all variables public synchronized void set(int newI) { i = newI; } … }

slide-24
SLIDE 24

Volatile Variables

Vital, little understood. Some architectures re-order loads/stores to memory!

 ‘As if’ no change to the code but slower.  Ensure loads and stores reach memory for inter-

thread visibility (except for C,C++ it’s only for I/O)

 Locks and synchronized blocks do too, but they are

slower and not lock-free.

 Not Atomic!

 myVolatile++ by two threads may lose a count.  Use AtomicInteger instead.

 Generally much faster than CAS, atomics, locks.

 Very fast, or free. (on x86, load is free on hardware)

 Consensus number 1

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-25
SLIDE 25

Volatile Reordering

Some architectures re-order loads/stores to memory!

No reordering of volatile loads/stores to memory : program order is followed

By ahead-of-time compiler (javac)

By just-in-time compiler (the JVM)

By core (all necessary implied or explicit ‘memory barriers’)

Mixed with Non-volatiles:

Non-volatile loads and stores can mix together in any way.

Non-volatile ops can ‘float’ below a volatile load (‘acquire’)

Non-volatile ops can ‘float’ above a volatile store (‘release’)

http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf http://g.oswego.edu/dl/jmm/cookbook.html (Doug Lea - Java) https://gcc.gnu.org/onlinedocs/gcc-4.4.0/gcc/Atomic-Builtins.html

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-26
SLIDE 26

Java Volatile Variables

Vital, little understood.

 Broken before Java 1.5 fixed the memory model  Array elements are always non-volatile  Primitives or references can be volatile  Defined by ‘happens-before’ binary relation.

 Seems almost nobody understands it this way. AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-27
SLIDE 27

C# Volatile Variables

Vital, little understood.

 volatile variables likeJava  System.Threading.Volatile.Read(var)  System.Threading.Volatile.Write(var)  Volatile variable load/store just implies volatile

read/write as above

 Can be applied to array elements, unlike Java

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-28
SLIDE 28

C, C++ Volatile Variables

Vital, little understood.

C, C++ already has a volatile keyword like const (dangerous)

Forces access to occur, in program order

Intended only for memory mapped I/O!

Not for threading but may work, e.g. in MicroSoft, probably gcc

No re-ordering control of non-volatiles at all!

Illegal in Linux kernel ! Use kernel native barriers and locks.

gcc:

Full sw-only barrier: asm volatile(“” ::: “memory”);

Full sw/hw barrier: gcc 4.4.0 and above: __sync_synchronize()

Volatile load/store: asm volatile(hw_specific_instruction)

c++11:

Nice sw-only barrier: atomic_signal_fence(std:memory_order)

Nice hw/sw barrier: atomic_thread_fence(std:memory_order)

Atomic variables: std:atomic<..>

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-29
SLIDE 29

Hardware Memory Barriers

prevent core from swapping its loads/stores to memory

 Four conceptual primitive kinds of barriers, can be combined:

 load-store: default in Sparc TSO, X86, ARM, POWER  load-load: default in Sparc TSO, X86, ARM, POWER  store-store: default in Sparc TSO, x86  store-load: default in none. Slow

 memory barriers affect all other loads or stores by the

generating core, and not just a memory location of the load

  • r store in the instruction.

 E.g. a load-store barrier prevents any earlier load from being

swapped with any later store by that core.

 A volatile load causes a succeeding load-store and load-load,

i.e. ‘acquire’ barrier. Other stores can ‘float’ down.

 A volatile store causes a preceding load-store and store-store,

i.e. ‘release’ barrier. (plus a ‘store-load’ in x86 OpenJDK).

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-30
SLIDE 30

Reordering Comparison

Increasing Processor Relaxed Memory Ordering Levels

Total Store reordering. Can switch Store-Load

Sparc: TSO ‘total store ordering’ mode

AMD: x86, x64 instruction set architecture

Intel X86, x64:

Partial ordering. Can switch Store-Store and Store-Load

Sparc PSO (obsolete)

Full reordering: Can switch everything including atomic load/store

Sparc: RMO (obsolete)

ARM v7 or later: depends on implementation architecture?

POWER

IA-64 (Intel Itanium)

MIPS: hw implementation environment dependent

Full reordering plus dependent loads reordered

DEC Alpha: AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com

slide-31
SLIDE 31

Memory Barrier Instructions

from OpenJDK orderAccess.hpp

// sparc RMO ia64 x86 // --------------------------------------------------------------------- // fence membar #LoadStore | mf lock addl 0,(sp) // #StoreStore | // #LoadLoad | // #StoreLoad // // release membar #LoadStore | st.rel [sp]=r0 movl $0,<dummy> // #StoreStore // st %g0,[] // // acquire ld [%sp],%g0 ld.acq <r>=[sp] movl (sp),<r> // membar #LoadLoad | // #LoadStore // // release_store membar #LoadStore | st.rel <store> // #StoreStore // st // // store_fence st st lock xchg // fence mf // // load_acquire ld ld.acq <load> // membar #LoadLoad | // #LoadStore

AirMap is a 90% faster 50% more capacity 7x faster Iteration Multi-core ConcurrentNavigableMap from boilerbay.com