Transactional Memory
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Transactional Memory Companion slides for The Art of Multiprocessor - - PowerPoint PPT Presentation
Transactional Memory Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Moores Law Transistor count still rising Clock speed flattening sharply Art of Multiprocessor 2 Programming
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Art of Multiprocessor Programming 2
Clock speed flattening sharply Transistor count still rising
Art of Multiprocessor Programming 3
Art of Multiprocessor Programming 4
memory cpu
Art of Multiprocessor Programming 5
cache
Bus
Bus
shared memory
cache cache
Art of Multiprocessor Programming 6
cache
Bus
Bus
shared memory
cache cache
All on the same chip Sun T2000 Niagara
Art of Multiprocessor Programming 7
User code Traditional Uniprocessor Speedup
1.8x 7x 3.6x Time: Moore’s law
Art of Multiprocessor Programming 8
User code Multicore Speedup
1.8x 7x 3.6x Unfortunately, not so simple…
Art of Multiprocessor Programming 9
1.8x 2x 2.9x
User code Multicore Speedup
Parallelization and Synchronization require great care…
Art of Multiprocessor Programming 10
1-thread execution time n-thread execution time
Art of Multiprocessor Programming 11
Art of Multiprocessor Programming 12
Art of Multiprocessor Programming 13
Art of Multiprocessor Programming 14
16
Art of Multiprocessor Programming
You buy a 10-core machine … Your application is: 60% concurrent 40% sequential How close to a 10-fold speedup?
17
Art of Multiprocessor Programming
You buy a 10-core machine … Your application is: 60% concurrent 40% sequential How close to a 10-fold speedup?
18
Art of Multiprocessor Programming
You buy a 10-core machine … Your application is: 80% concurrent 20% sequential How close to a 10-fold speedup?
19
Art of Multiprocessor Programming
You buy a 10-core machine … Your application is: 80% concurrent 20% sequential How close to a 10-fold speedup?
20
Art of Multiprocessor Programming
You buy a 10-core machine … Your application is: 90% concurrent 10% sequential How close to a 10-fold speedup?
21
Art of Multiprocessor Programming
You buy a 10-core machine … Your application is: 80% concurrent 20% sequential How close to a 10-fold speedup?
22
Art of Multiprocessor Programming
You buy a 10-core machine … Your application is: 99% concurrent 01% sequential How close to a 10-fold speedup?
23
Art of Multiprocessor Programming
You buy a 10-core machine … Your application is: 80% concurrent 20% sequential How close to a 10-fold speedup?
0.5 1 1.5 2 2.5 3 3.5 4 4.5 speedup
This course is about the parts that are hard to make concurrent … but still have a big influence on speedup!
25
Art of Multiprocessor Programming
26
Art of Multiprocessor Programming
Easily made correct … But not scalable.
27
Art of Multiprocessor Programming
Can be very tricky …
28
Art of Multiprocessor Programming
If a thread holding a lock is delayed … No one else can make progress
/* * When a locked buffer is visible to the I/O layer * BH_Launder is set. This means before unlocking * we must clear BH_Launder,mb() on alpha and then * clear BH_Lock, so no reader can see BH_Launder set * on an unlocked buffer and then risk to deadlock. */
Art of Multiprocessor Programming
Relation between … Lock data and object data … Exists only in programmer’s mind
Actual comment from Linux Kernel
(hat tip: Bradley Kuszmaul)
30
enq(x) enq(y) double-ended queue No interference if ends “far apart” Interference OK if queue is small Clean solution is publishable result:
[Michael & Scott PODC 97]
Art of Multiprocessor Programming
Art of Multiprocessor Programming 31
Transfer item from one queue to another Must be atomic : No duplicate or missing items
Art of Multiprocessor Programming 32
Lock source Lock target Unlock source & target
Art of Multiprocessor Programming 33
Lock source Lock target Unlock source & target Methods cannot provide internal synchronization Objects must expose locking protocols to clients Clients must devise and follow protocols Abstraction broken!
34
zzz
Yes!
Art of Multiprocessor Programming
If buffer is empty, wait for item to show up
35
empty empty zzz…
Art of Multiprocessor Programming
Wait for either?
Art of Multiprocessor Programming 36 36
Much modern programming practice inadequate for multicore world Agenda Replace locking with a transactional API Design languages and libraries Implement efficient run-times
37
Transactional Memory Hardware Transactional Memory Hybrid Transactional Memory Software Transactional Memory Research Questions
38
Transactional Memory Hardware Transactional Memory Hybrid Transactional Memory Software Transactional Memory Research Questions
Art of Multiprocessor Programming 39 39
Block of code …. Atomic: appears to happen instantaneously Serializable: all appear to happen in one-at-a-time
Commit: takes effect (atomically) Abort: has no effect (typically restarted)
Art of Multiprocessor Programming 40 40
atomic { x.remove(3); y.add(3); } atomic { y = null; }
Art of Multiprocessor Programming 41 41
atomic { x.remove(3); y.add(3); } atomic { y = null; }
Art of Multiprocessor Programming 42 42
public void LeftEnq(item x) { Qnode q = new Qnode(x); q.left = left; left.right = q; left = q; }
Art of Multiprocessor Programming 43 43
public void LeftEnq(item x) atomic { Qnode q = new Qnode(x); q.left = left; left.right = q; left = q; } }
Art of Multiprocessor Programming 44 44
public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = left; left.right = q; left = q; } }
Art of Multiprocessor Programming 45 45
Not always this simple! Conditional waits? False conflicts? Resource limits? Better problems to have …
Art of Multiprocessor Programming 46
Art of Multiprocessor Programming 47
public void Transfer(Queue<T> q1, q2) { atomic { T x = q1.deq(); q2.enq(x); } }
Trivial or what?
Art of Multiprocessor Programming 48 48
public T LeftDeq() { atomic { if (left == null) retry; … } }
Roll back transaction and restart when something changes
Art of Multiprocessor Programming 49 49
atomic { x = q1.deq(); } orElse { x = q2.deq(); }
Run 1st method. If it retries … Run 2nd method. If it retries … Entire statement retries
Research Questions
50
Transactional Memory Hardware Transactional Memory Hybrid Transactional Memory Software Transactional Memory
Art of Multiprocessor Programming 51 51
Exploit standard “cache coherence” Detect synchronization conflicts … Invalidate cached copies of data.
Bus
cache
cache cache
Art of Multiprocessor Programming
53
Bus
cache
cache cache
Random access memory (10s of cycles)
Art of Multiprocessor Programming
54
cache
cache cache
Bus
Shared Bus
Art of Multiprocessor Programming
55
Bus
cache
cache cache
Per-Processor Caches
Art of Multiprocessor Programming
56
Bus
Bus
cache
cache cache
data
load x
Art of Multiprocessor Programming
57
Bus
Bus
cache
cache cache load x
Art of Multiprocessor Programming
Got it! data
data
58
Bus
Bus
cache cache data data
Load x
Art of Multiprocessor Programming
59
Bus
cache cache data Got it data data
Bus
Art of Multiprocessor Programming
60
Bus
data
cache data
data
data
Art of Multiprocessor Programming
61
Bus
Bus
cache data data data cache Invalidate x
Art of Multiprocessor Programming
62
cache
Bus
cache data data
This cache acquires write permission
Art of Multiprocessor Programming
63
cache
Bus
cache data data
Other caches lose read permission This cache acquires write permission
Art of Multiprocessor Programming
64
cache
Bus
cache data data
Memory provides data only if not present in any cache, so no need to change it now (expensive)
Art of Multiprocessor Programming
Art of Multiprocessor Programming 65 65
Interconnect
caches memory
T
Art of Multiprocessor Programming 66 66
T T
caches memory
Art of Multiprocessor Programming 67 67
T T
committed
caches memory
Art of Multiprocessor Programming 68 68
committed T D caches
memory
Art of Multiprocessor Programming 69 69
T T
D caches
memory
Art of Multiprocessor Programming 70 70
At Commit point … No cache conflicts? We win. Mark transactional cache entries …. Was: read-only, Now: valid Was: modified, Now: dirty (will be written back) That’s (almost) everything!
71
Transactional Memory Hardware Transactional Memory Hybrid Transactional Memory Software Transactional Memory Research Questions
72
IBM’s Blue Gene/Q & System Z & Power8 Intel’s Haswell TSX extensions
if (_xbegin() == _XBEGIN_STARTED) { speculative code _xend() } else { abort handler }
if (_xbegin() == _XBEGIN_STARTED) { speculative code _xend() } else { abort handler }
start a speculative transaction
if (_xbegin() == _XBEGIN_STARTED) { speculative code _xend() } else { abort handler }
If you see this, you are inside a transaction
if (_xbegin() == _XBEGIN_STARTED) { speculative code _xend() } else { abort handler }
If you see anything else, your transaction aborted
if (_xbegin() == _XBEGIN_STARTED) { speculative code _xend() } else { abort handler }
you could retry the transaction, or take an alternative path
if (_xbegin() == _XBEGIN_STARTED) { speculative code } else if (status & _XABORT_EXPLICIT) { aborted by user code } else if (status & _XABORT_CONFLICT) { read-write conflict } else if (status & _XABORT_CAPACITY) { cache overflow } else { … }
if (_xbegin() == _XBEGIN_STARTED) { speculative code } else if (status & _XABORT_EXPLICIT) { aborted by user code } else if (status & _XABORT_CONFLICT) { read-write conflict } else if (status & _XABORT_CAPACITY) { cache overflow } else { … }
speculative code can call _xabort()
if (_xbegin() == _XBEGIN_STARTED) { speculative code } else if (status & _XABORT_EXPLICIT) { aborted by user code } else if (status & _XABORT_CONFLICT) { read-write conflict } else if (status & _XABORT_CAPACITY) { cache overflow } else { … }
synchronization conflict
if (_xbegin() == _XBEGIN_STARTED) { speculative code } else if (status & _XABORT_EXPLICIT) { aborted by user code } else if (status & _XABORT_CONFLICT) { read-write conflict } else if (status & _XABORT_CAPACITY) { cache overflow } else { … }
read/write set too big (maybe don’t retry)
if (_xbegin() == _XBEGIN_STARTED) { speculative code } else if (status & _XABORT_EXPLICIT) { aborted by user code } else if (status & _XABORT_CONFLICT) { read-write conflict } else if (status & _XABORT_CAPACITY) { cache overflow } else { … }
Transaction aborts if data set
Transaction aborts on timer interrupt
Many other reasons: TLB miss, illegal instruction, page fault …
if (_xbegin() == _XBEGIN_STARTED) { read lock state if (lock taken) _xabort(); work; _xend() } else { lock->lock(); work; lock->unlock(); }
if (_xbegin() == _XBEGIN_STARTED) { read lock state if (lock taken) _xabort(); work; _xend() } else { lock->lock(); work; lock->unlock(); }
reading lock ensures that transaction will abort if another thread acquires lock
if (_xbegin() == _XBEGIN_STARTED) { read lock state if (lock taken) _xabort(); work; _xend() } else { lock->lock(); work; lock->unlock(); }
abort if another thread has acquired lock
if (_xbegin() == _XBEGIN_STARTED) { read lock state if (lock taken) _xabort(); work; _xend() } else { lock->lock(); work; lock->unlock(); }
(aborting concurrent speculative transactions)
Art of Multiprocessor Programming
91
<HLE acquire prefix> lock(); do work; <HLE release prefix> unlock()
Art of Multiprocessor Programming
92
<HLE acquire prefix> lock(); do work; <HLE release prefix> unlock()
first time around, read lock and execute speculatively
Art of Multiprocessor Programming
93
<HLE acquire prefix> lock(); do work; <HLE release prefix> unlock()
if speculation fails, no more Mr. Nice Guy, acquire the lock
Art of Multiprocessor Programming
94
lock transfer latencies serialized execution locks
Art of Multiprocessor Programming
95
locks lock elision
Art of Multiprocessor Programming
96
Art of Multiprocessor Programming 97
a b c
Art of Multiprocessor Programming
Art of Multiprocessor Programming 98
a b c
99
a b c
Art of Multiprocessor Programming
100
a b c
Art of Multiprocessor Programming
Art of Multiprocessor Programming 101
a b c d remove(b)
102
a b c d
Art of Multiprocessor Programming
remove(b)
a b c d
Art of Multiprocessor Programming
a b c d
read transaction
Art of Multiprocessor Programming
a b c d
read transaction
Art of Multiprocessor Programming
a b c d no locks acquired
Art of Multiprocessor Programming
107
Too short? Missed opportunity Too far? Transaction aborts, work lost
108
On Success: limit = limit + 1 limit = limit + 1 On Failure: limit = limit / 2 limit = limit / 2
Art of Multiprocessor Programming 109 109
Node* teleport(Node* start, T v) { int retries = RETRY_THRESHOLD; while (--retries) { int distance = 0; if (xbegin() == _XBEGIN_STARTED) { traverse up to teleportLimit nodes move lock _xend(); teleportLimit++; return pred; } else { teleportLimit = teleportLimit/2 }}};
Art of Multiprocessor Programming 110 110
Node* teleport(Node* start, T v) { int retries = RETRY_THRESHOLD; while (--retries) { int distance = 0; if (xbegin() == _XBEGIN_STARTED) { traverse up to teleportLimit nodes move lock _xend(); teleportLimit++; return pred; } else { teleportLimit = teleportLimit/2 }}};
Art of Multiprocessor Programming 111 111
Node* teleport(Node* start, T v) { int retries = RETRY_THRESHOLD; while (--retries) { int distance = 0; if (xbegin() == _XBEGIN_STARTED) { traverse up to teleportLimit nodes move lock _xend(); teleportLimit++; return pred; } else { teleportLimit = teleportLimit/2 }}};
Art of Multiprocessor Programming 112 112
Node* teleport(Node* start, T v) { int retries = RETRY_THRESHOLD; while (--retries) { int distance = 0; if (xbegin() == _XBEGIN_STARTED) { traverse up to teleportLimit nodes move lock _xend(); teleportLimit++; return pred; } else { teleportLimit = teleportLimit/2 }}};
Art of Multiprocessor Programming 113 113
Node* teleport(Node* start, T v) { int retries = RETRY_THRESHOLD; while (--retries) { int distance = 0; if (xbegin() == _XBEGIN_STARTED) { traverse up to teleportLimit nodes move lock _xend(); teleportLimit++; return pred; } else { teleportLimit = teleportLimit/2 }}};
Art of Multiprocessor Programming 114 114
Node* teleport(Node* start, T v) { int retries = RETRY_THRESHOLD; while (--retries) { int distance = 0; if (xbegin() == _XBEGIN_STARTED) { traverse up to teleportLimit nodes move lock _xend(); teleportLimit++; return pred; } else { teleportLimit = teleportLimit/2 }}};
Art of Multiprocessor Programming 115 115
Node* teleport(Node* start, T v) { int retries = RETRY_THRESHOLD; while (--retries) { int distance = 0; if (xbegin() == _XBEGIN_STARTED) { traverse up to teleportLimit nodes move lock _xend(); teleportLimit++; return pred; } else { teleportLimit = teleportLimit/2 }}};
Art of Multiprocessor Programming 116 116
Node* teleport(Node* start, T v) { int retries = RETRY_THRESHOLD; while (--retries) { int distance = 0; if (xbegin() == _XBEGIN_STARTED) { traverse up to teleportLimit nodes move lock _xend(); teleportLimit++; return pred; } else { teleportLimit = teleportLimit/2 }}};
Art of Multiprocessor Programming 117 117
Node* teleport(Node* start, T v) { int retries = RETRY_THRESHOLD; while (--retries) { int distance = 0; if (xbegin() == _XBEGIN_STARTED) { traverse up to teleportLimit nodes move lock _xend(); teleportLimit++; return pred; } else { teleportLimit = teleportLimit/2 }}};
Art of Multiprocessor Programming 118 118
Node* teleport(Node* start, T v) { int retries = RETRY_THRESHOLD; while (--retries) { int distance = 0; if (xbegin() == _XBEGIN_STARTED) { traverse up to teleportLimit nodes move lock _xend(); teleportLimit++; return pred; } else { teleportLimit = teleportLimit/2 }}};
Art of Multiprocessor Programming 119 119
Node* teleport(Node* start, T v) { int retries = RETRY_THRESHOLD; while (--retries) { int distance = 0; if (xbegin() == _XBEGIN_STARTED) { traverse list up to threshold move lock _xend(); teleportLimit++; return last node; } else { teleportLimit = teleportLimit/2 }}};
Art of Multiprocessor Programming 120 120
Node* teleport(Node* start, T v) { int retries = RETRY_THRESHOLD; while (--retries) { int distance = 0; if (xbegin() == _XBEGIN_STARTED) { traverse list up to threshold move lock _xend(); teleportLimit++; return last node; } else { teleportLimit = teleportLimit/2 }}};
121
STMs come in different forms: Lock-Free Lock-based
122
But, didn’t you just say that locks are evil? For applications, yes! For run-time systems written by experts, maybe not ….
123
Each transaction keeps Read Set: locations and values read Write Set: locations and values written Changes installed at commit Conflicts detected at comit
124
Client Memory lock Too many locks!
125
Client Memory lock Lock Striping
126
127
Read set Add address, values, and versions to read set To read memory … Check unlocked
128
To write memory …
Write set Add address, new values and versions to write set
Write set
Read set
Write set
Read set
Write set
Read set To commit … Acquire write locks Compare version #s Install new values
Write set
Read set To commit … Acquire write locks Compare version #s Install new values Increment version #s
Release locks
133
A zombie human is dead but act like it is alive … A zombie transaction is one that will certainly abort, but continues to run … Why do we care?
134
Invariant: x = 2 y
135
Invariant: x = 2 y read x = 2
136
Invariant: x = 2 y x ← 4 y ← 2 commit read x = 2
137
Invariant: x = 2 y T h i s t r a n s a c t i
i s a z
b i e , d
e d t
i e , b u t s t i l l r u n n i n g ! read x = 2 read y = 2 Who cares?
138
Invariant: x = 2 y z ← 1/(x-y) Oh, no! It divides by zero and crashes the system! read x = 2 read y = 2
139
Invariant: x = 2 y z ← 1/(x-y) T h e p r
e r t y t h a t e v e r y t r a n s a c t i
s e e s a c
s i s t e n s t a t e i s c a l l e d … read x = 2 read y = 2 Opacity
140
Introduce version clock Incremented by (some) writers Guarantees opacity Read by everyone
141
Version numbers not really timestamps, but useful to pretend
142
Copy clock to rv rv
Write set
Read set
R u n s p e c u l a t i v e t r a n s a c t i
a s b e f
e …
Write set
Read set
Lock Write Set …
Write set
Read set
Increment global clock
Write set
Read set
Validate read set…
Write set
Read set
Commit & release locks
Read set
Read-only transactions?
Read set
Check that version numbers less than or equal to cached clock C h e c k t h a t v a r i a b l e s r e a d a r e u n l
k e d
150
Transactional Memory Hardware Transactional Memory Hybrid Transactional Memory Software Transactional Memory Research Questions
Art of Multiprocessor Programming 151
Art of Multiprocessor Programming 152
– managed languages, Java, C#, … – Easy to control interactions between transactional & non-trans threads
– C, C++, … – Hard to control interactions between transactional & non-trans threads
Art of Multiprocessor Programming 153
– modify private copies & install on commit – Commit requires work – Consistency easier
– Modify in place, roll back on abort – Makes commit efficient – Consistency harder
Art of Multiprocessor Programming 154
– Detect before conflict arises – “Contention manager” module resolves
– Detect on commit/abort
– Eager write/write, lazy read/write …
Art of Multiprocessor Programming 155
Art of Multiprocessor Programming 156
conflicts?
and who rolls back?
work but formal work in infancy
Art of Multiprocessor Programming 157
– Oldest? – Most work? – Non-waiting?
Judgment of Solomon
Art of Multiprocessor Programming 158
– Provide transaction- safe libraries – Undoable file system/DB calls
– Opening cash drawer – Firing missile
Art of Multiprocessor Programming 159
– If transaction tries I/O, switch to irrevocable mode.
– Requires serial execution
– In irrevocable transactions
Art of Multiprocessor Programming 160
int i = 0; try { atomic { i++; node = new Node(); } } catch (Exception e) { print(i); }
Art of Multiprocessor Programming 161
int i = 0; try { atomic { i++; node = new Node(); } } catch (Exception e) { print(i); } Throws OutOfMemoryException!
Art of Multiprocessor Programming 162
int i = 0; try { atomic { i++; node = new Node(); } } catch (Exception e) { print(i); } Throws OutOfMemoryException! What is printed?
Art of Multiprocessor Programming 163
– Preserves invariants – Safer
– Like locking semantics – What if exception object refers to values modified in transaction?
Art of Multiprocessor Programming 164
atomic void foo() { bar(); } atomic void bar() { … }
Art of Multiprocessor Programming 165
– Who knew that cosine() contained a transaction?
– If child aborts, so does parent
– If child aborts, partial rollback of child only
166
Locks and transactions complement on another
167
TM can improve memory management, both automatic and explicit.
168
TM restructures in-memory databases
169
New research in energy-efficient synchronization
170
GPUs and accelerators need synchronization
171
TM can simplify operating system kernels, device drivers, security …
172
Transaction-Friendly data structures
173
174
Hat tip: Jeremy Kemp
176
Art of Multiprocessor Programming 177
178