Relaxed Data Structures
Dan Alistarh IST Austria & ETH Zurich
Relaxed Data Structures Dan Alistarh IST Austria & ETH Zurich - - PowerPoint PPT Presentation
Relaxed Data Structures Dan Alistarh IST Austria & ETH Zurich ...but first, were hiring! Young institute dedicated to basic research and graduate education Located near Vienna , Austria Fully English-speaking Graduate
Dan Alistarh IST Austria & ETH Zurich
and graduate education
Clock rate and #cores over the past 45 years.
To get speedup on newer hardware. Scaling: more threads should imply more useful work.
Is this problem inherent for some data structures?
0.00E+00 1.00E+06 2.00E+06 3.00E+06 4.00E+06 5.00E+06 6.00E+06 10 20 30 40 50 60 70
Throughput (Events/Second) Number of Threads
Throughput of a Concurrent Packet Processing Queue
< $1000 / machine > $10000 / machine
Theorem: Given n threads, any deterministic, strongly ordered data structure has executions in which a processor takes linear in n time to return.
[Ellen, Hendler, Shavit, SICOMP 2013] [Alistarh, Aspnes, Gilbert, Guerraoui, JACM 2014]
How can we circumvent this?
Theory ↔ Software ↔ Hardware
New Notions of Progress / Correctness! Theorem: Given n threads, any deterministic, strongly ordered data structure has an execution in which a processor takes linear in n time to return.
[Alistarh, Aspnes, Gilbert, Guerraoui, JACM 2014]
New Data Structure Designs!
Memory location R; void fetch-and-inc ( ) { unsigned val = 0; do { val = Read( R ); } while (!Bool_CAS ( &R, val, val + 1 )); return val; }
Example: Lock-free counter
CAS ( R, old, new )
…
…
success
Preamble Scan & Validate
val
Thread 0 Thread 1
Theory: threads could starve in optimistic lock-free implementations. Practice: this doesn’t happen. Threads don’t starve.
Use more complex wait-free algorithms.
Memory location R; void fetch-and-increment ( ) { int val; do { val = Read( R ); new_val = val + 1; } while (! Compare&Swap ( &R, val, new_val )); return val; }
Example: Lock-free counter. val Counter Value R
1
val
1
val
1
2
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Lock-Free Stack, 16 threads Number of iterations before an operation succeeds
Percentage of operations
Counter Queue
5000000 10000000 15000000
1 2 3 4 5 6
Try distribution, SkipList Inserts, 16 threads, 50% mutations
Number of operations
SkipList
1 3 4 1 …
picked by the strategy, 0 to all others
Proof intuition:
Theorem: Under any stochastic scheduler, any lock-free algorithm is wait-free with probability 1. [Alistarh, Censor-Hillel, Shavit, STOC14/JACM16]
Theorem: Under any stochastic scheduler, any boundedlock-free algorithm is wait-free, with probability 1.
Minimal Progress Maximal Progress Deadlock-free Starvation-Free Lock-Free (Non-blocking) Wait-Free
Disclaimer: We do not claim that the scheduler is uniform generally. We only use this as a lower bound for its long-run behavior.
1 q 1 s
CAS ( R, old, new )
…
…
success
Step Complexity System Latency = Throughput-1
Memory location R; void fetch-and-inc ( ) { unsigned val = 0; do { val = Read( R ); } while (!Bool_CAS ( &R, val, val + 1 )); return val; }
Example: Lock-free counter
READ (R ) CAS ( R, old, old + 1 )
success
READ (R ) CAS ( R, old, old + 1 )
success
READ (R ) CAS ( R, old, old + 1 )
success
READ (R ) CAS ( R, old, old + 1 )
success
READ (R ) CAS ( R, old, old + 1 )
success
$ %
& % … (1 − )*$ % )
21
Time
P2: Read P4:CAS P2: CAS
USELESS
P1: Read P4: Read
P1:CAS P3: Read
Moral of the story:
Average latency of the system is O( 𝑜
By symmetry, average step complexity for a counter operation is O( 𝑜
READ (R ) CAS ( R, old, old + 1 )
success
READ (R ) CAS ( R, old, old + 1 )
success
system (but in different time references) Theorem: Under a uniform stochastic scheduler, we have Step complexity is O( #Preamble + 𝑜
System latency is O( #Preamble + 𝑜
1 q 1 s
CAS ( R, old, new )
…
…
success
Answers/Clarifications: d.alistarh@gmail.com Full analysis: “Lock-Free Algorithms under Stochastic Schedulers” PODC15
26
0.00E+00 1.00E+06 2.00E+06 3.00E+06 4.00E+06 5.00E+06 6.00E+06 7.00E+06 10 20 30 40 50 60 70
Throughput
Number of Threads
Michael-Scott Queue Throughput
27
0.00E+00 1.00E+06 2.00E+06 3.00E+06 4.00E+06 5.00E+06 6.00E+06 7.00E+06 10 20 30 40 50 60 70
Throughput
Number of Threads
Michael-Scott Queue Throughput
“Saturated” Throughput
Where is this difference coming from?
{ if (Read( Head ) == Top_Node ) then Write( Head , Next_Node ) else Start from step 1 again! }
28 val ptr
Node1 Node4
val ptr
CAS 1 (dequeue)
val ptr
Node1
val ptr
Node2 Node3 Node4
Head Tail Critical Interval
{ if (Read( Head ) == Top_Node ) then Write( Head , Next_Node ) else Start from step 1 again! }
29 val ptr val ptr Head
CAS
Node1 Node2
CAS
val ptr Tail val ptr
Node3 Node4
Critical Interval
Directory-based cache (Intel, AMD)
Read (R) CAS ( R, old, new ) …
Failure
We waste time because ownership of R circulates without useful work!
Read (R) CAS ( R, old, new )
Core 0 Core 1
30
31
Time
Ownership Transfer CAS Attempt Ownership Transfer Ownership Transfer CAS Attempt CAS Attempt Ownership Transfer CAS Attempt
Directory-based cache (Intel, AMD) Resp( R )
Read (R) CAS ( R, old, new ) …
Each transfer of R results in at least one useful
Read (R) CAS ( R, old, new )
Core 0 Core 1 Lease Interval T Lease Interval T success Delayed
32
Directory-based cache (Intel, AMD) Resp( R )
Read (R) CAS ( R, old, new ) …
In this case, we have simply delayed the whole system by T, without additional progress.
Read (R) CAS ( R, old, new )
Core 0 Core 1 Lease Interval T Lease Interval T FAIL Delayed
33
34
35
0.00E+00 1.00E+06 2.00E+06 3.00E+06 4.00E+06 5.00E+06 6.00E+06 7.00E+06 10 20 30 40 50 60 70
Throughput
Number of Threads
Michael-Scott Queue Throughput
NO_LEASE SINGLE_LEASE
4.5X
1. Top_Node= Lease&Read( Head ) 2. Next_Node = Read( Top_Node.ptr) 3. ATOMIC { if (Read( Head ) == Top_Node) then Write&Release( Head , Next_Node ) else Releaseand goto 1 }
0.00E+00 5.00E+03 1.00E+04 1.50E+04 2.00E+04 2.50E+04 10 20 30 40 50 60 70
#Threads
Energy for the Michael-Scott Queue (nJ / operation)
NO_LEASE SINGLE_LEASE
2.5E+6 3.5E+6 4.5E+6 5.5E+6 6.5E+6 7.5E+6 8.5E+6 9.5E+6 1.1E+7 1.2E+7 10 20 30 40 50 60 70
#Ops./Second #Threads
Lock-Free Stack Throughput
NO_LEASE WITH_LEASE 0.0E+0 2.0E+5 4.0E+5 6.0E+5 8.0E+5 1.0E+6 10 20 30 40 50 60 70
#Ops./Second #Threads
Priority Queue Throughput
NO_LEASE WITH_LEASE
37
void blocking_inc(int* R){ acquire( _lock ); int val = Read( R ); Write( R, val+1 ); release( _lock ); }
Blocking counter
void release(int* _lock){ *_lock = UNLOCKED; } void acquire(int* _lock){ while( !CAS(_lock, UNLOCKED, LOCKED)) ); }
Can we avoid the wasted coherence messages?
Req( R, EX) Resp( R ) Directory-based cache (Intel, AMD) Core 0 Core 1 Req( R, EX) Resp( R ) Resp( R ) Req( R, EX) Resp( R ) Req( R, EX) Resp( R ) Acquire (L)
CAS(L)
Acquire (L)
Write(L)
Release (L)
… CAS(L) …
Resp( R ) Resp( R ) Retrying
CAS(L)
Delayed
38
Lease Interval T
Simply Lease the lock on acquire!
39
1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 10 20 30 40 50 60 70
Throughput: Lock-Based Counter
TTAS_NO_LEASE TTAS_WITH_LEASE CLH HTICKET
5E+09 1E+10 1.5E+10 2E+10 2.5E+10 2 4 8 16 32
Completion Time (ns) Parallel PageRank Running Time
(lower is better)
NO_LEASE WITH_LEASE
9.5X
41
1000000 2000000 3000000 4000000 5000000 6000000 10 20 30 40 50 60 70
TL2 Throughput
NO_LEASE SINGLE_LEASE DOUBLE_LEASE
42
Can we scale beyond bottlenecks? Let’s Relax!
Can we scale beyond bottlenecks? Let’s Relax!
2 1 3 …
Memory location R; void fetch-and-inc ( ) { unsigned val = 0; do { val = Read( R ); } while (!Bool_CAS ( &R, val, val + 1 )); return val; }
Example: Basic lock-free counter
Memory location R; void increment ( ) { unsigned val = 0; do { val = Read( R ); } while (!Bool_CAS ( &R, val, val + 1 )); }
Example: Basic lock-free counter
Local value V[i]; // one per thread, initially 0 void increment ( ) { V[i] = V[i] + 1; if ( V[i] % 2 ) return; unsigned val = 0; do { val = Read( R ); } while (!Bool_CAS ( &R, val, val + 2 )); }
Example: Basic relaxed counter
Memory location R; Local value V[i]; // one per thread, initially 0 void increment ( ) { V[i] = V[i] + 1; if ( V[i] % 2 == 1 ) return; unsigned val = 0; do { val = Read( R ); } while (!Bool_CAS ( &R, val, val + 2 )); }
Example: Basic relaxed counter
1
task
3
task
4
task
5
task
7
task
8
task
Methods:
15
task
11
task
18
task
Search(key) Insert/Delete(k, v) DeleteMin()
Extremely useful, both in theory and practice:
We are looking for a fast concurrent Priority Queue.
[Sanders97], [Lotan&Shavit00], [Sundell&Tsigas07], [Basin et al. 11], [Linden&Jonsson13], [Lenhart et al. 14], [Wimmer et al.14], [Alistarh et al. 14], [Rihani et al. 15]
Known solutions do not perform well: DeleteMin is highly contended. Every thread wants the top element!
1 2 3 4 5 6 7 8 9 10 4 8 12 16 20 24 28 M operations/s Number of threads New HaS SaT
Throughput of state-of-the-art concurrent PQs (from [Linden&Jonsson 2015]).
Classical heap-based implementation
1 2 8 7 5 3 9 6
All operations must access the root! Cache invalidation Failed synchronization In sum: No Scalability!
1
task
3
task
5
task
4
task
2
task
8
task
Methods:
Still useful, both in theory and practice:
We’re now looking for a fast relaxed concurrent PQ. 4
task
2
task
The fact that we are running in parallel already implies that we’re accepting out-of-order execution of tasks!
1
task
3
task
4
task
5
task
7
task
8
task
15
task
11
task
18
task
The application already has do deal with some relaxation!
head
H 1 3 4 5 9
…
T
head
H 1 3 4 5 9
…
T
H 1 3 4 5 9
…
T
head tail
Search( 5 )
[H, 9] [H, 9] [1, 9] [5, 9] stop
head tail
P processors O(P) relaxation
Two examples for starting height 4
procedure Spray()
jump stay jump jump
Spray and pray?
✓ Maximum value returned by Spray has rank Õ(𝑄)
✓ For all x, Pr(x hit) = Õ(1/𝑄)
✓ If x > y is returned by some Spray, then Pr(y) = Ω 9(1/𝑄)
Pr(x hit) = probability that a spray returns value at index x
Õ(𝑄)
✓ Maximum value returned by Spray has rank Õ(𝑄)
$ &
h
✓ Maximum value returned by Spray has rank Õ(𝑄)
&= & = 2=*$
✓ Maximum value returned by Spray has rank Õ(𝑄)
2? *$ ≤
$ & 1 + 2 + 4 + … + 𝑄 = 𝑄 BCD E ?FG
If a Spray would return a dummy element, it instead restarts
(no guarantees, incorrect execution)
In many practical settings (D.E.S., shortest paths), priority inversions are not expensive.
[T.Henzinger et al. 11, Rihani et al.14, Nguyen et al. 14]
20 40 60 80 7 14 21 28 35 42 49 56
Threads Throughput (MOps/s)
MultiQ c=2 MultiQ HT c=2 MultiQ c=4 Spraylist Linden Lotan
Looks good, but does it actually guarantee anything? Can we improve it?
Insert
Remove
Relaxes correctness: not a strict PQ Optimistic about progress(probabilistic termination).
What is the average rank removed over a sequence of steps? Q1 Q2 Q3 Q4
WLOG, elements are consecutive integers.
Cost = rank of element removed among remaining
elements Cost(2) = 2 Cost(4) = 3 Cost(1) = 1 Intuitively, the average distance from optimal.
Notes:
[Alistarh, Kopinsky, Li, Nadiradze, PODC 2017]
This would work if inserts were round-robin:
The reduction does not hold in general. Intuitively,height and top priority are not well correlated.
R+1
R + 2
R+3 Hard case: over time, we’ll eventually get arbitrary distributions. We have to prove that the algorithm gets out of those reasonably fast.
?
10
In expectation, increment is n.
Problem: the behavior at a step is highly correlated with what happened in previous steps.
1.8
5.9
10.2 13.2
4.7 7.3
12.5 16.8
2.2 3.2 8.3
15.2
5.1 9.5
11.7 14.2
Idea: the exponential is memoryless. Expected value n
1.8
5.9
10.2 13.2
4.7 7.3
12.5 16.8
2.2 3.2 8.3
15.2
5.1 9.5
11.7 14.2
Expected increment n = 4
1 2 3 4 5
The probability that the ith label is in bin j is the same in both processes.
(or rank)
Easy to see initially, why later?
1.8
5.9
10.2 13.2
7.3
12.5
8.3 9.5 Mean = 6.725
𝜠1 𝜠2 𝜠3 𝜠4 Theorem: For any t > 0, 𝔽[∑
𝒇𝒚𝒒 𝜠𝒋(𝒖)/𝒐 +
𝒐 𝒋F𝟐
∑ 𝒇𝒚𝒒 −𝜠𝒋(𝒖)/𝒐
𝒐 𝒋F𝟐
] = 𝑷 𝒐 .
Idea: this potential function behaves is a “super-martingale:” as soon as it grows above O(n), it starts decreasing. Generalizes on [Peres, Talwar, Wieder, R.S.&A.15]
#queues decreases exponentially #queues decreases exponentially
≥ 𝑜 + 𝑛𝑓𝑏𝑜 ≥ 2𝑜 + 𝑛𝑓𝑏𝑜 ≥ 𝑙
𝑜 + 𝑛𝑓𝑏𝑜
… ≤ 𝑛𝑓𝑏𝑜 − 𝑜 ≤ 𝑛𝑓𝑏𝑜 − 2𝑜 ≤ 𝑛𝑓𝑏𝑜 − 𝑙
𝑜
𝑛𝑓𝑏𝑜 On average, a chosen queue is here.
What if we do two choices only 𝜸% of the time?
(one random choice otherwise)
What if the input distribution is biased?
Still works (within reason). Works really well in practice.
We can use this for relaxed concurrent queues, priority queues, counters.
Vector of Queues Q[m]; void enqueue ( element e ) { tsp = GetTimestamp(); i = random(0, m – 1); Q[i].enqueue( <e, tsp> ); } void dequeue ( ) { i = random(0, m – 1); j = random(0, m – 1); //pick better element out of two choices if ( Q[j].peek().tsp < Q[i].peek().tsp ) i = j return Q[i].dequeue( ); }
Example: Relaxed Queue
Theory says that the average rank removed is O( m ). We can trade off contention (n / m) versus rank guarantees (m).
Vector of Counters C[m]; int read( ) { i = random(0, m – 1); return C[i] * m; } void increment ( ) { i = random(0, m – 1); j = random(0, m – 1); //pick lower counter out of two choices if ( C[j] < C[i] ) i = j C[i].increment(); }
Example: Relaxed Queue
Theory says that the average distance from the true value is O( m ). We can again trade off contention (n / m) versus rank guarantees (m).
In [2]: #running 64 counters for 10K steps, making two choices for increment. pq.timerExperiment( 64, 10000, 2)
Task
𝑔 𝑦 = b 𝑚𝑝𝑡𝑡(𝑦, 𝑓𝑦𝑗)
h iF$
Notion of “quality,” e.g. squared distance Solved via optimization procedure. E.g., classification
Node1 Node2
Dataset Partition 1 Dataset Partition 2
Synchronization
𝑔1 𝑦 = b 𝑚(𝑦,𝑓𝑗)
h/& iF$
𝑔2 𝑦 = b 𝑚(𝑦, 𝑓𝑗)
h iFh & j$
Let 𝒉 l(𝒚𝒖) = gradient at randomly chosen point.
𝒉 l 𝒚 − 𝜶𝒈 𝒚
𝟑 ≤ 𝝉𝟑 (variance bound)
𝒚𝒖j𝟐 = 𝒚𝒖 − 𝜽𝒖𝒉 l(𝒚𝒖), where 𝜡[ t(𝑦𝑢)] = 𝛼𝑔 𝑦𝑢 . Theorem [ standard ]: Given 𝑔 convex, and 𝑆2 = ||𝑦0 − 𝑦∗||2. If we run SGD for 𝑼 = 𝓟( 𝑺𝟑 𝟑𝝉𝟑
𝜻𝟑 ) iterations, then
𝒚𝒖j𝟐 = 𝒚𝒖 − 𝜽𝒖𝛼𝑔 𝑦𝑢 .
0.5 0.1
Vector x[d], initially random void SGD-Converge( float eps ) do { <e, label> = randomly chosen data point gradient = ComputeGradient( x, e, label ) for i from 1 to d: x[i] = x[i] - 𝜽 * gradient[i] error = ComputeLoss( x, training_data ) } while (error > eps) }
Example: Sequential SGD
0.6
0.5 0.1
Shared: Vector x[d], initially random Lock L //for the model void SGD-Converge( float eps ) do { <e, label> = randomly chosen data point L.lock() gradient = ComputeGradient( x, e, label ) for i from 1 to d: x[i] = x[i] - 𝜽 * gradient[i] error = ComputeLoss( x, training_data ) L.unlock() } while (error > eps) }
Example: Naïve Concurrent SGD
0.5 0.1
Shared: Vector x[d], initially random Lock array L[d] //one per model component void SGD-Converge( float eps ) do { <e, label> = randomly chosen data point gradient = ComputeGradient( x, e, label ) for i from 1 to d: L[i].lock() x[i] = x[i] - 𝜽 * gradient[i] L[i].unlock() error = ComputeLoss( x, training_data ) } while (error > eps) }
Example: Naïve Concurrent SGD
0.6
Vector x[d], initially random void SGD-Converge( float eps ) do { <e, label> = randomly chosen data point gradient = ComputeGradient( x, e, label ) for i from 1 to d: x[i] = x[i] - 𝜽 * gradient[i] error = ComputeLoss( x, training_data ) } while (error > eps) }
Example: Hogwild SGD = Sequential SGD
The algorithm is OK running without running any locks! This is very non-trivial to prove, and we won’t do it here: Initially by [Niu et al., NIPS 2011] Analysis improved by [Duchi et al., NIPS 2015], [De Sa et al., NIPS 2015] Convergence rate quadratic in the max delay.
0.5 0.1
0.6
2 4 6 8 10 2 4 6 8 10
linear speedup without locking with locking
Cores
2 4 6 8 10 2 4 6 8 10
linear speedup without locking with locking
Cores
2 4 6 8 10 2 4 6 8 10
linear speedup without locking with locking
Cores
2 4 6 8 10 2 4 6 8 10
linear speedup without locking with locking
Cores
(a) pnz = .005 (b) pnz = .01 (c) pnz = .2 (d) pnz = 1
10 20 30 40 50 60 70 FC WF MS LB LCRQ TS-atomic TS-CAS TS-hardware TS-interval TS-stutter CTS RTS 1RR DQ 2RR DQ 1RA DQ k-FIFO average order deviation (lower is better) 1.8 9.9 66.2 25.0 15.6 20.4 17.6 24.7 16.7 19.2 8.8 20.8 13.8 22.9 2924.0 47.0
(a) High-contention producer-consumer
Order deviation of various queue algorithms (40 threads).
Some (strongly ordered) data structures are hard to scale.
How do we specify and prove them correct? What new data structures are out there? How do they interact with existing applications?
conferences, and the PODC/DISC community