Relaxed Data Structures Dan Alistarh IST Austria & ETH Zurich - - PowerPoint PPT Presentation

relaxed data structures
SMART_READER_LITE
LIVE PREVIEW

Relaxed Data Structures Dan Alistarh IST Austria & ETH Zurich - - PowerPoint PPT Presentation

Relaxed Data Structures Dan Alistarh IST Austria & ETH Zurich ...but first, were hiring! Young institute dedicated to basic research and graduate education Located near Vienna , Austria Fully English-speaking Graduate


slide-1
SLIDE 1

Relaxed Data Structures

Dan Alistarh IST Austria & ETH Zurich

slide-2
SLIDE 2

...but first, we’re hiring!

  • Young institute dedicated to basic research

and graduate education

  • Located near Vienna, Austria
  • Fully English-speaking
  • Graduate School
  • 1+3 years PhD Program
  • Full-time positions with competitive salary
  • Internships (2018): email d.alistarh@gmail.com
  • PhD & Postdoc Positions
  • Projects:
  • Concurrent Data Structures
  • Distributed Machine Learning
  • Molecular Computation
slide-3
SLIDE 3

Clock rate and #cores over the past 45 years.

Why Co

Concurrent Data Structures?

To get speedup on newer hardware. Scaling: more threads should imply more useful work.

slide-4
SLIDE 4

The Problem with Concurrency

Is this problem inherent for some data structures?

0.00E+00 1.00E+06 2.00E+06 3.00E+06 4.00E+06 5.00E+06 6.00E+06 10 20 30 40 50 60 70

Throughput (Events/Second) Number of Threads

Throughput of a Concurrent Packet Processing Queue

< $1000 / machine > $10000 / machine

slide-5
SLIDE 5

Inherent Sequential Bottlenecks

Data structures with strong ordering semantics

  • Stacks, Queues, Priority Queues, Exact Counters

This is important because of Amdahl’s Law

  • Assume single-threaded computation takes 7 days
  • Inherently sequential component (e.g., queue) takes 15% = 1 day
  • Then maximum speedup < 7x, even with infinite number of threads

Theorem: Given n threads, any deterministic, strongly ordered data structure has executions in which a processor takes linear in n time to return.

[Ellen, Hendler, Shavit, SICOMP 2013] [Alistarh, Aspnes, Gilbert, Guerraoui, JACM 2014]

slide-6
SLIDE 6

Today’s Class

How can we circumvent this?

Theory ↔ Software ↔ Hardware

New Notions of Progress / Correctness! Theorem: Given n threads, any deterministic, strongly ordered data structure has an execution in which a processor takes linear in n time to return.

[Alistarh, Aspnes, Gilbert, Guerraoui, JACM 2014]

New Data Structure Designs!

slide-7
SLIDE 7

Lock-Free Data Structures

  • Based on atomic instructions (CAS, Fetch&Inc, etc.)
  • Blocking of one thread doesn’t stop the whole system
  • Implementations: HashTables, Lists, B-Trees, Queues, Stacks, SkipLists, etc.
  • Known to scale well for many data structures

Memory location R; void fetch-and-inc ( ) { unsigned val = 0; do { val = Read( R ); } while (!Bool_CAS ( &R, val, val + 1 )); return val; }

Example: Lock-free counter

CAS ( R, old, new )

success

Preamble Scan & Validate

slide-8
SLIDE 8

The Lock-Free Paradox

val

Thread 0 Thread 1

Theory: threads could starve in optimistic lock-free implementations. Practice: this doesn’t happen. Threads don’t starve.

Use more complex wait-free algorithms.

Memory location R; void fetch-and-increment ( ) { int val; do { val = Read( R ); new_val = val + 1; } while (! Compare&Swap ( &R, val, new_val )); return val; }

Example: Lock-free counter. val Counter Value R

1

val

1

val

1

2

slide-9
SLIDE 9

Starvation?

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

Lock-Free Stack, 16 threads Number of iterations before an operation succeeds

Percentage of operations

Counter Queue

5000000 10000000 15000000

1 2 3 4 5 6

Try distribution, SkipList Inserts, 16 threads, 50% mutations

Number of operations

SkipList

Why?

slide-10
SLIDE 10

Part 1: Understanding Lock-free Progress

  • 1. We focus on contended workloads
  • 2. We focus on the scheduler
  • Sequence of accesses to shared data
  • Not adversarial, but relaxed
  • Stochastic model
  • 3. We focus on long-term behavior
  • How long does an operation take to complete on average?
  • Are there operations that never complete?

How does the “scheduler” behave in the long run?

slide-11
SLIDE 11
  • Complex combination of
  • Input (workload)
  • Code
  • Hardware
  • Single variable contention (IntelTM)

A simplified view of “the scheduler”

1 3 4 1 …

slide-12
SLIDE 12

The Scheduler

  • Pick random time t
  • What’s the probability that pi is scheduled?
  • Scheduler:
  • Either chooses a request from the pool in each “step,”
  • r leaves the variable with the current owner
  • The Schedule:
  • Under contention, a sequence of thread ids, e.g.: 2, 1, 4, 5, 2, 3, ….
  • Sequential access to contended data item
  • Stochastic Scheduler:
  • Every thread can be scheduled in each step, with probability > 0.
slide-13
SLIDE 13

Examples

  • Assume n processes
  • The uniform stochastic scheduler:
  • θ = 1 / n
  • Each process gets scheduled uniformly
  • A standard adversary:
  • Take any adversarial strategy
  • The distribution gives probability 1 to the process

picked by the strategy, 0 to all others

  • Not stochastic
  • Quantum-based schedulers
  • Stochastic if quantum length not fixed, but random variable
  • E.g.: [1, 1, 1], [3], [4, 4, 4, 4], [2, 2], [1], [4, 4], …
  • Common for OS scheduling
slide-14
SLIDE 14

Lock-Free Algorithms and Stochastic Schedulers

  • Lock-Free
  • There’s a time bound B for the system to complete some new operation
  • Wait-Free
  • There’s a (local) time bound for each operation to complete

Proof intuition:

  • Given any time t, if some thread p is scheduled for B consecutive time steps, it has to complete some new operation
  • There’s a non-zero probability that the scheduler might decide to schedule thread p B steps in a row.
  • By the “Infinite Monkey Theorem,” this will eventually occur.
  • Hence, with probability 1, every operation eventually succeeds

Theorem: Under any stochastic scheduler, any lock-free algorithm is wait-free with probability 1. [Alistarh, Censor-Hillel, Shavit, STOC14/JACM16]

slide-15
SLIDE 15

Comments

  • Practically, not that insightful
  • The probability that an operation succeeds could be as low as (1 / n)n
  • Does not necessarily hold if the scheduler is not stochastic
  • For instance, on NUMA systems, scheduler can be non-stochastic

Theorem: Under any stochastic scheduler, any boundedlock-free algorithm is wait-free, with probability 1.

Minimal Progress Maximal Progress Deadlock-free Starvation-Free Lock-Free (Non-blocking) Wait-Free

slide-16
SLIDE 16

The Story So Far

  • The Goal
  • Lock-Free Algorithms in Practice
  • The Stochastic Scheduler Model
  • Lock-Free ≈ Wait-Free (in Theory)
  • Performance Upper Bounds
  • A general class of lock-free algorithms
  • Uniform stochastic scheduler

Disclaimer: We do not claim that the scheduler is uniform generally. We only use this as a lower bound for its long-run behavior.

slide-17
SLIDE 17

Single-CAS Universal

  • Can implement any object lock-free

(Herlihy’s Universal Construction)

  • Blueprint for many efficient implementations

(Treiber Stack, Counters)

1 q 1 s

CAS ( R, old, new )

success

What is the average number of steps a process takes until completing a method call? What is the average number of steps the system takes until completing a method call?

Step Complexity System Latency = Throughput-1

slide-18
SLIDE 18

Special Case: The Counter

Memory location R; void fetch-and-inc ( ) { unsigned val = 0; do { val = Read( R ); } while (!Bool_CAS ( &R, val, val + 1 )); return val; }

Example: Lock-free counter

READ (R ) CAS ( R, old, old + 1 )

success

READ (R ) CAS ( R, old, old + 1 )

success

  • Example Schedule:
  • 1, 2, 2, 1

Assuming a uniform stochastic scheduler and n threads, what is the average step complexity?

slide-19
SLIDE 19

Part 2: Step Complexity Analysis

READ (R ) CAS ( R, old, old + 1 )

success

READ (R ) CAS ( R, old, old + 1 )

success

READ (R ) CAS ( R, old, old + 1 )

success

n, 2, 1, …

In each step, we pick an element from 1 to n randomly. How many steps (in expectation) before an element is chosen twice?

slide-20
SLIDE 20

The Birthday Problem

  • n = 365 days in a year
  • k people in a room
  • What is the probability that there are two with the same birthday?
  • Pr[ no birthday collision ] = 1 1 −

$ %

1 −

& % … (1 − )*$ % )

  • Approximation: 𝑓𝑦 ≈ 1 + 𝑦 (for 𝑦 close to 0).
  • Pr[ no birthday collision ] ≈ 𝑓*)() *$)/&%
  • This is constant for 𝒍 =

𝒐

  • Moral of the story:
  • 1. Two people in this room probably share birthdays
  • 2. After ~ 𝑜
  • steps are scheduled, some thread wins
slide-21
SLIDE 21

The Execution: A Sequential View

21

Time

P2: Read P4:CAS P2: CAS

USELESS

P1: Read P4: Read

P1:CAS P3: Read

Moral of the story:

  • 1. After ~ 𝑜
  • steps are scheduled, some thread wins
  • 2. That thread’s CAS will cause ~ 𝑜
  • ther threads to fail

2, 1, 4, ..., 2, 4, 1, 3

Average latency of the system is O( 𝑜

  • ) (this is tight).

By symmetry, average step complexity for a counter operation is O( 𝑜

  • ).
slide-22
SLIDE 22

Warning: Not Formally Correct

  • 1. We have assumed a uniform initial configuration
  • 2. A process which fails a CAS will have to pay an extra step
  • 3. We have only given upper bounds on the number of steps
  • But 𝑜
  • is indeed the tight bound here
  • 4. Latency <-> Step Complexity argued only by symmetry
  • Formally, by Markov Chain lifting

READ (R ) CAS ( R, old, old + 1 )

success

READ (R ) CAS ( R, old, old + 1 )

success

slide-23
SLIDE 23

The Full Result

  • 𝑜
  • trials / operation is the price of contention
  • A thread win causes ~ 𝑜
  • thers to fail their validation
  • Worst-case was unbounded!
  • Algorithms in SCU are fair:
  • Processes complete operations at the same “rate” as the

system (but in different time references) Theorem: Under a uniform stochastic scheduler, we have Step complexity is O( #Preamble + 𝑜

  • #Loop ).

System latency is O( #Preamble + 𝑜

  • #Loop).

1 q 1 s

CAS ( R, old, new )

success

slide-24
SLIDE 24

Extra Questions:

  • What happens if the probability distribution is not uniform?
  • P = (p1, p2, …, pn), with pi > 0 for all i.
  • Intuitively, should latency increase or decrease?
  • What happens if a thread needs to take 3 steps to succeed?
  • Restarts if someone else wins.
  • What happens for general k steps to succeed and arbitrary

distribution? J

Answers/Clarifications: d.alistarh@gmail.com Full analysis: “Lock-Free Algorithms under Stochastic Schedulers” PODC15

slide-25
SLIDE 25

The Story So Far

  • The Goal
  • Lock-Free Algorithms in Practice
  • Lock-Free ≈ Wait-Free (in Theory)
  • Performance Upper Bounds
  • How much do we lose in practice because of contention?
slide-26
SLIDE 26

Why does this graph look so sad?

26

0.00E+00 1.00E+06 2.00E+06 3.00E+06 4.00E+06 5.00E+06 6.00E+06 7.00E+06 10 20 30 40 50 60 70

Throughput

Number of Threads

Michael-Scott Queue Throughput

slide-27
SLIDE 27

Why does this graph look so sad?

27

0.00E+00 1.00E+06 2.00E+06 3.00E+06 4.00E+06 5.00E+06 6.00E+06 7.00E+06 10 20 30 40 50 60 70

Throughput

Number of Threads

Michael-Scott Queue Throughput

“Saturated” Throughput

Where is this difference coming from?

slide-28
SLIDE 28

Lock-Free Concurrent Queue Example

  • Dequeue Operation
  • 1. Top_Node = Read( Head )
  • 2. Next_Node = Read( Top_Node.ptr)
  • 3. ATOMIC

{ if (Read( Head ) == Top_Node ) then Write( Head , Next_Node ) else Start from step 1 again! }

28 val ptr

Node1 Node4

val ptr

CAS 1 (dequeue)

val ptr

Node1

val ptr

Node2 Node3 Node4

Head Tail Critical Interval

slide-29
SLIDE 29

Lock-Free Concurrent Queue Example

  • Consider two threads: Thread1, Thread2
  • Dequeue Operation
  • 1. Top_Node = Read( Head )
  • 2. Next_Node = Read( Top_Node.ptr )
  • 3. ATOMIC

{ if (Read( Head ) == Top_Node ) then Write( Head , Next_Node ) else Start from step 1 again! }

  • Let Thread2 perform the atomic update first
  • Thread1 tries to update later and fails
  • Under high contention, one in 𝒐
  • accesses will succeed!

29 val ptr val ptr Head

CAS

Node1 Node2

CAS

val ptr Tail val ptr

Node3 Node4

Critical Interval

slide-30
SLIDE 30

Part 3: What Happens at the Hardware Level?

Directory-based cache (Intel, AMD)

Read (R) CAS ( R, old, new ) …

Failure

We waste time because ownership of R circulates without useful work!

Read (R) CAS ( R, old, new )

Core 0 Core 1

30

slide-31
SLIDE 31

The Execution: A Pragmatic View

31

Time

Ownership Transfer CAS Attempt Ownership Transfer Ownership Transfer CAS Attempt CAS Attempt Ownership Transfer CAS Attempt

USELESS

New Lease/Release Operation: Each ownership transfer should result in useful work!

slide-32
SLIDE 32

Lease-Release: Every Transfer Should Be Useful

Directory-based cache (Intel, AMD) Resp( R )

Read (R) CAS ( R, old, new ) …

Each transfer of R results in at least one useful

  • peration!

Read (R) CAS ( R, old, new )

Core 0 Core 1 Lease Interval T Lease Interval T success Delayed

32

slide-33
SLIDE 33

Lease-Release: The Bad Case

Directory-based cache (Intel, AMD) Resp( R )

Read (R) CAS ( R, old, new ) …

In this case, we have simply delayed the whole system by T, without additional progress.

Read (R) CAS ( R, old, new )

Core 0 Core 1 Lease Interval T Lease Interval T FAIL Delayed

33

slide-34
SLIDE 34

Lease/Release, More Precisely

  • Programmer can lease a variable for bounded time
  • void RequestLease(void*address, int data_size);
  • void ReleaseLease(void* address, int data_size);
  • Performance penalty if lease expires before operation completion
  • Usually occurs < 5% of the time
  • Lease-Release handled by L1 Cache Controllers
  • Max lease time in the order of T = 1000 cycles
  • Implemented in the MIT Graphite Simulator
  • Private L1, Shared L2 Cache hierarchy
  • Directory based MSI Cache Coherence Protocol

34

slide-35
SLIDE 35

Lock-Free Queue with Lease-Release

(Simulated in Graphite)

35

0.00E+00 1.00E+06 2.00E+06 3.00E+06 4.00E+06 5.00E+06 6.00E+06 7.00E+06 10 20 30 40 50 60 70

Throughput

Number of Threads

Michael-Scott Queue Throughput

NO_LEASE SINGLE_LEASE

4.5X

  • Dequeue Operation

1. Top_Node= Lease&Read( Head ) 2. Next_Node = Read( Top_Node.ptr) 3. ATOMIC { if (Read( Head ) == Top_Node) then Write&Release( Head , Next_Node ) else Releaseand goto 1 }

0.00E+00 5.00E+03 1.00E+04 1.50E+04 2.00E+04 2.50E+04 10 20 30 40 50 60 70

#Threads

Energy for the Michael-Scott Queue (nJ / operation)

NO_LEASE SINGLE_LEASE

slide-36
SLIDE 36

Lock-Free Stack Throughput

2.5E+6 3.5E+6 4.5E+6 5.5E+6 6.5E+6 7.5E+6 8.5E+6 9.5E+6 1.1E+7 1.2E+7 10 20 30 40 50 60 70

#Ops./Second #Threads

Lock-Free Stack Throughput

NO_LEASE WITH_LEASE 0.0E+0 2.0E+5 4.0E+5 6.0E+5 8.0E+5 1.0E+6 10 20 30 40 50 60 70

#Ops./Second #Threads

Priority Queue Throughput

NO_LEASE WITH_LEASE

  • Treiber Stack
  • Lotan-Shavit skiplist-based priority queue
slide-37
SLIDE 37

How about Lock-Based Implementations?

37

void blocking_inc(int* R){ acquire( _lock ); int val = Read( R ); Write( R, val+1 ); release( _lock ); }

Blocking counter

void release(int* _lock){ *_lock = UNLOCKED; } void acquire(int* _lock){ while( !CAS(_lock, UNLOCKED, LOCKED)) ); }

slide-38
SLIDE 38

What Happens at the Protocol Level?

Can we avoid the wasted coherence messages?

Req( R, EX) Resp( R ) Directory-based cache (Intel, AMD) Core 0 Core 1 Req( R, EX) Resp( R ) Resp( R ) Req( R, EX) Resp( R ) Req( R, EX) Resp( R ) Acquire (L)

CAS(L)

Acquire (L)

Write(L)

Release (L)

… CAS(L) …

Resp( R ) Resp( R ) Retrying

CAS(L)

Delayed

38

Lease Interval T

Simply Lease the lock on acquire!

slide-39
SLIDE 39

Lock-Based Counter with Lease-Release

39

1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 10 20 30 40 50 60 70

Throughput: Lock-Based Counter

TTAS_NO_LEASE TTAS_WITH_LEASE CLH HTICKET

slide-40
SLIDE 40

Lock-based PageRank

  • The CRONO Graph benchmark implementation
  • Lease the lock before acquiring it
  • Release before giving it up

5E+09 1E+10 1.5E+10 2E+10 2.5E+10 2 4 8 16 32

Completion Time (ns) Parallel PageRank Running Time

(lower is better)

NO_LEASE WITH_LEASE

9.5X

slide-41
SLIDE 41

Extensions

  • Hardware Implementation
  • Protocol provably correct
  • Directory structure does not change
  • “Minor” protocol changes
  • Multiple Concurrent Leases
  • Request leases in sorted order
  • Many data structures don’t need this
  • Works well with transactions

41

1000000 2000000 3000000 4000000 5000000 6000000 10 20 30 40 50 60 70

TL2 Throughput

NO_LEASE SINGLE_LEASE DOUBLE_LEASE

slide-42
SLIDE 42

Notes

  • 1. Lease/Release Builds on Two Powerful Ideas
  • Hardware Queues [iQOLB: Rajwar, Kaegi, Goodman; HPCA 2000]
  • Transient Blocking Synchronization [Shalev, Shavit; Sun Tech Report]
  • 2. Estimates loss of performance because of hardware overheads
  • As we’ve seen, it can be non-trivial
  • 3. Allows the programmer to enforce optimism at the hardware level
  • We lease in the hope that we’ll commit before expiration
  • We pay extra cost if we’re wrong!

42

slide-43
SLIDE 43

The High-Level View

  • The Problem with Concurrency
  • Inherent Bottlenecks lead to meltdowns
  • Why?
  • Contention hurts optimistic patterns, quantifiably so
  • Lease/Release:
  • We can assume we could scale bottlenecks
  • Optimism enforced at the hardware level

Can we scale beyond bottlenecks? Let’s Relax!

slide-44
SLIDE 44

Relaxed Data Structures II: Relaxed Semantics

slide-45
SLIDE 45

The High-Level View

  • The Problem with Concurrency
  • Inherent Bottlenecks lead to meltdowns
  • Why?
  • Contention hurts optimistic patterns, quantifiably so
  • Lease/Release:
  • We can now scale bottlenecks
  • Optimism enforced at the hardware level

Can we scale beyond bottlenecks? Let’s Relax!

slide-46
SLIDE 46

Example 1: Relaxed Shared Counter

  • Shared Counter:
  • Read : returns counter value
  • Increment: adds 1 to the counter value

2 1 3 …

Memory location R; void fetch-and-inc ( ) { unsigned val = 0; do { val = Read( R ); } while (!Bool_CAS ( &R, val, val + 1 )); return val; }

Example: Basic lock-free counter

slide-47
SLIDE 47

Example 1: Relaxed Shared Counter

  • Shared Counter:
  • Read : returns counter value
  • Increment: adds 1 to the counter value

Memory location R; void increment ( ) { unsigned val = 0; do { val = Read( R ); } while (!Bool_CAS ( &R, val, val + 1 )); }

Example: Basic lock-free counter

slide-48
SLIDE 48

Example 1: Relaxed Shared Counter

  • Shared Counter:
  • Idea: Save updates locally to reduce contention
  • Read : returns approximatecounter value
  • Increment: adds 1 to the counter value (sort of)
  • Memory location R;

Local value V[i]; // one per thread, initially 0 void increment ( ) { V[i] = V[i] + 1; if ( V[i] % 2 ) return; unsigned val = 0; do { val = Read( R ); } while (!Bool_CAS ( &R, val, val + 2 )); }

Example: Basic relaxed counter

slide-49
SLIDE 49

Example 1: Relaxed Shared Counter

  • Why do this?
  • Less updates
  • Less contention
  • More performance
  • Why not?
  • Well, it’s not a counter
  • What does this guarantee?
  • The value returned is at most n behind the “true” value
  • It is always smaller than the true value
  • Is this a good idea?
  • Depends if the application accepts such semantics
  • Also, this only divides update contention by 2

Memory location R; Local value V[i]; // one per thread, initially 0 void increment ( ) { V[i] = V[i] + 1; if ( V[i] % 2 == 1 ) return; unsigned val = 0; do { val = Read( R ); } while (!Bool_CAS ( &R, val, val + 2 )); }

Example: Basic relaxed counter

slide-50
SLIDE 50

This Class

Algorithms & Data Structures with Relaxed Semantics

  • 1. Concurrent Priority Queues
  • 2. Concurrent Counters
  • 3. Concurrent Stochastic Gradient Descent
  • Algorithms & Guarantees
  • Applications
slide-51
SLIDE 51

Concurrent Priority Queues

1

task

3

task

4

task

5

task

7

task

8

task

Methods:

  • Get Top Task
  • Insert a Task
  • Search for Task

15

task

11

task

18

task

Priority Queue <key, value>

Search(key) Insert/Delete(k, v) DeleteMin()

Extremely useful, both in theory and practice:

  • Graph Algorithms (e.g., Shortest Paths)
  • Operating System Scheduling
  • Time-Based Simulations

We are looking for a fast concurrent Priority Queue.

slide-52
SLIDE 52

Fa Fast Concurrent Priority Queues

Lots of work on the topic:

[Sanders97], [Lotan&Shavit00], [Sundell&Tsigas07], [Basin et al. 11], [Linden&Jonsson13], [Lenhart et al. 14], [Wimmer et al.14], [Alistarh et al. 14], [Rihani et al. 15]

Known solutions do not perform well: DeleteMin is highly contended. Every thread wants the top element!

1 2 3 4 5 6 7 8 9 10 4 8 12 16 20 24 28 M operations/s Number of threads New HaS SaT

Throughput of state-of-the-art concurrent PQs (from [Linden&Jonsson 2015]).

slide-53
SLIDE 53

Sequential Solution: Use a Heap

Classical heap-based implementation

1 2 8 7 5 3 9 6

The Problem:

All operations must access the root! Cache invalidation Failed synchronization In sum: No Scalability!

slide-54
SLIDE 54

Re Relaxe xed Concurrent PQ

1

task

3

task

5

task

4

task

2

task

8

task

Methods:

  • ApproxDeleteMin
  • Insert
  • Search

Still useful, both in theory and practice:

  • We can often pay for priority inversions via extra work
  • E.g. parallel Dijkstra’s: multiple relaxations of the same node

We’re now looking for a fast relaxed concurrent PQ. 4

task

2

task

slide-55
SLIDE 55

A Philosophical Point

The fact that we are running in parallel already implies that we’re accepting out-of-order execution of tasks!

1

task

3

task

4

task

5

task

7

task

8

task

15

task

11

task

18

task

The application already has do deal with some relaxation!

slide-56
SLIDE 56

Trivial Concurrent Solution

head

  • Linked list, sorted by priority
  • Insert/Delete/Search done via list operations
  • Problem: does not scale!

H 1 3 4 5 9

T

slide-57
SLIDE 57

Less Trivial Solution: SkipList

head

  • Linked list, sorted by priority
  • Each node has random “height” (geometrically distributed with parameter ½)
  • Elements at the same height form their own lists

H 1 3 4 5 9

T

slide-58
SLIDE 58
  • Linked list, sorted by priority
  • Each node has random “height” (geometrically distributed with parameter ½)
  • Elements at the same height form their own lists
  • Average time Search, Insert, Delete logarithmic, work concurrently [Pugh98, Fraser04]

H 1 3 4 5 9

T

head tail

Search( 5 )

!

Concurrent Solution: the SkipList [Pugh90]

[H, 9] [H, 9] [1, 9] [5, 9] stop

slide-59
SLIDE 59
  • I. Lotan and N. Shavit. Skiplist-Based Concurrent Priority Queues. 2000.
  • DeleteMin: simply remove the smallest element from the bottom list
  • All processors compete for smallest element
  • Still does not scale!

head tail

The SkipList as a PQ

slide-60
SLIDE 60
  • We want to choose an item at random with ‘good’ guarantees
  • Minimize loss of exactness by only choosing items near the front of the list
  • Minimize contention by keeping collision probability low

The Idea: Relax!

P processors O(P) relaxation

slide-61
SLIDE 61

Two examples for starting height 4

procedure Spray()

  • Start at height H = log P
  • At each skiplist level, flip coin to stay or jump forward
  • Repeat for each level from log n down to 1 (the bottom)
  • As if removing a random priority element near the head

jump stay jump jump

DeleteMin: The Spray [Alistarh, Kopinsky, Li, Shavit, PPoPP 2015]

slide-62
SLIDE 62

The Spray Operation

int spray ( ) { cur <- head; i <- log n; while (i > 0) { repeat(rand(0,1)) { cur <- cur->next[i]; }; i <- i-1; } return cur->val; }

Parameters in red can be tuned!

Spray and pray?

slide-63
SLIDE 63

✓ Maximum value returned by Spray has rank Õ(𝑄)

  • Sprays aren’t too wide

✓ For all x, Pr(x hit) = Õ(1/𝑄)

  • Sprays don’t cluster too much

✓ If x > y is returned by some Spray, then Pr(y) = Ω 9(1/𝑄)

  • Elements do not starve in the list

Pr(x hit) = probability that a spray returns value at index x

Õ(𝑄)

SprayList Probabilistic Guarantees

slide-64
SLIDE 64

✓ Maximum value returned by Spray has rank Õ(𝑄)

  • Sprays aren’t too wide
  • Step 1: How many elements between two nodes at height h?
  • We need the second node to flip Heads h times in a row!
  • Pr [ h Heads flips in a row ] =

$ &

h

  • So the expected distance between two such elements is 2ℎ

Analysis: Max Spray Length

slide-65
SLIDE 65

✓ Maximum value returned by Spray has rank Õ(𝑄)

  • Sprays aren’t too wide
  • Step 2: How long does a spray stay at each height?
  • With probability 1 / 2, you stay, with probability 1 / 2 you go down
  • So in expectation you do 1 / 2 jumps at each level
  • Your expected horizontal travel at height H is

&= & = 2=*$

Analysis: Max Spray Length

slide-66
SLIDE 66

✓ Maximum value returned by Spray has rank Õ(𝑄)

  • Sprays aren’t too wide
  • Step 3: What’s your total horizontal travel

2? *$ ≤

$ & 1 + 2 + 4 + … + 𝑄 = 𝑄 BCD E ?FG

  • All of this was done in expectation
  • We lose log 𝑄 factors if we want the result to hold with high probability

Analysis: Max Spray Length

slide-67
SLIDE 67

Problem

First element almost never chosen!

slide-68
SLIDE 68

Small Tweak

✓Pad the front of the list with ‘dummy’ elements

If a Spray would return a dummy element, it instead restarts

slide-69
SLIDE 69
  • Alternating Inserts (on random keys) and DeleteMin operations
  • Exact algorithms have negative scaling after 8 threads
  • SprayList competitive with the random remover

(no guarantees, incorrect execution)

In many practical settings (D.E.S., shortest paths), priority inversions are not expensive.

One Benchmark

slide-70
SLIDE 70

Part 2: The MultiQueue Strategy

[T.Henzinger et al. 11, Rihani et al.14, Nguyen et al. 14]

  • Given: n sequential priority queues, each protected by a lock
  • Insert: pick a random queue, try-lock, and insert into it
  • Remove: pick two queues at random, try-lock and remove the better element
  • If locking fails, retry

20 40 60 80 7 14 21 28 35 42 49 56

Threads Throughput (MOps/s)

MultiQ c=2 MultiQ HT c=2 MultiQ c=4 Spraylist Linden Lotan

Looks good, but does it actually guarantee anything? Can we improve it?

Insert

Remove

Relaxes correctness: not a strict PQ Optimistic about progress(probabilistic termination).

slide-71
SLIDE 71

The Random Process

1 6 10 13 4 7 12 16 2 3 8 15 5 9 11 14

What is the average rank removed over a sequence of steps? Q1 Q2 Q3 Q4

WLOG, elements are consecutive integers.

  • 1. Insert Elements uniformly at random
  • 2. Remove using two choices

Cost = rank of element removed among remaining

elements Cost(2) = 2 Cost(4) = 3 Cost(1) = 1 Intuitively, the average distance from optimal.

slide-72
SLIDE 72

Notes:

  • Cost does not depend on t
  • Bounds are tight, and in some sense the best we could expect (for such a process)
  • The single-choice removal strategy diverges as we increase t

The Result

Theorem: Given n queues, for any time t > 0, the cost at t is O(n) in expectation, and O( n log n ) w.h.p.

[Alistarh, Kopinsky, Li, Nadiradze, PODC 2017]

slide-73
SLIDE 73

Analytic Approach (1)

Theorem: Given n queues, for any t > 0, the cost at t is O(n) in expectation, and O( n log n ) w.h.p.

  • Strategy 1: reduction to “power of two-choices” analysis? [Azar et al., SICOMP 99]
  • Would work if we could equate queue size with its top label.

This would work if inserts were round-robin:

  • Idea: keep “virtual bins” tracking elements removed from each queue
  • We always insert into the less loaded virtual bin (standard two-choice allocation)

1 5 9 2 6 10 3 7 11 4 8 12

The reduction does not hold in general. Intuitively,height and top priority are not well correlated.

slide-74
SLIDE 74
  • Strategy 2: some simple sort of induction
  • The initial cost distribution is nice; can we prove it always stays nice?
  • No: pick arbitrary rank value R

Analytic Approach (2)

Theorem: For any t > 0, the cost at t is O(n) in expectation, and O( n log n ) w.h.p. 1 2 … R

R+1

R + 2

R+3 Hard case: over time, we’ll eventually get arbitrary distributions. We have to prove that the algorithm gets out of those reasonably fast.

slide-75
SLIDE 75

?

  • Strategy idea: characterize what’s going on step-by-step

Analytic Approach

Theorem: For any t > 0, the cost at t is O(n) in expectation, and O( n log n ) w.h.p. 1 7 ? 3 11

10

? ? 4 2 5 9

In expectation, increment is n.

Problem: the behavior at a step is highly correlated with what happened in previous steps.

6

slide-76
SLIDE 76

The Ac

Actual Argument

Theorem: For any t > 0, the cost at t is O(n) in expectation, and O( n log n ) w.h.p.

  • Step 1: analyze a different, uncorrelatedexponential process
  • Crucially, has the same rank distribution!
  • Step 2: characterize the value distribution in the exponential process
  • “Potential argument”
  • Step 3: characterize rank distribution of exponential process
  • Average rank is O(n)
slide-77
SLIDE 77
  • Insert: pick a random queue
  • Add exponentially distributed increment with mean n into it
  • Remove: pick two queues at random, remove the lower label

Step 1: The exponential process

1.8

5.9

10.2 13.2

4.7 7.3

12.5 16.8

2.2 3.2 8.3

15.2

5.1 9.5

11.7 14.2

Lemma: The distribution of removed ranks is the same in the discrete process and in the exponential process.

Idea: the exponential is memoryless. Expected value n

slide-78
SLIDE 78
  • Insert: pick a random queue
  • Add exponentially distributed increment with mean n to its current label
  • Remove: pick two queues at random, remove the lower label

Step 1: The exponential process

1.8

5.9

10.2 13.2

4.7 7.3

12.5 16.8

2.2 3.2 8.3

15.2

5.1 9.5

11.7 14.2

Lemma: The distribution of rank values is the same in the discrete process and in the exponential process.

Expected increment n = 4

1 2 3 4 5

The probability that the ith label is in bin j is the same in both processes.

(or rank)

Easy to see initially, why later?

slide-79
SLIDE 79
  • Idea: focus on the deviation of top values from the mean at any time t
  • Let 𝜠j(t) = difference between top value of j and mean of top values at t

Step 2: Analyzing the exponential process

1.8

5.9

10.2 13.2

7.3

12.5

8.3 9.5 Mean = 6.725

𝜠1 𝜠2 𝜠3 𝜠4 Theorem: For any t > 0, 𝔽[∑

𝒇𝒚𝒒 𝜠𝒋(𝒖)/𝒐 +

𝒐 𝒋F𝟐

∑ 𝒇𝒚𝒒 −𝜠𝒋(𝒖)/𝒐

𝒐 𝒋F𝟐

] = 𝑷 𝒐 .

Idea: this potential function behaves is a “super-martingale:” as soon as it grows above O(n), it starts decreasing. Generalizes on [Peres, Talwar, Wieder, R.S.&A.15]

slide-80
SLIDE 80

Step 3: What does all this have to do with ra

ranks?

#queues decreases exponentially #queues decreases exponentially

Theorem: For any t > 0, the rank cost at t is O(n) in expectation. Further, rank cost is O(n log n) w.h.p.

≥ 𝑜 + 𝑛𝑓𝑏𝑜 ≥ 2𝑜 + 𝑛𝑓𝑏𝑜 ≥ 𝑙

𝑜 + 𝑛𝑓𝑏𝑜

… ≤ 𝑛𝑓𝑏𝑜 − 𝑜 ≤ 𝑛𝑓𝑏𝑜 − 2𝑜 ≤ 𝑛𝑓𝑏𝑜 − 𝑙

𝑜

𝑛𝑓𝑏𝑜 On average, a chosen queue is here.

slide-81
SLIDE 81

Applications

What if we do two choices only 𝜸% of the time?

(one random choice otherwise)

Theorem: For any t > 0, the cost at t is O(n / (𝜸2 log 𝜸)) in expectation, and O( n log n / 𝜸 ) w.h.p.

What if the input distribution is biased?

Still works (within reason). Works really well in practice.

We can use this for relaxed concurrent queues, priority queues, counters.

slide-82
SLIDE 82

Experiments

slide-83
SLIDE 83

Application 1: Timestamped Queues

  • Shared: m queues (for instance, one per processor)
  • Enqueue: inserts <node, tsp> pair into random queue
  • Dequeue: removes element with better tsp out of two random choices

Vector of Queues Q[m]; void enqueue ( element e ) { tsp = GetTimestamp(); i = random(0, m – 1); Q[i].enqueue( <e, tsp> ); } void dequeue ( ) { i = random(0, m – 1); j = random(0, m – 1); //pick better element out of two choices if ( Q[j].peek().tsp < Q[i].peek().tsp ) i = j return Q[i].dequeue( ); }

Example: Relaxed Queue

Theory says that the average rank removed is O( m ). We can trade off contention (n / m) versus rank guarantees (m).

slide-84
SLIDE 84

Application 2: Approximate Timestamps

  • Shared: m counters (for instance, one per processor)
  • Read: pick a random index and read from it
  • Increment: pick two counters at random, and increment the smaller one

Vector of Counters C[m]; int read( ) { i = random(0, m – 1); return C[i] * m; } void increment ( ) { i = random(0, m – 1); j = random(0, m – 1); //pick lower counter out of two choices if ( C[j] < C[i] ) i = j C[i].increment(); }

Example: Relaxed Queue

Theory says that the average distance from the true value is O( m ). We can again trade off contention (n / m) versus rank guarantees (m).

slide-85
SLIDE 85

Empirical Tests: Two Choices

In [2]: #running 64 counters for 10K steps, making two choices for increment. pq.timerExperiment( 64, 10000, 2)

slide-86
SLIDE 86

Empirical Tests: A Single Choice

slide-87
SLIDE 87

Part 3: A Little Bit of Machine Learning (the ultimate relaxation)

slide-88
SLIDE 88

Machine Learning in 1 Slide

Task

Data argmin𝒚𝑔 𝒚

𝑔 𝑦 = b 𝑚𝑝𝑡𝑡(𝑦, 𝑓𝑦𝑗)

h iF$

Notion of “quality,” e.g. squared distance Solved via optimization procedure. E.g., classification

slide-89
SLIDE 89

Distributed Machine Learning in 1 Slide

Node1 Node2

Dataset Partition 1 Dataset Partition 2

Synchronization

argmin𝒚𝑔 𝒚 = 𝑔1(𝒚) + 𝑔2(𝒚)

𝑔1 𝑦 = b 𝑚(𝑦,𝑓𝑗)

h/& iF$

𝑔2 𝑦 = b 𝑚(𝑦, 𝑓𝑗)

h iFh & j$

slide-90
SLIDE 90

Background

  • Gradient descent (GD):
  • Stochastic gradient descent:

Let 𝒉 l(𝒚𝒖) = gradient at randomly chosen point.

  • Let 𝜡

𝒉 l 𝒚 − 𝜶𝒈 𝒚

𝟑 ≤ 𝝉𝟑 (variance bound)

𝒚𝒖j𝟐 = 𝒚𝒖 − 𝜽𝒖𝒉 l(𝒚𝒖), where 𝜡[𝑕 t(𝑦𝑢)] = 𝛼𝑔 𝑦𝑢 . Theorem [ standard ]: Given 𝑔 convex, and 𝑆2 = ||𝑦0 − 𝑦∗||2. If we run SGD for 𝑼 = 𝓟( 𝑺𝟑 𝟑𝝉𝟑

𝜻𝟑 ) iterations, then

𝜡 𝑔(1 𝑈 b 𝑦€)

  • €FG

− 𝑔 𝑦∗ ≤ 𝜁.

𝒚𝒖j𝟐 = 𝒚𝒖 − 𝜽𝒖𝛼𝑔 𝑦𝑢 .

slide-91
SLIDE 91

What does this actually look like?

𝒚𝒖

0.5 0.1

  • 0.1

Vector x[d], initially random void SGD-Converge( float eps ) do { <e, label> = randomly chosen data point gradient = ComputeGradient( x, e, label ) for i from 1 to d: x[i] = x[i] - 𝜽 * gradient[i] error = ComputeLoss( x, training_data ) } while (error > eps) }

Example: Sequential SGD

𝒚𝒖j𝟐

0.6

  • 0.2
  • It is common in practice to batch together several examples
  • Apply the gradient with respect to all of them
slide-92
SLIDE 92

Concurrent SGD: Naïve Implementation

𝒚𝒖

0.5 0.1

  • 0.1

Shared: Vector x[d], initially random Lock L //for the model void SGD-Converge( float eps ) do { <e, label> = randomly chosen data point L.lock() gradient = ComputeGradient( x, e, label ) for i from 1 to d: x[i] = x[i] - 𝜽 * gradient[i] error = ComputeLoss( x, training_data ) L.unlock() } while (error > eps) }

Example: Naïve Concurrent SGD

This has practically no parallelism! L

slide-93
SLIDE 93

Concurrent SGD: Fine-Grained

𝒚𝒖

0.5 0.1

  • 0.1

Shared: Vector x[d], initially random Lock array L[d] //one per model component void SGD-Converge( float eps ) do { <e, label> = randomly chosen data point gradient = ComputeGradient( x, e, label ) for i from 1 to d: L[i].lock() x[i] = x[i] - 𝜽 * gradient[i] L[i].unlock() error = ComputeLoss( x, training_data ) } while (error > eps) }

Example: Naïve Concurrent SGD

This assumes that the reading step and the loss computation are fine with seeing partial updates. Proved to be correct by [Agarwal & Duchi, NIPS 11]. 𝒚𝒖j𝟐

0.6

  • 0.2
slide-94
SLIDE 94

Hogwild! [Niu et al., NIPS 2013]

Vector x[d], initially random void SGD-Converge( float eps ) do { <e, label> = randomly chosen data point gradient = ComputeGradient( x, e, label ) for i from 1 to d: x[i] = x[i] - 𝜽 * gradient[i] error = ComputeLoss( x, training_data ) } while (error > eps) }

Example: Hogwild SGD = Sequential SGD

The algorithm is OK running without running any locks! This is very non-trivial to prove, and we won’t do it here: Initially by [Niu et al., NIPS 2011] Analysis improved by [Duchi et al., NIPS 2015], [De Sa et al., NIPS 2015] Convergence rate quadratic in the max delay.

𝒚𝒖

0.5 0.1

  • 0.1

0.6

  • 0.2
slide-95
SLIDE 95

By the way:

slide-96
SLIDE 96

Experiments

2 4 6 8 10 2 4 6 8 10

linear speedup without locking with locking

Cores

2 4 6 8 10 2 4 6 8 10

linear speedup without locking with locking

Cores

2 4 6 8 10 2 4 6 8 10

linear speedup without locking with locking

Cores

2 4 6 8 10 2 4 6 8 10

linear speedup without locking with locking

Cores

(a) pnz = .005 (b) pnz = .01 (c) pnz = .2 (d) pnz = 1

slide-97
SLIDE 97

What have we learned?

  • Relaxation can lead to scaling
  • We’re just removing bottlenecks
  • Application semantics are critical
  • Some applications are perfectly fine with relaxation
  • Others aren’t
  • E.g. SGD vs. dressing for work

10 20 30 40 50 60 70 FC WF MS LB LCRQ TS-atomic TS-CAS TS-hardware TS-interval TS-stutter CTS RTS 1RR DQ 2RR DQ 1RA DQ k-FIFO average order deviation (lower is better) 1.8 9.9 66.2 25.0 15.6 20.4 17.6 24.7 16.7 19.2 8.8 20.8 13.8 22.9 2924.0 47.0

(a) High-contention producer-consumer

Order deviation of various queue algorithms (40 threads).

slide-98
SLIDE 98

The Last Slide

Some (strongly ordered) data structures are hard to scale.

Many of the data structures of our childhood are changing. Relaxed semantics, optimistic progress guarantees.

How do we specify and prove them correct? What new data structures are out there? How do they interact with existing applications?

slide-99
SLIDE 99

Workshop Announcement

  • Theory & Practice in Concurrent Data Structures
  • Co-located with DISC 2017 (Vienna)
  • Overall goals
  • Fostering collaboration between practically-minded (PPoPP, SOSP etc)

conferences, and the PODC/DISC community

  • New challenges in concurrent data structure design
  • Precise goals
  • Better benchmarks for concurrent data structures
  • Real applications and practical issues (e.g. memory management)
  • Usefulness of relaxed designs