relaxed data structures
play

Relaxed Data Structures Dan Alistarh IST Austria & ETH Zurich - PowerPoint PPT Presentation

Relaxed Data Structures Dan Alistarh IST Austria & ETH Zurich ...but first, were hiring! Young institute dedicated to basic research and graduate education Located near Vienna , Austria Fully English-speaking Graduate


  1. Relaxed Data Structures Dan Alistarh IST Austria & ETH Zurich

  2. ...but first, we’re hiring! • Young institute dedicated to basic research and graduate education • Located near Vienna , Austria • Fully English-speaking • Graduate School • 1+3 years PhD Program • Full-time positions with competitive salary • Internships (2018): email d.alistarh@gmail.com • PhD & Postdoc Positions • Projects: • Concurrent Data Structures • Distributed Machine Learning • Molecular Computation

  3. Why Co Concurrent Data Structures ? Clock rate and #cores over the past 45 years. To get speedup on newer hardware . Scaling : more threads should imply more useful work.

  4. The Problem with Concurrency Throughput of a Concurrent Packet Processing Queue 6.00E+06 Throughput (Events/Second) 5.00E+06 > $10000 / 4.00E+06 machine 3.00E+06 2.00E+06 1.00E+06 < $1000 / 0.00E+00 0 10 20 30 40 50 60 70 machine Number of Threads Is this problem inherent for some data structures?

  5. Inherent Sequential Bottlenecks Data structures with strong ordering semantics • Stacks, Queues, Priority Queues, Exact Counters Theorem: Given n threads , any deterministic , strongly ordered data structure has executions in which a processor takes linear in n time to return. [Ellen, Hendler, Shavit, SICOMP 2013] [Alistarh, Aspnes, Gilbert, Guerraoui, JACM 2014] This is important because of Amdahl’s Law • Assume single-threaded computation takes 7 days • Inherently sequential component (e.g., queue) takes 15% = 1 day • Then maximum speedup < 7x , even with infinite number of threads

  6. Today’s Class Theorem: Given n threads , any deterministic , strongly ordered data structure has an execution in which a processor takes linear in n time to return. [Alistarh, Aspnes, Gilbert, Guerraoui, JACM 2014] How can we circumvent this? Theory ↔ Software ↔ Hardware New Notions of Progress / Correctness! New Data Structure Designs!

  7. Lock-Free Data Structures • Based on atomic instructions (CAS, Fetch&Inc, etc.) • Blocking of one thread doesn’t stop the whole system • Implementations: HashTables, Lists, B-Trees, Queues, Stacks, SkipLists, etc. • Known to scale well for many data structures Preamble Memory location R; … void fetch-and-inc ( ) { unsigned val = 0; do { val = Read( R ); Scan & … } while (!Bool_CAS ( &R, val, val + 1 )); Validate return val; } CAS ( R, old, new ) Example: Lock-free counter success

  8. The Lock-Free Paradox Counter Value R 2 1 0 Memory location R; void fetch-and-increment ( ) { int val; do { Thread 1 Thread 0 val = Read( R ); 0 1 val val 0 1 val val new_val = val + 1; } while (! Compare&Swap ( &R, val, new_val )); return val; } Example: Lock-free counter. Theory : threads could starve in optimistic lock-free implementations . Use more complex wait-free algorithms. Practice : this doesn’t happen . Threads don’t starve .

  9. Starvation? Lock-Free Stack, 16 threads Number of iterations before an operation succeeds 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Percentage of operations Try distribution, SkipList Inserts, 16 threads, 50% mutations 15000000 10000000 Number of operations 5000000 0 Why ? 1 2 3 4 5 6 Queue SkipList Counter

  10. Part 1: Understanding Lock-free Progress 1. We focus on contended workloads 2. We focus on the scheduler • Sequence of accesses to shared data • Not adversarial , but relaxed • Stochastic model 3. We focus on long-term behavior • How long does an operation take to complete on average ? • Are there operations that never complete ? How does the “scheduler” behave in the long run ?

  11. A simplified view of “the scheduler” • Complex combination of • Input (workload) • Code • Hardware 1 3 4 1 … • Single variable contention (Intel TM )

  12. The Scheduler • Pick random time t • What’s the probability that p i is scheduled? • Scheduler: • Either chooses a request from the pool in each “step,” or leaves the variable with the current owner • The Schedule: • Under contention, a sequence of thread ids, e.g.: 2, 1, 4, 5, 2, 3, …. • Sequential access to contended data item • Stochastic Scheduler: • Every thread can be scheduled in each step, with probability > 0.

  13. Examples • Assume n processes • The uniform stochastic scheduler: • θ = 1 / n • Each process gets scheduled uniformly • A standard adversary : • Take any adversarial strategy • The distribution gives probability 1 to the process picked by the strategy, 0 to all others • Not stochastic • Quantum-based schedulers • Stochastic if quantum length not fixed, but random variable • E.g.: [1, 1, 1], [3], [4, 4, 4, 4], [2, 2], [1], [4, 4], … • Common for OS scheduling

  14. Lock-Free Algorithms and Stochastic Schedulers • Lock-Free • There’s a time bound B for the system to complete some new operation • Wait-Free • There’s a (local) time bound for each operation to complete Theorem: Under any stochastic scheduler, any lock-free algorithm is wait-free with probability 1. [Alistarh, Censor-Hillel, Shavit, STOC14/JACM16] Proof intuition: • Given any time t, if some thread p is scheduled for B consecutive time steps, it has to complete some new operation • There’s a non-zero probability that the scheduler might decide to schedule thread p B steps in a row. • By the “Infinite Monkey Theorem,” this will eventually occur. • Hence, with probability 1, every operation eventually succeeds

  15. Comments Theorem: Under any stochastic scheduler , any boundedlock-free algorithm is wait-free, with probability 1. Minimal Maximal Progress Progress Deadlock-free Starvation-Free Lock-Free Wait-Free (Non-blocking) • Practically , not that insightful • The probability that an operation succeeds could be as low as (1 / n) n • Does not necessarily hold if the scheduler is not stochastic • For instance, on NUMA systems, scheduler can be non-stochastic

  16. The Story So Far • The Goal • Lock-Free Algorithms in Practice • The Stochastic Scheduler Model • Lock-Free ≈ Wait-Free (in Theory) • Performance Upper Bounds • A general class of lock-free algorithms • Uniform stochastic scheduler Disclaimer : We do not claim that the scheduler is uniform generally . We only use this as a lower bound for its long-run behavior .

  17. Single-CAS Universal 1 … • Can implement any object lock-free q (Herlihy’s Universal Construction) 1 … • Blueprint for many efficient implementations s (Treiber Stack, Counters) CAS ( R, old, new ) success What is the average number of steps a process takes until completing a method call? Step Complexity What is the average number of steps the system takes until completing a method call? System Latency = Throughput -1

  18. Special Case: The Counter Memory location R; void fetch-and-inc ( ) { unsigned val = 0; READ (R ) READ (R ) do { val = Read( R ); CAS ( R, old, old + 1 ) CAS ( R, old, old + 1 ) } while (!Bool_CAS ( &R, val, val + 1 )); return val; success success } Example: Lock-free counter • Example Schedule: • 1, 2, 2, 1 Assuming a uniform stochastic scheduler and n threads, what is the average step complexity?

  19. Part 2: Step Complexity Analysis n, 2, 1, … READ (R ) READ (R ) READ (R ) CAS ( R, old, old + 1 ) CAS ( R, old, old + 1 ) CAS ( R, old, old + 1 ) success success success In each step, we pick an element from 1 to n randomly . How many steps (in expectation) before an element is chosen twice ?

  20. � � The Birthday Problem • n = 365 days in a year • k people in a room • What is the probability that there are two with the same birthday? $ & )*$ • Pr[ no birthday collision ] = 1 1 − 1 − % … (1 − % ) % • Approximation: 𝑓 𝑦 ≈ 1 + 𝑦 (for 𝑦 close to 0). • Pr[ no birthday collision ] ≈ 𝑓 *)() *$)/&% • This is constant for 𝒍 = 𝒐 Moral of the story: 1. Two people in this room probably share birthdays 2. After ~ 𝑜 steps are scheduled, some thread wins

  21. � � � � The Execution: A Sequential View 2, 1, 4, ..., 2, 4, 1, 3 USELESS … P4: Read P1:CAS P2: Read P1: Read P2: CAS P4:CAS P3: Read Time Moral of the story: 1. After ~ 𝑜 steps are scheduled, some thread wins 2. That thread’s CAS will cause ~ 𝑜 other threads to fail Average latency of the system is O( 𝑜 ) (this is tight). By symmetry, average step complexity for a counter operation is O( 𝑜 ). 21

  22. � Warning: Not Formally Correct 1. We have assumed a uniform initial configuration 2. A process which fails a CAS will have to pay an extra step READ (R ) READ (R ) CAS ( R, old, old + 1 ) CAS ( R, old, old + 1 ) success success 3. We have only given upper bounds on the number of steps • But 𝑜 is indeed the tight bound here 4. Latency <-> Step Complexity argued only by symmetry • Formally, by Markov Chain lifting

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend