Relaxed Data Structures Dan Alistarh IST Austria & ETH Zurich

...but first, we’re hiring! • Young institute dedicated to basic research and graduate education • Located near Vienna , Austria • Fully English-speaking • Graduate School • 1+3 years PhD Program • Full-time positions with competitive salary • Internships (2018): email d.alistarh@gmail.com • PhD & Postdoc Positions • Projects: • Concurrent Data Structures • Distributed Machine Learning • Molecular Computation

Why Co Concurrent Data Structures ? Clock rate and #cores over the past 45 years. To get speedup on newer hardware . Scaling : more threads should imply more useful work.

The Problem with Concurrency Throughput of a Concurrent Packet Processing Queue 6.00E+06 Throughput (Events/Second) 5.00E+06 > $10000 / 4.00E+06 machine 3.00E+06 2.00E+06 1.00E+06 < $1000 / 0.00E+00 0 10 20 30 40 50 60 70 machine Number of Threads Is this problem inherent for some data structures?

Inherent Sequential Bottlenecks Data structures with strong ordering semantics • Stacks, Queues, Priority Queues, Exact Counters Theorem: Given n threads , any deterministic , strongly ordered data structure has executions in which a processor takes linear in n time to return. [Ellen, Hendler, Shavit, SICOMP 2013] [Alistarh, Aspnes, Gilbert, Guerraoui, JACM 2014] This is important because of Amdahl’s Law • Assume single-threaded computation takes 7 days • Inherently sequential component (e.g., queue) takes 15% = 1 day • Then maximum speedup < 7x , even with infinite number of threads

Today’s Class Theorem: Given n threads , any deterministic , strongly ordered data structure has an execution in which a processor takes linear in n time to return. [Alistarh, Aspnes, Gilbert, Guerraoui, JACM 2014] How can we circumvent this? Theory ↔ Software ↔ Hardware New Notions of Progress / Correctness! New Data Structure Designs!

Lock-Free Data Structures • Based on atomic instructions (CAS, Fetch&Inc, etc.) • Blocking of one thread doesn’t stop the whole system • Implementations: HashTables, Lists, B-Trees, Queues, Stacks, SkipLists, etc. • Known to scale well for many data structures Preamble Memory location R; … void fetch-and-inc ( ) { unsigned val = 0; do { val = Read( R ); Scan & … } while (!Bool_CAS ( &R, val, val + 1 )); Validate return val; } CAS ( R, old, new ) Example: Lock-free counter success

The Lock-Free Paradox Counter Value R 2 1 0 Memory location R; void fetch-and-increment ( ) { int val; do { Thread 1 Thread 0 val = Read( R ); 0 1 val val 0 1 val val new_val = val + 1; } while (! Compare&Swap ( &R, val, new_val )); return val; } Example: Lock-free counter. Theory : threads could starve in optimistic lock-free implementations . Use more complex wait-free algorithms. Practice : this doesn’t happen . Threads don’t starve .

Starvation? Lock-Free Stack, 16 threads Number of iterations before an operation succeeds 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Percentage of operations Try distribution, SkipList Inserts, 16 threads, 50% mutations 15000000 10000000 Number of operations 5000000 0 Why ? 1 2 3 4 5 6 Queue SkipList Counter

Part 1: Understanding Lock-free Progress 1. We focus on contended workloads 2. We focus on the scheduler • Sequence of accesses to shared data • Not adversarial , but relaxed • Stochastic model 3. We focus on long-term behavior • How long does an operation take to complete on average ? • Are there operations that never complete ? How does the “scheduler” behave in the long run ?

A simplified view of “the scheduler” • Complex combination of • Input (workload) • Code • Hardware 1 3 4 1 … • Single variable contention (Intel TM )

The Scheduler • Pick random time t • What’s the probability that p i is scheduled? • Scheduler: • Either chooses a request from the pool in each “step,” or leaves the variable with the current owner • The Schedule: • Under contention, a sequence of thread ids, e.g.: 2, 1, 4, 5, 2, 3, …. • Sequential access to contended data item • Stochastic Scheduler: • Every thread can be scheduled in each step, with probability > 0.

Examples • Assume n processes • The uniform stochastic scheduler: • θ = 1 / n • Each process gets scheduled uniformly • A standard adversary : • Take any adversarial strategy • The distribution gives probability 1 to the process picked by the strategy, 0 to all others • Not stochastic • Quantum-based schedulers • Stochastic if quantum length not fixed, but random variable • E.g.: [1, 1, 1], [3], [4, 4, 4, 4], [2, 2], [1], [4, 4], … • Common for OS scheduling

Lock-Free Algorithms and Stochastic Schedulers • Lock-Free • There’s a time bound B for the system to complete some new operation • Wait-Free • There’s a (local) time bound for each operation to complete Theorem: Under any stochastic scheduler, any lock-free algorithm is wait-free with probability 1. [Alistarh, Censor-Hillel, Shavit, STOC14/JACM16] Proof intuition: • Given any time t, if some thread p is scheduled for B consecutive time steps, it has to complete some new operation • There’s a non-zero probability that the scheduler might decide to schedule thread p B steps in a row. • By the “Infinite Monkey Theorem,” this will eventually occur. • Hence, with probability 1, every operation eventually succeeds

Comments Theorem: Under any stochastic scheduler , any boundedlock-free algorithm is wait-free, with probability 1. Minimal Maximal Progress Progress Deadlock-free Starvation-Free Lock-Free Wait-Free (Non-blocking) • Practically , not that insightful • The probability that an operation succeeds could be as low as (1 / n) n • Does not necessarily hold if the scheduler is not stochastic • For instance, on NUMA systems, scheduler can be non-stochastic

The Story So Far • The Goal • Lock-Free Algorithms in Practice • The Stochastic Scheduler Model • Lock-Free ≈ Wait-Free (in Theory) • Performance Upper Bounds • A general class of lock-free algorithms • Uniform stochastic scheduler Disclaimer : We do not claim that the scheduler is uniform generally . We only use this as a lower bound for its long-run behavior .

Single-CAS Universal 1 … • Can implement any object lock-free q (Herlihy’s Universal Construction) 1 … • Blueprint for many efficient implementations s (Treiber Stack, Counters) CAS ( R, old, new ) success What is the average number of steps a process takes until completing a method call? Step Complexity What is the average number of steps the system takes until completing a method call? System Latency = Throughput -1

Special Case: The Counter Memory location R; void fetch-and-inc ( ) { unsigned val = 0; READ (R ) READ (R ) do { val = Read( R ); CAS ( R, old, old + 1 ) CAS ( R, old, old + 1 ) } while (!Bool_CAS ( &R, val, val + 1 )); return val; success success } Example: Lock-free counter • Example Schedule: • 1, 2, 2, 1 Assuming a uniform stochastic scheduler and n threads, what is the average step complexity?

Part 2: Step Complexity Analysis n, 2, 1, … READ (R ) READ (R ) READ (R ) CAS ( R, old, old + 1 ) CAS ( R, old, old + 1 ) CAS ( R, old, old + 1 ) success success success In each step, we pick an element from 1 to n randomly . How many steps (in expectation) before an element is chosen twice ?

� � The Birthday Problem • n = 365 days in a year • k people in a room • What is the probability that there are two with the same birthday? $ & )*$ • Pr[ no birthday collision ] = 1 1 − 1 − % … (1 − % ) % • Approximation: 𝑓 𝑦 ≈ 1 + 𝑦 (for 𝑦 close to 0). • Pr[ no birthday collision ] ≈ 𝑓 *)() *$)/&% • This is constant for 𝒍 = 𝒐 Moral of the story: 1. Two people in this room probably share birthdays 2. After ~ 𝑜 steps are scheduled, some thread wins

� � � � The Execution: A Sequential View 2, 1, 4, ..., 2, 4, 1, 3 USELESS … P4: Read P1:CAS P2: Read P1: Read P2: CAS P4:CAS P3: Read Time Moral of the story: 1. After ~ 𝑜 steps are scheduled, some thread wins 2. That thread’s CAS will cause ~ 𝑜 other threads to fail Average latency of the system is O( 𝑜 ) (this is tight). By symmetry, average step complexity for a counter operation is O( 𝑜 ). 21

� Warning: Not Formally Correct 1. We have assumed a uniform initial configuration 2. A process which fails a CAS will have to pay an extra step READ (R ) READ (R ) CAS ( R, old, old + 1 ) CAS ( R, old, old + 1 ) success success 3. We have only given upper bounds on the number of steps • But 𝑜 is indeed the tight bound here 4. Latency <-> Step Complexity argued only by symmetry • Formally, by Markov Chain lifting

Relaxed Data Structures Dan Alistarh IST Austria & ETH Zurich - PowerPoint PPT Presentation

Relaxed Data Structures Dan Alistarh IST Austria & ETH Zurich ...but first, were hiring! Young institute dedicated to basic research and graduate education Located near Vienna , Austria Fully English-speaking Graduate

Relaxed Separation Logic Tutorial @ POPL14 Viktor Vafeiadis MPI-SWS 20 January 2014

A solution of A solution of the cusp problem the cusp problem in relaxed halos in relaxed

5th STL Workshop, June 2005 Title: Relaxed weak queues: an alternative to run-relaxed heaps

Planning and Optimization C2. Delete Relaxation: Finding Relaxed Plans Malte Helmert and Gabriele

Entropy, continued UNIT 4 Day 7 Demonstration Stretched vs. Relaxed Rubber Bands POLL: iClicker

Planning and Optimization October 16, 2019 C2. Delete Relaxation: Properties of Relaxed

Relaxed memory models No sequential consistency (SC) in chips today Chip designers

Community Detection by Decomposing a Graph into Relaxed Cliques Fabio Furini, Timo Gschwind,

An out-of-order thread-local semantics for something like volatile relaxed atomics in C and the

Scrambling as the Combination of Relaxed Context-Free Grammars in a Model-Theoretic Grammar

Robustness against Relaxed Memory Models Memory Models Roland Meyer Technische Universit at

Program logics for relaxed consistency UPMARC Summer School 2014 Viktor Vafeiadis Max Planck

Hypo contact and Sasakian SU ( 2 ) -structures in 5-dimensions structures on Lie groups Sasakian

Relaxed Linear References for Lock-free Data Structures Elias Castegren , Tobias Wrigstad

Contact manifolds and SU ( 2 ) -structures in 5-dimensions SU ( n ) -structures Sasaki-Einstein

Data Structures 1 / 27 Built-in Data Structures Values can be collected in data structures:

Relaxation of isolated IFIMAR (CONICET-UNMdP) Mar del Plata, Argentina quantum systems School

On Partial Optimality in Multi-label MRFs P. Kohli 1 A. Shekhovtsov 2 C. Rother 1 V. Kolmogorov 3

Just Relax Convex Programming Methods for Subset Selection and Sparse Approximation Joel A.

Lagrangean relaxation Han Hoogeveen, Utrecht University Basics Situation: you have a nice

Introduction to Sum-of-Squares Ankur Moitra (MIT) Robust Statistics Summer School A CLASSIC HARD

Single-Source Shortest Paths [for directed weighted graphs] Course: CS 5130 - Advanced Data

Oblivious Rounding and the Integrality Gap URIEL FEIGE, WEIZMANN MICHAL FELDMAN, TEL-AVIV U.

From obfuscation to white-box crypto: relaxation and security notions Matthieu Rivain WhibOx