A Portable Lock-free Bounded Queue Peter Pirkelbauer Reed Milewicz - - PowerPoint PPT Presentation

a portable lock free bounded queue
SMART_READER_LITE
LIVE PREVIEW

A Portable Lock-free Bounded Queue Peter Pirkelbauer Reed Milewicz - - PowerPoint PPT Presentation

A Portable Lock-free Bounded Queue Peter Pirkelbauer Reed Milewicz Juan Felipe Gonzalez Computer and Information Sciences University of Alabama at Birmingham Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 1 / 30 Outline Circular


slide-1
SLIDE 1

A Portable Lock-free Bounded Queue

Peter Pirkelbauer Reed Milewicz Juan Felipe Gonzalez

Computer and Information Sciences University of Alabama at Birmingham

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 1 / 30

slide-2
SLIDE 2

Outline

1

Circular Bounded Queue

2

Mutual Exclusion

3

Lock-free objects

4

Lockfree Circular Bounded Queue

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 2 / 30

slide-3
SLIDE 3

Circular Bounded Queue

Elements

int head = 0; int tail = 0; int buf[N]; bool enq(int elem); std::pair<int, bool> deq();

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 3 / 30

slide-4
SLIDE 4

Circular Bounded Queue

Elements

int head = 0; int tail = 0; int buf[N]; bool enq(int elem); std::pair<int, bool> deq();

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 3 / 30

slide-5
SLIDE 5

Circular Bounded Queue

Elements

int head = 0; int tail = 0; int buf[N]; bool enq(int elem); std::pair<int, bool> deq();

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 3 / 30

slide-6
SLIDE 6

Circular Bounded Queue

Elements

int head = 0; int tail = 0; int buf[N]; bool enq(int elem); std::pair<int, bool> deq();

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 3 / 30

slide-7
SLIDE 7

Circular Bounded Queue

Elements

int head = 0; int tail = 0; int buf[N]; bool enq(int elem); std::pair<int, bool> deq();

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 3 / 30

slide-8
SLIDE 8

Circular Bounded Queue

Elements

int head = 0; int tail = 0; int buf[N]; bool enq(int elem); std::pair<int, bool> deq();

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 3 / 30

slide-9
SLIDE 9

Synchronization Mechanisms

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 4 / 30

slide-10
SLIDE 10

Mutual Exclusion Locks (Mutex)

Definition A concurrency control mechanism that allows at most one thread be inside of a critical section. Example

lock(mutex) shared memory operations; // critical section unlock(mutex)

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 5 / 30

slide-11
SLIDE 11

Circular Bounded Queue - Single Mutex

bool enqueue(val)

lock(mutex); // is the data structure full? if (tail != N + head) { // insert the element buf[tail % N] = val; // update the tail ++tail; } unlock(mutex); return true;

pair<int, bool> dequeue()

pair<int, bool> res(−1, false); lock(mutex); // is the data structure empty? if (head != tail) { // read the element res.first = buf[head % N]; res.second = true; // update the head ++head; } unlock(mutex); return res;

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 6 / 30

slide-12
SLIDE 12

Problems with Mutual Exclusion Locks

Priority Inversion Deadlock Livelock Diminished Parallelism Termination safety Sojourner Rover (’97)

Source: astr.ua.edu

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 7 / 30

slide-13
SLIDE 13

Lock-free objects

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 8 / 30

slide-14
SLIDE 14

Lock-free objects

Definition An object is lock-free if it guarantees that one out of many contending thread makes progress in a finite number of steps.

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 9 / 30

slide-15
SLIDE 15

Lockfree Primitives

Key insight Utilize atomic operations to manipulate the data Read-Modify-Write Operations

  • n x86

compare-and-swap (CAS)

  • n ARM, PowerPC, Alpha

Load-linked / Store-conditional (LL/SC)

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 10 / 30

slide-16
SLIDE 16

Semantics of Compare-and-swap

C++ Interface

bool atomic<T>::compare_exchange_strong(T& oldval, T newval)

Definition

// executes atomically bool atomic<T>::compare_exchange_strong(T& oldval, T newval) { if (oldval == ∗this) { ∗this = newval; return true; }

  • ldval = ∗this;

return false; }

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 11 / 30

slide-17
SLIDE 17

Circular Bounded Queue - Hybrid

enqueue(val)

lock(mutex); // is the data structure full? if (tail != N + head) { // insert the element buf[tail % N] = val; // update the tail ++tail; } unlock(mutex); return true;

pair<int, bool> dequeue()

pair<int, bool> res(−1, false); size_t oldhead = head; while(oldhead != tail) { // store the element away res.first = buf[oldhead % N]; // test if successful if (head.CAS(oldhead, oldhead+1)) { res.second = true; break; } } return res;

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 12 / 30

slide-18
SLIDE 18

Circular Bounded Queue - Hybrid

enqueue(val)

lock(mutex); // is the data structure full? if (tail != N + head) { // insert the element buf[tail % N] = val; // update the tail ++tail; } unlock(mutex); return true;

pair<int, bool> dequeue()

pair<int, bool> res(−1, false); size_t oldhead = head; while(oldhead != tail) { // store the element away res.first = buf[oldhead % N]; // test if successful if (head.CAS(oldhead, oldhead+1)) { res.second = true; break; } } return res;

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 12 / 30

slide-19
SLIDE 19

Lockfree Circular Bounded Queue

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 13 / 30

slide-20
SLIDE 20

Circular Bounded Queue - Unique empty values

Problem enqueue needs to update tail and buf[tail]. Our Solution Unique empty values decouple updates Distinguish special values (1 bit)

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 14 / 30

slide-21
SLIDE 21

Circular Bounded Queue - Unique empty values

Problem enqueue needs to update tail and buf[tail]. Our Solution Unique empty values decouple updates Distinguish special values (1 bit)

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 14 / 30

slide-22
SLIDE 22

Circular Bounded Queue - Nonblocking Enqueue

enqueue(T)

size_t pos = tail; while (pos < head) { atomic<T>& e = buf[idx(pos)].val; ++pos; value_type empty = emptyVal(pos); bool succ = e.CAS(empty, val); if (succ) { update_counter(tail, pos); return true; } } return false;

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 15 / 30

slide-23
SLIDE 23

Circular Bounded Queue - Nonblocking Enqueue

enqueue(T)

size_t pos = tail; while (pos < head) { atomic<T>& e = buf[idx(pos)].val; ++pos; value_type empty = emptyVal(pos); bool succ = e.CAS(empty, val); if (succ) { update_counter(tail, pos); return true; } } return false;

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 15 / 30

slide-24
SLIDE 24

Circular Bounded Queue - Nonblocking Dequeue

Problem With unique empty values dequeue needs to update two locations (head and buf[tail]) Our Solution Use a descriptor to describe the work Other threads help interrupted threads

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 16 / 30

slide-25
SLIDE 25

Circular Bounded Queue - Nonblocking Dequeue

Problem With unique empty values dequeue needs to update two locations (head and buf[tail]) Our Solution Use a descriptor to describe the work Other threads help interrupted threads

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 16 / 30

slide-26
SLIDE 26

Circular Bounded Queue - Valid Dequeue

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 17 / 30

slide-27
SLIDE 27

Circular Bounded Queue - Valid Dequeue

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 17 / 30

slide-28
SLIDE 28

Circular Bounded Queue - Valid Dequeue

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 17 / 30

slide-29
SLIDE 29

Circular Bounded Queue - Valid Dequeue

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 17 / 30

slide-30
SLIDE 30

Circular Bounded Queue - Valid Dequeue

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 17 / 30

slide-31
SLIDE 31

Circular Bounded Queue - Invalid Dequeue

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 18 / 30

slide-32
SLIDE 32

Circular Bounded Queue - Invalid Dequeue

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 18 / 30

slide-33
SLIDE 33

Circular Bounded Queue - Invalid Dequeue

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 18 / 30

slide-34
SLIDE 34

Circular Bounded Queue - Invalid Dequeue

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 18 / 30

slide-35
SLIDE 35

Circular Bounded Queue - Invalid Dequeue

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 18 / 30

slide-36
SLIDE 36

Circular Bounded Queue - Helping

Problem with Helping The original thread and the helping thread have the same codepath (bottleneck). Our Solution Delay helping and try to dequeue from later position.

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 19 / 30

slide-37
SLIDE 37

Circular Bounded Queue - Helping

Problem with Helping The original thread and the helping thread have the same codepath (bottleneck). Our Solution Delay helping and try to dequeue from later position.

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 19 / 30

slide-38
SLIDE 38

Circular Bounded Queue - Helping

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 20 / 30

slide-39
SLIDE 39

Circular Bounded Queue - Helping

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 20 / 30

slide-40
SLIDE 40

Circular Bounded Queue - Helping

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 20 / 30

slide-41
SLIDE 41

Circular Bounded Queue - Helping

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 20 / 30

slide-42
SLIDE 42

Implementation Overview

Implemented for the C++ relaxed memory model Model Checked with CDSChecker

All two thread cases with two operations each Some three thread cases with two operations each Some four thread cases with one operations each

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 21 / 30

slide-43
SLIDE 43

Evaluation

Three architecture families

Snapdragon 410 (ARM) IBM Power8 Intel x86

40M operations

Buffer size is 1024 elements Buffer is half-full at the beginning Each thread alternates enq and deq Each thread executes 40M / |Threads| operations

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 22 / 30

slide-44
SLIDE 44

Power Architecture

IBM Power8; 20 cores; 8 threads per core; 3.4Ghz; xlc -O2

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 23 / 30

slide-45
SLIDE 45

x86 Architecture

Intel E5-2660 cores; 1 thread per core (ht disabled); 2.6Ghz; icc -O2

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 24 / 30

slide-46
SLIDE 46

ARM Architecture

Snapdragon 410; 4 cores, 1 thread per core; 1.2Ghz; clang++

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 25 / 30

slide-47
SLIDE 47

Conclusion and Future Work

Portable bounded queue for C++11 Lock-free programming offers better progress guarantees Performance varies with architecture Future Work

Explore back-off schemes

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 26 / 30

slide-48
SLIDE 48

Thank you!

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 27 / 30

slide-49
SLIDE 49

References

Mike Jones: What really happened on Mars? Maurice Herlihy and Jeannette Wing: Linearizability: A Correctness Condition for Concurrent Objects, 1992. Maurice Herlihy and Nir Shavit: The Art of Multiprocesor Programming, Morgan-Kaufmann, 2008. Paul McKenney: Memory Barriers: a Hardware View for Software Hackers. The C++ Programming Language, Draft Standard, 2011. Keir Fraser and Tim Harris: Concurrent programming without locks. 2007. Philippas Tsigas and Yi Zhang: A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems. SPAA’01. Dmitry Vyukov: Bounded MPMC queue. http://www.1024cores.net, retrieved on May 21, 2016. Steven Feldman and Damian Dechev: A wait-free multi-producer multi-consumer ring

  • buffer. SAC 2015.

Brian Norris and Brian Demsky: CDSchecker: Checking concurrent data structures written with C/C++ atomics. OOPSLA ’13.

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 28 / 30

slide-50
SLIDE 50

Consistency Models (Compiler+Hardware)

int X = 0, Y = 0; Thread 1

store(X, 1); r1 = load(Y);

Thread 2

store(Y, 1); r2 = load(X);

Is (r1 == 0 && r2 == 0) a possible outcome?

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 29 / 30

slide-51
SLIDE 51

Hardware Memory Model

Architectures Weak to Strong Alpha ARM, PowerPC x86, SPARC (TSO) Dual 386 CPU

Source: http://www.linuxjournal.com

Pirkelbauer et al. (UAB) ICA3PP December 14, 2016 30 / 30