Fast, less-complicated, lock-free Data Structures Ulrich Drepper - PowerPoint PPT Presentation

Fast, less-complicated, lock-free Data Structures Ulrich Drepper ulrich.drepper@gs.com

Accelerate Code ● Not (much) through new hardware ● Split into independent pieces ● Splitting comes at a cost ● Marshaling between stages ● Increased latency for pipeline ● Realistically: Parallelization needed! 2

Parallelization ● Alternatives Extended “Amdahl's Law” 1 ● Multi-process S = ( 1 − P ) + P N ( 1 + O P ) or 2.5 ● Multi-thread 2 ● Error prone 1.5 ● High level of 1 parallelization needed 0.5 ● Keep cost of 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 parallelization ( O p ) low P = 0.6 3

Parallelization ● Collaboration through shared memory ● Synchronized access ● Synchronized access to data structures ● Atomic data structures (mostly based on Compare-And-Swap) bool __sync_bool_compare_and_swap(TYPE *ptr, TYPE oldval, TYPE newval) { if (*ptr != oldval) return false; *ptr = newval; return true; } 4

Lock-Free Data Structures Single Double LIFO FIFO Hash Linked Linked 1:1 CAS CAS 1:N CAS No Priority N:1 CAS CAS M:N CAS 1:1 CAS CAS 1:N Priority N:1 CAS CAS M:N 5

x86 Special Single Double LIFO FIFO Hash Linked Linked 1:1 CAS CAS 1:N CAS DWCAS No Priority N:1 CAS CAS M:N CAS DWCAS 1:1 CAS CAS Double-wide CAS 1:N Priority N:1 CAS CAS M:N 6

Extended CAS ● Wider, more complicated CAS not the answer DCAS is not a Silver Bullet for Nonblocking Algorithm Design Doherty, Detlefs, Groves, Flood, Luchangco, Martin, Moir, Shavit, Steele, SPAA '04, 2004 7

Locking ● Bane of Programming ● Interface design: explicit or implicit locking? ● Often unnecessary overhead ● Composability problem ● AB-BA locking problem void move(dbllist<T> &target, dbllist<T>::it &prev, dbllist<T> &source, dbllist<T>::it &elem); How to implement internal locking? 8

Locking and Latency ● Yes, there are spinlocks Detect Lock Wakeup Collision ● Fairer/more power efficient Signal locking requires sleep Enter Delay ● Sleep requires wakeup Kernel Latency Exit Wake Kernel Resume Lock Operation 9

Way Forward Two complimentary approaches ● Improve implementation of locking to ● Reduce contention ● Reduce cost of the operation ● Replace concept of locking 10

Way Forward Two complimentary approaches ● Improve implementation of locking to ● Reduce contention ● Reduce cost of the operation Hardware Lock Elision (HLE) ● Replace concept of locking Transactional Memory (TM) 11

Increase Parallelism ● Reduce lock contention ● Avoid “optimizations” like 4 reader-writer locks 3.5 3 ● Enable more code to be 2.5 parallelized 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P = 0.6 P = 0.8 12

Running Example 13

Locking Hash Tables ● Designed for concurrent accesses Thread 1 ● In practice mostly read accesses Separate Memory ● Even write accesses likely Locations will not conflict Thread 2 ● Locking is overkill 14

Hash Table With locking 15

Mutually Exclusive Access CAS(mutex, 0, 1) Mutex Yes == 0 Set 1? No Yes Read Delay Table Entry Update Table Wake Entry Store 0 in Mutex 16

Mutually Exclusive Access Mutex Yes == 0 Set 1? No Yes Read Delay Table Hash Entry Mutex Tab Memory Memory Update Table Wake Entry Store 0 in Mutex 17

Mutually Exclusive Access Mutex Yes == 0 Set 1? No No Yes Read Delay Table Entry Net Effect On Mutex: Nothing Update Table Wake Entry Store 0 in Mutex 18

Hardware Lock Elision 19

With Lock Elision What if '1' is Mutex Yes not written? == 0 Set 1? No Yes Read Delay Table Entry Update Table Wake Entry Store 0 in Mutex 20

With Lock Elision Mutex Yes == 0 Set 1? Thread 1 No Read Yes Thread 2 Delay Table Entry Update Table Wake Entry Store 0 in Mutex 21

With Lock Elision Mutex Yes == 0 Set 1? Thread 1 No Read Yes Thread 2 Delay Table No Mutual Entry Exclusion! Update Table Wake Entry Store 0 in Mutex 22

No Mutual Exclusion ● Bad Mutex Yes ● But only if == 0 Set 1? Thread 1 ● Concurrent access to No Read Yes Thread 2 same memory location Delay Table Entry ● At least one of the accesses is write Update Table Wake Entry Store 0 in Mutex 23

Alternative Mutex Yes == 0 Set 1? Thread 1 No Read Yes Thread 2 Detect Collisions! Delay Table Entry Update Table Wake Entry Store 0 in Mutex 24

Intel HLE 25

x86 code for Hash Table Thread 1 L1 Data Cache lock cmpxchg %ebx, mut jne 2f 42 mov table+2, %edx mov $0, mut Hash Table call wake Thread 2 lock cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+5 mov $0, mut Main Memory call wake 26

New in Intel HLE Thread 1 Transaction xacquire lock Flag cmpxchg %ebx, mut jne 2f 42 mov table+2, %edx xrelease mov $0, mut Hash Table call wake Thread 2 xacquire lock Lock Cache cmpxchg %ebx, mut jne 2f 0 Mutex New Instruction mov $4, table+5 Prefixes xrelease mov $0, mut (compatible) call wake 27

Successful Concurrent Use 28

No Collision Thread 1 xacquire lock cmpxchg %ebx, mut jne 2f 42 mov table+2, %edx xrelease mov $0, mut Hash Table call wake Thread 2 xacquire lock cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+5 xrelease mov $0, mut call wake 29

No Collision Thread 1 xacquire lock cmpxchg %ebx, mut jne 2f 42 mov table+2, %edx T 1 xrelease mov $0, mut Hash Table Old: 0 call wake Thread 2 xacquire lock cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+5 xrelease mov $0, mut call wake 30

No Collision Thread 1 xacquire lock cmpxchg %ebx, mut T 42 jne 2f 42 mov table+2, %edx T 1 xrelease mov $0, mut Hash Table Old: 0 call wake Thread 2 xacquire lock cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+5 xrelease mov $0, mut call wake 31

No Collision Thread 1 xacquire lock cmpxchg %ebx, mut T 42 jne 2f 42 mov table+2, %edx T 1 xrelease mov $0, mut Hash Table Old: 0 call wake Thread 2 xacquire lock cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+5 T 1 xrelease mov $0, mut Old: 0 call wake 32

No Collision Thread 1 xacquire lock cmpxchg %ebx, mut T 42 jne 2f 42 mov table+2, %edx T 1 xrelease mov $0, mut Hash Table Old: 0 call wake Thread 2 xacquire lock T 4 cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+5 T 1 xrelease mov $0, mut Old: 0 call wake 33

No Collision Thread 1 xacquire lock cmpxchg %ebx, mut T 42 jne 2f 42 mov table+2, %edx T 1 ✓ xrelease mov $0, mut Hash Table Old: 0 call wake Thread 2 xacquire lock T 4 cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+5 T 1 xrelease mov $0, mut Old: 0 call wake 34

No Collision Thread 1 xacquire lock cmpxchg %ebx, mut 42 jne 2f 42 mov table+2, %edx 1 0 xrelease mov $0, mut Hash Table Old: 0 call wake Thread 2 xacquire lock T 4 cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+5 T 1 ✓ xrelease mov $0, mut Old: 0 call wake 35

No Collision Thread 1 xacquire lock cmpxchg %ebx, mut 42 jne 2f 42 mov table+2, %edx 0 xrelease mov $0, mut Hash Table call wake 4 Thread 2 xacquire lock 4 cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+5 1 0 xrelease mov $0, mut Old: 0 call wake 36

Unsuccessful Concurrent Use 37

With Collision Thread 1 xacquire lock cmpxchg %ebx, mut jne 2f 42 mov table+2, %edx xrelease mov $0, mut Hash Table call wake Thread 2 xacquire lock cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+2 xrelease mov $0, mut call wake 38

With Collision Thread 1 xacquire lock cmpxchg %ebx, mut jne 2f 42 mov table+2, %edx T 1 xrelease mov $0, mut Hash Table Old: 0 call wake Thread 2 xacquire lock cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+2 xrelease mov $0, mut call wake 39

With Collision Thread 1 xacquire lock cmpxchg %ebx, mut T 42 jne 2f 42 mov table+2, %edx T 1 xrelease mov $0, mut Hash Table Old: 0 call wake Thread 2 xacquire lock cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+2 xrelease mov $0, mut call wake 40

With Collision Thread 1 xacquire lock cmpxchg %ebx, mut T 42 jne 2f 42 mov table+2, %edx T 1 xrelease mov $0, mut Hash Table Old: 0 call wake Thread 2 xacquire lock cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+2 T 1 xrelease mov $0, mut Old: 0 call wake 41

With Collision Thread 1  xacquire lock cmpxchg %ebx, mut T 42  jne 2f 42 mov table+2, %edx T 1 xrelease mov $0, mut Hash Table Old: 0 call wake Thread 2 xacquire lock T 4 cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+2 T 1 xrelease mov $0, mut Old: 0 call wake 42

With Collision Thread 1  xacquire lock cmpxchg %ebx, mut T 42  jne 2f 42 mov table+2, %edx  T 1 xrelease mov $0, mut Hash Table Old: 0 call wake Thread 2 xacquire lock T 4 cmpxchg %ebx, mut jne 2f 0 Mutex mov $4, table+2 T 1 xrelease mov $0, mut Old: 0 call wake 43

Fast, less-complicated, lock-free Data Structures Ulrich Drepper - PowerPoint PPT Presentation

Fast, less-complicated, lock-free Data Structures Ulrich Drepper ulrich.drepper@gs.com Accelerate Code Not (much) through new hardware Split into independent pieces Splitting comes at a cost Marshaling between stages

Lock-Free, Wait-Free and Multi-core Programming Roger Deran boilerbay.com Fast, Efficient

Thread-Modular Reasoning for Lock-Free Data Structures Roland Meyer based on joint work with

Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model Aras Atalar, Paul

1 Reader/Writer Lock: Second Try Reader/Writer Lock: Second Try Guidelines for Condition

LOCK/WAIT FREE SYNCHRONIZATION Synchronization Mutex Blocking Lock-free At

Decoupling Lock-Free Data Structures from Memory Reclamation for Static Analysis [POPL'19]

From Lock-Free to Wait-Free: Linked List Edward Duong Outline 1) Outline operations of the

Transactional Memory: Architectural support for Lock-Free Data Structure Transactional Memory:

Lock-Free Concurrent Data Structures Danny Hendler Ben-Gurion University 1 Danny Hendler,

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Concurrency Problems Thierry Sans (recap) Lock A lock is an object in memory providing two atomic

Synchronization: Going Deeper Synchronization: Going Deeper SharedLock : Reader/Writer Lock :

Easy Lock-Free Programming in Non-Volatile Memory Tia ianzheng Wang Justin Levandoski

Lock-Free Search Data Structures: Throughput Modeling with Poisson Processes Aras Atalar, Paul

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Fast Synthesis of Fast Collections Calvin Loncaric Emina Torlak Michael D. Ernst University of

CS502: Compiler Design Code Optimization Manas Thakur Fall 2020 Fast. Faster. Fastest?

Chapter 15 Embed: Focus + Context Vis/Visual Analytics, Chap 15 Focus+Context 1 CGGM Lab., CS

Non-Blocking Inter-Partition Communication with Wait-Free Pair Transactions Ethan Blanton and

Visualising Java Data Structures as Graphs John Hamer Department of Computer Science University

The GNU C Library or That thing between you and your goal... 2013-04-16 Carlos

A peek into the classification of C -dynamics UK Virtual Operator Algebras Seminar Gbor

OSPF Extended Link Attributes P. Psenak, A.Lindem Cisco Systems IETF 88, November 3-8, 2013

UMBC A B M A L F T U M B C I O M Y O T R 1 (November 26, 2000 6:48 pm) I E S