Fast, less-complicated, lock-free Data Structures Ulrich Drepper - - PowerPoint PPT Presentation

fast less complicated lock free data structures ulrich
SMART_READER_LITE
LIVE PREVIEW

Fast, less-complicated, lock-free Data Structures Ulrich Drepper - - PowerPoint PPT Presentation

Fast, less-complicated, lock-free Data Structures Ulrich Drepper ulrich.drepper@gs.com Accelerate Code Not (much) through new hardware Split into independent pieces Splitting comes at a cost Marshaling between stages


slide-1
SLIDE 1

Fast, less-complicated, lock-free Data Structures Ulrich Drepper

ulrich.drepper@gs.com

slide-2
SLIDE 2

2

Accelerate Code

  • Not (much) through new hardware
  • Split into independent pieces
  • Splitting comes at a cost
  • Marshaling between stages
  • Increased latency for pipeline
  • Realistically:

Parallelization needed!

slide-3
SLIDE 3

3

Parallelization

  • Alternatives
  • Multi-process
  • r
  • Multi-thread
  • Error prone
  • High level of

parallelization needed

  • Keep cost of

parallelization (Op) low

S = 1 (1−P) + P N (1+OP)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0.5 1 1.5 2 2.5

Extended “Amdahl's Law”

P = 0.6

slide-4
SLIDE 4

4

Parallelization

  • Collaboration through shared memory
  • Synchronized access
  • Synchronized access to data structures
  • Atomic data structures

(mostly based on Compare-And-Swap)

bool __sync_bool_compare_and_swap(TYPE *ptr, TYPE oldval, TYPE newval) { if (*ptr != oldval) return false; *ptr = newval; return true; }

slide-5
SLIDE 5

5

LIFO FIFO Hash Single Linked Double Linked No Priority 1:1 CAS CAS 1:N CAS N:1 CAS CAS M:N CAS Priority 1:1 CAS CAS 1:N N:1 CAS CAS M:N

Lock-Free Data Structures

slide-6
SLIDE 6

6

LIFO FIFO Hash Single Linked Double Linked No Priority 1:1 CAS CAS 1:N CAS DWCAS N:1 CAS CAS M:N CAS DWCAS Priority 1:1 CAS CAS 1:N N:1 CAS CAS M:N

x86 Special

Double-wide CAS

slide-7
SLIDE 7

7

Extended CAS

  • Wider, more complicated CAS not the answer

DCAS is not a Silver Bullet for Nonblocking Algorithm Design Doherty, Detlefs, Groves, Flood, Luchangco, Martin, Moir, Shavit, Steele, SPAA '04, 2004

slide-8
SLIDE 8

8

Locking

  • Bane of Programming
  • Interface design: explicit or implicit locking?
  • Often unnecessary overhead
  • Composability problem
  • AB-BA locking problem

void move(dbllist<T> &target, dbllist<T>::it &prev, dbllist<T> &source, dbllist<T>::it &elem);

How to implement internal locking?

slide-9
SLIDE 9

9

Locking and Latency

  • Yes, there are spinlocks
  • Fairer/more power efficient

locking requires sleep

  • Sleep requires wakeup

Detect Lock Collision Delay Wake Resume Lock Operation Enter Kernel Exit Kernel Latency Wakeup Signal

slide-10
SLIDE 10

10

Way Forward

Two complimentary approaches

  • Improve implementation of locking to
  • Reduce contention
  • Reduce cost of the operation
  • Replace concept of locking
slide-11
SLIDE 11

11

Way Forward

Two complimentary approaches

  • Improve implementation of locking to
  • Reduce contention
  • Reduce cost of the operation
  • Replace concept of locking

Hardware Lock Elision (HLE) Transactional Memory (TM)

slide-12
SLIDE 12

12

Increase Parallelism

  • Reduce lock contention
  • Avoid “optimizations” like

reader-writer locks

  • Enable more code to be

parallelized

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0.5 1 1.5 2 2.5 3 3.5 4

P = 0.6 P = 0.8

slide-13
SLIDE 13

13

Running Example

slide-14
SLIDE 14

14

Locking Hash Tables

  • Designed for concurrent

accesses

  • In practice mostly read

accesses

  • Even write accesses likely

will not conflict

  • Locking is overkill

Thread 1 Thread 2 Separate Memory Locations

slide-15
SLIDE 15

15

Hash Table With locking

slide-16
SLIDE 16

16

Mutually Exclusive Access

Mutex == 0 Set 1? Delay Read Table Entry Update Table Entry Store 0 in Mutex Wake No Yes Yes CAS(mutex, 0, 1)

slide-17
SLIDE 17

17

Mutually Exclusive Access

Mutex == 0 Set 1? Delay Read Table Entry Update Table Entry Store 0 in Mutex Wake No Yes Yes

Mutex Memory Hash Tab Memory

slide-18
SLIDE 18

18

Mutually Exclusive Access

Mutex == 0 Set 1? Delay Read Table Entry Update Table Entry Store 0 in Mutex Wake No Yes Yes No Net Effect On Mutex:

Nothing

slide-19
SLIDE 19

19

Hardware Lock Elision

slide-20
SLIDE 20

20

With Lock Elision

Mutex == 0 Set 1? Delay Read Table Entry Update Table Entry Store 0 in Mutex Wake No Yes Yes

What if '1' is not written?

slide-21
SLIDE 21

21

With Lock Elision

Mutex == 0 Set 1? Delay Read Table Entry Update Table Entry Store 0 in Mutex No Yes Yes Thread 1 Thread 2 Wake

slide-22
SLIDE 22

22

With Lock Elision

Mutex == 0 Set 1? Delay Read Table Entry Update Table Entry Store 0 in Mutex No Yes Yes Thread 1 Thread 2 Wake

No Mutual Exclusion!

slide-23
SLIDE 23

23

No Mutual Exclusion

Mutex == 0 Set 1? Delay Read Table Entry Update Table Entry Store 0 in Mutex No Yes Yes Thread 1 Thread 2 Wake

  • Bad
  • But only if
  • Concurrent access to

same memory location

  • At least one of the

accesses is write

slide-24
SLIDE 24

24

Alternative

Mutex == 0 Set 1? Delay Read Table Entry Update Table Entry Store 0 in Mutex No Yes Yes Thread 1 Thread 2 Wake

Detect Collisions!

slide-25
SLIDE 25

25

Intel HLE

slide-26
SLIDE 26

26

x86 code for Hash Table

lock cmpxchg %ebx, mut jne 2f mov table+2, %edx mov $0, mut call wake

Thread 1

lock cmpxchg %ebx, mut jne 2f mov $4, table+5 mov $0, mut call wake

Thread 2 42 Hash Table Mutex

L1 Data Cache Main Memory

slide-27
SLIDE 27

27

New in Intel HLE

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+5 xrelease mov $0, mut call wake

Thread 2 42 Hash Table Mutex

Lock Cache Transaction Flag New Instruction Prefixes (compatible)

slide-28
SLIDE 28

28

Successful Concurrent Use

slide-29
SLIDE 29

29

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+5 xrelease mov $0, mut call wake xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1

No Collision

Thread 2 42 Hash Table Mutex

slide-30
SLIDE 30

30

T 1

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1 Old: 0

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+5 xrelease mov $0, mut call wake

No Collision

Thread 2 42 Hash Table Mutex

slide-31
SLIDE 31

31

T 42 T 1

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1 Old: 0

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+5 xrelease mov $0, mut call wake

No Collision

Thread 2 42 Hash Table Mutex

slide-32
SLIDE 32

32

T 42 T 1

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1 Old: 0

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+5 xrelease mov $0, mut call wake

No Collision

T 1 Thread 2 Old: 0 42 Hash Table Mutex

slide-33
SLIDE 33

33

T 42 T 1

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1 Old: 0

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+5 xrelease mov $0, mut call wake

No Collision

T 4 T 1 Thread 2 Old: 0 42 Hash Table Mutex

slide-34
SLIDE 34

34

T 42 T 1

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1 Old: 0

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+5 xrelease mov $0, mut call wake

No Collision

T 4 T 1 Thread 2 Old: 0 42 Hash Table Mutex

slide-35
SLIDE 35

35

42 1 0

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1 Old: 0

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+5 xrelease mov $0, mut call wake

No Collision

T 4 T 1 Thread 2 Old: 0 42 Hash Table Mutex

slide-36
SLIDE 36

36

42

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+5 xrelease mov $0, mut call wake

No Collision

4 1 0 Thread 2 Old: 0 42 4 Hash Table Mutex

slide-37
SLIDE 37

37

Unsuccessful Concurrent Use

slide-38
SLIDE 38

38

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+2 xrelease mov $0, mut call wake xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1

With Collision

Thread 2 42 Hash Table Mutex

slide-39
SLIDE 39

39

T 1

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1 Old: 0

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+2 xrelease mov $0, mut call wake

With Collision

Thread 2 42 Hash Table Mutex

slide-40
SLIDE 40

40

T 42 T 1

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1 Old: 0

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+2 xrelease mov $0, mut call wake

With Collision

Thread 2 42 Hash Table Mutex

slide-41
SLIDE 41

41

T 42 T 1

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1 Old: 0

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+2 xrelease mov $0, mut call wake

With Collision

T 1 Thread 2 Old: 0 42 Hash Table Mutex

slide-42
SLIDE 42

42

T 42 T 1

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1 Old: 0

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+2 xrelease mov $0, mut call wake

With Collision

T 4 T 1 Thread 2 Old: 0 42 Hash Table Mutex

 

slide-43
SLIDE 43

43

T 42 T 1

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1 Old: 0

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+2 xrelease mov $0, mut call wake

With Collision

T 4 T 1 Thread 2 Old: 0 42 Hash Table Mutex

  

slide-44
SLIDE 44

44

42 1 0

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1 Old: 0

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+2 xrelease mov $0, mut call wake

With Collision

T 4 T 1 Thread 2 Old: 0 42 Hash Table Mutex

R e s t a r t

slide-45
SLIDE 45

45

42 XX

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+2 xrelease mov $0, mut call wake

With Collision

4 1 X 0 Thread 2 Old: 0 4 Hash Table Mutex

slide-46
SLIDE 46

46

42 XX 1

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+2 xrelease mov $0, mut call wake

With Collision

4 X Thread 2 4 1 Hash Table Mutex

slide-47
SLIDE 47

47

4 1

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+2 xrelease mov $0, mut call wake

With Collision

4 X Thread 2 4 1 Hash Table Mutex

slide-48
SLIDE 48

48

4

xacquire lock cmpxchg %ebx, mut jne 2f mov table+2, %edx xrelease mov $0, mut call wake

Thread 1

xacquire lock cmpxchg %ebx, mut jne 2f mov $4, table+2 xrelease mov $0, mut call wake

With Collision

4 X Thread 2 4 Hash Table Mutex

slide-49
SLIDE 49

49

Benefits for HLE

slide-50
SLIDE 50

50

LIFO FIFO Hash Single Linked Double Linked No Priority 1:1 CAS CAS HLE* HLE** HLE** 1:N CAS DWCAS HLE* HLE** HLE** N:1 CAS CAS HLE* HLE** HLE** M:N CAS DWCAS HLE* HLE** HLE** Priority 1:1 CAS CAS HLE* HLE** HLE** 1:N HLE HLE HLE* HLE** HLE** N:1 CAS CAS HLE* HLE** HLE** M:N HLE HLE HLE* HLE** HLE**

Lock-Free with HLE

slide-51
SLIDE 51

51

LIFO FIFO Hash Single Linked Double Linked No Priority 1:1 CAS CAS HLE* HLE** HLE** 1:N CAS DWCAS HLE* HLE** HLE** N:1 CAS CAS HLE* HLE** HLE** M:N CAS DWCAS HLE* HLE** HLE** Priority 1:1 CAS CAS HLE* HLE** HLE** 1:N HLE HLE HLE* HLE** HLE** N:1 CAS CAS HLE* HLE** HLE** M:N HLE HLE HLE* HLE** HLE**

Lock-Free with HLE

* = reasonable high limit for internal or external hashing ** = list length limited by cache size

slide-52
SLIDE 52

52

Problems

slide-53
SLIDE 53

53

Problems with HLE

  • Granularity: cache lines
  • Wrong conflicts through false shaing
  • Possible slowdown
  • Compact data structures needed
  • Abort/restart policy not architected
  • Usable only for simple operations
  • No system calls, page faults, ...
slide-54
SLIDE 54

54

Transactional Memory

slide-55
SLIDE 55

55

History

  • Available for some time

Wait-Free Synchronization, Maurice Herlihy, ACM Transactions on Programming Languages and Systems, 1991

slide-56
SLIDE 56

56

Software TM

void insert(node *p) { guard g(lock); node **prev = &list; node *l = list; while (l && l->val < p->val) { prev = &l->next; l = l->next; } p->next = l; *prev = p; } void insert(node *p) { tm_atomic { node **prev = &list; node *l = list; while (l && l->val < p->val){ prev = &l->next; l = l->next; } p->next = l; *prev = p; } }

Support added to C and C++

slide-57
SLIDE 57

57

Software TM

void insert(node *p) { tm_atomic { node **prev = &list; node *l = list; while (l && l->val < p->val){ prev = &l->next; l = l->next; } p->next = l; *prev = p; } }

  • No locking needed
  • Concurrency enabled
  • Exception-safe
  • Transparent use of

hardware TM support through compiler mode

slide-58
SLIDE 58

58

Intel RTM

slide-59
SLIDE 59

59

void insert(node *p) { tm_atomic { node **prev = &list; node *l = list; while (l && l->val < p->val){ prev = &l->next; l = l->next; } p->next = l; *prev = p; } }

Thread-Unsafe List

mov list(%rip),%rax mov $list,%edx test %rax,%rax je 1f mov 0x8(%rdi),%ecx jmp 2f 3: mov %rax,%rdx mov (%rax),%rax test %rax,%rax je 1f 2: cmp %ecx,0x8(%rax) jl 3b 1: mov %rax,(%rdi) mov %rdi,(%rdx) ret

slide-60
SLIDE 60

60

void insert(node *p) { node **prev = &list; node *l = list; while (l && l->val < p->val){ prev = &l->next; l = l->next; } p->next = l; *prev = p; }

Thread-Unsafe List

mov list(%rip),%rax mov $list,%edx test %rax,%rax je 1f mov 0x8(%rdi),%ecx jmp 2f 3: mov %rax,%rdx mov (%rax),%rax test %rax,%rax je 1f 2: cmp %ecx,0x8(%rax) jl 3b 1: mov %rax,(%rdi) mov %rdi,(%rdx) ret

slide-61
SLIDE 61

61

void insert(node *p) { tm_atomic { node **prev = &list; node *l = list; while (l && l->val < p->val){ prev = &l->next; l = l->next; } p->next = l; *prev = p; } }

Thread-Safe List

movl $MAX, cnt(%rsp) 0: xbegin .Labort mov list(%rip),%rax mov $list,%edx test %rax,%rax je 1f mov 0x8(%rdi),%ecx jmp 2f 3: mov %rax,%rdx mov (%rax),%rax test %rax,%rax je 1f 2: cmp %ecx,0x8(%rax) jl 3b 1: mov %rax,(%rdi) mov %rdi,(%rdx) xend ret

Restart

slide-62
SLIDE 62

62

Thread-Safe List

movl $MAX, cnt(%rsp) 0: xbegin .Labort mov list(%rip),%rax mov $list,%edx test %rax,%rax je 1f mov 0x8(%rdi),%ecx jmp 2f 3: mov %rax,%rdx mov (%rax),%rax test %rax,%rax je 1f 2: cmp %ecx,0x8(%rax) jl 3b 1: mov %rax,(%rdi) mov %rdi,(%rdx) xend ret

Restart

.Labort: test $2, %rax jz .Ltrylocking decl cnt(%rsp) jne 0b .Ltrylocking: ...

slide-63
SLIDE 63

63

Composition

tm_atomic { if ((i = find(l1.begin(), l1.end(), val)) != l1.end()) { l2.push_front(*i); l1.erase(i); } }

xbegin reference count: 1 find: xbegin reference count: 2 ... xend reference count: 1 push_front: xbegin reference count: 2 ... xend reference count: 1 erase: xbegin reference count: 2 ... xend reference count: 1 xend reference count: 0 COMMIT!

slide-64
SLIDE 64

64

Problems

slide-65
SLIDE 65

65

Problems with RTM

  • Cache line granularity
  • False sharing can lead to unpredictable aborts
  • Composability limited
  • No system calls etc
  • Limited TM compiler knowledge so far
  • More knowledge about libraries and more function

annotation needed

  • No experience with abort handlers yet
  • STM fallback solution very slow
slide-66
SLIDE 66

66

Summary

slide-67
SLIDE 67

67

Summary

  • Increase level of parallelization through HLE
  • Opportunistic execution
  • Fully backward compatible
  • No more need for reader/writer locks
  • Solve composition with RTM
  • Available through language extensions
  • Problems
  • How to handle cache line-granularity?
slide-68
SLIDE 68

68

Questions?