Read-Log-Update A Lightweight Synchronization Mechanism for - - PowerPoint PPT Presentation

read log update a lightweight synchronization mechanism
SMART_READER_LITE
LIVE PREVIEW

Read-Log-Update A Lightweight Synchronization Mechanism for - - PowerPoint PPT Presentation

Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming Paper Reading Group Alexander Matveev Nir Shavit Pascal Felber Patrick Marlier Presents: Maksym Planeta 24.09.2015 Table of Contents Introduction RLU


slide-1
SLIDE 1

Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming

Paper Reading Group Alexander Matveev Nir Shavit Pascal Felber Patrick Marlier Presents: Maksym Planeta 24.09.2015

slide-2
SLIDE 2

Table of Contents

Introduction RLU design Evaluation Conclusion

slide-3
SLIDE 3

Table of Contents

Introduction RLU design Evaluation Conclusion

slide-4
SLIDE 4

Motivation

What is bad with LRU?

◮ Complex to use for a writer; ◮ Optimized for low number of writers ◮ High delays in synchronize_rcu

slide-5
SLIDE 5

Contributions

RCU + STM = RLU.

◮ Update several objects with single counter increment;

slide-6
SLIDE 6

Contributions

RCU + STM = RLU.

◮ Update several objects with single counter increment;

Traverse doubly linked lists in both directions!

slide-7
SLIDE 7

Contributions

RCU + STM = RLU.

◮ Update several objects with single counter increment;

Traverse doubly linked lists in both directions!

◮ Stay compatible with RCU

slide-8
SLIDE 8

RCU recap

a b c

n

T1

n

T2

c

n

T1

n

T2

a b

synchronize_rcu() search(c) remove(b)

c

n

T1

n

T2

a b

…grace period… rcu_dereference(b)

c

n

T1

n

T2

a b

kfree(n)

rcu_read_unlock()

rcu_dereference(c)

➍ ➌

rcu_read_lock()

➊ ➋ ➏ ➎ Figure 2. Concurrent search and removal with the RCU-based linked list. 171

slide-9
SLIDE 9

Single point manipulation

s t a t i c i n l i n e void l i s t a d d r c u ( struct l i s t h e a d ∗new , struct l i s t h e a d ∗ prev , struct l i s t h e a d ∗ next ) { new− >next = next ; new− >prev = prev ; r c u a s s i g n p o i n t e r ( l i s t n e x t r c u ( prev ) , new ) ; next− >prev = new ; }

slide-10
SLIDE 10

RLU style

/∗ . . . some important code that we c o n s i d e r l a t e r . . . ∗/ /∗ Update r e f e r e n c e s ∗/ r l u a s s i g n p t r (&(new− >next ) , next ) ; r l u a s s i g n p t r (&( prev− >next ) , new ) ; /∗ Commit ∗/ r l u r e a d e r u n l o c k ( ) ;

slide-11
SLIDE 11

Table of Contents

Introduction RLU design Evaluation Conclusion

slide-12
SLIDE 12

Basic idea

  • 1. All operations read the global clock when they start;
  • 2. Clock is used to dereference shared objects;
  • 3. Write operations write to a log (RCU-style copy of an object);
  • 4. Increment global clock to commit write (Swap pointers in

RCU);

  • 5. Wait old readers to finish (synchronize_rcu);
  • 6. Write-back objects from the log. (Corresponds to RCU

memory reclamation)

slide-13
SLIDE 13

Read-read example

22 g-clock ∞ w-clock

w-log

— T1

… ∞ w-clock

w-log

— T2

… ∞ w-clock

w-log

— T3

O1 —

lock

O2 —

lock

O3 —

lock

read g-clock (l-clock←22)

22 l-clock 22 l-clock — l-clock

read O1 (not locked)

➋ ➊ ➋ ➌

read g-clock (l-clock←22) read O1 (not locked) read O2 (not locked)

Threads Memory …

T1 T2 T3 T1 T2 T3 mem

slide-14
SLIDE 14

Write-read example

22 g-clock ∞ w-clock

w-log

— T1

… ∞ w-clock

O2

w-log

O3 T2

… ∞ w-clock

w-log

— T3

O1 —

lock

O2 T2

lock

O3 T2

lock

read O2 (locked by T2)

22 l-clock 22 l-clock — l-clock

if (l-clock ≥ T2.w-clock) → steal new copy from T2.w-log else → read O2

➍log O2

(and lock)

➎update O2

(in w-log)

➏log O3

(and lock)

➐update O3

(in w-log)

T1 T2 T3 T1 T2 T3 mem

slide-15
SLIDE 15

Read-write-steal example

23 g-clock ∞ w-clock

w-log

— T1

… w-clock

O2

w-log

O3 T2

… ∞ w-clock

w-log

— T3

O1 —

lock

O2 T2

lock

O3 T2

lock

…done

22 l-clock 23 l-clock 23 l-clock

➍ ➑commit

w-clock←23 1) g-clock←23 2) wait for readers (with l-clock < 23) 3) wait for T1… … write back w-log 4) read g-clock (l-clock←23)

read O2 (locked by T2)

if (l-clock ≥ T2.w-clock) → steal new copy from T2.w-log read copy →

23

wait for T1… … wait for T1… …

T1 T2 T3 T1 T2 T3 mem

slide-16
SLIDE 16

Real list add

int r l u l i s t a d d ( r l u t h r e a d d a t a t ∗ s e l f , l i s t t ∗ l i s t , v a l t v a l ) { node t ∗ prev , ∗ next , ∗node ; v a l t v ; r e s t a r t : r l u r e a d e r l o c k ( ) ; /∗ Find r i g h t place . . . ∗/ i f ( ! r l u t r y l o c k ( s e l f , &prev ) | | ! r l u t r y l o c k ( s e l f , &next )) { r l u a b o r t ( s e l f ) ; goto r e s t a r t ; } new = rlu new node ( ) ; new− >v a l = v a l ; r l u a s s i g n p t r (&(new− >next ) , next ) ; r l u a s s i g n p t r (&( prev− >next ) , new ) ; r l u r e a d e r u n l o c k ( ) ; }

slide-17
SLIDE 17

Real list add

int r l u l i s t a d d ( r l u t h r e a d d a t a t ∗ s e l f , l i s t t ∗ l i s t , v a l t v a l ) { node t ∗ prev , ∗ next , ∗node ; v a l t v ; r e s t a r t : r l u r e a d e r l o c k ( ) ; /∗ Find r i g h t place . . . ∗/ i f ( ! r l u t r y l o c k ( s e l f , &prev ) | | ! r l u t r y l o c k ( s e l f , &next )) { r l u a b o r t ( s e l f ) ; goto r e s t a r t ; } new = rlu new node ( ) ; new− >v a l = v a l ; r l u a s s i g n p t r (&(new− >next ) , next ) ; r l u a s s i g n p t r (&( prev− >next ) , new ) ; r l u r e a d e r u n l o c k ( ) ; }

slide-18
SLIDE 18

Real list add

int r l u l i s t a d d ( r l u t h r e a d d a t a t ∗ s e l f , l i s t t ∗ l i s t , v a l t v a l ) { node t ∗ prev , ∗ next , ∗node ; v a l t v ; r e s t a r t : r l u r e a d e r l o c k ( ) ; /∗ Find r i g h t place . . . ∗/ i f ( ! r l u t r y l o c k ( s e l f , &prev ) | | ! r l u t r y l o c k ( s e l f , &next )) { r l u a b o r t ( s e l f ) ; goto r e s t a r t ; } new = rlu new node ( ) ; new− >v a l = v a l ; r l u a s s i g n p t r (&(new− >next ) , next ) ; r l u a s s i g n p t r (&( prev− >next ) , new ) ; r l u r e a d e r u n l o c k ( ) ; }

slide-19
SLIDE 19

Reader lock

1: function RLU_READER_LOCK(ctx) 2:

ctx.is-writer ← false

3:

ctx.run-cnt ← ctx.run-cnt +1 ⊲ Set active

4:

memory fence

5:

ctx.local-clock ← global-clock ⊲ Record global clock 173

6: function RLU_READER_UNLOCK(ctx) 7:

ctx.run-cnt ← ctx.run-cnt +1 ⊲ Set inactive

8:

if ctx.is-writer then

9:

RLU_COMMIT_WRITE_LOG(ctx)

⊲ Write updates 173

slide-20
SLIDE 20

Memory commit

44: function RLU_COMMIT_WRITE_LOG(ctx) 45:

ctx.write-clock ← global-clock +1 ⊲ Enable stealing

46:

FETCH_AND_ADD(global-clock, 1)

⊲ Advance clock

47:

RLU_SYNCHRONIZE(ctx)

⊲ Drain readers

48:

RLU_WRITEBACK_WRITE_LOG(ctx) ⊲ Safe to write back

49:

RLU_UNLOCK_WRITE_LOG(ctx)

50:

ctx.write-clock ← ∞ ⊲ Disable stealing

51:

RLU_SWAP_WRITE_LOGS(ctx)

⊲ Quiesce write-log 173

slide-21
SLIDE 21

Pointer dereference

10: function RLU_DEREFERENCE(ctx, obj) 11:

ptr-copy ← GET_COPY(obj) ⊲ Get copy pointer

12:

if IS_UNLOCKED(ptr-copy) then ⊲ Is free?

13:

return obj ⊲ Yes ⇒ return object

14:

if IS_COPY(ptr-copy) then ⊲ Already a copy?

15:

return obj ⊲ Yes ⇒ return object

16:

thr-id ← GET_THREAD_ID(ptr-copy)

17:

if thr-id = ctx.thr-id then ⊲ Locked by us?

18:

return ptr-copy ⊲ Yes ⇒ return copy

19:

  • ther-ctx ← GET_CTX(thr-id)

⊲ No ⇒ check for steal

20:

if other-ctx.write-clock ≤ ctx.local-clock then

21:

return ptr-copy ⊲ Stealing ⇒ return copy

22:

return obj ⊲ No stealing ⇒ return object 173

slide-22
SLIDE 22

RLU Deferring

  • 1. On commit do not increment the global clock and execute

RLU sync;

  • 2. Instead, save writer-log and create a new log for the next

writer

  • 3. Synchronize when a writer tries to lock an object that is

already locked.

slide-23
SLIDE 23

RLU Deferring advantages

  • 1. Reduce the amount of RLU synchronize calls
  • 2. Reduce contention on a global clock
  • 3. Less stealing – less cache misses
slide-24
SLIDE 24

Table of Contents

Introduction RLU design Evaluation Conclusion

slide-25
SLIDE 25

Linked lists

1 2 3 4 5 6 7 4 8 12 16 User-space linked list (1,000 nodes) Number of threads Operations/µs 2% updates RCU RLU 4 8 12 16 20% updates Harris (leaky) 4 8 12 16 40% updates Harris (HP)

Figure 4. Throughput for linked lists with 2% (left), 20% (middle), and 40% (right) updates.

176

slide-26
SLIDE 26

Hash table

2 4 6 8 10 12 14 4 8 12 16 User-space hash table (1,000 buckets of 100 nodes) Number of threads Operations/µs 2% updates RCU RLU 4 8 12 16 20% updates RLU (defer) Harris (leaky) 4 8 12 16 40% updates Harris (HP)

Figure 5. Throughput for hash tables with 2% (left), 20% (middle), and 40% (right) updates.

177

slide-27
SLIDE 27

Resizable Hash table

20 40 60 80 100 120 1 2 4 6 8 10 12 14 Operations/µs Number of threads Resizable hash table (64K items, 8-16K buckets) RCU 8K RCU 16K RCU 8-16K RLU 8K RLU 16K RLU 8-16K

Figure 6. Throughput for the resizable hash table.

177

slide-28
SLIDE 28

Update only stress test (hash table)

20 40 60 80 100 120 1 2 4 6 8 10 12 14 16 Operations/µs Number of threads Hash table (10,000 buckets of 1 node)  100% updates RCU RLU RLU (defer)

Figure 7. Throughput for the stress test on a hash table with 100% updates and a single item per bucket.

178

slide-29
SLIDE 29

Citrus Search Tree (throughput)

10 20 30 40 50 60 70 1 8 16 24 32 40 48 56 64 72 80 Operations/µs Number of threads Citrus tree (100,000 nodes) RCU 10% RCU 20% RCU 40% RLU 10% RLU 20% RLU 40%

178

slide-30
SLIDE 30

Citrus Search Tree (statistics)

100 200 300 400 500

1 8 16 24 32 40 48 56 64 72 80

Number of threads Write-back quiescence 2 4 6 8 10

1 8 16 24 32 40 48 56 64 72 80

Sync per writer (%) 10 20 30

1 8 16 24 32 40 48 56 64 72 80

Read copy per writer (%) RLU 10% RLU 20% RLU 40% 10 20 30 40 50

1 8 16 24 32 40 48 56 64 72 80

Sync request per writer (%)

178

slide-31
SLIDE 31

Kernel space doubly linked lists

10 20 30 40 50 60 70 80 2 4 6 8 10 12 14 16 Kernel doubly linked list (512 nodes) Number of threads Operations/µs 0.1% updates RCU RCU (fixed) RLU 2 4 6 8 10 12 14 16 1% updates

Figure 9. Throughput for kernel doubly linked lists (list_* APIs) with 0.1% (left) and 1% (right) updates.

180

slide-32
SLIDE 32

Kernel space single linked lists

1 2 3 4 5 4 8 12 16 Kernel linked list (1,000 nodes) Number of threads Operations/µs 2% updates RCU RLU 4 8 12 16 20% updates 4 8 12 16 40% updates

Figure 10. Throughput for linked lists running in the kernel with 2% (left), 20% (middle), and 40% (right) updates.

180

slide-33
SLIDE 33

Kernel space hash tables

2 4 6 8 10 12 14 4 8 12 16 Kernel hash table (1,000 buckets of 100 nodes) Number of threads Operations/µs 2% updates RCU 4 8 12 16 20% updates RLU 4 8 12 16 40% updates RLU (defer)

Figure 11. Throughput for hash tables running in the kernel with 2% (left), 20% (middle), and 40% (right) updates.

180

slide-34
SLIDE 34

Kernel-space Torture Tests

They tried even this! RLU successfully passed all of the within implemented functionality.

slide-35
SLIDE 35

Kyoto Cache DB

It was advertised in the abstract and finally here it is:

5 10 15 20 2 4 6 8 10 12 14 16 Kyoto Cabinet (1GB in-memory DB) Number of threads Operations/µs 2% updates 2 4 6 8 10 12 14 16 10% updates Reader-writer lock Ingress-egress lock RLU

Figure 12. Throughput for the original and RLU versions

  • f the Kyoto Cache DB.

181

slide-36
SLIDE 36

Table of Contents

Introduction RLU design Evaluation Conclusion

slide-37
SLIDE 37

Conclusion

◮ Performance similar to RCU.

Sometimes better, sometimes worse.

◮ Easier programming interface ◮ Compatible with RCU ◮ Good both in user and kernel space ◮ Severely benchmarked.