SLIDE 1
Read-Log-Update A Lightweight Synchronization Mechanism for - - PowerPoint PPT Presentation
Read-Log-Update A Lightweight Synchronization Mechanism for - - PowerPoint PPT Presentation
Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming Paper Reading Group Alexander Matveev Nir Shavit Pascal Felber Patrick Marlier Presents: Maksym Planeta 24.09.2015 Table of Contents Introduction RLU
SLIDE 2
SLIDE 3
Table of Contents
Introduction RLU design Evaluation Conclusion
SLIDE 4
Motivation
What is bad with LRU?
◮ Complex to use for a writer; ◮ Optimized for low number of writers ◮ High delays in synchronize_rcu
SLIDE 5
Contributions
RCU + STM = RLU.
◮ Update several objects with single counter increment;
SLIDE 6
Contributions
RCU + STM = RLU.
◮ Update several objects with single counter increment;
Traverse doubly linked lists in both directions!
SLIDE 7
Contributions
RCU + STM = RLU.
◮ Update several objects with single counter increment;
Traverse doubly linked lists in both directions!
◮ Stay compatible with RCU
SLIDE 8
RCU recap
a b c
n
T1
n
T2
c
n
T1
n
T2
a b
synchronize_rcu() search(c) remove(b)
c
n
T1
n
T2
a b
…grace period… rcu_dereference(b)
c
n
T1
n
T2
a b
kfree(n)
➑
rcu_read_unlock()
➐
rcu_dereference(c)
➍ ➌
rcu_read_lock()
➊ ➋ ➏ ➎ Figure 2. Concurrent search and removal with the RCU-based linked list. 171
SLIDE 9
Single point manipulation
s t a t i c i n l i n e void l i s t a d d r c u ( struct l i s t h e a d ∗new , struct l i s t h e a d ∗ prev , struct l i s t h e a d ∗ next ) { new− >next = next ; new− >prev = prev ; r c u a s s i g n p o i n t e r ( l i s t n e x t r c u ( prev ) , new ) ; next− >prev = new ; }
SLIDE 10
RLU style
/∗ . . . some important code that we c o n s i d e r l a t e r . . . ∗/ /∗ Update r e f e r e n c e s ∗/ r l u a s s i g n p t r (&(new− >next ) , next ) ; r l u a s s i g n p t r (&( prev− >next ) , new ) ; /∗ Commit ∗/ r l u r e a d e r u n l o c k ( ) ;
SLIDE 11
Table of Contents
Introduction RLU design Evaluation Conclusion
SLIDE 12
Basic idea
- 1. All operations read the global clock when they start;
- 2. Clock is used to dereference shared objects;
- 3. Write operations write to a log (RCU-style copy of an object);
- 4. Increment global clock to commit write (Swap pointers in
RCU);
- 5. Wait old readers to finish (synchronize_rcu);
- 6. Write-back objects from the log. (Corresponds to RCU
memory reclamation)
SLIDE 13
Read-read example
22 g-clock ∞ w-clock
—
w-log
— T1
… ∞ w-clock
—
w-log
— T2
… ∞ w-clock
—
w-log
— T3
…
O1 —
lock
O2 —
lock
O3 —
lock
read g-clock (l-clock←22)
22 l-clock 22 l-clock — l-clock
➊
read O1 (not locked)
➋ ➊ ➋ ➌
read g-clock (l-clock←22) read O1 (not locked) read O2 (not locked)
Threads Memory …
T1 T2 T3 T1 T2 T3 mem
SLIDE 14
Write-read example
22 g-clock ∞ w-clock
—
w-log
— T1
… ∞ w-clock
O2
w-log
O3 T2
… ∞ w-clock
—
w-log
— T3
…
O1 —
lock
O2 T2
lock
O3 T2
lock
read O2 (locked by T2)
22 l-clock 22 l-clock — l-clock
➌
if (l-clock ≥ T2.w-clock) → steal new copy from T2.w-log else → read O2
➍log O2
(and lock)
➎update O2
(in w-log)
➏log O3
(and lock)
➐update O3
(in w-log)
…
T1 T2 T3 T1 T2 T3 mem
SLIDE 15
Read-write-steal example
23 g-clock ∞ w-clock
—
w-log
— T1
… w-clock
O2
w-log
O3 T2
… ∞ w-clock
—
w-log
— T3
…
O1 —
lock
O2 T2
lock
O3 T2
lock
…done
22 l-clock 23 l-clock 23 l-clock
➍ ➑commit
w-clock←23 1) g-clock←23 2) wait for readers (with l-clock < 23) 3) wait for T1… … write back w-log 4) read g-clock (l-clock←23)
➊
read O2 (locked by T2)
➋
if (l-clock ≥ T2.w-clock) → steal new copy from T2.w-log read copy →
23
wait for T1… … wait for T1… …
…
T1 T2 T3 T1 T2 T3 mem
SLIDE 16
Real list add
int r l u l i s t a d d ( r l u t h r e a d d a t a t ∗ s e l f , l i s t t ∗ l i s t , v a l t v a l ) { node t ∗ prev , ∗ next , ∗node ; v a l t v ; r e s t a r t : r l u r e a d e r l o c k ( ) ; /∗ Find r i g h t place . . . ∗/ i f ( ! r l u t r y l o c k ( s e l f , &prev ) | | ! r l u t r y l o c k ( s e l f , &next )) { r l u a b o r t ( s e l f ) ; goto r e s t a r t ; } new = rlu new node ( ) ; new− >v a l = v a l ; r l u a s s i g n p t r (&(new− >next ) , next ) ; r l u a s s i g n p t r (&( prev− >next ) , new ) ; r l u r e a d e r u n l o c k ( ) ; }
SLIDE 17
Real list add
int r l u l i s t a d d ( r l u t h r e a d d a t a t ∗ s e l f , l i s t t ∗ l i s t , v a l t v a l ) { node t ∗ prev , ∗ next , ∗node ; v a l t v ; r e s t a r t : r l u r e a d e r l o c k ( ) ; /∗ Find r i g h t place . . . ∗/ i f ( ! r l u t r y l o c k ( s e l f , &prev ) | | ! r l u t r y l o c k ( s e l f , &next )) { r l u a b o r t ( s e l f ) ; goto r e s t a r t ; } new = rlu new node ( ) ; new− >v a l = v a l ; r l u a s s i g n p t r (&(new− >next ) , next ) ; r l u a s s i g n p t r (&( prev− >next ) , new ) ; r l u r e a d e r u n l o c k ( ) ; }
SLIDE 18
Real list add
int r l u l i s t a d d ( r l u t h r e a d d a t a t ∗ s e l f , l i s t t ∗ l i s t , v a l t v a l ) { node t ∗ prev , ∗ next , ∗node ; v a l t v ; r e s t a r t : r l u r e a d e r l o c k ( ) ; /∗ Find r i g h t place . . . ∗/ i f ( ! r l u t r y l o c k ( s e l f , &prev ) | | ! r l u t r y l o c k ( s e l f , &next )) { r l u a b o r t ( s e l f ) ; goto r e s t a r t ; } new = rlu new node ( ) ; new− >v a l = v a l ; r l u a s s i g n p t r (&(new− >next ) , next ) ; r l u a s s i g n p t r (&( prev− >next ) , new ) ; r l u r e a d e r u n l o c k ( ) ; }
SLIDE 19
Reader lock
1: function RLU_READER_LOCK(ctx) 2:
ctx.is-writer ← false
3:
ctx.run-cnt ← ctx.run-cnt +1 ⊲ Set active
4:
memory fence
5:
ctx.local-clock ← global-clock ⊲ Record global clock 173
6: function RLU_READER_UNLOCK(ctx) 7:
ctx.run-cnt ← ctx.run-cnt +1 ⊲ Set inactive
8:
if ctx.is-writer then
9:
RLU_COMMIT_WRITE_LOG(ctx)
⊲ Write updates 173
SLIDE 20
Memory commit
44: function RLU_COMMIT_WRITE_LOG(ctx) 45:
ctx.write-clock ← global-clock +1 ⊲ Enable stealing
46:
FETCH_AND_ADD(global-clock, 1)
⊲ Advance clock
47:
RLU_SYNCHRONIZE(ctx)
⊲ Drain readers
48:
RLU_WRITEBACK_WRITE_LOG(ctx) ⊲ Safe to write back
49:
RLU_UNLOCK_WRITE_LOG(ctx)
50:
ctx.write-clock ← ∞ ⊲ Disable stealing
51:
RLU_SWAP_WRITE_LOGS(ctx)
⊲ Quiesce write-log 173
SLIDE 21
Pointer dereference
10: function RLU_DEREFERENCE(ctx, obj) 11:
ptr-copy ← GET_COPY(obj) ⊲ Get copy pointer
12:
if IS_UNLOCKED(ptr-copy) then ⊲ Is free?
13:
return obj ⊲ Yes ⇒ return object
14:
if IS_COPY(ptr-copy) then ⊲ Already a copy?
15:
return obj ⊲ Yes ⇒ return object
16:
thr-id ← GET_THREAD_ID(ptr-copy)
17:
if thr-id = ctx.thr-id then ⊲ Locked by us?
18:
return ptr-copy ⊲ Yes ⇒ return copy
19:
- ther-ctx ← GET_CTX(thr-id)
⊲ No ⇒ check for steal
20:
if other-ctx.write-clock ≤ ctx.local-clock then
21:
return ptr-copy ⊲ Stealing ⇒ return copy
22:
return obj ⊲ No stealing ⇒ return object 173
SLIDE 22
RLU Deferring
- 1. On commit do not increment the global clock and execute
RLU sync;
- 2. Instead, save writer-log and create a new log for the next
writer
- 3. Synchronize when a writer tries to lock an object that is
already locked.
SLIDE 23
RLU Deferring advantages
- 1. Reduce the amount of RLU synchronize calls
- 2. Reduce contention on a global clock
- 3. Less stealing – less cache misses
SLIDE 24
Table of Contents
Introduction RLU design Evaluation Conclusion
SLIDE 25
Linked lists
1 2 3 4 5 6 7 4 8 12 16 User-space linked list (1,000 nodes) Number of threads Operations/µs 2% updates RCU RLU 4 8 12 16 20% updates Harris (leaky) 4 8 12 16 40% updates Harris (HP)
Figure 4. Throughput for linked lists with 2% (left), 20% (middle), and 40% (right) updates.
176
SLIDE 26
Hash table
2 4 6 8 10 12 14 4 8 12 16 User-space hash table (1,000 buckets of 100 nodes) Number of threads Operations/µs 2% updates RCU RLU 4 8 12 16 20% updates RLU (defer) Harris (leaky) 4 8 12 16 40% updates Harris (HP)
Figure 5. Throughput for hash tables with 2% (left), 20% (middle), and 40% (right) updates.
177
SLIDE 27
Resizable Hash table
20 40 60 80 100 120 1 2 4 6 8 10 12 14 Operations/µs Number of threads Resizable hash table (64K items, 8-16K buckets) RCU 8K RCU 16K RCU 8-16K RLU 8K RLU 16K RLU 8-16K
Figure 6. Throughput for the resizable hash table.
177
SLIDE 28
Update only stress test (hash table)
20 40 60 80 100 120 1 2 4 6 8 10 12 14 16 Operations/µs Number of threads Hash table (10,000 buckets of 1 node) 100% updates RCU RLU RLU (defer)
Figure 7. Throughput for the stress test on a hash table with 100% updates and a single item per bucket.
178
SLIDE 29
Citrus Search Tree (throughput)
10 20 30 40 50 60 70 1 8 16 24 32 40 48 56 64 72 80 Operations/µs Number of threads Citrus tree (100,000 nodes) RCU 10% RCU 20% RCU 40% RLU 10% RLU 20% RLU 40%
178
SLIDE 30
Citrus Search Tree (statistics)
100 200 300 400 500
1 8 16 24 32 40 48 56 64 72 80
Number of threads Write-back quiescence 2 4 6 8 10
1 8 16 24 32 40 48 56 64 72 80
Sync per writer (%) 10 20 30
1 8 16 24 32 40 48 56 64 72 80
Read copy per writer (%) RLU 10% RLU 20% RLU 40% 10 20 30 40 50
1 8 16 24 32 40 48 56 64 72 80
Sync request per writer (%)
178
SLIDE 31
Kernel space doubly linked lists
10 20 30 40 50 60 70 80 2 4 6 8 10 12 14 16 Kernel doubly linked list (512 nodes) Number of threads Operations/µs 0.1% updates RCU RCU (fixed) RLU 2 4 6 8 10 12 14 16 1% updates
Figure 9. Throughput for kernel doubly linked lists (list_* APIs) with 0.1% (left) and 1% (right) updates.
180
SLIDE 32
Kernel space single linked lists
1 2 3 4 5 4 8 12 16 Kernel linked list (1,000 nodes) Number of threads Operations/µs 2% updates RCU RLU 4 8 12 16 20% updates 4 8 12 16 40% updates
Figure 10. Throughput for linked lists running in the kernel with 2% (left), 20% (middle), and 40% (right) updates.
180
SLIDE 33
Kernel space hash tables
2 4 6 8 10 12 14 4 8 12 16 Kernel hash table (1,000 buckets of 100 nodes) Number of threads Operations/µs 2% updates RCU 4 8 12 16 20% updates RLU 4 8 12 16 40% updates RLU (defer)
Figure 11. Throughput for hash tables running in the kernel with 2% (left), 20% (middle), and 40% (right) updates.
180
SLIDE 34
Kernel-space Torture Tests
They tried even this! RLU successfully passed all of the within implemented functionality.
SLIDE 35
Kyoto Cache DB
It was advertised in the abstract and finally here it is:
5 10 15 20 2 4 6 8 10 12 14 16 Kyoto Cabinet (1GB in-memory DB) Number of threads Operations/µs 2% updates 2 4 6 8 10 12 14 16 10% updates Reader-writer lock Ingress-egress lock RLU
Figure 12. Throughput for the original and RLU versions
- f the Kyoto Cache DB.
181
SLIDE 36
Table of Contents
Introduction RLU design Evaluation Conclusion
SLIDE 37