read log update a lightweight synchronization mechanism
play

Read-Log-Update A Lightweight Synchronization Mechanism for - PowerPoint PPT Presentation

Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming Paper Reading Group Alexander Matveev Nir Shavit Pascal Felber Patrick Marlier Presents: Maksym Planeta 24.09.2015 Table of Contents Introduction RLU


  1. Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming Paper Reading Group Alexander Matveev Nir Shavit Pascal Felber Patrick Marlier Presents: Maksym Planeta 24.09.2015

  2. Table of Contents Introduction RLU design Evaluation Conclusion

  3. Table of Contents Introduction RLU design Evaluation Conclusion

  4. Motivation What is bad with LRU? ◮ Complex to use for a writer; ◮ Optimized for low number of writers ◮ High delays in synchronize_rcu

  5. Contributions RCU + STM = RLU. ◮ Update several objects with single counter increment;

  6. Contributions RCU + STM = RLU. ◮ Update several objects with single counter increment; Traverse doubly linked lists in both directions!

  7. Contributions RCU + STM = RLU. ◮ Update several objects with single counter increment; Traverse doubly linked lists in both directions! ◮ Stay compatible with RCU

  8. RCU recap T1 T1 T1 T1 search(c) n rcu_read_lock() n rcu_dereference(b) n rcu_dereference(c) n rcu_read_unlock() ➊ ➌ ➎ ➐ a b c a b c a b c a b c T2 T2 T2 T2 remove(b) n n synchronize_rcu() n …grace period… n kfree(n) ➋ ➍ ➏ ➑ Figure 2. Concurrent search and removal with the RCU-based linked list. 171

  9. Single point manipulation i n l i n e s t a t i c void l i s t a d d r c u ( struct l i s t h e a d ∗ new , l i s t h e a d ∗ prev , struct struct l i s t h e a d ∗ next ) { new − > next = next ; new − > prev = prev ; r c u a s s i g n p o i n t e r ( l i s t n e x t r c u ( prev ) , new ) ; next − > prev = new ; }

  10. RLU style / ∗ . . . some important code that we c o n s i d e r l a t e r . . . ∗ / / ∗ Update r e f e r e n c e s ∗ / r l u a s s i g n p t r (&(new − > next ) , next ) ; r l u a s s i g n p t r (&( prev − > next ) , new ) ; / ∗ Commit ∗ / r l u r e a d e r u n l o c k ( ) ;

  11. Table of Contents Introduction RLU design Evaluation Conclusion

  12. Basic idea 1. All operations read the global clock when they start; 2. Clock is used to dereference shared objects; 3. Write operations write to a log (RCU-style copy of an object); 4. Increment global clock to commit write (Swap pointers in RCU); 5. Wait old readers to finish ( synchronize_rcu ); 6. Write-back objects from the log. (Corresponds to RCU memory reclamation)

  13. Read-read example Threads Memory T1 T2 T3 mem 22 22 l-clock l-clock l-clock — g-clock 22 w-clock ∞ w-clock ∞ w-clock ∞ lock — w-log w-log w-log — — — O 1 — — — T 1 T 2 T 3 … … … lock — ➊ read g-clock O 2 (l-clock ← 22) ➊ read g-clock (l-clock ← 22) ➋ read O 1 lock — (not locked) ➋ read O 1 O 3 (not locked) … ➌ read O 2 (not locked) T1 T2 T3

  14. Write-read example T1 T2 T3 mem 22 22 l-clock l-clock l-clock — g-clock 22 w-clock ∞ w-clock ∞ w-clock ∞ lock — w-log w-log w-log O 2 — — O 1 O 3 — — T 1 T 2 T 3 … … … T 2 lock ➌ read O 2 ➍ log O 2 O 2 (locked by T 2) (and lock ) → if (l-clock ≥ ➎ update O 2 T 2 lock T 2 .w-clock) (in w-log) steal new O 3 ➏ log O 3 copy from (and lock ) T 2 .w-log … → else ➐ update O 3 read O 2 (in w-log) T1 T2 T3

  15. Read-write-steal example T1 T2 T3 mem l-clock 22 l-clock 23 l-clock 23 g-clock 23 23 w-clock ∞ w-clock w-clock ∞ lock — w-log w-log O 2 w-log — — O 1 O 3 — — T 1 T 2 T 3 … … … T 2 lock ➑ commit O 2 1) w-clock ← 23 2) g-clock ← 23 ➊ read g-clock T 2 lock 3) wait for (l-clock ← 23) readers (with ➋ read O 2 O 3 l-clock < 23) (locked by T 2 ) … … wait for T 1 … → if (l-clock ≥ … wait for T 1 … T 2 .w-clock) … wait for T 1 … steal new ➍ …done copy from 4) write back T 2 .w-log w-log → read copy T1 T2 T3

  16. Real list add r l u l i s t a d d ( r l u t h r e a d d a t a t ∗ s e l f , int l i s t t ∗ l i s t , v a l t v a l ) { node t ∗ prev , ∗ next , ∗ node ; v a l t v ; r e s t a r t : r l u r e a d e r l o c k ( ) ; / ∗ Find r i g h t place . . . ∗ / i f ( ! r l u t r y l o c k ( s e l f , &prev ) | | ! r l u t r y l o c k ( s e l f , &next )) { r l u a b o r t ( s e l f ) ; goto r e s t a r t ; } new = rlu new node ( ) ; new − > v a l = v a l ; r l u a s s i g n p t r (&(new − > next ) , next ) ; r l u a s s i g n p t r (&( prev − > next ) , new ) ; r l u r e a d e r u n l o c k ( ) ; }

  17. Real list add r l u l i s t a d d ( r l u t h r e a d d a t a t ∗ s e l f , int l i s t t ∗ l i s t , v a l t v a l ) { node t ∗ prev , ∗ next , ∗ node ; v a l t v ; r e s t a r t : r l u r e a d e r l o c k ( ) ; / ∗ Find r i g h t place . . . ∗ / i f ( ! r l u t r y l o c k ( s e l f , &prev ) | | ! r l u t r y l o c k ( s e l f , &next )) { r l u a b o r t ( s e l f ) ; goto r e s t a r t ; } new = rlu new node ( ) ; new − > v a l = v a l ; r l u a s s i g n p t r (&(new − > next ) , next ) ; r l u a s s i g n p t r (&( prev − > next ) , new ) ; r l u r e a d e r u n l o c k ( ) ; }

  18. Real list add r l u l i s t a d d ( r l u t h r e a d d a t a t ∗ s e l f , int l i s t t ∗ l i s t , v a l t v a l ) { node t ∗ prev , ∗ next , ∗ node ; v a l t v ; r e s t a r t : r l u r e a d e r l o c k ( ) ; / ∗ Find r i g h t place . . . ∗ / i f ( ! r l u t r y l o c k ( s e l f , &prev ) | | ! r l u t r y l o c k ( s e l f , &next )) { r l u a b o r t ( s e l f ) ; goto r e s t a r t ; } new = rlu new node ( ) ; new − > v a l = v a l ; r l u a s s i g n p t r (&(new − > next ) , next ) ; r l u a s s i g n p t r (&( prev − > next ) , new ) ; r l u r e a d e r u n l o c k ( ) ; }

  19. Reader lock 1: function RLU _ READER _ LOCK (ctx) ctx.is-writer ← false 2: ctx.run-cnt ← ctx.run-cnt +1 ⊲ Set active 3: memory fence 4: ctx.local-clock ← global-clock ⊲ Record global clock 5: 6: function RLU _ READER _ UNLOCK (ctx) ctx.run-cnt ← ctx.run-cnt +1 ⊲ Set inactive 7: if ctx.is-writer then 8: RLU _ COMMIT _ WRITE _ LOG (ctx) ⊲ Write updates 9: 173 173

  20. Memory commit 44: function RLU _ COMMIT _ WRITE _ LOG (ctx) ctx.write-clock ← global-clock +1 ⊲ Enable stealing 45: FETCH _ AND _ ADD (global-clock, 1) ⊲ Advance clock 46: 47: RLU _ SYNCHRONIZE (ctx) ⊲ Drain readers RLU _ WRITEBACK _ WRITE _ LOG (ctx) ⊲ Safe to write back 48: RLU _ UNLOCK _ WRITE _ LOG (ctx) 49: ctx.write-clock ← ∞ ⊲ Disable stealing 50: RLU _ SWAP _ WRITE _ LOGS (ctx) ⊲ Quiesce write-log 51: 173

  21. Pointer dereference 10: function RLU _ DEREFERENCE (ctx, obj) ptr-copy ← GET _ COPY (obj) ⊲ Get copy pointer 11: 12: if IS _ UNLOCKED (ptr-copy) then ⊲ Is free? return obj ⊲ Yes ⇒ return object 13: if IS _ COPY (ptr-copy) then ⊲ Already a copy? 14: ⊲ Yes ⇒ return object return obj 15: thr-id ← GET _ THREAD _ ID (ptr-copy) 16: if thr-id = ctx.thr-id then ⊲ Locked by us? 17: return ptr-copy ⊲ Yes ⇒ return copy 18: other-ctx ← GET _ CTX (thr-id) ⊲ No ⇒ check for steal 19: if other-ctx.write-clock ≤ ctx.local-clock then 20: return ptr-copy ⊲ Stealing ⇒ return copy 21: return obj ⊲ No stealing ⇒ return object 22: 173

  22. RLU Deferring 1. On commit do not increment the global clock and execute RLU sync; 2. Instead, save writer-log and create a new log for the next writer 3. Synchronize when a writer tries to lock an object that is already locked.

  23. RLU Deferring advantages 1. Reduce the amount of RLU synchronize calls 2. Reduce contention on a global clock 3. Less stealing – less cache misses

  24. Table of Contents Introduction RLU design Evaluation Conclusion

  25. Linked lists User-space linked list (1,000 nodes) 2% updates 20% updates 40% updates 7 RCU Harris Harris (HP) RLU (leaky) 6 Operations/ µ s 5 4 3 2 1 0 4 8 12 16 4 8 12 16 4 8 12 16 Number of threads Figure 4. Throughput for linked lists with 2% (left), 20% (middle), and 40% (right) updates. 176

  26. Hash table User-space hash table (1,000 buckets of 100 nodes) 2% updates 20% updates 40% updates 14 RCU RLU (defer) Harris (HP) RLU Harris 12 (leaky) Operations/ µ s 10 8 6 4 2 0 4 8 12 16 4 8 12 16 4 8 12 16 Number of threads Figure 5. Throughput for hash tables with 2% (left), 20% (middle), and 40% (right) updates. 177

  27. Resizable Hash table Resizable hash table (64K items, 8-16K buckets) 120 RCU 8K RCU 16K 100 Operations/ µ s RCU 8-16K RLU 8K 80 RLU 16K 60 RLU 8-16K 40 20 0 1 2 4 6 8 10 12 14 Number of threads Figure 6. Throughput for the resizable hash table. 177

  28. Update only stress test (hash table) Hash table (10,000 buckets of 1 node)  100% updates 120 RCU RLU 100 Operations/ µ s RLU (defer) 80 60 40 20 0 1 2 4 6 8 10 12 14 16 Number of threads Figure 7. Throughput for the stress test on a hash table with 100% updates and a single item per bucket. 178

  29. Citrus Search Tree (throughput) Citrus tree (100,000 nodes) 70 RCU 10% RCU 20% 60 Operations/ µ s RCU 40% 50 RLU 10% RLU 20% 40 RLU 40% 30 20 10 0 1 8 16 24 32 40 48 56 64 72 80 Number of threads 178

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend