Martjn Děcký
decky@d3s.mff.cuni.cz
http://d3s.mff.cuni.cz CHARLES UNIVERSITY IN PRAGUE faculty of mathematjcs and physics faculty of mathematjcs and physics
Read-Copy-Update for Read-Copy-Update for HelenOS HelenOS - - PowerPoint PPT Presentation
Read-Copy-Update for Read-Copy-Update for HelenOS HelenOS http://d3s.mff.cuni.cz Martjn Dck decky@d3s.mff.cuni.cz CHARLES UNIVERSITY IN PRAGUE faculty of mathematjcs and physics faculty of mathematjcs and physics Introductjon
Martjn Děcký
decky@d3s.mff.cuni.cz
http://d3s.mff.cuni.cz CHARLES UNIVERSITY IN PRAGUE faculty of mathematjcs and physics faculty of mathematjcs and physics
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 2
HelenOS
Microkernel multjserver operatjng system Relying on asynchronous IPC mechanism
Major motjvatjon for scalable concurrent algorithms and data structures
Martjn Děcký
Researcher in computer science (operatjng systems) Not an expert on concurrent algorithms
But very lucky to be able to cooperate with hugely talented people in this area
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 3
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 4
→ The state of the shared communicatjon facilitjes needs to be protected by explicit synchronizatjon means
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 5
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 6
← In order to counterweight the overhead of the communicatjon by doing
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 7
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 8
← In order to avoid limitjng the achievable degree of concurrency
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 9
For accessing shared data structures Mutual exclusion synchronizatjon
Temporal separatjon of scheduling entjtjes Typical means
Disabling preemptjon, Dekker's algorithm, direct use of atomic test-and-set operatjons, etc.
Typical mechanisms
Locks, semaphores, conditjon variables, etc. [+] Relatjvely intuitjve semantjcs, well-known characteristjcs [-] Overhead, restrictjon of concurrency, deadlocks
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 10
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 11
Non-blocking synchronizatjon
Replace temporal separatjon by sophistjcated means that guarantee logical consistency Typical means
Atomic writes, direct use of atomic read-modify-write operatjons, etc.
Typical mechanisms
Transactjonal memory, hazard pointers, Read-Copy-Update, etc.
[+] Reasonable (almost no) overhead and restrictjon of concurrency in favorable cases, guarantee of progress [-] Less intuitjve semantjcs, sometjmes non-trivial characteristjcs, non-favorable cases, livelocks
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 12
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 13
Wait-freedom
Guaranteed system-wide progress and starvatjon-freedom (all operatjons are fjnitely bounded) Wait-freedom algorithms always exist [1], but the performance of general methods is usually inferior to blocking algorithms Wait-free queue by Kogan & Petrank [2]
Lock-freedom
Guaranteed system-wide progress, but individual threads can starve Four phases: Data operatjon, assistjng obstructjon, abortjng obstructjon, waitjng
Obstructjon-freedom
Guaranteed single thread progress if isolated for a bounded tjme (obstructjng threads need to be suspended)
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 14
Individual instance of usage
Generic reusable patuern
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 15
Individual instance of usage E.g. non-blocking list implementatjon using atomic pointer writes
Generic reusable patuern E.g. non-blocking list implementatjon using Read-Copy-Update
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 16
Non-blocking synchronizatjon mechanism
Targetjng synchronizatjon of read-mostly pointer-based data structures with immutable values
Favorable case: R/W ratjo of ~ 10:1 (but even 1:1 is achievable) Unlimited number of readers without blocking (not waitjng for other readers or writers) Litule overhead on the reader side (smaller than taking an uncontended lock) Readers have to tolerate “stale” data and late updates Readers have to observe “safe” access patuerns Synchronizatjon among writers out of scope of the mechanism Optjonal provisions for asynchronous reclamatjon
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 17
Read-side critjcal sectjon
Delimited by r e a d _ l
k ( ) and r e a d _ u n l
k ( )
(non-blocking)
Protected data can be referenced only inside the critjcal sectjon
Safe a c c e s s ( ) methods for reading pointers
Avoiding unsafe compiler optjmizatjons (reloading the pointer) Not necessary for reading values
Quiescent state (a thread outside a critjcal sectjon) Grace period (all threads pass through a quiescent state)
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 18
Synchronous write-side update
Atomically unlinking an old element Calling a s y n c h r
i z e ( )
Blocks untjl a grace period elapses (all readers pass a quiescent state, no longer referencing the unlinked data) Possibility to reclaim or free the unlinked data
Insertjng a new element using safe a s s i g n ( )
Avoiding unsafe compiler optjmizatjons and store reordering on weakly ordered architectures
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 19
head next v0 next v1 I. Atomic pointer update to remove the element with v0 from the list
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 20
head next v0 next v1 I. head next v0 next v1 II. Blocking on s y n c h r
i z e ( ) During the grace period preexistjng readers can stjll access the “stale” element with v0
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 21
head next v0 next v1 I. head next v0 next v1 II. head next v2 next v1 III. No reader can reference the element with v0 anymore – it can be reclaimed New element with v2 can be atomically inserted
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 22
Asynchronous write-side update
Using a c a l l ( )
Non-blocking operatjon registering a callback Callback is executed afuer a grace period elapses
Using a b a r r i e r ( )
Waitjng for all queued asynchronous callbacks
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 23
Cornerstone of any RCU algorithm
Implicit trade-ofg between precision and overhead
Any extension of a grace period is also a grace period Long (imprecise) grace periods
Blocking synchronous writers for a longer tjme Increasing memory usage due to unreclaimed elements
Short (precise) grace periods
Increasing overhead on the reader side (need for memory barriers, atomic
Usual compromise
Identjfying naturally occurring quiescent states for the given RCU algorithm
Context switches, exceptjons (tjmer tjcks), etc.
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 24
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 25
Foundatjon for a scalable concurrent data structure Developing a microkernel-specifjc RCU algorithm
Specifjc requirements, constraints and use cases Last well-known RCU implementatjon for a microkernel in 2003 (K42)
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 26
AP-RCU
Non-intrusive, portable RCU algorithm Developed and implemented by Andrej Podzimek for UTS (OpenSolaris) [3] [4]
AH-RCU
Inspired by AP-RCU and several other RCU algorithms Developed and implemented by Adam Hraška for SPARTAN (HelenOS) [7] Foundatjon for the Concurrent Hash Table in HelenOS [8] Additjonal variants (preemptjble AP-RCU, user space RCU)
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 27
The RCU algorithm must not impose design concepts of legacy systems on HelenOS
E.g. a specifjc way how the tjmer interrupt handler is implemented
The kernel space RCU algorithm must support
Read-side critjcal sectjons in interrupt and exceptjon handlers Asynchronous reclaimatjon (c a l l ( ) ) in interrupt and exceptjon handlers Read-side critjcal sectjons with preemptjon enabled (not afgectjng scheduling latency)
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 28
Concurrent Hash Table implementatjon
Growing and shrinking Interrupt and non-maskable interrupt tolerant
Suitable for a global page hash table
Concurrent reads with low overhead Concurrent inserts and deletes
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 29
Basic characteristjcs
Kernel space algorithm Read-side critjcal sectjons are preemptjble (without loss of performance)
Multjple read-side critjcal sectjons within a tjme slice Expensive operatjons when a thread was preempted do not make much harm
Support for asynchronous reclaimatjon in interrupt and exceptjon handlers No reliance on periodic tjmer
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 30
Grace period detectjon
Test if all CPUs passed a quiescent state
Sending an interprocessor interrupt (IPI) to each CPU
If the interrupt handler detects a nestjng count of 0, it issues a memory barrier (representjng a natural quiescent state)
Avoid sending IPI if context switch is detected
Detect any preempted readers holding up the current grace period
Sleep and wait for the last preempted reader holding up the grace period to wake the detector thread
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 31
Advantages
Low overhead and preemptjble read-side critjcal sectjon, suitable for exceptjon handlers No regular sampling
Disadvantages
Polling CPUs using interprocessor interrupts might be disruptjve in large systems
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 32
Basic characteristjcs
Inspired by Tripletu's relatjvistjc hash table [5] and Michael's lock- free lists [6]
Hash collisions resolved using separate RCU-protected bucket lists Buckets organized as lock-free lists without hazard pointers
RCU stjll protects against accessing invalid pointers and the ABA problem
Concurrent lookups and concurrent modifjcatjons
Tolerance for nested concurrent modifjcatjons from interrupt and exceptjon handlers
Growing and shrinking using background resizing by a factor of 2
Concurrent with lookups and updates Requires four grace periods
Deferred element freeing using RCU c a l l ( )
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 33
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 34
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 35
r e a d _ l
k ( ) : d i s a b l e _ p r e e m p t i
( ) c h e c k _ q s ( ) c p u . n e s t i n g _ c n t + + r e a d _ u n l
k ( ) : c p u . n e s t i n g _ c n t
h e c k _ q s ( ) e n a b l e _ p r e e m p t i
( ) c h e c k _ q s ( ) : i f ( c p u . n e s t i n g _ c n t = = ) { i f ( c p u . l a s t _ s e e n _ g p ! = c u r _ g p ) { g p = c u r _ g p m e m
y _ b a r r i e r ( ) c p u . l a s t _ s e e n _ g p = g p } }
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 36
r e a d _ l
k ( ) : d i s a b l e _ p r e e m p t i
( ) c h e c k _ q s ( ) c p u . n e s t i n g _ c n t + + r e a d _ u n l
k ( ) : c p u . n e s t i n g _ c n t
h e c k _ q s ( ) e n a b l e _ p r e e m p t i
( ) c h e c k _ q s ( ) : i f ( c p u . n e s t i n g _ c n t = = ) { i f ( c p u . l a s t _ s e e n _ g p ! = c u r _ g p ) { g p = c u r _ g p m e m
y _ b a r r i e r ( ) c p u . l a s t _ s e e n _ g p = g p } }
Note: Writer forces a context switch on CPUs where no read-side critjcal sectjon was not observed for a while. Note: Except m e m
y _ b a r r i e r ( )
The fjrst reader to notjce the start of a new grace period
Once all CPUs announce a quiescent state or perform a context switch (a naturally occurring quiescent state due to disabled preemptjon), the grace period ends.
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 37
r e a d _ l
k ( ) : d i s a b l e _ p r e e m p t i
( ) i f ( t h r e a d . n e s t i n g _ c n t = = ) r e c
d _ q s ( ) t h r e a d . n e s t i n g _ c n t + + e n a b l e _ p r e e m p t i
( ) r e a d _ u n l
k ( ) : d i s a b l e _ p r e e m p t i
( ) i f ( t h r e a d . n e s t i n g _ c n t
= ) { r e c
d _ q s ( ) i f ( ( t h r e a d . w a s _ p r e e m p t e d ) | | ( c p u . i s _ d e l a y i n g _ g p ) ) s i g n a l _ u n l
k ( ) } e n a b l e _ p r e e m p t i
( ) r e c
d _ q s ( ) : i f ( c p u . l a s t _ s e e n _ g p ! = c u r _ g p ) { g p = c u r _ g p m e m
y _ b a r r i e r ( ) c p u _ l a s t _ s e e n _ g p = g p } s i g n a l _ u n l
k ( ) : i f ( a t
i c _ e x c h a n g e ( c p u . i s _ d e l a y i n g _ g p , f a l s e ) = = t r u e ) r e m a i n i n g _ r e a d e r s _ s e m a p h
e . u p ( ) i f ( a t
i c _ e x c h a n g e ( t h r e a d . w a s _ p r e e m p t e d , f a l s e ) = = t r u e ) { p r e e m p t _ m u t e x . l
k ( ) p r e e m p t e d _ l i s t . r e m
e ( t h r e a d ) i f ( ( i s _ e m p t y ( c p u . c u r _ p r e e m p t e d ) ) & & ( p r e e m p t e d _ b l
k i n g _ g p ) ) r e m a i n i n g _ r e a d e r s _ s e m a p h
e . u p ( ) p r e e m p t _ m u t e x . u n l
k ( ) }
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 38
r e a d _ l
k ( ) : t h r e a d . n e s t i n g _ c n t + + c
p i l e r _ b a r r i e r ( ) r e a d _ u n l
k ( ) : c
p i l e r _ b a r r i e r ( ) t h r e a d . n e s t i n g _ c n t
f ( t h r e a d . n e s t i n g _ c n t = = w a s _ p r e e m p t e d ) p r e e m p t e d _ u n l
k ( ) p r e e mp t e d _ u n l
k ( ) : / / a v
d r a c e b e t w e e n t h r e a d a n d i n t e r r u p t h a n d l e r i f ( a t
i c _ e x c h a n g e ( t h r e a d . n e s t i n g _ c n t , ) = = w a s _ p r e e m p t e d ) { p r e e m p t _ l
k . l
k ( ) p r e e m p t e d _ l i s t . r e m
e ( t h r e a d ) i f ( ( i s _ e m p t y ( c p u . c u r _ p r e e m p t e d ) ) & & ( d e t e c t i
_ w a i t i n g ) ) d e t e c t i
_ s e m a p h
e . u p ( ) / / n
i f y t h e d e t e c t
t h r e a d a b
t t h e g r a c e p e r i
p r e e m p t _ l
k . u n l
k ( ) }
Note: Except p r e e m p t e d _ u n l
k ( )
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 39
s y n c h r
i z e ( ) : m e m
y _ b a r r i e r ( ) m u t e x . l
k ( ) c u r _ g p + + / / s t a r t a n e w g r a c e p e r i
r e a d e r _ c p u s = [ ] / / g a t h e r C P U s p
e n t i a l l y i n r e a d
i d e C S f
e a c h c p u i n c p u s { i f ( ( ! c p u . i d l e ) & & ( c p u . l a s t _ s e e n _ g p ! = c u r _ g p ) ) { c p u . l a s t _ c t x _ s w i t c h _ c n t = c p u . c t x _ s w i t c h _ c n t r e a d e r _ c p u s + = c p u } } w a i t ( 1 m s ) / / l
g e s t a c c e p t a b l e g r a c e p e r i
d u r a t i
( t u n a b l e ) f
e a c h c p u i n r e a d e r _ c p u s { / / e n f
c e a q u i e s c e n t s t a t e i f ( ( ! c p u . i d l e ) & & ( c p u . l a s t _ s e e n _ g p ! = c u r _ g p ) & & ( c p u . l a s t _ c t x _ s w i t c h _ c n t = = c p u . c t x _ s w i t c h _ c n t ) ) c p u . c t x _ s w i t c h _ f
c e _ w a i t ( ) } m u t e x . u n l
k ( )
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 40
d e t e c t
_ t h r e a d : f
e v e r { w a i t _ f
_ c a l l b a c k s ( ) / / r u n c a l l b a c k s a d d e d b e f
e t h e c u r r e n t g r a c e p e r i
e x e c u t e _ c a l l b a c k s ( ) / / p u s h c a l l b a c k s r e g i s t e r e d s i n c e l a s t p r
e s s i n g t
h e q u e u e a d v a n c e _ c a l l b a c k s ( ) w a i t _ f
_ g p _ e n d ( ) }
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 41
w a i t _ f
_ g p _ e n d ( ) : g p _ m u t e x . l
k ( ) i f ( c
p l e t e d _ g p ! = c u r _ g p ) { / / a g r a c e p e r i
i s a l r e a d y i n p r
r e s s w a i t _ f
_ g p _ e n d _ s i g n a l ( ) g
t } e l s e { / / s t a r t a n e w g r a c e p e r i
p r e e m p t _ l
k . l
k ( ) c u r _ g p + + p r e e m p t _ l
k . u n l
k ( ) } g p _ m u t e x . u n l
k ( ) w a i t _ f
_ r e a d e r s ( ) g p _ m u t e x . l
k ( ) c
p l e t e d _ g p = c u r _ g p
t : g p _ m u t e x . u n l
k ( )
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 42
2e+08 4e+08 6e+08 8e+08 1e+09 1.2e+09 1 2 3 4 5 List traversals / second Threads ideal ah-rcu pap-rcu spinlock
Read-side critjcal sectjon scalability: Traversal of a fjve-element list The list is protected as a whole, it is only read, never modifjed.
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 43
Write-side overhead: Difgerent ratjos of updates vs. lookups Five-element list, four threads running in parallel. Updates are always synchronized by a spinlock.
5e+07 1e+08 1.5e+08 2e+08 2.5e+08 3e+08 3.5e+08 4e+08 4.5e+08 5e+08 10 20 40 60 80 100 Operations / second %
ah-rcu + spinlock pap-rcu + spinlock spinlock
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 44
Read-side scalability vs. write-side overhead: Crossover point Data points from previous fjgure with low fractjon of updates are discarded.
1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 5 10 20 40 60 80 100 Operations / second %
ah-rcu + spinlock pap-rcu + spinlock spinlock
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 45
Concurrent hash table lookup scalability 128 buckets, average load factor of 4 elements per bucket, 50 % of lookups for hittjng keys, 50 % of lookups for missing keys (each thread used a separate list). The resize conditjon was checked, but never executed.
1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 1 2 3 4 5 Lookups / second Threads cht + ah-rcu ht + bucket spinlocks ht + global spinlock
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 46
Concurrent hash table update overhead: Difgerent ratjos of concurrent updates vs. lookups Four threads running in parallel.
1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 10 20 30 40 50 60 70 80 Operations / second cht + ah-rcu ht + bucket spinlocks ht + global spinlock %
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 47
Novel scalable algorithms
Preemptjble AP-RCU for HelenOS Preemptjble AH-RCU for HelenOS Resizeable Concurrent Hash Table for HelenOS
Suitable as a basic data structure for asynchronous HelenOS IPC Suitable for other kernel uses (e.g. global page table)
Thorough evaluatjon
Promising behavior
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 48
Martjn Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 49
[1] Herlihy M. P.: Impossibility and universality results for wait-free synchronizatjon, in Proceedings of 7th Annual ACM Symposium on Principles of Distributed Computjng, ACM, 1988 [2] Kogan A., Petrank E.: Wait-free queues with multjple enqueuers and dequeuers, in Proceedings of 16th ACM Symposium on Principles and Practjce of Parallel Programming, ACM, 2011 [3] Podzimek A., Děcký M., Bulej L., Tůma P.: A Non-Intrusive Read-Copy-Update for UTS, in Proceedings of 18th IEEE Internatjonal Conference on Parallel and Distributed Systems, IEEE, 2012, htup://d3s.mfg.cuni.cz/publicatjons/download/PodzimekDeckyBulejTuma-ICPADS-2012.pdf [4] htup://d3s.mfg.cuni.cz/sofuware/rcu/rcu.patch [5] Tripletu J., McKenney P. E., Walpole J.: Resizable, scalable, concurrent hash tables via relatjvistjc programming, in Proceedings of the 2011 USENIX Annual Technical Conference, ACM, 2011 [6] Michael M. M.: High performance dynamic lock-free hash tables and list-based sets, in Proceedings of 14th Annual ACM Symposium on Parallel Algorithms and Architectures, ACM, 2002 [7] Hraška A.: Read-Copy-Update for HelenOS, master thesis, Charles University in Prague, 2013, htup://www.helenos.org/doc/theses/ah-thesis.pdf [8] htups://code.launchpad.net/~adam-hraska+lp/helenos/cht-bench