An HTM-Based Update-side Synchronization for RCU on NUMA systems - - PowerPoint PPT Presentation

an htm based update side synchronization for rcu on numa
SMART_READER_LITE
LIVE PREVIEW

An HTM-Based Update-side Synchronization for RCU on NUMA systems - - PowerPoint PPT Presentation

An HTM-Based Update-side Synchronization for RCU on NUMA systems SeongJae Park, Paul E. McKenney, Laurent Dufour, Heon Y. Yeom Disclaimer This work was done prior to the first author joining Amazon and while the second author was at IBM


slide-1
SLIDE 1

An HTM-Based Update-side Synchronization for RCU on NUMA systems

SeongJae Park, Paul E. McKenney, Laurent Dufour, Heon Y. Yeom

slide-2
SLIDE 2

Disclaimer

  • This work was done prior to the first author joining Amazon and while the

second author was at IBM

  • The views expressed herein are those of the authors; they do not reflect the

views of their employers

slide-3
SLIDE 3

The World Is In NUMA/Multi-CPU Era

  • More than a decade ago, world has changed to multi-CPU era
  • Nowadays, huge NUMA systems utilizing hundreds of threads are common
  • Efficient synchronization primitives are the key of performance and scalability

https://www.karlrupp.net/wp-content/uploads/2015/06/35years.png

slide-4
SLIDE 4

RCU: Read-Copy Update

  • A synchronization mechanism for read-mostly workloads
  • Provides almost ideal performance and scalability for reads

https://static.lwn.net/images/ns/kernel/rcu/GracePeriodGood.png

slide-5
SLIDE 5

RCU-protected Linked List: Reading Items

A B C Readers An updater Updaters Readers do nothing special except notifying its start and completion. Just traverse the list. X Y means X can see Y A B means B is A’s next item

slide-6
SLIDE 6

RCU-protected Linked List: Deletion of B

A B C lock(update_lock); a->next = c; unlock(update_lock); Updaters An updater lock() is required to avoid the race between concurrent updates. Use of the global locking becomes the scalability bottleneck. X Y means X can see Y A B means B is A’s next item Readers

slide-7
SLIDE 7

RCU-protected Linked List: Deletion of B

A B C lock(update_lock); a->next = c; unlock(update_lock); Updaters An updater lock() is required to avoid the race between concurrent updates. Use of the global locking becomes the scalability bottleneck. Pre-existing Readers New Readers Now there are pre-existing readers and new readers. X Y means X can see Y A B means B is A’s next item

slide-8
SLIDE 8

RCU-protected Linked List: Deletion of B

A B C Pre-existing Readers Wait until pre-existing readers complete New Readers Updaters An updater X Y means X can see Y A B means B is A’s next item

slide-9
SLIDE 9

RCU-protected Linked List: Deletion of B

A B C Now nobody can see B New Readers Updaters An updater X Y means X can see Y A B means B is A’s next item

slide-10
SLIDE 10

RCU-protected Linked List: Deletion of B

A B C safe to reuse B! free(B); New Readers Updaters An updater X Y means X can see Y A B means B is A’s next item Called QSBR (Quiescent State Based Reclaim)

slide-11
SLIDE 11

Lack of RCU-centric update-side synchronization

  • Intended design

○ allow users selecting or implementing best synchronization mechanism for them

  • However, many users use the global locking

○ Simple to apply, but imposes scalability problem ○ To mitigate this problem, several RCU extensions have proposed

slide-12
SLIDE 12

Read-Log-Update (RLU)

  • Published in SOSP’15[1]
  • Adopts a software transactional memory (STM) like logging mechanism

[1] Matveev, Alexander, et al. "Read-log-update: a lightweight synchronization mechanism for concurrent programming." Proceedings of the 25th Symposium on Operating Systems Principles. ACM, 2015.

slide-13
SLIDE 13

RLU-protected Linked List: Reading Items

A B C Readers Updaters An updater RLU Readers required to find out proper version, in addition to notifying its start and completion X Y means X can see Y A B means B is A’s next item

slide-14
SLIDE 14

RLU-protected Linked List: Deletion of B

A B C Readers Updaters An updater rlu_lock(); create new version A’; rlu_unlock(); A’ RLU-lock critical sections are similar to STM transactions; If it conflicts with others, it aborts. RLU Readers required to find out proper version, in addition to notifying its start and completion X Y means X can see Y A B means B is A’s next item

slide-15
SLIDE 15

RLU-protected Linked List: Deletion of B

A B C A’ New Readers rlu_lock(); create new version A’; rlu_unlock(); Updaters Pre-existing Readers An updater Reader-updater conflict is avoided because readers search valid versions by themselves RLU-lock critical sections are similar to STM transactions; If it conflicts with others, it aborts. RLU Readers required to find out proper version, in addition to notifying its start and completion X Y means X can see Y A B means B is A’s next item Oh, this is not the version for me!

slide-16
SLIDE 16

RLU-protected Linked List: Deletion of B

A B C A’ New Readers waits until pre-existing readers complete Updaters An updater Pre-existing Readers RLU Readers required to find out proper version, in addition to notifying its start and completion X Y means X can see Y A B means B is A’s next item Oh, this is not the version for me!

slide-17
SLIDE 17

RLU-protected Linked List: Deletion of B

A’ B C A New Readers Swap A and A’; Updaters An updater RLU Readers required to find out proper version, in addition to notifying its start and completion X Y means X can see Y A B means B is A’s next item Readers can now access A’ and C without referencing A; Safe to reuse A and B

slide-18
SLIDE 18

RLU-protected Linked List: Deletion of B

A’ C New Readers free(A); free(B); Updaters An updater B A RLU Readers required to find out proper version, in addition to notifying its start and completion X Y means X can see Y A B means B is A’s next item

slide-19
SLIDE 19

RCU-HTM

  • Published in PACT’17[1]
  • Encapsulates each update in an HTM transaction

[1] Siakavaras, Dimitrios, et al. "RCU-HTM: combining RCU with HTM to implement highly efficient concurrent binary search trees." Parallel Architectures and Compilation Techniques (PACT), 2017 26th International Conference on. IEEE, 2017.

slide-20
SLIDE 20

RCU-HTM-protected Linked List: Reading Items

A B C Readers Updaters An updater Readers do nothing special except notifying its start and

  • completion. Just

traverse the list. X Y means X can see Y A B means B is A’s next item

slide-21
SLIDE 21

RCU-HTM-protected Linked List: Deletion of B

A B C begin_htm_trx(); a->next = c; commit_htm_trx(); Updaters Encapsulate data updates within HTM transaction; HTM guarantees consistency and scalability An updater New Readers Pre-existing Readers X Y means X can see Y A B means B is A’s next item Else are same to QSBR; Wait until safe and dealloc

slide-22
SLIDE 22

Will Those Scale On NUMA Machines?

  • Both RLU and RCU-HTM had not evaluated on huge NUMA machine

○ RLU was evaluated with single socket machine utilizing 16 threads ○ RCU-HTM evaluated with single socket machine utilizing 44 threads

  • Server: 4 sockets, 18 cores, hyper-threaded (total 144 h/w threads)

○ Every following evaluation uses this server

  • Workload: Random reads, inserts, and deletes to kernel space linked lists

○ Each of the linked lists are protected by RCU, RLU, and RCU-HTM, respectively ○ 256 initial items pre-loaded (sufficient to scale with 144 threads) ○ Measure operations per second with varying number of threads and update rate

slide-23
SLIDE 23

Unexpected Poor Scalability Revealed

  • RLU imposes significant overhead to reads
  • With updates, RLU and RCU-HTM degrade as multiple NUMA nodes used
slide-24
SLIDE 24

Root-causes and Implications of The Results

  • RLU’s read overhead apparently comes from the valid version searching

○ Read-mostly performance-sensitive workloads would not use RLU instead of RCU!

  • NUMA-oblivious designs of RLU and RCU-HTM degrade update scalability
  • In case of RCU-HTM, amplification of HTM aborts on NUMA impacts

○ Long latency between NUMA makes transaction time long and thus easy to be aborted ○ The dominate readers conflict with HTM transactions of update threads and aborts them

  • HTM benefit is clear, we need NUMA-aware HTM use for read-mostly works

Read Update on single NUMA node Update on multiple NUMA nodes RCU Almost Ideal Bad (Global locking) Bad (Global locking) RLU Far from ideal (Version check overhead) Good Bad (NUMA oblivious) RCU-HTM Almost Ideal Best (No software locking overhead) Horrible (HTM aborts amplification)

slide-25
SLIDE 25

Our Design Principles for New RCU Extension

We design new RCU extension called RCX with our principles

slide-26
SLIDE 26

Our Design Principles for New RCU Extension

We design new RCU extension called RCX with our principles 1. Do fine-grained update-side synchronization

Principle #1 Principle #2 Principle #3 Principle #4 Principle #5 RCU X O X N/A N/A RLU O X X N/A N/A RCU-HTM O O O X X

slide-27
SLIDE 27

Our Design Principles for New RCU Extension

We design new RCU extension called RCX with our principles 1. Do fine-grained update-side synchronization 2. Use pure RCU read mechanism for the ideal read performance and scalability

Principle #1 Principle #2 Principle #3 Principle #4 Principle #5 RCU X O X N/A N/A RLU O X X N/A N/A RCU-HTM O O O X X

slide-28
SLIDE 28

Our Design Principles for New RCU Extension

We design new RCU extension called RCX with our principles 1. Do fine-grained update-side synchronization 2. Use pure RCU read mechanism for the ideal read performance and scalability 3. Use HTM; Only HTM provides H/W-oriented high performance

Principle #1 Principle #2 Principle #3 Principle #4 Principle #5 RCU X O X N/A N/A RLU O X X N/A N/A RCU-HTM O O O X X

slide-29
SLIDE 29

Our Design Principles for New RCU Extension

We design new RCU extension called RCX with our principles 1. Do fine-grained update-side synchronization 2. Use pure RCU read mechanism for the ideal read performance and scalability 3. Use HTM; Only HTM provides H/W-oriented high performance 4. Access only NUMA-local data objects within HTM transaction

a. Otherwise, abort rates exponentially increase

Principle #1 Principle #2 Principle #3 Principle #4 Principle #5 RCU X O X N/A N/A RLU O X X N/A N/A RCU-HTM O O O X X

slide-30
SLIDE 30

Our Design Principles for New RCU Extension

We design new RCU extension called RCX with our principles 1. Do fine-grained update-side synchronization 2. Use pure RCU read mechanism for the ideal read performance and scalability 3. Use HTM; Only HTM provides H/W-oriented high performance 4. Access only NUMA-local data objects within HTM transaction

a. Otherwise, abort rates exponentially increase

5. Isolate the HTM working set from the dominant readers

a. Otherwise, the readers abort HTM transactions

Principle #1 Principle #2 Principle #3 Principle #4 Principle #5 RCU X O X N/A N/A RLU O X X N/A N/A RCU-HTM O O O X X

slide-31
SLIDE 31

RCX Interface

A B C Readers Updater Updaters Readers do nothing special except notifying its start and completion. Just traverse the list. X Y means X can see Y A B means B is A’s next item

slide-32
SLIDE 32

RCX Interface

A B C Readers Updater rcx_lock(A,B,C); a->next = c; rcx_unlock(A,B,C); Updaters In RCX, update critical sections should specify items to update Readers do nothing special except notifying its start and completion. Just traverse the list. X Y means X can see Y A B means B is A’s next item

slide-33
SLIDE 33

RCX Interface

A B C Readers Updater rcx_lock(A,B,C); a->next = c; rcx_unlock(A,B,C); Updaters Readers do nothing special except notifying its start and completion. Just traverse the list. X Y means X can see Y A B means B is A’s next item In RCX, update critical sections should specify items to update Else are same to QSBR; Wait until safe and dealloc

slide-34
SLIDE 34

rcx_lock() in Detail

... RCX-protected objects CPU 0 CPU 1 CPU m ... Node 0 CPU 0 CPU 1 CPU m ... Node 0

slide-35
SLIDE 35

rcx_lock() in Detail

  • Embed node-local locks and a global lock to each object

... RCX-protected objects Global locks Local locks for node 0 Local locks for node 1 ... ... ... CPU 0 CPU 1 CPU m ... Node 0 CPU 0 CPU 1 CPU m ... Node 0

slide-36
SLIDE 36

rcx_lock() in Detail

  • Embed node-local locks and a global lock to each object
  • Updaters first acquire the per-node local lock using HTM

... RCX-protected objects Global locks Local locks for node 0 Local locks for node 1 ... ... ... CPU 0 CPU 1 CPU m ... Node 0 CPU 0 CPU 1 CPU m ... Node 0

slide-37
SLIDE 37

rcx_lock() in Detail

  • Embed node-local locks and a global lock to each object
  • Updaters first acquire the per-node local lock using HTM
  • Than, commit the transaction and acquire the global lock using spinlock

... RCX-protected objects Global locks Local locks for node 0 Local locks for node 1 ... ... ... CPU 0 CPU 1 CPU m ... Node 0 CPU 0 CPU 1 CPU m ... Node 0

slide-38
SLIDE 38

rcx_lock() in Detail

  • Embed node-local locks and a global lock to each object
  • Updaters first acquire the per-node local lock using HTM
  • Than, commit the transaction and acquire the global lock using spinlock
  • Updaters who acquired both locks can update the items

... RCX-protected objects Global locks Local locks for node 0 Local locks for node 1 ... ... ... CPU 0 CPU 1 CPU m ... Node 0 CPU 0 CPU 1 CPU m ... Node 0

slide-39
SLIDE 39

RCX and The Principles

slide-40
SLIDE 40

RCX and The Principles

  • Do fine-grained update-side synchronization

○ Compete with threads accessing same objects only

Principle #1 Principle #2 Principle #3 Principle #4 Principle #5 RCU X O X N/A N/A RLU O X X N/A N/A RCU-HTM O O O X X RCX O O O O O

slide-41
SLIDE 41

RCX and The Principles

  • Do fine-grained update-side synchronization

○ Compete with threads accessing same objects only

  • Use pure RCU read mechanism

Principle #1 Principle #2 Principle #3 Principle #4 Principle #5 RCU X O X N/A N/A RLU O X X N/A N/A RCU-HTM O O O X X RCX O O O O O

slide-42
SLIDE 42

RCX and The Principles

  • Do fine-grained update-side synchronization

○ Compete with threads accessing same objects only

  • Use pure RCU read mechanism
  • Use HTM

Principle #1 Principle #2 Principle #3 Principle #4 Principle #5 RCU X O X N/A N/A RLU O X X N/A N/A RCU-HTM O O O X X RCX O O O O O

slide-43
SLIDE 43

RCX and The Principles

  • Do fine-grained update-side synchronization

○ Compete with threads accessing same objects only

  • Use pure RCU read mechanism
  • Use HTM
  • Access only NUMA-local data objects within HTM transaction

Principle #1 Principle #2 Principle #3 Principle #4 Principle #5 RCU X O X N/A N/A RLU O X X N/A N/A RCU-HTM O O O X X RCX O O O O O

slide-44
SLIDE 44

RCX and The Principles

  • Do fine-grained update-side synchronization

○ Compete with threads accessing same objects only

  • Use pure RCU read mechanism
  • Use HTM
  • Access only NUMA-local data objects within HTM transaction
  • Isolate the working set of HTM from the dominant Readers

○ HTM in RCX touches local locks only, which is invisible to readers

Principle #1 Principle #2 Principle #3 Principle #4 Principle #5 RCU X O X N/A N/A RLU O X X N/A N/A RCU-HTM O O O X X RCX O O O O O

slide-45
SLIDE 45

Evaluations

slide-46
SLIDE 46

RCU Variants-Protected Linked Lists

  • RCX Performs best, for both read only and updates mixed workload
  • Similar results with hash tables

Number of threads

slide-47
SLIDE 47

Macro Benchmarks

  • We further applied RCX to systems having scalability problems

○ Virtual memory management system of Linux ○ In-memory DBMS

slide-48
SLIDE 48

RCU-protected VMA-tree

  • Linux protects each VMA-tree with a global reader-writer lock (mmap_sem)
  • Two similar RCU approaches proposed: RCUVM[1] and SPF[2]
  • However, VMA-tree update intensive workloads receive no benefit
  • We further apply RCX on top of SPF and call it RCXVM

[1] Clements, Austin T., M. Frans Kaashoek, and Nickolai Zeldovich. "Scalable address spaces using RCU balanced trees." in ACM SIGPLAN Notices 47.4 (2012): 199-210. [2] H USSEIN , N. "Another attempt at speculative page-fault handling." https://lwn.net/Articles/730531/, 2017.

slide-49
SLIDE 49

Virtual Memory Scalability Evaluation Result

  • RCXVM further improves Metis and Ebizzy

Metis Psearchy Ebizzy Throughput Number of threads

slide-50
SLIDE 50

Virtual Memory Scalability Evaluation Result

  • RCXVM further improves Metis and Ebizzy

○ Metis: Up to 24.03x of Original, 2.10x of SPF (144 threads)

Metis Psearchy Ebizzy Throughput Number of threads

slide-51
SLIDE 51

Virtual Memory Scalability Evaluation Result

  • RCXVM further improves Metis and Ebizzy

○ Metis: Up to 24.03x of Original, 2.10x of SPF (144 threads) ○ Ebizzy: Up to 5.60x of Original (72 threads), 2.23x of SPF (36 threads)

Metis Psearchy Ebizzy Throughput Number of threads

slide-52
SLIDE 52

Virtual Memory Scalability Evaluation Result

  • RCXVM further improves Metis and Ebizzy

○ Metis: Up to 24.03x of Original, 2.10x of SPF (144 threads) ○ Ebizzy: Up to 5.60x of Original (72 threads), 2.23x of SPF (36 threads)

  • Psearchy and Ebizzy with many threads show no benefit

○ The bottleneck (tlb flushes) is out of RCXVM coverage

Metis Psearchy Ebizzy Throughput Number of threads

slide-53
SLIDE 53

In-memory DBMS Scalability

  • Kyoto CacheDB uses global reader-writer lock; We implement two variants

substituting it with fine-grained RCU and RCX, respectively

  • With 20 million records evaluation, RCX shows improvements

○ Up to 17.28x of Original and 1.3x of fine-grained RCU with 10% updates

Read-only 10% updates Number of threads

slide-54
SLIDE 54

In-memory DBMS Scalability

  • Kyoto CacheDB uses global reader-writer lock; We implement two variants

substituting it with fine-grained RCU and RCX, respectively

  • With 20 million records evaluation, RCX shows improvements

○ Up to 17.28x of Original and 1.3x of fine-grained RCU with 10% updates

Read-only 10% updates Number of threads

slide-55
SLIDE 55

Conclusion

  • RCX achieves best update while preserving the almost ideal read in terms of

performance and scalability, owing to its NUMA-aware use of HTM

  • Many details and additional things in the paper

○ Detailed investigations of state-of-the-arts including an HMCS lock and RCX variants ○ Optimization of RCX for memory efficiency and HTM implementation details

  • The source code is available: https://github.com/rcx-sync

Read Single node update Multiple NUMA node update RCU Almost Ideal Bad (Global locking) Bad (Global locking) RLU Far from ideal (Version check overhead) Good Bad (NUMA oblivious) RCU-HTM Almost Ideal Best (No software locking overhead) Horrible (HTM aborts amplification) RCX Almost Ideal Best Best

slide-56
SLIDE 56

Thank You