a scalable ordering primitive for multicore machines
play

A Scalable Ordering Primitive for Multicore Machines Sanidhya - PowerPoint PPT Presentation

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim Era of multicore machines 2 Scope of multicore machines Huge hardware thread parallelism How are operations executed correctly?


  1. A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim

  2. Era of multicore machines 2

  3. Scope of multicore machines Huge hardware thread parallelism How are operations executed correctly? Ordering Becomes scalability bottleneck ... 3

  4. Example: Read Log Update (RLU) ● Extension of RCU ● Modifes objects in a thread’s local log ● Clock maintains correct snapshot (old vs new) ● Frees objects via epoch-based reclamation 4

  5. Read Log Update (RLU) operation A A log/bu g/bufe fer to s o stor ore Globa obal Clock P copi pies (pe per-thread) d) (22) Log Log B’ D’ RLU RLU heade der A B C D E Re Read d on s start Q Loc Local Cloc ock (22)

  6. RLU commit operation Globa Globa obal Clock obal Clock Write Clock Write Clock 1. P upda pdates c clocks 1. P upda pdates c clocks (22) (23) (23) (∞) ∞) 2. P executes RCU-epoc och 2. P executes RCU-epoc och P  Waits fo  Waits fo for Q Q to fnish for Q Q to fnish B’ D’ C’ Logical Clock maintains correctness/ordering A B C D E Maintained via atomic instructions FAA/CAS → Q Local Clock Lo (22) Q will read Q wi d only y old d obj objects

  7. Issue with logical clock ● RLU sufers from global clock contention – Cache-line contention due to atomic instructions – Possible to circumvent with our approach Phi Phi ARM ARM 180 180 160 160 How can we achieve ordering 150 150 120 120 120 120 with minimal timestamping overhead? Ops/usec Ops/usec 90 90 80 80 60 60 Atomic Atomic 40 40 30 30 Ordo 0 0 0 0 0 0 64 64 128 128 192 192 256 256 0 0 16 16 32 32 48 48 64 64 80 80 96 96 #cores #cores #cores #cores 7

  8. Our proposed ordering primitive: Ordo ● Exposes a monotonically increasing clock – Current hardware already provides – rdtscp (X86), cntvct (ARM), stick (Sparc) ● Relies on a per-core invariant hardware clock – Monotonically increases with constant skew regardless of dynamic frequency and voltage scaling 8

  9. Challenges with Ordo ● Comparing two clocks – Clocks are not synchronized – Cores receive RESET signal at varying times ● Application: – Modifying algorithms to use Ordo – Able to compare between two timestamps 9

  10. Embracing the invariant clocks ● Measure a global uncertainty window – Ensure a new timestamp once a window is over – Provides a notion of globally synchronized clock ● Measured ofset MUST have the invariant: Measured ofset is greater than the physical ofset – Physical ofset: ofset due to RESET signal – Measured ofset: physical ofset + one-way delay 10

  11. Calculating global uncertainty window: ORDO_BOUNDARY ● Add one-way delay latency on each path 1) Calculate C 1 timestamp C C 1 2 2) Notify C 2 via memory C 1 C C → 2 3) Get C 2 timestamp T ( C ) : 0 20 1 4) Repeat steps 1-3 to get the minimum T ( C ) : 2 0 2 time 11

  12. Calculating global uncertainty window: ORDO_BOUNDARY ● Add one-way delay latency on each path ● Repeat prior steps in C C 1 2 opposite direction ● Do not know which clock T ( C ) : 5 0 2 is ahead of the other C 2 C C → 1 30 T ( C ) : 8 0 1 time 12

  13. Calculating global uncertainty window: ORDO_BOUNDARY ● Repeat steps for each pair of cores from C 1 to C n ● The maximum ofset is the ORDO_BOUNDARY C C C C C 1 C C → 1 2 1 2 2 20 T ( C ) : 5 0 T ( C ) : 0 2 C 2 C C → 1 1 30 C 1 C ←→ 2 30 T ( C ) : 2 0 2 time T ( C ) : 8 0 1 13

  14. Ordo application ● Applicable to any timestamp-based algorithm ● Expose Ordo API for these algorithms – get_time(): Current hardware timestamp – cmp_time(t 1 , t 2 ): Compare two timestamps with uncertainty, if |t 1 -t 2 | < ORDO_BOUNDARY – new_time(t): Return t new > (t + ORDO_BOUNDARY) ● Catch: Algorithms should handle uncertainty 14

  15. Algorithms with Ordo handling uncertainty ● Physical to logical timestamping: – Rely on c to compare two timestamps m p _ t i m e ( ) – Either defer or revert if comparison is uncertain – Use n to guarantee new time e w _ t i m e ( ) ● Physical timestamping: – Use new _ to access the global clock t i m e ( ) 15

  16. Read Log Update (RLU Ordo ) operation Glo loba bal ofs fset P’s l loc ocal P (30) cloc ock ( (22) Log Log B’ D’ A B C D E Re Read d on s start Q Loc Local Cloc ock Q’s cor Q’s ore (50) cloc ock ( (50)

  17. RLU Ordo commit operation Write Clock Write Clock 1. P P u upd pdates own c cloc ock 1. P P u upd pdates own c cloc ock (150) (∞) ∞) 2. P P e executes RC RCU-epo poch 2. P P e executes RC RCU-epo poch Glo loba bal ofs fset P  Waits fo  Waits fo for Q Q to fnish for Q Q to fnish (30) B’ D’ C’ A B C D E Q Local Clock Lo (50) Q wi Q will read d only y old d obj objects

  18. Algorithms modifed with Ordo ● RLU See our paper ● Transactional Locking (TL2) in STM ● Database concurrency control: OCC, MVCC ● Oplog used in Linux forking functionality 18

  19. Evaluation ● Questions: – Measured global ofset (ORDO_BOUNDARY) – Maximum scalability of Ordo – Ordo’s impact on algorithms ● Machines confguration: – 240 core, 8 socket Intel Xeon machine (Xeon) – 256 core, Intel Xeon Phi (Phi) – 96 core, 2 socket ARM machine (ARM) – 32 core, 8 socket AMD machine (AMD) 19

  20. Ofset between clocks ● Empirically measured ofset after reboots ● ORDO_BOUNDARY is the maximum ofset Machine Minimum (ns) Maximum (ns) Intel Xeon 70 276 Intel Xeon phi 90 270 ARM 100 1,100 AMD 93 203 20

  21. Timestamping with Ordo ● Ordo relies on hardware timestamping ● 17.4 – 285.5x faster than atomic increments 12 12 12 12 Ops/usec/core Ops/usec/core Xeon(Atomic) Xeon(Atomic) Phi(Atomic) Phi(Atomic) 8 8 8 8 Xeon(Ordo) Phi(Ordo) 4 4 4 4 0 0 0 0 0 0 60 60 120 120 180 180 240 240 0 0 64 64 128 128 192 192 256 256 12 12 12 12 Ops/usec/core Ops/usec/core ARM(Atomic) ARM(Atomic) AMD(Atomic) AMD(Atomic) 8 8 8 8 ARM(Ordo) AMD(Ordo) 4 4 4 4 0 0 0 0 0 0 16 16 32 32 48 48 64 64 80 80 96 96 0 0 4 4 8 12 16 20 24 28 32 8 12 16 20 24 28 32 21 #core #core #core #core

  22. Scaling RLU with Ordo ● RLU Ordo is 2.1x faster on an average ● Still sufers from object copy and its locking RLU 2% RLU(Ordo) 2% 150 180 150 120 Ops/usec 120 90 90 60 60 30 Xeon Phi 30 0 0 0 60 120 180 240 0 64 128 192 256 160 80 120 60 Ops/usec 80 40 40 20 ARM AMD 0 0 0 16 32 48 64 80 96 0 8 16 24 32 #core #core 22

  23. Discussion and limitations ● Simplifes the design and understanding of algorithms ● Not a panacea – Applicable when clock is contentious ● No skew consideration ● Thread ID-based timestamp comparison has its limitation 23

  24. Conclusion ● Ordo is a scalable timestamping primitive – Relies on invariant hardware clocks ● Exposes time-based API to the user ● Applied Ordo to fve concurrent algorithms ● Improves the scalability of algorithms by at most 39.7x across architectures 24

  25. Backup Slides

  26. Ofset between clocks ● Clocks are not synchronized – 8 th socket in Xeon and 2 nd socket in ARM – Results remain consistent even after reboots and measuring after a period of time Arm Xeon 96 120 900 225 72 Ofset between clocks 90 600 150 48 60 300 75 24 30 0 0 0 0 0 30 60 90 120 0 24 48 72 96 26 # core # core

  27. Sensitivity of ORDO_BOUNDARY ● Varying ORDO_BOUNDARY from 1/8x – 8x ● Cycles increases from 32.2–18K on Xeon machine 1.08 1.04 Normalized throughput 1.00 0.96 0.92 1-core 1-socket 8-sockets 27

  28. Physical timestamping: Oplog ● Improves Exim performance by 1.9x at 240 cores 120k Stock Oplog(Ordo) 100k 80k Messages/sec 60k 40k 20k 0k 30 60 90 120 150 180 210 240 28 #core

  29. Scaling database concurrency control ● Improves OCC and MVCC by 4.1–39.7x for read-only (YCSB) ● OCC Ordo 1.24x faster than Tictoc and Silo (TPC-C) OCC MVCC OCC (Ordo) MVCC (Ordo) 180 100 Xeon Phi 150 80 Txns/usec 120 60 90 40 60 20 30 0 0 0 60 120 180 240 0 64 128 192 40 35 ARM AMD 32 28 Txns/usec 24 21 16 14 8 7 0 0 29 0 16 32 48 64 80 96 0 8 16 24 32 # core # core

  30. Cannot use clock synchronization protocols ● No information on minimum bounds on message delivery between/among clocks ● Protocols introduce various errors ● Can lead to mis-synchronized clocks – Larger or smaller than the actual physical ofset Lead to incorrect implementation of concurrent algorithms 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend