A Scalable Ordering Primitive for Multicore Machines Sanidhya - PowerPoint PPT Presentation

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim

Era of multicore machines 2

Scope of multicore machines Huge hardware thread parallelism How are operations executed correctly? Ordering Becomes scalability bottleneck ... 3

Example: Read Log Update (RLU) ● Extension of RCU ● Modifes objects in a thread’s local log ● Clock maintains correct snapshot (old vs new) ● Frees objects via epoch-based reclamation 4

Read Log Update (RLU) operation A A log/bu g/bufe fer to s o stor ore Globa obal Clock P copi pies (pe per-thread) d) (22) Log Log B’ D’ RLU RLU heade der A B C D E Re Read d on s start Q Loc Local Cloc ock (22)

RLU commit operation Globa Globa obal Clock obal Clock Write Clock Write Clock 1. P upda pdates c clocks 1. P upda pdates c clocks (22) (23) (23) (∞) ∞) 2. P executes RCU-epoc och 2. P executes RCU-epoc och P  Waits fo  Waits fo for Q Q to fnish for Q Q to fnish B’ D’ C’ Logical Clock maintains correctness/ordering A B C D E Maintained via atomic instructions FAA/CAS → Q Local Clock Lo (22) Q will read Q wi d only y old d obj objects

Issue with logical clock ● RLU sufers from global clock contention – Cache-line contention due to atomic instructions – Possible to circumvent with our approach Phi Phi ARM ARM 180 180 160 160 How can we achieve ordering 150 150 120 120 120 120 with minimal timestamping overhead? Ops/usec Ops/usec 90 90 80 80 60 60 Atomic Atomic 40 40 30 30 Ordo 0 0 0 0 0 0 64 64 128 128 192 192 256 256 0 0 16 16 32 32 48 48 64 64 80 80 96 96 #cores #cores #cores #cores 7

Our proposed ordering primitive: Ordo ● Exposes a monotonically increasing clock – Current hardware already provides – rdtscp (X86), cntvct (ARM), stick (Sparc) ● Relies on a per-core invariant hardware clock – Monotonically increases with constant skew regardless of dynamic frequency and voltage scaling 8

Challenges with Ordo ● Comparing two clocks – Clocks are not synchronized – Cores receive RESET signal at varying times ● Application: – Modifying algorithms to use Ordo – Able to compare between two timestamps 9

Embracing the invariant clocks ● Measure a global uncertainty window – Ensure a new timestamp once a window is over – Provides a notion of globally synchronized clock ● Measured ofset MUST have the invariant: Measured ofset is greater than the physical ofset – Physical ofset: ofset due to RESET signal – Measured ofset: physical ofset + one-way delay 10

Calculating global uncertainty window: ORDO_BOUNDARY ● Add one-way delay latency on each path 1) Calculate C 1 timestamp C C 1 2 2) Notify C 2 via memory C 1 C C → 2 3) Get C 2 timestamp T ( C ) : 0 20 1 4) Repeat steps 1-3 to get the minimum T ( C ) : 2 0 2 time 11

Calculating global uncertainty window: ORDO_BOUNDARY ● Add one-way delay latency on each path ● Repeat prior steps in C C 1 2 opposite direction ● Do not know which clock T ( C ) : 5 0 2 is ahead of the other C 2 C C → 1 30 T ( C ) : 8 0 1 time 12

Calculating global uncertainty window: ORDO_BOUNDARY ● Repeat steps for each pair of cores from C 1 to C n ● The maximum ofset is the ORDO_BOUNDARY C C C C C 1 C C → 1 2 1 2 2 20 T ( C ) : 5 0 T ( C ) : 0 2 C 2 C C → 1 1 30 C 1 C ←→ 2 30 T ( C ) : 2 0 2 time T ( C ) : 8 0 1 13

Ordo application ● Applicable to any timestamp-based algorithm ● Expose Ordo API for these algorithms – get_time(): Current hardware timestamp – cmp_time(t 1 , t 2 ): Compare two timestamps with uncertainty, if |t 1 -t 2 | < ORDO_BOUNDARY – new_time(t): Return t new > (t + ORDO_BOUNDARY) ● Catch: Algorithms should handle uncertainty 14

Algorithms with Ordo handling uncertainty ● Physical to logical timestamping: – Rely on c to compare two timestamps m p _ t i m e ( ) – Either defer or revert if comparison is uncertain – Use n to guarantee new time e w _ t i m e ( ) ● Physical timestamping: – Use new _ to access the global clock t i m e ( ) 15

Read Log Update (RLU Ordo ) operation Glo loba bal ofs fset P’s l loc ocal P (30) cloc ock ( (22) Log Log B’ D’ A B C D E Re Read d on s start Q Loc Local Cloc ock Q’s cor Q’s ore (50) cloc ock ( (50)

RLU Ordo commit operation Write Clock Write Clock 1. P P u upd pdates own c cloc ock 1. P P u upd pdates own c cloc ock (150) (∞) ∞) 2. P P e executes RC RCU-epo poch 2. P P e executes RC RCU-epo poch Glo loba bal ofs fset P  Waits fo  Waits fo for Q Q to fnish for Q Q to fnish (30) B’ D’ C’ A B C D E Q Local Clock Lo (50) Q wi Q will read d only y old d obj objects

Algorithms modifed with Ordo ● RLU See our paper ● Transactional Locking (TL2) in STM ● Database concurrency control: OCC, MVCC ● Oplog used in Linux forking functionality 18

Evaluation ● Questions: – Measured global ofset (ORDO_BOUNDARY) – Maximum scalability of Ordo – Ordo’s impact on algorithms ● Machines confguration: – 240 core, 8 socket Intel Xeon machine (Xeon) – 256 core, Intel Xeon Phi (Phi) – 96 core, 2 socket ARM machine (ARM) – 32 core, 8 socket AMD machine (AMD) 19

Ofset between clocks ● Empirically measured ofset after reboots ● ORDO_BOUNDARY is the maximum ofset Machine Minimum (ns) Maximum (ns) Intel Xeon 70 276 Intel Xeon phi 90 270 ARM 100 1,100 AMD 93 203 20

Timestamping with Ordo ● Ordo relies on hardware timestamping ● 17.4 – 285.5x faster than atomic increments 12 12 12 12 Ops/usec/core Ops/usec/core Xeon(Atomic) Xeon(Atomic) Phi(Atomic) Phi(Atomic) 8 8 8 8 Xeon(Ordo) Phi(Ordo) 4 4 4 4 0 0 0 0 0 0 60 60 120 120 180 180 240 240 0 0 64 64 128 128 192 192 256 256 12 12 12 12 Ops/usec/core Ops/usec/core ARM(Atomic) ARM(Atomic) AMD(Atomic) AMD(Atomic) 8 8 8 8 ARM(Ordo) AMD(Ordo) 4 4 4 4 0 0 0 0 0 0 16 16 32 32 48 48 64 64 80 80 96 96 0 0 4 4 8 12 16 20 24 28 32 8 12 16 20 24 28 32 21 #core #core #core #core

Scaling RLU with Ordo ● RLU Ordo is 2.1x faster on an average ● Still sufers from object copy and its locking RLU 2% RLU(Ordo) 2% 150 180 150 120 Ops/usec 120 90 90 60 60 30 Xeon Phi 30 0 0 0 60 120 180 240 0 64 128 192 256 160 80 120 60 Ops/usec 80 40 40 20 ARM AMD 0 0 0 16 32 48 64 80 96 0 8 16 24 32 #core #core 22

Discussion and limitations ● Simplifes the design and understanding of algorithms ● Not a panacea – Applicable when clock is contentious ● No skew consideration ● Thread ID-based timestamp comparison has its limitation 23

Conclusion ● Ordo is a scalable timestamping primitive – Relies on invariant hardware clocks ● Exposes time-based API to the user ● Applied Ordo to fve concurrent algorithms ● Improves the scalability of algorithms by at most 39.7x across architectures 24

Backup Slides

Ofset between clocks ● Clocks are not synchronized – 8 th socket in Xeon and 2 nd socket in ARM – Results remain consistent even after reboots and measuring after a period of time Arm Xeon 96 120 900 225 72 Ofset between clocks 90 600 150 48 60 300 75 24 30 0 0 0 0 0 30 60 90 120 0 24 48 72 96 26 # core # core

Sensitivity of ORDO_BOUNDARY ● Varying ORDO_BOUNDARY from 1/8x – 8x ● Cycles increases from 32.2–18K on Xeon machine 1.08 1.04 Normalized throughput 1.00 0.96 0.92 1-core 1-socket 8-sockets 27

Physical timestamping: Oplog ● Improves Exim performance by 1.9x at 240 cores 120k Stock Oplog(Ordo) 100k 80k Messages/sec 60k 40k 20k 0k 30 60 90 120 150 180 210 240 28 #core

Scaling database concurrency control ● Improves OCC and MVCC by 4.1–39.7x for read-only (YCSB) ● OCC Ordo 1.24x faster than Tictoc and Silo (TPC-C) OCC MVCC OCC (Ordo) MVCC (Ordo) 180 100 Xeon Phi 150 80 Txns/usec 120 60 90 40 60 20 30 0 0 0 60 120 180 240 0 64 128 192 40 35 ARM AMD 32 28 Txns/usec 24 21 16 14 8 7 0 0 29 0 16 32 48 64 80 96 0 8 16 24 32 # core # core

Cannot use clock synchronization protocols ● No information on minimum bounds on message delivery between/among clocks ● Protocols introduce various errors ● Can lead to mis-synchronized clocks – Larger or smaller than the actual physical ofset Lead to incorrect implementation of concurrent algorithms 30

A Scalable Ordering Primitive for Multicore Machines Sanidhya - PowerPoint PPT Presentation

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim Era of multicore machines 2 Scope of multicore machines Huge hardware thread parallelism How are operations executed correctly?

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.

Questions? Static Semantics Primitive types First exercise is online: Primitive value

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Finite State Machines (FSM) Chapter 8 State Machines Introduction State Machines Mealy and

Information Ordering Ling573 Systems & Applications April 20, 2017 Roadmap

CS5412: HOW MUCH ORDERING? Lecture XVI Ken Birman Ordering 2 The key to consistency turns

Variable & Value Ordering Heuristics Heuristics for backtracking algorithms Variable

NUMA Siloing in the FreeBSD Network Stack Drew Gallatin EuroBSDCon 2019 (Or how to serve

Practical DirectX 12 - Programming Model and Hardware Capabilities Gareth Thomas & Alex Dunn

From Shader Code to a Tera Terafl flop op: How Shader Cores Work Kayvon Fatahalian Stanford

Computing Environments Saeid Mofrad, Ishtiaq Ahmed, Shiyong Lu, Ping Yang, Heming Cui, Fengwei

in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart Weiwei Jiang

Side-Channel Attacks and Defenses for SGX and SEV Yinqian Zhang Associate Professor Computer

Evaluating the performance of HPC- style SYCL applications Tom Deakin and Simon McIntosh-Smith

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , Rafael Ubal, David Kaeli

A Scalable Ordering Primitive for Multicore Machines Sanidhya - PowerPoint PPT Presentation

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim Era of multicore machines 2 Scope of multicore machines Huge hardware thread parallelism How are operations executed correctly?

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.

Questions? Static Semantics Primitive types First exercise is online: Primitive value

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Finite State Machines (FSM) Chapter 8 State Machines Introduction State Machines Mealy and

Information Ordering Ling573 Systems &amp; Applications April 20, 2017 Roadmap

CS5412: HOW MUCH ORDERING? Lecture XVI Ken Birman Ordering 2 The key to consistency turns

Variable &amp; Value Ordering Heuristics Heuristics for backtracking algorithms Variable

NUMA Siloing in the FreeBSD Network Stack Drew Gallatin EuroBSDCon 2019 (Or how to serve

Practical DirectX 12 - Programming Model and Hardware Capabilities Gareth Thomas &amp; Alex Dunn

From Shader Code to a Tera Terafl flop op: How Shader Cores Work Kayvon Fatahalian Stanford

Computing Environments Saeid Mofrad, Ishtiaq Ahmed, Shiyong Lu, Ping Yang, Heming Cui, Fengwei

in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart Weiwei Jiang

Side-Channel Attacks and Defenses for SGX and SEV Yinqian Zhang Associate Professor Computer

Evaluating the performance of HPC- style SYCL applications Tom Deakin and Simon McIntosh-Smith

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , Rafael Ubal, David Kaeli

Information Ordering Ling573 Systems & Applications April 20, 2017 Roadmap

Variable & Value Ordering Heuristics Heuristics for backtracking algorithms Variable

Practical DirectX 12 - Programming Model and Hardware Capabilities Gareth Thomas & Alex Dunn