A Scalable Ordering Primitive for Multicore Machines Sanidhya - - PowerPoint PPT Presentation

a scalable ordering primitive for multicore machines
SMART_READER_LITE
LIVE PREVIEW

A Scalable Ordering Primitive for Multicore Machines Sanidhya - - PowerPoint PPT Presentation

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim Era of multicore machines 2 Scope of multicore machines Huge hardware thread parallelism How are operations executed correctly?


slide-1
SLIDE 1

A Scalable Ordering Primitive for Multicore Machines

Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim

slide-2
SLIDE 2

2

Era of multicore machines

slide-3
SLIDE 3

3

Scope of multicore machines

Huge hardware thread parallelism

...

How are operations executed correctly? Ordering Becomes scalability bottleneck

slide-4
SLIDE 4

4

Example: Read Log Update (RLU)

  • Extension of RCU
  • Modifes objects in a thread’s local log
  • Clock maintains correct snapshot (old vs new)
  • Frees objects via epoch-based reclamation
slide-5
SLIDE 5

A B C D D’ E B’

A A log/bu g/bufe fer to s

  • stor
  • re

copi pies (pe per-thread) d)

Log Log

RLU RLU heade der Globa

  • bal Clock

(22) Loc Local Cloc

  • ck

(22) Re Read d on s start

Read Log Update (RLU) operation

P Q

slide-6
SLIDE 6

Write Clock (∞) ∞) Globa

  • bal Clock

(22)

A B C D C’ D’ E B’

  • 1. P upda

pdates c clocks

  • 2. P executes RCU-epoc
  • ch

 Waits fo for Q Q to fnish

  • 1. P upda

pdates c clocks

  • 2. P executes RCU-epoc
  • ch

 Waits fo for Q Q to fnish Globa

  • bal Clock

(23) Lo Local Clock (22) Write Clock (23) Q wi Q will read d only y old d obj

  • bjects

RLU commit operation

P Q

Logical Clock maintains correctness/ordering Maintained via atomic instructions FAA/CAS →

slide-7
SLIDE 7

7

Issue with logical clock

  • RLU sufers from global clock contention

– Cache-line contention due to atomic instructions – Possible to circumvent with our approach

30 60 90 120 150 180 64 128 192 256 Ops/usec #cores Phi Atomic 40 80 120 160 16 32 48 64 80 96 #cores ARM 30 60 90 120 150 180 64 128 192 256 Ops/usec #cores Phi Atomic Ordo 40 80 120 160 16 32 48 64 80 96 #cores ARM

How can we achieve ordering with minimal timestamping overhead?

slide-8
SLIDE 8

8

Our proposed ordering primitive: Ordo

  • Exposes a monotonically increasing clock

– Current hardware already provides – rdtscp (X86), cntvct (ARM), stick (Sparc)

  • Relies on a per-core invariant hardware clock

– Monotonically increases with constant skew regardless

  • f dynamic frequency and voltage scaling
slide-9
SLIDE 9

9

Challenges with Ordo

  • Comparing two clocks

– Clocks are not synchronized – Cores receive RESET signal at varying times

  • Application:

– Modifying algorithms to use Ordo – Able to compare between two timestamps

slide-10
SLIDE 10

10

Embracing the invariant clocks

  • Measure a global uncertainty window

– Ensure a new timestamp once a window is over – Provides a notion of globally synchronized clock

  • Measured ofset MUST have the invariant:

Measured ofset is greater than the physical ofset

– Physical ofset: ofset due to RESET signal – Measured ofset: physical ofset + one-way delay

slide-11
SLIDE 11

11

Calculating global uncertainty window: ORDO_BOUNDARY

  • Add one-way delay latency on each path

C

1

C

2

T ( C

1

) : T ( C

2

) : 2

C1 C C →

2

20

time

1) Calculate C1 timestamp 2) Notify C2 via memory 3) Get C2 timestamp 4) Repeat steps 1-3 to get the minimum

slide-12
SLIDE 12

12

Calculating global uncertainty window: ORDO_BOUNDARY

  • Add one-way delay latency on each path

C

1

C

2

T ( C

1

) : 8 T ( C

2

) : 5

C2 C C →

1

30

time

  • Repeat prior steps in
  • pposite direction
  • Do not know which clock

is ahead of the other

slide-13
SLIDE 13

13

Calculating global uncertainty window: ORDO_BOUNDARY

  • Repeat steps for each pair of cores from C1 to Cn
  • The maximum ofset is the ORDO_BOUNDARY

C

1

C

2

C1 C ←→

2

30

T ( C

1

) : 8 T ( C

2

) : 5

C1 C C →

2

20 C2 C C →

1

30

C

1

C

2

T ( C

1

) : T ( C

2

) : 2 time

slide-14
SLIDE 14

14

Ordo application

  • Applicable to any timestamp-based algorithm
  • Expose Ordo API for these algorithms

– get_time(): Current hardware timestamp – cmp_time(t1, t2): Compare two timestamps with

uncertainty, if |t1-t2| < ORDO_BOUNDARY

– new_time(t): Return tnew > (t + ORDO_BOUNDARY)

  • Catch: Algorithms should handle uncertainty
slide-15
SLIDE 15

15

Algorithms with Ordo handling uncertainty

  • Physical to logical timestamping:

– Rely on c

m p _ t i m e ( ) to compare two timestamps

– Either defer or revert if comparison is uncertain – Use n

e w _ t i m e ( ) to guarantee new time

  • Physical timestamping:

– Use new_

t i m e ( ) to access the global clock

slide-16
SLIDE 16

A B C D D’ E B’

Log Log

Q’s Q’s cor

  • re

cloc

  • ck (

(50) Loc Local Cloc

  • ck

(50) P’s l loc

  • cal

cloc

  • ck (

(22) Re Read d on s start

Read Log Update (RLUOrdo) operation

P Q

Glo loba bal ofs fset (30)

slide-17
SLIDE 17

Write Clock (∞) ∞)

A B C D C’ D’ E B’

Lo Local Clock (50) Write Clock (150) Q wi Q will read d only y old d obj

  • bjects

RLUOrdo commit operation

P Q

Glo loba bal ofs fset (30)

  • 1. P

P u upd pdates own c cloc

  • ck
  • 2. P

P e executes RC RCU-epo poch  Waits fo for Q Q to fnish

  • 1. P

P u upd pdates own c cloc

  • ck
  • 2. P

P e executes RC RCU-epo poch  Waits fo for Q Q to fnish

slide-18
SLIDE 18

18

Algorithms modifed with Ordo

  • RLU
  • Transactional Locking (TL2) in STM
  • Database concurrency control: OCC, MVCC
  • Oplog used in Linux forking functionality

See our paper

slide-19
SLIDE 19

19

Evaluation

  • Questions:

– Measured global ofset (ORDO_BOUNDARY) – Maximum scalability of Ordo – Ordo’s impact on algorithms

  • Machines confguration:

– 240 core, 8 socket Intel Xeon machine (Xeon) – 256 core, Intel Xeon Phi (Phi) – 96 core, 2 socket ARM machine (ARM) – 32 core, 8 socket AMD machine (AMD)

slide-20
SLIDE 20

20

Machine Minimum (ns) Maximum (ns) Intel Xeon 70 276 Intel Xeon phi 90 270 ARM 100 1,100 AMD 93 203

Ofset between clocks

  • Empirically measured ofset after reboots
  • ORDO_BOUNDARY is the maximum ofset
slide-21
SLIDE 21

21

Timestamping with Ordo

  • Ordo relies on hardware timestamping
  • 17.4 – 285.5x faster than atomic increments

4 8 12 60 120 180 240 Ops/usec/core Xeon(Atomic) 4 8 12 64 128 192 256 Phi(Atomic) 4 8 12 16 32 48 64 80 96 Ops/usec/core #core ARM(Atomic) 4 8 12 4 8 12 16 20 24 28 32 #core AMD(Atomic) 4 8 12 60 120 180 240 Ops/usec/core Xeon(Atomic) Xeon(Ordo) 4 8 12 64 128 192 256 Phi(Atomic) Phi(Ordo) 4 8 12 16 32 48 64 80 96 Ops/usec/core #core ARM(Atomic) ARM(Ordo) 4 8 12 4 8 12 16 20 24 28 32 #core AMD(Atomic) AMD(Ordo)

slide-22
SLIDE 22

22

Scaling RLU with Ordo

  • RLUOrdo is 2.1x faster on an average
  • Still sufers from object copy and its locking

30 60 90 120 150 60 120 180 240 Ops/usec Xeon RLU 2% RLU(Ordo) 2% 30 60 90 120 150 180 64 128 192 256 Phi 40 80 120 160 16 32 48 64 80 96 Ops/usec #core ARM 20 40 60 80 8 16 24 32 #core AMD

slide-23
SLIDE 23

23

Discussion and limitations

  • Simplifes the design and understanding of

algorithms

  • Not a panacea

– Applicable when clock is contentious

  • No skew consideration
  • Thread ID-based timestamp comparison has its

limitation

slide-24
SLIDE 24

24

Conclusion

  • Ordo is a scalable timestamping primitive

– Relies on invariant hardware clocks

  • Exposes time-based API to the user
  • Applied Ordo to fve concurrent algorithms
  • Improves the scalability of algorithms by at

most 39.7x across architectures

slide-25
SLIDE 25

Backup Slides

slide-26
SLIDE 26

26

Ofset between clocks

  • Clocks are not synchronized

– 8th socket in Xeon and 2nd socket in ARM – Results remain consistent even after reboots and

measuring after a period of time

30 60 90 120 30 60 90 120 # core Xeon 75 150 225 24 48 72 96 24 48 72 96 # core Arm 300 600 900 Ofset between clocks

slide-27
SLIDE 27

27

Sensitivity of ORDO_BOUNDARY

  • Varying ORDO_BOUNDARY from 1/8x – 8x
  • Cycles increases from 32.2–18K on Xeon machine

0.92 0.96 1.00 1.04 1.08 1-core 1-socket 8-sockets Normalized throughput

slide-28
SLIDE 28

28

Physical timestamping: Oplog

  • Improves Exim performance by 1.9x at 240 cores

0k 20k 40k 60k 80k 100k 120k 30 60 90 120 150 180 210 240 Messages/sec #core Stock Oplog(Ordo)

slide-29
SLIDE 29

29

Scaling database concurrency control

  • Improves OCC and MVCC by 4.1–39.7x for read-only (YCSB)
  • OCCOrdo 1.24x faster than Tictoc and Silo (TPC-C)

30 60 90 120 150 180 60 120 180 240 Txns/usec Xeon OCC OCC (Ordo) MVCC MVCC (Ordo) 20 40 60 80 100 64 128 192 Phi 8 16 24 32 40 16 32 48 64 80 96 Txns/usec # core ARM 7 14 21 28 35 8 16 24 32 # core AMD

slide-30
SLIDE 30

30

Cannot use clock synchronization protocols

  • No information on minimum bounds on

message delivery between/among clocks

  • Protocols introduce various errors
  • Can lead to mis-synchronized clocks

– Larger or smaller than the actual physical ofset

Lead to incorrect implementation of concurrent algorithms