A Scalable Ordering Primitive for Multicore Machines
Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim
A Scalable Ordering Primitive for Multicore Machines Sanidhya - - PowerPoint PPT Presentation
A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim Era of multicore machines 2 Scope of multicore machines Huge hardware thread parallelism How are operations executed correctly?
A Scalable Ordering Primitive for Multicore Machines
Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim
2
Era of multicore machines
3
Scope of multicore machines
Huge hardware thread parallelism
How are operations executed correctly? Ordering Becomes scalability bottleneck
4
Example: Read Log Update (RLU)
A B C D D’ E B’
A A log/bu g/bufe fer to s
copi pies (pe per-thread) d)
Log Log
RLU RLU heade der Globa
(22) Loc Local Cloc
(22) Re Read d on s start
Read Log Update (RLU) operation
P Q
Write Clock (∞) ∞) Globa
(22)
A B C D C’ D’ E B’
pdates c clocks
Waits fo for Q Q to fnish
pdates c clocks
Waits fo for Q Q to fnish Globa
(23) Lo Local Clock (22) Write Clock (23) Q wi Q will read d only y old d obj
RLU commit operation
P Q
Logical Clock maintains correctness/ordering Maintained via atomic instructions FAA/CAS →
7
Issue with logical clock
– Cache-line contention due to atomic instructions – Possible to circumvent with our approach
30 60 90 120 150 180 64 128 192 256 Ops/usec #cores Phi Atomic 40 80 120 160 16 32 48 64 80 96 #cores ARM 30 60 90 120 150 180 64 128 192 256 Ops/usec #cores Phi Atomic Ordo 40 80 120 160 16 32 48 64 80 96 #cores ARM
How can we achieve ordering with minimal timestamping overhead?
8
Our proposed ordering primitive: Ordo
– Current hardware already provides – rdtscp (X86), cntvct (ARM), stick (Sparc)
– Monotonically increases with constant skew regardless
9
Challenges with Ordo
– Clocks are not synchronized – Cores receive RESET signal at varying times
– Modifying algorithms to use Ordo – Able to compare between two timestamps
10
Embracing the invariant clocks
– Ensure a new timestamp once a window is over – Provides a notion of globally synchronized clock
Measured ofset is greater than the physical ofset
– Physical ofset: ofset due to RESET signal – Measured ofset: physical ofset + one-way delay
11
Calculating global uncertainty window: ORDO_BOUNDARY
C
1
C
2
T ( C
1
) : T ( C
2
) : 2
C1 C C →
2
20
time
1) Calculate C1 timestamp 2) Notify C2 via memory 3) Get C2 timestamp 4) Repeat steps 1-3 to get the minimum
12
Calculating global uncertainty window: ORDO_BOUNDARY
C
1
C
2
T ( C
1
) : 8 T ( C
2
) : 5
C2 C C →
1
30
time
is ahead of the other
13
Calculating global uncertainty window: ORDO_BOUNDARY
C
1
C
2
C1 C ←→
2
30
T ( C
1
) : 8 T ( C
2
) : 5
C1 C C →
2
20 C2 C C →
1
30
C
1
C
2
T ( C
1
) : T ( C
2
) : 2 time
14
Ordo application
– get_time(): Current hardware timestamp – cmp_time(t1, t2): Compare two timestamps with
uncertainty, if |t1-t2| < ORDO_BOUNDARY
– new_time(t): Return tnew > (t + ORDO_BOUNDARY)
15
Algorithms with Ordo handling uncertainty
– Rely on c
m p _ t i m e ( ) to compare two timestamps
– Either defer or revert if comparison is uncertain – Use n
e w _ t i m e ( ) to guarantee new time
– Use new_
t i m e ( ) to access the global clock
A B C D D’ E B’
Log Log
Q’s Q’s cor
cloc
(50) Loc Local Cloc
(50) P’s l loc
cloc
(22) Re Read d on s start
Read Log Update (RLUOrdo) operation
P Q
Glo loba bal ofs fset (30)
Write Clock (∞) ∞)
A B C D C’ D’ E B’
Lo Local Clock (50) Write Clock (150) Q wi Q will read d only y old d obj
RLUOrdo commit operation
P Q
Glo loba bal ofs fset (30)
P u upd pdates own c cloc
P e executes RC RCU-epo poch Waits fo for Q Q to fnish
P u upd pdates own c cloc
P e executes RC RCU-epo poch Waits fo for Q Q to fnish
18
Algorithms modifed with Ordo
See our paper
19
Evaluation
– Measured global ofset (ORDO_BOUNDARY) – Maximum scalability of Ordo – Ordo’s impact on algorithms
– 240 core, 8 socket Intel Xeon machine (Xeon) – 256 core, Intel Xeon Phi (Phi) – 96 core, 2 socket ARM machine (ARM) – 32 core, 8 socket AMD machine (AMD)
20
Machine Minimum (ns) Maximum (ns) Intel Xeon 70 276 Intel Xeon phi 90 270 ARM 100 1,100 AMD 93 203
Ofset between clocks
21
Timestamping with Ordo
4 8 12 60 120 180 240 Ops/usec/core Xeon(Atomic) 4 8 12 64 128 192 256 Phi(Atomic) 4 8 12 16 32 48 64 80 96 Ops/usec/core #core ARM(Atomic) 4 8 12 4 8 12 16 20 24 28 32 #core AMD(Atomic) 4 8 12 60 120 180 240 Ops/usec/core Xeon(Atomic) Xeon(Ordo) 4 8 12 64 128 192 256 Phi(Atomic) Phi(Ordo) 4 8 12 16 32 48 64 80 96 Ops/usec/core #core ARM(Atomic) ARM(Ordo) 4 8 12 4 8 12 16 20 24 28 32 #core AMD(Atomic) AMD(Ordo)
22
Scaling RLU with Ordo
30 60 90 120 150 60 120 180 240 Ops/usec Xeon RLU 2% RLU(Ordo) 2% 30 60 90 120 150 180 64 128 192 256 Phi 40 80 120 160 16 32 48 64 80 96 Ops/usec #core ARM 20 40 60 80 8 16 24 32 #core AMD
23
Discussion and limitations
algorithms
– Applicable when clock is contentious
limitation
24
Conclusion
– Relies on invariant hardware clocks
most 39.7x across architectures
Backup Slides
26
Ofset between clocks
– 8th socket in Xeon and 2nd socket in ARM – Results remain consistent even after reboots and
measuring after a period of time
30 60 90 120 30 60 90 120 # core Xeon 75 150 225 24 48 72 96 24 48 72 96 # core Arm 300 600 900 Ofset between clocks
27
Sensitivity of ORDO_BOUNDARY
0.92 0.96 1.00 1.04 1.08 1-core 1-socket 8-sockets Normalized throughput
28
Physical timestamping: Oplog
0k 20k 40k 60k 80k 100k 120k 30 60 90 120 150 180 210 240 Messages/sec #core Stock Oplog(Ordo)
29
Scaling database concurrency control
30 60 90 120 150 180 60 120 180 240 Txns/usec Xeon OCC OCC (Ordo) MVCC MVCC (Ordo) 20 40 60 80 100 64 128 192 Phi 8 16 24 32 40 16 32 48 64 80 96 Txns/usec # core ARM 7 14 21 28 35 8 16 24 32 # core AMD
30
Cannot use clock synchronization protocols
message delivery between/among clocks
– Larger or smaller than the actual physical ofset
Lead to incorrect implementation of concurrent algorithms