for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, - - PowerPoint PPT Presentation

for relaxed atomics on heterogeneous systems
SMART_READER_LITE
LIVE PREVIEW

for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, - - PowerPoint PPT Presentation

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu Paper Available at:


slide-1
SLIDE 1

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems

Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu Paper Available at: http://rsim.cs.illinois.edu/pubs.html

slide-2
SLIDE 2

No Formal Specification for Relaxed Atomics C++17 "specification" for relaxed atomics

  • Races that don't order other accesses
  • Implementations should ensure no “out-of-thin-air”

values are computed that circularly depend on their own computation “C++ (relaxed) atomics were the worst idea ever. I just spent days (and days) trying to get something to work. … My example only has 2 addresses and 4 accesses, it shouldn’t be this hard. Can you help?”

  • Email from employee at major research lab

2

Formal specification for relaxed atomics is a longstanding problem

slide-3
SLIDE 3
  • But generally use simple, SW-based coherence

– Cost of staying away from relaxed atomics too high!

3

Why Use Relaxed Atomics?

0X 10X 20X

Speedup

27X 99X 28X

slide-4
SLIDE 4
  • Previous work

– Goal: formal semantics for all possible relaxed atomics uses – No widely accepted formal semantics after ~15 years of effort

  • Insight: analyze how real codes use relaxed atomics

– What are common uses of relaxed atomics? – Why do they work? – Can we formalize semantics for them?

4

Our Approach

slide-5
SLIDE 5

Contributions

5

Everyone can safely use RAts

  • Identified common uses of relaxed atomics

– Work queues, event counters, ref counters, seqlocks, …

  • Data-race-free-relaxed (DRFrlx) memory model:

– SC-centric semantics + efficiency

  • Evaluated benefits of using relaxed atomics

– Up to 53% less cycles (33% avg), 40% less energy (20% avg)

slide-6
SLIDE 6

Outline

  • Motivation
  • Background
  • Data-race-free-relaxed
  • Results
  • Conclusion

6

slide-7
SLIDE 7

Atomics Background

  • Default: Data-race-free-0 (DRF0) [ISCA ‘90]

– Identify all races as synchronization accesses (C++: atomics) – All atomics order data accesses – Atomics order other atomics Ensures SC semantics if no data races

7

// each thread for i = 0:n … ADD R4, A[i], R1 ADD R5, B[i], R1 … synch (atomic) synch (atomic)

slide-8
SLIDE 8

Atomics Background (Cont.)

  • Default: Data-race-free-0 (DRF0) [ISCA ‘90]

– All atomics order data accesses – Atomics order other atomics Ensures SC semantics if no data races

  • Data-race-free-1 (DRF1): unpaired atomics [TPDS ‘93]

+ Unpaired atomics do not order data accesses – Atomics order other atomics Ensures SC semantics if no data races

  • Relaxed atomics [PLDI ‘08]

+ Do not order data or other atomics But can violate SC and no formal specification

8

slide-9
SLIDE 9

Outline

  • Motivation
  • Background
  • Data-race-free-relaxed
  • Results
  • Conclusion

9

slide-10
SLIDE 10

Identifying Relaxed Atomic Use Cases

  • Our Approach

– What are common uses of relaxed atomics? – Why do they work? – Can we formalize semantics for them?

  • Contacted vendors, developers, and researchers

10

How do relaxed atomics work in Event Counters?

slide-11
SLIDE 11

Accel

  • Threads concurrently update counters

– Read part of a data array, updates its counter

11

Event Counter

L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache

Counters

… 0 0 0 0 0 0

1 1 1 1 1 1

slide-12
SLIDE 12

Accel

  • Threads concurrently update counters

– Read part of a data array, updates its counter – Increments race, so have to use atomics

12

Event Counter (Cont.)

L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache

Counters

… 2 1 3 2 1 1

1 1 1 1 1 1

slide-13
SLIDE 13

Accel

  • Threads concurrently update counters

– Read part of a data array, updates its counter – Increments race, so have to use atomics

13

Event Counter (Cont.)

L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache

Counters

7 1 9 1 5 3

Commutative increments: order does not affect final result How to formalize?

slide-14
SLIDE 14

Incorporating Commutativity Into DRFrlx

14

  • New relaxed atomic category: commutative
  • Formalism (Intuition):

– Accesses are commutative – Intermediate values must not be observed

Final result is always SC

slide-15
SLIDE 15

Commutative Definitions for an SC Execution

15

  • Commutativity

– Two accesses to a memory location M are commutative if:

  • Can be performed in any order and
  • Yield the same final result for M
  • X and Y form a commutative race iff:

1. X and Y form a race, 2. At least one of X and Y is distinguished as commutative, & 3. X and Y are:

  • Not commutative or
  • Value loaded by either is used by another instr. in its thread
slide-16
SLIDE 16

Commutative Program and Model Definitions

16

  • DRFrlx Program

– A program is DRFrlx iff for every SC execution of program:

  • No data races or commutative races in the execution
  • DRFrlx Model

– A system obeys DRFrlx iff:

  • Result of every execution of DRFrlx program is result of an SC

execution of the program

What about other use cases?

slide-17
SLIDE 17

Incorporating Other Use Cases Into DRFrlx

17

SC Final result always SC SC-centric: non-SC parts isolated Unpaired Non-Ordering Commutative Speculative Quantum Semantics Category Use Case Work Queues Flags Seqlocks Event Counters Ref Counters Split Counters

slide-18
SLIDE 18

Outline

  • Motivation
  • Background
  • Data-race-free-relaxed
  • Results
  • Conclusion

18

slide-19
SLIDE 19

Evaluation Methodology

19

  • 1 CPU core + 15 GPU compute units (CU)

– Each node has private L1, scratchpad, tile of shared L2

  • Simulation Environment

– GEMS, Simics, Garnet, GPGPU-Sim, GPUWattch, McPAT

  • Study DRF0, DRF1, DRFrlx w/ GPU & DeNovoA coherence
  • Workloads

– Microbenchmarks for each use case – Benchmarks with biggest RAts speedups on discrete GPU

  • UTS, PageRank (PR), Betweeness Centrality (BC)
slide-20
SLIDE 20

Relaxed Atomics Applications – Execution Time

20

0% 20% 40% 60% 80% 100%

GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2

104

PR-2 PR-3 PR-4 PR-1 UTS BC-1 BC-2 BC-3 BC-4 AVG

GD0 GD1 DD0 DD1 DDR GDR

GD0 = GPU coherence + DRF0 GD1 = GPU coherence + DRF1 GDR = GPU coherence + DRFrlx DD0 = DeNovoA coherence + DRF0 DD1 = DeNovoA coherence + DRF1 DDR = DeNovoA coherence + DRFrlx

slide-21
SLIDE 21

0% 20% 40% 60% 80% 100%

GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2

Relaxed Atomics Applications – Execution Time

21

Relaxed atomics reduce cycles up to ~50% DRF1 increases data reuse (21% avg vs. GD0) DRFrlx overlaps atomics (15% avg vs. GD1)

PR-2 PR-3 PR-4 PR-1 UTS BC-1 BC-2 BC-3 BC-4 AVG

GD0 GD1 DD0 DD1 DDR GDR

slide-22
SLIDE 22

Relaxed Atomics Applications – Execution Time

22

0% 20% 40% 60% 80% 100%

GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2

Relaxed atomics reduce cycles up to ~50% DeNovoA increases reuse over GPU: 10% avg. for DRFrlx

104

PR-2 PR-3 PR-4 PR-1 UTS BC-1 BC-2 BC-3 BC-4 AVG

GD0 GD1 DD0 DD1 DDR GDR

slide-23
SLIDE 23

Relaxed Atomics Applications – Energy

23

0% 20% 40% 60% 80% 100%

GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2

N/W L2 $ L1 D$ Scratch GPU Core+

Energy similar to execution time trends DeNovoA’s reuse reduces energy over GPU: 29% avg. for DRFrlx

104

PR-2 PR-3 PR-4 PR-1 UTS BC-1 BC-2 BC-3 BC-4 AVG

slide-24
SLIDE 24

Conclusion

24

DRFrlx: SC-centric semantics + efficiency

Everyone can safely use RAts

  • Cost of avoiding relaxed atomics too high
  • Difficult to use correctly: no formal specification
  • Insight: Analyze how real codes use relaxed atomics