[PPT] - for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, PowerPoint Presentation

SLIDE 1

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems

Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu Paper Available at: http://rsim.cs.illinois.edu/pubs.html

SLIDE 2

No Formal Specification for Relaxed Atomics C++17 "specification" for relaxed atomics

Races that don't order other accesses
Implementations should ensure no “out-of-thin-air”

values are computed that circularly depend on their own computation “C++ (relaxed) atomics were the worst idea ever. I just spent days (and days) trying to get something to work. … My example only has 2 addresses and 4 accesses, it shouldn’t be this hard. Can you help?”

Email from employee at major research lab

2

Formal specification for relaxed atomics is a longstanding problem

SLIDE 3

But generally use simple, SW-based coherence

– Cost of staying away from relaxed atomics too high!

3

Why Use Relaxed Atomics?

0X 10X 20X

Speedup

27X 99X 28X

SLIDE 4

Previous work

– Goal: formal semantics for all possible relaxed atomics uses – No widely accepted formal semantics after ~15 years of effort

Insight: analyze how real codes use relaxed atomics

– What are common uses of relaxed atomics? – Why do they work? – Can we formalize semantics for them?

4

Our Approach

SLIDE 5

Contributions

5

Everyone can safely use RAts

Identified common uses of relaxed atomics

– Work queues, event counters, ref counters, seqlocks, …

Data-race-free-relaxed (DRFrlx) memory model:

– SC-centric semantics + efficiency

Evaluated benefits of using relaxed atomics

– Up to 53% less cycles (33% avg), 40% less energy (20% avg)

SLIDE 6

Outline

Motivation
Background
Data-race-free-relaxed
Results
Conclusion

6

SLIDE 7

Atomics Background

Default: Data-race-free-0 (DRF0) [ISCA ‘90]

– Identify all races as synchronization accesses (C++: atomics) – All atomics order data accesses – Atomics order other atomics Ensures SC semantics if no data races

7

// each thread for i = 0:n … ADD R4, A[i], R1 ADD R5, B[i], R1 … synch (atomic) synch (atomic)

SLIDE 8

Atomics Background (Cont.)

Default: Data-race-free-0 (DRF0) [ISCA ‘90]

– All atomics order data accesses – Atomics order other atomics Ensures SC semantics if no data races

Data-race-free-1 (DRF1): unpaired atomics [TPDS ‘93]

+ Unpaired atomics do not order data accesses – Atomics order other atomics Ensures SC semantics if no data races

Relaxed atomics [PLDI ‘08]

+ Do not order data or other atomics But can violate SC and no formal specification

8

SLIDE 9

Outline

Motivation
Background
Data-race-free-relaxed
Results
Conclusion

9

SLIDE 10

Identifying Relaxed Atomic Use Cases

Our Approach

– What are common uses of relaxed atomics? – Why do they work? – Can we formalize semantics for them?

Contacted vendors, developers, and researchers

10

How do relaxed atomics work in Event Counters?

SLIDE 11

Accel

Threads concurrently update counters

– Read part of a data array, updates its counter

11

Event Counter

L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache

Counters

… 0 0 0 0 0 0

…

1 1 1 1 1 1

SLIDE 12

Accel

Threads concurrently update counters

– Read part of a data array, updates its counter – Increments race, so have to use atomics

12

Event Counter (Cont.)

L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache

Counters

… 2 1 3 2 1 1

…

1 1 1 1 1 1

SLIDE 13

Accel

Threads concurrently update counters

– Read part of a data array, updates its counter – Increments race, so have to use atomics

13

Event Counter (Cont.)

L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache

Counters

…

7 1 9 1 5 3

Commutative increments: order does not affect final result How to formalize?

SLIDE 14

Incorporating Commutativity Into DRFrlx

14

New relaxed atomic category: commutative
Formalism (Intuition):

– Accesses are commutative – Intermediate values must not be observed

Final result is always SC

SLIDE 15

Commutative Definitions for an SC Execution

15

Commutativity

– Two accesses to a memory location M are commutative if:

Can be performed in any order and
Yield the same final result for M
X and Y form a commutative race iff:

1. X and Y form a race, 2. At least one of X and Y is distinguished as commutative, & 3. X and Y are:

Not commutative or
Value loaded by either is used by another instr. in its thread

SLIDE 16

Commutative Program and Model Definitions

16

DRFrlx Program

– A program is DRFrlx iff for every SC execution of program:

No data races or commutative races in the execution
DRFrlx Model

– A system obeys DRFrlx iff:

Result of every execution of DRFrlx program is result of an SC

execution of the program

What about other use cases?

SLIDE 17

Incorporating Other Use Cases Into DRFrlx

17

SC Final result always SC SC-centric: non-SC parts isolated Unpaired Non-Ordering Commutative Speculative Quantum Semantics Category Use Case Work Queues Flags Seqlocks Event Counters Ref Counters Split Counters

SLIDE 18

Outline

Motivation
Background
Data-race-free-relaxed
Results
Conclusion

18

SLIDE 19

Evaluation Methodology

19

1 CPU core + 15 GPU compute units (CU)

– Each node has private L1, scratchpad, tile of shared L2

Simulation Environment

– GEMS, Simics, Garnet, GPGPU-Sim, GPUWattch, McPAT

Study DRF0, DRF1, DRFrlx w/ GPU & DeNovoA coherence
Workloads

– Microbenchmarks for each use case – Benchmarks with biggest RAts speedups on discrete GPU

UTS, PageRank (PR), Betweeness Centrality (BC)

SLIDE 20

Relaxed Atomics Applications – Execution Time

20

0% 20% 40% 60% 80% 100%

GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2

104

PR-2 PR-3 PR-4 PR-1 UTS BC-1 BC-2 BC-3 BC-4 AVG

GD0 GD1 DD0 DD1 DDR GDR

GD0 = GPU coherence + DRF0 GD1 = GPU coherence + DRF1 GDR = GPU coherence + DRFrlx DD0 = DeNovoA coherence + DRF0 DD1 = DeNovoA coherence + DRF1 DDR = DeNovoA coherence + DRFrlx

SLIDE 21

0% 20% 40% 60% 80% 100%

GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2

Relaxed Atomics Applications – Execution Time

21

Relaxed atomics reduce cycles up to ~50% DRF1 increases data reuse (21% avg vs. GD0) DRFrlx overlaps atomics (15% avg vs. GD1)

PR-2 PR-3 PR-4 PR-1 UTS BC-1 BC-2 BC-3 BC-4 AVG

GD0 GD1 DD0 DD1 DDR GDR

SLIDE 22

Relaxed Atomics Applications – Execution Time

22

0% 20% 40% 60% 80% 100%

GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2

Relaxed atomics reduce cycles up to ~50% DeNovoA increases reuse over GPU: 10% avg. for DRFrlx

104

PR-2 PR-3 PR-4 PR-1 UTS BC-1 BC-2 BC-3 BC-4 AVG

GD0 GD1 DD0 DD1 DDR GDR

SLIDE 23

Relaxed Atomics Applications – Energy

23

0% 20% 40% 60% 80% 100%

GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2 GD0 GD1 GD2 DD0 DD1 DD2

N/W L2 $ L1 D$ Scratch GPU Core+

Energy similar to execution time trends DeNovoA’s reuse reduces energy over GPU: 29% avg. for DRFrlx

104

PR-2 PR-3 PR-4 PR-1 UTS BC-1 BC-2 BC-3 BC-4 AVG

SLIDE 24

Conclusion

24

DRFrlx: SC-centric semantics + efficiency

Everyone can safely use RAts

Cost of avoiding relaxed atomics too high
Difficult to use correctly: no formal specification
Insight: Analyze how real codes use relaxed atomics