Compiler Construction of Idempotent Regions and Applications in - - PowerPoint PPT Presentation

compiler construction of idempotent regions
SMART_READER_LITE
LIVE PREVIEW

Compiler Construction of Idempotent Regions and Applications in - - PowerPoint PPT Presentation

Compiler Construction of Idempotent Regions and Applications in Architecture Design Marc de Kruijf Advisor: Karthikeyan Sankaralingam PhD Defense 07/20/2012 Example source code int int sum(int int *array, int int len) { int int x = 0;


slide-1
SLIDE 1

Compiler Construction of Idempotent Regions and Applications in Architecture Design

Marc de Kruijf

Advisor: Karthikeyan Sankaralingam

PhD Defense 07/20/2012

slide-2
SLIDE 2

Example

2

int int sum(int int *array, int int len) { int int x = 0; for for (int int i = 0; i < len; ++i) x += array[i]; return return x; }

source code

slide-3
SLIDE 3

Example

3

R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

assembly code F F F F

faults exceptions

x

load ? mis-speculations

slide-4
SLIDE 4

Example

4

R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

assembly code

slide-5
SLIDE 5

R0 and R1 are unmodified R2 = load [R1] R3 = 0 LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP EXIT: return R3

Example

5

assembly code

just re-execute!

convention: use checkpoints/buffers

slide-6
SLIDE 6

It’s Idempotent!

6

idempoh… what…?

int int sum(int int *data, int int len) { int int x = 0; for for (int int i = 0; i < len; ++i) x += data[i]; return return x; }

=

slide-7
SLIDE 7

Thesis

7

idempotent regions ALL THE TIME

specifically…

– using compiler analysis (intra-procedural) – transparent; no programmer intervention – hardware/software co-design – software analysis hardware execution

the thing that I am defending

slide-8
SLIDE 8

Thesis

8

prelim.pptx defense.pptx preliminary exam (11/2010)

– idempotence: concept and simple empirical analysis – compiler: preliminary design & partial implementation – architecture: some area and power savings…?

defense (07/2012)

– idempotence: formalization and detailed empirical analysis – compiler: complete design, source code release* – architecture: compelling benefits (various)

* http://research.cs.wisc.edu/vertical/iCompiler

slide-9
SLIDE 9

Contributions & Findings

9

a summary contribution areas

– idempotence: models and analysis framework – compiler: design, implementation, and evaluation – architecture: design and evaluation

findings

– potentially large idempotent regions exist in applications – for compilation, larger is better – small regions (5-15 instructions), 10-15% overheads – large regions (50+ instructions), 0-2% overheads – enables efficient exception and hardware fault recovery

slide-10
SLIDE 10

10

Overview

❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation

slide-11
SLIDE 11

Idempotence Models

11

MODEL A An input is a variable that is live-in to the region. A region preserves an input if the input is not overwritten.

idempotence: what does it mean?

DEFINITION (1) a region is idempotent iff its re-execution has no side-effects (2) a region is idempotent iff it preserves its inputs OK, but what does it mean to preserve an input? four models (next slides): A, B, C, & D

slide-12
SLIDE 12

Idempotence Model A

12

R1 = R2 + R3 if R1 > 0 mem[R4] = R1 SP = SP - 16

false true

Live-ins:

{all registers}

a starting point

{all memory} \{R1}

?? = mem[R4] … ?

slide-13
SLIDE 13

Idempotence Model A

13

R1 = R2 + R3 if R1 > 0 mem[R4] = R1 SP = SP - 16

false true

Live-ins:

{all registers}

a starting point

{all memory}

slide-14
SLIDE 14

?? = mem[R4] … ?

Idempotence Model A

14

R1 = R2 + R3 if R1 > 0 mem[R4] = R1 SP = SP - 16

false true

Live-ins:

{all registers}

a starting point

{all memory} \{mem[R4]}

slide-15
SLIDE 15

?? = mem[R4] … ?

Idempotence Model B

15

R1 = R2 + R3 if R1 > 0 mem[R4] = R1 SP = SP - 16

false true

live-in but dynamically dead at time of write – OK to overwrite if control flow invariable

varying control flow assumptions

slide-16
SLIDE 16

Idempotence Model C

16

R1 = R2 + R3 if R1 > 0 mem[R4] = R1 SP = SP - 16

false true

allow final instruction to overwrite input (to include otherwise ineligible instructions)

varying sequencing assumptions

slide-17
SLIDE 17

Idempotence Model D

17

R1 = R2 + R3 if R1 > 0 mem[R4] = R1 SP = SP - 16

false true

may be concurrently read in another thread – consider as input

varying isolation assumptions

slide-18
SLIDE 18

Idempotence Models

18

an idempotence taxonomy sequencing axis control axis isolation axis

(Model C) (Model B) (Model D)

slide-19
SLIDE 19

what are the implications?

19

slide-20
SLIDE 20

Empirical Analysis

20

methodology measurement

– dynamic region size (path length) – subject to axis constraints – x86 dynamic instruction count (using PIN)

benchmarks

– SPEC 2006, PARSEC, and Parboil suites

experimental configurations

– unconstrained: ideal upper bound (Model C) – oblivious: actual in normal compiled code (Model C) – X-constrained: ideal upper bound constrained by axis X

slide-21
SLIDE 21

Empirical Analysis

1 10 100 1000 SPEC INT SPEC FP PARSEC Parboil OVERALL

  • blivious

unconstrained

21

*geometrically averaged across suites

  • blivious vs. unconstrained

5.2 160.2

average region size*

slide-22
SLIDE 22

Empirical Analysis

1 10 100 1000 SPEC INT SPEC FP PARSEC Parboil OVERALL

  • blivious

unconstrained control-constrained

22

control axis sensitivity

160.2 5.2 40.1

*geometrically averaged across suites

average region size*

slide-23
SLIDE 23

Empirical Analysis

1 10 100 1000 SPEC INT SPEC FP PARSEC Parboil OVERALL

  • blivious

unconstrained control-constrained isolation-constrained

23

isolation axis sensitivity

160.2 5.2 40.1 27.4

*geometrically averaged across suites

average region size*

slide-24
SLIDE 24

non-idempotent instructions*

Empirical Analysis

0% 1% 2% SPEC INT SPEC FP PARSEC Parboil OVERALL

24

sequencing axis sensitivity

0.19%

*geometrically averaged across suites

slide-25
SLIDE 25

Idempotence Models

25

a summary a spectrum of idempotence models

– significant opportunity: 100+ sizes possible – 4x reduction constraining control axis – 1.5x reduction constraining isolation axis

two models going forward

– architectural idempotence & contextual idempotence – both are effectively the ideal case (Model C) – architectural idempotence: invariable control always – contextual idempotence: variable control w.r.t. locals

slide-26
SLIDE 26

26

Overview

❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation

slide-27
SLIDE 27

Compiler Design

27

PARTITION: ANALYZE:

choose your own adventure

CODE GEN:

PARTITION ANALYZE CODE GEN COMPILER EVALUATION

identify semantic clobber antidependences cut semantic clobber antidependences preserve semantic idempotence

slide-28
SLIDE 28

Compiler Evaluation

28

preamble

WHAT DO YOU MEAN:

PERFORMANCE OVERHEADS?

PARTITION: ANALYZE: CODE GEN: identify semantic clobber antidependences cut semantic clobber antidependences preserve semantic idempotence

slide-29
SLIDE 29

Compiler Evaluation

29

preamble region size

  • verhead

register pressure

  • preserve input values in registers
  • spill other values (if needed)
  • spill input values to stack
  • allocate other values to registers
slide-30
SLIDE 30

Compiler Evaluation

30

compiler implementation

– LLVM, support for both x86 and ARM

methodology measurements

– performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 (just for ISA comparison at end) – region size: instructions between boundaries (path length) – x86 only, using PIN

benchmarks

– SPEC 2006, PARSEC, and Parboil suites

slide-31
SLIDE 31

Results, Take 1/3

0% 5% 10% 15% 20% SPEC INT SPEC FP PARSEC Parboil OVERALL

31

initial results – overhead

performance overhead percentage increase in x86 dynamic instruction count geometrically averaged across suites

13.1%

slide-32
SLIDE 32

0% 10% 20% 30% 40% 50% 1 100 10000 1000000

Results, Take 1/3

region size

  • verhead

YOU ARE HERE

(typically 10-30 instructions)

? analysis of trade-offs

32

10+ instructions

register pressure

slide-33
SLIDE 33

0% 10% 20% 30% 40% 50% 1 100 10000 1000000

Results, Take 1/3

region size

  • verhead

register pressure

analysis of trade-offs

33

detection latency

? ?

slide-34
SLIDE 34

Results, Take 2/3

0% 5% 10% 15% 20% SPEC INT SPEC FP PARSEC Parboil OVERALL

34

minimizing register pressure

performance overhead

13.1% 11.1%

Before After

slide-35
SLIDE 35

0% 10% 20% 30% 40% 50% 1 100 10000 1000000

Results, Take 2/3

region size

  • verhead

register pressure

analysis of trade-offs

35

detection latency re-execution time

?

slide-36
SLIDE 36

Big Regions

36

how do we get there? Problem #1: aliasing analysis

– no flow-sensitive analysis in LLVM; really hurts loops

Problem #2: loop optimizations

– boundaries in loops are bad for everyone – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help

Problem #3: large array structures

– awareness of array access patterns can help

Problem #4: intra-procedural scope

– limited scope aggravates all effects listed above

slide-37
SLIDE 37

Big Regions

37

how do we get there? solutions can be automated

– a lot of work… what would be the gain?

ad hoc for now

– consider PARSEC and Parboil suites as a case study – aliasing annotations – manual loop refactoring, scalarization, etc. – partitioning algorithm refinements (application-specific) – inlining annotations

slide-38
SLIDE 38

Results Take 3/3

0% 2% 4% 6% 8% 10% 12% 14% PARSEC Parboil OVERALL

38

big regions

performance overhead

13.1% 0.06%

Before After

slide-39
SLIDE 39

0% 10% 20% 30% 40% 50% 1 100 10000 1000000

Results Take 3/3

39

50+ instructions is good enough region size

  • verhead

register pressure

50+ instructions (mis-optimized)

slide-40
SLIDE 40

ISA Sensitivity

40

you might be curious ISA matters?

(1) two-address (e.g. x86) vs. three-address (e.g. ARM) (2) register-memory (e.g. x86) vs. register-register (e.g. ARM) (3) number of available registers

the short version ( )

– impact of (1) & (2) not significant (+/- 2% overall) – even less significant as regions grow larger – impact of (3): to get same performance w/ idempotence – increase registers by 0% (large) to ~60% (small regions)

slide-41
SLIDE 41

Compiler Design & Evaluation

41

a summary design and implementation

– static analysis algorithms, modular and perform well – code-gen algorithms, modular and perform well – LLVM implementation source code available*

findings

– pressure-related performance overheads range from: – 0% (large regions) to ~15% (small regions) – greatest opportunity: loop-intensive applications – ISA effects are insignificant

* http://research.cs.wisc.edu/vertical/iCompiler

slide-42
SLIDE 42

42

Overview

❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation

slide-43
SLIDE 43

Architecture Recovery: It’s Real

43

safety first speed first (safety second)

slide-44
SLIDE 44

44

1 Fetch 4 Write-back 3 Execute 2 Decode

Architecture Recovery: It’s Real

lots of sharp turns

1 Fetch 2 Decode 3 Execute 4 Write-back

closer to the truth

slide-45
SLIDE 45

45

1 Fetch 4 Write-back 3 Execute 2 Decode 1 Fetch 2 Decode 3 Execute 4 Write-back !!!

Architecture Recovery: It’s Real

lots of interaction

too late!

slide-46
SLIDE 46

46

Architecture Recovery: It’s Real

bad stuff can happen

mis-speculation

(a) branch mis-prediction, (b) memory re-ordering, (c) transaction violation, etc.

hardware faults

(d) wear-out fault, (e) particle strike, (f) voltage spike, etc.

exceptions

(g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc.

slide-47
SLIDE 47

47

Architecture Recovery: It’s Real

bad stuff can happen

mis-speculation

(a) branch mis-prediction, (b) memory re-ordering, (c) transaction violation, etc.

hardware faults

(d) wear-out fault, (e) particle strike, (f) voltage spike, etc.

exceptions

(g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc.

register pressure detection latency re-execution time

slide-48
SLIDE 48

48

Architecture Recovery: It’s Real

bad stuff can happen

mis-speculation

(a) branch mis-prediction, (b) memory re-ordering, (c) transaction violation, etc.

hardware faults

(d) wear-out fault, (e) particle strike, (f) voltage spike, etc.

exceptions

(g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc.

slide-49
SLIDE 49

49

Architecture Recovery: It’s Real

bad stuff can happen

hardware faults exceptions

(d) wear-out fault, (e) particle strike, (f) voltage spike, etc. (g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc. integrated GPU low-power CPU high-reliability systems

slide-50
SLIDE 50

GPU Exception Support

50

slide-51
SLIDE 51

51

GPU Exception Support

why would we want it? GPU/CPU integration

– unified address space: support for demand paging – numerous secondary benefits as well…

slide-52
SLIDE 52

52

GPU Exception Support

why is it hard? the CPU solution

pipeline registers buffers

slide-53
SLIDE 53

53

GPU Exception Support

why is it hard?

CPU: 10s of registers/core GPU: 10s of registers/thread 32 threads/warp 48 warps per “core” 10,000s of registers/core

slide-54
SLIDE 54

54

GPU Exception Support

idempotence on GPUs GPUs hit the sweet spot

(1) extractably large regions (low compiler overheads) (2) detection latencies long or hard to bound (large is good) (3) exceptions are infrequent (low re-execution overheads) register pressure detection latency re-execution time

slide-55
SLIDE 55

55

GPU Exception Support

idempotence on GPUs GPU design topics

– compiler flow – hardware support – exception live-lock – bonus: fast context switching

DETAILS

GPUs hit the sweet spot

(1) extractably large regions (low compiler overheads) (2) detection latencies long or hard to bound (large is good) (3) exceptions are infrequent (low re-execution overheads)

slide-56
SLIDE 56

GPU Exception Support

56

evaluation methodology compiler

– LLVM targeting ARM

benchmarks

– Parboil GPU benchmarks for CPUs, modified

simulation

– gem5 for ARM: simple dual-issue in-order (e.g. Fermi) – 10-cycle page fault detection latency

measurement

– performance overhead in execution cycles

slide-57
SLIDE 57

GPU Exception Support

0.0% 0.5% 1.0% 1.5% cutcp fft histo mri-q sad tpacf gmean

57

evaluation results

performance overhead

0.54%

slide-58
SLIDE 58

CPU Exception Support

58

slide-59
SLIDE 59

59

CPU Exception Support

why is it a problem? the CPU solution

pipeline registers buffers

slide-60
SLIDE 60

60

CPU Exception Support

why is it a problem?

Decode, Rename, & Issue Fetch Integer Integer Multiply Load/Store RF Branch FP … IEEE FP Bypass Replay queue Flush? Replay? …

Before

slide-61
SLIDE 61

61

CPU Exception Support

why is it a problem?

Fetch Decode & Issue Integer Integer Branch Multiply Load/Store FP RF …

After

slide-62
SLIDE 62

62

CPU Exception Support

idempotence on CPUs CPU design simplification

– in ARM Cortex-A8 (dual-issue in-order) can remove: – bypass / staging register file, replay queue – rename pipeline stage – IEEE-compliant floating point unit – pipeline flush for exceptions and replays – all associated control logic

DETAILS

leaner hardware

– bonus: cheap (but modest) OoO issue

slide-63
SLIDE 63

CPU Exception Support

63

evaluation methodology compiler

– LLVM targeting ARM, minimize pressure (take 2/3)

benchmarks

– SPEC 2006 & PARSEC suites (unmodified)

simulation

– gem5 for ARM: aggressive dual-issue in-order (e.g. A8) – stall on potential in-flight exception

measurement

– performance overhead in execution cycles

slide-64
SLIDE 64

CPU Exception Support

0% 2% 4% 6% 8% 10% 12% 14% SPEC INT SPEC FP PARSEC OVERALL

64

evaluation results

performance overhead

9.1%

slide-65
SLIDE 65

Hardware Fault Tolerance

65

slide-66
SLIDE 66

66

Hardware Fault Tolerance

what is the opportunity? reliability trends

– CMOS reliability is a growing problem – future CMOS alternatives are no better

architecture trends

– hardware power and complexity are premium – desire for simple hardware + efficient recovery

application trends

– emerging workloads consist of large idempotent regions – increasing levels of software abstraction

slide-67
SLIDE 67

67

Hardware Fault Tolerance

design topics hardware organizations

– homogenous: idempotence everywhere – statically heterogeneous: e.g. accelerators – dynamically heterogeneous: adaptive cores

FAULT

MODEL

fault detection capability

– fine-grained in hardware (e.g. Argus, MICRO ‘07) or – fine-grained in software (e.g. instruction/region DMR)

fault model (aka ISA semantics)

– similar to pipeline-based (e.g. ROB) recovery

slide-68
SLIDE 68

Hardware Fault Tolerance

68

evaluation methodology compiler

– LLVM targeting ARM (compiled to minimize pressure)

benchmarks

– SPEC 2006, PARSEC, and Parboil suites (unmodified)

simulation

– gem5 for ARM: simple dual-issue in-order – DMR detection; compare against checkpoint/log and TMR

measurement

– performance overhead in execution cycles

slide-69
SLIDE 69

Hardware Fault Tolerance

69

evaluation results

0% 5% 10% 15% 20% 25% 30% 35% idempotence checkpoint/log TMR

9.1 22.2 29.3 performance overhead

slide-70
SLIDE 70

70

Overview

❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation

slide-71
SLIDE 71

71

RELATED WORK CONCLUSIONS

slide-72
SLIDE 72

Conclusions

72

idempotence: not good for everything

– small regions are expensive – preserving register state is difficult with limited flexibility – large regions are cheap – preserving register state is easy with amortization effect – preserving memory state is mostly “for free”

idempotence: synergistic with modern trends

– programmability (for GPUs) – low power (for everyone) – high-level software efficient recovery (for everyone)

slide-73
SLIDE 73

73

The End

slide-74
SLIDE 74

74

Back-Up: Chronology

MapReduce for CELL

Time

SELSE ’09: Synergy ISCA ‘10: Relax MICRO ’11: Idempotent Processors PLDI ’12: Static Analysis and Compiler Design ISCA ’12: iGPU DSN ’10: TS model CGO ??: Code Gen TACO ??: Models prelim defense

slide-75
SLIDE 75

Choose Your Own Adventure Slides

75

slide-76
SLIDE 76

Idempotence Analysis

76

Yes

2

is this idempotent?

slide-77
SLIDE 77

Idempotence Analysis

77

No

2

how about this?

slide-78
SLIDE 78

Idempotence Analysis

78

Yes

2

maybe this?

slide-79
SLIDE 79

Idempotence Analysis

79

  • peration sequence

dependence chain idempotent? write read, write write, read, write Yes No Yes

it’s all about the data dependences

slide-80
SLIDE 80

Idempotence Analysis

80

  • peration sequence

dependence chain idempotent? write, read read, write write, read, write Yes No Yes

it’s all about the data dependences CLOBBER ANTIDEPENDENCE

antidependence with an exposed read

slide-81
SLIDE 81

Semantic Idempotence

81

(1) local (“pseudoregister”) state: can be renamed to remove clobber antidependences* does not semantically constrain idempotence

two types of program state

(2) non-local (“memory”) state: cannot “rename” to avoid clobber antidependences semantically constrains idempotence semantic idempotence = no non-local clobber antidep. preserve local state by renaming and careful allocation

slide-82
SLIDE 82

Region Partitioning Algorithm

82

steps one, two, and three

Step 1: transform function remove artificial dependences, remove non-clobbers Step 3: refine for correctness & performance account for loops, optimize for dynamic behavior Step 2: construct regions around antidependences cut all non-local antidependences in the CFG

slide-83
SLIDE 83

Step 1: Transform

83

Transformation 1: SSA for pseudoregister antidependences

not one, but two transformations

clobber antidependences region boundaries region identification

But we still have a problem:

depends on

slide-84
SLIDE 84

Step 1: Transform

84

Transformation 1: SSA for pseudoregister antidependences

not one, but two transformations

Transformation 2: Scalar replacement of memory variables

[x] = a; b = [x]; [x] = c; [x] = a; b = a; [x] = c;

non-clobber antidependences… GONE!

slide-85
SLIDE 85

Step 1: Transform

85

Transformation 1: SSA for pseudoregister antidependences

not one, but two transformations

Transformation 2: Scalar replacement of memory variables

clobber antidependences region boundaries region identification

depends on

slide-86
SLIDE 86

Region Partitioning Algorithm

86

steps one, two, and three

Step 1: transform function remove artificial dependences, remove non-clobbers Step 3: refine for correctness & performance account for loops, optimize for dynamic behavior Step 2: construct regions around antidependences cut all non-local antidependences in the CFG

slide-87
SLIDE 87

Step 2: Cut the CFG

87

cut, cut, cut…

construct regions by “cutting” non-local antidependences antidependence

slide-88
SLIDE 88

larger is (generally) better: large regions amortize the cost of input preservation

88

region size

  • verhead

sources of overhead

  • ptimal

region size?

Step 2: Cut the CFG

rough sketch

but where to cut…?

slide-89
SLIDE 89

Step 2: Cut the CFG

89

but where to cut…?

goal: the minimum set of cuts that cuts all antidependence paths intuition: minimum cuts fewest regions large regions approach: a series of reductions: minimum vertex multi-cut (NP-complete) minimum hitting set among paths minimum hitting set among “dominating nodes” details omitted

slide-90
SLIDE 90

Region Partitioning Algorithm

90

steps one, two, and three

Step 1: transform function remove artificial dependences, remove non-clobbers Step 3: refine for correctness & performance account for loops, optimize for dynamic behavior Step 2: construct regions around antidependences cut all non-local antidependences in the CFG

slide-91
SLIDE 91

Step 3: Loop-Related Refinements

91

correctness: Not all local antidependences removed by SSA…

loops affect correctness and performance

loop-carried antidependences may clobber depends on boundary placement; handled as a post-pass performance: Loops tend to execute multiple times… to maximize region size, place cuts outside of loop algorithm modified to prefer cuts outside of loops details omitted

slide-92
SLIDE 92

Code Generation Algorithms

92

idempotence preservation

background & concepts: live intervals, region intervals, and shadow intervals compiling for architectural idempotence: invariable control flow upon re-execution compiling for contextual idempotence: potentially variable control flow upon re-execution

slide-93
SLIDE 93

Code Generation Algorithms

live intervals and region intervals

x = ... ... = f(x) y = ...

93

region boundaries region interval x’s live interval

slide-94
SLIDE 94

Code Generation Algorithms

shadow intervals

94

shadow interval the interval over which a variable must not be overwritten specifically to preserve idempotence different for architectural and contextual idempotence

slide-95
SLIDE 95

Code Generation Algorithms

for contextual idempotence

x = ... ... = f(x) y = ...

95

region boundaries x’s shadow interval x’s live interval

slide-96
SLIDE 96

Code Generation Algorithms

for architectural idempotence

x = ... ... = f(x) y = ...

96

region boundaries x’s shadow interval x’s live interval

slide-97
SLIDE 97

Code Generation Algorithms

for architectural idempotence

x = ... ... = f(x) y = ...

97

region boundaries x’s shadow interval x’s live interval y’s live interval

slide-98
SLIDE 98

Big Regions

98

Re: Problem #2 (cut in loops are bad)

i0 = φ(0, i1) i1 = i0 + 1 if (i1 < X) for (i = 0; i < X; i++) { ... } C code CFG + SSA

slide-99
SLIDE 99

Big Regions

99

Re: Problem #2 (cut in loops are bad)

R0 = 0 R0 = R0 + 1 if (R0 < X) for (i = 0; i < X; i++) { ... } C code machine code

slide-100
SLIDE 100

Big Regions

100

Re: Problem #2 (cut in loops are bad)

R0 = 0 R0 = R0 + 1 if (R0 < X) for (i = 0; i < X; i++) { ... } C code machine code

slide-101
SLIDE 101

Big Regions

101

Re: Problem #2 (cut in loops are bad)

R1 = 0 R0 = R1 R1 = R0 + 1 if (R1 < X) for (i = 0; i < X; i++) { ... } C code machine code – “redundant” copy – extra boundary (pressure)

slide-102
SLIDE 102

Big Regions

102

Re: Problem #3 (array access patterns) [x] = a; b = [x]; [x] = c; [x] = a; b = a; [x] = c;

non-clobber antidependences… GONE!

algorithm makes this simplifying assumption: cheap for scalars, expensive for arrays

slide-103
SLIDE 103

Big Regions

103

Re: Problem #3 (array access patterns)

not really practical for large arrays but if we don’t do it, non-clobber antidependences remain solution: handle potential non-clobbers in a post-pass (same way we deal with loop clobbers in static analysis)

// initialize: int[100] array; memset(&array, 100*4, 0); // accumulate: for (...) array[i] += foo(i);

slide-104
SLIDE 104

Big Regions

104

Benchmark Problems Size Before Size After blackscholes ALIASING, SCOPE 78.9 >10,000,000 canneal SCOPE 35.3 187.3 fluidanimate ARRAYS, LOOPS, SCOPE 9.4 >10,000,000 streamcluster ALIASING 120.7 4,928 swaptions ALIASING, ARRAYS 10.8 211,000 cutcp LOOPS 21.9 612.4 fft ALIASING 24.7 2,450 histo ARRAYS, SCOPE 4.4 4,640,000 mri-q – 22,100 22,100 sad ALIASING 51.3 90,000 tpacf ARRAYS, SCOPE 30.2 107,000

results: sizes

slide-105
SLIDE 105

Big Regions

105

Benchmark Problems Overhead Before Overhead After blackscholes ALIASING, SCOPE

  • 2.93%
  • 0.05%

canneal SCOPE 5.31% 1.33% fluidanimate ARRAYS, LOOPS, SCOPE 26.67%

  • 0.62%

streamcluster ALIASING 13.62% 0.00% swaptions ALIASING, ARRAYS 17.67% 0.00% cutcp LOOPS 6.344%

  • 0.01%

fft ALIASING 11.12% 0.00% histo ARRAYS, SCOPE 23.53% 0.00% mri-q – 0.00% 0.00% sad ALIASING 4.17% 0.00% tpacf ARRAYS, SCOPE 12.36%

  • 0.02%

results: overheads

slide-106
SLIDE 106

Big Regions

106

problem labels Problem #1: aliasing analysis

– no flow-sensitive analysis in LLVM; really hurts loops

Problem #2: loop optimizations

– boundaries in loops are bad for everyone – loop blocking, fission/fusion, inter-change, peeling, blocking, unrolling, scalarization, etc. can all help

Problem #3: large array structures

– awareness of array access patterns can help

Problem #4: intra-procedural scope

– limited scope aggravates all effects listed above

(ALIASING) (LOOPS) (ARRAYS) (SCOPE)

slide-107
SLIDE 107

ISA Sensitivity

107

5 10 15 20 SPEC INT SPEC FP PARSEC Parboil OVERALL

percentage overhead x86-64 ARMv7 same configuration as take 1/3

x86-64 vs. ARMv7

slide-108
SLIDE 108

ISA Sensitivity

108

general purpose register (GPR) sensititivity

2 4 6 8 10 12 14 14-GPR 12-GPR 10-GPR take 2/3

ARMv7, 16 GPR baseline; data as geometric mean across SPEC INT percentage overhead

slide-109
SLIDE 109

ISA Sensitivity

109

more registers isn’t always enough

x = 0; if (y > 0) x = 1; z = x + y; C code R0 = 0 if (R1 > 0) R0 = 1 R2 = R0 + R1

slide-110
SLIDE 110

ISA Sensitivity

110

more registers isn’t always enough

R0 = 0 if (R1 > 0) R3 = R0 x = 0; if (y > 0) x = 1; z = x + y; C code R3 = 1 R2 = R3 + R1

slide-111
SLIDE 111

111

GPU Exception Support

compiler flow & hardware support

kernel source source code compiler IR device code generator partitioning preservation idempotent device code L2 cache core L1, TLB general purpose registers RPCs fetch FU … … decode FU FU compiler hardware

slide-112
SLIDE 112

112

GPU Exception Support

exception live-lock and fast context switching bonus: fast context switching

– boundary locations are configurable at compile time – observation 1: save/restore only live state – observation 2: place boundaries to minimize liveness

exception live-lock

– multiple recurring exceptions can cause live-lock – detection: save PC and compare – recovery: single-stepped re-execution or re-compilation

slide-113
SLIDE 113

113

CPU Exception Support

design simplification idempotence OoO retirement

– simplify result bypassing – simplifies exception support for long latency instructions – simplifies scheduling of variable-latency instructions

OoO issue?

slide-114
SLIDE 114

114

CPU Exception Support

design simplification what about branch prediction, etc.?

high re-execution costs; live-lock issues register pressure detection latency re-execution time

!!!

region placement to minimize re-execution...?

slide-115
SLIDE 115

CPU Exception Support

5 10 15 20 25 SPEC INT SPEC FP PARSEC OVERALL

115

minimizing branch re-execution cost

percentage overhead

9.1%

take 2/3 cut at branch

18.1%

slide-116
SLIDE 116

116

Hardware Fault Tolerance

fault semantics hardware fault model (fault semantics)

– side-effects are temporally contained to region execution – side-effects are spatially contained to target resources – control flow is legal (follows static CFG edges)

slide-117
SLIDE 117

Related Work

117

  • n idempotence

Very Related Year Domain Sentinel Scheduling 1992 Speculative memory re-ordering Reference Idempotency 2006 Reducing speculative storage Restart Markers 2006 Virtual memory in vector machines Encore 2011 Hardware fault recovery Somewhat Related Year Domain Multi-Instruction Retry 1995 Branch and hardware fault recovery Atomic Heap Transactions 1999 Atomic memory allocation

slide-118
SLIDE 118

118

Related Work

  • n idempotence

what’s new?

– idempotence model classification and analysis – first work to decompose entire programs – static analysis in terms of clobber (anti-)dependences – static analysis and code generation algorithms – overhead analysis: detection, pressure, re-execution – comprehensive (and general) compiler implementation – comprehensive compiler evaluation – a spectrum of architecture designs & applications