Compiler Construction of Idempotent Regions and Applications in Architecture Design
Marc de Kruijf
Advisor: Karthikeyan Sankaralingam
PhD Defense 07/20/2012
Compiler Construction of Idempotent Regions and Applications in - - PowerPoint PPT Presentation
Compiler Construction of Idempotent Regions and Applications in Architecture Design Marc de Kruijf Advisor: Karthikeyan Sankaralingam PhD Defense 07/20/2012 Example source code int int sum(int int *array, int int len) { int int x = 0;
PhD Defense 07/20/2012
2
3
4
5
6
int int sum(int int *data, int int len) { int int x = 0; for for (int int i = 0; i < len; ++i) x += data[i]; return return x; }
7
8
* http://research.cs.wisc.edu/vertical/iCompiler
9
10
11
12
false true
{all registers}
{all memory} \{R1}
13
false true
{all registers}
{all memory}
14
false true
{all registers}
{all memory} \{mem[R4]}
15
false true
live-in but dynamically dead at time of write – OK to overwrite if control flow invariable
16
false true
allow final instruction to overwrite input (to include otherwise ineligible instructions)
17
false true
may be concurrently read in another thread – consider as input
18
(Model C) (Model B) (Model D)
19
20
1 10 100 1000 SPEC INT SPEC FP PARSEC Parboil OVERALL
unconstrained
21
*geometrically averaged across suites
5.2 160.2
average region size*
1 10 100 1000 SPEC INT SPEC FP PARSEC Parboil OVERALL
unconstrained control-constrained
22
160.2 5.2 40.1
*geometrically averaged across suites
average region size*
1 10 100 1000 SPEC INT SPEC FP PARSEC Parboil OVERALL
unconstrained control-constrained isolation-constrained
23
160.2 5.2 40.1 27.4
*geometrically averaged across suites
average region size*
non-idempotent instructions*
0% 1% 2% SPEC INT SPEC FP PARSEC Parboil OVERALL
24
0.19%
*geometrically averaged across suites
25
26
27
PARTITION ANALYZE CODE GEN COMPILER EVALUATION
28
29
30
0% 5% 10% 15% 20% SPEC INT SPEC FP PARSEC Parboil OVERALL
31
performance overhead percentage increase in x86 dynamic instruction count geometrically averaged across suites
13.1%
0% 10% 20% 30% 40% 50% 1 100 10000 1000000
(typically 10-30 instructions)
32
10+ instructions
0% 10% 20% 30% 40% 50% 1 100 10000 1000000
33
0% 5% 10% 15% 20% SPEC INT SPEC FP PARSEC Parboil OVERALL
34
performance overhead
13.1% 11.1%
Before After
0% 10% 20% 30% 40% 50% 1 100 10000 1000000
35
36
37
0% 2% 4% 6% 8% 10% 12% 14% PARSEC Parboil OVERALL
38
performance overhead
13.1% 0.06%
Before After
0% 10% 20% 30% 40% 50% 1 100 10000 1000000
39
50+ instructions (mis-optimized)
40
41
* http://research.cs.wisc.edu/vertical/iCompiler
42
43
44
1 Fetch 4 Write-back 3 Execute 2 Decode
1 Fetch 2 Decode 3 Execute 4 Write-back
45
1 Fetch 4 Write-back 3 Execute 2 Decode 1 Fetch 2 Decode 3 Execute 4 Write-back !!!
too late!
46
mis-speculation
(a) branch mis-prediction, (b) memory re-ordering, (c) transaction violation, etc.
hardware faults
(d) wear-out fault, (e) particle strike, (f) voltage spike, etc.
exceptions
(g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc.
47
mis-speculation
(a) branch mis-prediction, (b) memory re-ordering, (c) transaction violation, etc.
hardware faults
(d) wear-out fault, (e) particle strike, (f) voltage spike, etc.
exceptions
(g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc.
48
mis-speculation
(a) branch mis-prediction, (b) memory re-ordering, (c) transaction violation, etc.
hardware faults
(d) wear-out fault, (e) particle strike, (f) voltage spike, etc.
exceptions
(g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc.
49
hardware faults exceptions
(d) wear-out fault, (e) particle strike, (f) voltage spike, etc. (g) page fault, (h) divide-by-zero, (i) mis-aligned access, etc. integrated GPU low-power CPU high-reliability systems
50
51
52
53
54
55
DETAILS
56
0.0% 0.5% 1.0% 1.5% cutcp fft histo mri-q sad tpacf gmean
57
performance overhead
0.54%
58
59
60
Decode, Rename, & Issue Fetch Integer Integer Multiply Load/Store RF Branch FP … IEEE FP Bypass Replay queue Flush? Replay? …
61
Fetch Decode & Issue Integer Integer Branch Multiply Load/Store FP RF …
62
DETAILS
63
0% 2% 4% 6% 8% 10% 12% 14% SPEC INT SPEC FP PARSEC OVERALL
64
performance overhead
9.1%
65
66
67
FAULT
MODEL
68
69
0% 5% 10% 15% 20% 25% 30% 35% idempotence checkpoint/log TMR
9.1 22.2 29.3 performance overhead
70
71
RELATED WORK CONCLUSIONS
72
73
74
MapReduce for CELL
SELSE ’09: Synergy ISCA ‘10: Relax MICRO ’11: Idempotent Processors PLDI ’12: Static Analysis and Compiler Design ISCA ’12: iGPU DSN ’10: TS model CGO ??: Code Gen TACO ??: Models prelim defense
75
76
77
78
79
80
81
82
83
clobber antidependences region boundaries region identification
depends on
84
85
clobber antidependences region boundaries region identification
depends on
86
87
88
sources of overhead
rough sketch
89
90
91
92
93
region boundaries region interval x’s live interval
94
95
region boundaries x’s shadow interval x’s live interval
96
region boundaries x’s shadow interval x’s live interval
97
region boundaries x’s shadow interval x’s live interval y’s live interval
98
99
100
101
102
non-clobber antidependences… GONE!
103
// initialize: int[100] array; memset(&array, 100*4, 0); // accumulate: for (...) array[i] += foo(i);
104
Benchmark Problems Size Before Size After blackscholes ALIASING, SCOPE 78.9 >10,000,000 canneal SCOPE 35.3 187.3 fluidanimate ARRAYS, LOOPS, SCOPE 9.4 >10,000,000 streamcluster ALIASING 120.7 4,928 swaptions ALIASING, ARRAYS 10.8 211,000 cutcp LOOPS 21.9 612.4 fft ALIASING 24.7 2,450 histo ARRAYS, SCOPE 4.4 4,640,000 mri-q – 22,100 22,100 sad ALIASING 51.3 90,000 tpacf ARRAYS, SCOPE 30.2 107,000
105
Benchmark Problems Overhead Before Overhead After blackscholes ALIASING, SCOPE
canneal SCOPE 5.31% 1.33% fluidanimate ARRAYS, LOOPS, SCOPE 26.67%
streamcluster ALIASING 13.62% 0.00% swaptions ALIASING, ARRAYS 17.67% 0.00% cutcp LOOPS 6.344%
fft ALIASING 11.12% 0.00% histo ARRAYS, SCOPE 23.53% 0.00% mri-q – 0.00% 0.00% sad ALIASING 4.17% 0.00% tpacf ARRAYS, SCOPE 12.36%
106
107
5 10 15 20 SPEC INT SPEC FP PARSEC Parboil OVERALL
percentage overhead x86-64 ARMv7 same configuration as take 1/3
108
2 4 6 8 10 12 14 14-GPR 12-GPR 10-GPR take 2/3
ARMv7, 16 GPR baseline; data as geometric mean across SPEC INT percentage overhead
109
110
111
kernel source source code compiler IR device code generator partitioning preservation idempotent device code L2 cache core L1, TLB general purpose registers RPCs fetch FU … … decode FU FU compiler hardware
112
113
114
!!!
5 10 15 20 25 SPEC INT SPEC FP PARSEC OVERALL
115
percentage overhead
9.1%
take 2/3 cut at branch
18.1%
116
117
Very Related Year Domain Sentinel Scheduling 1992 Speculative memory re-ordering Reference Idempotency 2006 Reducing speculative storage Restart Markers 2006 Virtual memory in vector machines Encore 2011 Hardware fault recovery Somewhat Related Year Domain Multi-Instruction Retry 1995 Branch and hardware fault recovery Atomic Heap Transactions 1999 Atomic memory allocation
118