An example of a research compiler
Simone Campanoni simonec@eecs.northwestern.edu
An example of a research compiler Simone Campanoni - - PowerPoint PPT Presentation
An example of a research compiler Simone Campanoni simonec@eecs.northwestern.edu Sequential programs are not accelerating like they used to Performance (log scale) Performance gap Core frequency scaling Multicore era Sequential program
Simone Campanoni simonec@eecs.northwestern.edu
2
3
4
5
6
Time
7
work() work() work()
8
c=f(c) d=f(d) work()
c=f(c) d=f(d) work() c=f(c) d=f(d) work()
9
c=f(c) d=f(d) work() c=f(c) d=f(d) work() c=f(c) d=f(d) work()
[Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012]
10
c=f(c) d=f(d) work()
[Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012]
c=f(c) d=f(d) work() c=f(c) d=f(d) work()
11
c=f(c) d=f(d) work(x)
[Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012]
c=f(c) d=f(d) work() c=f(c) d=f(d) work()
12
[Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012]
13
[Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012]
14
Time
15
16
[Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012]
4-core Intel Nehalem
HELIX-RC HELIX-UP
17
[ISCA 2014] [CGO 2015] [CGO 2012 DAC 2012, IEEE Micro 2012]
Communication
18
SPEC CPU Int benchmarks
19
SPEC CPU Int benchmarks
90
20
21
… Load Y …
22
Store X, 1 Store Y, 1
Store Y, 1
Store Y, 1
Store X, 1 Load X
23
… Wait 0 Load Y …
Store X, 1 Wait 0 Store Y, 1 Signal 0
24
25
26
27
[ISCA 2014] [CGO 2015] [CGO 2012 DAC 2012, IEEE Micro 2012]
Communication
28
28
4 Cores 1.68 2.77 2.31 1.61 1.19 Nehalem Bulldozer Haswell 79% accuracy 78% accuracy
29
30
Inst 1 Inst 2 Inst 3 Inst 4 Inst 3 Inst 4 Inst 3 Inst 4
Inst 1 Inst 2 Inst 1 Inst 2 Speedup
31
32
33
Code region 2 Code region 1
Apply relaxing transformation 3 to code region 1 Apply relaxing transformation 5 to code region 2
34
35
Nehalem 6 cores 2 threads per core
36
Nehalem 6 cores 2 threads per core
37
HELIX
38
39
HELIX
40
41
Increasing DDG inaccuracies leads to lower performance
42
Increasing communication latency
43
45