An example of a research compiler Simone Campanoni - - PowerPoint PPT Presentation

an example of a research compiler
SMART_READER_LITE
LIVE PREVIEW

An example of a research compiler Simone Campanoni - - PowerPoint PPT Presentation

An example of a research compiler Simone Campanoni simonec@eecs.northwestern.edu Sequential programs are not accelerating like they used to Performance (log scale) Performance gap Core frequency scaling Multicore era Sequential program


slide-1
SLIDE 1

An example of a research compiler

Simone Campanoni simonec@eecs.northwestern.edu

slide-2
SLIDE 2

2

Core frequency scaling

Sequential programs are not accelerating like they used to

1992 2004 Performance (log scale) Multicore era Performance gap

Sequential program running on a platform

slide-3
SLIDE 3

Single application: Not enough explicit parallelism

  • Developing parallel code is hard
  • Sequentially-designed code is still ubiquitous

Multiple applications: Only a few CPU-intensive applications running concurrently in client devices

Multicores are underutilized

3

slide-4
SLIDE 4

Parallelizing compiler: Exploit unused cores to accelerate sequential programs

4

slide-5
SLIDE 5

5

Numerical programs

Non-numerical programs Non-numerical programs need to be parallelized

slide-6
SLIDE 6

99% of time is spent in loops

Parallelize loops to parallelize a program

6

Time

Outermost loops

slide-7
SLIDE 7

7

work() work() work()

DOALL parallelism

Iteration 0 Iteration 1 Iteration 2 Time

slide-8
SLIDE 8

8

c=f(c) d=f(d) work()

DOACROSS parallelism

c=f(c) d=f(d) work() c=f(c) d=f(d) work()

Sequential segment Parallel segment Time

slide-9
SLIDE 9

9

c=f(c) d=f(d) work() c=f(c) d=f(d) work() c=f(c) d=f(d) work()

HELIX: DOACROSS for multicore

[Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012]

slide-10
SLIDE 10

10

c=f(c) d=f(d) work()

HELIX: DOACROSS for multicore

[Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012]

c=f(c) d=f(d) work() c=f(c) d=f(d) work()

slide-11
SLIDE 11

11

c=f(c) d=f(d) work(x)

HELIX: DOACROSS for multicore

[Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012]

c=f(c) d=f(d) work() c=f(c) d=f(d) work()

Seq. Segment 0 Seq. Segment 1 Wait 0 Wait 1 Signal 0 Signal 1

slide-12
SLIDE 12

12

HELIX: DOACROSS for multicore

[Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012]

slide-13
SLIDE 13

13

HELIX: DOACROSS for multicore

[Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012]

slide-14
SLIDE 14

99% of time is spent in loops

Parallelize loops to parallelize a program

14

Time

Innermost loops Outermost loops

slide-15
SLIDE 15

Parallelize loops to parallelize a program

Outermost loops

15

Innermost loops Coverage Communication Ease of analysis HELIX

slide-16
SLIDE 16

16

HELIX: DOACROSS for multicore

[Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012]

Speedup SPEC INT baseline ICC, Microsoft Visual Studio,DOACROSS HELIX

4-core Intel Nehalem

4 1 16 Innermost loops Coverage Communication Easy of analysis HELIX Outermost loops

HELIX-RC HELIX-UP

Small Loop Parallelism

slide-17
SLIDE 17

Outline

Small Loop Parallelism and HELIX HELIX-RC: Architecture/Compiler Co-Design HELIX-UP: Unleash Parallelization

17

[ISCA 2014] [CGO 2015] [CGO 2012 DAC 2012, IEEE Micro 2012]

Communication

HELIX Small loops

slide-18
SLIDE 18

SLP challenge: short loop iterations

18

Clock cycles Duration of loop iteration (cycles)

SPEC CPU Int benchmarks

slide-19
SLIDE 19

SLP challenge: short loop iterations

19

Clock cycles Duration of loop iteration (cycles)

SPEC CPU Int benchmarks

90

slide-20
SLIDE 20

SLP challenge: short loop iterations

20

Clock cycles Duration of loop iteration (cycles) Adjacent core communication latency

slide-21
SLIDE 21

Seq. Segment 0 Seq. Segment 1 Wait 0 Wait 1 Signal 0 Signal 1

A compiler-architecture co-design to efficiently execute short iterations

21

Compiler

  • Identify latency-critical code in each small loop
  • Code that generates shared data
  • Expose information to the architecture
  • Reduce the communication latency
  • n the critical path

Architecture: Ring Cache

slide-22
SLIDE 22

… Load Y …

  • Iter. 1

22

Light-weight enhancement of today’s multicore architecture

Core 0 Core 1 Core 3 Core 2 DL1 DL1 DL1 DL1 Last level cache Ring node Ring node Ring node Ring node

Store X, 1 Store Y, 1

  • Iter. 0

Store Y, 1

  • Iter. 2

Store Y, 1

  • Iter. 3

Store X, 1 Load X

75 – 260 cycles!

slide-23
SLIDE 23

Core 0 Core 1 Ring node Ring node Ring node Ring node

23

Light-weight enhancement of today’s multicore architecture

… Wait 0 Load Y …

  • Iter. 1

Store X, 1 Wait 0 Store Y, 1 Signal 0

  • Iter. 0
slide-24
SLIDE 24

24

98% hit rate

slide-25
SLIDE 25

The importance of HELIX-RC

25

Non-numerical programs Numerical programs

slide-26
SLIDE 26

The importance of HELIX-RC

26

Non-numerical programs Numerical programs

slide-27
SLIDE 27

Small Loop Parallelism and HELIX HELIX-RC: Architecture/Compiler Co-Design HELIX-UP: Unleash Parallelization

27

[ISCA 2014] [CGO 2015] [CGO 2012 DAC 2012, IEEE Micro 2012]

Communication

HELIX Small loops

Outline

slide-28
SLIDE 28

HELIX and its limitations

28

Thread 0 Thread 1 Thread 2 Thread 3

Data Data Data Iteration 0 Iteration 1 Iteration 2

Performance:

  • Lower than you would like
  • Inconsistent across architectures
  • Sensitive to

dependence analysis accuracy

What can we do to improve it?

28

4 Cores 1.68 2.77 2.31 1.61 1.19 Nehalem Bulldozer Haswell 79% accuracy 78% accuracy

50% 80%

slide-29
SLIDE 29

Opportunity: relax program semantics

  • Some workloads tolerate output distortion
  • Output distortion is workload-dependent

29

slide-30
SLIDE 30

Relaxing transformations remove performance bottlenecks

  • Sequential bottleneck

30

Inst 1 Inst 2 Inst 3 Inst 4 Inst 3 Inst 4 Inst 3 Inst 4

Dep

Thread 1 Thread 2 Thread 3

Inst 1 Inst 2 Inst 1 Inst 2 Speedup

Sequential segment

slide-31
SLIDE 31

Relaxing transformations remove performance bottlenecks

  • Sequential bottleneck
  • Communication bottleneck
  • Data locality bottleneck

31

slide-32
SLIDE 32

Relaxing transformations remove performance bottlenecks

32

No relaxing transformations Relaxing transformation 1 Relaxing transformation 2 … Relaxing transformation k Max output distortion Max performance No output distortion Baseline performance

slide-33
SLIDE 33

Design space of HELIX-UP

33

Code region 2 Code region 1

  • Performance
  • Energy saved
  • Output distortion

1) User provides output distortion limits 2) System finds the best configuration 3) Run parallelized code with that configuration

Apply relaxing transformation 3 to code region 1 Apply relaxing transformation 5 to code region 2

slide-34
SLIDE 34

Pruning the design space

Empirical observation: Transforming a code region affects only the loop it belongs to 50 loops, 2 code regions per loop 2 transformations per code region Complete space = 2100 Pruned space = 50 * (22) = 200

How well does HELIX-UP perform?

34

slide-35
SLIDE 35

HELIX: no relaxing transformations

35

Nehalem 6 cores 2 threads per core

HELIX-UP unblocks extra parallelism with small output distortions

slide-36
SLIDE 36

HELIX-UP unblocks extra parallelism with small output distortions

36

Nehalem 6 cores 2 threads per core

slide-37
SLIDE 37

37

Performance/distortion tradeoff

%

256.bzip2

HELIX

slide-38
SLIDE 38

Run time code tuning

  • Static HELIX-UP decides

how to transform the code based on profile data averaged over inputs

  • The runtime reacts to transient bottlenecks

by adjusting code accordingly

38

slide-39
SLIDE 39

Adapting code at run time unlocks more parallelism

39

256.bzip2

%

HELIX

slide-40
SLIDE 40

HELIX-UP improves more than just performance

  • Robustness to DDG inaccuracies
  • Consistent performance

across platforms

40

slide-41
SLIDE 41

Relaxed transformations to be robust to DDG inaccuracies

41

Increasing DDG inaccuracies leads to lower performance

No impact

  • n HELIX-UP

HELIX HELIX-UP

256.bzip2

slide-42
SLIDE 42

Relaxed transformations for consistent performance

42

Increasing communication latency

slide-43
SLIDE 43

Small Loop Parallelism and HELIX

  • Parallelism hides in small loops

HELIX-RC: Architecture/Compiler Co-Design

  • Irregular programs require low latency

HELIX-UP: Unleash Parallelization

  • Tolerating distortions boosts parallelization

43

slide-44
SLIDE 44

Thank you!

slide-45
SLIDE 45

Small Loop Parallelism and HELIX

  • Parallelism hides in small loops

HELIX-RC: Architecture/Compiler Co-Design

  • Irregular programs require low latency

HELIX-UP: Unleash Parallelization

  • Tolerating distortions boosts parallelization

45