Intel Labs Haskell Research Compiler Hai (Paul) Liu with Neal Glew, - - PowerPoint PPT Presentation

intel labs haskell research compiler hai paul liu
SMART_READER_LITE
LIVE PREVIEW

Intel Labs Haskell Research Compiler Hai (Paul) Liu with Neal Glew, - - PowerPoint PPT Presentation

Intel Labs Haskell Research Compiler Hai (Paul) Liu with Neal Glew, Leaf Peterson, Todd A. Anderson Intel Labs. September 28, 2016 Intel Labs Haskell Research Compiler An alternative Haskell compiler that: uses GHC as frontend; does


slide-1
SLIDE 1

Intel Labs Haskell Research Compiler Hai (Paul) Liu

with Neal Glew, Leaf Peterson, Todd A. Anderson Intel Labs. September 28, 2016

slide-2
SLIDE 2

Intel Labs Haskell Research Compiler

An alternative Haskell compiler that:

  • uses GHC as frontend;
  • does whole program compilation;

2 September 28, 2016

slide-3
SLIDE 3

Intel Labs Haskell Research Compiler

An alternative Haskell compiler that:

  • uses GHC as frontend;
  • does whole program compilation;
  • achieves overall performance parity with GHC+LLVM;
  • is significantly better for some programs;

2 September 28, 2016

slide-4
SLIDE 4

Intel Labs Haskell Research Compiler

An alternative Haskell compiler that:

  • uses GHC as frontend;
  • does whole program compilation;
  • achieves overall performance parity with GHC+LLVM;
  • is significantly better for some programs;
  • does automatic SIMD vectorization for Intel CPUs.

2 September 28, 2016

slide-5
SLIDE 5

However...

The backend of HRC was not originally designed for Haskell.

3 September 28, 2016

slide-6
SLIDE 6

However...

The backend of HRC was not originally designed for Haskell. We re-purposed an existing FL compiler and runtime that:

  • has a set of different design decisions;
  • makes an interesting comparison to GHC.

3 September 28, 2016

slide-7
SLIDE 7

However...

The backend of HRC was not originally designed for Haskell. We re-purposed an existing FL compiler and runtime that:

  • has a set of different design decisions;
  • makes an interesting comparison to GHC.

Known Limitations:

  • No lightweight threads, sparks, or STM (easy);
  • No exception re-throw for thunks (fixable);
  • No asynchronous exceptions (hard).

3 September 28, 2016

slide-8
SLIDE 8

Functionality

HRC is highly compatible to GHC:

  • Modified GHC and base libraries to handle differences;
  • Modified Vector library to use initializing writes;
  • Modified Cabal to compile for HRC;
  • Little or no modifications to most other libraries.

4 September 28, 2016

slide-9
SLIDE 9

Compilation Process

5 September 28, 2016

slide-10
SLIDE 10

Inside HRC Pipeline

CoreHs functional, non-strict ANormStrict ANF, strict explicit thunk/eval MIL CFG/SSA based high-level object Pillar inspired by C-- variant of C

6 September 28, 2016

slide-11
SLIDE 11

Comparison to GHC

GHC HRC Desugaring Type analysis Core-to-Core transformation STG Functional, object memory model, MIL CFG/SSA based, object

  • ptimized for currying and thunks

memory model, conventional Cmm CFG blocks, low-level types, Pillar C types, C calling convention and custom calling convention meta and GC support LLVM or Portable LLVM bitcode, or Intel C/C++ Portable C code NCG direct assembly generation Compiler compiled to assembly Runtime Optimized for currying and thunks Runtime Conventional and GC and GC

7 September 28, 2016

slide-12
SLIDE 12

MIL

  • High level object model with low level control flow
  • Leveraging immutability and memory safety
  • Data flow and control flow analysis
  • Inter- and intra- procedural optimizations
  • Representation/contification/loop/thunk optimizations
  • SIMD auto vectorization

8 September 28, 2016

slide-13
SLIDE 13

Immutable Array

GHC creates an immutable array by:

  • creating a mutable array;
  • writing to it;
  • freezing the result.

9 September 28, 2016

slide-14
SLIDE 14

Immutable Array

GHC creates an immutable array by:

  • creating a mutable array;
  • writing to it;
  • freezing the result.

HRC separates array creation and initialization, with primitives to:

  • create immutable array
  • do initializing writes
  • read

9 September 28, 2016

slide-15
SLIDE 15

Immutable Array

GHC creates an immutable array by:

  • creating a mutable array;
  • writing to it;
  • freezing the result.

HRC separates array creation and initialization, with primitives to:

  • create immutable array
  • do initializing writes
  • read

Programmers must ensure:

  • an array field is written to before it is read;
  • a field is never written to more than once.

9 September 28, 2016

slide-16
SLIDE 16

Walkthrough by Example

  • dd, even :: Int → Bool
  • dd 0 = False
  • dd n = even (n − 1)

even 0 = True even n = odd (n − 1)

10 September 28, 2016

slide-17
SLIDE 17

GHC Core

even :: (Int → Bool) = \ (n :: Int) → %case n %of ( :: Int) {I# (n1 :: Int#) → %case n1 %of (n2 :: Int#) {(0 :: Int#) → True; % → odd (I# (n2 − (1 :: Int#)))}};

  • dd :: (Int → Bool) =

\ (m :: Int) → %case m %of ( :: Int) {I# (m1 :: Int#) → %case m1 %of (m2 :: Int#) {(0 :: Int#) → False; % → even (I# (m2 − (1 :: Int#)))}};

11 September 28, 2016

slide-18
SLIDE 18

GHC Core

even :: (Int → Bool) = \ (n :: Int) → %case n %of ( :: Int) {I# (n1 :: Int#) → %case n1 %of (n2 :: Int#) {(0 :: Int#) → True; % → odd (I# (n2 − (1 :: Int#)))}};

  • dd :: (Int → Bool) =

\ (m :: Int) → %case m %of ( :: Int) {I# (m1 :: Int#) → %case m1 %of (m2 :: Int#) {(0 :: Int#) → False; % → even (I# (m2 − (1 :: Int#)))}};

11 September 28, 2016

slide-19
SLIDE 19

GHC Core

even :: (Int → Bool) = \ (n :: Int) → %case n %of ( :: Int) {I# (n1 :: Int#) → %case n1 %of (n2 :: Int#) {(0 :: Int#) → True; % → odd (I# (n2 − (1 :: Int#)))}};

  • dd :: (Int → Bool) =

\ (m :: Int) → %case m %of ( :: Int) {I# (m1 :: Int#) → %case m1 %of (m2 :: Int#) {(0 :: Int#) → False; % → even (I# (m2 − (1 :: Int#)))}};

11 September 28, 2016

slide-20
SLIDE 20

GHC Core

even :: (Int → Bool) = \ (n :: Int) → %case n %of ( :: Int) {I# (n1 :: Int#) → %case n1 %of (n2 :: Int#) {(0 :: Int#) → True; % → odd (I# (n2 − (1 :: Int#)))}};

  • dd :: (Int → Bool) =

\ (m :: Int) → %case m %of ( :: Int) {I# (m1 :: Int#) → %case m1 %of (m2 :: Int#) {(0 :: Int#) → False; % → even (I# (m2 − (1 :: Int#)))}};

11 September 28, 2016

slide-21
SLIDE 21

GHC Core

even :: (Int → Bool) = \ (n :: Int) → %case n %of ( :: Int) {I# (n1 :: Int#) → %case n1 %of (n2 :: Int#) {(0 :: Int#) → True; % → odd (I# (n2 − (1 :: Int#)))}};

  • dd :: (Int → Bool) =

\ (m :: Int) → %case m %of ( :: Int) {I# (m1 :: Int#) → %case m1 %of (m2 :: Int#) {(0 :: Int#) → False; % → even (I# (m2 − (1 :: Int#)))}};

11 September 28, 2016

slide-22
SLIDE 22

ANormStrict

even = %thunk %let f = \n → %let n0 = %eval n %in %case n0 of {I# n1 → %case n1 of {0 → %eval True; → %let u = %let v = %let w = 1 %in n1 − w i = %eval I# %in i v t = %thunk u g = %eval odd %in g t}} %in f ;

12 September 28, 2016

slide-23
SLIDE 23

ANormStrict

even = %thunk %let f = \n → %let n0 = %eval n %in %case n0 of {I# n1 → %case n1 of {0 → %eval True; → %let u = %let v = %let w = 1 %in n1 − w i = %eval I# %in i v t = %thunk u g = %eval odd %in g t}} %in f ;

12 September 28, 2016

slide-24
SLIDE 24

ANormStrict

even = %thunk %let f = \n → %let n0 = %eval n %in %case n0 of {I# n1 → %case n1 of {0 → %eval True; → %let u = %let v = %let w = 1 %in n1 − w i = %eval I# %in i v t = %thunk u g = %eval odd %in g t}} %in f ;

12 September 28, 2016

slide-25
SLIDE 25

ANormStrict (closure converted)

even = %thunk <I#, True, odd;> %let f = \<I#, True, odd; n> → %let n0 = %eval n %in %case n0 of {I# n1 → %case n1 of {0 → %eval True; → %let u = %let v = %let w = 1 %in n1 − w i = %eval I# %in i v t = %thunk <u;> u g = %eval odd %in g t}} %in f ;

13 September 28, 2016

slide-26
SLIDE 26

ANormStrict (closure converted)

even = %thunk <I#, True, odd;> %let f = \<I#, True, odd; n> → %let n0 = %eval n %in %case n0 of {I# n1 → %case n1 of {0 → %eval True; → %let u = %let v = %let w = 1 %in n1 − w i = %eval I# %in i v t = %thunk <u;> u g = %eval odd %in g t}} %in f ;

13 September 28, 2016

slide-27
SLIDE 27

MIL

f code = Code(CcClosure(lv I#, lv True, lv odd ); n) { L0(): n0 ← Eval(n) ⇒ L1 L1(): Case tagof(n0) { U32(0) ⇒ L2() } L2(): n1 = SumProj(n0.U32(0).0) Case n1 { S32(0) ⇒ L8() Default ⇒ L3() } L3(): v = SInt32Minus(n1, 1) i ← Eval(lv I#) ⇒ L4 L4(): u ← CallClos(i) (v) ⇒ L5 L5(): t = ThunkMkVal(u) g ← Eval(lv odd) ⇒ L6 L6(): c ← CallClos(g) (t) ⇒ L7 L7(): Goto L10(c) L8(): b ← Eval(lv True) ⇒ L9 L9(): Goto L10(b) L10(r): Return(r) }

14 September 28, 2016

slide-28
SLIDE 28

MIL

f code = Code(CcClosure(lv I#, lv True, lv odd ); n) { L0(): n0 ← Eval(n) ⇒ L1 L1(): Case tagof(n0) { U32(0) ⇒ L2() } L2(): n1 = SumProj(n0.U32(0).0) Case n1 { S32(0) ⇒ L8() Default ⇒ L3() } L3(): v = SInt32Minus(n1, 1) i ← Eval(lv I#) ⇒ L4 L4(): u ← CallClos(i) (v) ⇒ L5 L5(): t = ThunkMkVal(u) g ← Eval(lv odd) ⇒ L6 L6(): c ← CallClos(g) (t) ⇒ L7 L7(): Goto L10(c) L8(): b ← Eval(lv True) ⇒ L9 L9(): Goto L10(b) L10(r): Return(r) }

14 September 28, 2016

slide-29
SLIDE 29

MIL

f code = Code(CcClosure(lv I#, lv True, lv odd ); n) { L0(): n0 ← Eval(n) ⇒ L1 L1(): Case tagof(n0) { U32(0) ⇒ L2() } L2(): n1 = SumProj(n0.U32(0).0) Case n1 { S32(0) ⇒ L8() Default ⇒ L3() } L3(): v = SInt32Minus(n1, 1) i ← Eval(lv I#) ⇒ L4 L4(): u ← CallClos(i) (v) ⇒ L5 L5(): t = ThunkMkVal(u) g ← Eval(lv odd) ⇒ L6 L6(): c ← CallClos(g) (t) ⇒ L7 L7(): Goto L10(c) L8(): b ← Eval(lv True) ⇒ L9 L9(): Goto L10(b) L10(r): Return(r) }

14 September 28, 2016

slide-30
SLIDE 30

MIL

f code = Code(CcClosure(lv I#, lv True, lv odd ); n) { L0(): n0 ← Eval(n) ⇒ L1 L1(): Case tagof(n0) { U32(0) ⇒ L2() } L2(): n1 = SumProj(n0.U32(0).0) Case n1 { S32(0) ⇒ L8() Default ⇒ L3() } L3(): v = SInt32Minus(n1, 1) i ← Eval(lv I#) ⇒ L4 L4(): u ← CallClos(i) (v) ⇒ L5 L5(): t = ThunkMkVal(u) g ← Eval(lv odd) ⇒ L6 L6(): c ← CallClos(g) (t) ⇒ L7 L7(): Goto L10(c) L8(): b ← Eval(lv True) ⇒ L9 L9(): Goto L10(b) L10(r): Return(r) }

14 September 28, 2016

slide-31
SLIDE 31

MIL

f code = Code(CcClosure(lv I#, lv True, lv odd ); n) { L0(): n0 ← Eval(n) ⇒ L1 L1(): Case tagof(n0) { U32(0) ⇒ L2() } L2(): n1 = SumProj(n0.U32(0).0) Case n1 { S32(0) ⇒ L8() Default ⇒ L3() } L3(): v = SInt32Minus(n1, 1) i ← Eval(lv I#) ⇒ L4 L4(): u ← CallClos(i) (v) ⇒ L5 L5(): t = ThunkMkVal(u) g ← Eval(lv odd) ⇒ L6 L6(): c ← CallClos(g) (t) ⇒ L7 L7(): Goto L10(c) L8(): b ← Eval(lv True) ⇒ L9 L9(): Goto L10(b) L10(r): Return(r) }

14 September 28, 2016

slide-32
SLIDE 32

MIL

f code = Code(CcClosure(lv I#, lv True, lv odd ); n) { L0(): n0 ← Eval(n) ⇒ L1 L1(): Case tagof(n0) { U32(0) ⇒ L2() } L2(): n1 = SumProj(n0.U32(0).0) Case n1 { S32(0) ⇒ L8() Default ⇒ L3() } L3(): v = SInt32Minus(n1, 1) i ← Eval(lv I#) ⇒ L4 L4(): u ← CallClos(i) (v) ⇒ L5 L5(): t = ThunkMkVal(u) g ← Eval(lv odd) ⇒ L6 L6(): c ← CallClos(g) (t) ⇒ L7 L7(): Goto L10(c) L8(): b ← Eval(lv True) ⇒ L9 L9(): Goto L10(b) L10(r): Return(r) }

14 September 28, 2016

slide-33
SLIDE 33

MIL

f code = Code(CcClosure(lv I#, lv True, lv odd ); n) { L0(): n0 ← Eval(n) ⇒ L1 L1(): Case tagof(n0) { U32(0) ⇒ L2() } L2(): n1 = SumProj(n0.U32(0).0) Case n1 { S32(0) ⇒ L8() Default ⇒ L3() } L3(): v = SInt32Minus(n1, 1) i ← Eval(lv I#) ⇒ L4 L4(): u ← CallClos(i) (v) ⇒ L5 L5(): t = ThunkMkVal(u) g ← Eval(lv odd) ⇒ L6 L6(): c ← CallClos(g) (t) ⇒ L7 L7(): Goto L10(c) L8(): b ← Eval(lv True) ⇒ L9 L9(): Goto L10(b) L10(r): Return(r) }

14 September 28, 2016

slide-34
SLIDE 34

MIL (simplified)

f code = Code(CcClosure(lv I#, lv True, lv odd ); n) { L0(): n0 ← Eval(n) ⇒ L1 L1(): n1 = SumProj(n0.U32(0).0) e = SInt32Eq (n1, 0) Case e { True ⇒ L6() False ⇒ L2() } L2(): v = SInt32Minus(n1, 1) i ← Eval(lv I#) ⇒ L3 L3(): u ← CallClos(i) (v) ⇒ L4 L4(): t = ThunkMkVal(u) g ← Eval(lv odd) ⇒ L5 L5(): CallClos (g) (t) | = L6(): Return(gv True) }

15 September 28, 2016

slide-35
SLIDE 35

MIL (simplified)

f code = Code(CcClosure(lv I#, lv True, lv odd ); n) { L0(): n0 ← Eval(n) ⇒ L1 L1(): n1 = SumProj(n0.U32(0).0) e = SInt32Eq (n1, 0) Case e { True ⇒ L6() False ⇒ L2() } L2(): v = SInt32Minus(n1, 1) i ← Eval(lv I#) ⇒ L3 L3(): u ← CallClos(i) (v) ⇒ L4 L4(): t = ThunkMkVal(u) g ← Eval(lv odd) ⇒ L5 L5(): CallClos (g) (t) | = L6(): Return(gv True) }

15 September 28, 2016

slide-36
SLIDE 36

MIL (simplified)

f code = Code(CcClosure(lv I#, lv True, lv odd ); n) { L0(): n0 ← Eval(n) ⇒ L1 L1(): n1 = SumProj(n0.U32(0).0) e = SInt32Eq (n1, 0) Case e { True ⇒ L6() False ⇒ L2() } L2(): v = SInt32Minus(n1, 1) i ← Eval(lv I#) ⇒ L3 L3(): u ← CallClos(i) (v) ⇒ L4 L4(): t = ThunkMkVal(u) g ← Eval(lv odd) ⇒ L5 L5(): CallClos (g) (t) | = L6(): Return(gv True) }

15 September 28, 2016

slide-37
SLIDE 37

MIL (simplified)

f code = Code(CcClosure(lv I#, lv True, lv odd ); n) { L0(): n0 ← Eval(n) ⇒ L1 L1(): n1 = SumProj(n0.U32(0).0) e = SInt32Eq (n1, 0) Case e { True ⇒ L6() False ⇒ L2() } L2(): v = SInt32Minus(n1, 1) i ← Eval(lv I#) ⇒ L3 L3(): u ← CallClos(i) (v) ⇒ L4 L4(): t = ThunkMkVal(u) g ← Eval(lv odd) ⇒ L5 L5(): CallClos (g) (t) | = L6(): Return(gv True) }

15 September 28, 2016

slide-38
SLIDE 38

MIL (after Rep optimization)

f code = Code(CcCode; n) { L0(): even = SInt32Eq (n, 0) Case even { True ⇒ L4() False ⇒ L1() } L1(): v = SInt32Minus(n, 1)

  • dd = SInt32Eq(v, 0)

Case odd { True ⇒ L3() False ⇒ L2() } L2(): w = SInt32Minus(v, 1) Call(f code) (w) | = L3(): Return(gv False) L4(): Return(gv True) }

16 September 28, 2016

slide-39
SLIDE 39

MIL (after Rep optimization)

f code = Code(CcCode; n) { L0(): even = SInt32Eq (n, 0) Case even { True ⇒ L4() False ⇒ L1() } L1(): v = SInt32Minus(n, 1)

  • dd = SInt32Eq(v, 0)

Case odd { True ⇒ L3() False ⇒ L2() } L2(): w = SInt32Minus(v, 1) Call(f code) (w) | = L3(): Return(gv False) L4(): Return(gv True) }

16 September 28, 2016

slide-40
SLIDE 40

MIL (after Rep optimization)

f code = Code(CcCode; n) { L0(): even = SInt32Eq (n, 0) Case even { True ⇒ L4() False ⇒ L1() } L1(): v = SInt32Minus(n, 1)

  • dd = SInt32Eq(v, 0)

Case odd { True ⇒ L3() False ⇒ L2() } L2(): w = SInt32Minus(v, 1) Call(f code) (w) | = L3(): Return(gv False) L4(): Return(gv True) }

16 September 28, 2016

slide-41
SLIDE 41

MIL (after contification)

f code = Code(CcCode; n) { L0(): Goto L1(n) L1(u): even = SInt32Eq(u, 0) Case even { True ⇒ L5() False ⇒ L2() } L2(): v = SInt32Minus(u, 1)

  • dd = SInt32Eq(v, 0)

Case odd { True ⇒ L4() False ⇒ L3() } L3(): w = SInt32Minus(v, 1) Goto L1(w) L4(): Return(gv False) L5(): Return(gv True) }

17 September 28, 2016

slide-42
SLIDE 42

MIL (after arithmetic simplification)

f code = Code(CcCode; n) { L0(): Goto L1(n) L1(u): even = SInt32Eq(u, 0) Case even { True ⇒ L5() False ⇒ L2() } L2():

  • dd = SInt32Eq(u, 1)

Case odd { True ⇒ L4() False ⇒ L3() } L3(): w = SInt32Minus(u, 2) Goto L1(w) L4(): Return(gv False) L5(): Return(gv True) }

18 September 28, 2016

slide-43
SLIDE 43

Benchmarking

40+ benchmark programs:

  • a mixed set of programs from nofib benchmark suite;
  • performance oriented programs using array libraries.

Sequential performance is measured by compiling and running on 2.7GHz Xeon machine (32-bit Windows) with:

  • standard GHC 7.6.1
  • GHC 7.6.1 + LLVM 2.9
  • HRC with modified GHC 7.6.1 + Intel C/C++ compiler.

19 September 28, 2016

slide-44
SLIDE 44

Benchmark Result

HRC is at parity to GHC+LLVM, which is 10% faster than GHC.

0.5 1 1.5 2 2.5 3 3.5 bspt circsim constraints rsa boyer fulsom multiplier compress chess wheel-sieve2 gamteb fem cacheprof linear fibheaps life galois_raytrace binarytrees mandel2 rewrite bernouilli prime clausify wheel-sieve1 meteor digits-of-e1 finlay smallpt mandelbrot sobel queens pic blur nbody-vector nbody-repa convolution blackscholes mmult vectorise-sum nbody dot-product vectorise-add 1d-convolution geometric mean GHC+LLVM HRC

Figure: Kernel Execution Time Relative to GHC (smaller is better)

20 September 28, 2016

slide-45
SLIDE 45

Benchmark Result (selected)

Performance oriented program with a numeric computation kernel using arrays.

0.2 0.4 0.6 0.8 1 1.2 GHC+LLVM HRC

Figure: Kernel Execution Time Relative to GHC (smaller is better)

21 September 28, 2016

slide-46
SLIDE 46

Performance

For GHC:

  • GHC is better at executing lazy code and curried functions.
  • GHC’s object representation and GC are better suited to

typical Haskell programs. For HRC:

  • HRC focuses on optimizing strict programs with hot loops.
  • HRC benefits significantly from the elimination of thunks,

boxes and branches.

22 September 28, 2016

slide-47
SLIDE 47

Take Aways

  • Reusing GHC is a big win (Core: easy, library: a bit work)
  • Novel MIL design choices:

Low-level control with high-level object. Immutable array with initializing writes.

  • Eliminating thunks is critical to performance.
  • Penalty of not using a specialized runtime.
  • Compilation through C has overhead, but not as significant.

23 September 28, 2016

slide-48
SLIDE 48

References

Automatic SIMD Vectorization for Haskell. Leaf Petersen and Dominic Orchard and Neal Glew. ICFP’13. Measuring the Haskell Gap. Leaf Petersen, Todd A. Anderson, Hai Liu and Neal Glew. Presented at IFL’13. A Multivalued Language with a Dependent Type System. Neal Glew, Tim Sweeney and Leaf Petersen. DTP’13. Pillar: A Parallel Implementation Language. Anderson et al. LCPC’07. Optimizations in a private nursery-based garbage collector. Todd A.

  • Anderson. ISMM’10.

24 September 28, 2016