Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic - PowerPoint PPT Presentation

Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic Orchard , Neal Glew ICFP 2013 - Boston, MA, USA Work done at Intel Labs

SIMD • Trend towards parallel architectures (multi-core & instruction-level parallelism) • SIMD: fine-grained data parallelism ‣ Vector registers: e.g.128-bit wide (4x integers) A 0 A 1 A 2 A 3 ‣ Vector instructions: e.g. vadd r1, r2 r1 = A 0 A 1 A 2 A 3 r2 = B 0 B 1 B 2 B 3 vadd r1, r2 ⇢ r1 = A 0+ B 0 A 1+ B 1 A 2+ B 2 A 3+ B 3 2

Exploiting SIMD S1 S2 S3 S4 S5 S1 v S2 v S3 v S4 v S5 v i = 0 i = 0 vectorize .... .... i = n i = n • Need to preserve semantics • Imperative: (undecidable) dependency & effects analysis • Functional: much easier 3

Intel Labs Haskell Research Compiler (HRC) Good things happen, Haskell e.g., type-checking, simplification, fusions, etc. GHC translation Core optimisation + MIL HRC vectorization C / Pillar translation 4 [1] Liu, Glew, Peterson, Anderson, The Intel Labs Haskell Research Compiler, Haskell ’13, ACM

Take home messages • FP optimisation not exhausted: still low-hanging fruit to be had • Vectorization is low hanging and a big win: • up to 6.5x speedups for HRC + vectorization • Many standard techniques + a few extras 5

MIL • Aimed at compiling high-performance functional code • Block-structured (CFG) (loops), SSA form • Strict + explicit thunks • Distinguishes mutable and immutable data • Vector primitives (numerical and array ops) 6

Prior to vectorization • Core ➝ MIL: closure conversion, explicit thunks • Optimisations (general simplifier, unboxing , representation opts.) • Contification [1] • Converts (many) tail recursion uses to loops 7 [1] M. Fluet and S. Weeks. Contification using dominators. ICFP 2001, ACM.

Vectorization • Targets inner-most loops that are: ‣ reductions over immutable arrays ‣ initialising writes 8

Initialising writes • Allocate then initialize an (immutable) array • Two invariants: ‣ Reading an array element always follows the initializing write of the element ‣ Each element may be initialized only once • Modified GHC libraries to generate intializing writes rather than mutation 9

Mutability in library (Data.Vector) unstreamRMax :: (PrimMonad m, MVector v a) => MStream m a -> Int -> m (v (PrimState m) a) unstreamRMax s n = do v <- INTERNAL_CHECK(checkLength) "unstreamRMax" n $ unsafeNew n let put i x = do let i' = i-1 INTERNAL_CHECK(checkIndex) "unstreamRMax" i' n $ unsafeWrite v i' x return i' i <- MStream.foldM' put n s return $ INTERNAL_CHECK(checkSlice) "unstreamRMax" i (n-i) n $ unsafeSlice i (n-i) v (used for the immutable vector too) 10

Transformed to immutability + initializing writes unstreamRPrimMmax :: (PrimMonad m, Vector v a) => Int -> MStream m a -> m (v a) unstreamRPrimMmax n s = do v <- INTERNAL_CHECK(checkLength) "unstreamRPrimMmax" n $ unsafeCreate n let put i x = do let i' = i-1 INTERNAL_CHECK(checkIndex) "unstreamMmax" i' n $ unsafeInitElem v i' x return i' i <- MStream.foldM' put n s v' <- basicUnsafeInited v return $ INTERNAL_CHECK(checkSlice) "unstreamRPrimMmax" i (n-i) n $ unsafeSlice i (n-i) v' 11

Start with... .... .... .... Serial .... .... .... 12

.... .... .... Vectorize... Entry Vector Serial Cleanup Exit .... .... .... 13

Vectorization • Transform each instruction ‣ ... depending on properties of instruction/arguments • Let dead-code elimination handle clean up x s = .... first value (scalar) x v = .... vector version vectorize x = .... x l = .... last value (scalar) ( x b = .... basis vector ) 14

Vectorization (simple) • Pointwise operations, promote all constants y = +(x, 4) y v = ⟨ + ⟩ (x v , ⟨ 4, 4, 4, 4 ⟩ ) vector-promoted constant pointwise op projection y l = y v ! 3 last value of y, needed if y is live-out 15

Vectorization (simple) • Pointwise operations, promote all constants y = x[z] y v = ⟨ x v ⟩ [ ⟨ z v ⟩ ] general array load on vectors “gather” equivalent to: y v = ⟨ x v0 [z v0 ], x v1 [z v1 ], x v2 [z v2 ], x v3 [z v3 ] ⟩ x[z] = y ⟨ x v ⟩ [ ⟨ z v ⟩ ] = y v general array store on vectors “scatter” 16

Why is this (sometimes) naive? • General vector array loads/stores not widely supported • Specialised versions often faster • Dependency between scalar/vector • Does too much work (non-optimal) ‣ HRC instead tracks “induction variables” and uses this information to optimise 17

Induction variables • Base induction variables • Loop-carried variable with constant step • Derived induction variable • Induction variable + constant or × constant x is a base induction variable (step 1 ) L 1 (x): .... x’ is a derived induction variable x’ = +(x, 1) if ... goto L 1 (x’) else goto L end (...) 18

Vectorizing base I.V.s e.g. x L 1 (x): .... x’ = +(x, 1) if ... goto L 1 (x’) else goto L end (...) basis vector HRC: x b = ⟨ 0, 1, 2, 3 ⟩ x v = ⟨ + ⟩ (x b , ⟨ x s , x s , x s , x s ⟩ ) x l = x s + 3 last basis value 19

Vectorizing derived I.V.s e.g. x’ L 1 (x): .... x’ = +(x, 1) if ... goto L 1 (x’) else goto L end (...) 2 vector ops x b = ⟨ 0, 1, 2, 3 ⟩ Simple: + 1 promotions x v = ⟨ + ⟩ (x b , ⟨ x s , x s , x s , x s ⟩ ) x’ v = ⟨ + ⟩ (x v , ⟨ 1, 1, 1, 1 ⟩ ) HRC: x’ b = x b Removes dependence on x v x’ s = x s + 1 1 vector op + 1 scalar op x’ v = ⟨ + ⟩ (x’ b , ⟨ x’ s , x’ s , x’ s , x’ s ⟩ ) + 1 promotion x’ l = x l +1 20

Vectorizing induction variables [simple approach] x’ v = ⟨ + ⟩ (x v , ⟨ 1, 1, 1, 1 ⟩ ) = ⟨ + ⟩ ( ⟨ + ⟩ (x b , ⟨ x s , x s , x s , x s ⟩ ), ⟨ 1, 1, 1, 1 ⟩ ) (assoc) = ⟨ + ⟩ (x b , ⟨ + ⟩ ( ⟨ x s , x s , x s , x s ⟩ , ⟨ 1, 1, 1, 1 ⟩ )) “Naturality” of promotion : ⟨ f ⟩ . (promote × promote) ≡ promote . f ⟨ f(a,b),f(a,b),f(a,b),f(a,b) ⟩ ≡ ⟨ f ⟩ ( ⟨ a, a, a, a ⟩ , ⟨ b, b, b, b ⟩ ) = ⟨ + ⟩ (x b , ⟨ x s + 1 , x s + 1, x s + 1, x s + 1 ⟩ )) (naturality) (simplify) = ⟨ + ⟩ (x b , ⟨ x’ s , x’ s , x’ s , x’ s ⟩ )) 21

Vectorizing loads/stores • Specialised gathers and scatters for unit strides provide higher performance specialised ( contiguous ) array load y = x[z] y v = x s [ ⟨ z s : ⟩ ] (when x is loop invariant, z is a unit stride induction variable) 22

Results Test framework: Intel Xeon, 256-bit register AVX 23

SIMD ¡Vectorisation ¡Performance ¡ 8 7 6.69 ¡ 6 Repa lib Speedup ¡over ¡no ¡SIMD ¡ Higher ¡is ¡Better ¡ 5 4 3.52 ¡ HRC ¡SIMD 3.02 ¡ 2.80 ¡ 3 2.00 ¡ 1.84 ¡ 2 1.34 ¡ 1.05 ¡ 1 0 Vector ¡Add Vector ¡Sum Dot ¡Product Matrix ¡Multiply Nbody 1D ¡Convolution 2D ¡Convolution Blur Benchmark ¡ Repa 24

Conclusions: what is important? • Purity at the top level we already knew these • Fusion • Understanding effects at the implementation level • Use program properties (induction vars) • Keep dependencies between scalars/vectors separate 25

Conclusions • Future work • Masked instructions • Vectorised allocations • Alignment • SIMD was straightforward to add to HRC, with very good results We told ‘em we could do parallelism! 26

Backup Slides 27

Pillar • C-like language • Managed memory with garbage-collection • Tail calls • Continuations • Compiles to C [1] Anderson, Glew, Guo, Lewis, Liu, Liu, Petersen, Rajagopalan, Stichnoth, Wu, and Zhang. Pillar: A 28 parallel implementation language. Languages and Compilers for Parallel Computing, Springer-Verlag, 2008

8 7 6 5 4 3 GHC 2 GHC ¡LLVM 1 HRC SIMD Intel ¡SIMD 0 Matrix 1D 2D Vector ¡Add Vector ¡Sum Dot ¡Product N ¡Body ¡Repa Blur Multiply Convolution Convolution GHC 0.15 0.72 0.24 0.34 0.35 0.02 0.34 0.45 GHC ¡LLVM 0.15 0.73 0.35 0.97 0.98 0.02 0.58 0.86 Intel ¡SIMD HRC SIMD 1.05 1.84 2.00 4.37 1.34 6.68 5.03 3.02 29

Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic - PowerPoint PPT Presentation

Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic Orchard , Neal Glew ICFP 2013 - Boston, MA, USA Work done at Intel Labs SIMD Trend towards parallel architectures (multi-core & instruction-level parallelism) SIMD:

Is vectorization easy? Is vectorization enough? Sbastien Ponce Florian Lemaitre Plan

Lecture 3 SIMD and Vectorization GPU Architecture Todays lecture Vectorization and SSE

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

Rethinking SIMD Vectorization for In-Memory Databases Sri Harshal Parimi Motivation Need for

Welcome! INFOMOV Lecture 5 SIMD (1) 2 Meanwhile, on ars technica INFOMOV

TSLP Throttling Automatic Vectorization: When Less is More Vasileios Porpodas and Timothy M.

SIMD+ Overview Illiac IV History Early machines First massively

AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael Goldfarb and Milind Kulkarni

Exploiting automatic vectorization to employ SPMD on SIMD registers Stefan Sprenger

Haskell-RL An Equational Specification of Haskell in Maude Andrew Bennett Presented on 24 April

Haskell Overview David Grisham 31 October 2017 Haskell Overview David Grisham

wrangling the internet of things with haskell production haskell Reid Draper @reiddraper

Bringing Haskell to the World www.fpcomplete.com Experience Report Building Haskell Development

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

haskell cons In haskell consing is done via the infix operator (:). For example: (cons 1 (cons 2

Introduction to R v2019-01 R can just be a calculator > 3+2 [1] 5 > 2/7 [1] 0.2857143

Neural Ordinary Differential Equations Ricky Chen, Yulia Rubanova, Jesse Bettencourt, David

A Review of Linear Algebra Mohammad Emtiyaz Khan CS,UBC A Review of Linear Algebra p.1/13

Vectors III MA1S1 Tristan McLoughlin October 17, 2014 Anton & Rorres: Ch 3.3 Hefferon: Ch

Z3strBV: A Solver for a Theory of Strings and Bit-vectors Murphy Berzish 1 , Sanu Subramanian 2 ,

Whats Algebra? Note: This assumes you have not taken Math 351! If you have, you probably have

CS 162 Intro to Programming II Vectors 1 Vectors A

SI425 : NLP Set 11 Distributional Similarity some slides adapted from Dan Jurafsky and Bill

Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic - PowerPoint PPT Presentation

Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic Orchard , Neal Glew ICFP 2013 - Boston, MA, USA Work done at Intel Labs SIMD Trend towards parallel architectures (multi-core & instruction-level parallelism) SIMD:

Is vectorization easy? Is vectorization enough? Sbastien Ponce Florian Lemaitre Plan

Lecture 3 SIMD and Vectorization GPU Architecture Todays lecture Vectorization and SSE

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

Rethinking SIMD Vectorization for In-Memory Databases Sri Harshal Parimi Motivation Need for

Welcome! INFOMOV Lecture 5 SIMD (1) 2 Meanwhile, on ars technica INFOMOV

TSLP Throttling Automatic Vectorization: When Less is More Vasileios Porpodas and Timothy M.

SIMD+ Overview Illiac IV History Early machines First massively

AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael Goldfarb and Milind Kulkarni

Exploiting automatic vectorization to employ SPMD on SIMD registers Stefan Sprenger

Haskell-RL An Equational Specification of Haskell in Maude Andrew Bennett Presented on 24 April

Haskell Overview David Grisham 31 October 2017 Haskell Overview David Grisham

wrangling the internet of things with haskell production haskell Reid Draper @reiddraper

Bringing Haskell to the World www.fpcomplete.com Experience Report Building Haskell Development

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

haskell cons In haskell consing is done via the infix operator (:). For example: (cons 1 (cons 2

Introduction to R v2019-01 R can just be a calculator &gt; 3+2 [1] 5 &gt; 2/7 [1] 0.2857143

Neural Ordinary Differential Equations Ricky Chen, Yulia Rubanova, Jesse Bettencourt, David

A Review of Linear Algebra Mohammad Emtiyaz Khan CS,UBC A Review of Linear Algebra p.1/13

Vectors III MA1S1 Tristan McLoughlin October 17, 2014 Anton &amp; Rorres: Ch 3.3 Hefferon: Ch

Z3strBV: A Solver for a Theory of Strings and Bit-vectors Murphy Berzish 1 , Sanu Subramanian 2 ,

Whats Algebra? Note: This assumes you have not taken Math 351! If you have, you probably have

CS 162 Intro to Programming II Vectors 1 Vectors A

SI425 : NLP Set 11 Distributional Similarity some slides adapted from Dan Jurafsky and Bill

Introduction to R v2019-01 R can just be a calculator > 3+2 [1] 5 > 2/7 [1] 0.2857143

Vectors III MA1S1 Tristan McLoughlin October 17, 2014 Anton & Rorres: Ch 3.3 Hefferon: Ch