Automatic SIMD vectorization for Haskell
Leaf Petersen, Dominic Orchard, Neal Glew ICFP 2013 - Boston, MA, USA
Work done at Intel Labs
Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic - - PowerPoint PPT Presentation
Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic Orchard , Neal Glew ICFP 2013 - Boston, MA, USA Work done at Intel Labs SIMD Trend towards parallel architectures (multi-core & instruction-level parallelism) SIMD:
Leaf Petersen, Dominic Orchard, Neal Glew ICFP 2013 - Boston, MA, USA
Work done at Intel Labs
instruction-level parallelism)
2
A0 A1 A2 A3 A0 A1 A2 A3
r1 =
B0 B1 B2 B3
r2 =
A0+B0 A1+B1 A2+B2 A3+B3
r1 =
vadd r1, r2 ⇢
vectorize S1 S2 S3 S4 S5
i = 0 i = n
....
S1v S2v S3v S4v S5v
i = 0 i = n
....
3
Haskell GHC Core
Good things happen, e.g., type-checking, simplification, fusions, etc.
C / Pillar HRC
[1] Liu, Glew, Peterson, Anderson, The Intel Labs Haskell Research Compiler, Haskell ’13, ACM
MIL
translation
vectorization
translation
4
fruit to be had
5
functional code
6
representation opts.)
[1] M. Fluet and S. Weeks. Contification using dominators. ICFP 2001, ACM.
7
8
the initializing write of the element
intializing writes rather than mutation
9
unstreamRMax :: (PrimMonad m, MVector v a) => MStream m a -> Int -> m (v (PrimState m) a) unstreamRMax s n = do v <- INTERNAL_CHECK(checkLength) "unstreamRMax" n $ unsafeNew n let put i x = do let i' = i-1 INTERNAL_CHECK(checkIndex) "unstreamRMax" i' n $ unsafeWrite v i' x return i' i <- MStream.foldM' put n s return $ INTERNAL_CHECK(checkSlice) "unstreamRMax" i (n-i) n $ unsafeSlice i (n-i) v
10
(used for the immutable vector too)
11
unstreamRPrimMmax :: (PrimMonad m, Vector v a) => Int -> MStream m a
unstreamRPrimMmax n s = do v <- INTERNAL_CHECK(checkLength) "unstreamRPrimMmax" n $ unsafeCreate n let put i x = do let i' = i-1 INTERNAL_CHECK(checkIndex) "unstreamMmax" i' n $ unsafeInitElem v i' x return i' i <- MStream.foldM' put n s v' <- basicUnsafeInited v return $ INTERNAL_CHECK(checkSlice) "unstreamRPrimMmax" i (n-i) n $ unsafeSlice i (n-i) v'
Serial .... .... .... .... .... ....
12
Serial Vector Entry .... .... .... Cleanup Exit .... .... ....
13
xv = .... xs = .... first value (scalar) vector version xl = .... last value (scalar) ( xb = .... basis vector )
14
x = ....
vectorize
y = +(x, 4) yv = ⟨+⟩(xv, ⟨4, 4, 4, 4⟩)
pointwise op
yl = yv ! 3
last value of y, needed if y is live-out
15
vector-promoted constant
projection
y = x[z] yv = ⟨xv⟩[⟨zv⟩]
general array load on vectors “gather”
x[z] = y ⟨xv⟩[⟨zv⟩] = yv yv = ⟨xv0[zv0], xv1[zv1], xv2[zv2], xv3[zv3]⟩ equivalent to:
16
general array store on vectors “scatter”
supported
and uses this information to optimise
17
L1(x): .... x’ = +(x, 1) if ... goto L1(x’) else goto Lend(...)
18
x’ is a derived induction variable x is a base induction variable (step 1)
xb = ⟨0, 1, 2, 3⟩ HRC: xv = ⟨+⟩(xb, ⟨xs, xs, xs, xs⟩) xl = xs + 3 basis vector
19
L1(x): .... x’ = +(x, 1) if ... goto L1(x’) else goto Lend(...)
last basis value
L1(x): .... x’ = +(x, 1) if ... goto L1(x’) else goto Lend(...)
x’b = xb x’s = xs + 1 x’v = ⟨+⟩(x’b, ⟨x’s, x’s, x’s, x’s⟩)
HRC:
xb = ⟨0, 1, 2, 3⟩ xv = ⟨+⟩(xb, ⟨xs, xs, xs, xs⟩)
Simple:
2 vector ops + 1 promotions
Removes dependence
1 vector op + 1 scalar op + 1 promotion
20
x’v = ⟨+⟩(xv, ⟨1, 1, 1, 1⟩) x’l = xl +1
21
“Naturality” of promotion:
⟨ f ⟩ . (promote × promote) ≡ promote . f = ⟨+⟩(⟨+⟩(xb, ⟨xs, xs, xs, xs⟩), ⟨1, 1, 1, 1⟩) x’v = ⟨+⟩(xv, ⟨1, 1, 1, 1⟩)
[simple approach]
= ⟨+⟩(xb, ⟨+⟩(⟨xs, xs, xs, xs⟩, ⟨1, 1, 1, 1⟩))
(assoc)
= ⟨+⟩(xb, ⟨xs + 1 , xs + 1, xs + 1, xs + 1⟩)) (naturality) = ⟨+⟩(xb, ⟨x’s, x’s, x’s, x’s⟩))
(simplify)
⟨f(a,b),f(a,b),f(a,b),f(a,b)⟩ ≡ ⟨f⟩(⟨a, a, a, a⟩, ⟨b, b, b, b⟩)
strides provide higher performance y = x[z] yv = xs[⟨zs:⟩] (when x is loop invariant, z is a unit stride induction variable)
22
specialised (contiguous) array load
23
Test framework: Intel Xeon, 256-bit register AVX
24
1.05 ¡ 1.84 ¡ 2.00 ¡ 1.34 ¡ 2.80 ¡ 6.69 ¡ 3.52 ¡ 3.02 ¡ 1 2 3 4 5 6 7 8 Vector ¡Add Vector ¡Sum Dot ¡Product Matrix ¡Multiply Nbody 1D ¡Convolution 2D ¡Convolution Blur
Speedup ¡over ¡no ¡SIMD ¡ Higher ¡is ¡Better ¡ Benchmark ¡
SIMD ¡Vectorisation ¡Performance ¡
HRC ¡SIMD
Repa lib Repa
25
we already knew these
with very good results
We told ‘em we could do parallelism!
26
27
[1] Anderson, Glew, Guo, Lewis, Liu, Liu, Petersen, Rajagopalan, Stichnoth, Wu, and Zhang. Pillar: A parallel implementation language. Languages and Compilers for Parallel Computing, Springer-Verlag, 2008
28
29
Vector ¡Add Vector ¡Sum Dot ¡Product N ¡Body ¡Repa Matrix Multiply 1D Convolution 2D Convolution Blur GHC 0.15 0.72 0.24 0.34 0.35 0.02 0.34 0.45 GHC ¡LLVM 0.15 0.73 0.35 0.97 0.98 0.02 0.58 0.86 Intel ¡SIMD 1.05 1.84 2.00 4.37 1.34 6.68 5.03 3.02 1 2 3 4 5 6 7 8 GHC GHC ¡LLVM Intel ¡SIMD
HRC SIMD HRC SIMD