Data Layout Transformation for Stencil Computations on Short-Vector - - PowerPoint PPT Presentation

data layout transformation for stencil computations on
SMART_READER_LITE
LIVE PREVIEW

Data Layout Transformation for Stencil Computations on Short-Vector - - PowerPoint PPT Presentation

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures Tom Henretty 1 Kevin Stock 1 Louis-Nol Pouchet 1 Franz Franchetti 2 J. Ramanujam 3 P . Sadayappan 1 1 The Ohio State University 2 Carnegie Mellon University 3


slide-1
SLIDE 1

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Tom Henretty1 Kevin Stock1 Louis-Noël Pouchet1 Franz Franchetti2

  • J. Ramanujam3 P

. Sadayappan1

1 The Ohio State University 2 Carnegie Mellon University 3 Louisiana State University

March 29, 2011

ETAPS CC’11

Saarbrucken, Germany

slide-2
SLIDE 2

Outline: CC’11

Outline

1

Introduction

2

Vectorization of Stencils

3

Stream Alignment Conflict

4

Data Layout Transformation

5

Compiler Framework

6

Experimental Results

7

Conclusion

OSU / CMU / LSU 2

slide-3
SLIDE 3

Introduction: CC’11

Short-Vector SIMD

◮ Perform identical computation on small chunks of data

◮ Operations are independent ◮ Vector size: from 2 to 64 ◮ Packing operations to form a vector (shuffle, extract, ...)

◮ Low latency, multiple SIMD units per CPU

◮ Maximal Speedup equals the vector size

◮ Ubiquitous feature on modern processors

◮ x86 – SSE, AVX ◮ Power – VMX / VSX ◮ ARM – NEON ◮ Cell SPU OSU / CMU / LSU 3

slide-4
SLIDE 4

Introduction: CC’11

A Brief on Stencil Computations

◮ Typically: iterative update of a structured (fixed) grid ◮ Compute a point from neighbor points values

◮ Same grid / multiple grids

◮ Numerous application domains use stencils

◮ Finite difference methods for solving PDEs ◮ Image processing ◮ Computational electromagnetics, CFD, numerical relativity, ...

◮ Domain-Specific Languages for Stencils (Fenics, RNPL, ...)

OSU / CMU / LSU 4

slide-5
SLIDE 5

Vectorization of Stencils: CC’11

Stencil Example

for (t = 0; t < TMAX; ++t) for (i = 1; i < N-1; ++i) for (j = 1; j < M-1; ++j) a[i][j] = b[i+1][j] + b[i][j-1] + b[i ][j] + b[i][j+1] + b[i-1][j];

(a) 5 point stencil C code

M N j i

(b) Arrays a, b, and stencil detail

OSU / CMU / LSU 5

slide-6
SLIDE 6

Vectorization of Stencils: CC’11

Vectorization of Stencil Computation

◮ Two “main” types of stencils

◮ Jacobi-like: the output does not depend on the input ◮ Seidel-like: in-place update

◮ Loop transformations expose tiling possibilities, and at least one

inner-most parallel loop

◮ Auto-vectorization successful (ICC, GCC)... ◮ ...But SIMD speedup is far from optimal!

OSU / CMU / LSU 6

slide-7
SLIDE 7

Vectorization of Stencils: CC’11

Performance Consideration

for (t = 0; t < T; ++t) { for (i = 0; i < N; ++i) for (j = 1; j < N+1; ++j) S1: C[i][j] = A[i][j] + A[i][j-1]; for (i = 0; i < N; ++i) for (j = 1; j < N+1; ++j) S2: A[i][j] = C[i][j] + C[i][j-1]; } for (t = 0; t < T; ++t) { for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) S3: C[i][j] = A[i][j] + B[i][j]; for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) S4: A[i][j] = C[i][j] + B[i][j]; }

Performance: AMD Phenom 1.2 GFlop/s Core2 3.5 GFlop/s Core i7 4.1 GFlop/s Performance: AMD Phenom 1.9 GFlop/s Core2 6.0 GFlop/s Core i7 6.7 GFlop/s

(a) Stencil code (b) Non-Stencil code Stencil code (a) has much lower performance than the non-stencil code (b) despite accessing 50% fewer data elements

OSU / CMU / LSU 7

slide-8
SLIDE 8

Stream Alignment Conflict: CC’11

Stream Alignment Conflict

A B A B C D E F G H I J K L M N O P ... ... ... ... ... ... ... ... MEMORY CONTENTS for (i = 0; i < H; i++) for (j = 0; j < W - 1; j++) A[i][j] = B[i][j] + B[i][j+1]; xmm1 xmm2 xmm3 I J K L M N O P J K L M VECTOR REGISTERS x86 ASSEMBLY

movaps B(...), %xmm1 movaps 16+B(...),%xmm2 movaps %xmm2, %xmm3 palignr $4, %xmm1, %xmm3 ;; Register state here addps %xmm1, %xmm3 movaps %xmm3, A(...)

◮ Load and shuffle:

◮ Load [I,J,K,L] and [M,N,O,P] ◮ Shuffle to create [J,K,L,M]

◮ Multiple unaligned loads

◮ Load [I,J,K,L] and [J,K,L,M] ◮ Not possible on architectures with alignment constraints OSU / CMU / LSU 8

slide-9
SLIDE 9

Data Layout Transformation: CC’11

Overview of the Solution

◮ Stream Alignment Conflict: adjacent elements in memory maps to

adjacent vector slots

◮ Key idea: break this property, to have both operands in identical vector

slot

◮ Achieved through Data Layout Transformation

◮ No shuffle needed ◮ No extra unaligned load ◮ But not trivial to achieve! OSU / CMU / LSU 9

slide-10
SLIDE 10

Data Layout Transformation: CC’11

Data Layout Transformation Example

(a) Original Layout A B C D E F G H I J K L M N O P Q R S T U V W X

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

(d) Transformed Layout A G M S B H N T C I O U D J P V E K Q W F L R X

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

(b) Dimension Lifted (c) Transposed A G M S B H N T D J P V E K Q W F L R X C I O U

V V N

A B C D E F G H I J K L M N O P Q R S T U V W X

V N V

for (i = 1; i < 24; ++i) B[i] = (A[i-1] + A[i] + A[i+1]) / 3;

OSU / CMU / LSU 10

slide-11
SLIDE 11

Data Layout Transformation: CC’11

Handling Boundaries

A G M S B H N T D J P V E K Q W F L R X C I O U Compute Steady State

  • f Array Z

F L R X F L R F L R A G M S G’ M’ S’ B H N T A G M S G M S E K Q W F L R X F’ L’ R’ G M S Compute Boundaries

  • f Array Z

Shuffle Opposite Boundaries

  • f Array Y

Original Array Y OSU / CMU / LSU 11

slide-12
SLIDE 12

Data Layout Transformation: CC’11

Higher-dimensional Stencils

1 2

(a) Original Layout (b) Transformed Layout

1 2

c0 c3 c2 c1 n0 n3 n2 n1 s0 s3 s2 s1 w0 w3 w2 w1 e0 e3 e2 e1

c0 c3 c2 c1 w0 w3 w2 w1 e0 e3 e2 e1 n0 n3 n2 n1 s0 s3 s2 s1

OSU / CMU / LSU 12

slide-13
SLIDE 13

Compiler Framework: CC’11

Overview of Code Generation Algorithm

1

Detect arrays/statements that suffers from SAC

2

Perform Dimension-Lift-and-Transpose of those arrays

3

Generate Vector code for the inner-loop

◮ Ghost cell copy-in and copy-out code ◮ Boundary code ◮ Steady state code OSU / CMU / LSU 13

slide-14
SLIDE 14

Compiler Framework: CC’11

Detection of Stream Alignment Conflict

◮ Standard compiler framework operating on array subscript functions ◮ Main idea: detect cross-iteration reuse ◮ Robust to stream offset via iteration shifting

◮ Minimize the reuse distance ◮ Some alignment conflicts are artificial and fixed with stream realignment

◮ Requires the window of the stencil to be constant

◮ The window size is used to compute the amount of ghost cells OSU / CMU / LSU 14

slide-15
SLIDE 15

Experimental Results: CC’11

Experimental Setup

◮ Experiments run on 3 architectures (x86):

◮ Intel Core2 Quad (Kentsfield): SAC resolved with low-performance shuffles ◮ AMD Phenom (K10): SAC resolved with average-performance shuffles ◮ Intel Core i7 (Nehalem): SAC resolved with fast redundant loads

◮ Data is L1-resident

◮ assume tiling was performed beforehand if necessary

◮ Tested compiler: Intel ICC 11.1

OSU / CMU / LSU 15

slide-16
SLIDE 16

Experimental Results: CC’11

Three Code Variants Evaluated

1

Ref: reference code

◮ Straightforward C implementation ◮ Always auto-vectorized by the compiler 2

DLT: basic layout transformed

◮ Straightforward C implementation with DLT arrays ◮ Always auto-vectorized by the compiler 3

DLTi: intrinsics + layout transformed

◮ C implementation with DLT arrays and SSE vector intrinsics OSU / CMU / LSU 16

slide-17
SLIDE 17

Experimental Results: CC’11

Single Precision Results

0 ¡ 2 ¡ 4 ¡ 6 ¡ 8 ¡ 10 ¡ 12 ¡ 14 ¡ 16 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ J-­‑1D ¡ J-­‑2D-­‑5pt ¡ J-­‑2D-­‑9pt ¡ J-­‑3D ¡ Hea?tut-­‑3D ¡ FDTD-­‑2D ¡ Rician-­‑2D ¡ Gflop/s ¡ Benchmark ¡/ ¡Microarchitecture ¡

Single ¡Precision ¡DLT ¡Results ¡ L1 ¡Cache ¡Resident ¡

  • Ref. ¡

DLT ¡ DLTi ¡

OSU / CMU / LSU 17

slide-18
SLIDE 18

Experimental Results: CC’11

Double Precision Results

1 2 3 4 5 6 7 8 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 J-1D J-2D-5pt J-2D-9pt J-3D Heatttut-3D FDTD-2D Rician-2D Gflop/s Benchmark / Microarchitecture

Double Precision DLT Results L1 Cache Resident

Ref. DLT DLTi

OSU / CMU / LSU 18

slide-19
SLIDE 19

Experimental Results: CC’11

Summary of Experiments

◮ Performance improvement matches the shuffle/unaligned load costs ◮ Tested higher-dimensional stencils show less improvement:

◮ more intra-stencil dependences ◮ higher cache pressure

◮ Manual check of the ASM showed no shuffle, no redundant load

instructions

OSU / CMU / LSU 19

slide-20
SLIDE 20

Conclusion: CC’11

Conclusion

◮ Stream Alignment Conflict is the performance bottleneck for

auto-vectorized stencils

◮ Impact varies with micro-architecture characteristics, but is always

significant

◮ A data layout transformation can solve this problem ◮ Strong performance improvement observed

◮ Manual vectorization still beats automatic vectorization OSU / CMU / LSU 20