data layout transformation for stencil computations on
play

Data Layout Transformation for Stencil Computations on Short-Vector - PowerPoint PPT Presentation

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures Tom Henretty 1 Kevin Stock 1 Louis-Nol Pouchet 1 Franz Franchetti 2 J. Ramanujam 3 P . Sadayappan 1 1 The Ohio State University 2 Carnegie Mellon University 3


  1. Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures Tom Henretty 1 Kevin Stock 1 Louis-Noël Pouchet 1 Franz Franchetti 2 J. Ramanujam 3 P . Sadayappan 1 1 The Ohio State University 2 Carnegie Mellon University 3 Louisiana State University March 29, 2011 ETAPS CC’11 Saarbrucken, Germany

  2. Outline: CC’11 Outline Introduction 1 Vectorization of Stencils 2 Stream Alignment Conflict 3 Data Layout Transformation 4 Compiler Framework 5 Experimental Results 6 Conclusion 7 OSU / CMU / LSU 2

  3. Introduction: CC’11 Short-Vector SIMD ◮ Perform identical computation on small chunks of data ◮ Operations are independent ◮ Vector size: from 2 to 64 ◮ Packing operations to form a vector (shuffle, extract, ...) ◮ Low latency, multiple SIMD units per CPU ◮ Maximal Speedup equals the vector size ◮ Ubiquitous feature on modern processors ◮ x86 – SSE, AVX ◮ Power – VMX / VSX ◮ ARM – NEON ◮ Cell SPU OSU / CMU / LSU 3

  4. Introduction: CC’11 A Brief on Stencil Computations ◮ Typically: iterative update of a structured (fixed) grid ◮ Compute a point from neighbor points values ◮ Same grid / multiple grids ◮ Numerous application domains use stencils ◮ Finite difference methods for solving PDEs ◮ Image processing ◮ Computational electromagnetics, CFD, numerical relativity, ... ◮ Domain-Specific Languages for Stencils (Fenics, RNPL, ...) OSU / CMU / LSU 4

  5. Vectorization of Stencils: CC’11 Stencil Example (a) 5 point stencil C code for (t = 0; t < TMAX; ++t) for (i = 1; i < N-1; ++i) for (j = 1; j < M-1; ++j) a[i][j] = b[i+1][j] + b[i][j-1] + b[i ][j] + b[i][j+1] + b[i-1][j]; M j i N (b) Arrays a, b, and stencil detail OSU / CMU / LSU 5

  6. Vectorization of Stencils: CC’11 Vectorization of Stencil Computation ◮ Two “main” types of stencils ◮ Jacobi-like: the output does not depend on the input ◮ Seidel-like: in-place update ◮ Loop transformations expose tiling possibilities, and at least one inner-most parallel loop ◮ Auto-vectorization successful (ICC, GCC)... ◮ ...But SIMD speedup is far from optimal! OSU / CMU / LSU 6

  7. Vectorization of Stencils: CC’11 Performance Consideration for (t = 0; t < T; ++t) { for (t = 0; t < T; ++t) { for (i = 0; i < N; ++i) for (i = 0; i < N; ++i) for (j = 1; j < N+1; ++j) for (j = 0; j < N; ++j) S1: C[i][j] = A[i][j] + A[i][j-1]; S3: C[i][j] = A[i][j] + B[i][j]; for (i = 0; i < N; ++i) for (i = 0; i < N; ++i) for (j = 1; j < N+1; ++j) for (j = 0; j < N; ++j) S2: A[i][j] = C[i][j] + C[i][j-1]; S4: A[i][j] = C[i][j] + B[i][j]; } } AMD Phenom 1.9 GFlop/s AMD Phenom 1.2 GFlop/s Core2 3.5 GFlop/s Performance: Core2 6.0 GFlop/s Performance: Core i7 4.1 GFlop/s Core i7 6.7 GFlop/s (a) Stencil code (b) Non-Stencil code Stencil code (a) has much lower performance than the non-stencil code (b) despite accessing 50% fewer data elements OSU / CMU / LSU 7

  8. Stream Alignment Conflict: CC’11 Stream Alignment Conflict for (i = 0; i < H; i++) for (j = 0; j < W - 1; j++) VECTOR REGISTERS A[i][j] = B[i][j] + B[i][j+1]; x86 ASSEMBLY xmm1 I J K L movaps B(...), %xmm1 movaps 16+B(...),%xmm2 movaps %xmm2, %xmm3 MEMORY CONTENTS palignr $4, %xmm1, %xmm3 M N O P xmm2 ;; Register state here addps %xmm1, %xmm3 A A B C D E F G H ... ... ... ... movaps %xmm3, A(...) xmm3 J K L M B I J K L M N O P ... ... ... ... ◮ Load and shuffle: ◮ Load [I,J,K,L] and [M,N,O,P] ◮ Shuffle to create [J,K,L,M] ◮ Multiple unaligned loads ◮ Load [I,J,K,L] and [J,K,L,M] ◮ Not possible on architectures with alignment constraints OSU / CMU / LSU 8

  9. Data Layout Transformation: CC’11 Overview of the Solution ◮ Stream Alignment Conflict: adjacent elements in memory maps to adjacent vector slots ◮ Key idea: break this property, to have both operands in identical vector slot ◮ Achieved through Data Layout Transformation ◮ No shuffle needed ◮ No extra unaligned load ◮ But not trivial to achieve! OSU / CMU / LSU 9

  10. Data Layout Transformation: CC’11 Data Layout Transformation Example 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 A B C D E F G H I J K L M N O P Q R S T U V W X (a) Original Layout V N V A G M S A B C D E F B H N T G H I J K L C I O U V N M N O P Q R V D J P V S T U V W X E K Q W F L R X (b) Dimension Lifted (c) Transposed 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 A G M S B H N T C I O U D J P V E K Q W F L R X (d) Transformed Layout for (i = 1; i < 24; ++i) B[i] = (A[i-1] + A[i] + A[i+1]) / 3; OSU / CMU / LSU 10

  11. Data Layout Transformation: CC’11 Handling Boundaries Compute Boundaries of Array Z Shuffle Opposite F L R Boundaries of Array Y A G M S Compute Steady State Original of Array Z F L R X B H N T Array Y A G M S F L R G’ M’ S’ B H N T C I O U D J P V E K Q W E K Q W A G M S F L R X F L R X G M S G M S F’ L’ R’ OSU / CMU / LSU 11

  12. Data Layout Transformation: CC’11 Higher-dimensional Stencils (a) Original Layout n0 n1 n2 n3 0 w0 c0 e0 w1 c1 e1 w2 c2 e2 w3 c3 e3 1 2 s0 s1 s2 s3 (b) Transformed Layout w0 w1 w2 w3 s0 s1 s2 s3 n0 n1 n2 n3 c0 c1 c2 c3 e0 e1 e2 e3 0 1 2 OSU / CMU / LSU 12

  13. Compiler Framework: CC’11 Overview of Code Generation Algorithm Detect arrays/statements that suffers from SAC 1 Perform Dimension-Lift-and-Transpose of those arrays 2 Generate Vector code for the inner-loop 3 ◮ Ghost cell copy-in and copy-out code ◮ Boundary code ◮ Steady state code OSU / CMU / LSU 13

  14. Compiler Framework: CC’11 Detection of Stream Alignment Conflict ◮ Standard compiler framework operating on array subscript functions ◮ Main idea: detect cross-iteration reuse ◮ Robust to stream offset via iteration shifting ◮ Minimize the reuse distance ◮ Some alignment conflicts are artificial and fixed with stream realignment ◮ Requires the window of the stencil to be constant ◮ The window size is used to compute the amount of ghost cells OSU / CMU / LSU 14

  15. Experimental Results: CC’11 Experimental Setup ◮ Experiments run on 3 architectures (x86): ◮ Intel Core2 Quad (Kentsfield): SAC resolved with low-performance shuffles ◮ AMD Phenom (K10): SAC resolved with average-performance shuffles ◮ Intel Core i7 (Nehalem): SAC resolved with fast redundant loads ◮ Data is L1-resident ◮ assume tiling was performed beforehand if necessary ◮ Tested compiler: Intel ICC 11.1 OSU / CMU / LSU 15

  16. Experimental Results: CC’11 Three Code Variants Evaluated Ref : reference code 1 ◮ Straightforward C implementation ◮ Always auto-vectorized by the compiler DLT : basic layout transformed 2 ◮ Straightforward C implementation with DLT arrays ◮ Always auto-vectorized by the compiler DLTi : intrinsics + layout transformed 3 ◮ C implementation with DLT arrays and SSE vector intrinsics OSU / CMU / LSU 16

  17. Experimental Results: CC’11 Single Precision Results Single ¡Precision ¡DLT ¡Results ¡ L1 ¡Cache ¡Resident ¡ 16 ¡ 14 ¡ 12 ¡ 10 ¡ Gflop/s ¡ 8 ¡ Ref. ¡ 6 ¡ DLT ¡ DLTi ¡ 4 ¡ 2 ¡ 0 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ Phenom ¡ Core2Quad ¡ Core ¡i7 ¡ J-­‑1D ¡ J-­‑2D-­‑5pt ¡ J-­‑2D-­‑9pt ¡ J-­‑3D ¡ Hea?tut-­‑3D ¡ FDTD-­‑2D ¡ Rician-­‑2D ¡ Benchmark ¡/ ¡Microarchitecture ¡ OSU / CMU / LSU 17

  18. Experimental Results: CC’11 Double Precision Results Double Precision DLT Results L1 Cache Resident 8 7 6 5 Gflop/s 4 Ref. 3 DLT DLTi 2 1 0 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 Phenom Core2Quad Core i7 J-1D J-2D-5pt J-2D-9pt J-3D Heatttut-3D FDTD-2D Rician-2D Benchmark / Microarchitecture OSU / CMU / LSU 18

  19. Experimental Results: CC’11 Summary of Experiments ◮ Performance improvement matches the shuffle/unaligned load costs ◮ Tested higher-dimensional stencils show less improvement: ◮ more intra-stencil dependences ◮ higher cache pressure ◮ Manual check of the ASM showed no shuffle, no redundant load instructions OSU / CMU / LSU 19

  20. Conclusion: CC’11 Conclusion ◮ Stream Alignment Conflict is the performance bottleneck for auto-vectorized stencils ◮ Impact varies with micro-architecture characteristics, but is always significant ◮ A data layout transformation can solve this problem ◮ Strong performance improvement observed ◮ Manual vectorization still beats automatic vectorization OSU / CMU / LSU 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend