Op#miza#on of High-Order Stencils* Kevin Stock - PowerPoint PPT Presentation

Op#miza#on ¡of ¡High-‑Order ¡Stencils* ¡ ¡ ¡ ¡ Kevin ¡Stock ¡ ¡ ¡ ¡Ohio ¡State ¡University ¡ Mar#n ¡Kong ¡ ¡ ¡ ¡Ohio ¡State ¡University ¡ Louis-‑Noel ¡Pouchet ¡ ¡ ¡Ohio ¡State ¡University ¡ Fabrice ¡Rastello ¡ ¡ ¡ ¡INRIA, ¡Grenoble ¡ J. ¡(Ram) ¡Ramanujam ¡ ¡Louisiana ¡State ¡University ¡ Saday ¡Sadayappan ¡ ¡Ohio ¡State ¡University ¡ * ¡Funded ¡in ¡part ¡by ¡NSF ¡and ¡DOE ¡

int aFunction(int a, int b) { int c=b; 100% ¡ return a; } 80% ¡ main() { int a,b,c,d,e; 60% ¡ int i=4; for (i=0;i<10;i++) { 40% ¡ int j=55; c=i+j; c=aFunction(i,c); a=aFunction(a+1,b); 20% ¡ } #pragma SliceTarget a; 0% ¡ return 0; } 1 ¡ 4 ¡ 8 ¡12 ¡16 ¡ DSL Technology for Exascale Computing (D-TEC) Lead ¡PI ¡ ¡ Deputy ¡PI ¡ Daniel ¡J. ¡Quinlan ¡ Saman ¡Amarasinghe ¡ Lawrence ¡Livermore ¡NaGonal ¡Laboratory ¡ MIT ¡ Co-‑PIs ¡and ¡InsGtuGons ¡ Armando ¡Solar-‑Lezama, ¡Adam ¡Chlipala, ¡Srinivas ¡Devadas, ¡ ¡ Una-‑May ¡O’Reilly, ¡Nir ¡Shavit, ¡Youssef ¡Marzouk ¡@ ¡MassachuseWs ¡InsGtute ¡of ¡Technology ¡ John ¡Mellor-‑Crummey ¡& ¡Vivek ¡Sarkar ¡@ ¡Rice ¡University ¡ Vijay ¡Saraswat ¡& ¡David ¡Grove ¡@ ¡IBM ¡Watson ¡ P. ¡Sadayappan ¡& ¡Atanas ¡Rountev ¡@ ¡Ohio ¡State ¡University ¡ Ras ¡Bodik ¡@ ¡University ¡of ¡California ¡at ¡Berkeley ¡ Craig ¡Rasmussen ¡@ ¡University ¡of ¡Oregon ¡ Phil ¡Colella ¡@ ¡Lawrence ¡Berkeley ¡NaGonal ¡Laboratory ¡ ScoW ¡Baden ¡@ ¡University ¡of ¡California ¡at ¡San ¡Diego ¡

Stencil Example for (i=1; i<N-1; i++) for (j=1;j<N-1; j++) A[i][j] = c*(B[i][j-1] + B[i][j] + B[i][j+1] + B[i-1][j] + B[i+1][j]); • Stencil ¡computaGons ¡sweep ¡ over ¡a ¡regular ¡mulG-‑ dimensional ¡array ¡ • ComputaGon ¡for ¡each ¡element ¡ of ¡result ¡array ¡A ¡uses ¡a ¡fixed ¡ set ¡of ¡neighboring ¡elements ¡ from ¡input ¡array ¡B ¡ • Ample ¡inherent ¡parallelism, ¡ but ¡performance ¡is ¡oben ¡far ¡ below ¡processor’s ¡peak: ¡ without ¡temporal ¡blocking, ¡ limited ¡by ¡memory ¡bandwidth ¡ ¡

Optimizing High-Order Stencils • High-‑order ¡stencils: ¡ weighted ¡averaging ¡over ¡ wider ¡neighborhood ¡ • “Box”wxw ¡stencil ¡=> ¡O(w 2 ) ¡ flops ¡per ¡site ¡(w=2*k+1) ¡ • OperaGonal ¡intensity: ¡O(w 2 ) ¡ ¡ ¡ for (i=k; i<n-k; i++) for (j=k; j<n-k; j++) // ii,jj loops fully unrolled for (ii=-k; ii<=k; ii++) for (jj=-k; jj<=k; jj++) k=1, ¡w=2k+1 ¡= ¡3 ¡ B[i][j] += W[k+ii][k+jj]*A[i+ii][j+jj]; K=2, ¡w=2k+1 ¡= ¡5 ¡

High-Order Stencils: Baseline Performance for (i=k; i<n-k; i++) 10 base 9 for (j=k; j<n-k; j++) 8 7 // ii,jj loops fully unrolled GFLOP/s 6 5 for (ii=-k; ii<=k; ii++) 4 3 for (jj=-k; jj<=k; jj++) 2 1 B[i][j] += W[k+ii][k+jj]*A[i+ii][j+jj]; 0 1 2 3 4 5 6 • MulG-‑core ¡performance ¡ k improves ¡from ¡3x3 ¡to ¡5x5 ¡ base-loads base-stores stencil ¡but ¡drops ¡for ¡higher ¡ 0.7 Memory Ops / FLOP order ¡stencils ¡ 0.6 0.5 • Problem: ¡ ¡ 0.4 – Not ¡memory-‑bw ¡bound ¡ 0.3 0.2 – But ¡too ¡many ¡register ¡ 0.1 loads ¡due ¡to ¡insufficient ¡ 0 1 2 3 4 5 6 registers ¡to ¡get ¡reuse ¡ k ¡

Standard Gather-Gather Stencil w reads from IN 0 reads from OUT 1 write to OUT w 2 − w + 1 registers Six points reused between iterations Sweep ¡Direc#on ¡ (innermost ¡loop) ¡

Reordering Stencil Operations • wxw ¡“box” ¡stencil ¡ computaGon ¡over ¡MxN ¡ domain ¡viewed ¡as ¡a ¡total ¡of ¡ w 2 MN ¡edges ¡ • Standard ¡stencil ¡execuGon ¡ groups ¡all ¡edges ¡incoming ¡ into ¡a ¡site ¡for ¡consecuGve ¡ execuGon ¡ • Regrouping ¡of ¡edges ¡forms ¡a ¡ different ¡“stencil” ¡paWern, ¡ with ¡different ¡#loads/stores ¡ and ¡register ¡pressure ¡ ¡ ¡

Gather-Scatter 1 reads from IN w − 1 reads from OUT w write to OUT w + 1 registers

Scatter-Gather w reads from IN 0 reads from OUT 1 write to OUT w + 1 registers

Performance: Core i7-4770K (4 cores)

Stencils on Vector-SIMD Processors • Fundamental ¡source ¡of ¡ inefficiency ¡with ¡stencil ¡codes ¡ for ¡(i=0; ¡i<H; ¡++i) ¡ ¡ on ¡current ¡short-‑vector ¡SIMD ¡ ¡for ¡(j=0; ¡j<W; ¡++j) ¡ ¡ ISAs ¡(e.g. ¡SSE, ¡AVX ¡…) ¡ ¡ ¡z[i][j]+=y[i][j]+y[i][j+1]; ¡ – Concurrent ¡operaGons ¡on ¡ conGguous ¡elements ¡ – Each ¡data ¡element ¡is ¡reused ¡in ¡ 0 ¡ 1 ¡ 2 ¡ 3 ¡ different ¡“slots” ¡of ¡vector ¡ VR0 ¡ a ¡ b ¡ c ¡ d ¡ register ¡ – Redundant ¡loads ¡or ¡shuffle ¡ops ¡ VR1 ¡ m ¡ n ¡ o ¡ p ¡ needed ¡ VR2 ¡ n ¡ o ¡ p ¡ q ¡ • Data ¡Layout ¡transforma#on ¡ VR3 ¡ based ¡on ¡matching ¡ VR4 ¡ computa#onal ¡characteris#cs ¡of ¡ stencils ¡ ¡to ¡vector-‑SIMD ¡ Vector ¡registers ¡ architecture ¡characteris#cs ¡ ¡ Inefficiency: ¡Each ¡element ¡of ¡ b ¡is ¡loaded ¡twice ¡ a ¡ b ¡ c ¡ d ¡ e ¡ f ¡ g ¡ h ¡ i ¡ j ¡ k ¡ l ¡ z[i][j] ¡ ¡ m ¡ n ¡ o ¡ p ¡ q ¡ r ¡ s ¡ t ¡ u ¡ v ¡ w ¡ x ¡ y[i][j] ¡ ¡ Data ¡in ¡memory ¡

Data Layout Transformation y[0:23] ¡ for (j = 1; j < W-1; ++j) z[j] = y[j-1] + y[j] + y[j+1]; yt[0:23] ¡ (a) 1D ¡vector ¡in ¡memory ¡ ó ¡(b) ¡2D ¡logical ¡view ¡of ¡same ¡data ¡ (c) ¡Transposed ¡2D ¡array ¡moves ¡interacGng ¡elements ¡into ¡same ¡slot ¡of ¡ different ¡vectors ¡ ¡ ó ¡ ¡(d) ¡New ¡1D ¡layout ¡aber ¡transformaGon ¡ • Boundaries need special handling (details in CC 2011 paper)

Performance: Stencil Rate

Summary • High-‑order ¡stencils ¡have ¡high ¡operaGonal ¡ intensity: ¡not ¡limited ¡by ¡memory ¡bw, ¡but ¡face ¡a ¡ different ¡challenge: ¡excessive ¡register ¡loads ¡ • SoluGon ¡via ¡associaGve ¡reordering ¡of ¡ accumulaGon ¡operaGons ¡to ¡change ¡“stencil” ¡ paWern ¡ • Details ¡in ¡PLDI’14 ¡paper ¡ • Ongoing/Future ¡Work ¡ – IncorporaGon ¡into ¡DSL ¡for ¡AMRShib ¡Calculus ¡(Collela) ¡

Op#miza#on of High-Order Stencils* Kevin Stock - PowerPoint PPT Presentation

Op#miza#on of High-Order Stencils* Kevin Stock Ohio State University Mar#n Kong Ohio State University Louis-Noel Pouchet

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

PolyMage: High-Performance Compilation for Heterogeneous Stencils Uday Bondhugula (with Ravi

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

A Compiler Intermediate Representation for Stencils Climate change is now affecting every

From Stencils to Elliptic PDE Solvers U. Rde (FAU Erlangen, ulrich.ruede@fau.de) joint work

for the Shallow Water Equations on Graphics Processing Units Technology for a better society 1

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Edge-Adaptive Image Interpolation with Contour Stencils Pascal Getreuer Dec 27, 2010 TV along

Op#miza#ons for Rendering Realis#c Lens flares in Polynomial

Performance Op>miza>on Project 2 Lab Schedule

Program Opmizaon 15-213: Introduc;on to Computer Systems 10

Op#miza#on Challenges for Deep Learning Yoshua Bengio U.

Bandit opmizaon with large strategy sets Alexandre Prou*ere

Predic've Modeling in a Polyhedral Op'miza'on Space Eunjung

Op#miza#on of Block Sparse Matrix- Vector Mul#plica#on on Shared

Op#miza#on of LLVM-Based Code using Mul#-Objec#ve Evolu#onary Algorithms Bernab Dorronsoro

Data Needs for Sampling the Internet to Measure Performance Juana Sanchez UCLA Statistics In

Impr oving Memor y Hier ar chy Per for mance For Ir r egular Applications J ohn Mellor- Crummey

CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September

Translation Caching: Skip, Dont Walk (The Page Table) Thomas W. Barr, Alan L. Cox, Scott

NETWORK COMMUNITY DETECTION IN PRACTICAL SCENARIOS Lovro Subelj University of Ljubljana

Solving Linear and Integer Programs Robert E. Bixby ILOG, Inc. and Rice University Outline

A formal proof of Borodin-Trakhtenbrots Gap Theorem Andrea Asperti DISI, University of

W HAT IS GOING TO HAPPEN TO THE POLAR ICE CAPS ? They affect temperature, sea level (70 m

Op#miza#on of High-Order Stencils* Kevin Stock - PowerPoint PPT Presentation

Op#miza#on of High-Order Stencils* Kevin Stock Ohio State University Mar#n Kong Ohio State University Louis-Noel Pouchet

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

PolyMage: High-Performance Compilation for Heterogeneous Stencils Uday Bondhugula (with Ravi

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

A Compiler Intermediate Representation for Stencils Climate change is now affecting every

From Stencils to Elliptic PDE Solvers U. Rde (FAU Erlangen, ulrich.ruede@fau.de) joint work

for the Shallow Water Equations on Graphics Processing Units Technology for a better society 1

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Edge-Adaptive Image Interpolation with Contour Stencils Pascal Getreuer Dec 27, 2010 TV along

Op#miza#ons for Rendering Realis#c Lens flares in Polynomial

Performance Op&gt;miza&gt;on Project 2 Lab Schedule

Program Op*miza*on 15-213: Introduc;on to Computer Systems 10

Op#miza#on Challenges for Deep Learning Yoshua Bengio U.

Bandit op*miza*on with large strategy sets Alexandre Prou*ere

Predic've Modeling in a Polyhedral Op'miza'on Space Eunjung

Op#miza#on of Block Sparse Matrix- Vector Mul#plica#on on Shared

Op#miza#on of LLVM-Based Code using Mul#-Objec#ve Evolu#onary Algorithms Bernab Dorronsoro

Data Needs for Sampling the Internet to Measure Performance Juana Sanchez UCLA Statistics In

Impr oving Memor y Hier ar chy Per for mance For Ir r egular Applications J ohn Mellor- Crummey

CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September

Translation Caching: Skip, Dont Walk (The Page Table) Thomas W. Barr, Alan L. Cox, Scott

NETWORK COMMUNITY DETECTION IN PRACTICAL SCENARIOS Lovro Subelj University of Ljubljana

Solving Linear and Integer Programs Robert E. Bixby ILOG, Inc. and Rice University Outline

A formal proof of Borodin-Trakhtenbrots Gap Theorem Andrea Asperti DISI, University of

W HAT IS GOING TO HAPPEN TO THE POLAR ICE CAPS ? They affect temperature, sea level (70 m

Performance Op>miza>on Project 2 Lab Schedule

Program Opmizaon 15-213: Introduc;on to Computer Systems 10

Bandit opmizaon with large strategy sets Alexandre Prou*ere