Iterator-Based Optimization of Imperfectly-Nested Loops DANIEL - PowerPoint PPT Presentation

Iterator-Based Optimization of Imperfectly-Nested Loops DANIEL FESHBACH, MARY GLASER, MICHELLE STROUT, DAVID WONNACOTT

Overview Motivation: Approaches to Performance Tuning ▶ Quick overview of Polyhedral Model ▶ Quick review of Chapel Iterators ▶ Detailed Discussion of Deriche Image Processing Example ▶ Details of Nussinov are in paper (and past work) ▶ Details of FFT may be in future paper (we hope) ▶

Basic Approaches to Code Optimization // Example (benchmark, simplified Physics) Performance tuning of compute-intensive code... ▶ // iterative Jacobi stencil // Repeatedly update each A[i,j], based on // previous values of it and neighbors for t in 0..T-1 do for x in 1..N-2 do for y in 1..N-2 do A[(t+1)%2, x, y] = (A[t%2,x-1,y] + A[t%2,x,y-1] + A[t%2,x ,y] + A[t%2,x,y+1] + A[t%2,x+1,y]) / 5; // note: t%2 stores two time steps

Basic Approaches to Code Optimization // Example (benchmark, simplified Physics) Performance tuning of compute-intensive code... ▶ // iterative Jacobi stencil Compiler-writer: this is a compiler problem, fix compiler // Repeatedly update each A[i,j], based on ▶ // previous values of it and neighbors Replace % operation with bit-mask, or hoist out of loop ▶ Perform loop tiling to improve memory performance ▶ for t in 0..T-1 do Perform loop skewing to ensure loop tiling is legal ▶ for x in 1..N-2 do Also introduce vector instructions for y in 1..N-2 do ▶ A[(t+1)%2, x, y] = (A[t%2,x-1,y] + A[t%2,x,y-1] + A[t%2,x ,y] + A[t%2,x,y+1] + A[t%2,x+1,y]) / 5; // note: t%2 stores two time steps

Basic Approaches to Code Optimization // Example (benchmark, simplified Physics) Performance tuning of compute-intensive code... ▶ // iterative Jacobi stencil Compiler-writer: this is a compiler problem, fix compiler // Repeatedly update each A[i,j], based on ▶ // previous values of it and neighbors Replace % operation with bit-mask, or hoist out of loop ▶ Perform loop tiling to improve memory performance ▶ for t in 0..T-1 do Perform loop skewing to ensure loop tiling is legal ▶ for x in 1..N-2 do Also introduce vector instructions for y in 1..N-2 do ▶ A[(t+1)%2, x, y] = Then, update compiler to tile for multicore systems ▶ (A[t%2,x-1,y] + A[t%2,x,y-1] + Then, write another compiler for distributed memory ▶ A[t%2,x ,y] + A[t%2,x,y+1] + Then, write another compiler for GPGPU's ▶ A[t%2,x+1,y]) / 5; // note: t%2 stores two time steps

Basic Approaches to Code Optimization // Example ( actual code is more complex ) Performance tuning of compute-intensive code... ▶ // iterative Jacobi stencil Compiler-writer: this is a compiler problem, fix compiler // Repeatedly update each A[i,j], based on ▶ // previous values of it and neighbors Physicist: this is a coding problem, give to grad student ▶ for t in 0..T-1 do for x in 1..N-2 do for y in 1..N-2 do A[(t+1)%2, x, y] = (A[t%2,x-1,y] + A[t%2,x,y-1] + A[t%2,x ,y] + A[t%2,x,y+1] + A[t%2,x+1,y]) / 5; // note: t%2 stores two time steps

Basic Approaches to Code Optimization // Example ( actual code is more complex ) Performance tuning of compute-intensive code... ▶ // iterative Jacobi stencil Compiler-writer: this is a compiler problem, fix compiler // Repeatedly update each A[i,j], based on ▶ // previous values of it and neighbors Physicist: this is a coding problem, give to grad student ▶ for t in 0..T-1 do Grad student replaces or hoists % for x in 1..N-2 do ▶ for y in 1..N-2 do A[t&1, x, y] = // t&1 == t%2 (A[1-(t&1),x-1,y]+A[1-(t&1),x,y-1]+ A[1-(t&1),x ,y]+A[1-(t&1),x,y+1]+ A[1-(t&1),x+1,y]) / 5; // note: t%2 stores two time steps

Basic Approaches to Code Optimization // Example ( actual code is more complex ) Performance tuning of compute-intensive code... ▶ // iterative Jacobi stencil Compiler-writer: this is a compiler problem, fix compiler // Repeatedly update each A[i,j], based on ▶ // previous values of it and neighbors // Loop over tile wavefronts. Physicist: this is a coding problem, give to grad student ▶ for kt in ceild(3,tau) .. floord(3*T,tau) { // The next two loops iterate within a tile wavefront. Grad student replaces or hoists % ▶ // Assumes a square iteration space. Grad student may have heard of loop tiling, may try it var k1_lb: int = floord(3*L+2+(kt-2)*tau, tau*3); ▶ var k1_ub: int = floord(3*U+(kt+2)*tau-2, tau*3); var k2_lb: int = floord((2*kt-2)*tau-3*U+2, tau*3); var k2_ub: int = floord((2+2*kt)*tau-3*L-2, tau*3); // Loops over tile coordinates within a parallel wavefront of tiles. forall k1 in k1_lb .. k1_ub { for x in k2_lb .. k2_ub { var k2 = x-k1; // Loop over time within a tile. for t in max(1,floord(kt*tau,3)) .. min(T,floord((3+kt)*tau-3,3)){ write = t & 1; // equivalent to t mod 2 read = 1 - write; // Loops over the spatial dimensions within each tile. for i in max(L,max((kt-k1-k2)*tau-t, 2*t-(2+k1+k2)*tau+2)) .. min(U,min((1+kt-k1-k2)*tau-t-1, 2*t-(k1+k2)*tau)) { for j in max(L,max(tau*k1-t,t-i-(1+k2)*tau+1)) .. min(U,min((1+k1)*tau-t-1,t-i-k2*tau)){ A[write, x, y] = (A[read,x-1,y] + A[read,x,y-1] + A[read,x ,y] + A[read,x,y+1] + A[read,x+1,y]) / 5; // note: t%2 stores two time steps

Basic Approaches to Code Optimization // Example ( actual code is more complex ) Performance tuning of compute-intensive code... ▶ // iterative Jacobi stencil Compiler-writer: this is a compiler problem, fix compiler // Repeatedly update each A[i,j], based on ▶ // previous values of it and neighbors // Loop over tile wavefronts. Physicist: this is a coding problem, give to grad student ▶ for kt in ceild(3,tau) .. floord(3*T,tau) { // The next two loops iterate within a tile wavefront. Grad student replaces or hoists % ▶ // Assumes a square iteration space. var k1_lb: int = floord(3*L+2+(kt-2)*tau, tau*3); Grad student may have heard of loop tiling, may try it ▶ var k1_ub: int = floord(3*U+(kt+2)*tau-2, tau*3); var k2_lb: int = floord((2*kt-2)*tau-3*U+2, tau*3); Grad student spends nights reading about vectorization ▶ var k2_ub: int = floord((2+2*kt)*tau-3*L-2, tau*3); // Loops over tile coordinates within a parallel wavefront of tiles. forall k1 in k1_lb .. k1_ub { for x in k2_lb .. k2_ub { var k2 = x-k1; // Loop over time within a tile. for t in max(1,floord(kt*tau,3)) .. min(T,floord((3+kt)*tau-3,3)){ write = t & 1; // equivalent to t mod 2 read = 1 - write; // Loops over the spatial dimensions within each tile. for i in max(L,max((kt-k1-k2)*tau-t, 2*t-(2+k1+k2)*tau+2)) .. min(U,min((1+kt-k1-k2)*tau-t-1, 2*t-(k1+k2)*tau)) { for j in max(L,max(tau*k1-t,t-i-(1+k2)*tau+1)) .. min(U,min((1+k1)*tau-t-1,t-i-k2*tau)){ A[write, x, y] = (A[read,x-1,y] + A[read,x,y-1] + A[read,x ,y] + A[read,x,y+1] + A[read,x+1,y]) / 5; // note: t%2 stores two time steps

Iterator-Based Optimization of Imperfectly-Nested Loops DANIEL - PowerPoint PPT Presentation

Iterator-Based Optimization of Imperfectly-Nested Loops DANIEL FESHBACH, MARY GLASER, MICHELLE STROUT, DAVID WONNACOTT Overview Motivation: Approaches to Performance Tuning Quick overview of Polyhedral Model Quick review of Chapel

MANAGING IMPERFECTLY MANAGING IMPERFECTLY MANAGING IMPERFECTLY MANAGING IMPERFECTLY OBSERVED

Iterators Iterator Methods Iterator ix = x.iterator(); constructs and initializes an iterator to

Iterators Iterator Methods Iterator ix = x.iterator(); constructs and initializes an iterator to

Iterators An iterator permits you to examine the elements of a data structure one at a time.

LOOPS Loops Loops Loops! How can we repeat a piece of code without having to write it out over

Nested Word Automata Jens Stimpfle 30.6.2014 Nested Words Nested Words Theoretically and

Repetition with for loops Topic 5 for loops and nested loops So far, repeating a statement is

Static Analysis by Abstract Interpretation of communicating imperfectly-clocked Synchronous

Tutorial 3 Loops Side Effects 1 CS 136 Spring 2020 Tutorial 3 Loops: for loops &

Loops! Flow of Control: Loops (Savitch, Chapter 4) TOPICS while Loops do while

Nested for loops Topic 6 A for loop can contain any kind of statement in its body, Nested for

Nested Transactions Nested Transactions Flat transactions The rules for committing of

Nested and Composite Classes Lecture 14 COP 3252 Summer 2017 May 30, 2017 Nested Classes

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Loops! Loops! Loops! Lecture 10 COP 3014 Spring 2017 January 31, 2017 Repetition Statements

Loops! Loops! Loops! Lecture 5 COP 3014 Fall 2020 September 17, 2020 Repetition Statements

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Scientific Computing I Module 8: Discretisation of PDEs Michael Bader Lehrstuhl Informatik V

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures Tom

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

CS475/CM375 Lecture 4: Sept 22 Sparse Gaussian Elimination, Graph Representation Reading: [Saad]

Edge-Adaptive Image Interpolation with Contour Stencils Pascal Getreuer Dec 27, 2010 TV along

HYPRE: High Performance Preconditioners October 18, 2013 Robert D. Falgout Center for Applied

Tuning space optimization for multi- core architectures V. Martnez , F. Dupros, M. Castro, H.

Iterator-Based Optimization of Imperfectly-Nested Loops DANIEL - PowerPoint PPT Presentation

Iterator-Based Optimization of Imperfectly-Nested Loops DANIEL FESHBACH, MARY GLASER, MICHELLE STROUT, DAVID WONNACOTT Overview Motivation: Approaches to Performance Tuning Quick overview of Polyhedral Model Quick review of Chapel

MANAGING IMPERFECTLY MANAGING IMPERFECTLY MANAGING IMPERFECTLY MANAGING IMPERFECTLY OBSERVED

Iterators Iterator Methods Iterator ix = x.iterator(); constructs and initializes an iterator to

Iterators Iterator Methods Iterator ix = x.iterator(); constructs and initializes an iterator to

Iterators An iterator permits you to examine the elements of a data structure one at a time.

LOOPS Loops Loops Loops! How can we repeat a piece of code without having to write it out over

Nested Word Automata Jens Stimpfle 30.6.2014 Nested Words Nested Words Theoretically and

Repetition with for loops Topic 5 for loops and nested loops So far, repeating a statement is

Static Analysis by Abstract Interpretation of communicating imperfectly-clocked Synchronous

Tutorial 3 Loops Side Effects 1 CS 136 Spring 2020 Tutorial 3 Loops: for loops &amp;

Loops! Flow of Control: Loops (Savitch, Chapter 4) TOPICS while Loops do while

Nested for loops Topic 6 A for loop can contain any kind of statement in its body, Nested for

Nested Transactions Nested Transactions Flat transactions The rules for committing of

Nested and Composite Classes Lecture 14 COP 3252 Summer 2017 May 30, 2017 Nested Classes

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Loops! Loops! Loops! Lecture 10 COP 3014 Spring 2017 January 31, 2017 Repetition Statements

Loops! Loops! Loops! Lecture 5 COP 3014 Fall 2020 September 17, 2020 Repetition Statements

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Scientific Computing I Module 8: Discretisation of PDEs Michael Bader Lehrstuhl Informatik V

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures Tom

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

CS475/CM375 Lecture 4: Sept 22 Sparse Gaussian Elimination, Graph Representation Reading: [Saad]

Edge-Adaptive Image Interpolation with Contour Stencils Pascal Getreuer Dec 27, 2010 TV along

HYPRE: High Performance Preconditioners October 18, 2013 Robert D. Falgout Center for Applied

Tuning space optimization for multi- core architectures V. Martnez , F. Dupros, M. Castro, H.

Tutorial 3 Loops Side Effects 1 CS 136 Spring 2020 Tutorial 3 Loops: for loops &