Automatic Parallelization: Parallelism and Tiling Roshan Dathathri - PowerPoint PPT Presentation

Automatic Parallelization: Parallelism and Tiling Roshan Dathathri Department of Computer Science and Automation Indian Institute of Science roshan@csa.iisc.ernet.in June 25, 2013 Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 1 / 30

Goals of program transformations/optimizations Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 2 / 30

Goals of program transformations/optimizations Increase performance Execute lesser code - e.g., Loop Invariant Code Motion Execute more efficient code - e.g., Algebraic Reassociation Utilize memory efficiently - e.g., Loop Tiling Parallelize execution Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 2 / 30

Goals of program transformations/optimizations Increase performance Execute lesser code - e.g., Loop Invariant Code Motion Execute more efficient code - e.g., Algebraic Reassociation Utilize memory efficiently - e.g., Loop Tiling Parallelize execution Reduce memory footprint Reduce energy usage Today: Source code transformations Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 2 / 30

Goals of program transformations/optimizations Increase performance Execute lesser code - e.g., Loop Invariant Code Motion Execute more efficient code - e.g., Algebraic Reassociation Utilize memory efficiently - e.g., Loop Tiling Parallelize execution Reduce memory footprint Reduce energy usage Today: Source code transformations Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 3 / 30

Memory Hierarchy Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 4 / 30

Data Locality Same memory location or related memory locations being frequently accessed Di ff erent classes of locality: Spatial locality Temporal locality Group locality Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 5 / 30

Spatial locality Elements close-by (in space/memory) tend to be referenced soon e.g., c [ i ][ j ] in the code below for ( i =0; i<N; i++) { for ( j =0; j<N; j++) { for (k=0; k<N; k++) { c[i ][ j] += a[i ][k] ∗ b[k][ j ]; } } } Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 6 / 30

Spatial locality Elements close-by (in space/memory) tend to be referenced soon e.g., c [ i ][ j ] in the code below for ( i =0; i<N; i++) { for ( j =0; j<N; j++) { for (k=0; k<N; k++) { c[i ][ j] += a[i ][k] ∗ b[k][ j ]; } } } Innermost dimension of the array should vary the fastest, by a constant Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 6 / 30

Which code exploits spatial reuse of c [ i ][ j ] better? Snippet 1 Snippet 2 for ( i =0; i<N; i++) { for (k=0; k<N; k++) { for ( j =0; j<N; j++) { for ( i =0; i<N; i++) { for (k=0; k<N; k++) { for ( j =0; j<N; j++) { c[i ][ j] += a[i ][k] ∗ b[k][ j ]; c[i ][ j] += a[i ][k] ∗ b[k][ j ]; } } } } } } Table: Matrix multiplication code Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 7 / 30

Temporal locality Same element tends to be referenced soon e.g., c [ i ][ j ] in the code below for ( i =0; i<N; i++) { for ( j =0; j<N; j++) { for (k=0; k<N; k++) { c[i ][ j] += a[i ][k] ∗ b[k][ j ]; } } } Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 8 / 30

Temporal locality Same element tends to be referenced soon e.g., c [ i ][ j ] in the code below for ( i =0; i<N; i++) { for ( j =0; j<N; j++) { for (k=0; k<N; k++) { c[i ][ j] += a[i ][k] ∗ b[k][ j ]; } } } Rank of an access function is less than the dimensionality of the loop nest Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 8 / 30

Which code exploits temporal reuse of c [ i ][ j ] better? Snippet 1 Snippet 2 for ( i =0; i<N; i++) { for (k=0; k<N; k++) { for ( j =0; j<N; j++) { for ( i =0; i<N; i++) { for (k=0; k<N; k++) { for ( j =0; j<N; j++) { c[i ][ j] += a[i ][k] ∗ b[k][ j ]; c[i ][ j] += a[i ][k] ∗ b[k][ j ]; } } } } } } Table: Matrix multiplication code Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 9 / 30

Group locality Multiple accesses of the same array tend to reference the same element soon e.g., a [ i + 1 ] , a [ i ] , a [ i − 1 ] in the code below for (t = 0; t < T − 1; t++) { for ( i = 1; i < N+1; i++) { temp[i] = 0.125 ∗ (a[i+1] − 2.0 ∗ a[i] + a[i − 1]); } for ( i = 1; i < N+1; i++) { a[i] = temp[i ]; } } Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 10 / 30

Loop Tiling/Blocking Executing iteration space in blocks: block-after-block Most important of all loop transformations Crucial for locality and parallelism Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 11 / 30

Example – Tiling for ( i =0; i<N; i++) { for ( j =0; j<N; j++) { j for (k=0; k<N; k++) { c[i ][ j] += a[i ][k] ∗ b[k][ j ]; } } k } Original code i Figure: Locality in i, j, k dimensions Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 12 / 30

Example – Tiling tile boundary // inter − tile iterators for (iT=0; iT<N; iT+=B) { for (jT=0; jT<N; jT+=B) { for (kT=0; kT<N; kT+=B) { // intra − tile iterators for ( i=iT; i<iT+B; i++) { tile boundary j for ( j=jT; j<jT+B; j++) { for (k=kT; k<kT+B; k++) { c[i ][ j] += a[i ][k] ∗ b[k][ j ]; k } } } } i } } Figure: Exploiting reuse in i, j, k dimensions Tiled code with tile size B ∗ B ∗ B Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 13 / 30

Tiling for Data Locality Tiling for caches Data touched by a tile should fi t in faster memory Improves data reuse – allows reuse in multiple directions Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 14 / 30

Validity of Tiling A tile is a piece of computation that can execute atomically in its entirety Should be able to construct a total order on the set of all tiles Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 15 / 30

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b Example – Validity of Tiling t N for (t =0; t<T; t++) { for ( i =2; i<N − 1; i++) { 4 a[t ][ i] += 0.333 ∗ ( a[t − 1][i]+ 3 a[t − 1][i − 1]+a[t − 1][i +1]); 2 } } 1 i Original code 0 1 2 3 4 N Figure: Dependences (1,0), (1,1), (1,-1) Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 16 / 30

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b Example – Validity of Tiling t N 4 3 2 1 i 0 1 2 3 4 N Figure: Dependences (1,0), (1,1), (1,-1) Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 17 / 30

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b Example – Validity of Tiling t N for ( t1 =0; t1<=T − 1;t1++) { for ( t2=t1+2; t2<=t1+N − 2;t2++) { 4 a[t1][ − t1+t2 ]+=0.333 ∗ (a[t1 − 1][ − t1+t2]+ 3 a[t1 − 1][ − t1+t2 − 1]+a[t1 − 1][ − t1+t2+1]); 2 } 1 } i Skewed code 0 1 2 3 4 N Figure: Dependences (1,0), (1,1), (1,-1) Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 17 / 30

Validity of Tiling With distance vectors and tiling along original dimensions, all dependence components along dimensions being tiled should be non-negative Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 18 / 30

Validity of Tiling With distance vectors and tiling along original dimensions, all dependence components along dimensions being tiled should be non-negative With dependence polyhedron D, valid tiling hyperplanes, h : h . D ≥ 0 Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 18 / 30

Validity of Tiling With distance vectors and tiling along original dimensions, all dependence components along dimensions being tiled should be non-negative With dependence polyhedron D, valid tiling hyperplanes, h : h . D ≥ 0 � 1 � 1 � 1 � � � 0 1 1 1 1 = . 1 1 0 1 − 1 1 2 0 Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 18 / 30

Validity of Tiling With distance vectors and tiling along original dimensions, all dependence components along dimensions being tiled should be non-negative With dependence polyhedron D, valid tiling hyperplanes, h : h . D ≥ 0 � 1 � 1 � 1 � � � 0 1 1 1 1 = . 1 1 0 1 − 1 1 2 0 Consider dependences (1,0,1), (1, -2, 0), (0,1,0), (0,0,1): what kind of tiling is valid? Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 18 / 30

Validity of Tiling With distance vectors and tiling along original dimensions, all dependence components along dimensions being tiled should be non-negative With dependence polyhedron D, valid tiling hyperplanes, h : h . D ≥ 0 � 1 � 1 � 1 � � � 0 1 1 1 1 = . 1 1 0 1 − 1 1 2 0 Consider dependences (1,0,1), (1, -2, 0), (0,1,0), (0,0,1): what kind of tiling is valid?       1 0 0 1 1 0 0 1 1 0 0  =  . 2 1 0 0 − 2 1 0 2 0 1 0     0 0 1 1 0 0 1 1 0 0 1 Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25, 2013 18 / 30

Automatic Parallelization: Parallelism and Tiling Roshan Dathathri - PowerPoint PPT Presentation

Automatic Parallelization: Parallelism and Tiling Roshan Dathathri Department of Computer Science and Automation Indian Institute of Science roshan@csa.iisc.ernet.in June 25, 2013 Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25,

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Tiling: A Data Locality Optimizing Algorithm Previously Kelly & Pugh transformation

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Automatic Parallelism for Mercury Paul Bone The University of Melbourne National ICT Australia

A Relaxed Criterion for Loop Tiling Riyadh Baghdadi, Albert Cohen, Sven Verdoolaege

Tiling for Dynamic Scheduling Ravi Teja Mullapudi Uday Bondhugula CSA, Indian Institue of

Automatic Tiling of Mostly-Tileable Loop Nests David Wonnacott Tian Jin Allison Lake

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Parallelization and Parallelization and Proling Proling Programming for Statistical

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization Parallelization Programming for Statistical Programming for Statistical Science

Broadcast Trees for Heterogeneous Platforms Olivier Beaumont, Yves Robert and Loris Marchal

Term 2 2020 IN THE NEXT 4 LECTURES The context: distribute ributed d syst stems ms

Processor Architecture and Circuit Design: A Marginal Cost Analysis Omid Azizi Aqeel Mahesri,

1 Motivation Modelling complex models require huge amounts of triangles Conventional polygonal

Flicker: Flicker: Minimal TCB Code Execution Minimal TCB Code Execution Jonathan M. McCune

Introduction to Verbs M&R 8789, 127, 195205 ENG240Y Old English / Mon 27 Sep 2010

Investor Update 4 th Quarter 2016 February 23, 2017 Disclaimer This presentation contains

Authors:

Automatic Parallelization: Parallelism and Tiling Roshan Dathathri - PowerPoint PPT Presentation

Automatic Parallelization: Parallelism and Tiling Roshan Dathathri Department of Computer Science and Automation Indian Institute of Science roshan@csa.iisc.ernet.in June 25, 2013 Roshan Dathathri (CSA, IISc) Parallelism and Tiling June 25,

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Tiling: A Data Locality Optimizing Algorithm Previously Kelly &amp; Pugh transformation

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Automatic Parallelism for Mercury Paul Bone The University of Melbourne National ICT Australia

A Relaxed Criterion for Loop Tiling Riyadh Baghdadi, Albert Cohen, Sven Verdoolaege

Tiling for Dynamic Scheduling Ravi Teja Mullapudi Uday Bondhugula CSA, Indian Institue of

Automatic Tiling of Mostly-Tileable Loop Nests David Wonnacott Tian Jin Allison Lake

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Parallelization and Parallelization and Proling Proling Programming for Statistical

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization Parallelization Programming for Statistical Programming for Statistical Science

Broadcast Trees for Heterogeneous Platforms Olivier Beaumont, Yves Robert and Loris Marchal

Term 2 2020 IN THE NEXT 4 LECTURES The context: distribute ributed d syst stems ms

Processor Architecture and Circuit Design: A Marginal Cost Analysis Omid Azizi Aqeel Mahesri,

1 Motivation Modelling complex models require huge amounts of triangles Conventional polygonal

Flicker: Flicker: Minimal TCB Code Execution Minimal TCB Code Execution Jonathan M. McCune

Introduction to Verbs M&amp;R 8789, 127, 195205 ENG240Y Old English / Mon 27 Sep 2010

Investor Update 4 th Quarter 2016 February 23, 2017 Disclaimer This presentation contains

Authors:

Tiling: A Data Locality Optimizing Algorithm Previously Kelly & Pugh transformation

Introduction to Verbs M&R 8789, 127, 195205 ENG240Y Old English / Mon 27 Sep 2010