1 Unroll and Jam Unroll and Jam Example (cont) Unroll the Outer - PowerPoint PPT Presentation

Tiling: A Data Locality Optimizing Algorithm Loop Unrolling Motivation Previously – Reduces loop overhead – Kelly & Pugh transformation framework – Improves effectiveness of other transformations – Affine space partitions for parallelism – Code scheduling – CSE Today The Transformation – “Unroll and Jam” and Tiling − Make n copies of the loop: n is the unrolling factor – Specifying tiling in the Kelly and Pugh transformation framework − Adjust loop bounds accordingly – Status of code generation for tiling CS553 Lecture Tiling 1 CS553 Lecture Tiling 2 Loop Unrolling (cont) Loop Balance Example Problem do i=1,n do i=1,n-1 by 2 – We’d like to produce loops with the right balance of memory operations A(i) = B(i) + C(i) A(i) = B(i) + C(i) and floating point operations enddo A(i+1) = B(i+1) + C(i+1) – The ideal balance is machine-dependent enddo – e.g. How many load-store units are connected to the L1 cache? if (i=n) – e.g. How many functional units are provided? A(i) = B(i) + C(i) Example − The inner loop has 1 memory do j = 1,2*n Details operation per iteration and 1 floating do i = 1,m point operation per iteration − When is loop unrolling legal? A(j) = A(j) + B(i) − If our target machine can only − Handle end cases with a cloned copy of the loop enddo support 1 memory operation for − Enter this special case if the remaining number of iteration is less enddo every two floating point operations, than the unrolling factor this loop will be memory bound What can we do? CS553 Lecture Tiling 3 CS553 Lecture Tiling 4 1

Unroll and Jam Unroll and Jam Example (cont) Unroll the Outer Loop Idea do j = 1,2*n by 2 – Restructure loops so that loaded values are used many times per iteration do i = 1,m Unroll and Jam A(j) = A(j) + B(i) enddo – Unroll the outer loop some number of times do i = 1,m – Fuse (Jam) the resulting inner loops A(j+1) = A(j+1) + B(i) Example Unroll the Outer Loop enddo enddo do j = 1,2*n do j = 1,2*n by 2 Jam the inner loops do i = 1,m do i = 1,m − The inner loop has 1 load per do j = 1,2*n by 2 A(j) = A(j) + B(i) A(j) = A(j) + B(i) iteration and 2 floating point do i = 1,m enddo enddo operations per iteration A(j) = A(j) + B(i) enddo do i = 1,m − We reuse the loaded value of B(i) A(j+1) = A(j+1) + B(i) A(j+1) = A(j+1) + B(i) − The Loop Balance matches the enddo enddo machine balance enddo enddo CS553 Lecture Tiling 5 CS553 Lecture Tiling 6 Unroll and Jam (cont) Tiling Legality A non-unimodular transformation that ... – When is Unroll and Jam legal? – groups iteration points into tiles that are executed atomically – can improve spatial and temporal data Disadvantages locality – What limits the degree of unrolling? – can expose larger granularities of j parallelism i Implementing tiling do ii = 1,6, by 2 – how can we specify tiling? do jj = 1, 5, by 2 – when is tiling legal? do i = ii, ii+2-1 – how do we generate tiled code? do j = jj, min(jj+2-1,5) A(i,j) = ... CS553 Lecture Tiling 7 CS553 Lecture Tiling 8 2

Specifying Tiling Legality of Tiling Rectangular tiling A legal rectangular tiling – tile size vector – each tile executed atomically – no dependence cycles between tiles – tile offset, – Check legality by verifying that transformed data dependences are lexicographically j positive j i Possible Transformation Mappings i – creating a tile space Fully permutable loops – rectangular tiling is legal on fully permutable loops – keeping tile iterators in original iteration space j’ CS553 Lecture Tiling 9 CS553 Lecture Tiling 10 i’ Code Generation for Tiling Unroll and Jam IS Tiling (followed by inner loop unrolling) Original Loop do ii = 1,6, by 2 do j = 1,2*n Fixed-size Tiles do jj = 1, 5, by 2 do i = 1,m – Omega library do i = ii, ii+2-1 – Cloog A(j)= A(j) + B(i) do j = jj, min(jj+2-1,5) – for rectangular space and tiles, straight-forward enddo A(i,j) = ... enddo j Parameterized tile sizes i – Parameterized tiled loops for free, PLDI 2007 – TLOG - A Tiled Loop Generator, http://www.cs.colostate.edu/~ln/TLOG/ After Tiling After Unroll and Jam do jj = 1,2*n by 2 Overview of decoupled approach do jj = 1,2*n by 2 do i = 1,m – find polyhedron that may contain any loop origins do i = 1,m do j = jj, jj+2-1 – generate code that traverses that polyhedron A(j)= A(j)+B(i) A(j)= A(j)+B(i) – post process the code to start a tile origins and step by tile size A(j+1)= A(j+1)+B(i) enddo – generate loops over points in tile to stay within original iteration space and within enddo tile enddo enddo enddo CS553 Lecture Tiling 11 CS553 Lecture Tiling 12 3

Concepts Next Time Unroll and Jam is the same as Tiling with the inner loop unrolled Lecture – Run-time reordering transformations Tiling can improve ... – loop balance Suggested Exercises – spatial locality – after array expansion of the scalar T, is it legal to tile the three loops in Figure 11.23? write the tiled code for a block size of your choice. – data locality – computation to communication ratio Implementing tiling – specification – checking legality – code generation CS553 Lecture Tiling 13 CS553 Lecture Tiling 14 4

1 Unroll and Jam Unroll and Jam Example (cont) Unroll the Outer - PowerPoint PPT Presentation

Tiling: A Data Locality Optimizing Algorithm Loop Unrolling Motivation Previously Reduces loop overhead Kelly & Pugh transformation framework Improves effectiveness of other transformations Affine space partitions for

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses Soner Onder Michigan Technological

Full Boltzmann equations for Leptogenesis (FHW, M. Plmacher, Y.Y.Y Wong: arXiv:0907.0205)

HPC Challenge Benchmark Piotr Luszczek University of Tennessee Knoxville SC2004, November

Preemptible Atomics Jan Vitek Jason Baker, Antonio Cunei, Jeremy Manson, Marek Prochazka, Bin

Combining Compression Functions and Block Cipher-Based Hash Functions Asiacrypt 2006 Thomas

RTP Redundancy Up date Colin P erkins < c.p erkins@cs.ucl.ac.uk > Depa rtment of

Data Blocking Jon K. Nilsen Department of Physics and Scientific Computing Group University of

Web Information Retrieval Lecture 3 Index Construction Index construction This time:

Outline Overview of recent work improving performance in most difficult cases:

Single-Database Private Information Retrieval 07.11.2005 Aleksandr Grebennik Tartu University a

SASE : Implementation of a Compressed Text Search Engine Srinidhi Varadarajan Tzi-cker

Disks, Memories & Buffer Management The two offices of memory are collection and

HYDRAstor: a Scalable Secondary Storage 7th USENIX Conference on File and Storage Technologies

Information Retrieval Index Construction Hamid Beigy Sharif university of technology October 6,

Indexed Files : Outline ! Introduction ! Indexed Files ! Full Index Organization ! Indexed

Challenges in Con erting the Challenges in Converting the National Crime Victimization Survey to

Music The Compact Disc replaced vinyl and cassettes Movies The DVD replaced VHS tapes Video

Integration Tests with Super Powers And even more... Alexandre Figura Site Reliability

Rust SGX SDK: Towards Memory Safety in Intel SGX Yu Ding, Ran Duan , Long Li , Yueqiang Cheng ,

Introduction To Java Larry Stead, Instructor lss2168@columbia.edu cell 973-932-3147 Office

Whats cooking in GStreamer FOSDEM, Brussels 1 February 2014 Tim-Philipp Mller

CS-5630 / CS-6630 Visualization Alexander Lex alex@sci.utah.edu [xkcd] visualization pictures

Information Transmission Chapter 3, image and video OVE EDFORS ELECTRICAL AND INFORMATION

Computability in Europe 2011 Sofia, Bulgaria Honesty and Time-Constructibility in Type-2

1 Unroll and Jam Unroll and Jam Example (cont) Unroll the Outer - PowerPoint PPT Presentation

Tiling: A Data Locality Optimizing Algorithm Loop Unrolling Motivation Previously Reduces loop overhead Kelly & Pugh transformation framework Improves effectiveness of other transformations Affine space partitions for

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses Soner Onder Michigan Technological

Full Boltzmann equations for Leptogenesis (FHW, M. Plmacher, Y.Y.Y Wong: arXiv:0907.0205)

HPC Challenge Benchmark Piotr Luszczek University of Tennessee Knoxville SC2004, November

Preemptible Atomics Jan Vitek Jason Baker, Antonio Cunei, Jeremy Manson, Marek Prochazka, Bin

Combining Compression Functions and Block Cipher-Based Hash Functions Asiacrypt 2006 Thomas

RTP Redundancy Up date Colin P erkins &lt; c.p erkins@cs.ucl.ac.uk &gt; Depa rtment of

Data Blocking Jon K. Nilsen Department of Physics and Scientific Computing Group University of

Web Information Retrieval Lecture 3 Index Construction Index construction This time:

Outline Overview of recent work improving performance in most difficult cases:

Single-Database Private Information Retrieval 07.11.2005 Aleksandr Grebennik Tartu University a

SASE : Implementation of a Compressed Text Search Engine Srinidhi Varadarajan Tzi-cker

Disks, Memories &amp; Buffer Management The two offices of memory are collection and

HYDRAstor: a Scalable Secondary Storage 7th USENIX Conference on File and Storage Technologies

Information Retrieval Index Construction Hamid Beigy Sharif university of technology October 6,

Indexed Files : Outline ! Introduction ! Indexed Files ! Full Index Organization ! Indexed

Challenges in Con erting the Challenges in Converting the National Crime Victimization Survey to

Music The Compact Disc replaced vinyl and cassettes Movies The DVD replaced VHS tapes Video

Integration Tests with Super Powers And even more... Alexandre Figura Site Reliability

Rust SGX SDK: Towards Memory Safety in Intel SGX Yu Ding, Ran Duan , Long Li , Yueqiang Cheng ,

Introduction To Java Larry Stead, Instructor lss2168@columbia.edu cell 973-932-3147 Office

Whats cooking in GStreamer FOSDEM, Brussels 1 February 2014 Tim-Philipp Mller

CS-5630 / CS-6630 Visualization Alexander Lex alex@sci.utah.edu [xkcd] visualization pictures

Information Transmission Chapter 3, image and video OVE EDFORS ELECTRICAL AND INFORMATION

Computability in Europe 2011 Sofia, Bulgaria Honesty and Time-Constructibility in Type-2

RTP Redundancy Up date Colin P erkins < c.p erkins@cs.ucl.ac.uk > Depa rtment of

Disks, Memories & Buffer Management The two offices of memory are collection and