stencil codes in one shot
play

Stencil Codes in One Shot spcl.inf.ethz.ch @spcl_eth COSMO - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , AND T ORSTEN H OEFLER Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot spcl.inf.ethz.ch @spcl_eth COSMO Atmospheric Model Regional


  1. spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , AND T ORSTEN H OEFLER Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot

  2. spcl.inf.ethz.ch @spcl_eth COSMO Atmospheric Model • Regional atmospheric model used by 7 national weather services • Implements many different stencil programs

  3. spcl.inf.ethz.ch @spcl_eth Optimizing the Fastwaves Kernel from the COSMO Atmospheric Model model prediction [ms] model prediction [ms] model prediction [ms] 64x64x1 64x64x1 64x64x1 0 0 0 1 1 1 2 2 2 unfused unfused unfused 5 5 5 3 3 3 4 4 4 measured execution time [ms] measured execution time [ms] measured execution time [ms] tiled tiled tiled 64x4x3 64x4x3 0.94 0.94 0.94 2 2 1 1 8 8 8 7 7 7 6 6 6 0 0 5 5 3 3 4 4 8 8 64x4x1 6 6 7 7 2 1 64x4x5 64x4x5 0 absinthe absinthe 5 3 4 auto-tuning 0.62 0.62 -6.5% 8 0.58 6 7 64x4x4 0.67 0.67 0.73 1.08 1.08 1.08 Michael Baldauf, Axel Seifert, Jochen Förstner, Detlev Majewski, Matthias Raschendorfer, and Thorsten Reinhardt, Operational Convective-Scale Numerical Weather Prediction with the COSMO Model: Description and Sensitivities . 2011. 3

  4. spcl.inf.ethz.ch @spcl_eth Stencil Programs Execute Multiple Stencils in Sequence y yend for (int y = ybeg; y < yend; y++) for (int x = xbeg; x < xend; x++) A(x,y) = I(x,y) + I(x-1,y) + I(x+1,y); for (int y = ybeg; y < yend; y++) for (int x = xbeg; x < xend; x++) B(x,y) = A(x,y+1) + A(x,y); x ybeg xbeg xend • element-wise computation • position independent access pattern 1 load / 1 store 2 loads / 1 store 4

  5. spcl.inf.ethz.ch @spcl_eth Loop Tiling and Loop Fusion y for (int idx = 0; idx < 4; ++idx) { int xbeg = tiles[idx].xbeg; yend int xend = tiles[idx].xend; int ybeg = tiles[idx].ybeg; idx = 2 idx = 3 int yend = tiles[idx].yend; Buffer A(xbeg, xend, ybeg, yend+1); yend ybeg for (int y = ybeg; y < yend+1 ; ++y) for (int x = xbeg; x < xend; ++x) idx = 0 idx = 1 A(x,y) = I(x,y) + I(x-1,y) + I(x+1,y); for (int y = ybeg; y < yend; y++) x ybeg for (int x = xbeg; x < xend; x++) xbeg xend xbeg xend B(x,y) = A(x,y+1) + A(x,y); } 1 load / 0 store 0 loads / 1 store 5

  6. spcl.inf.ethz.ch @spcl_eth Architecture Overview model learner 1 𝑢(𝑞, 𝑐) = 𝑄𝑞 + 𝐶𝑐 learned parameters benchmark target system 2 optimizer ILP solver fast code 3 code generator code transformations 6

  7. spcl.inf.ethz.ch @spcl_eth Performance Model Ideas • execution time of innermost loop scalar peel loops vectorized loop body • memory accesses dominate the execution time slow memory (L3 cache/DDR) fast memory (L1 cache) 7

  8. spcl.inf.ethz.ch @spcl_eth Performance Model Design • linear cost functions for peel and body cost + 𝑢 = 𝑄𝑞 + 𝐶𝑐 𝑄𝑞 𝐶𝑐 • slow and fast memory 𝐶 2 𝑐 2 𝑄 1 𝑞 1 𝑢 = max 𝑄 1 𝑞 1 , 𝑄 2 𝑞 2 + max(𝐶 1 𝑐 1 , 𝐶 2 𝑐 2 ) + 𝑄 2 𝑞 2 𝐶 1 𝑐 1 • model the entire program 𝑢 = ෍ 𝑢 𝑗 0 1 2 3 4 5 6 7 8 𝑗=0..8 8

  9. spcl.inf.ethz.ch @spcl_eth Evaluating the Fast Memory Model 𝑜 𝑦 = 2 , 𝑜 𝑧 = 2 • # cache accesses 𝑓 𝑧 𝑞 𝑔 = (3 + 1)𝑜 𝑦 (𝐸 𝑧 + 𝑓 𝑧 𝑜 𝑧 ) 𝑞 𝑔 = 𝑜 𝑦 𝐸 𝑧 𝑞 𝑔 = 𝑜 𝑦 (𝐸 𝑧 + 𝑓 𝑧 𝑜 𝑧 ) 𝑐 𝑔 = (3 + 1)𝐸 𝑦 𝐸 𝑧 + 𝐸 𝑦 𝑓 𝑧 𝑜 𝑧 𝑐 𝑔 = 𝐸 𝑦 𝐸 𝑧 + 𝐸 𝑦 𝑓 𝑧 𝑜 𝑧 𝑐 𝑔 = 𝐸 𝑦 𝐸 𝑧 + 0 1 3 loads / 1 store 𝐸 𝑧 𝑓 𝑧 • estimated execution time 3 2 𝑢 = 𝑄 𝑔 𝑞 𝑔 + 𝐶 𝑔 𝑐 𝑔 learn the model parameters 𝑄 𝑔 , 𝐶 𝑔 𝐸 𝑦 9

  10. spcl.inf.ethz.ch @spcl_eth Learning the Fast Memory Model k k-1 k+1 fast memory p=12 = + + execution time [ms] p=20 0.10 p=16 k k+1 k-1 p=16 0.05 p=12 = + + 0.00 k k+1 k-1 20 40 60 80 x p=20 = + + 𝑄 𝑔 , 𝐶 𝑔 = argmin ෍ 𝑄𝑞 𝑠 − 𝐶𝑐 𝑠 − 𝑢 𝑠 (𝑄,𝐶)∈ℝ output array input array 𝑠∈[0,𝑆] 10

  11. spcl.inf.ethz.ch @spcl_eth Linear Multiplication of Bounded Integer Variables p • the binary product 𝑞 = 𝑦𝑐 given the upper bound 𝑌 𝑞 ≥ 𝑦 0 𝑦 result 𝑞 ≤ 𝑦 𝟏 ≤ 𝒒 ≤ 𝒚 limit range 𝑞 ≥ 0 𝒒 − 𝒀𝒄 ≤ 𝟏 𝒒 − 𝒚 − 𝒀𝒄 ≥ −𝒀 force result 𝑞 ≤ 0 b • the integer product 𝑞 = 𝑦𝑧 given the upper bounds 𝑌 and 𝑍 𝑐 = 0 𝑐 = 1 ⌊log 2 (𝑍)⌋ binary representation 2 𝑗 𝑧 𝑗 𝑧 = ෍ 𝑗=0 ⌊log 2 (𝑍)⌋ sum binary products 2 𝑗 𝑦𝑧 𝑗 𝑞 = ෍ 𝑗=0 https://blog.adamfurmanek.pl/2015/09/26/ilp-part-6/ 11

  12. spcl.inf.ethz.ch @spcl_eth Comparison to Auto-tuning, Heuristics, Hand-tuned, and Random Variants fastwaves diffusion advection 1.3 max 0.8 (74.0%) 1.6 measured time [ms] measured time [ms] measured time [ms] min 1.1 0.7 hand min hand max 1.2 0.6 absinthe 0.9 min hand 0.5 absinthe absinthe max 0.8 0.7 0.4 auto-tuning auto-tuning auto-tuning (-6.5%) (-0.8%) (-3.4%) 0.3 0.5 0.4 0.5 0.7 0.9 1.1 1.3 0.4 0.8 1.2 1.6 0.3 0.4 0.5 0.6 0.7 0.8 estimated time [ms] estimated time [ms] estimated time [ms] 12

  13. spcl.inf.ethz.ch @spcl_eth Comparison to Halide and Polymage Absinthe 1.66x Halide 3.7x execution time [ms] 1.29x 20 Polymage 1x 1.4x 2.03x 1.06x 1x 10 1x 0 fastwaves advection diffusion R. T. Mullapudi, A. Adams, D. Sharlet, J. Ragan-Kelley, and K. Fatahalian, Automatically scheduling halide image processing pipelines . 2016. A. Jangda and U. Bondhugula, An effective fusion and tile size model for optimizing image processing pipelines . 2018. 13

  14. spcl.inf.ethz.ch @spcl_eth Conclusions loop fusion and loop tiling learned performance model integer linear programming close to auto-tuning 14

  15. spcl.inf.ethz.ch @spcl_eth Backup Slides 15

  16. spcl.inf.ethz.ch @spcl_eth Model the Space of Possible Code Transformations 0 1 2 3 4 5 6 7 8 stencils 64x4x3 64x4x5 𝑕 0 = 0 𝑕 1 = 0 𝑕 2 = 0 𝑕 3 = 0 𝑕 4 = 0 𝑕 5 = 0 𝑕 6 = 1 𝑕 7 = 1 𝑕 8 = 1 fusion choices 0 ≤ 𝑕 𝑗+1 − 𝑕 𝑗 ≤ 1 ∀𝑗 ∈ 0,7 16

  17. spcl.inf.ethz.ch @spcl_eth Model the Space of Possible Code Transformations 0 1 2 3 4 5 6 7 8 stencils 64x4x3 64x4x5 𝑦 = 1 𝑦 = 1 𝑦 = 1 𝑦 = 1 𝑜 0 𝑜 8 𝑜 5 𝑜 6 𝑧 = 16 𝑧 = 16 𝑧 = 16 𝑧 = 16 tile sizes … 𝑜 0 ... 𝑜 8 𝑜 5 𝑜 6 𝑨 = 20 𝑨 = 12 𝑨 = 20 𝑨 = 12 𝑜 0 𝑜 8 𝑜 5 𝑜 6 equality constraints 𝑧 ≤ 𝐸 𝑧 , 1 ≤ 𝑜 𝑗 𝑦 ≤ 𝐸 𝑦 , 1 ≤ 𝑜 𝑗 𝑨 ≤ 𝐸 𝑨 ∀𝑗 ∈ 0,8 1 ≤ 𝑜 𝑗 17

  18. spcl.inf.ethz.ch @spcl_eth Limit the Cache Utilization stencils 0 1 2 𝑔 2 ≥ 𝐺 22 𝑔 2 + 𝐺 12 𝑕 2 − 𝑕 1 ≥ 𝐺 12 𝑔 2 + 𝐺 02 𝑕 2 − 𝑕 0 ≥ 𝐺 02 𝑨 − 𝑔 𝑧 𝑜 2 𝑦 𝑜 2 𝐷𝑜 2 2 ≥ 0 𝐺 02 = 6 𝐺 12 = 5 𝐺 22 = 4 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend