Stencil Codes in One Shot spcl.inf.ethz.ch @spcl_eth COSMO - - PowerPoint PPT Presentation

β–Ά
stencil codes in one shot
SMART_READER_LITE
LIVE PREVIEW

Stencil Codes in One Shot spcl.inf.ethz.ch @spcl_eth COSMO - - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , AND T ORSTEN H OEFLER Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot spcl.inf.ethz.ch @spcl_eth COSMO Atmospheric Model Regional


slide-1
SLIDE 1

spcl.inf.ethz.ch @spcl_eth

TOBIAS GYSI, TOBIAS GROSSER, AND TORSTEN HOEFLER

Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot

slide-2
SLIDE 2

spcl.inf.ethz.ch @spcl_eth

COSMO Atmospheric Model

  • Regional atmospheric model used by 7 national weather services
  • Implements many different stencil programs
slide-3
SLIDE 3

spcl.inf.ethz.ch @spcl_eth

3

Optimizing the Fastwaves Kernel from the COSMO Atmospheric Model

unfused tiled 64x64x1 2 1 3 4 6 5 7 8 model prediction [ms] measured execution time [ms] 1.08 0.94 absinthe 64x4x3 64x4x5 2 1 3 4 6 5 7 8 unfused tiled 64x64x1 2 1 3 4 6 5 7 8 model prediction [ms] measured execution time [ms] 1.08 0.67 0.94 0.62 absinthe 64x4x3 64x4x5 2 1 3 4 6 5 7 8 unfused tiled 64x64x1 2 1 3 4 6 5 7 8 auto-tuning 64x4x1 64x4x4 model prediction [ms] measured execution time [ms] 1.08 0.73 0.67 0.94 0.62 0.58

  • 6.5%

2 1 3 4 6 5 7 8

Michael Baldauf, Axel Seifert, Jochen FΓΆrstner, Detlev Majewski, Matthias Raschendorfer, and Thorsten Reinhardt, Operational Convective-Scale Numerical Weather Prediction with the COSMO Model: Description and Sensitivities. 2011.

slide-4
SLIDE 4

spcl.inf.ethz.ch @spcl_eth

4

Stencil Programs Execute Multiple Stencils in Sequence

for (int y = ybeg; y < yend; y++) for (int x = xbeg; x < xend; x++) A(x,y) = I(x,y) + I(x-1,y) + I(x+1,y); for (int y = ybeg; y < yend; y++) for (int x = xbeg; x < xend; x++) B(x,y) = A(x,y+1) + A(x,y);

  • element-wise computation
  • position independent access pattern

xbeg xend x y yend ybeg 1 load / 1 store

2 loads / 1 store

slide-5
SLIDE 5

spcl.inf.ethz.ch @spcl_eth

5

Loop Tiling and Loop Fusion

for (int idx = 0; idx < 4; ++idx) { int xbeg = tiles[idx].xbeg; int xend = tiles[idx].xend; int ybeg = tiles[idx].ybeg; int yend = tiles[idx].yend; Buffer A(xbeg, xend, ybeg, yend+1); for (int y = ybeg; y < yend+1; ++y) for (int x = xbeg; x < xend; ++x) A(x,y) = I(x,y) + I(x-1,y) + I(x+1,y); for (int y = ybeg; y < yend; y++) for (int x = xbeg; x < xend; x++) B(x,y) = A(x,y+1) + A(x,y); }

x y

idx = 0 idx = 1

xbeg xend xbeg xend 1 load / 0 store

0 loads / 1 store idx = 2 idx = 3

yend ybeg yend ybeg

slide-6
SLIDE 6

spcl.inf.ethz.ch @spcl_eth

6

Architecture Overview

1

model learner 2

  • ptimizer

ILP solver learned parameters target system benchmark 3 code generator code transformations fast code

𝑒(π‘ž, 𝑐) = π‘„π‘ž + 𝐢𝑐

slide-7
SLIDE 7

spcl.inf.ethz.ch @spcl_eth

7

Performance Model Ideas

  • memory accesses dominate the execution time

scalar peel loops vectorized loop body

  • execution time of innermost loop

fast memory (L1 cache) slow memory (L3 cache/DDR)

slide-8
SLIDE 8

spcl.inf.ethz.ch @spcl_eth

8

Performance Model Design

  • linear cost functions for peel and body cost
  • model the entire program

𝑒 = π‘„π‘ž + 𝐢𝑐

  • slow and fast memory

𝑒 = max 𝑄

1π‘ž1, 𝑄2π‘ž2 + max(𝐢1𝑐1, 𝐢2𝑐2)

𝑒 = ෍

𝑗=0..8

𝑒𝑗 + π‘„π‘ž 𝐢𝑐 + 𝑄

1π‘ž1

𝑄2π‘ž2 𝐢2𝑐2 𝐢1𝑐1

1 2 3 4 5 6 7 8

slide-9
SLIDE 9

spcl.inf.ethz.ch @spcl_eth

  • # cache accesses
  • estimated execution time

9

Evaluating the Fast Memory Model

3 loads / 1 store π‘žπ‘” = π‘œπ‘¦(𝐸𝑧 + π‘“π‘§π‘œπ‘§) 𝑐𝑔 = 𝐸𝑦𝐸𝑧 + πΈπ‘¦π‘“π‘§π‘œπ‘§ π‘œπ‘¦ = 2 , π‘œπ‘§ = 2 𝑒 = π‘„π‘”π‘žπ‘” + 𝐢𝑔𝑐𝑔 learn the model parameters 𝑄𝑔, 𝐢𝑔 1 2 3 𝐸𝑦 𝐸𝑧 𝑓𝑧 𝑓𝑧 + π‘žπ‘” = π‘œπ‘¦πΈπ‘§ 𝑐𝑔 = 𝐸𝑦𝐸𝑧 π‘žπ‘” = (3 + 1)π‘œπ‘¦(𝐸𝑧 + π‘“π‘§π‘œπ‘§) 𝑐𝑔 = (3 + 1)𝐸𝑦𝐸𝑧 + πΈπ‘¦π‘“π‘§π‘œπ‘§

slide-10
SLIDE 10

spcl.inf.ethz.ch @spcl_eth

10

Learning the Fast Memory Model

p=12 p=16 p=20 0.00 0.05 0.10 20 40 60 80 x execution time [ms] fast memory

𝑄𝑔, 𝐢𝑔 = argmin

(𝑄,𝐢)βˆˆβ„

෍

π‘ βˆˆ[0,𝑆]

π‘„π‘žπ‘  βˆ’ 𝐢𝑐𝑠 βˆ’ 𝑒𝑠 = k + k+1 + k-1 p=12 = k + k+1 + k-1 p=16 = k + k+1 + k-1 p=20

  • utput array

input array

slide-11
SLIDE 11

spcl.inf.ethz.ch @spcl_eth

11

Linear Multiplication of Bounded Integer Variables

result 𝑦 limit range 𝟏 ≀ 𝒒 ≀ π’š force result 𝒒 βˆ’ 𝒀𝒄 ≀ 𝟏 𝒒 βˆ’ π’š βˆ’ 𝒀𝒄 β‰₯ βˆ’π’€ binary representation 𝑧 = ෍

𝑗=0 ⌊log2(𝑍)βŒ‹

2𝑗𝑧𝑗 sum binary products π‘ž = ෍

𝑗=0 ⌊log2(𝑍)βŒ‹

2𝑗𝑦𝑧𝑗 b p π‘ž ≀ 𝑦 π‘ž β‰₯ 0 𝑐 = 0 𝑐 = 1 π‘ž ≀ 0 π‘ž β‰₯ 𝑦

https://blog.adamfurmanek.pl/2015/09/26/ilp-part-6/

  • the binary product π‘ž = 𝑦𝑐 given the upper bound π‘Œ
  • the integer product π‘ž = 𝑦𝑧 given the upper bounds π‘Œ and 𝑍
slide-12
SLIDE 12

spcl.inf.ethz.ch @spcl_eth

12

Comparison to Auto-tuning, Heuristics, Hand-tuned, and Random Variants

absinthe 0.5 0.7 0.9 1.1 1.3 0.5 0.7 0.9 1.1 1.3 estimated time [ms] measured time [ms] fastwaves auto-tuning (-6.5%) min max (74.0%) hand absinthe hand min max auto-tuning (-0.8%) 0.4 0.8 1.2 1.6 0.4 0.8 1.2 1.6 estimated time [ms] measured time [ms] diffusion absinthe hand min max auto-tuning (-3.4%) 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 estimated time [ms] measured time [ms] advection

slide-13
SLIDE 13

spcl.inf.ethz.ch @spcl_eth

13

Comparison to Halide and Polymage

  • R. T. Mullapudi, A. Adams, D. Sharlet, J. Ragan-Kelley, and K. Fatahalian, Automatically scheduling halide image processing pipelines. 2016.
  • A. Jangda and U. Bondhugula, An effective fusion and tile size model for optimizing image processing pipelines. 2018.

1.66x 1.29x 1x 10 20 fastwaves 2.03x 3.7x 1x advection 1.4x 1.06x 1x diffusion execution time [ms] Absinthe Halide Polymage

slide-14
SLIDE 14

spcl.inf.ethz.ch @spcl_eth

14

Conclusions

learned performance model integer linear programming loop fusion and loop tiling close to auto-tuning

slide-15
SLIDE 15

spcl.inf.ethz.ch @spcl_eth

15

Backup Slides

slide-16
SLIDE 16

spcl.inf.ethz.ch @spcl_eth

16

Model the Space of Possible Code Transformations

stencils 64x4x3 64x4x5 1 2 3 4 5 6 7 8 fusion choices 𝑕3 = 0 𝑕1 = 0 𝑕2 = 0 𝑕0 = 0 𝑕4 = 0 𝑕5 = 0 𝑕6 = 1 𝑕7 = 1 𝑕8 = 1 0 ≀ 𝑕𝑗+1 βˆ’ 𝑕𝑗 ≀ 1 βˆ€π‘— ∈ 0,7

slide-17
SLIDE 17

spcl.inf.ethz.ch @spcl_eth

17

Model the Space of Possible Code Transformations

stencils 64x4x3 64x4x5 1 2 3 4 5 6 7 8 tile sizes π‘œ0

𝑦 = 1

π‘œ0

𝑧 = 16

π‘œ0

𝑨 = 20

… π‘œ5

𝑦 = 1

π‘œ5

𝑧 = 16

π‘œ5

𝑨 = 20

π‘œ6

𝑦 = 1

π‘œ6

𝑧 = 16

π‘œ6

𝑨 = 12

... π‘œ8

𝑦 = 1

π‘œ8

𝑧 = 16

π‘œ8

𝑨 = 12

1 ≀ π‘œπ‘—

𝑦 ≀ 𝐸𝑦, 1 ≀ π‘œπ‘— 𝑧 ≀ 𝐸𝑧, 1 ≀ π‘œπ‘— 𝑨 ≀ 𝐸𝑨 βˆ€π‘— ∈ 0,8

equality constraints

slide-18
SLIDE 18

spcl.inf.ethz.ch @spcl_eth

18

Limit the Cache Utilization

𝐺

02 = 6 𝐺 12 = 5 𝐺 22 = 4

𝑔

2 β‰₯ 𝐺 22

𝑔

2 + 𝐺 12 𝑕2 βˆ’ 𝑕1 β‰₯ 𝐺 12

𝑔

2 + 𝐺 02 𝑕2 βˆ’ 𝑕0 β‰₯ 𝐺 02

π·π‘œ2

π‘¦π‘œ2 π‘§π‘œ2 𝑨 βˆ’ 𝑔 2 β‰₯ 0

stencils 1 2