spcl.inf.ethz.ch @spcl_eth
Stencil Codes in One Shot spcl.inf.ethz.ch @spcl_eth COSMO - - PowerPoint PPT Presentation
Stencil Codes in One Shot spcl.inf.ethz.ch @spcl_eth COSMO - - PowerPoint PPT Presentation
spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , AND T ORSTEN H OEFLER Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot spcl.inf.ethz.ch @spcl_eth COSMO Atmospheric Model Regional
spcl.inf.ethz.ch @spcl_eth
COSMO Atmospheric Model
- Regional atmospheric model used by 7 national weather services
- Implements many different stencil programs
spcl.inf.ethz.ch @spcl_eth
3
Optimizing the Fastwaves Kernel from the COSMO Atmospheric Model
unfused tiled 64x64x1 2 1 3 4 6 5 7 8 model prediction [ms] measured execution time [ms] 1.08 0.94 absinthe 64x4x3 64x4x5 2 1 3 4 6 5 7 8 unfused tiled 64x64x1 2 1 3 4 6 5 7 8 model prediction [ms] measured execution time [ms] 1.08 0.67 0.94 0.62 absinthe 64x4x3 64x4x5 2 1 3 4 6 5 7 8 unfused tiled 64x64x1 2 1 3 4 6 5 7 8 auto-tuning 64x4x1 64x4x4 model prediction [ms] measured execution time [ms] 1.08 0.73 0.67 0.94 0.62 0.58
- 6.5%
2 1 3 4 6 5 7 8
Michael Baldauf, Axel Seifert, Jochen FΓΆrstner, Detlev Majewski, Matthias Raschendorfer, and Thorsten Reinhardt, Operational Convective-Scale Numerical Weather Prediction with the COSMO Model: Description and Sensitivities. 2011.
spcl.inf.ethz.ch @spcl_eth
4
Stencil Programs Execute Multiple Stencils in Sequence
for (int y = ybeg; y < yend; y++) for (int x = xbeg; x < xend; x++) A(x,y) = I(x,y) + I(x-1,y) + I(x+1,y); for (int y = ybeg; y < yend; y++) for (int x = xbeg; x < xend; x++) B(x,y) = A(x,y+1) + A(x,y);
- element-wise computation
- position independent access pattern
xbeg xend x y yend ybeg 1 load / 1 store
2 loads / 1 store
spcl.inf.ethz.ch @spcl_eth
5
Loop Tiling and Loop Fusion
for (int idx = 0; idx < 4; ++idx) { int xbeg = tiles[idx].xbeg; int xend = tiles[idx].xend; int ybeg = tiles[idx].ybeg; int yend = tiles[idx].yend; Buffer A(xbeg, xend, ybeg, yend+1); for (int y = ybeg; y < yend+1; ++y) for (int x = xbeg; x < xend; ++x) A(x,y) = I(x,y) + I(x-1,y) + I(x+1,y); for (int y = ybeg; y < yend; y++) for (int x = xbeg; x < xend; x++) B(x,y) = A(x,y+1) + A(x,y); }
x y
idx = 0 idx = 1
xbeg xend xbeg xend 1 load / 0 store
0 loads / 1 store idx = 2 idx = 3
yend ybeg yend ybeg
spcl.inf.ethz.ch @spcl_eth
6
Architecture Overview
1
model learner 2
- ptimizer
ILP solver learned parameters target system benchmark 3 code generator code transformations fast code
π’(π, π) = ππ + πΆπ
spcl.inf.ethz.ch @spcl_eth
7
Performance Model Ideas
- memory accesses dominate the execution time
scalar peel loops vectorized loop body
- execution time of innermost loop
fast memory (L1 cache) slow memory (L3 cache/DDR)
spcl.inf.ethz.ch @spcl_eth
8
Performance Model Design
- linear cost functions for peel and body cost
- model the entire program
π’ = ππ + πΆπ
- slow and fast memory
π’ = max π
1π1, π2π2 + max(πΆ1π1, πΆ2π2)
π’ = ΰ·
π=0..8
π’π + ππ πΆπ + π
1π1
π2π2 πΆ2π2 πΆ1π1
1 2 3 4 5 6 7 8
spcl.inf.ethz.ch @spcl_eth
- # cache accesses
- estimated execution time
9
Evaluating the Fast Memory Model
3 loads / 1 store ππ = ππ¦(πΈπ§ + ππ§ππ§) ππ = πΈπ¦πΈπ§ + πΈπ¦ππ§ππ§ ππ¦ = 2 , ππ§ = 2 π’ = ππππ + πΆπππ learn the model parameters ππ, πΆπ 1 2 3 πΈπ¦ πΈπ§ ππ§ ππ§ + ππ = ππ¦πΈπ§ ππ = πΈπ¦πΈπ§ ππ = (3 + 1)ππ¦(πΈπ§ + ππ§ππ§) ππ = (3 + 1)πΈπ¦πΈπ§ + πΈπ¦ππ§ππ§
spcl.inf.ethz.ch @spcl_eth
10
Learning the Fast Memory Model
p=12 p=16 p=20 0.00 0.05 0.10 20 40 60 80 x execution time [ms] fast memory
ππ, πΆπ = argmin
(π,πΆ)ββ
ΰ·
π β[0,π]
πππ β πΆππ β π’π = k + k+1 + k-1 p=12 = k + k+1 + k-1 p=16 = k + k+1 + k-1 p=20
- utput array
input array
spcl.inf.ethz.ch @spcl_eth
11
Linear Multiplication of Bounded Integer Variables
result π¦ limit range π β€ π β€ π force result π β ππ β€ π π β π β ππ β₯ βπ binary representation π§ = ΰ·
π=0 βlog2(π)β
2ππ§π sum binary products π = ΰ·
π=0 βlog2(π)β
2ππ¦π§π b p π β€ π¦ π β₯ 0 π = 0 π = 1 π β€ 0 π β₯ π¦
https://blog.adamfurmanek.pl/2015/09/26/ilp-part-6/
- the binary product π = π¦π given the upper bound π
- the integer product π = π¦π§ given the upper bounds π and π
spcl.inf.ethz.ch @spcl_eth
12
Comparison to Auto-tuning, Heuristics, Hand-tuned, and Random Variants
absinthe 0.5 0.7 0.9 1.1 1.3 0.5 0.7 0.9 1.1 1.3 estimated time [ms] measured time [ms] fastwaves auto-tuning (-6.5%) min max (74.0%) hand absinthe hand min max auto-tuning (-0.8%) 0.4 0.8 1.2 1.6 0.4 0.8 1.2 1.6 estimated time [ms] measured time [ms] diffusion absinthe hand min max auto-tuning (-3.4%) 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 estimated time [ms] measured time [ms] advection
spcl.inf.ethz.ch @spcl_eth
13
Comparison to Halide and Polymage
- R. T. Mullapudi, A. Adams, D. Sharlet, J. Ragan-Kelley, and K. Fatahalian, Automatically scheduling halide image processing pipelines. 2016.
- A. Jangda and U. Bondhugula, An effective fusion and tile size model for optimizing image processing pipelines. 2018.
1.66x 1.29x 1x 10 20 fastwaves 2.03x 3.7x 1x advection 1.4x 1.06x 1x diffusion execution time [ms] Absinthe Halide Polymage
spcl.inf.ethz.ch @spcl_eth
14
Conclusions
learned performance model integer linear programming loop fusion and loop tiling close to auto-tuning
spcl.inf.ethz.ch @spcl_eth
15
Backup Slides
spcl.inf.ethz.ch @spcl_eth
16
Model the Space of Possible Code Transformations
stencils 64x4x3 64x4x5 1 2 3 4 5 6 7 8 fusion choices π3 = 0 π1 = 0 π2 = 0 π0 = 0 π4 = 0 π5 = 0 π6 = 1 π7 = 1 π8 = 1 0 β€ ππ+1 β ππ β€ 1 βπ β 0,7
spcl.inf.ethz.ch @spcl_eth
17
Model the Space of Possible Code Transformations
stencils 64x4x3 64x4x5 1 2 3 4 5 6 7 8 tile sizes π0
π¦ = 1
π0
π§ = 16
π0
π¨ = 20
β¦ π5
π¦ = 1
π5
π§ = 16
π5
π¨ = 20
π6
π¦ = 1
π6
π§ = 16
π6
π¨ = 12
... π8
π¦ = 1
π8
π§ = 16
π8
π¨ = 12
1 β€ ππ
π¦ β€ πΈπ¦, 1 β€ ππ π§ β€ πΈπ§, 1 β€ ππ π¨ β€ πΈπ¨ βπ β 0,8
equality constraints
spcl.inf.ethz.ch @spcl_eth
18