MODESTO: Data-centric Analytic Optimization of Complex Stencil - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , AND T ORSTEN H OEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures

spcl.inf.ethz.ch @spcl_eth Stencil Programs Motivation:  COSMO is a regional climate model used by 7 national weather services  The real-world application COSMO contains hundreds of different stencils Analysis:  Stencil programs can be represented using directed acyclic graphs  Nodes and edges correspond to stencils and dependencies respectively simplified horizontal diffusion example ⊕ ⊕ in lap fli ⊕ ⊕ flj out ⊕ ⊕ 𝑏⨁𝑐 = 𝑏 ′ + 𝑐 ′ 𝑏 ′ ∈ 𝑏, 𝑐 ′ ∈ 𝑐+ 2

spcl.inf.ethz.ch @spcl_eth TODO: Quickly say what tiling and loop fusion are? 3

spcl.inf.ethz.ch @spcl_eth How to Deal with Data Dependencies?  Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension) tile 2 Inter tile data dependencies: i-dimension • Perform halo exchange communication • Introduce redundant computation tile 1 tile 0 time 4

spcl.inf.ethz.ch @spcl_eth How to Deal with Data Dependencies?  Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension) Halo Exchange Parallel (hp): • Update tiles in parallel i-dimension • Perform halo exchange communication Pros and Cons: • Avoid redundant computation • At the cost of additional synchronization time 5

spcl.inf.ethz.ch @spcl_eth How to Deal with Data Dependencies?  Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension) Halo Exchange Sequential (hs): • Update tiles sequentially i-dimension • E.g. Innermost loop updates tile-by-tile Pros and Cons: • Avoid redundant computation • At cost of being sequential time 6

spcl.inf.ethz.ch @spcl_eth How to Deal with Data Dependencies?  Consider the horizontal diffusion lap-fli-out dependency chain (i-dimension) Computation on-the-fly (of): • Compute all dependencies on-the-fly i-dimension • «Overlapped tiling» Pros and Cons: • Avoid synchronization • At the cost of redundant computation time 7

spcl.inf.ethz.ch @spcl_eth Case Study: STELLA (STEncil Loop LAnguage)  STELLA is a stencil DSL used to refactor the dynamical core of COSMO // define stencil functors using C++ template metaprogramming: struct Lap { ... }; struct Fli { ... }; ... // stencil assembly Stencil stencil ; StencilCompiler::Build( stencil, pack_parameters( ... ), define_temporaries( StencilBuffer<lap, double>(), StencilBuffer<fli, double>(), ... ), define_loops ( define_sweep ( StencilStage<Lap, IJRange<-1,1,-1,1> >(), StencilStage<Fli, IJRange<-1,0,0,0> >(), ... ))); STELLA defines a virtual tiling // stencil execution hierarchy that facilitates platform stencil.Apply(); independent code generation 8

spcl.inf.ethz.ch @spcl_eth Tiling Hierarchy of STELLA’s GPU -Backend DSL Tile Size Strategy Memory Communication halo exchange sweep 1 x 1 x 1 registers scratchpad parallel tiling hierarchy halo exchange sweep ∞ x ∞ x 1 registers registers sequential computation loop 64x4x64 GDDR - on-the-fly computation stencil ∞ x ∞ x ∞ GDDR - on-the-fly 9

spcl.inf.ethz.ch @spcl_eth Stencil Program Algebra  Map stencils to the tiling hierarchy using a bracket expression lap fli out [[lap,fli,flj],[out]] [lap,fli,flj,out] flj  Enumerate the stencil execution orders that respect the dependencies lap fli flj out  Enumerate implementation variants by adding/removing brackets ...,lap,fli flj,out,... ]],[[ ],[ , 10

spcl.inf.ethz.ch @spcl_eth lateral and vertical communication refer to commuication within one respectively Machine Model between different tiling hierarchy levels  Our model considers peak computation and communication throughputs target machine machine model C = 90 Gflops core 1 core 2 core 3 (30 Gflop) (30 Gflop) (30 Gflop) V 1 = 300 GB/s 100 GB/s 100 GB/s 100 GB/s L 1 = 50 GB/s 25 GB/s 25 GB/s cache cache cache M 1 = 256 kB (256 kB) (256 kB) (256 kB) 10 GB/s V 0 = 10 GB/s L 0 = 0 GB/s DDR M 0 = 8 GB (8 GB) 11

spcl.inf.ethz.ch @spcl_eth remember the performance model parameters Performance Modeling 𝐷, 𝑊 𝑛 , 𝑀 1 , … , 𝑀 𝑛 𝑢 𝑚𝑏𝑞 𝑢 𝑔𝑚𝑗 𝑢 𝑔𝑚𝑘 𝑢 𝑝𝑣𝑢 𝑤 𝑚𝑏𝑞 𝑊 2 𝑤 𝑔𝑚𝑗 𝑊 2 𝑤 𝑔𝑚𝑘 𝑊 2 𝑤 𝑝𝑣𝑢 𝑊 2 tiling hierarchy 𝑢 ,𝑚𝑏𝑞,𝑔𝑚𝑗,𝑔𝑚𝑘- 𝑢 ,𝑝𝑣𝑢- 𝑤 ,𝑚𝑏𝑞,𝑔𝑚𝑗,𝑔𝑚𝑘- 𝑊 1 𝑤 ,𝑝𝑣𝑢- 𝑊 1 𝑢 ,,𝑚𝑏𝑞,𝑔𝑚𝑗,𝑔𝑚𝑘-,,𝑝𝑣𝑢-- execution time  Given a stencil 𝑡 given and the amount of computation 𝑑 𝑡 𝑢 𝑡 = 𝑑 𝑡 /𝐷  1 , … , 𝑚 𝑑 𝑛 Given a group 𝑕 and the vertical and lateral communication 𝑤 𝑑 and 𝑚 𝑑 1 𝑀 1 𝑛 𝑀 𝑛 max(𝑢 𝑑 , 𝑤 𝑑 𝑊 𝑛 𝑢 𝑕 = , 𝑚 𝑑 , … , 𝑚 𝑑 ) 𝑑∈𝑕.𝑑𝑖𝑗𝑚𝑒 12

spcl.inf.ethz.ch @spcl_eth Affine Sets and Maps  The stencil program analysis is based on (quasi-) affine sets and maps ∈ ℤ 𝑜 ⋀ (0, … , 0) < 𝑗 𝑇 = 𝑗 𝑗 < (10, … , 10) ∈ ℤ 𝑜 𝑘 ∈ ℤ 𝑜 , 𝑘 𝑁 = 𝑗 → 𝑘 𝑗 = 2 ∙ 𝑗  For example, data dependencies are expressed using a named map ∈ ℤ 2 , 𝑘 𝐸 𝑚𝑏𝑞 = 𝑚𝑏𝑞 𝑗 → 𝑗𝑜 𝑗 + 𝑘 𝑗 ∈ 0,0 , 1,0 , −1,0 , 0,1 , 0, −1 in lap fli flj out 𝐸 = 𝐸 𝑚𝑏𝑞 ∪ 𝐸 𝑔𝑚𝑗 ∪ 𝐸 𝑔𝑚𝑘 ∪ 𝐸 𝑝𝑣𝑢 𝐹 = D + (* 𝑝𝑣𝑢 0 +) apply the out origin vector to the transitive closure of all dependencies 13

spcl.inf.ethz.ch @spcl_eth Tiling Maps i 1 out (2,5) tile id (1,2) 5 (0,2) (1,2) (2,2) 4 3 (0,1) (1,1) (2,1) 2 1 (0,0) (1,0) (2,0) 0 i 0 0 1 2 3 4 5  Define a tiling using a map that associates stencil evaluations to tile ids ⌋, ⌊𝑗 1 2 ⌋ + 𝑈 𝑝𝑣𝑢 = *(out, 𝑗 0 , 𝑗 1 ) → ⌊𝑗 0 2 14

spcl.inf.ethz.ch @spcl_eth Count Stencil Evaluations tiling hierarchy count the points in the filtered tiling map using the Barvinok algorithm  Intersect the range of the tiling map with the base tiling hierarchy origin tile 𝑇 = 0, 0, 𝑧 0 , 𝑧 1 𝑧 𝑘 ∈ ℤ+  Count the number of stencil evaluations associated to the remaining tiles 𝑑 𝑝𝑣𝑢 = 𝑈 𝑝𝑣𝑢 ∩ 𝑠𝑏𝑜 𝑇 ∙ #𝑔𝑚𝑝𝑞𝑡 15

spcl.inf.ethz.ch @spcl_eth Stencil Program Optimization  Put it all together (stencil algebra, performance model, stencil analysis) 1. Optimize the stencil execution order (brute force search) 2. Optimize the stencil grouping (dynamic programming) fli lap out minimize 𝑢(𝑦) 𝑦∈𝐽 flj subject to 𝑛(𝑦) ≤ 𝑁 I fli lap out fli flj lap out flj fli lap out flj 16

spcl.inf.ethz.ch @spcl_eth Dynamic Programming (simplified) tiling hierarchy level 1 lap ⟷ lap - - - lap ⟷ fli fli ⟷ fli - - lap ⟷ flj fli ⟷ flj flj ⟷ flj - lap ⟷ out fli ⟷ out flj ⟷ out out ⟷ out 1 4 3 2 tiling hierarchy level 0 - - - - 4 - [ ... ] :: ... fli ⟷ out flj ⟷ out out ⟷ out 3 2 17

spcl.inf.ethz.ch @spcl_eth Evaluation CPU Experiments (i5-3330): GPU Experiments (Tesla K20c): no fusion hand-tuned optimized no fusion hand-tuned optimized 3.1 3.1 2.7 2.7 2.4 2.4 2.4 2.3 2.3 2.1 2.1 2.1 2.1 2 1.5 1.1 1 1 1 1 1 1 1 1 hd uv div uv&div hd uv div uv&div 120 12 m = measured time [ms] m = measured time [ms] m ~ 1.5e m ~ 1.6e 80 8 40 4 0 0 0 20 40 60 80 0 2 4 6 8 e = estimated time [ms] e = estimated time [ms] 18

spcl.inf.ethz.ch @spcl_eth Conclusions  We categorize data locality transformations for stencil programs  We enumerate stencil program implementation variants using an algebra  Our performance model estimates the stencil program execution time  Using MODESTO we can automatically tune STELLA stencil programs  2.0-3.1x speedup against naive implementations  1.0-1.8x speedup against expert tuned implementations 19

MODESTO: Data-centric Analytic Optimization of Complex Stencil - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , AND T ORSTEN H OEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures spcl.inf.ethz.ch @spcl_eth Stencil Programs Motivation:

Different approaches Modesto, Calif.; Wellesley, MA; and Wichita, KS Modesto high school

Zeros of analytic functions Lecture 14 Zeros of analytic functions Zeros of analytic functions

A Decision A Decision A Decision-Analytic Approach for A Decision Analytic Approach for

Exercises on the Internet for researchers and students to learn Stata M. Escobar (modesto@usal.es)

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Intermembrane Space H + H + Cyt c Co Q Complex Complex III IV H + ATPase H + Complex

Data-centric Profiling Working Group Outbrief Basic Concept Associating performance data with

TransMR: Data Centric Programming Beyond Data Parallelism Naresh Rapolu Karthik Kambatla Prof.

The Worlds First LED Human Centric Fluorescent Tube by Human Centric Optics Inc. 333,

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

The DataPath System: A Data-Centric Analytic Processing Engine for Large Data Warehouses Subi

Various Faces of Data Centric Networking Eiko Yoneki University of Cambridge Computer Laboratory

Six Faces of Data Centric Networking Eiko Yoneki University of Cambridge Computer Laboratory

On local interdefinability of (real and complex) analytic functions Tamara Servi Universit

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

An introduction to complex numbers The complex numbers Are the real numbers not sufficient? A

Exploiting Depmap cancer dependency data using the depmap R package Theo Killian Gatto Lab 1

Loop Transformations Sebastian Hack Saarland University Compiler Construction W2015 saarland

Trust in the context of smart cities Synchronicity: Privacy by Design Strategy for Smart Cities

Compliance-by-Construction ? Privacy Compliance via Model Transformations T. Antignac, R.

{} Introduction to Computer Programming Data Structures CSCI-UA 2 Dictionaries {key: value}

Dictionary learning of sound speed profiles Michael Bianco a) and Peter Gerstoft Scripps

Advanced #3: Dictionaries, Trees, and Graphs SAMS SENIOR CS TRACK Learning Goals Use

Learning sparsely used overcomplete dictionaries Alekh Agarwal Microsoft Research Joint work