PolyMage: High-Performance Compilation for Heterogeneous Stencils - PowerPoint PPT Presentation

PolyMage: High-Performance Compilation for Heterogeneous Stencils Uday Bondhugula (with Ravi Teja Mullapudi, Vinay Vasista) Department of Computer Science and Automation Indian Institute of Science Bangalore, India Apr 15, 2015 Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Domain-Specific Languages A DSL and compiler for optimizing image processing pipelines Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Domain-Specific Languages A DSL and compiler for optimizing image processing pipelines Too specialized Need to learn a new language! A Dodo (highly specialized, but extinct) Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Domain-Specific Languages A DSL and compiler for optimizing image processing pipelines Too specialized Need to learn a new language! But DSLs can be embedded in existing languages Can grow and become more general-purpose A DSL compiler can “see” across A Dodo (generalized to routines – allow whole program adapt) optimization Generate optimized code for multiple targets Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Introduction Image Processing Pipelines Graphs of interconnected processing stages I in I x I y I xx I xy I yy S xx S xy S yy det trace harris Figure: Harris corner detection Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Introduction Computation Patterns g f Point-wise f ( x, y ) = w r · g ( x, y, • ) + w g · g ( x, y, • ) + w b · g ( x, y, • ) Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Introduction Computation Patterns g f Stencil +1 +1 f ( x, y ) = � � g ( x + σ x , y + σ y ) · w ( σ x , σ y ) σ x = − 1 σ y = − 1 Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Introduction Computation Patterns g f Downsample +1 +1 f ( x, y ) = � � g (2 x + σ x , 2 y + σ y ) · w ( σ x , σ y ) σ x = − 1 σ y = − 1 Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Introduction Computation Patterns f g Upsample +1 +1 f ( x, y ) = � � g (( x + σ x ) / 2 , ( y + σ y ) / 2) · w ( σ x , σ y , x, y ) σ x = − 1 σ y = − 1 Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Introduction Example: Pyramid Blending pipeline ↓ y ↓ x ↓ y ↓ x ↓ y ↓ x M ↓ y ↓ x ↓ y ↓ x ↓ y ↓ x ↓ x ↓ y ↑ x ↑ x ↑ x ↑ x ↑ x ↑ x ↓ x ↓ y ↑ y ↑ y ↑ y ↑ y ↑ y ↑ y ↓ x ↓ y L L L L L L X X ↑ x X ↑ x X ↑ x ↑ + ↑ + ↑ + Image courtesy: Kyros Kutulakos Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Introduction Where are Image Processing Pipelines used? On images uploaded to social networks like Facebook, Google+ On all camera-enabled devices Everyday workloads from data center to mobile device scales Computational photography, computer vision, medical imaging, ... Google + Auto Enhance Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Introduction Naive vs Optimized Implementation 354 . 56 Execution time (ms) Naive implementation in C Naive parallelization – 7 × OpenMP, Vector pragmas (icc) 53 . 91 12 . 3 Manual optimization – 29 × Seq Par Tuned Locality, Parallelism, Vector intrinsics Harris corner detection (16 cores) Manually optimizing pipelines is hard Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Introduction Naive vs Optimized Implementation 354 . 56 Execution time (ms) Naive implementation in C Naive parallelization – 7 × OpenMP, Vector pragmas (icc) 53 . 91 12 . 3 Manual optimization – 29 × Seq Par Tuned Locality, Parallelism, Vector intrinsics Harris corner detection (16 cores) Goal: Performance levels of manual tuning Without the pain Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Approach Our Approach: PolyMage High-level language (DSL embedded in Python) – Allow expressing common patterns intuitively – Enables compiler analysis and optimization Automatic Optimizing Code Generator – Uses domain-specific cost models to apply complex combinations of scaling, alignment, tiling and fusion to optimize for parallelism and locality Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Approach Harris Corner Detection R, C = Parameter ( I n t ), Parameter ( I n t ) I = Image ( Float , [R+2, C+2]) x, y = V a r i a b l e (), V a r i a b l e () I n t e r v a l (0,R+1 ,1), I n t e r v a l (0,C+1 ,1) row , col = c = Condition (x,’>=’ ,1) & Condition (x,’<=’,R) & Condition (y,’>=’ ,1) & Condition (y,’<=’,C) I in cb = Condition (x,’>=’ ,2) & Condition (x,’<=’,R -1) & Condition (y,’>=’ ,2) & Condition (y,’<=’,C -1) Iy = Function (varDom = ([x,y],[row ,col ]), Float ) Iy.defn = [ Case (c, S t e n c i l (I(x,y), 1.0/12 , I x I y [[-1, -2, -1], [ 0, 0, 0], [ 1, 2, 1]]) ] Ix = Function (varDom = ([x,y],[row ,col ]), Float ) I xx I xy Ix.defn = [ Case (c, S t e n c i l (I(x,y), 1.0/12 , I yy [[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]) ] Ixx = Function (varDom = ([x,y],[row ,col ]), Float ) Ixx.defn = [ Case (c, Ix(x,y) * Ix(x,y)) ] S xx S xy S yy Iyy = Function (varDom = ([x,y],[row ,col ]), Float ) Iyy.defn = [ Case (c, Iy(x,y) * Iy(x,y)) ] Ixy = Function (varDom = ([x,y],[row ,col ]), Float ) Ixy.defn = [ Case (c, Ix(x,y) * Iy(x,y)) ] det Sxx = Function (varDom = ([x,y],[row ,col ]), Float ) trace Syy = Function (varDom = ([x,y],[row ,col ]), Float ) Sxy = Function (varDom = ([x,y],[row ,col ]), Float ) f o r pair i n [(Sxx , Ixx), (Syy , Iyy), (Sxy , Ixy)]: pair [0]. defn = [ Case (cb , S t e n c i l (pair [1], 1, [[1, 1, 1], [1, 1, 1], [1, 1, 1]]) ] Function (varDom = ([x,y],[row ,col ]), Float ) det = d = Sxx(x,y) * Syy(x,y) - Sxy(x,y) * Sxy(x,y) harris det.defn = [ Case (cb , d) ] trace = Function (varDom = ([x,y],[row ,col ]), Float ) trace.defn = [ Case (cb , Sxx(x,y) + Syy(x,y)) ] harris = Function (varDom = ([x,y],[row ,col ]), Float ) coarsity = det(x,y) - .04 * trace(x,y) * trace(x,y) harris.defn = [ Case (cb , coarsity) ] Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Compiler Our Approach: PolyMage High-level language (DSL embedded in Python) – Allow expressing common patterns intuitively – Enables compiler analysis and optimization Automatic Optimizing Code Generator – Uses domain-specific cost models to apply complex combinations of scaling, alignment, tiling and fusion to optimize for parallelism and locality Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Compiler Polyhedral Representation f out f 2 f 1 x V a r i a b l e () x = f in = Image ( Float , [18]) f 1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f 1 .defn = [ f in (x) + 1 ] f 2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f 2 .defn = [ f 1 (x -1) + f 1 (x+1) ] Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) f out = f out .defn = [ f 2 (x -1) + f 2 (x+1) ] Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Compiler Polyhedral Representation Domains f out f 2 f 1 x V a r i a b l e () x = f in = Image ( Float , [18]) f 1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f 1 .defn = [ f in (x) + 1 ] f 2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f 2 .defn = [ f 1 (x -1) + f 1 (x+1) ] Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) f out = f out .defn = [ f 2 (x -1) + f 2 (x+1) ] Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Compiler Polyhedral Representation Dependence vectors f out f 2 f 1 x Function Dependence Vectors f out ( x ) = f 2 ( x − 1) · f 2 ( x + 1) (1 , 1) , (1 , − 1) f 2 ( x ) = f 1 ( x − 1) + f 1 ( x + 1) (1 , 1) , (1 , − 1) f 1 ( x ) = f in ( x ) Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Compiler Polyhedral Representation Live-outs f out f 2 f 1 x Function Dependence Vectors f out ( x ) = f 2 ( x − 1) · f 2 ( x + 1) (1 , 1) , (1 , − 1) f 2 ( x ) = f 1 ( x − 1) + f 1 ( x + 1) (1 , 1) , (1 , − 1) f 1 ( x ) = f in ( x ) Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Compiler Scheduling Criteria f out f 2 f 1 x Locality Storage Parallelism Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Compiler Scheduling Criteria Default schedule f out f 2 f 1 x Locality Storage Parallelism Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Compiler Scheduling Criteria Parallelogram tiling f out f 2 f 1 x Locality Storage Parallelism Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

PolyMage: High-Performance Compilation for Heterogeneous Stencils - PowerPoint PPT Presentation

PolyMage: High-Performance Compilation for Heterogeneous Stencils Uday Bondhugula (with Ravi Teja Mullapudi, Vinay Vasista) Department of Computer Science and Automation Indian Institute of Science Bangalore, India Apr 15, 2015 Uday

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

PolyMage: Automatic Optimization for Image Processing Pipelines Ravi Teja Mullapudi Vinay

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

The Compilation Process Preprocessing: o processes include-files, conditional compilation and

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

MERCHANDISE TRADE TRADE STATISTICS MERCHANDISE STATISTICS (IMTS) IN BOTSWANA (with ref to

MERCHANDISE TRADE TRADE STATISTICS MERCHANDISE STATISTICS (IMTS) IN BOTSWANA (GENERAL) (IMTS)

MID-YEAR MOBILITY DATA 1 PURPOSE The following slides are a compilation of the mid-year

JVM Optimization 101 Sebastian Zarnekow itemis Static vs Dynamic Compilation AOT vs JIT JIT

Dynamic Compilation for Reducing Dynamic Compilation for Reducing Energy Consumption of I/O-

The Java Virtual Machine The Java Virtual Machine interpret compile Native Binary Code Michael

December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep Learning Compilation Conference!

Magic Behind Xcode Compilation Mobile Warsaw, 2015 > whoami Twitter: @1101_debian Github:

MC: Meta-level Compilation Extending the Process of Code Compilation with Application-Specific

Performance Potential of Optimization Phase Selection During Dynamic JIT Compilation Michael R.

Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser)

Constraining the Physical Processes that Shape the CGM (at low redshift) Arguments will follow

Scatter plots IN TR OD U C TION TO DATA VISU AL IZATION W ITH G G P L OT 2 Rick Sca v e a Fo

LSST+Euclid: galaxy shape measurement synergies Robert L. Schuhmann (IfA Edinburgh, U Manchester)

animation 1 animation shape specification as a function of time 2 animation representation

{Nano|Micro|Mini}-Services? Modularization for Sustainable Systems Stefan Tilkov | innoQ

1 Terms Actions and Cords Actions t ::= c constant term send t; send a term t x

Todays Topics - Distributed Shared Memory The Shared Memory Abstraction, why? Approaches

Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS Retreat 2017, September 18 th Koby

PolyMage: High-Performance Compilation for Heterogeneous Stencils - PowerPoint PPT Presentation

PolyMage: High-Performance Compilation for Heterogeneous Stencils Uday Bondhugula (with Ravi Teja Mullapudi, Vinay Vasista) Department of Computer Science and Automation Indian Institute of Science Bangalore, India Apr 15, 2015 Uday

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

PolyMage: Automatic Optimization for Image Processing Pipelines Ravi Teja Mullapudi Vinay

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

The Compilation Process Preprocessing: o processes include-files, conditional compilation and

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

MERCHANDISE TRADE TRADE STATISTICS MERCHANDISE STATISTICS (IMTS) IN BOTSWANA (with ref to

MERCHANDISE TRADE TRADE STATISTICS MERCHANDISE STATISTICS (IMTS) IN BOTSWANA (GENERAL) (IMTS)

MID-YEAR MOBILITY DATA 1 PURPOSE The following slides are a compilation of the mid-year

JVM Optimization 101 Sebastian Zarnekow itemis Static vs Dynamic Compilation AOT vs JIT JIT

Dynamic Compilation for Reducing Dynamic Compilation for Reducing Energy Consumption of I/O-

The Java Virtual Machine The Java Virtual Machine interpret compile Native Binary Code Michael

December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep Learning Compilation Conference!

Magic Behind Xcode Compilation Mobile Warsaw, 2015 &gt; whoami Twitter: @1101_debian Github:

MC: Meta-level Compilation Extending the Process of Code Compilation with Application-Specific

Performance Potential of Optimization Phase Selection During Dynamic JIT Compilation Michael R.

Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser)

Constraining the Physical Processes that Shape the CGM (at low redshift) Arguments will follow

Scatter plots IN TR OD U C TION TO DATA VISU AL IZATION W ITH G G P L OT 2 Rick Sca v e a Fo

LSST+Euclid: galaxy shape measurement synergies Robert L. Schuhmann (IfA Edinburgh, U Manchester)

animation 1 animation shape specification as a function of time 2 animation representation

{Nano|Micro|Mini}-Services? Modularization for Sustainable Systems Stefan Tilkov | innoQ

1 Terms Actions and Cords Actions t ::= c constant term send t; send a term t x

Todays Topics - Distributed Shared Memory The Shared Memory Abstraction, why? Approaches

Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS Retreat 2017, September 18 th Koby

Magic Behind Xcode Compilation Mobile Warsaw, 2015 > whoami Twitter: @1101_debian Github: