PolyMage: High-Performance Compilation for Heterogeneous Stencils - - PowerPoint PPT Presentation

polymage high performance compilation for heterogeneous
SMART_READER_LITE
LIVE PREVIEW

PolyMage: High-Performance Compilation for Heterogeneous Stencils - - PowerPoint PPT Presentation

PolyMage: High-Performance Compilation for Heterogeneous Stencils Uday Bondhugula (with Ravi Teja Mullapudi, Vinay Vasista) Department of Computer Science and Automation Indian Institute of Science Bangalore, India Apr 15, 2015 Uday


slide-1
SLIDE 1

PolyMage: High-Performance Compilation for Heterogeneous Stencils

Uday Bondhugula

(with Ravi Teja Mullapudi, Vinay Vasista) Department of Computer Science and Automation Indian Institute of Science Bangalore, India

Apr 15, 2015

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-2
SLIDE 2

Domain-Specific Languages

A DSL and compiler for optimizing image processing pipelines

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-3
SLIDE 3

Domain-Specific Languages

A DSL and compiler for optimizing image processing pipelines Too specialized Need to learn a new language! A Dodo (highly special- ized, but extinct)

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-4
SLIDE 4

Domain-Specific Languages

A DSL and compiler for optimizing image processing pipelines Too specialized Need to learn a new language! But DSLs can be embedded in existing languages Can grow and become more general-purpose A DSL compiler can “see” across routines – allow whole program

  • ptimization

Generate optimized code for multiple targets A Dodo (generalized to adapt)

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-5
SLIDE 5

Introduction

Image Processing Pipelines

Graphs of interconnected processing stages

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Figure: Harris corner detection

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-6
SLIDE 6

Introduction

Computation Patterns g f Point-wise

f(x, y) = wr · g(x, y, •) + wg · g(x, y, •) + wb · g(x, y, •)

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-7
SLIDE 7

Introduction

Computation Patterns g f Stencil

f(x, y) =

+1

  • σx=−1

+1

  • σy=−1

g(x + σx, y + σy) · w(σx, σy)

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-8
SLIDE 8

Introduction

Computation Patterns g f Downsample

f(x, y) =

+1

  • σx=−1

+1

  • σy=−1

g(2x + σx, 2y + σy) · w(σx, σy)

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-9
SLIDE 9

Introduction

Computation Patterns g f Upsample

f(x, y) =

+1

  • σx=−1

+1

  • σy=−1

g((x + σx)/2, (y + σy)/2) · w(σx, σy, x, y)

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-10
SLIDE 10

Introduction

Example: Pyramid Blending pipeline

↓ x ↓ x ↓ y ↓ y ↓ x ↓ x ↓ y ↓ y ↓ x ↓ x ↓ y ↓ y ↑ x ↑ y L X ↑+ ↑ x ↑ y L ↑ x ↑ x ↑ y L X ↑+ ↑ x ↑ y L ↑ x ↑ x ↑ y L X ↑+ ↑ x ↑ y L M ↓ y ↓ x ↓ y ↓ x ↓ y ↓ x X ↑ x

Image courtesy: Kyros Kutulakos Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-11
SLIDE 11

Introduction

Where are Image Processing Pipelines used?

On images uploaded to social networks like Facebook, Google+ On all camera-enabled devices Everyday workloads from data center to mobile device scales Computational photography, computer vision, medical imaging, ... Google+ Auto Enhance

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-12
SLIDE 12

Introduction

Naive vs Optimized Implementation

Seq Par Tuned

354.56 53.91 12.3

Execution time (ms)

Harris corner detection (16 cores)

Naive implementation in C Naive parallelization – 7× OpenMP, Vector pragmas (icc) Manual optimization – 29× Locality, Parallelism, Vector intrinsics

Manually optimizing pipelines is hard

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-13
SLIDE 13

Introduction

Naive vs Optimized Implementation

Seq Par Tuned

354.56 53.91 12.3

Execution time (ms)

Harris corner detection (16 cores)

Naive implementation in C Naive parallelization – 7× OpenMP, Vector pragmas (icc) Manual optimization – 29× Locality, Parallelism, Vector intrinsics

Goal: Performance levels of manual tuning Without the pain

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-14
SLIDE 14

Approach

Our Approach: PolyMage

High-level language (DSL embedded in Python) – Allow expressing common patterns intuitively – Enables compiler analysis and optimization Automatic Optimizing Code Generator – Uses domain-specific cost models to apply complex combinations of scaling, alignment, tiling and fusion to optimize for parallelism and locality

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-15
SLIDE 15

Approach

Harris Corner Detection

R, C = Parameter( I n t ), Parameter( I n t ) I = Image( Float , [R+2, C+2]) x, y = V a r i a b l e (), V a r i a b l e () row , col = I n t e r v a l (0,R+1 ,1), I n t e r v a l (0,C+1 ,1) c = Condition (x,’>=’ ,1) & Condition (x,’<=’,R) & Condition (y,’>=’ ,1) & Condition (y,’<=’,C) cb = Condition (x,’>=’ ,2) & Condition (x,’<=’,R -1) & Condition (y,’>=’ ,2) & Condition (y,’<=’,C -1) Iy = Function (varDom = ([x,y],[row ,col ]), Float ) Iy.defn = [ Case(c, S t e n c i l (I(x,y), 1.0/12 , [[-1,

  • 2,
  • 1],

[ 0, 0, 0], [ 1, 2, 1]]) ] Ix = Function (varDom = ([x,y],[row ,col ]), Float ) Ix.defn = [ Case(c, S t e n c i l (I(x,y), 1.0/12 , [[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]) ] Ixx = Function (varDom = ([x,y],[row ,col ]), Float ) Ixx.defn = [ Case(c, Ix(x,y) * Ix(x,y)) ] Iyy = Function (varDom = ([x,y],[row ,col ]), Float ) Iyy.defn = [ Case(c, Iy(x,y) * Iy(x,y)) ] Ixy = Function (varDom = ([x,y],[row ,col ]), Float ) Ixy.defn = [ Case(c, Ix(x,y) * Iy(x,y)) ] Sxx = Function (varDom = ([x,y],[row ,col ]), Float ) Syy = Function (varDom = ([x,y],[row ,col ]), Float ) Sxy = Function (varDom = ([x,y],[row ,col ]), Float ) f o r pair i n [(Sxx , Ixx), (Syy , Iyy), (Sxy , Ixy)]: pair [0]. defn = [ Case(cb , S t e n c i l (pair [1], 1, [[1, 1, 1], [1, 1, 1], [1, 1, 1]]) ] det = Function (varDom = ([x,y],[row ,col ]), Float ) d = Sxx(x,y) * Syy(x,y) - Sxy(x,y) * Sxy(x,y) det.defn = [ Case(cb , d) ] trace = Function (varDom = ([x,y],[row ,col ]), Float ) trace.defn = [ Case(cb , Sxx(x,y) + Syy(x,y)) ] harris = Function (varDom = ([x,y],[row ,col ]), Float ) coarsity = det(x,y) - .04 * trace(x,y) * trace(x,y) harris.defn = [ Case(cb , coarsity) ]

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-16
SLIDE 16

Compiler

Our Approach: PolyMage

High-level language (DSL embedded in Python) – Allow expressing common patterns intuitively – Enables compiler analysis and optimization Automatic Optimizing Code Generator – Uses domain-specific cost models to apply complex combinations of scaling, alignment, tiling and fusion to optimize for parallelism and locality

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-17
SLIDE 17

Compiler

Polyhedral Representation

x

f1 f2 fout

x = V a r i a b l e () fin = Image( Float , [18]) f1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f1.defn = [ fin(x) + 1 ] f2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f2.defn = [ f1(x -1) + f1(x+1) ] fout = Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) fout.defn = [ f2(x -1) + f2(x+1) ]

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-18
SLIDE 18

Compiler

Polyhedral Representation

x

f1 f2 fout

Domains

x = V a r i a b l e () fin = Image( Float , [18]) f1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f1.defn = [ fin(x) + 1 ] f2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f2.defn = [ f1(x -1) + f1(x+1) ] fout = Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) fout.defn = [ f2(x -1) + f2(x+1) ]

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-19
SLIDE 19

Compiler

Polyhedral Representation

x

f1 f2 fout

Dependence vectors Function Dependence Vectors

fout(x) = f2(x − 1)· f2(x + 1) (1, 1), (1, −1) f2(x) = f1(x − 1) + f1(x + 1) (1, 1), (1, −1) f1(x) = fin(x)

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-20
SLIDE 20

Compiler

Polyhedral Representation

x

f1 f2 fout

Live-outs Function Dependence Vectors

fout(x) = f2(x − 1)· f2(x + 1) (1, 1), (1, −1) f2(x) = f1(x − 1) + f1(x + 1) (1, 1), (1, −1) f1(x) = fin(x)

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-21
SLIDE 21

Compiler

Scheduling Criteria

x

f1 f2 fout

Parallelism Locality Storage

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-22
SLIDE 22

Compiler

Scheduling Criteria

x

f1 f2 fout

Default schedule

Parallelism Locality Storage

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-23
SLIDE 23

Compiler

Scheduling Criteria

x

f1 f2 fout

Default schedule

Parallelism Locality Storage

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-24
SLIDE 24

Compiler

Scheduling Criteria

x

f1 f2 fout

Default schedule

Parallelism Locality Storage

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-25
SLIDE 25

Compiler

Scheduling Criteria

x

f1 f2 fout

Parallelogram tiling

Parallelism Locality Storage

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-26
SLIDE 26

Compiler

Scheduling Criteria

x

f1 f2 fout

Overlap tiling

Parallelism Locality Storage Re-computation

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-27
SLIDE 27

Compiler

Overlapped Tiling for Heterogeneous Functions

x

f f↓1 f↓2

Function Schedule

f↓2(x) = f↓1(2x − 1)· f↓1(2x + 1) (x) → (2, x) f↓1(x) = f(2x − 1)· f(2x + 1)· f(2x) (x) → (1, x) f(x) = fin(x) (x) → (0, x)

Prior approaches for overlapped tiling only consider homogeneous time-iterated stencils

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-28
SLIDE 28

Compiler

Overlapped Tiling for Heterogeneous Functions

x

f f↓1 f↓2

Function Schedule

f↓2(x) = f↓1(2x − 1)· f↓1(2x + 1) (x) → (2, x) f↓1(x) = f(2x − 1)· f(2x + 1)· f(2x) (x) → (1, x) f(x) = fin(x) (x) → (0, x)

Cannot have a fixed tile shape when dependence vectors are non-constant

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-29
SLIDE 29

Compiler

Overlapped Tiling for Heterogeneous Functions

x

f f↓1 f↓2

Function Schedule

f↓2(x) = f↓1(2x − 1)· f↓1(2x + 1) (x) → (2, 4x) f↓1(x) = f(2x − 1)· f(2x + 1)· f(2x) (x) → (1, 2x) f(x) = fin(x) (x) → (0, x)

Scaling and aligning the schedules

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-30
SLIDE 30

Compiler

Overlapped Tiling for Heterogeneous Functions

x

f f↓1 f↓2 f↑ fout

Function Schedule

fout(x) = f↑(x/2) (x) → (4, x) f↑(x) = f↓2(x/2)· f↓2(x/2 + 1) (x) → (3, 2x) f↓2(x) = f↓1(2x − 1)· f↓1(2x + 1) (x) → (2, 4x) f↓1(x) = f(2x − 1)· f(2x + 1)· f(2x) (x) → (1, 2x) f(x) = fin(x) (x) → (0, x)

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-31
SLIDE 31

Compiler

Overlapped Tiling for Heterogeneous Functions

x

f f↓1 f↓2 f↑ fout

Determining tile shape Conservative vs precise bounding faces

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-32
SLIDE 32

Compiler

Overlapped Tiling for Heterogeneous Functions

x

f f↓1 f↓2 f↑ fout

Determining tile shape Conservative vs precise bounding faces

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-33
SLIDE 33

Compiler

Overlapped Tiling for Heterogeneous Functions

x

f f↓1 f↓2 f↑ fout

Determining tile shape Conservative vs precise bounding faces

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-34
SLIDE 34

Compiler

Overlapped Tiling for Heterogeneous Functions

x

f f↓1 f↓2 f↑ fout

Significant reduction in redundant computation

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-35
SLIDE 35

Compiler

Overlapped Tiling for Heterogeneous Functions

x

h

  • τ

Tile size τ, Overlap O, Height h Trade-off between fusion height and overlap

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-36
SLIDE 36

Compiler

Overlapped Tiling for Heterogeneous Functions

x

h

  • τ

Tile size τ, Overlap O, Height h Trade-off between fusion height and overlap

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-37
SLIDE 37

Compiler

Overlapped Tiling for Heterogeneous Functions

x

h

  • τ

Tile size τ, Overlap O, Height h Trade-off between fusion height and overlap

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-38
SLIDE 38

Compiler

Overlapped Tiling for Heterogeneous Functions

x

f f↓1 f↓2 f↑ fout

Scratch pads Reduction in intermediate storage Better locality and reuse Privatized for each thread

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-39
SLIDE 39

Compiler

Overlapped Tiling for Heterogeneous Functions

x

f f↓1 f↓2 f↑ fout

Scratch pads Reduction in intermediate storage Better locality and reuse Privatized for each thread

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-40
SLIDE 40

Compiler

Overlapped Tiling for Heterogeneous Functions

x

f f↓1 f↓2 f↑ fout

Scratch pads Reduction in intermediate storage Better locality and reuse Privatized for each thread

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-41
SLIDE 41

Compiler

Grouping Pipeline Stages for Tiling

↓ x ↓ x ↓ y ↓ y ↓ x ↓ x ↓ y ↓ y ↓ x ↓ x ↓ y ↓ y ↑ x ↑ y L X ↑+ ↑ x ↑ y L ↑ x ↑ x ↑ y L X ↑+ ↑ x ↑ y L ↑ x ↑ x ↑ y L X ↑+ ↑ x ↑ y L M ↓ y ↓ x ↓ y ↓ x ↓ y ↓ x X ↑ x

Image courtesy: Kyros Kutulakos Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-42
SLIDE 42

Compiler

Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Grouping criteria

– Keep dependences short Alignment, Scaling – Redundant computation vs reuse Overlap threshold, Tile sizes, Parameter estimates – Exponential number of valid groupings

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-43
SLIDE 43

Compiler

Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Grouping criteria

– Keep dependences short Alignment, Scaling – Redundant computation vs reuse Overlap threshold, Tile sizes, Parameter estimates – Exponential number of valid groupings

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-44
SLIDE 44

Compiler

Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Grouping criteria

– Keep dependences short Alignment, Scaling – Redundant computation vs reuse Overlap threshold, Tile sizes, Parameter estimates – Exponential number of valid groupings

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-45
SLIDE 45

Compiler

Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Grouping criteria

– Keep dependences short Alignment, Scaling – Redundant computation vs reuse Overlap threshold, Tile sizes, Parameter estimates – Exponential number of valid groupings

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-46
SLIDE 46

Compiler

Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Grouping heuristic

– Greedy iterative algorithm – Only fuse stages which can be

  • verlap tiled

– Overlap relative to tile size less than the given threshold

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-47
SLIDE 47

Compiler

Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Grouping heuristic

– Greedy iterative algorithm – Only fuse stages which can be

  • verlap tiled

– Overlap relative to tile size less than the given threshold

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-48
SLIDE 48

Compiler

Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Grouping heuristic

– Greedy iterative algorithm – Only fuse stages which can be

  • verlap tiled

– Overlap relative to tile size less than the given threshold

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-49
SLIDE 49

Compiler

Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Grouping heuristic

– Greedy iterative algorithm – Only fuse stages which can be

  • verlap tiled

– Overlap relative to tile size less than the given threshold

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-50
SLIDE 50

Compiler

Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Grouping heuristic

– Greedy iterative algorithm – Only fuse stages which can be

  • verlap tiled

– Overlap relative to tile size less than the given threshold

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-51
SLIDE 51

Compiler

Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Grouping heuristic

– Greedy iterative algorithm – Only fuse stages which can be

  • verlap tiled

– Overlap relative to tile size less than the given threshold

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-52
SLIDE 52

Compiler

Grouping

↓ x ↓ x ↓ y ↓ y ↓ x ↓ x ↓ y ↓ y ↓ x ↓ x ↓ y ↓ y ↑ x ↑ y L X ↑+ ↑ x ↑ y L ↑ x ↑ x ↑ y L X ↑+ ↑ x ↑ y L ↑ x ↑ x ↑ y L X ↑+ ↑ x ↑ y L M ↓ y ↓ x ↓ y ↓ x ↓ y ↓ x X ↑ x

Image courtesy: Kyros Kutulakos Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-53
SLIDE 53

Compiler

Autotuning Exploring reuse vs re-computation trade-off

– Tile sizes and overlap threshold determine grouping – Ideal tile sizes and grouping structure depend on machine characteristics – Vary tile sizes and overlap threshold Seven tiles sizes per dimension and three threshold values Small search space ( 72 × 3 for 2d-tiling )

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-54
SLIDE 54

Performance Evaluation

Experimental Setup

Seven representative benchmarks of varying structure and complexity

Benchmark Stages Lines Image size Unsharp Mask 4 16 2048×2048×3 Bilateral Grid 7 43 2560×1536 Harris Corner 11 43 6400×6400 Camera Pipeline 32 86 2528×1920 Pyramid Blending 44 71 2048×2048×3 Multiscale Interpolate 49 41 2560×1536×3 Local Laplacian 99 107 2560×1536×3

Target machine : Intel Xeon E5-2680, 16-core dual socket NUMA, @2.7GHz, using Intel C/C++ Compiler v14.0.1

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-55
SLIDE 55

Performance Evaluation

Effectiveness of Schedule Transformations

Speedup of grouped and tiled implementations over naively parallelized and vectorized ones

Unsharp Mask Bilateral Grid Harris Corner Camera Pipeline Pyramid Blending Multiscale Interpolate Local Laplacian

6.33 3.27 2.88 1.36 2.82 2.13 1.57

16 threads and vectorization enabled PolyMage optimized code exhibits better scaling, better vectorization efficiency due to better locality

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-56
SLIDE 56

Performance Evaluation

Comparison with manually tuned Halide schedules

Speedup of PolyMage over manually tuned Halide [PLDI’13] schedules

Unsharp Mask Bilateral Grid Harris Corner Camera Pipeline Pyramid Blending Multiscale Interpolate Local Laplacian

1.63 0.89 2.59 1.04 4.61 1.81 1.54

(16 threads and vectorization enabled)

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-57
SLIDE 57

Performance Evaluation

Comparison with Halide matched schedules

Speedup of PolyMage schedules specified in Halide over manually tuned Halide schedules

Harris Corner Pyramid Blending Multiscale Interpolate

1.4 2.16 1.75

(16 threads and vectorization enabled)

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-58
SLIDE 58

Performance Evaluation

Results Summary

PolyMage (fully automatic) provides a mean speedup of 2.58×

  • ver Code optimized/parallelized naively

5.39×

  • ver Halide/OpenTuner schedules

1.75×

  • ver Halide manually tuned schedules

Performance better/comparable to hand-tuned

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-59
SLIDE 59

Performance Evaluation

Results Summary

PolyMage (fully automatic) provides a mean speedup of 2.58×

  • ver Code optimized/parallelized naively

5.39×

  • ver Halide/OpenTuner schedules

1.75×

  • ver Halide manually tuned schedules

Performance better/comparable to hand-tuned Performance better or comparable to manually optimized implementations For camera pipeline, performance comparable to an expert

  • ptimized camera pipeline implementation (FCAM)

Productivity: 86 line PolyMage input → 732 lines of C++ code; Performance: only 10% slower than FCAM

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

slide-60
SLIDE 60

Performance Evaluation

Conclusions and Acknowledgments

DSL optimization is the way to go

– Customize existing optimization techniques – Productivity and expressiveness of ultra-high level but performance of hand-tuned

Acknowledgments

Joint work with Ravi Teja Mullapudi and Vinay Vasista Intel Labs, Bangalore for their hardware and software donation

More information at http://mcl.csa.iisc.ernet.in/polymage.html

Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015