PolyMage: Automatic Optimization for Image Processing Pipelines - - PowerPoint PPT Presentation

polymage automatic optimization for image processing
SMART_READER_LITE
LIVE PREVIEW

PolyMage: Automatic Optimization for Image Processing Pipelines - - PowerPoint PPT Presentation

PolyMage: Automatic Optimization for Image Processing Pipelines Ravi Teja Mullapudi Vinay Vasista Uday Bondhugula CSA, Indian Institute of Science June 27, 2016 Table of Contents 1 Image Processing Pipelines 2 Language 3 Compiler 4 Related


slide-1
SLIDE 1

PolyMage: Automatic Optimization for Image Processing Pipelines

Ravi Teja Mullapudi Vinay Vasista Uday Bondhugula

CSA, Indian Institute of Science

June 27, 2016

slide-2
SLIDE 2

Table of Contents

1 Image Processing Pipelines 2 Language 3 Compiler 4 Related Work 5 Performance Evaluation

slide-3
SLIDE 3

Table of Contents

1 Image Processing Pipelines 2 Language 3 Compiler 4 Related Work 5 Performance Evaluation

slide-4
SLIDE 4

Image Processing Pipelines - Data

Cameras and Internet

Instagram 60 Million photos per day. http://instagram.com/press/ YouTube 100 hours of video uploaded every minute. https://www.youtube.com/yt/press/statistics.html

Astronomy Large Synoptic Survey Telescope (LSST) Generates 30 TB of image data every night. http://lsst.org/lsst/google Medical Imaging Human Connectome Project fMRI data for 68 subjects 1.873 TB. http://www.humanconnectome.org/

slide-5
SLIDE 5

Image Processing Pipelines - Computation

Synthesis, Enhancement and Analysis of Images Applications Computational Photography Computer Vision Medical Imaging

slide-6
SLIDE 6

Image Processing Pipelines - Challenges

  • Real-time processing
  • High resolution
  • Complex algorithms

Need for Speed

slide-7
SLIDE 7

Image Processing Pipelines - Challenges

  • Real-time processing
  • High resolution
  • Complex algorithms

Need for Speed

  • Deep memory hierarchies
  • Parallelism
  • Heterogeneity

Modern Architectures

slide-8
SLIDE 8

Image Processing Pipelines - Challenges

  • Real-time processing
  • High resolution
  • Complex algorithms

Need for Speed

  • Deep memory hierarchies
  • Parallelism
  • Heterogeneity

Modern Architectures

  • Requires expertise
  • Tedious and error prone
  • Not portable

Hand Optimization

slide-9
SLIDE 9

Image Processing Pipelines - Challenges

  • Real-time processing
  • High resolution
  • Complex algorithms

Need for Speed

  • Deep memory hierarchies
  • Parallelism
  • Heterogeneity

Modern Architectures

  • OpenCV, CImg, MATLAB
  • Limited optimization
  • Architecture support

Libraries

  • Requires expertise
  • Tedious and error prone
  • Not portable

Hand Optimization

slide-10
SLIDE 10

Domain Specific Languages

Productivity, Performance and Portability

  • Decouple algorithms from schedules
  • Support common patterns in the domain
  • High performance compilation
slide-11
SLIDE 11

Image Processing Pipelines - Computation Patterns

f (x, y) = g(x, y)

Point-wise

f (x, y) =

+1

  • σx =−1

+1

  • σy =−1

g(x + σx , y + σy )

Stencil

slide-12
SLIDE 12

Image Processing Pipelines - Computation Patterns

f (x, y) =

+1

  • σx =−1

+1

  • σy =−1

g(2x + σx , 2y + σy )

Downsample

f (x, y) =

+1

  • σx =−1

+1

  • σy =−1

g((x + σx )/2, (y + σy )/2)

Upsample

slide-13
SLIDE 13

Image Processing Pipelines - Computation Patterns

f (g(x))+ = 1

Histogram

f (t, x, y) = g(f (t − 1, x, y))

Time-iterated

slide-14
SLIDE 14

PolyMage Framework

DSL Spec Build stage graph Static bounds check Inlining Polyhedral representation Default schedule Alignment Scaling Grouping Schedule transformation Storage optimization Code generation

slide-15
SLIDE 15

Table of Contents

1 Image Processing Pipelines 2 Language 3 Compiler 4 Related Work 5 Performance Evaluation

slide-16
SLIDE 16

Language Constructs

Parameter Variable Image Interval Function Accumulator Stencil Condition Select Case Accumulate

N = Parameter( I n t ) x = V a r i a b l e () I = Image( Float , [N]) c1 = Condition (x, ’>=’, 1) & Condition (x, ’<=’, N -2) c2 = Condition (x, ’==’, 0) | Condition (x, ’==’, N -1) f = Function (varDom = ([x], I n t e r v a l (0, N-1, 1)), Float ) f.defn = [ Case(c1 , S t e n c i l (I(x), 1.0/3 , [[1, 1, 1]])), Case(c2 , 0) ]

f : [0..N − 1] → R f (x) =     

+1

  • σx =−1

I(x + σx)/3 if 1 ≤ x ≤ N − 2 if x = 0 ∨ x = N − 1.

slide-17
SLIDE 17

Language Constructs

Parameter Variable Image Interval Function Accumulator Stencil Condition Select Case Accumulate

R, C = Parameter( I n t ), Parameter( I n t ) I = Image(UChar, [R, C]) x, y = V a r i a b l e (), V a r i a b l e () row , col = I n t e r v a l (0, R, 1), I n t e r v a l (0, C, 1) bins = I n t e r v a l (0, 255, 1) hist = Accumulator (redDom = ([x,y],[row ,col ]), varDom = ([x],bins), I n t ) hist.defn = Accumulate(hist(I(x,y)), 1, Sum)

hist : [0..255] → Z hist(p) =| {(x, y) : I(x, y) = p} |

slide-18
SLIDE 18

Unsharp Mask

R, C = Parameter( I n t ), Parameter( I n t ) thresh , w = Parameter( Float ), Parameter( Float ) x, y, c = V a r i a b l e (), V a r i a b l e (), V a r i a b l e () I = Image( Float , [3, R+4, C+4]) cr = I n t e r v a l (0, 2, 1) xr , xc = I n t e r v a l (2, R+1, 1), I n t e r v a l (0, C+3, 1) yr , yc = I n t e r v a l (2, R+1, 1), I n t e r v a l (2, C+1, 1) blurx = Function (varDom = ([c, x, y], [cr , xr , xc]), Float ) blurx.defn = [ S t e n c i l (I(c, x, y), 1.0/16 , [[1, 4, 6, 4, 1]]) ] blury = Function (varDom = ([c, x, y], [cr , yr , yc]), Float ) blury.defn = [ S t e n c i l (blurx(c, x, y), 1.0/16 , [[1] , [4], [6], [4], [1]]]) ] sharpen = Function (varDom = ([c, x, y], [cr , yr , yc]), Float ) sharpen.defn = [ I(c, x, y) * ( 1 + w ) - blury(c, x, y) * w ] masked = Function (varDom = ([c, x, y], [cr , yr , yc]), Float ) diff = Abs((I(c, x, y) - blury(c, x, y))) cond = Condition ( diff , ‘<’, thresh )

  • masked. definition = Select(cond , I(c, x, y), sharpen(c, x, y))

Iin blurx blury sharpen masked

slide-19
SLIDE 19

Harris Corner Detection

R, C = Parameter( I n t ), Parameter( I n t ) I = Image( Float , [R+2, C+2]) x, y = V a r i a b l e (), V a r i a b l e () row , col = I n t e r v a l (0,R+1 ,1), I n t e r v a l (0,C+1 ,1) c = Condition (x,’>=’ ,1) & Condition (x,’<=’,R) & Condition (y,’>=’ ,1) & Condition (y,’<=’,C) cb = Condition (x,’>=’ ,2) & Condition (x,’<=’,R -1) & Condition (y,’>=’ ,2) & Condition (y,’<=’,C -1) Iy = Function (varDom = ([x,y],[row ,col ]), Float ) Iy.defn = [ Case(c, S t e n c i l (I(x,y), 1.0/12 , [[-1,

  • 2,
  • 1],

[ 0, 0, 0], [ 1, 2, 1]]) ] Ix = Function (varDom = ([x,y],[row ,col ]), Float ) Ix.defn = [ Case(c, S t e n c i l (I(x,y), 1.0/12 , [[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]) ] Ixx = Function (varDom = ([x,y],[row ,col ]), Float ) Ixx.defn = [ Case(c, Ix(x,y) * Ix(x,y)) ] Iyy = Function (varDom = ([x,y],[row ,col ]), Float ) Iyy.defn = [ Case(c, Iy(x,y) * Iy(x,y)) ] Ixy = Function (varDom = ([x,y],[row ,col ]), Float ) Ixy.defn = [ Case(c, Ix(x,y) * Iy(x,y)) ] Sxx = Function (varDom = ([x,y],[row ,col ]), Float ) Syy = Function (varDom = ([x,y],[row ,col ]), Float ) Sxy = Function (varDom = ([x,y],[row ,col ]), Float ) f o r pair i n [(Sxx , Ixx), (Syy , Iyy), (Sxy , Ixy)]: pair [0]. defn = [ Case(cb , S t e n c i l (pair [1], 1, [[1, 1, 1], [1, 1, 1], [1, 1, 1]]) ] det = Function (varDom = ([x,y],[row ,col ]), Float ) d = Sxx(x,y) * Syy(x,y) - Sxy(x,y) * Sxy(x,y) det.defn = [ Case(cb , d) ] trace = Function (varDom = ([x,y],[row ,col ]), Float ) trace.defn = [ Case(cb , Sxx(x,y) + Syy(x,y)) ] harris = Function (varDom = ([x,y],[row ,col ]), Float ) coarsity = det(x,y) - .04 * trace(x,y) * trace(x,y) harris.defn = [ Case(cb , coarsity) ]

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

slide-20
SLIDE 20

Pyramid Blending

↓ x ↓ x ↓ y ↓ y ↓ x ↓ x ↓ y ↓ y ↓ x ↓ x ↓ y ↓ y ↑ x ↑ y L X ↑+ ↑ x ↑ y L ↑ x ↑ x ↑ y L X ↑+ ↑ x ↑ y L ↑ x ↑ x ↑ y L X ↑+ ↑ x ↑ y L M ↓ y ↓ x ↓ y ↓ x ↓ y ↓ x X ↑ x

slide-21
SLIDE 21

Table of Contents

1 Image Processing Pipelines 2 Language 3 Compiler 4 Related Work 5 Performance Evaluation

slide-22
SLIDE 22

Compiler - Polyhedral Representation

x = V a r i a b l e () fin = Image( Float , [18]) f1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f1.defn = [ fin (x) + 1 ] f2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f2.defn = [ f1(x -1) + f1(x+1) ] fout = Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) fout .defn = [ f2(x -1) . f2(x+1) ]

x

f1(x) f2(x) fout(x)

Domains

slide-23
SLIDE 23

Compiler - Polyhedral Representation

x = V a r i a b l e () fin = Image( Float , [18]) f1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f1.defn = [ fin (x) + 1 ] f2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f2.defn = [ f1(x -1) + f1(x+1) ] fout = Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) fout .defn = [ f2(x -1) . f2(x+1) ]

x

f1(x) f2(x) fout(x)

Dependence vectors

slide-24
SLIDE 24

Compiler - Polyhedral Representation

x = V a r i a b l e () fin = Image( Float , [18]) f1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f1.defn = [ fin (x) + 1 ] f2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f2.defn = [ f1(x -1) + f1(x+1) ] fout = Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) fout .defn = [ f2(x -1) . f2(x+1) ]

x

f1(x) f2(x) fout(x)

Live-outs

slide-25
SLIDE 25

Compiler - Polyhedral Representation

x = V a r i a b l e () fin = Image( Float , [18]) f1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f1.defn = [ fin (x) + 1 ] f2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f2.defn = [ f1(x -1) + f1(x+1) ] fout = Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) fout .defn = [ f2(x -1) . f2(x+1) ]

x

f1(x) f2(x) fout(x) f1(x) → (0, x) f2(x) → (1, x) fout(x) → (2, x)

Schedule default

slide-26
SLIDE 26

Compiler - Polyhedral Representation

x = V a r i a b l e () fin = Image( Float , [18]) f1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f1.defn = [ fin (x) + 1 ] f2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f2.defn = [ f1(x -1) + f1(x+1) ] fout = Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) fout .defn = [ f2(x -1) . f2(x+1) ]

x

f1(x) f2(x) fout(x) f1(x) → (0, x) f2(x) → (1, x) fout(x) → (2, x) f1(x) → (0, x) f2(x) → (1, x + 1) fout(x) → (2, x + 2)

Schedule skewed

slide-27
SLIDE 27

Compiler - Scheduling Criteria

f1(x) f2(x) fout(x)

Default schedule Parallelism Locality Storage

slide-28
SLIDE 28

Compiler - Scheduling Criteria

f1(x) f2(x) fout(x)

Default schedule Parallelism Locality Storage

slide-29
SLIDE 29

Compiler - Scheduling Criteria

f1(x) f2(x) fout(x)

Default schedule Parallelism Locality Storage

slide-30
SLIDE 30

Compiler - Scheduling Criteria

f1(x) f2(x) fout(x)

Default schedule Parallelism Locality Storage

slide-31
SLIDE 31

Compiler - Scheduling Criteria

f1(x) f2(x) fout(x)

Parallelogram tiling Parallelism Locality Storage

slide-32
SLIDE 32

Compiler - Scheduling Criteria

f1(x) f2(x) fout(x)

Split tiling Parallelism Locality Storage

slide-33
SLIDE 33

Compiler - Scheduling Criteria

f1(x) f2(x) fout(x)

Overlap tiling Parallelism Locality Storage Redundant computation

slide-34
SLIDE 34

Compiler - Alignment and Scaling

  • f (x, y) = g(0, x, y) + g(1, x, y) + g(2, x, y)
  • Default schedules

f (x, y) → (1, x, y, 0) g(0, x, y) → (0, 0, x, y) Dependence vector non-constant (1, x, y − x, −y)

  • Aligned schedules

f (x, y) → (1, 0, x, y) g(0, x, y) → (0, 0, x, y) Dependence vector (1, 0, 0, 0) Alignment

slide-35
SLIDE 35

Compiler - Alignment and Scaling

  • f (x) = g(2x) + g(2x + 1)
  • Default schedules

f (x) → (1, x) g(x) → (0, x) Dependence vectors non-constant (1, −x), (1, −x − 1)

  • Scaled schedules

f (x) → (1, 2x) g(x) → (0, x) Dependence vectors (1, 0), (1, -1) Scaling

slide-36
SLIDE 36

Compiler - Overlapped Tiling

x

f f↓1 f↓2 f↑1 fout

f (x) = fin(x) f↓1(x) = f (2x − 1) + f (2x + 1) f↓2(x) = f↓1(2x − 1) × f↓1(2x + 1) f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1) fout(x) = f↑1(x/2) f (x) → (0, x) f↓1(x) → (1, 2x) f↓2(x) → (2, 4x) f↑1(x) → (3, 2x) fout(x) → (4, x)

slide-37
SLIDE 37

Compiler - Overlapped Tiling

x

f f↓1 f↓2 f↑1 fout

  • Conservative vs precise bounding faces
  • Significant reduction in redundant computation

Tile shape

slide-38
SLIDE 38

Compiler - Overlapped Tiling

x

f f↓1 f↓2 f↑1 fout

  • Conservative vs precise bounding faces
  • Significant reduction in redundant computation

Tile shape

slide-39
SLIDE 39

Compiler - Overlapped Tiling

x

f f↓1 f↓2 f↑1 fout

  • Conservative vs precise bounding faces
  • Significant reduction in redundant computation

Tile shape

slide-40
SLIDE 40

Compiler - Overlapped Tiling

x

f f↓1 f↓2 f↑1 fout

  • Conservative vs precise bounding faces
  • Significant reduction in redundant computation

Tile shape

slide-41
SLIDE 41

Compiler - Overlapped Tiling

x

f f↓1 f↓2 f↑1 fout

  • Conservative vs precise bounding faces
  • Significant reduction in redundant computation

Tile shape

slide-42
SLIDE 42

Compiler - Overlapped Tiling

x

φl φr h

  • τ

Default schedule : fk( i ) → ( sk ), O = h ∗(|l|+|r|) τ ∗ T ≤ φl( sk ) ≤ τ ∗ (T + 1) + O − 1 ∧ τ ∗ T ≤ φr( sk ) ≤ τ ∗ (T + 1) + O − 1 fk( i ) → (T, sk ) Tile constraints

slide-43
SLIDE 43

Compiler - Overlapped Tiling

x

f f↓1 f↓2 f↑1 fout

  • Storage for intermediate values
  • Reduction in intermediate storage
  • Better locality and reuse
  • Privatized for each thread
  • Only last level can be live-out

Scratch pads

slide-44
SLIDE 44

Compiler - Overlapped Tiling

x

f f↓1 f↓2 f↑1 fout

  • Storage for intermediate values
  • Reduction in intermediate storage
  • Better locality and reuse
  • Privatized for each thread
  • Only last level can be live-out

Scratch pads

slide-45
SLIDE 45

Compiler - Overlapped Tiling

x

f f↓1 f↓2 f↑1 fout

  • Storage for intermediate values
  • Reduction in intermediate storage
  • Better locality and reuse
  • Privatized for each thread
  • Only last level can be live-out

Scratch pads

slide-46
SLIDE 46

Compiler - Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

  • Constant dependences

Alignment, Scaling

  • Redundant computation vs Reuse

Overlap, Tile sizes, Parameter estimates

  • Live-out constraints

Fusion criteria

  • Exponential number of valid groupings
  • Greedy iterative approach

Fusion heuristic

slide-47
SLIDE 47

Compiler - Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

  • Constant dependences

Alignment, Scaling

  • Redundant computation vs Reuse

Overlap, Tile sizes, Parameter estimates

  • Live-out constraints

Fusion criteria

  • Exponential number of valid groupings
  • Greedy iterative approach

Fusion heuristic

slide-48
SLIDE 48

Compiler - Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

  • Constant dependences

Alignment, Scaling

  • Redundant computation vs Reuse

Overlap, Tile sizes, Parameter estimates

  • Live-out constraints

Fusion criteria

  • Exponential number of valid groupings
  • Greedy iterative approach

Fusion heuristic

slide-49
SLIDE 49

Compiler - Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

  • Constant dependences

Alignment, Scaling

  • Redundant computation vs Reuse

Overlap, Tile sizes, Parameter estimates

  • Live-out constraints

Fusion criteria

  • Exponential number of valid groupings
  • Greedy iterative approach

Fusion heuristic

slide-50
SLIDE 50

Compiler - Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Input : DAG of stages, (S, E); parameter estimates, P; tile sizes, T; overlap threshold, othresh /* Initially, each stage is in a separate group */

1 G ← ∅ 2 for s ∈ S do 3

G ← G ∪ {s}

4 repeat 5

converge ← true

6

cand set ← getSingleChildGroups(G, E)

7

  • rd list ← sortGroupsBySize(cand set, P)

8

for each g in ord list do

9

child = getChildGroup(g, E)

10

if hasConstantDependenceVectors(g, child) then

11

  • r ← estimateRelativeOverlap(g, child, T)

12

if or < othresh then

13

merge ← g ∪ child

14

G ← G − g − child

15

G ← G ∪ merge

16

converge ← false

17

break

18 until converge = true 19 return G

Algorithm

slide-51
SLIDE 51

Compiler - Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Input : DAG of stages, (S, E); parameter estimates, P; tile sizes, T; overlap threshold, othresh /* Initially, each stage is in a separate group */

1 G ← ∅ 2 for s ∈ S do 3

G ← G ∪ {s}

4 repeat 5

converge ← true

6

cand set ← getSingleChildGroups(G, E)

7

  • rd list ← sortGroupsBySize(cand set, P)

8

for each g in ord list do

9

child = getChildGroup(g, E)

10

if hasConstantDependenceVectors(g, child) then

11

  • r ← estimateRelativeOverlap(g, child, T)

12

if or < othresh then

13

merge ← g ∪ child

14

G ← G − g − child

15

G ← G ∪ merge

16

converge ← false

17

break

18 until converge = true 19 return G

Algorithm

slide-52
SLIDE 52

Compiler - Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Input : DAG of stages, (S, E); parameter estimates, P; tile sizes, T; overlap threshold, othresh /* Initially, each stage is in a separate group */

1 G ← ∅ 2 for s ∈ S do 3

G ← G ∪ {s}

4 repeat 5

converge ← true

6

cand set ← getSingleChildGroups(G, E)

7

  • rd list ← sortGroupsBySize(cand set, P)

8

for each g in ord list do

9

child = getChildGroup(g, E)

10

if hasConstantDependenceVectors(g, child) then

11

  • r ← estimateRelativeOverlap(g, child, T)

12

if or < othresh then

13

merge ← g ∪ child

14

G ← G − g − child

15

G ← G ∪ merge

16

converge ← false

17

break

18 until converge = true 19 return G

Algorithm

slide-53
SLIDE 53

Compiler - Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Input : DAG of stages, (S, E); parameter estimates, P; tile sizes, T; overlap threshold, othresh /* Initially, each stage is in a separate group */

1 G ← ∅ 2 for s ∈ S do 3

G ← G ∪ {s}

4 repeat 5

converge ← true

6

cand set ← getSingleChildGroups(G, E)

7

  • rd list ← sortGroupsBySize(cand set, P)

8

for each g in ord list do

9

child = getChildGroup(g, E)

10

if hasConstantDependenceVectors(g, child) then

11

  • r ← estimateRelativeOverlap(g, child, T)

12

if or < othresh then

13

merge ← g ∪ child

14

G ← G − g − child

15

G ← G ∪ merge

16

converge ← false

17

break

18 until converge = true 19 return G

Algorithm

slide-54
SLIDE 54

Compiler - Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Input : DAG of stages, (S, E); parameter estimates, P; tile sizes, T; overlap threshold, othresh /* Initially, each stage is in a separate group */

1 G ← ∅ 2 for s ∈ S do 3

G ← G ∪ {s}

4 repeat 5

converge ← true

6

cand set ← getSingleChildGroups(G, E)

7

  • rd list ← sortGroupsBySize(cand set, P)

8

for each g in ord list do

9

child = getChildGroup(g, E)

10

if hasConstantDependenceVectors(g, child) then

11

  • r ← estimateRelativeOverlap(g, child, T)

12

if or < othresh then

13

merge ← g ∪ child

14

G ← G − g − child

15

G ← G ∪ merge

16

converge ← false

17

break

18 until converge = true 19 return G

Algorithm

slide-55
SLIDE 55

Compiler - Grouping

Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris

Input : DAG of stages, (S, E); parameter estimates, P; tile sizes, T; overlap threshold, othresh /* Initially, each stage is in a separate group */

1 G ← ∅ 2 for s ∈ S do 3

G ← G ∪ {s}

4 repeat 5

converge ← true

6

cand set ← getSingleChildGroups(G, E)

7

  • rd list ← sortGroupsBySize(cand set, P)

8

for each g in ord list do

9

child = getChildGroup(g, E)

10

if hasConstantDependenceVectors(g, child) then

11

  • r ← estimateRelativeOverlap(g, child, T)

12

if or < othresh then

13

merge ← g ∪ child

14

G ← G − g − child

15

G ← G ∪ merge

16

converge ← false

17

break

18 until converge = true 19 return G

Algorithm

slide-56
SLIDE 56

Compiler - Grouping

↓ x ↓ x ↓ y ↓ y ↓ x ↓ x ↓ y ↓ y ↓ x ↓ x ↓ y ↓ y ↑ x ↑ y L X ↑+ ↑ x ↑ y L ↑ x ↑ x ↑ y L X ↑+ ↑ x ↑ y L ↑ x ↑ x ↑ y L X ↑+ ↑ x ↑ y L M ↓ y ↓ x ↓ y ↓ x ↓ y ↓ x X ↑ x

slide-57
SLIDE 57

Compiler - Code Generation

void pipe_harris ( i n t C, i n t R, f l o a t * I, f l o a t *& harris) { /* Live

  • ut

allocation */ harris = ( f l o a t *) (malloc(sizeof( f l o a t )*(2+R)*(2+C))); # pragma

  • mp

parallel for f o r ( i n t Ti =

  • 1; Ti

<= R / 32; Ti += 1){ /* Scratch pad allocation */ f l o a t Ix [36][260] , Iy [36][260]; f l o a t Syy [36][260] , Sxy [36][260] , Sxx [36][260]; f o r ( i n t Tj =

  • 1; Tj

<= C / 256; Tj += 1) { i n t lbi = max(1, 32 * Ti ); i n t ubi = min(R, 32 * Ti + 35); f o r ( i n t i = lbi ; i <= ubi ; i += 1) { i n t lbj = max(1, 256 * Tj ); i n t ubj = min(C, 256 * Tj + 259); # pragma ivdep f o r ( i n t j = lbj ; j <= ubj ; j += 1) { Iy(-32 * Ti + i,

  • 256 * Tj

+ j); Ix(-32 * Ti + i,

  • 256 * Tj

+ j); } } .... .... .... } } }

OpenMP Vectorization Scanning polyhedra

slide-58
SLIDE 58

Auto Tuning

200 250 300 350 400 450 20 40 60 Execution time on 1 core (ms) Execution time on 16 cores (ms) 60 80 100 120 140 160 180 5 10 15 Execution time on 1 core (ms) Execution time on 16 cores (ms)

  • Tile sizes and overlap thershold determine grouping
  • Seven tiles sizes for each dimension
  • Three threshold values
  • Small search space ( 72 ∗ 3 for 2d-tiling )

Tuning Camera Pipeline Pyramid Blending

slide-59
SLIDE 59

Table of Contents

1 Image Processing Pipelines 2 Language 3 Compiler 4 Related Work 5 Performance Evaluation

slide-60
SLIDE 60

Related work

  • Decoupled view of computation and schedules
  • Scheduling for affine loop nests

Do not target specific domains

  • Overlapped tiling

Works for simple time-iterated stencils Different approach to constructing overlapped tiles Polyhedral compilation

  • Domain specific language and compiler system
  • Effective for exploring schedules

Requires an explicit schedule specification Halide

slide-61
SLIDE 61

Halide

ImageParam input(UInt (16) , 2); Func blur_x("blur_x"), blur_y("blur_y"); Var x("x"), y("y"), xi("xi"), yi("yi"); // The algorithm blur_x(x, y) = (input(x, y) + input(x+1, y) + input(x+2, y))/3; blur_y(x, y) = (blur_x(x, y) + blur_x(x, y+1) + blur_x(x, y+2))/3; // How to schedule it blur_y.split(y, y, yi , 8).parallel(y).vectorize(x, 8); blur_x.store_at(blur_y , y). compute_at(blur_y , yi).vectorize(x, 8);

Halide Blur Schedule

slide-62
SLIDE 62

Table of Contents

1 Image Processing Pipelines 2 Language 3 Compiler 4 Related Work 5 Performance Evaluation

slide-63
SLIDE 63

Experimental Setup

Intel Xeon E5-2680 Clock 2.7 GHz Cores / socket 8 Total cores 16 L1 cache / core 32 KB L2 cache / core 512 KB L3 cache / socket 20 MB Compiler Intel C compiler (icc) 14.0.1 Compiler flags

  • O3 -xhost

Linux kernel 3.8.0-38 (64-bit)

slide-64
SLIDE 64

Evaluation Method

  • Seven representative benchmarks
  • Varying structure and complexity

Benchmarks

  • Halide

Tuned schedule, Matched schedule

  • OpenCV

Optimized library calls Comparison

slide-65
SLIDE 65

Multiscale Interpolation

1 2 4 8 16 2 4 6 8 10 12 14

2.24 4.03 6.57 9.82 12.54 1.28 2.38 3.93 6.18 9.43 1.46 2.57 4.07 5.7 5.88 1 1.8 2.94 4.42 5.82 2.14 3.44 5.94 7.25 6.93 1.77 2.99 5.29 7.13 6.92 1.28 2.43 4.1 7.1 12.11 0.88 1.68 3.19 5.47 8.5

Number of cores Speedup over PolyMage base (1 core)

PolyMage(opt+vec) PolyMage(opt) PolyMage(base+vec) PolyMage(base) Halide(tuned+vec) Halide(tuned) Halide(matched+vec) Halide(matched)

slide-66
SLIDE 66

Harris Corner Detection

1 2 4 8 16 10 20 30 40 50

3.74 7.35 12.85 24.02 46.78 1.12 2.24 4.03 7.64 15.18 2.47 4.31 7.83 12.22 16.22 1 1.94 3.47 6.18 10.3 1.64 3.17 6.08 10.17 18.07 0.93 1.84 3.51 6.05 10.3 1.87 3.73 7.43 13.65 25.35 0.73 1.45 2.91 5.31 9.88

Number of cores Speedup over PolyMage base (1 core)

PolyMage(opt+vec) PolyMage(opt) PolyMage(base+vec) PolyMage(base) Halide(tuned+vec) Halide(tuned) Halide(matched+vec) Halide(matched)

slide-67
SLIDE 67

Camera Pipeline

1 2 4 8 16 5 10 15 20 25 30 35

2.79 5.49 9.5 18.16 32.37 0.79 1.57 2.74 5.26 10.28 2.95 5.62 9.58 13.22 24.2 1 1.98 3.61 6.5 12.16 4.82 7.3 12.32 21.26 31.28 1.4 2.59 4.71 7.56 14.15 2.42 4.83 9.55 17.49 33.75

Number of cores Speedup over PolyMage base (1 core)

PolyMage(opt+vec) PolyMage(opt) PolyMage(base+vec) PolyMage(base) Halide(tuned+vec) Halide(tuned) FCam

slide-68
SLIDE 68

Results Summary

Benchmark Number Image size Lines PolyMage OpenCV Speedup over

  • f stages

1 core 4 cores 16 cores (1 core) H-tuned (16 cores) Harris Corner 11 6400 × 6400 43 233.79 68.03 18.69 810.24 2.59×* Pyramid Blending 44 2048×2048×3 71 196.99 57.84 21.91 197.28 4.61×* Unsharp Mask 4 2048×2048×3 16 165.40 44.92 14.85 349.57 1.6×* Local Laplacian 99 2560×1536×3 107 274.50 76.60 32.35

  • 1.54×

Camera Pipeline 32 2528 × 1920 86 67.87 19.95 5.86

  • 1.04×

Bilateral Grid 7 2560 × 1536 43 89.76 27.30 8.47

  • 0.89×

Multiscale Interpol. 49 2560×1536×3 41 101.70 34.73 18.18

  • 1.81×

Mean speedup of 1.27× over tuned Halide schedules Comparable performance to highly tuned camera pipeline implementation

slide-69
SLIDE 69

Conclusion

DSL for high-performance image processing Optimization techniques

  • Tiling
  • Storage optimization
  • Grouping and fusing

Effectiveness

  • Up to 1.81× better than tuned schedules
  • Matching hand tuned performance
slide-70
SLIDE 70

Acknowledgements

Halide, OpenCV, isl, islpy and cgen Intel for their hardware

slide-71
SLIDE 71

Thank You!

slide-72
SLIDE 72

Pyramid Blending

1 2 4 8 16 2 4 6 8 10 12 14 16

1.66 3.2 5.66 9.96 14.95 1.26 2.42 4.29 7.49 13.37 1.13 2.02 3.25 4.71 5.31 1 1.82 2.99 4.55 5.35 0.56 1 1.83 2.71 3.24 0.66 1.16 2.08 2.98 3.43 1.24 2.12 3.7 5.72 7 0.76 1.45 2.64 4.31 5.98

Number of cores Speedup over PolyMage base (1 core)

PolyMage(opt+vec) PolyMage(opt) PolyMage(base+vec) PolyMage(base) Halide(tuned+vec) Halide(tuned) Halide(matched+vec) Halide(matched)

slide-73
SLIDE 73

Bilateral Grid

1 2 4 8 16 2 4 6 8 10 12 14

1.15 2.17 3.77 6.55 12.16 0.82 1.61 2.73 4.74 8.99 1.65 3.17 3.42 3.56 3.72 1 1.97 2.15 2.28 2.42 1.6 2.92 5.4 8.55 13.68 1.13 2.11 4.03 6.72 10.37

Number of cores Speedup over PolyMage base (1 core)

PolyMage(opt+vec) PolyMage(opt) PolyMage(base+vec) PolyMage(base) Halide(tuned+vec) Halide(tuned)

slide-74
SLIDE 74

Local Laplacian Filter

1 2 4 8 16 2 4 6 8 10 12 14

1.62 3.41 5.8 9.41 13.73 1.02 1.99 3.48 6.1 10.81 1.58 2.93 4.71 6.41 8.74 1 1.92 3.3 5.23 7.39 1.04 1.99 3.68 6.18 8.93 0.55 1.07 2.08 3.61 5.71

Number of cores Speedup over PolyMage base (1 core)

PolyMage(opt+vec) PolyMage(opt) PolyMage(base+vec) PolyMage(base) Halide(tuned+vec) Halide(tuned)