Automatically Scheduling Halide Image Processing Pipelines Ravi Teja - - PowerPoint PPT Presentation

automatically scheduling halide image processing pipelines
SMART_READER_LITE
LIVE PREVIEW

Automatically Scheduling Halide Image Processing Pipelines Ravi Teja - - PowerPoint PPT Presentation

Automatically Scheduling Halide Image Processing Pipelines Ravi Teja Mullapudi (CMU) Andrew Adams (Google) Dillon Sharlet (Google) Jonathan Ragan-Kelley (Stanford) Kayvon Fatahalian (CMU) High demand for e ffi cient image processing Scheduling


slide-1
SLIDE 1

Automatically Scheduling Halide Image Processing Pipelines

Ravi Teja Mullapudi (CMU) Andrew Adams (Google) Dillon Sharlet (Google) Jonathan Ragan-Kelley (Stanford) Kayvon Fatahalian (CMU)

slide-2
SLIDE 2

High demand for efficient image processing

slide-3
SLIDE 3

Scheduling image processing algorithms

Algorithm description

Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + …

slide-4
SLIDE 4

Scheduling image processing algorithms

Implementation Schedule (machine mapping) Algorithm description

Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + … parallelize y loop tile output dims vectorize y loop

slide-5
SLIDE 5

Scheduling image processing algorithms

Implementation Schedule (machine mapping) Algorithm description

Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + … parallelize y loop tile output dims vectorize y loop

Image processing algorithm developers

slide-6
SLIDE 6

Algorithm description Schedule (machine mapping)

Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + … parallelize y loop tile output dims vectorize y loop

Image processing algorithm developers

Few developers have the skill set to author highly optimized schedules

> 10x Faster Implementation

slide-7
SLIDE 7

Algorithm description

Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + …

Image processing algorithm developers

Contribution: automatic scheduling of image processing pipelines

> 10x Faster Implementation Scheduling Algorithm Image processing algorithm developers Generates expert-quality schedules in seconds

slide-8
SLIDE 8

Why is it challenging to schedule image processing pipelines?

slide-9
SLIDE 9

in

Algorithm: 3x3 box blur

slide-10
SLIDE 10

in bx

bx(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y))/3

Algorithm: 3x3 box blur

slide-11
SLIDE 11

in bx

  • ut

bx(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y)) / 3

  • ut(x, y) = (bx(x, y-1) + bx(x, y) + bx(x, y+1)) / 3

Algorithm: 3x3 box blur

slide-12
SLIDE 12

in

A basic (slow) schedule

x y

compute all pixels of bx, in parallel compute all pixels of by, in parallel

slide-13
SLIDE 13

in

A basic (slow) schedule

x y

compute all pixels of bx, in parallel compute all pixels of by, in parallel

slide-14
SLIDE 14

x y

in bx

A basic (slow) schedule

compute all pixels of bx, in parallel compute all pixels of by, in parallel

slide-15
SLIDE 15

x y

Intermediate buffer in bx

A basic (slow) schedule

compute all pixels of bx, in parallel compute all pixels of by, in parallel

  • ut
slide-16
SLIDE 16

x y

Intermediate buffer in bx

  • ut

A basic (slow) schedule

compute all pixels of bx, in parallel compute all pixels of by, in parallel

slide-17
SLIDE 17

x y

Intermediate buffer in bx

  • ut

A basic (slow) schedule

compute all pixels of bx, in parallel compute all pixels of by, in parallel

slide-18
SLIDE 18

Low performance: bandwidth bound

x y

in bx

  • ut

Large in-memory buffer

slide-19
SLIDE 19

x y

Tiling to improve data locality

in bx

  • ut

3x3 tile

for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile

slide-20
SLIDE 20

x y

in bx

  • ut

3x3 tile Required pixels of bx

Tiling to improve data locality

for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile

slide-21
SLIDE 21

x y

Required pixels of bx 3x3 tile in bx

  • ut

Tiling to improve data locality

for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile

slide-22
SLIDE 22

x y

Intermediate buffer: fits in fast on-chip storage in bx

  • ut

Tiling to improve data locality

for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile

slide-23
SLIDE 23

x y

in bx

  • ut

Tiling to improve data locality

for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile

slide-24
SLIDE 24

x y

in bx

  • ut

Tiling to improve data locality

for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile

slide-25
SLIDE 25

x y

in bx

  • ut

Tiling to improve data locality

for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile

slide-26
SLIDE 26

x y

in bx

  • ut

Tiling to improve data locality

for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile

slide-27
SLIDE 27

x y

Tiling introduces redundant work

in bx

  • ut
slide-28
SLIDE 28

x y

in bx

  • ut

Pixels computed twice

Tiling introduces redundant work

slide-29
SLIDE 29

x y

Pixels computed twice in bx

  • ut

Tiling introduces redundant work

slide-30
SLIDE 30

Larger tiles reduce redundant work

x y

in bx

  • ut

for each 3x6 tile, in parallel compute required pixels of bx compute pixels in tile of out

slide-31
SLIDE 31

Goal: balance parallelism, locality, work

x y

in bx

  • ut

for each 3x6 tile, in parallel compute required pixels of bx compute pixels in tile of out

slide-32
SLIDE 32

Goal: balance parallelism, locality, work

x y

in bx

  • ut

for each 3x6 tile, in parallel compute required pixels of bx compute pixels in tile of out

slide-33
SLIDE 33

Represent image processing pipelines as graphs

in bx

  • ut

DAG representation of the two-stage blur pipeline

slide-34
SLIDE 34

Real world pipelines are complex graphs

Local Laplacian filters

[Paris et al. 2010, Aubry et al. 2011]

Google Nexus HDR+ mode: over 2000 stages!

100 stages

slide-35
SLIDE 35

in

  • ut

Key aspects of scheduling

slide-36
SLIDE 36

in

  • ut

Deciding which stages to interleave for better data locality

Key aspects of scheduling

slide-37
SLIDE 37

in

  • ut

Deciding which stages to interleave for better data locality

Key aspects of scheduling

Picking tiles sizes to trade-off locality and re-computation

slide-38
SLIDE 38

in

  • ut

Deciding which stages to interleave for better data locality

Key aspects of scheduling

Picking tiles sizes to trade-off locality and re-computation Maintain ability to execute in parallel

slide-39
SLIDE 39

An Algorithm for Scheduling Image Processing Pipelines

slide-40
SLIDE 40

Algorithm

Input: DAG of pipeline stages A C B D E in

slide-41
SLIDE 41

Algorithm

Input: DAG of pipeline stages Output: Optimized schedule A C B D E in

for each 8x128 tile in parallel compute required pixels of A compute pixels in tile of B for each 8x8 tile in parallel compute required pixels of C compute required pixels of D compute pixels in tile of E

slide-42
SLIDE 42

Algorithm

Input: DAG of pipeline stages Output: Optimized schedule A C B D E in

for each 8x128 tile in parallel compute required pixels of A compute pixels in tile of B for each 8x8 tile in parallel compute required pixels of C compute required pixels of D compute pixels in tile of E

slide-43
SLIDE 43

Algorithm

Input: DAG of pipeline stages Output: Optimized schedule A C B D E in

for each 8x128 tile in parallel compute required pixels of A compute pixels in tile of B for each 8x8 tile in parallel compute required pixels of C compute required pixels of D compute pixels in tile of E

A,B C,D,E in

slide-44
SLIDE 44

Algorithm

Input: DAG of pipeline stages Output: Optimized schedule A C B D E in

for each 8x128 tile in parallel compute required pixels of A compute pixels in tile of B for each 8x8 tile in parallel compute required pixels of C compute required pixels of D compute pixels in tile of E

A,B C,D,E in

Tile size: 8 x 128 Tile size: 8 x 8

slide-45
SLIDE 45

Scheduling the DAG for better locality

Determine which stages to group together? How to tile stages in each group?

slide-46
SLIDE 46

When to group stages?

Grouping A and B together can either improve or degrade performance

C A,B D E in

Tile size: 3 x 3

for each 3x3 tile in parallel compute required pixels of A compute pixels in tile of B compute all pixels of C, in parallel compute all pixels of D, in parallel compute all pixels of E, in parallel

?

slide-47
SLIDE 47

Quantifying the cost of a group

C A,B D E in

Tile size: 3 x 3

for each 3x3 tile in parallel compute required pixels of A compute pixels in tile of B compute all pixels of C, in parallel compute all pixels of D, in parallel compute all pixels of E, in parallel

Cost = Cost of arithmetic + Cost of memory

slide-48
SLIDE 48

Quantifying the cost of a group

C A,B D E in

Tile size: 3 x 3

for each 3x3 tile in parallel compute required pixels of A compute pixels in tile of B compute all pixels of C, in parallel compute all pixels of D, in parallel compute all pixels of E, in parallel

Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

slide-49
SLIDE 49

Quantifying the cost of a group

C A,B D E in

Tile size: 3 x 3

for each 3x3 tile in parallel compute required pixels of A compute pixels in tile of B

Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

slide-50
SLIDE 50

Estimating cost using interval analysis

A,B in

Tile size: 3 x 3

A B in

Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

slide-51
SLIDE 51

Estimating cost using interval analysis

A,B in

Tile size: 3 x 3

A B in

Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

slide-52
SLIDE 52

Estimating cost using interval analysis

A,B in

Tile size: 3 x 3

A B in

Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

slide-53
SLIDE 53

Estimating cost using interval analysis

A,B in

Tile size: 3 x 3

A B in

Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

slide-54
SLIDE 54

Estimating cost using interval analysis

A,B in

Tile size: 3 x 3

A B in

Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

slide-55
SLIDE 55

Estimating cost using interval analysis

A,B in

Tile size: 3 x 3

Cost = Number of tiles x Cost per tile

A B in

slide-56
SLIDE 56

Search for best tile sizes

A,B in

Tile size: 1 x 6

A B in

slide-57
SLIDE 57

Search for best tile sizes

A,B in

Tile size: 6 x 1

A B in

slide-58
SLIDE 58

Search for best tile sizes

A,B in

Tile size: 2 x 2

A B in

slide-59
SLIDE 59

When to group stages?

C A,B D E in

=

A,B Cost( ) Benefit( ) A,B A Cost( ) Cost( ) B

+

  • Tile size: best

Tile size: best

slide-60
SLIDE 60

Exhaustive search is infeasible

Exponential number of possible groupings A,B,C,D,E in B,C,D,E in A A,B in C,D,E C A,B D E in

slide-61
SLIDE 61

Greedy grouping algorithm

compute all pixels of A, in parallel compute all pixels of B, in parallel compute all pixels of C, in parallel compute all pixels of D, in parallel compute all pixels of E, in parallel

A C B D E in

slide-62
SLIDE 62

Greedy grouping algorithm

compute all pixels of A, in parallel compute all pixels of B, in parallel compute all pixels of C, in parallel compute all pixels of D, in parallel compute all pixels of E, in parallel

A C B D E in

10 20 5 50 40

slide-63
SLIDE 63

Greedy grouping algorithm

compute all pixels of A, in parallel compute all pixels of B, in parallel compute all pixels of C, in parallel compute all pixels of D, in parallel compute all pixels of E, in parallel

A C B D E in

10 20 5 50 40

slide-64
SLIDE 64

Greedy grouping algorithm

A B D C,E in

10 40

Tile size: 8 x 8

2 5

compute all pixels of A, in parallel compute all pixels of B, in parallel compute all pixels of D, in parallel for each 8x8 tile in parallel compute required pixels of C compute pixels in tile of E

slide-65
SLIDE 65

Greedy grouping algorithm

A,B D C,E in

4

Tile size: 8 x 8

  • 1

5

Tile size: 8 x 128

for each 8x128 tile in parallel compute required pixels of A compute pixels in tile of B compute all pixels of D, in parallel for each 8x8 tile in parallel compute required pixels of C compute pixels in tile of E

slide-66
SLIDE 66

Greedy grouping algorithm

C,D,E in

Tile size: 8 x 8

  • 5

for each 8x128 tile in parallel compute required pixels of A compute pixels in tile of B for each 8x8 tile in parallel compute required pixels of C compute required pixels of compute pixels in tile of E

A,B

Tile size: 8 x 128

slide-67
SLIDE 67

Auto scheduler implementation details

for each 8x128 tile in parallel vectorize compute required pixels of A unroll x by 4 vectorize compute required pixels of B vectorize compute pixels in tile of D for each 8x8 tile in parallel vectorize compute required pixels of C unroll y by 2 vectorize compute pixels in tile of E

  • Multi-core parallelism, vectorization, loop reordering, and

unrolling

slide-68
SLIDE 68

Evaluation

slide-69
SLIDE 69

Benchmarks of varying complexity and structure

Blur Unsharp mask Harris corner detection Camera RAW processing Non-local means denoising Max-brightness filter Multi-scale interpolation Local-laplacian filter Synthetic depth-of-field Bilateral filter Histogram equalization VGG-16 deep network eval

Benchmark

3 9 13 30 13 9 52 103 74 8 7 64

Stages

slide-70
SLIDE 70

Auto scheduler generates schedules in seconds

<1 <1 <1 <1 <1 <1 2.6 3.9 55 <1 <1 6.9

Compile time (s)

Blur Unsharp mask Harris corner detection Camera RAW processing Non-local means denoising Max-brightness filter Multi-scale interpolation Local-laplacian filter Synthetic depth-of-field Bilateral filter Histogram equalization VGG-16 deep network eval

Benchmark

3 9 13 30 13 9 52 103 74 8 7 64

Stages

slide-71
SLIDE 71

Bilateral grid Blur Camera pipe Convolution layer Harris corner Histogram equal Mscale interpolate Lens blur Local laplacian Matrix multiply Max filter Non-local means Unsharp mask VGG-16 evaluation

Auto scheduler performs comparably to experts

0.5 1 1.5

Auto scheduler Performance relative to experts (6 core Xeon CPU)

slide-72
SLIDE 72

Bilateral grid Blur Camera pipe Convolution layer Harris corner Histogram equal Mscale interpolate Lens blur Local laplacian Matrix multiply Max filter Non-local means Unsharp mask VGG-16 evaluation

Auto scheduler performs comparably to experts

0.5 1 1.5

On 8 of the 14 benchmarks performance within 10% of experts or better Auto scheduler Performance relative to experts (6 core Xeon CPU)

slide-73
SLIDE 73

Bilateral grid Blur Camera pipe Convolution layer Harris corner Histogram equal Mscale interpolate Lens blur Local laplacian Matrix multiply Max filter Non-local means Unsharp mask VGG-16 evaluation Bilateral grid Blur Camera pipe Convolution layer Harris corner Histogram equal Mscale interpolate Lens blur Local laplacian Matrix multiply Max filter Non-local means Unsharp mask VGG-16 evaluation

Auto scheduler performs comparably to experts

0.5 1 1.5

On 8 of the 14 benchmarks performance within 10% of experts or better Baseline schedules exploit multi-core and vector parallelism but no grouping Auto scheduler Baseline Performance relative to experts (6 core Xeon CPU)

slide-74
SLIDE 74

10 20 30 40 50 10 20 30 40 50 30 60 90 120 10 20 30 40 50 10 20 30 40 50

Auto scheduler can save time for experts

Dillon Andrew

Time (min) Throughput Throughput Time (min) Time (min) Throughput Max filter Non-local means Lens blur

30 60 90 120

slide-75
SLIDE 75

Auto scheduler can save time for experts

10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50

Auto scheduler Dillon Andrew

Time (min) Throughput

30 60 90 120 30 60 90 120

Throughput Time (min) Time (min) Throughput Max filter Non-local means Lens blur

slide-76
SLIDE 76

Bilateral grid Blur Camera pipe Convolution layer Harris corner Histogram equal Mscale interpolate Lens blur Local laplacian Matrix multiply Max filter Non-local means Unsharp mask VGG-16 evaluation Bilateral grid Blur Camera pipe Convolution layer Harris corner Histogram equal Mscale interpolate Lens blur Local laplacian Matrix multiply Max filter Non-local means Unsharp mask VGG-16 evaluation Bilateral grid Blur Camera pipe Convolution layer Harris corner Histogram equal Mscale interpolate Lens blur Local laplacian Matrix multiply Max filter Non-local means Unsharp mask VGG-16 evaluation

Performance relative to experts (6 core Xeon CPU)

0.5 1 1.5

Exploring cost model parameters

Auto scheduler 3-day auto tuning

slide-77
SLIDE 77

Bilateral grid Blur Camera pipe Convolution layer Harris corner Histogram equal Mscale interpolate Lens blur Local laplacian Matrix multiply Max filter Non-local means Unsharp mask VGG-16 evaluation Bilateral grid Blur Camera pipe Convolution layer Harris corner Histogram equal Mscale interpolate Lens blur Local laplacian Matrix multiply Max filter Non-local means Unsharp mask VGG-16 evaluation Bilateral grid Blur Camera pipe Convolution layer Harris corner Histogram equal Mscale interpolate Lens blur Local laplacian Matrix multiply Max filter Non-local means Unsharp mask VGG-16 evaluation Bilateral grid Blur Camera pipe Convolution layer Harris corner Histogram equal Mscale interpolate Lens blur Local laplacian Matrix multiply Max filter Non-local means Unsharp mask VGG-16 evaluation

Performance relative to experts (6 core Xeon CPU)

0.5 1 1.5

Exploring cost model parameters

Auto scheduler 3-day auto tuning Quick auto tuning

slide-78
SLIDE 78

Bilateral grid Blur Camera pipe Convolution layer Harris corner Histogram equal Mscale interpolate Lens blur Local laplacian Matrix multiply Max filter Non-local means Unsharp mask VGG-16 evaluation

Quad core ARM performance

0.5 1

Performance relative to experts (ARM CPU)

slide-79
SLIDE 79

Bilateral grid Blur Camera pipe Convolution layer Harris corner Histogram equal Mscale interpolate Lens blur Local laplacian Matrix multiply Max filter Non-local means Unsharp mask VGG-16 evaluation

K40 GPU performance

0.5 1 1.5

Performance relative to experts (K40)

slide-80
SLIDE 80

Bilateral grid Blur Camera pipe Convolution layer Harris corner Histogram equal Mscale interpolate Lens blur Local laplacian Matrix multiply Max filter Non-local means Unsharp mask VGG-16 evaluation

K40 GPU performance

0.5 1 1.5

Performance relative to experts (K40)

slide-81
SLIDE 81

Optimizing Halide via auto-tuning and stochastic search [Ragan-Kelley 13, Ansel 14]:

  • Compilation time: hours to days
  • Output up to 5-10x slower than hand-tuned implementations

Darkroom [Hegarty 14]:

  • Auto-scheduling assuming applications restricted to fixed-size stencils

PolyMage [Mullapudi 15]: polyhedral-based optimization

  • Greedy group-and-tile algorithm was inspired by PolyMage
  • Polyhedral approach cannot analyze non-affine and data-dependent

computations

Prior work

slide-82
SLIDE 82

Limitations

Restricted space of schedules

  • Does not consider sliding windows and multi-level tiling

No human interaction with the auto scheduler

  • Enable experts to guide the scheduling process
slide-83
SLIDE 83

Summary

Algorithm that generates Halide schedules

  • Competitive with experts
  • Generated in seconds
  • Pratical implementation

In the process of being merged into mainline Halide

https://github.com/halide/Halide/tree/auto_scheduler

slide-84
SLIDE 84

Generalizing the auto scheduler for other DSLs

Tensor Flow Halide Opt

Abstract analysis and scheduling techniques into components that can be used across languages

slide-85
SLIDE 85

Thank you

https://github.com/halide/Halide/tree/auto_scheduler