PolyMage: Automatic Optimization for Image Processing Pipelines - - PowerPoint PPT Presentation
PolyMage: Automatic Optimization for Image Processing Pipelines - - PowerPoint PPT Presentation
PolyMage: Automatic Optimization for Image Processing Pipelines Ravi Teja Mullapudi Vinay Vasista Uday Bondhugula CSA, Indian Institute of Science June 27, 2016 Table of Contents 1 Image Processing Pipelines 2 Language 3 Compiler 4 Related
Table of Contents
1 Image Processing Pipelines 2 Language 3 Compiler 4 Related Work 5 Performance Evaluation
Table of Contents
1 Image Processing Pipelines 2 Language 3 Compiler 4 Related Work 5 Performance Evaluation
Image Processing Pipelines - Data
Cameras and Internet
Instagram 60 Million photos per day. http://instagram.com/press/ YouTube 100 hours of video uploaded every minute. https://www.youtube.com/yt/press/statistics.html
Astronomy Large Synoptic Survey Telescope (LSST) Generates 30 TB of image data every night. http://lsst.org/lsst/google Medical Imaging Human Connectome Project fMRI data for 68 subjects 1.873 TB. http://www.humanconnectome.org/
Image Processing Pipelines - Computation
Synthesis, Enhancement and Analysis of Images Applications Computational Photography Computer Vision Medical Imaging
Image Processing Pipelines - Challenges
- Real-time processing
- High resolution
- Complex algorithms
Need for Speed
Image Processing Pipelines - Challenges
- Real-time processing
- High resolution
- Complex algorithms
Need for Speed
- Deep memory hierarchies
- Parallelism
- Heterogeneity
Modern Architectures
Image Processing Pipelines - Challenges
- Real-time processing
- High resolution
- Complex algorithms
Need for Speed
- Deep memory hierarchies
- Parallelism
- Heterogeneity
Modern Architectures
- Requires expertise
- Tedious and error prone
- Not portable
Hand Optimization
Image Processing Pipelines - Challenges
- Real-time processing
- High resolution
- Complex algorithms
Need for Speed
- Deep memory hierarchies
- Parallelism
- Heterogeneity
Modern Architectures
- OpenCV, CImg, MATLAB
- Limited optimization
- Architecture support
Libraries
- Requires expertise
- Tedious and error prone
- Not portable
Hand Optimization
Domain Specific Languages
Productivity, Performance and Portability
- Decouple algorithms from schedules
- Support common patterns in the domain
- High performance compilation
Image Processing Pipelines - Computation Patterns
f (x, y) = g(x, y)
Point-wise
f (x, y) =
+1
- σx =−1
+1
- σy =−1
g(x + σx , y + σy )
Stencil
Image Processing Pipelines - Computation Patterns
f (x, y) =
+1
- σx =−1
+1
- σy =−1
g(2x + σx , 2y + σy )
Downsample
f (x, y) =
+1
- σx =−1
+1
- σy =−1
g((x + σx )/2, (y + σy )/2)
Upsample
Image Processing Pipelines - Computation Patterns
f (g(x))+ = 1
Histogram
f (t, x, y) = g(f (t − 1, x, y))
Time-iterated
PolyMage Framework
DSL Spec Build stage graph Static bounds check Inlining Polyhedral representation Default schedule Alignment Scaling Grouping Schedule transformation Storage optimization Code generation
Table of Contents
1 Image Processing Pipelines 2 Language 3 Compiler 4 Related Work 5 Performance Evaluation
Language Constructs
Parameter Variable Image Interval Function Accumulator Stencil Condition Select Case Accumulate
N = Parameter( I n t ) x = V a r i a b l e () I = Image( Float , [N]) c1 = Condition (x, ’>=’, 1) & Condition (x, ’<=’, N -2) c2 = Condition (x, ’==’, 0) | Condition (x, ’==’, N -1) f = Function (varDom = ([x], I n t e r v a l (0, N-1, 1)), Float ) f.defn = [ Case(c1 , S t e n c i l (I(x), 1.0/3 , [[1, 1, 1]])), Case(c2 , 0) ]
f : [0..N − 1] → R f (x) =
+1
- σx =−1
I(x + σx)/3 if 1 ≤ x ≤ N − 2 if x = 0 ∨ x = N − 1.
Language Constructs
Parameter Variable Image Interval Function Accumulator Stencil Condition Select Case Accumulate
R, C = Parameter( I n t ), Parameter( I n t ) I = Image(UChar, [R, C]) x, y = V a r i a b l e (), V a r i a b l e () row , col = I n t e r v a l (0, R, 1), I n t e r v a l (0, C, 1) bins = I n t e r v a l (0, 255, 1) hist = Accumulator (redDom = ([x,y],[row ,col ]), varDom = ([x],bins), I n t ) hist.defn = Accumulate(hist(I(x,y)), 1, Sum)
hist : [0..255] → Z hist(p) =| {(x, y) : I(x, y) = p} |
Unsharp Mask
R, C = Parameter( I n t ), Parameter( I n t ) thresh , w = Parameter( Float ), Parameter( Float ) x, y, c = V a r i a b l e (), V a r i a b l e (), V a r i a b l e () I = Image( Float , [3, R+4, C+4]) cr = I n t e r v a l (0, 2, 1) xr , xc = I n t e r v a l (2, R+1, 1), I n t e r v a l (0, C+3, 1) yr , yc = I n t e r v a l (2, R+1, 1), I n t e r v a l (2, C+1, 1) blurx = Function (varDom = ([c, x, y], [cr , xr , xc]), Float ) blurx.defn = [ S t e n c i l (I(c, x, y), 1.0/16 , [[1, 4, 6, 4, 1]]) ] blury = Function (varDom = ([c, x, y], [cr , yr , yc]), Float ) blury.defn = [ S t e n c i l (blurx(c, x, y), 1.0/16 , [[1] , [4], [6], [4], [1]]]) ] sharpen = Function (varDom = ([c, x, y], [cr , yr , yc]), Float ) sharpen.defn = [ I(c, x, y) * ( 1 + w ) - blury(c, x, y) * w ] masked = Function (varDom = ([c, x, y], [cr , yr , yc]), Float ) diff = Abs((I(c, x, y) - blury(c, x, y))) cond = Condition ( diff , ‘<’, thresh )
- masked. definition = Select(cond , I(c, x, y), sharpen(c, x, y))
Iin blurx blury sharpen masked
Harris Corner Detection
R, C = Parameter( I n t ), Parameter( I n t ) I = Image( Float , [R+2, C+2]) x, y = V a r i a b l e (), V a r i a b l e () row , col = I n t e r v a l (0,R+1 ,1), I n t e r v a l (0,C+1 ,1) c = Condition (x,’>=’ ,1) & Condition (x,’<=’,R) & Condition (y,’>=’ ,1) & Condition (y,’<=’,C) cb = Condition (x,’>=’ ,2) & Condition (x,’<=’,R -1) & Condition (y,’>=’ ,2) & Condition (y,’<=’,C -1) Iy = Function (varDom = ([x,y],[row ,col ]), Float ) Iy.defn = [ Case(c, S t e n c i l (I(x,y), 1.0/12 , [[-1,
- 2,
- 1],
[ 0, 0, 0], [ 1, 2, 1]]) ] Ix = Function (varDom = ([x,y],[row ,col ]), Float ) Ix.defn = [ Case(c, S t e n c i l (I(x,y), 1.0/12 , [[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]) ] Ixx = Function (varDom = ([x,y],[row ,col ]), Float ) Ixx.defn = [ Case(c, Ix(x,y) * Ix(x,y)) ] Iyy = Function (varDom = ([x,y],[row ,col ]), Float ) Iyy.defn = [ Case(c, Iy(x,y) * Iy(x,y)) ] Ixy = Function (varDom = ([x,y],[row ,col ]), Float ) Ixy.defn = [ Case(c, Ix(x,y) * Iy(x,y)) ] Sxx = Function (varDom = ([x,y],[row ,col ]), Float ) Syy = Function (varDom = ([x,y],[row ,col ]), Float ) Sxy = Function (varDom = ([x,y],[row ,col ]), Float ) f o r pair i n [(Sxx , Ixx), (Syy , Iyy), (Sxy , Ixy)]: pair [0]. defn = [ Case(cb , S t e n c i l (pair [1], 1, [[1, 1, 1], [1, 1, 1], [1, 1, 1]]) ] det = Function (varDom = ([x,y],[row ,col ]), Float ) d = Sxx(x,y) * Syy(x,y) - Sxy(x,y) * Sxy(x,y) det.defn = [ Case(cb , d) ] trace = Function (varDom = ([x,y],[row ,col ]), Float ) trace.defn = [ Case(cb , Sxx(x,y) + Syy(x,y)) ] harris = Function (varDom = ([x,y],[row ,col ]), Float ) coarsity = det(x,y) - .04 * trace(x,y) * trace(x,y) harris.defn = [ Case(cb , coarsity) ]
Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris
Pyramid Blending
↓ x ↓ x ↓ y ↓ y ↓ x ↓ x ↓ y ↓ y ↓ x ↓ x ↓ y ↓ y ↑ x ↑ y L X ↑+ ↑ x ↑ y L ↑ x ↑ x ↑ y L X ↑+ ↑ x ↑ y L ↑ x ↑ x ↑ y L X ↑+ ↑ x ↑ y L M ↓ y ↓ x ↓ y ↓ x ↓ y ↓ x X ↑ x
Table of Contents
1 Image Processing Pipelines 2 Language 3 Compiler 4 Related Work 5 Performance Evaluation
Compiler - Polyhedral Representation
x = V a r i a b l e () fin = Image( Float , [18]) f1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f1.defn = [ fin (x) + 1 ] f2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f2.defn = [ f1(x -1) + f1(x+1) ] fout = Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) fout .defn = [ f2(x -1) . f2(x+1) ]
x
f1(x) f2(x) fout(x)
Domains
Compiler - Polyhedral Representation
x = V a r i a b l e () fin = Image( Float , [18]) f1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f1.defn = [ fin (x) + 1 ] f2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f2.defn = [ f1(x -1) + f1(x+1) ] fout = Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) fout .defn = [ f2(x -1) . f2(x+1) ]
x
f1(x) f2(x) fout(x)
Dependence vectors
Compiler - Polyhedral Representation
x = V a r i a b l e () fin = Image( Float , [18]) f1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f1.defn = [ fin (x) + 1 ] f2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f2.defn = [ f1(x -1) + f1(x+1) ] fout = Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) fout .defn = [ f2(x -1) . f2(x+1) ]
x
f1(x) f2(x) fout(x)
Live-outs
Compiler - Polyhedral Representation
x = V a r i a b l e () fin = Image( Float , [18]) f1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f1.defn = [ fin (x) + 1 ] f2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f2.defn = [ f1(x -1) + f1(x+1) ] fout = Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) fout .defn = [ f2(x -1) . f2(x+1) ]
x
f1(x) f2(x) fout(x) f1(x) → (0, x) f2(x) → (1, x) fout(x) → (2, x)
Schedule default
Compiler - Polyhedral Representation
x = V a r i a b l e () fin = Image( Float , [18]) f1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f1.defn = [ fin (x) + 1 ] f2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f2.defn = [ f1(x -1) + f1(x+1) ] fout = Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) fout .defn = [ f2(x -1) . f2(x+1) ]
x
f1(x) f2(x) fout(x) f1(x) → (0, x) f2(x) → (1, x) fout(x) → (2, x) f1(x) → (0, x) f2(x) → (1, x + 1) fout(x) → (2, x + 2)
Schedule skewed
Compiler - Scheduling Criteria
f1(x) f2(x) fout(x)
Default schedule Parallelism Locality Storage
Compiler - Scheduling Criteria
f1(x) f2(x) fout(x)
Default schedule Parallelism Locality Storage
Compiler - Scheduling Criteria
f1(x) f2(x) fout(x)
Default schedule Parallelism Locality Storage
Compiler - Scheduling Criteria
f1(x) f2(x) fout(x)
Default schedule Parallelism Locality Storage
Compiler - Scheduling Criteria
f1(x) f2(x) fout(x)
Parallelogram tiling Parallelism Locality Storage
Compiler - Scheduling Criteria
f1(x) f2(x) fout(x)
Split tiling Parallelism Locality Storage
Compiler - Scheduling Criteria
f1(x) f2(x) fout(x)
Overlap tiling Parallelism Locality Storage Redundant computation
Compiler - Alignment and Scaling
- f (x, y) = g(0, x, y) + g(1, x, y) + g(2, x, y)
- Default schedules
f (x, y) → (1, x, y, 0) g(0, x, y) → (0, 0, x, y) Dependence vector non-constant (1, x, y − x, −y)
- Aligned schedules
f (x, y) → (1, 0, x, y) g(0, x, y) → (0, 0, x, y) Dependence vector (1, 0, 0, 0) Alignment
Compiler - Alignment and Scaling
- f (x) = g(2x) + g(2x + 1)
- Default schedules
f (x) → (1, x) g(x) → (0, x) Dependence vectors non-constant (1, −x), (1, −x − 1)
- Scaled schedules
f (x) → (1, 2x) g(x) → (0, x) Dependence vectors (1, 0), (1, -1) Scaling
Compiler - Overlapped Tiling
x
f f↓1 f↓2 f↑1 fout
f (x) = fin(x) f↓1(x) = f (2x − 1) + f (2x + 1) f↓2(x) = f↓1(2x − 1) × f↓1(2x + 1) f↑1(x) = f↓2(x/2) + f↓2(x/2 + 1) fout(x) = f↑1(x/2) f (x) → (0, x) f↓1(x) → (1, 2x) f↓2(x) → (2, 4x) f↑1(x) → (3, 2x) fout(x) → (4, x)
Compiler - Overlapped Tiling
x
f f↓1 f↓2 f↑1 fout
- Conservative vs precise bounding faces
- Significant reduction in redundant computation
Tile shape
Compiler - Overlapped Tiling
x
f f↓1 f↓2 f↑1 fout
- Conservative vs precise bounding faces
- Significant reduction in redundant computation
Tile shape
Compiler - Overlapped Tiling
x
f f↓1 f↓2 f↑1 fout
- Conservative vs precise bounding faces
- Significant reduction in redundant computation
Tile shape
Compiler - Overlapped Tiling
x
f f↓1 f↓2 f↑1 fout
- Conservative vs precise bounding faces
- Significant reduction in redundant computation
Tile shape
Compiler - Overlapped Tiling
x
f f↓1 f↓2 f↑1 fout
- Conservative vs precise bounding faces
- Significant reduction in redundant computation
Tile shape
Compiler - Overlapped Tiling
x
φl φr h
- τ
Default schedule : fk( i ) → ( sk ), O = h ∗(|l|+|r|) τ ∗ T ≤ φl( sk ) ≤ τ ∗ (T + 1) + O − 1 ∧ τ ∗ T ≤ φr( sk ) ≤ τ ∗ (T + 1) + O − 1 fk( i ) → (T, sk ) Tile constraints
Compiler - Overlapped Tiling
x
f f↓1 f↓2 f↑1 fout
- Storage for intermediate values
- Reduction in intermediate storage
- Better locality and reuse
- Privatized for each thread
- Only last level can be live-out
Scratch pads
Compiler - Overlapped Tiling
x
f f↓1 f↓2 f↑1 fout
- Storage for intermediate values
- Reduction in intermediate storage
- Better locality and reuse
- Privatized for each thread
- Only last level can be live-out
Scratch pads
Compiler - Overlapped Tiling
x
f f↓1 f↓2 f↑1 fout
- Storage for intermediate values
- Reduction in intermediate storage
- Better locality and reuse
- Privatized for each thread
- Only last level can be live-out
Scratch pads
Compiler - Grouping
Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris
- Constant dependences
Alignment, Scaling
- Redundant computation vs Reuse
Overlap, Tile sizes, Parameter estimates
- Live-out constraints
Fusion criteria
- Exponential number of valid groupings
- Greedy iterative approach
Fusion heuristic
Compiler - Grouping
Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris
- Constant dependences
Alignment, Scaling
- Redundant computation vs Reuse
Overlap, Tile sizes, Parameter estimates
- Live-out constraints
Fusion criteria
- Exponential number of valid groupings
- Greedy iterative approach
Fusion heuristic
Compiler - Grouping
Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris
- Constant dependences
Alignment, Scaling
- Redundant computation vs Reuse
Overlap, Tile sizes, Parameter estimates
- Live-out constraints
Fusion criteria
- Exponential number of valid groupings
- Greedy iterative approach
Fusion heuristic
Compiler - Grouping
Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris
- Constant dependences
Alignment, Scaling
- Redundant computation vs Reuse
Overlap, Tile sizes, Parameter estimates
- Live-out constraints
Fusion criteria
- Exponential number of valid groupings
- Greedy iterative approach
Fusion heuristic
Compiler - Grouping
Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris
Input : DAG of stages, (S, E); parameter estimates, P; tile sizes, T; overlap threshold, othresh /* Initially, each stage is in a separate group */
1 G ← ∅ 2 for s ∈ S do 3
G ← G ∪ {s}
4 repeat 5
converge ← true
6
cand set ← getSingleChildGroups(G, E)
7
- rd list ← sortGroupsBySize(cand set, P)
8
for each g in ord list do
9
child = getChildGroup(g, E)
10
if hasConstantDependenceVectors(g, child) then
11
- r ← estimateRelativeOverlap(g, child, T)
12
if or < othresh then
13
merge ← g ∪ child
14
G ← G − g − child
15
G ← G ∪ merge
16
converge ← false
17
break
18 until converge = true 19 return G
Algorithm
Compiler - Grouping
Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris
Input : DAG of stages, (S, E); parameter estimates, P; tile sizes, T; overlap threshold, othresh /* Initially, each stage is in a separate group */
1 G ← ∅ 2 for s ∈ S do 3
G ← G ∪ {s}
4 repeat 5
converge ← true
6
cand set ← getSingleChildGroups(G, E)
7
- rd list ← sortGroupsBySize(cand set, P)
8
for each g in ord list do
9
child = getChildGroup(g, E)
10
if hasConstantDependenceVectors(g, child) then
11
- r ← estimateRelativeOverlap(g, child, T)
12
if or < othresh then
13
merge ← g ∪ child
14
G ← G − g − child
15
G ← G ∪ merge
16
converge ← false
17
break
18 until converge = true 19 return G
Algorithm
Compiler - Grouping
Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris
Input : DAG of stages, (S, E); parameter estimates, P; tile sizes, T; overlap threshold, othresh /* Initially, each stage is in a separate group */
1 G ← ∅ 2 for s ∈ S do 3
G ← G ∪ {s}
4 repeat 5
converge ← true
6
cand set ← getSingleChildGroups(G, E)
7
- rd list ← sortGroupsBySize(cand set, P)
8
for each g in ord list do
9
child = getChildGroup(g, E)
10
if hasConstantDependenceVectors(g, child) then
11
- r ← estimateRelativeOverlap(g, child, T)
12
if or < othresh then
13
merge ← g ∪ child
14
G ← G − g − child
15
G ← G ∪ merge
16
converge ← false
17
break
18 until converge = true 19 return G
Algorithm
Compiler - Grouping
Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris
Input : DAG of stages, (S, E); parameter estimates, P; tile sizes, T; overlap threshold, othresh /* Initially, each stage is in a separate group */
1 G ← ∅ 2 for s ∈ S do 3
G ← G ∪ {s}
4 repeat 5
converge ← true
6
cand set ← getSingleChildGroups(G, E)
7
- rd list ← sortGroupsBySize(cand set, P)
8
for each g in ord list do
9
child = getChildGroup(g, E)
10
if hasConstantDependenceVectors(g, child) then
11
- r ← estimateRelativeOverlap(g, child, T)
12
if or < othresh then
13
merge ← g ∪ child
14
G ← G − g − child
15
G ← G ∪ merge
16
converge ← false
17
break
18 until converge = true 19 return G
Algorithm
Compiler - Grouping
Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris
Input : DAG of stages, (S, E); parameter estimates, P; tile sizes, T; overlap threshold, othresh /* Initially, each stage is in a separate group */
1 G ← ∅ 2 for s ∈ S do 3
G ← G ∪ {s}
4 repeat 5
converge ← true
6
cand set ← getSingleChildGroups(G, E)
7
- rd list ← sortGroupsBySize(cand set, P)
8
for each g in ord list do
9
child = getChildGroup(g, E)
10
if hasConstantDependenceVectors(g, child) then
11
- r ← estimateRelativeOverlap(g, child, T)
12
if or < othresh then
13
merge ← g ∪ child
14
G ← G − g − child
15
G ← G ∪ merge
16
converge ← false
17
break
18 until converge = true 19 return G
Algorithm
Compiler - Grouping
Iin Ix Iy Ixx Ixy Iyy Sxx Syy Sxy det trace harris
Input : DAG of stages, (S, E); parameter estimates, P; tile sizes, T; overlap threshold, othresh /* Initially, each stage is in a separate group */
1 G ← ∅ 2 for s ∈ S do 3
G ← G ∪ {s}
4 repeat 5
converge ← true
6
cand set ← getSingleChildGroups(G, E)
7
- rd list ← sortGroupsBySize(cand set, P)
8
for each g in ord list do
9
child = getChildGroup(g, E)
10
if hasConstantDependenceVectors(g, child) then
11
- r ← estimateRelativeOverlap(g, child, T)
12
if or < othresh then
13
merge ← g ∪ child
14
G ← G − g − child
15
G ← G ∪ merge
16
converge ← false
17
break
18 until converge = true 19 return G
Algorithm
Compiler - Grouping
↓ x ↓ x ↓ y ↓ y ↓ x ↓ x ↓ y ↓ y ↓ x ↓ x ↓ y ↓ y ↑ x ↑ y L X ↑+ ↑ x ↑ y L ↑ x ↑ x ↑ y L X ↑+ ↑ x ↑ y L ↑ x ↑ x ↑ y L X ↑+ ↑ x ↑ y L M ↓ y ↓ x ↓ y ↓ x ↓ y ↓ x X ↑ x
Compiler - Code Generation
void pipe_harris ( i n t C, i n t R, f l o a t * I, f l o a t *& harris) { /* Live
- ut
allocation */ harris = ( f l o a t *) (malloc(sizeof( f l o a t )*(2+R)*(2+C))); # pragma
- mp
parallel for f o r ( i n t Ti =
- 1; Ti
<= R / 32; Ti += 1){ /* Scratch pad allocation */ f l o a t Ix [36][260] , Iy [36][260]; f l o a t Syy [36][260] , Sxy [36][260] , Sxx [36][260]; f o r ( i n t Tj =
- 1; Tj
<= C / 256; Tj += 1) { i n t lbi = max(1, 32 * Ti ); i n t ubi = min(R, 32 * Ti + 35); f o r ( i n t i = lbi ; i <= ubi ; i += 1) { i n t lbj = max(1, 256 * Tj ); i n t ubj = min(C, 256 * Tj + 259); # pragma ivdep f o r ( i n t j = lbj ; j <= ubj ; j += 1) { Iy(-32 * Ti + i,
- 256 * Tj
+ j); Ix(-32 * Ti + i,
- 256 * Tj
+ j); } } .... .... .... } } }
OpenMP Vectorization Scanning polyhedra
Auto Tuning
200 250 300 350 400 450 20 40 60 Execution time on 1 core (ms) Execution time on 16 cores (ms) 60 80 100 120 140 160 180 5 10 15 Execution time on 1 core (ms) Execution time on 16 cores (ms)
- Tile sizes and overlap thershold determine grouping
- Seven tiles sizes for each dimension
- Three threshold values
- Small search space ( 72 ∗ 3 for 2d-tiling )
Tuning Camera Pipeline Pyramid Blending
Table of Contents
1 Image Processing Pipelines 2 Language 3 Compiler 4 Related Work 5 Performance Evaluation
Related work
- Decoupled view of computation and schedules
- Scheduling for affine loop nests
Do not target specific domains
- Overlapped tiling
Works for simple time-iterated stencils Different approach to constructing overlapped tiles Polyhedral compilation
- Domain specific language and compiler system
- Effective for exploring schedules
Requires an explicit schedule specification Halide
Halide
ImageParam input(UInt (16) , 2); Func blur_x("blur_x"), blur_y("blur_y"); Var x("x"), y("y"), xi("xi"), yi("yi"); // The algorithm blur_x(x, y) = (input(x, y) + input(x+1, y) + input(x+2, y))/3; blur_y(x, y) = (blur_x(x, y) + blur_x(x, y+1) + blur_x(x, y+2))/3; // How to schedule it blur_y.split(y, y, yi , 8).parallel(y).vectorize(x, 8); blur_x.store_at(blur_y , y). compute_at(blur_y , yi).vectorize(x, 8);
Halide Blur Schedule
Table of Contents
1 Image Processing Pipelines 2 Language 3 Compiler 4 Related Work 5 Performance Evaluation
Experimental Setup
Intel Xeon E5-2680 Clock 2.7 GHz Cores / socket 8 Total cores 16 L1 cache / core 32 KB L2 cache / core 512 KB L3 cache / socket 20 MB Compiler Intel C compiler (icc) 14.0.1 Compiler flags
- O3 -xhost
Linux kernel 3.8.0-38 (64-bit)
Evaluation Method
- Seven representative benchmarks
- Varying structure and complexity
Benchmarks
- Halide
Tuned schedule, Matched schedule
- OpenCV
Optimized library calls Comparison
Multiscale Interpolation
1 2 4 8 16 2 4 6 8 10 12 14
2.24 4.03 6.57 9.82 12.54 1.28 2.38 3.93 6.18 9.43 1.46 2.57 4.07 5.7 5.88 1 1.8 2.94 4.42 5.82 2.14 3.44 5.94 7.25 6.93 1.77 2.99 5.29 7.13 6.92 1.28 2.43 4.1 7.1 12.11 0.88 1.68 3.19 5.47 8.5
Number of cores Speedup over PolyMage base (1 core)
PolyMage(opt+vec) PolyMage(opt) PolyMage(base+vec) PolyMage(base) Halide(tuned+vec) Halide(tuned) Halide(matched+vec) Halide(matched)
Harris Corner Detection
1 2 4 8 16 10 20 30 40 50
3.74 7.35 12.85 24.02 46.78 1.12 2.24 4.03 7.64 15.18 2.47 4.31 7.83 12.22 16.22 1 1.94 3.47 6.18 10.3 1.64 3.17 6.08 10.17 18.07 0.93 1.84 3.51 6.05 10.3 1.87 3.73 7.43 13.65 25.35 0.73 1.45 2.91 5.31 9.88
Number of cores Speedup over PolyMage base (1 core)
PolyMage(opt+vec) PolyMage(opt) PolyMage(base+vec) PolyMage(base) Halide(tuned+vec) Halide(tuned) Halide(matched+vec) Halide(matched)
Camera Pipeline
1 2 4 8 16 5 10 15 20 25 30 35
2.79 5.49 9.5 18.16 32.37 0.79 1.57 2.74 5.26 10.28 2.95 5.62 9.58 13.22 24.2 1 1.98 3.61 6.5 12.16 4.82 7.3 12.32 21.26 31.28 1.4 2.59 4.71 7.56 14.15 2.42 4.83 9.55 17.49 33.75
Number of cores Speedup over PolyMage base (1 core)
PolyMage(opt+vec) PolyMage(opt) PolyMage(base+vec) PolyMage(base) Halide(tuned+vec) Halide(tuned) FCam
Results Summary
Benchmark Number Image size Lines PolyMage OpenCV Speedup over
- f stages
1 core 4 cores 16 cores (1 core) H-tuned (16 cores) Harris Corner 11 6400 × 6400 43 233.79 68.03 18.69 810.24 2.59×* Pyramid Blending 44 2048×2048×3 71 196.99 57.84 21.91 197.28 4.61×* Unsharp Mask 4 2048×2048×3 16 165.40 44.92 14.85 349.57 1.6×* Local Laplacian 99 2560×1536×3 107 274.50 76.60 32.35
- 1.54×
Camera Pipeline 32 2528 × 1920 86 67.87 19.95 5.86
- 1.04×
Bilateral Grid 7 2560 × 1536 43 89.76 27.30 8.47
- 0.89×
Multiscale Interpol. 49 2560×1536×3 41 101.70 34.73 18.18
- 1.81×
Mean speedup of 1.27× over tuned Halide schedules Comparable performance to highly tuned camera pipeline implementation
Conclusion
DSL for high-performance image processing Optimization techniques
- Tiling
- Storage optimization
- Grouping and fusing
Effectiveness
- Up to 1.81× better than tuned schedules
- Matching hand tuned performance
Acknowledgements
Halide, OpenCV, isl, islpy and cgen Intel for their hardware
Thank You!
Pyramid Blending
1 2 4 8 16 2 4 6 8 10 12 14 16
1.66 3.2 5.66 9.96 14.95 1.26 2.42 4.29 7.49 13.37 1.13 2.02 3.25 4.71 5.31 1 1.82 2.99 4.55 5.35 0.56 1 1.83 2.71 3.24 0.66 1.16 2.08 2.98 3.43 1.24 2.12 3.7 5.72 7 0.76 1.45 2.64 4.31 5.98
Number of cores Speedup over PolyMage base (1 core)
PolyMage(opt+vec) PolyMage(opt) PolyMage(base+vec) PolyMage(base) Halide(tuned+vec) Halide(tuned) Halide(matched+vec) Halide(matched)
Bilateral Grid
1 2 4 8 16 2 4 6 8 10 12 14
1.15 2.17 3.77 6.55 12.16 0.82 1.61 2.73 4.74 8.99 1.65 3.17 3.42 3.56 3.72 1 1.97 2.15 2.28 2.42 1.6 2.92 5.4 8.55 13.68 1.13 2.11 4.03 6.72 10.37
Number of cores Speedup over PolyMage base (1 core)
PolyMage(opt+vec) PolyMage(opt) PolyMage(base+vec) PolyMage(base) Halide(tuned+vec) Halide(tuned)
Local Laplacian Filter
1 2 4 8 16 2 4 6 8 10 12 14
1.62 3.41 5.8 9.41 13.73 1.02 1.99 3.48 6.1 10.81 1.58 2.93 4.71 6.41 8.74 1 1.92 3.3 5.23 7.39 1.04 1.99 3.68 6.18 8.93 0.55 1.07 2.08 3.61 5.71
Number of cores Speedup over PolyMage base (1 core)
PolyMage(opt+vec) PolyMage(opt) PolyMage(base+vec) PolyMage(base) Halide(tuned+vec) Halide(tuned)