Automatically Scheduling Halide Image Processing Pipelines Ravi Teja - PowerPoint PPT Presentation

Automatically Scheduling Halide Image Processing Pipelines Ravi Teja Mullapudi (CMU) Andrew Adams (Google) Dillon Sharlet (Google) Jonathan Ragan-Kelley (Stanford) Kayvon Fatahalian (CMU)

High demand for e ffi cient image processing

Scheduling image processing algorithms Algorithm description Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + …

Scheduling image processing algorithms Algorithm description Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + … Implementation Schedule (machine mapping) parallelize y loop tile output dims vectorize y loop

Scheduling image processing algorithms Algorithm Image processing description algorithm developers Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + … Implementation Schedule (machine mapping) parallelize y loop tile output dims vectorize y loop

Few developers have the skill set to author highly optimized schedules Algorithm Image processing description algorithm developers Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + … > 10x Faster Implementation Schedule (machine mapping) parallelize y loop tile output dims vectorize y loop

Contribution: automatic scheduling of image processing pipelines Algorithm Image processing Image processing description algorithm developers algorithm developers Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + … > 10x Faster Generates expert-quality Implementation schedules in seconds Scheduling Algorithm

Why is it challenging to schedule image processing pipelines?

Algorithm: 3x3 box blur in

Algorithm: 3x3 box blur in bx bx(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y))/3

Algorithm: 3x3 box blur in bx out bx(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y)) / 3 out(x, y) = (bx(x, y-1) + bx(x, y) + bx(x, y+1)) / 3

A basic (slow) schedule compute all pixels of bx, in parallel compute all pixels of by, in parallel x y in

A basic (slow) schedule compute all pixels of bx, in parallel compute all pixels of by, in parallel x y in bx

A basic (slow) schedule compute all pixels of bx, in parallel compute all pixels of by, in parallel Intermediate buffer x y in bx out

Low performance: bandwidth bound Large in-memory buffer x y in bx out

Tiling to improve data locality for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile x 3x3 tile y in bx out

Tiling to improve data locality for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile Required pixels of bx 3x3 tile x y in bx out

Tiling to improve data locality for each 3x3 tile, in parallel compute required pixels of bx Intermediate buffer: compute pixels of out in tile fits in fast on-chip storage x y in bx out

Tiling to improve data locality for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile x y in bx out

Tiling introduces redundant work x y in bx out

Tiling introduces redundant work Pixels computed twice x y in bx out

Larger tiles reduce redundant work for each 3x6 tile, in parallel compute required pixels of bx compute pixels in tile of out x y in bx out

Goal: balance parallelism, locality, work for each 3x6 tile, in parallel compute required pixels of bx compute pixels in tile of out x y in bx out

Represent image processing pipelines as graphs out in bx DAG representation of the two-stage blur pipeline

Real world pipelines are complex graphs Local Laplacian filters 100 stages [Paris et al. 2010, Aubry et al. 2011] Google Nexus HDR+ mode: over 2000 stages!

Key aspects of scheduling out in

Key aspects of scheduling Deciding which stages to out in interleave for better data locality

Key aspects of scheduling Deciding which stages to out in interleave for better data locality Picking tiles sizes to trade-off locality and re-computation

Key aspects of scheduling Deciding which stages to out in interleave for better data locality Picking tiles sizes to trade-off locality and re-computation Maintain ability to execute in parallel

An Algorithm for Scheduling Image Processing Pipelines

Algorithm D Input: DAG of pipeline stages in A B E C

Algorithm D Input: DAG of pipeline stages in A B E C Output: Optimized schedule for each 8x128 tile in parallel compute required pixels of A compute pixels in tile of B for each 8x8 tile in parallel compute required pixels of C compute required pixels of D compute pixels in tile of E

Algorithm D Input: DAG of pipeline stages in A B E C Output: Optimized schedule for each 8x128 tile in parallel compute required pixels of A compute pixels in tile of B in A,B C,D,E for each 8x8 tile in parallel compute required pixels of C compute required pixels of D compute pixels in tile of E

Algorithm D Input: DAG of pipeline stages in A B E C Output: Optimized schedule for each 8x128 tile in parallel compute required pixels of A compute pixels in tile of B in A,B C,D,E for each 8x8 tile in parallel Tile size: 8 x 128 Tile size: 8 x 8 compute required pixels of C compute required pixels of D compute pixels in tile of E

Scheduling the DAG for better locality Determine which stages to group together? How to tile stages in each group?

When to group stages? for each 3x3 tile in parallel ? compute required pixels of A D compute pixels in tile of B in A,B E compute all pixels of C, in parallel compute all pixels of D, in parallel C Tile size: 3 x 3 compute all pixels of E, in parallel Grouping A and B together can either improve or degrade performance

Quantifying the cost of a group for each 3x3 tile in parallel compute required pixels of A D compute pixels in tile of B in A,B E compute all pixels of C, in parallel compute all pixels of D, in parallel C Tile size: 3 x 3 compute all pixels of E, in parallel Cost = Cost of arithmetic + Cost of memory

Quantifying the cost of a group for each 3x3 tile in parallel compute required pixels of A D compute pixels in tile of B in A,B E compute all pixels of C, in parallel compute all pixels of D, in parallel C Tile size: 3 x 3 compute all pixels of E, in parallel Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

Quantifying the cost of a group D for each 3x3 tile in parallel in A,B E compute required pixels of A compute pixels in tile of B C Tile size: 3 x 3 Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

Estimating cost using interval analysis in A,B Tile size: 3 x 3 in A B Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

Estimating cost using interval analysis in A,B Tile size: 3 x 3 in A B Cost = Number of tiles x Cost per tile

Search for best tile sizes in A,B Tile size: 1 x 6 in A B

Automatically Scheduling Halide Image Processing Pipelines Ravi Teja - PowerPoint PPT Presentation

Automatically Scheduling Halide Image Processing Pipelines Ravi Teja Mullapudi (CMU) Andrew Adams (Google) Dillon Sharlet (Google) Jonathan Ragan-Kelley (Stanford) Kayvon Fatahalian (CMU) High demand for e ffi cient image processing Scheduling

Nomenclature Common: named as if derived from hydrogen halide, HX Nomenclature Common:

Nomenclature Common: named as if derived from hydrogen halide, HX Nomenclature Common:

Topological nanostructures with halide perovskites Alexander Berestennikov

Introduction: What is Image Processing? CS 4640: Image Processing Basics January 10, 2012 What

Nomenclature Common: named as if derived from hydrogen halide, HX CH 3 Cl methyl chloride

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

PolyMage: Automatic Optimization for Image Processing Pipelines Ravi Teja Mullapudi Vinay

Exploring image processing pipelines with scikit-image, joblib, ipywidgets and dash A bag of

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

anti-anti-virus (continued) 1 logistics: TRICKY HW assignment out infecting an executable

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Today Finish Linear Regression: Best linear function prediction of Y given X . MMSE: Best Function

The logarithm as an inverse function In this section we concentrate on understanding the logarithm

JUST THE MATHS SLIDES NUMBER 5.10 GEOMETRY 10 (Graphical solutions) by A.J.Hobson

JUST THE MATHS SLIDES NUMBER 5.3 GEOMETRY 3 (Straight line laws) by A.J.Hobson 5.3.1

Function Calls and Stack Philipp Koehn 16 April 2018 Philipp Koehn Computer Systems

x86 Assembly Crash Course Don Porter Registers Only variables available in assembly

Automatically Scheduling Halide Image Processing Pipelines Ravi Teja - PowerPoint PPT Presentation

Automatically Scheduling Halide Image Processing Pipelines Ravi Teja Mullapudi (CMU) Andrew Adams (Google) Dillon Sharlet (Google) Jonathan Ragan-Kelley (Stanford) Kayvon Fatahalian (CMU) High demand for e ffi cient image processing Scheduling

Nomenclature Common: named as if derived from hydrogen halide, HX Nomenclature Common:

Nomenclature Common: named as if derived from hydrogen halide, HX Nomenclature Common:

Topological nanostructures with halide perovskites Alexander Berestennikov

Introduction: What is Image Processing? CS 4640: Image Processing Basics January 10, 2012 What

Nomenclature Common: named as if derived from hydrogen halide, HX CH 3 Cl methyl chloride

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

PolyMage: Automatic Optimization for Image Processing Pipelines Ravi Teja Mullapudi Vinay

Exploring image processing pipelines with scikit-image, joblib, ipywidgets and dash A bag of

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Licensed Pipelines &amp; the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

anti-anti-virus (continued) 1 logistics: TRICKY HW assignment out infecting an executable

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Today Finish Linear Regression: Best linear function prediction of Y given X . MMSE: Best Function

The logarithm as an inverse function In this section we concentrate on understanding the logarithm

JUST THE MATHS SLIDES NUMBER 5.10 GEOMETRY 10 (Graphical solutions) by A.J.Hobson

JUST THE MATHS SLIDES NUMBER 5.3 GEOMETRY 3 (Straight line laws) by A.J.Hobson 5.3.1

Function Calls and Stack Philipp Koehn 16 April 2018 Philipp Koehn Computer Systems

x86 Assembly Crash Course Don Porter Registers Only variables available in assembly

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure