Autotuning Halide schedules with OpenTuner Jonathan Ragan-Kelley - PowerPoint PPT Presentation

Autotuning Halide schedules with OpenTuner Jonathan Ragan-Kelley (Stanford)

We are surrounded by computational cameras Enormous opportunity, demands extreme optimization parallelism & locality limit performance and energy

We are surrounded by computational cameras Enormous opportunity, demands extreme optimization parallelism & locality limit performance and energy Camera: 8 Mpixels (96MB/frame as float ) CPUs: 15 GFLOP/sec GPU: 115 GFLOP/sec

We are surrounded by computational cameras Enormous opportunity, demands extreme optimization parallelism & locality limit performance and energy Camera: 8 Mpixels (96MB/frame as float ) CPUs: 15 GFLOP/sec GPU: 115 GFLOP/sec Required > 40:1 arithmetic intensity

A realistic pipeline: Local Laplacian Filters [Paris et al. 2010, Aubry et al. 2011] COPY COPY input output LUT: look-up table UP: upsample LUT O ( x , y,k ) ← lut( I ( x , y ) − k σ ) T 1 (2 x ,2 y ) ← I ( x , y ) T 2 ← T 1 ⊗ x [1 3 3 1] SUB DDA ADD level size O ← T 2 ⊗ y [1 3 3 1] w × h ADD: addition DOWN O ( x , y ) ← I 1 ( x , y ) + I 2 ( x , y ) DOWN: downsample DOWN UP UP T 1 ← I ⊗ x [1 3 3 1] SUB DDA ADD w × h SUB: subtraction T 2 ← T 1 ⊗ y [1 3 3 1] 2 2 O ( x , y ) ← I 1 ( x , y ) − I 2 ( x , y ) O ( x , y ) ← T 2 (2 x ,2 y ) DOWN DOWN UP UP ... . . . . . . ... DDA: data-dependent access The algorithm uses 8 pyramid levels k ← floor( I 1 ( x , y ) / σ ) α ← ( I 1 ( x , y ) / σ ) − k O ( x , y ) ← (1 −α ) I 2 ( x , y,k ) + α I 2 ( x , y,k + 1) DDA w × h 128 128 COPY COPY wide, deep, heterogeneous stencils + stream processing

Local Laplacian Filters in Adobe Photoshop Camera Raw / Lightroom 1500 lines of expert- optimized C++ multi-threaded, SSE 3 months of work 10x faster than reference C

Local Laplacian Filters in Adobe Photoshop Camera Raw / Lightroom 1500 lines of expert- optimized C++ multi-threaded, SSE 3 months of work 10x faster than reference C 2x slower than another organization (which they couldn’t find)

Halide a new language & compiler for image processing

Halide a new language & compiler for image processing 1. Decouple algorithm from schedule Algorithm: what is computed Schedule: where and when it’s computed

Halide a new language & compiler for image processing 1. Decouple algorithm from schedule Algorithm: what is computed Schedule: where and when it’s computed we want to autotune this

The algorithm defines pipelines as pure functions Pipeline stages are functions from coordinates to values Execution order and storage are unspecified

The algorithm defines pipelines as pure functions Pipeline stages are functions from coordinates to values Execution order and storage are unspecified 3x3 blur as a Halide algorithm : blurx (x, ¡y) ¡= ¡( in (x-‑ 1 , ¡y) ¡+ ¡ in (x, ¡y) ¡+ ¡ in (x+ 1 , ¡y))/ 3 ; blury (x, ¡y) ¡= ¡( blurx (x, ¡y-‑ 1 ) ¡+ ¡ blurx (x, ¡y) ¡+ ¡ blurx (x, ¡y+ 1 ))/ 3 ;

Halide a new language & compiler for image processing 1. Decouple algorithm from schedule Algorithm: what is computed Schedule: where and when it’s computed

Halide a new language & compiler for image processing 1. Decouple algorithm from schedule Algorithm: what is computed Schedule: where and when it’s computed 2. Single, unified model for all schedules

Halide a new language & compiler for image processing 1. Decouple algorithm from schedule Algorithm: what is computed Schedule: where and when it’s computed 2. Single, unified model for all schedules Simple enough to search, expose to user Powerful enough to beat expert-tuned code

The schedule defines intra-stage order, inter-stage interleaving show pipeline and domain. schedule specifies: input - interleaving (up/down) - or how we specify choices: - or blurx - granularity at which to allocate, stor and r - granularity at which to interleave blury computation

The schedule defines intra-stage order, inter-stage interleaving show pipeline and domain. For each stage: schedule specifies: input - interleaving (up/down) 1) In what order should we compute its values ? - or how we specify choices: - or blurx - granularity at which to allocate, stor and r - granularity at which to interleave blury computation

The schedule defines intra-stage order, inter-stage interleaving show pipeline and domain. For each stage: schedule specifies: input - interleaving (up/down) 1) In what order should we compute its values ? - or split, tile, reorder, vectorize, unroll loops how we specify choices: - or blurx - granularity at which to allocate, stor and r - granularity at which to interleave blury computation

The schedule defines intra-stage order, inter-stage interleaving show pipeline and domain. For each stage: schedule specifies: input - interleaving (up/down) 1) In what order should we compute its values ? - or split, tile, reorder, vectorize, unroll loops how we specify choices: - or blurx 2) When should we - granularity at which to allocate, stor compute its inputs ? and r - granularity at which to interleave blury computation

The schedule defines intra-stage order, inter-stage interleaving show pipeline and domain. For each stage: schedule specifies: input - interleaving (up/down) 1) In what order should we compute its values ? - or split, tile, reorder, vectorize, unroll loops how we specify choices: - or blurx 2) When should we - granularity at which to allocate, stor compute its inputs ? and r level in loop nest of - granularity at which to interleave consumers at which to compute each producer blury computation

Schedule primitives compose to create many organizations ¡ ¡blur_x.compute_at(blury, ¡x) ¡ ¡blur_x.compute_at_root() ¡ ¡blur_x.compute_at(blury, ¡x) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.store_at_root() redundant redundant redundant locality locality locality work work work parallelism parallelism parallelism ¡ ¡blur_x.compute_at(blury, ¡y) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.store_at_root() ¡ ¡blur_x.compute_at(blury, ¡x) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.split(x, ¡x, ¡xi, ¡8) ¡ ¡blur_x.compute_at(blury, ¡y) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.vectorize(x, ¡4) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.vectorize(xi, ¡4) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.store_at(blury, ¡yi) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.parallel(x) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.vectorize(x, ¡4) ¡ ¡blur_y.tile(x, ¡y, ¡xi, ¡yi, ¡8, ¡8) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.parallel(y) ¡ ¡blur_y.split(x, ¡x, ¡xi, ¡8) ¡ ¡blur_y.split(y, ¡y, ¡yi, ¡8) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.vectorize(xi, ¡4) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.vectorize(xi, ¡4) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.parallel(y) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.parallel(x) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.vectorize(x, ¡4) redundant redundant redundant locality locality locality work work work ¡ ¡ ¡ parallelism parallelism parallelism

Schedule primitives compose to create many organizations in blurx blury in blurx blury in blurx blury redundant redundant redundant locality locality locality work work work parallelism parallelism parallelism in blurx blury in blurx blury in blurx blury redundant redundant redundant locality locality locality work work work parallelism parallelism parallelism

A trivial Halide program // ¡The ¡ algorithm ¡-‑ ¡no ¡storage, ¡order a (x, ¡y) ¡= ¡ in (x, ¡y); b (x, ¡y) ¡= ¡ a (x, ¡y); c (x, ¡y) ¡= ¡ b (x, ¡y);

A trivial Halide program // ¡The ¡ algorithm ¡-‑ ¡no ¡storage, ¡order // ¡generated ¡schedule a . split (x, ¡x, ¡x0, ¡4) a (x, ¡y) ¡= ¡ in (x, ¡y); ¡. split (y, ¡y, ¡y1, ¡16) b (x, ¡y) ¡= ¡ a (x, ¡y); ¡. reorder (y1, ¡x0, ¡y, ¡x) c (x, ¡y) ¡= ¡ b (x, ¡y); ¡. vectorize (y1, ¡4) ¡. compute_at (b, ¡y); b . split (x, ¡x, ¡x2, ¡64) Schedules are complex ¡. reorder (x2, ¡x, ¡y) ¡. reorder_storage (y, ¡x) split ¡. vectorize (x2, ¡8) reorder / reorder_storage ¡. compute_at (c, ¡x4); c . split (x, ¡x, ¡x4, ¡8) vectorize / parallel ¡. split (y, ¡y, ¡y5, ¡2) compute_at / store_at ¡. reorder (x4, ¡y5, ¡y, ¡x) ¡. parallel (x) ¡. compute_root ();

Autotuning Halide schedules with OpenTuner Jonathan Ragan-Kelley - PowerPoint PPT Presentation

Autotuning Halide schedules with OpenTuner Jonathan Ragan-Kelley (Stanford) We are surrounded by computational cameras Enormous opportunity, demands extreme optimization parallelism & locality limit performance and energy We are

Auto-tuning HotSpot JVM using OpenTuner OpenTuner Workshop International Symposium on Code

OpenTuner: An Extensible Framework for Program Autotuning Jason Ansel Shoaib Kamil Kalyan

Nomenclature Common: named as if derived from hydrogen halide, HX Nomenclature Common:

Nomenclature Common: named as if derived from hydrogen halide, HX Nomenclature Common:

Schedules, Schedules, Schedules Schedules, Schedules, Schedules Review of V5 Schedule and Status

Autotuning Programs with Algorithmic Choice Jason Ansel MIT - CSAIL December 18, 2013

Topological nanostructures with halide perovskites Alexander Berestennikov

Nomenclature Common: named as if derived from hydrogen halide, HX CH 3 Cl methyl chloride

Agenda Bilateral Settlement Schedules Overview Bilateral Settlement Schedules Online GUI

CAPSTONE PRESENTATION SCHEDULES PACIFIC LUTHERAN UNIVERSITY SPRING 2015 Schedules as of April

High performance data processing with Halide Roel Jordans High performance computjng 2

Transmission VPHGS in Silver Halide Sensitized Gelatin Maider Insausti, Francisco Garz on, P.

1 Organic Chemistry The Functional Group Approach Br OH alkane alcohol halide alkene

Automatically Scheduling Halide Image Processing Pipelines Ravi Teja Mullapudi (CMU) Andrew Adams

Metal Halide Perovskites: a New Family of Semiconductors for Photovoltaics and Optoelectronics

1 Organic Chemistry The Functional Group Approach Br OH alkane alcohol halide alkene

Multi- -Function Function Microstrip Microstrip Multi Patch Antennas Partially Patch Antennas

Difference operators for functions of partitions and its application to hook-content identities

ERWP Woody Invasive Committee Update March 9, 2017 2016 2016 Wrap-up up Primary Treatment:

Heavy flavor measurements at PHENIX K.Nagashima Hiroshima Univ./RIKEN Introduction

I grew up as a photographer in a dark room, and I grew up as a photographer in a dark room, and

9/26/2018 Disclosure Overview of the Donation Process No one involved in the planning or and

Key Elements for Building an Intentional Fundraising! Partner, Vice President, Rebecca Zanatta

Fundraising Jon Powell Sr. Director, Research and Education NextAfter @jonpowell31 1