Autotuning Programs with Algorithmic Choice Jason Ansel MIT - CSAIL - - PowerPoint PPT Presentation

autotuning programs with algorithmic choice
SMART_READER_LITE
LIVE PREVIEW

Autotuning Programs with Algorithmic Choice Jason Ansel MIT - CSAIL - - PowerPoint PPT Presentation

Introduction PetaBricks OpenTuner Conclusions Autotuning Programs with Algorithmic Choice Jason Ansel MIT - CSAIL December 18, 2013 Parallelism choices s e c i o h c c i m h t i r o g A l Accuracy choices Introduction


slide-1
SLIDE 1

Introduction PetaBricks OpenTuner Conclusions

Autotuning Programs with Algorithmic Choice

Jason Ansel

MIT - CSAIL

December 18, 2013

slide-2
SLIDE 2

Introduction PetaBricks OpenTuner Conclusions

High Performance Search Problem

Parallelism choices Accuracy choices A l g

  • r

i t h m i c c h

  • i

c e s

  • Parallelism
slide-3
SLIDE 3

Introduction PetaBricks OpenTuner Conclusions

High Performance Search Problem

Parallelism choices Accuracy choices A l g

  • r

i t h m i c c h

  • i

c e s

  • Parallelism Performance
  • Exploiting parallelism is

necessary but not sufficient

slide-4
SLIDE 4

Introduction PetaBricks OpenTuner Conclusions

High Performance Search Problem

Performance search space:

Parallelism choices Accuracy choices A l g

  • r

i t h m i c c h

  • i

c e s

  • Parallelism Performance
  • Exploiting parallelism is

necessary but not sufficient

  • Performance is a

multi-dimensional search problem

  • Normally done by expert

programmers

  • Optimization decisions often

change program results

slide-5
SLIDE 5

Introduction PetaBricks OpenTuner Conclusions

High Performance Search Problem

Goal of this work

To automate the process of program optimization to create programs that can adapt to changing environments and goals.

slide-6
SLIDE 6

Introduction PetaBricks OpenTuner Conclusions

High Performance Search Problem

Goal of this work

To automate the process of program optimization to create programs that can adapt to changing environments and goals.

  • Language level solutions for concisely representing algorithmic

choice spaces.

  • Processes and compilation techniques to manage and explore

these spaces.

  • Autotuning techniques to efficiently solve these search

problems.

slide-7
SLIDE 7

Introduction PetaBricks OpenTuner Conclusions

Research Covered in This Talk

  • The PetaBricks programming language: algorithmic choice at

the language level [PLDI’09]

  • Language level support for variable accuracy [CGO’11]
  • Automated construction of multigrid V-cycles [SC’09]
  • Code generation and autotuning for heterogeneous CPU/GPU

mix of parallel processing units [ASPLOS’13]

  • Solution for input sensitivity based on adaptive
  • verhead-aware classifiers [Under review]
  • OpenTuner: an extensible framework for program autotuning

[Under review]

slide-8
SLIDE 8

Introduction PetaBricks OpenTuner Conclusions

Research Covered in This Talk

  • The PetaBricks programming language: algorithmic choice at

the language level [PLDI’09]

  • Language level support for variable accuracy [CGO’11]
  • Automated construction of multigrid V-cycles [SC’09]
  • Code generation and autotuning for heterogeneous CPU/GPU

mix of parallel processing units [ASPLOS’13]

  • Solution for input sensitivity based on adaptive
  • verhead-aware classifiers [Under review]
  • OpenTuner: an extensible framework for program autotuning

[Under review]

  • Won’t be talking about work in: ASPLOS’09, ASPLOS’12,

GECCO’11, IPDPS’09, PLDI’11, and many others

slide-9
SLIDE 9

Introduction PetaBricks OpenTuner Conclusions

A Motivating Example for Algorithmic Choice

  • How would you write a fast sorting algorithm?
slide-10
SLIDE 10

Introduction PetaBricks OpenTuner Conclusions

A Motivating Example for Algorithmic Choice

  • How would you write a fast sorting algorithm?
  • Insertion sort
  • Quick sort
  • Merge sort
  • Radix sort
slide-11
SLIDE 11

Introduction PetaBricks OpenTuner Conclusions

A Motivating Example for Algorithmic Choice

  • How would you write a fast sorting algorithm?
  • Insertion sort
  • Quick sort
  • Merge sort
  • Radix sort
  • Poly-algorithms
slide-12
SLIDE 12

Introduction PetaBricks OpenTuner Conclusions

std::stable sort

/usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367

slide-13
SLIDE 13

Introduction PetaBricks OpenTuner Conclusions

std::stable sort

/usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367

slide-14
SLIDE 14

Introduction PetaBricks OpenTuner Conclusions

Why 15?

  • Why 15?
slide-15
SLIDE 15

Introduction PetaBricks OpenTuner Conclusions

Why 15?

  • Why 15?
  • Dates back to at least 2000 (June 2000 SGI release)
  • Still in current C++ STL shipped with GCC
  • cutoff = 15 survived 10+ years
  • In the source code for millions1 of C++ programs
  • There is nothing the compiler can do about it

1Any C++ program with “#include <algorithm>”, conservative estimate based on: http://c2.com/cgi/wiki?ProgrammingLanguageUsageStatistics

slide-16
SLIDE 16

Introduction PetaBricks OpenTuner Conclusions

Is 15 The Right Number?

  • The best cutoff (CO) changes
  • Depends on competing costs:
  • Cost of computation (< operator, call overhead, etc)
  • Cost of communication (swaps)
  • Cache behavior (misses, prefetcher, locality)
slide-17
SLIDE 17

Introduction PetaBricks OpenTuner Conclusions

Is 15 The Right Number?

  • The best cutoff (CO) changes
  • Depends on competing costs:
  • Cost of computation (< operator, call overhead, etc)
  • Cost of communication (swaps)
  • Cache behavior (misses, prefetcher, locality)
  • Sorting 100000 doubles with std::stable sort:
  • CO ≈ 200 optimal on a Phenom 905e (15% speedup)
  • CO ≈ 400 optimal on a Opteron 6168 (15% speedup)
  • CO ≈ 500 optimal on a Xeon E5320 (34% speedup)
  • CO ≈ 700 optimal on a Xeon X5460 (25% speedup)
  • If the best cutoff has changed, perhaps best algorithm has

also changed

slide-18
SLIDE 18

Introduction PetaBricks OpenTuner Conclusions

Algorithmic Choice

  • Compiler’s hands are tied, it is stuck with 15
  • Need a better way to represent algorithmic choices
  • PetaBricks is the first language with support for algorithmic

choice

slide-19
SLIDE 19

Introduction PetaBricks OpenTuner Conclusions

Sort in PetaBricks

Language

function Sort to out [ n ] from in [ n ] { either { I n s e r t i o n S o r t ( out , in ) ; } or { QuickSort ( out , in ) ; } or { MergeSort ( out , in ) ; } or { RadixSort ( out , in ) ; } }

slide-20
SLIDE 20

Introduction PetaBricks OpenTuner Conclusions

Sort in PetaBricks

Language

function Sort to out [ n ] from in [ n ] { either { I n s e r t i o n S o r t ( out , in ) ; } or { QuickSort ( out , in ) ; } or { MergeSort ( out , in ) ; } or { RadixSort ( out , in ) ; } }

Representation

Decision tree synthesized by our autotuner

slide-21
SLIDE 21

Introduction PetaBricks OpenTuner Conclusions

Decision Trees

Optimized for a Xeon E7340 (8 cores):

N < 600 N < 1420 Insertion Sort Quick Sort Merge Sort (2-way)

slide-22
SLIDE 22

Introduction PetaBricks OpenTuner Conclusions

Decision Trees

Optimized for Sun Fire T200 Niagara (8 cores):

N < 1461 N < 2400 Merge Sort (4-way) Merge Sort (2-way) N < 75 Merge Sort (8-way) Merge Sort (16-way)

slide-23
SLIDE 23

Introduction PetaBricks OpenTuner Conclusions

Sort Algorithm Timings2

0.0005 0.001 0.0015 0.002 0.0025 250 500 750 1000 1250 1500 1750 Time (s) Input Size InsertionSort QuickSort MergeSort RadixSort Autotuned 2On an 8-way Xeon E7340 system

slide-24
SLIDE 24

Introduction PetaBricks OpenTuner Conclusions

Iteration Order Choices

  • Many other choices related to execution order
  • By rows?
  • By columns?
  • Diagonal? Reverse order? Blocked?
  • Parallel?
  • Choices both within a single (possibly parallel)

task and between different tasks

slide-25
SLIDE 25

Introduction PetaBricks OpenTuner Conclusions

Iteration Order Choices

  • Many other choices related to execution order
  • By rows?
  • By columns?
  • Diagonal? Reverse order? Blocked?
  • Parallel?
  • Choices both within a single (possibly parallel)

task and between different tasks

  • This is main motivation for a new language as
  • pposed to a library
slide-26
SLIDE 26

Introduction PetaBricks OpenTuner Conclusions

Synthesized Outer Control Flow

  • PetaBricks programs have synthesized outer control flow
  • Declarative (data flow like) outer syntax
  • Imperative inner code
  • Programs start as completely parallel
  • Added dependencies restrict the space of legal executions
  • May only access data explicitly depended on

Parallel loop

  • X. c e l l ( i ) from () {

. . . }

Sequential loop

  • X. c e l l ( i ) from (X. c e l l ( i −1)

l e f t ) { . . . }

slide-27
SLIDE 27

Introduction PetaBricks OpenTuner Conclusions

Matrix Multiply

transform MatrixMultiply to AB[w, h ] from A[ c , h ] , B[w, c ] {

  • AB. c e l l ( x , y ) from (A. row( y ) a , B. column ( x ) b){

return dot (a , b ) ; } }

slide-28
SLIDE 28

Introduction PetaBricks OpenTuner Conclusions

Matrix Multiply

transform MatrixMultiply to AB[w, h ] from A[ c , h ] , B[w, c ] {

  • AB. c e l l ( x , y ) from (A. row( y ) a , B. column ( x ) b){

return dot (a , b ) ; } to (AB. region ( x , y , x + 4 , y + 4)

  • ut )

from (A. region (0 , y , c , y + 4) a ,

  • B. region ( x ,

0 , x + 4 , c ) b){ // . . . compute 4 x 4 block . . . } }

slide-29
SLIDE 29

Introduction PetaBricks OpenTuner Conclusions

Strassen Matrix Multiply

transform S t r a s s e n to AB[ n , n ] from A[ n , n ] , B[ n , n ] using M1[ n /2 , n /2] , M2[ n /2 , n /2] , M3[ n /2 , n /2] , M4[ n /2 , n /2] , M5[ n /2 , n /2] , M6[ n /2 , n /2] , M7[ n /2 , n /2] { to (M1 m1) from (A. region (0 , 0 , n /2 , n /2) a11 ,

  • A. region ( n /2 , n /2 , n ,

n ) a22 ,

  • B. region (0 ,

0 , n /2 , n /2) b11 ,

  • B. region ( n /2 , n /2 , n ,

n ) b22 ) using ( t1 [ n / 2 , n / 2 ] , t2 [ n /2 , n / 2 ] ) { spawn MatrixAdd ( t1 , a11 , a22 ) ; spawn MatrixAdd ( t2 , b11 , b22 ) ; sync ; S t r a s s e n (m1, t1 , t2 ) ; } . . . . // Compute one quadrant

  • f
  • utput

with s t r a s s e n decomposition to (AB. region ( n /2 , 0 , n , n /2) c12 ) from (M3 m3, M5 m5){ MatrixAdd ( c12 , m3, m5 ) ; } . . . . // Or , compute element i n

  • utput

d i r e c t l y ( same as l a s t s l i d e )

  • AB. c e l l ( x , y )

from (A. row ( y ) a , B. column ( x ) b){ return dot ( a , b ) ; } }

slide-30
SLIDE 30

Introduction PetaBricks OpenTuner Conclusions

Variable Accuracy Algorithms

  • Many problems don’t have a single correct answer,
  • ptimizations often trade-off accuracy and performance.
  • Soft computing
  • DSP algorithms
  • Iterative algorithms
slide-31
SLIDE 31

Introduction PetaBricks OpenTuner Conclusions

Variable Accuracy Algorithms

  • Many problems don’t have a single correct answer,
  • ptimizations often trade-off accuracy and performance.
  • Soft computing
  • DSP algorithms
  • Iterative algorithms
  • Variable accuracy, supported in the PetaBricks language, is a

fundamental part of algorithmic choice which enables new classes of programs to be represented.

slide-32
SLIDE 32

Introduction PetaBricks OpenTuner Conclusions

K-Means Example

transform kmeans from Points [ n , 2 ] // Array

  • f

p o i n t s ( each column // s t o r e s x and y c o o r d i n a t e s ) using C e n t r o i d s [ s q r t ( n ) , 2 ] to Assignments [ n ] { // Rule 1 : // One p o s s i b l e i n i t i a l c o n d i t i o n : Random // s e t

  • f

p o i n t s to ( C e n t r o i d s . column ( i ) c ) from ( Points p ) { c=p . column ( rand (0 , n ) ) } // Rule 2 : // Another i n i t i a l c o n d i t i o n : C e n t e r p l u s i n i t i a l // c e n t e r s ( kmeans++) to ( C e n t r o i d s c ) from ( Points p ) { CenterPlus ( c , p ) ; } // Rule 3 : // The kmeans i t e r a t i v e a l g o r i t h m to ( Assignments a ) from ( Points p , C e n t r o i d s c ) { while ( t r u e ) { i n t change ; A s s i g n C l u s t e r s ( a , change , p , c , a ) ; i f ( change==0) return ; // Reached f i x e d p o i n t NewClusterLocations ( c , p , a ) ; } } }

slide-33
SLIDE 33

Introduction PetaBricks OpenTuner Conclusions

K-Means Example (Variable Accuracy)

transform kmeans accuracy metric kmeansaccuracy a c c u r a c y v a r i a b l e k from Points [ n , 2 ] // Array

  • f

p o i n t s ( each column // s t o r e s x and y c o o r d i n a t e s ) using C e n t r o i d s [ k , 2 ] to Assignments [ n ] ... // Rule 3 : // The kmeans i t e r a t i v e a l g o r i t h m to ( Assignments a ) from ( Points p , C e n t r o i d s c ) { for enough { i n t change ; A s s i g n C l u s t e r s ( a , change , p , c , a ) ; i f ( change==0) return ; // Reached f i x e d p o i n t NewClusterLocations ( c , p , a ) ; } } } transform kmeansaccuracy from Assignments [ n ] , Points [ n , 2 ] to Accuracy { Accuracy from ( Assignments a , Points p){ return s q r t (2∗n/ SumClusterDistanceSquared ( a , p ) ) ; } }

slide-34
SLIDE 34

Introduction PetaBricks OpenTuner Conclusions

Semantics of Variable Accuracy

Running the accuracy metric on the output will return a value that, in expectation, exceeds the accuracy target more than P percent of the time.

slide-35
SLIDE 35

Introduction PetaBricks OpenTuner Conclusions

Semantics of Variable Accuracy

Running the accuracy metric on the output will return a value that, in expectation, exceeds the accuracy target more than P percent of the time.

  • Expected distribution of accuracy measured during autotuning

time, not at runtime.

  • When fixed accuracy code calls variable accuracy code, an

accuracy target must be specified.

  • When variable accuracy code call code containing variable

accuracy components, only the outer most accuracy target will be honored.

slide-36
SLIDE 36

Introduction PetaBricks OpenTuner Conclusions

A Brief Multigrid Intro

  • Used to iteratively solve PDEs over a gridded domain
  • Relaxations update points using neighboring values (stencil

computations)

  • Restrictions and Interpolations compute new grid with coarser
  • r finer discretization
slide-37
SLIDE 37

Introduction PetaBricks OpenTuner Conclusions

Standard Cycle Shaps

  • Cycle shapes effect accuracy and

performance

  • Equation, accuracy target, data,

and execution platform effect efficacy of different shapes

  • Entire papers published about new

cycle shapes!

slide-38
SLIDE 38

Introduction PetaBricks OpenTuner Conclusions

Standard Cycle Shaps

  • Cycle shapes effect accuracy and

performance

  • Equation, accuracy target, data,

and execution platform effect efficacy of different shapes

  • Entire papers published about new

cycle shapes!

  • We fundamentally change the

status quo in this domain

  • Define the search space of cycle

shapes once

  • Autotune to find a cycle shape

tailored to your problem

slide-39
SLIDE 39

Introduction PetaBricks OpenTuner Conclusions

Choice Space of Multigrid

slide-40
SLIDE 40

Introduction PetaBricks OpenTuner Conclusions

Autotuned V-cycle Shapes

10

1

Grid Size

2048 1024 512 256 128 64 32 16

10

3

10

5

10

7

Grid Size

2048 1024 512 256 128 64 32 16

slide-41
SLIDE 41

Introduction PetaBricks OpenTuner Conclusions

Dynamic Programming Technique for Autotuning Multigrid

slide-42
SLIDE 42

Introduction PetaBricks OpenTuner Conclusions

Dynamic Programming Technique for Autotuning Multigrid

slide-43
SLIDE 43

Introduction PetaBricks OpenTuner Conclusions

Dynamic Programming Technique for Autotuning Multigrid

  • Partition accuracy space into discrete levels
slide-44
SLIDE 44

Introduction PetaBricks OpenTuner Conclusions

Dynamic Programming Technique for Autotuning Multigrid

  • Partition accuracy space into discrete levels
slide-45
SLIDE 45

Introduction PetaBricks OpenTuner Conclusions

Dynamic Programming Technique for Autotuning Multigrid

Grid size i

Grid size 2i

  • Partition accuracy space into discrete levels
  • Base space of candidate algorithms on optimal algorithms

from coarser level

slide-46
SLIDE 46

Introduction PetaBricks OpenTuner Conclusions

2D Poisson’s Equation (uses Multigrid)

1 2 4 8 16 32 64 100 1000 10000 100000 1e+06 Speedup (x) Input Size Accuracy Level 109 Accuracy Level 107 Accuracy Level 105 Accuracy Level 103 Accuracy Level 101

2D Poisson’s equation

slide-47
SLIDE 47

Introduction PetaBricks OpenTuner Conclusions

More Variable Accuracy Results

1 2 4 8 10 100 1000 Speedup (x) Input Size Accuracy Level 0.95 Accuracy Level 0.75 Accuracy Level 0.50 Accuracy Level 0.20 Accuracy Level 0.10 Accuracy Level 0.05

Clustering

1 10 100 1000 10000 10 100 1000 10000 100000 1e+06 Speedup (x) Input Size Accuracy Level 1.01 Accuracy Level 1.1 Accuracy Level 1.2 Accuracy Level 1.3 Accuracy Level 1.4

Bin Packing

1 2 4 8 16 32 10 100 1000 10000 Speedup (x) Input Size Accuracy Level 2.0 Accuracy Level 1.5 Accuracy Level 1.0 Accuracy Level 0.8 Accuracy Level 0.6 Accuracy Level 0.3

Image Compression

1 2 4 8 16 32 10 100 1000 10000 100000 Speedup (x) Input Size Accuracy Level 109 Accuracy Level 107 Accuracy Level 105 Accuracy Level 103 Accuracy Level 101

3D Helmholtz

1 2 4 8 16 32 64 100 1000 10000 100000 1e+06 Speedup (x) Input Size Accuracy Level 109 Accuracy Level 107 Accuracy Level 105 Accuracy Level 103 Accuracy Level 101

2D Poisson

1 2 4 8 10 100 1000 10000 Speedup (x) Input Size Accuracy Level 3.0 Accuracy Level 2.0 Accuracy Level 1.5 Accuracy Level 1.0 Accuracy Level 0.5 Accuracy Level 0.0

Preconditioner

slide-48
SLIDE 48

Introduction PetaBricks OpenTuner Conclusions

Results on Different Systems

Test Systems

Codename CPU(s) Cores GPU OpenCL Runtime Desktop Core i7 920 @2.67GHz 4 NVIDIA Tesla C2070 CUDA Toolkit 3.2 Server 4× Xeon X7550 @2GHz 32 None AMD APP SDK 2.5 Laptop Core i5 2520M @2.5GHz 2 AMD Radeon HD 6630M Xcode 4.2

Benchmarks

Name # Possible Configs Generated OpenCL Kernels Mean Autotuning Time Testing Input Size SeparableConv. 101358 9 3.82 hours 35202 Black-Sholes 10130 1 3.09 hours 500000 Poisson2D SOR 101358 25 15.37 hours 20482 Sort 10920 7 3.56 hours 220 Strassen 101509 9 3.05 hours 10242 SVD 102435 8 1.79 hours 2562 Tridiagonal Solver 101040 8 5.56 hours 10242

slide-49
SLIDE 49

Introduction PetaBricks OpenTuner Conclusions

Results on Different Systems

Test Systems

Codename CPU(s) Cores GPU OpenCL Runtime Desktop Core i7 920 @2.67GHz 4 NVIDIA Tesla C2070 CUDA Toolkit 3.2 Server 4× Xeon X7550 @2GHz 32 None AMD APP SDK 2.5 Laptop Core i5 2520M @2.5GHz 2 AMD Radeon HD 6630M Xcode 4.2

Benchmarks

Name # Possible Configs Generated OpenCL Kernels Mean Autotuning Time Testing Input Size SeparableConv. 101358 9 3.82 hours 35202 Black-Sholes 10130 1 3.09 hours 500000 Poisson2D SOR 101358 25 15.37 hours 20482 Sort 10920 7 3.56 hours 220 Strassen 101509 9 3.05 hours 10242 SVD 102435 8 1.79 hours 2562 Tridiagonal Solver 101040 8 5.56 hours 10242

slide-50
SLIDE 50

Introduction PetaBricks OpenTuner Conclusions

Separable Convolution (width=7)

1.0x 2.0x 3.0x Desktop Execution Time (Normalized) Desktop Config Server Config Laptop Config

Desktop Config Server Config Laptop Config SeparableConv. 1D kernel+local memory on GPU 1D kernel on OpenCL 2D kernel+local memory on GPU

slide-51
SLIDE 51

Introduction PetaBricks OpenTuner Conclusions

Separable Convolution (width=7)

1.0x 2.0x 3.0x Desktop Server Laptop Execution Time (Normalized) Desktop Config Server Config Laptop Config

Desktop Config Server Config Laptop Config SeparableConv. 1D kernel+local memory on GPU 1D kernel on OpenCL 2D kernel+local memory on GPU

slide-52
SLIDE 52

Introduction PetaBricks OpenTuner Conclusions

Separable Convolution (width=7)

1.0x 2.0x 3.0x Desktop Server Laptop Execution Time (Normalized) Desktop Config Server Config Laptop Config Hand-coded OpenCL

Desktop Config Server Config Laptop Config SeparableConv. 1D kernel+local memory on GPU 1D kernel on OpenCL 2D kernel+local memory on GPU

slide-53
SLIDE 53

Introduction PetaBricks OpenTuner Conclusions

Poisson 2D SOR

1.0x 3.0x 5.0x 7.0x 9.0x Desktop Server Laptop Execution Time (Normalized) Desktop Config Server Config Laptop Config

Desktop Config Server Config Laptop Config Poisson2D SOR Split on CPU followed by compute on GPU Split some parts on OpenCL followed by compute on CPU Split on CPU followed by compute on GPU

slide-54
SLIDE 54

Introduction PetaBricks OpenTuner Conclusions

Singular Value Decomposition (SVD)

1.0x 1.2x 1.4x 1.6x 1.8x 2.0x Desktop Server Laptop Execution Time (Normalized) Desktop Config Server Config Laptop Config

Desktop Config Server Config Laptop Config SVD

First phase: task parallism be- tween CPU/GPU; matrix multi- ply: 8-way parallel recursive de- composition on CPU, call LA- PACK when < 42 × 42 First phase: all on CPU; ma- trix multiply: 8-way parallel re- cursive decomposition on CPU, call LAPACK when < 170×170 First phase: all on CPU; ma- trix multiply: 4-way parallel re- cursive decomposition on CPU, call LAPACK when < 85 × 85

slide-55
SLIDE 55

Introduction PetaBricks OpenTuner Conclusions

Results Takeaways

  • Different configurations are required for best performance on

different systems

  • Not just changing block sizes
  • Can not be easily solved by a simple heuristic
  • Motivates the need for algorithmic choice and autotuning
slide-56
SLIDE 56

Introduction PetaBricks OpenTuner Conclusions

Autotuning Challenges

  • Evaluating quality of candidate algorithms is expensive
  • Must run the program (at least once)
  • More expensive for unfit solutions
  • Scales poorly with larger problem sizes
  • Fitness is noisy
  • Randomness from parallel races and system noise
  • Testing each candidate only once often produces a worse

algorithm

  • Running many trials is expensive
  • Decision tree structures are complex
  • Not easy to hill-climb
  • We artificially bound them
slide-57
SLIDE 57

Introduction PetaBricks OpenTuner Conclusions

Input Sensitivity

  • Input sensitivity is a major challenge
  • Different algorithms may be better for different inputs
  • Use fast algorithm for easy inputs, slow algorithm for hard

inputs

  • Avoid pathological cases
slide-58
SLIDE 58

Introduction PetaBricks OpenTuner Conclusions

Input Sensitivity Today

  • Vast majority of programs today use a single algorithm for all

inputs

  • This forces design for the “worst case” input
  • Wastes time and resources
slide-59
SLIDE 59

Introduction PetaBricks OpenTuner Conclusions

Input Sensitivity Today

  • Vast majority of programs today use a single algorithm for all

inputs

  • This forces design for the “worst case” input
  • Wastes time and resources
  • Related work:
  • Uses hand written heuristics to adapt to inputs
  • Rectify inputs for security [Long el al.]
  • Our system automatically classifies inputs and runs a program
  • ptimized for the type of input being processed
slide-60
SLIDE 60

Introduction PetaBricks OpenTuner Conclusions

Input Sensitivity Overview

Training Deployment Input Classifier

Input Aware Learning

Program Training Inputs Feature Extractors Insights:

  • Feature Priority List
  • Performance Bounds

Input

Select

Input Optimized Programs Training Selected Program

R u n
slide-61
SLIDE 61

Introduction PetaBricks OpenTuner Conclusions

Input Features

f u n c t i o n Sort to

  • ut [ n ]

from i n [ n ] i n p u t f e a t u r e Sortedness , D u p l i c a t i o n { . . . } f u n c t i o n S o r t e d n e s s from i n [ n ] to s o r t e d n e s s tunable double l e v e l ( 0 . 0 , 1 . 0 ) { i n t s o r t e d c o u n t = 0; i n t count = 0 ; i n t step = ( i n t )( l e v e l ∗n ) ; f o r ( i n t i =0; i+step<n ; i+=step ) { i f ( i n [ i ] <= i n [ i+step ] ) { // increment f o r c o r r e c t l y

  • rdered

// p a i r s

  • f

elements s o r t e d c o u n t += 1 ; } count += 1; } i f ( count > 0) s o r t e d n e s s = s o r t e d c o u n t / ( double ) count ; e l s e s o r t e d n e s s = 0 . 0 ; } f u n c t i o n D u p l i c a t i o n from i n [ n ] to d u p l i c a t i o n { . . . }

slide-62
SLIDE 62

Introduction PetaBricks OpenTuner Conclusions

Input Space Sampling

Duplication Sortedness

slide-63
SLIDE 63

Introduction PetaBricks OpenTuner Conclusions

Input Space Sampling

Duplication Sortedness

slide-64
SLIDE 64

Introduction PetaBricks OpenTuner Conclusions

Input Space Sampling

Duplication Sortedness

slide-65
SLIDE 65

Introduction PetaBricks OpenTuner Conclusions

Input Space Sampling

Duplication Sortedness

slide-66
SLIDE 66

Introduction PetaBricks OpenTuner Conclusions

Input Space Sampling

Duplication Sortedness

slide-67
SLIDE 67

Introduction PetaBricks OpenTuner Conclusions

Input Space Sampling

Duplication Sortedness

slide-68
SLIDE 68

Introduction PetaBricks OpenTuner Conclusions

Training

Features Input Labels

Decision Tree Max A Priori Adaptive Tree

Classifier Constructors

1...m m+1

Classifier Selector

Selection Objective Considers cost of extracting needed features

Input Classifier

slide-69
SLIDE 69

Introduction PetaBricks OpenTuner Conclusions

How Many Landmarks Are Enough?

0.2 0.4 0.6 0.8 1 Lost speedup (L) Size of region (pi) 2 configs 3 configs 4 configs 5 configs 6 configs 7 configs 8 configs 9 configs 10 20 30 40 50 60 70 80 90 100 Speedup Landmarks

slide-70
SLIDE 70

Introduction PetaBricks OpenTuner Conclusions

Input Adaptation Results

sort

1 100 1 3 5

Landmarks Speedup

clustering

1 100 2 3

Landmarks Speedup

binpacking

1 100 1 1.05

Landmarks Speedup

svd

1 100 0.9 1.1

Landmarks Speedup

poisson2d

1 100 0.9 1.2

Landmarks Speedup

helmholtz3d

1 100 0.8 1.0

Landmarks Speedup

slide-71
SLIDE 71

Introduction PetaBricks OpenTuner Conclusions

Related Projects

A small selection of many related projects:

Package Domain Search Method Active Harmony Runtime System Nelder-Mead ATLAS Dense Linear Algebra Exhaustive Code Perforation Compiler Exhaustive + Simulated Annealing Dynamic Knobs Runtime System Control Theory FFTW Fast Fourier Transform Exhaustive / Dynamic Prog. Insieme Compiler Differential Evolution Milepost GCC / cTuning Compiler IID Model + Central DB OSKI Sparse Linear Algebra Exhaustive + Heuristic PATUS Stencil Computations Nelder-Mead or Evolutionary SEEC / Heartbeats Runtime System Control Theory Sepya Stencil Computations Random-Restart Gradient Ascent SPIRAL DSP Algorithms Pareto Active Learning

slide-72
SLIDE 72

Introduction PetaBricks OpenTuner Conclusions

Related Projects

A small selection of many related projects:

Package Domain Search Method Active Harmony Runtime System Nelder-Mead ATLAS Dense Linear Algebra Exhaustive Code Perforation Compiler Exhaustive + Simulated Annealing Dynamic Knobs Runtime System Control Theory FFTW Fast Fourier Transform Exhaustive / Dynamic Prog. Insieme Compiler Differential Evolution Milepost GCC / cTuning Compiler IID Model + Central DB OSKI Sparse Linear Algebra Exhaustive + Heuristic PATUS Stencil Computations Nelder-Mead or Evolutionary SEEC / Heartbeats Runtime System Control Theory Sepya Stencil Computations Random-Restart Gradient Ascent SPIRAL DSP Algorithms Pareto Active Learning

  • Simple techniques (exhaustive, hill climbers, etc) are popular
  • No single technique is best for all problems
  • Representations are often just integers/floats/booleans
slide-73
SLIDE 73

Introduction PetaBricks OpenTuner Conclusions

Limits of Existing Autotuning Projects

  • We believe these factors limit the scope and efficiency of

autotuning

  • A hill climber works great for a block size, but completely fails

at synthesizing poly-algorithms

  • Many users of autotuning work hard to prune their search

spaces to fit their techniques

slide-74
SLIDE 74

Introduction PetaBricks OpenTuner Conclusions

Limits of Existing Autotuning Projects

  • We believe these factors limit the scope and efficiency of

autotuning

  • A hill climber works great for a block size, but completely fails

at synthesizing poly-algorithms

  • Many users of autotuning work hard to prune their search

spaces to fit their techniques

  • OpenTuner provides extensible representations and ensembles
  • f techniques which can solve more complex autotuning

problems

slide-75
SLIDE 75

Introduction PetaBricks OpenTuner Conclusions

OpenTuner Overview

OpenTuner: an extensible framework for program autotuning Results Database Search Techniques Search Driver Search

Reads: Results Writes: Desired Results

Measurement User Defined Measurement Function Measurement Driver Configuration Manipulator

Reads: Desired Results Writes: Results

slide-76
SLIDE 76

Introduction PetaBricks OpenTuner Conclusions

OpenTuner Configuration Manipulator Parameters

Parameter Primitive Complex Integer ScaledNumeric Float LogInteger LogFloat PowerOfTwo Switch Enum Permutation Schedule Selector Boolean

  • Hierarchical structure of parameters, user defined parameter

types can be added at any point

  • Primitive parameters behave like bounded integers or floats
  • Complex parameters have a set of stochastic mutation
  • perators
  • Technique-specific operators
slide-77
SLIDE 77

Introduction PetaBricks OpenTuner Conclusions

Ensembles of Techniques

Differential Evolution Particle Swarm Optimization Torczon Hill Climber

slide-78
SLIDE 78

Introduction PetaBricks OpenTuner Conclusions

Ensembles of Techniques

Differential Evolution Particle Swarm Optimization Torczon Hill Climber Information sharing through ResultsDB

slide-79
SLIDE 79

Introduction PetaBricks OpenTuner Conclusions

Ensembles of Techniques

Differential Evolution Particle Swarm Optimization Torczon Hill Climber Information sharing through ResultsDB AUC Bandit

slide-80
SLIDE 80

Introduction PetaBricks OpenTuner Conclusions

Ensembles of Techniques

Differential Evolution Particle Swarm Optimization Torczon Hill Climber Information sharing through ResultsDB AUC Bandit Which configuration should we try next?

?

slide-81
SLIDE 81

Introduction PetaBricks OpenTuner Conclusions

Ensembles of Techniques

Differential Evolution Particle Swarm Optimization Torczon Hill Climber Information sharing through ResultsDB AUC Bandit Which configuration should we try next? 33%

Exploration

33% 33%

slide-82
SLIDE 82

Introduction PetaBricks OpenTuner Conclusions

Ensembles of Techniques

Differential Evolution Particle Swarm Optimization Torczon Hill Climber Information sharing through ResultsDB AUC Bandit Which configuration should we try next? 100%

Exploitation

0% 0%

slide-83
SLIDE 83

Introduction PetaBricks OpenTuner Conclusions

OpenTuner Results

Project Benchmark Possible Configurations GCC/G++ Flags all 10806 Halide Blur 1052 Halide Wavelet 1044 HPL n/a 109.9 PetaBricks Poisson 103657 PetaBricks Sort 1090 PetaBricks Strassen 10188 PetaBricks TriSolve 101559 Stencil all 106.5 Unitary n/a 1021

slide-84
SLIDE 84

Introduction PetaBricks OpenTuner Conclusions

OpenTuner Results

Project Benchmark Possible Configurations GCC/G++ Flags all 10806 Halide Blur 1052 Halide Wavelet 1044 HPL n/a 109.9 PetaBricks Poisson 103657 PetaBricks Sort 1090 PetaBricks Strassen 10188 PetaBricks TriSolve 101559 Stencil all 106.5 Unitary n/a 1021

slide-85
SLIDE 85

Introduction PetaBricks OpenTuner Conclusions

OpenTuner Results: GCC Flags

fft.c

0.8 0.85 0.9 0.95 1 300 600 900 1200 1500 1800 Execution Time (seconds) Autotuning Time (seconds) gcc -O1 gcc -O2 gcc -O3 OpenTuner

matrixmultiply.cpp

0.1 0.15 0.2 0.25 0.3 300 600 900 1200 1500 1800 Execution Time (seconds) Autotuning Time (seconds) g++ -O1 g++ -O2 g++ -O3 OpenTuner

raytracer.cpp

0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 300 600 900 1200 1500 1800 Execution Time (seconds) Autotuning Time (seconds) g++ -O1 g++ -O2 g++ -O3 OpenTuner

tsp ga.cpp

0.4 0.45 0.5 0.55 0.6 300 600 900 1200 1500 1800 Execution Time (seconds) Autotuning Time (seconds) g++ -O1 g++ -O2 g++ -O3 OpenTuner
slide-86
SLIDE 86

Introduction PetaBricks OpenTuner Conclusions

OpenTuner Results: PetaBricks

Poisson 2D

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 600 1200 1800 2400 3000 3600 Execution Time (seconds) Autotuning Time (seconds) PetaBricks Autotuner OpenTuner

Sort

0.02 0.04 0.06 0.08 0.1 600 1200 Execution Time (seconds) Autotuning Time (seconds) PetaBricks Autotuner OpenTuner

Strassen

0.05 0.1 0.15 0.2 600 1200 Execution Time (seconds) Autotuning Time (seconds) PetaBricks Autotuner OpenTuner

Tridiagonal Solver

0.0095 0.01 0.0105 0.011 0.0115 600 1200 Execution Time (seconds) Autotuning Time (seconds) PetaBricks Autotuner OpenTuner
slide-87
SLIDE 87

Introduction PetaBricks OpenTuner Conclusions

Conclusions

  • PetaBricks has pushed the limits of what can be done with

algorithmic choice

  • Provides performance portability by allowing programs to

adapt to their environment

  • Have shown: variable accuracy, multigrid, and input sensitivity
  • Hope that future main stream programming languages will

incorperate algorithmic choice and autotuning

  • OpenTuner can expand the scope of program autotuning for
  • ther projects
  • Extensible configuration representation
  • Ensembles of techniques
  • Hope that field of autotuning will expand to much more

complex problems

slide-88
SLIDE 88

Introduction PetaBricks OpenTuner Conclusions

Coauthors and Collaborators

  • Saman Amarasinghe
  • Cy Chan
  • Yufei Ding
  • Alan Edelman
  • Sam Fingeret
  • Sanath Jayasena
  • Shoaib Kamil
  • Kevin Kelley
  • Erika Lee
  • Deepak Narayanan
  • Marek Olszewski
  • Una-May O’Reilly
  • Maciej Pacula
  • Phitchaya Mangpo Phothilimthana
  • Jonathan Ragan-Kelley
  • Xipeng Shen
  • Michele Tartara
  • Kalyan Veeramachaneni
  • Yod Watanaprakornku
  • Yee Lok Wong
  • Kevin Wu
  • Minshu Zhan
  • Qin Zhao
slide-89
SLIDE 89

Introduction PetaBricks OpenTuner Conclusions

Thanks!

About me:

http://jasonansel.com/ http://opentuner.org/ http://projects.csail.mit.edu/petabricks/