Introduction PetaBricks OpenTuner Conclusions
Autotuning Programs with Algorithmic Choice
Jason Ansel
MIT - CSAIL
December 18, 2013
Autotuning Programs with Algorithmic Choice Jason Ansel MIT - CSAIL - - PowerPoint PPT Presentation
Introduction PetaBricks OpenTuner Conclusions Autotuning Programs with Algorithmic Choice Jason Ansel MIT - CSAIL December 18, 2013 Parallelism choices s e c i o h c c i m h t i r o g A l Accuracy choices Introduction
Introduction PetaBricks OpenTuner Conclusions
Autotuning Programs with Algorithmic Choice
Jason Ansel
MIT - CSAIL
December 18, 2013
Introduction PetaBricks OpenTuner Conclusions
High Performance Search Problem
Parallelism choices Accuracy choices A l g
i t h m i c c h
c e s
Introduction PetaBricks OpenTuner Conclusions
High Performance Search Problem
Parallelism choices Accuracy choices A l g
i t h m i c c h
c e s
necessary but not sufficient
Introduction PetaBricks OpenTuner Conclusions
High Performance Search Problem
Performance search space:
Parallelism choices Accuracy choices A l g
i t h m i c c h
c e s
necessary but not sufficient
multi-dimensional search problem
programmers
change program results
Introduction PetaBricks OpenTuner Conclusions
High Performance Search Problem
Goal of this work
To automate the process of program optimization to create programs that can adapt to changing environments and goals.
Introduction PetaBricks OpenTuner Conclusions
High Performance Search Problem
Goal of this work
To automate the process of program optimization to create programs that can adapt to changing environments and goals.
choice spaces.
these spaces.
problems.
Introduction PetaBricks OpenTuner Conclusions
Research Covered in This Talk
the language level [PLDI’09]
mix of parallel processing units [ASPLOS’13]
[Under review]
Introduction PetaBricks OpenTuner Conclusions
Research Covered in This Talk
the language level [PLDI’09]
mix of parallel processing units [ASPLOS’13]
[Under review]
GECCO’11, IPDPS’09, PLDI’11, and many others
Introduction PetaBricks OpenTuner Conclusions
A Motivating Example for Algorithmic Choice
Introduction PetaBricks OpenTuner Conclusions
A Motivating Example for Algorithmic Choice
Introduction PetaBricks OpenTuner Conclusions
A Motivating Example for Algorithmic Choice
Introduction PetaBricks OpenTuner Conclusions
std::stable sort
/usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367
Introduction PetaBricks OpenTuner Conclusions
std::stable sort
/usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367
Introduction PetaBricks OpenTuner Conclusions
Why 15?
Introduction PetaBricks OpenTuner Conclusions
Why 15?
1Any C++ program with “#include <algorithm>”, conservative estimate based on: http://c2.com/cgi/wiki?ProgrammingLanguageUsageStatistics
Introduction PetaBricks OpenTuner Conclusions
Is 15 The Right Number?
Introduction PetaBricks OpenTuner Conclusions
Is 15 The Right Number?
also changed
Introduction PetaBricks OpenTuner Conclusions
Algorithmic Choice
choice
Introduction PetaBricks OpenTuner Conclusions
Sort in PetaBricks
Language
function Sort to out [ n ] from in [ n ] { either { I n s e r t i o n S o r t ( out , in ) ; } or { QuickSort ( out , in ) ; } or { MergeSort ( out , in ) ; } or { RadixSort ( out , in ) ; } }
Introduction PetaBricks OpenTuner Conclusions
Sort in PetaBricks
Language
function Sort to out [ n ] from in [ n ] { either { I n s e r t i o n S o r t ( out , in ) ; } or { QuickSort ( out , in ) ; } or { MergeSort ( out , in ) ; } or { RadixSort ( out , in ) ; } }
⇒
Representation
Decision tree synthesized by our autotuner
Introduction PetaBricks OpenTuner Conclusions
Decision Trees
Optimized for a Xeon E7340 (8 cores):
N < 600 N < 1420 Insertion Sort Quick Sort Merge Sort (2-way)
Introduction PetaBricks OpenTuner Conclusions
Decision Trees
Optimized for Sun Fire T200 Niagara (8 cores):
N < 1461 N < 2400 Merge Sort (4-way) Merge Sort (2-way) N < 75 Merge Sort (8-way) Merge Sort (16-way)
Introduction PetaBricks OpenTuner Conclusions
Sort Algorithm Timings2
0.0005 0.001 0.0015 0.002 0.0025 250 500 750 1000 1250 1500 1750 Time (s) Input Size InsertionSort QuickSort MergeSort RadixSort Autotuned 2On an 8-way Xeon E7340 system
Introduction PetaBricks OpenTuner Conclusions
Iteration Order Choices
task and between different tasks
Introduction PetaBricks OpenTuner Conclusions
Iteration Order Choices
task and between different tasks
Introduction PetaBricks OpenTuner Conclusions
Synthesized Outer Control Flow
Parallel loop
. . . }
Sequential loop
l e f t ) { . . . }
Introduction PetaBricks OpenTuner Conclusions
Matrix Multiply
transform MatrixMultiply to AB[w, h ] from A[ c , h ] , B[w, c ] {
return dot (a , b ) ; } }
Introduction PetaBricks OpenTuner Conclusions
Matrix Multiply
transform MatrixMultiply to AB[w, h ] from A[ c , h ] , B[w, c ] {
return dot (a , b ) ; } to (AB. region ( x , y , x + 4 , y + 4)
from (A. region (0 , y , c , y + 4) a ,
0 , x + 4 , c ) b){ // . . . compute 4 x 4 block . . . } }
Introduction PetaBricks OpenTuner Conclusions
Strassen Matrix Multiply
transform S t r a s s e n to AB[ n , n ] from A[ n , n ] , B[ n , n ] using M1[ n /2 , n /2] , M2[ n /2 , n /2] , M3[ n /2 , n /2] , M4[ n /2 , n /2] , M5[ n /2 , n /2] , M6[ n /2 , n /2] , M7[ n /2 , n /2] { to (M1 m1) from (A. region (0 , 0 , n /2 , n /2) a11 ,
n ) a22 ,
0 , n /2 , n /2) b11 ,
n ) b22 ) using ( t1 [ n / 2 , n / 2 ] , t2 [ n /2 , n / 2 ] ) { spawn MatrixAdd ( t1 , a11 , a22 ) ; spawn MatrixAdd ( t2 , b11 , b22 ) ; sync ; S t r a s s e n (m1, t1 , t2 ) ; } . . . . // Compute one quadrant
with s t r a s s e n decomposition to (AB. region ( n /2 , 0 , n , n /2) c12 ) from (M3 m3, M5 m5){ MatrixAdd ( c12 , m3, m5 ) ; } . . . . // Or , compute element i n
d i r e c t l y ( same as l a s t s l i d e )
from (A. row ( y ) a , B. column ( x ) b){ return dot ( a , b ) ; } }
Introduction PetaBricks OpenTuner Conclusions
Variable Accuracy Algorithms
Introduction PetaBricks OpenTuner Conclusions
Variable Accuracy Algorithms
fundamental part of algorithmic choice which enables new classes of programs to be represented.
Introduction PetaBricks OpenTuner Conclusions
K-Means Example
transform kmeans from Points [ n , 2 ] // Array
p o i n t s ( each column // s t o r e s x and y c o o r d i n a t e s ) using C e n t r o i d s [ s q r t ( n ) , 2 ] to Assignments [ n ] { // Rule 1 : // One p o s s i b l e i n i t i a l c o n d i t i o n : Random // s e t
p o i n t s to ( C e n t r o i d s . column ( i ) c ) from ( Points p ) { c=p . column ( rand (0 , n ) ) } // Rule 2 : // Another i n i t i a l c o n d i t i o n : C e n t e r p l u s i n i t i a l // c e n t e r s ( kmeans++) to ( C e n t r o i d s c ) from ( Points p ) { CenterPlus ( c , p ) ; } // Rule 3 : // The kmeans i t e r a t i v e a l g o r i t h m to ( Assignments a ) from ( Points p , C e n t r o i d s c ) { while ( t r u e ) { i n t change ; A s s i g n C l u s t e r s ( a , change , p , c , a ) ; i f ( change==0) return ; // Reached f i x e d p o i n t NewClusterLocations ( c , p , a ) ; } } }
Introduction PetaBricks OpenTuner Conclusions
K-Means Example (Variable Accuracy)
transform kmeans accuracy metric kmeansaccuracy a c c u r a c y v a r i a b l e k from Points [ n , 2 ] // Array
p o i n t s ( each column // s t o r e s x and y c o o r d i n a t e s ) using C e n t r o i d s [ k , 2 ] to Assignments [ n ] ... // Rule 3 : // The kmeans i t e r a t i v e a l g o r i t h m to ( Assignments a ) from ( Points p , C e n t r o i d s c ) { for enough { i n t change ; A s s i g n C l u s t e r s ( a , change , p , c , a ) ; i f ( change==0) return ; // Reached f i x e d p o i n t NewClusterLocations ( c , p , a ) ; } } } transform kmeansaccuracy from Assignments [ n ] , Points [ n , 2 ] to Accuracy { Accuracy from ( Assignments a , Points p){ return s q r t (2∗n/ SumClusterDistanceSquared ( a , p ) ) ; } }
Introduction PetaBricks OpenTuner Conclusions
Semantics of Variable Accuracy
Running the accuracy metric on the output will return a value that, in expectation, exceeds the accuracy target more than P percent of the time.
Introduction PetaBricks OpenTuner Conclusions
Semantics of Variable Accuracy
Running the accuracy metric on the output will return a value that, in expectation, exceeds the accuracy target more than P percent of the time.
time, not at runtime.
accuracy target must be specified.
accuracy components, only the outer most accuracy target will be honored.
Introduction PetaBricks OpenTuner Conclusions
A Brief Multigrid Intro
computations)
Introduction PetaBricks OpenTuner Conclusions
Standard Cycle Shaps
performance
and execution platform effect efficacy of different shapes
cycle shapes!
Introduction PetaBricks OpenTuner Conclusions
Standard Cycle Shaps
performance
and execution platform effect efficacy of different shapes
cycle shapes!
status quo in this domain
shapes once
tailored to your problem
Introduction PetaBricks OpenTuner Conclusions
Choice Space of Multigrid
Introduction PetaBricks OpenTuner Conclusions
Autotuned V-cycle Shapes
10
1
Grid Size
2048 1024 512 256 128 64 32 16
10
3
10
5
10
7
Grid Size
2048 1024 512 256 128 64 32 16
Introduction PetaBricks OpenTuner Conclusions
Dynamic Programming Technique for Autotuning Multigrid
Introduction PetaBricks OpenTuner Conclusions
Dynamic Programming Technique for Autotuning Multigrid
Introduction PetaBricks OpenTuner Conclusions
Dynamic Programming Technique for Autotuning Multigrid
Introduction PetaBricks OpenTuner Conclusions
Dynamic Programming Technique for Autotuning Multigrid
Introduction PetaBricks OpenTuner Conclusions
Dynamic Programming Technique for Autotuning Multigrid
Grid size i
Grid size 2i
from coarser level
Introduction PetaBricks OpenTuner Conclusions
2D Poisson’s Equation (uses Multigrid)
1 2 4 8 16 32 64 100 1000 10000 100000 1e+06 Speedup (x) Input Size Accuracy Level 109 Accuracy Level 107 Accuracy Level 105 Accuracy Level 103 Accuracy Level 101
2D Poisson’s equation
Introduction PetaBricks OpenTuner Conclusions
More Variable Accuracy Results
1 2 4 8 10 100 1000 Speedup (x) Input Size Accuracy Level 0.95 Accuracy Level 0.75 Accuracy Level 0.50 Accuracy Level 0.20 Accuracy Level 0.10 Accuracy Level 0.05Clustering
1 10 100 1000 10000 10 100 1000 10000 100000 1e+06 Speedup (x) Input Size Accuracy Level 1.01 Accuracy Level 1.1 Accuracy Level 1.2 Accuracy Level 1.3 Accuracy Level 1.4Bin Packing
1 2 4 8 16 32 10 100 1000 10000 Speedup (x) Input Size Accuracy Level 2.0 Accuracy Level 1.5 Accuracy Level 1.0 Accuracy Level 0.8 Accuracy Level 0.6 Accuracy Level 0.3Image Compression
1 2 4 8 16 32 10 100 1000 10000 100000 Speedup (x) Input Size Accuracy Level 109 Accuracy Level 107 Accuracy Level 105 Accuracy Level 103 Accuracy Level 1013D Helmholtz
1 2 4 8 16 32 64 100 1000 10000 100000 1e+06 Speedup (x) Input Size Accuracy Level 109 Accuracy Level 107 Accuracy Level 105 Accuracy Level 103 Accuracy Level 1012D Poisson
1 2 4 8 10 100 1000 10000 Speedup (x) Input Size Accuracy Level 3.0 Accuracy Level 2.0 Accuracy Level 1.5 Accuracy Level 1.0 Accuracy Level 0.5 Accuracy Level 0.0Preconditioner
Introduction PetaBricks OpenTuner Conclusions
Results on Different Systems
Test Systems
Codename CPU(s) Cores GPU OpenCL Runtime Desktop Core i7 920 @2.67GHz 4 NVIDIA Tesla C2070 CUDA Toolkit 3.2 Server 4× Xeon X7550 @2GHz 32 None AMD APP SDK 2.5 Laptop Core i5 2520M @2.5GHz 2 AMD Radeon HD 6630M Xcode 4.2
Benchmarks
Name # Possible Configs Generated OpenCL Kernels Mean Autotuning Time Testing Input Size SeparableConv. 101358 9 3.82 hours 35202 Black-Sholes 10130 1 3.09 hours 500000 Poisson2D SOR 101358 25 15.37 hours 20482 Sort 10920 7 3.56 hours 220 Strassen 101509 9 3.05 hours 10242 SVD 102435 8 1.79 hours 2562 Tridiagonal Solver 101040 8 5.56 hours 10242
Introduction PetaBricks OpenTuner Conclusions
Results on Different Systems
Test Systems
Codename CPU(s) Cores GPU OpenCL Runtime Desktop Core i7 920 @2.67GHz 4 NVIDIA Tesla C2070 CUDA Toolkit 3.2 Server 4× Xeon X7550 @2GHz 32 None AMD APP SDK 2.5 Laptop Core i5 2520M @2.5GHz 2 AMD Radeon HD 6630M Xcode 4.2
Benchmarks
Name # Possible Configs Generated OpenCL Kernels Mean Autotuning Time Testing Input Size SeparableConv. 101358 9 3.82 hours 35202 Black-Sholes 10130 1 3.09 hours 500000 Poisson2D SOR 101358 25 15.37 hours 20482 Sort 10920 7 3.56 hours 220 Strassen 101509 9 3.05 hours 10242 SVD 102435 8 1.79 hours 2562 Tridiagonal Solver 101040 8 5.56 hours 10242
Introduction PetaBricks OpenTuner Conclusions
Separable Convolution (width=7)
1.0x 2.0x 3.0x Desktop Execution Time (Normalized) Desktop Config Server Config Laptop Config
Desktop Config Server Config Laptop Config SeparableConv. 1D kernel+local memory on GPU 1D kernel on OpenCL 2D kernel+local memory on GPU
Introduction PetaBricks OpenTuner Conclusions
Separable Convolution (width=7)
1.0x 2.0x 3.0x Desktop Server Laptop Execution Time (Normalized) Desktop Config Server Config Laptop Config
Desktop Config Server Config Laptop Config SeparableConv. 1D kernel+local memory on GPU 1D kernel on OpenCL 2D kernel+local memory on GPU
Introduction PetaBricks OpenTuner Conclusions
Separable Convolution (width=7)
1.0x 2.0x 3.0x Desktop Server Laptop Execution Time (Normalized) Desktop Config Server Config Laptop Config Hand-coded OpenCL
Desktop Config Server Config Laptop Config SeparableConv. 1D kernel+local memory on GPU 1D kernel on OpenCL 2D kernel+local memory on GPU
Introduction PetaBricks OpenTuner Conclusions
Poisson 2D SOR
1.0x 3.0x 5.0x 7.0x 9.0x Desktop Server Laptop Execution Time (Normalized) Desktop Config Server Config Laptop Config
Desktop Config Server Config Laptop Config Poisson2D SOR Split on CPU followed by compute on GPU Split some parts on OpenCL followed by compute on CPU Split on CPU followed by compute on GPU
Introduction PetaBricks OpenTuner Conclusions
Singular Value Decomposition (SVD)
1.0x 1.2x 1.4x 1.6x 1.8x 2.0x Desktop Server Laptop Execution Time (Normalized) Desktop Config Server Config Laptop Config
Desktop Config Server Config Laptop Config SVD
First phase: task parallism be- tween CPU/GPU; matrix multi- ply: 8-way parallel recursive de- composition on CPU, call LA- PACK when < 42 × 42 First phase: all on CPU; ma- trix multiply: 8-way parallel re- cursive decomposition on CPU, call LAPACK when < 170×170 First phase: all on CPU; ma- trix multiply: 4-way parallel re- cursive decomposition on CPU, call LAPACK when < 85 × 85
Introduction PetaBricks OpenTuner Conclusions
Results Takeaways
different systems
Introduction PetaBricks OpenTuner Conclusions
Autotuning Challenges
algorithm
Introduction PetaBricks OpenTuner Conclusions
Input Sensitivity
inputs
Introduction PetaBricks OpenTuner Conclusions
Input Sensitivity Today
inputs
Introduction PetaBricks OpenTuner Conclusions
Input Sensitivity Today
inputs
Introduction PetaBricks OpenTuner Conclusions
Input Sensitivity Overview
Training Deployment Input Classifier
Input Aware Learning
Program Training Inputs Feature Extractors Insights:
Input
SelectInput Optimized Programs Training Selected Program
R u nIntroduction PetaBricks OpenTuner Conclusions
Input Features
f u n c t i o n Sort to
from i n [ n ] i n p u t f e a t u r e Sortedness , D u p l i c a t i o n { . . . } f u n c t i o n S o r t e d n e s s from i n [ n ] to s o r t e d n e s s tunable double l e v e l ( 0 . 0 , 1 . 0 ) { i n t s o r t e d c o u n t = 0; i n t count = 0 ; i n t step = ( i n t )( l e v e l ∗n ) ; f o r ( i n t i =0; i+step<n ; i+=step ) { i f ( i n [ i ] <= i n [ i+step ] ) { // increment f o r c o r r e c t l y
// p a i r s
elements s o r t e d c o u n t += 1 ; } count += 1; } i f ( count > 0) s o r t e d n e s s = s o r t e d c o u n t / ( double ) count ; e l s e s o r t e d n e s s = 0 . 0 ; } f u n c t i o n D u p l i c a t i o n from i n [ n ] to d u p l i c a t i o n { . . . }
Introduction PetaBricks OpenTuner Conclusions
Input Space Sampling
Duplication Sortedness
Introduction PetaBricks OpenTuner Conclusions
Input Space Sampling
Duplication Sortedness
Introduction PetaBricks OpenTuner Conclusions
Input Space Sampling
Duplication Sortedness
Introduction PetaBricks OpenTuner Conclusions
Input Space Sampling
Duplication Sortedness
Introduction PetaBricks OpenTuner Conclusions
Input Space Sampling
Duplication Sortedness
Introduction PetaBricks OpenTuner Conclusions
Input Space Sampling
Duplication Sortedness
Introduction PetaBricks OpenTuner Conclusions
Training
Features Input Labels
Decision Tree Max A Priori Adaptive TreeClassifier Constructors
1...m m+1Classifier Selector
Selection Objective Considers cost of extracting needed features
Input Classifier
Introduction PetaBricks OpenTuner Conclusions
How Many Landmarks Are Enough?
0.2 0.4 0.6 0.8 1 Lost speedup (L) Size of region (pi) 2 configs 3 configs 4 configs 5 configs 6 configs 7 configs 8 configs 9 configs 10 20 30 40 50 60 70 80 90 100 Speedup Landmarks
Introduction PetaBricks OpenTuner Conclusions
Input Adaptation Results
sort
1 100 1 3 5Landmarks Speedup
clustering
1 100 2 3Landmarks Speedup
binpacking
1 100 1 1.05Landmarks Speedup
svd
1 100 0.9 1.1Landmarks Speedup
poisson2d
1 100 0.9 1.2Landmarks Speedup
helmholtz3d
1 100 0.8 1.0Landmarks Speedup
Introduction PetaBricks OpenTuner Conclusions
Related Projects
A small selection of many related projects:
Package Domain Search Method Active Harmony Runtime System Nelder-Mead ATLAS Dense Linear Algebra Exhaustive Code Perforation Compiler Exhaustive + Simulated Annealing Dynamic Knobs Runtime System Control Theory FFTW Fast Fourier Transform Exhaustive / Dynamic Prog. Insieme Compiler Differential Evolution Milepost GCC / cTuning Compiler IID Model + Central DB OSKI Sparse Linear Algebra Exhaustive + Heuristic PATUS Stencil Computations Nelder-Mead or Evolutionary SEEC / Heartbeats Runtime System Control Theory Sepya Stencil Computations Random-Restart Gradient Ascent SPIRAL DSP Algorithms Pareto Active Learning
Introduction PetaBricks OpenTuner Conclusions
Related Projects
A small selection of many related projects:
Package Domain Search Method Active Harmony Runtime System Nelder-Mead ATLAS Dense Linear Algebra Exhaustive Code Perforation Compiler Exhaustive + Simulated Annealing Dynamic Knobs Runtime System Control Theory FFTW Fast Fourier Transform Exhaustive / Dynamic Prog. Insieme Compiler Differential Evolution Milepost GCC / cTuning Compiler IID Model + Central DB OSKI Sparse Linear Algebra Exhaustive + Heuristic PATUS Stencil Computations Nelder-Mead or Evolutionary SEEC / Heartbeats Runtime System Control Theory Sepya Stencil Computations Random-Restart Gradient Ascent SPIRAL DSP Algorithms Pareto Active Learning
Introduction PetaBricks OpenTuner Conclusions
Limits of Existing Autotuning Projects
autotuning
at synthesizing poly-algorithms
spaces to fit their techniques
Introduction PetaBricks OpenTuner Conclusions
Limits of Existing Autotuning Projects
autotuning
at synthesizing poly-algorithms
spaces to fit their techniques
problems
Introduction PetaBricks OpenTuner Conclusions
OpenTuner Overview
OpenTuner: an extensible framework for program autotuning Results Database Search Techniques Search Driver Search
Reads: Results Writes: Desired Results
Measurement User Defined Measurement Function Measurement Driver Configuration Manipulator
Reads: Desired Results Writes: Results
Introduction PetaBricks OpenTuner Conclusions
OpenTuner Configuration Manipulator Parameters
Parameter Primitive Complex Integer ScaledNumeric Float LogInteger LogFloat PowerOfTwo Switch Enum Permutation Schedule Selector Boolean
types can be added at any point
Introduction PetaBricks OpenTuner Conclusions
Ensembles of Techniques
Differential Evolution Particle Swarm Optimization Torczon Hill Climber
Introduction PetaBricks OpenTuner Conclusions
Ensembles of Techniques
Differential Evolution Particle Swarm Optimization Torczon Hill Climber Information sharing through ResultsDB
Introduction PetaBricks OpenTuner Conclusions
Ensembles of Techniques
Differential Evolution Particle Swarm Optimization Torczon Hill Climber Information sharing through ResultsDB AUC Bandit
Introduction PetaBricks OpenTuner Conclusions
Ensembles of Techniques
Differential Evolution Particle Swarm Optimization Torczon Hill Climber Information sharing through ResultsDB AUC Bandit Which configuration should we try next?
?
Introduction PetaBricks OpenTuner Conclusions
Ensembles of Techniques
Differential Evolution Particle Swarm Optimization Torczon Hill Climber Information sharing through ResultsDB AUC Bandit Which configuration should we try next? 33%
Exploration
33% 33%
Introduction PetaBricks OpenTuner Conclusions
Ensembles of Techniques
Differential Evolution Particle Swarm Optimization Torczon Hill Climber Information sharing through ResultsDB AUC Bandit Which configuration should we try next? 100%
Exploitation
0% 0%
Introduction PetaBricks OpenTuner Conclusions
OpenTuner Results
Project Benchmark Possible Configurations GCC/G++ Flags all 10806 Halide Blur 1052 Halide Wavelet 1044 HPL n/a 109.9 PetaBricks Poisson 103657 PetaBricks Sort 1090 PetaBricks Strassen 10188 PetaBricks TriSolve 101559 Stencil all 106.5 Unitary n/a 1021
Introduction PetaBricks OpenTuner Conclusions
OpenTuner Results
Project Benchmark Possible Configurations GCC/G++ Flags all 10806 Halide Blur 1052 Halide Wavelet 1044 HPL n/a 109.9 PetaBricks Poisson 103657 PetaBricks Sort 1090 PetaBricks Strassen 10188 PetaBricks TriSolve 101559 Stencil all 106.5 Unitary n/a 1021
Introduction PetaBricks OpenTuner Conclusions
OpenTuner Results: GCC Flags
fft.c
0.8 0.85 0.9 0.95 1 300 600 900 1200 1500 1800 Execution Time (seconds) Autotuning Time (seconds) gcc -O1 gcc -O2 gcc -O3 OpenTunermatrixmultiply.cpp
0.1 0.15 0.2 0.25 0.3 300 600 900 1200 1500 1800 Execution Time (seconds) Autotuning Time (seconds) g++ -O1 g++ -O2 g++ -O3 OpenTunerraytracer.cpp
0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 300 600 900 1200 1500 1800 Execution Time (seconds) Autotuning Time (seconds) g++ -O1 g++ -O2 g++ -O3 OpenTunertsp ga.cpp
0.4 0.45 0.5 0.55 0.6 300 600 900 1200 1500 1800 Execution Time (seconds) Autotuning Time (seconds) g++ -O1 g++ -O2 g++ -O3 OpenTunerIntroduction PetaBricks OpenTuner Conclusions
OpenTuner Results: PetaBricks
Poisson 2D
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 600 1200 1800 2400 3000 3600 Execution Time (seconds) Autotuning Time (seconds) PetaBricks Autotuner OpenTunerSort
0.02 0.04 0.06 0.08 0.1 600 1200 Execution Time (seconds) Autotuning Time (seconds) PetaBricks Autotuner OpenTunerStrassen
0.05 0.1 0.15 0.2 600 1200 Execution Time (seconds) Autotuning Time (seconds) PetaBricks Autotuner OpenTunerTridiagonal Solver
0.0095 0.01 0.0105 0.011 0.0115 600 1200 Execution Time (seconds) Autotuning Time (seconds) PetaBricks Autotuner OpenTunerIntroduction PetaBricks OpenTuner Conclusions
Conclusions
algorithmic choice
adapt to their environment
incorperate algorithmic choice and autotuning
complex problems
Introduction PetaBricks OpenTuner Conclusions
Coauthors and Collaborators
Introduction PetaBricks OpenTuner Conclusions
Thanks!
About me:
http://jasonansel.com/ http://opentuner.org/ http://projects.csail.mit.edu/petabricks/