[PPT] - BOAT: Building Auto-Tuners with Structured Bayesian Optimization PowerPoint Presentation

SLIDE 1

BOAT: Building Auto-Tuners with Structured Bayesian Optimization

Valentin Dalibard Michael Schaarschmidt Eiko Yoneki

Presented by Jesse Mu

SLIDE 2

Parameters in large-scale systems

ML Hyperparams Compiler Flags Number of cluster nodes

Coarse Fine

SLIDE 3

Parameters in large-scale systems

ML Hyperparams Compiler Flags Number of cluster nodes

Coarse Fine

How to optimize parameters θ?

SLIDE 4

Parameters in large-scale systems

ML Hyperparams Compiler Flags Number of cluster nodes

Coarse Fine

How to optimize parameters θ? Minimize some cost function f(θ) .

SLIDE 5

Parameters in large-scale systems

ML Hyperparams Compiler Flags Number of cluster nodes

Coarse Fine

How to optimize parameters θ? Minimize some cost function f(θ) ...where cost is runtime, memory, I/O, etc

SLIDE 6

Auto-tuning (optimization)

SLIDE 7

Auto-tuning (optimization)

Grid search θ ∈ [1, 2, 3, …]

SLIDE 8

Auto-tuning (optimization)

Grid search
Evolutionary approaches (e.g. )
Hill-climbing (e.g. )

θ ∈ [1, 2, 3, …]

SLIDE 9

Auto-tuning (optimization)

Grid search
Evolutionary approaches (e.g. )
Hill-climbing (e.g. )
Bayesian optimization (e.g. )

SPEARMINT

θ ∈ [1, 2, 3, …]

SLIDE 10

Auto-tuning (optimization) in distributed systems

Grid search
Evolutionary approaches (e.g. )
Hill-climbing (e.g. )
Bayesian optimization (e.g. )

SPEARMINT

θ ∈ [1, 2, 3, …]

SLIDE 11

Auto-tuning (optimization) in distributed systems

Grid search
Evolutionary approaches (e.g. )
Hill-climbing (e.g. )
Bayesian optimization (e.g. )

Require 1000s of evaluations of cost function!

SPEARMINT

θ ∈ [1, 2, 3, …]

SLIDE 12

Auto-tuning (optimization) in distributed systems

Grid search
Evolutionary approaches (e.g. )
Hill-climbing (e.g. )
Bayesian optimization (e.g. )

Require 1000s of evaluations of cost function!

SPEARMINT

Fails in high dimensions! θ ∈ [1, 2, 3, …]

SLIDE 13

Auto-tuning (optimization) in distributed systems

Grid search
Evolutionary approaches (e.g. )
Hill-climbing (e.g. )
Bayesian optimization (e.g. )
Structured Bayesian optimization (this work: BespOke Auto-Tuners)

Require 1000s of evaluations of cost function!

SPEARMINT

Fails in high dimensions! θ ∈ [1, 2, 3, …]

SLIDE 14

Gaussian Processes

From Carl Rasmussen’s 4F13 lectures http://mlg.eng.cam.ac.uk/teaching/4f13/1718/gp%20and%20data.pdf

Prior Posterior Data

SLIDE 15

SLIDE 16

e.g. expected increase over max perf. (balance exploration vs exploitation)

SLIDE 17

Bayesian Optimization

Gaussian Process

SLIDE 18

Structured Bayesian Optimization (SBO)

Gaussian Process

SLIDE 19

Structured Bayesian Optimization (SBO)

SLIDE 20

Structured Bayesian Optimization (SBO)

*Developer-specified, semi-parametric model of performance from observed performance + arbitrary runtime characteristics

*

SLIDE 21

Structured Bayesian Optimization (SBO)

*Developer-specified, semi-parametric model of performance from observed performance + arbitrary runtime characteristics

*

SLIDE 22

Probabilistic Models for SBO

SLIDE 23

Probabilistic Models for SBO

SLIDE 24

Probabilistic Models for SBO

Too restrictive Too generic Just right

SLIDE 25

Semi-parametric models in SBO

Specify the parametric component only (GP for free)

SLIDE 26

Semi-parametric models in SBO

Specify the parametric component only (GP for free)
e.g. predict GC rate from JVM eden size

SLIDE 27

Semi-parametric models in SBO

Specify the parametric component only (GP for free)
e.g. predict GC rate from JVM eden size

SLIDE 28

Semi-parametric models in SBO

Specify the parametric component only (GP for free)
e.g. predict GC rate from JVM eden size

Prior: malloc rate ~ Uniform(0, 5000)

SLIDE 29

Semi-parametric models in SBO

SLIDE 30

Composing semi-parametric models

SLIDE 31

Composing semi-parametric models

SLIDE 32

Composing semi-parametric models

Dataflow DAG Inference exploits conditional independence between models

SLIDE 33

Composing semi-parametric models

Dataflow DAG Inference exploits conditional independence between models

SLIDE 34

SBO: Summary

1. Configuration space (i.e. possible params)
2. Objective function + runtime measurements
3. Semi-parametric model of system

SLIDE 35

SBO: Summary

1. Configuration space (i.e. possible params)
2. Objective function + runtime measurements
3. Semi-parametric model of system

standard

SLIDE 36

SBO: Summary

1. Configuration space (i.e. possible params)
2. Objective function + runtime measurements
3. Semi-parametric model of system

standard new

SLIDE 37

SBO: Summary

1. Configuration space (i.e. possible params)
2. Objective function + runtime measurements
3. Semi-parametric model of system

Key: try generic system, before optimizing with structure

standard new

SLIDE 38

Evaluation: Cassandra GC

SLIDE 39

Evaluation: Cassandra GC

SLIDE 40

Evaluation: Cassandra GC

Best params outperform Cassandra defaults by 63% Existing systems converge but take 6x longer

SLIDE 41

Evaluation: Neural Net SGD

Load balancing, worker allocation over 10 machines = 30 params

SLIDE 42

Evaluation: Neural Net SGD

Load balancing, worker allocation over 10 machines = 30 params

SLIDE 43

Evaluation: Neural Net SGD

Default configuration: 9.82s OpenTuner: 8.71s BOAT: 4.31s Existing systems don’t converge! Load balancing, worker allocation over 10 machines = 30 params

SLIDE 44

Review:

SLIDE 45

Review: overall, a good, unsurprising contribution

SLIDE 46

Review: overall, a good, unsurprising contribution

Theory

○ Unsurprising that expert-developed models optimize better! ■ Tradeoff: developer hours vs machine hours ○ Cassandra GC system converges in 2 iterations - model is near-perfect! What happens when parametric model is wrong? ■ More details about tradeoff between parametric model and generic GP ■ OpenTuner: build an ensemble of multiple search techniques

SLIDE 47

Review: overall, a good, unsurprising contribution

Theory

○ Unsurprising that expert-developed models optimize better! ■ Tradeoff: developer hours vs machine hours ○ Cassandra GC system converges in 2 iterations - model is near-perfect! What happens when parametric model is wrong? ■ More details about tradeoff between parametric model and generic GP ■ OpenTuner: build an ensemble of multiple search techniques

Implementation

○ Cross-validation? ○ Key for system adoption: make interface as high-level as possible

SLIDE 48

Review: overall, a good, unsurprising contribution

Theory

○ Unsurprising that expert-developed models optimize better! ■ Tradeoff: developer hours vs machine hours ○ Cassandra GC system converges in 2 iterations - model is near-perfect! What happens when parametric model is wrong? ■ More details about tradeoff between parametric model and generic GP ■ OpenTuner: build an ensemble of multiple search techniques

Implementation

○ Cross-validation? ○ Key for system adoption: make interface as high-level as possible

Evaluation

○ What happens when # params >> 30? ○ “DAGModels help debugging”...how?