BOAT: Building Auto-Tuners with Structured Bayesian Optimization - - PowerPoint PPT Presentation

boat building auto tuners with structured bayesian
SMART_READER_LITE
LIVE PREVIEW

BOAT: Building Auto-Tuners with Structured Bayesian Optimization - - PowerPoint PPT Presentation

BOAT: Building Auto-Tuners with Structured Bayesian Optimization Valentin Dalibard Michael Schaarschmidt Eiko Yoneki Presented by Jesse Mu Parameters in large-scale systems Coarse Number of cluster nodes ML Hyperparams Compiler


slide-1
SLIDE 1

BOAT: Building Auto-Tuners with Structured Bayesian Optimization

Valentin Dalibard Michael Schaarschmidt Eiko Yoneki

Presented by Jesse Mu

slide-2
SLIDE 2

Parameters in large-scale systems

ML Hyperparams Compiler Flags Number of cluster nodes

Coarse Fine

slide-3
SLIDE 3

Parameters in large-scale systems

ML Hyperparams Compiler Flags Number of cluster nodes

Coarse Fine

How to optimize parameters θ?

slide-4
SLIDE 4

Parameters in large-scale systems

ML Hyperparams Compiler Flags Number of cluster nodes

Coarse Fine

How to optimize parameters θ? Minimize some cost function f(θ) .

slide-5
SLIDE 5

Parameters in large-scale systems

ML Hyperparams Compiler Flags Number of cluster nodes

Coarse Fine

How to optimize parameters θ? Minimize some cost function f(θ) ...where cost is runtime, memory, I/O, etc

slide-6
SLIDE 6

Auto-tuning (optimization)

slide-7
SLIDE 7

Auto-tuning (optimization)

  • Grid search θ ∈ [1, 2, 3, …]
slide-8
SLIDE 8

Auto-tuning (optimization)

  • Grid search
  • Evolutionary approaches (e.g. )
  • Hill-climbing (e.g. )

θ ∈ [1, 2, 3, …]

slide-9
SLIDE 9

Auto-tuning (optimization)

  • Grid search
  • Evolutionary approaches (e.g. )
  • Hill-climbing (e.g. )
  • Bayesian optimization (e.g. )

SPEARMINT

θ ∈ [1, 2, 3, …]

slide-10
SLIDE 10

Auto-tuning (optimization) in distributed systems

  • Grid search
  • Evolutionary approaches (e.g. )
  • Hill-climbing (e.g. )
  • Bayesian optimization (e.g. )

SPEARMINT

θ ∈ [1, 2, 3, …]

slide-11
SLIDE 11

Auto-tuning (optimization) in distributed systems

  • Grid search
  • Evolutionary approaches (e.g. )
  • Hill-climbing (e.g. )
  • Bayesian optimization (e.g. )

Require 1000s of evaluations of cost function!

SPEARMINT

θ ∈ [1, 2, 3, …]

slide-12
SLIDE 12

Auto-tuning (optimization) in distributed systems

  • Grid search
  • Evolutionary approaches (e.g. )
  • Hill-climbing (e.g. )
  • Bayesian optimization (e.g. )

Require 1000s of evaluations of cost function!

SPEARMINT

Fails in high dimensions! θ ∈ [1, 2, 3, …]

slide-13
SLIDE 13

Auto-tuning (optimization) in distributed systems

  • Grid search
  • Evolutionary approaches (e.g. )
  • Hill-climbing (e.g. )
  • Bayesian optimization (e.g. )
  • Structured Bayesian optimization (this work: BespOke Auto-Tuners)

Require 1000s of evaluations of cost function!

SPEARMINT

Fails in high dimensions! θ ∈ [1, 2, 3, …]

slide-14
SLIDE 14

Gaussian Processes

From Carl Rasmussen’s 4F13 lectures http://mlg.eng.cam.ac.uk/teaching/4f13/1718/gp%20and%20data.pdf

Prior Posterior Data

slide-15
SLIDE 15
slide-16
SLIDE 16

e.g. expected increase over max perf. (balance exploration vs exploitation)

slide-17
SLIDE 17

Bayesian Optimization

Gaussian Process

slide-18
SLIDE 18

Structured Bayesian Optimization (SBO)

Gaussian Process

slide-19
SLIDE 19

Structured Bayesian Optimization (SBO)

slide-20
SLIDE 20

Structured Bayesian Optimization (SBO)

*Developer-specified, semi-parametric model of performance from observed performance + arbitrary runtime characteristics

*

slide-21
SLIDE 21

Structured Bayesian Optimization (SBO)

*Developer-specified, semi-parametric model of performance from observed performance + arbitrary runtime characteristics

*

slide-22
SLIDE 22

Probabilistic Models for SBO

slide-23
SLIDE 23

Probabilistic Models for SBO

slide-24
SLIDE 24

Probabilistic Models for SBO

Too restrictive Too generic Just right

slide-25
SLIDE 25

Semi-parametric models in SBO

  • Specify the parametric component only (GP for free)
slide-26
SLIDE 26

Semi-parametric models in SBO

  • Specify the parametric component only (GP for free)
  • e.g. predict GC rate from JVM eden size
slide-27
SLIDE 27

Semi-parametric models in SBO

  • Specify the parametric component only (GP for free)
  • e.g. predict GC rate from JVM eden size
slide-28
SLIDE 28

Semi-parametric models in SBO

  • Specify the parametric component only (GP for free)
  • e.g. predict GC rate from JVM eden size

Prior: malloc rate ~ Uniform(0, 5000)

slide-29
SLIDE 29

Semi-parametric models in SBO

slide-30
SLIDE 30

Composing semi-parametric models

slide-31
SLIDE 31

Composing semi-parametric models

slide-32
SLIDE 32

Composing semi-parametric models

Dataflow DAG Inference exploits conditional independence between models

slide-33
SLIDE 33

Composing semi-parametric models

Dataflow DAG Inference exploits conditional independence between models

slide-34
SLIDE 34

SBO: Summary

  • 1. Configuration space (i.e. possible params)
  • 2. Objective function + runtime measurements
  • 3. Semi-parametric model of system
slide-35
SLIDE 35

SBO: Summary

  • 1. Configuration space (i.e. possible params)
  • 2. Objective function + runtime measurements
  • 3. Semi-parametric model of system

standard

slide-36
SLIDE 36

SBO: Summary

  • 1. Configuration space (i.e. possible params)
  • 2. Objective function + runtime measurements
  • 3. Semi-parametric model of system

standard new

slide-37
SLIDE 37

SBO: Summary

  • 1. Configuration space (i.e. possible params)
  • 2. Objective function + runtime measurements
  • 3. Semi-parametric model of system

Key: try generic system, before optimizing with structure

standard new

slide-38
SLIDE 38

Evaluation: Cassandra GC

slide-39
SLIDE 39

Evaluation: Cassandra GC

slide-40
SLIDE 40

Evaluation: Cassandra GC

Best params outperform Cassandra defaults by 63% Existing systems converge but take 6x longer

slide-41
SLIDE 41

Evaluation: Neural Net SGD

Load balancing, worker allocation over 10 machines = 30 params

slide-42
SLIDE 42

Evaluation: Neural Net SGD

Load balancing, worker allocation over 10 machines = 30 params

slide-43
SLIDE 43

Evaluation: Neural Net SGD

Default configuration: 9.82s OpenTuner: 8.71s BOAT: 4.31s Existing systems don’t converge! Load balancing, worker allocation over 10 machines = 30 params

slide-44
SLIDE 44

Review:

slide-45
SLIDE 45

Review: overall, a good, unsurprising contribution

slide-46
SLIDE 46

Review: overall, a good, unsurprising contribution

  • Theory

○ Unsurprising that expert-developed models optimize better! ■ Tradeoff: developer hours vs machine hours ○ Cassandra GC system converges in 2 iterations - model is near-perfect! What happens when parametric model is wrong? ■ More details about tradeoff between parametric model and generic GP ■ OpenTuner: build an ensemble of multiple search techniques

slide-47
SLIDE 47

Review: overall, a good, unsurprising contribution

  • Theory

○ Unsurprising that expert-developed models optimize better! ■ Tradeoff: developer hours vs machine hours ○ Cassandra GC system converges in 2 iterations - model is near-perfect! What happens when parametric model is wrong? ■ More details about tradeoff between parametric model and generic GP ■ OpenTuner: build an ensemble of multiple search techniques

  • Implementation

○ Cross-validation? ○ Key for system adoption: make interface as high-level as possible

slide-48
SLIDE 48

Review: overall, a good, unsurprising contribution

  • Theory

○ Unsurprising that expert-developed models optimize better! ■ Tradeoff: developer hours vs machine hours ○ Cassandra GC system converges in 2 iterations - model is near-perfect! What happens when parametric model is wrong? ■ More details about tradeoff between parametric model and generic GP ■ OpenTuner: build an ensemble of multiple search techniques

  • Implementation

○ Cross-validation? ○ Key for system adoption: make interface as high-level as possible

  • Evaluation

○ What happens when # params >> 30? ○ “DAGModels help debugging”...how?