BAYESIAN GLOBAL OPTIMIZATION Using Optimal Learning to Tune Deep - - PowerPoint PPT Presentation

bayesian global optimization
SMART_READER_LITE
LIVE PREVIEW

BAYESIAN GLOBAL OPTIMIZATION Using Optimal Learning to Tune Deep - - PowerPoint PPT Presentation

BAYESIAN GLOBAL OPTIMIZATION Using Optimal Learning to Tune Deep Learning Pipelines Scott Clark scott@sigopt.com OUTLINE 1. Why is Tuning AI Models Hard? 2. Comparison of Tuning Methods 3. Bayesian Global Optimization 4. Deep Learning


slide-1
SLIDE 1

BAYESIAN GLOBAL OPTIMIZATION

Using Optimal Learning to Tune Deep Learning Pipelines Scott Clark scott@sigopt.com

slide-2
SLIDE 2

OUTLINE

  • 1. Why is Tuning AI Models Hard?
  • 2. Comparison of Tuning Methods
  • 3. Bayesian Global Optimization
  • 4. Deep Learning Examples
  • 5. Evaluating Optimization Strategies
slide-3
SLIDE 3

Deep Learning / AI is extremely powerful Tuning these systems is extremely non-intuitive

slide-4
SLIDE 4

https://www.quora.com/What-is-the-most-important-unresolved-problem-in-machine-learning-3

What is the most important unresolved problem in machine learning?

“...we still don't really know why some configurations of deep neural networks work in some case and not others, let alone having a more or less automatic approach to determining the architectures and the hyperparameters.” Xavier Amatriain, VP Engineering at Quora (former Director of Research at Netflix)

slide-5
SLIDE 5

Photo: Joe Ross

slide-6
SLIDE 6

TUNABLE PARAMETERS IN DEEP LEARNING

slide-7
SLIDE 7

TUNABLE PARAMETERS IN DEEP LEARNING

slide-8
SLIDE 8

Photo: Tammy Strobel

slide-9
SLIDE 9

STANDARD METHODS FOR HYPERPARAMETER SEARCH

slide-10
SLIDE 10

STANDARD TUNING METHODS

Parameter Configuration

?

Grid Search Random Search Manual Search

  • Weights
  • Thresholds
  • Window sizes
  • Transformations

ML / AI Model Testing Data Cross Validation Training Data

slide-11
SLIDE 11

OPTIMIZATION FEEDBACK LOOP

Objective Metric

Better Results

REST API New configurations

ML / AI Model Testing Data Cross Validation Training Data

slide-12
SLIDE 12

BAYESIAN GLOBAL OPTIMIZATION

slide-13
SLIDE 13

… the challenge of how to collect information as efficiently as possible, primarily for settings where collecting information is time consuming and expensive.

  • Prof. Warren Powell - Princeton

What is the most efficient way to collect information?

  • Prof. Peter Frazier - Cornell

How do we make the most money, as fast as possible?

Scott Clark - CEO, SigOpt

OPTIMAL LEARNING

slide-14
SLIDE 14
  • Optimize objective function

○ Loss, Accuracy, Likelihood

  • Given parameters

○ Hyperparameters, feature/architecture params

  • Find the best hyperparameters

○ Sample function as few times as possible ○ Training on big data is expensive

BAYESIAN GLOBAL OPTIMIZATION

slide-15
SLIDE 15

SMBO

Sequential Model-Based Optimization

HOW DOES IT WORK?

slide-16
SLIDE 16
  • 1. Build Gaussian Process (GP) with points

sampled so far

  • 2. Optimize the fit of the GP (covariance

hyperparameters)

  • 3. Find the point(s) of highest Expected

Improvement within parameter domain

  • 4. Return optimal next best point(s) to sample

GP/EI SMBO

slide-17
SLIDE 17

GAUSSIAN PROCESSES

slide-18
SLIDE 18

GAUSSIAN PROCESSES

slide-19
SLIDE 19

GAUSSIAN PROCESSES

slide-20
SLIDE 20

GAUSSIAN PROCESSES

slide-21
SLIDE 21

GAUSSIAN PROCESSES

slide-22
SLIDE 22

GAUSSIAN PROCESSES

slide-23
SLIDE 23

GAUSSIAN PROCESSES

slide-24
SLIDE 24

GAUSSIAN PROCESSES

slide-25
SLIDE 25
  • verfit

good fit underfit

GAUSSIAN PROCESSES

slide-26
SLIDE 26

EXPECTED IMPROVEMENT

slide-27
SLIDE 27

EXPECTED IMPROVEMENT

slide-28
SLIDE 28

EXPECTED IMPROVEMENT

slide-29
SLIDE 29

EXPECTED IMPROVEMENT

slide-30
SLIDE 30

EXPECTED IMPROVEMENT

slide-31
SLIDE 31

EXPECTED IMPROVEMENT

slide-32
SLIDE 32

DEEP LEARNING EXAMPLES

slide-33
SLIDE 33
  • Classify movie reviews

using a CNN in MXNet

SIGOPT + MXNET

slide-34
SLIDE 34

TEXT CLASSIFICATION PIPELINE

ML / AI Model

(MXNet)

Testing Text Validation

Accuracy

Better Results

REST API Hyperparameter Configurations and Feature Transformations

Training Text

slide-35
SLIDE 35

TUNABLE PARAMETERS IN DEEP LEARNING

slide-36
SLIDE 36
  • Comparison of several RMSProp SGD parametrizations

STOCHASTIC GRADIENT DESCENT

slide-37
SLIDE 37

ARCHITECTURE PARAMETERS

slide-38
SLIDE 38

Grid Search Random Search This slide’s GIF loops automatically

?

TUNING METHODS

slide-39
SLIDE 39

MULTIPLICATIVE TUNING SPEED UP

slide-40
SLIDE 40

SPEED UP #1: CPU -> GPU

slide-41
SLIDE 41

SPEED UP #2: RANDOM/GRID -> SIGOPT

slide-42
SLIDE 42

CONSISTENTLY BETTER AND FASTER

slide-43
SLIDE 43
  • Classify house numbers in

an image dataset (SVHN)

SIGOPT + TENSORFLOW

slide-44
SLIDE 44

COMPUTER VISION PIPELINE

ML / AI Model

(Tensorflow)

Testing Images Cross Validation

Accuracy

Better Results

REST API Hyperparameter Configurations and Feature Transformations

Training Images

slide-45
SLIDE 45

METRIC OPTIMIZATION

slide-46
SLIDE 46
  • All convolutional neural network
  • Multiple convolutional and dropout layers
  • Hyperparameter optimization mixture of

domain expertise and grid search (brute force)

SIGOPT + NEON

http://arxiv.org/pdf/1412.6806.pdf

slide-47
SLIDE 47

COMPARATIVE PERFORMANCE

  • Expert baseline: 0.8995

○ (using neon)

  • SigOpt best: 0.9011

○ 1.6% reduction in error rate ○ No expert time wasted in tuning

slide-48
SLIDE 48

SIGOPT + NEON

http://arxiv.org/pdf/1512.03385v1.pdf

  • Explicitly reformulate the layers as learning residual

functions with reference to the layer inputs, instead of learning unreferenced functions

  • Variable depth
  • Hyperparameter optimization mixture of domain expertise and grid search (brute force)
slide-49
SLIDE 49

COMPARATIVE PERFORMANCE

Standard Method

  • Expert baseline: 0.9339

○ (from paper)

  • SigOpt best: 0.9436

○ 15% relative error rate reduction ○ No expert time wasted in tuning

slide-50
SLIDE 50

EVALUATING THE OPTIMIZER

slide-51
SLIDE 51

OUTLINE

  • Metric Definitions
  • Benchmark Suite
  • Eval Infrastructure
  • Visualization Tool
  • Baseline Comparisons
slide-52
SLIDE 52

What is the best value found after optimization completes?

METRIC: BEST FOUND

BLUE RED BEST_FOUND 0.7225 0.8949

slide-53
SLIDE 53

How quickly is optimum found? (area under curve)

METRIC: AUC

BLUE RED BEST_FOUND 0.9439 0.9435 AUC 0.8299 0.9358

slide-54
SLIDE 54

STOCHASTIC OPTIMIZATION

slide-55
SLIDE 55
  • Optimization functions from literature
  • ML datasets: LIBSVM, Deep Learning, etc

BENCHMARK SUITE

TEST FUNCTION TYPE COUNT Continuous Params 184 Noisy Observations 188 Parallel Observations 45 Integer Params 34 Categorical Params / ML 47 Failure Observations 30 TOTAL 489

slide-56
SLIDE 56
  • On-demand cluster in

AWS for parallel eval function optimization

  • Full eval consists of

~20000 optimizations, taking ~30 min

INFRASTRUCTURE

slide-57
SLIDE 57

RANKING OPTIMIZERS

  • 1. Mann-Whitney U tests

using BEST_FOUND

  • 2. Tied results then partially

ranked using AUC

  • 3. Any remaining ties, stay

as ties for final ranking

slide-58
SLIDE 58

RANKING AGGREGATION

  • Aggregate partial rankings across all eval functions using

Borda count (sum of methods ranked lower)

slide-59
SLIDE 59

SHORT RESULTS SUMMARY

slide-60
SLIDE 60

BASELINE COMPARISONS

slide-61
SLIDE 61

SIGOPT SERVICE

slide-62
SLIDE 62

OPTIMIZATION FEEDBACK LOOP

Objective Metric

Better Results

REST API New configurations

ML / AI Model Testing Data Cross Validation Training Data

slide-63
SLIDE 63

SIMPLIFIED OPTIMIZATION

Client Libraries

  • Python
  • Java
  • R
  • Matlab
  • And more...

Framework Integrations

  • TensorFlow
  • scikit-learn
  • xgboost
  • Keras
  • Neon
  • And more...

Live Demo

slide-64
SLIDE 64

DISTRIBUTED TRAINING

  • SigOpt serves as a distributed

scheduler for training models across workers

  • Workers access the SigOpt API

for the latest parameters to try for each model

  • Enables easy distributed

training of non-distributed algorithms across any number

  • f models
slide-65
SLIDE 65

COMPARATIVE PERFORMANCE

  • Better Results, Faster and Cheaper

Quickly get the most out of your models with our proven, peer-reviewed ensemble of Bayesian and Global Optimization Methods

○ A Stratified Analysis of Bayesian Optimization Methods (ICML 2016) ○ Evaluation System for a Bayesian Optimization Service (ICML 2016) ○ Interactive Preference Learning of Utility Functions for Multi-Objective Optimization (NIPS 2016) ○ And more...

  • Fully Featured

Tune any model in any pipeline

○ Scales to 100 continuous, integer, and categorical parameters and many thousands of evaluations ○ Parallel tuning support across any number of models ○ Simple integrations with many languages and libraries ○ Powerful dashboards for introspecting your models and optimization ○ Advanced features like multi-objective optimization, failure region support, and more

  • Secure Black Box Optimization

Your data and models never leave your system

slide-66
SLIDE 66

https://sigopt.com/getstarted

Try it yourself!

slide-67
SLIDE 67

Questions?

contact@sigopt.com https://sigopt.com @SigOpt