BAYESIAN OPTIMIZATION FOR AUTOMATED MODEL SELECTION Gustavo - PowerPoint PPT Presentation

BAYESIAN OPTIMIZATION FOR AUTOMATED MODEL SELECTION Gustavo Malkomes Chip Schaff Roman Garnett Washington University in St. Louis Probabilistic Scientific Computing 06.06.2017

INTRODUCTION GP Model selection

Problem • Gaussian processes (GPs) are powerful models able to express a wide range of structure in nonlinear functions. • This power is sometimes a curse, as it can be very difficult to determine appropriate models (e.g., mean/covariance functions) to describe a given dataset. • The choice of model can be critical! . . . How would a nonexpert make this choice? (usually blindly!) • Our goal here will be to automatically construct a useful model to explain a given dataset. Introduction Model selection 3

Introduction Model selection 4

Simple grammar 1 SE PER K �→ { SE, RQ, LIN, PER, . . . } K �→ K ∗ K SE + PER RQ K �→ K + K 1 Duvenaud, et al. ICML 2013 Introduction Model selection 5

The problem We want to automatically search a space of GP models (i.e., parameterized mean/covariance functions with priors over their parameters) M = {M} to find the best one to explain our data. Introduction Model selection 6

Objective function In the Bayesian formalism, given a dataset D , we measure the quality of a model M using the (log) model evidence, which we wish to maximize: � g ( M ; D ) = log p ( y | X , θ, M ) p ( θ | M ) d θ This is intractable, but we can approximate, e.g.: • Bayesian information criterion ( BIC ) • Laplace approximation Introduction Model selection 7

Optimization problem We may now frame the model search problem as an optimization problem. We seek M ∗ = arg max g ( M ; D ) . M∈ M Introduction Model selection 8

Previous work: Greedy search 2 · · · SE RQ PER . . . . . . SE+RQ RQ*PER . . . . . . SE+RQ*PER RQ*PER*PER 2 Duvenaud, et al., ICML 2013 Introduction Model selection 9

OBSTACLES Why this is a hard problem

The objective is nonlinear and nonconvex • The mapping from models to evidence is highly complex! • Even seemingly “similar” models can offer vastly different explanations of the data. • . . . and this similarity depends on the geometry of the data! • Imagine a bunch of isolated points. . . Obstacles 11

The objective is expensive Even estimating the model evidence is very expensive. Both the BIC and Laplace approximations require finding the MLE/MAP hyperparameters: ˆ θ M = arg max log p ( y | X , θ, M ) θ This can easily be O (1000 N 3 ) ! Obstacles 12

The domain is discrete Another problem is that the space of models is discrete; therefore we can’t compute gradients of the objective. Obstacles 13

BAYESIAN OPTIMIZATION? Why not?

A case for Bayesian optimization! We have a • nonlinear, • gradient-free, • expensive, • black-box optimization problem. . . . . . Bayesian optimization! Bayesian Optimization A model for evidence 15

Overview of approach We are going to model the (log) model evidence function with a Gaussian process in model space: g : M → log p ( y | X , M ) � � p g ( M ; D ) = GP ( g ; µ g , K g ) . (How are we going to construct this??) Bayesian Optimization A model for evidence 16

Overview of approach Given some observed models and their evidences: �� D g = M i , g ( M i ; D ) , We find the posterior p ( g | D g ) and derive an acquisition function α ( M ; D g ) that we maximize to select the next model for investigation. (How are we going to maximize this??) Bayesian Optimization A model for evidence 17

THE EVIDENCE MODEL

Evidence model: mean We need to construct an informative prior over the log model evidence function: � � p g ( M ; D ) = GP ( g ; µ g , K g ) . For the mean, we simply take a constant. . . . . . what about the covariance? Evidence model A model for evidence 19

The “kernel kernel” The covariance K g measures our prior belief in the correlation between the log model evidence evaluated at two kernels. Here we consider two kernels to be “similar” for a given dataset D , if they offer similar explanations for the latent function at the observed locations. Evidence model A model for evidence 20

The “kernel kernel” A model M induces a prior distribution over latent function values at given locations X : � p ( f | X , M ) = p ( f | X , θ, M ) p ( θ ) d θ This is an (infinite) mixture of multivariate Gaussians, each of which is a potential explanation of the latent function values f (and thus for the observed data y ). Evidence model A model for evidence 21

The “kernel kernel” Given input locations X , we suggest two models M and M ′ should be similar when the latent explanations p ( f | X , M ′ ) p ( f | X , M ) are similar; i.e., they have high overlap. Evidence model A model for evidence 22

Measuring overlap: Hellinger distance Omitting many details, we have a solution: the so-called expected Hellinger distance ¯ d 2 H ( M , M ′ ; D ) (the expectation is over the hyperparameters of each model). Evidence model A model for evidence 23

The “kernel kernel” Now our “kernel kernel” between two models M and M ′ , given the data D , is defined to be � � − 1 2 ℓ 2 ¯ K g ( M , M ′ ; D , ℓ ) = exp H ( M , M ′ ; D ) d 2 . Crucially, this depends on the data distribution! Evidence model A model for evidence 24

“Kernel kernel:” Illustration SE PER SE+ SE PER RQ PER SE RQ SE + PER RQ PER SE+ PER Evidence model A model for evidence 25

OPTIMIZING THE ACQUISITION FUNCTION

Overview of approach We have defined a model over the model evidence function. We still need to figure out how to maximize the acquisition function (e.g., expected improvement) M ′ = arg max α ( M ; D g ) . M∈ M Acquisition Function 27

Active set construction Our idea: dynamically maintain a bag of ( ∼ 500 ) candidate models and optimize α on that smaller set. To construct this set, we will heuristically encourage exploitation and exploration. Acquisition Function 28

Active set construction: Exploitation Exploitation: add models near the best-yet seen. RQ · · · SE PER . . . . . . SE+RQ RQ*PER . . . . . . SE+RQ*PER RQ*PER*PER Acquisition Function 29

Active set construction: Exploration Exploration: add models generated from (short) random walks from the empty kernel. RQ · · · SE PER . . . . . . SE+RQ RQ*PER . . . . . . SE+RQ*PER RQ*PER*PER Acquisition Function 30

EXPERIMENTS

Experimental setup • We compare our method (Bayesian optimization for model selection, BOMS ) against the greedy search method from Duvenaud, et al. ICML 2013. • Laplace approximation for estimating model evidence. • Budget of 50 model evidence computations. Experiments 32

Model space: CKS grammar • For time-series data, the base kernels were SE , RQ , LIN , and PER . • For higher-dimensional data, the base kernels were SE i and RQ i . Experiments 33

Experimental setup: Details for BOMS • First model selected was SE. • Acquisition function was expected improvement per second. Experiments 34

Results: Time series AIRLINE MAUNA LOA g ( M ∗ ; D ) / |D| 2 . 5 0 . 5 2 0 CKS BOMS 1 . 5 0 20 40 0 20 40 iteration iteration Experiments 35

Results: High-dimensional data HOUSING CONCRETE − 0 . 8 − 0 . 6 − 0 . 8 − 1 − 1 − 1 . 2 − 1 . 2 − 1 . 4 − 1 . 4 0 20 40 0 20 40 iteration iteration Experiments 36

Notes • The overhead of our method in terms of running time is approximately 10% . • The vast majority of the time is spent optimizing hyperparameters (random restart, etc.). • We offer some advice for automatically selecting reasonable hyperparameter priors for given data that we adopt here. Experiments 37

Other options For Bayesian optimization, may want to choose another family of kernels, e.g., • Additive decompositions (Kandasamy, et al., ICML 2015) • Low-dimensional embeddings (Wang, et al., IJCAI 2013, Garnett, et al. UAI 2014) Both would be convenient for optimization for other reasons (e.g., easier optimization of the acquisition function) Experiments 38

LOOKING FORWARD

Looking forward These results are promising, but the real promise of such methods is in the inner loop of another procedure (e.g., Bayesian optimization or Bayesian quadrature)! Looking forward 40

Future code snippet? data = []; models = [SE]; for i = 1:budget % use mixture of models in acquisition function x_next = maximize_acquisition(data, models); y_next = f(x_next); data = data + [x_next, y_next]; % update bag of models models = update_models(data, models); % BOMS end Looking forward 41

THANK YOU! Questions?

BAYESIAN OPTIMIZATION FOR AUTOMATED MODEL SELECTION Gustavo - PowerPoint PPT Presentation

BAYESIAN OPTIMIZATION FOR AUTOMATED MODEL SELECTION Gustavo Malkomes Chip Schaff Roman Garnett Washington University in St. Louis Probabilistic Scientific Computing 06.06.2017 INTRODUCTION GP Model selection Problem Gaussian processes

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Bayesian Model Selection and Averaging Nonlinear Models Bayes factors Example Families FFX

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Bayesian variable selection Dr. Jarad Niemi Iowa State University September 4, 2017 Jarad Niemi

Sequential Monte Carlo Methods for Bayesian Model Selection in Positron Emission Tomography Yan

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian model averaging Dr. Jarad Niemi Iowa State University September 7, 2017 Jarad Niemi

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

A comparisons of some criteria for states selection of the latent Markov model for longitudinal

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

Motivation Partial Wave Analysis Up to know: worked on + with

Lecture 7: Cross-Validation Instructor: Prof. Shuai Huang Industrial and Systems Engineering

Prediction in MLM Model comparisons and regularization PSYC 575 October 13, 2020 (updated: 25

Learning a Belief Network If you know the structure have observed all of the variables

How well can HMM model load signals 3rd International Workshop on Non-Intrusive Load Monitoring,

Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 21 to March 4,

BAYESIAN OPTIMIZATION FOR AUTOMATED MODEL SELECTION Gustavo - PowerPoint PPT Presentation

BAYESIAN OPTIMIZATION FOR AUTOMATED MODEL SELECTION Gustavo Malkomes Chip Schaff Roman Garnett Washington University in St. Louis Probabilistic Scientific Computing 06.06.2017 INTRODUCTION GP Model selection Problem Gaussian processes

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Bayesian Model Selection and Averaging Nonlinear Models Bayes factors Example Families FFX

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Bayesian variable selection Dr. Jarad Niemi Iowa State University September 4, 2017 Jarad Niemi

Sequential Monte Carlo Methods for Bayesian Model Selection in Positron Emission Tomography Yan

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics &amp; Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian model averaging Dr. Jarad Niemi Iowa State University September 7, 2017 Jarad Niemi

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

A comparisons of some criteria for states selection of the latent Markov model for longitudinal

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

Motivation Partial Wave Analysis Up to know: worked on + with

Lecture 7: Cross-Validation Instructor: Prof. Shuai Huang Industrial and Systems Engineering

Prediction in MLM Model comparisons and regularization PSYC 575 October 13, 2020 (updated: 25

Learning a Belief Network If you know the structure have observed all of the variables

How well can HMM model load signals 3rd International Workshop on Non-Intrusive Load Monitoring,

Data Mining in Bioinformatics Day 3: Feature Selection Karsten Borgwardt February 21 to March 4,

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION