BAYESIAN OPTIMIZATION FOR AUTOMATED MODEL SELECTION Gustavo - - PowerPoint PPT Presentation

bayesian optimization for automated model selection
SMART_READER_LITE
LIVE PREVIEW

BAYESIAN OPTIMIZATION FOR AUTOMATED MODEL SELECTION Gustavo - - PowerPoint PPT Presentation

BAYESIAN OPTIMIZATION FOR AUTOMATED MODEL SELECTION Gustavo Malkomes Chip Schaff Roman Garnett Washington University in St. Louis Probabilistic Scientific Computing 06.06.2017 INTRODUCTION GP Model selection Problem Gaussian processes


slide-1
SLIDE 1

BAYESIAN OPTIMIZATION FOR AUTOMATED MODEL SELECTION

Gustavo Malkomes Chip Schaff Roman Garnett Washington University in St. Louis Probabilistic Scientific Computing 06.06.2017

slide-2
SLIDE 2

INTRODUCTION

GP Model selection

slide-3
SLIDE 3

Problem

  • Gaussian processes (GPs) are powerful models able to

express a wide range of structure in nonlinear functions.

  • This power is sometimes a curse, as it can be very

difficult to determine appropriate models (e.g., mean/covariance functions) to describe a given dataset.

  • The choice of model can be critical! . . . How would a

nonexpert make this choice? (usually blindly!)

  • Our goal here will be to automatically construct a useful

model to explain a given dataset.

Introduction Model selection 3

slide-4
SLIDE 4

Introduction Model selection 4

slide-5
SLIDE 5

Simple grammar1

K → {SE, RQ, LIN, PER, . . . } K → K ∗ K K → K + K

SE+PER RQ PER SE

1Duvenaud, et al. ICML 2013

Introduction Model selection 5

slide-6
SLIDE 6

The problem

We want to automatically search a space of GP models (i.e., parameterized mean/covariance functions with priors over their parameters) M = {M} to find the best one to explain our data.

Introduction Model selection 6

slide-7
SLIDE 7

Objective function

In the Bayesian formalism, given a dataset D, we measure the quality of a model M using the (log) model evidence, which we wish to maximize: g(M; D) = log

  • p(y | X, θ, M)p(θ | M) dθ

This is intractable, but we can approximate, e.g.:

  • Bayesian information criterion (BIC)
  • Laplace approximation

Introduction Model selection 7

slide-8
SLIDE 8

Optimization problem

We may now frame the model search problem as an

  • ptimization problem. We seek

M∗ = arg max

M∈M

g(M; D).

Introduction Model selection 8

slide-9
SLIDE 9

Previous work: Greedy search2

SE RQ · · · PER SE+RQ . . . . . . RQ*PER SE+RQ*PER . . . . . . RQ*PER*PER

2Duvenaud, et al., ICML 2013

Introduction Model selection 9

slide-10
SLIDE 10

OBSTACLES

Why this is a hard problem

slide-11
SLIDE 11

The objective is nonlinear and nonconvex

  • The mapping from models to evidence is highly complex!
  • Even seemingly “similar” models can offer vastly different

explanations of the data.

  • . . . and this similarity depends on the geometry of the

data!

  • Imagine a bunch of isolated points. . .

Obstacles 11

slide-12
SLIDE 12

The objective is expensive

Even estimating the model evidence is very expensive. Both the BIC and Laplace approximations require finding the

MLE/MAP hyperparameters:

ˆ θM = arg max

θ

log p(y | X, θ, M) This can easily be O(1000N 3)!

Obstacles 12

slide-13
SLIDE 13

The domain is discrete

Another problem is that the space of models is discrete; therefore we can’t compute gradients of the objective.

Obstacles 13

slide-14
SLIDE 14

BAYESIAN OPTIMIZATION?

Why not?

slide-15
SLIDE 15

A case for Bayesian optimization!

We have a

  • nonlinear,
  • gradient-free,
  • expensive,
  • black-box optimization problem. . .

. . . Bayesian optimization!

Bayesian Optimization A model for evidence 15

slide-16
SLIDE 16

Overview of approach

We are going to model the (log) model evidence function with a Gaussian process in model space: g: M → log p(y | X, M) p

  • g(M; D)
  • = GP(g; µg, Kg).

(How are we going to construct this??)

Bayesian Optimization A model for evidence 16

slide-17
SLIDE 17

Overview of approach

Given some observed models and their evidences: Dg =

  • Mi, g(Mi; D)
  • ,

We find the posterior p(g | Dg) and derive an acquisition function α(M; Dg) that we maximize to select the next model for investigation. (How are we going to maximize this??)

Bayesian Optimization A model for evidence 17

slide-18
SLIDE 18

THE EVIDENCE MODEL

slide-19
SLIDE 19

Evidence model: mean

We need to construct an informative prior over the log model evidence function: p

  • g(M; D)
  • = GP(g; µg, Kg).

For the mean, we simply take a constant. . . . . . what about the covariance?

Evidence model A model for evidence 19

slide-20
SLIDE 20

The “kernel kernel”

The covariance Kg measures our prior belief in the correlation between the log model evidence evaluated at two kernels. Here we consider two kernels to be “similar” for a given dataset D, if they offer similar explanations for the latent function at the observed locations.

Evidence model A model for evidence 20

slide-21
SLIDE 21

The “kernel kernel”

A model M induces a prior distribution over latent function values at given locations X: p(f | X, M) =

  • p(f | X, θ, M)p(θ) dθ

This is an (infinite) mixture of multivariate Gaussians, each of which is a potential explanation of the latent function values f (and thus for the observed data y).

Evidence model A model for evidence 21

slide-22
SLIDE 22

The “kernel kernel”

Given input locations X, we suggest two models M and M′ should be similar when the latent explanations p(f | X, M) p(f | X, M′) are similar; i.e., they have high overlap.

Evidence model A model for evidence 22

slide-23
SLIDE 23

Measuring overlap: Hellinger distance

Omitting many details, we have a solution: the so-called expected Hellinger distance ¯ d2

H(M, M′; D)

(the expectation is over the hyperparameters of each model).

Evidence model A model for evidence 23

slide-24
SLIDE 24

The “kernel kernel”

Now our “kernel kernel” between two models M and M′, given the data D, is defined to be Kg(M, M′; D, ℓ) = exp

  • − 1

2ℓ2 ¯ d2

H(M, M′; D)

  • .

Crucially, this depends on the data distribution!

Evidence model A model for evidence 24

slide-25
SLIDE 25

“Kernel kernel:” Illustration

SE+PER RQ PER SE

SE RQ PER SE+ PER SE RQ PER SE+ PER

Evidence model A model for evidence 25

slide-26
SLIDE 26

OPTIMIZING THE ACQUISITION FUNCTION

slide-27
SLIDE 27

Overview of approach

We have defined a model over the model evidence function. We still need to figure out how to maximize the acquisition function (e.g., expected improvement) M′ = arg max

M∈M

α(M; Dg).

Acquisition Function 27

slide-28
SLIDE 28

Active set construction

Our idea: dynamically maintain a bag of (∼500) candidate models and optimize α on that smaller set. To construct this set, we will heuristically encourage exploitation and exploration.

Acquisition Function 28

slide-29
SLIDE 29

Active set construction: Exploitation

Exploitation: add models near the best-yet seen. SE RQ · · · PER SE+RQ . . . . . . RQ*PER SE+RQ*PER . . . . . . RQ*PER*PER

Acquisition Function 29

slide-30
SLIDE 30

Active set construction: Exploration

Exploration: add models generated from (short) random walks from the empty kernel. SE RQ · · · PER SE+RQ . . . . . . RQ*PER SE+RQ*PER . . . . . . RQ*PER*PER

Acquisition Function 30

slide-31
SLIDE 31

EXPERIMENTS

slide-32
SLIDE 32

Experimental setup

  • We compare our method (Bayesian optimization for

model selection, BOMS) against the greedy search method from Duvenaud, et al. ICML 2013.

  • Laplace approximation for estimating model evidence.
  • Budget of 50 model evidence computations.

Experiments 32

slide-33
SLIDE 33

Model space: CKS grammar

  • For time-series data, the base kernels were SE, RQ, LIN,

and PER.

  • For higher-dimensional data, the base kernels were SEi

and RQi.

Experiments 33

slide-34
SLIDE 34

Experimental setup: Details for BOMS

  • First model selected was SE.
  • Acquisition function was expected improvement per

second.

Experiments 34

slide-35
SLIDE 35

Results: Time series

AIRLINE MAUNA LOA 20 40 0.5

iteration g(M∗; D)/|D|

CKS BOMS 20 40 1.5 2 2.5

iteration

Experiments 35

slide-36
SLIDE 36

Results: High-dimensional data

HOUSING CONCRETE 20 40 −1.4 −1.2 −1 −0.8 −0.6

iteration

20 40 −1.4 −1.2 −1 −0.8

iteration

Experiments 36

slide-37
SLIDE 37

Notes

  • The overhead of our method in terms of running time is

approximately 10%.

  • The vast majority of the time is spent optimizing

hyperparameters (random restart, etc.).

  • We offer some advice for automatically selecting

reasonable hyperparameter priors for given data that we adopt here.

Experiments 37

slide-38
SLIDE 38

Other options

For Bayesian optimization, may want to choose another family

  • f kernels, e.g.,
  • Additive decompositions (Kandasamy, et al., ICML 2015)
  • Low-dimensional embeddings (Wang, et al., IJCAI 2013,

Garnett, et al. UAI 2014) Both would be convenient for optimization for other reasons (e.g., easier optimization of the acquisition function)

Experiments 38

slide-39
SLIDE 39

LOOKING FORWARD

slide-40
SLIDE 40

Looking forward

These results are promising, but the real promise of such methods is in the inner loop of another procedure (e.g., Bayesian optimization or Bayesian quadrature)!

Looking forward 40

slide-41
SLIDE 41

Future code snippet?

data = []; models = [SE]; for i = 1:budget % use mixture of models in acquisition function x_next = maximize_acquisition(data, models); y_next = f(x_next); data = data + [x_next, y_next]; % update bag of models models = update_models(data, models); % BOMS end

Looking forward 41

slide-42
SLIDE 42

THANK YOU!

Questions?