ACTIVELY LEARNING HYPERPARAMETERS FOR GPS Roman Garnett - - PowerPoint PPT Presentation

actively learning hyperparameters for gps
SMART_READER_LITE
LIVE PREVIEW

ACTIVELY LEARNING HYPERPARAMETERS FOR GPS Roman Garnett - - PowerPoint PPT Presentation

ACTIVELY LEARNING HYPERPARAMETERS FOR GPS Roman Garnett Washington University in St. Louis 12.10.2016 Joint work with Michael Osborne (University of Oxford) Philipp Hennig ( MPI Tbingen) INTRODUCTION Learning hyperparameters Problem


slide-1
SLIDE 1

ACTIVELY LEARNING HYPERPARAMETERS FOR GPS

Roman Garnett Washington University in St. Louis 12.10.2016 Joint work with Michael Osborne (University of Oxford) Philipp Hennig (MPI Tübingen)

slide-2
SLIDE 2

INTRODUCTION

Learning hyperparameters

slide-3
SLIDE 3

Problem

  • Gaussian processes (GPs) are powerful models able to

express a wide range of structure in nonlinear functions.

  • This power is sometimes a curse, as it can be very

difficult to determine appropriate values of hyperparameters, especially with small datasets.

Introduction Learning hyperparameters 3

slide-4
SLIDE 4

Small datasets

  • Small datasets are inherent in situations when the

function of interest is very expensive, as is typical in Bayesian optimization.

  • Success on these problems hinges on accurate modeling
  • f uncertainty, and undetermined hyperparameters can

contribute a great deal (often hidden!).

  • The traditional approach in these scenarios is to spend

some portion of the budget on model-agnostic initialization (Latin hypercubes, etc.)

  • We present a model-driven approach here.

Introduction Learning hyperparameters 4

slide-5
SLIDE 5

Motivating problem: Learning embeddings

  • High-dimensionality has stymied the progress of

model-based approaches to many machine learning tasks.

  • In particular, Gaussian processes approaches remain

intractable for large numbers of input variables.

  • An old idea for combating this problem is to exploit

low-dimensional structure in the function, the most simple example of which is a linear embedding.

Introduction Learning hyperparameters 5

slide-6
SLIDE 6

Learning embeddings for GPs

  • We want to learn a function

f : RD → R, where D is very large.

  • We assume that f has low intrinsic dimension, that is,

that there is a function g: Rd → R such that f(x) = g(Rx), where R ∈ Rd×D is a matrix defining a linear embedding.

Introduction Learning hyperparameters 6

slide-7
SLIDE 7

Example

  • Here f : R2 → R

(D = 2), but only depends on a

  • ne-dimensional projection
  • f x (d = 1).
  • All function values are

realized along the black line.

f x2 x1

Introduction Learning hyperparameters 7

slide-8
SLIDE 8

The GP model

If we knew the embedding R, modeling f would be

  • straightforward. Our model for f given the embedding R is a

zero-mean Gaussian process: p(f | R) = GP(f; 0, K), with K(x, x′; R) = κ(Rx, Rx′), where κ is a covariance on Rd × Rd.

Introduction Learning hyperparameters 8

slide-9
SLIDE 9

The GP model

If κ is the familiar squared-exponential, then K(x, x′; R, γ) = γ2 exp

  • −1

2(x − x′)⊤R⊤R(x − x′)

  • .

This is a low-rank Mahalanobis covariance, also known as a factor analysis covariance.

Introduction Learning hyperparameters 9

slide-10
SLIDE 10

Our approach

  • Our goal is to learn R (in general, any θ) as quickly as

possible!

  • Unlike previous approaches, which focus on random

embeddings (Wang, et al. 2013), we focus on learning the embedding directly.

Introduction Learning hyperparameters 10

slide-11
SLIDE 11

What can happen with random choices

Djolonga, et al. NIPS 2013

Introduction Learning hyperparameters 11

slide-12
SLIDE 12

LEARNING THE HYPERPARAMETERS

slide-13
SLIDE 13

Learning the hyperparameters

We maintain a probabilistic belief on θ. We start with a prior p(θ), and given data D we find the (approximate) posterior p(θ | D). The uncertainty in θ (in particular, its entropy) measures our progress!

Actively Learning Hyperparameters Learning the hyperparameters 13

slide-14
SLIDE 14

The prior

The prior is arbitrary, but here we took diffuse independent prior distribution on each entry: p(θi) = N(θi; 0, σ2

i ).

Could also use something more sophisticated.

Actively Learning Hyperparameters Learning the hyperparameters 14

slide-15
SLIDE 15

The posterior

Now, given observations D, we approximate the posterior distribution on θ: p(θ | D) ≈ N(θ; ˆ θ, Σ). The method of approximation is also arbitrary, but we took a Laplace approximation.

Actively Learning Hyperparameters Learning the hyperparameters 15

slide-16
SLIDE 16

SELECTING INFORMATIVE POINTS

Active learning

slide-17
SLIDE 17

Selecting informative points

  • We wish to sequentially sample the most informative

point about θ.

  • We suggest maximizing the mutual information between

the observed function value and the hyperparameters, particularly in the form known as Bayesian active learning by disagreement (BALD).1 x∗ = arg max

x

H[y | x, D] − Eθ

  • H[y | x, D, θ]
  • .

1Houlsby, et al. BAYESOPT 2011

Actively Learning Hyperparameters Selecting informative points 17

slide-18
SLIDE 18

BALD

Breaking this down, we want to find points with high marginal uncertainty (à la uncertainty sampling). . . x∗ = arg max

x

H[y | x, D] − Eθ

  • H[y | x, D, θ]
  • .

Actively Learning Hyperparameters Selecting informative points 18

slide-19
SLIDE 19

BALD

. . . but would have low uncertainty if we knew the hyperparameters θ: x∗ = arg max

x

H[y | x, D] − Eθ

  • H[y | x, D, θ]
  • .

Actively Learning Hyperparameters Selecting informative points 19

slide-20
SLIDE 20

BALD

  • That is, we want to find points where the competing

models (one for each value of θ) are all certain, but disagree highly with each other.

  • These points are the most informative points about the

hyperparameters! (We can discard hyperparameters that were confident about the wrong answer).

Actively Learning Hyperparameters Selecting informative points 20

slide-21
SLIDE 21

Computation of BALD

How can we compute or approximate the BALD objective for

  • ur model?

x∗ = arg max

x

H[y | x, D] − Eθ

  • H[y | x, D, θ]
  • .

The first term (marginal uncertainty in y) is especially

  • troubling. . .

Actively Learning Hyperparameters Selecting informative points 21

slide-22
SLIDE 22

LEARNING THE FUNCTION

Approximate marginalization of GP hyperparameters

slide-23
SLIDE 23

Learning the function

Given data D, and an input x∗, we wish to capture our belief about the associated latent value f ∗, accounting for uncertainty in θ: p(f ∗ | x∗, D) =

  • p(f ∗ | x∗, D, θ)p(θ | D) dθ.

We provide an approximation called the “marginal GP” (MGP).

Actively Learning Hyperparameters Learning the function 23

slide-24
SLIDE 24

The MGP

The result is this: p(f ∗ | x∗, D) ≈ N(f ∗; m∗

D, C∗ D),

where m∗

D = µ∗ D,ˆ θ.

The approximate mean is the MAP posterior mean, and. . .

Actively Learning Hyperparameters Learning the function 24

slide-25
SLIDE 25

The MGP

C∗

D = 4

3V ∗

D,ˆ θ + ∂µ∗

∂θ

Σ∂µ∗ ∂θ + (3V ∗

D,ˆ θ)−1∂V ∗

∂θ

Σ∂V ∗ ∂θ . The variance is inflated according to how the posterior mean and posterior variance change with the hyperparameters.

Actively Learning Hyperparameters Learning the function 25

slide-26
SLIDE 26

Return to BALD

The MGP gives us a simple approximation to the BALD

  • bjective; we maximize the following simple objective:

C∗

D

V ∗

D,ˆ θ

. So we sample the point with maximal variance inflation. This is the point where the plausible hyperparameters maximally disagree under our approximation!

Actively Learning Hyperparameters Learning the function 26

slide-27
SLIDE 27

BALD and the MGP

utility and maximum (true) utility and maximum (MGP) utility and maximum (BBQ) ±2 sd (MGP) ±2 sd (map) ±2 sd (true) mean (true) mean (map/MGP) data y x

Actively Learning Hyperparameters Learning the function 27

slide-28
SLIDE 28

EXAMPLE

slide-29
SLIDE 29

Example

Consider a simple one-dimensional example (here R is simply an inverse length scale).

  • The blue envelope shows the uncertainty given by the

MAP embedding.

  • The red envelope shows the additional uncertainty due to

not knowing the embedding.

  • We sample where the ratio of these is maximized.

y x

Example One-dimensional example 29

slide-30
SLIDE 30

Example

The inset shows our belief over log R, it tightens as we continue to sample.

y x

Example One-dimensional example 30

slide-31
SLIDE 31

Example

y x

Example One-dimensional example 31

slide-32
SLIDE 32

Example

y x

Example One-dimensional example 32

slide-33
SLIDE 33

Example

We sample at a variety of separations to further refine our belief about R.

y x

Example One-dimensional example 33

slide-34
SLIDE 34

Example

Notice that we are relatively uncertain about many function values! Nonetheless, we are effectively learning R.

y x

Example One-dimensional example 34

slide-35
SLIDE 35

2d example

BALD

sampling uncertainty sampling f(x) x2 x1 −5 5 −5 5

Example Two-dimensional example 35

slide-36
SLIDE 36

2d example

true R p(R | D) R2 R1 −1 1 −1 1

Example Two-dimensional example 36

slide-37
SLIDE 37

Results

  • We have tested this approach on numerous synthetic and

real-world regression problems up to dimension D = 318, and our performance was significantly superior to:

  • random sampling,
  • Latin-hypercube designs, and
  • uncertainty sampling.

Example Experiments 37

slide-38
SLIDE 38

Test setup

For each method/dataset, we:

  • Began with a single observation of the function at the

center of the (box-bounded) domain,

  • Allowed each method to select a sequence of n = 100
  • bservations,
  • Given the resulting training data, found the MAP

hyperparameters, and

  • Used these hyperparameters to test on a held-out set of

1000 points, measuring RMSE and negative log likelihood.

Example Experiments 38

slide-39
SLIDE 39

Results: RMSE

Choosing 100 observations, predicting on 1000 more.

dataset D/d

RAND LH UNC BALD

synthetic 10/2 0.412 0.371 0.146 0.138 synthetic 10/3 0.553 0.687 0.557 0.523 synthetic 20/2 0.578 0.549 0.551 0.464 synthetic 20/3 0.714 0.740 0.700 0.617 Branin 10/2 18.2 17.8 3.63 2.29 Branin 20/2 18.3 14.8 13.4 15.0 communities & crime 96/2 0.720 — 0.782 0.661 temperature 106/2 0.423 — 0.427 0.328

CT slices

318/2 0.878 — 0.845 0.767

Example Experiments 39

slide-40
SLIDE 40

Reminder

The framework we have presented for actively learning linear embeddings is completely general; we can use it for actively learning hyperparameters in any GP model!

Example Experiments 40

slide-41
SLIDE 41

Question

Both these approaches suggest a two-stage approach for

  • ptimization. Is this necessary? Can we use BALD to learn the

embedding while simultaneously optimizing the function?

Example Experiments 41

slide-42
SLIDE 42

Code

github.com/rmgarnett/ active_gp_hyperlearning

Example Experiments 42

slide-43
SLIDE 43

PAPER

For more details

slide-44
SLIDE 44

UAI 2014

Actively Learning Linear Embeddings for Gaussian Processes, UAI 2014.

Papers Papers 44

slide-45
SLIDE 45

Extension: NIPS 2015

Extension to model selection, one step closer to fully automated Bayesian optimization! Bayesian Active Model Selection with an Application to Automated Audiometry, NIPS 2015.

Papers Papers 45

slide-46
SLIDE 46

Extension: NIPS 2016

Another extension to model selection with fixed datasets, one step closer to fully automated Bayesian

  • ptimization!

Bayesian optimization for automated model selection, NIPS 2016.

Papers Papers 46

slide-47
SLIDE 47

THANK YOU!

Questions?