A Short Introduction to Bayesian Optimization With applications to - - PowerPoint PPT Presentation

a short introduction to bayesian optimization
SMART_READER_LITE
LIVE PREVIEW

A Short Introduction to Bayesian Optimization With applications to - - PowerPoint PPT Presentation

A Short Introduction to Bayesian Optimization With applications to parameter tuning on accelerators Johannes Kirschner 28th February 2018 ICFA Workshop on Machine Learning for Accelerator Control Solve x = arg max f ( x ) x X 0


slide-1
SLIDE 1

A Short Introduction to Bayesian Optimization

With applications to parameter tuning on accelerators

Johannes Kirschner 28th February 2018

ICFA Workshop on Machine Learning for Accelerator Control

slide-2
SLIDE 2

Solve x∗ = arg max

x∈X

f (x)

slide-3
SLIDE 3

Application: Tuning of Accelerators

Example: x = Parameter settings on accelerator f (x) = Pulse energy

1

slide-4
SLIDE 4

Application: Tuning of Accelerators

Example: x = Parameter settings on accelerator f (x) = Pulse energy Goal: Find x∗ = arg maxx∈X f (x) . . . using only noisy evaluations yt = f (xt) + ǫt.

1

slide-5
SLIDE 5

Part 1) A flexible & statistically sound model for f : Gaussian Processes

1

slide-6
SLIDE 6

From Linear Least Squares to Gaussian Processes

Given: Measurements (x1, y1), . . . , (xt, yt). Goal: Find statistical estimator ˆ f (x) of f .

2

slide-7
SLIDE 7

From Linear Least Squares to Gaussian Processes

Regularized linear least squares: ˆ θ = arg min

θ∈Rd T

  • t=1
  • x⊤

t θ − yt

2 + θ2

3

slide-8
SLIDE 8

From Linear Least Squares to Gaussian Processes

Least squares regression in a Hilbert space H: ˆ f = arg min

f ∈H T

  • t=1
  • f (xt) − yt

2 + f 2

H 4

slide-9
SLIDE 9

From Linear Least Squares to Gaussian Processes

Least squares regression in a Hilbert space H: ˆ f = arg min

f ∈H T

  • t=1
  • f (xt) − yt

2 + f 2

H

Closed form solution if H is a Reproducing Kernel Hilbert Space! Defined by a kernel k : X × X → R. Example: RBF Kernel k(x, y) = exp

  • − x−y2

2σ2

  • Kernel characterizes smoothness of functions in H.

4

slide-10
SLIDE 10

From Linear Least Squares to Gaussian Processes

ˆ f = arg min

f ∈H T

  • t=1
  • f (xt) − yt

2 + f 2

H 5

slide-11
SLIDE 11

From Linear Least Squares to Gaussian Processes

ˆ f = arg min

f ∈H T

  • t=1
  • f (xt) − yt

2 + f 2

H 5

slide-12
SLIDE 12

From Linear Least Squares to Gaussian Processes

ˆ f = arg min

f ∈H T

  • t=1
  • f (xt) − yt

2 + f 2

H 5

slide-13
SLIDE 13

From Linear Least Squares to Gaussian Processes

Bayesian Interpretation: ˆ f is the posterior mean of a Gaussian Process. A Gaussian Process is a distribution over functions, such that

  • any finite collection of evaluations is multivariate normal distributed,
  • the covariance structure is defined through the kernel.

5

slide-14
SLIDE 14

From Linear Least Squares to Gaussian Processes

Bayesian Interpretation: ˆ f is the posterior mean of a Gaussian Process. A Gaussian Process is a distribution over functions, such that

  • any finite collection of evaluations is multivariate normal distributed,
  • the covariance structure is defined through the kernel.

5

slide-15
SLIDE 15

From Linear Least Squares to Gaussian Processes

Bayesian Interpretation: ˆ f is the posterior mean of a Gaussian Process. A Gaussian Process is a distribution over functions, such that

  • any finite collection of evaluations is multivariate normal distributed,
  • the covariance structure is defined through the kernel.

5

slide-16
SLIDE 16

From Linear Least Squares to Gaussian Processes

Bayesian Interpretation: ˆ f is the posterior mean of a Gaussian Process. A Gaussian Process is a distribution over functions, such that

  • any finite collection of evaluations is multivariate normal distributed,
  • the covariance structure is defined through the kernel.

5

slide-17
SLIDE 17

Part 2) Bayesian Optimization Algorithms

5

slide-18
SLIDE 18

Bayesian Optimization: Introduction

Idea: Use confidence intervals to efficiently optimize f . Example: Plausible Maximizers

6

slide-19
SLIDE 19

Bayesian Optimization: Introduction

Idea: Use confidence intervals to efficiently optimize f . Example: Plausible Maximizers

6

slide-20
SLIDE 20

Bayesian Optimization: GP-UCB

Idea: Use confidence intervals to efficiently optimize f . Example: GP-UCB (Gaussian Process - Upper Confidence Bound)

7

slide-21
SLIDE 21

Bayesian Optimization: GP-UCB

Idea: Use confidence intervals to efficiently optimize f . Example: GP-UCB (Gaussian Process - Upper Confidence Bound)

7

slide-22
SLIDE 22

Bayesian Optimization: GP-UCB

Idea: Use confidence intervals to efficiently optimize f . Example: GP-UCB (Gaussian Process - Upper Confidence Bound)

7

slide-23
SLIDE 23

Bayesian Optimization: GP-UCB

Idea: Use confidence intervals to efficiently optimize f . Example: GP-UCB (Gaussian Process - Upper Confidence Bound)

7

slide-24
SLIDE 24

Bayesian Optimization: GP-UCB

Idea: Use confidence intervals to efficiently optimize f . Example: GP-UCB (Gaussian Process - Upper Confidence Bound)

7

slide-25
SLIDE 25

Bayesian Optimization: GP-UCB

Idea: Use confidence intervals to efficiently optimize f . Example: GP-UCB (Gaussian Process - Upper Confidence Bound)

7

slide-26
SLIDE 26

Bayesian Optimization: GP-UCB

Idea: Use confidence intervals to efficiently optimize f . Example: GP-UCB (Gaussian Process - Upper Confidence Bound)

7

slide-27
SLIDE 27

Bayesian Optimization: GP-UCB

Idea: Use confidence intervals to efficiently optimize f . Example: GP-UCB (Gaussian Process - Upper Confidence Bound)

7

slide-28
SLIDE 28

Bayesian Optimization: GP-UCB

Idea: Use confidence intervals to efficiently optimize f . Example: GP-UCB (Gaussian Process - Upper Confidence Bound)

7

slide-29
SLIDE 29

Bayesian Optimization: GP-UCB

Idea: Use confidence intervals to efficiently optimize f . Example: GP-UCB (Gaussian Process - Upper Confidence Bound) Convergence guarantee: f (xt) − → f (x∗) as t − → ∞

7

slide-30
SLIDE 30

Bayesian Optimization: GP-UCB

Idea: Use confidence intervals to efficiently optimize f . Example: GP-UCB (Gaussian Process - Upper Confidence Bound) Convergence guarantee:

1 T

T

x=1 f (x∗) − f (xt) ≤ O

  • 1/

√ T

  • 7
slide-31
SLIDE 31

Extension 1: Safe Bayesian Optimization

Objective: Keep a safety function s(x) below a threshold c. max

x∈X f (x)

s.t. s(x) ≤ c SafeOpt: [Sui et al.,(2015); Berkenkamp et al. (2016)]

8

slide-32
SLIDE 32

Extension 1: Safe Bayesian Optimization

Safe Tuning of 2 Matching Quadrupoles at SwissFEL:

8

slide-33
SLIDE 33

Extension 2: Heteroscedastic Noise

What if the noise variance depends on evaluation point?

9

slide-34
SLIDE 34

Extension 2: Heteroscedastic Noise

What if the noise variance depends on evaluation point? Standard approaches, like GP-UCB, are agnostic to noise level. Information Directed Sampling: Bayesian optimization with heteroscedastic noise; including theoretical guarantees. [Kirschner and Krause (2018); Russo and Van Roy (2014)]

9

slide-35
SLIDE 35

Acknowledgments

Experiments at SwissFEL Joined work withFranziska Frei, Nicole Hiller, Rasmus Ischebeck, Andreas Krause, Morjmir Mutny Plots Thanks to Felix Berkenkamp for sharing his python notebooks. Pictures Accelerator Structure: Franziska Frei

10

slide-36
SLIDE 36

References

  • F. Berkenkamp, A. P. Schoellig, A. Krause., Safe Controller Optimization

for Quadrotors with Gaussian Processes, ICRA, 2016

  • J. Kirschner and A. Krause, Information Directed Sampling and Bandits

with Heteroscedastic Noise, ArXiv preprint, 2018

  • D. Russo and B. Van Roy, Learning to Optimize via Information-Directed

Sampling, NIPS 2014

  • Y. Sui, A. Gotovos, J. W. Burdick, and A. Krause, Safe exploration for
  • ptimization with Gaussian processes, ICML 2015

11