Announcements Homework 1: out tomorrow Due Thu Jan 29 Project - - PowerPoint PPT Presentation

announcements
SMART_READER_LITE
LIVE PREVIEW

Announcements Homework 1: out tomorrow Due Thu Jan 29 Project - - PowerPoint PPT Presentation

Active Learning and Optimized Information Gathering Lecture 6 Gaussian Process Optimization CS 101.2 Andreas Krause Announcements Homework 1: out tomorrow Due Thu Jan 29 Project Proposal due Tue Jan 27 Office hours Come to office


slide-1
SLIDE 1

Active Learning and

Optimized Information Gathering

Lecture 6 – Gaussian Process Optimization

CS 101.2 Andreas Krause

slide-2
SLIDE 2

2

Announcements

Homework 1: out tomorrow

Due Thu Jan 29

Project

Proposal due Tue Jan 27

Office hours

Come to office hours before your presentation! Andreas: Friday 1:30-3pm, 260 Jorgensen Ryan: Wednesday 4:00-6:00pm, 109 Moore

slide-3
SLIDE 3

3

Course outline

1.

Online decision making

2.

Statistical active learning

3.

Combinatorial approaches

slide-4
SLIDE 4

4

Recap Bandit problems

K-arms

εn greedy, UCB1 have regret O(log(T) K)

What about infinite arms (K=∞)

Have to make assumptions!

… p1 p2 p3 pk

slide-5
SLIDE 5

5

Bandits = Noisy function optimization

We are given black box access to function f f(x) = mean payoff for arm x Evaluating f is very expensive Want to (quickly) find x* = argmaxx f(x) f x y = f(x) + noise

slide-6
SLIDE 6

6

Bandits with ∞-many arms

Can only hope to perform well if we make some assumptions Linear Lipschitz-continuous (bounded slope) f(x)=wT x

slide-7
SLIDE 7

7

Regret depends on complexity

Bandit linear optimization over Rn

“strong” assumptions Regret O(T2/3 n)

Bandit problems for optimizing Lipschitz functions

“weak” assumptions Regret O(C(n) Tn/(n+1)) Curse of dimensionality!

Today: Flexible (Bayesian) approach for encoding assumptions about function complexity

slide-8
SLIDE 8

8

What if we believe, the function looks like:

Want flexible way to encode assumptions about functions! Piece-wise linear? Analytic? (∞ ∞ ∞ ∞-diff.’able)

slide-9
SLIDE 9

9

Bayesian inference

Two Bernoulli variables A(larm), B(urglar) P(B=1) = 0.1; P(A=1 | B=1)=0.9; P(A=1 | B=0)=0.1 What is P(B | A)? P(B) “prior” P(A | B) “likelihood” P(B | A) “posterior”

slide-10
SLIDE 10

10

A Bayesian approach

Bayesian models for functions + + + + Uff… Why is this useful? Likelihood P(data | f) Posterior P(f | data) Prior P(f)

slide-11
SLIDE 11

11

Probability of data

P(y1,…,yk) = Can compute P(y’ | y1,…,yk) =

slide-12
SLIDE 12

12

Regression with uncertainty about predictions!

+ + + +

slide-13
SLIDE 13

13

How can we do this?

Want to compute P(y’ | y1,…,yk)

P(y1,…,yk) = ∫ P(f, y1,…,yk) df

Horribly complicated integral?? Will see how we can compute this (more or less) efficiently In closed form! … if P(f) is a Gaussian Process

slide-14
SLIDE 14

14

Gaussian distribution

σ = Standard deviation µ = mean

slide-15
SLIDE 15

15

Bivariate Gaussian distribution

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

  • 2
  • 1

1 2 0.1 0.2 0.3 0.4

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

  • 2
  • 1

1 2 0.05 0.1 0.15 0.2

slide-16
SLIDE 16

16

Multivariate Gaussian distribution

Joint distribution over n random variables P(Y1,…Yn) σjk = E[ (Yj – µj) (Yk - µk) ] Yj and Yk independent σjk=0

slide-17
SLIDE 17

17

Marginalization

Suppose (Y1,…,Yn) ~ N( µ, Σ) What is P(Y1)?? More generally: Let A={i1,…,ik} ⊆ {1,…,N} Write YA = (Yi1,…,Yik) YA ~ N( µA, ΣAA)

slide-18
SLIDE 18

18

Conditioning

Suppose (Y1,…,Yn) ~ N( µ, Σ) Decompose as (YA,YB) What is P(YA | YB)?? P(YA = yA| YB = yB) = N(yA; µA|B, ΣA|B) where Computable using linear algebra!

slide-19
SLIDE 19

19

Conditioning

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

  • 2
  • 1

1 2 0.1 0.2 0.3 0.4

Y1=0.75 P(Y2 | Y1=0.75)

slide-20
SLIDE 20

20

High dimensional Gaussians

Gaussian Bivariate Gaussian Multivariate Gaussian Gaussian Process = “∞-variate Gaussian”

  • 2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 2
  • 2
  • 1
1 2 0.1 0.2 0.3 0.4
slide-21
SLIDE 21

21

Gaussian process

A Gaussian Process (GP) is a

(infinite) set of random variables, indexed by some set V i.e., for each x∈ V there’s a RV Yx Let A ⊆ V, |A|= {x1,…,xk} < ∞ Then YA ~ N(µA,ΣAA) where K: V× V → R is called kernel (covariance) function µ: V → R is called mean function

slide-22
SLIDE 22

22

Visualizing GPs

Typically, only care about “marginals”, i.e., P(y) = N(y; µ(x), K(x,x)) x∈ ∈ ∈ ∈V

slide-23
SLIDE 23

23

Mean functions

Can encode prior knowledge Typically, one simply assumes

µ(x) = 0

Will do that here to simplify notation

slide-24
SLIDE 24

24

Kernel functions

K must be symmetric K(x,x’) = K(x’,x) for all x, x’ K must be positive definite For all A: ΣAA is positive definite matrix Kernel function K: assumptions about correlation!

slide-25
SLIDE 25

25

Kernel functions: Examples

Squared exponential kernel K(x,x’) = exp(-(x-x’)2/h2)

0.2 0.4 0.6 0.8 1

  • 4
  • 3
  • 2
  • 1

1 2

Bandwidth h=.1

1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1

Distance |x-x’|

0.2 0.4 0.6 0.8 1

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2.5 3

Bandwidth h=.3 Samples from P(f)

slide-26
SLIDE 26

26

Kernel functions: Examples

Exponential kernel K(x,x’) = exp(-|x-x’|/h)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5

Bandwidth h=1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2.5

Bandwidth h=.3

1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1

Distance |x-x’|

slide-27
SLIDE 27

27

Kernel functions: Examples

Linear kernel: K(x,x’) = xT x’ Corresponds to linear regression!

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

slide-28
SLIDE 28

28

Kernel functions: Examples

Linear kernel with features: K(x,x’) = Φ(x)TΦ(x’)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2.5

E.g., Φ Φ Φ Φ(x) = [0,x,x2]

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2.5

E.g., Φ Φ Φ Φ(x) = sin(x)

slide-29
SLIDE 29

29

Kernel functions: Examples

White noise: K(x,x) = 1; K(x,x’) = 0 for x’ ≠ x

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • 3
  • 2
  • 1

1 2 3 4

slide-30
SLIDE 30

30

Constructing kernels from kernels

If K1(x,x’) and K2(x,x’) are kernel functions then α K1(x,x’) + β K2(x,x’) is a kernel for α,β > 0 K1(x,x’)*K2(x,x’) is a kernel

slide-31
SLIDE 31

31

GP Regression

Suppose we know kernel function K Get data (x1,y1),…,(xn,yn) Want to predict y’ = f(x’) for some new x’

slide-32
SLIDE 32

32

Linear prediction

Posterior mean µx`| D = Σx`,DΣD,D-1 yD Hence, µx`|D = ∑i=1n wi yi Prediction µx`|D depends linearly on inputs yi! For fixed data set D = {(x1,y1),…,(xn,yn)}, can precompute weights wi Like linear regression, but number of parameters w_i grows with training data

“Nonparametric regression” Can fit any data set!! ☺

slide-33
SLIDE 33

33

Learning parameters

Example: K(x,x’) = exp(-(x-x’)2/h2) Need to specify h! In general, kernel function has parameters θ Want to learn θ from data + + + + + + h too small “overfit” + + + + + + + h too large “underfit” + + + + + + + h “just right” +

slide-34
SLIDE 34

34

Learning parameters

Pick parameters that make data most likely! log P(y | θ) differentiable if K(x,x’) is!

Can do gradient descent, conjugate gradient, etc.

Tends to work well (not over- or underfit) in practice!

slide-35
SLIDE 35

35

Matlab demo

[Rasmussen & Williams, Gaussian Processes for Machine Learning] http://www.gaussianprocess.org/gpml/

slide-36
SLIDE 36

36

Gaussian process

A Gaussian Process (GP) is a

(infinite) set of random variables, indexed by some set V i.e., for each x∈ V there’s a RV Yx Let A ⊆ V, |A|= {x1,…,xk} < ∞ Then YA ~ N(µA,ΣAA) where K: V× V → R is called kernel (covariance) function µ: V → R is called mean function

slide-37
SLIDE 37

37

GPs over other sets

GP is collection of random variables, indexed by set V So far: Have seen GPs over V = R Can define GPs over

Text (strings) Graphs Sets …

Only need to choose appropriate kernel function

slide-38
SLIDE 38

38

  • Example: Using GPs to model spatial phenomena
slide-39
SLIDE 39

39

Other extensions (won’t cover here)

GPs for classification

Nonparametric generalization of logistic regression Like SVMs (but give confidence on predicted labels!)

GPs for modeling non-Gaussian phenomena

Model count data over space, …

Active set methods for fast inference … Still active research area in machine learning

slide-40
SLIDE 40

40

Bandits = Noisy function optimization

We are given black box access to function f Evaluating f is very expensive Want to (quickly) find x* = argmaxx f(x) Idea: Assume f is a sample from a Gaussian Process! Gaussian Process optimization (a.k.a.: Response surface optimization) f x y = f(x) + noise

slide-41
SLIDE 41

41

Upper confidence bound approach

UCB(x | D) = µ(x | D) + 2*σ(x | D) Pick point x* = argmaxx UCB(x | D) x∈ ∈ ∈ ∈V + + + + +

slide-42
SLIDE 42

42

Matlab demo

slide-43
SLIDE 43

43

Properties

Implicitly trades off exploration and exploitation Exploits prior knowledge about function Can converge to optimal solution very quickly! ☺ Seems to work well in many applications Can perform poorly if our prior assumptions are wrong

slide-44
SLIDE 44

44

What you need to know

GPs =

Nonparametric generalization of linear regression Flexible ways to encode prior assumptions about mean payoffs

Definition of GPs Properties of multivariate Gaussians (marginalization, conditioning) Gaussian Process optimization

Combination of regression and optimization Use confidence bands for selecting samples