announcements
play

Announcements Homework 1: out tomorrow Due Thu Jan 29 Project - PowerPoint PPT Presentation

Active Learning and Optimized Information Gathering Lecture 6 Gaussian Process Optimization CS 101.2 Andreas Krause Announcements Homework 1: out tomorrow Due Thu Jan 29 Project Proposal due Tue Jan 27 Office hours Come to office


  1. Active Learning and Optimized Information Gathering Lecture 6 – Gaussian Process Optimization CS 101.2 Andreas Krause

  2. Announcements Homework 1: out tomorrow Due Thu Jan 29 Project Proposal due Tue Jan 27 Office hours Come to office hours before your presentation! Andreas: Friday 1:30-3pm, 260 Jorgensen Ryan: Wednesday 4:00-6:00pm, 109 Moore 2

  3. Course outline Online decision making 1. Statistical active learning 2. Combinatorial approaches 3. 3

  4. Recap Bandit problems … p 2 p 3 p k p 1 K-arms ε n greedy, UCB1 have regret O(log(T) K ) What about infinite arms (K= ∞ ) Have to make assumptions! 4

  5. Bandits = Noisy function optimization We are given black box access to function f f(x) = mean payoff for arm x x f y = f(x) + noise Evaluating f is very expensive Want to (quickly) find x* = argmax x f(x) 5

  6. Bandits with ∞ -many arms f(x)=w T x Lipschitz-continuous Linear (bounded slope) Can only hope to perform well if we make some assumptions 6

  7. Regret depends on complexity Bandit linear optimization over R n “strong” assumptions Regret O(T 2/3 n) Bandit problems for optimizing Lipschitz functions “weak” assumptions Regret O(C(n) T n/(n+1) ) Curse of dimensionality! Today: Flexible (Bayesian) approach for encoding assumptions about function complexity 7

  8. What if we believe, the function looks like: Piece-wise linear? Analytic? ( ∞ ∞ -diff.’able) ∞ ∞ Want flexible way to encode assumptions about functions! 8

  9. Bayesian inference Two Bernoulli variables A(larm), B(urglar) P(B=1) = 0.1; P(A=1 | B=1)=0.9; P(A=1 | B=0)=0.1 What is P(B | A)? P(B) “prior” P(A | B) “likelihood” P(B | A) “posterior” 9

  10. A Bayesian approach Bayesian models for functions Likelihood P(data | f) Prior P(f) Posterior P(f | data) + + + + Uff… Why is this useful? 10

  11. Probability of data P(y 1 ,…,y k ) = Can compute P(y’ | y 1 ,…,y k ) = 11

  12. Regression with uncertainty about predictions! + + + + 12

  13. How can we do this? Want to compute P(y’ | y 1 ,…,y k ) P(y 1 ,…,y k ) = ∫ P(f, y 1 ,…,y k ) df Horribly complicated integral?? � Will see how we can compute this (more or less) efficiently In closed form! … if P(f) is a Gaussian Process 13

  14. Gaussian distribution σ = Standard deviation µ = mean 14

  15. Bivariate Gaussian distribution 0.2 2 0.15 0.4 1 0.3 2 0.1 0.2 1 0 0.05 0.1 0 -1 0 -1 0 -2 -1.5 -1 -0.5 -2 -1.5 0 -1 -2 0.5 -0.5 -2 0 1 0.5 1 1.5 1.5 2 2 15

  16. Multivariate Gaussian distribution Joint distribution over n random variables P(Y 1 ,…Y n ) σ jk = E[ (Y j – µ j ) (Y k - µ k ) ] Y j and Y k independent � σ jk =0 16

  17. Marginalization Suppose (Y 1 ,…,Y n ) ~ N( µ , Σ ) What is P(Y 1 )?? More generally: Let A={i 1 ,…,i k } ⊆ {1,…,N} Write Y A = (Y i1 ,…,Y ik ) Y A ~ N( µ A , Σ AA ) 17

  18. Conditioning Suppose (Y 1 ,…,Y n ) ~ N( µ , Σ ) Decompose as (Y A ,Y B ) What is P(Y A | Y B )?? P(Y A = y A | Y B = y B ) = N(y A ; µ A|B , Σ A|B ) where Computable using linear algebra! 18

  19. Conditioning 2 0.4 1 0.3 0.2 0 0.1 -1 0 -2 -1.5 -1 -0.5 P(Y 2 | Y 1 =0.75) 0 0.5 -2 1 1.5 2 Y 1 =0.75 19

  20. High dimensional Gaussians Gaussian Bivariate Gaussian 2 0.4 1 0.3 0.2 0 0.1 -1 0 -2 -1.5 -1 -0.5 0 0.5 -2 1 1.5 2 Multivariate Gaussian Gaussian Process = “ ∞ -variate Gaussian” 20

  21. Gaussian process A Gaussian Process (GP) is a (infinite) set of random variables, indexed by some set V i.e., for each x ∈ V there’s a RV Y x Let A ⊆ V, |A|= {x 1 ,…,x k } < ∞ Then Y A ~ N( µ A , Σ AA ) where K: V × V → R is called kernel (covariance) function µ : V → R is called mean function 21

  22. Visualizing GPs x ∈ ∈ ∈ ∈ V Typically, only care about “marginals”, i.e., P(y) = N(y; µ (x), K(x,x)) 22

  23. Mean functions Can encode prior knowledge Typically, one simply assumes µ (x) = 0 Will do that here to simplify notation 23

  24. Kernel functions K must be symmetric K(x,x’) = K(x’,x) for all x, x’ K must be positive definite For all A: Σ AA is positive definite matrix Kernel function K: assumptions about correlation! 24

  25. Kernel functions: Examples Squared exponential kernel 1 K(x,x’) = exp(-(x-x’) 2 /h 2 ) 0 . 9 0 . 8 0 . 7 0 . 6 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 Distance |x-x’| Samples from P(f) 3 2 2.5 1 2 1.5 0 1 0.5 -1 0 -2 -0.5 -1 -3 -1.5 -2 -4 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Bandwidth h=.1 Bandwidth h=.3 25

  26. Kernel functions: Examples Exponential kernel 1 0 . 9 0 . 8 K(x,x’) = exp(-|x-x’|/h) 0 . 7 0 . 6 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 Distance |x-x’| 2.5 1.5 2 1 1.5 0.5 1 0 0.5 0 -0.5 -0.5 -1 -1 -1.5 -1.5 -2 -2 -2.5 -2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Bandwidth h=.3 Bandwidth h=1 26

  27. Kernel functions: Examples Linear kernel: K(x,x’) = x T x’ 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Corresponds to linear regression! 27

  28. Kernel functions: Examples Linear kernel with features: K(x,x’) = Φ (x) T Φ (x’) Φ (x) = [0,x,x 2 ] E.g., Φ Φ Φ E.g., Φ Φ Φ Φ (x) = sin(x) 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -1.5 -1.5 -2 -2 -2.5 -2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 28

  29. Kernel functions: Examples White noise: K(x,x) = 1; K(x,x’) = 0 for x’ ≠ x 4 3 2 1 0 -1 -2 -3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 29

  30. Constructing kernels from kernels If K 1 (x,x’) and K 2 (x,x’) are kernel functions then α K 1 (x,x’) + β K 2 (x,x’) is a kernel for α , β > 0 K 1 (x,x’)*K 2 (x,x’) is a kernel 30

  31. GP Regression Suppose we know kernel function K Get data (x 1 ,y 1 ),…,(x n ,y n ) Want to predict y’ = f(x’) for some new x’ 31

  32. Linear prediction Posterior mean µ x` | D = Σ x`,D Σ D,D-1 y D Hence, µ x`|D = ∑ i=1n w i y i Prediction µ x`|D depends linearly on inputs y i ! For fixed data set D = {(x 1 ,y 1 ),…,(x n ,y n )}, can precompute weights w i Like linear regression, but number of parameters w_i grows with training data � “Nonparametric regression” � Can fit any data set!! ☺ 32

  33. Learning parameters Example: K(x,x’) = exp(-(x-x’) 2 /h 2 ) Need to specify h! + + + + + + + + + + + + + + + + + + + + + h too small h too large h “just right” “underfit” “overfit” In general, kernel function has parameters θ Want to learn θ from data 33

  34. Learning parameters Pick parameters that make data most likely! log P(y | θ ) differentiable if K(x,x’) is! � Can do gradient descent, conjugate gradient, etc. Tends to work well (not over- or underfit) in practice! 34

  35. Matlab demo [Rasmussen & Williams, Gaussian Processes for Machine Learning] http://www.gaussianprocess.org/gpml/ 35

  36. Gaussian process A Gaussian Process (GP) is a (infinite) set of random variables, indexed by some set V i.e., for each x ∈ V there’s a RV Y x Let A ⊆ V, |A|= {x 1 ,…,x k } < ∞ Then Y A ~ N( µ A , Σ AA ) where K: V × V → R is called kernel (covariance) function µ : V → R is called mean function 36

  37. GPs over other sets GP is collection of random variables, indexed by set V So far: Have seen GPs over V = R Can define GPs over Text (strings) Graphs Sets … Only need to choose appropriate kernel function 37

  38. Example: Using GPs to model spatial phenomena ��������������� � ���������������������� ���������������������������� ������������ � ��������� �������� � � � � ���������������������� �������������������� 38

  39. Other extensions (won’t cover here) GPs for classification Nonparametric generalization of logistic regression Like SVMs (but give confidence on predicted labels!) GPs for modeling non-Gaussian phenomena Model count data over space, … Active set methods for fast inference … Still active research area in machine learning 39

  40. Bandits = Noisy function optimization We are given black box access to function f x f y = f(x) + noise Evaluating f is very expensive Want to (quickly) find x* = argmax x f(x) Idea: Assume f is a sample from a Gaussian Process! � Gaussian Process optimization (a.k.a.: Response surface optimization) 40

  41. Upper confidence bound approach UCB(x | D) = µ (x | D) + 2* σ (x | D) Pick point x* = argmax x UCB(x | D) + + + + + x ∈ ∈ ∈ ∈ V 41

  42. Matlab demo 42

  43. Properties Implicitly trades off exploration and exploitation Exploits prior knowledge about function Can converge to optimal solution very quickly! ☺ Seems to work well in many applications Can perform poorly if our prior assumptions are wrong � 43

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend