gaussian processes
play

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan - PowerPoint PPT Presentation

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 1 / 22 What are Gaussian processes? GPs let us do Bayesian inference on functions . Using GPs we can: Interpolate spatial


  1. Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 1 / 22

  2. What are Gaussian processes? GPs let us do Bayesian inference on functions . Using GPs we can: Interpolate spatial data Forecast time series Represent latent surfaces for classification, point processes, etc. Emulate likelihoods and complex, black-box functions Model cool stuff across many scientific disciplines! [https://pythonhosted.org/infpy/gps.html] [http://becs.aalto.fi/en/research/bayes/mcmcstuff/traindata.jpg] Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 2 / 22

  3. Preliminaries The basic setup: Data set { ( x i , y i ) , i = 1 , . . . , n } . Inputs x i ∈ S ⊂ R D . Outputs y i ∈ R . x i ∼ p ( x ) y i = f ( x i ) + ǫ i iid ∼ N (0 , σ 2 ǫ i ǫ ) Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 3 / 22

  4. Preliminaries The basic setup: Data set { ( x i , y i ) , i = 1 , . . . , n } . Inputs x i ∈ S ⊂ R D . Outputs y i ∈ R . x i ∼ p ( x ) y i = f ( x i ) + ǫ i iid ∼ N (0 , σ 2 ǫ i ǫ ) Definition f is a Gaussian process if for any collection X = { x i ∈ S , i = 1 , . . . , n } ,   f ( x 1 ) . .  ∼ N ( µ ( X ) , K ( X , X ))   .  f ( x n ) Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 3 / 22

  5. Mean, covariance functions GPs characterized by mean, covariance functions: Mean function: µ ( x ). WLOG, we can assume µ = 0. (Why?) Covariance function k where [ K ( X , X )] ij = k ( x i , x j ) = Cov( f ( x i ) , f ( x j )) . Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 4 / 22

  6. Mean, covariance functions GPs characterized by mean, covariance functions: Mean function: µ ( x ). WLOG, we can assume µ = 0. (Why?) Covariance function k where [ K ( X , X )] ij = k ( x i , x j ) = Cov( f ( x i ) , f ( x j )) . Example: −|| x i − x j || 2 � � k ( x i , x j ) = τ 2 exp (squared exponential) 2 ℓ 2 Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 4 / 22

  7. GP regression (prediction) Interpolation/prediction at target locations: (Noise-free observations) Observe { ( x i , f ( x i )) , i = 1 , . . . , n } . (Noisy observations) Observe { ( x i , y i ) , i = 1 , . . . , n } . Want to predict f ∗ = { f ( x ∗ 1 ) , . . . , f ( x ∗ k ) } at x ∗ . Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 5 / 22

  8. GP regression (prediction) Interpolation/prediction at target locations: (Noise-free observations) Observe { ( x i , f ( x i )) , i = 1 , . . . , n } . (Noisy observations) Observe { ( x i , y i ) , i = 1 , . . . , n } . Want to predict f ∗ = { f ( x ∗ 1 ) , . . . , f ( x ∗ k ) } at x ∗ . � f � K ( X , X ) Prediction with � �� 0 � K ( X , X ∗ ) ��  | X , X ∗ ∼ N ,  f ∗ K ( X ∗ , X ) K ( X ∗ , X ∗ ) 0  noise-free     f ∗ | f , X , X ∗ ∼ N � data K ( X ∗ , X )[ K ( X , X )] − 1 f ,    � K ( X ∗ , X ∗ ) − K ( X ∗ , X )[ K ( X , X )] − 1 K ( X , X ∗ )    Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 5 / 22

  9. GP regression (prediction) Interpolation/prediction at target locations: (Noise-free observations) Observe { ( x i , f ( x i )) , i = 1 , . . . , n } . (Noisy observations) Observe { ( x i , y i ) , i = 1 , . . . , n } . Want to predict f ∗ = { f ( x ∗ 1 ) , . . . , f ( x ∗ k ) } at x ∗ . � f � K ( X , X ) Prediction with � �� 0 � K ( X , X ∗ ) ��  | X , X ∗ ∼ N ,  f ∗ K ( X ∗ , X ) K ( X ∗ , X ∗ ) 0  noise-free     f ∗ | f , X , X ∗ ∼ N � data K ( X ∗ , X )[ K ( X , X )] − 1 f ,    � K ( X ∗ , X ∗ ) − K ( X ∗ , X )[ K ( X , X )] − 1 K ( X , X ∗ )    Prediction � � �� � � K ( X , X ) + σ 2 K ( X , X ∗ ) �� y 0 ǫ I n  | X , X ∗ ∼ N ,  f ∗ K ( X ∗ , X ) K ( X ∗ , X ∗ ) 0  with noisy     f ∗ | y , X , X ∗ ∼ N � data K ( X ∗ , X )[ K ( X , X ) + σ 2 ǫ I n ] − 1 y ,    �  K ( X ∗ , X ∗ ) − K ( X ∗ , X )[ K ( X , X ) + σ 2 ǫ I n ] − 1 K ( X , X ∗ )   Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 5 / 22

  10. GP regression (prediction) Some cool things we’ve noticed: f , f ∗ , y , y ∗ are all jointly Gaussian. GP regression gives us interval (distributional) predictions for free. Prediction using noise-free vs. noisy data: Which situation is more likely in practice? The “nugget” σ 2 ǫ I n : Arises due to measurement error or high-frequency behavior. Provides numerical stability and regularization. Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 6 / 22

  11. Illustrating GP regression TRUTH: τ 2 = 1 , ℓ 2 = 1 , σ 2 ǫ = 0 . 01. 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

  12. Illustrating GP regression Sample { ( x i , y i ) , i = 1 , . . . 20 } 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

  13. Illustrating GP regression Posterior mean of f ∗ | y 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

  14. Illustrating GP regression 95% prediction interval for f ∗ | y 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

  15. Illustrating GP regression Fitting GP with ℓ 2 = 10: 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

  16. Illustrating GP regression Fitting GP with ℓ 2 = 0 . 1: 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

  17. Illustrating GP regression Fitting GP with σ 2 ǫ = 1: 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

  18. Illustrating GP regression Fitting GP with σ 2 ǫ = 0 . 0001: 2 1 f(x) 0 −1 −2 0 2 4 6 8 10 x Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 7 / 22

  19. GPs and Bayesian linear regression Assume f ( x i ) is linear in p -dimensional feature vector of x i : f ( x i ) = φ ( x i ) ′ w = φ ′ i w Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 8 / 22

  20. GPs and Bayesian linear regression Assume f ( x i ) is linear in p -dimensional feature vector of x i : f ( x i ) = φ ( x i ) ′ w = φ ′ i w Usual Bayesian regression setup for φ : ind ∼ N ( φ ′ i w , σ 2 y i | X ǫ ) (likelihood) w ∼ N (0 , Σ) (prior) w , A − 1 ) w | y , X ∼ N (ˆ (posterior) f ∗ | y , X , x ∗ ∼ N (( φ ∗ ) ′ ˆ w , ( φ ∗ ) ′ A − 1 φ ∗ ) (posterior predictive) where w = A − 1 Φy /σ 2 ˆ ǫ . A = ΦΦ ′ /σ 2 ǫ + Σ − 1 . Φ = p × n matrix stacking φ i , i = 1 , . . . , n columnwise. Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 8 / 22

  21. GPs and Bayesian linear regression After some matrix algebra (Woodbury identity!), we can write this as: f ∗ | y , X , x ∗ ∼ N � ( φ ∗ ) ′ Σ Φ [ Φ ′ Σ Φ + σ 2 ǫ I ] − 1 y , ( φ ∗ ) ′ Σ φ ∗ − ( φ ∗ ) ′ Σ Φ [ Φ ′ Σ Φ + σ 2 ǫ I ] − 1 Φ ′ Σ φ ∗ � Taking k ( x i , x j ) = φ ( x i ) ′ Σ φ ( x j ), we get familiar GP prediction expression. Thus { Bayesian regression } ⊂ { Gaussian processes } . { Gaussian processes } ⊂ { Bayesian regression } ? Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 9 / 22

  22. GPs and Bayesian linear regression After some matrix algebra (Woodbury identity!), we can write this as: f ∗ | y , X , x ∗ ∼ N � ( φ ∗ ) ′ Σ Φ [ Φ ′ Σ Φ + σ 2 ǫ I ] − 1 y , ( φ ∗ ) ′ Σ φ ∗ − ( φ ∗ ) ′ Σ Φ [ Φ ′ Σ Φ + σ 2 ǫ I ] − 1 Φ ′ Σ φ ∗ � Taking k ( x i , x j ) = φ ( x i ) ′ Σ φ ( x j ), we get familiar GP prediction expression. Thus { Bayesian regression } ⊂ { Gaussian processes } . { Gaussian processes } ⊂ { Bayesian regression } ? “Kernel trick”: feature vectors φ only enter as inner products Φ ′ Σ Φ , ( φ ∗ ) ′ Σ Φ , or ( φ ∗ ) ′ Σ φ ∗ . Kernel (covariance function) k ( · , · ) spares us from ever calculating φ ( x ). Where have we seen this before? Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 9 / 22

  23. Covariance functions Common choices: � � −|| x i − x j || k ( x i , x j ) = τ 2 exp (exponential) 2 ℓ −|| x i − x j || 2 � � k ( x i , x j ) = τ 2 exp (squared exponential) 2 ℓ 2 + || x i − x j || 3 � 1 − 3 || x i − x j || � k ( x i , x j ) = τ 2 1 [ || x i − x j || ≤ θ ] (spherical) 2 θ 3 2 θ � ν τ 2 � || x i − x j || B ν ( φ || x i − x j || ) (mat´ k ( x i , x j ) = ern) Γ( ν ) 2 φ k ( x i , x j ) = σ 2 + τ 2 ( x i − c ) ′ ( x j − c ) (linear) Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 10 / 22

  24. Covariance functions Properties Isotrophy (stationarity) Covariance only depends on distance: k ( x i , x j ) = c ( || x i − x j || ). Common in many GP applications. Dan Cervone (NYU CDS) Gaussian Processes November 10, 2015 11 / 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend