 
              CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007
Gaussian Processes Outline
Gaussian Processes Outline Parametric Bayesian Regression
Gaussian Processes Outline Parametric Bayesian Regression Parameters to Functions
Gaussian Processes Outline Parametric Bayesian Regression Parameters to Functions GP Regression
Gaussian Processes Outline Parametric Bayesian Regression Parameters to Functions GP Regression GP Classification
Gaussian Processes Outline Parametric Bayesian Regression Parameters to Functions GP Regression GP Classification We will use
Gaussian Processes Outline Parametric Bayesian Regression Parameters to Functions GP Regression GP Classification We will use Primary: Carl Rasmussen’s GP tutorial slides (NIPS’06)
Gaussian Processes Outline Parametric Bayesian Regression Parameters to Functions GP Regression GP Classification We will use Primary: Carl Rasmussen’s GP tutorial slides (NIPS’06) Secondary: Hanna Wallach’s slides on regression
The Prediction Problem 420 CO 2 concentration, ppm 400 380 360 340 320 1960 1980 2000 2020 year Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 3 / 55
The Prediction Problem 420 CO 2 concentration, ppm 400 380 360 340 320 1960 1980 2000 2020 year Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 4 / 55
The Prediction Problem 420 CO 2 concentration, ppm 400 380 360 340 320 1960 1980 2000 2020 year Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 5 / 55
Maximum likelihood, parametric model Supervised parametric learning: • data: x ✱ y • model: y = f w ( x ) + ε Gaussian likelihood: � exp (− 1 2 ( y c − f w ( x c )) 2 /σ 2 p ( y | x ✱ w ✱ M i ) ∝ noise ) ✳ c Maximize the likelihood: w ML = argmax p ( y | x ✱ w ✱ M i ) ✳ w Make predictions, by plugging in the ML estimate: p ( y ∗ | x ∗ ✱ w ML ✱ M i ) Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 16 / 55
Bayesian Inference, parametric model Supervised parametric learning: • data: x ✱ y • model: y = f w ( x ) + ε Gaussian likelihood: � 2 ( y c − f w ( x c )) 2 /σ 2 exp (− 1 p ( y | x ✱ w ✱ M i ) ∝ noise ) ✳ c Parameter prior: p ( w | M i ) Posterior parameter distribution by Bayes rule p ( a | b ) = p ( b | a ) p ( a ) / p ( b ) : p ( w | x ✱ y ✱ M i ) = p ( w | M i ) p ( y | x ✱ w ✱ M i ) p ( y | x ✱ M i ) Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 17 / 55
Bayesian Inference, parametric model, cont. Making predictions: � p ( y ∗ | x ∗ ✱ x ✱ y ✱ M i ) = p ( y ∗ | w ✱ x ∗ ✱ M i ) p ( w | x ✱ y ✱ M i ) d w Marginal likelihood: � p ( y | x ✱ M i ) = p ( w | M i ) p ( y | x ✱ w ✱ M i ) d w ✳ Model probability: p ( M i | x ✱ y ) = p ( M i ) p ( y | x ✱ M i ) p ( y | x ) Problem: integrals are intractable for most interesting models! Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 18 / 55
Bayesian Linear Regression Bayesian Linear Regression (2) Likelihood of parameters is: P ( y | X , w ) = N ( X ⊤ w , σ 2 I ) . Assume a Gaussian prior over parameters: P ( w ) = N ( 0 , Σ p ) . Apply Bayes’ theorem to obtain posterior: P ( w | y , X ) ∝ P ( y | X , w ) P ( w ) . Hanna M. Wallach hmw26@cam.ac.uk Introduction to Gaussian Process Regression
Bayesian Linear Regression Bayesian Linear Regression (3) Posterior distribution over w is: P ( w | y , X ) = N ( 1 + 1 σ 2 A − 1 X y , A − 1 ) where A = Σ − 1 σ 2 XX ⊤ . p Predictive distribution is: � P ( f ⋆ | x ⋆ , X , y ) = f ( x ⋆ | w ) P ( w | X , y ) d w = N ( 1 σ 2 x ⋆ ⊤ A − 1 X y , x ⋆ ⊤ A − 1 x ⋆ ) . Hanna M. Wallach hmw26@cam.ac.uk Introduction to Gaussian Process Regression
Non-parametric Gaussian process models In our non-parametric model, the “parameters” is the function itself! Gaussian likelihood: y | x ✱ f ( x ) ✱ M i ∼ N ( f ✱ σ 2 noise I ) (Zero mean) Gaussian process prior: � m ( x ) ≡ 0 ✱ k ( x ✱ x ′ ) � f ( x ) | M i ∼ GP Leads to a Gaussian process posterior m post ( x ) = k ( x ✱ x )[ K ( x ✱ x ) + σ 2 noise I ] − 1 y ✱ � f ( x ) | x ✱ y ✱ M i ∼ GP k post ( x ✱ x ′ ) = k ( x ✱ x ′ ) − k ( x ✱ x )[ K ( x ✱ x ) + σ 2 noise I ] − 1 k ( x ✱ x ′ ) � ✳ And a Gaussian predictive distribution: y ∗ | x ∗ ✱ x ✱ y ✱ M i ∼ N k ( x ∗ ✱ x ) ⊤ [ K + σ 2 noise I ] − 1 y ✱ � k ( x ∗ ✱ x ∗ ) + σ 2 noise − k ( x ∗ ✱ x ) ⊤ [ K + σ 2 noise I ] − 1 k ( x ∗ ✱ x ) � Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 19 / 55
The Gaussian Distribution The Gaussian distribution is given by p ( x | µ ✱ Σ ) = N ( µ ✱ Σ ) = ( 2 π ) − D / 2 | Σ | − 1 / 2 exp − 1 2 ( x − µ ) ⊤ Σ − 1 ( x − µ ) � � where µ is the mean vector and Σ the covariance matrix. Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 8 / 55
Conditionals and Marginals of a Gaussian joint Gaussian joint Gaussian conditional marginal Both the conditionals and the marginals of a joint Gaussian are again Gaussian. Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 9 / 55
What is a Gaussian Process? A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables. Informally: infinitely long vector ≃ function Definition : a Gaussian process is a collection of random variables, any finite number of which have (consistent) Gaussian distributions. � A Gaussian distribution is fully specified by a mean vector, µ , and covariance matrix Σ : f = ( f 1 ✱ ✳ ✳ ✳ ✱ f n ) ⊤ ∼ N ( µ ✱ Σ ) ✱ indexes i = 1 ✱ ✳ ✳ ✳ ✱ n A Gaussian process is fully specified by a mean function m ( x ) and covariance function k ( x ✱ x ′ ) : m ( x ) ✱ k ( x ✱ x ′ ) � � f ( x ) ∼ GP indexes: x ✱ Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 10 / 55
The marginalization property Thinking of a GP as a Gaussian distribution with an infinitely long mean vector and an infinite by infinite covariance matrix may seem impractical. . . . . . luckily we are saved by the marginalization property : Recall: � p ( x ) = p ( x ✱ y ) d y ✳ For Gaussians: �� a � A B � �� p ( x ✱ y ) = N p ( x ) = N ( a ✱ A ) ⇒ = B ⊤ b C ✱ Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 11 / 55
Random functions from a Gaussian Process Example one dimensional Gaussian process: m ( x ) = 0 ✱ k ( x ✱ x ′ ) = exp (− 1 2 ( x − x ′ ) 2 ) � � p ( f ( x )) ∼ GP ✳ To get an indication of what this distribution over functions looks like, focus on a finite subset of function values f = ( f ( x 1 ) ✱ f ( x 2 ) ✱ ✳ ✳ ✳ ✱ f ( x n )) ⊤ , for which f ∼ N ( 0 ✱ Σ ) ✱ where Σ ij = k ( x i ✱ x j ) . Then plot the coordinates of f as a function of the corresponding x values. Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 12 / 55
Some values of the random function 1.5 1 0.5 output, f(x) 0 −0.5 −1 −1.5 −5 0 5 input, x Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 13 / 55
Non-parametric Gaussian process models In our non-parametric model, the “parameters” is the function itself! Gaussian likelihood: y | x ✱ f ( x ) ✱ M i ∼ N ( f ✱ σ 2 noise I ) (Zero mean) Gaussian process prior: � m ( x ) ≡ 0 ✱ k ( x ✱ x ′ ) � f ( x ) | M i ∼ GP Leads to a Gaussian process posterior m post ( x ) = k ( x ✱ x )[ K ( x ✱ x ) + σ 2 noise I ] − 1 y ✱ � f ( x ) | x ✱ y ✱ M i ∼ GP k post ( x ✱ x ′ ) = k ( x ✱ x ′ ) − k ( x ✱ x )[ K ( x ✱ x ) + σ 2 noise I ] − 1 k ( x ✱ x ′ ) � ✳ And a Gaussian predictive distribution: y ∗ | x ∗ ✱ x ✱ y ✱ M i ∼ N k ( x ∗ ✱ x ) ⊤ [ K + σ 2 noise I ] − 1 y ✱ � k ( x ∗ ✱ x ∗ ) + σ 2 noise − k ( x ∗ ✱ x ) ⊤ [ K + σ 2 noise I ] − 1 k ( x ∗ ✱ x ) � Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 19 / 55
Prior and Posterior 2 2 1 1 output, f(x) output, f(x) 0 0 −1 −1 −2 −2 −5 0 5 −5 0 5 input, x input, x Predictive distribution: k ( x ∗ ✱ x ) ⊤ [ K + σ 2 noise I ] − 1 y ✱ p ( y ∗ | x ∗ ✱ x ✱ y ) ∼ N � k ( x ∗ ✱ x ∗ ) + σ 2 noise − k ( x ∗ ✱ x ) ⊤ [ K + σ 2 noise I ] − 1 k ( x ∗ ✱ x ) � Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 20 / 55
Recommend
More recommend