probabilistic graphical models lecture 21 advanced
play

Probabilistic Graphical Models Lecture 21: Advanced Gaussian - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University April 1, 2015 1 / 100 Gaussian process review Definition A Gaussian process (GP) is a collection


  1. Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University April 1, 2015 1 / 100

  2. Gaussian process review Definition A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. Nonparametric Regression Model ◮ Prior: f ( x ) ∼ GP ( m ( x ) , k ( x , x ′ )) , meaning ( f ( x 1 ) , . . . , f ( x N )) ∼ N ( µ , K ) , with µ i = m ( x i ) and K ij = cov ( f ( x i ) , f ( x j )) = k ( x i , x j ) . GP posterior Likelihood GP prior � �� � � �� � � �� � p ( f ( x ) |D ) ∝ p ( D| f ( x )) p ( f ( x )) Gaussian process sample prior functions Gaussian process sample posterior functions 3 3 2 2 1 1 output, f(t) output, f(t) 0 0 −1 −1 −2 −2 2 / 100 −3 −3

  3. Gaussian Process Inference ◮ Observed noisy data y = ( y ( x 1 ) , . . . , y ( x N )) T at input locations X . ◮ Start with the standard regression assumption: N ( y ( x ); f ( x ) , σ 2 ) . ◮ Place a Gaussian process distribution over noise free functions f ( x ) ∼ GP ( 0 , k θ ) . The kernel k is parametrized by θ . ◮ Infer p ( f ∗ | y , X , X ∗ ) for the noise free function f evaluated at test points X ∗ . Joint distribution � � � � �� K θ ( X , X ) + σ 2 I K θ ( X , X ∗ ) y (1) ∼ N 0 , . K θ ( X ∗ , X ) K θ ( X ∗ , X ∗ ) f ∗ Conditional predictive distribution f ∗ | X ∗ , X , y , θ ∼ N (¯ f ∗ , cov ( f ∗ )) , (2) f ∗ = K θ ( X ∗ , X )[ K θ ( X , X ) + σ 2 I ] − 1 y , ¯ (3) cov ( f ∗ ) = K θ ( X ∗ , X ∗ ) − K θ ( X ∗ , X )[ K θ ( X , X ) + σ 2 I ] − 1 K θ ( X , X ∗ ) . (4) 3 / 100

  4. Learning and Model Selection p ( M i | y ) = p ( y |M i ) p ( M i ) (5) p ( y ) We can write the evidence of the model as � (6) p ( y |M i ) = p ( y | f , M i ) p ( f ) d f , 4 Complex Model Data Simple Model 3 Simple Appropriate Model Complex 2 Appropriate Output, f(x) 1 p(y|M) 0 −1 −2 −3 −4 y −10 −8 −6 −4 −2 0 2 4 6 8 10 All Possible Datasets Input, x (a) (b) 4 / 100

  5. Learning and Model Selection ◮ We can integrate away the entire Gaussian process f ( x ) to obtain the marginal likelihood, as a function of kernel hyperparameters θ alone. � (7) p ( y | θ , X ) = p ( y | f , X ) p ( f | θ , X ) d f . complexity penalty model fit � �� � � �� � − 1 1 2 log | K θ + σ 2 I | − N 2 y T ( K θ + σ 2 I ) − 1 y − log p ( y | θ , X ) = 2 log ( 2 π ) . (8) ◮ An extremely powerful mechanism for kernel learning. Samples from GP Posterior Samples from GP Prior 4 4 3 3 2 2 Output, f(x) Output, f(x) 1 1 0 0 −1 −1 −2 −2 −3 −3 −4 −4 −10 −5 0 5 10 −10 −5 0 5 10 5 / 100 Input, x Input, x

  6. Inference and Learning 1. Learning: Optimize marginal likelihood, model fit complexity penalty � �� � � �� � − 1 1 2 log | K θ + σ 2 I | − N 2 y T ( K θ + σ 2 I ) − 1 y − log p ( y | θ , X ) = 2 log ( 2 π ) , with respect to kernel hyperparameters θ . 2. Inference: Conditioned on kernel hyperparameters θ , form the predictive distribution for test inputs X ∗ : f ∗ | X ∗ , X , y , θ ∼ N (¯ f ∗ , cov ( f ∗ )) , ¯ f ∗ = K θ ( X ∗ , X )[ K θ ( X , X ) + σ 2 I ] − 1 y , cov ( f ∗ ) = K θ ( X ∗ , X ∗ ) − K θ ( X ∗ , X )[ K θ ( X , X ) + σ 2 I ] − 1 K θ ( X , X ∗ ) . 6 / 100

  7. Learning and Model Selection ◮ A fully Bayesian treatment would integrate away kernel hyperparameters θ . � (9) p ( f ∗ | X ∗ , X , y ) = p ( f ∗ | X ∗ , X , y , θ ) p ( θ | y ) d θ ◮ For example, we could specify a prior p ( θ ) , use MCMC to take J samples from p ( θ | y ) ∝ p ( y | θ ) p ( θ ) , and then find J p ( f ∗ | X ∗ , X , y ) ≈ 1 � θ ( i ) ∼ p ( θ | y ) . p ( f ∗ | X ∗ , X , y , θ ( i ) ) , (10) J i = 1 ◮ If we have a non-Gaussian noise model, and thus cannot integrate away f , the strong dependencies between Gaussian process f and hyperparameters θ make sampling extremely difficult. In my experience, the most effective solution is to use a deterministic approximation for the posterior p ( f | y ) which enables one to work with an approximate marginal likelihood. 7 / 100

  8. Popular Kernels Let τ = x − x ′ : k SE ( τ ) = exp ( − 0 . 5 τ 2 /ℓ 2 ) (11) √ √ 3 τ 3 τ k MA ( τ ) = a ( 1 + ) exp ( − (12) ) ℓ ℓ τ 2 k RQ ( τ ) = ( 1 + 2 α ℓ 2 ) − α (13) k PE ( τ ) = exp ( − 2 sin 2 ( π τ ω ) /ℓ 2 ) (14) 8 / 100

  9. Worked Example: Combining Kernels, CO 2 Data CO 2 Concentration (ppm) 400 380 360 340 320 1968 1977 1986 1995 2004 Year Example from Rasmussen and Williams (2006), Gaussian Processes for Machine Learning . 9 / 100

  10. Worked Example: Combining Kernels, CO 2 Data 10 / 100

  11. Worked Example: Combining Kernels, CO 2 Data � � − ( x p − x q ) 2 ◮ Long rising trend: k 1 ( x p , x q ) = θ 2 1 exp 2 θ 2 2 ◮ Quasi-periodic seasonal changes: k 2 ( x p , x q ) = � � − 2 sin 2 ( π ( x p − x q )) − ( x p − x q ) k RBF ( x p , x q ) k PER ( x p , x q ) = θ 2 3 exp 2 θ 2 θ 2 4 5 ◮ Multi-scale medium term irregularities: � � − θ 8 1 + ( x p − x q ) 2 k 3 ( x p , x q ) = θ 2 6 2 θ 8 θ 2 7 � � − ( x p − x q ) 2 ◮ Correlated and i.i.d. noise: k 4 ( x p , x q ) = θ 2 + θ 2 9 exp 11 δ pq 2 θ 2 10 ◮ k total ( x p , x q ) = k 1 ( x p , x q ) + k 2 ( x p , x q ) + k 3 ( x p , x q ) + k 4 ( x p , x q ) 11 / 100

  12. What is a kernel? ◮ Informally, k describes the similarities between pairs of data points. For example, far away points may be considered less similar than nearby points. K ij = � φ ( x i ) , φ ( x j ) � and so tells us the overlap between the features (basis functions) φ ( x i ) and φ ( x j ) ◮ We have seen that all linear basis function models f ( x ) = w T φ ( x ) , with p ( w ) = N ( 0 , Σ w ) correspond to Gaussian processes with kernel k ( x , x ′ ) = φ ( x ) T Σ w φ ( x ′ ) . ◮ We have also accumulated some experience with the RBF kernel k RBF ( x , x ′ ) = a 2 exp ( − || x − x ′ || 2 ) . 2 ℓ 2 ◮ The kernel controls the generalisation behaviour of a kernel machine. For example, a kernel controls the support and inductive biases of a Gaussian process – which functions are a priori likely. ◮ A kernel is also known as covariance function or covariance kernel in the context of Gaussian processes. 12 / 100

  13. Candidate Kernel � 1 || x − x ′ || ≤ 1 k ( x , x ′ ) = 0 otherwise ◮ Symmetric ◮ Provides information about proximity of points ◮ Exercise: Is it a valid kernel? 13 / 100

  14. Candidate Kernel � 1 || x − x ′ || ≤ 1 k ( x , x ′ ) = 0 otherwise Try the points x 1 = 1, x 2 = 2, x 3 = 3. Compute the kernel matrix   ? ? ? (15) K = ? ? ?   ? ? ? 14 / 100

  15. Candidate Kernel � 1 || x − x ′ || ≤ 1 k ( x , x ′ ) = 0 otherwise Try the points x 1 = 1, x 2 = 2, x 3 = 3. Compute the kernel matrix   1 1 0 1 1 1 (16) K =   0 1 1 � � ( 2 ) − 1 ) − 1 , 1, and ( 1 − The eigenvalues of K are ( ( 2 )) . Therefore K is not positive semidefinite. 15 / 100

  16. Representer Theorem A decision function f ( x ) can be written as N N � � (17) f ( x ) = � w , φ ( x ) � = � α i φ ( x i ) , φ ( x ) � = α i k ( x i , x ) . i = 1 i = 1 ◮ Representer theorem says this function exists with finitely many coefficients α i even when φ is infinite dimensional (an infinite number of basis functions). ◮ Initially viewed as a strength of kernel methods, for datasets not exceeding e.g. ten thousand points. ◮ Unfortunately, the number of nonzero α i often grows linearly in the size of the training set N . ◮ Example: In GP regression, the predictive mean is N � E [ f ∗ | y , X , x ∗ ] = k T ∗ ( K + σ 2 I ) − 1 y = (18) α i k ( x i , x ∗ ) , i = 1 where α i = ( K + σ 2 I ) − 1 y . 16 / 100

  17. Making new kernels from old Suppose k 1 ( x , x ′ ) and k 2 ( x , x ′ ) are valid. Then the following covariance functions are also valid: k ( x , x ′ ) = g ( x ) k 1 ( x , x ′ ) g ( x ′ ) (19) k ( x , x ′ ) = q ( k 1 ( x , x ′ )) (20) k ( x , x ′ ) = exp ( k 1 ( x , x ′ )) (21) k ( x , x ′ ) = k 1 ( x , x ′ ) + k 2 ( x , x ′ ) (22) k ( x , x ′ ) = k 1 ( x , x ′ ) k 2 ( x , x ′ ) (23) k ( x , x ′ ) = k 3 ( φ ( x ) , φ ( x ′ )) (24) k ( x , x ′ ) = x T Ax ′ (25) k ( x , x ′ ) = k a ( x a , x ′ a ) + k b ( x b , x ′ (26) b ) k ( x , x ′ ) = k a ( x a , x ′ a ) k b ( x b , x ′ (27) b ) where g is any function, q is a polynomial with nonnegative coefficients, φ ( x ) is a function from x to R M , k 3 is a valid covariance function in R M , A is a symmetric positive definite matrix, x a and x b are not necessarily disjoint variables with x = ( x a , x b ) T , and k a and k b are valid kernels in their respective spaces. 17 / 100

  18. Stationary Kernels ◮ A stationary kernel is invariant to translations of the input space. Equivalently, k = k ( x − x ′ ) = k ( τ ) . ◮ All distance kernels, k = k ( || x − x ′ || ) are examples of stationary kernels. ◮ The RBF kernel k RBF ( x , x ′ ) = a 2 exp ( − || x − x ′ || 2 ) is a stationary kernel. 2 ℓ 2 0 ) p is an example of a The polynomial kernel k POL ( x , x ′ ) = ( x T x + σ 2 non-stationary kernel. ◮ Stationarity provides a useful inductive bias . 18 / 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend