Probabilistic Graphical Models Lecture 21: Advanced Gaussian - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University April 1, 2015 1 / 100

Gaussian process review Definition A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. Nonparametric Regression Model ◮ Prior: f ( x ) ∼ GP ( m ( x ) , k ( x , x ′ )) , meaning ( f ( x 1 ) , . . . , f ( x N )) ∼ N ( µ , K ) , with µ i = m ( x i ) and K ij = cov ( f ( x i ) , f ( x j )) = k ( x i , x j ) . GP posterior Likelihood GP prior � �� p ( f ( x ) |D ) ∝ p ( D| f ( x )) p ( f ( x )) Gaussian process sample prior functions Gaussian process sample posterior functions 3 3 2 2 1 1 output, f(t) output, f(t) 0 0 −1 −1 −2 −2 2 / 100 −3 −3

Gaussian Process Inference ◮ Observed noisy data y = ( y ( x 1 ) , . . . , y ( x N )) T at input locations X . ◮ Start with the standard regression assumption: N ( y ( x ); f ( x ) , σ 2 ) . ◮ Place a Gaussian process distribution over noise free functions f ( x ) ∼ GP ( 0 , k θ ) . The kernel k is parametrized by θ . ◮ Infer p ( f ∗ | y , X , X ∗ ) for the noise free function f evaluated at test points X ∗ . Joint distribution � � � � �� K θ ( X , X ) + σ 2 I K θ ( X , X ∗ ) y (1) ∼ N 0 , . K θ ( X ∗ , X ) K θ ( X ∗ , X ∗ ) f ∗ Conditional predictive distribution f ∗ | X ∗ , X , y , θ ∼ N (¯ f ∗ , cov ( f ∗ )) , (2) f ∗ = K θ ( X ∗ , X )[ K θ ( X , X ) + σ 2 I ] − 1 y , ¯ (3) cov ( f ∗ ) = K θ ( X ∗ , X ∗ ) − K θ ( X ∗ , X )[ K θ ( X , X ) + σ 2 I ] − 1 K θ ( X , X ∗ ) . (4) 3 / 100

Learning and Model Selection p ( M i | y ) = p ( y |M i ) p ( M i ) (5) p ( y ) We can write the evidence of the model as � (6) p ( y |M i ) = p ( y | f , M i ) p ( f ) d f , 4 Complex Model Data Simple Model 3 Simple Appropriate Model Complex 2 Appropriate Output, f(x) 1 p(y|M) 0 −1 −2 −3 −4 y −10 −8 −6 −4 −2 0 2 4 6 8 10 All Possible Datasets Input, x (a) (b) 4 / 100

Learning and Model Selection ◮ We can integrate away the entire Gaussian process f ( x ) to obtain the marginal likelihood, as a function of kernel hyperparameters θ alone. � (7) p ( y | θ , X ) = p ( y | f , X ) p ( f | θ , X ) d f . complexity penalty model fit � �� − 1 1 2 log | K θ + σ 2 I | − N 2 y T ( K θ + σ 2 I ) − 1 y − log p ( y | θ , X ) = 2 log ( 2 π ) . (8) ◮ An extremely powerful mechanism for kernel learning. Samples from GP Posterior Samples from GP Prior 4 4 3 3 2 2 Output, f(x) Output, f(x) 1 1 0 0 −1 −1 −2 −2 −3 −3 −4 −4 −10 −5 0 5 10 −10 −5 0 5 10 5 / 100 Input, x Input, x

Inference and Learning 1. Learning: Optimize marginal likelihood, model fit complexity penalty � �� − 1 1 2 log | K θ + σ 2 I | − N 2 y T ( K θ + σ 2 I ) − 1 y − log p ( y | θ , X ) = 2 log ( 2 π ) , with respect to kernel hyperparameters θ . 2. Inference: Conditioned on kernel hyperparameters θ , form the predictive distribution for test inputs X ∗ : f ∗ | X ∗ , X , y , θ ∼ N (¯ f ∗ , cov ( f ∗ )) , ¯ f ∗ = K θ ( X ∗ , X )[ K θ ( X , X ) + σ 2 I ] − 1 y , cov ( f ∗ ) = K θ ( X ∗ , X ∗ ) − K θ ( X ∗ , X )[ K θ ( X , X ) + σ 2 I ] − 1 K θ ( X , X ∗ ) . 6 / 100

Learning and Model Selection ◮ A fully Bayesian treatment would integrate away kernel hyperparameters θ . � (9) p ( f ∗ | X ∗ , X , y ) = p ( f ∗ | X ∗ , X , y , θ ) p ( θ | y ) d θ ◮ For example, we could specify a prior p ( θ ) , use MCMC to take J samples from p ( θ | y ) ∝ p ( y | θ ) p ( θ ) , and then find J p ( f ∗ | X ∗ , X , y ) ≈ 1 � θ ( i ) ∼ p ( θ | y ) . p ( f ∗ | X ∗ , X , y , θ ( i ) ) , (10) J i = 1 ◮ If we have a non-Gaussian noise model, and thus cannot integrate away f , the strong dependencies between Gaussian process f and hyperparameters θ make sampling extremely difficult. In my experience, the most effective solution is to use a deterministic approximation for the posterior p ( f | y ) which enables one to work with an approximate marginal likelihood. 7 / 100

Popular Kernels Let τ = x − x ′ : k SE ( τ ) = exp ( − 0 . 5 τ 2 /ℓ 2 ) (11) √ √ 3 τ 3 τ k MA ( τ ) = a ( 1 + ) exp ( − (12) ) ℓ ℓ τ 2 k RQ ( τ ) = ( 1 + 2 α ℓ 2 ) − α (13) k PE ( τ ) = exp ( − 2 sin 2 ( π τ ω ) /ℓ 2 ) (14) 8 / 100

Worked Example: Combining Kernels, CO 2 Data CO 2 Concentration (ppm) 400 380 360 340 320 1968 1977 1986 1995 2004 Year Example from Rasmussen and Williams (2006), Gaussian Processes for Machine Learning . 9 / 100

Worked Example: Combining Kernels, CO 2 Data 10 / 100

Worked Example: Combining Kernels, CO 2 Data � � − ( x p − x q ) 2 ◮ Long rising trend: k 1 ( x p , x q ) = θ 2 1 exp 2 θ 2 2 ◮ Quasi-periodic seasonal changes: k 2 ( x p , x q ) = � � − 2 sin 2 ( π ( x p − x q )) − ( x p − x q ) k RBF ( x p , x q ) k PER ( x p , x q ) = θ 2 3 exp 2 θ 2 θ 2 4 5 ◮ Multi-scale medium term irregularities: � � − θ 8 1 + ( x p − x q ) 2 k 3 ( x p , x q ) = θ 2 6 2 θ 8 θ 2 7 � � − ( x p − x q ) 2 ◮ Correlated and i.i.d. noise: k 4 ( x p , x q ) = θ 2 + θ 2 9 exp 11 δ pq 2 θ 2 10 ◮ k total ( x p , x q ) = k 1 ( x p , x q ) + k 2 ( x p , x q ) + k 3 ( x p , x q ) + k 4 ( x p , x q ) 11 / 100

What is a kernel? ◮ Informally, k describes the similarities between pairs of data points. For example, far away points may be considered less similar than nearby points. K ij = � φ ( x i ) , φ ( x j ) � and so tells us the overlap between the features (basis functions) φ ( x i ) and φ ( x j ) ◮ We have seen that all linear basis function models f ( x ) = w T φ ( x ) , with p ( w ) = N ( 0 , Σ w ) correspond to Gaussian processes with kernel k ( x , x ′ ) = φ ( x ) T Σ w φ ( x ′ ) . ◮ We have also accumulated some experience with the RBF kernel k RBF ( x , x ′ ) = a 2 exp ( − || x − x ′ || 2 ) . 2 ℓ 2 ◮ The kernel controls the generalisation behaviour of a kernel machine. For example, a kernel controls the support and inductive biases of a Gaussian process – which functions are a priori likely. ◮ A kernel is also known as covariance function or covariance kernel in the context of Gaussian processes. 12 / 100

Candidate Kernel � 1 || x − x ′ || ≤ 1 k ( x , x ′ ) = 0 otherwise ◮ Symmetric ◮ Provides information about proximity of points ◮ Exercise: Is it a valid kernel? 13 / 100

Candidate Kernel � 1 || x − x ′ || ≤ 1 k ( x , x ′ ) = 0 otherwise Try the points x 1 = 1, x 2 = 2, x 3 = 3. Compute the kernel matrix   ? ? ? (15) K = ? ? ?   ? ? ? 14 / 100

Candidate Kernel � 1 || x − x ′ || ≤ 1 k ( x , x ′ ) = 0 otherwise Try the points x 1 = 1, x 2 = 2, x 3 = 3. Compute the kernel matrix   1 1 0 1 1 1 (16) K =   0 1 1 � � ( 2 ) − 1 ) − 1 , 1, and ( 1 − The eigenvalues of K are ( ( 2 )) . Therefore K is not positive semidefinite. 15 / 100

Representer Theorem A decision function f ( x ) can be written as N N � � (17) f ( x ) = � w , φ ( x ) � = � α i φ ( x i ) , φ ( x ) � = α i k ( x i , x ) . i = 1 i = 1 ◮ Representer theorem says this function exists with finitely many coefficients α i even when φ is infinite dimensional (an infinite number of basis functions). ◮ Initially viewed as a strength of kernel methods, for datasets not exceeding e.g. ten thousand points. ◮ Unfortunately, the number of nonzero α i often grows linearly in the size of the training set N . ◮ Example: In GP regression, the predictive mean is N � E [ f ∗ | y , X , x ∗ ] = k T ∗ ( K + σ 2 I ) − 1 y = (18) α i k ( x i , x ∗ ) , i = 1 where α i = ( K + σ 2 I ) − 1 y . 16 / 100

Making new kernels from old Suppose k 1 ( x , x ′ ) and k 2 ( x , x ′ ) are valid. Then the following covariance functions are also valid: k ( x , x ′ ) = g ( x ) k 1 ( x , x ′ ) g ( x ′ ) (19) k ( x , x ′ ) = q ( k 1 ( x , x ′ )) (20) k ( x , x ′ ) = exp ( k 1 ( x , x ′ )) (21) k ( x , x ′ ) = k 1 ( x , x ′ ) + k 2 ( x , x ′ ) (22) k ( x , x ′ ) = k 1 ( x , x ′ ) k 2 ( x , x ′ ) (23) k ( x , x ′ ) = k 3 ( φ ( x ) , φ ( x ′ )) (24) k ( x , x ′ ) = x T Ax ′ (25) k ( x , x ′ ) = k a ( x a , x ′ a ) + k b ( x b , x ′ (26) b ) k ( x , x ′ ) = k a ( x a , x ′ a ) k b ( x b , x ′ (27) b ) where g is any function, q is a polynomial with nonnegative coefficients, φ ( x ) is a function from x to R M , k 3 is a valid covariance function in R M , A is a symmetric positive definite matrix, x a and x b are not necessarily disjoint variables with x = ( x a , x b ) T , and k a and k b are valid kernels in their respective spaces. 17 / 100

Stationary Kernels ◮ A stationary kernel is invariant to translations of the input space. Equivalently, k = k ( x − x ′ ) = k ( τ ) . ◮ All distance kernels, k = k ( || x − x ′ || ) are examples of stationary kernels. ◮ The RBF kernel k RBF ( x , x ′ ) = a 2 exp ( − || x − x ′ || 2 ) is a stationary kernel. 2 ℓ 2 0 ) p is an example of a The polynomial kernel k POL ( x , x ′ ) = ( x T x + σ 2 non-stationary kernel. ◮ Stationarity provides a useful inductive bias . 18 / 100

Probabilistic Graphical Models Lecture 21: Advanced Gaussian - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University April 1, 2015 1 / 100 Gaussian process review Definition A Gaussian process (GP) is a collection

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) Sheffield, September 2018

Intermediate Math Circles - When You Arrive If you have been here before Check your name on the

C# Programming in Depth Prof. Dr. Bertrand Meyer March 2007 May 2007 Lecture 4: Garbage

GC Assertions: Using the Garbage Collector to check heap properties Shirley Gracelyn February

Learning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel M.

Density-Based Alternative Explanation Fuzzy Clustering as a Explaining the . . . What If Not

Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) Andrew Gordon Wilson

Support Vector Machine and Kernel Methods Jiayu Zhou 1 Department of Computer Science and