Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline - PowerPoint PPT Presentation

Non-Gaussian posterior approximation ◮ Various methods to make a Gaussian approximation, p ( f | y ) ≈ q ( f ) = N � f | µ = ? , C = ? � . ◮ Only need to obtain an approximate posterior at the training locations. ◮ At test locations, the data only e ff ects their probabily via the posterior at these locations. p ( f , f ∗ | x ∗ , x , y ) = p ( f ∗ | f , x ∗ ) p ( f | x , y )

Outline Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons

Methods overview Given choice of Gaussian approximation of posterior. How do we choose the parameter values µ and C ? There a number of di ff erent methods in which to choose how to set the parameters of our Gaussian approximation.

Parameters e ff ect - mean

Parameters e ff ect - variance

How to choose the parameters? Two approaches that we might take: ◮ Is to match the mean and variance at some point, for example the mode. ◮ Attempt to minimise some divergence measure between the approximate distribution and the true distribution. ◮ Laplace takes the former ◮ Variational bayes takes the latter ◮ EP kind of takes the latter

Laplace approximation Task: for some generic random variable, f , and data, y , find a good approximation to di ffi cult to compute posterior distribution, p ( f | y ). Laplace approach: fit a Gaussian by matching the curvature at the modal point of the posterior. ◮ Use a second-order taylor expansion around the mode of the log-posterior. ◮ Use the expansion to find an equivalent Gaussian in the probability space.

Laplace approximation ◮ Log of a Gaussian distribution, q ( f ) = N � f | µ, C � , is a quadratic function of f . ◮ A second-order taylor expansion is an approximation of a function using only quadratic terms. ◮ Laplace approximation expands the un-normalised posterior, and then uses it to set the linear and quadratic terms of the log q ( f ). ◮ The first and second derivatives of the form of the log-posterior, at the mode, will match the derivatives of the approximate Gaussian at this same point.

Second-order taylor expansion p ( f | y ) = 1 Zh ( f ) In our case: h ( f ) = p ( y | f ) p ( f ) 2.50 2.00 1.50 1.00 0.50 h ( f ) 0.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Second-order taylor expansion log p ( f | y ) = log 1 Z + log h ( f ) 2.00 1.00 0.00 -1.00 -2.00 logh ( f ) -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Second-order taylor expansion log p ( f | y ) = log 1 Z + log h ( f ) Z + log h ( a ) + d log h ( a ) ≈ log 1 ( f − a ) d a 2( f − a ) ⊤ d 2 log h ( a ) + 1 ( f − a ) + · · · d a 2

Second-order taylor expansion Z + log h ( a ) + d log h ( a ) ≈ log 1 ( f − a ) d a 2( f − a ) ⊤ d 2 log h ( a ) + 1 ( f − a ) + · · · d a 2 Want to make the expansion around the mode, ˆ f : d log h ( a ) � � = 0 � � d a � a = ˆ f

Second-order taylor expansion f ) + d log h (ˆ f ) log p ( f | y ) ≈ log 1 Z + log h (ˆ ( f − ˆ f ) d ˆ f 2.00 1.00 0.00 -1.00 logh ( f ) Mode f -2.00 taylor O=1 at f -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Second-order taylor expansion f ) + d log h (ˆ f ) log p ( f | y ) ≈ log 1 Z + log h (ˆ ( f − ˆ f ) d ˆ f f ) ⊤ d 2 log h (ˆ f ) + 1 2( f − ˆ ( f − ˆ f ) d ˆ f 2 2.00 1.00 0.00 logh ( f ) -1.00 Mode f taylor O=1 at f -2.00 taylor O=2 at f -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Second-order taylor expansion f ) + d log h (ˆ f ) log p ( f | y ) ≈ log 1 Z + log h (ˆ ( f − ˆ f ) d ˆ f f ) ⊤ d 2 log h (ˆ f ) + 1 2( f − ˆ ( f − ˆ f ) + · · · d ˆ f 2 2.00 1.00 0.00 logh ( f ) Mode f -1.00 taylor O=1 at f taylor O=2 at f -2.00 taylor O=3 at f -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Second-order taylor expansion f ) + d log h (ˆ f ) log p ( f | y ) ≈ log 1 Z + log h (ˆ ( f − ˆ f ) d ˆ f f ) ⊤ d 2 log h (ˆ f ) + 1 2( f − ˆ ( f − ˆ f ) d ˆ f 2 2.00 1.00 0.00 -1.00 log h(f) Mode -2.00 taylor O=2 at f -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Second-order taylor expansion  − d 2 log h (ˆ     f ) p ( f | y ) ≈ 1  − 1 Zh (ˆ 2( f − ˆ  ( f − ˆ  f ) ⊤    f ) exp f )       d ˆ     f 2  2.50 2.00 1.50 1.00 h(f) exp(taylor O=2 at f ) 0.50 Mode f 0.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Second-order taylor expansion  − d 2 log h (ˆ     f ) p ( f | y ) ≈ 1  − 1 Zh (ˆ 2( f − ˆ  ( f − ˆ  f ) ⊤    f ) exp f )       d ˆ     f 2  − 1   − d 2 log h (ˆ    f )  f | ˆ     = N  f ,         d ˆ    f 2     2.50 2.00 1.50 1.00 h(f) N ( f | f , log h ( f ) 1 ) 0.50 Mode f 0.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Laplace appoximation for Gaussian processes In our case, h ( f ) = p ( y | f ) p ( f ), so we need to evaluate − d 2 log h (ˆ = − d 2 (log p ( y | ˆ f ) + log p (ˆ f ) f )) d ˆ d ˆ f 2 f 2 = − d 2 log p ( y | ˆ f ) + K − 1 d ˆ f 2 � W + K − 1 giving a posterior approximation: � W + K − 1 � − 1 � f | ˆ � p ( f | y ) ≈ q ( f ) = N f ,

Laplace approximation - algorithm overview ◮ Find the mode, ˆ f of the true log posterior, via Newton’s method. ◮ Use second-order Taylor expansion around this modal value. ◮ Form Gaussian approximation setting the mean equal to the posterior mode, ˆ f , and matching the curvature. f , ( K − 1 + W ) − 1 � � f | ˆ ◮ p ( f | y ) ≈ q ( f | µ , C ) = N ◮ W � − d 2 log p ( y | ˆ f ) . d ˆ f 2 ◮ For factorizing likelihoods (most), W is diagonal.

log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) Evaluate curvature 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) Evaluate curvature 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) mode, f 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

Visualization of Laplace 1.0 prior, p ( f ) likelihood, p ( y = 4| ( f )) posterior, p ( f | y = 4) 0.8 laplace, q ( f ) mode, f 0.6 0.4 0.2 0.0 8 6 4 2 0 2 4 6 8 f 1

Visualise of Laplace - Bernoulli 1.0 prior, p ( f ) likelihood, p ( y = 1| ( f )) posterior, p ( f | y = 1) 0.8 laplace, q ( f ) mode, f 0.6 0.4 0.2 0.0 8 6 4 2 0 2 4 6 8 f 1

Variational Bayes (VB) Task: for some generic random variable, z , and data, y , find a good approximation to di ffi cult to compute posterior distribution, p ( z | y ). VB approach: minimise a divergence measure between an approximate posterior, q ( z ) and true posterior, p ( z | y ). ◮ KL divergence, KL � q ( z ) � p ( z | y ) � . ◮ Minimize this with respect to parameters of q ( z ).

KL divergence ◮ General for any two distributions q ( x ) and p ( x ). ◮ KL � q ( x ) � p ( x ) � is the average additional amount of information lost when p ( x ) is used to approximate q ( x ). It’s a measure of divergence of one distribution to another. ◮ KL � q ( x ) � p ( x ) � = log q ( x ) � � p ( x ) q ( x ) ◮ Always 0 or positive, not symmetric. ◮ Lets look at how it changes with response to changes in the approximating distribution.

0.40 0.40 0.40 0.40 q(x) ~ N ( µ = − 1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =1 . 0 ,σ 2 =1 . 0) 0.35 0.35 0.35 0.35 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.30 0.30 0.30 0.25 0.25 0.25 0.25 pdf pdf pdf pdf 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.5 2.0 2.0 2.0 2.0 µ µ µ µ KL varying mean 0.40 q(x) ~ N ( µ = − 2 . 0 ,σ 2 =1 . 0) 0.35 p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.25 pdf 0.20 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 x 2.0 1.5 KL [ q ( x ) || p ( x )] 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 µ

0.40 0.40 0.40 0.40 q(x) ~ N ( µ = − 2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 0) 0.35 0.35 0.35 0.35 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.30 0.30 0.30 0.25 0.25 0.25 0.25 pdf pdf pdf pdf 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.5 2.0 2.0 2.0 2.0 µ µ µ µ KL varying mean 0.40 q(x) ~ N ( µ = − 1 . 0 ,σ 2 =1 . 0) 0.35 p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.25 pdf 0.20 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 x 2.0 1.5 KL [ q ( x ) || p ( x )] 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 µ

0.40 0.40 0.40 0.40 q(x) ~ N ( µ = − 1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ = − 2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =1 . 0 ,σ 2 =1 . 0) 0.35 0.35 0.35 0.35 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.30 0.30 0.30 0.25 0.25 0.25 0.25 pdf pdf pdf pdf 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.5 2.0 2.0 2.0 2.0 µ µ µ µ KL varying mean 0.40 q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 0) 0.35 p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.25 pdf 0.20 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 x 2.0 1.5 KL [ q ( x ) || p ( x )] 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 µ

0.8 0.8 0.8 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 575) q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 725) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 15) q(x) ~ N ( µ =0 . 0 ,σ 2 =2 . 0) 0.7 0.7 0.7 0.7 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 pdf pdf pdf pdf 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 1.2 1.2 1.2 1.2 1.4 1.4 1.4 1.4 1.6 1.6 1.6 1.6 1.8 1.8 1.8 1.8 2.0 2.0 2.0 2.0 σ 2 σ 2 σ 2 σ 2 KL varying variance 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 3) 0.7 p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.5 pdf 0.4 0.3 0.2 0.1 0.0 6 4 2 0 2 4 6 x 0.5 0.4 KL [ q ( x ) || p ( x )] 0.3 0.2 0.1 0.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 σ 2

Variational Bayes Don’t have access to or can’t compute for computational reasons: p ( z | y ) or p ( y ), and hence KL � q ( z ) � p ( z | y ) � How can we minimize something we can’t compute? ◮ Can compute q ( z ) and p ( y | z ) for any z . ◮ q ( z ) is parameterised by ‘variational parameters’. ◮ True posterior using Bayes rule, p ( z | y ) = p ( y | z ) p ( z ) . p ( y ) ◮ p ( y ) doesn’t change when variational parameters are changed.

Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) �

Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) � � � log q ( z ) � q ( z ) dz = p ( z | y )

Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) � � � log q ( z ) � q ( z ) dz = p ( z | y ) � � � log q ( z ) = q ( z ) p ( z ) − log p ( y | z ) + log p ( y ) dz

Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) � � � log q ( z ) � q ( z ) dz = p ( z | y ) � � � log q ( z ) = q ( z ) p ( z ) − log p ( y | z ) + log p ( y ) dz = KL � q ( z ) � p ( z ) � − � q ( z ) � log p ( y | z ) � dz + log p ( y )

Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) � � � log q ( z ) � q ( z ) dz = p ( z | y ) � � � log q ( z ) = q ( z ) p ( z ) − log p ( y | z ) + log p ( y ) dz = KL � q ( z ) � p ( z ) � − � q ( z ) � log p ( y | z ) � dz + log p ( y ) � q ( z ) � log p ( y | z ) � dz − KL � q ( z ) � p ( z ) � + KL � q ( z ) � p ( z | y ) � log p ( y ) =

Variational Bayes - Derivation � q ( z ) � log p ( y | z ) � dz − KL � q ( z ) � p ( z ) � + KL � q ( z ) � p ( z | y ) � log p ( y ) = � q ( z ) � log p ( y | z ) � dz − KL � q ( z ) � p ( z ) � ≥ ◮ Tractable terms give lower bound on log p ( y ) as KL � q ( z ) � p ( z | y ) � always positive. ◮ Adjust variational parameters of q ( z ) to make tractable terms as large as possible, thus KL � q ( z ) � p ( z | y ) � as small as possible.

VB optimisation illustration

Variational Bayes for Gaussian processes ◮ Make a Gaussian approximation, q ( f ) = N ( f | µ , C ), as similar possible to true posterior, p ( f | y ). ◮ Treat µ and C as ‘variational parameters’, e ff ecting quality of approximation. � � KL � q ( f ) � p ( f | y ) � = log q ( f ) p ( f | y ) q ( f ) � � log q ( f ) = p ( f ) − log p ( y | f ) + log p ( y ) q ( f ) = KL � q ( f ) � p ( f ) � − � log p ( y | f ) � q ( f ) + log p ( y ) q ( f ) − KL � q ( f ) � p ( f ) � + KL � q ( f ) � p ( f | y ) � log p ( y ) = � log p ( y | f ) �

Variational Bayes for Gaussian processes - bound q ( f ) − KL � q ( f ) � p ( f ) � + KL � q ( f ) � p ( f | y ) � log p ( y ) = � log p ( y | f ) � ≥ � log p ( y | f ) � q ( f ) − KL � q ( f ) � p ( f ) � ◮ Adjust variational parameters µ and C to make tractable terms as large as possible, thus KL � q ( f ) � p ( f | y ) � as small as possible. ◮ � log p ( y | f ) � q ( f ) with factorizing likelihood can be done with a series of n 1 dimensional integrals. ◮ In practice, can reduce the number of variational parameters by reparameterizing C = ( K ff − 2 Λ ) − 1 by noting that the bound is constant in o ff diagonal terms of C .

VB optimisation illustration for Gaussian processes

Expectation propagation n � p ( f | y ) ∝ p ( f ) p ( y i | f i ) i = 1 n 1 � t i ( f i | ˜ σ 2 q ( f ) � p ( f ) Z i , ˜ µ i , ˜ i ) = N ( f | µ , Σ ) Z ep i = 1 � � t i � ˜ σ 2 Z i N f i | ˜ µ i , ˜ i ◮ Individual likelihood terms, p ( y i | f i ), replaced by independent un-normalised 1D Gaussians, t i . ◮ Uses an iterative algorithm to update t i ’s, to get more and more accurate approximation.

Expectation propagation 1. Remove one factor t i from the approximation q ( f ).

Expectation propagation 1. Remove one factor t i from the approximation q ( f ). 2. The approximate marginal q ( f i ) with t i contribution removed is called cavity distribution, q − i ( f i )

Expectation propagation 1. Remove one factor t i from the approximation q ( f ). 2. The approximate marginal q ( f i ) with t i contribution removed is called cavity distribution, q − i ( f i ) 3. Find t i that minimises KL � p ( y i | f i ) q − i ( f i ) / z i � q ( f i ) � by matching moments.

Expectation propagation 1. Remove one factor t i from the approximation q ( f ). 2. The approximate marginal q ( f i ) with t i contribution removed is called cavity distribution, q − i ( f i ) 3. Find t i that minimises KL � p ( y i | f i ) q − i ( f i ) / z i � q ( f i ) � by matching moments. 4. Repeat until convergence.

Expectation propagation 1. Remove one factor t i from the approximation q ( f ). 2. The approximate marginal q ( f i ) with t i contribution removed is called cavity distribution, q − i ( f i ) 3. Find t i that minimises KL � p ( y i | f i ) q − i ( f i ) / z i � q ( f i ) � by matching moments. 4. Repeat until convergence. This approximately minimises KL � p ( f | y ) � q ( f ) � locally, but not globally.

Expectation propagation - in math Step 1 & 2. First choose a local likelihood contribution, i , to leave out, and find the marginal cavity distribution, p ( f ) � n n n j = 1 t j ( f j ) � � q ( f | y ) ∝ p ( f ) t j ( f j ) → → p ( f ) t j ( f j ) t i ( f i ) j = 1 j � i � � t j ( f j ) d f j � i � q − i ( f i ) → p ( f ) j � i

Expectation propagation - in math Step 1 & 2. First choose a local likelihood contribution, i , to leave out, and find the marginal cavity distribution, p ( f ) � n n n j = 1 t j ( f j ) � � q ( f | y ) ∝ p ( f ) t j ( f j ) → → p ( f ) t j ( f j ) t i ( f i ) j = 1 j � i � � t j ( f j ) d f j � i � q − i ( f i ) → p ( f ) j � i � ˆ � � � σ 2 Step 3.1. ˆ q ( f i ) ≈ min KL p ( y i | f i ) q − i ( f i ) � N f i | ˆ µ i , ˆ Z i i

Expectation propagation - in math Step 1 & 2. First choose a local likelihood contribution, i , to leave out, and find the marginal cavity distribution, p ( f ) � n n n j = 1 t j ( f j ) � � q ( f | y ) ∝ p ( f ) t j ( f j ) → → p ( f ) t j ( f j ) t i ( f i ) j = 1 j � i � � t j ( f j ) d f j � i � q − i ( f i ) → p ( f ) j � i � ˆ � � � σ 2 Step 3.1. ˆ q ( f i ) ≈ min KL p ( y i | f i ) q − i ( f i ) � N f i | ˆ µ i , ˆ Z i i Step 3.2: Compute parameters of t i ( f i | ˜ σ 2 Z i , ˜ µ i , ˜ i ) making � � moments of q ( f i ) match those of ˆ σ 2 Z i N f i | ˆ µ i , ˆ . i

Comparing posterior approximations Prior p ( f 1 ,f 2 ) Likelihood p ( y =1 | f 1 ,f 2 ) 20 20 10 10 f 2 0 f 2 0 10 10 20 20 20 10 0 10 20 20 10 0 10 20 f 1 f 1 ◮ Gaussian prior between two function values { f 1 , f 2 } , at { x 1 , x 2 } respectively. ◮ Bernoulli likelihood, y 1 = 1 and y 2 = 1.

Comparing posterior approximations True posterior Laplace approximation 20 20 10 10 f 2 0 f 2 0 10 10 20 20 20 10 0 10 20 20 10 0 10 20 f 1 f 1 ◮ p ( f | y ) ∝ p ( y | f ) p ( f ) p ( y ) ◮ True posterior is non-Gaussian. ◮ Laplace approximates with a Gaussian at the mode of the posterior.

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline - PowerPoint PPT Presentation

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons GP regression - recap so far Model the

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

3 = 12 = 1 1 1 4 Likelihoods, Bootstraps and Testing Trees p.1/60 Likelihoods,

Likelihoods, Bootstraps and Testing Trees Joe Felsenstein Depts. of Genome Sciences and of

Plan Composite Likelihood Methods What are composite likelihoods? David Firth Where are

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Determining the PSF over the Full FoV of LSST using Anisotropic Gaussian Processes

Birth and Death Processes Today: Birth processes Birth and Death Processes Death

Programs, Processes, and Threads Programs, Processes, and Threads (Chapter 2) Processes

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 @GPSS2018 Zhenwen Dai (Amazon)

Introduction to General and Generalized Linear Models The Likelihood Principle - part I Henrik

Maximum Likelihood Estimation MLE tool for parameter estimation good approach for cases

PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 2: Estimation Theory October 2019 Heikki

Estimating the parameters of some probability distributions: Exemplifications 1. Estimating the

STAT 339 A Generative Linear Model and Max Likelihood Estimation 20-22 February 2017 Colin

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian

The maximum likelihood degree of rank 2 matrices via Euler characteristics Jose Israel Rodriguez

Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas