non gaussian likelihoods for gaussian processes
play

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline - PowerPoint PPT Presentation

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons GP regression - recap so far Model the


  1. Non-Gaussian posterior approximation ◮ Various methods to make a Gaussian approximation, p ( f | y ) ≈ q ( f ) = N � f | µ = ? , C = ? � . ◮ Only need to obtain an approximate posterior at the training locations. ◮ At test locations, the data only e ff ects their probabily via the posterior at these locations. p ( f , f ∗ | x ∗ , x , y ) = p ( f ∗ | f , x ∗ ) p ( f | x , y )

  2. Why do we want the posterior anyway? True posterior, posterior approximation, or samples are needed to make predictions at new locations, x ∗ . � p ( f ∗ | x ∗ , x , y ) = p ( f ∗ | f , x ∗ ) p ( f | y , x ) d f � q ( f ∗ | x ∗ , x , y ) = p ( f ∗ | f , x ∗ ) q ( f | x ) d f

  3. Outline Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons

  4. Methods overview Given choice of Gaussian approximation of posterior. How do we choose the parameter values µ and C ? There a number of di ff erent methods in which to choose how to set the parameters of our Gaussian approximation.

  5. Parameters e ff ect - mean

  6. Parameters e ff ect - variance

  7. How to choose the parameters? Two approaches that we might take: ◮ Is to match the mean and variance at some point, for example the mode. ◮ Attempt to minimise some divergence measure between the approximate distribution and the true distribution. ◮ Laplace takes the former ◮ Variational bayes takes the latter ◮ EP kind of takes the latter

  8. Outline Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons

  9. Laplace approximation Task: for some generic random variable, f , and data, y , find a good approximation to di ffi cult to compute posterior distribution, p ( f | y ). Laplace approach: fit a Gaussian by matching the curvature at the modal point of the posterior. ◮ Use a second-order taylor expansion around the mode of the log-posterior. ◮ Use the expansion to find an equivalent Gaussian in the probability space.

  10. Laplace approximation ◮ Log of a Gaussian distribution, q ( f ) = N � f | µ, C � , is a quadratic function of f . ◮ A second-order taylor expansion is an approximation of a function using only quadratic terms. ◮ Laplace approximation expands the un-normalised posterior, and then uses it to set the linear and quadratic terms of the log q ( f ). ◮ The first and second derivatives of the form of the log-posterior, at the mode, will match the derivatives of the approximate Gaussian at this same point.

  11. Second-order taylor expansion p ( f | y ) = 1 Zh ( f ) In our case: h ( f ) = p ( y | f ) p ( f ) 2.50 2.00 1.50 1.00 0.50 h ( f ) 0.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

  12. Second-order taylor expansion log p ( f | y ) = log 1 Z + log h ( f ) 2.00 1.00 0.00 -1.00 -2.00 logh ( f ) -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

  13. Second-order taylor expansion log p ( f | y ) = log 1 Z + log h ( f ) Z + log h ( a ) + d log h ( a ) ≈ log 1 ( f − a ) d a 2( f − a ) ⊤ d 2 log h ( a ) + 1 ( f − a ) + · · · d a 2

  14. Second-order taylor expansion Z + log h ( a ) + d log h ( a ) ≈ log 1 ( f − a ) d a 2( f − a ) ⊤ d 2 log h ( a ) + 1 ( f − a ) + · · · d a 2 Want to make the expansion around the mode, ˆ f : d log h ( a ) � � = 0 � � d a � a = ˆ f

  15. Second-order taylor expansion f ) + d log h (ˆ f ) log p ( f | y ) ≈ log 1 Z + log h (ˆ ( f − ˆ f ) d ˆ f 2.00 1.00 0.00 -1.00 logh ( f ) Mode f -2.00 taylor O=1 at f -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

  16. Second-order taylor expansion f ) + d log h (ˆ f ) log p ( f | y ) ≈ log 1 Z + log h (ˆ ( f − ˆ f ) d ˆ f f ) ⊤ d 2 log h (ˆ f ) + 1 2( f − ˆ ( f − ˆ f ) d ˆ f 2 2.00 1.00 0.00 logh ( f ) -1.00 Mode f taylor O=1 at f -2.00 taylor O=2 at f -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

  17. Second-order taylor expansion f ) + d log h (ˆ f ) log p ( f | y ) ≈ log 1 Z + log h (ˆ ( f − ˆ f ) d ˆ f f ) ⊤ d 2 log h (ˆ f ) + 1 2( f − ˆ ( f − ˆ f ) + · · · d ˆ f 2 2.00 1.00 0.00 logh ( f ) Mode f -1.00 taylor O=1 at f taylor O=2 at f -2.00 taylor O=3 at f -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

  18. Second-order taylor expansion f ) + d log h (ˆ f ) log p ( f | y ) ≈ log 1 Z + log h (ˆ ( f − ˆ f ) d ˆ f f ) ⊤ d 2 log h (ˆ f ) + 1 2( f − ˆ ( f − ˆ f ) d ˆ f 2 2.00 1.00 0.00 -1.00 log h(f) Mode -2.00 taylor O=2 at f -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

  19. Second-order taylor expansion  − d 2 log h (ˆ     f ) p ( f | y ) ≈ 1  − 1 Zh (ˆ 2( f − ˆ  ( f − ˆ  f ) ⊤    f ) exp f )       d ˆ     f 2  2.50 2.00 1.50 1.00 h(f) exp(taylor O=2 at f ) 0.50 Mode f 0.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

  20. Second-order taylor expansion  − d 2 log h (ˆ     f ) p ( f | y ) ≈ 1  − 1 Zh (ˆ 2( f − ˆ  ( f − ˆ  f ) ⊤    f ) exp f )       d ˆ     f 2  − 1   − d 2 log h (ˆ    f )  f | ˆ     = N  f ,         d ˆ    f 2     2.50 2.00 1.50 1.00 h(f) N ( f | f , log h ( f ) 1 ) 0.50 Mode f 0.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

  21. Laplace appoximation for Gaussian processes In our case, h ( f ) = p ( y | f ) p ( f ), so we need to evaluate − d 2 log h (ˆ = − d 2 (log p ( y | ˆ f ) + log p (ˆ f ) f )) d ˆ d ˆ f 2 f 2 = − d 2 log p ( y | ˆ f ) + K − 1 d ˆ f 2 � W + K − 1 giving a posterior approximation: � W + K − 1 � − 1 � f | ˆ � p ( f | y ) ≈ q ( f ) = N f ,

  22. Laplace approximation - algorithm overview ◮ Find the mode, ˆ f of the true log posterior, via Newton’s method. ◮ Use second-order Taylor expansion around this modal value. ◮ Form Gaussian approximation setting the mean equal to the posterior mode, ˆ f , and matching the curvature. f , ( K − 1 + W ) − 1 � � f | ˆ ◮ p ( f | y ) ≈ q ( f | µ , C ) = N ◮ W � − d 2 log p ( y | ˆ f ) . d ˆ f 2 ◮ For factorizing likelihoods (most), W is diagonal.

  23. log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

  24. log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

  25. log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

  26. log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

  27. log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

  28. log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

  29. log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

  30. log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

  31. log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

  32. log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) Evaluate curvature 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

  33. log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) Evaluate curvature 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) mode, f 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1

  34. Visualization of Laplace 1.0 prior, p ( f ) likelihood, p ( y = 4| ( f )) posterior, p ( f | y = 4) 0.8 laplace, q ( f ) mode, f 0.6 0.4 0.2 0.0 8 6 4 2 0 2 4 6 8 f 1

  35. Visualise of Laplace - Bernoulli 1.0 prior, p ( f ) likelihood, p ( y = 1| ( f )) posterior, p ( f | y = 1) 0.8 laplace, q ( f ) mode, f 0.6 0.4 0.2 0.0 8 6 4 2 0 2 4 6 8 f 1

  36. Outline Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons

  37. Variational Bayes (VB) Task: for some generic random variable, z , and data, y , find a good approximation to di ffi cult to compute posterior distribution, p ( z | y ). VB approach: minimise a divergence measure between an approximate posterior, q ( z ) and true posterior, p ( z | y ). ◮ KL divergence, KL � q ( z ) � p ( z | y ) � . ◮ Minimize this with respect to parameters of q ( z ).

  38. KL divergence ◮ General for any two distributions q ( x ) and p ( x ). ◮ KL � q ( x ) � p ( x ) � is the average additional amount of information lost when p ( x ) is used to approximate q ( x ). It’s a measure of divergence of one distribution to another. ◮ KL � q ( x ) � p ( x ) � = log q ( x ) � � p ( x ) q ( x ) ◮ Always 0 or positive, not symmetric. ◮ Lets look at how it changes with response to changes in the approximating distribution.

  39. 0.40 0.40 0.40 0.40 q(x) ~ N ( µ = − 1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =1 . 0 ,σ 2 =1 . 0) 0.35 0.35 0.35 0.35 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.30 0.30 0.30 0.25 0.25 0.25 0.25 pdf pdf pdf pdf 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.5 2.0 2.0 2.0 2.0 µ µ µ µ KL varying mean 0.40 q(x) ~ N ( µ = − 2 . 0 ,σ 2 =1 . 0) 0.35 p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.25 pdf 0.20 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 x 2.0 1.5 KL [ q ( x ) || p ( x )] 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 µ

  40. 0.40 0.40 0.40 0.40 q(x) ~ N ( µ = − 2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 0) 0.35 0.35 0.35 0.35 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.30 0.30 0.30 0.25 0.25 0.25 0.25 pdf pdf pdf pdf 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.5 2.0 2.0 2.0 2.0 µ µ µ µ KL varying mean 0.40 q(x) ~ N ( µ = − 1 . 0 ,σ 2 =1 . 0) 0.35 p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.25 pdf 0.20 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 x 2.0 1.5 KL [ q ( x ) || p ( x )] 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 µ

  41. 0.40 0.40 0.40 0.40 q(x) ~ N ( µ = − 1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ = − 2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =1 . 0 ,σ 2 =1 . 0) 0.35 0.35 0.35 0.35 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.30 0.30 0.30 0.25 0.25 0.25 0.25 pdf pdf pdf pdf 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.5 2.0 2.0 2.0 2.0 µ µ µ µ KL varying mean 0.40 q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 0) 0.35 p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.25 pdf 0.20 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 x 2.0 1.5 KL [ q ( x ) || p ( x )] 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 µ

  42. 0.40 0.40 0.40 0.40 q(x) ~ N ( µ = − 2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ = − 1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 0) 0.35 0.35 0.35 0.35 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.30 0.30 0.30 0.25 0.25 0.25 0.25 pdf pdf pdf pdf 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.5 2.0 2.0 2.0 2.0 µ µ µ µ KL varying mean 0.40 q(x) ~ N ( µ =1 . 0 ,σ 2 =1 . 0) 0.35 p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.25 pdf 0.20 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 x 2.0 1.5 KL [ q ( x ) || p ( x )] 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 µ

  43. 0.40 0.40 0.40 0.40 q(x) ~ N ( µ = − 2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ = − 1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 0) 0.35 0.35 0.35 0.35 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.30 0.30 0.30 0.25 0.25 0.25 0.25 pdf pdf pdf pdf 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.5 2.0 2.0 2.0 2.0 µ µ µ µ KL varying mean 0.40 q(x) ~ N ( µ =2 . 0 ,σ 2 =1 . 0) 0.35 p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.25 pdf 0.20 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 x 2.0 1.5 KL [ q ( x ) || p ( x )] 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 µ

  44. 0.8 0.8 0.8 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 575) q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 725) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 15) q(x) ~ N ( µ =0 . 0 ,σ 2 =2 . 0) 0.7 0.7 0.7 0.7 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 pdf pdf pdf pdf 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 1.2 1.2 1.2 1.2 1.4 1.4 1.4 1.4 1.6 1.6 1.6 1.6 1.8 1.8 1.8 1.8 2.0 2.0 2.0 2.0 σ 2 σ 2 σ 2 σ 2 KL varying variance 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 3) 0.7 p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.5 pdf 0.4 0.3 0.2 0.1 0.0 6 4 2 0 2 4 6 x 0.5 0.4 KL [ q ( x ) || p ( x )] 0.3 0.2 0.1 0.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 σ 2

  45. 0.8 0.8 0.8 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 575) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 15) q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 3) q(x) ~ N ( µ =0 . 0 ,σ 2 =2 . 0) 0.7 0.7 0.7 0.7 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 pdf pdf pdf pdf 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 1.2 1.2 1.2 1.2 1.4 1.4 1.4 1.4 1.6 1.6 1.6 1.6 1.8 1.8 1.8 1.8 2.0 2.0 2.0 2.0 σ 2 σ 2 σ 2 σ 2 KL varying variance 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 725) 0.7 p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.5 pdf 0.4 0.3 0.2 0.1 0.0 6 4 2 0 2 4 6 x 0.5 0.4 KL [ q ( x ) || p ( x )] 0.3 0.2 0.1 0.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 σ 2

  46. 0.8 0.8 0.8 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 725) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 575) q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 3) q(x) ~ N ( µ =0 . 0 ,σ 2 =2 . 0) 0.7 0.7 0.7 0.7 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 pdf pdf pdf pdf 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 1.2 1.2 1.2 1.2 1.4 1.4 1.4 1.4 1.6 1.6 1.6 1.6 1.8 1.8 1.8 1.8 2.0 2.0 2.0 2.0 σ 2 σ 2 σ 2 σ 2 KL varying variance 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 15) 0.7 p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.5 pdf 0.4 0.3 0.2 0.1 0.0 6 4 2 0 2 4 6 x 0.5 0.4 KL [ q ( x ) || p ( x )] 0.3 0.2 0.1 0.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 σ 2

  47. 0.8 0.8 0.8 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 725) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 15) q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 3) q(x) ~ N ( µ =0 . 0 ,σ 2 =2 . 0) 0.7 0.7 0.7 0.7 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 pdf pdf pdf pdf 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 1.2 1.2 1.2 1.2 1.4 1.4 1.4 1.4 1.6 1.6 1.6 1.6 1.8 1.8 1.8 1.8 2.0 2.0 2.0 2.0 σ 2 σ 2 σ 2 σ 2 KL varying variance 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 575) 0.7 p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.5 pdf 0.4 0.3 0.2 0.1 0.0 6 4 2 0 2 4 6 x 0.5 0.4 KL [ q ( x ) || p ( x )] 0.3 0.2 0.1 0.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 σ 2

  48. 0.8 0.8 0.8 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 725) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 575) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 15) q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 3) 0.7 0.7 0.7 0.7 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 pdf pdf pdf pdf 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 1.2 1.2 1.2 1.2 1.4 1.4 1.4 1.4 1.6 1.6 1.6 1.6 1.8 1.8 1.8 1.8 2.0 2.0 2.0 2.0 σ 2 σ 2 σ 2 σ 2 KL varying variance 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =2 . 0) 0.7 p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.5 pdf 0.4 0.3 0.2 0.1 0.0 6 4 2 0 2 4 6 x 0.5 0.4 KL [ q ( x ) || p ( x )] 0.3 0.2 0.1 0.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 σ 2

  49. Variational Bayes Don’t have access to or can’t compute for computational reasons: p ( z | y ) or p ( y ), and hence KL � q ( z ) � p ( z | y ) � How can we minimize something we can’t compute? ◮ Can compute q ( z ) and p ( y | z ) for any z . ◮ q ( z ) is parameterised by ‘variational parameters’. ◮ True posterior using Bayes rule, p ( z | y ) = p ( y | z ) p ( z ) . p ( y ) ◮ p ( y ) doesn’t change when variational parameters are changed.

  50. Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) �

  51. Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) � � � log q ( z ) � q ( z ) dz = p ( z | y )

  52. Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) � � � log q ( z ) � q ( z ) dz = p ( z | y ) � � � log q ( z ) = q ( z ) p ( z ) − log p ( y | z ) + log p ( y ) dz

  53. Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) � � � log q ( z ) � q ( z ) dz = p ( z | y ) � � � log q ( z ) = q ( z ) p ( z ) − log p ( y | z ) + log p ( y ) dz = KL � q ( z ) � p ( z ) � − � q ( z ) � log p ( y | z ) � dz + log p ( y )

  54. Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) � � � log q ( z ) � q ( z ) dz = p ( z | y ) � � � log q ( z ) = q ( z ) p ( z ) − log p ( y | z ) + log p ( y ) dz = KL � q ( z ) � p ( z ) � − � q ( z ) � log p ( y | z ) � dz + log p ( y ) � q ( z ) � log p ( y | z ) � dz − KL � q ( z ) � p ( z ) � + KL � q ( z ) � p ( z | y ) � log p ( y ) =

  55. Variational Bayes - Derivation � q ( z ) � log p ( y | z ) � dz − KL � q ( z ) � p ( z ) � + KL � q ( z ) � p ( z | y ) � log p ( y ) = � q ( z ) � log p ( y | z ) � dz − KL � q ( z ) � p ( z ) � ≥ ◮ Tractable terms give lower bound on log p ( y ) as KL � q ( z ) � p ( z | y ) � always positive. ◮ Adjust variational parameters of q ( z ) to make tractable terms as large as possible, thus KL � q ( z ) � p ( z | y ) � as small as possible.

  56. VB optimisation illustration

  57. Variational Bayes for Gaussian processes ◮ Make a Gaussian approximation, q ( f ) = N ( f | µ , C ), as similar possible to true posterior, p ( f | y ). ◮ Treat µ and C as ‘variational parameters’, e ff ecting quality of approximation. � � KL � q ( f ) � p ( f | y ) � = log q ( f ) p ( f | y ) q ( f ) � � log q ( f ) = p ( f ) − log p ( y | f ) + log p ( y ) q ( f ) = KL � q ( f ) � p ( f ) � − � log p ( y | f ) � q ( f ) + log p ( y ) q ( f ) − KL � q ( f ) � p ( f ) � + KL � q ( f ) � p ( f | y ) � log p ( y ) = � log p ( y | f ) �

  58. Variational Bayes for Gaussian processes - bound q ( f ) − KL � q ( f ) � p ( f ) � + KL � q ( f ) � p ( f | y ) � log p ( y ) = � log p ( y | f ) � ≥ � log p ( y | f ) � q ( f ) − KL � q ( f ) � p ( f ) � ◮ Adjust variational parameters µ and C to make tractable terms as large as possible, thus KL � q ( f ) � p ( f | y ) � as small as possible. ◮ � log p ( y | f ) � q ( f ) with factorizing likelihood can be done with a series of n 1 dimensional integrals. ◮ In practice, can reduce the number of variational parameters by reparameterizing C = ( K ff − 2 Λ ) − 1 by noting that the bound is constant in o ff diagonal terms of C .

  59. VB optimisation illustration for Gaussian processes

  60. Outline Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons

  61. Expectation propagation n � p ( f | y ) ∝ p ( f ) p ( y i | f i ) i = 1 n 1 � t i ( f i | ˜ σ 2 q ( f ) � p ( f ) Z i , ˜ µ i , ˜ i ) = N ( f | µ , Σ ) Z ep i = 1 � � t i � ˜ σ 2 Z i N f i | ˜ µ i , ˜ i ◮ Individual likelihood terms, p ( y i | f i ), replaced by independent un-normalised 1D Gaussians, t i . ◮ Uses an iterative algorithm to update t i ’s, to get more and more accurate approximation.

  62. Expectation propagation 1. Remove one factor t i from the approximation q ( f ).

  63. Expectation propagation 1. Remove one factor t i from the approximation q ( f ). 2. The approximate marginal q ( f i ) with t i contribution removed is called cavity distribution, q − i ( f i )

  64. Expectation propagation 1. Remove one factor t i from the approximation q ( f ). 2. The approximate marginal q ( f i ) with t i contribution removed is called cavity distribution, q − i ( f i ) 3. Find t i that minimises KL � p ( y i | f i ) q − i ( f i ) / z i � q ( f i ) � by matching moments.

  65. Expectation propagation 1. Remove one factor t i from the approximation q ( f ). 2. The approximate marginal q ( f i ) with t i contribution removed is called cavity distribution, q − i ( f i ) 3. Find t i that minimises KL � p ( y i | f i ) q − i ( f i ) / z i � q ( f i ) � by matching moments. 4. Repeat until convergence.

  66. Expectation propagation 1. Remove one factor t i from the approximation q ( f ). 2. The approximate marginal q ( f i ) with t i contribution removed is called cavity distribution, q − i ( f i ) 3. Find t i that minimises KL � p ( y i | f i ) q − i ( f i ) / z i � q ( f i ) � by matching moments. 4. Repeat until convergence. This approximately minimises KL � p ( f | y ) � q ( f ) � locally, but not globally.

  67. Expectation propagation - in math Step 1 & 2. First choose a local likelihood contribution, i , to leave out, and find the marginal cavity distribution, p ( f ) � n n n j = 1 t j ( f j ) � � q ( f | y ) ∝ p ( f ) t j ( f j ) → → p ( f ) t j ( f j ) t i ( f i ) j = 1 j � i � � t j ( f j ) d f j � i � q − i ( f i ) → p ( f ) j � i

  68. Expectation propagation - in math Step 1 & 2. First choose a local likelihood contribution, i , to leave out, and find the marginal cavity distribution, p ( f ) � n n n j = 1 t j ( f j ) � � q ( f | y ) ∝ p ( f ) t j ( f j ) → → p ( f ) t j ( f j ) t i ( f i ) j = 1 j � i � � t j ( f j ) d f j � i � q − i ( f i ) → p ( f ) j � i � ˆ � � � σ 2 Step 3.1. ˆ q ( f i ) ≈ min KL p ( y i | f i ) q − i ( f i ) � N f i | ˆ µ i , ˆ Z i i

  69. Expectation propagation - in math Step 1 & 2. First choose a local likelihood contribution, i , to leave out, and find the marginal cavity distribution, p ( f ) � n n n j = 1 t j ( f j ) � � q ( f | y ) ∝ p ( f ) t j ( f j ) → → p ( f ) t j ( f j ) t i ( f i ) j = 1 j � i � � t j ( f j ) d f j � i � q − i ( f i ) → p ( f ) j � i � ˆ � � � σ 2 Step 3.1. ˆ q ( f i ) ≈ min KL p ( y i | f i ) q − i ( f i ) � N f i | ˆ µ i , ˆ Z i i Step 3.2: Compute parameters of t i ( f i | ˜ σ 2 Z i , ˜ µ i , ˜ i ) making � � moments of q ( f i ) match those of ˆ σ 2 Z i N f i | ˆ µ i , ˆ . i

  70. Outline Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons

  71. Comparing posterior approximations Prior p ( f 1 ,f 2 ) Likelihood p ( y =1 | f 1 ,f 2 ) 20 20 10 10 f 2 0 f 2 0 10 10 20 20 20 10 0 10 20 20 10 0 10 20 f 1 f 1 ◮ Gaussian prior between two function values { f 1 , f 2 } , at { x 1 , x 2 } respectively. ◮ Bernoulli likelihood, y 1 = 1 and y 2 = 1.

  72. Comparing posterior approximations True posterior Laplace approximation 20 20 10 10 f 2 0 f 2 0 10 10 20 20 20 10 0 10 20 20 10 0 10 20 f 1 f 1 ◮ p ( f | y ) ∝ p ( y | f ) p ( f ) p ( y ) ◮ True posterior is non-Gaussian. ◮ Laplace approximates with a Gaussian at the mode of the posterior.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend