power expectation propagation for deep gaussian processes
play

Power Expectation Propagation for Deep Gaussian Processes Dr. - PowerPoint PPT Presentation

Power Expectation Propagation for Deep Gaussian Processes Dr. Richard E. Turner ( ret26@cam.ac.uk ) Computational and Biological Learning Lab, Department of Engineering, University of Cambridge with Thang Bui, Yingzhen Li, Jos e Miguel Hern


  1. Power Expectation Propagation for Deep Gaussian Processes Dr. Richard E. Turner ( ret26@cam.ac.uk ) Computational and Biological Learning Lab, Department of Engineering, University of Cambridge with Thang Bui, Yingzhen Li, Jos´ e Miguel Hern´ andez Lobato, Daniel Hern´ andez Lobato, Josiah Jan 1 / 32

  2. Motivation: Gaussian Process regression outputs inputs 2 / 32

  3. Motivation: Gaussian Process regression ? outputs inputs 2 / 32

  4. Motivation: Gaussian Process regression ? outputs inputs 2 / 32

  5. Motivation: Gaussian Process regression inference & learning ? outputs inputs 2 / 32

  6. Motivation: Gaussian Process regression inference & learning intractabilities computational analytic ? outputs inputs 2 / 32

  7. EP pseudo-point approximation true posterior 3 / 32

  8. EP pseudo-point approximation true posterior 3 / 32

  9. EP pseudo-point approximation marginal posterior likelihood true posterior 3 / 32

  10. EP pseudo-point approximation marginal posterior likelihood true posterior approximate posterior 3 / 32

  11. EP pseudo-point approximation marginal posterior likelihood true posterior approximate posterior 3 / 32

  12. EP pseudo-point approximation marginal posterior likelihood true posterior approximate posterior 3 / 32

  13. EP pseudo-point approximation marginal posterior likelihood true posterior approximate posterior 3 / 32

  14. EP pseudo-point approximation marginal posterior likelihood approximate posterior true posterior input locations of outputs and covariance 'pseudo' data 'pseudo' data 3 / 32

  15. EP algorithm 4 / 32

  16. EP algorithm take out one 1. remove pseudo-observation likelihood cavity 4 / 32

  17. EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood tilted 4 / 32

  18. EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 4 / 32

  19. EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family update 4. update pseudo-observation likelihood 4 / 32

  20. EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 1. minimum: moments matched at pseudo-inputs 2. Gaussian regression: matches moments everywhere update 4. update pseudo-observation likelihood 4 / 32

  21. EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 1. minimum: moments matched at pseudo-inputs 2. Gaussian regression: matches moments everywhere update 4. update pseudo-observation likelihood rank 1 4 / 32

  22. Fixed points of EP = FITC approximation 5 / 32

  23. Fixed points of EP = FITC approximation suppressed & 5 / 32

  24. Fixed points of EP = FITC approximation suppressed & 5 / 32

  25. Fixed points of EP = FITC approximation suppressed & 5 / 32

  26. Fixed points of EP = FITC approximation suppressed & 6 / 32

  27. Fixed points of EP = FITC approximation suppressed & 6 / 32

  28. Fixed points of EP = FITC approximation suppressed & 6 / 32

  29. Fixed points of EP = FITC approximation suppressed & 7 / 32

  30. Fixed points of EP = FITC approximation suppressed & 7 / 32

  31. Fixed points of EP = FITC approximation suppressed & 7 / 32

  32. Fixed points of EP = FITC approximation suppressed & 7 / 32

  33. Fixed points of EP = FITC approximation suppressed & 8 / 32

  34. Fixed points of EP = FITC approximation suppressed & = equivalent 8 / 32

  35. Fixed points of EP = FITC approximation suppressed & = Csato & Opper (2002) equivalent Qi, Abdel-Gawad & Minka (2010) 8 / 32

  36. Fixed points of EP = FITC approximation suppressed & = Csato & Opper (2002) equivalent Qi, Abdel-Gawad & Minka (2010) Interpretation resolves philosophical issues with FITC (increase M with N) FITC known to overfit => EP over-estimates marginal likelihood 8 / 32

  37. EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 1. minimum: moments matched at pseudo-inputs 2. Gaussian regression: matches moments everywhere update 4. update pseudo-observation likelihood rank 1 9 / 32

  38. Power EP algorithm (as tractable as EP) take out fraction of 1. remove pseudo-observation likelihood cavity add in fraction of 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 1. minimum: moments matched at pseudo-inputs 2. Gaussian regression: matches moments everywhere update 4. update pseudo-observation likelihood rank 1 10 / 32

  39. Power EP: a unifying framework FITC VFE Csato and Opper, 2002 Titsias, 2009 Snelson and Ghahramani, 2005 11 / 32

  40. Power EP: a unifying framework Approximate blocks of data: structured approximations 12 / 32

  41. Power EP: a unifying framework Approximate blocks of data: structured approximations 12 / 32

  42. Power EP: a unifying framework Approximate blocks of data: structured approximations PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 12 / 32

  43. Power EP: a unifying framework Approximate blocks of data: structured approximations PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 Place pseudo-data in different space: interdomain transformations (linear transform) 12 / 32

  44. Power EP: a unifying framework Approximate blocks of data: structured approximations PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 Place pseudo-data in different space: interdomain transformations (linear transform) 12 / 32

  45. Power EP: a unifying framework Approximate blocks of data: structured approximations PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 Place pseudo-data in different space: interdomain transformations (linear transform) pseudo-data in new space 12 / 32

  46. Power EP: a unifying framework Approximate blocks of data: structured approximations PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 Place pseudo-data in different space: interdomain transformations (linear transform) Figueiras-Vidal & La ́ zaro-Gredilla 2009 pseudo-data in new space T obar et al. 2015 Matthews et al, 2016 12 / 32

  47. Power EP: a unifying framework GP Regression GP Classification ** ** [16*] inter-domain inter-domain [13] [17,13] [7,4*,6*] (PITC) [9,11,8*] [14*] structured structured [12*,15*] [10,5,6*] approx. approx. (FITC) VFE VFE PEP PEP EP EP [4] Quiñonero-Candela et al. 2005 [8] Titsias, 2009 [12] Naish-Guzman et al, 2007 [5] Snelson et al., 2005 [9] Csató, 2002 [13] Qi et al., 2010 [6] Snelson, 2006 [14] Hensman et al., 2015 [10] Csató et al., 2002 [7] Schwaighofer, 2002 [11] Seeger et al., 2003 [15] Hernández-Lobato et al., 2016 [16] Matthews et al., 2016 * = optimised pseudo-inputs [17] Figueiras-Vidal et al., 2009 ** = structured versions of VFE recover VFE 13 / 32

  48. How should I set the power parameter α ? 8 UCI regression datasets 6 UCI classification datasets 20 random splits 20 random splits M = 0 - 200 M = 10, 50, 100 hypers and inducing inputs optimised hypers and inducing inputs optimised 0.8 0.1 0.5 0.8 0.2 0.05 0.4 0 0 1 0.6 1 1 2 3 4 5 6 7 8 1 2 3 4 CD Error Average Rank SMSE Average Rank Error Average Rank indicates significant difference 0.4 0.2 0.8 1 0.1 1 0.5 0 0.6 0.05 0.8 0 1 2 3 4 1 2 3 4 5 6 7 8 CD SMLL Average Rank MLL Average Rank MLL Average Rank = 0.5 does well over all 14 / 32

  49. Deep Gaussian processes f l ∼ GP (0 , k ( ., . )) y n = g ( x n ) = f L ( f L − 1 ( · · · f 2 ( f 1 ( x n )))) + ǫ n x n f 1 h L − 1 , n := f L − 1 ( · · · f 1 ( x n )) , y n = f L ( h L − 1 , n ) + ǫ n h 1 , n Deep GPs a are f 2 multi-layer generalisation of Gaussian processes, equivalent to deep neural networks with infinitely h 2 , n wide hidden layers f 3 Questions: How to perform inference and learning tractably? y n How Deep GPs compare to alternative, e.g. Bayesian neural networks? N a Damianou and Lawrence (2013) [unsupervised learning] 15 / 32

  50. Pros and cons of Deep GPs Why deep GPs? Because x n deep and nonparametric, f 1 discover useful input warping or dimensionality compression/expansion, i.e. automatic, h 1 , n nonparametric Bayesian kernel design, f 2 give a non-Gaussian functional mapping g , h 2 , n f 3 y n N 16 / 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend