menu
play

Menu Concerns about the quality of the predictive distributions - PowerPoint PPT Presentation

S OME C ONCERNS A BOUT S PARSE A PPROXIMATIONS FOR G AUSSIAN P ROCESS R EGRESSION Joaquin Qui nonero Candela Max Planck Institute for Biological Cybernetics Gaussian Process Round Table Sheffield, June 9 and 10, 2005 Menu Concerns about


  1. S OME C ONCERNS A BOUT S PARSE A PPROXIMATIONS FOR G AUSSIAN P ROCESS R EGRESSION Joaquin Qui˜ nonero Candela Max Planck Institute for Biological Cybernetics Gaussian Process Round Table Sheffield, June 9 and 10, 2005

  2. Menu • Concerns about the quality of the predictive distributions • Augmentation: a bit more expensive, but gooood ... • Dude, where’s my prior? • A short tale about sparse greedy support set selection

  3. The Regression Task • Simplest case, additive independent Gaussian noise of variance σ 2 • Gaussian process prior over functions: p ( y | f ) ∼ N ( f , σ 2 I ) , p ( f ) ∼ N (0 , K ) • Task: obtain the predictive distribution of f ∗ at the new input x ∗ : � p ( f ∗ | x ∗ , y ) = p ( f ∗ | x ∗ , f ) p ( f | y ) d f • Need to compute the posterior distribution (expensive): K ( K + σ 2 I ) − 1 y , σ 2 K ( K + σ 2 I ) − 1 � � p ( f | y ) ∼ N • ... and integrate f from the conditional distribution of f ∗ : K ∗ , · K − 1 y , K ∗ , ∗ − K ∗ , · K − 1 K ⊤ � � p ( f ∗ | x ∗ , f ) ∼ N ∗ , ·

  4. Usual Reduced Set Approximations • Consider some very common approximations – Na¨ ıve process approximation on subset of the data – Subset of regressors (Wahba, Smola and Bartlett...) – Sparse online GPs (Csat´ o and Opper) – Fast Sparse Projected Process Approx (Seeger et al.) – Relevance Vector Machines (Tipping) – Augmented Reduced Rank GPs (Rasmussen, Qui˜ nonero Candela) • All based on considering only a subset I of the latent variables � p ( f ∗ | x ∗ , y ) = p ( f ∗ | x ∗ , f I ) p ( f I | y ) d f I • However they differ in: – the way the support set I and the hyperparameters are learnt – the likelihood and/or predictive distribution approximations • This has important consequences on the resulting predictive distribution – risk of over-fitting – degenerate approximations with nonsense predictive uncertainties

  5. Na¨ ıve Process Approximation • Extremely simple idea: throw away all the data outside I ! • The posterior only benefits from the information contained in y I : K I ( K I + σ 2 I ) − 1 y I , σ 2 K I ( K I + σ 2 I ) − 1 � � p ( f I | y I ) ∼ N • The model underfits and is under-confident: p ( f ∗ | x ∗ , y I ) ∼ N ( µ ∗ , σ 2 ∗ ) µ ∗ = K ∗ ,I ( K I + σ 2 I ) − 1 y , ∗ = K ∗ , ∗ − K ∗ ,I ( K I + σ 2 I ) − 1 K ⊤ σ 2 ∗ ,I • Training scales with m 3 , predicting with m and m 2 (mean and var) • Baseline approximation: we want higher accuracy and confidence

  6. Subset Of Regressors • Finite linear model with peculiar prior on the weigths: α I ∼ N (0 , K − 1 f ∗ = K ∗ ,I K − 1 ⇒ f I ∼ N (0 , K I ) f ∗ = K ∗ ,I α I , I ) I f I , • Posterior now benefits from all of y : I f I | y , σ 2 I ) · N ( f I | 0 , K I ) , q ( f I | y ) ∝ N ( K ⊤ I, · K − 1 I, · + σ 2 K I ] − 1 K I, · y , σ 2 K I [ K I, · K ⊤ I, · + σ 2 K I ] − 1 K I K I [ K I, · K ⊤ � � ∼ N • The conditional distribution of f ∗ is degenerate! � ⊤ K ∗ ,I K − 1 � p ( f ∗ | f I ) ∼ N I f I , 0 • The predictive distribution produces nonsense errorbars � − 1 K I, · y , I, · + σ 2 K I K I, · K ⊤ � µ ∗ = K ∗ ,I � − 1 K ⊤ ∗ = σ 2 K ∗ ,I I, · + σ 2 K I K I, · K ⊤ σ 2 � ∗ ,I • Under the prior, only functions with m degrees of freedom

  7. Projected Process (Seeger et al) • Basic principle: likelihood approximation I f I , σ 2 I ) p ( y | f I ) ∼ ( K ⊤ I, · K − 1 • Leads to exactly the same posterior as for Subset of Regressors • But the conditional distribution is now non-degenerate (process approximation) � ⊤ K ∗ ,I K − 1 I f I , K ∗ , ∗ − K ∗ ,I K − 1 � p ( f ∗ | f I ) ∼ N I K ∗ ,I • Predictive distribution with same mean as Subset of Regressors, but with way under-confident predictive variance! � − 1 K I, · y I, · + σ 2 K I K I, · K ⊤ � µ ∗ = K ∗ ,I � − 1 K ⊤ ∗ ,I + σ 2 K ∗ ,I I, · + σ 2 K I ∗ = K ∗ , ∗ − K ∗ ,I K − 1 I K ⊤ K I, · K ⊤ σ 2 � ∗ ,I

  8. Augmented Subset Of Regressors • For each x ∗ , augment f I with f ∗ ; new active set I ∗ �� �� � f I � • Augmented posterior: q � y � f ∗ • ... at a cost of O ( nm ) per test case: need to compute K ∗ , · K ⊤ I, · • aSoR: � − 1 Q + v ∗ v ⊤ � ∗ µ ∗ = K ∗ , · y c ∗ � − 1 Q + v ∗ v ⊤ � ∗ K ⊤ σ 2 ∗ = K ∗ , ∗ − K ∗ , · ∗ , · c ∗ with the ususal approximate covariance: I K I, · + σ 2 I Q = K ⊤ I, · K − 1 with the difference between actual and projected covariance of f ∗ and f : v ∗ = K ⊤ ∗ , · − K ⊤ I, · K − 1 I K I, ∗ with the difference between the prior variance of f ∗ and the projected: c ∗ = K ∗ , ∗ − K ⊤ I, ∗ K − 1 I K I, ∗

  9. Dude, where’s my prior?

  10. The Priors The equivalent prior on [ f , f ∗ ] ⊤ is N (0 , P ) with: Q = K ⊤ I, · K − 1 I K I, · Subset of Regressors: Projected Process K ⊤ I, · K − 1 K ⊤ I, · K − 1 � � � � Q I K I, ∗ Q I K I, ∗ P = P = K ⊤ I, ∗ K − 1 I K I, · K ⊤ I, ∗ K − 1 K ⊤ I, ∗ K − 1 I K I, ∗ I K I, · K ∗ , ∗ Nystr¨ om: (positive definiteness!) Ed and Zoubin’s funky thing K ⊤ I, · K − 1 K ⊤ � � � � Q + Λ I K I, ∗ Q ∗ , · P = P = K ⊤ I, ∗ K − 1 K ∗ , · K ∗ , ∗ I K I, · K ∗ , ∗ Λ = diag ( K · ) − diag ( Q ) Augmented Subset of Regressors: � � Q + v ∗ v ⊤ K ⊤ ∗ P = ∗ , · c ∗ K ∗ , · K ∗ , ∗ with: v ∗ = K ⊤ ∗ , · − K ⊤ I, · K − 1 c ∗ = K ∗ , ∗ − K ⊤ I, ∗ K − 1 I K I, ∗ , I K I, ∗

  11. More on Ed and Zoubin’s Method • Here’s a way of looking at it: the prior is a posterior process f ∗ | f I = N ( K ∗ ,I K − 1 I f I , K ∗ , ∗ − K ∗ ,I K − 1 I K ⊤ ∗ ,I ) , ... well, almost: E [ f + , f ∗ | f I ] = 0 • And then of course f I ∼ N (0 , K I ) • The corresponding prior is Q = K I, · K − 1 I K ⊤ p ( f ) = N (0 , K ∗ , ∗ I + Q − diag( Q )) , I, · • With a bit of algebra you recover the marginal likelihood and the predictive distribution • I finished this 30 minutes ago, which is why I won’t show figures on it! (well, I now may) • but ...

  12. Na¨ ıve Process Approximation 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

  13. Subset of Regressors (degenerate) 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

  14. Projected Process Approximation 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

  15. Ed and Zoubin’s Projected Process Method 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

  16. Augmented SoR (pred scales with nm ) 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

  17. Comparing the Predictive Uncertainties 0.7 Naive SR 0.6 Seeger EdZoubin Augm 0.5 0.4 0.3 0.2 0.1 0 −15 −10 −5 0 5 10 15

  18. Smola and Bartett’s Greedy Selection neg log evidence −80 neg log evidence squared error −120 0.12 −160 0.08 test squared error: 0.04 min neg log ev. min neg log post. 0 0 0 1 1 2 2 10 10 10 10 10 10 size of support set, m, logarithmic scale 50 gap = 0.025 neg log posterior upper bound on neg log post. 0 lower bound on neg log post. −50 0 1 2 10 10 10 size of support set, m, logarithmic scale

  19. Wrap Up • Training: from O ( n 3 ) to O ( nm 2 ) • Predicting: from O ( n 2 ) to O ( m 2 ) (or O ( nm ) ) • Be sparse if you must, but only then • Beware of over-fitting prone greedy selection methods • Do worry about the prior implied by the approximation!

  20. Appendix: Healing the RVM by Augmentation (joint work with Carl Rasmussen)

  21. Finite Linear Model 2 1 0 −1 −2 0 5 10 15

  22. A Bad Probabilistic Model 2 1 0 −1 −2 0 5 10 15

  23. The Healing: Augmentation 2 1 0 −1 −2 0 5 10 15

  24. Augmentation? • Train once your m -dimensional model • At each new test point add a new basis function • Update the m + 1 -dimensional model (update posterior) • Testing is now more expensive

  25. Wait a minute ... I don’t care about probabilistic predictions!

  26. Another Symptom: Underfitting Abalone Squared error loss Absolute error loss - log test density loss RVM RVM* GP RVM RVM* GP RVM RVM* GP Loss: 0.138 0.135 0.092 0.259 0.253 0.209 0.469 0.408 0.219 · · · RVM not sig. < 0 . 01 0.07 < 0 . 01 < 0 . 01 < 0 . 01 · · · RVM* 0.02 < 0 . 01 < 0 . 01 GP · · · Robot Arm Squared error loss Absolute error loss - log test density loss RVM RVM* GP RVM RVM* GP RVM RVM* GP Loss: 0.0043 0.0040 0.0024 0.0482 0.0467 0.0334 -1.2162 -1.3295 -1.7446 · · · RVM < 0 . 01 < 0 . 01 < 0 . 01 < 0 . 01 < 0 . 01 < 0 . 01 RVM* · < 0 . 01 · < 0 . 01 · < 0 . 01 · · · GP • GP (Gaussian Process): infinitely augmented linear model • Beats finite linear models in all datasets I’ve looked at

  27. Interlude None of this happens with non-localized basis functions

  28. Finite Linear Model 2 1 0 −1 −2 0 5 10 15

  29. A Bad Probabilistic Model 2 1 0 −1 −2 0 5 10 15

  30. The Healing: Augmentation 2 1 0 −1 −2 0 5 10 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend