Menu Concerns about the quality of the predictive distributions - PowerPoint PPT Presentation

S OME C ONCERNS A BOUT S PARSE A PPROXIMATIONS FOR G AUSSIAN P ROCESS R EGRESSION Joaquin Qui˜ nonero Candela Max Planck Institute for Biological Cybernetics Gaussian Process Round Table Sheffield, June 9 and 10, 2005

Menu • Concerns about the quality of the predictive distributions • Augmentation: a bit more expensive, but gooood ... • Dude, where’s my prior? • A short tale about sparse greedy support set selection

The Regression Task • Simplest case, additive independent Gaussian noise of variance σ 2 • Gaussian process prior over functions: p ( y | f ) ∼ N ( f , σ 2 I ) , p ( f ) ∼ N (0 , K ) • Task: obtain the predictive distribution of f ∗ at the new input x ∗ : � p ( f ∗ | x ∗ , y ) = p ( f ∗ | x ∗ , f ) p ( f | y ) d f • Need to compute the posterior distribution (expensive): K ( K + σ 2 I ) − 1 y , σ 2 K ( K + σ 2 I ) − 1 � � p ( f | y ) ∼ N • ... and integrate f from the conditional distribution of f ∗ : K ∗ , · K − 1 y , K ∗ , ∗ − K ∗ , · K − 1 K ⊤ � � p ( f ∗ | x ∗ , f ) ∼ N ∗ , ·

Usual Reduced Set Approximations • Consider some very common approximations – Na¨ ıve process approximation on subset of the data – Subset of regressors (Wahba, Smola and Bartlett...) – Sparse online GPs (Csat´ o and Opper) – Fast Sparse Projected Process Approx (Seeger et al.) – Relevance Vector Machines (Tipping) – Augmented Reduced Rank GPs (Rasmussen, Qui˜ nonero Candela) • All based on considering only a subset I of the latent variables � p ( f ∗ | x ∗ , y ) = p ( f ∗ | x ∗ , f I ) p ( f I | y ) d f I • However they differ in: – the way the support set I and the hyperparameters are learnt – the likelihood and/or predictive distribution approximations • This has important consequences on the resulting predictive distribution – risk of over-fitting – degenerate approximations with nonsense predictive uncertainties

Na¨ ıve Process Approximation • Extremely simple idea: throw away all the data outside I ! • The posterior only benefits from the information contained in y I : K I ( K I + σ 2 I ) − 1 y I , σ 2 K I ( K I + σ 2 I ) − 1 � � p ( f I | y I ) ∼ N • The model underfits and is under-confident: p ( f ∗ | x ∗ , y I ) ∼ N ( µ ∗ , σ 2 ∗ ) µ ∗ = K ∗ ,I ( K I + σ 2 I ) − 1 y , ∗ = K ∗ , ∗ − K ∗ ,I ( K I + σ 2 I ) − 1 K ⊤ σ 2 ∗ ,I • Training scales with m 3 , predicting with m and m 2 (mean and var) • Baseline approximation: we want higher accuracy and confidence

Subset Of Regressors • Finite linear model with peculiar prior on the weigths: α I ∼ N (0 , K − 1 f ∗ = K ∗ ,I K − 1 ⇒ f I ∼ N (0 , K I ) f ∗ = K ∗ ,I α I , I ) I f I , • Posterior now benefits from all of y : I f I | y , σ 2 I ) · N ( f I | 0 , K I ) , q ( f I | y ) ∝ N ( K ⊤ I, · K − 1 I, · + σ 2 K I ] − 1 K I, · y , σ 2 K I [ K I, · K ⊤ I, · + σ 2 K I ] − 1 K I K I [ K I, · K ⊤ � � ∼ N • The conditional distribution of f ∗ is degenerate! � ⊤ K ∗ ,I K − 1 � p ( f ∗ | f I ) ∼ N I f I , 0 • The predictive distribution produces nonsense errorbars � − 1 K I, · y , I, · + σ 2 K I K I, · K ⊤ � µ ∗ = K ∗ ,I � − 1 K ⊤ ∗ = σ 2 K ∗ ,I I, · + σ 2 K I K I, · K ⊤ σ 2 � ∗ ,I • Under the prior, only functions with m degrees of freedom

Projected Process (Seeger et al) • Basic principle: likelihood approximation I f I , σ 2 I ) p ( y | f I ) ∼ ( K ⊤ I, · K − 1 • Leads to exactly the same posterior as for Subset of Regressors • But the conditional distribution is now non-degenerate (process approximation) � ⊤ K ∗ ,I K − 1 I f I , K ∗ , ∗ − K ∗ ,I K − 1 � p ( f ∗ | f I ) ∼ N I K ∗ ,I • Predictive distribution with same mean as Subset of Regressors, but with way under-confident predictive variance! � − 1 K I, · y I, · + σ 2 K I K I, · K ⊤ � µ ∗ = K ∗ ,I � − 1 K ⊤ ∗ ,I + σ 2 K ∗ ,I I, · + σ 2 K I ∗ = K ∗ , ∗ − K ∗ ,I K − 1 I K ⊤ K I, · K ⊤ σ 2 � ∗ ,I

Augmented Subset Of Regressors • For each x ∗ , augment f I with f ∗ ; new active set I ∗ �� f I � • Augmented posterior: q � y � f ∗ • ... at a cost of O ( nm ) per test case: need to compute K ∗ , · K ⊤ I, · • aSoR: � − 1 Q + v ∗ v ⊤ � ∗ µ ∗ = K ∗ , · y c ∗ � − 1 Q + v ∗ v ⊤ � ∗ K ⊤ σ 2 ∗ = K ∗ , ∗ − K ∗ , · ∗ , · c ∗ with the ususal approximate covariance: I K I, · + σ 2 I Q = K ⊤ I, · K − 1 with the difference between actual and projected covariance of f ∗ and f : v ∗ = K ⊤ ∗ , · − K ⊤ I, · K − 1 I K I, ∗ with the difference between the prior variance of f ∗ and the projected: c ∗ = K ∗ , ∗ − K ⊤ I, ∗ K − 1 I K I, ∗

Dude, where’s my prior?

The Priors The equivalent prior on [ f , f ∗ ] ⊤ is N (0 , P ) with: Q = K ⊤ I, · K − 1 I K I, · Subset of Regressors: Projected Process K ⊤ I, · K − 1 K ⊤ I, · K − 1 � � � � Q I K I, ∗ Q I K I, ∗ P = P = K ⊤ I, ∗ K − 1 I K I, · K ⊤ I, ∗ K − 1 K ⊤ I, ∗ K − 1 I K I, ∗ I K I, · K ∗ , ∗ Nystr¨ om: (positive definiteness!) Ed and Zoubin’s funky thing K ⊤ I, · K − 1 K ⊤ � � � � Q + Λ I K I, ∗ Q ∗ , · P = P = K ⊤ I, ∗ K − 1 K ∗ , · K ∗ , ∗ I K I, · K ∗ , ∗ Λ = diag ( K · ) − diag ( Q ) Augmented Subset of Regressors: � � Q + v ∗ v ⊤ K ⊤ ∗ P = ∗ , · c ∗ K ∗ , · K ∗ , ∗ with: v ∗ = K ⊤ ∗ , · − K ⊤ I, · K − 1 c ∗ = K ∗ , ∗ − K ⊤ I, ∗ K − 1 I K I, ∗ , I K I, ∗

More on Ed and Zoubin’s Method • Here’s a way of looking at it: the prior is a posterior process f ∗ | f I = N ( K ∗ ,I K − 1 I f I , K ∗ , ∗ − K ∗ ,I K − 1 I K ⊤ ∗ ,I ) , ... well, almost: E [ f + , f ∗ | f I ] = 0 • And then of course f I ∼ N (0 , K I ) • The corresponding prior is Q = K I, · K − 1 I K ⊤ p ( f ) = N (0 , K ∗ , ∗ I + Q − diag( Q )) , I, · • With a bit of algebra you recover the marginal likelihood and the predictive distribution • I finished this 30 minutes ago, which is why I won’t show figures on it! (well, I now may) • but ...

Na¨ ıve Process Approximation 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

Subset of Regressors (degenerate) 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

Projected Process Approximation 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

Ed and Zoubin’s Projected Process Method 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

Augmented SoR (pred scales with nm ) 1.5 1 0.5 0 −0.5 −1 −1.5 −15 −10 −5 0 5 10 15

Comparing the Predictive Uncertainties 0.7 Naive SR 0.6 Seeger EdZoubin Augm 0.5 0.4 0.3 0.2 0.1 0 −15 −10 −5 0 5 10 15

Smola and Bartett’s Greedy Selection neg log evidence −80 neg log evidence squared error −120 0.12 −160 0.08 test squared error: 0.04 min neg log ev. min neg log post. 0 0 0 1 1 2 2 10 10 10 10 10 10 size of support set, m, logarithmic scale 50 gap = 0.025 neg log posterior upper bound on neg log post. 0 lower bound on neg log post. −50 0 1 2 10 10 10 size of support set, m, logarithmic scale

Wrap Up • Training: from O ( n 3 ) to O ( nm 2 ) • Predicting: from O ( n 2 ) to O ( m 2 ) (or O ( nm ) ) • Be sparse if you must, but only then • Beware of over-fitting prone greedy selection methods • Do worry about the prior implied by the approximation!

Appendix: Healing the RVM by Augmentation (joint work with Carl Rasmussen)

Finite Linear Model 2 1 0 −1 −2 0 5 10 15

A Bad Probabilistic Model 2 1 0 −1 −2 0 5 10 15

The Healing: Augmentation 2 1 0 −1 −2 0 5 10 15

Augmentation? • Train once your m -dimensional model • At each new test point add a new basis function • Update the m + 1 -dimensional model (update posterior) • Testing is now more expensive

Wait a minute ... I don’t care about probabilistic predictions!

Another Symptom: Underfitting Abalone Squared error loss Absolute error loss - log test density loss RVM RVM* GP RVM RVM* GP RVM RVM* GP Loss: 0.138 0.135 0.092 0.259 0.253 0.209 0.469 0.408 0.219 · · · RVM not sig. < 0 . 01 0.07 < 0 . 01 < 0 . 01 < 0 . 01 · · · RVM* 0.02 < 0 . 01 < 0 . 01 GP · · · Robot Arm Squared error loss Absolute error loss - log test density loss RVM RVM* GP RVM RVM* GP RVM RVM* GP Loss: 0.0043 0.0040 0.0024 0.0482 0.0467 0.0334 -1.2162 -1.3295 -1.7446 · · · RVM < 0 . 01 < 0 . 01 < 0 . 01 < 0 . 01 < 0 . 01 < 0 . 01 RVM* · < 0 . 01 · < 0 . 01 · < 0 . 01 · · · GP • GP (Gaussian Process): infinitely augmented linear model • Beats finite linear models in all datasets I’ve looked at

Interlude None of this happens with non-localized basis functions

Finite Linear Model 2 1 0 −1 −2 0 5 10 15

A Bad Probabilistic Model 2 1 0 −1 −2 0 5 10 15

The Healing: Augmentation 2 1 0 −1 −2 0 5 10 15

Menu Concerns about the quality of the predictive distributions - PowerPoint PPT Presentation

S OME C ONCERNS A BOUT S PARSE A PPROXIMATIONS FOR G AUSSIAN P ROCESS R EGRESSION Joaquin Qui nonero Candela Max Planck Institute for Biological Cybernetics Gaussian Process Round Table Sheffield, June 9 and 10, 2005 Menu Concerns about

Online Accounting Services LOG-IN PAGE Menu ALCHAVO.com MAIN PAGE Menu ALCHAVO.com MAIN PAGE

Breakfast Menu Breakfast Menu Paper: PopSet Fawn 120g Size: 594 x 420 mm Scale: 40%

Mobile Menu & Ordering The easiest way to share your menu with anyone and collect orders

Menu Labeling Supplemental Draft Guidance for Industry November 7, 2017 Supplemental Menu

CS378 - Mobile Computing More UI - Part 2 Special Menus Two special application menus

Requirements of the Final Rule for Restaurant Menu Labeling Loretta Carey Food Labeling and

Dynamic Spectrum Access ROBERT HORVITZ bob@openspectrum.info Menu Menu TVWS in Europe (or

1 2 3 4 5 For certain users, when selecting items in the Enterprise Menu, the Menu will be

REWARDS BACKBAR REWARDS MENU MARKETING REWARDS MENU Parameters Credit cannot be

Banquet 2/25/2016 MCVTS CMET 1 Review Friday Describe process for creating menu

The Menu Check Service Jewel Maule Healthy Kids Association Samantha Heffernan NSW

Responsible Raw Natures Menu Ltd Veterinary Division Dr Claire Miller BVetMed CertAVP MRCVS

Menu provides an avenue for ODE to Documentation introduce the recently released USDA and

It Pays to Set the Menu: Mutual Fund Investment Options in 401(k) Plans Veronika Pool Indiana

Monetary Non-Neutrality in a Multi-Sector Menu Cost Model Emi Nakamura and J on Steinsson

Starting screen When you call gretl the following screen would appear No active part of menu

Compressive Extreme Learning Machines Improved Models Through Exploiting Time-Accuracy Trade-offs

How to Take into Account the Discrete Parameters in the BIC Criterion? V. Vandewalle University

Frequentist Statistics DS GA 1002 Probability and Statistics for Data Science

Safety Assurance in in Cyber-Physical Systems buil ilt wit ith Le Learning-Enabled Components

First International Workshop on Learning over Multiple Contexts LMCE 2014 Nancy, 19 September

Event Calendar SHIMA Daio,

Voyaging around nacre with the x-ray shuttle from biomineralisation to prosthetics via mollusc

Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh

Menu Concerns about the quality of the predictive distributions - PowerPoint PPT Presentation

S OME C ONCERNS A BOUT S PARSE A PPROXIMATIONS FOR G AUSSIAN P ROCESS R EGRESSION Joaquin Qui nonero Candela Max Planck Institute for Biological Cybernetics Gaussian Process Round Table Sheffield, June 9 and 10, 2005 Menu Concerns about

Online Accounting Services LOG-IN PAGE Menu ALCHAVO.com MAIN PAGE Menu ALCHAVO.com MAIN PAGE

Breakfast Menu Breakfast Menu Paper: PopSet Fawn 120g Size: 594 x 420 mm Scale: 40%

Mobile Menu &amp; Ordering The easiest way to share your menu with anyone and collect orders

Menu Labeling Supplemental Draft Guidance for Industry November 7, 2017 Supplemental Menu

CS378 - Mobile Computing More UI - Part 2 Special Menus Two special application menus

Requirements of the Final Rule for Restaurant Menu Labeling Loretta Carey Food Labeling and

Dynamic Spectrum Access ROBERT HORVITZ bob@openspectrum.info Menu Menu TVWS in Europe (or

1 2 3 4 5 For certain users, when selecting items in the Enterprise Menu, the Menu will be

REWARDS BACKBAR REWARDS MENU MARKETING REWARDS MENU Parameters Credit cannot be

Banquet 2/25/2016 MCVTS CMET 1 Review Friday Describe process for creating menu

The Menu Check Service Jewel Maule Healthy Kids Association Samantha Heffernan NSW

Responsible Raw Natures Menu Ltd Veterinary Division Dr Claire Miller BVetMed CertAVP MRCVS

Menu provides an avenue for ODE to Documentation introduce the recently released USDA and

It Pays to Set the Menu: Mutual Fund Investment Options in 401(k) Plans Veronika Pool Indiana

Monetary Non-Neutrality in a Multi-Sector Menu Cost Model Emi Nakamura and J on Steinsson

Starting screen When you call gretl the following screen would appear No active part of menu

Compressive Extreme Learning Machines Improved Models Through Exploiting Time-Accuracy Trade-offs

How to Take into Account the Discrete Parameters in the BIC Criterion? V. Vandewalle University

Frequentist Statistics DS GA 1002 Probability and Statistics for Data Science

Safety Assurance in in Cyber-Physical Systems buil ilt wit ith Le Learning-Enabled Components

First International Workshop on Learning over Multiple Contexts LMCE 2014 Nancy, 19 September

Event Calendar SHIMA Daio,

Voyaging around nacre with the x-ray shuttle from biomineralisation to prosthetics via mollusc

Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh

Mobile Menu & Ordering The easiest way to share your menu with anyone and collect orders