Menu Concerns about the quality of the predictive distributions - - PowerPoint PPT Presentation
Menu Concerns about the quality of the predictive distributions - - PowerPoint PPT Presentation
S OME C ONCERNS A BOUT S PARSE A PPROXIMATIONS FOR G AUSSIAN P ROCESS R EGRESSION Joaquin Qui nonero Candela Max Planck Institute for Biological Cybernetics Gaussian Process Round Table Sheffield, June 9 and 10, 2005 Menu Concerns about
Menu
- Concerns about the quality of the predictive distributions
- Augmentation: a bit more expensive, but gooood ...
- Dude, where’s my prior?
- A short tale about sparse greedy support set selection
The Regression Task
- Simplest case, additive independent Gaussian noise of variance σ2
- Gaussian process prior over functions:
p(y|f) ∼ N(f, σ2 I) , p(f) ∼ N(0, K)
- Task: obtain the predictive distribution of f∗ at the new input x∗:
p(f∗|x∗, y) =
- p(f∗|x∗, f) p(f|y) df
- Need to compute the posterior distribution (expensive):
p(f|y) ∼ N
- K (K + σ2 I)−1y, σ2K (K + σ2 I)−1
- ... and integrate f from the conditional distribution of f∗:
p(f∗|x∗, f) ∼ N
- K∗,· K−1y, K∗,∗ − K∗,· K−1K⊤
∗,·
Usual Reduced Set Approximations
- Consider some very common approximations
– Na¨ ıve process approximation on subset of the data – Subset of regressors (Wahba, Smola and Bartlett...) – Sparse online GPs (Csat´
- and Opper)
– Fast Sparse Projected Process Approx (Seeger et al.) – Relevance Vector Machines (Tipping) – Augmented Reduced Rank GPs (Rasmussen, Qui˜ nonero Candela)
- All based on considering only a subset I of the latent variables
p(f∗|x∗, y) =
- p(f∗|x∗, f I) p(f I|y) df I
- However they differ in:
– the way the support set I and the hyperparameters are learnt – the likelihood and/or predictive distribution approximations
- This has important consequences on the resulting predictive distribution
– risk of over-fitting – degenerate approximations with nonsense predictive uncertainties
Na¨ ıve Process Approximation
- Extremely simple idea: throw away all the data outside I!
- The posterior only benefits from the information contained in yI:
p(f I|yI) ∼ N
- KI (KI + σ2 I)−1yI, σ2KI (KI + σ2 I)−1
- The model underfits and is under-confident:
p(f∗|x∗, yI) ∼ N(µ∗, σ2
∗)
µ∗ = K∗,I (KI + σ2 I)−1y , σ2
∗ = K∗,∗ − K∗,I (KI + σ2 I)−1K⊤ ∗,I
- Training scales with m3, predicting with m and m2 (mean and var)
- Baseline approximation: we want higher accuracy and confidence
Subset Of Regressors
- Finite linear model with peculiar prior on the weigths:
f∗ = K∗,I αI , αI ∼ N(0, K−1
I )
⇒ f∗ = K∗,I K−1
I f I ,
f I ∼ N(0, KI)
- Posterior now benefits from all of y:
q(f I|y) ∝ N(K⊤
I,· K−1 I f I|y, σ2 I) · N(f I|0, KI),
∼ N
- KI [KI,· K⊤
I,· + σ2 KI]−1KI,· y, σ2 KI [KI,· K⊤ I,· + σ2 KI]−1KI
- The conditional distribution of f∗ is degenerate!
p(f∗|f I) ∼ N
- K∗,I K−1
I f I, 0
⊤
- The predictive distribution produces nonsense errorbars
µ∗ = K∗,I
- KI,·K⊤
I,· + σ2 KI
−1 KI,· y , σ2
∗ = σ2 K∗,I
- KI,·K⊤
I,· + σ2 KI
−1 K⊤
∗,I
- Under the prior, only functions with m degrees of freedom
Projected Process (Seeger et al)
- Basic principle: likelihood approximation
p(y|f I) ∼ (K⊤
I,· K−1 I f I, σ2 I)
- Leads to exactly the same posterior as for Subset of Regressors
- But the conditional distribution is now non-degenerate (process
approximation) p(f∗|f I) ∼ N
- K∗,I K−1
I f I, K∗,∗ − K∗,I K−1 I K∗,I
⊤
- Predictive distribution with same mean as Subset of Regressors, but with
way under-confident predictive variance! µ∗ = K∗,I
- KI,·K⊤
I,· + σ2 KI
−1 KI,· y σ2
∗ = K∗,∗ − K∗,I K−1 I K⊤ ∗,I + σ2 K∗,I
- KI,·K⊤
I,· + σ2 KI
−1 K⊤
∗,I
Augmented Subset Of Regressors
- For each x∗, augment fI with f∗; new active set I∗
- Augmented posterior: q
- f I
f∗
- y
- ... at a cost of O(nm) per test case: need to compute K∗,· K⊤
I,·
- aSoR:
µ∗ = K∗,·
- Q + v∗ v⊤
∗
c∗ −1 y σ2
∗ = K∗,∗ − K∗,·
- Q + v∗ v⊤
∗
c∗ −1 K⊤
∗,·
with the ususal approximate covariance: Q = K⊤
I,·K−1 I KI,· + σ2 I
with the difference between actual and projected covariance of f∗ and f: v∗ = K⊤
∗,· − K⊤ I,·K−1 I KI,∗
with the difference between the prior variance of f∗ and the projected: c∗ = K∗,∗ − K⊤
I,∗K−1 I KI,∗
Dude, where’s my prior?
The Priors
The equivalent prior on [f, f∗]⊤ is N(0, P ) with: Q = K⊤
I,·K−1 I KI,·
Subset of Regressors: Projected Process P =
- Q
K⊤
I,·K−1 I KI,∗
K⊤
I,∗K−1 I KI,· K⊤ I,∗K−1 I KI,∗
- P =
- Q
K⊤
I,·K−1 I KI,∗
K⊤
I,∗K−1 I KI,·
K∗,∗
- Nystr¨
- m: (positive definiteness!)
Ed and Zoubin’s funky thing P =
- Q
K⊤
∗,·
K∗,· K∗,∗
- P =
- Q + Λ
K⊤
I,·K−1 I KI,∗
K⊤
I,∗K−1 I KI,·
K∗,∗
- Λ = diag (K·) − diag (Q)
Augmented Subset of Regressors: P =
- Q + v∗ v⊤
∗
c∗
K⊤
∗,·
K∗,· K∗,∗
- with:
v∗ = K⊤
∗,· − K⊤ I,·K−1 I KI,∗ ,
c∗ = K∗,∗ − K⊤
I,∗K−1 I KI,∗
More on Ed and Zoubin’s Method
- Here’s a way of looking at it: the prior is a posterior process
f∗|f I = N(K∗,I K−1
I f I, K∗,∗ − K∗,I K−1 I K⊤ ∗,I) ,
... well, almost: E[f+, f∗|f I] = 0
- And then of course f I ∼ N(0, KI)
- The corresponding prior is
p(f) = N(0, K∗,∗ I + Q − diag(Q)) , Q = KI,· K−1
I K⊤ I,·
- With a bit of algebra you recover the marginal likelihood and the predictive
distribution
- I finished this 30 minutes ago, which is why I won’t show figures on it! (well, I
now may)
- but ...
Na¨ ıve Process Approximation
−15 −10 −5 5 10 15 −1.5 −1 −0.5 0.5 1 1.5
Subset of Regressors (degenerate)
−15 −10 −5 5 10 15 −1.5 −1 −0.5 0.5 1 1.5
Projected Process Approximation
−15 −10 −5 5 10 15 −1.5 −1 −0.5 0.5 1 1.5
Ed and Zoubin’s Projected Process Method
−15 −10 −5 5 10 15 −1.5 −1 −0.5 0.5 1 1.5
Augmented SoR (pred scales with nm)
−15 −10 −5 5 10 15 −1.5 −1 −0.5 0.5 1 1.5
Comparing the Predictive Uncertainties
−15 −10 −5 5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Naive SR Seeger EdZoubin Augm
Smola and Bartett’s Greedy Selection
10 10
1
10
2
0.04 0.08 0.12
test squared error:
squared error size of support set, m, logarithmic scale 10 10
1
10
2
−160 −120 −80 neg log evidence
min neg log ev. min neg log post. neg log evidence
10 10
1
10
2
−50 50
upper bound on neg log post. lower bound on neg log post.
size of support set, m, logarithmic scale neg log posterior
gap = 0.025
Wrap Up
- Training: from O(n3) to O(nm2)
- Predicting: from O(n2) to O(m2) (or O(nm))
- Be sparse if you must, but only then
- Beware of over-fitting prone greedy selection methods
- Do worry about the prior implied by the approximation!
Appendix: Healing the RVM by Augmentation (joint work with Carl Rasmussen)
Finite Linear Model
5 10 15 −2 −1 1 2
A Bad Probabilistic Model
5 10 15 −2 −1 1 2
The Healing: Augmentation
5 10 15 −2 −1 1 2
Augmentation?
- Train once your m-dimensional model
- At each new test point add a new basis function
- Update the m + 1-dimensional model (update posterior)
- Testing is now more expensive
Wait a minute ... I don’t care about probabilistic predictions!
Another Symptom: Underfitting
Abalone
Squared error loss Absolute error loss
- log test density loss
RVM RVM* GP RVM RVM* GP RVM RVM* GP Loss: 0.138 0.135 0.092 0.259 0.253 0.209 0.469 0.408 0.219 RVM · not sig. < 0.01 · 0.07 < 0.01 · < 0.01 < 0.01 RVM* · 0.02 · < 0.01 · < 0.01 GP · · ·
Robot Arm
Squared error loss Absolute error loss
- log test density loss
RVM RVM* GP RVM RVM* GP RVM RVM* GP Loss: 0.0043 0.0040 0.0024 0.0482 0.0467 0.0334 -1.2162 -1.3295 -1.7446 RVM · < 0.01 < 0.01 · < 0.01 < 0.01 · < 0.01 < 0.01 RVM* · < 0.01 · < 0.01 · < 0.01 GP · · ·
- GP (Gaussian Process): infinitely augmented linear model
- Beats finite linear models in all datasets I’ve looked at
Interlude None of this happens with non-localized basis functions
Finite Linear Model
5 10 15 −2 −1 1 2
A Bad Probabilistic Model
5 10 15 −2 −1 1 2
The Healing: Augmentation
5 10 15 −2 −1 1 2
Appendix: Augmentation in Sparse GPs
- O(nm2) sparse approx. to Gaussian Processes (Smola and Bartlett, 2001)
- Augmentation: same training, more expensive testing
- Better mean based and probabilistic performance
non-augmented augmented method
- tr. neg ev.
MAE MSE NTL MAE MSE NTL SGGP – 0.0481 0.0048 −0.3525 0.0460 0.0045 −0.4613 SGEV −1.1555 0.0484 0.0049 −0.3446 0.0463 0.0045 −0.4562 HPEV-rand −1.0978 0.0503 0.0047 −0.3694 0.0486 0.0045 −0.4269 HPEV-SGEV −1.3234 0.0425 0.0036 −0.4218 0.0404 0.0033 −0.5918 HPEV-SGGP −1.3274 0.0425 0.0036 −0.4217 0.0405 0.0033 −0.5920 2000 training - 2000 test SGEV −1.4932 0.0371 0.0028 −0.6223 0.0346 0.0024 −0.6672 HPEV-rand −1.5378 0.0363 0.0026 −0.6417 0.0340 0.0023 −0.7004 36000 training - 4000 test