Menu Concerns about the quality of the predictive distributions - - PowerPoint PPT Presentation

menu
SMART_READER_LITE
LIVE PREVIEW

Menu Concerns about the quality of the predictive distributions - - PowerPoint PPT Presentation

S OME C ONCERNS A BOUT S PARSE A PPROXIMATIONS FOR G AUSSIAN P ROCESS R EGRESSION Joaquin Qui nonero Candela Max Planck Institute for Biological Cybernetics Gaussian Process Round Table Sheffield, June 9 and 10, 2005 Menu Concerns about


slide-1
SLIDE 1

SOME CONCERNS ABOUT SPARSE APPROXIMATIONS FOR GAUSSIAN PROCESS REGRESSION

Joaquin Qui˜ nonero Candela Max Planck Institute for Biological Cybernetics

Gaussian Process Round Table Sheffield, June 9 and 10, 2005

slide-2
SLIDE 2

Menu

  • Concerns about the quality of the predictive distributions
  • Augmentation: a bit more expensive, but gooood ...
  • Dude, where’s my prior?
  • A short tale about sparse greedy support set selection
slide-3
SLIDE 3

The Regression Task

  • Simplest case, additive independent Gaussian noise of variance σ2
  • Gaussian process prior over functions:

p(y|f) ∼ N(f, σ2 I) , p(f) ∼ N(0, K)

  • Task: obtain the predictive distribution of f∗ at the new input x∗:

p(f∗|x∗, y) =

  • p(f∗|x∗, f) p(f|y) df
  • Need to compute the posterior distribution (expensive):

p(f|y) ∼ N

  • K (K + σ2 I)−1y, σ2K (K + σ2 I)−1
  • ... and integrate f from the conditional distribution of f∗:

p(f∗|x∗, f) ∼ N

  • K∗,· K−1y, K∗,∗ − K∗,· K−1K⊤

∗,·

slide-4
SLIDE 4

Usual Reduced Set Approximations

  • Consider some very common approximations

– Na¨ ıve process approximation on subset of the data – Subset of regressors (Wahba, Smola and Bartlett...) – Sparse online GPs (Csat´

  • and Opper)

– Fast Sparse Projected Process Approx (Seeger et al.) – Relevance Vector Machines (Tipping) – Augmented Reduced Rank GPs (Rasmussen, Qui˜ nonero Candela)

  • All based on considering only a subset I of the latent variables

p(f∗|x∗, y) =

  • p(f∗|x∗, f I) p(f I|y) df I
  • However they differ in:

– the way the support set I and the hyperparameters are learnt – the likelihood and/or predictive distribution approximations

  • This has important consequences on the resulting predictive distribution

– risk of over-fitting – degenerate approximations with nonsense predictive uncertainties

slide-5
SLIDE 5

Na¨ ıve Process Approximation

  • Extremely simple idea: throw away all the data outside I!
  • The posterior only benefits from the information contained in yI:

p(f I|yI) ∼ N

  • KI (KI + σ2 I)−1yI, σ2KI (KI + σ2 I)−1
  • The model underfits and is under-confident:

p(f∗|x∗, yI) ∼ N(µ∗, σ2

∗)

µ∗ = K∗,I (KI + σ2 I)−1y , σ2

∗ = K∗,∗ − K∗,I (KI + σ2 I)−1K⊤ ∗,I

  • Training scales with m3, predicting with m and m2 (mean and var)
  • Baseline approximation: we want higher accuracy and confidence
slide-6
SLIDE 6

Subset Of Regressors

  • Finite linear model with peculiar prior on the weigths:

f∗ = K∗,I αI , αI ∼ N(0, K−1

I )

⇒ f∗ = K∗,I K−1

I f I ,

f I ∼ N(0, KI)

  • Posterior now benefits from all of y:

q(f I|y) ∝ N(K⊤

I,· K−1 I f I|y, σ2 I) · N(f I|0, KI),

∼ N

  • KI [KI,· K⊤

I,· + σ2 KI]−1KI,· y, σ2 KI [KI,· K⊤ I,· + σ2 KI]−1KI

  • The conditional distribution of f∗ is degenerate!

p(f∗|f I) ∼ N

  • K∗,I K−1

I f I, 0

  • The predictive distribution produces nonsense errorbars

µ∗ = K∗,I

  • KI,·K⊤

I,· + σ2 KI

−1 KI,· y , σ2

∗ = σ2 K∗,I

  • KI,·K⊤

I,· + σ2 KI

−1 K⊤

∗,I

  • Under the prior, only functions with m degrees of freedom
slide-7
SLIDE 7

Projected Process (Seeger et al)

  • Basic principle: likelihood approximation

p(y|f I) ∼ (K⊤

I,· K−1 I f I, σ2 I)

  • Leads to exactly the same posterior as for Subset of Regressors
  • But the conditional distribution is now non-degenerate (process

approximation) p(f∗|f I) ∼ N

  • K∗,I K−1

I f I, K∗,∗ − K∗,I K−1 I K∗,I

  • Predictive distribution with same mean as Subset of Regressors, but with

way under-confident predictive variance! µ∗ = K∗,I

  • KI,·K⊤

I,· + σ2 KI

−1 KI,· y σ2

∗ = K∗,∗ − K∗,I K−1 I K⊤ ∗,I + σ2 K∗,I

  • KI,·K⊤

I,· + σ2 KI

−1 K⊤

∗,I

slide-8
SLIDE 8

Augmented Subset Of Regressors

  • For each x∗, augment fI with f∗; new active set I∗
  • Augmented posterior: q
  • f I

f∗

  • y
  • ... at a cost of O(nm) per test case: need to compute K∗,· K⊤

I,·

  • aSoR:

µ∗ = K∗,·

  • Q + v∗ v⊤

c∗ −1 y σ2

∗ = K∗,∗ − K∗,·

  • Q + v∗ v⊤

c∗ −1 K⊤

∗,·

with the ususal approximate covariance: Q = K⊤

I,·K−1 I KI,· + σ2 I

with the difference between actual and projected covariance of f∗ and f: v∗ = K⊤

∗,· − K⊤ I,·K−1 I KI,∗

with the difference between the prior variance of f∗ and the projected: c∗ = K∗,∗ − K⊤

I,∗K−1 I KI,∗

slide-9
SLIDE 9

Dude, where’s my prior?

slide-10
SLIDE 10

The Priors

The equivalent prior on [f, f∗]⊤ is N(0, P ) with: Q = K⊤

I,·K−1 I KI,·

Subset of Regressors: Projected Process P =

  • Q

K⊤

I,·K−1 I KI,∗

K⊤

I,∗K−1 I KI,· K⊤ I,∗K−1 I KI,∗

  • P =
  • Q

K⊤

I,·K−1 I KI,∗

K⊤

I,∗K−1 I KI,·

K∗,∗

  • Nystr¨
  • m: (positive definiteness!)

Ed and Zoubin’s funky thing P =

  • Q

K⊤

∗,·

K∗,· K∗,∗

  • P =
  • Q + Λ

K⊤

I,·K−1 I KI,∗

K⊤

I,∗K−1 I KI,·

K∗,∗

  • Λ = diag (K·) − diag (Q)

Augmented Subset of Regressors: P =

  • Q + v∗ v⊤

c∗

K⊤

∗,·

K∗,· K∗,∗

  • with:

v∗ = K⊤

∗,· − K⊤ I,·K−1 I KI,∗ ,

c∗ = K∗,∗ − K⊤

I,∗K−1 I KI,∗

slide-11
SLIDE 11

More on Ed and Zoubin’s Method

  • Here’s a way of looking at it: the prior is a posterior process

f∗|f I = N(K∗,I K−1

I f I, K∗,∗ − K∗,I K−1 I K⊤ ∗,I) ,

... well, almost: E[f+, f∗|f I] = 0

  • And then of course f I ∼ N(0, KI)
  • The corresponding prior is

p(f) = N(0, K∗,∗ I + Q − diag(Q)) , Q = KI,· K−1

I K⊤ I,·

  • With a bit of algebra you recover the marginal likelihood and the predictive

distribution

  • I finished this 30 minutes ago, which is why I won’t show figures on it! (well, I

now may)

  • but ...
slide-12
SLIDE 12

Na¨ ıve Process Approximation

−15 −10 −5 5 10 15 −1.5 −1 −0.5 0.5 1 1.5

slide-13
SLIDE 13

Subset of Regressors (degenerate)

−15 −10 −5 5 10 15 −1.5 −1 −0.5 0.5 1 1.5

slide-14
SLIDE 14

Projected Process Approximation

−15 −10 −5 5 10 15 −1.5 −1 −0.5 0.5 1 1.5

slide-15
SLIDE 15

Ed and Zoubin’s Projected Process Method

−15 −10 −5 5 10 15 −1.5 −1 −0.5 0.5 1 1.5

slide-16
SLIDE 16

Augmented SoR (pred scales with nm)

−15 −10 −5 5 10 15 −1.5 −1 −0.5 0.5 1 1.5

slide-17
SLIDE 17

Comparing the Predictive Uncertainties

−15 −10 −5 5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Naive SR Seeger EdZoubin Augm

slide-18
SLIDE 18

Smola and Bartett’s Greedy Selection

10 10

1

10

2

0.04 0.08 0.12

test squared error:

squared error size of support set, m, logarithmic scale 10 10

1

10

2

−160 −120 −80 neg log evidence

min neg log ev. min neg log post. neg log evidence

10 10

1

10

2

−50 50

upper bound on neg log post. lower bound on neg log post.

size of support set, m, logarithmic scale neg log posterior

gap = 0.025

slide-19
SLIDE 19

Wrap Up

  • Training: from O(n3) to O(nm2)
  • Predicting: from O(n2) to O(m2) (or O(nm))
  • Be sparse if you must, but only then
  • Beware of over-fitting prone greedy selection methods
  • Do worry about the prior implied by the approximation!
slide-20
SLIDE 20

Appendix: Healing the RVM by Augmentation (joint work with Carl Rasmussen)

slide-21
SLIDE 21

Finite Linear Model

5 10 15 −2 −1 1 2

slide-22
SLIDE 22

A Bad Probabilistic Model

5 10 15 −2 −1 1 2

slide-23
SLIDE 23

The Healing: Augmentation

5 10 15 −2 −1 1 2

slide-24
SLIDE 24

Augmentation?

  • Train once your m-dimensional model
  • At each new test point add a new basis function
  • Update the m + 1-dimensional model (update posterior)
  • Testing is now more expensive
slide-25
SLIDE 25

Wait a minute ... I don’t care about probabilistic predictions!

slide-26
SLIDE 26

Another Symptom: Underfitting

Abalone

Squared error loss Absolute error loss

  • log test density loss

RVM RVM* GP RVM RVM* GP RVM RVM* GP Loss: 0.138 0.135 0.092 0.259 0.253 0.209 0.469 0.408 0.219 RVM · not sig. < 0.01 · 0.07 < 0.01 · < 0.01 < 0.01 RVM* · 0.02 · < 0.01 · < 0.01 GP · · ·

Robot Arm

Squared error loss Absolute error loss

  • log test density loss

RVM RVM* GP RVM RVM* GP RVM RVM* GP Loss: 0.0043 0.0040 0.0024 0.0482 0.0467 0.0334 -1.2162 -1.3295 -1.7446 RVM · < 0.01 < 0.01 · < 0.01 < 0.01 · < 0.01 < 0.01 RVM* · < 0.01 · < 0.01 · < 0.01 GP · · ·

  • GP (Gaussian Process): infinitely augmented linear model
  • Beats finite linear models in all datasets I’ve looked at
slide-27
SLIDE 27

Interlude None of this happens with non-localized basis functions

slide-28
SLIDE 28

Finite Linear Model

5 10 15 −2 −1 1 2

slide-29
SLIDE 29

A Bad Probabilistic Model

5 10 15 −2 −1 1 2

slide-30
SLIDE 30

The Healing: Augmentation

5 10 15 −2 −1 1 2

slide-31
SLIDE 31

Appendix: Augmentation in Sparse GPs

  • O(nm2) sparse approx. to Gaussian Processes (Smola and Bartlett, 2001)
  • Augmentation: same training, more expensive testing
  • Better mean based and probabilistic performance

non-augmented augmented method

  • tr. neg ev.

MAE MSE NTL MAE MSE NTL SGGP – 0.0481 0.0048 −0.3525 0.0460 0.0045 −0.4613 SGEV −1.1555 0.0484 0.0049 −0.3446 0.0463 0.0045 −0.4562 HPEV-rand −1.0978 0.0503 0.0047 −0.3694 0.0486 0.0045 −0.4269 HPEV-SGEV −1.3234 0.0425 0.0036 −0.4218 0.0404 0.0033 −0.5918 HPEV-SGGP −1.3274 0.0425 0.0036 −0.4217 0.0405 0.0033 −0.5920 2000 training - 2000 test SGEV −1.4932 0.0371 0.0028 −0.6223 0.0346 0.0024 −0.6672 HPEV-rand −1.5378 0.0363 0.0026 −0.6417 0.0340 0.0023 −0.7004 36000 training - 4000 test

slide-32
SLIDE 32

Thanks a lot to Sheffield and to Neil!