Mismatched Models & Can GP Regression Be Made Robust Against - - PowerPoint PPT Presentation

β–Ά
mismatched models
SMART_READER_LITE
LIVE PREVIEW

Mismatched Models & Can GP Regression Be Made Robust Against - - PowerPoint PPT Presentation

Gaussian Process Regression with Mismatched Models & Can GP Regression Be Made Robust Against Model Mismatch? Peter Sollich NEURIPS 2002 & International Workshop on Deterministic and Statistical Methods in Machine Learning (2004)


slide-1
SLIDE 1

Gaussian Process Regression with Mismatched Models & Can GP Regression Be Made Robust Against Model Mismatch?

Peter Sollich NEURIPS 2002 & International Workshop on Deterministic and Statistical Methods in Machine Learning (2004)

slide-2
SLIDE 2

Learning curve

Ideal learning curve:

  • Performance on true

distribution

  • Average over multiple

training datasets

slide-3
SLIDE 3

What is GP regression?

𝑧 = 𝑔 𝑦 + πœ—, πœ—~ 𝑂 0,𝜏2 want to estimate𝑔 Put a GP-prior on 𝑔

  • 𝐷𝑝𝑀 𝑔 𝑦𝑗 ,𝑔 π‘¦π‘˜

= 𝐿(𝑦𝑗,π‘¦π‘˜)

  • 𝐹 𝑔 𝑦

= 0 Why GP-regression?

  • Error bars
  • Posterior analytically (requires 𝑃(π‘œ3))
slide-4
SLIDE 4

Mismatched model?

Input to GP: kernel 𝐿, noise level 𝜏

  • What if we use the wrong one?

Setting:

  • Assume p(x) known: uniform on line or

hypercube

  • Theory exact if 𝑒 = ∞, otherwise all kinds
  • f approximations
  • Assume K x, xβ€² = g( 𝑦 βˆ’ 𝑦′ )
slide-5
SLIDE 5

Weird learning curves

  • Plateaus or arbitrary # overfitting maxima

Hypercube, d=10, noise level too small: 1e-4, 1e-3, … true = 1 Line 1D, noise level too low results in plateau

slide-6
SLIDE 6

Asymptotic problems

  • If true kernel (OU, MB2) is less smooth than chosen kernel (RBF)
  • No asymptotic decay πœ— = 𝑃(

1 π‘œ) such as for parametric models, much

slower (log. slow)

  • Prior cannot be overwhelmed by data (is too strong)
slide-7
SLIDE 7

Fix?

  • But maybe we just chose very bad hyperparameters?
  • A true Bayesian is too expensive… What if evidence maximization?
  • Maximize: 𝑄 𝐸 = ∫ 𝑄 𝐸 πœ„ 𝑄 πœ„ π‘’πœ„ w.r.t. hyperpar.
  • Setting: assume wrong kernel, but we tune 𝜏, 𝑏, π‘š using evidence
  • All kinds of approximations to make the analysis tractable…
slide-8
SLIDE 8

Hypercube analysis

  • If we can tune par to get Bayes optimal performance, we will get it.
  • No maxima’s
  • If we cannot find those par (for example, π‘š β†’ ∞), convergence still very slow
  • No experiments???
slide-9
SLIDE 9

1D case

  • True kernel = MB2
  • Used kernel = in plot
  • No maxima, plateaus
  • Optimal rate

achieved