Outline GP hyperparameter inference Priors on GP hyperparameters - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline GP hyperparameter inference Priors on GP hyperparameters - - PowerPoint PPT Presentation

I NTEGRATION OVER HYPERPARAMETERS AND ESTIMATION OF PREDICTIVE PERFORMANCE Aki Vehtari Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Finland aki.vehtari@aalto.fi Priors and integration


slide-1
SLIDE 1

Priors and integration for GP hyperparameters Vehtari

INTEGRATION OVER HYPERPARAMETERS AND

ESTIMATION OF PREDICTIVE PERFORMANCE

Aki Vehtari

Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Finland aki.vehtari@aalto.fi

slide-2
SLIDE 2

Priors and integration for GP hyperparameters Vehtari

Outline

◮ GP hyperparameter inference

◮ Priors on GP hyperparameters ◮ Benefits of integration vs. point estimate ◮ MCMC, CCD

slide-3
SLIDE 3

Priors and integration for GP hyperparameters Vehtari

Gaussian processes and hyperparameters

◮ Gaussian processes are priors on function space ◮ GPs are usually constructed with a parametric covariance

function

◮ we need to think about priors on those parameters

slide-4
SLIDE 4

Priors and integration for GP hyperparameters Vehtari

Gaussian processes and hyperparameters

◮ Gaussian processes are priors on function space ◮ GPs are usually constructed with a parametric covariance

function

◮ we need to think about priors on those parameters

◮ If we have “big data” and small number of hyperparameters

◮ priors and integration over the posterior is not so important ◮ even more so when sparse approximations, which limit the

complexity of the models, are used

slide-5
SLIDE 5

Priors and integration for GP hyperparameters Vehtari

1D demo

◮ 1D demo originally by Michael Betancourt

slide-6
SLIDE 6

Priors and integration for GP hyperparameters Vehtari

1D demo

slide-7
SLIDE 7

Priors and integration for GP hyperparameters Vehtari

1D demo summary

◮ Likelihood for lengthscale beyond the data scale is flat and

non-identifiable because the functions looks all the same

◮ add prior making large lengthscale less likely

◮ If no repeated measurements non-identifiability between

signal magnitude and noise magnitude when lengthscale short

◮ add prior making short lengthscale less likely ◮ add prior on measurement noise ◮ make repeated measurements

◮ Nonidentifiability between lengthscale and magnitude

slide-8
SLIDE 8

Priors and integration for GP hyperparameters Vehtari

Non-Gaussian likelihoods

◮ Poisson

◮ variance is equal to mean, and thus can’t overfit

slide-9
SLIDE 9

Priors and integration for GP hyperparameters Vehtari

Non-Gaussian likelihoods

◮ Poisson

◮ variance is equal to mean, and thus can’t overfit ◮ except if data is not conditionally Poisson distributed

slide-10
SLIDE 10

Priors and integration for GP hyperparameters Vehtari

Non-Gaussian likelihoods

◮ Poisson

◮ variance is equal to mean, and thus can’t overfit ◮ except if data is not conditionally Poisson distributed

◮ Binary classification (logit/probit)

◮ unbounded likelihood if separable ◮ with short if enough lengthscale separable

slide-11
SLIDE 11

Priors and integration for GP hyperparameters Vehtari

Sparse approximations

◮ Sparse approximations limit the complexity

◮ FITC type models work only with large lengthscale

slide-12
SLIDE 12

Priors and integration for GP hyperparameters Vehtari

Higher dimensions

◮ Separate lengthscale for each dimension, aka ARD

◮ lengthscale is related to non-linearity

slide-13
SLIDE 13

Priors and integration for GP hyperparameters Vehtari

Toy example

−1 1 −2 −1 1 2 f1(x1) −1 1 f2(x2) −1 1 f3(x3) −1 1 f4(x4) −1 1 −2 −1 1 2 f5(x5) −1 1 f6(x6) −1 1 f7(x7) −1 1 f8(x8)

f(x) = f1(x1) + · · · + f8(x8), y ∼ N

  • f, 0.32

, Var

  • fj
  • = 1 for all j.

⇒ All inputs equally relevant

2 4 6 8 0.5 1 Input True relevance

slide-14
SLIDE 14

Priors and integration for GP hyperparameters Vehtari

Toy example

−1 1 −2 −1 1 2 f1(x1) −1 1 f2(x2) −1 1 f3(x3) −1 1 f4(x4) −1 1 −2 −1 1 2 f5(x5) −1 1 f6(x6) −1 1 f7(x7) −1 1 f8(x8)

f(x) = f1(x1) + · · · + f8(x8), y ∼ N

  • f, 0.32

, Var

  • fj
  • = 1 for all j.

⇒ All inputs equally relevant

2 4 6 8 0.5 1 Input True relevance ARD-value

Optimized ARD-values, ARD(j) = 1/ℓj (averaged over 100 data realizations, n = 200)

slide-15
SLIDE 15

Priors and integration for GP hyperparameters Vehtari

Bayesian optimization

◮ GPs have been used too much as black boxes ◮ Bonus: use shape constrained GPs (see, e.g., Siivola

et al., 2017)

slide-16
SLIDE 16

Priors and integration for GP hyperparameters Vehtari

Periodic covariance function

◮ If you know the period fix it ◮ If you don’t know, there can be serious identifiability

problems unless informative priors are used

slide-17
SLIDE 17

Priors and integration for GP hyperparameters Vehtari

Parametric model plus GP

◮ For example, linear model plus GP

◮ with long lengthscale GP is like a linear model which

causes non-identifiability and problems in interpretation

slide-18
SLIDE 18

Priors and integration for GP hyperparameters Vehtari

Parametric model plus GP

◮ For example, linear model plus GP

◮ with long lengthscale GP is like a linear model which

causes non-identifiability and problems in interpretation

◮ Same for other parametric model + GP

◮ need more informative priors

slide-19
SLIDE 19

Priors and integration for GP hyperparameters Vehtari

GP plus GP

1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 80 90 100 110 Trends Relative Number of Births Slow trend Fast non−periodic component Mean Mon Tue Wed Thu Fri Sat Sun 80 90 100 110 Day of week effect 1972 1976 1980 1984 1988 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 80 90 100 110 Seasonal effect 1972 1976 1980 1984 1988 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 80 90 100 110 Day of year effect New year Valentine’s day Leap day April 1st Memorial day Independence day Labor day Halloween Thanksgiving Christmas

slide-20
SLIDE 20

Priors and integration for GP hyperparameters Vehtari

GP plus GP

◮ Identifiability problems as different components are

explaining same features in the data

◮ priors which “encourage” specialization of the components

slide-21
SLIDE 21

Priors and integration for GP hyperparameters Vehtari

Summary on priors and benefits of integration

◮ Specific prior recommendations for length scale

◮ inverse gamma has a sharp left tail that puts negligible

mass on small length-scales, but a generous right tail, allowing for large length-scales (but still reducing non-identifiability)

◮ generalized inverse Gaussian has an inverse gamma left

tail (if p ≤ 0) and a Gaussian right tail (avoids identifiability issue when combined with linear model)

slide-22
SLIDE 22

Priors and integration for GP hyperparameters Vehtari

Summary on priors and benefits of integration

◮ Specific prior recommendations for length scale

◮ inverse gamma has a sharp left tail that puts negligible

mass on small length-scales, but a generous right tail, allowing for large length-scales (but still reducing non-identifiability)

◮ generalized inverse Gaussian has an inverse gamma left

tail (if p ≤ 0) and a Gaussian right tail (avoids identifiability issue when combined with linear model)

◮ Specific weakly informative prior recommendations for

signal and noise magnitude

◮ half-normals are often enough if length-scale has

informative prior

◮ if information about measurement accuracy is available,

informative prior such as gamma or scaled inverse Chi2 for variance

slide-23
SLIDE 23

Priors and integration for GP hyperparameters Vehtari

GPs in Stan

◮ Stan manual 2.16.0 (and later) chapter 16

http://mc-stan.org/users/documentation/index.html

◮ code and documentation by Rob Trangucci ◮ prior recommendations by Rob Trangucci, Michael

Betancourt, Aki Vehtari

◮ Code examples https://github.com/rtrangucci/gps in stan

◮ by Rob Trangucci

slide-24
SLIDE 24

Priors and integration for GP hyperparameters Vehtari

Hamiltonian Monte Carlo + NUTS

◮ Uses gradient information for more efficient sampling ◮ Alternating dynamic simulation and sampling of the energy

level

◮ Parameters

◮ step size, number of steps in each chain

slide-25
SLIDE 25

Priors and integration for GP hyperparameters Vehtari

Hamiltonian Monte Carlo + NUTS

◮ Uses gradient information for more efficient sampling ◮ Alternating dynamic simulation and sampling of the energy

level

◮ Parameters

◮ step size, number of steps in each chain

◮ No U-Turn Sampling

◮ adaptively selects number of steps to improve robustness

and efficiency

slide-26
SLIDE 26

Priors and integration for GP hyperparameters Vehtari

Hamiltonian Monte Carlo + NUTS

◮ Uses gradient information for more efficient sampling ◮ Alternating dynamic simulation and sampling of the energy

level

◮ Parameters

◮ step size, number of steps in each chain

◮ No U-Turn Sampling

◮ adaptively selects number of steps to improve robustness

and efficiency

◮ Adaptation in Stan

◮ Step size adjustment (mass matrix) is estimated during

initial adaptation phase

slide-27
SLIDE 27

Priors and integration for GP hyperparameters Vehtari

Hamiltonian Monte Carlo + NUTS

◮ Uses gradient information for more efficient sampling ◮ Alternating dynamic simulation and sampling of the energy

level

◮ Parameters

◮ step size, number of steps in each chain

◮ No U-Turn Sampling

◮ adaptively selects number of steps to improve robustness

and efficiency

◮ Adaptation in Stan

◮ Step size adjustment (mass matrix) is estimated during

initial adaptation phase

◮ Demo

◮ https://chi-feng.github.io/mcmc-demo/app.html#

RandomWalkMH,donut

◮ note that HMC/NUTS in this demo is not exactly same as in

Stan

slide-28
SLIDE 28

Priors and integration for GP hyperparameters Vehtari

CCD

◮ Deterministic placement of integration points

slide-29
SLIDE 29

Priors and integration for GP hyperparameters Vehtari

Estimation of the predictive performance of GP

◮ How to avoid naive k-fold-CV?

◮ leave-one-out (LOO) approximations

◮ Approximations depend on how the predictions are made

◮ analytically, Laplace, EP

, VB, MCMC for latents?

◮ marginal posterior improvements? ◮ integration over the hyperparameters?

slide-30
SLIDE 30

Priors and integration for GP hyperparameters Vehtari

Predictive distributions

◮ Posterior predictive distribution

p(˜ y|˜ x, D) (1)

◮ LOO predictive distribution

p(yi|xi, D−i) (2)

slide-31
SLIDE 31

Priors and integration for GP hyperparameters Vehtari

Hierarchical LOO computation

◮ Possible to compute first

p(yi|xi, D−i, θ, φ) (3) and then p(yi|xi, D−i) =

  • p(yi|xi, D−i, θ, φ)p(θ, φ|D−i)dθdφ

(4)

slide-32
SLIDE 32

Priors and integration for GP hyperparameters Vehtari

Generic approach

◮ Consider the case where we have not yet seen the ith

  • bservation. Then using the Bayes’ rule we can add

information from the ith observation p(fi|D) = p(yi|fi)p(fi|xi, D−i) p(yi|xi, D−i) (5)

slide-33
SLIDE 33

Priors and integration for GP hyperparameters Vehtari

Generic approach

◮ Consider the case where we have not yet seen the ith

  • bservation. Then using the Bayes’ rule we can add

information from the ith observation p(fi|D) = p(yi|fi)p(fi|xi, D−i) p(yi|xi, D−i) (5)

◮ Correspondingly we can remove the effect of the ith

  • bservation from the full posterior:

p(fi|xi, D−i) = p(fi|D)p(yi|xi, D−i) p(yi|fi) (6)

slide-34
SLIDE 34

Priors and integration for GP hyperparameters Vehtari

Generic approach

◮ Consider the case where we have not yet seen the ith

  • bservation. Then using the Bayes’ rule we can add

information from the ith observation p(fi|D) = p(yi|fi)p(fi|xi, D−i) p(yi|xi, D−i) (5)

◮ Correspondingly we can remove the effect of the ith

  • bservation from the full posterior:

p(fi|xi, D−i) = p(fi|D)p(yi|xi, D−i) p(yi|fi) (6) If we now integrate both sides over fi and rearrange the terms we get p(yi|xi, D−i) = 1/ p(fi|D) p(yi|fi)dfi (7)

slide-35
SLIDE 35

Priors and integration for GP hyperparameters Vehtari

Generic approach

◮ In some cases, we can compute p(fi|xi, D−i) exactly or

approximate it efficiently and then we can compute the LOO predictive density, p(yi|xi, D−i) =

  • p(fi|xi, D−i)p(yi|fi)dfi,

(8)

slide-36
SLIDE 36

Priors and integration for GP hyperparameters Vehtari

Analytic

◮ With Gaussian likelihood and fixed hyperparameters

analytic LOO equations for p(fi|xi, D−i, θ, φ) ∝ p(fi|D, θ) p(yi|fi, φ) = N(fi|µ−i, v−i), (9) where µ−i = v−i(Σ−1

ii µi − σ−2yi)

v−i =

  • Σ−1

ii

− σ−2−1 (10) which removes the effect of observation yi from the marginal p(fi|xi, D, θ, φ)

slide-37
SLIDE 37

Priors and integration for GP hyperparameters Vehtari

EP

◮ Opper & Winther (2000) showed that EP cavity distribution

is up to first order LOO consistent

◮ this means that if we are going to use EP approximated

predictive distribution of the latent q(˜ f|˜ x, D, θ, φ) we can use analytic equations given the Gaussian latent posterior approximation by EP

◮ LOO distributions are cavity distributions, which are

  • btained as a byproduct of the method
slide-38
SLIDE 38

Priors and integration for GP hyperparameters Vehtari

Laplace

◮ First order LOO consistency of the Laplace approximation

was shown by Vehtari, Mononen, Tolvanen, Winther (2014)

◮ this means that if we are going to use Laplace

approximated predictive distribution of the latent q(˜ f|˜ x, D, θ, φ) we can use analytic equations given the Gaussian latent posterior approximation by Laplace approximation

slide-39
SLIDE 39

Priors and integration for GP hyperparameters Vehtari

Laplace

◮ First order LOO consistency of the Laplace approximation

was shown by Vehtari, Mononen, Tolvanen, Winther (2014)

◮ this means that if we are going to use Laplace

approximated predictive distribution of the latent q(˜ f|˜ x, D, θ, φ) we can use analytic equations given the Gaussian latent posterior approximation by Laplace approximation with site terms N(fi|˜ µi, ˜ Σi) ˜ Σi = − 1 ∇i∇i log p(yi|fi, φ)|fi=ˆ

fi

(11) ˜ µi = ˆ f + ˜ Σi∇i log p(yi| fi, φ)|fi=ˆ

fi

(12)

◮ computation of LOO takes same time as in case of

Gaussian likelihood

slide-40
SLIDE 40

Priors and integration for GP hyperparameters Vehtari

VB

◮ Likely that same holds for VB

slide-41
SLIDE 41

Priors and integration for GP hyperparameters Vehtari

Experimental results

◮ Small datasets, so that we can compute brute-force LOO ◮ Accuracy of the approximations improves for larger

datasets Data set n d

  • bservation model

Ripley 250 2 probit Australian 690 14 probit Ionosphere 351 33 probit Sonar 208 60 probit Leukemia 1043 4 log-logistic with censoring

Table: Summary of datasets and models in our examples.

slide-42
SLIDE 42

Priors and integration for GP hyperparameters Vehtari

LA results with fixed hyperparameters

peff/n

0.2 0.4

Bias

  • 20

20 40 60 80

Ripley peff/n

0.1 0.2

Bias

  • 20

20 40 60 80

Australian peff/n

0.1 0.2

Bias

  • 20

20 40 60 80

Ionosphere peff/n

0.2 0.3 0.4

Bias

  • 20

20 40 60 80

Sonar peff/n

0.05 0.1

Bias

  • 20

20 40 60 80

Leukemia

LA-LOO TQ-LOO-LA-G WAICG-LA-L WAICV-LA-L

Figure: Bias when the target is brute-force-LOO with Laplace and varying flexibility of the model. Model flexibility was varied by rescaling the length scale(s) in the GP model. Model flexibility is measured by the relative effective number of parameters peff/n. The flexibility of the MAP model is shown with a vertical dashed line.

slide-43
SLIDE 43

Priors and integration for GP hyperparameters Vehtari

EP results with fixed hyperparameters

peff/n

0.2 0.4

Bias

  • 20

20 40 60 80

Ripley peff/n

0.1 0.2

Bias

  • 20

20 40 60 80

Australian peff/n

0.1 0.2

Bias

  • 20

20 40 60 80

Ionosphere peff/n

0.2 0.3 0.4

Bias

  • 20

20 40 60 80

Sonar peff/n

0.05 0.1

Bias

  • 20

20 40 60 80

Leukemia

EP-LOO TQ-LOO-EP-G WAICG-EP-L WAICV-EP-L

Figure: Bias when the target is brute-force-LOO with EP and varying flexibility of the model. Model flexibility was varied by rescaling the length scale(s) in the GP model. Model flexibility is measured by the relative effective number of parameters peff/n. The flexibility of the MAP model is shown with a vertical dashed line.

slide-44
SLIDE 44

Priors and integration for GP hyperparameters Vehtari

LA-CM2 results with fixed hyperparameters

peff/n

0.2 0.4

Bias

  • 50

50

Ripley peff/n

0.1 0.2

Bias

  • 50

50

Australian peff/n

0.1 0.2

Bias

  • 50

50

Ionosphere peff/n

0.2 0.3 0.4

Bias

  • 50

50

Sonar peff/n

0.05 0.1

Bias

  • 50

50

Leukemia

LA-LOO Q-LOO-LA-CM2 WAICG-LA-CM2 WAICV-LA-CM2

Figure: Bias when the target is brute-force-LOO with Laplace-CM2 and varying flexibility of the model. Model flexibility was varied by rescaling the length scale(s) in the GP model. Model flexibility is measured by the relative effective number of parameters peff/n. The flexibility of the MAP model is shown with a vertical dashed line.

slide-45
SLIDE 45

Priors and integration for GP hyperparameters Vehtari

EP-FACT results with fixed hyperparameters

peff/n

0.2 0.4

Bias

  • 50

50

Ripley peff/n

0.1 0.2

Bias

  • 50

50

Australian peff/n

0.1 0.2

Bias

  • 50

50

Ionosphere peff/n

0.2 0.3 0.4

Bias

  • 50

50

Sonar peff/n

0.05 0.1

Bias

  • 50

50

Leukemia

EP-LOO Q-LOO-EP-FACT WAICG-EP-FACT WAICV-EP-FACT

Figure: Bias when the target is brute-force-LOO with EP-FACT and varying flexibility of the model. Model flexibility was varied by rescaling the length scale(s) in the GP model. Model flexibility is measured by the relative effective number of parameters peff/n. The flexibility of the MAP model is shown with a vertical dashed line.

slide-46
SLIDE 46

Priors and integration for GP hyperparameters Vehtari

Unknown hyperparameters

◮ If hyperparameters are unknown and optimised, the above

estimates are optimistic

◮ bias can be negligible, if big data and the number of

hyperparameters is small

slide-47
SLIDE 47

Priors and integration for GP hyperparameters Vehtari

Unknown hyperparameters

◮ If hyperparameters are unknown and optimised, the above

estimates are optimistic

◮ bias can be negligible, if big data and the number of

hyperparameters is small

◮ Better to integrate over the hyperparameters

◮ deterministic samples, e.g., CCD ◮ stochastic samples, e.g. importance sampling, MCMC

slide-48
SLIDE 48

Priors and integration for GP hyperparameters Vehtari

Hierarchical approximation using IS

◮ Using above results for the conditional part

p(yi|xi, D−i, θ, φ), the LOO predictive distribution can be approximated using IS for hyperparameters

slide-49
SLIDE 49

Priors and integration for GP hyperparameters Vehtari

Hierarchical approximation using IS

◮ Using above results for the conditional part

p(yi|xi, D−i, θ, φ), the LOO predictive distribution can be approximated using IS for hyperparameters p(˜ yi|xi, D−i) ≈ S

s=1 p(˜

yi|D−i, φs)ws

i

S

s=1 ws i

, (13) where ws

i are importance weights and

ws

i ∝

1 p(yi|xi, D−i, θs, φs), (14)

slide-50
SLIDE 50

Priors and integration for GP hyperparameters Vehtari

Hierarchical approximation using IS

◮ Using above results for the conditional part

p(yi|xi, D−i, θ, φ), the LOO predictive distribution can be approximated using IS for hyperparameters p(˜ yi|xi, D−i) ≈ S

s=1 p(˜

yi|D−i, φs)ws

i

S

s=1 ws i

, (13) where ws

i are importance weights and

ws

i ∝

1 p(yi|xi, D−i, θs, φs), (14)

◮ The LOO predictive density simplifies to

p(yi|xi, D−i) ≈ 1

1 S

S

s=1 1 p(yi|xi,D−i,θs,φs)

(15)

slide-51
SLIDE 51

Priors and integration for GP hyperparameters Vehtari

Improving IS

◮ Variance of IS can be reduced by using truncated

importance sampling

◮ “Very Good Importance Sampling” (work in progress)

slide-52
SLIDE 52

Priors and integration for GP hyperparameters Vehtari

Hierarchical approximation using IS

◮ Importance weighting works also for deterministic CCD

method

slide-53
SLIDE 53

Priors and integration for GP hyperparameters Vehtari

LA/EP results with unknown hyperparameters

Method Ripley Australian Ionosphere Sonar Leukemia LA-LOO+CCD+IS 0.2 (0.1) 3.4 (0.4)

  • 0.1 (0.1)
  • 0.13 (0.06)

0.56 (0.05) LA-LOO+CCD 0.8 (0.2) 7.2 (0.9) 0.6 (0.2) 0.5 (0.2) 4.8 (0.2) LA-LOO+MAP 1.0 (0.2) 9.2 (1.8) 1.3 (0.2) 1.3 (0.3) 4.9 (0.6)

Table: Bias and standard deviation when the target is brute-force-LOO with Laplace and CCD.

slide-54
SLIDE 54

Priors and integration for GP hyperparameters Vehtari

LA/EP results with unknown hyperparameters

Method Ripley Australian Ionosphere Sonar Leukemia LA-LOO+CCD+IS 0.2 (0.1) 3.4 (0.4)

  • 0.1 (0.1)
  • 0.13 (0.06)

0.56 (0.05) LA-LOO+CCD 0.8 (0.2) 7.2 (0.9) 0.6 (0.2) 0.5 (0.2) 4.8 (0.2) LA-LOO+MAP 1.0 (0.2) 9.2 (1.8) 1.3 (0.2) 1.3 (0.3) 4.9 (0.6)

Table: Bias and standard deviation when the target is brute-force-LOO with Laplace and CCD.

Method Ripley Australian Ionosphere Sonar Leukemia EP-LOO+CCD+IS 0.42 (0.14) 7.3 (1.4) 0.8 (0.6)

  • 0.24 (0.14)

0.49 (0.04) EP-LOO+CCD 1.3 (0.4) 15 (2) 2.8 (1.3) 0.6 (0.3) 4.8 (0.2) EP-LOO+MAP 1.4 (0.3) 17 (2) 2.8 (0.7) 0.9 (0.3) 4.9 (0.6)

Table: Bias and standard deviation when the target is brute-force-LOO with EP and CCD.

slide-55
SLIDE 55

Priors and integration for GP hyperparameters Vehtari

Non-log-concave likelihoods

◮ Above nice results are with log-concave likelihoods ◮ Does not work so well with non-log-concave likelihoods

◮ first order consistency proof assumes log-concave

likelihoods

◮ posterior can be multimodal → unimodal approximation bad ◮ pseudo observations may have repulsive effect

slide-56
SLIDE 56

Priors and integration for GP hyperparameters Vehtari

Non-log-concave likelihoods

◮ Above nice results are with log-concave likelihoods ◮ Does not work so well with non-log-concave likelihoods

◮ first order consistency proof assumes log-concave

likelihoods

◮ posterior can be multimodal → unimodal approximation bad ◮ pseudo observations may have repulsive effect ◮ (current) marginal improvement methods don’t fix this

problem

slide-57
SLIDE 57

Priors and integration for GP hyperparameters Vehtari

Summary

◮ LOO with LA or EP

, log-concave likelihoods and fixed hyperparameters is fast and reliable

◮ IS can be used to handle unknown hyperparameters

slide-58
SLIDE 58

Priors and integration for GP hyperparameters Vehtari

Warning

◮ LOO-CV can be used to compare a small set of models ◮ For a large number of models

◮ the selection process will cause overfitting ◮ the inference conditional on the selected model is wrong 25 50 −3.5 −2.5 −1.5 −0.5 n = 20 25 50 −3.3 −2.4 −1.5 n = 50 25 50 −2.2 −1.8 −1.4 n = 100

slide-59
SLIDE 59

Priors and integration for GP hyperparameters Vehtari

Warning

◮ LOO-CV can be used to compare a small set of models ◮ For a large number of models

◮ the selection process will cause overfitting ◮ the inference conditional on the selected model is wrong 25 50 −3.5 −2.5 −1.5 −0.5 n = 20 25 50 −3.3 −2.4 −1.5 n = 50 25 50 −2.2 −1.8 −1.4 n = 100

◮ Use instead a projection predictive approach

Piironen, J., and Vehtari, A. (2016b). Projection predictive input variable selection for Gaussian process models. In Machine Learning for Signal Processing (MLSP), 2016 IEEE International Workshop on, doi:10.1109/MLSP .2016.7738829. arXiv preprint arXiv:1510.04813.

slide-60
SLIDE 60

Priors and integration for GP hyperparameters Vehtari

Selection induced bias in variable selection

25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3

n = 200

25 50 75 100 −0.6 −0.3 0.3

CV-10

n = 100

25 50 75 100 −0.6 −0.3 0.3

WAIC

25 50 75 100 −0.6 −0.3 0.3

DIC

25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3

n = 400

25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3

MPP

25 50 75 100 −0.6 −0.3 0.3

BMA-ref

25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3

BMA-proj

Piironen & Vehtari (2016)

slide-61
SLIDE 61

Priors and integration for GP hyperparameters Vehtari

Warning

◮ LOO-CV can be used to compare a small set of models ◮ For a large number of models

◮ the selection process will cause overfitting ◮ the inference conditional on the selected model is wrong 25 50 −3.5 −2.5 −1.5 −0.5 n = 20 25 50 −3.3 −2.4 −1.5 n = 50 25 50 −2.2 −1.8 −1.4 n = 100

◮ Use instead a projection predictive approach

Piironen, J., and Vehtari, A. (2016b). Projection predictive input variable selection for Gaussian process models. In Machine Learning for Signal Processing (MLSP), 2016 IEEE International Workshop on, doi:10.1109/MLSP .2016.7738829. arXiv preprint arXiv:1510.04813.

slide-62
SLIDE 62

Priors and integration for GP hyperparameters Vehtari

References

Piironen, J. and Vehtari, A. (2016a). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3):711–735. Piironen, J. and Vehtari, A. (2016b). Projection predictive input variable selection for gaussian process models. In Machine Learning for Signal Processing (MLSP), 2016 IEEE International Workshop on. Siivola, E., Vehtari, A., Vanhatalo, J., and Gonz´ alez, J. (2017). Bayesian optimization with virtual derivative sign observations. arXiv:1704.00963. Stan Development Team (2017). Stan: A C++ library for probability and sampling, version 2.16. Vehtari, A., Gelman, A., and Gabry, J. (2016a). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. arXiv:1507.04544. Vehtari, A., Mononen, T., Tolvanen, V., and Winther, O. (2016b). Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. Journal of Machine Learning Research, 17(103):1–38.