Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Contents I Introduction I Automatic relevance determination (ARD) I - - PowerPoint PPT Presentation
Contents I Introduction I Automatic relevance determination (ARD) I - - PowerPoint PPT Presentation
P ROJECTION P REDICTIVE M ODEL S ELECTION F OR G AUSSIAN P ROCESSES Juho Piironen, Aki Vehtari Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Finland juho.piironen@aalto.fi,
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Contents
I Introduction I Automatic relevance determination (ARD) I Projection predictive method I Examples I Summary
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Introduction
I Model target y with several input variables x I Only some of the inputs x relevant
I Bayesian approach: use a relevant prior and integrate over
all uncertainties
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Introduction
I Model target y with several input variables x I Only some of the inputs x relevant
I Bayesian approach: use a relevant prior and integrate over
all uncertainties
I Radford Neal won the NIPS 2003 feature selection
competition using Bayesian methods with all the features (500 – 100 000)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Introduction
I Model target y with several input variables x I Only some of the inputs x relevant
I Bayesian approach: use a relevant prior and integrate over
all uncertainties
I Radford Neal won the NIPS 2003 feature selection
competition using Bayesian methods with all the features (500 – 100 000)
I Sometimes we want to select a minimal subset from x with
a good predictive performance
I improved model interpretability I reduced measurement costs in the future I reduced prediction time
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Gaussian process (GP) regression
I GP-prior
f(x) ⇠ GP
- 0, k(x, x0)
- I Observation model
y | f ⇠ N ⇣ y | f, 2I ⌘
I Predictive distribution
f⇤ | y ⇠ N(f⇤ | µ⇤, Σ⇤), µ⇤ = K⇤(K + 2I)1y Σ⇤ = K⇤⇤ K⇤(K + 2I)1KT
⇤.
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
“Automatic relevance determination”
I Squared exponential (SE) or exponentiated quadratic
covariance function kSE(x, x0) = 2
f exp
@1 2
D
X
j=1
(xj x0
j )2
`2
j
1 A .
I Use of separate length-scales `j for each input referred to
as automatic relevance determination (ARD)
I Idea: Optimizing marginal likelihood will yield large values `j
for irrelevant inputs
I Problem: Large length-scale may simply mean linearity
w.r.t. the input (not irrelevance)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Toy example
−1 1 −2 −1 1 2 f1(x1) −1 1 f2(x2) −1 1 f3(x3) −1 1 f4(x4) −1 1 −2 −1 1 2 f5(x5) −1 1 f6(x6) −1 1 f7(x7) −1 1 f8(x8)
f(x) = f1(x1) + · · · + f8(x8), y ⇠ N ⇣ f, 0.32⌘ , Var
- fj
- = 1 for all j.
) All inputs equally relevant
2 4 6 8 0.5 1 Input True relevance
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Toy example
−1 1 −2 −1 1 2 f1(x1) −1 1 f2(x2) −1 1 f3(x3) −1 1 f4(x4) −1 1 −2 −1 1 2 f5(x5) −1 1 f6(x6) −1 1 f7(x7) −1 1 f8(x8)
f(x) = f1(x1) + · · · + f8(x8), y ⇠ N ⇣ f, 0.32⌘ , Var
- fj
- = 1 for all j.
) All inputs equally relevant
2 4 6 8 0.5 1 Input True relevance ARD-value
Optimized ARD-values, ARD(j) = 1/`j (averaged over 100 data realizations, n = 200)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
How about estimating the predictive performance?
I Cross-validation gives an (almost) unbiased estimate of
the predictive performance
I Fast LOO-CV approximations in
Vehtari, Mononen, Tolvanen, Sivula, and Winther (2017). Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. JMLR 17(103):1-38.
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
How about estimating the predictive performance?
I Cross-validation gives an (almost) unbiased estimate of
the predictive performance
I Fast LOO-CV approximations in
Vehtari, Mononen, Tolvanen, Sivula, and Winther (2017). Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. JMLR 17(103):1-38.
I But...
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Selection induced bias in variable selection
I Even if the model performance estimate is unbiased (like
LOO-CV), but it’s noisy (like LOO-CV), then using it for model selection introduces additional fitting to the data
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Selection induced bias in variable selection
I Even if the model performance estimate is unbiased (like
LOO-CV), but it’s noisy (like LOO-CV), then using it for model selection introduces additional fitting to the data
I Performance of the selection process itself can be
assessed using two level cross-validation, but it does not help choosing better models
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Selection induced bias in variable selection
I Even if the model performance estimate is unbiased (like
LOO-CV), but it’s noisy (like LOO-CV), then using it for model selection introduces additional fitting to the data
I Performance of the selection process itself can be
assessed using two level cross-validation, but it does not help choosing better models
I Bigger problem if there is a large number of models as in
covariate selection
I Juho Piironen and Aki Vehtari (2017). Comparison of Bayesian
predictive methods for model selection. Statistics and Computing, 27(3):711-735. doi:10.1007/s11222-016-9649-y. arXiv preprint arXiv:1503.08650.
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Selection induced bias in variable selection
25 50 −3.5 −2.5 −1.5 −0.5 n = 20 25 50 −3.3 −2.4 −1.5 n = 50 25 50 −2.2 −1.8 −1.4 n = 100
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Selection induced bias in variable selection
25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3
n = 200
25 50 75 100 −0.6 −0.3 0.3
CV-10
n = 100
25 50 75 100 −0.6 −0.3 0.3
WAIC
25 50 75 100 −0.6 −0.3 0.3
DIC
25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3
n = 400
25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3
MPP
25 50 75 100 −0.6 −0.3 0.3
BMA-ref
25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3
BMA-proj
Piironen & Vehtari (2017)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Selection induced bias in variable selection
25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3
n = 200
25 50 75 100 −0.6 −0.3
CV-10 n = 100
25 50 75 100 −0.6 −0.3
WAIC
25 50 75 100 −0.6 −0.3
DIC
25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3
n = 400
25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3
MPP
25 50 75 100 −0.6 −0.3
BMA-ref
25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3
BMA-proj
Piironen & Vehtari (2017)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Selection induced bias in variable selection
20 40 60 −0.2 −0.1 0.1 20 40 60 −0.2 −0.1 0.1
Sonar
10 20 30 −0.4 −0.2 0.2
CV-10 / IS-LOO-CV Ionosphere
10 20 30 −0.4 −0.2 0.2
WAIC
10 20 30 −0.4 −0.2 0.2
DIC
20 40 60 −0.2 −0.1 0.1 5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2
Ovarian
5 10 −0.4 −0.2 0.2
Colon
5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2 20 40 60 −0.2 −0.1 0.1 10 20 30 −0.4 −0.2 0.2
MPP
10 20 30 −0.4 −0.2 0.2
BMA-ref
20 40 60 −0.2 −0.1 0.1 5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2 20 40 60 −0.2 −0.1 0.1 10 20 30 −0.4 −0.2 0.2
BMA-proj
Piironen & Vehtari (2017)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projection predictive method, general idea
I Originally proposed for generalized linear models by
Goutis and Robert (1998); Dupuis and Robert (2003) (the decision theoretic idea of using the full model can be tracked to Lindley (1968), see also many related references in Vehtari and Ojanen (2012))
I Performs well in practice in comparison to many other
methods (Piironen and Vehtari, 2016)
I has low variance I able to preserve information from the full model
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projection predictive method, general idea
I Originally proposed for generalized linear models by
Goutis and Robert (1998); Dupuis and Robert (2003) (the decision theoretic idea of using the full model can be tracked to Lindley (1968), see also many related references in Vehtari and Ojanen (2012))
I Performs well in practice in comparison to many other
methods (Piironen and Vehtari, 2016)
I has low variance I able to preserve information from the full model
I General idea
- 1. Fit the full encompassing model (with all the inputs) with
best possible prior information
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projection predictive method, general idea
I Originally proposed for generalized linear models by
Goutis and Robert (1998); Dupuis and Robert (2003) (the decision theoretic idea of using the full model can be tracked to Lindley (1968), see also many related references in Vehtari and Ojanen (2012))
I Performs well in practice in comparison to many other
methods (Piironen and Vehtari, 2016)
I has low variance I able to preserve information from the full model
I General idea
- 1. Fit the full encompassing model (with all the inputs) with
best possible prior information
- 2. Any submodel (reduced number of inputs) is trained by
minimizing predictive Kullback-Leibler (KL) divergence to the full model (= projection)
I For a given number of variables, choose the model with
minimal projection discrepancy
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projective predictive covariate selection, idea
I The full model predictive distribution represents our best
knowledge about future ˜ y p(˜ y|D) = Z p(˜ y|θ)p(θ|D)dθ, where θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projective predictive covariate selection, idea
I The full model predictive distribution represents our best
knowledge about future ˜ y p(˜ y|D) = Z p(˜ y|θ)p(θ|D)dθ, where θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)
I What is the best distribution q?(θ) given a constraint that
- nly selected covariates have nonzero coefficient
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projective predictive covariate selection, idea
I The full model predictive distribution represents our best
knowledge about future ˜ y p(˜ y|D) = Z p(˜ y|θ)p(θ|D)dθ, where θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)
I What is the best distribution q?(θ) given a constraint that
- nly selected covariates have nonzero coefficient
I Optimization problem:
q? = arg min
q
1 n
n
X
i=1
KL ✓ p(˜ yi | D) k Z p(˜ yi | θ)q(θ)dθ ◆
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projective predictive covariate selection, idea
I The full model predictive distribution represents our best
knowledge about future ˜ y p(˜ y|D) = Z p(˜ y|θ)p(θ|D)dθ, where θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)
I What is the best distribution q?(θ) given a constraint that
- nly selected covariates have nonzero coefficient
I Optimization problem:
q? = arg min
q
1 n
n
X
i=1
KL ✓ p(˜ yi | D) k Z p(˜ yi | θ)q(θ)dθ ◆
I Optimal projection from the full posterior to a sparse
posterior (with minimal predictive loss)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projective predictive feature selection, computation
I We have posterior draws {θs}S s=1, for the full model
(θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projective predictive feature selection, computation
I We have posterior draws {θs}S s=1, for the full model
(θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)
I The predictive distribution p(˜
y | D) ⇡ 1
S
P
s p(˜
y | θs) represents our best knowledge about future ˜ y
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projective predictive feature selection, computation
I We have posterior draws {θs}S s=1, for the full model
(θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)
I The predictive distribution p(˜
y | D) ⇡ 1
S
P
s p(˜
y | θs) represents our best knowledge about future ˜ y
I Easier optimization problem by changing the order of
integration and optimization (Goutis & Robert, 1998): θs
? = arg min ˆ θ
1 n
n
X
i=1
KL ⇣ p(˜ yi | θs) k p(˜ yi | ˆ θ) ⌘
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projective predictive feature selection, computation
I We have posterior draws {θs}S s=1, for the full model
(θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)
I The predictive distribution p(˜
y | D) ⇡ 1
S
P
s p(˜
y | θs) represents our best knowledge about future ˜ y
I Easier optimization problem by changing the order of
integration and optimization (Goutis & Robert, 1998): θs
? = arg min ˆ θ
1 n
n
X
i=1
KL ⇣ p(˜ yi | θs) k p(˜ yi | ˆ θ) ⌘
I ✓s ? are now (approximate) draws from the projected
distribution
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projection by draws
I Projection of one Monte Carlo sample can be solved
I Gaussian case: analytically
w⊥ = (X⊥
TX⊥)−1X⊥ T f
2
⊥ = 2 + 1
n(f f⊥)T(f f⊥)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projection by draws
I Projection of one Monte Carlo sample can be solved
I Gaussian case: analytically
w⊥ = (X⊥
TX⊥)−1X⊥ T f
2
⊥ = 2 + 1
n(f f⊥)T(f f⊥)
I Exponential family case: equivalent to finding the maximum
likelihood parameters for the submodel with the
- bservations replaced by the fit of the reference model
(Goutis & Robert, 1998; Dupuis & Robert, 2003)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projection predictive method for GPs
I The parameters of the GP are essentially the latent values
f (and likelihood parameters like )
I Without constraints for the latent values in the submodel,
the solution to the minimization problem is f? = f
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projection predictive method for GPs
I The parameters of the GP are essentially the latent values
f (and likelihood parameters like )
I Without constraints for the latent values in the submodel,
the solution to the minimization problem is f? = f
I We require constraint that that the submodel prediction
satisfies the usual GP predictive equations
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projection predictive method for GPs
I Fit the full model M by learning the hyperparameters θ to
- btain the latent fit f | y, θ ⇠ N(f | µθ, Σθ)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projection predictive method for GPs
I Fit the full model M by learning the hyperparameters θ to
- btain the latent fit f | y, θ ⇠ N(f | µθ, Σθ)
I The projection to a submodel M? with fewer number of
variables D? is obtained by solving (M||M?) = min
θ⊥
. KL
- N(f | µθ, Σθ)
- N
- f | µθ⊥, Σθ⊥
- (1)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projection predictive method for GPs
I Fit the full model M by learning the hyperparameters θ to
- btain the latent fit f | y, θ ⇠ N(f | µθ, Σθ)
I The projection to a submodel M? with fewer number of
variables D? is obtained by solving (M||M?) = min
θ⊥
. KL
- N(f | µθ, Σθ)
- N
- f | µθ⊥, Σθ⊥
- (1)
where µ? = K?(K? + 2I)1y, Σ? = K? K?(K? + 2I)1K?,
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Toy example
−1 1 −2 −1 1 2 f1(x1) −1 1 f2(x2) −1 1 f3(x3) −1 1 f4(x4) −1 1 −2 −1 1 2 f5(x5) −1 1 f6(x6) −1 1 f7(x7) −1 1 f8(x8)
f(x) = f1(x1) + · · · + f8(x8), y ⇠ N ⇣ f, 0.32⌘ , Var
- fj
- = 1 for all j.
) All inputs equally relevant
2 4 6 8 0.5 1 Input True relevance ARD-value
Optimized ARD-values, ARD(j) = 1/`j (averaged over 100 data realizations, n = 200)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Toy example
−1 1 −2 −1 1 2 f1(x1) −1 1 f2(x2) −1 1 f3(x3) −1 1 f4(x4) −1 1 −2 −1 1 2 f5(x5) −1 1 f6(x6) −1 1 f7(x7) −1 1 f8(x8)
f(x) = f1(x1) + · · · + f8(x8), y ⇠ N ⇣ f, 0.32⌘ , Var
- fj
- = 1 for all j.
) All inputs equally relevant
2 4 6 8 0.5 1 Input True relevance ARD-value LIO proj. error
Leave-input-out (LIO) projection errors (averaged over 100 data realizations, n = 200)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projection predictive variable selection
I In variable selection usually not feasible to go through all
variable combinations
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projection predictive variable selection
I In variable selection usually not feasible to go through all
variable combinations
I Use e.g. forward search to explore promising combinations
I start from the empty model, at each step add the variable
that reduces the objective (1) the most
I stop when the performance similar to the full model
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Real world examples
5 10 −1.5 −1 −0.5 Number of variables MLPD
Boston Housing (D = 13)
5 10 15 −1.5 −1 −0.5 Number of variables
Automobile (D = 38)
2 4 6 8 10 −1.4 −1.2 −1 Number of variables
Crime (D = 102)
Full model
Mean log predictive density (MLPD) on test data for full model (all inputs) with sampled hyperparameters.
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Real world examples
5 10 −1.5 −1 −0.5 Number of variables MLPD
Boston Housing (D = 13)
5 10 15 −1.5 −1 −0.5 Number of variables
Automobile (D = 38)
2 4 6 8 10 −1.4 −1.2 −1 Number of variables
Crime (D = 102)
Full model ARD
Accuracy for each submodel size, variables sorted by ARD (length-scales), hyperparameters optimized to maximum marginal likelihood.
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Real world examples
5 10 −1.5 −1 −0.5 Number of variables MLPD
Boston Housing (D = 13)
5 10 15 −1.5 −1 −0.5 Number of variables
Automobile (D = 38)
2 4 6 8 10 −1.4 −1.2 −1 Number of variables
Crime (D = 102)
Full model ARD Projection
Accuracy for each submodel size, variables sorted by stepwise minimization of projection error (forward search), hyperparameters learned via the projection.
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Non-Gaussian likelihood
I Given Gaussian posterior approximation (e.g. obtained
using EP), we can make the projection conditional on Gaussian likelihood approximations
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Projection predictive method, pros and cons
I Advantage:
I Discrepancy to the full model much more reliable indicator
- f submodel’s performance than the length-scales
I Disadvantage:
I Computational complexity for the projection is O(n3)
(unless sparse approximations are used) ) slow if several submodels (e.g. variable combinations) are explored
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
Summary
I Carry out inference for the full model for best performance,
select only if necessary
I ARD-values (length-scales) are unreliable for input
relevance assessment
I Projection discrepancy to the full model is a more robust
indicator
I However, the forward search requires substantial amount of
additional computations (in addition to fitting the full model)
Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari
References
Dupuis, J. A. and Robert, C. P . (2003). Variable selection in qualitative models via an entropic explanatory power. Journal of Statistical Planning and Inference, 111(1-2):77–94. Goutis, C. and Robert, C. P . (1998). Model choice in generalised linear models: A Bayesian approach via Kullback-Leibler
- projections. Biometrika, 85(1):29–37.