Contents I Introduction I Automatic relevance determination (ARD) I - - PowerPoint PPT Presentation

contents
SMART_READER_LITE
LIVE PREVIEW

Contents I Introduction I Automatic relevance determination (ARD) I - - PowerPoint PPT Presentation

P ROJECTION P REDICTIVE M ODEL S ELECTION F OR G AUSSIAN P ROCESSES Juho Piironen, Aki Vehtari Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Finland juho.piironen@aalto.fi,


slide-1
SLIDE 1

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

PROJECTION PREDICTIVE MODEL SELECTION FOR GAUSSIAN PROCESSES

Juho Piironen, Aki Vehtari

Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Finland juho.piironen@aalto.fi, aki.vehtari@aalto.fi

slide-2
SLIDE 2

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Contents

I Introduction I Automatic relevance determination (ARD) I Projection predictive method I Examples I Summary

slide-3
SLIDE 3

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Introduction

I Model target y with several input variables x I Only some of the inputs x relevant

I Bayesian approach: use a relevant prior and integrate over

all uncertainties

slide-4
SLIDE 4

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Introduction

I Model target y with several input variables x I Only some of the inputs x relevant

I Bayesian approach: use a relevant prior and integrate over

all uncertainties

I Radford Neal won the NIPS 2003 feature selection

competition using Bayesian methods with all the features (500 – 100 000)

slide-5
SLIDE 5

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Introduction

I Model target y with several input variables x I Only some of the inputs x relevant

I Bayesian approach: use a relevant prior and integrate over

all uncertainties

I Radford Neal won the NIPS 2003 feature selection

competition using Bayesian methods with all the features (500 – 100 000)

I Sometimes we want to select a minimal subset from x with

a good predictive performance

I improved model interpretability I reduced measurement costs in the future I reduced prediction time

slide-6
SLIDE 6

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Gaussian process (GP) regression

I GP-prior

f(x) ⇠ GP

  • 0, k(x, x0)
  • I Observation model

y | f ⇠ N ⇣ y | f, 2I ⌘

I Predictive distribution

f⇤ | y ⇠ N(f⇤ | µ⇤, Σ⇤), µ⇤ = K⇤(K + 2I)1y Σ⇤ = K⇤⇤ K⇤(K + 2I)1KT

⇤.

slide-7
SLIDE 7

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

“Automatic relevance determination”

I Squared exponential (SE) or exponentiated quadratic

covariance function kSE(x, x0) = 2

f exp

@1 2

D

X

j=1

(xj x0

j )2

`2

j

1 A .

I Use of separate length-scales `j for each input referred to

as automatic relevance determination (ARD)

I Idea: Optimizing marginal likelihood will yield large values `j

for irrelevant inputs

I Problem: Large length-scale may simply mean linearity

w.r.t. the input (not irrelevance)

slide-8
SLIDE 8

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Toy example

−1 1 −2 −1 1 2 f1(x1) −1 1 f2(x2) −1 1 f3(x3) −1 1 f4(x4) −1 1 −2 −1 1 2 f5(x5) −1 1 f6(x6) −1 1 f7(x7) −1 1 f8(x8)

f(x) = f1(x1) + · · · + f8(x8), y ⇠ N ⇣ f, 0.32⌘ , Var

  • fj
  • = 1 for all j.

) All inputs equally relevant

2 4 6 8 0.5 1 Input True relevance

slide-9
SLIDE 9

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Toy example

−1 1 −2 −1 1 2 f1(x1) −1 1 f2(x2) −1 1 f3(x3) −1 1 f4(x4) −1 1 −2 −1 1 2 f5(x5) −1 1 f6(x6) −1 1 f7(x7) −1 1 f8(x8)

f(x) = f1(x1) + · · · + f8(x8), y ⇠ N ⇣ f, 0.32⌘ , Var

  • fj
  • = 1 for all j.

) All inputs equally relevant

2 4 6 8 0.5 1 Input True relevance ARD-value

Optimized ARD-values, ARD(j) = 1/`j (averaged over 100 data realizations, n = 200)

slide-10
SLIDE 10

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

How about estimating the predictive performance?

I Cross-validation gives an (almost) unbiased estimate of

the predictive performance

I Fast LOO-CV approximations in

Vehtari, Mononen, Tolvanen, Sivula, and Winther (2017). Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. JMLR 17(103):1-38.

slide-11
SLIDE 11

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

How about estimating the predictive performance?

I Cross-validation gives an (almost) unbiased estimate of

the predictive performance

I Fast LOO-CV approximations in

Vehtari, Mononen, Tolvanen, Sivula, and Winther (2017). Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. JMLR 17(103):1-38.

I But...

slide-12
SLIDE 12

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Selection induced bias in variable selection

I Even if the model performance estimate is unbiased (like

LOO-CV), but it’s noisy (like LOO-CV), then using it for model selection introduces additional fitting to the data

slide-13
SLIDE 13

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Selection induced bias in variable selection

I Even if the model performance estimate is unbiased (like

LOO-CV), but it’s noisy (like LOO-CV), then using it for model selection introduces additional fitting to the data

I Performance of the selection process itself can be

assessed using two level cross-validation, but it does not help choosing better models

slide-14
SLIDE 14

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Selection induced bias in variable selection

I Even if the model performance estimate is unbiased (like

LOO-CV), but it’s noisy (like LOO-CV), then using it for model selection introduces additional fitting to the data

I Performance of the selection process itself can be

assessed using two level cross-validation, but it does not help choosing better models

I Bigger problem if there is a large number of models as in

covariate selection

I Juho Piironen and Aki Vehtari (2017). Comparison of Bayesian

predictive methods for model selection. Statistics and Computing, 27(3):711-735. doi:10.1007/s11222-016-9649-y. arXiv preprint arXiv:1503.08650.

slide-15
SLIDE 15

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Selection induced bias in variable selection

25 50 −3.5 −2.5 −1.5 −0.5 n = 20 25 50 −3.3 −2.4 −1.5 n = 50 25 50 −2.2 −1.8 −1.4 n = 100

slide-16
SLIDE 16

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Selection induced bias in variable selection

25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3

n = 200

25 50 75 100 −0.6 −0.3 0.3

CV-10

n = 100

25 50 75 100 −0.6 −0.3 0.3

WAIC

25 50 75 100 −0.6 −0.3 0.3

DIC

25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3

n = 400

25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3

MPP

25 50 75 100 −0.6 −0.3 0.3

BMA-ref

25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3 25 50 75 100 −0.6 −0.3 0.3

BMA-proj

Piironen & Vehtari (2017)

slide-17
SLIDE 17

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Selection induced bias in variable selection

25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3

n = 200

25 50 75 100 −0.6 −0.3

CV-10 n = 100

25 50 75 100 −0.6 −0.3

WAIC

25 50 75 100 −0.6 −0.3

DIC

25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3

n = 400

25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3

MPP

25 50 75 100 −0.6 −0.3

BMA-ref

25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3 25 50 75 100 −0.6 −0.3

BMA-proj

Piironen & Vehtari (2017)

slide-18
SLIDE 18

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Selection induced bias in variable selection

20 40 60 −0.2 −0.1 0.1 20 40 60 −0.2 −0.1 0.1

Sonar

10 20 30 −0.4 −0.2 0.2

CV-10 / IS-LOO-CV Ionosphere

10 20 30 −0.4 −0.2 0.2

WAIC

10 20 30 −0.4 −0.2 0.2

DIC

20 40 60 −0.2 −0.1 0.1 5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2

Ovarian

5 10 −0.4 −0.2 0.2

Colon

5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2 20 40 60 −0.2 −0.1 0.1 10 20 30 −0.4 −0.2 0.2

MPP

10 20 30 −0.4 −0.2 0.2

BMA-ref

20 40 60 −0.2 −0.1 0.1 5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2 5 10 −0.4 −0.2 0.2 20 40 60 −0.2 −0.1 0.1 10 20 30 −0.4 −0.2 0.2

BMA-proj

Piironen & Vehtari (2017)

slide-19
SLIDE 19

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projection predictive method, general idea

I Originally proposed for generalized linear models by

Goutis and Robert (1998); Dupuis and Robert (2003) (the decision theoretic idea of using the full model can be tracked to Lindley (1968), see also many related references in Vehtari and Ojanen (2012))

I Performs well in practice in comparison to many other

methods (Piironen and Vehtari, 2016)

I has low variance I able to preserve information from the full model

slide-20
SLIDE 20

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projection predictive method, general idea

I Originally proposed for generalized linear models by

Goutis and Robert (1998); Dupuis and Robert (2003) (the decision theoretic idea of using the full model can be tracked to Lindley (1968), see also many related references in Vehtari and Ojanen (2012))

I Performs well in practice in comparison to many other

methods (Piironen and Vehtari, 2016)

I has low variance I able to preserve information from the full model

I General idea

  • 1. Fit the full encompassing model (with all the inputs) with

best possible prior information

slide-21
SLIDE 21

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projection predictive method, general idea

I Originally proposed for generalized linear models by

Goutis and Robert (1998); Dupuis and Robert (2003) (the decision theoretic idea of using the full model can be tracked to Lindley (1968), see also many related references in Vehtari and Ojanen (2012))

I Performs well in practice in comparison to many other

methods (Piironen and Vehtari, 2016)

I has low variance I able to preserve information from the full model

I General idea

  • 1. Fit the full encompassing model (with all the inputs) with

best possible prior information

  • 2. Any submodel (reduced number of inputs) is trained by

minimizing predictive Kullback-Leibler (KL) divergence to the full model (= projection)

I For a given number of variables, choose the model with

minimal projection discrepancy

slide-22
SLIDE 22

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projective predictive covariate selection, idea

I The full model predictive distribution represents our best

knowledge about future ˜ y p(˜ y|D) = Z p(˜ y|θ)p(θ|D)dθ, where θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)

slide-23
SLIDE 23

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projective predictive covariate selection, idea

I The full model predictive distribution represents our best

knowledge about future ˜ y p(˜ y|D) = Z p(˜ y|θ)p(θ|D)dθ, where θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)

I What is the best distribution q?(θ) given a constraint that

  • nly selected covariates have nonzero coefficient
slide-24
SLIDE 24

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projective predictive covariate selection, idea

I The full model predictive distribution represents our best

knowledge about future ˜ y p(˜ y|D) = Z p(˜ y|θ)p(θ|D)dθ, where θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)

I What is the best distribution q?(θ) given a constraint that

  • nly selected covariates have nonzero coefficient

I Optimization problem:

q? = arg min

q

1 n

n

X

i=1

KL ✓ p(˜ yi | D) k Z p(˜ yi | θ)q(θ)dθ ◆

slide-25
SLIDE 25

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projective predictive covariate selection, idea

I The full model predictive distribution represents our best

knowledge about future ˜ y p(˜ y|D) = Z p(˜ y|θ)p(θ|D)dθ, where θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)

I What is the best distribution q?(θ) given a constraint that

  • nly selected covariates have nonzero coefficient

I Optimization problem:

q? = arg min

q

1 n

n

X

i=1

KL ✓ p(˜ yi | D) k Z p(˜ yi | θ)q(θ)dθ ◆

I Optimal projection from the full posterior to a sparse

posterior (with minimal predictive loss)

slide-26
SLIDE 26

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projective predictive feature selection, computation

I We have posterior draws {θs}S s=1, for the full model

(θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)

slide-27
SLIDE 27

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projective predictive feature selection, computation

I We have posterior draws {θs}S s=1, for the full model

(θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)

I The predictive distribution p(˜

y | D) ⇡ 1

S

P

s p(˜

y | θs) represents our best knowledge about future ˜ y

slide-28
SLIDE 28

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projective predictive feature selection, computation

I We have posterior draws {θs}S s=1, for the full model

(θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)

I The predictive distribution p(˜

y | D) ⇡ 1

S

P

s p(˜

y | θs) represents our best knowledge about future ˜ y

I Easier optimization problem by changing the order of

integration and optimization (Goutis & Robert, 1998): θs

? = arg min ˆ θ

1 n

n

X

i=1

KL ⇣ p(˜ yi | θs) k p(˜ yi | ˆ θ) ⌘

slide-29
SLIDE 29

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projective predictive feature selection, computation

I We have posterior draws {θs}S s=1, for the full model

(θ = (β, 2)) and β is in general non-sparse (all βj 6= 0)

I The predictive distribution p(˜

y | D) ⇡ 1

S

P

s p(˜

y | θs) represents our best knowledge about future ˜ y

I Easier optimization problem by changing the order of

integration and optimization (Goutis & Robert, 1998): θs

? = arg min ˆ θ

1 n

n

X

i=1

KL ⇣ p(˜ yi | θs) k p(˜ yi | ˆ θ) ⌘

I ✓s ? are now (approximate) draws from the projected

distribution

slide-30
SLIDE 30

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projection by draws

I Projection of one Monte Carlo sample can be solved

I Gaussian case: analytically

w⊥ = (X⊥

TX⊥)−1X⊥ T f

2

⊥ = 2 + 1

n(f f⊥)T(f f⊥)

slide-31
SLIDE 31

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projection by draws

I Projection of one Monte Carlo sample can be solved

I Gaussian case: analytically

w⊥ = (X⊥

TX⊥)−1X⊥ T f

2

⊥ = 2 + 1

n(f f⊥)T(f f⊥)

I Exponential family case: equivalent to finding the maximum

likelihood parameters for the submodel with the

  • bservations replaced by the fit of the reference model

(Goutis & Robert, 1998; Dupuis & Robert, 2003)

slide-32
SLIDE 32

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projection predictive method for GPs

I The parameters of the GP are essentially the latent values

f (and likelihood parameters like )

I Without constraints for the latent values in the submodel,

the solution to the minimization problem is f? = f

slide-33
SLIDE 33

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projection predictive method for GPs

I The parameters of the GP are essentially the latent values

f (and likelihood parameters like )

I Without constraints for the latent values in the submodel,

the solution to the minimization problem is f? = f

I We require constraint that that the submodel prediction

satisfies the usual GP predictive equations

slide-34
SLIDE 34

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projection predictive method for GPs

I Fit the full model M by learning the hyperparameters θ to

  • btain the latent fit f | y, θ ⇠ N(f | µθ, Σθ)
slide-35
SLIDE 35

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projection predictive method for GPs

I Fit the full model M by learning the hyperparameters θ to

  • btain the latent fit f | y, θ ⇠ N(f | µθ, Σθ)

I The projection to a submodel M? with fewer number of

variables D? is obtained by solving (M||M?) = min

θ⊥

. KL

  • N(f | µθ, Σθ)
  • N
  • f | µθ⊥, Σθ⊥
  • (1)
slide-36
SLIDE 36

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projection predictive method for GPs

I Fit the full model M by learning the hyperparameters θ to

  • btain the latent fit f | y, θ ⇠ N(f | µθ, Σθ)

I The projection to a submodel M? with fewer number of

variables D? is obtained by solving (M||M?) = min

θ⊥

. KL

  • N(f | µθ, Σθ)
  • N
  • f | µθ⊥, Σθ⊥
  • (1)

where µ? = K?(K? + 2I)1y, Σ? = K? K?(K? + 2I)1K?,

slide-37
SLIDE 37

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Toy example

−1 1 −2 −1 1 2 f1(x1) −1 1 f2(x2) −1 1 f3(x3) −1 1 f4(x4) −1 1 −2 −1 1 2 f5(x5) −1 1 f6(x6) −1 1 f7(x7) −1 1 f8(x8)

f(x) = f1(x1) + · · · + f8(x8), y ⇠ N ⇣ f, 0.32⌘ , Var

  • fj
  • = 1 for all j.

) All inputs equally relevant

2 4 6 8 0.5 1 Input True relevance ARD-value

Optimized ARD-values, ARD(j) = 1/`j (averaged over 100 data realizations, n = 200)

slide-38
SLIDE 38

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Toy example

−1 1 −2 −1 1 2 f1(x1) −1 1 f2(x2) −1 1 f3(x3) −1 1 f4(x4) −1 1 −2 −1 1 2 f5(x5) −1 1 f6(x6) −1 1 f7(x7) −1 1 f8(x8)

f(x) = f1(x1) + · · · + f8(x8), y ⇠ N ⇣ f, 0.32⌘ , Var

  • fj
  • = 1 for all j.

) All inputs equally relevant

2 4 6 8 0.5 1 Input True relevance ARD-value LIO proj. error

Leave-input-out (LIO) projection errors (averaged over 100 data realizations, n = 200)

slide-39
SLIDE 39

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projection predictive variable selection

I In variable selection usually not feasible to go through all

variable combinations

slide-40
SLIDE 40

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projection predictive variable selection

I In variable selection usually not feasible to go through all

variable combinations

I Use e.g. forward search to explore promising combinations

I start from the empty model, at each step add the variable

that reduces the objective (1) the most

I stop when the performance similar to the full model

slide-41
SLIDE 41

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Real world examples

5 10 −1.5 −1 −0.5 Number of variables MLPD

Boston Housing (D = 13)

5 10 15 −1.5 −1 −0.5 Number of variables

Automobile (D = 38)

2 4 6 8 10 −1.4 −1.2 −1 Number of variables

Crime (D = 102)

Full model

Mean log predictive density (MLPD) on test data for full model (all inputs) with sampled hyperparameters.

slide-42
SLIDE 42

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Real world examples

5 10 −1.5 −1 −0.5 Number of variables MLPD

Boston Housing (D = 13)

5 10 15 −1.5 −1 −0.5 Number of variables

Automobile (D = 38)

2 4 6 8 10 −1.4 −1.2 −1 Number of variables

Crime (D = 102)

Full model ARD

Accuracy for each submodel size, variables sorted by ARD (length-scales), hyperparameters optimized to maximum marginal likelihood.

slide-43
SLIDE 43

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Real world examples

5 10 −1.5 −1 −0.5 Number of variables MLPD

Boston Housing (D = 13)

5 10 15 −1.5 −1 −0.5 Number of variables

Automobile (D = 38)

2 4 6 8 10 −1.4 −1.2 −1 Number of variables

Crime (D = 102)

Full model ARD Projection

Accuracy for each submodel size, variables sorted by stepwise minimization of projection error (forward search), hyperparameters learned via the projection.

slide-44
SLIDE 44

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Non-Gaussian likelihood

I Given Gaussian posterior approximation (e.g. obtained

using EP), we can make the projection conditional on Gaussian likelihood approximations

slide-45
SLIDE 45

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Projection predictive method, pros and cons

I Advantage:

I Discrepancy to the full model much more reliable indicator

  • f submodel’s performance than the length-scales

I Disadvantage:

I Computational complexity for the projection is O(n3)

(unless sparse approximations are used) ) slow if several submodels (e.g. variable combinations) are explored

slide-46
SLIDE 46

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

Summary

I Carry out inference for the full model for best performance,

select only if necessary

I ARD-values (length-scales) are unreliable for input

relevance assessment

I Projection discrepancy to the full model is a more robust

indicator

I However, the forward search requires substantial amount of

additional computations (in addition to fitting the full model)

slide-47
SLIDE 47

Projection Predictive Model Selection for Gaussian Processes Piironen, Vehtari

References

Dupuis, J. A. and Robert, C. P . (2003). Variable selection in qualitative models via an entropic explanatory power. Journal of Statistical Planning and Inference, 111(1-2):77–94. Goutis, C. and Robert, C. P . (1998). Model choice in generalised linear models: A Bayesian approach via Kullback-Leibler

  • projections. Biometrika, 85(1):29–37.

Lindley, D. V. (1968). The choice of variables in multiple regression. Journal of the Royal Statistical Society. Series B (Methodological), 30:31–66. Piironen, J. and Vehtari, A. (2016). Comparison of Bayesian predictive methods for model selection. Statistics and Computing. First online. Vehtari, A. and Ojanen, J. (2012). A survey of Bayesian predictive methods for model assessment, selection and comparison. Statistics Surveys, 6:142–228.