The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van - - PowerPoint PPT Presentation

the catch up phenomenon in bayesian and mdl model
SMART_READER_LITE
LIVE PREVIEW

The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van - - PowerPoint PPT Presentation

The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van Erven www.timvanerven.nl 23 May 2013 Joint work with Peter Grnwald , Steven de Rooij and Wouter Koolen Outline Bayes Factors and MDL Model Selection Consistent, but


slide-1
SLIDE 1

The Catch-up Phenomenon in Bayesian and MDL Model Selection

Tim van Erven www.timvanerven.nl 23 May 2013 Joint work with Peter Grünwald, Steven de Rooij and Wouter Koolen

slide-2
SLIDE 2

Outline

✤ Bayes Factors and MDL Model Selection ✤ Consistent, but suboptimal predictions ✤ Explanation: the Catch-up Phenomenon ✤ Predictive MDL interpretation of Bayes factors ✤ Markov chain example ✤ Solution: the Switch Distribution ✤ Simulations & Theorems: consistent + optimal predictions ✤ Cumulative risk

slide-3
SLIDE 3

T wo Desirable Properties in Model Selection

✤ Suppose are statistical models

(sets of probability distributions: )

✤ Consistency: If some in model generates the data, then is

selected with probability one as the amount of data goes to infinity.

✤ Rate of convergence: How fast does an estimator based on the

available models converge to the true distribution? Consistent Optimal rate of convergence BIC, Bayes, MDL AIC, LOO Cross-validation Yes No No Yes AIC-BIC Dilemma Mk = {pθ|θ ∈ Θk} p∗ Mk∗ Mk∗ M1, . . . , MK

slide-4
SLIDE 4

T wo Desirable Properties in Model Selection

✤ Suppose are statistical models

(sets of probability distributions: )

✤ Consistency: If some in model generates the data, then is

selected with probability one as the amount of data goes to infinity.

✤ Rate of convergence: How fast does an estimator based on the

available models converge to the true distribution? Consistent Optimal rate of convergence BIC, Bayes, MDL AIC, LOO Cross-validation Yes No No Yes AIC-BIC Dilemma Mk = {pθ|θ ∈ Θk} p∗ Mk∗ Mk∗

?

M1, . . . , MK

slide-5
SLIDE 5

Bayesian Prediction

✤ Given model with prior and data

, the Bayesian marginal likelihood is

✤ Given predict with estimator

¯ pk(xn) ≡ p(xn|Mk) := Z

Θk

pθ(xn)wk(θ)dθ ¯ pk(xn+1|xn) = ¯ pk(xn+1) ¯ pk(xn) = Z

Θk

pθ(xn+1|xn)wk(θ|xn)dθ Mk = {pθ|θ ∈ Θk} wk xn = (x1, . . . , xn) Mk

slide-6
SLIDE 6

Bayes Factors and MDL Model Selection

✤ Suppose we have multiple models ✤ Bayes factors: Put a prior on model index k and choose to

maximize the posterior probability

✤ is minimizing

p(Mk|xn) := ¯ pk(xn)π(k) P

k0 ¯

pk0(xn)π(k0) − log ¯ pk(xn) − log π(k) ≈ − log ¯ pk(xn) ˆ k(xn) ˆ k(xn) M1, M2, . . . π

slide-7
SLIDE 7

Bayes Factors and MDL Model Selection

✤ Suppose we have multiple models ✤ Bayes factors: Put a prior on model index k and choose to

maximize the posterior probability

✤ is minimizing

p(Mk|xn) := ¯ pk(xn)π(k) P

k0 ¯

pk0(xn)π(k0) − log ¯ pk(xn) − log π(k) ≈ − log ¯ pk(xn) ˆ k(xn) ˆ k(xn) M1, M2, . . .

}

Minimum Description Length (MDL) π

slide-8
SLIDE 8

Example: Histogram Density Estimation

✤ I.I.D. data in interval [0,1] ✤ Given k, estimate density by the estimator in the figure ✤ This is equivalent to for conjugate Dirichlet(1,...,1) prior ✤ How should we choose the number of bins k? ✤ Too few: does not capture enough structure ✤ Too many: overfitting (many bins will be empty) ✤ [Yu, Speed, ‘92]: Bayes does not achieve the optimal rate of

convergence! Mk = {pθ|θ ∈ Θk ⊂ Rk}

0.25 0.5 0.75 1 0.1 0.2 0.3 0.4 0.5 0.6

θ1 θ2

= 4+1

n+4

= 0+1

n+4

= 2+1

n+4

= 1+1

n+4

θ4 θ3

¯ pk

slide-9
SLIDE 9

CV Selects More Bins than Bayes

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 1 f(x)

20 40 60 80 100 50000 100000 150000 200000 n # bins selected Bayes LOOCV

average

slide-10
SLIDE 10

CV Predicts Better than Bayes

Prediction error in log loss at sample size n:

50 100 150 200 250 300 350 50000 100000 150000 200000 n cumulative loss Bayes LOOCV

− log ¯ pˆ

k(xn)(xn+1|xn)

cumulative prediction error:

n

X

i=1

− log ¯ pˆ

k(xi−1)(xi|xi−1)

slide-11
SLIDE 11

CV Predicts Better than Bayes...

✤ Density not a histogram, but can be approximated arbitrarily well ✤ LOO-CV, AIC converge at optimal rate ✤ Bayesian model selection selects too few bins (underfits)! 0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 1 f(x)

20 40 60 80 100 50000 100000 150000 200000 n # bins selected Bayes LOOCV

slide-12
SLIDE 12

... but CV is Inconsistent!

✤ Now suppose data are sampled from the uniform distribution ✤ LOO cross-validation selects 2.5 bins on average: it is inconsistent!

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 f(x)

1 2 3 4 5 6 7 8 50000 100000 150000 200000 n # bins selected Bayes LOOCV

slide-13
SLIDE 13

Outline

✤ Bayes Factors and MDL Model Selection ✤ Consistent, but suboptimal predictions ✤ Explanation: the Catch-up Phenomenon ✤ Predictive MDL interpretation of Bayes factors ✤ Markov chain example ✤ Solution: the Switch Distribution ✤ Simulations & Theorems: consistent + optimal predictions ✤ Cumulative risk

slide-14
SLIDE 14

Logarithmic Loss

If we measure prediction quality by log loss then minus log likelihood = cumulative log loss: where

  • Proof. Take the negative logarithm of the chain rule:

loss(p, x) := − log p(x) − log p(x1, . . . , xn) =

n

X

i=1

− log p(xi|xi−1)

p(x1, . . . , xn) =

n

Y

i=1

p(xi|xi−1)

xi−1 = (x1, . . . , xi−1)

slide-15
SLIDE 15

The Most Important Slide

Bayes factors and MDL pick the k minimizing Prequential/predictive MDL interpretation: select the model such that, when used as a sequential prediction strategy, minimizes cumulative sequential prediction error [Dawid ’84, Rissanen ’84] − log ¯ pk(x1, . . . , xn) =

n

X

i=1

− log ¯ pk(xi|xi−1) Mk ¯ pk

}

Prediction error for model at sample size i! Mk

slide-16
SLIDE 16

Example: Markov Chains

Natural language text: “The Picture of Dorian Gray” by Oscar Wilde

"... But beauty, real beauty, ends where an intellectual expression begins. Intellect is in itself a mode of exaggeration, and destroys the harmony of any face. The moment one sits down to think, one becomes all nose, or all forehead, or something horrid. Look at the successful men in any of the learned professions. How perfectly hideous they are! ..."

Compare the first-order and second-order Markov chain models

  • n the first n characters in the book,

with uniform priors on the transition probabilities

slide-17
SLIDE 17

Example: Markov Chains

Natural language text: “The Picture of Dorian Gray” by Oscar Wilde

"... But beauty, real beauty, ends where an intellectual expression begins. Intellect is in itself a mode of exaggeration, and destroys the harmony of any face. The moment one sits down to think, one becomes all nose, or all forehead, or something horrid. Look at the successful men in any of the learned professions. How perfectly hideous they are! ..."

Compare the first-order and second-order Markov chain models

  • n the first n characters in the book,

with uniform priors on the transition probabilities

128x127 parameters 128x128x127 parameters

slide-18
SLIDE 18

Example: Markov chains

Sample size (n)

(green line equals the log of the Bayes factor)

Compare the marginal likelihoods

slide-19
SLIDE 19

Example: Markov chains

Sample size (n)

− log ¯ p2(xn) − [− log ¯ p1(xn)] =

n

X

i=1

loss(¯ p2, xi) −

n

X

i=1

loss(¯ p1, xi)

slide-20
SLIDE 20

Example: Markov chains

Sample size (n)

For , select complex model

− log ¯ p2(xn) − [− log ¯ p1(xn)] =

n

X

i=1

loss(¯ p2, xi) −

n

X

i=1

loss(¯ p1, xi)

slide-21
SLIDE 21

Example: Markov chains

− log ¯ p2(xn) − [− log ¯ p1(xn)] =

n

X

i=1

loss(¯ p2, xi) −

n

X

i=1

loss(¯ p1, xi)

Sample size (n)

For , select complex model For , complex model makes the best predictions!

slide-22
SLIDE 22

The Catch-up Phenomenon

✤ Given “simple” model and a “complex” model ✤ Common phenomenon: for some sample size s ✤ simple model predicts better if n ≤ s ✤ complex model predicts better if n > s ✤ Catch-up Phenomenon: Bayes/MDL exhibit inertia ✤ complex model has to “catch up”,

so we prefer simpler model for a while even after n > s!

Remark: Methods similar to Bayes factors (e.g. BIC) will also exhibit the catch-up

  • phenomenon. Bayesian model averaging does not help either!

M1 M2

slide-23
SLIDE 23

Example: Markov chains

Bayes / MDL

Can we modify Bayes so as to do as well as the black curve?

slide-24
SLIDE 24

Example: Markov chains

Bayes / MDL

Can we modify Bayes so as to do as well as the black curve? Almost!

slide-25
SLIDE 25

Outline

✤ Bayes Factors and MDL Model Selection ✤ Consistent, but suboptimal predictions ✤ Explanation: the Catch-up Phenomenon ✤ Predictive MDL interpretation of Bayes factors ✤ Markov chain example ✤ Solution: the Switch Distribution ✤ Simulations & Theorems: consistent + optimal predictions ✤ Cumulative risk

slide-26
SLIDE 26

The Best of Both Worlds

✤ Catch-up phenomenon: new explanation for poor predictions of

Bayes (and other BIC-like methods)

✤ We want a model selection/averaging method that, in a wide variety

  • f circumstances,

✤ is provably consistent, ✤ provably achieves optimal convergence rates ✤ But it has previously been suggested that this is impossible!

[Yang ’05]

slide-27
SLIDE 27

The Best of Both Worlds

✤ Catch-up phenomenon: new explanation for poor predictions of

Bayes (and other BIC-like methods)

✤ We want a model selection/averaging method that, in a wide variety

  • f circumstances,

✤ is provably consistent, ✤ provably achieves optimal convergence rates ✤ But it has previously been suggested that this is impossible!

[Yang ’05]

✤ So we have to be careful to avoid impossibility results...

slide-28
SLIDE 28

The Switch Distribution

✤ To avoid the catch-up phenomenon we would like to switch between

models at switch-point s:

✤ Q. But how do we know when to switch?!

psw(xn|s) :=

s

Y

i=1

¯ p1(xi|xi−1) ×

n

Y

i=s+1

¯ p2(xi|xi−1)

slide-29
SLIDE 29

The Switch Distribution

✤ To avoid the catch-up phenomenon we would like to switch between

models at switch-point s:

✤ Q. But how do we know when to switch?! ✤ A. Switch distribution: do not put a prior on models, but on when

to switch between models: psw(xn|s) :=

s

Y

i=1

¯ p1(xi|xi−1) ×

n

Y

i=s+1

¯ p2(xi|xi−1) psw(xn) := X

s≥0

psw(xn|s)π(s) π

slide-30
SLIDE 30

The Switch Distribution

✤ To avoid the catch-up phenomenon we would like to switch between

models at switch-point s:

✤ Q. But how do we know when to switch?! ✤ A. Switch distribution: do not put a prior on models, but on when

to switch between models:

✤ Generalizes to an arbitrary (unknown) number of switches between

any countable number of models. psw(xn|s) :=

s

Y

i=1

¯ p1(xi|xi−1) ×

n

Y

i=s+1

¯ p2(xi|xi−1) psw(xn) := X

s≥0

psw(xn|s)π(s) π

slide-31
SLIDE 31

The Switch Distribution

✤ To avoid the catch-up phenomenon we would like to switch between

models at switch-point s:

✤ Q. But how do we know when to switch?! ✤ A. Switch distribution: do not put a prior on models, but on when

to switch between models:

✤ Generalizes to an arbitrary (unknown) number of switches between

any countable number of models.

✤ For many model classes, method is computationally feasible.

psw(xn|s) :=

s

Y

i=1

¯ p1(xi|xi−1) ×

n

Y

i=s+1

¯ p2(xi|xi−1) psw(xn) := X

s≥0

psw(xn|s)π(s) π

slide-32
SLIDE 32

Switching Resolves the Catch-up Phenomenon

✤ Pay less than 32 bits for not knowing s ✤ Gain more than 20 000 bits by switching ✤ Almost as good as knowing in advance when to switch!

Sample size (n)

Switch distribution

2 log(s + 1) = 2 log(50 001) ≈

slide-33
SLIDE 33

Switch Distribution is Consistent for Histograms

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 f(x)

1 2 3 4 5 6 7 8 50000 100000 150000 200000 n # bins selected Switch Bayes LOOCV 5 10 15 20 25 30 50000 100000 150000 200000 n cumulative loss Switch Bayes LOOCV

slide-34
SLIDE 34

Switch Distribution Predicts Well with Histograms

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 1 f(x)

20 40 60 80 100 50000 100000 150000 200000 n # bins selected Switch Bayes LOOCV 50 100 150 200 250 300 350 50000 100000 150000 200000 n cumulative loss Switch Bayes LOOCV

slide-35
SLIDE 35

Theorem: Switching is Consistent

✦ Let be models with priors on parameter sets

and marginal likelihoods

✦ Suppose are asymptotically sufficiently distinguishable in

a suitable sense.

For example, it is sufficient if the models consist of i.i.d. or Markov distributions, the parameter sets are of different dimensionality and the priors have a density w.r.t. Lebesgue measure.

✦ Then, for all and all , except for a subset of with prior

  • probability 0, the switch distribution is consistent in that

with -probability 1. M1, M2, . . . w1, w2, . . . ¯ p1, ¯ p2, . . . ¯ p1, ¯ p2, . . . k∗ wk∗ p∗ ∈ Mk∗ Mk∗ lim

n→∞ psw(Mk∗ | Xn) = 1

p∗ Θ1, Θ2, . . .

Θ1, Θ2, . . .

slide-36
SLIDE 36

Setting for Prediction

✤ Let be i.i.d. models that can approximate a large set of

i.i.d. distributions arbitrarily well (in Kullback-Leibler divergence)

✤ For example, may be the set of all densities on [0,1] with bounded

derivatives and may be histograms

✤ Suppose data are i.i.d. with distribution

M1, M2, . . . p∗ ∈ M∗ M∗ M1, M2, . . . M∗ Xn = (X1, . . . , Xn)

slide-37
SLIDE 37

Risk

✤ Let be the prediction of outcome for some estimator

For example, may be based on the Bayesian marginal likelihood

✤ The risk is the expected divergence of the predictions of from : ✤ We take to be the Kullback-Leibler divergence:

rn(p∗, p) := EXn−1∼p∗ D(p∗kpXn−1) D(p∗kpXn−1) = EXn∼p∗[loss(pXn−1, Xn) loss(p∗, Xn)] pXn−1 Xn p

p

p p∗ D

slide-38
SLIDE 38

Cumulative Risk

The cumulative risk is Motivation:

✤ Appropriate when the goal is sequential prediction ✤ Can convert to ordinary risk via online-to-batch conversion

[Yang, Barron, ‘99]

✤ Equals redundancy in universal coding ✤ Avoids Yang’s impossibility results

Rn(p∗, p) =

n

X

i=1

ri(p∗, p) = EXn ⇥

n

X

i=1

loss(pXi−1, Xi) −

n

X

i=1

loss(p∗, Xi) ⇤

slide-39
SLIDE 39

Theorem: Switching Achieves Minimax Cumulative Rate

✤ Let be estimators for the models

An oracle chooses model , knowing the true distribution and the data.

✤ Suppose the cumulative risk of the oracle grows fast enough that

for some and the effective number of models is polynomial in , i.e. for some beta > 0.

✤ Then the switch distribution, with suitable prior , predicts at least

as well as the oracle: (log n)2+α supp⇤∈M⇤ Rn(p∗, ¯ pk) → 0 lim sup

n→∞

supp⇤∈M⇤ Rn(p∗, psw) supp⇤∈M⇤ Rn(p∗, ¯ pk) ≤ 1. ¯ p1, ¯ p2, . . . M1, M2, . . . k ≡ k(p⇤, Xn) α > 0 k(p⇤, Xn) ≤ nβ π n

slide-40
SLIDE 40

Outline

✤ Bayes Factors and MDL Model Selection ✤ Consistent, but suboptimal predictions ✤ Explanation: the Catch-up Phenomenon ✤ Predictive MDL interpretation of Bayes factors ✤ Markov chain example ✤ Solution: the Switch Distribution ✤ Simulations & Theorems: consistent + optimal predictions ✤ Cumulative risk

slide-41
SLIDE 41

Conclusion

✤ Bayes and other BIC-like methods select the model that minimizes

cumulative prediction error.

✤ If the best-predicting model depends on the sample size, then they

suffer from the catch-up phenomenon.

✤ This explains the AIC-BIC dilemma. ✤ The switch-distribution provably resolves the catch-up phenomenon:

Consistent Optimal rate of convergence BIC, Bayes, MDL AIC, LOO Cross-validation Switch distribution Yes No Yes No Yes Yes (for cumulative risk)

slide-42
SLIDE 42

References

Other:

A.P. Dawid, Statistical theory: the prequential approach, Journal of the Royal Statistical Society, Series A 147, Part 2 (1984), 278-292

  • D. Haussler and M. Opper, Mutual information, metric entropy and cumulative relative entropy risk,

The Annals of Statistics, Vol. 25, no. 6 (1997), 2451-2492

  • J. Rissanen, Universal coding, information, prediction, and estimation, IEEE Transactions on

Information Theory IT-30, no. 4 (1984), 629-636

  • B. Yu and T. P. Speed, Data compression and histograms, Probability Theory and Related Fields 92

(1992), 195-229

  • Y. Yang and A. Barron, Information-theoretic determination of minimax rates of convergence, Annals
  • f Statistics, Vol. 27, no. 5 (1999), 1564-1599

  • Y. Yang, Can the strengths of AIC and BIC be shared?, Biometrica 92(4), 2005, 937-950
  • T. van Erven, P. Grünwald and S. de Rooij,

Catching up faster by switching sooner: A predictive approach to adaptive estimation with an application to the AIC-BIC dilemma, Journal of the Royal Statistical Society, Series B, vol. 74, no. 3, pp. 361-417, 2012 MATLAB software available from my website: www.timvanerven.nl/publications/

slide-43
SLIDE 43

Bayesian Prediction

✤ Given model with prior and data

, the Bayesian marginal likelihood is

✤ Given predict with estimator ✤ If k is unknown, Bayesian model averaging also puts a prior on k:

p(xn+1|xn) = X

k

¯ pk(xn+1|xn)π(k|xn) ¯ pk(xn) ≡ p(xn|Mk) := Z

Θk

pθ(xn)wk(θ)dθ ¯ pk(xn+1|xn) = ¯ pk(xn+1) ¯ pk(xn) = Z

Θk

pθ(xn+1|xn)wk(θ|xn)dθ Mk = {pθ|θ ∈ Θk} wk xn = (x1, . . . , xn) Mk π