The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van - PowerPoint PPT Presentation

The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van Erven www.timvanerven.nl 23 May 2013 Joint work with Peter Grünwald , Steven de Rooij and Wouter Koolen

Outline ✤ Bayes Factors and MDL Model Selection ✤ Consistent, but suboptimal predictions ✤ Explanation: the Catch-up Phenomenon ✤ Predictive MDL interpretation of Bayes factors ✤ Markov chain example ✤ Solution: the Switch Distribution ✤ Simulations & Theorems: consistent + optimal predictions ✤ Cumulative risk

T wo Desirable Properties in Model Selection ✤ Suppose are statistical models M 1 , . . . , M K (sets of probability distributions: ) M k = { p θ | θ ∈ Θ k } ✤ Consistency : If some in model generates the data, then is p ∗ M k ∗ M k ∗ selected with probability one as the amount of data goes to infinity. ✤ Rate of convergence : How fast does an estimator based on the available models converge to the true distribution? AIC-BIC Dilemma Consistent Optimal rate of convergence BIC, Bayes, MDL Yes No AIC, LOO Cross-validation No Yes

T wo Desirable Properties in Model Selection ✤ Suppose are statistical models M 1 , . . . , M K (sets of probability distributions: ) M k = { p θ | θ ∈ Θ k } ✤ Consistency : If some in model generates the data, then is p ∗ M k ∗ M k ∗ selected with probability one as the amount of data goes to infinity. ✤ Rate of convergence : How fast does an estimator based on the available models converge to the true distribution? AIC-BIC Dilemma Consistent Optimal rate of convergence ? BIC, Bayes, MDL Yes No AIC, LOO Cross-validation No Yes

Bayesian Prediction ✤ Given model with prior and data M k = { p θ | θ ∈ Θ k } w k x n = ( x 1 , . . . , x n ) , the Bayesian marginal likelihood is Z p k ( x n ) ≡ p ( x n |M k ) := p θ ( x n ) w k ( θ ) d θ ¯ Θ k ✤ Given predict with estimator M k p k ( x n +1 ) p k ( x n +1 | x n ) = ¯ Z p θ ( x n +1 | x n ) w k ( θ | x n ) d θ ¯ = p k ( x n ) ¯ Θ k

Bayes Factors and MDL Model Selection ✤ Suppose we have multiple models M 1 , M 2 , . . . ✤ Bayes factors : Put a prior on model index k and choose to ˆ k ( x n ) π maximize the posterior probability p k ( x n ) π ( k ) ¯ p ( M k | x n ) := P k 0 ¯ p k 0 ( x n ) π ( k 0 ) ˆ ✤ is minimizing k ( x n ) p k ( x n ) − log π ( k ) ≈ − log ¯ p k ( x n ) − log ¯

Bayes Factors and MDL Model Selection ✤ Suppose we have multiple models M 1 , M 2 , . . . ✤ Bayes factors : Put a prior on model index k and choose to ˆ k ( x n ) π maximize the posterior probability p k ( x n ) π ( k ) ¯ p ( M k | x n ) := P k 0 ¯ p k 0 ( x n ) π ( k 0 ) ˆ ✤ is minimizing k ( x n ) p k ( x n ) − log π ( k ) ≈ − log ¯ p k ( x n ) − log ¯ } Minimum Description Length (MDL)

Example: Histogram Density Estimation 0.6 0.5 M k = { p θ | θ ∈ Θ k ⊂ R k } 0.4 0.3 = 4+1 θ 1 n +4 0.2 = 2+1 θ 3 n +4 = 1+1 θ 4 0.1 n +4 θ 2 = 0+1 n +4 0 0 0.25 0.5 0.75 1 ✤ I.I.D. data in interval [0,1] ✤ Given k, estimate density by the estimator in the figure ✤ This is equivalent to for conjugate Dirichlet(1,...,1) prior ¯ p k ✤ How should we choose the number of bins k ? ✤ Too few: does not capture enough structure ✤ Too many: overfitting (many bins will be empty) ✤ [Yu, Speed, ‘92]: Bayes does not achieve the optimal rate of convergence!

CV Selects More Bins than Bayes average # bins selected 3 100 2.5 80 2 60 1.5 40 1 20 Bayes 0.5 LOOCV f(x) 0 0 0.2 0.4 0.6 0.8 1 0 50000 100000 150000 200000 n

CV Predicts Better than Bayes k ( x n ) ( x n +1 | x n ) Prediction error in log loss at sample size n: − log ¯ p ˆ n X k ( x i − 1 ) ( x i | x i − 1 ) cumulative prediction error: − log ¯ p ˆ i =1 cumulative loss 350 300 250 200 150 100 Bayes 50 LOOCV 0 0 50000 100000 150000 200000 n

CV Predicts Better than Bayes... # bins selected 3 100 2.5 80 2 60 1.5 40 1 20 Bayes 0.5 LOOCV f(x) 0 0 0.2 0.4 0.6 0.8 1 0 50000 100000 150000 200000 n ✤ Density not a histogram, but can be approximated arbitrarily well ✤ LOO-CV, AIC converge at optimal rate ✤ Bayesian model selection selects too few bins ( underfits )!

... but CV is Inconsistent! ✤ Now suppose data are sampled from the uniform distribution # bins selected 8 Bayes 1 7 LOOCV 6 0.8 5 0.6 4 3 0.4 2 0.2 1 f(x) 0 0 0 50000 100000 150000 200000 0 0.2 0.4 0.6 0.8 1 n ✤ LOO cross-validation selects 2.5 bins on average: it is inconsistent !

Logarithmic Loss If we measure prediction quality by log loss loss( p, x ) := − log p ( x ) then minus log likelihood = cumulative log loss : n X − log p ( x i | x i − 1 ) − log p ( x 1 , . . . , x n ) = i =1 x i − 1 = ( x 1 , . . . , x i − 1 ) where n Y p ( x i | x i − 1 ) p ( x 1 , . . . , x n ) = Proof. Take the negative logarithm of the chain rule: i =1

The Most Important Slide Bayes factors and MDL pick the k minimizing n X p k ( x i | x i − 1 ) − log ¯ p k ( x 1 , . . . , x n ) = − log ¯ } i =1 Prediction error for model at sample size i! M k Prequential/predictive MDL interpretation: select the model such that, when used as a sequential prediction M k strategy, minimizes cumulative sequential prediction error ¯ p k [Dawid ’84, Rissanen ’84]

Example: Markov Chains Natural language text : “The Picture of Dorian Gray” by Oscar Wilde "... But beauty, real beauty, ends where an intellectual expression begins. Intellect is in itself a mode of exaggeration, and destroys the harmony of any face. The moment one sits down to think, one becomes all nose, or all forehead, or something horrid. Look at the successful men in any of the learned professions. How perfectly hideous they are! ..." Compare the first-order and second-order Markov chain models on the first n characters in the book, with uniform priors on the transition probabilities

Example: Markov Chains Natural language text : “The Picture of Dorian Gray” by Oscar Wilde "... But beauty, real beauty, ends where an intellectual expression begins. Intellect is in itself a mode of exaggeration, and destroys the harmony of any face. The moment one sits down to think, one becomes all nose, or all forehead, or something horrid. Look at the successful men in any of the learned professions. How perfectly hideous they are! ..." 128x128x127 parameters 128x127 parameters Compare the first-order and second-order Markov chain models on the first n characters in the book, with uniform priors on the transition probabilities

Example: Markov chains Compare the marginal likelihoods Sample size (n) (green line equals the log of the Bayes factor)

Example: Markov chains Sample size (n) n n X X p 2 ( x n ) − [ − log ¯ p 1 ( x n )] = − log ¯ loss(¯ p 2 , x i ) − loss(¯ p 1 , x i ) i =1 i =1

Example: Markov chains For , select complex model Sample size (n) n n X X p 2 ( x n ) − [ − log ¯ p 1 ( x n )] = − log ¯ loss(¯ p 2 , x i ) − loss(¯ p 1 , x i ) i =1 i =1

Example: Markov chains For , complex model makes the best predictions! For , select complex model Sample size (n) n n X X p 2 ( x n ) − [ − log ¯ p 1 ( x n )] = − log ¯ loss(¯ p 2 , x i ) − loss(¯ p 1 , x i ) i =1 i =1

The Catch-up Phenomenon ✤ Given “simple” model and a “complex” model M 1 M 2 ✤ Common phenomenon: for some sample size s ✤ simple model predicts better if n ≤ s ✤ complex model predicts better if n > s ✤ Catch-up Phenomenon : Bayes/MDL exhibit inertia ✤ complex model has to “ catch up ”, so we prefer simpler model for a while even after n > s! Remark : Methods similar to Bayes factors (e.g. BIC) will also exhibit the catch-up ✤ phenomenon. Bayesian model averaging does not help either!

Example: Markov chains Bayes / MDL Can we modify Bayes so as to do as well as the black curve ?

Example: Markov chains Bayes / MDL Can we modify Bayes so as to do as well as the black curve ? Almost!

The Best of Both Worlds ✤ Catch-up phenomenon : new explanation for poor predictions of Bayes (and other BIC-like methods) ✤ We want a model selection/averaging method that, in a wide variety of circumstances, ✤ is provably consistent , ✤ provably achieves optimal convergence rates ✤ But it has previously been suggested that this is impossible! [Yang ’05]

The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van - PowerPoint PPT Presentation

The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van Erven www.timvanerven.nl 23 May 2013 Joint work with Peter Grnwald , Steven de Rooij and Wouter Koolen Outline Bayes Factors and MDL Model Selection Consistent, but

Overview Two-Part MDL Two-Part MDL Two-Part MDL for Two-Part MDL for Grammar Learning

REAL-TIME RAY TRACING WITH MDL Ignacio Llamas & Maksim Eisenstein, 03.21.2019 - GPU Technology

Recent Developments in Class Action Law and Impact on MDL Cases and Impact on MDL Cases

Between Renderers with MDL Jan Jordan Software Product Manager MDL March 18, GTC San Jose 2019

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Catch the Cham eleon - transcript of presentation video Nick: So, this is team Catch the Chameleon.

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Strong Asymptotic Assertions for Discrete MDL in Regression and Classification or A Strange Way

Implement Physically Based Ray Tracing with OptiX and MDL Detlef Rttger, 4/4/2016 OptiX

Integrating the NVIDIA Material Definition Language MDL in Your Application Lutz Kettner

Sharing Physically Based Materials Between Renderers with MDL Jan Jordan Software Product Manager

SIGGRAPH16 : NVIDIA BEST OF GTC MDL MATERIALS TO GLSL SHADERS Andreas Senbach, NVIDIA

RESTORATION ADVISORY BOARD MEETING MARCH 14, 2019 87th Air Base Wing JB MDL PBR CONTRACT UPDATE

MDL-Based Unsupervised Attribute Ranking Zdravko Markov Computer Science Department Central

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S ebastien

Catalan intransitive verbs and argument realization Alex Alsina & Fengrong Yang

Latent Models: Sequence Models Beyond HMMs and Machine Translation Alignment CMSC 473/673 UMBC

YES, YOU CAN TRAIN CATS! Circuscats.com (The Acro-Cats) CHERYL KOLUS, DVM, KPA-CTP 2 IF THIS

Catch-up Effects in Health Outcomes - Linear and Quantile Regression Estimates from Four Countries

Review , Catch-up, Question&Answ er &A ti Q C t h i R Outline Dear Prof.

CS 309: Autonomous Intelligent Robotics FRI I Lecture 24: Final Presentations Instructor:

Ranking pop songs through the years Julia Silge Data Scientist at Stack Overflow DataCamp

The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van - PowerPoint PPT Presentation

The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van Erven www.timvanerven.nl 23 May 2013 Joint work with Peter Grnwald , Steven de Rooij and Wouter Koolen Outline Bayes Factors and MDL Model Selection Consistent, but

Overview Two-Part MDL Two-Part MDL Two-Part MDL for Two-Part MDL for Grammar Learning

REAL-TIME RAY TRACING WITH MDL Ignacio Llamas &amp; Maksim Eisenstein, 03.21.2019 - GPU Technology

Recent Developments in Class Action Law and Impact on MDL Cases and Impact on MDL Cases

Between Renderers with MDL Jan Jordan Software Product Manager MDL March 18, GTC San Jose 2019

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Catch the Cham eleon - transcript of presentation video Nick: So, this is team Catch the Chameleon.

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Strong Asymptotic Assertions for Discrete MDL in Regression and Classification or A Strange Way

Implement Physically Based Ray Tracing with OptiX and MDL Detlef Rttger, 4/4/2016 OptiX

Integrating the NVIDIA Material Definition Language MDL in Your Application Lutz Kettner

Sharing Physically Based Materials Between Renderers with MDL Jan Jordan Software Product Manager

SIGGRAPH16 : NVIDIA BEST OF GTC MDL MATERIALS TO GLSL SHADERS Andreas Senbach, NVIDIA

RESTORATION ADVISORY BOARD MEETING MARCH 14, 2019 87th Air Base Wing JB MDL PBR CONTRACT UPDATE

MDL-Based Unsupervised Attribute Ranking Zdravko Markov Computer Science Department Central

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S ebastien

Catalan intransitive verbs and argument realization Alex Alsina &amp; Fengrong Yang

Latent Models: Sequence Models Beyond HMMs and Machine Translation Alignment CMSC 473/673 UMBC

YES, YOU CAN TRAIN CATS! Circuscats.com (The Acro-Cats) CHERYL KOLUS, DVM, KPA-CTP 2 IF THIS

Catch-up Effects in Health Outcomes - Linear and Quantile Regression Estimates from Four Countries

Review , Catch-up, Question&amp;Answ er &amp;A ti Q C t h i R Outline Dear Prof.

CS 309: Autonomous Intelligent Robotics FRI I Lecture 24: Final Presentations Instructor:

Ranking pop songs through the years Julia Silge Data Scientist at Stack Overflow DataCamp

REAL-TIME RAY TRACING WITH MDL Ignacio Llamas & Maksim Eisenstein, 03.21.2019 - GPU Technology

Catalan intransitive verbs and argument realization Alex Alsina & Fengrong Yang

Review , Catch-up, Question&Answ er &A ti Q C t h i R Outline Dear Prof.