Bayesian Model Comparison Roberto Trotta - www.robertotrotta.com - PowerPoint PPT Presentation

@R_Trotta Bayesian Model Comparison Roberto Trotta - www.robertotrotta.com Analytics, Computation and Inference in Cosmology Cargese, Sept 2018

Frequentist hypothesis testing • Warning: frequentist hypothesis testing (e.g., likelihood ratio test) cannot be interpreted as a statement about the probability of the hypothesis! • Example: to test the null hypothesis H 0 : θ = 0, draw n normally distributed points (with known variance σ 2 ). The χ 2 is distributed as a chi-square distribution with (n-1) degrees of freedom (dof). Pick a significance level α (or p-value, e.g. α = 0.05). If P( χ 2 > χ 2obs ) < α reject the null hypothesis. • This is a statement about the likelihood of observing data as extreme or more extreme than have been measured assuming the null hypothesis is correct . • It is not a statement about the probability of the null hypothesis itself and cannot be interpreted as such! (or you’ll make gross mistakes) • The use of p-values implies that a hypothesis that may be true can be rejected because it has not predicted observable results that have not actually occurred.   (Je ff reys, 1961) Roberto Trotta

Exercice: Is the coin fair? Blue Team: N=12 is fixed, H the random variable Red Team: H=3 is fixed, N the random variable Question: What is the p-value for the null hypothesis? DATA: T T H T H T T T T T T H

The significance of significance • Important: A 2-sigma result does not wrongly reject the null hypothesis 5% of the time: at least 29% of 2-sigma results are wrong! • Take an equal mixture of H 0 , H 1 • Simulate data, perform hypothesis testing for H 0 • Select results rejecting H 0 at (or within a small range from) 1- α CL   (this is the prescription by Fisher) • What fraction of those results did actually come from H 0 ("true nulls", should not have been rejected)? Recommended reading:   Sellke, Bayarri & Berger, The American Statistician , 55, 1 (2001) Roberto Trotta

Bayesian model comparison

The 3 levels of inference LEVEL 2 LEVEL 3 LEVEL 1 Actually, there are several None of the models I have selected a model M possible models: M 0 , M 1 ,... is clearly the best and prior P( θ |M) Model comparison Parameter inference Model averaging What is the relative What are the favourite What is the inference on plausibility of M 0 , M 1 ,... values of the the parameters in light of the data? parameters?   accounting for model (assumes M is true) uncertainty? odds = P(M 0 | d) P ( θ | d, M ) = P ( d | θ ,M ) P ( θ | M ) P ( θ | d ) = � i P ( M i | d ) P ( θ | d, M i ) P ( d | M ) P(M 1 | d) Roberto Trotta

Examples of model comparison questions ASTROPARTICLE COSMOLOGY Gravitational waves detection Is the Universe flat? Do cosmic rays correlate with AGNs? Does dark energy evolve? Which SUSY model is ‘best’? Are there anomalies in the CMB? Is there evidence for DM modulation? Which inflationary model is ‘best’? Is there a DM signal in gamma ray/ Is there evidence for modified gravity? neutrino data? Are the initial conditions adiabatic? Many scientific questions are of the model comparison type ASTROPHYSICS Exoplanets detection Is there a line in this spectrum? Is there a source in this image? Roberto Trotta

Scale for the strength of evidence • A (slightly modified) Je ff reys’ scale to assess the strength of evidence favoured model’s |lnB| relative odds Interpretation probability not worth < 1.0 < 3:1 < 0.750 mentioning < 2.5 < 12:1 0.923 weak < 5.0 < 150:1 0.993 moderate > 5.0 > 150:1 > 0.993 strong Roberto Trotta

Bayesian model comparison of 193 models Higgs inflation as reference model Martin,RT+14 disfavoured favoured

An automatic Occam’s razor • Bayes factor balances quality of fit vs extra model complexity. • It rewards highly predictive models, penalizing “wasted” parameter space R P ( d | M ) = d θ L ( θ ) P ( θ | M ) ≈ P (ˆ θ ) δθ L (ˆ θ ) Likelihood ∆ θ L (ˆ θ )ˆ δθ ≈ δθ θ Occam’s factor Prior Δθ ˆ θ Roberto Trotta

The evidence as predictive probability • The evidence can be understood as a function of d to give the predictive probability under the model M: P(d|M) Simpler model M 0 More complex model M 1 d Observed value d obs Roberto Trotta

Simple example: nested models • This happens often in practice: Likelihood we have a more complex model, M 1 with prior P( θ |M 1 ), which reduces to a simpler model (M 0 ) for a certain value of δθ the parameter,   e.g. θ = θ * = 0 ( nested models ) Prior Δθ • Is the extra complexity of M 1 warranted by the data? ˆ θ θ * = 0

Simple example: nested models Define: λ ≡ ˆ θ − θ � Likelihood δθ For “informative” data: δθ − λ 2 δθ ln B 01 ≈ ln ∆ θ 2 Prior Δθ mismatch of wasted parameter prediction with space observed data (favours simpler model) ˆ θ (favours more θ * = 0 complex model)

The rough guide to model comparison wider prior (fixed data) Trotta (2008) larger sample (fixed prior and significance) Planck WMAP3 WMAP1 Δθ = Prior width 𝜀 θ = Likelihood width ∆ θ I 10 ≡ log 10 δθ Roberto Trotta

“Prior-free” evidence bounds • What if we do not know how to set the prior? For nested models, we can still choose a prior that will maximise the support for the more complex model: wider prior (fixed data) larger sample (fixed prior and significance) maximum evidence for Model 1 Roberto Trotta

  Maximum evidence for a detection • The absolute upper bound: put all prior mass for the alternative onto the observed maximum likelihood value. Then   B < exp( − χ 2 / 2) • More reasonable class of priors: symmetric and unimodal around Ψ =0, then   ( α = significance level) − 1 B < exp(1) α ln α If the upper bound is small, no other choice of prior will make the extra parameter significant. Sellke, Bayarri & Berger, The American Statistician , 55, 1 (2001) Roberto Trotta

How to interpret the “number of sigma’s” “Reasonable” Absolute bound α sigma bound on lnB on lnB (B) (B) 2.0   0.9 0.05 2 (7:1) (3:1) weak undecided 4.5 3.0 0.003 3 (90:1) (21:1) moderate moderate 6.48 5.0   0.0003 3.6 (650:1) (150:1)   strong strong Roberto Trotta

How to assess p-values Rule of thumb: interpret a n-sigma result as a (n-1)-sigma result Sellke, Bayarri & Berger, The American Statistician , 55, 1 (2001) Roberto Trotta

Computing the model likelihood Model likelihood: � P ( d | M ) = Ω d θ P ( d | θ , M ) P ( θ | M ) B 01 ≡ P ( d | M 0 ) Bayes factor: P ( d | M 1 ) • Usually computational demanding: it’s a multi-dimensional integral, averaging the likelihood over the (possibly much wider) prior • I’ll present two methods used by cosmologists: • Savage-Dickey density ratio (Dickey 1971): Gives the Bayes factor between nested models (under mild conditions). Can be usually derived from posterior samples of the larger (higher D) model. • Nested sampling (Skilling 2004): Transforms the D-dim integral in 1D integration. Can be used generally (within limitations of the e ffi ciency of the sampling method adopted). Roberto Trotta

The Savage-Dickey density ratio Dickey J. M., 1971, Ann. Math. Stat., 42, 204 • This method works for nested models and gives the Bayes factor analytically. • Assumptions: • Nested models: M 1 with parameters ( Ψ , 𝜕 ) reduces to M 0 for e.g. 𝜕 = 𝜕 ✶ • Separable priors: the prior π 1 ( Ψ , 𝜕 |M 1 ) is uncorrelated with π 0 ( Ψ |M 0 ) B 01 = p ( ω ? | d ) • Result: Marginal posterior π 1 ( ω ? ) under M 1 • The Bayes factor is the ratio of the normalised (1D) marginal posterior on the additional parameter in M 1 over its prior, Prior evaluated at the value of the parameter for which M 1 reduces to M 0 . 𝜕 = 𝜕 ✶ Roberto Trotta

Bayesian Model Comparison Roberto Trotta - www.robertotrotta.com - PowerPoint PPT Presentation

@R_Trotta Bayesian Model Comparison Roberto Trotta - www.robertotrotta.com Analytics, Computation and Inference in Cosmology Cargese, Sept 2018 Frequentist hypothesis testing Warning: frequentist hypothesis testing (e.g., likelihood ratio

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian model averaging Dr. Jarad Niemi Iowa State University September 7, 2017 Jarad Niemi

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Bayesian model comparison with applications Johannes Bergstr om Department of Theoretical

Bayesian Model Selection and Averaging Nonlinear Models Bayes factors Example Families FFX

Bayesian model averaging Dr. Jarad Niemi STAT 544 - Iowa State University March 9, 2017 Jarad

Overview Bayesian Model Selection Bayesian Learning of CPTs Dealing with Multiple Models Chris

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Advances in Gaussian Processes Tutorial at NIPS 2006 in Vancouver Carl Edward Rasmussen Max

Machine learning theory Nonuniform learnability Hamid Beigy Sharif university of technology

Prediction models of Social Media data Daniel Preotiuc-Pietro daniel@dcs.shef.ac.uk 11.10.2013

Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the

arXiv:2007.10928v1 [cs.LG] 21 Jul 2020 Abstract The No Free Lunch theorems prove that under a

tic r The he e extr xtragala lactic ray sk y sky Thr hree a appr pproa oache

Econometric Evaluation of Social Programs Part I: Counterfactuals, Causality and Structural

Sigmoid curves and a case for close-to-linear nonlinear models Charles Y. Tan charles

Sambuz

Useful Links

Newsletter

Mail Us