Bayesian inference: what it means and why we care Robin J. Ryder - PowerPoint PPT Presentation

Bayesian inference: what it means and why we care Robin J. Ryder Centre de Recherche en Math´ ematiques de la D´ ecision Universit´ e Paris-Dauphine 6 November 2017 Mathematical Coffees Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 1 / 42

The aim of Statistics In Statistics, we generally care about inferring information about an unknown parameter θ . For instance, we observe X 1 , . . . , X n ∼ N ( θ, 1) and wish to: Obtain a (point) estimate ˆ θ of θ , e.g. ˆ θ = 1 . 3. Measure the uncertainty of our estimator, by obtaining an interval or region of plausible values, e.g. [0 . 9 , 1 . 5] is a 95% confidence interval for θ . Perform model choice/hypothesis testing, e.g. decide between H 0 : θ = 0 and H 1 : θ � = 0 or between H 0 : X i ∼ N ( θ, 1) and H 1 : X i ∼ E ( θ ). Use this inference in postprocessing: prediction, decision-making, input of another model... Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 2 / 42

Why be Bayesian? Some application areas make heavy use of Bayesian inference, because: The models are complex Estimating uncertainty is paramount The output of one model is used as the input of another We are interested in complex functions of our parameters Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 3 / 42

Frequentist statistics Statistical inference deals with estimating an unknown parameter θ given some data D . In the frequentist view of statistics, θ has a true fixed (deterministic) value. Uncertainty is measured by confidence intervals, which are not intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20) for θ , I cannot say that there is a 95% probability that θ belongs to the interval [80 ; 120]. Frequentist statistics often use the maximum likelihood estimator: for which value of θ would the data be most likely (under our model)? L ( θ | D ) = P [ D | θ ] ˆ θ = arg max L ( θ | D ) θ Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 4 / 42

Bayes’ rule Recall Bayes’ rule: for two events A and B , we have P [ A | B ] = P [ B | A ] P [ A ] . P [ B ] Alternatively, with marginal and conditional densities: π ( y | x ) = π ( x | y ) π ( y ) . π ( x ) Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 5 / 42

Bayesian statistics In the Bayesian framework, the parameter θ is seen as inherently random: it has a distribution. Before I see any data, I have a prior distribution on π ( θ ), usually uninformative. Once I take the data into account, I get a posterior distribution, which is hopefully more informative. By Bayes’ rule, π ( θ | D ) = π ( D | θ ) π ( θ ) . π ( D ) By definition, π ( D | θ ) = L ( θ | D ). The quantity π ( D ) is a normalizing constant with respect to θ , so we usually do not include it and write instead π ( θ | D ) ∝ π ( θ ) L ( θ | D ) . Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 6 / 42

Bayesian statistics π ( θ | D ) ∝ π ( θ ) L ( θ | D ) Different people have different priors, hence different posteriors. But with enough data, the choice of prior matters little. We are now allowed to make probability statements about θ , such as ”there is a 95% probability that θ belongs to the interval [78 ; 119]” (credible interval). Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 7 / 42

Advantages and drawbacks of Bayesian statistics More intuitive interpretation of the results Easier to think about uncertainty In a hierarchical setting, it becomes easier to take into account all the sources of variability Prior specification: need to check that changing your prior does not change your result Computationally intensive Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 8 / 42

Example: Bernoulli Take X i ∼ Bernoulli ( θ ), i.e. P [ X i = 0] = 1 − θ. P [ X i = 1] = θ Possible prior: θ ∼ U ([0 , 1]): π ( θ ) = 1 for 0 ≤ θ ≤ 1. Likelihood: L ( θ | X i ) = θ X i (1 − θ ) 1 − X i � X i (1 − θ ) n − � X i = θ S n (1 − θ ) n − S n L ( θ | X 1 , . . . , X n ) = θ Posterior, with S n = � n i =1 X i : π ( θ | X 1 , . . . , X n ) ∝ 1 · θ S n (1 − θ ) n − S n We can compute the normalizing constant analytically: ( n + 1)! S n !( n − S n )! θ S n (1 − θ ) n − S n π ( θ | X 1 , . . . , X n ) = Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 9 / 42

Conjugate prior Suppose we take the prior θ ∼ Beta ( α, β ): π ( θ ) = Γ( α + β ) Γ( α )Γ( β ) θ α − 1 (1 − θ ) β − 1 . Then the posterior verifies π ( θ | X 1 , . . . , X n ) ∝ θ α − 1 (1 − θ ) β − 1 · θ S n (1 − θ ) n − S n hence θ | X 1 , . . . , X n ∼ Beta ( α + S n , β + n − S n ) . Whatever the data, the posterior is in the same family as the prior: we say that the prior is conjugate for this model. This is very convenient mathematically. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 10 / 42

Jeffrey’s prior Another possible default prior is Jeffrey’s prior, which is invariant by change of variables. Let ℓ be the log-likelihood and I be Fisher’s information: � d 2 �� d ℓ � 2 � � � � � � I ( θ ) = E � X ∼ P θ = − E d θ 2 ℓ ( θ ; X ) � X ∼ P θ . � � d θ � Jeffrey’s prior is defined by � π ( θ ) ∝ I ( θ ) . Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 11 / 42

Invariance of Jeffrey’s prior Let φ be an alternate parameterization of the model. Then the prior induced on φ by Jeffrey’s prior on θ is � � d θ � � π ( φ ) = π ( θ ) � � d φ � � � d θ � � 2 � � d θ � �� d ℓ � � 2 �� d ℓ � 2 � 2 � � � d θ � � ∝ I ( θ ) = � E = � E d φ d θ d φ d θ d φ � �� d ℓ � 2 � � � � I ( φ ) = � E = d φ which is Jeffrey’s prior on φ . Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 12 / 42

Effect of prior Example: Bernoulli model (biased coin). θ =probability of success. Observe S n = 72 successes out of n = 100 trials. Frequentist estimate: ˆ θ = 0 . 72 95% confidence interval: [0 . 63 0 . 81]. Bayesian estimate: will depend on the prior. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 13 / 42

Effect of prior prior(x) 8 4 0 0.0 0.2 0.4 0.6 0.8 1.0 x prior(x) 8 4 0 0.0 0.2 0.4 0.6 0.8 1.0 x prior(x) 8 4 0 0.0 0.2 0.4 0.6 0.8 1.0 x S n = 72, n = 100 Black:prior; green: likelihood; red: posterior. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 14 / 42

Effect of prior prior(x) 8 4 0 0.0 0.2 0.4 0.6 0.8 1.0 x prior(x) 8 4 0 0.0 0.2 0.4 0.6 0.8 1.0 x prior(x) 8 4 0 0.0 0.2 0.4 0.6 0.8 1.0 x S n = 7, n = 10 Black:prior; green: likelihood; red: posterior. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 15 / 42

Effect of prior 25 prior(x) 10 0 0.0 0.2 0.4 0.6 0.8 1.0 x 25 prior(x) 10 0 0.0 0.2 0.4 0.6 0.8 1.0 x 25 prior(x) 10 0 0.0 0.2 0.4 0.6 0.8 1.0 x S n = 721, n = 1000 Black:prior; green: likelihood; red: posterior. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 16 / 42

Choosing the prior The choice of the prior distribution can have a large impact, especially if the data are of small to moderate size. How do we choose the prior? Expert knowledge of the application A previous experiment A conjugate prior, i.e. one that is convenient mathematically, with moments chosen by expert knowledge A non-informative prior ... In all cases, the best practice is to try several priors, and to see whether the posteriors agree: would the data be enough to make agree experts who disagreed a priori? Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 17 / 42

Example: phylogenetic tree Example from Ryder & Nicholls (2011). Given lexical data, we wish to infer the age of the Most Recent Common Ancestor to the Indo-European languages. Two main hypotheses: Kurgan hypothesis: root age is 6000-6500 years Before Present (BP). Anatolian hypothesis: root age is 8000-9500 years BP Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 18 / 42

Example of a tree Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 19 / 42

Why be Bayesian in this setting? Our model is complex and the likelihood function is not pleasant We are interested in the marginal distribution of the root age Many nuisance parameters: tree topology, internal ages, evolution rates... We want to make sure that our inference procedure does not favour one of the two hypotheses a priori We will use the output as input of other models For the root age, we choose a prior U ([5000 , 16000]). Prior for the other parameters is out of the scope of this talk. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 20 / 42

Model parameters Parameter space is large: Root age R Tree topology and internal ages g (complex state space) Evolution parameters λ , µ , ρ , κ ... The posterior distribution is defined by π ( R , g , λ, µ, ρ, κ | D ) ∝ π ( R ) π ( g ) π ( λ, µ, κ, ρ ) L ( R , g , λ, µ, κ, ρ | D ) We are interested in the marginal distribution of R given the data D : � π ( R | D ) = π ( R , g , λ, µ, ρ, κ | D ) dg d λ d µ d ρ d κ. Robin Ryder (Dauphine) Bayesian inference: what and why 06/11/17 21 / 42

Bayesian inference: what it means and why we care Robin J. Ryder - PowerPoint PPT Presentation

Bayesian inference: what it means and why we care Robin J. Ryder Centre de Recherche en Math ematiques de la D ecision Universit e Paris-Dauphine 6 November 2017 Mathematical Coffees Robin Ryder (Dauphine) Bayesian inference: what

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed Basics of

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Analytics, Inference and Computation in Cosmology: Exercises on Bayesian Inference Roberto

EST5104 Bayesian Inference EST5803 Advanced Bayesian Inference Ricardo Ehlers ehlers@icmc.usp.br

Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 Bayesian Inference Kfir Bar

Approximate Bayesian inference for latent Gaussian models avard Rue 1 H Department of

CS 730/730W/830: Intro AI Bayesian Networks Approx. Inference Exact Inference 1 handout: slides

CS 730/830: Intro AI Bayesian Networks Approx. Inference Exact Inference Wheeler Ruml (UNH)

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Lecture 23/Chapter 19 Diversity of Sample Means Means versus Proportions Behavior of

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

AN APPLICATION OF POSITIVE DEFINITE FUNCTIONS TO THE PROBLEM OF MUBS MIHAIL N. KOLOUNTZAKIS, M

Biangular Lines Darcy Best March 24, 2014 Joint work with: Hadi Kharaghani (University of

Computing the sets of totally symmetric and totally conjugate orthogonal partial Latin squares by

Totally Symmetric Partial Latin Squares with Trivial Autotopism Groups Trent G. Marbach

Partial Maltsevness and category of quandles Dominique Bourn Lab. Math. Pures Appliqu ees

Methods and Resources; Orthography and Phonology Old NorseIcelandic Literature? Old

AP Calculus AB Limits & Continuity 2015-10-20 www.njctl.org Slide 3 / 233 Slide 4 / 233

Conjugate Priors: Beta and Normal 18.05 Spring 2014 January 1, 2017 1 /15 Review: Continuous

Sambuz

Useful Links

Newsletter

Mail Us

Bayesian inference: what it means and why we care Robin J. Ryder - PowerPoint PPT Presentation

Bayesian inference: what it means and why we care Robin J. Ryder Centre de Recherche en Math ematiques de la D ecision Universit e Paris-Dauphine 6 November 2017 Mathematical Coffees Robin Ryder (Dauphine) Bayesian inference: what

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed Basics of

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Analytics, Inference and Computation in Cosmology: Exercises on Bayesian Inference Roberto

EST5104 Bayesian Inference EST5803 Advanced Bayesian Inference Ricardo Ehlers ehlers@icmc.usp.br

Machine Learning: Foundations Lecturer: Yishay Mansour Lecture 2 Bayesian Inference Kfir Bar

Approximate Bayesian inference for latent Gaussian models avard Rue 1 H Department of

CS 730/730W/830: Intro AI Bayesian Networks Approx. Inference Exact Inference 1 handout: slides

CS 730/830: Intro AI Bayesian Networks Approx. Inference Exact Inference Wheeler Ruml (UNH)

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Lecture 23/Chapter 19 Diversity of Sample Means Means versus Proportions Behavior of

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

AN APPLICATION OF POSITIVE DEFINITE FUNCTIONS TO THE PROBLEM OF MUBS MIHAIL N. KOLOUNTZAKIS, M

Biangular Lines Darcy Best March 24, 2014 Joint work with: Hadi Kharaghani (University of

Computing the sets of totally symmetric and totally conjugate orthogonal partial Latin squares by

Totally Symmetric Partial Latin Squares with Trivial Autotopism Groups Trent G. Marbach

Partial Maltsevness and category of quandles Dominique Bourn Lab. Math. Pures Appliqu ees

Methods and Resources; Orthography and Phonology Old NorseIcelandic Literature? Old

AP Calculus AB Limits &amp; Continuity 2015-10-20 www.njctl.org Slide 3 / 233 Slide 4 / 233

Conjugate Priors: Beta and Normal 18.05 Spring 2014 January 1, 2017 1 /15 Review: Continuous

Sambuz

Useful Links

Newsletter

Mail Us

AP Calculus AB Limits & Continuity 2015-10-20 www.njctl.org Slide 3 / 233 Slide 4 / 233