Objective Bayesian Statistics Jos M. Bernardo Universitat de - PowerPoint PPT Presentation

An Introduction to Objective Bayesian Statistics José M. Bernardo Universitat de València, Spain <jose.m.bernardo@uv.es> http://www.uv.es/bernardo Université de Neuchâtel, Switzerland March 15th–March 17th, 2006

2 Summary 1. Concept of Probability Introduction. Notation. Statistical models. Intrinsic discrepancy. Intrinsic convergence of distributions. Foundations. Probability as a rational degree of belief. 2. Basics of Bayesian Analysis Parametric inference. The learning process. Reference analysis. No relevant initial information. Inference summaries. Point and region estimation. Prediction. Regression. Hierarchical models. Exchangeability. 3. Decision Making Structure of a decision problem. Intrinsic loss functions. Point and region estimation. Intrinsic estimators and credible regions. Hypothesis testing. Bayesian reference criterion (BRC).

3 1. Concept of Probability 1.1. Introduction Tentatively accept a formal statistical model Typically suggested by informal descriptive evaluation Conclusions conditional on the assumption that model is correct Bayesian approach firmly based on axiomatic foundations Mathematical need to describe by probabilities all uncertainties Parameters must have a ( prior ) distribution describing available information about their values Not a description of their variability ( fixed unknown quantities), but a description of the uncertainty about their true values. Important particular case: no relevant (or subjective) initial information: scientific and industrial reporting, public decision making, ... Prior exclusively based on model assumptions and available, well-documented data: Objective Bayesian Statistics

4 • Notation Under conditions C , p ( x | C ) , π ( θ | C ) are, respectively, probability densities (or mass) functions of observables x and parameters θ � � p ( x | C ) ≥ 0 , X p ( x | C ) d x = 1 , E [ x | C ] = X x p ( x | C ) d x , � � π ( θ | C ) ≥ 0 , Θ π ( θ | C ) d θ = 1 , E [ θ | C ] = Θ θ π ( θ | C ) d θ . Special densities (or mass) functions use specific notation, as N ( x | µ, σ ) , Bi ( x | n, θ ) , or Pn ( x | λ ) . Other examples: { Be ( x | α, β ) , 0 < x < 1 , α > 0 , β > 0 } Beta Be ( x | α, β ) = Γ( α + β ) Γ( α )Γ( β ) x α − 1 (1 − x ) β − 1 { Ga ( x | α, β ) , x > 0 , α > 0 , β > 0 } Gamma Ga ( x | α, β ) = βα Γ( α ) x α − 1 e − βx { St ( x | µ, σ, α ) , x ∈ ℜ , µ ∈ ℜ , σ > 0 , α > 0 } Student � 2 � − ( α +1) / 2 � St ( x | µ, σ, α ) = Γ { ( α +1) / 2) } � x − µ 1 1 + 1 σ √ απ α σ Γ( α/ 2)

5 • Statistical Models Statistical model generating x ∈ X X X , { p ( x | θ ) , x ∈ X X X , θ ∈ Θ } Parameter vector θ = { θ 1 , . . . , θ k } ∈ Θ . Parameter space Θ ⊂ ℜ k . Dataset x ∈ X X X . Sampling(Outcome)space X X X , ofarbitrarystructure. Likelihood function of x , l ( θ | x ) . l ( θ | x ) = p ( x | θ ) , as a function of θ ∈ Θ . Maximum likelihood estimator (mle) of θ θ = ˆ ˆ θ ( x ) = arg sup θ ∈ Θ l ( θ | x ) Data x = { x 1 , . . . , x n } random sample (iid) from model if p ( x | θ ) = � n X = X n j =1 p ( x j | θ ) , x j ∈ X , X X Behaviour under repeated sampling (general, not iid data) Considering { x 1 , x 2 , . . . } , a (possibly infinite) sequence of possible replications of the complete data set x . Denote by x ( m ) = { x 1 , . . . , x m } a finite set of m such replications. Asymptotic results obtained as m → ∞

6 1.2. Intrinsic Divergence • Logarithmic divergences The logarithmic divergence (Kullback-Leibler) k { ˆ p | p } of a density ˆ p ( x ) , x ∈ X X X from its true density p ( x ) , is X p ( x ) log p ( x ) � κ { ˆ p | p } = p ( x ) d x , (provided this exists) ˆ The functional κ { ˆ p | p } is non-negative, (zero iff, ˆ p ( x ) = p ( x ) a.e.) and invariant under one-to-one transformations of x . But κ { p 1 | p 2 } is not symmetric and diverges if, strictly, X X X 2 ⊂ X X X 1 . • Intrinsic discrepancy between distributions X 1 p 1 ( x ) log p 1( x ) X 2 p 2 ( x ) log p 2( x ) �� δ { p 1 , p 2 } = min p 2( x ) d x , p 1( x ) d x The intrinsic discrepancy δ { p 1 , p 2 } is non-negative (zero iff, p 1 = p 2 a.e.), and invariant under one-to-one transformations of x , Defined if X X X 2 ⊂ X X X 1 or X X X 1 ⊂ X X X 2 , operative interpretation as the minimum amount of information (in nits ) required to discriminate.

7 • Interpretation and calibration of the intrinsic discrepancy Let { p 1 ( x | θ 1 ) , θ 1 ∈ Θ 1 } or { p 2 ( x | θ 2 ) , θ 2 ∈ Θ 2 } be two alternative statistical models for x ∈ X , one of which is assumed to be true. The intrinsic divergence δ { θ 1 , θ 2 } = δ { p 1 , p 2 } is then minimum expected log-likelihood ratio in favour of the true model . Indeed, if p 1 ( x | θ 1 ) true model, the expected log-likelihood ratio in its favour is E 1 [log { p 1 ( x | θ 1 ) /p 2 ( x | θ 1 ) } ] = κ { p 2 | p 1 } . If the true model is p 2 ( x | θ 2 ) , the expected log-likelihood ratio in favour of the true model is κ { p 2 | p 1 } . But δ { p 2 | p 1 } = min[ κ { p 2 | p 1 } , κ { p 1 | p 2 } ] . Calibration . δ = log[100] ≈ 4 . 6 nits , likelihood ratios for the true model larger than 100 making discrimination very easy . δ = log(1 + ε ) ≈ ε nits , likelihood ratios for the true model may about 1 + ǫ making discrimination very hard . Intrinsic Discrepancy δ 0.01 0.69 2.3 4.6 6.9 Average Likelihood Ratio for true model exp[ δ ] 1.01 2 10 100 1000

8 Example. Conventional Poisson approximation Pn ( r | nθ ) of Binomial probabilities Bi ( r | n, θ ) Intrinsic discrepancy between Binomial and Poisson distributions δ { Bi ( r | n, θ ) , Po ( r | nθ } = min[ k { Bi | Po } , k { Po | Bi } ] = k { Bi | Po } = � n r =0 Bi ( r | n, θ ) log[ Bi ( r | n, θ ) / Po ( r | nθ )] = δ { n, θ } ∆ � Bi, Po � n, Θ � δ { 3 , 0 . 05 } = 0 . 00074 0.16 n � 1 δ { 5000 , 0 . 05 } = 0 . 00065 0.14 n � 3 0.12 n � 5 δ {∞ , θ } = 1 2 [ − θ − log(1 − θ )] 0.1 n �� 0.08 Good Poisson approximations 0.06 are impossible if θ is not small, 0.04 0.02 however large n might be. Θ 0.1 0.2 0.3 0.4 0.5

9 • Intrinsic Convergence of Distributions Intrinsic convergence . A sequence of probability densities (or mass) functions { p i ( x ) } ∞ i =1 converges intrinsically to p ( x ) if (and only if) the intrinsic divergence between p i ( x ) and p ( x ) converges to zero. i.e. , iff lim i →∞ δ ( p i , p ) = 0 . Example . Normal approximation to a Student distribution. δ ( α ) = δ { St ( x | µ, σ, α ) , N ( x | µ, σ ) } = min[ k { St α | N } , k { N | St α } ] � N ( x | 0 , 1) 7 = k { St α | N } = N ( x | 0 , 1) log St ( x | 0 , 1 , α ) d x ≈ α (22 + 4 α ) ℜ 0.01 ∆ � Α � � k � St Α � N � 0.008 k { N | St α } diverges for α ≤ 2 k { St α | N } is finite for all α > 0 . 0.006 k � N � St Α � δ (18) ≈ 0 . 04 δ (25) ≈ 0 . 02 0.004 Expected log-density ratios k � N � St 39 � � 0.0012 0.002 k � St 39 � N � � 0.0010 at least 0 . 001 when α < 40 . Α 20 40 60 80 100

10 1.3. Foundations • Foundations of Statistics Axiomatic foundations on rational description of uncertainty imply that the uncertainty about all unknown quantities should be measured with probability distributions { π ( θ | C ) , θ ∈ Θ } describing the plausibility of their given available conditions C . Axioms have a strong intuitive appeal; examples include • Transitivity of plausibility . If E 1 ≻ E 2 | C , and E 2 ≻ E 3 | C , then E 1 ≻ E 3 | C • The sure-thing principle . If E 1 ≻ E 2 | A, C and E 1 ≻ E 2 | A, C , then E 1 ≻ E 2 | C ). Axioms are not a description of actual human activity, but a normative set of principles for those aspiring to rational behaviour. “Absolute” probabilities do not exist. Typical applications produce Pr ( E | x , A, K ) , a measure of rational belief in the occurrence of the event E , given data x , assumptions A and available knowledge K .

11 • Probability as a Measure of Conditional Uncertainty Axiomatic foundations imply that Pr ( E | C ) , the probability of an event E given C is always a conditional measure of the (presumably rational) uncertainty, on a [0 , 1] scale, about the occurrence of E in conditions C . • Probabilistic diagnosis . V is the event that a person carries a virus and + a positive test result. All related probabilities, e.g. , Pr (+ | V ) = 0 . 98 , Pr (+ | V ) = 0 . 01 , Pr ( V | K ) = 0 . 002 , Pr (+ | K ) = Pr (+ | V ) Pr ( V | K ) + Pr (+ | V ) Pr ( V | K ) = 0 . 012 Pr ( V | + , A, K ) = Pr (+ | V ) Pr ( V | K ) = 0 . 164 (Bayes’ Theorem) Pr (+ | K ) are conditional uncertainty measures (and proportion estimates). • Estimation of a proportion .Survey conducted to estimate the proportion θ of positive individuals in a population. Random sample of size n with r positive. Pr ( a < θ < b | r, n, A, K ) , a conditional measure of the uncertainty about the event that θ belongs to [ a, b ] given assumptions A , initial knowledge K and data { r, n } .

Objective Bayesian Statistics Jos M. Bernardo Universitat de - PowerPoint PPT Presentation

An Introduction to Objective Bayesian Statistics Jos M. Bernardo Universitat de Valncia, Spain <jose.m.bernardo@uv.es> http://www.uv.es/bernardo Universit de Neuchtel, Switzerland March 15thMarch 17th, 2006 2 Summary 1.

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Statistics for Analytical Science at Warwick Simon Spencer Bayesian statistics in epidemiology

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

Statistics for Applications Chapter 8: Bayesian Statistics 1/17 The Bayesian approach (1)

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian statistics DS GA 1002 Probability and Statistics for Data Science

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Political Science 209 - Fall 2018 Probability II Florian Hollenbach 8th November 2018

Bioinformatics: Network Analysis Probabilistic Modeling: Bayesian Networks COMP 572 (BIOS 572 /

Bayesian Learning Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 6

Bayesian method probabilities Application of Bayesian methods Demo: McRobot (P . Lewis)

Precision nuclear physics Observable calculations are becoming increasingly precise Hamiltonian

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Two Statistical Paradigms Bayesian versus Frequentist Steven Janke April 2012 (Bayesian

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Sambuz

Useful Links

Newsletter

Mail Us

Objective Bayesian Statistics Jos M. Bernardo Universitat de - PowerPoint PPT Presentation

An Introduction to Objective Bayesian Statistics Jos M. Bernardo Universitat de Valncia, Spain <jose.m.bernardo@uv.es> http://www.uv.es/bernardo Universit de Neuchtel, Switzerland March 15thMarch 17th, 2006 2 Summary 1.

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Statistics for Analytical Science at Warwick Simon Spencer Bayesian statistics in epidemiology

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

Statistics for Applications Chapter 8: Bayesian Statistics 1/17 The Bayesian approach (1)

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian statistics DS GA 1002 Probability and Statistics for Data Science

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Political Science 209 - Fall 2018 Probability II Florian Hollenbach 8th November 2018

Bioinformatics: Network Analysis Probabilistic Modeling: Bayesian Networks COMP 572 (BIOS 572 /

Bayesian Learning Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 6

Bayesian method probabilities Application of Bayesian methods Demo: McRobot (P . Lewis)

Precision nuclear physics Observable calculations are becoming increasingly precise Hamiltonian

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Two Statistical Paradigms Bayesian versus Frequentist Steven Janke April 2012 (Bayesian

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos &amp; Alex

Sambuz

Useful Links

Newsletter

Mail Us

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex