Bayesian statistics DS GA 1002 Probability and Statistics for Data - PowerPoint PPT Presentation

Bayesian statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda

Frequentist vs Bayesian statistics In frequentist statistics the data are modeled as realizations from a distribution that depends on deterministic parameters In Bayesian statistics the parameters are modeled as random variables This allows to quantify our prior uncertainty and incorporate additional information

Learning Bayesian models Conjugate priors Bayesian estimators

Prior distribution and likelihood x ∈ R n are a realization of a random vector � The data � X , which depends on a vector of parameters � Θ Modeling choices: ◮ Prior distribution: Distribution of � Θ encoding our uncertainty about the model before seeing the data ◮ Likelihood: Conditional distribution of � X given � Θ

Posterior distribution The posterior distribution is the conditional distribution of Θ given � X Evaluating the posterior at the data � x allows to update our uncertainty about Θ using the data

Bernoulli distribution Goal: Estimating Bernoulli parameter from iid data We consider two different Bayesian estimators Θ 1 and Θ 2 : 1. Θ 1 is a conservative estimator with a uniform prior pdf � 1 for 0 ≤ θ ≤ 1 f Θ 1 ( θ ) = 0 otherwise 2. Θ 2 has a prior pdf skewed towards 1 � 2 θ for 0 ≤ θ ≤ 1 f Θ 2 ( θ ) = 0 otherwise

Prior distributions 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0

Bernoulli distribution: likelihood The data are assumed to be iid, so the likelihood is p � X | Θ ( � x | θ )

Bernoulli distribution: likelihood The data are assumed to be iid, so the likelihood is x | θ ) = θ n 1 ( 1 − θ ) n 0 p � X | Θ ( � n 0 is the number of zeros and n 1 the number of ones

Bernoulli distribution: posterior distribution f Θ 1 | � X ( θ | � x )

Bernoulli distribution: posterior distribution f Θ 1 ( θ ) p � X | Θ 1 ( � x | θ ) f Θ 1 | � X ( θ | � x ) = X ( � x ) p �

Bernoulli distribution: posterior distribution f Θ 1 ( θ ) p � X | Θ 1 ( � x | θ ) f Θ 1 | � X ( θ | � x ) = X ( � x ) p � f Θ 1 ( θ ) p � X | Θ 1 ( � x | θ ) = � u f Θ 1 ( u ) p � X | Θ 1 ( � x | u ) d u

Bernoulli distribution: posterior distribution f Θ 1 ( θ ) p � X | Θ 1 ( � x | θ ) f Θ 1 | � X ( θ | � x ) = X ( � x ) p � f Θ 1 ( θ ) p � X | Θ 1 ( � x | θ ) = � u f Θ 1 ( u ) p � X | Θ 1 ( � x | u ) d u θ n 1 ( 1 − θ ) n 0 = u u n 1 ( 1 − u ) n 0 d u �

Bernoulli distribution: posterior distribution f Θ 1 ( θ ) p � X | Θ 1 ( � x | θ ) f Θ 1 | � X ( θ | � x ) = X ( � x ) p � f Θ 1 ( θ ) p � X | Θ 1 ( � x | θ ) = � u f Θ 1 ( u ) p � X | Θ 1 ( � x | u ) d u θ n 1 ( 1 − θ ) n 0 = u u n 1 ( 1 − u ) n 0 d u � θ n 1 ( 1 − θ ) n 0 = β ( n 1 + 1 , n 0 + 1 ) � u a − 1 ( 1 − u ) b − 1 d u β ( a , b ) := u

Bernoulli distribution: posterior distribution f Θ 2 | � X ( θ | � x )

Bernoulli distribution: posterior distribution f Θ 2 ( θ ) p � X | Θ 2 ( � x | θ ) f Θ 2 | � X ( θ | � x ) = X ( � x ) p �

Bernoulli distribution: posterior distribution f Θ 2 ( θ ) p � X | Θ 2 ( � x | θ ) f Θ 2 | � X ( θ | � x ) = X ( � x ) p � θ n 1 + 1 ( 1 − θ ) n 0 = u u n 1 + 1 ( 1 − u ) n 0 d u �

Bernoulli distribution: posterior distribution f Θ 2 ( θ ) p � X | Θ 2 ( � x | θ ) f Θ 2 | � X ( θ | � x ) = X ( � x ) p � θ n 1 + 1 ( 1 − θ ) n 0 = u u n 1 + 1 ( 1 − u ) n 0 d u � θ n 1 + 1 ( 1 − θ ) n 0 = β ( n 1 + 2 , n 0 + 1 ) � u a − 1 ( 1 − u ) b − 1 d u β ( a , b ) := u

Bernoulli distribution: n 0 = 1, n 1 = 3 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0

Bernoulli distribution: n 0 = 3, n 1 = 1 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0

Bernoulli distribution: n 0 = 91, n 1 = 9 Posterior mean (uniform prior) 14 Posterior mean (skewed prior) ML estimator 12 10 8 6 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0

Beta random variable Useful in Bayesian statistics Unimodal continuous distribution in the unit interval The pdf of a beta distribution with parameters a and b is defined as � θ a − 1 ( 1 − θ ) b − 1 , if 0 ≤ θ ≤ 1, β ( a , b ) f β ( θ ; a , b ) := 0 otherwise � u a − 1 ( 1 − u ) b − 1 d u β ( a , b ) := u

Beta random variables a = 1 b = 1 6 a = 1 b = 2 a = 3 b = 3 a = 6 b = 2 a = 3 b = 15 4 f X ( x ) 2 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 x

Learning a Bernoulli distribution The first prior is beta with parameters a = 1 and b = 1 The second prior is beta with parameters a = 2 and b = 1 The posteriors are beta with parameters a = n 1 + 1, b = n 0 + 1 and a = n 1 + 2, b = n 0 + 1 respectively

Conjugate priors A conjugate family of distributions for a certain likelihood satisfies the following property: If the prior belongs to the family, the posterior also belongs to the family Beta distributions are conjugate priors when the likelihood is binomial

The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ | X ( θ | x )

The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ | X ( θ | x ) = f Θ ( θ ) p X | Θ ( x | θ ) p X ( x )

The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ | X ( θ | x ) = f Θ ( θ ) p X | Θ ( x | θ ) p X ( x ) f Θ ( θ ) p X | Θ ( x | θ ) = � u f Θ ( u ) p X | Θ ( x | u ) d u

The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ | X ( θ | x ) = f Θ ( θ ) p X | Θ ( x | θ ) p X ( x ) f Θ ( θ ) p X | Θ ( x | θ ) = � u f Θ ( u ) p X | Θ ( x | u ) d u θ a − 1 ( 1 − θ ) b − 1 � n θ x ( 1 − θ ) n − x � x = u x ( 1 − u ) n − x d u u u a − 1 ( 1 − u ) b − 1 � n � � x

The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ | X ( θ | x ) = f Θ ( θ ) p X | Θ ( x | θ ) p X ( x ) f Θ ( θ ) p X | Θ ( x | θ ) = � u f Θ ( u ) p X | Θ ( x | u ) d u θ a − 1 ( 1 − θ ) b − 1 � n θ x ( 1 − θ ) n − x � x = u x ( 1 − u ) n − x d u u u a − 1 ( 1 − u ) b − 1 � n � � x θ x + a − 1 ( 1 − θ ) n − x + b − 1 = u u x + a − 1 ( 1 − u ) n − x + b − 1 d u �

The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ | X ( θ | x ) = f Θ ( θ ) p X | Θ ( x | θ ) p X ( x ) f Θ ( θ ) p X | Θ ( x | θ ) = � u f Θ ( u ) p X | Θ ( x | u ) d u θ a − 1 ( 1 − θ ) b − 1 � n θ x ( 1 − θ ) n − x � x = u x ( 1 − u ) n − x d u u u a − 1 ( 1 − u ) b − 1 � n � � x θ x + a − 1 ( 1 − θ ) n − x + b − 1 = u u x + a − 1 ( 1 − u ) n − x + b − 1 d u � = f β ( θ ; x + a , n − x + b )

Poll in New Mexico 429 participants, 227 people intend to vote for Clinton and 202 for Trump Probability that Trump wins in New Mexico? Assumptions: ◮ Fraction of Trump voters is modeled as a random variable Θ ◮ Poll participants are selected uniformly at random with replacement ◮ Number of Trump voters in the poll is binomial with parameters n = 449 and p = Θ

Poll in New Mexico ◮ Prior is uniform, so beta with parameters a = 1 and b = 1 ◮ Likelihood is binomial ◮ Posterior is beta with parameters a = 202 + 1 and b = 227 + 1 ◮ The probability that Trump wins in New Mexico is the probability that Θ given the data is greater than 0.5

Poll in New Mexico 18 88.6% 16 11.4% 14 12 10 8 6 4 2 0 0.35 0.40 0.45 0.50 0.55 0.60

Bayesian estimators What estimator should we use? Two main options: ◮ The posterior mean ◮ The posterior mode

Posterior mean Mean of the posterior distribution � � Θ | � � θ MMSE ( � x ) := E X = � x Minimum mean-square-error (MMSE) estimate For any arbitrary estimator θ other ( � x ) , �� 2 � �� 2 � θ other ( � X ) − � θ MMSE ( � X ) − � E Θ ≥ E Θ

Posterior mean �� 2 � � θ other ( � X ) − � � � Θ X = � E x

Posterior mean �� 2 � � θ other ( � X ) − � � � Θ X = � E x �� 2 � � θ other ( � X ) − θ MMSE ( � X ) + θ MMSE ( � X ) − � � � = E Θ X = � x �

Posterior mean �� 2 � � θ other ( � X ) − � � � Θ X = � E x �� 2 � � θ other ( � X ) − θ MMSE ( � X ) + θ MMSE ( � X ) − � � � = E Θ X = � x � � 2 � x )) 2 + E � � � θ MMSE ( � X ) − � � � = ( θ other ( � x ) − θ MMSE ( � Θ X = � x � � � �� Θ | � � + 2 ( θ other ( � x ) − θ MMSE ( � x )) E θ MMSE ( � x ) − E X = � x

Bayesian statistics DS GA 1002 Probability and Statistics for Data - PowerPoint PPT Presentation

Bayesian statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda Frequentist vs Bayesian statistics In frequentist statistics the data are modeled as

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Statistics for Analytical Science at Warwick Simon Spencer Bayesian statistics in epidemiology

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

Statistics for Applications Chapter 8: Bayesian Statistics 1/17 The Bayesian approach (1)

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

On the number of polynomial solutions of Bernoulli and Abel polynomial differential equations

Rates for Inductive Learning of Compositional Models Adrian Barbu Department of Statistics

W -superrigidity for Bernoulli actions and wreath product group von Neumann algebras Lecture

Orthogonal Polynomials for Bernoulli and Euler Polynomials Lin JIU Dalhousie University Number

Comparing Bayesian Networks and Structure Learning Algorithms (and other graphical models) Marco

Ergodicity and type of nonsingular Bernoulli actions Richard Kadison and his mathematical legacy

Critical Parameters of Loop and Bernoulli Percolation Peter M uhlbacher University of Warwick

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates 8