[PPT] - Bayesian statistics DS GA 1002 Statistical and Mathematical Models PowerPoint Presentation

SLIDE 1

Bayesian statistics

DS GA 1002 Statistical and Mathematical Models

http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall15 Carlos Fernandez-Granda

SLIDE 2

Frequentist vs Bayesian statistics

In frequentist statistics the data are modeled as realizations from a distribution that depends on deterministic parameters In Bayesian statistics the parameters are modeled as random variables This allows to quantify our prior uncertainty and incorporate additional information

SLIDE 3

Learning Bayesian models Conjugate priors Bayesian estimators

SLIDE 4

Prior distribution and likelihood

The data x ∈ Rn are a realization of a random vector X, which depends on a vector of parameters Θ Modeling choices:

◮ Prior distribution: Distribution of

Θ encoding our uncertainty about the model before seeing the data

◮ Likelihood: Conditional distribution of

X given Θ

SLIDE 5

Posterior distribution

The posterior distribution is the conditional distribution of Θ given X Evaluating the posterior at the data x allows to update our uncertainty about Θ using the data

SLIDE 6

Bernoulli distribution

Goal: Estimating Bernoulli parameter from iid data We consider two different Bayesian estimators Θ1 and Θ2:

1. Θ1 is a conservative estimator with a uniform prior pdf

fΘ1 (θ) =

1

for 0 ≤ θ ≤ 1

therwise
2. Θ2 has a prior pdf skewed towards 1

fΘ2 (θ) =

2 θ

for 0 ≤ θ ≤ 1

therwise

SLIDE 7

Prior distributions

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0

SLIDE 8

Bernoulli distribution: likelihood

The data are assumed to be iid, so the likelihood is p

X|Θ (

x|θ)

SLIDE 9

Bernoulli distribution: likelihood

The data are assumed to be iid, so the likelihood is p

X|Θ (

x|θ) = θn1 (1 − θ)n0 n0 is the number of zeros and n1 the number of ones

SLIDE 10

Bernoulli distribution: posterior distribution

fΘ1|

X (θ|

x)

SLIDE 11

Bernoulli distribution: posterior distribution

fΘ1|

X (θ|

x) = fΘ1 (θ) p

X|Θ1 (

x|θ) p

X (

x)

SLIDE 12

Bernoulli distribution: posterior distribution

fΘ1|

X (θ|

x) = fΘ1 (θ) p

X|Θ1 (

x|θ) p

X (

x) = fΘ1 (θ) p

X|Θ1 (

x|θ)

u fΘ1 (u) p

X|Θ1 (

x|u) du

SLIDE 13

Bernoulli distribution: posterior distribution

fΘ1|

X (θ|

x) = fΘ1 (θ) p

X|Θ1 (

x|θ) p

X (

x) = fΘ1 (θ) p

X|Θ1 (

x|θ)

u fΘ1 (u) p

X|Θ1 (

x|u) du = θn1 (1 − θ)n0

u un1 (1 − u)n0 du

SLIDE 14

Bernoulli distribution: posterior distribution

fΘ1|

X (θ|

x) = fΘ1 (θ) p

X|Θ1 (

x|θ) p

X (

x) = fΘ1 (θ) p

X|Θ1 (

x|θ)

u fΘ1 (u) p

X|Θ1 (

x|u) du = θn1 (1 − θ)n0

u un1 (1 − u)n0 du

= θn1 (1 − θ)n0 β (n1 + 1, n0 + 1) β (a, b) :=

u

ua−1 (1 − u)b−1 du

SLIDE 15

Bernoulli distribution: posterior distribution

fΘ2|

X (θ|

x)

SLIDE 16

Bernoulli distribution: posterior distribution

fΘ2|

X (θ|

x) = fΘ2 (θ) p

X|Θ2 (

x|θ) p

X (

x)

SLIDE 17

Bernoulli distribution: posterior distribution

fΘ2|

X (θ|

x) = fΘ2 (θ) p

X|Θ2 (

x|θ) p

X (

x) = θn1+1 (1 − θ)n0

u un1+1 (1 − u)n0 du

SLIDE 18

Bernoulli distribution: posterior distribution

fΘ2|

X (θ|

x) = fΘ2 (θ) p

X|Θ2 (

x|θ) p

X (

x) = θn1+1 (1 − θ)n0

u un1+1 (1 − u)n0 du

= θn1+1 (1 − θ)n0 β (n1 + 2, n0 + 1) β (a, b) :=

u

ua−1 (1 − u)b−1 du

SLIDE 19

Bernoulli distribution: n0 = 1, n1 = 3

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5

SLIDE 20

Bernoulli distribution: n0 = 3, n1 = 1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0

SLIDE 21

Bernoulli distribution: n0 = 91, n1 = 9

0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 14

Posterior mean (uniform prior) Posterior mean (skewed prior) ML estimator

SLIDE 22

Learning Bayesian models Conjugate priors Bayesian estimators

SLIDE 23

Beta distribution

The pdf of a beta distribution with parameters a and b is defined as fβ (θ; a, b) := θa−1(1−θ)b−1

β(a,b)

, if 0 ≤ θ ≤ 1,

therwise

β (a, b) :=

u

ua−1 (1 − u)b−1 du

SLIDE 24

Learning a Bernoulli distribution

The first prior is beta with parameters a = 1 and b = 1 The second prior is beta with parameters a = 2 and b = 1 The posteriors are beta with parameters a = n1 + 1, b = n0 + 1 and a = n1 + 2, b = n0 + 1 respectively

SLIDE 25

Conjugate priors

A conjugate family of distributions for a certain likelihood satisfies the following property: If the prior belongs to the family, the posterior also belongs to the family Beta distributions are conjugate priors when the likelihood is binomial

SLIDE 26

The beta distribution is conjugate to the binomial likelihood

Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x)

SLIDE 27

The beta distribution is conjugate to the binomial likelihood

Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x)

SLIDE 28

The beta distribution is conjugate to the binomial likelihood

Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x) = fΘ (θ) pX | Θ (x | θ)

u fΘ (u) pX | Θ (x | u) du

SLIDE 29

The beta distribution is conjugate to the binomial likelihood

Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x) = fΘ (θ) pX | Θ (x | θ)

u fΘ (u) pX | Θ (x | u) du

= θa−1 (1 − θ)b−1 n

x

θx (1 − θ)n−x
u ua−1 (1 − u)b−1 n

x

ux (1 − u)n−x du

SLIDE 30

The beta distribution is conjugate to the binomial likelihood

Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x) = fΘ (θ) pX | Θ (x | θ)

u fΘ (u) pX | Θ (x | u) du

= θa−1 (1 − θ)b−1 n

x

θx (1 − θ)n−x
u ua−1 (1 − u)b−1 n

x

ux (1 − u)n−x du

= θx+a−1 (1 − θ)n−x+b−1

u ux+a−1 (1 − u)n−x+b−1 du

SLIDE 31

The beta distribution is conjugate to the binomial likelihood

Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x) = fΘ (θ) pX | Θ (x | θ)

u fΘ (u) pX | Θ (x | u) du

= θa−1 (1 − θ)b−1 n

x

θx (1 − θ)n−x
u ua−1 (1 − u)b−1 n

x

ux (1 − u)n−x du

= θx+a−1 (1 − θ)n−x+b−1

u ux+a−1 (1 − u)n−x+b−1 du

= fβ (θ; x + a, n − x + b)

SLIDE 32

Poll in New Mexico

449 participants, 227 people intend to vote for Clinton and 202 for Trump Probability that Trump wins in New Mexico? Assumptions:

◮ Fraction of Trump voters is modeled as a random variable Θ ◮ Poll participants are selected uniformly at random with replacement ◮ Number of Trump voters in the poll is binomial with parameters

n = 449 and p = Θ

SLIDE 33

Poll in New Mexico

◮ Prior is uniform, so beta with parameters a = 1 and b = 1 ◮ Likelihood is binomial ◮ Posterior is beta with parameters a = 202 + 1 and b = 227 + 1 ◮ The probability that Trump wins in New Mexico is the probability that

Θ given the data is greater than 0.5

SLIDE 34

Poll in New Mexico

0.35 0.40 0.45 0.50 0.55 0.60 2 4 6 8 10 12 14 16 18

88.6% 11.4%

SLIDE 35

Learning Bayesian models Conjugate priors Bayesian estimators

SLIDE 36

Bayesian estimators

What estimator should be use? Two main options:

◮ The posterior mean ◮ The posterior mode

SLIDE 37

Posterior mean

Mean of the posterior distribution θMMSE ( x) := E

Θ|

X = x

Minimum mean-square-error (MMSE) estimate

For any arbitrary estimator θother ( x), E

θother(

X) − Θ 2 ≥ E

θMMSE(

X) − Θ 2

SLIDE 38

Posterior mean

E

θother(

X) − Θ 2

X =

x

SLIDE 39

Posterior mean

E

θother(

X) − Θ 2

X =

x

= E
θother(

X) − θMMSE( X) + θMMSE( X) − Θ 2

X =

x

SLIDE 40

Posterior mean

E

θother(

X) − Θ 2

X =

x

= E
θother(

X) − θMMSE( X) + θMMSE( X) − Θ 2

X =

x

= (θother(

x) − θMMSE( x))2 + E θMMSE( X) − Θ 2

X =

x

+ 2 (θother(

x) − θMMSE( x)) E

θMMSE(

x) − E

Θ|

X = x

SLIDE 41

Posterior mean

E

θother(

X) − Θ 2

X =

x

= E
θother(

X) − θMMSE( X) + θMMSE( X) − Θ 2

X =

x

= (θother(

x) − θMMSE( x))2 + E θMMSE( X) − Θ 2

X =

x

+ 2 (θother(

x) − θMMSE( x)) E

θMMSE(

x) − E

Θ|

X = x

= (θother(

x) − θMMSE( x))2 + E θMMSE( X) − Θ 2

X =

x

SLIDE 42

Posterior mean

By iterated expectation, E

θother(

X) − Θ 2 = E

E
θother(

X) − Θ 2

X

SLIDE 43

Posterior mean

By iterated expectation, E

θother(

X) − Θ 2 = E

E
θother(

X) − Θ 2

X
= E
θother(

X) − θMMSE( X) 2 + E

E

θMMSE( X) − Θ 2

X

SLIDE 44

Posterior mean

By iterated expectation, E

θother(

X) − Θ 2 = E

E
θother(

X) − Θ 2

X
= E
θother(

X) − θMMSE( X) 2 + E

E

θMMSE( X) − Θ 2

X
= E
θother(

X) − θMMSE( X) 2 + E

θMMSE(

X) − Θ 2

SLIDE 45

Posterior mean

By iterated expectation, E

θother(

X) − Θ 2 = E

E
θother(

X) − Θ 2

X
= E
θother(

X) − θMMSE( X) 2 + E

E

θMMSE( X) − Θ 2

X
= E
θother(

X) − θMMSE( X) 2 + E

θMMSE(

X) − Θ 2 ≥ E

θMMSE(

X) − Θ 2

SLIDE 46

Bernoulli distribution: n0 = 1, n1 = 3

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5

SLIDE 47

Bernoulli distribution: n0 = 3, n1 = 1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0

SLIDE 48

Bernoulli distribution: n0 = 91, n1 = 9

0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 14

Posterior mean (uniform prior) Posterior mean (skewed prior) ML estimator

SLIDE 49

Posterior mode

The maximum-a-posteriori (MAP) estimator is the mode of the posterior distribution θMAP ( x) := arg max

θ

p

Θ | X

θ |

x

if

Θ is discrete and θMAP ( x) := arg max

θ

f

Θ | X

θ |

x

if

Θ is continuous

SLIDE 50

Maximum-likelihood estimator

If the prior is uniform the ML estimator coincides with the MAP estimator arg max

θ

f

Θ | X

θ|

x

SLIDE 51

Maximum-likelihood estimator

If the prior is uniform the ML estimator coincides with the MAP estimator arg max

θ

f

Θ | X

θ|

x

= arg max
θ

f

Θ

θ
f

X| Θ

x|

θ

u f

Θ (u) f X| Θ (

x|u) du

SLIDE 52

Maximum-likelihood estimator

If the prior is uniform the ML estimator coincides with the MAP estimator arg max

θ

f

Θ | X

θ|

x

= arg max
θ

f

Θ

θ
f

X| Θ

x|

θ

u f

Θ (u) f X| Θ (

x|u) du = arg max

θ

f

X| Θ

x|

θ

SLIDE 53

Maximum-likelihood estimator

If the prior is uniform the ML estimator coincides with the MAP estimator arg max

θ

f

Θ | X

θ|

x

= arg max
θ

f

Θ

θ
f

X| Θ

x|

θ

u f

Θ (u) f X| Θ (

x|u) du = arg max

θ

f

X| Θ

x|

θ

= arg max
θ

L

x

θ

SLIDE 54

Maximum-likelihood estimator

If the prior is uniform the ML estimator coincides with the MAP estimator arg max

θ

f

Θ | X

θ|

x

= arg max
θ

f

Θ

θ
f

X| Θ

x|

θ

u f

Θ (u) f X| Θ (

x|u) du = arg max

θ

f

X| Θ

x|

θ

= arg max
θ

L

x

θ
Uniform priors are only well defined over bounded domains

SLIDE 55

Probability of error

If Θ is discrete, MAP estimator minimizes the probability of error For any arbitrary estimator θother ( x) P

θother(

X) = Θ

≥ P
θMAP(

X) = Θ

SLIDE 56

Probability of error

P

Θ = θother(

X)

SLIDE 57

Probability of error

P

Θ = θother(

X)

=
x

f

X (

x) P

Θ = θother(

x) | X = x

d

x

SLIDE 58

Probability of error

P

Θ = θother(

X)

=
x

f

X (

x) P

Θ = θother(

x) | X = x

d

x =

x

f

X (

x) p

Θ | X (θother(

x) | x) d x

SLIDE 59

Probability of error

P

Θ = θother(

X)

=
x

f

X (

x) P

Θ = θother(

x) | X = x

d

x =

x

f

X (

x) p

Θ | X (θother(

x) | x) d x ≤

x

f

X (

x) p

Θ | X (θMAP(

x) | x) d x

SLIDE 60

Probability of error

P

Θ = θother(

X)

=
x

f

X (

x) P

Θ = θother(

x) | X = x

d

x =

x

f

X (

x) p

Θ | X (θother(

x) | x) d x ≤

x

f

X (

x) p

Θ | X (θMAP(

x) | x) d x = P

Θ = θMAP(

X)

SLIDE 61

Sending bits

Model for communication channel: signal Θ encodes a single bit Prior knowledge indicates that a 0 is 3 times more likely than a 1 pΘ (1) = 1 4, pΘ (0) = 3 4. The channel is noisy, so we send the signal n times At the receptor we observe

Xi = Θ +

Zi, 1 ≤ i ≤ n, where Z is iid standard Gaussian

SLIDE 62

Sending bits: ML estimator

The likelihood is equal to L

x (θ) = n

i=1

f

Xi|Θ (

xi | θ) =

n

i=1

1 √ 2π e− (

xi −θ)2 2

The log-likelihood is equal to log L

x (θ) = − n

i=1

( xi − θ)2 2 − n 2 log 2π

SLIDE 63

Sending bits: ML estimator

θML ( x) = 1 if log L

x (1) = − n

i=1
x 2

i − 2

xi + 1 2 − n 2 log 2π ≥ −

n

i=1
x 2

i

2 − n 2 log 2π = log L

x (0)

Equivalently, θML ( x) =

1

if 1

n

i=1

xi > 1

2

therwise

SLIDE 64

Sending bits: ML estimator

The probability of error is

P

Θ = θML(

X)

SLIDE 65

Sending bits: ML estimator

The probability of error is

P

Θ = θML(

X)

= P
Θ = θML(

X)

Θ = 0
P (Θ = 0) + P
Θ = θML(

X)

Θ = 1
P (Θ = 1)

SLIDE 66

Sending bits: ML estimator

The probability of error is

P

Θ = θML(

X)

= P
Θ = θML(

X)

Θ = 0
P (Θ = 0) + P
Θ = θML(

X)

Θ = 1
P (Θ = 1)

= P

1

n

i=1
xi > 1

2

Θ = 0
P (Θ = 0) + P
1

n

i=1
xi < 1

2

Θ = 1
P (Θ = 1)

SLIDE 67

Sending bits: ML estimator

The probability of error is

P

Θ = θML(

X)

= P
Θ = θML(

X)

Θ = 0
P (Θ = 0) + P
Θ = θML(

X)

Θ = 1
P (Θ = 1)

= P

1

n

i=1
xi > 1

2

Θ = 0
P (Θ = 0) + P
1

n

i=1
xi < 1

2

Θ = 1
P (Θ = 1)

= Q √n/2

SLIDE 68

Sending bits: MAP estimator

The logarithm of the posterior is equal to log pΘ|

X (θ|

x)

SLIDE 69

Sending bits: MAP estimator

The logarithm of the posterior is equal to log pΘ|

X (θ|

x) = log n

i=1 f Xi|Θ (

xi|θ) pΘ (θ) f

X (

x)

SLIDE 70

Sending bits: MAP estimator

The logarithm of the posterior is equal to log pΘ|

X (θ|

x) = log n

i=1 f Xi|Θ (

xi|θ) pΘ (θ) f

X (

x) =

n

i=1

log f

Xi|Θ (

xi|θ) pΘ (θ) − log f

X (

x)

SLIDE 71

Sending bits: MAP estimator

The logarithm of the posterior is equal to log pΘ|

X (θ|

x) = log n

i=1 f Xi|Θ (

xi|θ) pΘ (θ) f

X (

x) =

n

i=1

log f

Xi|Θ (

xi|θ) pΘ (θ) − log f

X (

x) = −

n

i=1
x 2

i − 2

xiθ + θ2 2 − n 2 log 2π + log pΘ (θ) − log f

X (

x)

SLIDE 72

Sending bits: MAP estimator

θMAP ( x) = 1 if log pΘ|

X (1|

x) + log f

X (

x) = −

n

i=1
x 2

i − 2

xi + 1 2 − n 2 log 2π − log 4 ≥ −

n

i=1
x 2

i

2 − n 2 log 2π − log 4 + log 3 = log pΘ|

X (0|

x) + log f

X (

x) . Equivalently, θMAP ( x) =

1

if 1

n

i=1

xi > 1

2 + log 3 n ,

therwise.

SLIDE 73

Sending bits: MAP estimator

The probability of error is

P (Θ = θMAP ( x))

SLIDE 74

Sending bits: MAP estimator

The probability of error is

P (Θ = θMAP ( x)) = P (Θ = θMAP ( x) |Θ = 0) P (Θ = 0) + P (Θ = θMAP ( x) |Θ = 1) P (Θ = 1)

SLIDE 75

Sending bits: MAP estimator

The probability of error is

P (Θ = θMAP ( x)) = P (Θ = θMAP ( x) |Θ = 0) P (Θ = 0) + P (Θ = θMAP ( x) |Θ = 1) P (Θ = 1) = P

1

n

i=1
xi > 1

2 + log 3 n

Θ = 0
P (Θ = 0)

+ P

1

n

i=1
xi < 1

2 + log 3 n

Θ = 1
P (Θ = 1)

SLIDE 76

Sending bits: MAP estimator

The probability of error is

P (Θ = θMAP ( x)) = P (Θ = θMAP ( x) |Θ = 0) P (Θ = 0) + P (Θ = θMAP ( x) |Θ = 1) P (Θ = 1) = P

1

n

i=1
xi > 1

2 + log 3 n

Θ = 0
P (Θ = 0)

+ P

1

n

i=1
xi < 1

2 + log 3 n

Θ = 1
P (Θ = 1)

= 3 4Q √n/2 + log 3 √n

+ 1

4Q √n/2 − log 3 √n

SLIDE 77

Sending bits: Probability of error

5 10 15 20

n

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

Probability of error ML estimator MAP estimator