Bayesian statistics DS GA 1002 Statistical and Mathematical Models - - PowerPoint PPT Presentation
Bayesian statistics DS GA 1002 Statistical and Mathematical Models - - PowerPoint PPT Presentation
Bayesian statistics DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall15 Carlos Fernandez-Granda Frequentist vs Bayesian statistics In frequentist statistics the data are modeled as realizations
Frequentist vs Bayesian statistics
In frequentist statistics the data are modeled as realizations from a distribution that depends on deterministic parameters In Bayesian statistics the parameters are modeled as random variables This allows to quantify our prior uncertainty and incorporate additional information
Learning Bayesian models Conjugate priors Bayesian estimators
Prior distribution and likelihood
The data x ∈ Rn are a realization of a random vector X, which depends on a vector of parameters Θ Modeling choices:
◮ Prior distribution: Distribution of
Θ encoding our uncertainty about the model before seeing the data
◮ Likelihood: Conditional distribution of
X given Θ
Posterior distribution
The posterior distribution is the conditional distribution of Θ given X Evaluating the posterior at the data x allows to update our uncertainty about Θ using the data
Bernoulli distribution
Goal: Estimating Bernoulli parameter from iid data We consider two different Bayesian estimators Θ1 and Θ2:
- 1. Θ1 is a conservative estimator with a uniform prior pdf
fΘ1 (θ) =
- 1
for 0 ≤ θ ≤ 1
- therwise
- 2. Θ2 has a prior pdf skewed towards 1
fΘ2 (θ) =
- 2 θ
for 0 ≤ θ ≤ 1
- therwise
Prior distributions
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0
Bernoulli distribution: likelihood
The data are assumed to be iid, so the likelihood is p
X|Θ (
x|θ)
Bernoulli distribution: likelihood
The data are assumed to be iid, so the likelihood is p
X|Θ (
x|θ) = θn1 (1 − θ)n0 n0 is the number of zeros and n1 the number of ones
Bernoulli distribution: posterior distribution
fΘ1|
X (θ|
x)
Bernoulli distribution: posterior distribution
fΘ1|
X (θ|
x) = fΘ1 (θ) p
X|Θ1 (
x|θ) p
X (
x)
Bernoulli distribution: posterior distribution
fΘ1|
X (θ|
x) = fΘ1 (θ) p
X|Θ1 (
x|θ) p
X (
x) = fΘ1 (θ) p
X|Θ1 (
x|θ)
- u fΘ1 (u) p
X|Θ1 (
x|u) du
Bernoulli distribution: posterior distribution
fΘ1|
X (θ|
x) = fΘ1 (θ) p
X|Θ1 (
x|θ) p
X (
x) = fΘ1 (θ) p
X|Θ1 (
x|θ)
- u fΘ1 (u) p
X|Θ1 (
x|u) du = θn1 (1 − θ)n0
- u un1 (1 − u)n0 du
Bernoulli distribution: posterior distribution
fΘ1|
X (θ|
x) = fΘ1 (θ) p
X|Θ1 (
x|θ) p
X (
x) = fΘ1 (θ) p
X|Θ1 (
x|θ)
- u fΘ1 (u) p
X|Θ1 (
x|u) du = θn1 (1 − θ)n0
- u un1 (1 − u)n0 du
= θn1 (1 − θ)n0 β (n1 + 1, n0 + 1) β (a, b) :=
- u
ua−1 (1 − u)b−1 du
Bernoulli distribution: posterior distribution
fΘ2|
X (θ|
x)
Bernoulli distribution: posterior distribution
fΘ2|
X (θ|
x) = fΘ2 (θ) p
X|Θ2 (
x|θ) p
X (
x)
Bernoulli distribution: posterior distribution
fΘ2|
X (θ|
x) = fΘ2 (θ) p
X|Θ2 (
x|θ) p
X (
x) = θn1+1 (1 − θ)n0
- u un1+1 (1 − u)n0 du
Bernoulli distribution: posterior distribution
fΘ2|
X (θ|
x) = fΘ2 (θ) p
X|Θ2 (
x|θ) p
X (
x) = θn1+1 (1 − θ)n0
- u un1+1 (1 − u)n0 du
= θn1+1 (1 − θ)n0 β (n1 + 2, n0 + 1) β (a, b) :=
- u
ua−1 (1 − u)b−1 du
Bernoulli distribution: n0 = 1, n1 = 3
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5
Bernoulli distribution: n0 = 3, n1 = 1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0
Bernoulli distribution: n0 = 91, n1 = 9
0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 14
Posterior mean (uniform prior) Posterior mean (skewed prior) ML estimator
Learning Bayesian models Conjugate priors Bayesian estimators
Beta distribution
The pdf of a beta distribution with parameters a and b is defined as fβ (θ; a, b) := θa−1(1−θ)b−1
β(a,b)
, if 0 ≤ θ ≤ 1,
- therwise
β (a, b) :=
- u
ua−1 (1 − u)b−1 du
Learning a Bernoulli distribution
The first prior is beta with parameters a = 1 and b = 1 The second prior is beta with parameters a = 2 and b = 1 The posteriors are beta with parameters a = n1 + 1, b = n0 + 1 and a = n1 + 2, b = n0 + 1 respectively
Conjugate priors
A conjugate family of distributions for a certain likelihood satisfies the following property: If the prior belongs to the family, the posterior also belongs to the family Beta distributions are conjugate priors when the likelihood is binomial
The beta distribution is conjugate to the binomial likelihood
Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x)
The beta distribution is conjugate to the binomial likelihood
Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x)
The beta distribution is conjugate to the binomial likelihood
Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x) = fΘ (θ) pX | Θ (x | θ)
- u fΘ (u) pX | Θ (x | u) du
The beta distribution is conjugate to the binomial likelihood
Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x) = fΘ (θ) pX | Θ (x | θ)
- u fΘ (u) pX | Θ (x | u) du
= θa−1 (1 − θ)b−1 n
x
- θx (1 − θ)n−x
- u ua−1 (1 − u)b−1 n
x
- ux (1 − u)n−x du
The beta distribution is conjugate to the binomial likelihood
Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x) = fΘ (θ) pX | Θ (x | θ)
- u fΘ (u) pX | Θ (x | u) du
= θa−1 (1 − θ)b−1 n
x
- θx (1 − θ)n−x
- u ua−1 (1 − u)b−1 n
x
- ux (1 − u)n−x du
= θx+a−1 (1 − θ)n−x+b−1
- u ux+a−1 (1 − u)n−x+b−1 du
The beta distribution is conjugate to the binomial likelihood
Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x) = fΘ (θ) pX | Θ (x | θ)
- u fΘ (u) pX | Θ (x | u) du
= θa−1 (1 − θ)b−1 n
x
- θx (1 − θ)n−x
- u ua−1 (1 − u)b−1 n
x
- ux (1 − u)n−x du
= θx+a−1 (1 − θ)n−x+b−1
- u ux+a−1 (1 − u)n−x+b−1 du
= fβ (θ; x + a, n − x + b)
Poll in New Mexico
449 participants, 227 people intend to vote for Clinton and 202 for Trump Probability that Trump wins in New Mexico? Assumptions:
◮ Fraction of Trump voters is modeled as a random variable Θ ◮ Poll participants are selected uniformly at random with replacement ◮ Number of Trump voters in the poll is binomial with parameters
n = 449 and p = Θ
Poll in New Mexico
◮ Prior is uniform, so beta with parameters a = 1 and b = 1 ◮ Likelihood is binomial ◮ Posterior is beta with parameters a = 202 + 1 and b = 227 + 1 ◮ The probability that Trump wins in New Mexico is the probability that
Θ given the data is greater than 0.5
Poll in New Mexico
0.35 0.40 0.45 0.50 0.55 0.60 2 4 6 8 10 12 14 16 18
88.6% 11.4%
Learning Bayesian models Conjugate priors Bayesian estimators
Bayesian estimators
What estimator should be use? Two main options:
◮ The posterior mean ◮ The posterior mode
Posterior mean
Mean of the posterior distribution θMMSE ( x) := E
- Θ|
X = x
- Minimum mean-square-error (MMSE) estimate
For any arbitrary estimator θother ( x), E
- θother(
X) − Θ 2 ≥ E
- θMMSE(
X) − Θ 2
Posterior mean
E
- θother(
X) − Θ 2
- X =
x
Posterior mean
E
- θother(
X) − Θ 2
- X =
x
- = E
- θother(
X) − θMMSE( X) + θMMSE( X) − Θ 2
- X =
x
Posterior mean
E
- θother(
X) − Θ 2
- X =
x
- = E
- θother(
X) − θMMSE( X) + θMMSE( X) − Θ 2
- X =
x
- = (θother(
x) − θMMSE( x))2 + E θMMSE( X) − Θ 2
- X =
x
- + 2 (θother(
x) − θMMSE( x)) E
- θMMSE(
x) − E
- Θ|
X = x
Posterior mean
E
- θother(
X) − Θ 2
- X =
x
- = E
- θother(
X) − θMMSE( X) + θMMSE( X) − Θ 2
- X =
x
- = (θother(
x) − θMMSE( x))2 + E θMMSE( X) − Θ 2
- X =
x
- + 2 (θother(
x) − θMMSE( x)) E
- θMMSE(
x) − E
- Θ|
X = x
- = (θother(
x) − θMMSE( x))2 + E θMMSE( X) − Θ 2
- X =
x
Posterior mean
By iterated expectation, E
- θother(
X) − Θ 2 = E
- E
- θother(
X) − Θ 2
- X
Posterior mean
By iterated expectation, E
- θother(
X) − Θ 2 = E
- E
- θother(
X) − Θ 2
- X
- = E
- θother(
X) − θMMSE( X) 2 + E
- E
θMMSE( X) − Θ 2
- X
Posterior mean
By iterated expectation, E
- θother(
X) − Θ 2 = E
- E
- θother(
X) − Θ 2
- X
- = E
- θother(
X) − θMMSE( X) 2 + E
- E
θMMSE( X) − Θ 2
- X
- = E
- θother(
X) − θMMSE( X) 2 + E
- θMMSE(
X) − Θ 2
Posterior mean
By iterated expectation, E
- θother(
X) − Θ 2 = E
- E
- θother(
X) − Θ 2
- X
- = E
- θother(
X) − θMMSE( X) 2 + E
- E
θMMSE( X) − Θ 2
- X
- = E
- θother(
X) − θMMSE( X) 2 + E
- θMMSE(
X) − Θ 2 ≥ E
- θMMSE(
X) − Θ 2
Bernoulli distribution: n0 = 1, n1 = 3
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5
Bernoulli distribution: n0 = 3, n1 = 1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0
Bernoulli distribution: n0 = 91, n1 = 9
0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 14
Posterior mean (uniform prior) Posterior mean (skewed prior) ML estimator
Posterior mode
The maximum-a-posteriori (MAP) estimator is the mode of the posterior distribution θMAP ( x) := arg max
- θ
p
Θ | X
- θ |
x
- if
Θ is discrete and θMAP ( x) := arg max
- θ
f
Θ | X
- θ |
x
- if
Θ is continuous
Maximum-likelihood estimator
If the prior is uniform the ML estimator coincides with the MAP estimator arg max
- θ
f
Θ | X
- θ|
x
Maximum-likelihood estimator
If the prior is uniform the ML estimator coincides with the MAP estimator arg max
- θ
f
Θ | X
- θ|
x
- = arg max
- θ
f
Θ
- θ
- f
X| Θ
- x|
θ
- u f
Θ (u) f X| Θ (
x|u) du
Maximum-likelihood estimator
If the prior is uniform the ML estimator coincides with the MAP estimator arg max
- θ
f
Θ | X
- θ|
x
- = arg max
- θ
f
Θ
- θ
- f
X| Θ
- x|
θ
- u f
Θ (u) f X| Θ (
x|u) du = arg max
- θ
f
X| Θ
- x|
θ
Maximum-likelihood estimator
If the prior is uniform the ML estimator coincides with the MAP estimator arg max
- θ
f
Θ | X
- θ|
x
- = arg max
- θ
f
Θ
- θ
- f
X| Θ
- x|
θ
- u f
Θ (u) f X| Θ (
x|u) du = arg max
- θ
f
X| Θ
- x|
θ
- = arg max
- θ
L
x
- θ
Maximum-likelihood estimator
If the prior is uniform the ML estimator coincides with the MAP estimator arg max
- θ
f
Θ | X
- θ|
x
- = arg max
- θ
f
Θ
- θ
- f
X| Θ
- x|
θ
- u f
Θ (u) f X| Θ (
x|u) du = arg max
- θ
f
X| Θ
- x|
θ
- = arg max
- θ
L
x
- θ
- Uniform priors are only well defined over bounded domains
Probability of error
If Θ is discrete, MAP estimator minimizes the probability of error For any arbitrary estimator θother ( x) P
- θother(
X) = Θ
- ≥ P
- θMAP(
X) = Θ
Probability of error
P
- Θ = θother(
X)
Probability of error
P
- Θ = θother(
X)
- =
- x
f
X (
x) P
- Θ = θother(
x) | X = x
- d
x
Probability of error
P
- Θ = θother(
X)
- =
- x
f
X (
x) P
- Θ = θother(
x) | X = x
- d
x =
- x
f
X (
x) p
Θ | X (θother(
x) | x) d x
Probability of error
P
- Θ = θother(
X)
- =
- x
f
X (
x) P
- Θ = θother(
x) | X = x
- d
x =
- x
f
X (
x) p
Θ | X (θother(
x) | x) d x ≤
- x
f
X (
x) p
Θ | X (θMAP(
x) | x) d x
Probability of error
P
- Θ = θother(
X)
- =
- x
f
X (
x) P
- Θ = θother(
x) | X = x
- d
x =
- x
f
X (
x) p
Θ | X (θother(
x) | x) d x ≤
- x
f
X (
x) p
Θ | X (θMAP(
x) | x) d x = P
- Θ = θMAP(
X)
Sending bits
Model for communication channel: signal Θ encodes a single bit Prior knowledge indicates that a 0 is 3 times more likely than a 1 pΘ (1) = 1 4, pΘ (0) = 3 4. The channel is noisy, so we send the signal n times At the receptor we observe
- Xi = Θ +
Zi, 1 ≤ i ≤ n, where Z is iid standard Gaussian
Sending bits: ML estimator
The likelihood is equal to L
x (θ) = n
- i=1
f
Xi|Θ (
xi | θ) =
n
- i=1
1 √ 2π e− (
xi −θ)2 2
The log-likelihood is equal to log L
x (θ) = − n
- i=1
( xi − θ)2 2 − n 2 log 2π
Sending bits: ML estimator
θML ( x) = 1 if log L
x (1) = − n
- i=1
- x 2
i − 2
xi + 1 2 − n 2 log 2π ≥ −
n
- i=1
- x 2
i
2 − n 2 log 2π = log L
x (0)
Equivalently, θML ( x) =
- 1
if 1
n
n
i=1
xi > 1
2
- therwise
Sending bits: ML estimator
The probability of error is
P
- Θ = θML(
X)
Sending bits: ML estimator
The probability of error is
P
- Θ = θML(
X)
- = P
- Θ = θML(
X)
- Θ = 0
- P (Θ = 0) + P
- Θ = θML(
X)
- Θ = 1
- P (Θ = 1)
Sending bits: ML estimator
The probability of error is
P
- Θ = θML(
X)
- = P
- Θ = θML(
X)
- Θ = 0
- P (Θ = 0) + P
- Θ = θML(
X)
- Θ = 1
- P (Θ = 1)
= P
- 1
n
n
- i=1
- xi > 1
2
- Θ = 0
- P (Θ = 0) + P
- 1
n
n
- i=1
- xi < 1
2
- Θ = 1
- P (Θ = 1)
Sending bits: ML estimator
The probability of error is
P
- Θ = θML(
X)
- = P
- Θ = θML(
X)
- Θ = 0
- P (Θ = 0) + P
- Θ = θML(
X)
- Θ = 1
- P (Θ = 1)
= P
- 1
n
n
- i=1
- xi > 1
2
- Θ = 0
- P (Θ = 0) + P
- 1
n
n
- i=1
- xi < 1
2
- Θ = 1
- P (Θ = 1)
= Q √n/2
Sending bits: MAP estimator
The logarithm of the posterior is equal to log pΘ|
X (θ|
x)
Sending bits: MAP estimator
The logarithm of the posterior is equal to log pΘ|
X (θ|
x) = log n
i=1 f Xi|Θ (
xi|θ) pΘ (θ) f
X (
x)
Sending bits: MAP estimator
The logarithm of the posterior is equal to log pΘ|
X (θ|
x) = log n
i=1 f Xi|Θ (
xi|θ) pΘ (θ) f
X (
x) =
n
- i=1
log f
Xi|Θ (
xi|θ) pΘ (θ) − log f
X (
x)
Sending bits: MAP estimator
The logarithm of the posterior is equal to log pΘ|
X (θ|
x) = log n
i=1 f Xi|Θ (
xi|θ) pΘ (θ) f
X (
x) =
n
- i=1
log f
Xi|Θ (
xi|θ) pΘ (θ) − log f
X (
x) = −
n
- i=1
- x 2
i − 2
xiθ + θ2 2 − n 2 log 2π + log pΘ (θ) − log f
X (
x)
Sending bits: MAP estimator
θMAP ( x) = 1 if log pΘ|
X (1|
x) + log f
X (
x) = −
n
- i=1
- x 2
i − 2
xi + 1 2 − n 2 log 2π − log 4 ≥ −
n
- i=1
- x 2
i
2 − n 2 log 2π − log 4 + log 3 = log pΘ|
X (0|
x) + log f
X (
x) . Equivalently, θMAP ( x) =
- 1
if 1
n
n
i=1
xi > 1
2 + log 3 n ,
- therwise.
Sending bits: MAP estimator
The probability of error is
P (Θ = θMAP ( x))
Sending bits: MAP estimator
The probability of error is
P (Θ = θMAP ( x)) = P (Θ = θMAP ( x) |Θ = 0) P (Θ = 0) + P (Θ = θMAP ( x) |Θ = 1) P (Θ = 1)
Sending bits: MAP estimator
The probability of error is
P (Θ = θMAP ( x)) = P (Θ = θMAP ( x) |Θ = 0) P (Θ = 0) + P (Θ = θMAP ( x) |Θ = 1) P (Θ = 1) = P
- 1
n
n
- i=1
- xi > 1
2 + log 3 n
- Θ = 0
- P (Θ = 0)
+ P
- 1
n
n
- i=1
- xi < 1
2 + log 3 n
- Θ = 1
- P (Θ = 1)
Sending bits: MAP estimator
The probability of error is
P (Θ = θMAP ( x)) = P (Θ = θMAP ( x) |Θ = 0) P (Θ = 0) + P (Θ = θMAP ( x) |Θ = 1) P (Θ = 1) = P
- 1
n
n
- i=1
- xi > 1
2 + log 3 n
- Θ = 0
- P (Θ = 0)
+ P
- 1
n
n
- i=1
- xi < 1
2 + log 3 n
- Θ = 1
- P (Θ = 1)
= 3 4Q √n/2 + log 3 √n
- + 1
4Q √n/2 − log 3 √n
Sending bits: Probability of error
5 10 15 20
n
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Probability of error ML estimator MAP estimator