Bayesian statistics DS GA 1002 Probability and Statistics for Data - - PowerPoint PPT Presentation

bayesian statistics
SMART_READER_LITE
LIVE PREVIEW

Bayesian statistics DS GA 1002 Probability and Statistics for Data - - PowerPoint PPT Presentation

Bayesian statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda Frequentist vs Bayesian statistics In frequentist statistics the data are modeled as


slide-1
SLIDE 1

Bayesian statistics

DS GA 1002 Probability and Statistics for Data Science

http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda

slide-2
SLIDE 2

Frequentist vs Bayesian statistics

In frequentist statistics the data are modeled as realizations from a distribution that depends on deterministic parameters In Bayesian statistics the parameters are modeled as random variables This allows to quantify our prior uncertainty and incorporate additional information

slide-3
SLIDE 3

Learning Bayesian models Conjugate priors Bayesian estimators

slide-4
SLIDE 4

Prior distribution and likelihood

The data x ∈ Rn are a realization of a random vector X, which depends on a vector of parameters Θ Modeling choices:

◮ Prior distribution: Distribution of

Θ encoding our uncertainty about the model before seeing the data

◮ Likelihood: Conditional distribution of

X given Θ

slide-5
SLIDE 5

Posterior distribution

The posterior distribution is the conditional distribution of Θ given X Evaluating the posterior at the data x allows to update our uncertainty about Θ using the data

slide-6
SLIDE 6

Bernoulli distribution

Goal: Estimating Bernoulli parameter from iid data We consider two different Bayesian estimators Θ1 and Θ2:

  • 1. Θ1 is a conservative estimator with a uniform prior pdf

fΘ1 (θ) =

  • 1

for 0 ≤ θ ≤ 1

  • therwise
  • 2. Θ2 has a prior pdf skewed towards 1

fΘ2 (θ) =

  • 2 θ

for 0 ≤ θ ≤ 1

  • therwise
slide-7
SLIDE 7

Prior distributions

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0

slide-8
SLIDE 8

Bernoulli distribution: likelihood

The data are assumed to be iid, so the likelihood is p

X|Θ (

x|θ)

slide-9
SLIDE 9

Bernoulli distribution: likelihood

The data are assumed to be iid, so the likelihood is p

X|Θ (

x|θ) = θn1 (1 − θ)n0 n0 is the number of zeros and n1 the number of ones

slide-10
SLIDE 10

Bernoulli distribution: posterior distribution

fΘ1|

X (θ|

x)

slide-11
SLIDE 11

Bernoulli distribution: posterior distribution

fΘ1|

X (θ|

x) = fΘ1 (θ) p

X|Θ1 (

x|θ) p

X (

x)

slide-12
SLIDE 12

Bernoulli distribution: posterior distribution

fΘ1|

X (θ|

x) = fΘ1 (θ) p

X|Θ1 (

x|θ) p

X (

x) = fΘ1 (θ) p

X|Θ1 (

x|θ)

  • u fΘ1 (u) p

X|Θ1 (

x|u) du

slide-13
SLIDE 13

Bernoulli distribution: posterior distribution

fΘ1|

X (θ|

x) = fΘ1 (θ) p

X|Θ1 (

x|θ) p

X (

x) = fΘ1 (θ) p

X|Θ1 (

x|θ)

  • u fΘ1 (u) p

X|Θ1 (

x|u) du = θn1 (1 − θ)n0

  • u un1 (1 − u)n0 du
slide-14
SLIDE 14

Bernoulli distribution: posterior distribution

fΘ1|

X (θ|

x) = fΘ1 (θ) p

X|Θ1 (

x|θ) p

X (

x) = fΘ1 (θ) p

X|Θ1 (

x|θ)

  • u fΘ1 (u) p

X|Θ1 (

x|u) du = θn1 (1 − θ)n0

  • u un1 (1 − u)n0 du

= θn1 (1 − θ)n0 β (n1 + 1, n0 + 1) β (a, b) :=

  • u

ua−1 (1 − u)b−1 du

slide-15
SLIDE 15

Bernoulli distribution: posterior distribution

fΘ2|

X (θ|

x)

slide-16
SLIDE 16

Bernoulli distribution: posterior distribution

fΘ2|

X (θ|

x) = fΘ2 (θ) p

X|Θ2 (

x|θ) p

X (

x)

slide-17
SLIDE 17

Bernoulli distribution: posterior distribution

fΘ2|

X (θ|

x) = fΘ2 (θ) p

X|Θ2 (

x|θ) p

X (

x) = θn1+1 (1 − θ)n0

  • u un1+1 (1 − u)n0 du
slide-18
SLIDE 18

Bernoulli distribution: posterior distribution

fΘ2|

X (θ|

x) = fΘ2 (θ) p

X|Θ2 (

x|θ) p

X (

x) = θn1+1 (1 − θ)n0

  • u un1+1 (1 − u)n0 du

= θn1+1 (1 − θ)n0 β (n1 + 2, n0 + 1) β (a, b) :=

  • u

ua−1 (1 − u)b−1 du

slide-19
SLIDE 19

Bernoulli distribution: n0 = 1, n1 = 3

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5

slide-20
SLIDE 20

Bernoulli distribution: n0 = 3, n1 = 1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0

slide-21
SLIDE 21

Bernoulli distribution: n0 = 91, n1 = 9

0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 14

Posterior mean (uniform prior) Posterior mean (skewed prior) ML estimator

slide-22
SLIDE 22

Learning Bayesian models Conjugate priors Bayesian estimators

slide-23
SLIDE 23

Beta random variable

Useful in Bayesian statistics Unimodal continuous distribution in the unit interval The pdf of a beta distribution with parameters a and b is defined as fβ (θ; a, b) := θa−1(1−θ)b−1

β(a,b)

, if 0 ≤ θ ≤ 1,

  • therwise

β (a, b) :=

  • u

ua−1 (1 − u)b−1 du

slide-24
SLIDE 24

Beta random variables

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 x fX (x) a = 1 b = 1 a = 1 b = 2 a = 3 b = 3 a = 6 b = 2 a = 3 b = 15

slide-25
SLIDE 25

Learning a Bernoulli distribution

The first prior is beta with parameters a = 1 and b = 1 The second prior is beta with parameters a = 2 and b = 1 The posteriors are beta with parameters a = n1 + 1, b = n0 + 1 and a = n1 + 2, b = n0 + 1 respectively

slide-26
SLIDE 26

Conjugate priors

A conjugate family of distributions for a certain likelihood satisfies the following property: If the prior belongs to the family, the posterior also belongs to the family Beta distributions are conjugate priors when the likelihood is binomial

slide-27
SLIDE 27

The beta distribution is conjugate to the binomial likelihood

Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x)

slide-28
SLIDE 28

The beta distribution is conjugate to the binomial likelihood

Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x)

slide-29
SLIDE 29

The beta distribution is conjugate to the binomial likelihood

Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x) = fΘ (θ) pX | Θ (x | θ)

  • u fΘ (u) pX | Θ (x | u) du
slide-30
SLIDE 30

The beta distribution is conjugate to the binomial likelihood

Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x) = fΘ (θ) pX | Θ (x | θ)

  • u fΘ (u) pX | Θ (x | u) du

= θa−1 (1 − θ)b−1 n

x

  • θx (1 − θ)n−x
  • u ua−1 (1 − u)b−1 n

x

  • ux (1 − u)n−x du
slide-31
SLIDE 31

The beta distribution is conjugate to the binomial likelihood

Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x) = fΘ (θ) pX | Θ (x | θ)

  • u fΘ (u) pX | Θ (x | u) du

= θa−1 (1 − θ)b−1 n

x

  • θx (1 − θ)n−x
  • u ua−1 (1 − u)b−1 n

x

  • ux (1 − u)n−x du

= θx+a−1 (1 − θ)n−x+b−1

  • u ux+a−1 (1 − u)n−x+b−1 du
slide-32
SLIDE 32

The beta distribution is conjugate to the binomial likelihood

Θ is beta with parameters a and b X is binomial with parameters n and Θ fΘ | X (θ | x) = fΘ (θ) pX | Θ (x | θ) pX (x) = fΘ (θ) pX | Θ (x | θ)

  • u fΘ (u) pX | Θ (x | u) du

= θa−1 (1 − θ)b−1 n

x

  • θx (1 − θ)n−x
  • u ua−1 (1 − u)b−1 n

x

  • ux (1 − u)n−x du

= θx+a−1 (1 − θ)n−x+b−1

  • u ux+a−1 (1 − u)n−x+b−1 du

= fβ (θ; x + a, n − x + b)

slide-33
SLIDE 33

Poll in New Mexico

429 participants, 227 people intend to vote for Clinton and 202 for Trump Probability that Trump wins in New Mexico? Assumptions:

◮ Fraction of Trump voters is modeled as a random variable Θ ◮ Poll participants are selected uniformly at random with replacement ◮ Number of Trump voters in the poll is binomial with parameters

n = 449 and p = Θ

slide-34
SLIDE 34

Poll in New Mexico

◮ Prior is uniform, so beta with parameters a = 1 and b = 1 ◮ Likelihood is binomial ◮ Posterior is beta with parameters a = 202 + 1 and b = 227 + 1 ◮ The probability that Trump wins in New Mexico is the probability that

Θ given the data is greater than 0.5

slide-35
SLIDE 35

Poll in New Mexico

0.35 0.40 0.45 0.50 0.55 0.60 2 4 6 8 10 12 14 16 18

88.6% 11.4%

slide-36
SLIDE 36

Learning Bayesian models Conjugate priors Bayesian estimators

slide-37
SLIDE 37

Bayesian estimators

What estimator should we use? Two main options:

◮ The posterior mean ◮ The posterior mode

slide-38
SLIDE 38

Posterior mean

Mean of the posterior distribution θMMSE ( x) := E

  • Θ|

X = x

  • Minimum mean-square-error (MMSE) estimate

For any arbitrary estimator θother ( x), E

  • θother(

X) − Θ 2 ≥ E

  • θMMSE(

X) − Θ 2

slide-39
SLIDE 39

Posterior mean

E

  • θother(

X) − Θ 2

  • X =

x

slide-40
SLIDE 40

Posterior mean

E

  • θother(

X) − Θ 2

  • X =

x

  • = E
  • θother(

X) − θMMSE( X) + θMMSE( X) − Θ 2

  • X =

x

slide-41
SLIDE 41

Posterior mean

E

  • θother(

X) − Θ 2

  • X =

x

  • = E
  • θother(

X) − θMMSE( X) + θMMSE( X) − Θ 2

  • X =

x

  • = (θother(

x) − θMMSE( x))2 + E θMMSE( X) − Θ 2

  • X =

x

  • + 2 (θother(

x) − θMMSE( x)) E

  • θMMSE(

x) − E

  • Θ|

X = x

slide-42
SLIDE 42

Posterior mean

E

  • θother(

X) − Θ 2

  • X =

x

  • = E
  • θother(

X) − θMMSE( X) + θMMSE( X) − Θ 2

  • X =

x

  • = (θother(

x) − θMMSE( x))2 + E θMMSE( X) − Θ 2

  • X =

x

  • + 2 (θother(

x) − θMMSE( x)) E

  • θMMSE(

x) − E

  • Θ|

X = x

  • = (θother(

x) − θMMSE( x))2 + E θMMSE( X) − Θ 2

  • X =

x

slide-43
SLIDE 43

Posterior mean

By iterated expectation, E

  • θother(

X) − Θ 2 = E

  • E
  • θother(

X) − Θ 2

  • X
slide-44
SLIDE 44

Posterior mean

By iterated expectation, E

  • θother(

X) − Θ 2 = E

  • E
  • θother(

X) − Θ 2

  • X
  • = E
  • θother(

X) − θMMSE( X) 2 + E

  • E

θMMSE( X) − Θ 2

  • X
slide-45
SLIDE 45

Posterior mean

By iterated expectation, E

  • θother(

X) − Θ 2 = E

  • E
  • θother(

X) − Θ 2

  • X
  • = E
  • θother(

X) − θMMSE( X) 2 + E

  • E

θMMSE( X) − Θ 2

  • X
  • = E
  • θother(

X) − θMMSE( X) 2 + E

  • θMMSE(

X) − Θ 2

slide-46
SLIDE 46

Posterior mean

By iterated expectation, E

  • θother(

X) − Θ 2 = E

  • E
  • θother(

X) − Θ 2

  • X
  • = E
  • θother(

X) − θMMSE( X) 2 + E

  • E

θMMSE( X) − Θ 2

  • X
  • = E
  • θother(

X) − θMMSE( X) 2 + E

  • θMMSE(

X) − Θ 2 ≥ E

  • θMMSE(

X) − Θ 2

slide-47
SLIDE 47

Bernoulli distribution: n0 = 1, n1 = 3

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5

slide-48
SLIDE 48

Bernoulli distribution: n0 = 3, n1 = 1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0

slide-49
SLIDE 49

Bernoulli distribution: n0 = 91, n1 = 9

0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 14

Posterior mean (uniform prior) Posterior mean (skewed prior) ML estimator

slide-50
SLIDE 50

Posterior mode

The maximum-a-posteriori (MAP) estimator is the mode of the posterior distribution θMAP ( x) := arg max

  • θ

p

Θ | X

  • θ |

x

  • if

Θ is discrete and θMAP ( x) := arg max

  • θ

f

Θ | X

  • θ |

x

  • if

Θ is continuous

slide-51
SLIDE 51

Maximum-likelihood estimator

If the prior is uniform the ML estimator coincides with the MAP estimator arg max

  • θ

f

Θ | X

  • θ|

x

slide-52
SLIDE 52

Maximum-likelihood estimator

If the prior is uniform the ML estimator coincides with the MAP estimator arg max

  • θ

f

Θ | X

  • θ|

x

  • = arg max
  • θ

f

Θ

  • θ
  • f

X| Θ

  • x|

θ

  • u f

Θ (u) f X| Θ (

x|u) du

slide-53
SLIDE 53

Maximum-likelihood estimator

If the prior is uniform the ML estimator coincides with the MAP estimator arg max

  • θ

f

Θ | X

  • θ|

x

  • = arg max
  • θ

f

Θ

  • θ
  • f

X| Θ

  • x|

θ

  • u f

Θ (u) f X| Θ (

x|u) du = arg max

  • θ

f

X| Θ

  • x|

θ

slide-54
SLIDE 54

Maximum-likelihood estimator

If the prior is uniform the ML estimator coincides with the MAP estimator arg max

  • θ

f

Θ | X

  • θ|

x

  • = arg max
  • θ

f

Θ

  • θ
  • f

X| Θ

  • x|

θ

  • u f

Θ (u) f X| Θ (

x|u) du = arg max

  • θ

f

X| Θ

  • x|

θ

  • = arg max
  • θ

L

x

  • θ
slide-55
SLIDE 55

Maximum-likelihood estimator

If the prior is uniform the ML estimator coincides with the MAP estimator arg max

  • θ

f

Θ | X

  • θ|

x

  • = arg max
  • θ

f

Θ

  • θ
  • f

X| Θ

  • x|

θ

  • u f

Θ (u) f X| Θ (

x|u) du = arg max

  • θ

f

X| Θ

  • x|

θ

  • = arg max
  • θ

L

x

  • θ
  • Uniform priors are only well defined over bounded domains
slide-56
SLIDE 56

Probability of error

If Θ is discrete, MAP estimator minimizes the probability of error For any arbitrary estimator θother ( x) P

  • θother(

X) = Θ

  • ≥ P
  • θMAP(

X) = Θ

slide-57
SLIDE 57

Probability of error

P

  • Θ = θother(

X)

slide-58
SLIDE 58

Probability of error

P

  • Θ = θother(

X)

  • =
  • x

f

X (

x) P

  • Θ = θother(

x) | X = x

  • d

x

slide-59
SLIDE 59

Probability of error

P

  • Θ = θother(

X)

  • =
  • x

f

X (

x) P

  • Θ = θother(

x) | X = x

  • d

x =

  • x

f

X (

x) p

Θ | X (θother(

x) | x) d x

slide-60
SLIDE 60

Probability of error

P

  • Θ = θother(

X)

  • =
  • x

f

X (

x) P

  • Θ = θother(

x) | X = x

  • d

x =

  • x

f

X (

x) p

Θ | X (θother(

x) | x) d x ≤

  • x

f

X (

x) p

Θ | X (θMAP(

x) | x) d x

slide-61
SLIDE 61

Probability of error

P

  • Θ = θother(

X)

  • =
  • x

f

X (

x) P

  • Θ = θother(

x) | X = x

  • d

x =

  • x

f

X (

x) p

Θ | X (θother(

x) | x) d x ≤

  • x

f

X (

x) p

Θ | X (θMAP(

x) | x) d x = P

  • Θ = θMAP(

X)

slide-62
SLIDE 62

Sending bits

Model for communication channel: signal Θ encodes a single bit Prior knowledge indicates that a 0 is 3 times more likely than a 1 pΘ (1) = 1 4, pΘ (0) = 3 4. The channel is noisy, so we send the signal n times At the receptor we observe

  • Xi = Θ +

Zi, 1 ≤ i ≤ n, where Z is iid standard Gaussian

slide-63
SLIDE 63

Sending bits: ML estimator

The likelihood is equal to L

x (θ) = n

  • i=1

f

Xi|Θ (

xi | θ) =

n

  • i=1

1 √ 2π e− (

xi −θ)2 2

The log-likelihood is equal to log L

x (θ) = − n

  • i=1

( xi − θ)2 2 − n 2 log 2π

slide-64
SLIDE 64

Sending bits: ML estimator

θML ( x) = 1 if log L

x (1) = − n

  • i=1
  • x 2

i − 2

xi + 1 2 − n 2 log 2π ≥ −

n

  • i=1
  • x 2

i

2 − n 2 log 2π = log L

x (0)

Equivalently, θML ( x) =

  • 1

if 1

n

n

i=1

xi > 1

2

  • therwise
slide-65
SLIDE 65

Sending bits: ML estimator

The probability of error is

P

  • Θ = θML(

X)

slide-66
SLIDE 66

Sending bits: ML estimator

The probability of error is

P

  • Θ = θML(

X)

  • = P
  • Θ = θML(

X)

  • Θ = 0
  • P (Θ = 0) + P
  • Θ = θML(

X)

  • Θ = 1
  • P (Θ = 1)
slide-67
SLIDE 67

Sending bits: ML estimator

The probability of error is

P

  • Θ = θML(

X)

  • = P
  • Θ = θML(

X)

  • Θ = 0
  • P (Θ = 0) + P
  • Θ = θML(

X)

  • Θ = 1
  • P (Θ = 1)

= P

  • 1

n

n

  • i=1
  • xi > 1

2

  • Θ = 0
  • P (Θ = 0) + P
  • 1

n

n

  • i=1
  • xi < 1

2

  • Θ = 1
  • P (Θ = 1)
slide-68
SLIDE 68

Sending bits: ML estimator

The probability of error is

P

  • Θ = θML(

X)

  • = P
  • Θ = θML(

X)

  • Θ = 0
  • P (Θ = 0) + P
  • Θ = θML(

X)

  • Θ = 1
  • P (Θ = 1)

= P

  • 1

n

n

  • i=1
  • xi > 1

2

  • Θ = 0
  • P (Θ = 0) + P
  • 1

n

n

  • i=1
  • xi < 1

2

  • Θ = 1
  • P (Θ = 1)

= Q √n/2

slide-69
SLIDE 69

Sending bits: MAP estimator

The logarithm of the posterior is equal to log pΘ|

X (θ|

x)

slide-70
SLIDE 70

Sending bits: MAP estimator

The logarithm of the posterior is equal to log pΘ|

X (θ|

x) = log n

i=1 f Xi|Θ (

xi|θ) pΘ (θ) f

X (

x)

slide-71
SLIDE 71

Sending bits: MAP estimator

The logarithm of the posterior is equal to log pΘ|

X (θ|

x) = log n

i=1 f Xi|Θ (

xi|θ) pΘ (θ) f

X (

x) =

n

  • i=1

log f

Xi|Θ (

xi|θ) pΘ (θ) − log f

X (

x)

slide-72
SLIDE 72

Sending bits: MAP estimator

The logarithm of the posterior is equal to log pΘ|

X (θ|

x) = log n

i=1 f Xi|Θ (

xi|θ) pΘ (θ) f

X (

x) =

n

  • i=1

log f

Xi|Θ (

xi|θ) pΘ (θ) − log f

X (

x) = −

n

  • i=1
  • x 2

i − 2

xiθ + θ2 2 − n 2 log 2π + log pΘ (θ) − log f

X (

x)

slide-73
SLIDE 73

Sending bits: MAP estimator

θMAP ( x) = 1 if log pΘ|

X (1|

x) + log f

X (

x) = −

n

  • i=1
  • x 2

i − 2

xi + 1 2 − n 2 log 2π − log 4 ≥ −

n

  • i=1
  • x 2

i

2 − n 2 log 2π − log 4 + log 3 = log pΘ|

X (0|

x) + log f

X (

x) . Equivalently, θMAP ( x) =

  • 1

if 1

n

n

i=1

xi > 1

2 + log 3 n ,

  • therwise.
slide-74
SLIDE 74

Sending bits: MAP estimator

The probability of error is

P

  • Θ = θMAP
  • X
slide-75
SLIDE 75

Sending bits: MAP estimator

The probability of error is

P

  • Θ = θMAP
  • X
  • = P
  • Θ = θMAP
  • X
  • |Θ = 0
  • P (Θ = 0) + P
  • Θ = θMAP
  • X
  • |Θ = 1
  • P (Θ = 1)
slide-76
SLIDE 76

Sending bits: MAP estimator

The probability of error is

P

  • Θ = θMAP
  • X
  • = P
  • Θ = θMAP
  • X
  • |Θ = 0
  • P (Θ = 0) + P
  • Θ = θMAP
  • X
  • |Θ = 1
  • P (Θ = 1)

= P

  • 1

n

n

  • i=1
  • Xi > 1

2 + log 3 n

  • Θ = 0
  • P (Θ = 0)

+ P

  • 1

n

n

  • i=1
  • Xi < 1

2 + log 3 n

  • Θ = 1
  • P (Θ = 1)
slide-77
SLIDE 77

Sending bits: MAP estimator

The probability of error is

P

  • Θ = θMAP
  • X
  • = P
  • Θ = θMAP
  • X
  • |Θ = 0
  • P (Θ = 0) + P
  • Θ = θMAP
  • X
  • |Θ = 1
  • P (Θ = 1)

= P

  • 1

n

n

  • i=1
  • Xi > 1

2 + log 3 n

  • Θ = 0
  • P (Θ = 0)

+ P

  • 1

n

n

  • i=1
  • Xi < 1

2 + log 3 n

  • Θ = 1
  • P (Θ = 1)

= 3 4Q √n/2 + log 3 √n

  • + 1

4Q √n/2 − log 3 √n

slide-78
SLIDE 78

Sending bits: Probability of error

5 10 15 20

n

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

Probability of error ML estimator MAP estimator