Bayesian Methods David S. Rosenberg New York University March 20, - - PowerPoint PPT Presentation

bayesian methods
SMART_READER_LITE
LIVE PREVIEW

Bayesian Methods David S. Rosenberg New York University March 20, - - PowerPoint PPT Presentation

Bayesian Methods David S. Rosenberg New York University March 20, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 1 / 38 Contents Classical Statistics 1 Bayesian Statistics: Introduction 2


slide-1
SLIDE 1

Bayesian Methods

David S. Rosenberg

New York University

March 20, 2018

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 1 / 38

slide-2
SLIDE 2

Contents

1

Classical Statistics

2

Bayesian Statistics: Introduction

3

Bayesian Decision Theory

4

Summary

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 2 / 38

slide-3
SLIDE 3

Classical Statistics

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 3 / 38

slide-4
SLIDE 4

Parametric Family of Densities

A parametric family of densities is a set {p(y | θ) : θ ∈ Θ},

where p(y | θ) is a density on a sample space Y, and θ is a parameter in a [finite dimensional] parameter space Θ.

This is the common starting point for a treatment of classical or Bayesian statistics.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 4 / 38

slide-5
SLIDE 5

Density vs Mass Functions

In this lecture, whenever we say “density”, we could replace it with “mass function.” Corresponding integrals would be replaced by summations.

(In more advanced, measure-theoretic treatments, they are each considered densities w.r.t. different base measures.)

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 5 / 38

slide-6
SLIDE 6

Frequentist or “Classical” Statistics

Parametric family of densities {p(y | θ) | θ ∈ Θ}. Assume that p(y | θ) governs the world we are observing, for some θ ∈ Θ. If we knew the right θ ∈ Θ, there would be no need for statistics. Instead of θ, we have data D: y1,...,yn sampled i.i.d. p(y | θ). Statistics is about how to get by with D in place of θ.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 6 / 38

slide-7
SLIDE 7

Point Estimation

One type of statistical problem is point estimation. A statistic s = s(D) is any function of the data. A statistic ˆ θ = ˆ θ(D) taking values in Θ is a point estimator of θ. A good point estimator will have ˆ θ ≈ θ.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 7 / 38

slide-8
SLIDE 8

Desirable Properties of Point Estimators

Desirable statistical properties of point estimators:

Consistency: As data size n → ∞, we get ˆ θn → θ. Efficiency: (Roughly speaking) ˆ θn is as accurate as we can get from a sample of size n.

Maximum likelihood estimators are consistent and efficient under reasonable conditions.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 8 / 38

slide-9
SLIDE 9

The Likelihood Function

Consider parametric family {p(y | θ) : θ ∈ Θ} and i.i.d. sample D = (y1,...,yn). The density for sample D for θ ∈ Θ is p(D | θ) =

n

  • i=1

p(yi | θ). p(D | θ) is a function of D and θ. For fixed θ, p(D | θ) is a density function on Yn. For fixed D, the function θ → p(D | θ) is called the likelihood function: LD(θ) := p(D | θ).

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 9 / 38

slide-10
SLIDE 10

Maximum Likelihood Estimation

Definition The maximum likelihood estimator (MLE) for θ in the model {p(y,θ) | θ ∈ Θ} is ˆ θMLE = argmax

θ∈Θ

LD(θ). Maximum likelihood is just one approach to getting a point estimator for θ. Method of moments is another general approach one learns about in statistics. Later we’ll talk about MAP and posterior mean as approaches to point estimation.

These arise naturally in Bayesian settings.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 10 / 38

slide-11
SLIDE 11

Coin Flipping: Setup

Parametric family of mass functions: p(Heads | θ) = θ, for θ ∈ Θ = (0,1). Note that every θ ∈ Θ gives us a different probability model for a coin.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 11 / 38

slide-12
SLIDE 12

Coin Flipping: Likelihood function

Data D = (H,H,T,T,T,T,T,H,...,T)

nh: number of heads nt: number of tails

Assume these were i.i.d. flips. Likelihood function for data D: LD(θ) = p(D | θ) = θnh (1−θ)nt This is the probability of getting the flips in the order they were received.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 12 / 38

slide-13
SLIDE 13

Coin Flipping: MLE

As usual, easier to maximize the log-likelihood function: ˆ θMLE = argmax

θ∈Θ

logLD(θ) = argmax

θ∈Θ

[nh logθ+nt log(1−θ)] First order condition: nh θ − nt 1−θ = ⇐ ⇒ θ = nh nh +nt . So ˆ θMLE is the empirical fraction of heads.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 13 / 38

slide-14
SLIDE 14

Bayesian Statistics: Introduction

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 14 / 38

slide-15
SLIDE 15

Bayesian Statistics

Introduces a new ingredient: the prior distribution. A prior distribution p(θ) is a distribution on parameter space Θ. A prior reflects our belief about θ, before seeing any data..

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 15 / 38

slide-16
SLIDE 16

A Bayesian Model

A [parametric] Bayesian model consists of two pieces:

1

A parametric family of densities {p(D | θ) | θ ∈ Θ}.

2

A prior distribution p(θ) on parameter space Θ.

Putting pieces together, we get a joint density on θ and D: p(D,θ) = p(D | θ)p(θ).

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 16 / 38

slide-17
SLIDE 17

The Posterior Distribution

The posterior distribution for θ is p(θ | D). Prior represents belief about θ before observing data D. Posterior represents the rationally “updated” belief about θ, after seeing D.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 17 / 38

slide-18
SLIDE 18

Expressing the Posterior Distribution

By Bayes rule, can write the posterior distribution as p(θ | D) = p(D | θ)p(θ) p(D) . Let’s consider both sides as functions of θ, for fixed D. Then both sides are densities on Θ and we can write p(θ | D)

posterior

∝ p(D | θ)

likelihood

p(θ)

  • prior

. Where ∝ means we’ve dropped factors independent of θ.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 18 / 38

slide-19
SLIDE 19

Coin Flipping: Bayesian Model

Parametric family of mass functions: p(Heads | θ) = θ, for θ ∈ Θ = (0,1). Need a prior distribution p(θ) on Θ = (0,1). A distribution from the Beta family will do the trick...

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 19 / 38

slide-20
SLIDE 20

Coin Flipping: Beta Prior

Prior: θ ∼ Beta(α,β) p(θ) ∝ θα−1 (1−θ)β−1

Figure by Horas based on the work of Krishnavedala (Own work) [Public domain], via Wikimedia Commons http://commons.wikimedia.org/wiki/File:Beta_distribution_pdf.svg. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 20 / 38

slide-21
SLIDE 21

Coin Flipping: Beta Prior

Prior: θ ∼ Beta(h,t) p(θ) ∝ θh−1 (1−θ)t−1 Mean of Beta distribution: Eθ = h h +t Mode of Beta distribution: argmax

θ

p(θ) = h −1 h +t −2 for h,t > 1.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 21 / 38

slide-22
SLIDE 22

Coin Flipping: Posterior

Prior: θ ∼ Beta(h,t) p(θ) ∝ θh−1 (1−θ)t−1 Likelihood function L(θ) = p(D | θ) = θnh (1−θ)nt Posterior density: p(θ | D) ∝ p(θ)p(D | θ) ∝ θh−1 (1−θ)t−1 ×θnh (1−θ)nt = θh−1+nh (1−θ)t−1+nt

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 22 / 38

slide-23
SLIDE 23

Posterior is Beta

Prior: θ ∼ Beta(h,t) p(θ) ∝ θh−1 (1−θ)t−1 Posterior density: p(θ | D) ∝ θh−1+nh (1−θ)t−1+nt Posterior is in the beta family: θ | D ∼ Beta(h +nh,t +nt) Interpretation:

Prior initializes our counts with h heads and t tails. Posterior increments counts by observed nh and nt.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 23 / 38

slide-24
SLIDE 24

Sidebar: Conjugate Priors

Interesting that posterior is in same distribution family as prior. Let π be a family of prior distributions on Θ. Let P parametric family of distributions with parameter space Θ. Definition A family of distributions π is conjugate to parametric model P if for any prior in π, the posterior is always in π. The beta family is conjugate to the coin-flipping (i.e. Bernoulli) model. The family of all probability distributions is conjugate to any parametric model. [Trivially]

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 24 / 38

slide-25
SLIDE 25

Example: Coin Flipping - Concrete Example

Suppose we have a coin, possibly biased (parametric probability model): p(Heads | θ) = θ. Parameter space θ ∈ Θ = [0,1]. Prior distribution: θ ∼ Beta(2,2).

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 25 / 38

slide-26
SLIDE 26

Example: Coin Flipping

Next, we gather some data D = {H,H,T,T,T,T,T,H,...,T}: Heads: 75 Tails: 60

ˆ θMLE =

75 75+60 ≈ 0.556

Posterior distribution: θ | D ∼ Beta(77,62):

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 26 / 38

slide-27
SLIDE 27

Bayesian Point Estimates

So we have posterior θ | D... But we want a point estimate ˆ θ for θ. Common options:

posterior mean ˆ θ = E[θ | D] maximum a posteriori (MAP) estimate ˆ θ = argmaxθ p(θ | D)

Note: this is the mode of the posterior distribution

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 27 / 38

slide-28
SLIDE 28

What else can we do with a posterior?

Look at it. Extract “credible set” for θ (Bayesian version of a confidence interval).

e.g. Interval [a,b] is a 95% credible set if P(θ ∈ [a,b] | D) 0.95

The most “Bayesian” approach is Bayesian decision theory:

Choose a loss function. Find action minimizing expected risk w.r.t. posterior

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 28 / 38

slide-29
SLIDE 29

Bayesian Decision Theory

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 29 / 38

slide-30
SLIDE 30

Bayesian Decision Theory

Ingredients:

Parameter space Θ. Prior: Distribution p(θ) on Θ. Action space A. Loss function: ℓ : A×Θ → R.

The posterior risk of an action a ∈ A is r(a) := E[ℓ(θ,a) | D] =

  • ℓ(θ,a)p(θ | D)dθ.

It’s the expected loss under the posterior.

A Bayes action a∗ is an action that minimizes posterior risk: r(a∗) = min

a∈Ar(a)

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 30 / 38

slide-31
SLIDE 31

Bayesian Point Estimation

General Setup:

Data D generated by p(y | θ), for unknown θ ∈ Θ. Want to produce a point estimate for θ.

Choose the following:

Prior p(θ) on Θ = R. Loss ℓ(ˆ θ,θ) =

  • θ− ˆ

θ 2

Find action ˆ θ ∈ Θ that minimizes posterior risk: r(ˆ θ) = E

  • θ− ˆ

θ 2 | D

  • =

θ− ˆ θ 2 p(θ | D)dθ

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 31 / 38

slide-32
SLIDE 32

Bayesian Point Estimation: Square Loss

Find action ˆ θ ∈ Θ that minimizes posterior risk r(ˆ θ) = θ− ˆ θ 2 p(θ | D)dθ. Differentiate: dr(ˆ θ) d ˆ θ = −

  • 2
  • θ− ˆ

θ

  • p(θ | D)dθ

= −2

  • θp(θ | D)dθ+2ˆ

θ

  • p(θ | D)dθ
  • =1

= −2

  • θp(θ | D)dθ+2ˆ

θ

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 32 / 38

slide-33
SLIDE 33

Bayesian Point Estimation: Square Loss

Derivative of posterior risk is dr(ˆ θ) d ˆ θ = −2

  • θp(θ | D)dθ+2ˆ

θ. First order condition dr(ˆ

θ) d ˆ θ

= 0 gives ˆ θ =

  • θp(θ | D)dθ

= E[θ | D] Bayes action for square loss is the posterior mean.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 33 / 38

slide-34
SLIDE 34

Bayesian Point Estimation: Absolute Loss

Loss: ℓ(θ, ˆ θ) =

  • θ− ˆ

θ

  • Bayes action for absolute loss is the posterior median.

That is, the median of the distribution p(θ | D). Show with approach similar to what was used in Homework #1.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 34 / 38

slide-35
SLIDE 35

Bayesian Point Estimation: Zero-One Loss

Suppose Θ is discrete (e.g. Θ = {english,french}) Zero-one loss: ℓ(θ, ˆ θ) = 1(θ = ˆ θ) Posterior risk: r(ˆ θ) = E

  • 1(θ = ˆ

θ) | D

  • =

P

  • θ = ˆ

θ | D

  • =

1−P

  • θ = ˆ

θ | D

  • =

1−p(ˆ θ | D) Bayes action is ˆ θ = argmax

θ∈Θ

p(θ | D) This ˆ θ is called the maximum a posteriori (MAP) estimate. The MAP estimate is the mode of the posterior distribution.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 35 / 38

slide-36
SLIDE 36

Summary

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 36 / 38

slide-37
SLIDE 37

Recap and Interpretation

Prior represents belief about θ before observing data D. Posterior represents the rationally “updated” beliefs after seeing D. All inferences and action-taking are based on the posterior distribution. In the Bayesian approach,

No issue of “choosing a procedure” or justifying an estimator. Only choices are

family of distributions, indexed by Θ, and the prior distribution on Θ

For decision making, need a loss function. Everything after that is computation.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 37 / 38

slide-38
SLIDE 38

The Bayesian Method

1 Define the model:

Choose a parametric family of densities: {p(D | θ) | θ ∈ Θ}. Choose a distribution p(θ) on Θ, called the prior distribution.

2 After observing D, compute the posterior distribution p(θ | D). 3 Choose action based on p(θ | D). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 38 / 38