Overview Bayesian Methods for Parameter Estimation Introduction to - - PowerPoint PPT Presentation

overview bayesian methods for parameter estimation
SMART_READER_LITE
LIVE PREVIEW

Overview Bayesian Methods for Parameter Estimation Introduction to - - PowerPoint PPT Presentation

Overview Bayesian Methods for Parameter Estimation Introduction to Bayesian Statistics: Learning a Probability Learning the mean of a Gaussian Chris Williams Readings: Bishop 2.1 (Beta), 2.2 (Dirichlet), 2.3.6 School of Informatics,


slide-1
SLIDE 1

Bayesian Methods for Parameter Estimation

Chris Williams

School of Informatics, University of Edinburgh

October 2007

1 / 18

Overview

Introduction to Bayesian Statistics: Learning a Probability Learning the mean of a Gaussian Readings: Bishop §2.1 (Beta), §2.2 (Dirichlet), §2.3.6 (Gaussian), Heckerman tutorial section 2

2 / 18

Bayesian vs Frequentist Inference

Frequentist Assumes that there is an unknown but fixed parameter θ Estimates θ with some confidence Prediction by using the estimated parameter value Bayesian Represents uncertainty about the unknown parameter Uses probability to quantify this uncertainty. Unknown parameters as random variables Prediction follows rules of probability

3 / 18

Frequentist method

Model p(x|θ, M), data D = {x1, . . . , xn} ˆ θ = argmaxθ p(D|θ, M) Prediction for xn+1 is based on p(xn+1|ˆ θ, M)

4 / 18

slide-2
SLIDE 2

Bayesian method

Prior distribution p(θ|M) Posterior distribution p(θ|D, M) p(θ|D, M) = p(D|θ, M)p(θ|M) p(D|M) Making predictions p(xn+1|D, M) =

  • p(xn+1, θ|D, M) dθ

=

  • p(xn+1|θ, D, M)p(θ|D, M) dθ

=

  • p(xn+1|θ, M)p(θ|D, M) dθ

Interpretation: average of predictions p(xn+1|θ, M) weighted by p(θ|D, M) Marginal likelihood (important for model comparison)

  • 5 / 18

Bayes, MAP and Maximum Likelihood

p(xn+1|D, M) =

  • p(xn+1|θ, M)p(θ|D, M) dθ

Maximum a posteriori value of θ θMAP = argmaxθ p(θ|D, M) Note: not invariant to reparameterization (cf ML estimator) If posterior is sharply peaked about the most probable value θMAP then p(xn+1|D, M) ≃ p(xn+1|θMAP, M) In the limit n → ∞, θMAP converges to ˆ θ (as long as p(ˆ θ) = 0) Bayesian approach most effective when data is limited, n is small

6 / 18

Learning probabilities: thumbtack example

Frequentist Approach The probability of heads θ is unknown Given iid data, estimate θ using an estimator with good properties (e.g. ML estimator)

heads tails

7 / 18

Likelihood

Likelihood for a sequence of heads and tails p(hhth . . . tth|θ) = θnh(1 − θ)nt MLE ˆ θ = nh nh + nt

8 / 18

slide-3
SLIDE 3

Learning probabilities: thumbtack example

Bayesian Approach: (a) the prior Prior density p(θ), use beta distribution p(θ) = Beta(αh, αt) ∝ θαh−1(1 − θ)αt−1 for αh, αt > 0 Properties of the beta distribution E[θ] =

  • θp(θ) =

αh αh + αt

9 / 18

Examples of the Beta distribution

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Beta(0.5,0.5) Beta(1,1)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Beta(3,2) Beta(15,10)

10 / 18

Bayesian Approach: (b) the posterior p(θ|D) ∝ p(θ)p(D|θ) ∝ θαh−1(1 − θ)αt−1θnh(1 − θ)nt ∝ θαh+nh−1(1 − θ)αt+nt−1 Posterior is also a Beta distribution ∼ Beta(αh + nh, αt + nt) The Beta prior is conjugate to the binomial likelihood (i.e. they have the same parametric form) αh and αt can be thought of as imaginary counts, with α = αh + αt as the equivalent sample size

11 / 18

Bayesian Approach: (c) making predictions

x x x θ

1 2 n

xn+1

p(Xn+1 = heads|D, M) =

  • p(Xn+1 = heads|θ)p(θ|D, M) dθ

=

  • θ Beta((αh + nh, αt + nt) dθ

= αh + nh α + n

12 / 18

slide-4
SLIDE 4

Beyond Conjugate Priors

The thumbtack came from a magic shop → a mixture prior

p(θ) = 0.4Beta(20, 0.5) + 0.2Beta(2, 2) + 0.4Beta(0.5, 20)

13 / 18

Generalization to multinomial variables

Dirichlet prior p(θ1, . . . , θr) = Dir(α1, . . . , αr) ∝

r

  • i=1

θαi−1

i

with

  • i

θi = 1, αi > 0 αi’s are imaginary counts, α =

i αi is equivalent sample

size Properties E(θi) = αi α Dirichlet distribution is conjugate to the multinomial likelihood

14 / 18

Posterior distribution p(θ|n1, . . . , nr) ∝

r

  • i=1

θαi+ni−1

i

Marginal likelihood p(D|M) = Γ(α) Γ(α + n)

r

  • i=1

Γ(αi + ni) Γ(αi)

15 / 18

Inferring the mean of a Gaussian

Likelihood p(x|µ) ∼ N(µ, σ2) Prior p(µ) ∼ N(µ0, σ2

0)

Given data D = {x1, . . . , xn}, what is p(µ|D)?

16 / 18

slide-5
SLIDE 5

p(µ|D) ∼ N(µn, σ2

n)

with x = 1 n

n

  • i=1

xi µn = nσ2 nσ2

0 + σ2 x +

σ2 nσ2

0 + σ2 µ0

1 σ2

n

= n σ2 + 1 σ2 See Bishop §2.3.6 for details

17 / 18

Comparing Bayesian and Frequentist approaches

Frequentist: fix θ, consider all possible data sets generated with θ fixed Bayesian: fix D, consider all possible values of θ One view is that Bayesian and Frequentist approaches have different definitions of what it means to be a good estimator

18 / 18