Bayesian Methods 1 Chris Williams School of Informatics, University - - PowerPoint PPT Presentation

bayesian methods 1
SMART_READER_LITE
LIVE PREVIEW

Bayesian Methods 1 Chris Williams School of Informatics, University - - PowerPoint PPT Presentation

Bayesian Methods 1 Chris Williams School of Informatics, University of Edinburgh October 2015 1 / 23 Overview Introduction to Bayesian Statistics: Learning a Bernoulli probability Learning a discrete distribution Learning the mean


slide-1
SLIDE 1

Bayesian Methods 1

Chris Williams

School of Informatics, University of Edinburgh

October 2015

1 / 23

slide-2
SLIDE 2

Overview

◮ Introduction to Bayesian Statistics: Learning a Bernoulli

probability

◮ Learning a discrete distribution ◮ Learning the mean of a Gaussian ◮ Exponential family ◮ Readings: Murphy §3.3 (Beta), §3.4 (Dirichlet), §4.6.1

(Gaussian) Barber §9.1.1, 9.1.3 (Beta), §9.4.3 (no parents, Dirichlet), §8.8.2 (Gaussian)

2 / 23

slide-3
SLIDE 3

Bayesian vs Frequentist Inference

Frequentist

◮ Assumes that there is an unknown but fixed parameter θ ◮ Estimates θ with some confidence ◮ Prediction by using the estimated parameter value

Bayesian

◮ Represents uncertainty about the unknown parameter ◮ Uses probability to quantify this uncertainty. Unknown parameters

as random variables

◮ Prediction follows rules of probability

3 / 23

slide-4
SLIDE 4

Frequentist method

◮ Model p(x|θ, M), data D = {x1, . . . , xN}

ˆ θ = argmaxθ p(D|θ, M)

◮ Prediction for xn+1 is based on p(xn+1|ˆ

θ, M)

4 / 23

slide-5
SLIDE 5

Bayesian method

◮ Prior distribution p(θ|M) ◮ Posterior distribution p(θ|D, M)

p(θ|D, M) = p(D|θ, M)p(θ|M) p(D|M)

◮ Making predictions

p(xN+1|D, M) =

  • p(xN+1, θ|D, M) dθ

=

  • p(xN+1|θ, D, M)p(θ|D, M) dθ

=

  • p(xN+1|θ, M)p(θ|D, M) dθ

Interpretation: average of predictions p(xN+1|θ, M) weighted by p(θ|D, M)

◮ Marginal likelihood (important for model comparison)

p(D|M) =

  • P(D|θ, M)P(θ|M) dθ

5 / 23

slide-6
SLIDE 6

Bayes, MAP and Maximum Likelihood

p(xN+1|D, M) =

  • p(xN+1|θ, M)p(θ|D, M) dθ

◮ Maximum a posteriori value of θ

θMAP = argmaxθ p(θ|D, M) Note: not invariant to reparameterization (cf ML estimator); ex: variance and precision (τ = 1/σ2) for a Gaussian

◮ If posterior is sharply peaked about the most probable value θMAP

then p(xN+1|D, M) ≃ p(xN+1|θMAP , M)

◮ In the limit N → ∞, θMAP converges to ˆ

θ (as long as p(ˆ θ) = 0)

◮ Bayesian approach most effective when data is limited, N is small

6 / 23

slide-7
SLIDE 7

Learning probabilities: thumbtack example

Frequentist Approach

◮ The probability of heads θ

is unknown

◮ Given iid data, estimate θ

using an estimator with good properties (e.g. ML estimator)

heads tails

7 / 23

slide-8
SLIDE 8

Likelihood

◮ Likelihood for a sequence of heads (1) and tails (0)

p(1100 . . . 001|θ) = θN1(1 − θ)N0

◮ MLE

ˆ θ = N1 N1 + N0

8 / 23

slide-9
SLIDE 9

Learning probabilities: thumbtack example

Bayesian Approach: (a) the prior

◮ Prior density p(θ), use Beta distribution

p(θ) = Beta(α, β) ∝ θα−1(1 − θ)β−1 for α, β > 0

◮ Properties of the Beta distribution

E[θ] =

  • θp(θ) =

α α + β var(θ) = αβ (α + β)2(α + β + 1)

9 / 23

slide-10
SLIDE 10

Examples of the Beta distribution

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Beta(0.5,0.5) Beta(1,1)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Beta(3,2) Beta(15,10)

10 / 23

slide-11
SLIDE 11

Bayesian Approach: (b) the posterior p(θ|D) ∝ p(θ)p(D|θ) ∝ θα−1(1 − θ)β−1θN1(1 − θ)N0 ∝ θα+N1−1(1 − θ)β+N0−1

◮ Posterior is also a Beta distribution ∼ Beta(α + N1, β + N0) ◮ The Beta prior is conjugate to the binomial likelihood (i.e.

prior and posterior have the same parametric form)

◮ α and β can be thought of as imaginary counts, with α + β as

the equivalent sample size [cointoss demo]

11 / 23

slide-12
SLIDE 12

Bayesian Approach: (c) making predictions

p(XN+1 = heads|D, M) =

  • p(XN+1 = heads|θ)p(θ|D, M) dθ

=

  • θ Beta(α + N1, β + N0) dθ

= α + N1 α + β + N

12 / 23

slide-13
SLIDE 13

Beyond Conjugate Priors

◮ The thumbtack came from a magic shop → a mixture prior

p(θ) = 0.4Beta(20, 0.5) + 0.2Beta(2, 2) + 0.4Beta(0.5, 20)

13 / 23

slide-14
SLIDE 14

Generalization to multinomial variables

◮ Dirichlet prior

p(θ1, . . . , θr) = Dir(α1, . . . , αr) ∝

r

  • i=1

θαi−1

i

with

  • i

θi = 1, αi > 0

◮ αi’s are imaginary counts, α = i αi is equivalent sample size ◮ Properties

E(θi) = αi α

◮ Dirichlet distribution is conjugate to the multinomial likelihood

14 / 23

slide-15
SLIDE 15

Examples of Dirichlet Distributions

[Source: https://projects.csail.mit.edu/church/wiki/Models_with_Unbounded_Complexity] 15 / 23

slide-16
SLIDE 16

◮ Likelihood

r

  • i=1

θNi

i ◮ Show that MLE ˆ

θi = Ni/N

◮ Posterior distribution

p(θ|N1, . . . , Nr) ∝

r

  • i=1

θαi+Ni−1

i ◮ Marginal likelihood

p(D|M) = Γ(α) Γ(α + N)

r

  • i=1

Γ(αi + Ni) Γ(αi)

16 / 23

slide-17
SLIDE 17

Inferring the mean of a Gaussian

◮ Likelihood

p(x|µ) ∼ N(µ, σ2)

◮ Prior

p(µ) ∼ N(µ0, σ2

0) ◮ Given data D = {x1, . . . , xN}, what is p(µ|D)?

17 / 23

slide-18
SLIDE 18

p(µ|D) ∼ N(µN, σ2

N)

with x = 1 N

N

  • i=1

xn µN = Nσ2 Nσ2

0 + σ2 x +

σ2 Nσ2

0 + σ2 µ0

1 σ2

N

= N σ2 + 1 σ2

◮ See Murphy §4.6.1 or Barber §8.8.2 for details

18 / 23

slide-19
SLIDE 19

The exponential family

◮ Any distribution over some x that can be written as

P(x|η) = h(x)g(η) exp

  • ηT u(x)
  • with h and g known, is in the exponential family of

distributions.

◮ Many common distributions are in the exponential family. A

notable exception is the t-distribution.

◮ The η are called the natural parameters of the distribution . ◮ For most distributions, the common representation (and

parameterization) does not take the exponential family form.

◮ So sometimes useful to convert to exponential family

representation and find the natural parameters.

◮ Exercise: Why not try this for some of the distributions that

we’ve seen already!

19 / 23

slide-20
SLIDE 20

Conjugate exponential models

◮ If the prior takes the same functional form as the posterior for

a given likelihood, a prior is said to be conjugate for that likelihood

◮ There is a conjugate prior for any exponential family

distribution

◮ If the prior and likelihood are conjugate and exponential, then

the the model is said to be conjugate exponential

◮ In conjugate exponential models, the Bayesian integrals can

be done analytically

20 / 23

slide-21
SLIDE 21

Reflecting on Conjugacy

◮ All of the priors that we have seen so far are conjugate ◮ Good thing: easy to do the sums ◮ Bad thing: prior distribution should match beliefs. Does a

Beta distribution match your beliefs? Is it good enough?

◮ Certainly not always ◮ Use of approximate inference methods for non-conjugate

models (see later in MLPR)

21 / 23

slide-22
SLIDE 22

Comparing Bayesian and Frequentist approaches

◮ Frequentist: fix θ, consider all possible data sets generated

with θ fixed

◮ Bayesian: fix D, consider all possible values of θ ◮ One view is that Bayesian and Frequentist approaches have

different definitions of what it means to be a good estimator

22 / 23

slide-23
SLIDE 23

Summary of Bayesian Methods

◮ Maximum likelihood fails to capture prior or uncertainty ◮ Need to use a prior distribution (maximum likelihood equals

MAP with uniform prior)

◮ Prior distribution might have its own parameters (usually

called hyper-parameters)

◮ MAP fails to capture uncertainty, need full posterior

distribution

◮ Prediction using MAP parameters does not capture

uncertainty

◮ Do inference by marginalization. Inference and Learning are

just using the rules of probability

23 / 23