Machine Learning Lecture 2 - Bayesian Learning: Binomial and - - PowerPoint PPT Presentation

machine learning lecture 2 bayesian learning binomial and
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Lecture 2 - Bayesian Learning: Binomial and - - PowerPoint PPT Presentation

Introduction D. Dubhashi Machine Learning Lecture 2 - Bayesian Learning: Binomial and Dirichlet Distributions Devdatt Dubhashi dubhashi@chalmers.se Department of Computer Science and Engineering Chalmers University January 21, 2016


slide-1
SLIDE 1

Introduction

  • D. Dubhashi

Machine Learning Lecture 2 - Bayesian Learning: Binomial and Dirichlet Distributions

Devdatt Dubhashi dubhashi@chalmers.se

Department of Computer Science and Engineering Chalmers University

January 21, 2016

slide-2
SLIDE 2

Introduction

  • D. Dubhashi

Coin Tossing

◮ Estimate probability a coin shows head based on

  • bserved coin tosses.

◮ Simple but fundamental! ◮ Historically important: originally used by alertBayes

(1763) and generalized by Pierre–Simon de Laplace (1774) creating Bayes Rule.

slide-3
SLIDE 3

Introduction

  • D. Dubhashi

Likelihood

Suppose Xi ∼ Ber(θ) i.e. P(Xi = 1) = θ(“heads”) P(Xi = 0) = 1 − θ(“tails”) and θ ∈ [0.1] is the parameter to be estimated. Given a series of N observed coin tosses, the probability that we observe k heads is given by the

Binomial distribution

: Bin(k | N, θ) = N k

  • θk(1 − θ)N−k.
slide-4
SLIDE 4

Introduction

  • D. Dubhashi

Likelihood

Thus, the likelihood has the form P(D | θ) ∝ θN1(1 − θ)N0, where N0 and N1 are the number of tails and heads seen

  • respectively. These are called sufficient statistics since this is

all we need to know about the data to estimate θ. Formally, s(D) are a set of sufficient statistics for D if P(θ | D) = P(θ | s(D)).

slide-5
SLIDE 5

Introduction

  • D. Dubhashi

Binomial Distribution

1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 θ=0.250 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 θ=0.900

slide-6
SLIDE 6

Introduction

  • D. Dubhashi

Bayes Rule for Posterior

P(θ | D) = P(D | θ)P(θ) 1

0 P(D | θ)P(θ)dθ

Bit daunting to compute the integral in the denominator! Can avoid it via a clever trick of choosing suitable prior!

slide-7
SLIDE 7

Introduction

  • D. Dubhashi

Prior

Need a prior with support [0, 1] and to make math easy, of the same form as likelihood.

Beta distribution

Beta(θ | a, b) = 1 B(a, b)θa−1(1 − θ)b−1, with hyperparameters a, b, and where B(a, b) = Γ(a)Γ(b) Γ(a + b) , is a normalizing factor. Mean

a a+b

Mode

a−1 a+b−2

Prior Knowledge If we believe that the mean is 0.7 and standard deviation is 0.2 we can set a = 2.975 and b = 1.277 (exercise!) Uninformative Prior Uniform prior a = 1 = b.

slide-8
SLIDE 8

Introduction

  • D. Dubhashi

Beta Distribution

0.2 0.4 0.6 0.8 1 0.5 1 1.5 2 2.5 3 beta distributions

a=0.1, b=0.1 a=1.0, b=1.0 a=2.0, b=3.0 a=8.0, b=4.0

slide-9
SLIDE 9

Introduction

  • D. Dubhashi

Posterior, Conjugate Prior

Multiplying prior and likelihood gives the posterior: P(θ | D) ∝ Bin(N1 | θ, N0 + N1)Beta(θ | a, b) = Beta(θ | N1 + a, N0 + b).] The posterior has the same distribution as the prior (with different parameters): the Beta distribution is said to be a conjugate prior for the Binomial distribution. The posterior is obtained by simply adding the prior parameters to the empirical counts, hence the hyperparameters are often called pseudo–counts.

slide-10
SLIDE 10

Introduction

  • D. Dubhashi

Updating Beta Prior with Binomial Likelihood

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6

prior Be(2.0, 2.0) lik Be(4.0, 18.0) post Be(5.0, 19.0)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 4 4.5

prior Be(5.0, 2.0) lik Be(12.0, 14.0) post Be(16.0, 15.0)

slide-11
SLIDE 11

Introduction

  • D. Dubhashi

Sequential update versus Batch

Suppose we have two data sets D1 and D2 with sufficient statistics N1

1, N1 0 and N2 1, N2 0 respectively. Let N1 := N1 1 +N2 1

and N0 := N1

0 + N2 0 be the combined suffcient statistics.

In batch mode, P(θ | D1, D2) ∝ Bin(N1 | θ, N1 + N0)Beta(θ | a, b) ∝ Beta(θ | N1 + a, N0 + a). In sequential mode, P(θ | D1, D2) ∝ P(D2 | θ)P(θ | D1) ∝ Bin(N2

1 | θ, N2 1 + N2 0)Beta(θ | N1 1 + a, N1 0 + b)

∝ Beta(θ | N1

1 + N2 1 + a, N1 0 + N2 0 + b)

= Beta(θ | N1 + a, N0 + b) Very suitable for online learning!

slide-12
SLIDE 12

Introduction

  • D. Dubhashi

Posterior Mean and Mode

◮ The MAP estimate is given by

ˆ θMAP = a + N1 − 1 a + b + N − 2.

◮ With uniform prior a = 1 = b,

ˆ θMLE = N1 N , which is just the MLE.

◮ The posterior mean is

¯ θ = a + N1 a + b + N , which is a convex combination of the prior mean and the MLE: ¯ θ = λ a a + b + (1 − λ)ˆ θMLE, with λ :=

a+b a+b+N . Note that as N → ∞, ¯

θ → ˆ θMLE.

slide-13
SLIDE 13

Introduction

  • D. Dubhashi

Posterior Predictive Distribution

The probability of a heads in a single new coin toss is: P(ˆ x = 1 | D) = 1 P(x = 1 | θ)P(θ | D)dθ = 1 θBeta(θ | N1 + a, N0 + b)dθ = EBeta[θ] = N1 + a N1 + N0 + a + b

slide-14
SLIDE 14

Introduction

  • D. Dubhashi

Predicting Multiple Future Trials

The probability of predicting x heads in M future trials: P(x | D) = 1 Bin(x | θ, M)P(θ | D)dθ = 1 Bin(x | θ, M)Beta(θ | N1 + a, N0 + b)dθ = M x

  • 1

B(N1 + a, N0 + b) 1 θx+N1+a−1(1 − θ)M−x+N0+b−1 = M x B(x + N1 + a, (M − x) + N0 + b) B(N1 + a, N0 + b) the compound Beta–Binomial distribution, with mean and variance: E[x] = M N1 + a N + a + b, var[x] = M (N1 + a)(N0 + b) (N1 + a + N0 + b)2 N + a + b + M N + 1

slide-15
SLIDE 15

Introduction

  • D. Dubhashi

Posterior Predictive and Plugin

1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 posterior predictive 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 plugin predictive

slide-16
SLIDE 16

Introduction

  • D. Dubhashi

Tossing a Dice

◮ From coins and two outcomes to dice and many

  • utcomes.

◮ Given observations from a dice with K faces, predict

next roll.

◮ Suppose we observe N dice rolls D = {x1, x2, · · · , xN}

where each xi ∈ {1, · · · , K}.

slide-17
SLIDE 17

Introduction

  • D. Dubhashi

Likelihood

◮ Suppose we observe N dice rolls D = {x1, x2, · · · , xN}

where each xi ∈ {1, · · · , K}.

P(D | θ) =

K

  • k=1

θNk

k ,

where θk is the (unknown) probability of showing face k and Nk is the observed number outomes showing face k.

slide-18
SLIDE 18

Introduction

  • D. Dubhashi

Multinomial Distribution

The probability of observing face k appear xk times in n rolls

  • f a dice with face probabilities θ := (θk, k ∈ {1, · · · K}) is

Multinomial Distribution

Mu(x | n, θ) =

  • n

x1, x2, · · · , xK

  • k
  • k=1

θxk

k .

slide-19
SLIDE 19

Introduction

  • D. Dubhashi

Priors

Since the parameter vector θ lives in the K–dimensional simplex SK := {(x1, · · · , xK) | x1 + · · · + xK = 1}, we need a prior that

◮ has support on this simplex. ◮ is ideally also conjugate to the likelihood distribution

i.e. multinomial.

slide-20
SLIDE 20

Introduction

  • D. Dubhashi

Dirichlet Distribution

Dirichlet Distribution

Dir(x | α) := 1 B(α)

K

  • k=1

xαk−1

k

1[x ∈ SK], where B(α) is the normalization factor B(α) := K

k=1 Γ(αk)

Γ(α0) , with α0 := K

k=1 α0.

slide-21
SLIDE 21

Introduction

  • D. Dubhashi

Dirichlet Distribution

0.5 1 0.5 1 5 10 15 α=0.10 p

slide-22
SLIDE 22

Introduction

  • D. Dubhashi

Dirichlet Distribution examples

α0 controls the strength i.e. the peak and αks control where it occurs: (1,1,1) Uniform distribution. (2,2,2) Broad distribution centered at (1/3, 1/3, 1/3) (20,20,20) Narrow distribution centered at (1/3, 1/3, 1/3). When αk < 1, for all k, we get “spikes” at the corners.

slide-23
SLIDE 23

Introduction

  • D. Dubhashi

Samples from Dirichlet Distribution

1 2 3 4 5 0.5 1 Samples from Dir (alpha=0.1) 1 2 3 4 5 0.5 1 1 2 3 4 5 0.5 1 1 2 3 4 5 0.5 1 1 2 3 4 5 0.5 1 1 2 3 4 5 0.5 1 Samples from Dir (alpha=1) 1 2 3 4 5 0.5 1 1 2 3 4 5 0.5 1 1 2 3 4 5 0.5 1 1 2 3 4 5 0.5 1

slide-24
SLIDE 24

Introduction

  • D. Dubhashi

Prior and Posterior

Prior Dirichlet Prior: Dir(θ | α) = 1 B(α)

K

  • i=1

θαk−1

k

Posterior P(θ | D) ∝ P(D | θ)P(θ) ∝

K

  • k=1

θNk

k θαk−1 k

=

K

  • k=1

θαk+Nk−1

k

= Dir(θ | α1 + N1, · · · , αK + NK)

slide-25
SLIDE 25

Introduction

  • D. Dubhashi

MAP estimate using Lagrange Multipliers

max

θ

Dir(θ | α1 + N1, · · · , αK + NK) =

K

  • k=1

θαk+Nk−1

k

subject to θ1 + θ2 + · · · + θK = 1. Use Lagrange multipliers! Solution: ˆ θk = αk + Nk − 1 α0 + N − K . With uniform prior this becomes ˆ θk = Nk/N.

slide-26
SLIDE 26

Introduction

  • D. Dubhashi

Posterior Predictive

P(X = k | D) =

  • P(X = k | θ)P(θ | D)dθ

=

  • P(X = k | θk) [P(θ−k, θk | D)dθ−k] dθk

=

  • θkP(θk | D)dθk

= E[θk | D] = αk + Nk α + N

slide-27
SLIDE 27

Introduction

  • D. Dubhashi

Application to Language Modelling

Suppose observe the following sentences:

Sentences

Mary had a little lamb, little lamb, little lamb Mary had a little lamb, it’s fleece as white as snow Can we predict which word comes next?

slide-28
SLIDE 28

Introduction

  • D. Dubhashi

Pre–Processing

◮ Vocabulary: mary, lamb, little, big,

fleece,white, black, snow, rain, unk represented by numerals 1 · · · 10, (where unk stands for “unknown”.

◮ Strip away all punctuation and stop words like “a”,

“as”, “the”

◮ Stemming i.e. reduce words to base form by removing

plurals etc. For example, “running” becomes “run”.

◮ This reduces the given seqeunces to:

1 10, 3 2 3 2 3 2 3 2, 1 10 3 2 10 5 10 6 8.

slide-29
SLIDE 29

Introduction

  • D. Dubhashi

Bag of Words Model

◮ Count word occurences giving for the size 10

vocabulary, the counts: 2, 4, 4, 0, 1, 1, 0, 10, 4.

◮ Using Dir(1, 1 · · · , 1) prior gives:

P(X = k | D) = E[θk | D] = αk + Nk α + N = 1 + Nk 10 + 17,

◮ So predictive distribution for next word is:

(3/27, 5/27, 5/27, 1/27, 2/27, 2/27, 1/27, 2/27, 1/27, 5/27)

◮ whose modes are X = 2 (lamb’) and X = 10 (unk). ◮ The words “big”, “black”, “rain” have non-zero

probabilities, even though they have not been seen !