Probabilistic Graphical Models Probabilistic Graphical Models - - PowerPoint PPT Presentation

probabilistic graphical models probabilistic graphical
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models Probabilistic Graphical Models - - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian networks Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives likelihood function and MLE role of the sufficient statistics MLE for


slide-1
SLIDE 1

Probabilistic Graphical Models Probabilistic Graphical Models

Parameter learning in Bayesian networks

Siamak Ravanbakhsh Fall 2019

slide-2
SLIDE 2

Learning objectives Learning objectives

likelihood function and MLE role of the sufficient statistics MLE for parameter learning in directed models why is it easy? conjugate priors and Bayesian parameter learning

slide-3
SLIDE 3

Likelihood function Likelihood function through an example

through an example

a thumbtack with unknown prob. of heads & tails Bernoulli dist. p(x; θ) = θ (1 −

x

θ)(1−x)

≡ 1

≡ 0

slide-4
SLIDE 4

Likelihood function Likelihood function through an example

through an example

a thumbtack with unknown prob. of heads & tails Bernoulli dist.

D = {1, 0, 0, 1, 1}

p(x; θ) = θ (1 −

x

θ)(1−x)

≡ 1

≡ 0

IID observations likelihood of is

θ

L(θ; D) =

P(x; θ) =

∏x∈D θ (1 −

3

θ)2

θ

likelihood function not a pdf (it does not integrate to 1)

slide-5
SLIDE 5

Likelihood function Likelihood function through an example

through an example

a thumbtack with unknown prob. of heads & tails Bernoulli dist.

D = {1, 0, 0, 1, 1}

p(x; θ) = θ (1 −

x

θ)(1−x)

≡ 1

≡ 0

IID observations likelihood of is

θ

L(θ; D) =

P(x; θ) =

∏x∈D θ (1 −

3

θ)2

θ

likelihood function not a pdf (it does not integrate to 1) max-likelihood estimate (MLE)

log-likelihood:

log L(θ; D) = 3 log θ + 2 log(1 − θ)

3 log θ + 2 log(1 − θ) =

∂θ ∂ (

)

θ 3

=

1−θ 2

=

θ(1−θ) 3−5θ

0 ⇒ = θ ^

5 3

maximizing the log-likelihood (M-projection of )

P

D

slide-6
SLIDE 6

Sufficient statistics Sufficient statistics through an example

through an example

D = {1, 0, 0, 1, 1}

≡ 1

≡ 0

IID observations likelihood of is

L(θ, D) =

P(x; θ) =

∏x∈D θ (1 −

3

θ)2

θ

all we needed to know about the data: number of heads and tails given a distribution its sufficient statistics is function such that

P(x; θ) ϕ = [ϕ

, … , ϕ ]

1 K

sufficient statistics of the dataset is all that matters about the data

E

[ϕ(x)] =

D

E

[ϕ(x )]

D′ ′

L(θ, D) =

∣D∣ 1

L(θ, D )

∀D, D , θ

∣D ∣

1 ′ ′

slide-7
SLIDE 7

Revisiting Revisiting exponential family exponential family

the (linear) exponential family: max-entropy distribution subject to

p(x) ∝ exp(⟨θ, ϕ(x)⟩)

E

[ϕ(x)] =

p

μ

L(θ, D) =

p(x; θ)

∏x∈D

given a distribution its sufficient statistics is function such that

P(x; θ) ϕ = [ϕ

, … , ϕ ]

1 K

E

[ϕ(x)] =

D

E

[ϕ(x )]

D′ ′

L(θ, D) =

∣D∣ 1

L(θ, D )

∀D, D , θ

∣D ∣

1 ′ ′

slide-8
SLIDE 8

Revisiting Revisiting exponential family exponential family

the (linear) exponential family: max-entropy distribution subject to if are linearly independent, then

p(x) ∝ exp(⟨θ, ϕ(x)⟩)

E

[ϕ(x)] =

p

μ

L(θ, D) =

p(x; θ)

∏x∈D

ϕ

, … , ϕ

1 k

θ ↔ μ

given a distribution its sufficient statistics is function such that

P(x; θ) ϕ = [ϕ

, … , ϕ ]

1 K

E

[ϕ(x)] =

D

E

[ϕ(x )]

D′ ′

L(θ, D) =

∣D∣ 1

L(θ, D )

∀D, D , θ

∣D ∣

1 ′ ′

slide-9
SLIDE 9

MLE for Bayesian networks MLE for Bayesian networks an example

an example

a simple network

p(x, y; θ) = p(x; θ

)p(y∣x; θ )

X Y ∣X

X Y

slide-10
SLIDE 10

MLE for Bayesian networks MLE for Bayesian networks an example

an example

a simple network

p(x, y; θ) = p(x; θ

)p(y∣x; θ )

X Y ∣X

X Y

likelihood

L(D; θ) =

p(x; θ )p(y∣x; θ )

∏(x,y)∈D

X Y ∣X

=

p(x; θ ) p(y∣x; θ )

(∏(x)∈D

X ) (∏(x,y)∈D Y ∣X )

likelihood of x

  • cond. likelihood of y
slide-11
SLIDE 11

MLE for Bayesian networks MLE for Bayesian networks an example

an example

a simple network

p(x, y; θ) = p(x; θ

)p(y∣x; θ )

X Y ∣X

X Y

likelihood

L(D; θ) =

p(x; θ )p(y∣x; θ )

∏(x,y)∈D

X Y ∣X

=

p(x; θ ) p(y∣x; θ )

(∏(x)∈D

X ) (∏(x,y)∈D Y ∣X )

for discrete vars.

likelihood of x

  • cond. likelihood of y

L(D; θ) =

θ θ

(∏ℓ∈V al(X)

X,ℓ N(x=ℓ)) (∏ℓ,ℓ ∈V al(X)×V al(Y )

Y ∣X,ℓ,ℓ′ N(x=ℓ,y=ℓ )

′ )

number of times in the dataset

x = ℓ

number of times in the dataset

x = ℓ, y = ℓ′

p(X = ℓ)

p(X = ℓ ∣ Y = ℓ )

slide-12
SLIDE 12

MLE for Bayesian networks MLE for Bayesian networks an example

an example

a simple network

p(x, y; θ) = p(x; θ

)p(y∣x; θ )

X Y ∣X

X Y

likelihood

L(D; θ) =

p(x; θ )p(y∣x; θ )

∏(x,y)∈D

X Y ∣X

=

p(x; θ ) p(y∣x; θ )

(∏(x)∈D

X ) (∏(x,y)∈D Y ∣X )

for discrete vars.

likelihood of x

  • cond. likelihood of y

L(D; θ) =

θ θ

(∏ℓ∈V al(X)

X,ℓ N(x=ℓ)) (∏ℓ,ℓ ∈V al(X)×V al(Y )

Y ∣X,ℓ,ℓ′ N(x=ℓ,y=ℓ )

′ )

number of times in the dataset

x = ℓ

number of times in the dataset

x = ℓ, y = ℓ′

p(X = ℓ)

p(X = ℓ ∣ Y = ℓ )

MLE : maximize local likelihood terms individually

θ

=

X,ℓ ∣D∣ N(x=ℓ)

θ

=

Y ∣X,ℓ,ℓ′ ∣D∣ N(x=ℓ,y=ℓ )

slide-13
SLIDE 13

MLE for Bayesian networks MLE for Bayesian networks general case

general case

Bayes-net

p(x; θ) =

p(x ∣

∏i

i

Pa

; θ )

x

i

X

∣Pa

i X i

likelihood

L(D; θ) =

p(x ∣

∏x∈D ∏i

i

Pa

; θ )

x

i

i∣Pa

i

local likelihood terms

=

p(x ∣

∏i ∏(x

,Pa )∈D

i x i

i

Pa

; θ )

x

i

i∣Pa

i

slide-14
SLIDE 14

MLE for Bayesian networks MLE for Bayesian networks general case

general case

Bayes-net

p(x; θ) =

p(x ∣

∏i

i

Pa

; θ )

x

i

X

∣Pa

i X i

likelihood

L(D; θ) =

p(x ∣

∏x∈D ∏i

i

Pa

; θ )

x

i

i∣Pa

i

maximizing the conditional likelihood for each node

similar to solving individual prediction problems

local likelihood terms

=

p(x ∣

∏i ∏(x

,Pa )∈D

i x i

i

Pa

; θ )

x

i

i∣Pa

i

slide-15
SLIDE 15

MLE for Bayesian networks MLE for Bayesian networks general case

general case

Bayes-net

p(x; θ) =

p(x ∣

∏i

i

Pa

; θ )

x

i

X

∣Pa

i X i

likelihood

L(D; θ) =

p(x ∣

∏x∈D ∏i

i

Pa

; θ )

x

i

i∣Pa

i

maximizing the conditional likelihood for each node

similar to solving individual prediction problems

local likelihood terms

Example

=

p(x ∣

∏i ∏(x

,Pa )∈D

i x i

i

Pa

; θ )

x

i

i∣Pa

i

how to learn a naive Bayes?

slide-16
SLIDE 16

Bayesian Bayesian parameter estimation parameter estimation

max-likelihood is the same for

= θ ^

3 1

Example ≡ 1 ≡ 0 N(x = 1) = 1, N(x = 0) = 2 N(x = 1) = 100, N(x = 0) = 200

case 1. case 2.

slide-17
SLIDE 17

Bayesian Bayesian parameter estimation parameter estimation

max-likelihood is the same for

= θ ^

3 1

Example need to model our uncertainty Bayesian approach: assume a prior estimate the posterior ≡ 1 ≡ 0 N(x = 1) = 1, N(x = 0) = 2 N(x = 1) = 100, N(x = 0) = 200

p(θ)

case 1. case 2.

slide-18
SLIDE 18

Bayesian Bayesian parameter estimation parameter estimation

max-likelihood is the same for

= θ ^

3 1

Example need to model our uncertainty Bayesian approach: assume a prior estimate the posterior ≡ 1 ≡ 0 N(x = 1) = 1, N(x = 0) = 2 N(x = 1) = 100, N(x = 0) = 200

p(θ)

p(θ ∣ D) =

p(D) p(θ)p(D∣θ)

p(θ)p(D ∣ θ)

case 1. case 2. likelihood

p(x∣θ)

∏x∈D

marginal likelihood prior posterior

slide-19
SLIDE 19

Bayesian parameter estimation Bayesian parameter estimation

assuming a uniform prior

≡ 1

≡ 0

p(θ) = { 1 0 ≤ θ ≤ 1

  • .w.

p(θ ∣ D) ∝ p(θ)p(D ∣ θ) ∝ p(D ∣ θ)

posterior

slide-20
SLIDE 20

Bayesian parameter estimation Bayesian parameter estimation

assuming a uniform prior

≡ 1

≡ 0

p(θ) = { 1 0 ≤ θ ≤ 1

  • .w.

p(θ ∣ D) ∝ p(θ)p(D ∣ θ) ∝ p(D ∣ θ)

rather than a single MLE value

posterior posterior predictive: predicting heads/tails using the posterior

p(x ∣ D) =

p(θ∣D)p(x∣θ)dθ

∫0

1

∝ θ (1 −

N(1)

θ)N(0) θ (1 −

x

θ)(1−x)

slide-21
SLIDE 21

Bayesian parameter estimation Bayesian parameter estimation

assuming a uniform prior

≡ 1

≡ 0

p(θ) = { 1 0 ≤ θ ≤ 1

  • .w.

p(θ ∣ D) ∝ p(θ)p(D ∣ θ) ∝ p(D ∣ θ)

rather than a single MLE value

posterior posterior predictive: predicting heads/tails using the posterior

Laplace correction

p(x ∣ D) =

p(θ∣D)p(x∣θ)dθ

∫0

1

∝ θ (1 −

N(1)

θ)N(0) θ (1 −

x

θ)(1−x)

if we do the integration above: p(x = 1 ∣ D) =

N(0)+N(1)+2 N(1)+1

(and normalize)

slide-22
SLIDE 22

Bayesian parameter estimation Bayesian parameter estimation

assuming a uniform prior

≡ 1

≡ 0

p(θ) = { 1 0 ≤ θ ≤ 1

  • .w.

p(θ ∣ D) ∝ p(θ)p(D ∣ θ) ∝ p(D ∣ θ)

rather than a single MLE value

posterior posterior predictive: predicting heads/tails using the posterior

Laplace correction

p(x ∣ D) =

p(θ∣D)p(x∣θ)dθ

∫0

1

∝ θ (1 −

N(1)

θ)N(0) θ (1 −

x

θ)(1−x)

if we do the integration above: p(x = 1 ∣ D) =

N(0)+N(1)+2 N(1)+1

(and normalize)

compare with prediction using MLE p(x = 1 ∣ D) =

N(0)+N(1) N(1)

slide-23
SLIDE 23

Conjugate priors Conjugate priors

how about non-uniform priors?

≡ 1

≡ 0

p(θ ∣ D)∝ p(θ)p(D ∣ θ)

need an efficient way to get the posterior

E.g., more likely to see heads

slide-24
SLIDE 24

Conjugate priors Conjugate priors

how about non-uniform priors?

≡ 1

≡ 0

p(θ ∣ D)∝ p(θ)p(D ∣ θ)

need an efficient way to get the posterior

E.g., more likely to see heads

ideally the prior & the posterior should have the same form

p(θ) p(θ∣D) p(θ) is a conjugate prior to the likelihood p(D∣θ)

slide-25
SLIDE 25

Conjugate priors Conjugate priors

how about non-uniform priors?

≡ 1

≡ 0

p(θ ∣ D)∝ p(θ)p(D ∣ θ)

need an efficient way to get the posterior

E.g., more likely to see heads

ideally the prior & the posterior should have the same form

p(θ) p(θ∣D) p(θ) is a conjugate prior to the likelihood p(D∣θ)

conjugate prior to the Bernoulli likelihood is the Beta distribution

p(D∣θ) ∝ θ (1 −

N(1)

θ)N(0) p(θ; α, β) = γθ (1 −

α−1

θ)β−1

γ =

Γ(α)Γ(β) Γ(α+β)

slide-26
SLIDE 26

Conjugate priors: Conjugate priors: Beta-Bernoulli Beta-Bernoulli

conjugate prior to the Bernoulli likelihood is the Beta distribution p(θ; α, β) =

θ

(1 −

Γ(α)Γ(β) Γ(α+β) α−1

θ)β−1

image: wikipedia

slide-27
SLIDE 27

Conjugate priors: Conjugate priors: Beta-Bernoulli Beta-Bernoulli

conjugate prior to the Bernoulli likelihood is the Beta distribution p(θ; α, β) =

θ

(1 −

Γ(α)Γ(β) Γ(α+β) α−1

θ)β−1

extension of factorial function Γ(n + 1) = n!

image: wikipedia

slide-28
SLIDE 28

Conjugate priors: Conjugate priors: Beta-Bernoulli Beta-Bernoulli

conjugate prior to the Bernoulli likelihood is the Beta distribution p(θ; α, β) =

θ

(1 −

Γ(α)Γ(β) Γ(α+β) α−1

θ)β−1

extension of factorial function Γ(n + 1) = n!

hyper-parameters: can be interpreted as # imaginary heads & tails

p(x = 1 ∣ D = ∅) =

p(x =

∫θ 1 ∣ θ)p(θ; α, β)dθ =

α+β α image: wikipedia

prior predictive:

slide-29
SLIDE 29

if the prior is , the posterior is

Conjugate priors: Conjugate priors: Beta-Bernoulli Beta-Bernoulli

conjugate prior to the Bernoulli likelihood is the Beta distribution p(θ; α, β) =

θ

(1 −

Γ(α)Γ(β) Γ(α+β) α−1

θ)β−1

extension of factorial function Γ(n + 1) = n!

hyper-parameters: can be interpreted as # imaginary heads & tails

p(x = 1 ∣ D = ∅) =

p(x =

∫θ 1 ∣ θ)p(θ; α, β)dθ =

α+β α

posterior: p(θ ∣ D) ∝ p(θ)P(D ∣ θ) ∝ θ (1 −

α−1

θ) θ (1 −

β−1 N(1)

θ) =

N(0)

θ (1 −

α−1+N(1)

θ)β−1+N(0)

p(θ; α, β) p(θ; α + N(1), β + N(0))

image: wikipedia

prior predictive:

slide-30
SLIDE 30

Beta-Bernoulli: Beta-Bernoulli: Example Example

p(x = 1) = .2

posterior for different priors and sample sizes

different prior means

α+β α

different prior strength α + β

slide-31
SLIDE 31

Beta-Bernoulli: Beta-Bernoulli: Example Example

p(x = 1) = .2

posterior for different priors and sample sizes

different prior means

α+β α

different prior strength α + β

posterior predictive for online setting

MLE

α = β = 5 α = β = 1

slide-32
SLIDE 32

Conjugate priors: Conjugate priors: Dirichlet-categorical Dirichlet-categorical

Bernoulli Beta Categorical Dirichlet

p(θ; α) =

θ Γ(α )

∏d

d

Γ(

α )

∑d

d ∏d

d α

−1

d

α =

slide-33
SLIDE 33

Conjugate priors: Conjugate priors: Dirichlet-categorical Dirichlet-categorical

Bernoulli Beta Categorical Dirichlet

p(θ; α) =

θ Γ(α )

∏d

d

Γ(

α )

∑d

d ∏d

d α

−1

d

pseudo-counts for different categories

α =

α ∈ (ℜ )

+ D

slide-34
SLIDE 34

Conjugate priors: Conjugate priors: Dirichlet-categorical Dirichlet-categorical

Bernoulli Beta Categorical Dirichlet

p(θ; α) =

θ Γ(α )

∏d

d

Γ(

α )

∑d

d ∏d

d α

−1

d

prior: pseudo-counts for different categories

α =

α ∈ (ℜ )

+ D

p(θ; α)

slide-35
SLIDE 35

Conjugate priors: Conjugate priors: Dirichlet-categorical Dirichlet-categorical

Bernoulli Beta Categorical Dirichlet

p(θ; α) =

θ Γ(α )

∏d

d

Γ(

α )

∑d

d ∏d

d α

−1

d

prior:

p(θ ∣ D) ∝ p(θ)p(D ∣ θ)

pseudo-counts for different categories

α =

α ∈ (ℜ )

+ D

p(θ; α)

likelihood: p(D ∣ θ) ∝

θ =

∏x∈D ∏d

d I(x=d)

θ

∏d

d N(d)

posterior:

slide-36
SLIDE 36

Conjugate priors: Conjugate priors: Dirichlet-categorical Dirichlet-categorical

Bernoulli Beta Categorical Dirichlet

p(θ; α) =

θ Γ(α )

∏d

d

Γ(

α )

∑d

d ∏d

d α

−1

d

prior:

p(θ ∣ D) ∝ p(θ)p(D ∣ θ)

pseudo-counts for different categories

α =

θ θ =

∏d

d N(d) d α

−1

d

θ

∏d

d α

+N(d)−1

d

α ∈ (ℜ )

+ D

p(θ; α)

likelihood: p(D ∣ θ) ∝

θ =

∏x∈D ∏d

d I(x=d)

θ

∏d

d N(d)

posterior:

slide-37
SLIDE 37

Conjugate priors: Conjugate priors: Dirichlet-categorical Dirichlet-categorical

Bernoulli Beta Categorical Dirichlet

p(θ; α) =

θ Γ(α )

∏d

d

Γ(

α )

∑d

d ∏d

d α

−1

d

prior:

p(θ ∣ D) ∝ p(θ)p(D ∣ θ)

pseudo-counts for different categories

α =

θ θ =

∏d

d N(d) d α

−1

d

θ

∏d

d α

+N(d)−1

d

α ∈ (ℜ )

+ D

p(θ; α)

likelihood: p(D ∣ θ) ∝

θ =

∏x∈D ∏d

d I(x=d)

θ

∏d

d N(d)

posterior: posterior predictive: p(x =

∣ x ˉ D) =

p(θ ∣

∫θ D)p(x = ∣ x ˉ θ)dθ

slide-38
SLIDE 38

Conjugate priors: Conjugate priors: Dirichlet-categorical Dirichlet-categorical

Bernoulli Beta Categorical Dirichlet

p(θ; α) =

θ Γ(α )

∏d

d

Γ(

α )

∑d

d ∏d

d α

−1

d

prior:

p(θ ∣ D) ∝ p(θ)p(D ∣ θ)

pseudo-counts for different categories

α =

θ θ =

∏d

d N(d) d α

−1

d

θ

∏d

d α

+N(d)−1

d

α ∈ (ℜ )

+ D

p(θ; α)

likelihood: p(D ∣ θ) ∝

θ =

∏x∈D ∏d

d I(x=d)

θ

∏d

d N(d)

posterior: posterior predictive: p(x =

∣ x ˉ D) =

p(θ ∣

∫θ D)p(x = ∣ x ˉ θ)dθ =

N+

α

∑d

d

α

+N( )

x ˉ

x ˉ

slide-39
SLIDE 39

Marginal Marginal likelihood vs. maximum likelihood likelihood vs. maximum likelihood

D = {1, 0, 0, 1, 1}

slide-40
SLIDE 40

Marginal Marginal likelihood vs. maximum likelihood likelihood vs. maximum likelihood

D = {1, 0, 0, 1, 1}

maximum likelihood value: P(D∣ ) =

θ ^ ≈ ( 5

3) 3 ( 5 2) 2

.035

slide-41
SLIDE 41

Marginal Marginal likelihood vs. maximum likelihood likelihood vs. maximum likelihood

marginal likelihood value:

D = {1, 0, 0, 1, 1}

maximum likelihood value: P(D∣ ) =

θ ^ ≈ ( 5

3) 3 ( 5 2) 2

.035

P(D) =

P(θ)P(D∣θ)dθ

∫θ∈[0,1]

1

slide-42
SLIDE 42

Marginal Marginal likelihood vs. maximum likelihood likelihood vs. maximum likelihood

marginal likelihood value:

D = {1, 0, 0, 1, 1}

maximum likelihood value: P(D∣ ) =

θ ^ ≈ ( 5

3) 3 ( 5 2) 2

.035

P(D) =

P(θ)P(D∣θ)dθ

∫θ∈[0,1]

1 2 chain rule: P(D) =

P(x

∣x , … , x ) ∏m=1

M (m) (1) (m−1)

slide-43
SLIDE 43

Marginal Marginal likelihood vs. maximum likelihood likelihood vs. maximum likelihood

marginal likelihood value:

D = {1, 0, 0, 1, 1}

maximum likelihood value: P(D∣ ) =

θ ^ ≈ ( 5

3) 3 ( 5 2) 2

.035

P(D) =

P(θ)P(D∣θ)dθ

∫θ∈[0,1]

1 2 chain rule: P(D) =

P(x

∣x , … , x ) ∏m=1

M (m) (1) (m−1)

=

α α

1

α+1 α

α+2 α

+1 ⋅

α+3 α

+1

1

α+4 α

+1

2

slide-44
SLIDE 44

Marginal Marginal likelihood vs. maximum likelihood likelihood vs. maximum likelihood

marginal likelihood value:

D = {1, 0, 0, 1, 1}

maximum likelihood value: P(D∣ ) =

θ ^ ≈ ( 5

3) 3 ( 5 2) 2

.035

P(D) =

P(θ)P(D∣θ)dθ

∫θ∈[0,1]

1 2 chain rule: P(D) =

P(x

∣x , … , x ) ∏m=1

M (m) (1) (m−1)

=

α α

1

α+1 α

α+2 α

+1 ⋅

α+3 α

+1

1

α+4 α

+1

2

=

Γ(α+5) Γ(α)

Γ(α )

1

Γ(α

+3)

1

Γ(α

)

Γ(α

+2)

.017

using Γ(x + 1) = xΓ(x)

slide-45
SLIDE 45

Marginal Marginal likelihood vs. maximum likelihood likelihood vs. maximum likelihood

marginal likelihood value:

D = {1, 0, 0, 1, 1}

maximum likelihood value: P(D∣ ) =

θ ^ ≈ ( 5

3) 3 ( 5 2) 2

.035

P(D) =

P(θ)P(D∣θ)dθ

∫θ∈[0,1]

1 2 chain rule: P(D) =

P(x

∣x , … , x ) ∏m=1

M (m) (1) (m−1)

=

α α

1

α+1 α

α+2 α

+1 ⋅

α+3 α

+1

1

α+4 α

+1

2

=

Γ(α+5) Γ(α)

Γ(α )

1

Γ(α

+3)

1

Γ(α

)

Γ(α

+2)

.017

P(D) =

Γ(α+∣D∣) Γ(α)

∏i

Γ(α

)

i

Γ(α

+∣D∣p (i))

i D

marginal likelihood for Dirichlet

using Γ(x + 1) = xΓ(x)

slide-46
SLIDE 46

Conjugate priors: Conjugate priors: exponential family

exponential family

for the likelihood function:

p(x ∣ θ) = exp(⟨ϕ(x), θ⟩ − A(θ))

slide-47
SLIDE 47

Conjugate priors: Conjugate priors: exponential family

exponential family

for the likelihood function:

p(x ∣ θ) = exp(⟨ϕ(x), θ⟩ − A(θ))

suppose we observe N instances

p(D ∣ θ) = exp(⟨ ϕ(x), θ⟩ − ∑x∈D NA(θ))

slide-48
SLIDE 48

Conjugate priors: Conjugate priors: exponential family

exponential family

for the likelihood function:

p(x ∣ θ) = exp(⟨ϕ(x), θ⟩ − A(θ))

suppose we observe N instances

p(D ∣ θ) = exp(⟨ ϕ(x), θ⟩ − ∑x∈D NA(θ))

conjugate prior

p(θ; η, ν) = exp(⟨νη, θ⟩ − νA(θ))

slide-49
SLIDE 49

Conjugate priors: Conjugate priors: exponential family

exponential family

for the likelihood function:

p(x ∣ θ) = exp(⟨ϕ(x), θ⟩ − A(θ))

suppose we observe N instances

p(D ∣ θ) = exp(⟨ ϕ(x), θ⟩ − ∑x∈D NA(θ))

conjugate prior

p(θ ∣ D; η, ν) = exp ⟨νη +

ϕ(x), θ⟩ − (ν + N)A(θ)

( ∑x∈D )

posterior:

p(θ; η, ν) = exp(⟨νη, θ⟩ − νA(θ))

imaginary counts imaginary expected sufficient statistics

slide-50
SLIDE 50

Bayesian learning Bayesian learning for Bayes-nets for Bayes-nets

assumption

global parameter independence: prior decomposes n

θ

X

θ

Y ∣X

X, Y

p(θ) =

p(θ )

∏i

X

∣Pa

i X i

conclusion

posterior also decomposes

p(θ ∣ D) =

p(θ ∣

∏i

X

∣Pa

i X i

D)

example

slide-51
SLIDE 51

Bayesian learning Bayesian learning for Bayes-nets for Bayes-nets

assumption

global parameter independence: prior decomposes n

θ

X

θ

Y ∣X

X, Y

p(θ) =

p(θ )

∏i

X

∣Pa

i X i

conclusion

posterior also decomposes

p(θ ∣ D) =

p(θ ∣

∏i

X

∣Pa

i X i

D) p(θ ∣ D) =

p(θ ) p(x ∣Pa ; θ )

∏i

X

∣Pa

i X i ∏x∈D ∏i

i x

i

X

∣Pa

i X i

prior likelihood

example

slide-52
SLIDE 52

Bayesian learning Bayesian learning for Bayes-nets for Bayes-nets

assumption

global parameter independence: prior decomposes n

θ

X

θ

Y ∣X

X, Y

p(θ) =

p(θ )

∏i

X

∣Pa

i X i

conclusion

posterior also decomposes

p(θ ∣ D) =

p(θ ∣

∏i

X

∣Pa

i X i

D) p(θ ∣ D) =

p(θ ) p(x ∣Pa ; θ )

∏i

X

∣Pa

i X i ∏x∈D ∏i

i x

i

X

∣Pa

i X i

prior likelihood

=

p(θ ) p(x ∣Pa ; θ )

∏i

X

∣Pa

i X i ∏x∈D

i x

i

X

∣Pa

i X i

p(θ

X

∣Pa

i X i

D)

example

slide-53
SLIDE 53

Bayesian learning Bayesian learning for Bayes-nets for Bayes-nets

assumption

global parameter independence: prior decomposes n

θ

X

θ

Y ∣X

X, Y

p(θ) =

p(θ )

∏i

X

∣Pa

i X i

conclusion

posterior also decomposes

p(θ ∣ D) =

p(θ ∣

∏i

X

∣Pa

i X i

D) p(θ ∣ D) =

p(θ ) p(x ∣Pa ; θ )

∏i

X

∣Pa

i X i ∏x∈D ∏i

i x

i

X

∣Pa

i X i

prior likelihood

=

p(θ ) p(x ∣Pa ; θ )

∏i

X

∣Pa

i X i ∏x∈D

i x

i

X

∣Pa

i X i

p(θ

X

∣Pa

i X i

D)

individual posteriors

we can apply Bayesian learning to individual conditional distributions

example

slide-54
SLIDE 54

Bayesian learning Bayesian learning for Bayes-nets for Bayes-nets

assumption

global parameter independence: prior decomposes n

θ

X

θ

Y ∣X

X, Y

p(θ) =

p(θ )

∏i

X

∣Pa

i X i

conclusion

posterior also decomposes

p(θ ∣ D) =

p(θ ∣

∏i

X

∣Pa

i X i

D) p(θ ∣ D) =

p(θ ) p(x ∣Pa ; θ )

∏i

X

∣Pa

i X i ∏x∈D ∏i

i x

i

X

∣Pa

i X i

prior likelihood

=

p(θ ) p(x ∣Pa ; θ )

∏i

X

∣Pa

i X i ∏x∈D

i x

i

X

∣Pa

i X i

p(θ

X

∣Pa

i X i

D)

individual posteriors

we can apply Bayesian learning to individual conditional distributions posterior predictive also decomposes: p(x ∣

D) =

p(x ∣

∏i

i ′

D)

p(θ

∣ ∫θ

X

∣Pa

i X i

D)p(x

∣Pa ; θ )dθ

i ′ x

i ′

X

∣Pa

i X i

X

∣Pa

i X i

example

slide-55
SLIDE 55

assume a decomposed prior

  • ne for each assignment to the parent (e.g., cols of the table)

Bayesian learning for Bayes-nets Bayesian learning for Bayes-nets

n

p(θ

) =

Y ∣X

p(θ

)p(θ )

Y ∣x0 Y ∣x1

we can further decompose the prior & posterior

for binary X

discrete case: conditional probability tables (CPTs) local parameter independence

slide-56
SLIDE 56

p(θ

Y ∣X

D) = p(θ

Y ∣x0

D)p(θ

Y ∣x1

D)

assume a decomposed prior

  • ne for each assignment to the parent (e.g., cols of the table)

Bayesian learning for Bayes-nets Bayesian learning for Bayes-nets

n

p(θ

) =

Y ∣X

p(θ

)p(θ )

Y ∣x0 Y ∣x1

we can further decompose the prior & posterior

for binary X

discrete case: conditional probability tables (CPTs) posterior is also decomposed local parameter independence

p(θ

) p(y∣x ; θ )

Y ∣x0 ∏(x ,y)∈D Y ∣x0

slide-57
SLIDE 57

Bayesian learning for Bayes-nets Bayesian learning for Bayes-nets

n

In practice this means:

discrete case: conditional probability tables (CPTs) keep a vector of pseudo-counts for each node after observing N samples: update these based on the frequency of different (x,y) values P (x

, pa )

′ i X

i

α

=

Y ∣x0

α

=

Y ∣x1

[1, … , 1]

K2 prior similar to Laplace smoothing BDe prior use a second Bayes-net to keep frequencies keep a total pseudo-count α α

=

x

∣pa

i X i

αP (x

, pa )

′ i X

i

then

slide-58
SLIDE 58

Bayesian learning for Bayes-nets Bayesian learning for Bayes-nets

example

ICU alarm network Bayesian learning vs MLE

slide-59
SLIDE 59

Summary Summary

learn the parameter by maximizing the likelihood it does not reflect uncertainty: maintain a distribution over the parameters for conjugate pairs (prior-likelihood), this maintenance is easy

slide-60
SLIDE 60

Summary Summary

learn the parameter by maximizing the likelihood it does not reflect uncertainty: maintain a distribution over the parameters for conjugate pairs (prior-likelihood), this maintenance is easy In Bayes-nets: both MLE and Bayesian learning is easy they have a decomposed form