Probabilistic Graphical Models Probabilistic Graphical Models - - PowerPoint PPT Presentation

probabilistic graphical models probabilistic graphical
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models Probabilistic Graphical Models - - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian networks Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives why structure learning is hard? two approaches to structure learning


slide-1
SLIDE 1

Probabilistic Graphical Models Probabilistic Graphical Models

Structure learning in Bayesian networks

Siamak Ravanbakhsh Fall 2019

slide-2
SLIDE 2

Learning objectives Learning objectives

why structure learning is hard? two approaches to structure learning constraint-based methods score based methods MLE vs Bayesian score

slide-3
SLIDE 3

Structure learning Structure learning in BayesNets in BayesNets

family of methods

constraint-based methods

estimate cond. independencies from the data find compatible BayesNets

slide-4
SLIDE 4

Structure learning Structure learning in BayesNets in BayesNets

family of methods

constraint-based methods

estimate cond. independencies from the data find compatible BayesNets

search over the combinatorial space, maximizing a score

2O(n )

2

slide-5
SLIDE 5

Structure learning Structure learning in BayesNets in BayesNets

family of methods

constraint-based methods

estimate cond. independencies from the data find compatible BayesNets

search over the combinatorial space, maximizing a score Bayesian model averaging

integrate over all possible structures

2O(n )

2

slide-6
SLIDE 6

Structure learning Structure learning in BayesNets in BayesNets

family of methods

constraint-based methods

estimate cond. independencies from the data find compatible BayesNets

search over the combinatorial space, maximizing a score Bayesian model averaging

integrate over all possible structures

slide-7
SLIDE 7

Structure learning Structure learning in BayesNets in BayesNets

Identifiable up to I-equivalence

family of methods

constraint-based methods

estimate cond. independencies from the data find compatible BayesNets

a DAG with the same set of conditional independencies (CI) I(G) = I(p

)

D

slide-8
SLIDE 8

Structure learning Structure learning in BayesNets in BayesNets

Identifiable up to I-equivalence

family of methods

constraint-based methods

estimate cond. independencies from the data find compatible BayesNets

a DAG with the same set of conditional independencies (CI) I(G) = I(p

)

D

Perfect MAP

slide-9
SLIDE 9

Structure learning Structure learning in BayesNets in BayesNets

Identifiable up to I-equivalence

family of methods

constraint-based methods

estimate cond. independencies from the data find compatible BayesNets

a DAG with the same set of conditional independencies (CI) I(G) = I(p

)

D

hypothesis testing Perfect MAP

slide-10
SLIDE 10

Structure learning Structure learning in BayesNets in BayesNets

Identifiable up to I-equivalence

family of methods

constraint-based methods

estimate cond. independencies from the data find compatible BayesNets

a DAG with the same set of conditional independencies (CI) I(G) = I(p

)

D

hypothesis testing Perfect MAP

X ⊥ Y ∣ Z?

slide-11
SLIDE 11

Structure learning Structure learning in BayesNets in BayesNets

Identifiable up to I-equivalence

family of methods

constraint-based methods

estimate cond. independencies from the data find compatible BayesNets

a DAG with the same set of conditional independencies (CI) I(G) = I(p

)

D

hypothesis testing

first attempt: a DAG that is I-map for

Perfect MAP

p

D

I(G) ⊆ I(p

)

D

X ⊥ Y ∣ Z?

slide-12
SLIDE 12

minimal I-map minimal I-map from CI test from CI test

input: IC test oracle; an ordering

  • utput: a minimal I-map G

for i=1...n find minimal s.t. set

X

, … , X

1 n

(X

i

X

, … , X −

1 i−1

U ∣ U)

U ⊆ {X

, … , X }

1 i−1

X

1

X

n

X

i

Pa

X

i

U

X

⊥ NonDesc ∣ Pa

i X

i

X

i

a DAG where removing an edge violates I-map property

slide-13
SLIDE 13

minimal I-map minimal I-map from CI test from CI test

Problems:

CI tests involve many variables number of CI tests is exponential a minimal I-MAP may be far from a P-MAP

slide-14
SLIDE 14

minimal I-map minimal I-map from CI test from CI test

Problems:

CI tests involve many variables number of CI tests is exponential a minimal I-MAP may be far from a P-MAP different orderings give different graphs

Example: D,I,S,G,L

(a topological ordering)

L,S,G,I,D L,D,S,I,G

slide-15
SLIDE 15

Structure learning Structure learning in BayesNets in BayesNets

Identifiable up to I-equivalence

family of methods

constraint-based methods

estimate cond. independencies from the data find compatible BayesNets

a DAG with the same set of conditional independencies (CI) I(G) = I(p

)

D

first attempt: a DAG that is I-map for p

D

I(G) ⊆ I(p

)

D can we find a perfect MAP with fewer IC tests involving fewer variables?

second attempt: a DAG that is P-map for

slide-16
SLIDE 16

Perfect map Perfect map from CI test from CI test

  • nly up to I-equivalence

the same set of CIs same skeleton same immoralities

slide-17
SLIDE 17

Perfect map Perfect map from CI test from CI test

  • nly up to I-equivalence

the same set of CIs same skeleton same immoralities procedure:

  • 1. find the undirected skeleton using CI tests
  • 2. identify immoralities in the undirected graph
slide-18
SLIDE 18

Perfect map Perfect map from CI test from CI test

  • 1. finding the undirected skeleton
  • bservation: if X and Y are not adjacent then OR

X ⊥ Y ∣ Pa

X

X ⊥ Y ∣ Pa

Y

slide-19
SLIDE 19

Perfect map Perfect map from CI test from CI test

  • 1. finding the undirected skeleton
  • bservation: if X and Y are not adjacent then OR

X ⊥ Y ∣ Pa

X

X ⊥ Y ∣ Pa

Y

assumption: max number of parents d

slide-20
SLIDE 20

Perfect map Perfect map from CI test from CI test

  • 1. finding the undirected skeleton
  • bservation: if X and Y are not adjacent then OR

X ⊥ Y ∣ Pa

X

X ⊥ Y ∣ Pa

Y

assumption: max number of parents d idea: search over all subsets of size d, and check CI above

slide-21
SLIDE 21

Perfect map Perfect map from CI test from CI test

  • 1. finding the undirected skeleton
  • bservation: if X and Y are not adjacent then OR

X ⊥ Y ∣ Pa

X

X ⊥ Y ∣ Pa

Y

assumption: max number of parents d idea: search over all subsets of size d, and check CI above

input: CI oracle; bound on #parents d

  • utput: undirected skeleton

initialize H as a complete undirected graph

for all pairs for all subsets U of size (within current neighbors of ) If then remove from H return H

X

, X

i j

≤ d

X

i

X

j

U X

i

X

j X

, X

i j

slide-22
SLIDE 22

Perfect map Perfect map from CI test from CI test

  • 1. finding the undirected skeleton
  • bservation: if X and Y are not adjacent then OR

X ⊥ Y ∣ Pa

X

X ⊥ Y ∣ Pa

Y

assumption: max number of parents d idea: search over all subsets of size d, and check CI above

input: CI oracle; bound on #parents d

  • utput: undirected skeleton

initialize H as a complete undirected graph

for all pairs for all subsets U of size (within current neighbors of ) If then remove from H return H

X

, X

i j

≤ d

X

i

X

j

U X

i

X

j X

, X

i j

= O((n ) ×

2

O((n − 2) )

d

O(n )

d+2

slide-23
SLIDE 23

Perfect map Perfect map from CI test from CI test

  • 2. finding the immoralities

potential immorality

X − Z, Y − Z ∈ H, X − Y

∈ H

Y X Z

slide-24
SLIDE 24

Perfect map Perfect map from CI test from CI test

  • 2. finding the immoralities

potential immorality

X − Z, Y − Z ∈ H, X − Y

∈ H

Y X Z

slide-25
SLIDE 25

Perfect map Perfect map from CI test from CI test

  • 2. finding the immoralities

potential immorality

X − Z, Y − Z ∈ H, X − Y

∈ H

not immorality only if

X

i

X

j

U ⇒ Z ∈ U

Y X Z

slide-26
SLIDE 26

Perfect map Perfect map from CI test from CI test

  • 2. finding the immoralities

input: CI oracle; bound on #parents d

  • utput: undirected skeleton

initialize H as a complete undirected graph

for all pairs for all subsets U of size (within current neighbors of ) If then remove from H return H

X

, X

i j

≤ d

X

i

X

j

U X

i

X

j X

, X

i j

potential immorality

X − Z, Y − Z ∈ H, X − Y

∈ H

not immorality only if

X

i

X

j

U ⇒ Z ∈ U

save the U when removing X-Y see if Z in U? if no, then we have immorality

X Y Z Y X Z

slide-27
SLIDE 27

Perfect map Perfect map from CI test from CI test

  • 3. propagate the constraints

at this point: a mix of directed and undirected edges

slide-28
SLIDE 28

Perfect map Perfect map from CI test from CI test

  • 3. propagate the constraints

at this point: a mix of directed and undirected edges add directions using the following rules (needed to preserve immoralities / DAG structure) until convergence

for exact CI tests, this guarantees the exact I-equivalence family

slide-29
SLIDE 29

Perfect map Perfect map from CI test from CI test

  • 3. propagate the constraints

at this point: a mix of directed and undirected edges add directions using the following rules (needed to preserve immoralities / DAG structure) until convergence

Example

Ground truth DAG

for exact CI tests, this guarantees the exact I-equivalence family

slide-30
SLIDE 30

Perfect map Perfect map from CI test from CI test

  • 3. propagate the constraints

at this point: a mix of directed and undirected edges add directions using the following rules (needed to preserve immoralities / DAG structure) until convergence

Example

Ground truth DAG undirected skeleton +immoralities

for exact CI tests, this guarantees the exact I-equivalence family

slide-31
SLIDE 31

Perfect map Perfect map from CI test from CI test

  • 3. propagate the constraints

at this point: a mix of directed and undirected edges add directions using the following rules (needed to preserve immoralities / DAG structure) until convergence

Example

Ground truth DAG undirected skeleton +immoralities using rules R1,R2,R3

for exact CI tests, this guarantees the exact I-equivalence family

slide-32
SLIDE 32

conditional independence (CI) test conditional independence (CI) test

how to decide from the dataset

X ⊥ Y ∣ Z

D

slide-33
SLIDE 33

conditional independence (CI) test conditional independence (CI) test

how to decide from the dataset

X ⊥ Y ∣ Z

D

measure the deviance of from conditional mututal information statistics p

(X ∣

D

Z)p

(Y ∣Z)

D

p

(X, Y ∣Z)

D

d

(D) =

I

E

[D(p (X, Y ∣Z)∣∣p (X∣Z)p (Y ∣Z))]

Z D D D

χ2

slide-34
SLIDE 34

conditional independence (CI) test conditional independence (CI) test

how to decide from the dataset

X ⊥ Y ∣ Z

D

measure the deviance of from conditional mututal information statistics p

(X ∣

D

Z)p

(Y ∣Z)

D

p

(X, Y ∣Z)

D

d

(D) =

I

E

[D(p (X, Y ∣Z)∣∣p (X∣Z)p (Y ∣Z))]

Z D D D

χ2

d

(D) =

χ2

∣D∣ ∑x,y,z

p (z)p

(x∣z)p (y∣z)

D D D

(p

(x,y,z)−p (z)p (x∣z)p (y∣z))

D D D D 2

using frequencies in the dataset

slide-35
SLIDE 35

conditional independence (CI) test conditional independence (CI) test

how to decide from the dataset

X ⊥ Y ∣ Z

D

measure the deviance of from conditional mututal information statistics p

(X ∣

D

Z)p

(Y ∣Z)

D

p

(X, Y ∣Z)

D

d

(D) =

I

E

[D(p (X, Y ∣Z)∣∣p (X∣Z)p (Y ∣Z))]

Z D D D

χ2

d

(D) =

χ2

∣D∣ ∑x,y,z

p (z)p

(x∣z)p (y∣z)

D D D

(p

(x,y,z)−p (z)p (x∣z)p (y∣z))

D D D D 2

using frequencies in the dataset

large deviance rejects the null hypothesis (of conditional independence)

slide-36
SLIDE 36

conditional independence (CI) test conditional independence (CI) test

how to decide from the dataset

X ⊥ Y ∣ Z

D

measure the deviance of from conditional mututal information statistics p

(X ∣

D

Z)p

(Y ∣Z)

D

p

(X, Y ∣Z)

D

d

(D) =

I

E

[D(p (X, Y ∣Z)∣∣p (X∣Z)p (Y ∣Z))]

Z D D D

χ2

d

(D) =

χ2

∣D∣ ∑x,y,z

p (z)p

(x∣z)p (y∣z)

D D D

(p

(x,y,z)−p (z)p (x∣z)p (y∣z))

D D D D 2

using frequencies in the dataset

large deviance rejects the null hypothesis (of conditional independence)

d(D) > t

pick a threshold

slide-37
SLIDE 37

conditional independence (CI) test conditional independence (CI) test

how to decide from the dataset

X ⊥ Y ∣ Z

D

measure the deviance of from conditional mututal information statistics p

(X ∣

D

Z)p

(Y ∣Z)

D

p

(X, Y ∣Z)

D

d

(D) =

I

E

[D(p (X, Y ∣Z)∣∣p (X∣Z)p (Y ∣Z))]

Z D D D

χ2

d

(D) =

χ2

∣D∣ ∑x,y,z

p (z)p

(x∣z)p (y∣z)

D D D

(p

(x,y,z)−p (z)p (x∣z)p (y∣z))

D D D D 2

using frequencies in the dataset

large deviance rejects the null hypothesis (of conditional independence)

d(D) > t

pick a threshold

p-value is the probability of false rejection

pvalue(t) = P({D : d(D) > t} ∣ X ⊥ Y ∣ Z)

slide-38
SLIDE 38

conditional independence (CI) test conditional independence (CI) test

how to decide from the dataset

X ⊥ Y ∣ Z

D

large deviance rejects the null hypothesis (of conditional independence)

d(D) > t

pick a threshold

p-value is the probability of false rejection

pvalue(t) = P({D : d(D) > t} ∣ X ⊥ Y ∣ Z)

  • ver all possible datasets
slide-39
SLIDE 39

conditional independence (CI) test conditional independence (CI) test

how to decide from the dataset

X ⊥ Y ∣ Z

D

large deviance rejects the null hypothesis (of conditional independence)

d(D) > t

pick a threshold

p-value is the probability of false rejection

pvalue(t) = P({D : d(D) > t} ∣ X ⊥ Y ∣ Z)

  • ver all possible datasets

it is possible to derive the distribution of deviance measures e.g., distribution reject a hypothesis (CI) for small p-values (.05)

χ2

.05 .95

slide-40
SLIDE 40

Structure learning Structure learning in BayesNets in BayesNets

family of methods

constraint-based methods

estimate cond. independencies from the data find compatible BayesNets

search over the combinatorial space, maximizing a score Bayesian model averaging

integrate over all possible structures

slide-41
SLIDE 41

Mutual information Mutual information

how much information does X encode about Y? reduction in the uncertainty of X after observing Y

slide-42
SLIDE 42

Mutual information Mutual information

how much information does X encode about Y?

I(X, Y ) = H(X) − H(X∣Y )

reduction in the uncertainty of X after observing Y

conditional entropy

p(x)H(p(y∣x))

∑x

slide-43
SLIDE 43

Mutual information Mutual information

how much information does X encode about Y?

I(X, Y ) = H(X) − H(X∣Y ) = H(Y ) − H(Y ∣X)

reduction in the uncertainty of X after observing Y

symmetric

= I(Y , X)

conditional entropy

p(x)H(p(y∣x))

∑x

slide-44
SLIDE 44

Mutual information Mutual information

how much information does X encode about Y?

I(X, Y ) = H(X) − H(X∣Y ) = H(Y ) − H(Y ∣X)

reduction in the uncertainty of X after observing Y

symmetric

= I(Y , X) I(X, Y ) =

p(x, y) log( )

∑x,y

p(x)p(y) p(x,y) conditional entropy

p(x)H(p(y∣x))

∑x

slide-45
SLIDE 45

Mutual information Mutual information

how much information does X encode about Y?

I(X, Y ) = H(X) − H(X∣Y ) = H(Y ) − H(Y ∣X)

reduction in the uncertainty of X after observing Y

symmetric

= I(Y , X) = D

(p(x, y)∥p(x)p(y))

KL

I(X, Y ) =

p(x, y) log( )

∑x,y

p(x)p(y) p(x,y)

positive

conditional entropy

p(x)H(p(y∣x))

∑x

slide-46
SLIDE 46

MLE in Bayes-nets MLE in Bayes-nets mutual information form

mutual information form

log-likelihood

ℓ(D; θ) =

log p(x ∣

∑x∈D ∑i

i

Pa

; θ )

x

i

i∣Pa

i

slide-47
SLIDE 47

MLE in Bayes-nets MLE in Bayes-nets mutual information form

mutual information form

log-likelihood

ℓ(D; θ) =

log p(x ∣

∑x∈D ∑i

i

Pa

; θ )

x

i

i∣Pa

i

=

log p(x ∣

∑i ∑(x

,Pa )∈D

i x i

i

Pa

; θ )

x

i

i∣Pa

i

slide-48
SLIDE 48

MLE in Bayes-nets MLE in Bayes-nets mutual information form

mutual information form

log-likelihood

ℓ(D; θ) =

log p(x ∣

∑x∈D ∑i

i

Pa

; θ )

x

i

i∣Pa

i

=

log p(x ∣

∑i ∑(x

,Pa )∈D

i x i

i

Pa

; θ )

x

i

i∣Pa

i

= N

p (x, Pa ) log p(x ∣

∑i ∑x

,Pa

i x i

D x

i

i

Pa

; θ )

x

i

i∣Pa

i

using the empirical distribution

slide-49
SLIDE 49

MLE in Bayes-nets MLE in Bayes-nets mutual information form

mutual information form

log-likelihood

ℓ(D; θ) =

log p(x ∣

∑x∈D ∑i

i

Pa

; θ )

x

i

i∣Pa

i

=

log p(x ∣

∑i ∑(x

,Pa )∈D

i x i

i

Pa

; θ )

x

i

i∣Pa

i

= N

p (x, Pa ) log p(x ∣

∑i ∑x

,Pa

i x i

D x

i

i

Pa

; θ )

x

i

i∣Pa

i

use MLE estimate ℓ(D, θ ) =

N

p (x , Pa ) log p (x ∣

∑i ∑x

,Pa

i x i

D i xi D i

Pa

)

xi

using the empirical distribution

slide-50
SLIDE 50

MLE in Bayes-nets MLE in Bayes-nets mutual information form

mutual information form

log-likelihood

ℓ(D; θ) =

log p(x ∣

∑x∈D ∑i

i

Pa

; θ )

x

i

i∣Pa

i

=

log p(x ∣

∑i ∑(x

,Pa )∈D

i x i

i

Pa

; θ )

x

i

i∣Pa

i

= N

p (x, Pa ) log p(x ∣

∑i ∑x

,Pa

i x i

D x

i

i

Pa

; θ )

x

i

i∣Pa

i

use MLE estimate ℓ(D, θ ) =

N

p (x , Pa ) log p (x ∣

∑i ∑x

,Pa

i x i

D i xi D i

Pa

)

xi

= N

p (x , Pa ) log + log p (x )

∑i ∑x

,Pa

i x i

D i x

i (

p

(x )p (Pa )

D i D x i

p

(x ,Pa )

D i x i

D i )

using the empirical distribution

slide-51
SLIDE 51

MLE in Bayes-nets MLE in Bayes-nets mutual information form

mutual information form

log-likelihood

ℓ(D; θ) =

log p(x ∣

∑x∈D ∑i

i

Pa

; θ )

x

i

i∣Pa

i

=

log p(x ∣

∑i ∑(x

,Pa )∈D

i x i

i

Pa

; θ )

x

i

i∣Pa

i

= N

p (x, Pa ) log p(x ∣

∑i ∑x

,Pa

i x i

D x

i

i

Pa

; θ )

x

i

i∣Pa

i

use MLE estimate ℓ(D, θ ) =

N

p (x , Pa ) log p (x ∣

∑i ∑x

,Pa

i x i

D i xi D i

Pa

)

xi

= N

p (x , Pa ) log + log p (x )

∑i ∑x

,Pa

i x i

D i x

i (

p

(x )p (Pa )

D i D x i

p

(x ,Pa )

D i x i

D i )

using the definition of mutual information

= N

I (X , Pa ) −

∑i

D i X

i

H

(X )

D i

using the empirical distribution

slide-52
SLIDE 52

Optimal solution for Optimal solution for trees trees

likelihood score

ℓ(D, θ ) =

N

I (X , Pa ) −

∑i

D i X

i

H

(X )

D i

slide-53
SLIDE 53

Optimal solution for Optimal solution for trees trees

likelihood score

ℓ(D, θ ) =

N

I (X , Pa ) −

∑i

D i X

i

H

(X )

D i

does not depend on structure

slide-54
SLIDE 54

Optimal solution for Optimal solution for trees trees

likelihood score

ℓ(D, θ ) =

N

I (X , Pa ) −

∑i

D i X

i

H

(X )

D i

does not depend on structure

I

(X , X )

D i j

slide-55
SLIDE 55

Optimal solution for Optimal solution for trees trees

likelihood score

ℓ(D, θ ) =

N

I (X , Pa ) −

∑i

D i X

i

H

(X )

D i

structure learning algorithms use mutual information in the structure search: Chow-Liu algorithm: find the max-spanning tree:

edge-weights = mutual information add direction to edges later make sure each node has at most one parent (i.e., no v-structure) does not depend on structure

I

(X , X )

D i j

I

(X , X ) =

D j i

I

(X , X )

D i j

slide-56
SLIDE 56

Bayesian about both structure and parameters

Bayesian Score Bayesian Score for BayesNets

for BayesNets

P(G∣D) ∝ P(D∣G)P(G)

G θ

slide-57
SLIDE 57

Bayesian about both structure and parameters

Bayesian Score Bayesian Score for BayesNets

for BayesNets

P(G∣D) ∝ P(D∣G)P(G)

G θ

log

score

(G, D) =

B

log P(D∣G) + log P(G)

slide-58
SLIDE 58

Bayesian about both structure and parameters

Bayesian Score Bayesian Score for BayesNets

for BayesNets

P(G∣D) ∝ P(D∣G)P(G)

G θ

P(D∣θ, G)P(θ ∣

∫θ∈Θ

G

G)dθ

marginal likelihood for a structure

log

score

(G, D) =

B

log P(D∣G) + log P(G)

G

slide-59
SLIDE 59

Bayesian about both structure and parameters

Bayesian Score Bayesian Score for BayesNets

for BayesNets

P(G∣D) ∝ P(D∣G)P(G)

G θ

P(D∣θ, G)P(θ ∣

∫θ∈Θ

G

G)dθ

marginal likelihood for a structure

assuming local and global parameter independence

factorizes to the marginal likelihood of each node

log

score

(G, D) =

B

log P(D∣G) + log P(G)

G

slide-60
SLIDE 60

Bayesian about both structure and parameters

Bayesian Score Bayesian Score for BayesNets

for BayesNets

P(G∣D) ∝ P(D∣G)P(G)

G θ

P(D∣θ, G)P(θ ∣

∫θ∈Θ

G

G)dθ

marginal likelihood for a structure

assuming local and global parameter independence

factorizes to the marginal likelihood of each node

log

score

(G, D) =

B

log P(D∣G) + log P(G)

G

for Dirichlet-multinomial has closed form

slide-61
SLIDE 61

Bayesian about both structure and parameters

Bayesian Score Bayesian Score for BayesNets

for BayesNets

P(G∣D) ∝ P(D∣G)P(G)

G θ

P(D∣θ, G)P(θ ∣

∫θ∈Θ

G

G)dθ

marginal likelihood for a structure

assuming local and global parameter independence

factorizes to the marginal likelihood of each node

log

score

(G, D) =

B

log P(D∣G) + log P(G)

G

for Dirichlet-multinomial has closed form

score

(G, D) ≈

B

ℓ(D, θ

) −

∗G

log(∣D∣)K

2 1

Bayesian Information Criterion (BIC)

for large sample size any exp-family member

slide-62
SLIDE 62

Bayesian about both structure and parameters

Bayesian Score Bayesian Score for BayesNets

for BayesNets

P(G∣D) ∝ P(D∣G)P(G)

G θ

P(D∣θ, G)P(θ ∣

∫θ∈Θ

G

G)dθ

marginal likelihood for a structure

assuming local and global parameter independence

factorizes to the marginal likelihood of each node

log

score

(G, D) =

B

log P(D∣G) + log P(G)

G

for Dirichlet-multinomial has closed form

score

(G, D) ≈

B

ℓ(D, θ

) −

∗G

log(∣D∣)K

2 1

Bayesian Information Criterion (BIC) #parameters

for large sample size any exp-family member

slide-63
SLIDE 63

Bayesian about both structure and parameters

Bayesian Score Bayesian Score for BayesNets

for BayesNets

P(G∣D) ∝ P(D∣G)P(G)

G θ

P(D∣θ, G)P(θ ∣

∫θ∈Θ

G

G)dθ

marginal likelihood for a structure

assuming local and global parameter independence

factorizes to the marginal likelihood of each node

log

score

(G, D) =

B

log P(D∣G) + log P(G)

G

for Dirichlet-multinomial has closed form

score

(G, D) ≈

B

ℓ(D, θ

) −

∗G

log(∣D∣)K

2 1

Bayesian Information Criterion (BIC) #parameters

for large sample size any exp-family member

Akaike Information Criterion (AIC)

ℓ(D, θ

) −

∗G

K

2 1

slide-64
SLIDE 64

Bayesian Score Bayesian Score for BayesNets

for BayesNets

Example

G

1

G

2

= ∣D∣

The Bayesian score is biased towards simpler structures

slide-65
SLIDE 65

Bayesian Score Bayesian Score for BayesNets

for BayesNets

Example

The Bayesian score is biased towards simpler structures

= ∣D∣

data sampled from ICU alarm Bayesnet

Bayesian score of the true model (509 params.) simplified model (359 params) simplified model (214 params)

slide-66
SLIDE 66

Structure search Structure search

is NP-hard

arg max

Score(D, G)

G

use heuristic search algorithms (discussed for MAP inference)

slide-67
SLIDE 67

Structure search Structure search

is NP-hard

arg max

Score(D, G)

G

use heuristic search algorithms (discussed for MAP inference)

local search using: edge addition edge deletion edge reversal

slide-68
SLIDE 68

Structure search Structure search

is NP-hard

arg max

Score(D, G)

G

use heuristic search algorithms (discussed for MAP inference)

local search using: edge addition edge deletion edge reversal O(N )

2 possible moves

slide-69
SLIDE 69

Structure search Structure search

is NP-hard

arg max

Score(D, G)

G

use heuristic search algorithms (discussed for MAP inference)

local search using: edge addition edge deletion edge reversal O(N )

2 possible moves

collect sufficient statistics (frequencies) estimate the score

slide-70
SLIDE 70

Structure search Structure search

is NP-hard

arg max

Score(D, G)

G

use heuristic search algorithms (discussed for MAP inference)

local search using: edge addition edge deletion edge reversal

use the decomposition of the score

O(N )

2 possible moves

collect sufficient statistics (frequencies) estimate the score

slide-71
SLIDE 71

Structure search Structure search

is NP-hard

arg max

Score(D, G)

G

use heuristic search algorithms (discussed for MAP inference)

local search using: edge addition edge deletion edge reversal

use the decomposition of the score

O(N )

2 possible moves

collect sufficient statistics (frequencies) estimate the score

example

ICU-alarm network

slide-72
SLIDE 72

Summary Summary

Structure learning is NP-hard Make assumptions to simplify:

slide-73
SLIDE 73

Summary Summary

Structure learning is NP-hard Make assumptions to simplify: constraint-based methods:

limit the max number of parents rely on CI tests identifies the I-equivalence class

slide-74
SLIDE 74

Summary Summary

Structure learning is NP-hard Make assumptions to simplify: constraint-based methods:

limit the max number of parents rely on CI tests identifies the I-equivalence class

score based methods:

tree structure use a Bayesian score + heuristic search finds a locally optimal structure