Graphical models and inference II Milos Hauskrecht milos@pitt.edu - - PDF document

graphical models and inference ii
SMART_READER_LITE
LIVE PREVIEW

Graphical models and inference II Milos Hauskrecht milos@pitt.edu - - PDF document

CS 3750 Machine Learning Lecture 3 Graphical models and inference II Milos Hauskrecht milos@pitt.edu 5329 Sennott Square, x4-8845 http://www.cs.pitt.edu/~milos/courses/cs3750-Spring2020/ CS 3750 Advanced Machine Learning Challenges for


slide-1
SLIDE 1

1

CS 3750 Advanced Machine Learning

CS 3750 Machine Learning

Milos Hauskrecht milos@pitt.edu 5329 Sennott Square, x4-8845 http://www.cs.pitt.edu/~milos/courses/cs3750-Spring2020/

Lecture 3

Graphical models and inference II

CS 3750 Advanced Machine Learning

Challenges for modeling complex multivariate distributions

How to model/parameterize complex multivariate distributions with a large number of variables? One solution:

  • Decompose the distribution. Reduce the number of parameters,

using some form of independence. Two models:

  • Bayesian belief networks (BBNs)
  • Markov Random Fields (MRFs)
  • Learning of these models relies on the decomposition.

) (X P

slide-2
SLIDE 2

2

CS 3750 Advanced Machine Learning

Bayesian belief network

Burglary Earthquake JohnCalls MaryCalls Alarm

P(B) P(E) P(A|B,E) P(J|A) P(M|A)

Directed acyclic graph

  • Nodes = random variables
  • Links = direct (causal) dependencies

Missing links encode different marginal and conditional independences

CS 3750 Advanced Machine Learning

Bayesian belief network

Burglary Earthquake JohnCalls MaryCalls Alarm

B E T F T T 0.95 0.05 T F 0.94 0.06 F T 0.29 0.71 F F 0.001 0.999 P(B) 0.001 0.999 P(E) 0.002 0.998 A T F T 0.90 0.1 F 0.05 0.95 A T F T 0.7 0.3 F 0.01 0.99 P(A|B,E) P(J|A) P(M|A) T F T F

slide-3
SLIDE 3

3

CS 3750 Advanced Machine Learning

Full joint distribution in BBNs

The full joint distribution is defined as a product of local conditional distributions:

) ) ( | ( ) ,.., , (

,.. 1 2 1

n i i i n

X pa X X X X P P

M A B J E       ) , , , , ( F M T J T A T E T B P Example: ) | ( ) | ( ) , | ( ) ( ) ( T A F M P T A T J P T E T B T A P T E P T B P          Then its probability is: Assume the following assignment

  • f values to random variables

F M T J T A T E T B      , , , ,

Inference in Bayesian networks

  • Full joint uses the decomposition
  • Calculation of marginals:

– Requires summation over variables we want to take out

  • How to compute sums and products more efficiently?

 

x x

x f a x af ) ( ) (   ) ( T J P

) , , , , (

, , , ,

m M T J a A e E b B P

F T b F T e F T a F T m

         

   

M A B J E

slide-4
SLIDE 4

4

Variable elimination

Assume order: M, E, B, A to calculate

) ( ) ( ) , | ( ) | ( ) | (

, , , ,

e E P b B P e E b B a A P a A m M P a A T J P

F T b F T e F T a F T m

             

   

) ( T J P 

               

   

    F T m F T b F T e F T a

a A m M P e E P b B P e E b B a A P a A T J P

, , , ,

) | ( ) ( ) ( ) , | ( ) | ( 1 ) ( ) ( ) , | ( ) | (

, , ,

  

  

       

F T b F T e F T a

e E P b B P e E b B a A P a A T J P

  

  

             

F T b F T e F T a

e E P e E b B a A P b B P a A T J P

, , ,

) ( ) , | ( ) ( ) | ( ) , ( ) ( ) | (

, 1 ,

b B a A b B P a A T J P

F T b F T a

     

 

 

 

 

           

F T a F T e

b B a A b B P a A T J P

, 1 ,

) , ( ) ( ) | ( 

   

F T a

a A a A T J P

, 2

) ( ) | (  ) ( T J P  

Variable elimination

Assume order: M, E, B, A to calculate Conditional probabilities defining the joint = factors Variable elimination inference can be cast in terms of operations defined over factors

) ( ) ( ) , | ( ) | ( ) | (

, , , ,

e E P b B P e E b B a A P a A m M P a A T J P

F T b F T e F T a F T m

             

   

) ( T J P 

) ( ) ( ) , , ( ) , ( ) (

4 4 , , , , 3 2 1

E f B f E B A f A M f A f

F T B F T E F T A F T M

   

   

slide-5
SLIDE 5

5

CS 3750 Advanced Machine Learning

Factors

  • Factor: is a function that maps value assignments for a

subset of random variables to  (reals)

  • The scope of the factor:

– a set of variables defining the factor

  • Example:

– Assume discrete random variables x (with values a1,a2, a3) and y (with values b1 and b2) – Factor: – Scope of the factor:

a1 b1 0.5 a1 b2 0.2 a2 b1 0.1 a2 b2 0.3 a3 b1 0.2 a3 b2 0.4

) , ( y x  } , { y x

CS 3750 Advanced Machine Learning

Factor Product

b1 c1 0.1 b1 c2 0.6 b2 c1 0.3 b2 c2 0.4 a1 b1 0.5 a1 b2 0.2 a2 b1 0.1 a2 b2 0.3 a3 b1 0.2 a3 b2 0.4 a1 b1 c1 0.5*0.1 a1 b1 c2 0.5*0.6 a1 b2 c1 0.2*0.3 a1 b2 c2 0.2*0.4 a2 b1 c1 0.1*0.1 a2 b1 c2 0.1*0.6 a2 b2 c1 0.3*0.3 a2 b2 c2 0.3*0.4 a3 b1 c1 0.2*0.1 a3 b1 c2 0.2*0.6 a3 b2 c1 0.4*0.3 a3 b2 c2 0.4*0.4

Variables: A,B,C ) , ( C B  ) , ( B A  ) , , ( C B A  ) , ( ) , ( ) , , ( B A C B C B A     

slide-6
SLIDE 6

6

CS 3750 Advanced Machine Learning

Factor Marginalization

a1 b1 c1 0.2 a1 b1 c2 0.35 a1 b2 c1 0.4 a1 b2 c2 0.15 a2 b1 c1 0.5 a2 b1 c2 0.1 a2 b2 c1 0.3 a2 b2 c2 0.2 a3 b1 c1 0.25 a3 b1 c2 0.45 a3 b2 c1 0.15 a3 b2 c2 0.25 a1 c1 0.2+0.4=0.6 a1 c2 0.35+0.15=0.5 a2 c1 0.8 a2 c2 0.3 a3 c1 0.4 a3 c2 0.7

Variables: A,B,C ) , , ( ) , ( C B A C A

B

 

CS 3750 Advanced Machine Learning

Factor division

A=1 B=1 0.5 A=1 B=2 0.4 A=2 B=1 0.8 A=2 B=2 0.2 A=3 B=1 0.6 A=3 B=2 0.5 A=1 0.4 A=2 0.4 A=3 0.5 A=1 B=1 0.5/0.4=1.25 A=1 B=2 0.4/0.4=1.0 A=2 B=1 0.8/0.4=2.0 A=2 B=2 0.2/0.4=2.0 A=3 B=1 0.6/0.5=1.2 A=3 B=2 0.5/0.5=1.0

Inverse of a factor product

slide-7
SLIDE 7

7

CS 3750 Advanced Machine Learning

Markov random fields

An undirected network (also called independence graph)

  • Probabilistic models with symmetric dependences
  • G = (S, E)

– S set of random variables – Undirected edges E that define dependences between pairs

  • f variables

Example: variables A,B ..H B C D E F H A G

CS 3750 Advanced Machine Learning

Markov random fields

The full joint of the MRF is defined Example: Full joint: B C D E F H A G ) , ( ) , ( ) , ( ) , ( ) , , ( ) , , ( ~ ) ,... , (

6 5 4 3 2 1

H F H G F C G A E D B C B A H B A P       ) (

c c x

  • A potential function (defined over a clique of the graph)

) (

) ( ) (

x cl c c c

P x x 

) (

c c x

  • A potential function (defined over variables in cliques/factors)
slide-8
SLIDE 8

8

CS 3750 Advanced Machine Learning

Markov random fields: independence relations

  • Pairwise Markov property

– Two nodes in the network that are not directly connected can be made independent given all other nodes

  • Local Markov property

– A set of nodes (variables) can be made independent from the rest of nodes variables given its immediate neighbors

  • Global Markov property

– A vertex set A is independent of the vertex set B (A and B are disjoint) given set C if all chains in between elements in A and B intersect C

CS 3750 Advanced Machine Learning

MRF variable elimination inference

Example:

B C D E F H A G

) ,... , ( ) (

,.. , ,

H B A P B P

H D C A 

H D C A

H F H G F C G A E D B C B A Z

,.. , , 6 5 4 3 2 1

) , ( ) , ( ) , ( ) , ( ) , , ( ) , , ( 1      

B C D E F H A G

 

      

H G F D C A E

H F H G F C G A E D B C B A Z

, , , , , 6 5 4 3 2 1

) , ( ) , ( ) , ( ) , ( ) , , ( ) , , ( 1      

Eliminate E ) , (

1

D B 

slide-9
SLIDE 9

9

CS 3750 Advanced Machine Learning

 

      

H G F C A D

H F H G F C G A D B C B A

, , , , 6 5 4 3 1 1

) , ( ) , ( ) , ( ) , ( ) , ( ) , , (      

MRF variable elimination inference

Example (cont): ) ,... , ( ) (

,.. , ,

H B A P B P

H D C A 

B C D E F H A G

H G F D C A

H F H G F C G A D B C B A Z

, , , , , 6 5 4 3 1 1

) , ( ) , ( ) , ( ) , ( ) , ( ) , , ( 1       Eliminate D ) (

2 B

B C D E F H A G

Z 1 

CS 3750 Advanced Machine Learning

 

      

G F C A H

H F H G F C G A B C B A

, , , 6 5 4 3 2 1

) , ( ) , ( ) , ( ) , ( ) ( ) , , (      

MRF variable elimination inference

Example (cont): ) ,... , ( ) (

,.. , ,

H B A P B P

H D C A 

B C D E F H A G

H G F C A

H F H G F C G A B C B A

, , , , 6 5 4 3 2 1

) , ( ) , ( ) , ( ) , ( ) ( ) , , (      

Eliminate H ) , , (

3

H G F 

B C D E F H A G

) , (

4

G F 

Z 1  Z 1 

slide-10
SLIDE 10

10

CS 3750 Advanced Machine Learning

 

      

G C A F

G F F C G A B C B A

, , 4 4 3 2 1

) , ( ) , ( ) , ( ) ( ) , , (     

MRF variable elimination inference

Example (cont): ) ,... , ( ) (

,.. , ,

H B A P B P

H D C A 

B C D E F H A G

) , ( ) , ( ) , ( ) ( ) , , (

4 , , , 4 3 2 1

G F F C G A B C B A

G F C A

    

Eliminate F ) , , (

5

G F C  ) , (

6

C G 

B C D E F H A G

Z 1  Z 1 

CS 3750 Advanced Machine Learning

 

      

C A F

G C G A B C B A

, 6 3 2 1

) , ( ) , ( ) ( ) , , (    

MRF variable elimination inference

Example (cont): ) ,... , ( ) (

,.. , ,

H B A P B P

H D C A 

B C D E F H A G

G C A

G C G A B C B A

, , 6 3 2 1

) , ( ) , ( ) ( ) , , (    

Eliminate G ) , , (

7

G C A  ) , (

8

C A 

B C D E F H A G

Z 1  Z 1 

slide-11
SLIDE 11

11

CS 3750 Advanced Machine Learning

 

      

A C

C A C B A B ) , ( ) , , ( ) (

8 1 2

  

MRF variable elimination inference

Example (cont): ) ,... , ( ) (

,.. , ,

H B A P B P

H D C A 

B C D E F H A G

C A

C A B C B A

, 8 2 1

) , ( ) ( ) , , (   

Eliminate C ) , , (

9

C B A  ) , (

10

B A 

B C D E F H A G

Z 1  Z 1 

CS 3750 Advanced Machine Learning

MRF variable elimination inference

Example (cont): ) ,... , ( ) (

,.. , ,

H B A P B P

H D C A 

B C D E F H A G

 

 

A A

B A B B A B ) , ( ) ( ) , ( ) (

10 2 10 2

   

Eliminate A ) (

11 B

B C D E F H A G

A

B A B ) , ( ) (

10 2

  ) ( ) (

11 2

B B   

B C D E F H A G

Z 1  Z 1  Z 1  Z 1 

slide-12
SLIDE 12

12

CS 3750 Advanced Machine Learning

Are BBNs and MRFs different?

Both models represent independences that hold among variables or sets of variables?

  • Are the two the same in terms of independences they

can represent?

  • Or, are they different?

CS 3750 Advanced Machine Learning

Are BBNs and MRFs different?

Both models represent independences that hold among variables

  • r sets of variables?
  • Are the two the same in terms of independences they can

represent?

  • Or, are they different?

Answer: MRFs are different from BBNs

  • There are independences that can be represented by one model

but not the other Directed Models (BBNs) Undirected Models (MRFs)

slide-13
SLIDE 13

13

CS 3750 Advanced Machine Learning

Are BBNs and MRFs different?

MRFs are different from BBNs

  • There are independences that can be represented by one model

but not the other Analysis:

A C B A C B

directed undirected A is independent of C given B

CS 3750 Advanced Machine Learning

Are BBNs and MRFs different?

MRFs are different from BBNs

  • There are independences that can be represented by one model

but not the other Analysis:

A C B

directed undirected B is independent of C given A

A C B

slide-14
SLIDE 14

14

CS 3750 Advanced Machine Learning

Are BBNs and MRFs different?

MRFs are different from BBNs

  • There are independences that can be represented by one model

but not the other Analysis:

A C B

directed undirected

A and B are marginally independent

A C B

A and B are independent given C

=

Fix to undirected (moralization)

A C B

A, B, C are all dependent No false independence

CS 3750 Advanced Machine Learning

Are BBNs and MRFs different?

MRFs are different from BBNs

  • There are independences that can be represented by one model

but not the other Analysis: undirected B and C are independent given A,D

A C B

A and D are independent given B,C

D

No directed graph can represent the same set of independences

slide-15
SLIDE 15

15

CS 3750 Advanced Machine Learning

Converting BBNs to MRFs

Moral-graph H[G]: of a Bayesian network over X is an undirected graph over X that contains an edge between x and y if:

  • There exists a directed edge between them in G.
  • They are both parents of the same node in G.

C D I G S L J H C D I G S L J H

CS 3750 Advanced Machine Learning

Moral Graphs: define MRFs

Why moralization? ) , , ( ) , , ( ) , ( ) , ( ) , , ( ) , ( ) ( ) , | ( ) , | ( ) | ( ) | ( ) , | ( ) | ( ) ( ) , , , , , , , (

7 6 5 4 3 2 1

J G H S L J G L I S D I G C D C J G H P S L J P G L P I S P D I G P C D P C P H J L S I G D C P          

) , | ( D I G P ) , , (

3

D I G 

C D I G S L J H C D I G S L J H

slide-16
SLIDE 16

16

CS 3750 Advanced Machine Learning

Inference

Variable elimination: Depends on the order of variables to eliminate Question: can we optimize the structures ahead of times so that we can make inferences efficiently and without worrying about the specific variable order?

  • Structures that support efficient inferences:

Chains, and trees

A B C D E A B C

Slides by C. Bishop

Inferences

slide-17
SLIDE 17

17

Inferences

Slides by C. Bishop

Inferences

Slides by C. Bishop

slide-18
SLIDE 18

18

Inferences

Slides by C. Bishop

Inferences

Slides by C. Bishop

slide-19
SLIDE 19

19

CS 3750 Advanced Machine Learning

Inference

Many BBNs or MRFs are not tree structured

  • Can we optimize the structures ahead of times so

that we can make inferences efficiently and without worrying about the specific variable

  • rder?
  • Idea: Convert to trees that support efficient

inference

  • Next: two approaches to convert MRFs (or BBNs)

to tree structures

CS 3750 Advanced Machine Learning

Induced graph

A graph induced by a specific variable elimination

  • rder that covers all variables:
  • a graph G extended by links that represent

intermediate factors

  • Induced graph defines a tree decomposition of a

graph G (or a clique tree)

– .

A A B C C B D E F G F C G G H

B C D E F H A G Tree decomposition Induced graph

slide-20
SLIDE 20

20

CS 3750 Advanced Machine Learning

Tree decomposition

  • A node in tree T is formed by a set of vertices

corresponding to maximum cliques in G

  • For all edges {v,w}G: there is a set containing both v

and w in T

  • For every v G : the nodes in T that contain v form a

connected subtree.

A A B C C B D E F G F C G G H

B C D E F H A G Tree decomposition (T) Induced graph (G)

CS 3750 Advanced Machine Learning

Tree decomposition of the graph

A tree decomposition of a graph G (clique tree):

  • A node in tree T is formed by

a set of vertices corresponding to maximum cliques in G

  • For all edges {v,w}G: there

is a set containing both v and w in T.

  • For every v G : the nodes in

T that contain v form a connected subtree. B C D E F

A A B C C B D E

H

F G F C G G

A G

H

slide-21
SLIDE 21

21

CS 3750 Advanced Machine Learning

Tree decomposition of the graph

B C D E F

A A B C C B D E

H

F G F C G G

A G

H

Cliques in the graph

A tree decomposition of a graph G (clique tree):

  • A node in tree T is formed by

a set of vertices corresponding to maximum cliques in G

  • For all edges {v,w}G: there

is a set containing both v and w in T.

  • For every v G : the nodes in

T that contain v form a connected subtree.

CS 3750 Advanced Machine Learning

Tree decomposition of the graph

B C D E F

A A B C C B D E

H

F G F C G G

A G

H

A tree decomposition of a graph G (clique tree):

  • A node in tree T is formed by

a set of vertices corresponding to maximum cliques in G

  • For all edges {v,w}G: there

is a set containing both v and w in T.

  • For every v G : the nodes in

T that contain v form a connected subtree.

slide-22
SLIDE 22

22

CS 3750 Advanced Machine Learning

Tree decomposition of the graph

B C D E F

A A B C C B D E

H

G F C G

A G

H

A tree decomposition of a graph G (clique tree):

  • A node in tree T is formed by

a set of vertices corresponding to maximum cliques in G

  • For all edges {v,w}G: there

is a set containing both v and w in T.

  • For every v G : the nodes in

T that contain v form a connected subtree.

CS 3750 Advanced Machine Learning

Tree decomposition of the graph

Another decomposition of a graph G:

  • A node in tree T is formed by

a set of vertices corresponding to maximum cliques in G

  • For all edges {v,w}G: there

is a set containing both v and w in T.

  • For every v G : the nodes in

T that contain v form a connected subtree. B C D E F

A B C B D E

H

G

A G

H F F C G G H

slide-23
SLIDE 23

23

CS 3750 Advanced Machine Learning

Treewidth of the graph

B C D E F

A A B C C B D E

H

F G F C G G

A G

H

  • Width of the tree

decomposition:

  • Treewidth of a graph

G: tw(G)= minimum width over all tree decompositions of G.

1 | | max 

 i I i

X

CS 3750 Advanced Machine Learning

Treewidth of the graph

B C D E F

A A B C C B D E

H

F G F C G G

A G

H

  • Treewidth of a graph G:

tw(G)= minimum width over all tree decompositions of G

  • Why is it important?
  • Many calculations can take

advantage of the structure and be performed more efficiently

  • treewidth gives the best case

complexity

A C B D E F G H

vs

slide-24
SLIDE 24

24

CS 3750 Advanced Machine Learning

Converting BBNs to MRFs

Moral-graph H[G]: of a Bayesian network over X is an undirected graph over X that contains an edge between x and y if:

  • There exists a directed edge between them in G.
  • They are both parents of the same node in G.

C D I G S L J H C D I G S L J H

CS 3750 Advanced Machine Learning

Moral Graphs: define MRFs

Why moralization? ) , , ( ) , , ( ) , ( ) , ( ) , , ( ) , ( ) ( ) , | ( ) , | ( ) | ( ) | ( ) , | ( ) | ( ) ( ) , , , , , , , (

7 6 5 4 3 2 1

J G H S L J G L I S D I G C D C J G H P S L J P G L P I S P D I G P C D P C P H J L S I G D C P          

) , | ( D I G P ) , , (

3

D I G 

C D I G S L J H C D I G S L J H

slide-25
SLIDE 25

25

CS 3750 Advanced Machine Learning

Chordal graphs

Chordal Graph: an undirected graph G

  • all cycles of four or more vertices have a chord (another

edge breaking the cycle)

  • minimum cycle for every vertex in a cycle is 3 (contains 3

verticies)

C D I G S L J H C D I G S L J H

Chordal. Not Chordal

CS 3750 Advanced Machine Learning

Chordal Graphs

Properties: – There exists an elimination ordering that adds no edges. – The minimal induced tree-width of the graph is equal to the size of the largest clique - 1

C D I G S L J H C D I G S L J H C D I G S L J H C D I G S L J H

slide-26
SLIDE 26

26

CS 3750 Advanced Machine Learning

Triangulation

The process of converting a graph G into a chordal graph is called Triangulation A new graph obtained via triangulation is: 1) Guaranteed to be chordal. 2) Not guaranteed to be (tree-width) optimal.

  • There exist exact algorithms for finding the minimal

chordal graphs, and heuristic methods with a guaranteed upper bound

CS 3750 Advanced Machine Learning

Chordal Graphs

  • Given a minimum triangulation for a graph G, we can carry
  • ut the variable-elimination algorithm in the minimum

possible time.

  • Complexity of the optimal triangulation:

– Finding the minimal triangulation is NP-Hard.

  • The inference limit:

– Inference time is exponential in terms of the largest clique (factor) in G.

slide-27
SLIDE 27

27

CS 3750 Advanced Machine Learning

Conversion of an MRF (BBN) to a clique tree

MRF conversions to clique trees: Option 1:

  • Via triangulation to form a chordal graph
  • Cliques in the chordal graph define the clique tree

Option 2:

  • From the induced graph built by running the variable

elimination (VE) procedure

  • Cliques are defined by factors generated during the VE

procedure BBN conversion:

  • Convert the BBN to an MRF – a moral graph
  • Apply MRF conversion

CS 3750 Advanced Machine Learning

Conclusions on inference complexity

We cannot escape exponential costs of the tree-width of the graph

  • Recall: Tree-width = the width of the optimal tree

decomposition (or the optimal clique tree) Good news:

  • For many graphs the tree-width is much smaller than the total

number of variables !!! Still a problem: Finding the optimal clique tree is hard (NP hard) – But, paying the cost up front may be worth it – Triangulate once, query many times. – Real cost savings if not a bounded one