ProbabilisticGraphicalModels(Cmput651): UndirectedGraphicalModels1 - - PDF document

probabilistic graphical models cmput 651 undirected
SMART_READER_LITE
LIVE PREVIEW

ProbabilisticGraphicalModels(Cmput651): UndirectedGraphicalModels1 - - PDF document

Cmput651UndirectedModels1 17/10/08 Cmput651ProbabilisticGraphicalModels ProbabilisticGraphicalModels(Cmput651): UndirectedGraphicalModels1 MatthewBrown 17/10/2008


slide-1
SLIDE 1

Cmput
651
‐
Undirected
Models
1
 17/10/08
 Cmput
651
–
Probabilistic
Graphical
Models


Probabilistic
Graphical
Models
(Cmput
651):
 Undirected
Graphical
Models
1


Matthew
Brown
 17/10/2008


1


Space
of
Topics


Directed
 UnDirected
 Semantics
 Learning
 Discrete
 Continuous
 Inference


2


slide-2
SLIDE 2

Cmput
651
‐
Undirected
Models
1
 17/10/08


What
is
an
undirected
model
(or
Markov
net)?


Graph
structure
has
undirected
edges
(understandably).


Some
examples:


3


Why
use
undirected
models?
(Misconception
Example)


D
 A
 C
 B


Bayesian
networks
cannot
represent
some
distributions


  • Misconception
example
from
Koller‐Friedman
(Fig
3.16)


D
 A
 C
 B
 D
 A
 C
 B


This
works:
 NO!
This
implies:


A⊥C | B,D B⊥D | A,C A⊥C | B,D ¬(B⊥D | A,C)

NO!
This
implies:


¬(A⊥C | B,D) B⊥D | A,C

4


slide-3
SLIDE 3

Cmput
651
‐
Undirected
Models
1
 17/10/08


Why
use
undirected
models?
(Lab
Dynamics
Example)


Some
distributions
can
be
represented
by
both
Bayes
and
 Markov
nets.
Sometimes,
the
Markov
net
is
more
natural
 (namely,
when
there
is
no
obvious
directionality).
(We’ll
come
back


to
Bayes
vs.
Markov
nets
later.)


B
 A
 C
 D
 F
 E
 B
 A
 C
 D
 F
 E
 Professor
 PhD
Students
 MSc
Students
 Undergrad


Two‐way
 communication
 Top‐down
 communication


5


Outline


  • What
are
Markov
networks

  • Relating
undirected
graphs
and
PDFs

  • Beyond
Markov
networks

  • Bringing
it
all
together:
Tumour
segmentation
eg

  • Markov
nets
vs.
Bayes
nets


6


slide-4
SLIDE 4

Cmput
651
‐
Undirected
Models
1
 17/10/08


Parameterization:
Recall
Bayes
net
CPDs


D
 A
 C
 B


Recall:
In
Bayes
nets,
conditional
probability
distributions
(CPDs)
 describe
the
relationship
between
nodes
joined
by
a
(directed)
edge.



a
 P(B=1|A=a)
 0
 0.6
 1
 0.8
 b,
d
 P(C=1|B=b,D=d)
 0,
0
 0.1
 0,
1
 0.9
 1,
0
 0.5
 1,
1
 0.5


7


Parameterization:
Factors


D
 A
 C
 B


Factors
describe
weighting
between
 connected
nodes,
namely
factors.
 Factor
values
always
≥
0
 Not
necessarily
normalized


  • PDFs,
CPDs
are
special
cases


KF
Fig
4.1


8


slide-5
SLIDE 5

Cmput
651
‐
Undirected
Models
1
 17/10/08


Parameterization:
Factors
and
Factorization


D
 A
 C
 B


Probability
distribution
derived
by
 multiplying
the
factors
and
then
 normalizing
them.
 P a,b,c,d

( ) = 1

Z φ1 a,b

[ ]⋅ φ2 b,c [ ]⋅ φ3 c,d [ ]⋅ φ4 a,d [ ]

Z = φ1 a,b

[ ]⋅ φ2 b,c [ ]⋅ φ3 c,d [ ]⋅ φ4 a,d [ ]

a,b,c,d

Any
probability
distribution
that
can
be
expressed
as
a
 normalized
product
of
factors
in
this
way
is
called
a
 Gibbs
distribution.
(We’ll
come
back
to
this
below,
also
see
KF


Definition
4.3.4.)


9


Parameterization:
Example
(slide
1/2)


D
 A
 C
 B


KF
Fig
4.1
 What
happens
when
you
multiply
over
all
 the
factors
below?
(answer
on
next
slide)


10


slide-6
SLIDE 6

Cmput
651
‐
Undirected
Models
1
 17/10/08


Parameterization:
Example


Product
over
factors


D
 A
 C
 B


11


Parameterization:
Factor
products


Match
up
the
shared
variable
 assignments
(B
in
this
example).
 φ1 A,B

[ ]

φ2 B,C

[ ]

ϕ A,B,C

[ ] = φ1 A,B [ ]⋅ φ2 B,C [ ]

KF
Fig
4.3
(also
see
Join
from
RG's
slides:
 Variable
elimination
–
slide
25)


12


slide-7
SLIDE 7

Cmput
651
‐
Undirected
Models
1
 17/10/08


Parameterization:
General
thoughts


General
factors
and
Markov
nets:


  • Advantage:
not
normalized

  • Computations
easier,
don’t
have
to
normalize
until
end

  • Disadvantage:
not
normalized

  • Harder
to
intuit
how
changes
to
a
factor
affect
whole
PDF

  • Harder
to
train


13


A
PDF
P(X1,...,Xn)
is
a
Gibbs
distribution
if
it
factorizes
thus:
 where
D1,
D2,
etc.
are
(possibly
overlapping)
subsets
of
X1,...,Xn
 Di
is
called
the
scope
of
factor
ϕi

 Z
is
the
partition
function
and
normalizes
the
factor
product
in
 the
numerator.
 (also
see
KF
Definition
4.3.4)


Factorization
of
PDFs:
Gibbs
distributions


P X1,…,Xn

( ) = 1

Z φ1 D

1

[ ]⋅ φ2 D2 [ ]⋅…⋅ φm Dm [ ]

Z = φ1 D

1

[ ]⋅ φ2 D2 [ ]⋅…⋅ φm Dm [ ]

X1,…,X n

14


slide-8
SLIDE 8

Cmput
651
‐
Undirected
Models
1
 17/10/08
 A
PDF
P(X1,...,Xn)
factorizes
over
a
Markov
net
H,
if
 1.
P(X1,...,Xn)

is
a
Gibbs
distribution:
 and

 2.
D1,
D2,
etc.
are
(maximal
or
non‐maximal)
cliques
of
H
 Recall:
 clique
=
a
complete
(fully‐connected)
subgraph
of
H
 maximal
clique
=
clique
that
is
not
a
subgraph
of
a
larger
clique


(K&F
use
the
terms
“clique”
and
“subclique”
for
what
Russ
and
I
(and
the
 graphical
modeling
community)
call
“maximal
clique”
and
clique”.)


Factorization
of
PDFs


P X1,…,Xn

( ) = 1

Z φ1 D

1

[ ]⋅ φ2 D2 [ ]⋅…⋅ φm Dm [ ]

Z = φ1 D

1

[ ]⋅ φ2 D2 [ ]⋅…⋅ φm Dm [ ]

X1,…,X n

15


Cliques


A
 B
 D
 C
 E


G


F
 Maximal
cliques:
 {A,E}
 {B,C,D,E}
 {D,E,F}
 {D,F,G}
 Examples
of
Cliques:
 {A},
{B},
{C},
etc.
(i.e.
single
nodes)
 {B,C},
{B,D},
{B,E},
{E,D},
{F,G},
etc.
 {B,C,D},
{C,D,E},
etc.
 {D,E,F,G}
is
NOT
a
clique
(no
E‐G
 edge)


16


slide-9
SLIDE 9

Cmput
651
‐
Undirected
Models
1
 17/10/08


Factorization
of
PDFs:
example


A
 B
 D
 C
 E
 P A,B,C,D,E

( ) = 1

Z φ1 A,B

[ ]⋅ φ2 B,C,D [ ]⋅ φ3 B,D,E [ ]⋅

φ4 B,C

[ ]⋅ φ5 B,D [ ]⋅ φ6 C,D [ ]⋅ φ7 B,E [ ]⋅ φ8 D,E [ ]⋅

φ9 A

[ ]⋅ φ10 B [ ]⋅ φ11 C [ ]⋅ φ12 D [ ]⋅ φ13 E [ ]

P A,B,C,D,E

( ) = 1

Z ψ1 A,B

[ ]⋅ψ2 B,C,D [ ]⋅ψ3 B,D,E [ ]

NOTE:
There
is
more
than
one
way
to
 define
the
factors.
For
example,
one
can
 use
only
the
maximal
cliques
(because
 the
maximal
clique
factors
can
subsume
 the
(sub)clique
factors):


17


Independence:
global
Markov
assumption


A
 B
 D
 C
 E


G


F
 Fill
denotes
 conditioning
 Active
path:
path
from
X
to
Y
with
 no
conditioned
nodes
on
it.
 Global
Markov
assumption:
if
 there
is
no
active
path
from
X
to
Y
 after
conditioning
on
some
set
of
 nodes
Z,
then
X
and
Y
are
 independent
given
Z.


(A⊥B | E) (A⊥{B,C,D,F,G} | E) ¬(B⊥G | E)

Also
see
KF
section
4.3.1


18


slide-10
SLIDE 10

Cmput
651
‐
Undirected
Models
1
 17/10/08


Independence:
global
Markov
assumption


A
 B
 D
 C
 E


G


F
 Fill
denotes
 conditioning


({B,C}⊥{F,G} | D,E) (A⊥{B,C,F,G} | D,E)

More
examples:


19


Independence:
global
Markov
assumption


A
 B
 D
 C
 E


G


F


¬(A⊥{B,C,D,E,F,G} |{})

With
no
conditioning,
no
 independencies
among
any
 nodes.
eg:


20


slide-11
SLIDE 11

Cmput
651
‐
Undirected
Models
1
 17/10/08


Independence:
global
Markov
assumption


A
 B
 D
 C
 E


G


F


({H,I,J,K}⊥{A,B,C,D,E,F,G} |{})

Without
conditioning,
only
non‐ connected
graphs
have
 independencies.


H
 I
 J
 K


21


Global
Markov
independence
is
monotonic


A
 B
 D
 C
 E


G


F


H
 I
 J


After
conditioning
on
F
and
G,
 {A,B}
is
independent
of
{H,I,J}.
 A
 B


D
 C
 E
 G
 F
 H
 I
 J


Adding
more
nodes
to
the
 conditioned
set
does
NOT
 change
this.
 Monotonicity:
Adding
more
nodes
to
the
conditioned
set
(cut
 set)
does
not
change
independence
relations
in
Markov
nets.


22


slide-12
SLIDE 12

Cmput
651
‐
Undirected
Models
1
 17/10/08


Bayes
net
independence
is
NOT
monotonic


C
 E
 D
 A
 B
 After
conditioning
on
B,
 A
and
C
are
independent.
 C
 E
 D
 A
 B
 But
if
we
also
condition
on
E,
 then
A
and
C
are
NO
LONGER
 independent.
 Markov
networks
do
NOT
possess
Bayes
networks’
ability
to
 handle
intercausal
reasoning
like
this.


23


Outline


  • What
are
Markov
networks

  • Relating
undirected
graphs
and
PDFs

  • Beyond
Markov
networks

  • Bringing
it
all
together:
Tumour
segmentation
eg

  • Markov
nets
vs.
Bayes
nets


24


slide-13
SLIDE 13

Cmput
651
‐
Undirected
Models
1
 17/10/08


Combining
Markov
nets
and
PDFs:
soundness


Let
P(X1,...,Xn)
be
a
positive
(>0)
distribution,
then:
 P
factorize
over
a
Markov
network
H
 if
and
only
if
 H
is
an
I‐map
of
P.


(proofs
in
KF
Section
4.3.1.1)


Notes:


  • =>
direction
does
not
require
positivity
assumption

  • <=
directions
is
called
the
Hammersley‐Clifford


theorem
(and
does
require
positivity)


25


Combining
Markov
nets
and
PDFs:
soundness


A
 B
 D
 C
 E
 F
 Recall:
complete
graph
is
an
I‐map
for
 every
distribution


26


slide-14
SLIDE 14

Cmput
651
‐
Undirected
Models
1
 17/10/08


Combining
Markov
nets
and
PDFs:
soundness


A
 B
 D
 C
 E
 F
 The
soundness
theorem
 guarantees
that:
 two
nodes
(eg:
A
and
F)
are
not
 connected
in
the
graph
structure
 if
and
only
if
 A
and
F
can
be
made
independent
 by
conditioning
on
some
cut
set
 (eg:
{B,C,D,E})


27


Combining
Markov
nets
and
PDFs:
completeness



 Ideally,
we
would
like
to
have
the
following:
 Theorem:
Let
H
be
a
Markov
net
and
P
a
distribution
 that
factorizes
over
H.
Then
I(P)
=
I(H).
 


(That
is,
the
independencies
in
P
are
precisely
those
implied
 by
the
graph
structure
H.)



 BUT:
 
 1)
this
theorem
is
NOT
true
in
general.
 
 2)
this
theorem
IS
true
for
all
distributions
EXCEPT
for
 a
set
of
measure
zero
(in
the
parameterization
space)
 
 
(also
see
KF
Section
4.3.1.2)


28


slide-15
SLIDE 15

Cmput
651
‐
Undirected
Models
1
 17/10/08


Combining
Markov
nets
and
PDFs:
completeness



 If
we
want
a
theorem
that
ALWAYS
holds,
we
must
 weaken
our
requirements:
 Theorem:
Let
H
be
a
Markov
net
and
let
X,
Y,
Z
be
 disjoint
sets
of
nodes
in
H.
If
X
and
Y
are
not
 separated
given
Z,
then
X
and
Y
are
dependent
given
 Z
in
some
distribution
P
that
factorizes
over
H.


(also
see
KF
Section
4.3.1.2)


29


Independencies
in
Markov
nets


  • Given
a
Markov
net
structure,
how
do
we
extract


independencies
among
the
nodes
as
represented
by
 that
Markov
net?


  • Three
different
definitions
of
independence
that
are


all
equivalent
(for
positive
distributions,
i.e.
P>0)


30


slide-16
SLIDE 16

Cmput
651
‐
Undirected
Models
1
 17/10/08


Three
(equivalent)
definitions
of
independence


Def’n
1
–
Global
Markov
independence
(seen
already):
 
 Let
H
be
a
Markov
net
and
X,
Y,
Z
disjoint
sets
of
 nodes
in
H.
If
Z
separates
X
and
Y,
then
 
 

 
 
 
 
 
 
 
 
(also
see
KF
Definition
4.3.2)


(X⊥Y | Z)

A
 B
 D
 C
 E


G


F


({B,C}⊥{F,G} | D,E) (A⊥{B,C,F,G} | D,E)

31


Three
(equivalent)
definitions
of
independence


Def’n
2
–
Pairwise
Markov
independence:
 
 Let
H
be
a
Markov
net
and
X
and
Y
two
nodes
in
H
 that
are
NOT
connected.
Then
X
and
Y
are
 independent
given
ALL
the
other
nodes
in
H.
 
 

 
 
 
 
 
 
 
 
(also
see
KF
Definition
4.3.5)


A
 B
 D
 C
 E


G


F


(A⊥B |C,D,E,F,G)

32


slide-17
SLIDE 17

Cmput
651
‐
Undirected
Models
1
 17/10/08


Three
(equivalent)
definitions
of
independence


Def’n
3
–
Local
Markov
independence:
 
 Let
H
be
a
Markov
net.
Then
a
node
X
in
H
is
 independent
of
all
its
NON‐neighbours
given
its
 neighbours.
(X’s
neighbours
share
an
edge
with
it.)
 
 

 
 
 
 
 
 
 
 
(also
see
KF
Definition
4.3.10)


A
 B
 D
 C
 E


G


F


(B⊥{A,F,G} |C,D,E)

B’s
neighbours
are
{C,D,E}
so:


33


Global
<=>
Local
<=>
Pairwise


Theorem:
global
=>
local
=>
pairwise
 That
is,
for
any
Markov
net
H
and
distribution
P,
 P
satisfies
global
Markov
independencies
in
H
implies
 P
satisfies
local
Markov
independencies
in
H
implies
 P
satisfies
pairwise
Markov
independencies
in
H 

 Theorem:
The
reverse
direction
(pairwise
=>
local
=>
 global)
holds
for
positive
distributions
only.
 
 

 
 
 
 
 
 
 
(also
see
KF
Definition
4.3.10)


34


slide-18
SLIDE 18

Cmput
651
‐
Undirected
Models
1
 17/10/08
 A


What’s
special
about
positive
distributions?


Suppose
A=B,
D=E,
C=BvD
 Non‐positive
distribution,
P(A=0,B=0,C=1,D=0,E=0)=0
 Can
view
graph
as
Bayes
net
 Local
Markov
independence
works:
 Global
Markov
independence
does
not:
 i.e.
for
non‐positive
PDFs,
local
independence
does
not
 imply
global


(also
see
KF
Example
4.3.15)
 A
 B
 C


D
 E
 B
 C
 D
 E


(B⊥D,E | A,C) ¬(A⊥E |C)

35


A


What’s
special
about
positive
distributions?


Suppose
A=B,
D=E,
C=BvD
 Non‐positive
distribution,
P(A=0,B=0,C=1,D=0,E=0)=0
 Can
view
graph
as
Bayes
net
 Note:
deterministic
relationships
(like
in
this
example)
 are
one
way
to
create
non‐positive
distributions.
 That
is,
deterministic
relationships
are
problematic
 for
Markov
networks
(unlike
Bayes
nets).


A
 B
 C
 D
 E
 B
 C
 D
 E


36


slide-19
SLIDE 19

Cmput
651
‐
Undirected
Models
1
 17/10/08


What’s
special
about
positive
distributions?


distribution
P:
A=B=C,
A
uniformly
distributed


Description
in
text
is
a
bit
confusing.
Shouldn’t
there
be
a
link
 between
B
and
C?
Answer:
you
must
look
at
this
in
terms
of
I‐ maps
based
on
specific
definitions
of
independence.


That
graph
structure
is
a
valid
I‐map
for
P
in
terms
of
 pairwise
independence
but
not
in
terms
of
local
 Markov
independence
‐
eg:
 For
non‐positive
distributions,
pairwise
independence
 does
not
imply
local
Markov
independence



 eg:





































as
local
Markov
independence
 requires
 A
 B
 C
 (also
see
KF
Example
4.3.16)


(A⊥C | B) ¬(C⊥{A,B} |{})

37


Distributions
to
graphs


Task:
Given
a
positive
distribution
P,
construct
a
Markov
 net
H
that
is
an
I‐map
of
P



 (also
see
KF
Section
4.3.3)


Approach
1
–
use
pairwise
Markov
independence
 
 Add
edge
between
nodes
X
and
Y
only
when



¬(X⊥Y | all_ nodes−{X,Y})

38


slide-20
SLIDE 20

Cmput
651
‐
Undirected
Models
1
 17/10/08


Distributions
to
graphs


A
 B
 D
 C
 E


G


F
 Approach
1
–
use
pairwise
Markov
independence
 Example:
 Testing
for
edge
A‐B:
 Given
all
the
other
nodes
(C‐G),
 is
A
independent
of
B?
 Yes
‐>
no
edge
 No
‐>
add
edge
A‐B


39


Distributions
to
graphs


Task:
Given
a
positive
distribution
P,
construct
a
Markov
 net
H
that
is
an
I‐map
of
P
 Approach
2
–
use
local
Markov
independence
 
 For
a
given
node
X,
define
its
neighbours
as
the
 minimal
set
of
nodes
NX
which,
when
conditioned
 upon,
render
X
independent
of
all
the
other
nodes.
 Then
add
an
edge
between
X
and
each
of
its
 neighbours.
 NB:
The
neighbours
of
X
are
also
called
its
Markov
 blanket
(when
we’re
talking
about
Markov
nets,
not
Bayes
nets).


40


slide-21
SLIDE 21

Cmput
651
‐
Undirected
Models
1
 17/10/08


Distributions
to
graphs


A
 B
 D
 C
 E


G


F
 Approach
1
–
use
pairwise
Markov
independence
 Example:
 Consider
node
A:
 Find
its
Markov
blanket
and
let
 these
nodes
be
its
neighbours
 (i.e.
connected
to
A)


41


Distributions
to
graphs


Theorem:
Let
P
be
a
positive
distribution.
If
H
is
a
 Markov
net
built
based
on
P
using
either
 
 i)
pairwise
Markov
independence
method
or
 
 ii)
local
Markov
independence
method,
 
then
H
is
the
unique,
minimal
I‐map
for
P



 

 
 
 
 
 
(also
see
KF
Theorems
4.3.18,
4.3.19)


NOTE:
This
does
not
work
if
P
is
not
positive
(see
KF
 text
for
examples).


42


slide-22
SLIDE 22

Cmput
651
‐
Undirected
Models
1
 17/10/08


Outline


  • What
are
Markov
networks

  • Relating
undirected
graphs
and
PDFs

  • Beyond
Markov
networks

  • Bringing
it
all
together:
Tumour
segmentation
eg

  • Markov
nets
vs.
Bayes
nets


43


Factor
graphs
(KF
Section
4.4.1.1)


A
 B
 D
 C
 E


P A,B,C,D,E

( ) = 1

Z φ1 A,B

[ ]⋅ φ2 B,C,D [ ]⋅ φ3 B,D,E [ ]⋅

φ4 B,C

[ ]⋅ φ5 B,D [ ]⋅ φ6 C,D [ ]⋅ φ7 B,E [ ]⋅ φ8 D,E [ ]⋅

φ9 A

[ ]⋅ φ10 B [ ]⋅ φ11 C [ ]⋅ φ12 D [ ]⋅ φ13 E [ ] P A,B,C,D,E

( ) = 1

Z ψ1 A,B

[ ]⋅ψ2 B,C,D [ ]⋅ψ3 B,D,E [ ]

OR
we
can
use
 Recall:
Hammersley‐Clifford
theorem
says
that
if
a
Markov
net
H
 is
an
I‐map
for
a
distribution
P,
then
P
factorizes
according
to
H’s
 cliques.
 BUT,
we
have
some
choice
of
which
cliques
to
include
because
 some
clique
potential
can
be
expressed
as
products
of
other
 clique
potentials.


44


slide-23
SLIDE 23

Cmput
651
‐
Undirected
Models
1
 17/10/08


Factor
graphs
(KF
Section
4.4.1.1)


variable
node
for
variable
C
 factor
node
for
factor
Φ2(B,C)


Factor
graphs
makes
factorization
explicit


Factor
nodes
connect
only
to
variable
nodes,
and
vice
versa
(i.e.
 graph
is
bipartite).


45


Factor
graphs


P(A,B,C)=ϕ(A,B,C)
 i.e.
one
factor
over
 the
maximal
clique
 P(A,B,C)=ϕ1(A,B)x
 ϕ2(B,C)xϕ3(A,C)
 i.e.
one
factor
for
 each
pair‐wise
clique
 Markov
net
induced
 by
both
factor
graphs


  • n
the
left


46


slide-24
SLIDE 24

Cmput
651
‐
Undirected
Models
1
 17/10/08


Factor
graphs


Definition:
A
factor
graph
is
an
undirected
graph
with
 two
types
of
nodes:
 
 
i)
variable
nodes
(ovals)

 
 
ii)
factor
nodes
(rectangles).
 
 Variable
nodes
connect
only
to
factor
nodes
and
 vice‐versa.
Each
factor
node
represents
a
single
 factor
whose
scope
(i.e.
arguments)
includes
 precisely
those
variables
which
are
represented
by
 the
factors
neighbours
(i.e.
variable
nodes
to
which
 the
factor
node
is
connected).


47


Log‐linear
models

(KF
Section
4.4.1.2)


We
want
a
finer‐grained
way
to
factorize
a
PDF.
 Definition:
A
distribution
P
is
a
log‐linear
model
over
a
 Markov
net
H
if,
given
 
 1.
a
set
of
features
{ψ1[D1]
...
ψk[Dk]},
 
 

Di
being
a
clique
in
H
 
 2.
weights
w1
...
wk


P(X1,...,Xn) = 1 Z exp − wiψi(Di)

i

     

48


slide-25
SLIDE 25

Cmput
651
‐
Undirected
Models
1
 17/10/08


Log‐linear
models

(KF
Section
4.4.1.2)
 P(X1,...,Xn) = 1 Z φi(Di)

i

= 1 Z exp −(−ln(φi(Di)))

( )

i

= 1 Z exp − −ln(φi(Di))

i

      = 1 Z exp − εi(Di)

i

      εi(Di) = −ln(φi(Di))

this
is
called
an
energy
function
 (from
statistical
physics)


49


Log‐linear
models

(KF
Section
4.4.1.2)
 P(X1,...,Xn) = 1 Z exp − wiψi(Di)

i

     

Features
ψi
can
be
just
about
anything:
 
 eg:
raw
data,
indicator
functions,
filters
 Training
is
done
on
the
weights
wi:
 
 eg:
gradient
descent


50


slide-26
SLIDE 26

Cmput
651
‐
Undirected
Models
1
 17/10/08


Outline


  • What
are
Markov
networks

  • Relating
undirected
graphs
and
PDFs

  • Beyond
Markov
networks

  • Bringing
it
all
together:
Tumour
segmentation
eg

  • Markov
nets
vs.
Bayes
nets


51


Tumour
segmentation
example:
Problem
statement


T1
MR
image
 (w/
Gad
enhancement)
 T2
MR
image
 Outline
of
T2
 abnormality
 (from
human
expert)
 Task:
automatically
classify
pixels
as
normal
or
abnormal
(i.e.
 reproduce
red
outline
in
rightmost
image).


52


slide-27
SLIDE 27

Cmput
651
‐
Undirected
Models
1
 17/10/08


Tumour
segmentation
example:
Problem
statement


We’ll
be
using
a
pixel‐wise
approach,
with
pixels
indexed
 by
i
(which
takes
values
from
1
to
N).


53


Tumour
segmentation
example:
Problem
statement


Probabilistic
formulation:
 Let:
i
index
pixel
location
[1...N]
 
 Yi
=
label
at
pixel
i
(0
or
1
for
 
 
normal/abnormal)
 
 Xi
=
data
vector
for
pixel
i
 
 
(Xi=[Xi,1
...
Xi,K
],
where
Xi,m=
value
 
 

of
feature
m
at
pixel
i)
 We
want
to
find
values
for
Y1...YN
 that
maximize
P(Y1...YN|X1...XN)


Note:
In
the
literature,
the
term
“feature
vector”
is
 used
instead
of
“data
vector”.
I’ve
changed
 terminology
to
avoid
conflict
with
how
the
text
 uses
the
term
“feature”.


Yi
 Xi,m


54


slide-28
SLIDE 28

Cmput
651
‐
Undirected
Models
1
 17/10/08


Tumour
segmentation:
Markov
Random
Fields


Markov
Random
Field
approach:
 Now
just
need
to
represent
P(Y1...YN,X1...XN)
 with
a
graphical
model
and
then
use
some
 inference
method
(eg:
variable
elimination)
 to
compute
P(Y1...YN,X1...XN)
and
the
 marginal
P(X1...XN).
 Yi
 Xi,m
 P(Y

1...YN | X1...XN) = P(Y 1...YN,X1...XN)

P(X1...XN )

55


Tumour
segmentation:
Markov
Random
Fields


T1
MR
image
 (w/
Gad
enhancement)
 T2
MR
image
 Outline
of
T2
 abnormality
 (from
human
expert)
 Building
the
full
joint
distribution,
P(Y1...YN,X1...XN),
is
hard:
 eg:
3
images
at
512x512
=
786432
nodes
 
 >
600
billion
edges
in
complete
graph
 Invariably,
we
are
forced
to
simplify
(see
next
slide).


56


slide-29
SLIDE 29

Cmput
651
‐
Undirected
Models
1
 17/10/08


Tumour
segmentation:
Markov
Random
Fields


We
represent
label
image
(all
the
Yi’s)
with
a
grid
network

of
 label
nodes
(white
circes).
Each
label
node
represents
one
 pixel’s
label
(normal
or
abnormal).
Each
label
node
is
also
 connected
to
its
pixel’s
feature
nodes
(blue
circles,
only
 some
of
which
are
depicted
above).
Feature
nodes
represent
 Xi,m.
(Also
see
Case
Study
4.4
in
KF
text.)


57


Tumour
segmentation:
Markov
Random
Fields


Independence
Assumptions:


  • 1. A
pixel’s
label
is
independent
of
all
other
labels
given
the


pixel’s
face
neighbours’
labels


  • 2. Conditional
on
the
labels,
a
pixel’s
data
are



 i)
independent
of
each
other
(because
no
edges
among
the
 blue
nodes
connected
to
a
given
white
nodes)
 
 ii)
independent
of
other
pixels’
data
(because
no
edges
 among
the
blue
nodes
connected
to
different
white
nodes)
 (NB:
ALL
of
these
assumptions
are
somewhat
wrong)


58


slide-30
SLIDE 30

Cmput
651
‐
Undirected
Models
1
 17/10/08


Tumour
segmentation:
Markov
Random
Fields


T1
MR
image
 (w/
Gad
enhancement)
 T2
MR
image
 Outline
of
T2
 abnormality
 (from
human
expert)
 Clearly,
the
label
image
(right
most)
and
the
two
data
images
on
 the
left
have
structure
beyond
the
independence
assumptions
we
 make
using
a
Markov
Random
Field.



59


Tumour
segmentation:
Markov
Random
Fields


P(Y

1,...,YN,X1,...,XN ) = 1

Z exp − wmψ(Xi,m,Yi)

m=1...K

i=1...N

− λ φ(Yi,Y j)

j ∈Nbrsi

i=1...N

       

We
have
N
pixels.
 Each
pixel
i
has
neighbouring
pixels
denoted
Nbrsi.
 Node
potentials:
 ψ(Xi,m,Yi)
returns
value
of
 datum
m
for
pixel
i
(eg:
 intensity
of
T1
or
T2
MR
 image)
times
Yi
(i.e.
if
label
 =
0,
return
value
is
0).

 Edge
potentials:
 φ(Yi,Yj)
returns
1
if
Yi
 and
Yj
are
different,
 and
0
otherwise.
(Also


see
Metric
MRFs
–
Concept
 4.3
of
KF
text.)




w’s
and
λ
are
trainable
weights


60


slide-31
SLIDE 31

Cmput
651
‐
Undirected
Models
1
 17/10/08


Tumour
segmentation:
Markov
Random
Fields


P(Y

1,...,YN,X1,1,...,XN,K ) =

1 Z exp − wmψ(Xi,m,Yi)

m=1...K

i=1...N

      + exp −λ φ(Yi,Y j)

j ∈Nbrsi

i=1...N

               

Can
rewrite
the
equation:
 Node
potential
term
is
just
a
logistic
regression
on
the
dot
 product
of
the
weight
vector
(w’s)
and
the
data
vector
(Xi)
(see
 next
slide).
 Edge
potential
term
penalizes
neighbouring
nodes’
having
 different
labels
(i.e.
smoothes
label
image).
 node
potentials
 edge
potentials


61


Tumour
segmentation:
Markov
Random
Fields


1 Z exp − wmψ(Xi,m,Yi)

m=1...K

i=1...N

      = 1 Z exp −  w ⋅  ψ (  X

i,Yi i

)       = 1 Z exp −  w ⋅  ψ (  X

i,Yi)

( )

i

= 1 Z exp −  w ⋅  ψ (  X

i,Yi)

( )

i

= exp −  w ⋅  X

i

( )

exp −  w ⋅  X

i

( ) + exp(− 

w ⋅  )        

i

= exp −  w ⋅  X

i

( )

exp −  w ⋅  X

i

( ) +1

       

i

Proof
that
node
potentials
simply
implement
logistic
regression.
 Consider
the
node
potential
term:
 formula
for
sigmoid


62


slide-32
SLIDE 32

Cmput
651
‐
Undirected
Models
1
 17/10/08


Tumour
segmentation:
Markov
Random
Fields


P(Y

1,...,YN,X1,...,XN ) = 1

Z exp − wmψ(Xi,m,Yi)

m=1...K

i=1...N

− λ φ(Yi,Y j)

j ∈Nbrsi

i=1...N

       

Summary:
Once
we
have
P(Y1...YN,X1...XN),
we
solve
 argmax

Y1...YN

P(Y

1...YN | X1...XN) = P(Y 1...YN,X1...XN)

P(X1...XN)

63


Tumour
segmentation:
Markov
Random
Fields


Lastly:
 I
didn’t
talk
about
inference
or
training
with
this
model.
 We’ll
cover
these
topics
in
the
next
couple
lectures.


64


slide-33
SLIDE 33

Cmput
651
‐
Undirected
Models
1
 17/10/08


Tumour
segmentation:
Conditional
Random
Fields


Markov
Random
Fields


  • model
full
joint
distribution


(generative
approach)


  • make
strong
assumptions
about


independence
relations
among
 the
data
variables


P(Y

1...YN | X1...XN) = P(Y 1...YN,X1...XN)

P(X1...XN )

Conditional
Random
Fields


  • model
conditional
distribution


(discriminative
approach)


  • can
safely
ignore
independence


relations
among
the
data
 variables
(i.e.
ignore
red
edges)


P(Y

1...YN | X1...XN)

65


Tumour
segmentation:
Conditional
Random
Fields


How
does
a
Conditional
Random
Fields
work?
It
uses
a
 partition
function
Z(X1...XN)
that
is
a
function
of
the
 data
vectors
(X1...XN).
That
is,
Z(X1...XN)
must
be
 recomputed
for
every
assignment
to
the
Xi’s,
which
 makes
training
a
Conditional
Random
Field
much
more
 computationally
expensive
than
training
a
Markov
 Random
Field.


P(Y

1,...,YN | X1,...,XN) =

1 Z(X1,...,XN ) exp − wmψ(Xi,m,Yi)

m=1...K

i=1...N

− λ φ(Yi,Y j)

j ∈Nbrsi

i=1...N

       

66


slide-34
SLIDE 34

Cmput
651
‐
Undirected
Models
1
 17/10/08


Generative
vs.
Discriminative


Suppose
you
want
to
know
Y
given
some
data
X.
 Generative:
 
 requires
full
joint
distribution
P(Y,X),
then
we
have:
 
 

 
 
 
 
 
or
 Discriminative:
 
 estimates
P(Y|X)
directly
 
 don’t
bother
understanding
full
joint
distribution


P(Y | X) = P(Y,X) P(X) P(Y | X) = P(X |Y)P(Y) P(X)

67


Generative
vs.
Discriminative


Generative
is
more
flexible
and
adaptable
IF
you
 correctly
specify
full
joint
distribution
P(Y,X).
 Discriminative
is
more
accurate
if
you
cannot
specify
 full
joint
distribution
with
high
fidelity.


68


slide-35
SLIDE 35

Cmput
651
‐
Undirected
Models
1
 17/10/08


Generative
vs.
Discriminative


Example:
 P(Y,X),
Y
=
patient
has
meningitis,
X
=
patient
has
a
fever
 
 Discriminative
approach:
estimate
P(meningitis| fever)
from
hospital‐visiting
population
over
10
years
 say.
If
underlying
distribution
P(Y,X)
changes
(eg:
 because
it’s
flu
season
and
people
are
coming
in
with
 flu‐caused
fevers),
then
you
must
re‐estimate
P(Y|X)
 for
the
new
distribution.


69


Generative
vs.
Discriminative


Example:
 P(Y,X),
Y
=
patient
has
meningitis,
X
=
patient
has
a
fever
 
 Generative
approach:
estimate
 
 During
flu
season,
simply
update
P(X)
and
re‐ estimate
P(Y|X).
 
 BUT
you
better
have
P(X|Y)
right
(which
is
often
hard
 to
do
for
complex
systems,
generally)
or
you’ll
get
 wrong
answers.


70


P(Y | X) = P(X |Y)P(Y) P(X)