[PPT] - Dual Decomposition for Marginal Inference Justin Domke Rochester PowerPoint Presentation

SLIDE 1

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition for Marginal Inference

Justin Domke

Rochester Institute of Technology

AAAI 2011

SLIDE 2

Introduction Dual Decomposition Experimental Results Conclusions

Outline

Introduction Graphical Models Motivation Wait a Second Dual Decomposition Dual Decomposition in General Dual Decomposition for Marginal Inference Experimental Results Experiments Conclusions Conclusions

SLIDE 3

Introduction Dual Decomposition Experimental Results Conclusions

Outline

Introduction Graphical Models Motivation Wait a Second Dual Decomposition Dual Decomposition in General Dual Decomposition for Marginal Inference Experimental Results Experiments Conclusions Conclusions

SLIDE 4

Introduction Dual Decomposition Experimental Results Conclusions

Graphical Models

Markov Random Field / Factor Graph:

p(x) ∝ ∏

c

ψ(xc)

SLIDE 5

Introduction Dual Decomposition Experimental Results Conclusions

Graphical Models

c1 = {1,2,3}, c2 = {3,4}, c3 = {4,5,6} p(x) ∝ ∏

c

ψ(xc) = ψ(x1,x2,x3)ψ(x3,x4)ψ(x4,x5,x6)

SLIDE 6

Introduction Dual Decomposition Experimental Results Conclusions

Marginal Inference

Want to recover p(Xi = xi).
Brute-force sum: Define ˆ

p(x) = ∏c ψ(xc) P(Xi = xi) = 1 Z ∑

x1

... ∑

xi−1 ∑ xi+1

...∑

xM

ˆ p(x) Z = ∑

x1

...∑

xM

ˆ p(x)

On trees, can do sums quickly by dynamic programming.
Sum-product algorithm / belief propagation
#P-hard
Approximate: Tree-reweighted belief propagation (TRW)
This paper: Same approximation as TRW, different algorithm.

SLIDE 7

Introduction Dual Decomposition Experimental Results Conclusions

Marginal Inference

Want to recover p(Xi = xi).
Brute-force sum: Define ˆ

p(x) = ∏c ψ(xc) P(Xi = xi) = 1 Z ∑

x1

... ∑

xi−1 ∑ xi+1

...∑

xM

ˆ p(x) Z = ∑

x1

...∑

xM

ˆ p(x)

On trees, can do sums quickly by dynamic programming.
Sum-product algorithm / belief propagation
#P-hard
Approximate: Tree-reweighted belief propagation (TRW)
This paper: Same approximation as TRW, different algorithm.

SLIDE 8

Introduction Dual Decomposition Experimental Results Conclusions

Marginal Inference

Want to recover p(Xi = xi).
Brute-force sum: Define ˆ

p(x) = ∏c ψ(xc) P(Xi = xi) = 1 Z ∑

x1

... ∑

xi−1 ∑ xi+1

...∑

xM

ˆ p(x) Z = ∑

x1

...∑

xM

ˆ p(x)

On trees, can do sums quickly by dynamic programming.
Sum-product algorithm / belief propagation
#P-hard
Approximate: Tree-reweighted belief propagation (TRW)
This paper: Same approximation as TRW, different algorithm.

SLIDE 9

Introduction Dual Decomposition Experimental Results Conclusions

Marginal Inference

Want to recover p(Xi = xi).
Brute-force sum: Define ˆ

p(x) = ∏c ψ(xc) P(Xi = xi) = 1 Z ∑

x1

... ∑

xi−1 ∑ xi+1

...∑

xM

ˆ p(x) Z = ∑

x1

...∑

xM

ˆ p(x)

On trees, can do sums quickly by dynamic programming.
Sum-product algorithm / belief propagation
#P-hard
Approximate: Tree-reweighted belief propagation (TRW)
This paper: Same approximation as TRW, different algorithm.

SLIDE 10

Introduction Dual Decomposition Experimental Results Conclusions

Outline

Introduction Graphical Models Motivation Wait a Second Dual Decomposition Dual Decomposition in General Dual Decomposition for Marginal Inference Experimental Results Experiments Conclusions Conclusions

SLIDE 11

Introduction Dual Decomposition Experimental Results Conclusions

Motivation

TRW Convergence rates can be very slow.
If lucky, TRW = block coordate ascent on dual.
TRW may fail to converge.
Damping converges in practice, slower.
Recent alternatives guarantee convergence.

[Hazan & Shashua 2009, Globerson & Jaakkola 2007b]

Not claimed faster than TRW. TRW-S [Meltzer et al. 2009] is

an exception.

This paper: use a quasi-newton method on dual.
Line searches guarantee convergence.
Hopefully, faster convergence.

SLIDE 12

Introduction Dual Decomposition Experimental Results Conclusions

Motivation

TRW Convergence rates can be very slow.
If lucky, TRW = block coordate ascent on dual.
TRW may fail to converge.
Damping converges in practice, slower.
Recent alternatives guarantee convergence.

[Hazan & Shashua 2009, Globerson & Jaakkola 2007b]

Not claimed faster than TRW. TRW-S [Meltzer et al. 2009] is

an exception.

This paper: use a quasi-newton method on dual.
Line searches guarantee convergence.
Hopefully, faster convergence.

SLIDE 13

Introduction Dual Decomposition Experimental Results Conclusions

Motivation

TRW Convergence rates can be very slow.
If lucky, TRW = block coordate ascent on dual.
TRW may fail to converge.
Damping converges in practice, slower.
Recent alternatives guarantee convergence.

[Hazan & Shashua 2009, Globerson & Jaakkola 2007b]

Not claimed faster than TRW. TRW-S [Meltzer et al. 2009] is

an exception.

This paper: use a quasi-newton method on dual.
Line searches guarantee convergence.
Hopefully, faster convergence.

SLIDE 14

Introduction Dual Decomposition Experimental Results Conclusions

Ising Model

xi ∈ {−1,+1}
p(x) ∝ ∏ij exp
θ(xi,xj)
∏i exp(θ(xi)
θ(xi) = αFxi,

αF ∈ [−1,+1]

θ(xi,xj) = αIxixj,

αI ∈ [0,T] for various T

SLIDE 15

Introduction Dual Decomposition Experimental Results Conclusions

θ(xi,xj) = αIxixj, αI ∈ [0,1]

SLIDE 16

Introduction Dual Decomposition Experimental Results Conclusions

θ(xi,xj) = αIxixj, αI ∈ [0,1]

20 40 60 80 100 10

−6

10

−4

10

−2

10 iters |µ−µ*|∞ trw

SLIDE 17

Introduction Dual Decomposition Experimental Results Conclusions

θ(xi,xj) = αIxixj, αI ∈ [0,1]

20 40 60 80 100 10

−6

10

−4

10

−2

10 iters |µ−µ*|∞ trw dual decomp

SLIDE 18

Introduction Dual Decomposition Experimental Results Conclusions

θ(xi,xj) = αIxixj, αI ∈ [0,3]

SLIDE 19

Introduction Dual Decomposition Experimental Results Conclusions

θ(xi,xj) = αIxixj, αI ∈ [0,3]

2000 4000 6000 8000 10000 10

−6

10

−4

10

−2

10 iters |µ−µ*|∞ trw

SLIDE 20

Introduction Dual Decomposition Experimental Results Conclusions

θ(xi,xj) = αIxixj, αI ∈ [0,3]

2000 4000 6000 8000 10000 10

−6

10

−4

10

−2

10 iters |µ−µ*|∞ trw dual decomp

SLIDE 21

Introduction Dual Decomposition Experimental Results Conclusions

θ(xi,xj) = αIxixj, αI ∈ [0,5]

SLIDE 22

Introduction Dual Decomposition Experimental Results Conclusions

θ(xi,xj) = αIxixj, αI ∈ [0,5]

2000 4000 6000 8000 10000 10

−6

10

−4

10

−2

10 iters |µ−µ*|∞ trw

SLIDE 23

Introduction Dual Decomposition Experimental Results Conclusions

θ(xi,xj) = αIxixj, αI ∈ [0,5]

2000 4000 6000 8000 10000 10

−6

10

−4

10

−2

10 iters |µ−µ*|∞ trw dual decomp

SLIDE 24

Introduction Dual Decomposition Experimental Results Conclusions

Outline

Introduction Graphical Models Motivation Wait a Second Dual Decomposition Dual Decomposition in General Dual Decomposition for Marginal Inference Experimental Results Experiments Conclusions Conclusions

SLIDE 25

Introduction Dual Decomposition Experimental Results Conclusions

Wait a Second

Question: Why should I care about very accurately computing approximate marginals!? Answer: You might not. One reason to care:

Number of iterations TRW needs for reasonable results is not

easy to predict.

SLIDE 26

Introduction Dual Decomposition Experimental Results Conclusions

Wait a Second

Question: Why should I care about very accurately computing approximate marginals!? Answer: You might not. One reason to care:

Number of iterations TRW needs for reasonable results is not

easy to predict.

SLIDE 27

Introduction Dual Decomposition Experimental Results Conclusions

Why I Care

Want to fit a CRF with some loss L(θ) = M(µ(θ)). Algorithm (Domke, 2010):

1. Get µ by running TRW with parameters θ.
2. Compute dM(µ)

dµ

3. Get µ+ by running TRW with parameters θ +r dM

dµ

4. dL

dθ ≈ 1 r

µ+ − µ
Strong convergence needed for difference µ+ − µ to be

meaniningful.

SLIDE 28

Introduction Dual Decomposition Experimental Results Conclusions

Why I Care

Want to fit a CRF with some loss L(θ) = M(µ(θ)). Algorithm (Domke, 2010):

1. Get µ by running TRW with parameters θ.
2. Compute dM(µ)

dµ

3. Get µ+ by running TRW with parameters θ +r dM

dµ

4. dL

dθ ≈ 1 r

µ+ − µ
Strong convergence needed for difference µ+ − µ to be

meaniningful.

SLIDE 29

Introduction Dual Decomposition Experimental Results Conclusions

Why I Care

Want to fit a CRF with some loss L(θ) = M(µ(θ)). Algorithm (Domke, 2010):

1. Get µ by running TRW with parameters θ.
2. Compute dM(µ)

dµ

3. Get µ+ by running TRW with parameters θ +r dM

dµ

4. dL

dθ ≈ 1 r

µ+ − µ
Strong convergence needed for difference µ+ − µ to be

meaniningful.

SLIDE 30

Introduction Dual Decomposition Experimental Results Conclusions

Outline

Introduction Graphical Models Motivation Wait a Second Dual Decomposition Dual Decomposition in General Dual Decomposition for Marginal Inference Experimental Results Experiments Conclusions Conclusions

SLIDE 31

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition with Two subproblems

max

x f (x)+g(x)

Can quickly and exactly maximize f (x)+a·x.
Can quickly and exactly maximize g(x)+b·x.

SLIDE 32

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition with Two subproblems

Transform max

x f (x)+g(x) to a constrained problem:

max

x,y

f (x)+g(y) s.t. x = y

Leads to dual problem:

min

λ h(λ),

h(λ) = max

x f (x)+λ ·x

+ max

y g(y)−λ ·y

SLIDE 33

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition with Two subproblems

Transform max

x f (x)+g(x) to a constrained problem:

max

x,y

f (x)+g(y) s.t. x = y

Leads to dual problem:

min

λ h(λ),

h(λ) = max

x f (x)+λ ·x

+ max

y g(y)−λ ·y

SLIDE 34

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition with Two subproblems

min

λ h(λ)

max

x

f (x)+λ ·x max

x

g(x)−λ ·x

SLIDE 35

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition with N subproblems

max

x N

∑

i=1

fi(x)

Can quickly and exactly maximize fi(x)+ai ·x, for all i.

SLIDE 36

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition with N subproblems

Transform max

x ∑ i

fi(x) to a constrained problem: max

{xi }

∑

i

fi(xi) s.t. xi = 1 N ∑

j

xj

Leads to dual problem:

min

λ h(λ),

h(λ) = ∑

i

hi(λ) hi(λ) = max

xi fi(xi)+(λ i − 1

N ∑

i

λ j)·xi

SLIDE 37

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition with N subproblems

Transform max

x ∑ i

fi(x) to a constrained problem: max

{xi }

∑

i

fi(xi) s.t. xi = 1 N ∑

j

xj

Leads to dual problem:

min

λ h(λ),

h(λ) = ∑

i

hi(λ) hi(λ) = max

xi fi(xi)+(λ i − 1

N ∑

i

λ j)·xi

SLIDE 38

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition with N subproblems

min

λ h(λ)

max

x f ′ 1(x,λ)

max

x f ′ 2(x,λ)

max

x f ′ 3(x,λ)

f ′

i (x,λ) = fi(xi)+(λ i − 1

N ∑

i

λ j)·xi

SLIDE 39

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition with N subproblems

Has been used extensively for MAP inference.
h(λ) is non-differentiable.
For marginal inference, h(λ) is differentiable, convex.

SLIDE 40

Introduction Dual Decomposition Experimental Results Conclusions

Outline

Introduction Graphical Models Motivation Wait a Second Dual Decomposition Dual Decomposition in General Dual Decomposition for Marginal Inference Experimental Results Experiments Conclusions Conclusions

SLIDE 41

Introduction Dual Decomposition Experimental Results Conclusions

Variational Inference

Can represent a graphical model in exponential family: p(x;θ) = exp

f(x)·θ −A(θ)
, A(θ) = log∑

x

exp

f(x)·θ
Can compute A as [Wainwright and Jordan]

A(θ) = max

µ∈M θ · µ +H(µ)

M is marginal polytope (hard).
H is entropy (hard).

SLIDE 42

Introduction Dual Decomposition Experimental Results Conclusions

Variational Inference

Exact inference: A(θ) = max

µ∈M θ · µ +H(µ)

TRW approximation: B(θ) = max

µ∈L θ · µ +∑ T

ρTH(µ(T))

L - is marginal polytope (easy)
H(µ(T)) - entropy of marginals projected onto tree T (easy)

Our problem: how to compute B?

SLIDE 43

Introduction Dual Decomposition Experimental Results Conclusions

Variational Inference

Exact inference: A(θ) = max

µ∈M θ · µ +H(µ)

TRW approximation: B(θ) = max

µ∈L θ · µ +∑ T

ρTH(µ(T))

L - is marginal polytope (easy)
H(µ(T)) - entropy of marginals projected onto tree T (easy)

Our problem: how to compute B?

SLIDE 44

Introduction Dual Decomposition Experimental Results Conclusions

Variational Inference

Exact inference: A(θ) = max

µ∈M θ · µ +H(µ)

TRW approximation: B(θ) = max

µ∈L θ · µ +∑ T

ρTH(µ(T))

L - is marginal polytope (easy)
H(µ(T)) - entropy of marginals projected onto tree T (easy)

Our problem: how to compute B?

SLIDE 45

Introduction Dual Decomposition Experimental Results Conclusions

Variational Inference

Exact inference: A(θ) = max

µ∈M θ · µ +H(µ)

TRW approximation: B(θ) = max

µ∈L θ · µ +∑ T

ρTH(µ(T))

L - is marginal polytope (easy)
H(µ(T)) - entropy of marginals projected onto tree T (easy)

Our problem: how to compute B?

SLIDE 46

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition for Marginal Inference

TRW approximation: B(θ) = max

µ∈L θ · µ +∑ T

ρTH(µ(T)) Theorem (main result): B(θ) = min

{θT }

h({θ T}) s.t. ∑

T:a∈T

θT

a =θa

h({θ T}) = ∑

T

BT(θ T) BT(θ T) = max

µT ∈MT

θ T · µT +ρTHT (µT) BT(θ T) is computable by running regular sum-product algorithm.

SLIDE 47

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition for Marginal Inference

TRW approximation: B(θ) = max

µ∈L θ · µ +∑ T

ρTH(µ(T)) Theorem (main result): B(θ) = min

{θT }

h({θ T}) s.t. ∑

T:a∈T

θT

a =θa

h({θ T}) = ∑

T

BT(θ T) BT(θ T) = max

µT ∈MT

θ T · µT +ρTHT (µT) BT(θ T) is computable by running regular sum-product algorithm.

SLIDE 48

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition for Marginal Inference

TRW approximation: B(θ) = max

µ∈L θ · µ +∑ T

ρTH(µ(T)) Theorem (main result): B(θ) = min

{θT }

h({θ T}) s.t. ∑

T:a∈T

θT

a =θa

h({θ T}) = ∑

T

BT(θ T) BT(θ T) = max

µT ∈MT

θ T · µT +ρTHT (µT) BT(θ T) is computable by running regular sum-product algorithm.

SLIDE 49

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition for Marginal Inference

min

{θT }

h({θ T}) max

µT ∈MT

f T(θ T,µT) max

µT ∈MT

f T(θ T,µT) f T(θ T,µT) = θ T · µT +ρTHT(µT)

SLIDE 50

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition for Marginal Inference

SLIDE 51

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomposition for Marginal Inference

Inference: Plug min

{θT }∑ T

BT(θ T) into L-BFGS.

Guarantees convergence. (Line searches)
Fast convergence rates.

SLIDE 52

Introduction Dual Decomposition Experimental Results Conclusions

Outline

Introduction Graphical Models Motivation Wait a Second Dual Decomposition Dual Decomposition in General Dual Decomposition for Marginal Inference Experimental Results Experiments Conclusions Conclusions

SLIDE 53

Introduction Dual Decomposition Experimental Results Conclusions

Ising Model

xi ∈ {−1,+1}
p(x) ∝ ∏ij exp
θ(xi,xj)
∏i exp(θ(xi)
θ(xi) = αFxi
θ(xi,xj) = αIxixj

SLIDE 54

Introduction Dual Decomposition Experimental Results Conclusions

Algorithms

Algorithms Compared:

Dual Decomposition + L-BFGS
TRW
TRW with damping of 1/2 in the log-domain.
TRW-S [Meltzer et al. 2009]

Max of 105 iterations allowed.

SLIDE 55

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomp vs. TRW

# Iterations, TRW

αI ∈ [0 1]

10 100 103 104 105 10 100 103 104 105 10−2 convergence 10−4 convergence 10−6 convergence 10−2 convergence 10−4 convergence 10−6 convergence # Iterations, Dual Decomposition 10 10 10

SLIDE 56

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomp vs. TRW

αI ∈ [0 3]

# Iterations, TRW 10 100 103 104 105 10 100 103 104 105 10−2 convergence 10−4 convergence 10−6 convergence # Iterations, Dual Decomposition 10 10 10

SLIDE 57

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomp vs. TRW

αI ∈ [0 9]

# Iterations, TRW 10 100 103 104 105 10 100 103 104 105 10−2 convergence 10−4 convergence 10−6 convergence # Iterations, Dual Decomposition 10 10 10

SLIDE 58

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomp vs. TRW-damped

αI ∈ [0 1]

# Iterations, TRW−damped 10 100 103 104 105 10 100 103 104 105 10−2 convergence 10−4 convergence 10−6 convergence # Iterations, Dual Decomposition 10 10 10

SLIDE 59

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomp vs. TRW-damped

αI ∈ [0 3]

# Iterations, TRW−damped 10 100 103 104 105 10 100 103 104 105 10−2 convergence 10−4 convergence 10−6 convergence # Iterations, Dual Decomposition 10 10 10

SLIDE 60

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomp vs. TRW-damped

αI ∈ [0 9]

# Iterations, TRW−damped 10 100 103 104 105 10 100 103 104 105 10−2 convergence 10−4 convergence 10−6 convergence # Iterations, Dual Decomposition 10 10 10

SLIDE 61

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomp vs. TRW-S

αI ∈ [0 1]

# Iterations, TRW−S # Iterations, Dual Decomposition 10 100 103 104 105 10 100 103 104 105 10−2 convergence 10−4 convergence 10−6 convergence

SLIDE 62

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomp vs. TRW-S

αI ∈ [0 3]

# Iterations, TRW−S # Iterations, Dual Decomposition 10 100 103 104 105 10 100 103 104 105 10−2 convergence 10−4 convergence 10−6 convergence

SLIDE 63

Introduction Dual Decomposition Experimental Results Conclusions

Dual Decomp vs. TRW-S

αI ∈ [0 9]

# Iterations, TRW−S # Iterations, Dual Decomposition 10 100 103 104 105 10 100 103 104 105 10−2 convergence 10−4 convergence 10−6 convergence

SLIDE 64

Introduction Dual Decomposition Experimental Results Conclusions

Convergence

Convergence Level # Iterations .01 .001 10−4 10−5 10−6 1 10 100 103 104 105

αI ∈ [0 1]

Dual Decomposition TRW−damped TRW−S TRW Dual Decomposition TRW−damped TRW−S TRW

SLIDE 65

Introduction Dual Decomposition Experimental Results Conclusions

Convergence

# Iterations Convergence Level .01 .001 10−4 10−5 10−6 1 10 100 103 104 105

αI ∈ [0 3]

Dual Decomposition TRW−damped TRW−S TRW

SLIDE 66

Introduction Dual Decomposition Experimental Results Conclusions

Convergence

# Iterations Convergence Level .01 .001 10−4 10−5 10−6 1 10 100 103 104 105

αI ∈ [0 9]

Dual Decomposition TRW−damped TRW−S TRW

SLIDE 67

Introduction Dual Decomposition Experimental Results Conclusions

Outline

Introduction Graphical Models Motivation Wait a Second Dual Decomposition Dual Decomposition in General Dual Decomposition for Marginal Inference Experimental Results Experiments Conclusions Conclusions

SLIDE 68

Introduction Dual Decomposition Experimental Results Conclusions

Conclusions

Dual Decomposition
Faster on “hard” problems or if strong convergence needed.
Caveats
Not really faster on “easy” problems.
Restriction on tree distribution P(T).

SLIDE 69

Introduction Dual Decomposition Experimental Results Conclusions

Conclusions

Dual Decomposition
Faster on “hard” problems or if strong convergence needed.
Caveats
Not really faster on “easy” problems.
Restriction on tree distribution P(T).

SLIDE 70

Introduction Dual Decomposition Experimental Results Conclusions