Marginal Inference in MRFs using Frank-Wolfe David Belanger, Daniel - - PowerPoint PPT Presentation

marginal inference in mrfs using frank wolfe
SMART_READER_LITE
LIVE PREVIEW

Marginal Inference in MRFs using Frank-Wolfe David Belanger, Daniel - - PowerPoint PPT Presentation

Marginal Inference in MRFs using Frank-Wolfe David Belanger, Daniel Sheldon, Andrew McCallum School of Computer Science University of Massachusetts, Amherst { belanger,sheldon,mccallum } @cs.umass.edu December 10, 2013 Table of Contents Markov


slide-1
SLIDE 1

Marginal Inference in MRFs using Frank-Wolfe

David Belanger, Daniel Sheldon, Andrew McCallum

School of Computer Science University of Massachusetts, Amherst {belanger,sheldon,mccallum}@cs.umass.edu

December 10, 2013

slide-2
SLIDE 2

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013 2 / 26

slide-3
SLIDE 3

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013 3 / 26

slide-4
SLIDE 4

Markov Random Fields

December 10, 2013 4 / 26

slide-5
SLIDE 5

Markov Random Fields

Φθ(x) =

  • c∈C

θc(xc)

December 10, 2013 4 / 26

slide-6
SLIDE 6

Markov Random Fields

Φθ(x) =

  • c∈C

θc(xc) P(x) = exp (Φθ(x)) log(Z)

December 10, 2013 4 / 26

slide-7
SLIDE 7

Markov Random Fields

Φθ(x) =

  • c∈C

θc(xc) P(x) = exp (Φθ(x)) log(Z) x → µ

December 10, 2013 4 / 26

slide-8
SLIDE 8

Markov Random Fields

Φθ(x) =

  • c∈C

θc(xc) P(x) = exp (Φθ(x)) log(Z) x → µ Φθ(x) → θ, µ

December 10, 2013 4 / 26

slide-9
SLIDE 9

Marginal Inference

µMARG = EPθ [µ]

December 10, 2013 5 / 26

slide-10
SLIDE 10

Marginal Inference

µMARG = EPθ [µ] µMARG = arg max

µ∈Mµ, θ + HM(µ)

December 10, 2013 5 / 26

slide-11
SLIDE 11

Marginal Inference

µMARG = EPθ [µ] µMARG = arg max

µ∈Mµ, θ + HM(µ)

¯ µapprox = arg max

µ∈Lµ, θ + HB(µ)

December 10, 2013 5 / 26

slide-12
SLIDE 12

Marginal Inference

µMARG = EPθ [µ] µMARG = arg max

µ∈Mµ, θ + HM(µ)

¯ µapprox = arg max

µ∈Lµ, θ + HB(µ)

HB(µ) =

  • c∈C

WcH(µc)

December 10, 2013 5 / 26

slide-13
SLIDE 13

MAP Inference

µMAP = arg max

µ∈Mµ, θ

December 10, 2013 6 / 26

slide-14
SLIDE 14

MAP Inference

µMAP = arg max

µ∈Mµ, θ

Black&Box&& MAP&Solver&

θ µMAP

December 10, 2013 6 / 26

slide-15
SLIDE 15

MAP Inference

µMAP = arg max

µ∈Mµ, θ

Black&Box&& MAP&Solver&

θ µMAP

Gray&Box&& MAP&Solver&

θ µMAP

December 10, 2013 6 / 26

slide-16
SLIDE 16

Marginal → MAP Reductions

Hazan and Jaakkola [2012] Ermon et al. [2013]

December 10, 2013 7 / 26

slide-17
SLIDE 17

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013 8 / 26

slide-18
SLIDE 18

Generic FW with Line Search

yt = arg min

x∈Xx, −∇f (xt−1)

xt = min

γ∈[0,1] f ((1 − γ)xt + γyt)

December 10, 2013 9 / 26

slide-19
SLIDE 19

Generic FW with Line Search

Linear&& Minimiza<on& Oracle&

Line&Search& Compute& &Gradient&

xt

rf(xt−1)

yt

December 10, 2013 10 / 26

slide-20
SLIDE 20

FW for Marginal Inference

MAP& Inference& Oracle&

Line&Search&

Compute&Gradient&

rF(µt) = θ + rH(µt)

˜ θ

˜ µMAP

µt+1

December 10, 2013 11 / 26

slide-21
SLIDE 21

Subproblem Parametrization

F(µ) = µ, θ +

  • c∈C

WcH(µc)

December 10, 2013 12 / 26

slide-22
SLIDE 22

Subproblem Parametrization

F(µ) = µ, θ +

  • c∈C

WcH(µc) ˜ θ = ∇F(µt) = θ +

  • c∈C

Wc∇H(µc)

December 10, 2013 12 / 26

slide-23
SLIDE 23

Line Search

˜ µMAP µt µt+1

December 10, 2013 13 / 26

slide-24
SLIDE 24

Line Search

˜ µMAP µt µt+1

Computing line search objective can scale with:

December 10, 2013 13 / 26

slide-25
SLIDE 25

Line Search

˜ µMAP µt µt+1

Computing line search objective can scale with: Bad: # possible values in cliques.

December 10, 2013 13 / 26

slide-26
SLIDE 26

Line Search

˜ µMAP µt µt+1

Computing line search objective can scale with: Bad: # possible values in cliques. Good: # cliques in graph. (see paper)

December 10, 2013 13 / 26

slide-27
SLIDE 27

Experiment #1

December 10, 2013 14 / 26

slide-28
SLIDE 28

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013 15 / 26

slide-29
SLIDE 29

Convergence Rate

Convergence Rate of Frank-Wolfe [Jaggi, 2013]

F(µt) − F(µ∗) ≤ 2CF t + 2(1 + δ)

December 10, 2013 16 / 26

slide-30
SLIDE 30

Convergence Rate

Convergence Rate of Frank-Wolfe [Jaggi, 2013]

F(µt) − F(µ∗) ≤ 2CF t + 2(1 + δ)

δCf t+2 MAP suboptimality at iter t

December 10, 2013 16 / 26

slide-31
SLIDE 31

Convergence Rate

Convergence Rate of Frank-Wolfe [Jaggi, 2013]

F(µt) − F(µ∗) ≤ 2CF t + 2(1 + δ)

δCf t+2 MAP suboptimality at iter t −

→ NP-Hard

December 10, 2013 16 / 26

slide-32
SLIDE 32

Convergence Rate

Convergence Rate of Frank-Wolfe [Jaggi, 2013]

F(µt) − F(µ∗) ≤ 2CF t + 2(1 + δ)

δCf t+2 MAP suboptimality at iter t −

→ NP-Hard How to deal with MAP hardness? Use MAP solver and hope for the best [Hazan and Jaakkola, 2012]. Relax to the local polytope.

December 10, 2013 16 / 26

slide-33
SLIDE 33

Curvature + Convergence Rate

Cf = sup

x,s∈D;γ∈[0,1];y=x+γ(s−x)

2 γ2 (f (y) − f (x) − y − x, ∇f (x))

December 10, 2013 17 / 26

slide-34
SLIDE 34

Curvature + Convergence Rate

Cf = sup

x,s∈D;γ∈[0,1];y=x+γ(s−x)

2 γ2 (f (y) − f (x) − y − x, ∇f (x))

˜ µMAP µt µt+1

0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 entropy prob x = 1

December 10, 2013 17 / 26

slide-35
SLIDE 35

Experiment #2

December 10, 2013 18 / 26

slide-36
SLIDE 36

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013 19 / 26

slide-37
SLIDE 37

Beyond MRFs

Question

Are MRFs the right Gibbs distribution to use Frank-Wolfe?

December 10, 2013 20 / 26

slide-38
SLIDE 38

Beyond MRFs

Question

Are MRFs the right Gibbs distribution to use Frank-Wolfe?

Problem Family MAP Algorithm Marginal Algorithm tree-structured graphical models Viterbi Forward-Backward loopy graphical models Max-Product BP Sum-Product BP Directed Spanning Tree Chu Liu Edmonds Matrix Tree Theorem Bipartite Matching Hungarian Algorithm ×

December 10, 2013 20 / 26

slide-39
SLIDE 39

Table of Contents

1

Markov Random Fields

2

Frank-Wolfe for Marginal Inference

3

Optimality Guarantees and Convergence Rate

4

Beyond MRFs

5

Fancier FW

December 10, 2013 21 / 26

slide-40
SLIDE 40

norm-regularized marginal inference µMARG = arg max

µ∈Mµ, θ + HM(µ) + λR(µ)

Harchaoui et al. [2013].

December 10, 2013 22 / 26

slide-41
SLIDE 41

norm-regularized marginal inference µMARG = arg max

µ∈Mµ, θ + HM(µ) + λR(µ)

Harchaoui et al. [2013]. Local linear oracle for MRFs? ˜ µt = arg max

µ∈M∩Br(µt)µ, θ

Garber and Hazan [2013]

December 10, 2013 22 / 26

slide-42
SLIDE 42

Conclusion

We need to figure out how to handle the entropy gradient.

December 10, 2013 23 / 26

slide-43
SLIDE 43

Conclusion

We need to figure out how to handle the entropy gradient. There are plenty of extensions to further Gibbs distributions + regularizers.

December 10, 2013 23 / 26

slide-44
SLIDE 44

Further Reading I

Stefano Ermon, Carla Gomes, Ashish Sabharwal, and Bart Selman. Taming the curse of dimensionality: Discrete integration by hashing and optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 334–342, 2013.

  • D. Garber and E. Hazan. A Linearly Convergent Conditional Gradient Algorithm

with Applications to Online and Stochastic Optimization. ArXiv e-prints, January 2013. Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. arXiv preprint arXiv:1302.2325, 2013. Tamir Hazan and Tommi S Jaakkola. On the Partition Function and Random Maximum A-Posteriori Perturbations. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 991–998, 2012. Bert Huang and Tony Jebara. Approximating the permanent with belief

  • propagation. arXiv preprint arXiv:0908.1769, 2009.

December 10, 2013 24 / 26

slide-45
SLIDE 45

Further Reading II

Mark Huber. Exact sampling from perfect matchings of dense regular bipartite

  • graphs. Algorithmica, 44(3):183–193, 2006.

Martin Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex

  • Optimization. In Proceedings of the 30th International Conference on Machine

Learning (ICML-13), pages 427–435, 2013. James Petterson, Tiberio Caetano, Julian McAuley, and Jin Yu. Exponential family graph matching and ranking. 2009. Tim Roughgarden and Michael Kearns. Marginals-to-models reducibility. In Advances in Neural Information Processing Systems, pages 1043–1051, 2013. Maksims Volkovs and Richard S Zemel. Efficient sampling for bipartite matching

  • problems. In Advances in Neural Information Processing Systems, pages

1322–1330, 2012. Pascal O Vontobel. The bethe permanent of a non-negative matrix. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pages 341–346. IEEE, 2010.

December 10, 2013 25 / 26

slide-46
SLIDE 46

Finding the Marginal Matching

Sampling

Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012].

December 10, 2013 26 / 26

slide-47
SLIDE 47

Finding the Marginal Matching

Sampling

Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012]. Used for maximum-likelihood learning [Petterson et al., 2009].

December 10, 2013 26 / 26

slide-48
SLIDE 48

Finding the Marginal Matching

Sampling

Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012]. Used for maximum-likelihood learning [Petterson et al., 2009].

Sum-Product

Also requires Bethe approximation. Works well: In practice [Huang and Jebara, 2009] In theory [Vontobel, 2010]

December 10, 2013 26 / 26

slide-49
SLIDE 49

Finding the Marginal Matching

Sampling

Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012]. Used for maximum-likelihood learning [Petterson et al., 2009].

Sum-Product

Also requires Bethe approximation. Works well: In practice [Huang and Jebara, 2009] In theory [Vontobel, 2010]

Frank-Wolfe

Basically the same algorithm as for graphical models. Same issue with curvature.

December 10, 2013 26 / 26