slides set 10
play

Slides Set 10: Bounded In Inference Non-iteratively; Min - PowerPoint PPT Presentation

Algorithms for Reasoning with graphical models Slides Set 10: Bounded In Inference Non-iteratively; Min ini-Bucket Eli limination Rina Dechter (Class Notes (8-9), Darwiche chapter 14 slides10 828X 2019 Outline Mini-bucket elimination


  1. CPCS Networks – Medical Diagnosis (noisy-OR model) Test case: no evidence Anytime-mpe(0.0001) U/L error vs time 3.8 cpcs422b 3.4 cpcs360b 3.0 Upper/Lower 2.6 2.2 1.8 1.4 1.0 0.6 i=1 i=21 1 10 100 1000 Time and parameter i Time (sec) Algorithm cpcs360 cpcs422 elim-mpe 115.8 1697.6  10 −  = anytime-mpe( ), 70.3 505.2 4  10 −  = 1 anytime-mpe( ), 70.3 110.5 slides10 828X 2019

  2. Outline • Mini-bucket elimination • Weighted Mini-bucket • Mini-clustering • Re-parameterization, cost-shifting • Iterative Belief propagation • Iterative-join-graph propagation slides10 828X 2019

  3. Decomposition for Sum • Generalize technique to sum via Holder’s inequality: • Define the weighted (or powered) sum: • “Temperature” interpolates between sum & max: • Different weights do not commute: slides10 828X 2019

  4. The Power Sum and Holder Inequality slides10 828X 2019

  5. Working Example • Model: • Markov network • Task: A • Partition function B C (Qiang Liu slides) slides10 828X 2019

  6. Mini-Bucket (Basic Principles) • Upper bound • Lower bound (Qiang Liu slides) slides10 828X 2019

  7. Holder Inequality • Where and • When , the equality is achieved. (Qiang Liu slides) G. H. Hardy, J. E. Littlewood and G. Pólya, Inequalities, Cambridge Univ. Press, London and New York, 1934. •

  8. Reverse Holder Inequality • If , but the direction of the inequality reverses. (Qiang Liu slides) G. H. Hardy, J. E. Littlewood and G. Pólya, Inequalities, Cambridge Univ. Press, London and New York, 1934.

  9. Weighted Mini-Bucket (for summation) 𝑥 Exact bucket elimination: mini-buckets ෍ ෑ 𝑔(𝑦) 𝑦 𝜇 𝐶 𝑏, 𝑑, 𝑒, 𝑓 = ෍ 𝑔 𝑏, 𝑐 ⋅ 𝑔 𝑐, 𝑑 ⋅ 𝑔 𝑐, 𝑒 ⋅ 𝑔 𝑐, 𝑓 𝑐 bucket B: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑥 1 𝑥 2 ≤ ෍ 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 ⋅ ෍ 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑐 𝑐 bucket C: 𝜇 𝐶→𝐷 (𝑏, 𝑑) 𝑔 𝑏, 𝑑 𝑔 𝑑, 𝑓 = 𝜇 𝐶→𝐷 (𝑏, 𝑑) ⋅ 𝜇 𝐶→𝐸 (𝑒, 𝑓) 𝑔 𝑏, 𝑒 𝜇 𝐶→𝐸 (𝑒, 𝑓) (mini-buckets) bucket D: 𝑥 𝑥 1 𝜇 𝐷→𝐹 (𝑏, 𝑓) 𝜇 𝐸→𝐹 (𝑏, 𝑓) where bucket E: ෍ 𝑔 𝑦 = ෍ 𝑔 𝑦 𝑥 𝑦 𝑦 is the weighted or “power” sum operator 𝑔 𝑏 bucket A: 𝜇 𝐹→𝐵 (𝑏) 𝑥 𝑥 1 𝑥 2 ෍ 𝑔 1 𝑦 𝑔 2 𝑦 ≤ ෍ 𝑔 1 𝑦 ෍ 𝑔 2 𝑦 U = upper bound 𝑦 𝑦 𝑦 where 𝑥 1 + 𝑥 2 = 𝑥 and 𝑥 1 > 0, 𝑥 2 > 0 [Liu and Ihler, 2011] (lower bound if ) 𝑥 1 > 0, 𝑥 2 < 0 slides10 828X 2019

  10. slides10 828X 2019

  11. Weighted-mini-bucket for Marginal Map slides10 828X 2019

  12. Bucket Elimination for MMAP Bucket Elimination A B: constrained elimination order SUM B C C: D E D: E: MAX A: MAP* is the marginal MAP value slides7 828X 2019

  13. A MB and WMB for Marginal MAP B C D E mini-buckets Marginal MAP 𝑥 1 𝜇 𝐶→𝐷 𝑏, 𝑑 = ෍ 𝑔 𝑏, 𝑐 𝑔(𝑐, 𝑑) Σ 𝐶 bucket B: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑐 𝑥 2 𝜇 𝐶→𝐸 𝑒, 𝑓 = ෍ 𝑔 𝑐, 𝑒 𝑔(𝑐, 𝑓) Σ 𝐷 bucket C: 𝜇 𝐶→𝐷 (𝑏, 𝑑) 𝑔 𝑏, 𝑑 𝑔 𝑑, 𝑓 𝑐 (𝑥 1 + 𝑥 2 = 1) . . 𝑔 𝑏, 𝑒 𝜇 𝐶→𝐸 (𝑒, 𝑓) max D bucket D: . max E 𝜇 𝐷→𝐹 (𝑏, 𝑓) 𝜇 𝐸→𝐹 (𝑏, 𝑓) bucket E: 𝜇 𝐹→𝐵 𝑏 = max 𝜇 𝐷→𝐹 𝑏, 𝑓 𝜇 𝐸→𝐹 (𝑏, 𝑓) 𝑓 max A 𝑔 𝑏 bucket A: 𝜇 𝐹→𝐵 (𝑏) 𝑉 = max 𝑔 𝑏 𝜇 𝐹→𝐵 (𝑏) 𝑏 U = upper bound Can optimize over cost-shifting and weights (single pass “MM” or iterative message passing) [Liu and Ihler, 2011; 2013] [Dechter and Rish, 2003] slides10 828X 2019

  14. MBE-map Process max buckets With max mini-buckets And sum buckets with weighted Mini-buckets slides10 828X 2019

  15. Initial partitioning slides10 828X 2019

  16. slides10 828X 2019

  17. Complexity and Tractability of MBE(i,m) slides10 828X 2019

  18. Outline • Mini-bucket elimination • Weighted Mini-bucket • Mini-clustering • Re-parameterization, cost-shifting • Iterative Belief propagation • Iterative-join-graph propagation slides10 828X 2019

  19. Join-Tree Clustering (Cluster-Tree Elimination) A 1 ABC =    h b c p a p b a p c a b ( , ) ( ) ( | ) ( | , ) ( 1 , 2 ) B a BC =    h b c p d b p f c d h b f ( , ) ( | ) ( | , ) ( , ) ( 2 , 1 ) ( 3 , 2 ) d f , 2 BCDF C D E =    h b f p d b p f c d h b c ( , ) ( | ) ( | , ) ( , ) ( 2 , 3 ) ( 1 , 2 ) c d , BF =  F  h b f p e b f h e f ( , ) ( | , ) ( , ) ( 3 , 2 ) ( 4 , 3 ) e 3 BEF =  G  h e f p e b f h b f ( , ) ( | , ) ( , ) EXACT algorithm ( 3 , 4 ) ( 2 , 3 ) b EF Time and space: = = h e f p G g e f ( , ) ( | , ) e ( 4 , 3 ) exp(cluster size)= 4 EFG exp(treewidth) slides10 828X 2019

  20. We can replace the sum with power sum For weights that sum to 1 in each mini-bucket slides10 828X 2019

  21. Mini-Clustering, i-bound=3 A A B C 1 B p(a), p(b|a), p(c|a,b) =    1 C D E h b c p a p b a p c a b ( , ) ( ) ( | ) ( | , ) BC ( 1 , 2 ) a F B C D G p(d|b), h (1,2) (b,c) 2 C D F =   p(f|c,d) 1 1 h b p d b h b c ( ) ( | ) ( , ) ( 2 , 3 ) ( 1 , 2 ) c d , BF = 2 h f p f c d ( ) max ( | , ) ( 2 , 3 ) c d , B E F 3 p(e|b,f), h 1 (2,3) (b), h 2 (2,3) (f) APPROXIMATE algorithm EF Time and space: exp(i-bound) E F G 4 p(g|e,f) slides10 828X 2019 Number of variables in a mini-cluster

  22. Mini-Clustering - Example A 1 ABC B =    h 1 b c p a p b a p c a b ( , ) : ( ) ( | ) ( | , ) H ( 1 , 2 ) C D E ( 1 , 2 ) a =  BC  1 1 h b p d b h b f ( ) : ( | ) ( , ) F ( 2 , 1 ) ( 3 , 2 ) d f , H = G h 2 c p f c d ( ) : ( | , ) max ( 2 , 1 ) ( 2 , 1 ) d f , 2 BCDF =   1 1 h b p d b h b c ( ) : ( | ) ( , ) ( 2 , 3 ) ( 1 , 2 ) H c d , ( 2 , 3 ) = h 2 f p f c d ( ) : max ( | , ) BF ( 2 , 3 ) c d , =   h 1 b f p e b f h 1 e f H ( , ) : ( | , ) ( , ) ( 3 , 2 ) ( 4 , 3 ) ( 3 , 2 ) e 3 BEF =    H h 1 e f p e b f h 1 b h 2 f ( , ) : ( | , ) ( ) ( ) ( 3 , 4 ) ( 3 , 4 ) ( 2 , 3 ) ( 2 , 3 ) EF b = = H h 1 e f p G g e f ( , ) : ( | , ) ( 4 , 3 ) e ( 4 , 3 ) 4 EFG slides10 828X 2019

  23. A Cluster Tree Elimination vs. Mini-Clustering B C D E MC CTE F 1 1 ABC ABC h 1 b c h b c H ( , ) ( , ) G ( 1 , 2 ) ( 1 , 2 ) ( 1 , 2 ) 1 BC BC h b ( ) ( 2 , 1 ) h 2 c h b c H ( ) ( , ) ( 2 , 1 ) ( 2 , 1 ) ( 2 , 1 ) 2 2 BCDF BCDF h 1 b h b f ( ) ( , ) H ( 2 , 3 ) ( 2 , 3 ) ( 2 , 3 ) h 2 f ( ) BF BF ( 2 , 3 ) H h 1 b f ( , ) h b f ( , ) ( 3 , 2 ) ( 3 , 2 ) ( 3 , 2 ) 3 3 BEF BEF h e f H ( , ) h 1 e f ( , ) ( 3 , 4 ) ( 3 , 4 ) ( 3 , 4 ) EF EF h e f ( , ) H h 1 e f ( , ) ( 4 , 3 ) ( 4 , 3 ) ( 4 , 3 ) 4 4 EFG EFG slides10 828X 2019

  24. Heuristics for partitioning (Dechter and Rish, 2003, Rollon and Dechter 2010) Scope-based Partitioning Heuristic (SCP) aims at minimizing the number of mini-buckets in the partition by including in each minibucket as many functions as respecting the i bound is satisfied Use greedy heuristic derived from a distance function to decide which functions go into a single mini-bucket slides10 828X 2019

  25. Greedy Scope-based Partitioning slides10 828X 2019

  26. Heuristic for Partitioning Scope-based Partitioning Heuristic. The scope-based partition heuristic (SCP) aims at minimizing the number of mini-buckets in the partition by including in each mini-bucket as many functions as possible as long as the i bound is satisfied. First, single function mini-buckets are decreasingly ordered according to their arity from left to right. Then, each mini-bucket is absorbed into the left-most mini-bucket with whom it can be merged. The time complexity of Partition( B, i ) , where B is the bucket to be partitioned, and |B|,the number of functions in the bucket, using the SCP heuristic is O ( |B| log ( |B| ) + |B|^ 2) . The scope-based heuristic is is quite fast, its shortcoming is that it does not consider the actual information in the functions. slides10 828X 2019

  27. Greedy Partition as a function of a distance function h slides10 828X 2019

  28. Comparing Mini-clustering against Belief Propagation. What is belief propagation slides10 828X 2019

  29. Iterative Belief Proapagation • Belief propagation is exact for poly-trees • IBP - applying BP iteratively to cyclic networks U U U One step : 1 3 2 update   3 x 2 x ( ) ( ) U 1 U 1 BEL(U ) 1  1 u ( ) X 1  2 u ( ) X 1 X X 1 2 • No guarantees for convergence • Works well for many coding networks slides10 828X 2019

  30. Linear Block Codes a b c d e f g h Received bits σ A B C D E F G H Input bits Gaussian channel noise + + + + + + Parity bits σ p 1 p 2 p 3 p 4 p 5 p 6 Received bits slides10 828X 2019

  31. Probabilistic decoding Error-correcting linear block code State-of-the-art: approximate algorithm – iterative belief propagation (IBP) (Pearl’s poly -tree algorithm applied to loopy networks) slides10 828X 2019

  32. MBE-mpe vs. IBP MBE-mpe is better on low w* codes IBP (or BP) is better on randomly generated (high w*) codes. Bit error rate (BER) as a function of noise (sigma): slides10 828X 2019

  33. Grid 15x15 - 10 evidence Grid 15x15, evid=10, w*=22, 10 instances Grid 15x15, evid=10, w*=22, 10 instances 0.06 0.14 MC 0.12 0.05 IBP MC IBP 0.10 0.04 Absolute error 0.08 NHD 0.03 0.06 0.02 0.04 0.01 0.02 0.00 0.00 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 i-bound i-bound Grid 15x15, evid=10, w*=22, 10 instances Grid 15x15, evid=10, w*=22, 10 instances 0.12 12 MC MC 10 0.10 IBP IBP 8 0.08 Time (seconds) Relative error 6 0.06 4 0.04 2 0.02 0 0.00 slides10 828X 2019 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 i-bound i-bound

  34. Outline • Mini-bucket elimination • Weighted Mini-bucket • Mini-clustering • Iterative Belief propagation • Iterative-join-graph propagation • Re-parameterization, cost-shifting slides10 828X 2019

  35. Iterative Belief Proapagation • Belief propagation is exact for poly-trees • IBP - applying BP iteratively to cyclic networks U U U One step : 1 3 2 update   3 x 2 x ( ) ( ) U 1 U 1 BEL(U ) 1  1 u ( ) X 1  2 u ( ) X 1 X X 1 2 • No guarantees for convergence • Works well for many coding networks • Lets combine iterative-nature with anytime--IJGP slides10 828X 2019

  36. Iterative Join Graph Propagation • Loopy Belief Propagation • Cyclic graphs • Iterative • Converges fast in practice (no guarantees though) • Very good approximations (e.g., turbo decoding, LDPC codes, SAT – survey propagation) • Mini-Clustering(i) • Tree decompositions • Only two sets of messages (inward, outward) • Anytime behavior – can improve with more time by increasing the i-bound • We want to combine: • Iterative virtues of Loopy BP • Anytime behavior of Mini-Clustering(i) slides10 828X 2019

  37. IJGP - The basic idea • Apply Cluster Tree Elimination to any join-graph • We commit to graphs that are I-maps • Avoid cycles as long as I-mapness is not violated • Result: use minimal arc-labeled join-graphs slides10 828X 2019

  38. Tree Decomposition for Belief Updating p(a ) A p(b|a ) B p(e b f | , ) P(d|b p(c a b ) E C D | , ) p(f|c , d ) F G p(g|e , f ) 93

  39. Tree Decomposition for belief updating A B C p(a ) p(a), p(b|a), p(c|a,b) A BC p(b|a ) B B C D F p(d|b), p(f|c,d) p(e b f | , ) BF P(d|b p(c a b ) E C D | , ) B E F p(e|b,f) p(f|c , d ) EF F E F G p(g|e,f) G p(g|e , f ) 94

  40. CT CTE: : Clu luster Tree Eli limination 1 ABC =    A h b c p a p b a p c a b ( , ) ( ) ( | ) ( | , ) ( 1 , 2 ) a BC B =    h b c p d b p f c d h b f ( , ) ( | ) ( | , ) ( , ) ( 2 , 1 ) ( 3 , 2 ) d f , 2 BCDF =    h b f p d b p f c d h b c ( , ) ( | ) ( | , ) ( , ) ( 2 , 3 ) ( 1 , 2 ) c d , C D E BF =   h b f p e b f h e f ( , ) ( | , ) ( , ) ( 3 , 2 ) ( 4 , 3 ) e 3 BEF F =   h e f p e b f h b f ( , ) ( | , ) ( , ) ( 3 , 4 ) ( 2 , 3 ) EF b = = h e f p G g e f ( , ) ( | , ) G e ( 4 , 3 ) 4 EFG Time: O ( exp(w+1 )) For each cluster P(X|e) is computed, also P(e) Space: O ( exp(sep)) 95

  41. Example A B C p(a), p(b|a), p(c|a,b) =  tree decomposit ion BN X,D,G,P A for a belief network is a BC     = T T (V,E) χ ψ triple , , , where is a tree and and are labeling   v V χ(v) X functions, associatin g with each verte x two sets, and B C D F p(d|b), p(f|c,d)  ψ(v) P satisfying :  p P 1. For each function there is exactly one vertex such that BF i   ψ(v) χ(v) p scope(p ) and i i B E F    χ(v)} X X {v V|X 2. For each varia ble the set forms a p(e|b,f) i i connected subtree (running intersecti on property) EF A B E F G p(g|e,f) E C D Belief network Tree decomposition F 96 G

  42. IJGP - The basic idea • Apply Cluster Tree Elimination to any join-graph • We commit to graphs that are I-maps • Avoid cycles as long as I-mapness is not violated • Result: use minimal arc-labeled join-graphs slides10 828X 2019

  43. Minimal Arc-Labeled Decomposition B C B C ABCDE BCE ABCDE BCE C DE C E DE C E CDEF CDEF a) Fragment of an a) Shrinking labels to make it a arc-labeled join-graph minimal arc-labeled join-graph • Use a DFS algorithm to eliminate cycles relative to each variable slides10 828X 2019

  44. Minimal arc-labeled join-graph

  45. Message propagation B C ABCDE BCE ABCDE h (3,1) (bc) C DE p(a), p(c), p(b|ac), 1 3 BCE C E p(d|abe),p(e|b,c) B C h(3,1)(bc) CDEF FGH h (1,2) C DE C E F F GH 2 CDEF GI FGI GHIJ Minimal arc-labeled:  = h de p a p c p b ac p d abe p e bc h bc ( ) ( ) ( ) ( | ) ( | ) ( | ) ( ) sep(1,2)={D,E} ( 1 , 2 ) ( 3 , 1 ) a b c , , elim(1,2)={A,B,C}  = h cde p a p c p b ac p d abe p e bc h bc ( ) ( ) ( ) ( | ) ( | ) ( | ) ( ) Non-minimal arc-labeled: ( 1 , 2 ) ( 3 , 1 ) sep(1,2)={C,D,E} a b , elim(1,2)={A,B} slides10 828X 2019

  46. IJGP - Example A C A B C A ABC C A AB BC C D E ABDE BCE BE C DE CE CDEF F H G H F FGH H F FG GH H GI I J FGI GHIJ Belief network Loopy BP graph slides10 828X 2019

  47. Arc-Minimal Join-Graph A A A A A A A A A A A A A C C C C A A A A A A A A A A A A A ABC ABC ABC ABC ABC ABC ABC ABC ABC ABC ABC ABC ABC C C C C C C C C C C C C C A A AB AB AB AB AB AB AB AB AB AB AB AB AB BC BC BC BC BC BC BC BC BC BC BC BC BC C C C C ABDE ABDE ABDE ABDE ABDE ABDE ABDE ABDE ABDE ABDE ABDE ABDE ABDE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE Arcs labeled with BE BE BE BE BE BE BE BE BE BE BE BE any single variable C C C C C C C C C C C C C DE DE DE DE DE DE DE DE DE DE DE DE DE CE CE CE CE CE CE CE CE CE CE CE CE CE should form a TREE CDEF CDEF CDEF CDEF CDEF CDEF CDEF CDEF CDEF CDEF CDEF CDEF CDEF H H H H H H H H H H H H H F F F F F F F F FGH FGH FGH FGH FGH FGH FGH FGH FGH FGH FGH FGH FGH H H H H H H H H H H H H H F F F F F F F F F F F F F FG FG FG FG FG FG FG FG FG FG FG FG FG GH GH GH GH GH GH GH GH GH GH GH GH GH H H H H H H H H H H GI GI GI GI GI GI GI GI GI GI GI GI GI FGI FGI FGI FGI FGI FGI FGI FGI FGI FGI FGI FGI FGI GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ slides10 828X 2019

  48. Collapsing Clusters A A ABC C AB BC ABDE BCE ABCDE ABCDE ABCDE BCE BCE BCE ABCDE BE BC BC BC C DE CE CDE CDE CDE CE CE CE CDE CDEF CDEF CDEF CDEF CDEF H FGH H FGH FGH FGH F F F F F FG GH FG FG FG GH GH GH GI GI GI GI GHI FGI GHIJ FGI FGI FGI GHIJ GHIJ GHIJ FGHI GHIJ slides10 828X 2019

  49. Join-Graphs A C A A ABC C A ABC C BC A AB BC C AB BC ABCDE ABCDE BCE ABDE BCE ABDE BCE BE DE CE CDE C C DE CE DE CE CDEF CDEF CDEF CDEF FGH H H F FGH H FGH H F F F F F GH FG GH H F GH GHI GI GI GI FGI GHIJ FGI GHIJ FGI GHIJ FGHI GHIJ more accuracy less complexity slides10 828X 2019

  50. Bounded decompositions • We want arc-labeled decompositions such that: • the cluster size (internal width) is bounded by i (the accuracy parameter) • Possible approaches to build decompositions: • partition-based algorithms - inspired by the mini-bucket decomposition • grouping-based algorithms slides10 828X 2019

  51. Constructing Join-Graphs A B G: (GFE) P(G|F,E) GFE C D E EF E: (EBF) (EF) EBF P(E|B,F) F G F BF F: ( FCD ) ( BF ) P(F|C,D) FCD BF CD D: (DB) (CD) P(D|B) CDB CB B C: (CAB) (CB) P(C|A,B) CAB BA B: (BA) (AB) (B) P(B|A) BA A A: (A) P(A) A a) schematic mini-bucket(i), i=3 b) arc-labeled join-graph decomposition slides10 828X 2019

  52. IJGP properties • IJGP( i ) applies BP to min arc-labeled join-graph, whose cluster size is bounded by i • On join-trees IJGP finds exact beliefs • IJGP is a Generalized Belief Propagation algorithm (Yedidia, Freeman, Weiss 2001) • Complexity of one iteration: • time: O(deg •( n+N ) •d i+1 ) O( N•d  ) • space: slides10 828X 2019

  53. Empirical evaluation ◼ Measures: • Algorithms: ◼ Absolute error • Exact ◼ Relative error • IBP • MC ◼ Kulbach-Leibler (KL) distance • IJGP ◼ Bit Error Rate ◼ Time ◼ Networks (all variables are binary): ◼ Random networks ◼ Grid networks (MxM) ◼ CPCS 54, 360, 422 ◼ Coding networks slides10 828X 2019

  54. Coding Networks – Bit Error Rate σ = .22 σ = .32 Coding, N=400, 1000 instances, 30 it, w*=43, sigma=.22 Coding, N=400, 500 instances, 30 it, w*=43, sigma=.32 1e-1 0.00243 IJGP MC 0.00242 IBP 1e-2 0.00241 IBP IJGP BER BER 1e-3 0.00240 0.00239 1e-4 0.00238 1e-5 0.00237 0 2 4 6 8 10 12 0 2 4 6 8 10 12 i-bound i-bound σ = .51 σ = .65 Coding, N=400, 500 instances, 30 it, w*=43, sigma=.51 Coding, N=400, 500 instances, 30 it, w*=43, sigma=.65 0.0785 0.0780 0.1914 0.0775 0.1912 IBP IBP 0.0770 0.1910 IJGP IJGP BER BER 0.0765 0.1908 0.0760 0.1906 0.0755 0.1904 0.0750 0.1902 0.0745 0.1900 0 2 4 6 8 10 12 0 2 4 6 8 10 12 i-bound i-bound slides10 828X 2019

  55. CPCS 422 – KL Distance CPCS 422, evid=0, w*=23, 1instance CPCS 422, evid=30, w*=23, 1instance 0.1 0.1 IJGP 30 it (at convergence) MC IBP 10 it (at convergence) 0.01 0.01 KL distance KL distance IJGP at convergence MC IBP at convergence 0.001 0.001 0.0001 0.0001 2 4 6 8 10 12 14 16 18 3 4 5 6 7 8 9 10 11 12 13 14 15 16 i-bound i-bound evidence=0 evidence=30 slides10 828X 2019

  56. CPCS 422 – KL vs. Iterations CPCS 422, evid=0, w*=23, 1instance CPCS 422, evid=30, w*=23, 1instance 1 0.1 IJGP (3) IJGP(3) IJGP(10) IJGP(10) IBP IBP 0.1 0.01 KL distance KL distance 0.01 0.001 0.001 0.0001 0.0001 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 number of iterations number of iterations evidence=0 evidence=30 slides10 828X 2019

  57. Coding networks - Time Coding, N=400, 500 instances, 30 iterations, w*=43 10 8 IJGP 30 iterations MC IBP 30 iterations 6 Time (seconds) 4 2 0 0 2 4 6 8 10 12 i-bound slides10 828X 2019

  58. More On the Power of Belief Propagation • BP as local minima of KL distance (Read Darwiche) • BP’s power from constraint propagation perspective. 115

  59. Lambda is Grounding for evidence e)

  60. Theorem: Yedidia, Frieman and Weiss 2005

  61. Summary of IJGP so far

  62. Outline • Mini-bucket elimination • Weighted Mini-bucket • Mini-clustering • Iterative Belief propagation • Iterative-join-graph propagation • Re-parameterization, cost-shifting slides10 828X 2019

  63. Cost-Shifting (Reparameterization) + 𝜇(𝐶) − 𝜇(𝐶) A B f(A,B) B C f(B,C) b b 6 + 3 b b 6 – 3 A B C f(A,B,C) b g 0 – 1 b g 0 – 3 b b b 12 g b 0 + 3 g b 0 + 1 b b g 6 g g 6 – 1 g g 6 + 1 b g b 0 b g g 6 = 0 + 6 g b b 6 B λ (B) g b g 0 b 3 Modify the individual functions g g b 6 g -1 g g g 12 - but – keep the sum of functions the same slides10 828X 2019

  64. Tightening the bound • Reparameterization (or, “cost shifting”) A B C F(A,B,C) • Decrease bound without changing overall function 0 0 0 3.0 0 0 1 2.0 A B f 1 (A,B) B C f 2 (B,C) 0 1 0 2.0 0 0 2.0 0 0 1.0 = + 0 1 1 4.0 1 0 3.5 0 1 0.0 1 0 0 4.5 0 1 1.0 1 0 1.0 1 0 1 3.5 1 1 3.0 1 1 3.0 1 1 0 4.0 1 1 1 6.0 A B f 1 (A,B) ¸ (B) B C f 2 (B,C) - ¸ (B) (Adjusting functions 0 0 2.0 0 0 1.0 cancel each other) + 0 0 = 1 0 3.5 0 1 0.0 0 1 1.0 1 0 1.0 +1 -1 (Decomposition bound is exact) 1 1 3.0 1 1 3.0 slides10 828X 2019 127

  65. Dual Decomposition 𝑔 13 (𝑦 1 , 𝑦 3 ) 𝑔 13 (∙) 𝑦 1 𝑦 3 𝑦 1 𝑦 3 𝑦 3 𝑦 1 𝑔 12 (𝑦 1 , 𝑦 2 ) 𝑔 23 (𝑦 2 , 𝑦 3 ) 𝑔 12 (∙) 𝑔 23 (∙) 𝑦 2 𝑦 2 𝑦 2 𝐺 ∗ = min ≥ ෍ min 𝑔 𝛽 (𝑦) 𝑦 ෍ 𝑔 𝛽 (𝑦) 𝑦 𝛽 𝛽 • Bound solution using decomposed optimization • Solve independently: optimistic bound slides10 828X 2019

  66. Dual Decomposition 𝑔 13 (𝑦 1 , 𝑦 3 ) 𝜇 1→13 (𝑦 1 ) 𝜇 3→13 (𝑦 3 ) 𝑔 13 (∙) 𝑦 1 𝑦 3 𝑦 1 𝑦 3 𝜇 3→23 (𝑦 3 ) 𝜇 1→12 (𝑦 1 ) 𝑦 3 𝑦 1 𝑔 12 (𝑦 1 , 𝑦 2 ) 𝑔 23 (𝑦 2 , 𝑦 3 ) 𝑔 12 (∙) 𝑔 23 (∙) Reparameterization: 𝑦 2 𝑦 2 𝑦 2 ∀𝑘 ∶ ෍ 𝜇 𝑘→𝛽 𝑦 𝑘 = 0 𝜇 2→13 (𝑦 2 ) 𝜇 2→23 (𝑦 2 ) 𝛽∋𝑘 𝐺 ∗ = min max + ෍ 𝜇 𝑗→𝛽 𝑦 𝑗 ≥ ෍ min 𝑔 𝛽 (𝑦) 𝑦 ෍ 𝑔 𝛽 (𝑦) 𝜇 𝑗→𝛽 𝑦 𝑗∈𝛽 𝛽 𝛽 • Bound solution using decomposed optimization • Solve independently: optimistic bound • Tighten the bound by reparameterization ‒ Enforce lost equality constraints via Lagrange multipliers slides10 828X 2019

  67. Dual Decomposition 𝑔 13 (𝑦 1 , 𝑦 3 ) 𝜇 1→13 (𝑦 1 ) 𝜇 3→13 (𝑦 3 ) 𝑔 13 (∙) 𝑦 1 𝑦 3 𝑦 1 𝑦 3 𝜇 3→23 (𝑦 3 ) 𝜇 1→12 (𝑦 1 ) 𝑦 3 𝑦 1 𝑔 12 (𝑦 1 , 𝑦 2 ) 𝑔 23 (𝑦 2 , 𝑦 3 ) 𝑔 12 (∙) 𝑔 23 (∙) Reparameterization: 𝑦 2 𝑦 2 𝑦 2 ∀𝑘 ∶ ෍ 𝜇 𝑘→𝛽 𝑦 𝑘 = 0 𝜇 2→13 (𝑦 2 ) 𝜇 2→23 (𝑦 2 ) 𝛽∋𝑘 𝐺 ∗ = min max + ෍ 𝜇 𝑗→𝛽 𝑦 𝑗 ≥ ෍ min 𝑔 𝛽 (𝑦) 𝑦 ෍ 𝑔 𝛽 (𝑦) 𝜇 𝑗→𝛽 𝑦 𝑗∈𝛽 𝛽 𝛽 Many names for the same class of bounds: ‒ Dual decomposition [Komodakis et al. 2007] ‒ TRW, MPLP [Wainwright et al. 2005; Globerson & Jaakkola, 2007] ‒ Soft arc consistency [Cooper & Schiex, 2004] ‒ Max-sum diffusion [Warner 2007] slides10 828X 2019

  68. Dual Decomposition 𝑔 13 (𝑦 1 , 𝑦 3 ) 𝜇 1→13 (𝑦 1 ) 𝜇 3→13 (𝑦 3 ) 𝑔 13 (∙) 𝑦 1 𝑦 3 𝑦 1 𝑦 3 𝜇 3→23 (𝑦 3 ) 𝜇 1→12 (𝑦 1 ) 𝑦 3 𝑦 1 𝑔 12 (𝑦 1 , 𝑦 2 ) 𝑔 23 (𝑦 2 , 𝑦 3 ) 𝑔 12 (∙) 𝑔 23 (∙) Reparameterization: 𝑦 2 𝑦 2 𝑦 2 ∀𝑘 ∶ ෍ 𝜇 𝑘→𝛽 𝑦 𝑘 = 0 𝜇 2→13 (𝑦 2 ) 𝜇 2→23 (𝑦 2 ) 𝛽∋𝑘 𝐺 ∗ = min max + ෍ 𝜇 𝑗→𝛽 𝑦 𝑗 ≥ ෍ min 𝑔 𝛽 (𝑦) 𝑦 ෍ 𝑔 𝛽 (𝑦) 𝜇 𝑗→𝛽 𝑦 𝑗∈𝛽 𝛽 𝛽 Many ways to optimize the bound: ‒ Sub-gradient descent [Komodakis et al. 2007; Jojic et al. 2010] ‒ Coordinate descent [Warner 2007; Globerson & Jaakkola 2007; Sontag et al. 2009; Ihler et al. 2012] ‒ Proximal optimization [Ravikumar et al, 2010] ‒ ADMM [Meshi & Globerson 2011; Martins et al. 2011; Forouzan & Ihler 2013] slides10 828X 2019

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend