Bounded in inference non-iteratively; ; Min ini-bucket t el eliminatio ion
COMPSCI 276,Spring 2017 Set 8: Rina Dechter
Reading: Class Notes (8), Darwiche chapters 14
Bounded in inference non-iteratively; ; Min ini-bucket t el - - PowerPoint PPT Presentation
Bounded in inference non-iteratively; ; Min ini-bucket t el eliminatio ion COMPSCI 276,Spring 2017 Set 8: Rina Dechter Reading: Class Notes (8), Darwiche chapters 14 Agenda Mini-bucket elimination Weighted Mini-bucket
Reading: Class Notes (8), Darwiche chapters 14
2
3
X/A a * k * 1
e) , x P( max arg ) a ,..., (a evidence) | x P(X ) BEL(X
i i i
e) , x P( max arg * x
x
) x U( e) , x P( max arg ) d ,..., (d
X/D d * k * 1
variables hypothesis : X A function utility x variables decision : ) ( : U X D
) var( 1
e X n i e i i pa
X i i i C
) var( 1 ) var( 1
e X n j e j j X e X n j e j j i i
i
x
5
) , ( ) | ( e a P e a P
d e b c
c b e P b a d P a c P a b P a P e a P
, , ,
) , | ( ) , | ( ) | ( ) | ( ) ( ) , (
d c b e
b a d P c b e P a b P a c P a P ) , | ( ) , | ( ) | ( ) | ( ) (
Elimination Order: d,e,b,c Query:
D: E: B: C: A:
d D
b a d P b a f ) , | ( ) , ( ) , | ( b a d P ) , | ( c b e P ) , | ( ) , ( c b e P c b fE
b E D B
c b f b a f a b P c a f ) , ( ) , ( ) | ( ) , ( ) ( ) ( ) , ( a f A p e a P
C
) (a P ) | ( a c P
c B C
c a f a c P a f ) , ( ) | ( ) ( ) | ( a b P
D,A,B E,B,C B,A,C C,A A ) , ( b a fD ) , ( c b fE ) , ( c a fB ) (a fC
A A D D E E C C B B
Bucket Tree
D E B C A
Original Functions Messages Time and space exp(w*)
6
b
Elimination operator
W*=4 ”induced width” (max clique size)
) , | ( ) , | ( ) | ( ) | ( ) ( max by replaced is
, , , ,
c b e P b a d P a b P a c P a P MPE :
b c d e a
max
7
e) (a, hD
(a) hE
e) c, d, (a, hB
e) d, (a, hC
(a) h P(a) max arg a' 1.
E a
e' 2.
) e' d, , (a' h max arg d' 3.
C d
) e' c, , d' , (a' h ) a' | P(c max arg c' 4.
B c
) c' b, | P(e' ) a' b, | P(d' ) a' | P(b max arg b' 5.
b
8
9
exponential in the number of variables involved
into “mini-buckets” on smaller number of variables
10
Split a bucket into mini-buckets =>bound complexity
X X
) ( ) ( ) O (e : d e c re a s e c o m p le xity l E xp o ne ntia
n r n r
e O e O
12
bucket A: bucket E: bucket D: bucket C: bucket B:
maxBΠ
F(a,b) F(a,d) hE(a) hB(a,c) hB(d,e) F(b,d) F(b,e) F(c,e) F(a,c) hC(e,a) L = lower bound
Mini-buckets A B C D E
F(b,c) e = 0 hD(e,a)
maxBΠ
We can generate a solution s going forward as before U= F(s)
A D E C B
bucket E: bucket C: bucket D: bucket B: bucket A: 𝑔 𝑏, 𝑐′ 𝑔 𝑐′, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑔 𝑑, 𝑏 𝑔 𝑑, 𝑓 𝑔 𝑏, 𝑒 𝑔 𝑏
A D E C B B’
min
𝐶′ 𝑔(∙)
min
𝐶 𝑔(∙)
mini-buckets
𝜇𝐶→𝐷(𝑏, 𝑑) 𝜇𝐶→𝐸(𝑒, 𝑓) 𝜇𝐷→𝐹(𝑏, 𝑓) 𝜇𝐸→𝐹(𝑏, 𝑓) 𝜇𝐹→𝐵(𝑏) L = lower bound [Dechter and Rish, 1997; 2003]
14 14
Before Splitting: Network N After Splitting: Network N'
Variables in different buckets are renamed and duplicated (Kask et. al., 2001), (Geffner et. al., 2007), (Choi, Chavira, Darwiche , 2007)
15
E: C: D: B: A:
𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑔 𝑑, 𝑏 𝑔 𝑑, 𝑓 𝑔 𝑏, 𝑒 𝑔 𝑏 𝜇𝐶→𝐷(𝑏, 𝑑) 𝜇𝐶→𝐸(𝑒, 𝑓) 𝜇𝐷→𝐹(𝑏, 𝑓) 𝜇𝐸→𝐹(𝑏, 𝑓) 𝜇𝐹→𝐵(𝑏)
U = Upper bound Example: MBE-mpe(3) versus BE-mpe
[Dechter and Rish, 1997]
E: C: D: B: A:
𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑔 𝑑, 𝑏 𝑔 𝑑, 𝑓 𝑔 𝑏, 𝑒 𝑔 𝑏 𝜇𝐶→𝐷(𝑏, 𝑑, 𝑒, 𝑓) 𝜇𝐷→𝐸(𝑏, 𝑒, 𝑓) 𝜇𝐸→𝐹(𝑏, 𝑓) 𝜇𝐹→𝐵(𝑏)
OPT 2: 3: 3: 3: 1:
max variables in a mini-bucket
𝑥∗ = 2 𝑥∗ =4
ECAI 2016
bucket E: bucket C: bucket D: bucket B: bucket A: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑔 𝑑, 𝑏 𝑔 𝑑, 𝑓 𝑔 𝑏, 𝑒 𝑔 𝑏
min
𝐶 𝑔(∙)
min
𝐶 𝑔(∙)
mini-buckets
𝜇𝐶→𝐷(𝑏, 𝑑) 𝜇𝐶→𝐸(𝑒, 𝑓) 𝜇𝐷→𝐹(𝑏, 𝑓) 𝜇𝐸→𝐹(𝑏, 𝑓) 𝜇𝐹→𝐵(𝑏) L = lower bound
ො 𝑏 = arg min
𝑏 𝑔 𝑏 + 𝜇𝐹→𝐵(𝑏)
Ƹ 𝑓 = arg min
𝑓
𝜇𝐷→𝐹(ො 𝑏, 𝑓) + 𝜇𝐸→𝐹(ො 𝑏, 𝑓) መ 𝑒 = arg min
𝑒 𝑔 ො
𝑏, 𝑒 + 𝜇𝐶→𝐸(𝑒, Ƹ 𝑓) Ƹ 𝑑 = arg min
𝑑
𝜇𝐶→𝐷 ො 𝑏, 𝑑 + 𝑔 𝑑, ො 𝑏 + 𝑔(𝑑, Ƹ 𝑓) 𝑐 = arg min
𝑐 𝑔 ො
𝑏, 𝑐 + 𝑔 𝑐, Ƹ 𝑑 +𝑔 𝑐, መ 𝑒 + 𝑔(𝑐, Ƹ 𝑓)
Greedy configuration = upper bound
[Dechter and Rish, 2003]
18
19
n
1
21
22
23
24
lower-bound)
BE-bel
map.
) ( max ) ( ) ( ) ( ) ( ) ( ) ( ) ( X g x f x g x f x g x f x g x f
X X X X X X
26
We sometime use normalization of the approximation, but then no guarantee. The problem is that we have to approximate also P(e).
27
28
30
Anytime-mpe(0.0001) U/L error vs time
Time and parameter i
1 10 100 1000
Upper/Lower
0.6 1.0 1.4 1.8 2.2 2.6 3.0 3.4 3.8
cpcs422b cpcs360b
i=1 i=21
Test case: no evidence
505.2 70.3 anytime-mpe( ), 110.5 70.3 anytime-mpe( ), 1697.6 115.8 elim-mpe cpcs422 cpcs360 Algorithm Time (sec)
4
10
1
10
32
A B C
(Qiang Liu slides)
(Qiang Liu slides)
New York, 1934.
(Qiang Liu slides)
New York, 1934.
(Qiang Liu slides)
(Qiang Liu slides)
ECAI 2016
bucket E: bucket C: bucket D: bucket B: bucket A: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑔 𝑏, 𝑑 𝑔 𝑑, 𝑓 𝑔 𝑏, 𝑒 𝑔 𝑏
mini-buckets
𝜇𝐶→𝐷(𝑏, 𝑑) 𝜇𝐶→𝐸(𝑒, 𝑓) 𝜇𝐷→𝐹(𝑏, 𝑓) 𝜇𝐸→𝐹(𝑏, 𝑓) 𝜇𝐹→𝐵(𝑏) U = upper bound
𝑦 𝑥
ෑ 𝑔(𝑦)
𝜇𝐶 𝑏, 𝑑, 𝑒, 𝑓 =
𝑐
𝑔 𝑏, 𝑐 ⋅ 𝑔 𝑐, 𝑑 ⋅ 𝑔 𝑐, 𝑒 ⋅ 𝑔 𝑐, 𝑓
Exact bucket elimination:
≤
𝑐 𝑥1
𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 ⋅
𝑐 𝑥2
𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓
= 𝜇𝐶→𝐷 (𝑏, 𝑑) ⋅ 𝜇𝐶→𝐸 (𝑒, 𝑓)
(mini-buckets)
where
𝑦 𝑥
𝑔 𝑦 =
𝑦
𝑔 𝑦
1 𝑥 𝑥
is the weighted or “power” sum operator
𝑦 𝑥
𝑔
1 𝑦 𝑔 2 𝑦 ≤ 𝑦 𝑥1
𝑔
1 𝑦
𝑦 𝑥2
𝑔
2 𝑦
where
𝑥1 + 𝑥2 = 𝑥 𝑥1 > 0, 𝑥2 > 0
and
𝑥1 > 0, 𝑥2 < 0
(lower bound if ) [Liu and Ihler, 2011]
(for summation)
(Qiang Liu slides)
(Qiang Liu slides)
f2(A)
A B C A
f1(A)
A B C
(Qiang Liu slides)
48
Process max buckets With max mini-buckets And sum buckets with sum Mini-bucket and max mini-buckets
ECAI 2016
bucket E: bucket C: bucket D: bucket B: bucket A: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑔 𝑏, 𝑑 𝑔 𝑑, 𝑓 𝑔 𝑏, 𝑒 𝑔 𝑏
mini-buckets
𝜇𝐶→𝐷(𝑏, 𝑑) 𝜇𝐶→𝐸(𝑒, 𝑓) 𝜇𝐷→𝐹(𝑏, 𝑓) 𝜇𝐸→𝐹(𝑏, 𝑓) 𝜇𝐹→𝐵(𝑏) U = upper bound
[Liu and Ihler, 2011; 2013] [Dechter and Rish, 2003]
Marginal MAP
Σ𝐶 Σ𝐷
maxA maxE maxD
𝜇𝐶→𝐷 𝑏, 𝑑 =
𝑐 𝑥1
𝑔 𝑏, 𝑐 𝑔(𝑐, 𝑑) 𝜇𝐶→𝐸 𝑒, 𝑓 =
𝑐 𝑥2
𝑔 𝑐, 𝑒 𝑔(𝑐, 𝑓) (𝑥1 + 𝑥2 = 1)
𝜇𝐹→𝐵 𝑏 = max
𝑓
𝜇𝐷→𝐹 𝑏, 𝑓 𝜇𝐸→𝐹(𝑏, 𝑓) 𝑉 = max
𝑏
𝑔 𝑏 𝜇𝐹→𝐵(𝑏)
Can optimize over cost-shifting and weights (single pass “MM” or iterative message passing)
50
Error-correcting linear block code State-of-the-art: approximate algorithm – iterative belief propagation (IBP) (Pearl’s poly-tree algorithm applied to loopy networks)
51
Initial partitioning
52
53
54
) (
1
1 u
X
1
2
3
2
1
) (
1
2 x
U
) (
1
2 u
X
) (
1
3 x
U
) bel(U
: step One
1
55
codes * w
better is mpe
c o d e s w * )
g e n e r a te d r a n d o m l y
b e tte r i s IB P
Bit error rate (BER) as a function of noise (sigma):
56
Mini-buckets – local inference approximation
Idea: bound size of recorded functions
MBE-mpe(i) - mini-bucket algorithm for MPE
Better results for noisy-OR than for random problems
Accuracy increases with decreasing noise in coding
Accuracy increases for likely evidence
Sparser graphs -> higher accuracy
Coding networks: MBE-mpe outperforms IBP on low- induced width codes
57
58
where deg = the maximum degree of a node n = number of variables (= number of CPTs) N = number of nodes in the tree decomposition d = the maximum domain size of a variable w* = the induced width sep = the separator size
ABC
) , | ( ) | ( ) ( ) , (
) 2 , 1 (
b a c p a b p a p c b h
a
EFG
) , ( ) , | ( ) | ( ) , (
) 2 , 3 ( , ) 1 , 2 (
f b h d c f p b d p c b h
f d
) , ( ) , | ( ) | ( ) , (
) 2 , 1 ( , ) 3 , 2 (
c b h d c f p b d p f b h
d c
) , ( ) , | ( ) , (
) 3 , 4 ( ) 2 , 3 (
f e h f b e p f b h
e
) , ( ) , | ( ) , (
) 3 , 2 ( ) 4 , 3 (
f b h f b e p f e h
b
) , | ( ) , (
) 3 , 4 (
f e g G p f e h
e
EF BF BC
BCDF
G E F C D B A
Time and space: exp(cluster size)= exp(treewidth) EXACT algorithm
59
) ( ) (
) var( ) var( r n r n
e O e O ) O(e
1 1 n r r
elim n i i
1
r
1 n r
elim n r i i elim r i i
h h
1 1
APPROXIMATE algorithm
60
) , | ( ) | ( ) ( ) , (
1 ) 2 , 1 (
b a c p a b p a p c b h
a
A B C p(a), p(b|a), p(c|a,b) B C D p(d|b), h(1,2)(b,c) C D F p(f|c,d) B E F p(e|b,f), h1
(2,3)(b), h2 (2,3)(f)
E F G p(g|e,f)
EF BC BF
, 2 ) 3 , 2 (
d c
1 ) 2 , 1 ( , 1 ) 3 , 2 (
d c
G E F C D B A
Time and space: exp(i-bound) APPROXIMATE algorithm
Number of variables in a mini-cluster
63
EF BF BC ) , | ( ) | ( ) ( : ) , (
1 ) 2 , 1 (
b a c p a b p a p c b h
a
) 2 , 1 (
H
) , | ( max : ) ( ) , ( ) | ( : ) (
, 2 ) 1 , 2 ( 1 ) 2 , 3 ( , 1 ) 1 , 2 (
d c f p c h f b h b d p b h
f d f d
) 1 , 2 (
H
) , | ( max : ) ( ) , ( ) | ( : ) (
, 2 ) 3 , 2 ( 1 ) 2 , 1 ( , 1 ) 3 , 2 (
d c f p f h c b h b d p b h
d c d c
) 3 , 2 (
H
) , ( ) , | ( : ) , (
1 ) 3 , 4 ( 1 ) 2 , 3 (
f e h f b e p f b h
e
) 2 , 3 (
H
) ( ) ( ) , | ( : ) , (
2 ) 3 , 2 ( 1 ) 3 , 2 ( 1 ) 4 , 3 (
f h b h f b e p f e h
b
) 4 , 3 (
H
) , | ( : ) , (
1 ) 3 , 4 (
f e g G p f e h
e
) 3 , 4 (
H
ABC
BEF EFG BCDF
64
ABC
) 2 , 1 (
BEF EFG
) 1 , 2 (
) 3 , 2 (
) 2 , 3 (
) 4 , 3 (
) 3 , 4 (
EF BF BC BCDF
) , (
1 ) 2 , 1 (
c b h
) ( ) (
2 ) 1 , 2 ( 1 ) 1 , 2 (
c h b h ) ( ) (
2 ) 3 , 2 ( 1 ) 3 , 2 (
f h b h
) , (
1 ) 2 , 3 (
f b h ) , (
1 ) 4 , 3 (
f e h ) , (
1 ) 3 , 4 (
f e h
) 2 , 1 (
) 1 , 2 (
) 3 , 2 (
) 2 , 3 (
) 4 , 3 (
) 3 , 4 (
ABC
BEF EFG EF BF BC BCDF
65
67
69
Grid 15x15, evid=10, w*=22, 10 instances
i-bound
2 4 6 8 10 12 14 16 18
NHD
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 MC IBP
Grid 15x15, evid=10, w*=22, 10 instances
i-bound
2 4 6 8 10 12 14 16 18
Absolute error
0.00 0.01 0.02 0.03 0.04 0.05 0.06 MC IBP
Grid 15x15, evid=10, w*=22, 10 instances
i-bound
2 4 6 8 10 12 14 16 18
Relative error
0.00 0.02 0.04 0.06 0.08 0.10 0.12 MC IBP
Grid 15x15, evid=10, w*=22, 10 instances
i-bound
2 4 6 8 10 12 14 16 18
Time (seconds)
2 4 6 8 10 12 MC IBP
70
CPCS 422, evid=0, w*=23, 1 instance
i-bound
2 4 6 8 10 12 14 16 18
Absolute error
0.00 0.01 0.02 0.03 0.04 0.05 MC IBP
CPCS 422, evid=10, w*=23, 1 instance
i-bound
2 4 6 8 10 12 14 16 18
Absolute error
0.00 0.01 0.02 0.03 0.04 0.05 MC IBP
71
Coding networks, N=100, P=4, sigma=.51, w*=12, 50 instances
i-bound
2 4 6 8 10 12
Bit Error Rate
0.06 0.08 0.10 0.12 0.14 0.16 0.18 MC IBP
Coding networks, N=100, P=4, sigma=.22, w*=12, 50 instances
i-bound
2 4 6 8 10 12
Bit Error Rate
0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 MC IBP
72
CPCS 422, evid=0, w*=23, 1 instance
i-bound
2 4 6 8 10 12 14 16 18
Absolute error
0.00 0.01 0.02 0.03 0.04 0.05 MC IBP
CPCS 422, evid=10, w*=23, 1 instance
i-bound
2 4 6 8 10 12 14 16 18
Absolute error
0.00 0.01 0.02 0.03 0.04 0.05 MC IBP
73
Coding networks, N=100, P=4, sigma=.51, w*=12, 50 instances
i-bound
2 4 6 8 10 12
Bit Error Rate
0.06 0.08 0.10 0.12 0.14 0.16 0.18 MC IBP
Coding networks, N=100, P=4, sigma=.22, w*=12, 50 instances
i-bound
2 4 6 8 10 12
Bit Error Rate
0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 MC IBP
75
Scope-based Partitioning Heuristic. The scope-based partition heuristic (SCP) aims at minimizing the number of mini-buckets in the partition by including in each minibucket as many functions as possible as long as the i bound is satisfied. First, single function mini-buckets are decreasingly ordered according to their
bucket with whom it can be merged. The time and space complexity of Partition(B, i) , where B is the partitioned bucket, using the SCP heuristic is O(|B| log (|B|) + |B|^2) and O(exp(i)), respectively. The scope-based heuristic is is quite fast, its shortcoming is that it does not consider the actual information in the functions.
76
Use greedy heuristic derived from a distance function to decide which functions go into a single mini-bucket
78
(Reparameterization) A B f(A,B) b b 6 + 3 b g 0 – 1 g b 0 + 3 g g 6 – 1 B C f(B,C) b b 6 – 3 b g 0 – 3 g b 0 + 1 g g 6 + 1 A B C f(A,B,C) b b b 12 b b g 6 b g b b g g 6 g b b 6 g b g g g b 6 g g g 12 B λ(B) b 3 g
+ 𝜇(𝐶) − 𝜇(𝐶) Modify the individual functions
keep the sum of functions the same = 0 + 6
𝑦1 𝑦3 𝑦2
𝑔
13(𝑦1, 𝑦3)
𝑔
23(𝑦2, 𝑦3)
𝑔
12(𝑦1, 𝑦2)
𝑦1 𝑦3 𝑦3 𝑦1 𝑦2 𝑦2
𝑔
12(∙)
𝑔
13(∙)
𝑔
23(∙)
𝐺∗ = min
𝑦 𝛽
𝑔
𝛽(𝑦)
≥
𝛽
min
𝑦
𝑔
𝛽(𝑦)
𝑦1 𝑦3 𝑦2
𝑔
13(𝑦1, 𝑦3)
𝑔
23(𝑦2, 𝑦3)
𝑔
12(𝑦1, 𝑦2)
𝑦1 𝑦3 𝑦3 𝑦1 𝑦2 𝑦2
𝑔
12(∙)
𝑔
13(∙)
𝑔
23(∙)
𝐺∗ = min
𝑦 𝛽
𝑔
𝛽(𝑦)
≥
𝛽
min
𝑦
𝑔
𝛽(𝑦)
‒ Enforce lost equality constraints via Lagrange multipliers 𝜇1→13(𝑦1) 𝜇1→12(𝑦1) 𝜇3→13(𝑦3) 𝜇3→23(𝑦3) 𝜇2→13(𝑦2) 𝜇2→23(𝑦2)
∀𝑘 ∶
𝛽∋𝑘
𝜇𝑘→𝛽 𝑦𝑘 = 0 Reparameterization:
max
𝜇𝑗→𝛽
+
𝑗∈𝛽
𝜇𝑗→𝛽 𝑦𝑗
𝑦1 𝑦3 𝑦2
𝑔
13(𝑦1, 𝑦3)
𝑔
23(𝑦2, 𝑦3)
𝑔
12(𝑦1, 𝑦2)
𝑦1 𝑦3 𝑦3 𝑦1 𝑦2 𝑦2
𝑔
12(∙)
𝑔
13(∙)
𝑔
23(∙)
𝐺∗ = min
𝑦 𝛽
𝑔
𝛽(𝑦)
≥
𝛽
min
𝑦
𝑔
𝛽(𝑦)
Many names for the same class of bounds:
‒ Dual decomposition [Komodakis et al. 2007] ‒ TRW, MPLP [Wainwright et al. 2005; Globerson & Jaakkola, 2007] ‒ Soft arc consistency [Cooper & Schiex, 2004] ‒ Max-sum diffusion [Warner 2007] 𝜇1→13(𝑦1) 𝜇1→12(𝑦1) 𝜇3→13(𝑦3) 𝜇3→23(𝑦3) 𝜇2→13(𝑦2) 𝜇2→23(𝑦2)
∀𝑘 ∶
𝛽∋𝑘
𝜇𝑘→𝛽 𝑦𝑘 = 0 Reparameterization:
max
𝜇𝑗→𝛽
+
𝑗∈𝛽
𝜇𝑗→𝛽 𝑦𝑗
ECAI 2016
𝑦1 𝑦3 𝑦2
𝑔
13(𝑦1, 𝑦3)
𝑔
23(𝑦2, 𝑦3)
𝑔
12(𝑦1, 𝑦2)
𝑦1 𝑦3 𝑦3 𝑦1 𝑦2 𝑦2
𝑔
12(∙)
𝑔
13(∙)
𝑔
23(∙)
𝐺∗ = min
𝑦 𝛽
𝑔
𝛽(𝑦)
≥
𝛽
min
𝑦
𝑔
𝛽(𝑦)
Many ways to optimize the bound:
‒ Sub-gradient descent [Komodakis et al. 2007; Jojic et al. 2010] ‒ Coordinate descent [Warner 2007; Globerson & Jaakkola 2007; Sontag et al. 2009; Ihler et al. 2012] ‒ Proximal optimization [Ravikumar et al, 2010] ‒ ADMM [Meshi & Globerson 2011; Martins et al. 2011; Forouzan & Ihler 2013] 𝜇1→13(𝑦1) 𝜇1→12(𝑦1) 𝜇3→13(𝑦3) 𝜇3→23(𝑦3) 𝜇2→13(𝑦2) 𝜇2→23(𝑦2)
∀𝑘 ∶
𝛽∋𝑘
𝜇𝑘→𝛽 𝑦𝑘 = 0 Reparameterization:
max
𝜇𝑗→𝛽
+
𝑗∈𝛽
𝜇𝑗→𝛽 𝑦𝑗
ECAI 2016
bucket E: bucket C: bucket D: bucket B: bucket A: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑔 𝑏, 𝑑 𝑔 𝑑, 𝑓 𝑔 𝑏, 𝑒 𝑔 𝑏
mini-buckets
𝜇𝐶→𝐷(𝑏, 𝑑) 𝜇𝐶→𝐸(𝑒, 𝑓) 𝜇𝐷→𝐹(𝑏, 𝑓) 𝜇𝐸→𝐹(𝑏, 𝑓) 𝜇𝐹→𝐵(𝑏) L = lower bound
ECAI 2016
bucket E: bucket C: bucket D: bucket B: bucket A: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑔 𝑏, 𝑑 𝑔 𝑑, 𝑓 𝑔 𝑏, 𝑒 𝑔 𝑏
mini-buckets
𝜇𝐶→𝐷(𝑏, 𝑑) 𝜇𝐶→𝐸(𝑒, 𝑓) 𝜇𝐷→𝐹(𝑏, 𝑓) 𝜇𝐸→𝐹(𝑏, 𝑓) 𝜇𝐹→𝐵(𝑏) L = lower bound
min
𝑏,𝑑,𝑐 𝑔 𝑏, 𝑐 + 𝑔 𝑐, 𝑑 − 𝜇𝐶→𝐷 𝑏, 𝑑
= 0 min
𝑒,𝑓,𝑐 𝑔 𝑐, 𝑒 + 𝑔 𝑐, 𝑓 − 𝜇𝐶→𝐸 𝑒, 𝑓
= 0
ECAI 2016
bucket E: bucket C: bucket D: bucket B: bucket A: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑔 𝑏, 𝑑 𝑔 𝑑, 𝑓 𝑔 𝑏, 𝑒 𝑔 𝑏
mini-buckets
𝜇𝐶→𝐷(𝑏, 𝑑) 𝜇𝐶→𝐸(𝑒, 𝑓) 𝜇𝐷→𝐹(𝑏, 𝑓) 𝜇𝐸→𝐹(𝑏, 𝑓) 𝜇𝐹→𝐵(𝑏) L = lower bound
min
𝑏,𝑑,𝑐 𝑔 𝑏, 𝑐 + 𝑔 𝑐, 𝑑 − 𝜇𝐶→𝐷 𝑏, 𝑑
= 0 min
𝑒,𝑓,𝑐 𝑔 𝑐, 𝑒 + 𝑔 𝑐, 𝑓 − 𝜇𝐶→𝐸 𝑒, 𝑓
= 0 min
𝑏,𝑓,𝑑[𝜇𝐶→𝐷 𝑏, 𝑑 + 𝑔 𝑏, 𝑑 + 𝑔(𝑑, 𝑓)
− 𝜇𝐷→𝐹 𝑏, 𝑓 ] = 0
ECAI 2016
bucket E: bucket C: bucket D: bucket B: bucket A: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑔 𝑏, 𝑑 𝑔 𝑑, 𝑓 𝑔 𝑏, 𝑒 𝑔 𝑏
mini-buckets
𝜇𝐶→𝐷(𝑏, 𝑑) 𝜇𝐶→𝐸(𝑒, 𝑓) 𝜇𝐷→𝐹(𝑏, 𝑓) 𝜇𝐸→𝐹(𝑏, 𝑓) 𝜇𝐹→𝐵(𝑏) L = lower bound
min
𝑏,𝑑,𝑐 𝑔 𝑏, 𝑐 + 𝑔 𝑐, 𝑑 − 𝜇𝐶→𝐷 𝑏, 𝑑
= 0 min
𝑒,𝑓,𝑐 𝑔 𝑐, 𝑒 + 𝑔 𝑐, 𝑓 − 𝜇𝐶→𝐸 𝑒, 𝑓
= 0 min
𝑏,𝑓,𝑑[𝜇𝐶→𝐷 𝑏, 𝑑 + 𝑔 𝑏, 𝑑 + 𝑔(𝑑, 𝑓)
− 𝜇𝐷→𝐹 𝑏, 𝑓 ] = 0 min
𝑏,𝑒 [𝑔 𝑏, 𝑒 + 𝜇𝐶→𝐸 𝑒, 𝑓
− 𝜇𝐸→𝐹 𝑏, 𝑓 ] = 0
ECAI 2016
bucket E: bucket C: bucket D: bucket B: bucket A: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑔 𝑏, 𝑑 𝑔 𝑑, 𝑓 𝑔 𝑏, 𝑒 𝑔 𝑏
mini-buckets
𝜇𝐶→𝐷(𝑏, 𝑑) 𝜇𝐶→𝐸(𝑒, 𝑓) 𝜇𝐷→𝐹(𝑏, 𝑓) 𝜇𝐸→𝐹(𝑏, 𝑓) 𝜇𝐹→𝐵(𝑏) L = lower bound
min
𝑏,𝑑,𝑐 𝑔 𝑏, 𝑐 + 𝑔 𝑐, 𝑑 − 𝜇𝐶→𝐷 𝑏, 𝑑
= 0 min
𝑒,𝑓,𝑐 𝑔 𝑐, 𝑒 + 𝑔 𝑐, 𝑓 − 𝜇𝐶→𝐸 𝑒, 𝑓
= 0 min
𝑏,𝑓,𝑑[𝜇𝐶→𝐷 𝑏, 𝑑 + 𝑔 𝑏, 𝑑 + 𝑔(𝑑, 𝑓)
− 𝜇𝐷→𝐹 𝑏, 𝑓 ] = 0 min
𝑏,𝑒 [𝑔 𝑏, 𝑒 + 𝜇𝐶→𝐸 𝑒, 𝑓
− 𝜇𝐸→𝐹 𝑏, 𝑓 ] = 0 min
𝑏,𝑓 [𝜇𝐷→𝐹 𝑏, 𝑓 + 𝜇𝐸→𝐹(𝑏, 𝑓)
− 𝜇𝐹→𝐵 𝑏 ] = 0
ECAI 2016
bucket E: bucket C: bucket D: bucket B: bucket A: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑔 𝑏, 𝑑 𝑔 𝑑, 𝑓 𝑔 𝑏, 𝑒 𝑔 𝑏
mini-buckets
𝜇𝐶→𝐷(𝑏, 𝑑) 𝜇𝐶→𝐸(𝑒, 𝑓) 𝜇𝐷→𝐹(𝑏, 𝑓) 𝜇𝐸→𝐹(𝑏, 𝑓) 𝜇𝐹→𝐵(𝑏) L = lower bound
min
𝑏,𝑑,𝑐 𝑔 𝑏, 𝑐 + 𝑔 𝑐, 𝑑 − 𝜇𝐶→𝐷 𝑏, 𝑑
= 0 min
𝑒,𝑓,𝑐 𝑔 𝑐, 𝑒 + 𝑔 𝑐, 𝑓 − 𝜇𝐶→𝐸 𝑒, 𝑓
= 0 min
𝑏,𝑓,𝑑[𝜇𝐶→𝐷 𝑏, 𝑑 + 𝑔 𝑏, 𝑑 + 𝑔(𝑑, 𝑓)
− 𝜇𝐷→𝐹 𝑏, 𝑓 ] = 0 min
𝑏,𝑒 [𝑔 𝑏, 𝑒 + 𝜇𝐶→𝐸 𝑒, 𝑓
− 𝜇𝐸→𝐹 𝑏, 𝑓 ] = 0 min
𝑏,𝑓 [𝜇𝐷→𝐹 𝑏, 𝑓 + 𝜇𝐸→𝐹(𝑏, 𝑓)
− 𝜇𝐹→𝐵 𝑏 ] = 0 min
𝑏
𝑔 𝑏 + 𝜇𝐹→𝐵 𝑏 = 𝑀
UTA 1/2015
UTA 1/2015
mini-buckets
message update within each bucket during downward sweep
ECAI 2016
bucket E: bucket C: bucket D: bucket B: bucket A: {𝑏, 𝑐, 𝑑} {𝑐, 𝑒, 𝑓} {𝑏, 𝑑, 𝑓} {𝑏, 𝑒, 𝑓} {𝑏}
Join graph:
L = lower bound {𝑏, 𝑓}
A B C D E
P(A) P(B|A) P(C|A) P(E|B,C) P(D|A,B)
Bucket B Bucket C Bucket D Bucket E Bucket A
P(B|A) P(D|A,B) P(E|B,C) P(C|A) E = 0 P(A) maxB∏ hB (A,D) MPE* is an upper bound on MPE --U Generating a solution yields a lower bound--L maxB∏ hD (A) hC (A,E) hB (C,E) hE (A)
W=2 m11 m12 m11,m12- moment-matching messages
ECAI 2016
ECAI 2016
[Dechter and Rish, 2003]
ECAI 2016
ECAI 2016
ECAI 2016
bucket E: bucket C: bucket D: bucket B: bucket A: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑔 𝑏, 𝑑 𝑔 𝑑, 𝑓 𝑔 𝑏, 𝑒 𝑔 𝑏
mini-buckets
𝜇𝐶→𝐷(𝑏, 𝑑) 𝜇𝐶→𝐸(𝑒, 𝑓) 𝜇𝐷→𝐹(𝑏, 𝑓) 𝜇𝐸→𝐹(𝑏, 𝑓) 𝜇𝐹→𝐵(𝑏) U = upper bound
𝑦 𝑥
ෑ 𝑔(𝑦)
𝜇𝐶 𝑏, 𝑑, 𝑒, 𝑓 =
𝑐
𝑔 𝑏, 𝑐 ⋅ 𝑔 𝑐, 𝑑 ⋅ 𝑔 𝑐, 𝑒 ⋅ 𝑔 𝑐, 𝑓
Exact bucket elimination:
≤
𝑐 𝑥1
𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 ⋅
𝑐 𝑥2
𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓
= 𝜇𝐶→𝐷 (𝑏, 𝑑) ⋅ 𝜇𝐶→𝐸 (𝑒, 𝑓)
(mini-buckets)
where
𝑦 𝑥
𝑔 𝑦 =
𝑦
𝑔 𝑦
1 𝑥 𝑥
is the weighted or “power” sum operator
𝑦 𝑥
𝑔
1 𝑦 𝑔 2 𝑦 ≤ 𝑦 𝑥1
𝑔
1 𝑦
𝑦 𝑥2
𝑔
2 𝑦
where
𝑥1 + 𝑥2 = 𝑥 𝑥1 > 0, 𝑥2 > 0
and
𝑥1 > 0, 𝑥2 < 0
(lower bound if ) [Liu and Ihler, 2011]
(for summation)
decomposition but, with an efficient “primal” bound form
matching” variant
ECAI 2016
bucket E: bucket C: bucket D: bucket B: bucket A: {𝑏, 𝑐, 𝑑} {𝑐, 𝑒, 𝑓} {𝑏, 𝑑, 𝑓} {𝑏, 𝑒, 𝑓} {𝑏}
Join graph:
U = upper bound {𝑏, 𝑓}
𝑥1 𝑥2 [Liu and Ihler, 2011]
ECAI 2016
bucket E: bucket C: bucket D: bucket B: bucket A: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑔 𝑏, 𝑑 𝑔 𝑑, 𝑓 𝑔 𝑏, 𝑒 𝑔 𝑏
mini-buckets
𝜇𝐶→𝐷(𝑏, 𝑑) 𝜇𝐶→𝐸(𝑒, 𝑓) 𝜇𝐷→𝐹(𝑏, 𝑓) 𝜇𝐸→𝐹(𝑏, 𝑓) 𝜇𝐹→𝐵(𝑏) U = upper bound
[Liu and Ihler, 2011; 2013] [Dechter and Rish, 2003]
Marginal MAP
Σ𝐶 Σ𝐷
maxA maxE maxD
𝜇𝐶→𝐷 𝑏, 𝑑 =
𝑐 𝑥1
𝑔 𝑏, 𝑐 𝑔(𝑐, 𝑑) 𝜇𝐶→𝐸 𝑒, 𝑓 =
𝑐 𝑥2
𝑔 𝑐, 𝑒 𝑔(𝑐, 𝑓) (𝑥1 + 𝑥2 = 1)
𝜇𝐹→𝐵 𝑏 = max
𝑓
𝜇𝐷→𝐹 𝑏, 𝑓 𝜇𝐸→𝐹(𝑏, 𝑓) 𝑉 = max
𝑏
𝑔 𝑏 𝜇𝐹→𝐵(𝑏)
Can optimize over cost-shifting and weights (single pass “MM” or iterative message passing)