Approximate Inference
Part 1 of 2 Tom Minka
Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/
1
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, - - PowerPoint PPT Presentation
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory for representing unknowns
Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/
1
2
3
4
i i
5
Variational Bayes, …)
6
7
8
,D) exact
1 2 3 4 x p(x,D
9
Sampling Deterministic approximation
Good for complex, multi-modal distributions Slow, but predictable accuracy Good for simple, smooth distributions Fast, but unpredictable accuracy 10
Laplace’s method
networks (MacKay)
Variational bounds
(Ghahramani, Jordan, Williams)
11
Another way to perform deterministic approximation
problems
Expectation Propagation Assumed-density filtering Loopy belief propagation
(1997) (1984) (2001)
12
13
) exact bestGaussian
1 2 3 4 x p(x,D)
14
15
16
) (
\ x
q i
(naïve)
) (
\ x
q i
) (x p
17
(informed)
) (
\ x
q i
) (x p
) (
\ x
q i
18
19
20
21
22
23
Message passing
24
Message passing
25
– A distributed representation
– stands in for when answering queries
– What type of distribution to construct (approximating family) – What cost to minimize (divergence measure)
26
27
p(x,D) ep exact bestGaussian
1 2 3 4 x p(
28
p(x,D) vb laplace exact
1 2 3 4 x p(
29
Posterior mean: exact = 1.64864 ep = 1.64514 laplace = 1.61946 vb = 1.61834 vb = 1.61834 Posterior variance: exact = 0.359673 ep = 0.311474 laplace = 0.234616 vb = 0.171155
30
20 points 200 points Deterministic methods improve with more data (posterior is more Gaussian) Sampling methods do not
31
i
− ∞
− ∞ − ∞
t t i
32
33
Guess the position of an object given noisy measurements
1
y
4
y
Object
1
x
2
x
3
x
4
x
2
y
3
y
34
1
2
3
4
1
2
3
4
t t t
−1
t t
(random walk) e.g. want distribution of x’s given y’s
35
1
2
3
4
36
1
2
1
2
37
2
3
2
3
38
1
2
3
4
39
1
2
3
4
40
1
2
3
4
41
1
2
3
4
42
43
1 1 − − t t t
t x t t t t
t
44
1
2
3
4
1
2
3
4
1
2
3
4
45
1
1
1
46
47
Posterior for the last state
48
49
50
φ
(Qi and Minka, 2003)
φ i
Re Im
φ
a
51
1 =
t
y
s
1
s
52
t t t
t
t
y
s xt
1
s xt
t
t
53
1 1 / − −
t t t
t
y
1 −
−
t
y
1 − t
y
54
1
2
3
4
1
2
3
4
1
2
3
4
Channel dynamics are learned from training data (all 1’s) Symbols can also be correlated (e.g. error-correcting code)
55
56
57
58
59
60
61
62
Spam Not spam Choose a boundary that will generalize to new data
63
Minimum training error solution (Perceptron)
Too arbitrary – won’t generalize well
64
Maximum-margin solution (SVM)
Ignores information in the vertical direction
65
Bayesian solution (via averaging)
Has a margin, and uses information in all dimensions
66
Separator is any vector w such that:
i Tx
(class 1)
i Tx
(class 2)
(sphere)
(sphere) This set has an unusual shape SVM: Optimize over it Bayes: Average over it
67
EP Gaussian approximation to posterior
68
69
\ \ \ i i i
70
71
A typical run on the 3-point problem Error = distance to true mean of w Billiard = Monte Carlo sampling (Herbrich et al, 2001) Opper&Winther’s algorithms: MF = mean-field theory TAP = cavity method (equiv to Gaussian EP for this problem)
72
73
74
75
76
Synthetic data where 6 features are relevant (out of 20) Bayes picks 6 Margin picks 13
77
78
http://research.microsoft.com/~minka/papers/ep/bpm/
http://www.kyb.tuebingen.mpg.de/bs/people/csatol/ogp/index.html
http://research.microsoft.com/infernet
79
http://research.microsoft.com/~minka/papers/ep/roadmap.html
http://research.microsoft.com/~minka/papers/ep/minka-ep- quickref.pdf
80
81