Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, - - PowerPoint PPT Presentation

approximate inference
SMART_READER_LITE
LIVE PREVIEW

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, - - PowerPoint PPT Presentation

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory for representing unknowns


slide-1
SLIDE 1

Approximate Inference

Part 1 of 2 Tom Minka

Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/

1

slide-2
SLIDE 2

Bayesian paradigm

  • Consistent use of probability theory for

representing unknowns (parameters, latent variables, missing data)

2

slide-3
SLIDE 3

Bayesian paradigm

  • Bayesian posterior distribution

summarizes what we’ve learned from training data and prior knowledge

  • Can use posterior to:
  • Can use posterior to:

– Describe training data – Make predictions on test data – Incorporate new data (online learning)

  • Today’s question: How to efficiently

represent and compute posteriors?

3

slide-4
SLIDE 4

Factor graphs

  • Shows how a function of several variables

can be factored into a product of simpler functions

  • f(x,y,z) = (x+y)(y+z)(x+z)
  • f(x,y,z) = (x+y)(y+z)(x+z)
  • Very useful for representing posteriors

4

slide-5
SLIDE 5

Example factor graph

) 1 , ; ( ) | ( m x N m x p

i i

=

5

slide-6
SLIDE 6

Two tasks

  • Modeling

– What graph should I use for this data?

  • Inference

– Given the graph and data, what is the mean – Given the graph and data, what is the mean

  • f x (for example)?

– Algorithms:

  • Sampling
  • Variable elimination
  • Message-passing (Expectation Propagation,

Variational Bayes, …)

6

slide-7
SLIDE 7

A (seemingly) intractable problem A (seemingly) intractable problem

7

slide-8
SLIDE 8

Clutter problem

  • Want to estimate x given multiple y’s

8

slide-9
SLIDE 9

Exact posterior

,D) exact

  • 1

1 2 3 4 x p(x,D

9

slide-10
SLIDE 10

Representing posterior distributions

Sampling Deterministic approximation

Good for complex, multi-modal distributions Slow, but predictable accuracy Good for simple, smooth distributions Fast, but unpredictable accuracy 10

slide-11
SLIDE 11

Deterministic approximation

Laplace’s method

  • Bayesian curve fitting, neural

networks (MacKay)

  • Bayesian PCA (Minka)

Variational bounds

  • Bayesian mixture of experts (Waterhouse)
  • Mixtures of PCA (Tipping, Bishop)
  • Factorial/coupled Markov models

(Ghahramani, Jordan, Williams)

11

slide-12
SLIDE 12

Moment matching

Another way to perform deterministic approximation

  • Much higher accuracy on some

problems

Expectation Propagation Assumed-density filtering Loopy belief propagation

(1997) (1984) (2001)

12

slide-13
SLIDE 13

Today

  • Moment matching

(Expectation Propagation) Tomorrow

  • Variational bounds

(Variational Message Passing)

13

slide-14
SLIDE 14

Best Gaussian by moment matching

) exact bestGaussian

  • 1

1 2 3 4 x p(x,D)

14

slide-15
SLIDE 15

Strategy

  • Approximate each factor by a Gaussian in

x

15

slide-16
SLIDE 16

Approximating a single factor

16

slide-17
SLIDE 17

×

) (x fi

) (

\ x

q i

(naïve)

× =

) (

\ x

q i

) (x p

17

slide-18
SLIDE 18

×

(informed)

) (x fi

) (

\ x

q i

× =

) (x p

) (

\ x

q i

18

slide-19
SLIDE 19

Single factor with Gaussian context

19

slide-20
SLIDE 20

Gaussian multiplication formula

20

slide-21
SLIDE 21

Approximation with narrow context

21

slide-22
SLIDE 22

Approximation with medium context

22

slide-23
SLIDE 23

Approximation with wide context

23

slide-24
SLIDE 24

Two factors

x x

Message passing

24

slide-25
SLIDE 25

Three factors

x x

Message passing

25

slide-26
SLIDE 26

Message Passing = Distributed Optimization

  • Messages represent a simpler distribution

that approximates

– A distributed representation

  • Message passing = optimizing to fit
  • Message passing = optimizing to fit

– stands in for when answering queries

  • Choices:

– What type of distribution to construct (approximating family) – What cost to minimize (divergence measure)

26

slide-27
SLIDE 27
  • Write p as product of factors:

Distributed divergence minimization

  • Approximate factors one by one:
  • Multiply to get the approximation:

27

slide-28
SLIDE 28

Gaussian found by EP

p(x,D) ep exact bestGaussian

  • 1

1 2 3 4 x p(

28

slide-29
SLIDE 29

Other methods

p(x,D) vb laplace exact

  • 1

1 2 3 4 x p(

29

slide-30
SLIDE 30

Accuracy

Posterior mean: exact = 1.64864 ep = 1.64514 laplace = 1.61946 vb = 1.61834 vb = 1.61834 Posterior variance: exact = 0.359673 ep = 0.311474 laplace = 0.234616 vb = 0.171155

30

slide-31
SLIDE 31

Cost vs. accuracy

20 points 200 points Deterministic methods improve with more data (posterior is more Gaussian) Sampling methods do not

31

slide-32
SLIDE 32

Censoring example

  • Want to estimate x given multiple y’s

) 1 , ; ( ) | ( x y N x y p

i

= ) 100 , ; ( ) ( x N x p =

− ∞

∫ ∫

− ∞ − ∞

+ = >

t t i

dy x y N dy x y N x t y p ) 1 , ; ( ) 1 , ; ( ) | | (|

32

slide-33
SLIDE 33

Time series problems Time series problems

33

slide-34
SLIDE 34

Example: Tracking

Guess the position of an object given noisy measurements

1

y

4

y

Object

1

x

2

x

3

x

4

x

2

y

3

y

34

slide-35
SLIDE 35

Factor graph

1

x

2

x

3

x

4

x

1

y

2

y

3

y

4

y

t t t

ν x x + =

−1

noise + =

t t

x y

(random walk) e.g. want distribution of x’s given y’s

35

slide-36
SLIDE 36

Approximate factor graph

1

x

2

x

3

x

4

x

36

slide-37
SLIDE 37

Splitting a pairwise factor

1

x

2

x

1

x

2

x

37

slide-38
SLIDE 38

Splitting in context

2

x

3

x

2

x

3

x

38

slide-39
SLIDE 39

Sweeping through the graph

1

x

2

x

3

x

4

x

39

slide-40
SLIDE 40

Sweeping through the graph

1

x

2

x

3

x

4

x

40

slide-41
SLIDE 41

Sweeping through the graph

1

x

2

x

3

x

4

x

41

slide-42
SLIDE 42

Sweeping through the graph

1

x

2

x

3

x

4

x

42

slide-43
SLIDE 43

Example: Poisson tracking

  • yt is a Poisson-distributed integer with

mean exp(xt)

43

slide-44
SLIDE 44

Poisson tracking model

) 01 . , ( ~ ) | ( x N x x p ) 100 , ( ~ ) ( 1 N x p ) 01 . , ( ~ ) | (

1 1 − − t t t

x N x x p

! / ) exp( ) | (

t x t t t t

y e x y x y p

t

− =

44

slide-45
SLIDE 45

Factor graph

1

x

2

x

3

x

4

x

1

y

2

y

3

y

4

y

1

x

2

x

3

x

4

x

45

slide-46
SLIDE 46

Approximating a measurement factor

1

x

1

y

1

x

46

slide-47
SLIDE 47

47

slide-48
SLIDE 48

Posterior for the last state

48

slide-49
SLIDE 49

49

slide-50
SLIDE 50

50

slide-51
SLIDE 51

EP for signal detection

  • Wireless communication problem
  • Transmitted signal =
  • vary to encode each symbol

) sin( φ ω + t a

φ

) , ( φ a

(Qi and Minka, 2003)

  • In complex numbers:

φ i

ae

Re Im

φ

a

51

slide-52
SLIDE 52

Binary symbols, Gaussian noise

  • Symbols are and

(in complex plane)

  • Received signal
  • Optimal detection is easy in this case

noise ) sin( + + = φ ωt a yt 1

1 =

s 1 − = s

  • Optimal detection is easy in this case

t

y

s

1

s

52

slide-53
SLIDE 53

Fading channel

  • Channel systematically changes amplitude

and phase:

  • = transmitted symbol

noise + =

t t t

s x y

s

  • = transmitted symbol
  • = channel multiplier (complex number)
  • changes over time

t

x

t

y

s xt

1

s xt

t

x

t

s

53

slide-54
SLIDE 54

Differential detection

  • Use last measurement to estimate state:
  • State estimate is noisy – can we do

better?

1 1 / − −

t t t

s y x

better?

t

y

1 −

t

y

1 − t

y

54

slide-55
SLIDE 55

Factor graph

y y

y

y

1

s

2

s

3

s

4

s

1

y

2

y

3

y

4

y

1

x

2

x

3

x

4

x

Channel dynamics are learned from training data (all 1’s) Symbols can also be correlated (e.g. error-correcting code)

55

slide-56
SLIDE 56

56

slide-57
SLIDE 57

57

slide-58
SLIDE 58

Splitting a transition factor

58

slide-59
SLIDE 59

Splitting a measurement factor

59

slide-60
SLIDE 60

On-line implementation

  • Iterate over the last δ measurements
  • Previous measurements act as prior
  • Results comparable to particle filtering, but

much faster

60

slide-61
SLIDE 61

61

slide-62
SLIDE 62

Classification problems Classification problems

62

slide-63
SLIDE 63

Spam filtering by linear separation

Spam Not spam Choose a boundary that will generalize to new data

63

slide-64
SLIDE 64

Linear separation

Minimum training error solution (Perceptron)

Too arbitrary – won’t generalize well

64

slide-65
SLIDE 65

Linear separation

Maximum-margin solution (SVM)

Ignores information in the vertical direction

65

slide-66
SLIDE 66

Linear separation

Bayesian solution (via averaging)

Has a margin, and uses information in all dimensions

66

slide-67
SLIDE 67

Geometry of linear separation

Separator is any vector w such that:

>

i Tx

w

(class 1)

<

i Tx

w

(class 2)

1 = w

(sphere)

1 = w

(sphere) This set has an unusual shape SVM: Optimize over it Bayes: Average over it

67

slide-68
SLIDE 68

Performance on linear separation

EP Gaussian approximation to posterior

68

slide-69
SLIDE 69

Factor graph

) , ; ( ) ( I w w N p =

69

slide-70
SLIDE 70

Computing moments

) , ; ( ) (

\ \ \ i i i

N q V m w w =

70

slide-71
SLIDE 71

Computing moments

71

slide-72
SLIDE 72

Time vs. accuracy

A typical run on the 3-point problem Error = distance to true mean of w Billiard = Monte Carlo sampling (Herbrich et al, 2001) Opper&Winther’s algorithms: MF = mean-field theory TAP = cavity method (equiv to Gaussian EP for this problem)

72

slide-73
SLIDE 73

Gaussian kernels

  • Map data into high-dimensional space so

that

73

slide-74
SLIDE 74

Bayesian model comparison

  • Multiple models Mi with prior probabilities

p(Mi)

  • Posterior probabilities:
  • For equal priors, models are compared

using model evidence:

74

slide-75
SLIDE 75

Highest-probability kernel

75

slide-76
SLIDE 76

Margin-maximizing kernel

76

slide-77
SLIDE 77

Bayesian feature selection

Synthetic data where 6 features are relevant (out of 20) Bayes picks 6 Margin picks 13

77

slide-78
SLIDE 78

EP versus Monte Carlo

  • Monte Carlo is general but expensive

– A sledgehammer

  • EP exploits underlying simplicity of the

problem (if it exists) problem (if it exists)

  • Monte Carlo is still needed for complex

problems (e.g. large isolated peaks)

  • Trick is to know what problem you have

78

slide-79
SLIDE 79

Software for EP

  • Bayes Point Machine toolbox

http://research.microsoft.com/~minka/papers/ep/bpm/

  • Sparse Online Gaussian Process toolbox

http://www.kyb.tuebingen.mpg.de/bs/people/csatol/ogp/index.html

  • Infer.NET

http://research.microsoft.com/infernet

79

slide-80
SLIDE 80

Further reading

  • EP bibliography
  • EP bibliography

http://research.microsoft.com/~minka/papers/ep/roadmap.html

  • EP quick reference

http://research.microsoft.com/~minka/papers/ep/minka-ep- quickref.pdf

80

slide-81
SLIDE 81

Tomorrow

  • Variational Message Passing
  • Divergence measures
  • Comparisons to EP

81