Expectation Propagation Tom Minka Microsoft Research, Cambridge, UK - - PowerPoint PPT Presentation

expectation propagation
SMART_READER_LITE
LIVE PREVIEW

Expectation Propagation Tom Minka Microsoft Research, Cambridge, UK - - PowerPoint PPT Presentation

Expectation Propagation Tom Minka Microsoft Research, Cambridge, UK 2006 Advanced Tutorial Lecture Series, CUED 1 A typical machine learning problem A typical machine learning problem 2 Spam filtering by linear separation Not spam Spam


slide-1
SLIDE 1

Expectation Propagation

1

Tom Minka

Microsoft Research, Cambridge, UK 2006 Advanced Tutorial Lecture Series, CUED

slide-2
SLIDE 2

A typical machine learning problem

2

A typical machine learning problem

slide-3
SLIDE 3

Spam filtering by linear separation

3

Spam Not spam Choose a boundary that will generalize to new data

slide-4
SLIDE 4

Linear separation

Minimum training error solution (Perceptron)

4

Too close to data – won’t generalize well

slide-5
SLIDE 5

Linear separation

Maximum-margin solution (SVM)

5

Ignores information in the vertical direction

slide-6
SLIDE 6

Linear separation

6

Bayesian solution (via averaging)

Has a margin, and uses information in all dimensions

slide-7
SLIDE 7

Geometry of linear separation

Separator is any vector w such that:

>

i Tx

w

(class 1)

<

i Tx

w

(class 2)

1 = w

(sphere)

7

1 = w

(sphere) This set has an unusual shape SVM: Optimize over it Bayes: Average over it

slide-8
SLIDE 8

Performance on linear separation

8

EP Gaussian approximation to posterior

slide-9
SLIDE 9

Bayesian paradigm

  • Consistent use of probability theory for

representing unknowns (parameters, latent variables, missing data)

9

slide-10
SLIDE 10

Factor graphs

  • Shows how a function of several variables

can be factored into a product of simpler functions

  • f(x,y,z) = (x+y)(y+z)(x+z)

10

  • f(x,y,z) = (x+y)(y+z)(x+z)
  • Very useful for representing posteriors
slide-11
SLIDE 11

Example factor graphs

11

slide-12
SLIDE 12

Two tasks

  • Modeling

– What graph should I use for this data?

  • Inference

– Given the graph and data, what is the mean

12

– Given the graph and data, what is the mean

  • f x (for example)?

– Algorithms:

  • Sampling
  • Variable elimination
  • Message-passing (Expectation Propagation,

Variational Bayes, …)

slide-13
SLIDE 13

Division of labor

  • Model construction

– Domain specific (computer vision, biology, text)

  • Inference computation

13

  • Inference computation

– Generic, mechanical – Further divided into:

  • Fitting an approximate posterior
  • Computing properties of the approx posterior
slide-14
SLIDE 14

Benefits of the division

  • Algorithmic knowledge is consolidated into

general graph-based algorithms (like EP)

  • Applied research has more freedom in

choosing models

14

choosing models

  • Algorithm research has much wider impact
slide-15
SLIDE 15

Take-home message

  • Applied researcher:

– express your model as factor graph – use graph-based inference algorithms

  • Algorithm researcher:

15

  • Algorithm researcher:

– present your algorithm in terms of graphs

slide-16
SLIDE 16

A (seemingly) intractable problem

16

A (seemingly) intractable problem

slide-17
SLIDE 17

Clutter problem

  • Want to estimate x given multiple y’s

17

slide-18
SLIDE 18

Exact posterior

,D) exact

18

  • 1

1 2 3 4 x p(x,D

slide-19
SLIDE 19

Representing posterior distributions

Sampling Deterministic approximation

19

Good for complex, multi-modal distributions Slow, but predictable accuracy Good for simple, smooth distributions Fast, but unpredictable accuracy

slide-20
SLIDE 20

Deterministic approximation

Laplace’s method

  • Bayesian curve fitting, neural

networks (MacKay)

  • Bayesian PCA (Minka)

20

Variational bounds

  • Bayesian mixture of experts (Waterhouse)
  • Mixtures of PCA (Tipping, Bishop)
  • Factorial/coupled Markov models

(Ghahramani, Jordan, Williams)

slide-21
SLIDE 21

Moment matching

Another way to perform deterministic approximation

  • Much higher accuracy on some

problems

21

Expectation Propagation Assumed-density filtering Loopy belief propagation

(1997) (1984) (2001)

slide-22
SLIDE 22

Best Gaussian by moment matching

) exact bestGaussian

22

  • 1

1 2 3 4 x p(x,D)

slide-23
SLIDE 23

Strategy

  • Approximate each factor by a Gaussian in

x

23

slide-24
SLIDE 24

Approximating a single factor

24

slide-25
SLIDE 25

×

) (x fi

) (

\ x

q i

(naïve)

25

× =

) (

\ x

q i

) (x p

slide-26
SLIDE 26

×

(informed)

) (x fi

) (

\ x

q i

26

× =

) (x p

) (

\ x

q i

slide-27
SLIDE 27

Single factor with Gaussian context

27

slide-28
SLIDE 28

Gaussian multiplication formula

28

slide-29
SLIDE 29

Approximation with narrow context

29

slide-30
SLIDE 30

Approximation with medium context

30

slide-31
SLIDE 31

Approximation with wide context

31

slide-32
SLIDE 32

Two factors

x

32

x

Message passing

slide-33
SLIDE 33

Three factors

x

33

x

Message passing

slide-34
SLIDE 34

Message Passing = Distributed Optimization

  • Messages represent a simpler distribution

that approximates

– A distributed representation

  • Message passing = optimizing to fit

34

  • Message passing = optimizing to fit

– stands in for when answering queries

  • Choices:

– What type of distribution to construct (approximating family) – What cost to minimize (divergence measure)

slide-35
SLIDE 35
  • Write p as product of factors:

Distributed divergence minimization

35

  • Approximate factors one by one:
  • Multiply to get the approximation:
slide-36
SLIDE 36

Global divergence to local divergence

  • Global divergence:

36

  • Local divergence:
slide-37
SLIDE 37

Message passing

  • Messages are passed between factors
  • Messages are factor approximations:
  • Factor receives

37

– Minimize local divergence to get – Send to other factors – Repeat until convergence

slide-38
SLIDE 38

Gaussian found by EP

p(x,D) ep exact bestGaussian

38

  • 1

1 2 3 4 x p(

slide-39
SLIDE 39

Other methods

p(x,D) vb laplace exact

39

  • 1

1 2 3 4 x p(

slide-40
SLIDE 40

Accuracy

Posterior mean: exact = 1.64864 ep = 1.64514 laplace = 1.61946 vb = 1.61834

40

vb = 1.61834 Posterior variance: exact = 0.359673 ep = 0.311474 laplace = 0.234616 vb = 0.171155

slide-41
SLIDE 41

Cost vs. accuracy

41

20 points 200 points Deterministic methods improve with more data (posterior is more Gaussian) Sampling methods do not

slide-42
SLIDE 42

Time series problems

42

Time series problems

slide-43
SLIDE 43

Example: Tracking

Guess the position of an object given noisy measurements

43

1

y

4

y

Object

1

x

2

x

3

x

4

x

2

y

3

y

slide-44
SLIDE 44

Factor graph

1

x

2

x

3

x

4

x

44

1

y

2

y

3

y

4

y

t t t

x x + =

−1

noise + =

t t

x y

(random walk) e.g. want distribution of x’s given y’s

slide-45
SLIDE 45

Approximate factor graph

1

x

2

x

3

x

4

x

45

slide-46
SLIDE 46

Splitting a pairwise factor

1

x

2

x

46

1

x

2

x

slide-47
SLIDE 47

Splitting in context

2

x

3

x

47

2

x

3

x

slide-48
SLIDE 48

Sweeping through the graph

1

x

2

x

3

x

4

x

48

slide-49
SLIDE 49

Sweeping through the graph

1

x

2

x

3

x

4

x

49

slide-50
SLIDE 50

Sweeping through the graph

1

x

2

x

3

x

4

x

50

slide-51
SLIDE 51

Sweeping through the graph

1

x

2

x

3

x

4

x

51

slide-52
SLIDE 52

Example: Poisson tracking

  • yt is a Poisson-distributed integer with

mean exp(xt)

52

slide-53
SLIDE 53

Poisson tracking model

) 01 . , ( ~ ) | ( x N x x p ) 100 , ( ~ ) ( 1 N x p

53

) 01 . , ( ~ ) | (

1 1 − − t t t

x N x x p

! / ) exp( ) | (

t x t t t t

y e x y x y p

t

− =

slide-54
SLIDE 54

Factor graph

1

x

2

x

3

x

4

x

54

1

y

2

y

3

y

4

y

1

x

2

x

3

x

4

x

slide-55
SLIDE 55

Approximating a measurement factor

1

x

55

1

y

1

x

slide-56
SLIDE 56

56

slide-57
SLIDE 57

Posterior for the last state

57

slide-58
SLIDE 58

58

slide-59
SLIDE 59

59

slide-60
SLIDE 60

EP for signal detection

  • Wireless communication problem
  • Transmitted signal =
  • vary to encode each symbol

) sin( φ ω + t a

φ

) , ( φ a

60

  • In complex numbers:

φ i

ae

Re Im

φ

a

slide-61
SLIDE 61

Binary symbols, Gaussian noise

  • Symbols are 1 and –1 (in complex plane)
  • Received signal =
  • Recovered

noise ) sin( + +φ ωt a

t

y ae e a = + = noise ˆ

ˆ φ φ

61

  • Optimal detection is easy in this case

t

y

s

1

s

slide-62
SLIDE 62

Fading channel

  • Channel systematically changes amplitude

and phase:

  • changes over time

noise + = s x y

t t

x

62

  • changes over time

t

x

t

y

s xt

1

s xt

slide-63
SLIDE 63

Differential detection

  • Use last measurement to estimate state
  • Binary symbols only
  • No smoothing of state = noisy

63

t

y

1 −

t

y

1 − t

y

slide-64
SLIDE 64

Factor graph

y y

y

y

1

s

2

s

3

s

4

s

64

1

y

2

y

3

y

4

y

1

x

2

x

3

x

4

x

Dynamics are learned from training data (all 1’s) Symbols can also be correlated (e.g. error-correcting code)

slide-65
SLIDE 65

65

slide-66
SLIDE 66

66

slide-67
SLIDE 67

Splitting a transition factor

67

slide-68
SLIDE 68

Splitting a measurement factor

68

slide-69
SLIDE 69

On-line implementation

  • Iterate over the last δ measurements
  • Previous measurements act as prior

69

  • Results comparable to particle filtering, but

much faster

slide-70
SLIDE 70

70

slide-71
SLIDE 71

Linear separation revisited

71

Linear separation revisited

slide-72
SLIDE 72

Geometry of linear separation

Separator is any vector w such that:

>

i Tx

w

(class 1)

<

i Tx

w

(class 2)

1 = w

(sphere)

72

1 = w

(sphere) This set has an unusual shape SVM: Optimize over it Bayes: Average over it

slide-73
SLIDE 73

Factor graph

73

slide-74
SLIDE 74

Performance on linear separation

74

EP Gaussian approximation to posterior

slide-75
SLIDE 75

Time vs. accuracy

A typical run on the 3-point problem Error = distance to true mean of w

75

Billiard = Monte Carlo sampling (Herbrich et al, 2001) Opper&Winther’s algorithms: MF = mean-field theory TAP = cavity method (equiv to Gaussian EP for this problem)

slide-76
SLIDE 76

Gaussian kernels

  • Map data into high-dimensional space so

that

76

slide-77
SLIDE 77

Bayesian model comparison

  • Multiple models Mi with prior probabilities

p(Mi)

  • Posterior probabilities:

77

  • For equal priors, models are compared

using model evidence:

slide-78
SLIDE 78

Highest-probability kernel

78

slide-79
SLIDE 79

Margin-maximizing kernel

79

slide-80
SLIDE 80

Bayesian feature selection

Synthetic data where 6 features are relevant (out of 20)

80

Bayes picks 6 Margin picks 13

slide-81
SLIDE 81

EP versus Monte Carlo

  • Monte Carlo is general but expensive

– A sledgehammer

  • EP exploits underlying simplicity of the

problem (if it exists)

81

problem (if it exists)

  • Monte Carlo is still needed for complex

problems (e.g. large isolated peaks)

  • Trick is to know what problem you have
slide-82
SLIDE 82

Further reading

  • Bayes Point Machine toolbox

http://research.microsoft.com/~minka/papers/ep/bpm/

82

  • EP bibliography

http://research.microsoft.com/~minka/papers/ep/roadmap.html

  • EP quick reference

http://research.microsoft.com/~minka/papers/ep/minka-ep- quickref.pdf