Expectation Propagation Tom Minka Microsoft Research, Cambridge, UK - PowerPoint PPT Presentation

Expectation Propagation Tom Minka Microsoft Research, Cambridge, UK 2006 Advanced Tutorial Lecture Series, CUED 1

A typical machine learning problem A typical machine learning problem 2

Spam filtering by linear separation Not spam Spam Choose a boundary that will generalize to new data 3

Linear separation Minimum training error solution (Perceptron) Too close to data – won’t generalize well 4

Linear separation Maximum-margin solution (SVM) 5 Ignores information in the vertical direction

Linear separation Bayesian solution (via averaging) 6 Has a margin, and uses information in all dimensions

Geometry of linear separation Separator is any vector w such that: > T x 0 (class 1) w i < T x 0 (class 2) w i = = (sphere) (sphere) 1 1 w w This set has an unusual shape SVM: Optimize over it Bayes: Average over it 7

Performance on linear separation EP Gaussian approximation to posterior 8

Bayesian paradigm • Consistent use of probability theory for representing unknowns (parameters, latent variables, missing data) 9

Factor graphs • Shows how a function of several variables can be factored into a product of simpler functions • f(x,y,z) = (x+y)(y+z)(x+z) • f(x,y,z) = (x+y)(y+z)(x+z) • Very useful for representing posteriors 10

Example factor graphs 11

Two tasks • Modeling – What graph should I use for this data? • Inference – Given the graph and data, what is the mean – Given the graph and data, what is the mean of x (for example)? – Algorithms: • Sampling • Variable elimination • Message-passing (Expectation Propagation, Variational Bayes, …) 12

Division of labor • Model construction – Domain specific (computer vision, biology, text) • Inference computation • Inference computation – Generic, mechanical – Further divided into: • Fitting an approximate posterior • Computing properties of the approx posterior 13

Benefits of the division • Algorithmic knowledge is consolidated into general graph-based algorithms (like EP) • Applied research has more freedom in choosing models choosing models • Algorithm research has much wider impact 14

Take-home message • Applied researcher: – express your model as factor graph – use graph-based inference algorithms • Algorithm researcher: • Algorithm researcher: – present your algorithm in terms of graphs 15

A (seemingly) intractable problem A (seemingly) intractable problem 16

Clutter problem • Want to estimate x given multiple y’s 17

Exact posterior exact ,D) p(x,D -1 0 1 2 3 4 x 18

Representing posterior distributions Sampling Deterministic approximation Good for complex, Good for simple, multi-modal distributions smooth distributions 19 Fast, but unpredictable accuracy Slow, but predictable accuracy

Deterministic approximation Laplace’s method • Bayesian curve fitting, neural networks (MacKay) • Bayesian PCA (Minka) Variational bounds • Bayesian mixture of experts (Waterhouse) • Mixtures of PCA (Tipping, Bishop) • Factorial/coupled Markov models (Ghahramani, Jordan, Williams) 20

Moment matching Another way to perform deterministic approximation • Much higher accuracy on some problems Assumed-density filtering (1984) Loopy belief propagation (1997) Expectation Propagation (2001) 21

Best Gaussian by moment matching exact bestGaussian ) p(x,D) -1 0 1 2 3 4 x 22

Strategy • Approximate each factor by a Gaussian in x 23

Approximating a single factor 24

(naïve) f i ( x ) × × \ x \ x q i q i ( ( ) ) = p ( x ) 25

(informed) f i ( x ) × × \ x \ x q i q i ( ( ) ) = p ( x ) 26

Single factor with Gaussian context 27

Gaussian multiplication formula 28

Approximation with narrow context 29

Approximation with medium context 30

Approximation with wide context 31

Two factors x x Message passing 32

Three factors x x Message passing 33

Message Passing = Distributed Optimization • Messages represent a simpler distribution � � � � that approximates � � � � – A distributed representation • Message passing = optimizing � to fit � • Message passing = optimizing � to fit � – � stands in for � when answering queries • Choices: – What type of distribution to construct (approximating family) – What cost to minimize (divergence measure) 34

Distributed divergence minimization • Write p as product of factors: • Approximate factors one by one: • Multiply to get the approximation: 35

Global divergence to local divergence • Global divergence: • Local divergence: 36

Message passing • Messages are passed between factors • Messages are factor approximations: • Factor � receives – Minimize local divergence to get – Send to other factors – Repeat until convergence 37

Gaussian found by EP ep exact bestGaussian p(x,D) p( -1 0 1 2 3 4 x 38

Other methods vb laplace exact p(x,D) p( -1 0 1 2 3 4 x 39

Accuracy Posterior mean: exact = 1.64864 ep = 1.64514 laplace = 1.61946 vb = 1.61834 vb = 1.61834 Posterior variance: exact = 0.359673 ep = 0.311474 laplace = 0.234616 vb = 0.171155 40

Cost vs. accuracy 200 points 20 points Deterministic methods improve with more data (posterior is more Gaussian) Sampling methods do not 41

Time series problems Time series problems 42

Example: Tracking Guess the position of an object given noisy measurements y 2 x 2 x x 1 3 y 4 y y 1 3 x 4 Object 43

Factor graph x x x x 3 1 2 4 y y y y 1 2 3 4 = + x x e.g. (random walk) − 1 t t t = + y x noise t t want distribution of x’s given y’s 44

Approximate factor graph x x x x 3 1 2 4 45

Splitting a pairwise factor x x 1 2 x x 1 2 46

Splitting in context x x 3 2 x x 3 2 47

Sweeping through the graph x x x x 3 1 2 4 48

Example: Poisson tracking • y t is a Poisson-distributed integer with mean exp(x t ) 52

Poisson tracking model p ( 1 x ) ~ N ( 0 , 100 ) p p ( ( x x | | x x ) ) ~ ~ N N ( ( x x , , 0 0 . . 01 01 ) ) − − t t 1 t 1 = − x p ( y | x ) exp( y x e ) / y ! t t t t t t 53

Factor graph x x x x 3 1 2 4 y y y y 1 2 3 4 x x x x 3 1 2 4 54

Approximating a measurement factor x 1 y 1 x 1 55

Posterior for the last state 57

EP for signal detection • Wireless communication problem ω + φ a sin( t ) • Transmitted signal = φ • vary to encode each symbol ( , ) a φ φ i • In complex numbers: ae Im a φ Re 60

Binary symbols, Gaussian noise • Symbols are 1 and –1 (in complex plane) ω t + φ + sin( ) noise a • Received signal = φ φ ˆ = + = • Recovered ˆ a e ae noise y t • Optimal detection is easy in this case y t 0 1 s s 61

Fading channel • Channel systematically changes amplitude and phase: = + y x s noise t t • • x x changes over time changes over time t y 1 t x t s 0 x t s 62

Differential detection • Use last measurement to estimate state • Binary symbols only • No smoothing of state = noisy y y t − t 1 − y − 63 t 1

Factor graph s s s s 3 1 2 4 y y y y y y y y 1 2 4 3 x x x x 3 1 2 4 Symbols can also be correlated (e.g. error-correcting code) Dynamics are learned from training data (all 1’s) 64

Splitting a transition factor 67

Splitting a measurement factor 68

On-line implementation • Iterate over the last δ measurements • Previous measurements act as prior • Results comparable to particle filtering, but much faster 69

Linear separation revisited Linear separation revisited 71

Geometry of linear separation Separator is any vector w such that: > T x 0 (class 1) w i < T x 0 (class 2) w i = = (sphere) (sphere) 1 1 w w This set has an unusual shape SVM: Optimize over it Bayes: Average over it 72

Factor graph 73

Performance on linear separation EP Gaussian approximation to posterior 74

Time vs. accuracy A typical run on the 3-point problem Error = distance to true mean of w Billiard = Monte Carlo sampling (Herbrich et al, 2001) Opper&Winther’s algorithms: MF = mean-field theory TAP = cavity method (equiv to Gaussian EP for this problem) 75

Gaussian kernels • Map data into high-dimensional space so that 76

Expectation Propagation Tom Minka Microsoft Research, Cambridge, UK - PowerPoint PPT Presentation

Expectation Propagation Tom Minka Microsoft Research, Cambridge, UK 2006 Advanced Tutorial Lecture Series, CUED 1 A typical machine learning problem A typical machine learning problem 2 Spam filtering by linear separation Not spam Spam

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

PLANT PROPAGATION An Overview of Plant Propagation Methods Two Techniques of Stem Cutting

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

CSci 8980: Advanced Topics in Graphical Models Expectation Propagation Instructor: Arindam

THE AMATEURS FRIEND OR Enemy A short course on Propagation Propagation What is it? What

1 How to deal with Radio Propagation How to deal with Radio Propagation Where are you from?

Physical of radio propagation Two types of propagation models

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Geometric Sound Transmission Micah Taylor Overview Geometric propagation Very fast Can be

Lecture no: 2 Short on dB calculations Basics about antennas Propagation mechanisms

Amateur Radio License Propagation and Antennas Todays Topics Propagation Antennas

HIERAR HIERARCHICAL CHICAL LINEAR MODELLING LINEAR MODELLING Expectation Expectation

CS70: Jean Walrand: Lecture 34. Conditional Expectation CS70: Jean Walrand: Lecture 34.

Probability recap CS 188: Artificial Intelligence Conditional probability Product rule

Inference and Representation David Sontag New York University Lecture 4, Sept. 29, 2015 David

Low-temperature sprectrum of correlation lengths of the XXZ chain in the massive antiferromagnetic

Experimental Design for Simulation [Law, Ch. 12][Sanchez et al. 1 ] Peter J. Haas CS 590M:

On decomposition of factor maps between shift spaces on groups - Z to countable amenable groups

On Mixtures of Factor Mixture Analyzers Cinzia Viroli cinzia.viroli@unibo.it Department of

1 K-means clustering The K-means clustering algorithm can be seen as applying the EM algorithm to

Inference in Graphical Models Henrik I. Christensen Robotics & Intelligent Machines @ GT

Expectation Propagation Tom Minka Microsoft Research, Cambridge, UK - PowerPoint PPT Presentation

Expectation Propagation Tom Minka Microsoft Research, Cambridge, UK 2006 Advanced Tutorial Lecture Series, CUED 1 A typical machine learning problem A typical machine learning problem 2 Spam filtering by linear separation Not spam Spam

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

PLANT PROPAGATION An Overview of Plant Propagation Methods Two Techniques of Stem Cutting

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

CSci 8980: Advanced Topics in Graphical Models Expectation Propagation Instructor: Arindam

THE AMATEURS FRIEND OR Enemy A short course on Propagation Propagation What is it? What

1 How to deal with Radio Propagation How to deal with Radio Propagation Where are you from?

Physical of radio propagation Two types of propagation models

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Geometric Sound Transmission Micah Taylor Overview Geometric propagation Very fast Can be

Lecture no: 2 Short on dB calculations Basics about antennas Propagation mechanisms

Amateur Radio License Propagation and Antennas Todays Topics Propagation Antennas

HIERAR HIERARCHICAL CHICAL LINEAR MODELLING LINEAR MODELLING Expectation Expectation

CS70: Jean Walrand: Lecture 34. Conditional Expectation CS70: Jean Walrand: Lecture 34.

Probability recap CS 188: Artificial Intelligence Conditional probability Product rule

Inference and Representation David Sontag New York University Lecture 4, Sept. 29, 2015 David

Low-temperature sprectrum of correlation lengths of the XXZ chain in the massive antiferromagnetic

Experimental Design for Simulation [Law, Ch. 12][Sanchez et al. 1 ] Peter J. Haas CS 590M:

On decomposition of factor maps between shift spaces on groups - Z to countable amenable groups

On Mixtures of Factor Mixture Analyzers Cinzia Viroli cinzia.viroli@unibo.it Department of

1 K-means clustering The K-means clustering algorithm can be seen as applying the EM algorithm to

Inference in Graphical Models Henrik I. Christensen Robotics &amp; Intelligent Machines @ GT

Inference in Graphical Models Henrik I. Christensen Robotics & Intelligent Machines @ GT