Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, - PowerPoint PPT Presentation

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1

Bayesian paradigm • Consistent use of probability theory for representing unknowns (parameters, latent variables, missing data) 2

Bayesian paradigm • Bayesian posterior distribution summarizes what we’ve learned from training data and prior knowledge • Can use posterior to: • Can use posterior to: – Describe training data – Make predictions on test data – Incorporate new data (online learning) • Today’s question: How to efficiently represent and compute posteriors? 3

Factor graphs • Shows how a function of several variables can be factored into a product of simpler functions • f(x,y,z) = (x+y)(y+z)(x+z) • f(x,y,z) = (x+y)(y+z)(x+z) • Very useful for representing posteriors 4

Example factor graph = p ( x | m ) N ( x ; m , 1 ) i i 5

Two tasks • Modeling – What graph should I use for this data? • Inference – Given the graph and data, what is the mean – Given the graph and data, what is the mean of x (for example)? – Algorithms: • Sampling • Variable elimination • Message-passing (Expectation Propagation, Variational Bayes, …) 6

A (seemingly) intractable problem A (seemingly) intractable problem 7

Clutter problem • Want to estimate x given multiple y’s 8

Exact posterior exact ,D) p(x,D -1 0 1 2 3 4 x 9

Representing posterior distributions Sampling Deterministic approximation Good for complex, Good for simple, multi-modal distributions smooth distributions Fast, but unpredictable accuracy 10 Slow, but predictable accuracy

Deterministic approximation Laplace’s method • Bayesian curve fitting, neural networks (MacKay) • Bayesian PCA (Minka) Variational bounds • Bayesian mixture of experts (Waterhouse) • Mixtures of PCA (Tipping, Bishop) • Factorial/coupled Markov models (Ghahramani, Jordan, Williams) 11

Moment matching Another way to perform deterministic approximation • Much higher accuracy on some problems Assumed-density filtering (1984) Loopy belief propagation (1997) Expectation Propagation (2001) 12

Today • Moment matching (Expectation Propagation) Tomorrow • Variational bounds (Variational Message Passing) 13

Best Gaussian by moment matching exact bestGaussian ) p(x,D) -1 0 1 2 3 4 x 14

Strategy • Approximate each factor by a Gaussian in x 15

Approximating a single factor 16

(naïve) f i ( x ) × × \ x \ x q i q i ( ( ) ) = p ( x ) 17

(informed) f i ( x ) × × \ x \ x q i q i ( ( ) ) = p ( x ) 18

Single factor with Gaussian context 19

Gaussian multiplication formula 20

Approximation with narrow context 21

Approximation with medium context 22

Approximation with wide context 23

Two factors x x Message passing 24

Three factors x x Message passing 25

Message Passing = Distributed Optimization • Messages represent a simpler distribution � � � � that approximates � � � � – A distributed representation • Message passing = optimizing � to fit � • Message passing = optimizing � to fit � – � stands in for � when answering queries • Choices: – What type of distribution to construct (approximating family) – What cost to minimize (divergence measure) 26

Distributed divergence minimization • Write p as product of factors: • Approximate factors one by one: • Multiply to get the approximation: 27

Gaussian found by EP ep exact bestGaussian p(x,D) p( -1 0 1 2 3 4 x 28

Other methods vb laplace exact p(x,D) p( -1 0 1 2 3 4 x 29

Accuracy Posterior mean: exact = 1.64864 ep = 1.64514 laplace = 1.61946 vb = 1.61834 vb = 1.61834 Posterior variance: exact = 0.359673 ep = 0.311474 laplace = 0.234616 vb = 0.171155 30

Cost vs. accuracy 200 points 20 points Deterministic methods improve with more data (posterior is more Gaussian) Sampling methods do not 31

Censoring example • Want to estimate x given multiple y’s = = p ( x ) N ( x ; 0 , 100 ) p ( y | x ) N ( y ; x , 1 ) i − − ∞ ∞ ∫ t ∫ > = + p (| y | t | x ) N ( y ; x , 1 ) dy N ( y ; x , 1 ) dy i − ∞ t 32

Time series problems Time series problems 33

Example: Tracking Guess the position of an object given noisy measurements y 2 x 2 x x 1 3 y 4 y y 1 3 x 4 Object 34

Factor graph x x x x 3 1 2 4 y y y y 1 2 3 4 = + ν x x e.g. (random walk) − 1 t t t = + y x noise t t want distribution of x’s given y’s 35

Approximate factor graph x x x x 3 1 2 4 36

Splitting a pairwise factor x x 1 2 x x 1 2 37

Splitting in context x x 3 2 x x 3 2 38

Sweeping through the graph x x x x 3 1 2 4 39

Example: Poisson tracking • y t is a Poisson-distributed integer with mean exp(x t ) 43

Poisson tracking model p ( 1 x ) ~ N ( 0 , 100 ) p p ( ( x x | | x x ) ) ~ ~ N N ( ( x x , , 0 0 . . 01 01 ) ) − − t t 1 t 1 = − x p ( y | x ) exp( y x e ) / y ! t t t t t t 44

Factor graph x x x x 3 1 2 4 y y y y 1 2 3 4 x x x x 3 1 2 4 45

Approximating a measurement factor x 1 y 1 x 1 46

Posterior for the last state 48

EP for signal detection (Qi and Minka, 2003) • Wireless communication problem ω + φ a sin( t ) • Transmitted signal = φ • vary to encode each symbol ( a , ) φ φ i • In complex numbers: ae Im a φ Re 51

Binary symbols, Gaussian noise 1 = = − 0 • Symbols are and s 1 s 1 (in complex plane) = ω t + φ + y t a sin( ) noise • Received signal • Optimal detection is easy in this case • Optimal detection is easy in this case y t 0 1 s s 52

Fading channel • Channel systematically changes amplitude and phase: = + y x s noise t t t • • = transmitted symbol = transmitted symbol s s t x • = channel multiplier (complex number) t x • changes over time y 1 t x t s t 0 x t s 53

Differential detection • Use last measurement to estimate state: ≈ x y 1 / s − − t t t 1 • State estimate is noisy – can we do better? better? y y t − t 1 − y − 54 t 1

Factor graph s s s s 3 1 2 4 y y y y y y y y 1 2 4 3 x x x x 3 1 2 4 Symbols can also be correlated (e.g. error-correcting code) Channel dynamics are learned from training data (all 1’s) 55

Splitting a transition factor 58

Splitting a measurement factor 59

On-line implementation • Iterate over the last δ measurements • Previous measurements act as prior • Results comparable to particle filtering, but much faster 60

Classification problems Classification problems 62

Spam filtering by linear separation Not spam Spam Choose a boundary that will generalize to new data 63

Linear separation Minimum training error solution (Perceptron) Too arbitrary – won’t generalize well 64

Linear separation Maximum-margin solution (SVM) 65 Ignores information in the vertical direction

Linear separation Bayesian solution (via averaging) 66 Has a margin, and uses information in all dimensions

Geometry of linear separation Separator is any vector w such that: > T x w 0 (class 1) i < T x w 0 (class 2) i = = (sphere) (sphere) w w 1 1 This set has an unusual shape SVM: Optimize over it Bayes: Average over it 67

Performance on linear separation EP Gaussian approximation to posterior 68

Factor graph = p ( w ) N ( w ; 0 , I ) 69

Computing moments w = \ i \ i \ i q ( ) N ( w ; m , V ) 70

Computing moments 71

Time vs. accuracy A typical run on the 3-point problem Error = distance to true mean of w Billiard = Monte Carlo sampling (Herbrich et al, 2001) Opper&Winther’s algorithms: MF = mean-field theory TAP = cavity method (equiv to Gaussian EP for this problem) 72

Gaussian kernels • Map data into high-dimensional space so that 73

Bayesian model comparison • Multiple models M i with prior probabilities p(M i ) • Posterior probabilities: • For equal priors, models are compared using model evidence: 74

Highest-probability kernel 75

Margin-maximizing kernel 76

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, - PowerPoint PPT Presentation

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory for representing unknowns

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008

Two Approximate- Programmability Birds, One Statistical- Inference Stone Adrian Sampson

Approximate Inference: Randomized Methods October 15, 2015 Topics Hard Inference

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Travel Time Estimation using Approximate Belief States on a Hidden Markov Model Walid Krichene

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Approximate inference on graphical models: variational methods Alexandre Bouchard-C ot e

Approximate Bayesian inference for latent Gaussian models avard Rue 1 H Department of

Variable Elimination 1 Inference Exact inference Enumeration Variable elimination

Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

Type Inference 75 Definition Type Inference Type inference = Java compiler's ability

CALA Midwest Conference Developing Logic Model Workshops for Library Staff: Strategies,

Modeling for Industrial Machinery xtUML Days 2018 Copenhagen, Denmark Dennis Tubbs Intelligent

CPM: A Cube Presentation Model for OLAP Andreas Maniatis 1 , Panos Vassiliadis 2 , Spiros

Future Resilient Transport Networks Prof Chris Baker Birmingham Centre for Railway Research and

Primary Project Team Primary Project Team COLLEGE TOWNSHIP, CENTRE COUNTY, PA COLLEGE TOWNSHIP,

Milestone Moments Update Presentation to: Nutrition Services Directors Meeting Presented by:

Reporting hand hygiene Click to edit Master subtitle style compliance January 2018 Trusted

NON-STATIONARITY OF WATER REGIME IN SLOVENIA Mojca raj and collegues elezniki (Slovenia),

Sambuz

Useful Links

Newsletter

Mail Us