COMS 4721: Machine Learning for Data Science Lecture 22, 4/18/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 22, 4/18/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

M ARKOV MODELS s 1 s 2 s 3 s 4 The sequence ( s 1 , s 2 , s 3 , . . . ) has the Markov property , if for all t p ( s t | s t − 1 , . . . , s 1 ) = p ( s t | s t − 1 ) . Our first encounter with Markov models assumed a finite state space, meaning we can define an indexing such that s ∈ { 1 , . . . , S } . This allowed us to represent the transition probabilities in a matrix, ⇔ p ( s t = j | s t − 1 = i ) . A ij

H IDDEN M ARKOV MODELS s n−1 s n+1 s 1 s 2 s n x 1 x 2 x n−1 x n x n+1 The hidden Markov model modified this by assuming the sequence of states was a latent process (i.e., unobserved). An observation x t is associated with each s t , where x t | s t ∼ p ( x | θ s t ) . Like a mixture model, this allowed for a few distributions to generate the data. It adds an extra transition rule between distributions.

D ISCRETE STATE SPACES A 22 A 21 In both cases, the state space was discrete and A 12 k = 2 relatively small in number. A 32 A 23 k = 1 A 11 ◮ For the Markov chain, we gave an example k = 3 A 31 where states correspond to positions in R d . A 13 ◮ A continuous hidden Markov model might A 33 perturb the latent state of the Markov chain. 1 k = 2 ◮ For example, each s i can be modified by continuous-valued noise, x i = s i + ǫ i . k = 1 0.5 ◮ But s 1 : T is still a discrete Markov chain. k = 3 0 0 0.5 1

D ISCRETE VS CONTINUOUS STATE SPACES Markov and hidden Markov models both assume a discrete state space. For Markov models: ◮ The state could be a data point x i (Markov Chain classifier) ◮ The state could be an object (object ranking) ◮ The state could be the destination of a link (internet search engines) For hidden Markov models we can simplify complex data: ◮ Sequences of discrete data may come from a few discrete distributions. ◮ Sequences of continuous data may come from a few distributions. What if we model the states as continuous too?

C ONTINUOUS - STATE M ARKOV MODEL Continuous Markov models extend the state space to a continuous domain. Instead of s ∈ { 1 , . . . , S } , s can take any value in R d . Again compare: ◮ Discrete-state Markov models: The states live in a discrete space. ◮ Continuous-state Markov models: The states live in a continuous space. The simplest example is the process s t = s t − 1 + ǫ t , ǫ t ∼ N ( 0 , aI ) . Each successive state is a perturbed version of the current state.

L INEAR G AUSSIAN M ARKOV MODEL The most basic continuous-state version of the hidden Markov model is called a linear Gaussian Markov model (also called the Kalman filter ). s t = Cs t − 1 + ǫ t − 1 x t = Ds t + ε t , � �� observed process latent process ◮ s t ∈ R p is a continuous-state latent (unobserved) Markov process ◮ x t ∈ R d is a continuous-valued observation ◮ The process noise ǫ t ∼ N ( 0 , Q ) ◮ The measurement noise ε t ∼ N ( 0 , V )

E XAMPLE APPLICATIONS s 1 s 2 s n−1 s n s n+1 x n−1 x n+1 x 1 x 2 x n Difference from HMM: s t and x t are both from continuous distributions. The linear Gaussian Markov model (and its variants) has many applications. ◮ Tracking moving objects ◮ Automatic control systems ◮ Economics and finance (e.g., stock modeling) ◮ etc.

E XAMPLE : T RACKING We get (very) noisy measurements of an object’s position in time, x t ∈ R 2 . The time-varying state vector is s = [ pos 1 vel 1 accel 1 pos 2 vel 2 accel 2 ] T . Motivated by the underlying physics, we model this as:   2 (∆ t ) 2 1 ∆ t 1 0 0 0 ∆ t 0 1 0 0 0     e − α ∆ t 0 0 0 0 0   s t + 1 = s t + ǫ t   1 2 (∆ t ) 2 ∆ t  0 0 0 1    ∆ t 0 0 0 0 1   e − α ∆ t 0 0 0 0 0 � �� ≡ C � 1 � 0 0 0 0 0 x t + 1 = s t + 1 + ε t + 1 0 0 0 1 0 0 � �� ≡ D Therefore, s t not only approximates where the target is, but where it’s going.

E XAMPLE : T RACKING

T HE LEARNING PROBLEM As with the hidden Markov model, we’re given the sequence ( x 1 , x 2 , x 3 , . . . ) , where each x ∈ R d . The goal is to learn state sequence ( s 1 , s 2 , s 3 , . . . ) . All distributions are Gaussian, p ( s t + 1 = s | s t ) = N ( Cs t , Q ) , p ( x t = x | s t ) = N ( Ds t , V ) . Notice that with the discrete HMM we wanted to learn π , A and B , where ◮ π is the initial state distribution ◮ A is the transition matrix among the discrete set of states ◮ B contains the state-dependent distributions on discrete-valued data The situation here is very different.

T HE LEARNING PROBLEM No “B” to learn: In the linear Gaussian Markov model, each state is unique and so the distribution on x t is different for each t . No “A” to learn: In addition, each state transition is to a brand new state, so each s t has its own unique probability distribution. What we can learn are the two posterior distributions. 1. p ( s t | x 1 , . . . , x t ) : A distribution on the current state given the past. 2. p ( s t | x 1 , . . . , x T ) : A distribution on each latent state in the sequence ◮ # 1: Kalman filtering problem. We’ll focus on this one today. ◮ # 2: Kalman smoothing problem. Requires extra step (not discussed).

T HE K ALMAN FILTER Goal : Learn the sequence of distributions p ( s t | x 1 , . . . , x t ) given a sequence of data ( x 1 , x 2 , x 3 , . . . ) and the model s t + 1 | s t ∼ N ( Cs t , Q ) , x t | s t ∼ N ( Ds t , V ) . This is the (linear) Kalman filtering problem and is often used for tracking. Setup : We can use Bayes rule to write p ( s t | x 1 , . . . , x t ) ∝ p ( x t | s t ) p ( s t | x 1 , . . . x t − 1 ) and represent the prior as a marginal distribution � p ( s t | x 1 , . . . , x t − 1 ) = p ( s t | s t − 1 ) p ( s t − 1 | x 1 , . . . , x t − 1 ) ds t − 1

T HE K ALMAN FILTER We’ve decomposed the problem into parts that we do and don’t know (yet) � p ( s t | x 1 , . . . , x t ) ∝ p ( x t | s t ) p ( s t | s t − 1 ) p ( s t − 1 | x 1 , . . . , x t − 1 ) ds t − 1 � �� ? N ( Ds t , V ) N ( Cs t − 1 , Q ) Observations and considerations: 1. The left is the posterior on s t and the right has the posterior on s t − 1 . 2. We want the integral to be in closed form and a known distribution. 3. We want the prior and likelihood terms to lead to a known posterior. 4. We want future calculations, e.g. for s t + 1 , to be easy. We will see how choosing the Gaussian distribution makes this all work.

T HE K ALMAN FILTER : S TEP 1 Calculate the marginal for prior distribution Hypothesize (temporarily) that the unknown distribution is Gaussian, � p ( s t | x 1 , . . . , x t ) ∝ p ( x t | s t ) p ( s t | s t − 1 ) p ( s t − 1 | x 1 , . . . , x t − 1 ) ds t − 1 � �� N ( Ds t , V ) N ( Cs t − 1 , Q ) N ( µ, Σ) by hypothesis A property of the Gaussian is that marginals are still Gaussian, � N ( s t | Cs t − 1 , Q ) N ( s t − 1 | µ, Σ) ds t − 1 = N ( s t | C µ, Q + C Σ C T ) . We know C and Q (by design) and µ and Σ (by hypothesis).

T HE K ALMAN FILTER : S TEP 2 Calculate the posterior We plug in the marginal distribution for the prior and see that p ( s t | x 1 , . . . , x t ) ∝ N ( x t | Ds t , V ) N ( s t | C µ, Q + C Σ C T ) . Though the parameters look complicated, the posterior is just a Gaussian p ( s t | x 1 , . . . , x t ) = N ( s t | µ ′ , Σ ′ ) � − 1 � ( Q + C Σ C T ) − 1 + D T V − 1 D Σ ′ = Σ ′ � � µ ′ D T V − 1 x t + ( Q + C Σ C T ) − 1 C µ = We can plug the relevant values into these two equations.

A DDRESSING THE G AUSSIAN ASSUMPTION By making the assumption of a Gaussian in the prior, � p ( s t | x 1 , . . . , x t ) ∝ p ( x t | s t ) p ( s t | s t − 1 ) p ( s t − 1 | x 1 , . . . , x t − 1 ) ds t − 1 � �� N ( x t | Ds t , V ) N ( s t | Cs t − 1 , Q ) N ( µ, Σ) by hypothesis we found that the posterior is also Gaussian with a new mean and covariance. ◮ We therefore only need to define a Gaussian prior on the first state to keep things moving forward. For example, p ( s 0 ) ∼ N ( 0 , I ) . Once this is done, all future calculations are in closed form.

K ALMAN FILTER : ONE FINAL QUANTITY Making predictions We know how to update the sequence of state posterior distributions p ( s t | x 1 , . . . , x t ) . What about predicting x t + 1 ? � p ( x t + 1 | x 1 , . . . , x t ) = p ( x t + 1 | s t + 1 ) p ( s t + 1 | x 1 , . . . , x t ) ds t + 1 � � = p ( x t + 1 | s t + 1 ) p ( s t + 1 | s t ) p ( s t | x 1 , . . . , x t ) ds t ds t + 1 � �� N ( x t + 1 | Ds t + 1 , V ) N ( s t + 1 | Cs t , Q ) N ( s t | µ ′ , Σ ′ ) Again, Gaussians are nice because these operations stay Gaussian. This is a multivariate Gaussian that looks even more complicated than the previous one (omitted). Simply perform the previous integral twice.

A LGORITHM : K ALMAN FILTERING The Kalman filtering algorithm can be run in real time. 0. Set the initial state distribution p ( s 0 ) = N ( 0 , I ) 1. Prior to observing each new x t ∈ R d predict x t ∼ N ( µ x t , Σ x t ) (using previously discussed marginalization) 2. After observing each new x t ∈ R d update p ( s t | x 1 , . . . , x t ) = N ( µ s t , Σ s t ) (using equations on previous slide)

COMS 4721: Machine Learning for Data Science Lecture 22, 4/18/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 22, 4/18/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University M ARKOV MODELS s 1 s 2 s 3 s 4 The sequence ( s 1 , s 2 , s 3 , . . . )

Introduction to machine learning COMS 4721 Learning from data Machine learning : the study

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 14, 3/21/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 3, 1/24/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 4, 1/26/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 5, 1/31/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 11, 2/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 17, 3/30/2017 Prof. John Paisley Department

Wasserstein Distributionally Robust Kalman Filtering (1) (1) (1) Soroosh Shafieezadeh-Abadeh,

Kalman-Filter Peter W uppen Universit at Hamburg Fakult at f ur Mathematik,

( x ik x i )( x jk x j ) N Expected value of a vector x is by component. k = 1 T

6.869 Computer Vision and Applications Prof. Bill Freeman Tracking Density propagation

Probabilistic Graphical Models Lecture 13 Loopy Belief Propagation CS/CNS/EE 155 Andreas

Question of the Day: How can we use measurements to estimate state? Kalman Filter:

Parallelized and Vectorized Tracking Using Kalman Filter with CMS Detector Geometry and Events G.

ROBOTICS 01PEEQW Basilio Bona DAUIN Politecnico di Torino Probabilistic Fundamentals in

Sambuz

Useful Links

Newsletter

Mail Us