introduction to statistical machine learning
play

Introduction to (Statistical) Machine Learning Brown University - PowerPoint PPT Presentation

Introduction to (Statistical) Machine Learning Brown University CSCI1420 & ENGN2520 Prof. Erik Sudderth Lecture for Nov. 21, 2013: HMMs: Forward-Backward & EM Algorithms, Principal Components Analysis (PCA) Many figures courtesy


  1. Introduction to (Statistical) Machine Learning Brown University CSCI1420 & ENGN2520 Prof. Erik Sudderth Lecture for Nov. 21, 2013: HMMs: Forward-Backward & EM Algorithms, Principal Components Analysis (PCA) Many figures courtesy Kevin Murphy’s textbook, Machine Learning: A Probabilistic Perspective

  2. Inference for HMMs z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 • Assume parameters defining the HMM are fixed and known: distributions of initial state, state transitions, observations • Given observation sequence, want to estimate hidden states Minimize sequence (word) error rate: L ( z, a ) = I ( z 6 = a ) " T " # # T Y Y z = arg max ˆ p ( z | x ) = arg max p ( z 1 ) p ( z t | z t − 1 ) · p ( x t | z t ) z z t =2 t =1 T X Minimize state (symbol) error rate: L ( z, a ) = I ( z t 6 = a t ) t =1 X X X X z t = arg max ˆ z t p ( z t | x ) = arg max · · · · · · p ( z, x ) z t z 1 z t − 1 z t +1 z T Problem: Naïve computation of either estimate requires O ( K T )

  3. Forward Filtering for HMMs z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 " T " # # T Y Y p ( z, x ) = p ( z ) p ( x | z ) = p ( z 1 ) p ( z t | z t − 1 ) · p ( x t | z t ) t =2 t =1 α t ( z t ) = p ( z t | x t , x t − 1 , . . . , x 1 ) Filtered state estimates: • Directly useful for online inference or tracking with HMMs • Building block towards finding posterior given all observations Initialization: Easy from known HMM parameters multiply be proportionality α 1 ( z 1 ) = p ( z 1 | x 1 ) ∝ p ( z 1 ) p ( x 1 | z 1 ) constant so sums to one Recursion: Derivation will follow from Markov properties K X α t ( z t ) ∝ p ( x t | z t ) p ( z t | z t − 1 ) α t − 1 ( z t − 1 ) O ( K 2 ) z t − 1 =1

  4. Forward Filtering for HMMs z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 α t ( z t ) = p ( z t | x t , x t − 1 , . . . , x 1 ) K X α t +1 ( z t +1 ) ∝ p ( x t +1 | z t +1 ) p ( z t +1 | z t ) α t ( z t ) z t =1 Prediction Step: Given current knowledge, what is next state? K X p ( z t +1 | x t , . . . , x 1 ) = p ( z t +1 | z t ) α t ( z t ) z t =1 Update Step: What does latest observation tell us about state? α t +1 ( z t +1 ) = p ( z t +1 | x t +1 , x t , . . . , x 1 ) ∝ p ( x t +1 | z t +1 ) p ( z t +1 | x t , . . . , x 1 ) Key Markov Identities: From generative structure of HMM, p ( z t +1 | z t , x t , . . . , x 1 ) = p ( z t +1 | z t ) p ( x t +1 | z t +1 , x t , . . . , x 1 ) = p ( x t +1 | z t +1 )

  5. Forward-Backward for HMMs z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 Forward Recursion: Distribution of State Given Past Data α t ( z t ) = p ( z t | x t , x t − 1 , . . . , x 1 ) α 1 ( z 1 ) ∝ p ( z 1 ) p ( x 1 | z 1 ) K X α t +1 ( z t +1 ) ∝ p ( x t +1 | z t +1 ) p ( z t +1 | z t ) α t ( z t ) z t =1 Backward Recursion: Likelihood of Future Data Given State β t ( z t ) ∝ p ( x t +1 , . . . , x T | z t ) β T ( z T ) = 1 K X β t ( z t ) ∝ p ( x t +1 | z t +1 ) p ( z t +1 | z t ) β t +1 ( z t +1 ) z t +1 =1 Marginal: Posterior distribution of state given all data p ( z t | x 1 , . . . , x T ) ∝ α t ( z t ) β t ( z t )

  6. EM for Hidden Markov Models π z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 θ parameters (state transition & emission dist.) π , θ hidden discrete state sequence z 1 , . . . , z N • Initialization: Randomly select starting parameters • E-Step: Given parameters, find posterior of hidden states • Dynamic programming to efficiently infer state marginals • M-Step: Given posterior distributions, find likely parameters • Like training of mixture models and Markov chains • Iteration: Alternate E-step & M-step until convergence

  7. E-Step: HMMs q ( t ) ( z ) = p ( z | x, π ( t − 1) , θ ( t − 1) ) ∝ p ( z | π ( t − 1) ) p ( x | z, θ ( t − 1) ) Mixture Models N Y q ( t ) ( z ) ∝ p ( z i | π ( t − 1) ) p ( x i | z i , θ ( t − 1) ) i =1 • Hidden states are conditionally independent given parameters • Naïve representation of full posterior has size O ( KN ) HMMs N Y q ( t ) ( z ) ∝ p ( z i | π ( t − 1) z i − 1 ) p ( x i | z i , θ ( t − 1) ) i =1 • Hidden states have Markov dependence given parameters O ( K N ) • Naïve representation of full posterior has size • But, our forward-backward dynamic programming can quickly find the marginals (at each time) of the posterior distribution

  8. M-Step: HMMs θ ( t ) = arg max L ( q ( t ) , θ ) = arg max X q ( z ) ln p ( x, z | θ ) θ θ z Initial state dist. State transition dist. emissions via State emission dist. (observation likelihoods) weighted moment matching Need posterior marginal distributions of single states, and pairs of sequential states p ( z t | x ) p ( z t , z t +1 | x )

  9. Unsupervised Learning Supervised Learning Unsupervised Learning Discrete classification or clustering categorization Continuous dimensionality regression reduction • Goal: Infer label/response y given only features x • Classical: Find latent variables y good for compression of x • Probabilistic learning: Estimate parameters of joint distribution p(x,y) which maximize marginal probability p(x)

  10. Dimensionality Reduction Isomap Algorithm: Tenenbaum et al., Science 2000.

  11. PCA Objective: Compression x n ∈ R D , • Observed feature vectors: n = 1 , 2 , . . . , N z n ∈ R M , • Hidden manifold coordinates: n = 1 , 2 , . . . , N W ∈ R D × M • Hidden linear mapping: x n = Wz n + b b ∈ R D × 1 ˜ N N x n || 2 = X X || x n − Wz n − b || 2 J ( z, W, b | x, M ) = || x n − ˜ n =1 n =1 • Unlike clustering objectives like K-means, we can find the global optimum of this objective efficiently: N Construct W from the top eigenvectors x = 1 X b = ¯ x n of the sample covariance matrix N (the directions of largest variance) n =1

  12. Principal Components Analysis Example PCA Analysis of MNIST Images of the Digit 3 N N x n || 2 = X X || x n − Wz n − b || 2 J ( z, W, b | x, M ) = || x n − ˜ n =1 n =1 • PCA models all translations of data equally well (by shifting b ) • PCA models all rotations of data equally well (by rotating W ) • Appropriate when modeling quantities over time, space, etc.

  13. PCA Derivation: One-Dimension x n ∈ R D , • Observed feature vectors: n = 1 , 2 , . . . , N • Hidden manifold coordinates: z n ∈ R , n = 1 , 2 , . . . , N w ∈ R D × 1 • Hidden linear mapping: x n = wz n ˜ w T w = 1 Assume mean already subtracted from data (centered) N N J ( z, w | x ) = 1 x n || 2 = 1 X X || x n − wz n || 2 || x n − ˜ N N n =1 n =1 • Step 1: Optimal manifold coordinate is always projection z n = w T x n ˆ • Step 2: Optimal mapping maximizes variance of projection N N z, w | x ) = C − 1 Σ = 1 ( w T x n )( x T n w ) = C − w T Σ w X X x n x T J (ˆ n N N n =1 n =1

  14. Gaussian Geometry • Eigenvalues and eigenvectors: U = [ u 1 , . . . , u d ] Σ u i = λ i u i , i = 1 , . . . , d Σ ∈ R d × d Λ = diag( λ 1 , . . . , λ d ) Σ U = U Λ • For a symmetric matrix: u T u T i u i = 1 i u j = 0 λ i ∈ R d X Σ = U Λ U T = λ i u i u T i i =1 • For a positive semidefinite matrix: λ i ≥ 0 • Quadratic forms: • For a positive definite matrix: λ i > 0 d y i = u T 1 i ( x − µ ) Σ − 1 = U Λ − 1 U T = X u i u T i λ i Projection of difference from i =1 mean onto eigenvector

  15. Maximizes Variance & Minimizes Error u x 2 x n e x n x 1 C. Bishop, Pattern Recognition & Machine Learning

  16. Principal Components Analysis (PCA) 2 3D 0 Data − 2 4 2 0 8 6 − 2 4 2 0 − 4 − 2 − 4 − 6 − 8 Best 2D Projection 4 2 0 5 − 2 0 − 4 − 5 Best 1D 2 4 0 2 Projection 0 − 2 − 2 − 4 − 6

  17. PCA Optimal Solution N N x n || 2 = X X || x n − Wz n − b || 2 J ( z, W, b | x, M ) = || x n − ˜ n =1 n =1 N x = 1 X X = [ x 1 − ¯ x, x 2 − ¯ x, . . . , x N − ¯ x ] b = ¯ x n N n =1 • Option A: Eigendecomposition of sample covariance matrix N Σ = 1 x ) T = 1 N XX T = U Λ U T X ( x n − ¯ x )( x n − ¯ N n =1 Construct W from eigenvectors with M largest eigenvalues • Option B: Singular value decomposition (SVD) of centered data X = USV T Construct W from singular vectors with M largest singular values

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend