Introduction to (Statistical) Machine Learning Brown University - PowerPoint PPT Presentation

Introduction to (Statistical) Machine Learning Brown University CSCI1420 & ENGN2520 Prof. Erik Sudderth Lecture for Nov. 21, 2013: HMMs: Forward-Backward & EM Algorithms, Principal Components Analysis (PCA) Many figures courtesy Kevin Murphy’s textbook, Machine Learning: A Probabilistic Perspective

Inference for HMMs z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 • Assume parameters defining the HMM are fixed and known: distributions of initial state, state transitions, observations • Given observation sequence, want to estimate hidden states Minimize sequence (word) error rate: L ( z, a ) = I ( z 6 = a ) " T " # # T Y Y z = arg max ˆ p ( z | x ) = arg max p ( z 1 ) p ( z t | z t − 1 ) · p ( x t | z t ) z z t =2 t =1 T X Minimize state (symbol) error rate: L ( z, a ) = I ( z t 6 = a t ) t =1 X X X X z t = arg max ˆ z t p ( z t | x ) = arg max · · · · · · p ( z, x ) z t z 1 z t − 1 z t +1 z T Problem: Naïve computation of either estimate requires O ( K T )

Forward Filtering for HMMs z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 " T " # # T Y Y p ( z, x ) = p ( z ) p ( x | z ) = p ( z 1 ) p ( z t | z t − 1 ) · p ( x t | z t ) t =2 t =1 α t ( z t ) = p ( z t | x t , x t − 1 , . . . , x 1 ) Filtered state estimates: • Directly useful for online inference or tracking with HMMs • Building block towards finding posterior given all observations Initialization: Easy from known HMM parameters multiply be proportionality α 1 ( z 1 ) = p ( z 1 | x 1 ) ∝ p ( z 1 ) p ( x 1 | z 1 ) constant so sums to one Recursion: Derivation will follow from Markov properties K X α t ( z t ) ∝ p ( x t | z t ) p ( z t | z t − 1 ) α t − 1 ( z t − 1 ) O ( K 2 ) z t − 1 =1

Forward Filtering for HMMs z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 α t ( z t ) = p ( z t | x t , x t − 1 , . . . , x 1 ) K X α t +1 ( z t +1 ) ∝ p ( x t +1 | z t +1 ) p ( z t +1 | z t ) α t ( z t ) z t =1 Prediction Step: Given current knowledge, what is next state? K X p ( z t +1 | x t , . . . , x 1 ) = p ( z t +1 | z t ) α t ( z t ) z t =1 Update Step: What does latest observation tell us about state? α t +1 ( z t +1 ) = p ( z t +1 | x t +1 , x t , . . . , x 1 ) ∝ p ( x t +1 | z t +1 ) p ( z t +1 | x t , . . . , x 1 ) Key Markov Identities: From generative structure of HMM, p ( z t +1 | z t , x t , . . . , x 1 ) = p ( z t +1 | z t ) p ( x t +1 | z t +1 , x t , . . . , x 1 ) = p ( x t +1 | z t +1 )

Forward-Backward for HMMs z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 Forward Recursion: Distribution of State Given Past Data α t ( z t ) = p ( z t | x t , x t − 1 , . . . , x 1 ) α 1 ( z 1 ) ∝ p ( z 1 ) p ( x 1 | z 1 ) K X α t +1 ( z t +1 ) ∝ p ( x t +1 | z t +1 ) p ( z t +1 | z t ) α t ( z t ) z t =1 Backward Recursion: Likelihood of Future Data Given State β t ( z t ) ∝ p ( x t +1 , . . . , x T | z t ) β T ( z T ) = 1 K X β t ( z t ) ∝ p ( x t +1 | z t +1 ) p ( z t +1 | z t ) β t +1 ( z t +1 ) z t +1 =1 Marginal: Posterior distribution of state given all data p ( z t | x 1 , . . . , x T ) ∝ α t ( z t ) β t ( z t )

EM for Hidden Markov Models π z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 θ parameters (state transition & emission dist.) π , θ hidden discrete state sequence z 1 , . . . , z N • Initialization: Randomly select starting parameters • E-Step: Given parameters, find posterior of hidden states • Dynamic programming to efficiently infer state marginals • M-Step: Given posterior distributions, find likely parameters • Like training of mixture models and Markov chains • Iteration: Alternate E-step & M-step until convergence

E-Step: HMMs q ( t ) ( z ) = p ( z | x, π ( t − 1) , θ ( t − 1) ) ∝ p ( z | π ( t − 1) ) p ( x | z, θ ( t − 1) ) Mixture Models N Y q ( t ) ( z ) ∝ p ( z i | π ( t − 1) ) p ( x i | z i , θ ( t − 1) ) i =1 • Hidden states are conditionally independent given parameters • Naïve representation of full posterior has size O ( KN ) HMMs N Y q ( t ) ( z ) ∝ p ( z i | π ( t − 1) z i − 1 ) p ( x i | z i , θ ( t − 1) ) i =1 • Hidden states have Markov dependence given parameters O ( K N ) • Naïve representation of full posterior has size • But, our forward-backward dynamic programming can quickly find the marginals (at each time) of the posterior distribution

M-Step: HMMs θ ( t ) = arg max L ( q ( t ) , θ ) = arg max X q ( z ) ln p ( x, z | θ ) θ θ z Initial state dist. State transition dist. emissions via State emission dist. (observation likelihoods) weighted moment matching Need posterior marginal distributions of single states, and pairs of sequential states p ( z t | x ) p ( z t , z t +1 | x )

Unsupervised Learning Supervised Learning Unsupervised Learning Discrete classification or clustering categorization Continuous dimensionality regression reduction • Goal: Infer label/response y given only features x • Classical: Find latent variables y good for compression of x • Probabilistic learning: Estimate parameters of joint distribution p(x,y) which maximize marginal probability p(x)

Dimensionality Reduction Isomap Algorithm: Tenenbaum et al., Science 2000.

PCA Objective: Compression x n ∈ R D , • Observed feature vectors: n = 1 , 2 , . . . , N z n ∈ R M , • Hidden manifold coordinates: n = 1 , 2 , . . . , N W ∈ R D × M • Hidden linear mapping: x n = Wz n + b b ∈ R D × 1 ˜ N N x n || 2 = X X || x n − Wz n − b || 2 J ( z, W, b | x, M ) = || x n − ˜ n =1 n =1 • Unlike clustering objectives like K-means, we can find the global optimum of this objective efficiently: N Construct W from the top eigenvectors x = 1 X b = ¯ x n of the sample covariance matrix N (the directions of largest variance) n =1

Principal Components Analysis Example PCA Analysis of MNIST Images of the Digit 3 N N x n || 2 = X X || x n − Wz n − b || 2 J ( z, W, b | x, M ) = || x n − ˜ n =1 n =1 • PCA models all translations of data equally well (by shifting b ) • PCA models all rotations of data equally well (by rotating W ) • Appropriate when modeling quantities over time, space, etc.

PCA Derivation: One-Dimension x n ∈ R D , • Observed feature vectors: n = 1 , 2 , . . . , N • Hidden manifold coordinates: z n ∈ R , n = 1 , 2 , . . . , N w ∈ R D × 1 • Hidden linear mapping: x n = wz n ˜ w T w = 1 Assume mean already subtracted from data (centered) N N J ( z, w | x ) = 1 x n || 2 = 1 X X || x n − wz n || 2 || x n − ˜ N N n =1 n =1 • Step 1: Optimal manifold coordinate is always projection z n = w T x n ˆ • Step 2: Optimal mapping maximizes variance of projection N N z, w | x ) = C − 1 Σ = 1 ( w T x n )( x T n w ) = C − w T Σ w X X x n x T J (ˆ n N N n =1 n =1

Gaussian Geometry • Eigenvalues and eigenvectors: U = [ u 1 , . . . , u d ] Σ u i = λ i u i , i = 1 , . . . , d Σ ∈ R d × d Λ = diag( λ 1 , . . . , λ d ) Σ U = U Λ • For a symmetric matrix: u T u T i u i = 1 i u j = 0 λ i ∈ R d X Σ = U Λ U T = λ i u i u T i i =1 • For a positive semidefinite matrix: λ i ≥ 0 • Quadratic forms: • For a positive definite matrix: λ i > 0 d y i = u T 1 i ( x − µ ) Σ − 1 = U Λ − 1 U T = X u i u T i λ i Projection of difference from i =1 mean onto eigenvector

Maximizes Variance & Minimizes Error u x 2 x n e x n x 1 C. Bishop, Pattern Recognition & Machine Learning

Principal Components Analysis (PCA) 2 3D 0 Data − 2 4 2 0 8 6 − 2 4 2 0 − 4 − 2 − 4 − 6 − 8 Best 2D Projection 4 2 0 5 − 2 0 − 4 − 5 Best 1D 2 4 0 2 Projection 0 − 2 − 2 − 4 − 6

PCA Optimal Solution N N x n || 2 = X X || x n − Wz n − b || 2 J ( z, W, b | x, M ) = || x n − ˜ n =1 n =1 N x = 1 X X = [ x 1 − ¯ x, x 2 − ¯ x, . . . , x N − ¯ x ] b = ¯ x n N n =1 • Option A: Eigendecomposition of sample covariance matrix N Σ = 1 x ) T = 1 N XX T = U Λ U T X ( x n − ¯ x )( x n − ¯ N n =1 Construct W from eigenvectors with M largest eigenvalues • Option B: Singular value decomposition (SVD) of centered data X = USV T Construct W from singular vectors with M largest singular values

Introduction to (Statistical) Machine Learning Brown University - PowerPoint PPT Presentation

Introduction to (Statistical) Machine Learning Brown University CSCI1420 & ENGN2520 Prof. Erik Sudderth Lecture for Nov. 21, 2013: HMMs: Forward-Backward & EM Algorithms, Principal Components Analysis (PCA) Many figures courtesy

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Introduction to Statistical Machine Learning Marcus Hutter Canberra, ACT, 0200, Australia

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Trainings, Ex Exercises, a and Ot Other Product cts Available to An Animal Di Disease R

T HE I MPERATIVE OF D IVERSITY AND I NCLUSION FOR I NNOVATIVE L EADERSHIP Ronald L. Copeland, MD,

STRONG FAMILIESCAPABLE COMMUNITIES COLLECTIVE ACTION TO ACHIEVE IMPACT IN REGIONAL COMMUNITIES

20 Dynamic Practice Guidelines for Emergency General Surgery 18 Committee on Acute Care

Jedem Kind seine Stimme Every child its own voice scientific study Prof. Heiner Barz

2 0 1 8 Right Delivery, Right Time Companys profjle | 06 November 2018 | 2 Companys

Connecticut Medicaid and Pharmacy A Presentation to the Connecticut Healthcare Cabinet February 14,

Im Improvin ing Out Outcom omes es and and Con Controlling lling Cos Costs: s: TA TAC Heal

Sambuz

Useful Links

Newsletter

Mail Us