Introduction to (Statistical) Machine Learning Brown University - - PowerPoint PPT Presentation

introduction to statistical machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to (Statistical) Machine Learning Brown University - - PowerPoint PPT Presentation

Introduction to (Statistical) Machine Learning Brown University CSCI1420 & ENGN2520 Prof. Erik Sudderth Lecture for Nov. 21, 2013: HMMs: Forward-Backward & EM Algorithms, Principal Components Analysis (PCA) Many figures courtesy


slide-1
SLIDE 1

Introduction to (Statistical) Machine Learning

Brown University CSCI1420 & ENGN2520

  • Prof. Erik Sudderth

Lecture for Nov. 21, 2013:

HMMs: Forward-Backward & EM Algorithms, Principal Components Analysis (PCA)

Many figures courtesy Kevin Murphy’s textbook, Machine Learning: A Probabilistic Perspective

slide-2
SLIDE 2

Inference for HMMs

z1

z2

z3

z4

z5

x1

x2

x3

x4

x5

  • Assume parameters defining the HMM are fixed and known:

distributions of initial state, state transitions, observations

  • Given observation sequence, want to estimate hidden states

Minimize sequence (word) error rate:

L(z, a) = I(z 6= a)

Minimize state (symbol) error rate: L(z, a) =

T

X

t=1

I(zt 6= at)

ˆ z = arg max

z

p(z | x) = arg max

z

" p(z1)

T

Y

t=2

p(zt | zt−1) # · " T Y

t=1

p(xt | zt) #

ˆ zt = arg max

zt p(zt | x) = arg max zt

X

z1

· · · X

zt−1

X

zt+1

· · · X

zT

p(z, x)

Problem: Naïve computation of either estimate requires O(KT )

slide-3
SLIDE 3

Forward Filtering for HMMs

z1

z2

z3

z4

z5

x1

x2

x3

x4

x5

αt(zt) = p(zt | xt, xt−1, . . . , x1)

Filtered state estimates:

  • Directly useful for online inference or tracking with HMMs
  • Building block towards finding posterior given all observations

p(z, x) = p(z)p(x | z) = " p(z1)

T

Y

t=2

p(zt | zt−1) # · " T Y

t=1

p(xt | zt) #

Recursion: Derivation will follow from Markov properties

αt(zt) ∝ p(xt | zt)

K

X

zt−1=1

p(zt | zt−1)αt−1(zt−1)

O(K2)

Initialization: Easy from known HMM parameters

α1(z1) = p(z1 | x1) ∝ p(z1)p(x1 | z1)

multiply be proportionality constant so sums to one

slide-4
SLIDE 4

Forward Filtering for HMMs

z1

z2

z3

z4

z5

x1

x2

x3

x4

x5

αt(zt) = p(zt | xt, xt−1, . . . , x1)

Prediction Step: Given current knowledge, what is next state? Update Step: What does latest observation tell us about state?

p(zt+1 | xt, . . . , x1) =

K

X

zt=1

p(zt+1 | zt)αt(zt)

αt+1(zt+1) = p(zt+1 | xt+1, xt, . . . , x1) ∝ p(xt+1 | zt+1)p(zt+1 | xt, . . . , x1)

αt+1(zt+1) ∝ p(xt+1 | zt+1)

K

X

zt=1

p(zt+1 | zt)αt(zt)

Key Markov Identities: From generative structure of HMM,

p(zt+1 | zt, xt, . . . , x1) = p(zt+1 | zt)

p(xt+1 | zt+1, xt, . . . , x1) = p(xt+1 | zt+1)

slide-5
SLIDE 5

Forward-Backward for HMMs

z1

z2

z3

z4

z5

x1

x2

x3

x4

x5

αt(zt) = p(zt | xt, xt−1, . . . , x1)

Forward Recursion: Distribution of State Given Past Data Backward Recursion: Likelihood of Future Data Given State

αt+1(zt+1) ∝ p(xt+1 | zt+1)

K

X

zt=1

p(zt+1 | zt)αt(zt)

α1(z1) ∝ p(z1)p(x1 | z1)

Marginal: Posterior distribution of state given all data

βt(zt) ∝

K

X

zt+1=1

p(xt+1 | zt+1)p(zt+1 | zt)βt+1(zt+1)

βT (zT ) = 1

βt(zt) ∝ p(xt+1, . . . , xT | zt)

p(zt | x1, . . . , xT ) ∝ αt(zt)βt(zt)

slide-6
SLIDE 6

EM for Hidden Markov Models

π

θ

  • Initialization: Randomly select starting parameters
  • E-Step: Given parameters, find posterior of hidden states
  • Dynamic programming to efficiently infer state marginals
  • M-Step: Given posterior distributions, find likely parameters
  • Like training of mixture models and Markov chains
  • Iteration: Alternate E-step & M-step until convergence

z1, . . . , zN π, θ

parameters (state transition & emission dist.) hidden discrete state sequence

z1

z2

z3

z4

z5

x1

x2

x3

x4

x5

slide-7
SLIDE 7

E-Step: HMMs

q(t)(z) = p(z | x, π(t−1), θ(t−1)) ∝ p(z | π(t−1))p(x | z, θ(t−1))

Mixture Models

q(t)(z) ∝

N

Y

i=1

p(zi | π(t−1))p(xi | zi, θ(t−1))

  • Hidden states are conditionally independent given parameters
  • Naïve representation of full posterior has size O(KN)

HMMs

O(KN)

q(t)(z) ∝

N

Y

i=1

p(zi | π(t−1)

zi−1 )p(xi | zi, θ(t−1))

  • Hidden states have Markov dependence given parameters
  • Naïve representation of full posterior has size
  • But, our forward-backward dynamic programming can quickly

find the marginals (at each time) of the posterior distribution

slide-8
SLIDE 8

M-Step: HMMs

θ(t) = arg max

θ

L(q(t), θ) = arg max

θ

X

z

q(z) ln p(x, z | θ)

Initial state dist. State transition dist. State emission dist. (observation likelihoods)

Need posterior marginal distributions of single states, and pairs of sequential states

p(zt | x)

p(zt, zt+1 | x)

emissions via weighted moment matching

slide-9
SLIDE 9

Unsupervised Learning

Supervised Learning Unsupervised Learning Discrete Continuous classification or categorization regression clustering dimensionality reduction

  • Goal: Infer label/response y given only features x
  • Classical: Find latent variables y good for compression of x
  • Probabilistic learning: Estimate parameters of joint

distribution p(x,y) which maximize marginal probability p(x)

slide-10
SLIDE 10

Dimensionality Reduction

Isomap Algorithm: Tenenbaum et al., Science 2000.

slide-11
SLIDE 11

PCA Objective: Compression

  • Observed feature vectors:
  • Hidden manifold coordinates:
  • Hidden linear mapping:

W ∈ RD×M

xn ∈ RD, n = 1, 2, . . . , N zn ∈ RM, n = 1, 2, . . . , N ˜ xn = Wzn + b

b ∈ RD×1 J(z, W, b | x, M) =

N

X

n=1

||xn − ˜ xn||2 =

N

X

n=1

||xn − Wzn − b||2

  • Unlike clustering objectives like K-means, we can find the

global optimum of this objective efficiently:

b = ¯ x = 1 N

N

X

n=1

xn

Construct W from the top eigenvectors

  • f the sample covariance matrix

(the directions of largest variance)

slide-12
SLIDE 12

Principal Components Analysis Example

  • PCA models all translations of data equally well (by shifting b)
  • PCA models all rotations of data equally well (by rotating W)
  • Appropriate when modeling quantities over time, space, etc.

PCA Analysis of MNIST Images of the Digit 3

J(z, W, b | x, M) =

N

X

n=1

||xn − ˜ xn||2 =

N

X

n=1

||xn − Wzn − b||2

slide-13
SLIDE 13

PCA Derivation: One-Dimension

  • Observed feature vectors:
  • Hidden manifold coordinates:
  • Hidden linear mapping:

xn ∈ RD, n = 1, 2, . . . , N

zn ∈ R, n = 1, 2, . . . , N

˜ xn = wzn w ∈ RD×1

Assume mean already subtracted from data (centered)

wT w = 1

  • Step 1: Optimal manifold coordinate is always projection

ˆ zn = wT xn

J(z, w | x) = 1 N

N

X

n=1

||xn − ˜ xn||2 = 1 N

N

X

n=1

||xn − wzn||2

  • Step 2: Optimal mapping maximizes variance of projection

J(ˆ z, w | x) = C − 1 N

N

X

n=1

(wT xn)(xT

nw) = C − wT Σw

Σ = 1 N

N

X

n=1

xnxT

n

slide-14
SLIDE 14

Gaussian Geometry

  • Eigenvalues and eigenvectors:
  • For a symmetric matrix:
  • For a positive semidefinite matrix:
  • For a positive definite matrix:

Σui = λiui, i = 1, . . . , d

λi ≥ 0

λi ∈ R

uT

i ui = 1

uT

i uj = 0

Σ = UΛU T =

d

X

i=1

λiuiuT

i

Σ−1 = UΛ−1U T =

d

X

i=1

1 λi uiuT

i

λi > 0

U = [u1, . . . , ud]

Λ = diag(λ1, . . . , λd)

  • Quadratic forms:

ΣU = UΛ

Σ ∈ Rd×d

yi = uT

i (x − µ)

Projection of difference from mean onto eigenvector

slide-15
SLIDE 15

Maximizes Variance & Minimizes Error

x2 x1 xn e xn u

  • C. Bishop, Pattern Recognition & Machine Learning
slide-16
SLIDE 16

Principal Components Analysis (PCA)

−8 −6 −4 −2 2 4 6 8 −4 −2 2 4 −2 2

−5 5 −4 −2 2 4

−6 −4 −2 2 4 −2 2

3D Data Best 2D Projection Best 1D Projection

slide-17
SLIDE 17

PCA Optimal Solution

J(z, W, b | x, M) =

N

X

n=1

||xn − ˜ xn||2 =

N

X

n=1

||xn − Wzn − b||2

b = ¯ x = 1 N

N

X

n=1

xn

X = [x1 − ¯ x, x2 − ¯ x, . . . , xN − ¯ x]

  • Option A: Eigendecomposition of sample covariance matrix

Construct W from eigenvectors with M largest eigenvalues

Σ = 1 N

N

X

n=1

(xn − ¯ x)(xn − ¯ x)T = 1 N XXT = UΛU T

  • Option B: Singular value decomposition (SVD) of centered data

Construct W from singular vectors with M largest singular values

X = USV T