A spectral algorithm for learning hidden Markov models . . . h 3 h - - PowerPoint PPT Presentation

a spectral algorithm for learning hidden markov models
SMART_READER_LITE
LIVE PREVIEW

A spectral algorithm for learning hidden Markov models . . . h 3 h - - PowerPoint PPT Presentation

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu Sham M. Kakade Tong Zhang UCSD TTI-C Rutgers A spectral algorithm for learning hidden Markov models Nr. 1 Motivation Hidden Markov Models


slide-1
SLIDE 1

A spectral algorithm for learning hidden Markov models

h1 x1 x2 h3 x3 . . . h2 Daniel Hsu Sham M. Kakade Tong Zhang UCSD TTI-C Rutgers

A spectral algorithm for learning hidden Markov models

  • Nr. 1
slide-2
SLIDE 2

Motivation

  • Hidden Markov Models (HMMs) – popular model for sequential data

(e.g. speech, bio-sequences, natural language)

h1 x1 h2 x2 h3 x3 . . . Each state defines distribution over

  • bservations

Hidden state process (Markovian) Observation sequence

  • Hidden state sequence not observed; hence, unsupervised learning.

A spectral algorithm for learning hidden Markov models

  • Nr. 2
slide-3
SLIDE 3

Motivation

  • Why HMMs?

– Handle temporally-dependent data – Succinct “factored” representation when state space is low-dimensional (c.f. autoregressive model)

  • Some uses of HMMs:

– Monitor “belief state” of dynamical system – Infer latent variables from time series – Density estimation A spectral algorithm for learning hidden Markov models

  • Nr. 3
slide-4
SLIDE 4

Preliminaries: parameters of discrete HMMs

Sequences of hidden states (h1, h2, . . .) and observations (x1, x2, . . .).

  • Hidden states {1, 2, . . . , m};

Observations {1, 2, . . . , n}

  • Conditional independences

h1 x1 h2 x2 h3 x3 . . .

  • Initial state distribution

π ∈ Rm

  • πi = Pr[h1 = i]
  • Transition matrix T ∈ Rm×m

Tij = Pr[ht+1 = i|ht = j]

  • Observation matrix O ∈ Rn×m

Oij = Pr[xt = i|ht = j] Pr[x1:t] = X

h1

Pr[h1] · X

h2

Pr[h2|h1] Pr[x1|h1] · . . . · X

ht+1

Pr[ht+1|ht] Pr[xt|ht] A spectral algorithm for learning hidden Markov models

  • Nr. 4
slide-5
SLIDE 5

Preliminaries: learning discrete HMMs

  • Popular heuristic: Expectation-Maximization (EM)

(a.k.a. Baum-Welch algorithm)

  • Computationally hard in general under cryptographic assumptions

(Terwijn, ’02)

  • This work: Computationally efficient algorithm with learning guarantees

for invertible HMMs: Assume T ∈ Rm×m and O ∈ Rn×m have rank m. (here, n ≥ m).

A spectral algorithm for learning hidden Markov models

  • Nr. 5
slide-6
SLIDE 6

Our contributions

  • Simple and efficient algorithm for learning invertible HMMs.
  • Sample complexity bounds for

– Joint probability estimation (total variation distance): X

x1:t

| Pr[x1:t] − c Pr[x1:t]| ≤ (relevant for density estimation tasks) – Conditional probability estimation (KL distance): KL(Pr[xt|x1:t−1]c Pr[xt|x1:t−1]) ≤ (relevant for next-symbol prediction / “belief states”)

  • Connects subspace identification to observation operators.

A spectral algorithm for learning hidden Markov models

  • Nr. 6
slide-7
SLIDE 7

Outline

  • 1. Motivation and preliminaries
  • 2. Discrete HMMs: key ideas
  • 3. Observable representation for HMMs
  • 4. Learning algorithm and guarantees
  • 5. Conclusions and future work

A spectral algorithm for learning hidden Markov models

  • Nr. 7
slide-8
SLIDE 8

Outline

  • 1. Motivation and preliminaries
  • 2. Discrete HMMs: key ideas
  • 3. Observable representation for HMMs
  • 4. Learning algorithm and guarantees
  • 5. Conclusions and future work

A spectral algorithm for learning hidden Markov models

  • Nr. 8
slide-9
SLIDE 9

Discrete HMMs: linear model

Let ht ∈ { e1, . . . , em} and xt ∈ { e1, . . . , en} (coord. vectors in Rm and Rn) E[ ht+1| ht] = T ht and E[ xt| ht] = O ht In expectation, dynamics and observation process are linear!

e.g. conditioned on ht = 2 (i.e. ht = e2): E[ ht+1| ht] = 2 4 0.2 0.4 0.6 0.3 0.6 0.4 0.5 3 5 2 4 1 3 5 = 2 4 0.4 0.6 3 5

Upshot: Can borrow “subspace identification” techniques from linear systems theory.

A spectral algorithm for learning hidden Markov models

  • Nr. 9
slide-10
SLIDE 10

Discrete HMMs: linear model

Exploiting linearity

  • Subspace identification for general linear models

– Use SVD to discover subspace containing relevant states, then learn effective transition and observation matrices (Ljung, ’87). – Analysis typically assumes additive noise (independent of state), e.g. Gaussian noise (Kalman filter); not applicable to HMMs.

  • This work: Use subspace identification, then learn alternative HMM

parameterization.

A spectral algorithm for learning hidden Markov models

  • Nr. 10
slide-11
SLIDE 11

Discrete HMMs: observation operators

For x ∈ {1, . . . , n}: define Ax

=   

T

      Ox,1 ... Ox,m    ∈ Rm×m [Ax]i,j = Pr[ ht+1 = i ∧ xt = x | ht = j ].

The {Ax} are observation operators (Sch¨ utzenberger, ’61; Jaeger, ’00).

. . . T O xt ht Axt

A spectral algorithm for learning hidden Markov models

  • Nr. 11
slide-12
SLIDE 12

Discrete HMMs: observation operators

Using observation operators Matrix multiplication handles “local” marginalization of hidden variables: e.g.

Pr[x1, x2] = X

h1

Pr[h1] · X

h2

Pr[h2|h1] Pr[x1|h1] · X

h3

Pr[h3|h2] Pr[x2|h2] = 1⊤

mAx2Ax1

π

where 1m ∈ Rm is the all-ones vector. Upshot: The {Ax} contain the same information as T and O.

A spectral algorithm for learning hidden Markov models

  • Nr. 12
slide-13
SLIDE 13

Discrete HMMs: observation operators

Learning observation operators

  • Previous methods face the problem of discovering and extracting the

relationship between hidden states and observations (Jaeger, ’00).

– Various techniques proposed (e.g. James and Singh, ’04; Wiewiora, ’05). – Formal guarantees were unclear.

  • This work: Combine subspace identification with observation operators

to yield observable HMM representation that is efficiently learnable.

A spectral algorithm for learning hidden Markov models

  • Nr. 13
slide-14
SLIDE 14

Outline

  • 1. Motivation and preliminaries
  • 2. Discrete HMMs: key ideas
  • 3. Observable representation for HMMs
  • 4. Learning algorithm and guarantees
  • 5. Conclusions and future work

A spectral algorithm for learning hidden Markov models

  • Nr. 14
slide-15
SLIDE 15

Observable representation for HMMs

Key rank condition: require T ∈ Rm×m and O ∈ Rn×m to have rank m

(rules out pathological cases from hardness reductions)

Define P1 ∈ Rn, P2,1 ∈ Rn×n, P3,x,1 ∈ Rn×n for x = 1, . . . , n by [P1]i = Pr[x1 = i] [P2,1]i,j = Pr[x2 = i, x1 = j] [P3,x,1]i,j = Pr[x3 = i, x2 = x, x1 = j]

(probabilities of singletons, doubles, and triples).

Claim: Can recover equivalent HMM parameters from P1, P2,1, {P3,x,1}, and these quantities can be estimated from data.

A spectral algorithm for learning hidden Markov models

  • Nr. 15
slide-16
SLIDE 16

Observable representation for HMMs

“Thin” SVD: P2,1 = UΣV ⊤ where U = [

u1| . . . | um] ∈ Rn×m Guaranteed m non-zero singular values by rank condition.

E[ xt| ht]

  • u2
  • u1

R(U) = R(O)

New parameters (based on U) implicitly transform hidden states

  • ht

→ (U ⊤O) ht = U ⊤E[ xt| ht]

(i.e. change to coordinate representation of E[ xt| ht] w.r.t. { u1, . . . , um}). A spectral algorithm for learning hidden Markov models

  • Nr. 16
slide-17
SLIDE 17

Observable representation for HMMs

For each x = 1, . . . , n, Bx

= (U ⊤P3,x,1) (U ⊤P2,1)+ (X+ is pseudoinv. of X) = (U ⊤O) Ax (U ⊤O)−1 . (algebra)

The Bx operate in the coord. system defined by { u1, . . . , um} (columns of U).

Pr[x1:t] = 1⊤

mAxt . . . Ax1

π = 1⊤

m(U ⊤O)−1Bxt . . . Bx1(U ⊤O)

π

xt

O T

Axt∼Bxt

ht

  • u1
  • u2

U ⊤(O ht) Bxt(U ⊤O ht) = U ⊤(OAxt ht)

Upshot: Suffices to learn {Bx} instead of {Ax}.

A spectral algorithm for learning hidden Markov models

  • Nr. 17
slide-18
SLIDE 18

Outline

  • 1. Motivation and preliminaries
  • 2. Discrete HMMs: key ideas
  • 3. Observable representation for HMMs
  • 4. Learning algorithm and guarantees
  • 5. Conclusions and future work

A spectral algorithm for learning hidden Markov models

  • Nr. 18
slide-19
SLIDE 19

Learning algorithm

The algorithm

  • 1. Look at triples of observations (x1, x2, x3) in data;

estimate frequencies P1, P2,1, and { P3,x,1}

  • 2. Compute SVD of

P2,1 to get matrix of top m singular vectors U

(“subspace identification”)

  • 3. Compute

Bx

= ( U ⊤ P3,x,1)( U ⊤ P2,1)+ for each x

(“observation operators”)

  • 4. Compute

b1

= U ⊤ P1 and b∞

= ( P ⊤

2,1

U)+ P1

A spectral algorithm for learning hidden Markov models

  • Nr. 19
slide-20
SLIDE 20

Learning algorithm

  • Joint probability calculations:
  • Pr[x1, . . . , xt]

= b⊤

Bxt . . . Bx1 b1.

  • Conditional probabilities: Given x1:t−1,
  • Pr[xt|x1:t−1]

= b⊤

Bxt bt where

  • bt

=

  • Bxt−1 . . .

Bx1 b1

  • b⊤

Bxt−1 . . . Bx1 b1 ≈ (U ⊤O)E[ ht|x1:t−1]. “Belief states” bt linearly related to conditional hidden states. (bt live in hypercube [−1, +1]m instead simplex ∆m)

A spectral algorithm for learning hidden Markov models

  • Nr. 20
slide-21
SLIDE 21

Sample complexity bound

Joint probability accuracy: with probability ≥ 1 − δ, O t2 2 ·

  • m

σm(O)2σm(P2,1)4 + m · n0 σm(O)2σm(P2,1)2

  • · log 1

δ

  • bservation triples sampled from the HMM suffices to guarantee
  • x1,...,xt

| Pr[x1, . . . , xt] − Pr[x1, . . . , xt]| ≤ .

  • m: number of states
  • n0: number of observations that account for most of the probability mass
  • σm(M): mth largest singular value of matrix M

Also have a sample complexity bound for conditional probability accuracy.

A spectral algorithm for learning hidden Markov models

  • Nr. 21
slide-22
SLIDE 22

Conclusions and future work

Summary:

  • Simple and efficient learning algorithm for invertible HMMs

(sample complexity bounds, streaming implementation, etc.)

  • Observable representation lets us avoid estimating T and O

(n.b. could recover these if we really wanted, but less stable).

  • SVD subspace captures dynamics of observable representation.
  • Salvages old ideas and techniques from automata and control theory.

Future work:

  • Improve efficiency, stability
  • General linear dynamical models
  • Behavior of EM under the rank condition?

A spectral algorithm for learning hidden Markov models

  • Nr. 22
slide-23
SLIDE 23

Thanks!

A spectral algorithm for learning hidden Markov models

  • Nr. 23
slide-24
SLIDE 24

Sample complexity bound

Conditional probability accuracy: with probability ≥ 1 − δ, poly(1/, 1/α, 1/γ, 1/σm(O), 1/σm(P2,1)) · (m2 + mn0) · log 1 δ

  • bservation triples sampled from the HMM suffices to guarantee

KL(Pr[xt|x1:t−1] Pr[xt|x1:t−1]) ≤ for all t and all history sequences x1:t−1.

  • n0: number of observations that account for 1 − ε of total probability mass,

where ε = σm(O)σm(P2,1)/(4√m)

  • γ = inf

v: v1=1 O

v1 “value of observation” (Even-Dar et al, ’07)

  • α = minx,i,j[Ax]ij (stochasticity requirement)

A spectral algorithm for learning hidden Markov models

  • Nr. 24
slide-25
SLIDE 25

Computer simulations

  • Simple “cycle” HMM: m = 9 states, n = 180 observations
  • Train using 20000 sequences of length 100 (generated by true model)
  • Results (average log-loss on 2000 test sequences of length 100):

True model 4.79 nats per symbol EM, initialized with true model 4.79 nats per symbol EM, random initializations 5.15 nats per symbol Our algorithm 4.88 nats per symbol Our algorithm, followed by EM 4.81 nats per symbol

A spectral algorithm for learning hidden Markov models

  • Nr. 25