A Spectral Learning Algorithm for Finite State Transducers Borja - - PowerPoint PPT Presentation

a spectral learning algorithm for finite state transducers
SMART_READER_LITE
LIVE PREVIEW

A Spectral Learning Algorithm for Finite State Transducers Borja - - PowerPoint PPT Presentation

A Spectral Learning Algorithm for Finite State Transducers Borja Balle , Ariadna Quattoni, Xavier Carreras ECML PKDD September 7, 2011 B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 1 / 15 Overview Probabilistic


slide-1
SLIDE 1

A Spectral Learning Algorithm for Finite State Transducers

Borja Balle, Ariadna Quattoni, Xavier Carreras ECML PKDD — September 7, 2011

  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 1 / 15

slide-2
SLIDE 2

Overview

Probabilistic Transducers

◮ Model input-output relations with hidden states ◮ As conditional distribution Pr[ y | x ] over strings ◮ With certain independence assumptions

H1 H2 H3 H4 X1 Y1 X2 Y2 X3 Y3 X4 Y4 · · · ... Input Output Hidden

◮ Used in many applications: NLP

, biology, . . .

◮ Hard to learn in general — usually EM algorithm is used

  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 2 / 15

slide-3
SLIDE 3

Overview

Spectral Learning Probabilistic Transducers

Our contribution:

◮ Fast learning algorithm for probabilistic FST ◮ With PAC-style theoretical guarantees ◮ Based on Observable Operator Model for FST ◮ Using spectral methods (Chang ’96, Mossel-Roch ’05, Hsu et al. ’09,

Siddiqi et al. ’10)

◮ Performing better than EM in experiments with real data

  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 3 / 15

slide-4
SLIDE 4

Outline

Observable Operators for FST Learning Observable Operator Models Experimental Evaluation Conclusion

  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 4 / 15

slide-5
SLIDE 5

Observable Operators for FST

Deriving Observable Operator Models

Given (x, y) ∈ (X × Y)t aligned sequences, model computes conditional probability (i.e. |x| = |y|) Pr[ y | x ] =

h∈Ht Pr[ y, h | x ]

(marginalize states)

=

ht+1∈H Pr[ y, ht+1 | x ]

(independence assumptions)

= 1⊤ αt+1

(vector form, αt+1 ∈ Rm)

= 1⊤Ayt

xt αt

(forward-backward equations)

= 1⊤Ayt

xt · · · Ay1 x1 α

(induction on t)

The choice of an operator Ab

a depends only on observable symbols

  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 5 / 15

slide-6
SLIDE 6

Observable Operators for FST

Observable Operator Model Parameters

Given X = {a1, . . . , ak}, Y = {b1, . . . , bl}, H = {c1, . . . , cm}, then Pr[ y | x ] = 1⊤Ayt

xt · · · Ay1 x1 α with parameters:

Ab

a = Ta Db ∈ Rm×m

(factorized operator)

Ta(i, j) = Pr[Hs = ci|Xs−1 = a, Hs−1 = cj] ∈ Rm×m

(state transition)

Db(i, j) = δi,j Pr[Ys = b|Hs = cj] ∈ Rm×m

(observation emission)

O(i, j) = Pr[Ys = bi|Hs = cj] ∈ Rl×m

(collected emissions)

α(i) = Pr[H1 = ci] ∈ Rm

(initial probabilites)

The choice of an operator Ab

a depends only on observable symbols . . .

. . . but operator parameters are conditioned by hidden states

  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 6 / 15

slide-7
SLIDE 7

Observable Operators for FST

A Learnable Set of Observable Operators

Note that for any invertible Q ∈ Rm×m Pr[ y | x ] = 1⊤Q−1 (Q Ayt

xt Q−1) · · · (Q Ay1 x1 Q−1) Q α

Idea

(subspace identification methods for linear systems, ’80s)

Find a basis for the state space such that operators in the new basis are related to observable quantities Following multiplicity automata and spectral HMM learning . . .

  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 7 / 15

slide-8
SLIDE 8

Observable Operators for FST

A Learnable Set of Observable Operators

Find a basis Q where operators can be expressed in terms of unigram, bigram and trigram probabilities ρ(i) = Pr[Y1 = bi] ∈ Rl P(i, j) = Pr[Y1 = bj, Y2 = bi] ∈ Rl×l Pb

a(i, j) = Pr[Y1 = bj, Y2 = b, Y3 = bi|X2 = a] ∈ Rl×l

Theorem (ρ, P and Pb

a are sufficient statistics)

Let P = UΣV ∗ be a thin SVD decomposition, then Q = U⊤O yields (under certain assumptions) Q α = U⊤ρ 1⊤ Q−1 = ρ⊤(U⊤P)+ Q Ab

a Q−1 = (U⊤Pb a)(U⊤P)+

  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 8 / 15

slide-9
SLIDE 9

Learning Observable Operator Models

Spectral Learning Algorithm

Given

◮ Input X and output Y alphabet ◮ Number of hidden states m ◮ Training sample S = {(x1, y1), . . . , (xn, yn)}

Do

◮ Compute unigram

ρ, bigram P and trigram Pb

a relative frequencies

in S

◮ Perform SVD on

P and take U with top m left singular vectors

◮ Return operators computed using

ρ, P, Pb

a and

U

In Time

◮ O(n) to compute relative frequencies ◮ O(|Y|3) to compute SVD

  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 9 / 15

slide-10
SLIDE 10

Learning Observable Operator Models

PAC-Style Result

◮ Input distribution DX over X ∗ with λ = E[|X|], µ = mina Pr[X2 = a] ◮ Conditional distributions DY|x on Y∗ given x ∈ X ∗ modeled by an

FST with m states (satisfying certain rank assumptions)

◮ Sampling i.i.d. from joint distribution DX ⊗ DY|X

Theorem For any 0 < ε, δ < 1, if the algorithm receives a sample of size n ≥ O

  • λ2m|Y|

ε4µσ2

Oσ4 P

log |X| δ

  • ,

(σO and σP are mth singular values of O and P in target)

then with probability at least 1 − δ the hypothesis DY|x satisfies EX  

y∈Y∗

  • DY|X(y) −

DY|X(y)

 ≤ ε .

(L1 distance between joint distributions DX ⊗ DY|X and DX ⊗ DY|X )

  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 10 / 15

slide-11
SLIDE 11

Experimental Evaluation

Synthetic Experiments

Goal: Compare against baselines when learning hypothesis hold Target: Randomly generated with |X| = 3, |Y| = 3, |H| = 2

32 128 512 2048 8192 32768 0.1 0.2 0.3 0.4 0.5 0.6 0.7

# training samples (in thousands) L1 distance HMM k−HMM FST ◮ HMM: model input-output

jointly

◮ k-HMM: one model for each

input symbol

◮ Results averaged over 5 runs

  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 11 / 15

slide-12
SLIDE 12

Experimental Evaluation

Transliteration Experiments

Goal: Compare against EM in a real task (where modeling assumptions fail) Task: English to Russian transliteration (brooklyn → бруклин)

75 150 350 750 1500 3000 6000 20 30 40 50 60 70 80

# training sequences normalized edit distance

Spectral, m=2 Spectral, m=3 EM, m=2 EM, m=3

Training times Spectral 26 s EM (iteration) 37 s EM (best) 1133 s

◮ Sequence alignment done in

preprocessing

◮ Standard techniques used for

inference

◮ Test size: 943, |X| = 82, |Y| = 34

  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 12 / 15

slide-13
SLIDE 13

Conclusion

Summary of Contributions

◮ Fast spectral method for learning input-output OOM ◮ Strong theoretical guarantees with few assumptions on input

distribution

◮ Outperforming previous spectral algorithms on FST ◮ Faster and better than EM in some real tasks

  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 13 / 15

slide-14
SLIDE 14

A Spectral Learning Algorithm for Finite State Transducers

Borja Balle, Ariadna Quattoni, Xavier Carreras ECML PKDD — September 7, 2011

  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 14 / 15

slide-15
SLIDE 15

Technical Assumptions

X = {a1, . . . , ak}, Y = {b1, . . . , bl}, H = {c1, . . . , cm} Parameters Ta(i, j) = Pr[Hs = ci|Xs−1 = a, Hs−1 = cj] ∈ Rm×m

(state transition)

T =

a Ta Pr[X1 = a] ∈ Rm×m

(“mean” transition matrix)

O(i, j) = Pr[Ys = bi|Hs = cj] ∈ Rl×m

(collected emissions)

α(i) = Pr[H1 = ci] ∈ Rm

(initial probabilites)

Assumptions

  • 1. l ≥ m
  • 2. α > 0
  • 3. rank(T) = rank(O) = m
  • 4. mina Pr[X2 = a] > 0
  • B. Balle, A. Quattoni, X. Carreras

Spectral Learning FST ECML PKDD 2011 15 / 15