A Spectral Learning Algorithm for Finite State Transducers Borja - PowerPoint PPT Presentation

A Spectral Learning Algorithm for Finite State Transducers Borja Balle , Ariadna Quattoni, Xavier Carreras ECML PKDD — September 7, 2011 B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 1 / 15

Overview Probabilistic Transducers ◮ Model input-output relations with hidden states ◮ As conditional distribution Pr [ y | x ] over strings ◮ With certain independence assumptions Input X 1 X 2 X 3 X 4 ... H 1 H 2 H 3 H 4 · · · Hidden Output Y 1 Y 2 Y 3 Y 4 ◮ Used in many applications: NLP , biology, . . . ◮ Hard to learn in general — usually EM algorithm is used B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 2 / 15

Overview Spectral Learning Probabilistic Transducers Our contribution: ◮ Fast learning algorithm for probabilistic FST ◮ With PAC-style theoretical guarantees ◮ Based on Observable Operator Model for FST ◮ Using spectral methods (Chang ’96, Mossel-Roch ’05, Hsu et al. ’09, Siddiqi et al. ’10) ◮ Performing better than EM in experiments with real data B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 3 / 15

Outline Observable Operators for FST Learning Observable Operator Models Experimental Evaluation Conclusion B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 4 / 15

Observable Operators for FST Deriving Observable Operator Models Given ( x , y ) ∈ ( X × Y ) t aligned sequences, model computes conditional probability (i.e. | x | = | y | ) Pr [ y | x ] = � h ∈H t Pr [ y , h | x ] (marginalize states) = � h t + 1 ∈H Pr [ y , h t + 1 | x ] (independence assumptions) = 1 ⊤ α t + 1 (vector form, α t + 1 ∈ R m ) = 1 ⊤ A y t x t α t (forward-backward equations) = 1 ⊤ A y t x t · · · A y 1 x 1 α (induction on t ) The choice of an operator A b a depends only on observable symbols B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 5 / 15

Observable Operators for FST Observable Operator Model Parameters Given X = { a 1 , . . . , a k } , Y = { b 1 , . . . , b l } , H = { c 1 , . . . , c m } , then Pr [ y | x ] = 1 ⊤ A y t x t · · · A y 1 x 1 α with parameters: A b a = T a D b ∈ R m × m (factorized operator) T a ( i , j ) = Pr [ H s = c i | X s − 1 = a , H s − 1 = c j ] ∈ R m × m (state transition) D b ( i , j ) = δ i , j Pr [ Y s = b | H s = c j ] ∈ R m × m (observation emission) O ( i , j ) = Pr [ Y s = b i | H s = c j ] ∈ R l × m (collected emissions) α ( i ) = Pr [ H 1 = c i ] ∈ R m (initial probabilites) The choice of an operator A b a depends only on observable symbols . . . . . . but operator parameters are conditioned by hidden states B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 6 / 15

Observable Operators for FST A Learnable Set of Observable Operators Note that for any invertible Q ∈ R m × m Pr [ y | x ] = 1 ⊤ Q − 1 ( Q A y t x t Q − 1 ) · · · ( Q A y 1 x 1 Q − 1 ) Q α Idea ( subspace identification methods for linear systems, ’80s ) Find a basis for the state space such that operators in the new basis are related to observable quantities Following multiplicity automata and spectral HMM learning . . . B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 7 / 15

Observable Operators for FST A Learnable Set of Observable Operators Find a basis Q where operators can be expressed in terms of unigram, bigram and trigram probabilities ρ ( i ) = Pr [ Y 1 = b i ] ∈ R l P ( i , j ) = Pr [ Y 1 = b j , Y 2 = b i ] ∈ R l × l P b a ( i , j ) = Pr [ Y 1 = b j , Y 2 = b , Y 3 = b i | X 2 = a ] ∈ R l × l Theorem ( ρ , P and P b a are sufficient statistics) Let P = U Σ V ∗ be a thin SVD decomposition, then Q = U ⊤ O yields (under certain assumptions) Q α = U ⊤ ρ 1 ⊤ Q − 1 = ρ ⊤ ( U ⊤ P ) + a Q − 1 = ( U ⊤ P b Q A b a )( U ⊤ P ) + B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 8 / 15

Learning Observable Operator Models Spectral Learning Algorithm Given ◮ Input X and output Y alphabet ◮ Number of hidden states m ◮ Training sample S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Do ρ , bigram � P and trigram � P b ◮ Compute unigram � a relative frequencies in S ◮ Perform SVD on � P and take � U with top m left singular vectors ρ , � P , � a and � ◮ Return operators computed using � P b U In Time ◮ O ( n ) to compute relative frequencies ◮ O ( |Y| 3 ) to compute SVD B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 9 / 15

Learning Observable Operator Models PAC-Style Result ◮ Input distribution D X over X ∗ with λ = E [ | X | ] , µ = min a Pr [ X 2 = a ] ◮ Conditional distributions D Y | x on Y ∗ given x ∈ X ∗ modeled by an FST with m states (satisfying certain rank assumptions) ◮ Sampling i.i.d. from joint distribution D X ⊗ D Y | X Theorem For any 0 < ε, δ < 1 , if the algorithm receives a sample of size � � λ 2 m |Y| log |X| ( σ O and σ P are mth singular n ≥ O , ε 4 µσ 2 O σ 4 values of O and P in target) δ P then with probability at least 1 − δ the hypothesis � D Y | x satisfies   � �  � ( L 1 distance between � � � D Y | X ( y ) − � joint distributions  ≤ ε . E X D Y | X ( y ) � D X ⊗ D Y | X and D X ⊗ � D Y | X ) y ∈Y ∗ B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 10 / 15

Experimental Evaluation Synthetic Experiments Goal: Compare against baselines when learning hypothesis hold Target: Randomly generated with |X| = 3, |Y| = 3, |H| = 2 0.7 HMM k−HMM 0.6 FST 0.5 ◮ HMM: model input-output L1 distance jointly 0.4 ◮ k -HMM: one model for each 0.3 input symbol 0.2 ◮ Results averaged over 5 runs 0.1 0 32 128 512 2048 8192 32768 # training samples (in thousands) B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 11 / 15

Experimental Evaluation Transliteration Experiments Goal: Compare against EM in a real task (where modeling assumptions fail) Task: English to Russian transliteration (brooklyn → бруклин ) 80 Training times Spectral, m=2 Spectral, m=3 Spectral 26 s EM, m=2 70 normalized edit distance EM, m=3 EM (iteration) 37 s EM (best) 1133 s 60 50 ◮ Sequence alignment done in 40 preprocessing 30 ◮ Standard techniques used for inference 20 75 150 350 750 1500 3000 6000 ◮ Test size: 943, |X| = 82, |Y| = 34 # training sequences B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 12 / 15

Conclusion Summary of Contributions ◮ Fast spectral method for learning input-output OOM ◮ Strong theoretical guarantees with few assumptions on input distribution ◮ Outperforming previous spectral algorithms on FST ◮ Faster and better than EM in some real tasks B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 13 / 15

A Spectral Learning Algorithm for Finite State Transducers Borja Balle , Ariadna Quattoni, Xavier Carreras ECML PKDD — September 7, 2011 B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 14 / 15

Technical Assumptions X = { a 1 , . . . , a k } , Y = { b 1 , . . . , b l } , H = { c 1 , . . . , c m } Parameters T a ( i , j ) = Pr [ H s = c i | X s − 1 = a , H s − 1 = c j ] ∈ R m × m (state transition) T = � a T a Pr [ X 1 = a ] ∈ R m × m (“mean” transition matrix) O ( i , j ) = Pr [ Y s = b i | H s = c j ] ∈ R l × m (collected emissions) α ( i ) = Pr [ H 1 = c i ] ∈ R m (initial probabilites) Assumptions 1. l ≥ m 2. α > 0 3. rank ( T ) = rank ( O ) = m 4. min a Pr [ X 2 = a ] > 0 B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 15 / 15

A Spectral Learning Algorithm for Finite State Transducers Borja - PowerPoint PPT Presentation

A Spectral Learning Algorithm for Finite State Transducers Borja Balle , Ariadna Quattoni, Xavier Carreras ECML PKDD September 7, 2011 B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 1 / 15 Overview Probabilistic

Finite State Machines: Finite State Transducers; Specifying Control Logic Greg Plaxton Theory in

Kleenex: From nondeterministic finite state transducers to streaming string transducers Fritz

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

Neural Grammatical Error Correction with Finite State Transducers Felix Stahlberg, Christopher

An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral

Unit 1: Sequence Models Lecture 2: Finite-State Acceptors/Transducers Liang Huang This Week:

Towards Register Minimisation of Streaming String Transducers Pierre-Alain Reynier LIS,

Finite-State Transducers in Language and Speech Processing : 05/20/2003 1. M.

Spectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle

Learning Tree to Word Transducers LATA 2014 Aur elien Lemay joint work with: Gr egoire

Learning reduplication with 2-way finite-state transducers Hossep Dolatian & Jeffrey Heinz

Model Checking Finite State Finite State Model Checking Finite State Systems

Finite A to B implies |A| = |B| Cardinality for finite A, B finite-card .1 finite-card .2

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

Finite State Machines (FSM) AKA Finite State Automat on State Machines Introduction State

Finite-State Morphology CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T

L M RAIL SLIDES (HS-LM-B_-S_) SPECIFICATION :- These are L.M Rail slides specially used

October 9, 2015 Kris Palmer, Mina Dadgar, Katherine Bergman Career Ladders Project Tram

Principles to Ac.ons Effec.ve Mathema.cs Teaching Prac.ces The Case of Jamie Bassham and the

Ebenezer Tumban Ebenezer Tumban , MICHIGAN TECH RESEARCH FORUM etumban@mtu.edu TECHTALKS

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using

Quantum-secure symmetric-key cryptography based on Hidden Shifts Gorjan Alagic Alexander Russell

Slides Set 9: AND/OR search for Probabilistic Networks Rina Dechter (Dechter1 chapter 6 and 7 )

What's on the Wire? Physical Layer Tapping with Daisho Dominic Spill Mike Kershaw / Dragorn

A Spectral Learning Algorithm for Finite State Transducers Borja - PowerPoint PPT Presentation

A Spectral Learning Algorithm for Finite State Transducers Borja Balle , Ariadna Quattoni, Xavier Carreras ECML PKDD September 7, 2011 B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 1 / 15 Overview Probabilistic

Finite State Machines: Finite State Transducers; Specifying Control Logic Greg Plaxton Theory in

Kleenex: From nondeterministic finite state transducers to streaming string transducers Fritz

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

Neural Grammatical Error Correction with Finite State Transducers Felix Stahlberg, Christopher

An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral

Unit 1: Sequence Models Lecture 2: Finite-State Acceptors/Transducers Liang Huang This Week:

Towards Register Minimisation of Streaming String Transducers Pierre-Alain Reynier LIS,

Finite-State Transducers in Language and Speech Processing : 05/20/2003 1. M.

Spectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle

Learning Tree to Word Transducers LATA 2014 Aur elien Lemay joint work with: Gr egoire

Learning reduplication with 2-way finite-state transducers Hossep Dolatian &amp; Jeffrey Heinz

Model Checking Finite State Finite State Model Checking Finite State Systems

Finite A to B implies |A| = |B| Cardinality for finite A, B finite-card .1 finite-card .2

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

Finite State Machines (FSM) AKA Finite State Automat on State Machines Introduction State

Finite-State Morphology CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T

L M RAIL SLIDES (HS-LM-B_-S_) SPECIFICATION :- These are L.M Rail slides specially used

October 9, 2015 Kris Palmer, Mina Dadgar, Katherine Bergman Career Ladders Project Tram

Principles to Ac.ons Effec.ve Mathema.cs Teaching Prac.ces The Case of Jamie Bassham and the

Ebenezer Tumban Ebenezer Tumban , MICHIGAN TECH RESEARCH FORUM etumban@mtu.edu TECHTALKS

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using

Quantum-secure symmetric-key cryptography based on Hidden Shifts Gorjan Alagic Alexander Russell

Slides Set 9: AND/OR search for Probabilistic Networks Rina Dechter (Dechter1 chapter 6 and 7 )

What's on the Wire? Physical Layer Tapping with Daisho Dominic Spill Mike Kershaw / Dragorn

Learning reduplication with 2-way finite-state transducers Hossep Dolatian & Jeffrey Heinz