Learning Automata with Hankel Matrices Borja Balle Amazon Research - PowerPoint PPT Presentation

Learning Automata with Hankel Matrices Borja Balle Amazon Research Cambridge Highlights — London, September 2017

Weighted Finite Automata (WFA) (over R ) Graphical Representation Algebraic Representation A “ x α , β , t A a u a P Σ y a , 1.2 a , 3.2 b , 2 b , 5 a , ´ 2 „ ´ 1 „ 1.2 b , 0   ´ 1 α “ A a “ 0.5 ´ 2 3.2 q 1 q 2 ´ 1 0.5 1.2 0 „ 1.2 „ 2   ´ 2 β “ A b “ a , ´ 1 0 0 5 b , ´ 2 Behavioral Representation Each WFA A computes a function A : Σ ‹ Ñ R given by A p x 1 ¨ ¨ ¨ x T q “ α J A x 1 ¨ ¨ ¨ A x T β

In This Talk... § Describe a core algorithm common to many algorithms for learning weighted automata § Explain the role this core plays in three learning problems in different setups § Survey extensions to more complex models and some applications

Outline 1. From Hankel Matrices to Weighted Automata 2. From Data to Hankel Matrices 3. From Theory to Practice

Hankel Matrices and Fliess’ Theorem Given f : Σ ‹ Ñ R define its Hankel matrix H f P R Σ ‹ ˆ Σ ‹ as ǫ a b ¨¨¨ s ¨¨¨ . » fi . f p ǫ q f p a q f p b q . ǫ . — ffi . f p a q f p aa q f p ab q . — ffi a — ffi . — . ffi f p b q f p ba q f p bb q . b — ffi H f “ — ffi . . — ffi . — ffi — ffi ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ f p ps q p — ffi – fl . . . Theorem [Fli74] 1. The rank of H f is finite if and only if f is computed by a WFA 2. The rank rank p f q “ rank p H f q equals the number of states of a minimal WFA computing f

The Structure of Hankel Matrices A p p 1 ¨ ¨ ¨ p T s 1 ¨ ¨ ¨ s T 1 q “ α J A p 1 ¨ ¨ ¨ A p T A s 1 ¨ ¨ ¨ A s T 1 β s ¨ ¨ ¨ ¨ » fi » fi » fi ¨ ¨ ¨ ¨ ¨ ¨ ‚ ¨ ¨ — ffi — ffi H “ ¨ fl “ ¨ ¨ ¨ ¨ ¨ ‚ ¨ ¨ — ffi — ffi – fl — ffi — ffi ¨ ¨ f p ps q ¨ ¨ ‚ ‚ ‚ ¨ ¨ ‚ ¨ ¨ p – – fl ¨ ¨ ¨ ¨ A p p 1 ¨ ¨ ¨ p T as 1 ¨ ¨ ¨ s T 1 q “ α J A p 1 ¨ ¨ ¨ A p T A a A s 1 ¨ ¨ ¨ A s T 1 β s ¨ ¨ ¨ ¨ » fi » fi » fi » fi ¨ ¨ ¨ ¨ ‚ ‚ ‚ ¨ ¨ ‚ ¨ ¨ — ffi — ffi H a “ ¨ fl “ ¨ ¨ ¨ ‚ ‚ ‚ ¨ ¨ ‚ ¨ ¨ — ffi — ffi – fl – fl — ffi — ffi ¨ ¨ f p pas q ¨ ¨ ‚ ‚ ‚ ‚ ‚ ‚ ¨ ¨ ‚ ¨ ¨ p – – fl ¨ ¨ ¨ ¨ Algebraically: Factorizing H lets us solve for A a A a “ P ` H a S ` H “ P S ñ H σ “ P A a S ñ = =

SVD-based Reconstruction [HKZ09; Bal+14] Inputs § Desired number of states r § Basis B “ p P , S q with P , S Ă Σ ‹ , ǫ P P X S § Finite Hankel blocks indexed by prefixes and suffixes in B : § H B P R P ˆ S § H B Σ “ t H B a P R P ˆ S : a P Σ u Algorithm: Spectral p H B , H B Σ , r q 1. Compute the rank r SVD of H B « UDV J 2. Let A a “ D ´ 1 U J H a V 3. Let α “ V J H B p ǫ , ´q and β “ D ´ 1 U J H B p´ , ǫ q 4. Return A “ x α , β , t A a uy Running time: 1. SVD takes O p| P || S | r q 2. Matrix multiplications take O p| Σ || P || S | r q

Properties of Spectral [HKZ09; Bal13; BM15a] Consistency § If P is prefix-closed, S is suffix-closed, and r “ rank p H B q “ rank pr H B | H B Σ sq § Then @ p P P , @ s P S , @ a P Σ , the WFA A “ Spectral p H B , H B Σ , r q satisfies A p p ¨ s q “ H B p p , s q and ˜ A p p ¨ a ¨ s q “ H B a p p , s q Recovery § If H B and H B Σ are sub-blocks of H f with r “ rank p f q “ rank p H B q § Then the WFA A “ Spectral p H B , H B Σ , r q satisfies A ” f Robustness Σ sq and } H B ´ ˆ a ´ ˆ § If r “ rank p H B q “ rank pr H B | H B H B } ď ε and } H B H B a } ď ε for all a P Σ α , ˆ β , t ˆ A a uy “ Spectral p ˆ H B , ˆ § Then x α , β , t A a uy “ Spectral p H B , H B H B Σ , r q and x ˆ Σ , r q α } , } β ´ ˆ β } , } A a ´ ˆ satisfy } α ´ ˆ A a } ď ε

Learning Models 1. Exact query learning: membership + equivalence queries [BV96; BBM06; BM15a] 2. Distributional PAC learning: samples from a stochastic WFA [HKZ09; BDR09; Bal+14] 3. Statistical learning: optimize output predictions wrt a loss function [BM12; BM15b]

Exact Learning of WFA with Queries Setup: § Unknown f : Σ ‹ Ñ R with rank p f q “ n § Membership oracle: MQ f p x q returns f p x q for any x P Σ ‹ § Equivalence oracle: EQ f p A q returns true if f ” A and p false , z q if f p z q ‰ A p z q Algorithm: 1. Initialize P “ S “ t ǫ u and maintain B “ p P , S q 2. Let A “ Spectral p H B , H B Σ , rank p H B qq 3. While EQ p A q “ p false , z q 3.1 Let z “ p ¨ a ¨ s with p the longest prefix of z in P 3.2 Let S “ S Y suffixes p s q 3.3 While D p P P and D a P Σ such that H B a p p , ´q R rowspan p H B q , add p ¨ a to P 3.4 Let A “ Spectral p H B , H B Σ , rank p H B qq Analysis: § At most n ` 1 calls to EQ f and O p| Σ | n 2 L q calls to MQ f , where L “ max | z | § Can be improved to O pp| Σ | ` log L q n 2 q calls to MQ f ; can reduce calls to EQ f by increasing calls to MQ f

PAC Learning Stochastic WFA Setup: § Unknown f : Σ ‹ Ñ R with rank p f q “ n defining probability distribution on Σ ‹ § Data: x p 1 q , . . . , x p m q i.i.d. strings sampled from f § Parameters: n and B “ p P , S q such that rank p H B q “ n and ǫ P P X S Algorithm: H B and ˆ 1. Estimate Hankel matrices ˆ H B a for all a P Σ using empirical probabilities m f p x q “ 1 1 r x p i q “ x s ˆ ÿ m i “ 1 2. Return ˆ A “ Spectral p ˆ H B , ˆ H B Σ , n q Analysis: § Running time is O p| P ¨ S | m ` | Σ || P || S | n q L 2 | Σ |? n ´ ¯ | x |ď L | f p x q ´ ˆ § With high probability ř A p x q| “ O f q 2 ? m σ n p H B

Statistical Learning of WFA Setup: § Unknown distribution D over Σ ‹ ˆ R § Data: p x p 1 q , y p 1 q q , . . . , p x p m q , y p m q q i.i.d. string-label pairs sampled from D § Parameters: n , convex loss function ℓ : R ˆ R Ñ R ` , convex regularizer R , regularization parameter λ ą 0, and B “ p P , S q with ǫ P P X S Algorithm: 1. Build B 1 “ p P 1 , S q with P 1 “ P Y P ¨ Σ H B 1 solving min H 1 2. Find the Hankel matrix ˆ ř m i “ 1 ℓ p H p x p i q q , y p i q q ` λ R p H q m H B and ˆ 3. Return ˆ A “ Spectral p ˆ H B , ˆ Σ , n q , where ˆ Σ are submatrices of ˆ H B 1 H B H B Analysis: § Running time is polynomial in n , m , | Σ | , | P | , and | S | § With high probability ˆ 1 m ˙ A p x q , y qs ď 1 E p x , y q„ D r ℓ p ˆ ÿ ℓ p ˆ A p x p i q q , y p i q q ` O ? m m i “ 1

Extensions 1. More complex models § Transducers and taggers [BQC11; Qua+14] § Grammars and tree automata [Luq+12; Bal+14; RBC16] § Reactive models [BBP15; LBP16; BM17a] 2. More realistic setups § Multiple related tasks [RBP17] § Timing data [BBP15; LBP16] § Single trajectory [BM17a] § Probabilistic models [BHP14] 3. Deeper theory § Convex relaxations [BQC12] § Generalization bounds [BM15b; BM17b] § Approximate minimisation [BPP15] § Bisimulation metrics [BGP17]

And It Works Too! 0.7 HMM Spectral methods are competitive k−HMM 0.6 FST 90 against traditional methods: 0.5 Hamming Accuracy (test) 88 L1 distance 86 0.4 § Expectation maximization 84 No Regularization 0.3 Avg. Perceptron 82 CRF Spectral IO HMM 0.2 L2 Max Margin 80 Spectral Max Margin § Conditional random fields 78 0.1 500 1K 2K 5K 10K 15K Training Samples 0 § Tensor decompositions 32 128 512 2048 8192 32768 # training samples (in thousands) 0.6 Spectral, Σ basis 100 74 8 Spectral, basis k=25 1000 Spectral, basis k=50 Initialization Spectral, basis k=100 7 0.5 72 10000 Model Building Spectral, basis k=300 In a variety of problems: Spectral, basis k=500 True ODM Word Error Rate (%) 6 Runtime [log(sec)] 70 Unigram Bigram 0.4 Rel. error 5 68 § Sequence tagging 4 0.3 66 3 64 2 0.2 62 § Constituency and dependency 1 0.1 0 60 0 50 100 150 200 Spec-Str Spec- S ub CO Tensor EM 0 10 20 30 40 50 Hankel rank Number of States parsing length ∙ 5 length ∙ 15 all sentences 98 85 § Timing and geometry learning 75 96 70 80 94 65 § POS-level language modelling 75 92 60 mu mu mu qn qn qn 55 90 SVTA SVTA SVTA 70 SVTA* SVTA* SVTA* 50 88 10 4 10 5 10 4 10 5 10 4 10 5 10 6

Open Problems and Current Trends § Optimal selection of P and S from data § Scalable convex optimization over sets of Hankel matrices § Constraining the output WFA (eg. probabilistic automata) § Relations between learning and approximate minimisation § How much of this can be extended to WFA over semi-rings? § Spectral methods for initializing non-convex gradient-based learning algorithms

Learning Automata with Hankel Matrices Borja Balle Amazon Research - PowerPoint PPT Presentation

Learning Automata with Hankel Matrices Borja Balle Amazon Research Cambridge Highlights London, September 2017 Weighted Finite Automata (WFA) (over R ) Graphical Representation Algebraic Representation A x , , t A a u a P y a ,

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Learning Automata with Hankel Matrices Borja Balle [ Disclaimer : Work done before joining Amazon]

Hankel Matrices: From Words to Graphs Nadia Labai and Johann A. Makowsky Faculty of Computer

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

Hecke algebras on homogeneous trees and relation with Hankel and Toeplitz matrices Janusz

Optimal Rank-1 Hankel Approximation of Matrices Gerlind Plonka University of Gttingen CodEx

Experimental Approach to the Hankel Transform of Catalan Number Combinations Wenyang Qian

Hankel determinants, continued fractions, orthgonal polynomials, and hypergeometric series Ira

Coefficientwise total positivity (via continued fractions) for some Hankel matrices of

Real root finding for rank defects in linear Hankel matrices Didier Henrion 1 , 2 Simone Naldi 1

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

CSC 473 Automata, Grammars & Languages 9/29/10 Automata, Grammars and Languages Discourse 03

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

Transformations and Matrices Transformations I Transformations are functions Matrices

Structural Matrices in MDOF Systems Evaluation of Structural Matrices Choice of Property

JUST THE MATHS SLIDES NUMBER 9.9 MATRICES 9 (Modal & spectral matrices) by

Point of Care Ultrasound UCSF Continuing Medical Education RUQ, LUQ, Paracentesis Brandon

The Transmission Control Protocol (TCP) June 6, 1998 TCP 2 What are transmission protocols

TO PHD OR NOT TO PHD? INTRODUCTION Carmen Daems 4rd year PhD student Laboratory of Behavioral

Boost Your Visibility in Google Search: Implementing Schema in Drupal 8 P R E P A R E D B Y I

Stochastic Solitons in Computational Anatomy Darryl D Holm Imperial College Vienna, 20 Feb 2015

Loop Quantum Gravity Reduced Phase Space Approach Thomas Thiemann 1 , 2 1 Albert Einstein

Living with Uncertainty Jim Dodrill ARM 1 We want our variation models to be: Effective

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort

Learning Automata with Hankel Matrices Borja Balle Amazon Research - PowerPoint PPT Presentation

Learning Automata with Hankel Matrices Borja Balle Amazon Research Cambridge Highlights London, September 2017 Weighted Finite Automata (WFA) (over R ) Graphical Representation Algebraic Representation A x , , t A a u a P y a ,

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Learning Automata with Hankel Matrices Borja Balle [ Disclaimer : Work done before joining Amazon]

Hankel Matrices: From Words to Graphs Nadia Labai and Johann A. Makowsky Faculty of Computer

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

Hecke algebras on homogeneous trees and relation with Hankel and Toeplitz matrices Janusz

Optimal Rank-1 Hankel Approximation of Matrices Gerlind Plonka University of Gttingen CodEx

Experimental Approach to the Hankel Transform of Catalan Number Combinations Wenyang Qian

Hankel determinants, continued fractions, orthgonal polynomials, and hypergeometric series Ira

Coefficientwise total positivity (via continued fractions) for some Hankel matrices of

Real root finding for rank defects in linear Hankel matrices Didier Henrion 1 , 2 Simone Naldi 1

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices &amp; quadratic forms)

CSC 473 Automata, Grammars &amp; Languages 9/29/10 Automata, Grammars and Languages Discourse 03

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

Transformations and Matrices Transformations I Transformations are functions Matrices

Structural Matrices in MDOF Systems Evaluation of Structural Matrices Choice of Property

JUST THE MATHS SLIDES NUMBER 9.9 MATRICES 9 (Modal &amp; spectral matrices) by

Point of Care Ultrasound UCSF Continuing Medical Education RUQ, LUQ, Paracentesis Brandon

The Transmission Control Protocol (TCP) June 6, 1998 TCP 2 What are transmission protocols

TO PHD OR NOT TO PHD? INTRODUCTION Carmen Daems 4rd year PhD student Laboratory of Behavioral

Boost Your Visibility in Google Search: Implementing Schema in Drupal 8 P R E P A R E D B Y I

Stochastic Solitons in Computational Anatomy Darryl D Holm Imperial College Vienna, 20 Feb 2015

Loop Quantum Gravity Reduced Phase Space Approach Thomas Thiemann 1 , 2 1 Albert Einstein

Living with Uncertainty Jim Dodrill ARM 1 We want our variation models to be: Effective

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

CSC 473 Automata, Grammars & Languages 9/29/10 Automata, Grammars and Languages Discourse 03

JUST THE MATHS SLIDES NUMBER 9.9 MATRICES 9 (Modal & spectral matrices) by