spectral learning techniques for weighted automata
play

Spectral Learning Techniques for Weighted Automata, Transducers, and - PowerPoint PPT Presentation

Spectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle Ariadna Quattoni Xavier Carreras q McGill University q Xerox Research Centre Europe TUTORIAL @ EMNLP 2014 Status Quo


  1. Applications of WFA WFA Can Model: ➓ Probability distributions f A ♣ x q ✏ P r x s ➓ Binary classifiers g ♣ x q ✏ sign ♣ f A ♣ x q � θ q ➓ Real predictors f A ♣ x q ➓ Sequence predictors g ♣ x q ✏ argmax y f A ♣ x , y q (with Σ ✏ X ✂ Y ) Used In Several Applications: ➓ Speech recognition [Mohri et al., 2008] ➓ Machine translation [de Gispert et al., 2010] ➓ Image processing [Albert and Kari, 2009] ➓ OCR systems [Knight and May, 2009] ➓ System testing [Baier et al., 2009]

  2. Useful Intuitions About f A f A ♣ x q ✏ f A ♣ x 1 . . . x T q ✏ α ❏ 0 A x 1 ☎ ☎ ☎ A x T α ✽ ✏ α ❏ 0 A x α ✽ ➓ Sum-Product: f A ♣ x q is a sum–product computation ✄ T ☛ ➳ ➵ α 0 ♣ i 0 q A x t ♣ i t ✁ 1 , i t q α ✽ ♣ i T q i 0 , i 1 ,..., i T Pr n s t ✏ 1 ➓ Forward-Backward: f A ♣ x q is dot product between forward and backward vectors α ❏ � ✟ f A ♣ ps q ✏ ☎ ♣ A s α ✽ q ✏ α p ☎ β s 0 A p ➓ Compositional Features: f A ♣ x q is a linear model α ❏ � ✟ f A ♣ x q ✏ ☎ α ✽ ✏ φ ♣ x q ☎ α ✽ 0 A x where φ : Σ ✍ Ñ R n compositional features (i.e. φ ♣ xσ q ✏ φ ♣ x q A σ )

  3. Useful Intuitions About f A f A ♣ x q ✏ f A ♣ x 1 . . . x T q ✏ α ❏ 0 A x 1 ☎ ☎ ☎ A x T α ✽ ✏ α ❏ 0 A x α ✽ ➓ Sum-Product: f A ♣ x q is a sum–product computation ✄ T ☛ ➳ ➵ α 0 ♣ i 0 q A x t ♣ i t ✁ 1 , i t q α ✽ ♣ i T q i 0 , i 1 ,..., i T Pr n s t ✏ 1 ➓ Forward-Backward: f A ♣ x q is dot product between forward and backward vectors α ❏ � ✟ f A ♣ ps q ✏ ☎ ♣ A s α ✽ q ✏ α p ☎ β s 0 A p ➓ Compositional Features: f A ♣ x q is a linear model α ❏ � ✟ f A ♣ x q ✏ ☎ α ✽ ✏ φ ♣ x q ☎ α ✽ 0 A x where φ : Σ ✍ Ñ R n compositional features (i.e. φ ♣ xσ q ✏ φ ♣ x q A σ )

  4. Useful Intuitions About f A f A ♣ x q ✏ f A ♣ x 1 . . . x T q ✏ α ❏ 0 A x 1 ☎ ☎ ☎ A x T α ✽ ✏ α ❏ 0 A x α ✽ ➓ Sum-Product: f A ♣ x q is a sum–product computation ✄ T ☛ ➳ ➵ α 0 ♣ i 0 q A x t ♣ i t ✁ 1 , i t q α ✽ ♣ i T q i 0 , i 1 ,..., i T Pr n s t ✏ 1 ➓ Forward-Backward: f A ♣ x q is dot product between forward and backward vectors α ❏ � ✟ f A ♣ ps q ✏ ☎ ♣ A s α ✽ q ✏ α p ☎ β s 0 A p ➓ Compositional Features: f A ♣ x q is a linear model α ❏ � ✟ f A ♣ x q ✏ ☎ α ✽ ✏ φ ♣ x q ☎ α ✽ 0 A x where φ : Σ ✍ Ñ R n compositional features (i.e. φ ♣ xσ q ✏ φ ♣ x q A σ )

  5. Useful Intuitions About f A f A ♣ x q ✏ f A ♣ x 1 . . . x T q ✏ α ❏ 0 A x 1 ☎ ☎ ☎ A x T α ✽ ✏ α ❏ 0 A x α ✽ ➓ Sum-Product: f A ♣ x q is a sum–product computation ✄ T ☛ ➳ ➵ α 0 ♣ i 0 q A x t ♣ i t ✁ 1 , i t q α ✽ ♣ i T q i 0 , i 1 ,..., i T Pr n s t ✏ 1 ➓ Forward-Backward: f A ♣ x q is dot product between forward and backward vectors α ❏ � ✟ f A ♣ ps q ✏ ☎ ♣ A s α ✽ q ✏ α p ☎ β s 0 A p ➓ Compositional Features: f A ♣ x q is a linear model α ❏ � ✟ f A ♣ x q ✏ ☎ α ✽ ✏ φ ♣ x q ☎ α ✽ 0 A x where φ : Σ ✍ Ñ R n compositional features (i.e. φ ♣ xσ q ✏ φ ♣ x q A σ )

  6. ➓ r ♣ qs ✏ r � ✏ s r ♣ qs ✏ r ⑤ ✏ s Forward–Backward Equations for A σ Any WFA A defines forward and backward maps α A , β A : Σ ✍ Ñ R n such that for any splitting x ✏ p ☎ s one has α ❏ � ✟ � ✟ f A ♣ x q ✏ 0 A p 1 ☎ ☎ ☎ A p T ☎ A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ✏ α A ♣ p q ☎ β A ♣ s q

  7. Forward–Backward Equations for A σ Any WFA A defines forward and backward maps α A , β A : Σ ✍ Ñ R n such that for any splitting x ✏ p ☎ s one has α ❏ � ✟ � ✟ f A ♣ x q ✏ 0 A p 1 ☎ ☎ ☎ A p T ☎ A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ✏ α A ♣ p q ☎ β A ♣ s q Example ➓ In HMM coordinates of α A and β A have probabilistic interpretation: r α A ♣ p qs i ✏ P r p , h � 1 ✏ i s r β A ♣ s qs i ✏ P r s ⑤ h ✏ i s

  8. Forward–Backward Equations for A σ Any WFA A defines forward and backward maps α A , β A : Σ ✍ Ñ R n such that for any splitting x ✏ p ☎ s one has α ❏ � ✟ � ✟ f A ♣ x q ✏ 0 A p 1 ☎ ☎ ☎ A p T ☎ A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ✏ α A ♣ p q ☎ β A ♣ s q Key Observation Comparing f A ♣ ps q and f A ♣ pσs q reveals information about A σ : f A ♣ ps q ✏ α A ♣ p q ☎ β A ♣ s q f A ♣ pσs q ✏ α A ♣ p q ☎ A σ ☎ β A ♣ s q

  9. Forward–Backward Equations for A σ Any WFA A defines forward and backward maps α A , β A : Σ ✍ Ñ R n such that for any splitting x ✏ p ☎ s one has α ❏ � ✟ � ✟ f A ♣ x q ✏ 0 A p 1 ☎ ☎ ☎ A p T ☎ A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ✏ α A ♣ p q ☎ β A ♣ s q Key Observation Comparing f A ♣ ps q and f A ♣ pσs q reveals information about A σ : f A ♣ ps q ✏ α A ♣ p q ☎ β A ♣ s q f A ♣ pσs q ✏ α A ♣ p q ☎ A σ ☎ β A ♣ s q Hankel matrices help organize and solve these equations!

  10. ♣ q ✏ ⑤ ⑤ ☎☎☎ ✔ ☎ ☎ ☎ ✜ ✖ ✣ ✖ ✣ ✏ ✖ ✣ ✖ ✣ ✖ ✣ ✕ ✢ ♣ q ✏ ♣ q ✏ ♣ q ✏ The Hankel Matrix Two Equivalent Representations ➓ Functional: f : Σ ✍ Ñ R ➓ Matricial: H f P R Σ ✍ ✂ Σ ✍ , the Hankel matrix of f Definition: p prefix, s suffix ñ H f ♣ p , s q ✏ f ♣ p ☎ s q

  11. ♣ q ✏ ♣ q ✏ ♣ q ✏ The Hankel Matrix Two Equivalent Representations ➓ Functional: f : Σ ✍ Ñ R ➓ Matricial: H f P R Σ ✍ ✂ Σ ✍ , the Hankel matrix of f Definition: p prefix, s suffix ñ H f ♣ p , s q ✏ f ♣ p ☎ s q Example f ♣ x q ✏ ⑤ x ⑤ a λ a b aa ☎☎☎ ✔ ✜ 0 1 0 2 ☎ ☎ ☎ (number of a ’s in x ) λ 1 2 1 3 a ✖ ✣ ✖ ✣ 0 1 0 2 H f ✏ b ✖ ✣ ✖ ✣ 2 3 2 4 aa ✖ ✣ . ✕ . ✢ ... . . . .

  12. The Hankel Matrix Two Equivalent Representations ➓ Functional: f : Σ ✍ Ñ R ➓ Matricial: H f P R Σ ✍ ✂ Σ ✍ , the Hankel matrix of f Definition: p prefix, s suffix ñ H f ♣ p , s q ✏ f ♣ p ☎ s q Example f ♣ x q ✏ ⑤ x ⑤ a λ a b aa ☎☎☎ ✔ ✜ 0 1 0 2 ☎ ☎ ☎ (number of a ’s in x ) λ 1 2 1 3 a ✖ ✣ ✖ ✣ 0 1 0 2 H f ✏ b ✖ ✣ ✖ ✣ 2 3 2 4 aa ✖ ✣ . ✕ . ✢ ... . . . . H f ♣ λ , aa q ✏ H f ♣ a , a q ✏ H f ♣ aa , λ q ✏ 2

  13. ♣ q ✏ ♣ q ✏ ♣ q ✏ The Hankel Matrix Two Equivalent Representations ➓ Functional: f : Σ ✍ Ñ R ➓ Matricial: H f P R Σ ✍ ✂ Σ ✍ , the Hankel matrix of f Definition: p prefix, s suffix ñ H f ♣ p , s q ✏ f ♣ p ☎ s q Properties λ a b aa ☎☎☎ ✔ ✜ 0 1 0 2 ☎ ☎ ☎ λ ➓ ⑤ x ⑤ � 1 entries for f ♣ x q 1 2 1 3 a ✖ ✣ ✖ ✣ 0 1 0 2 ➓ Depends on ordering of Σ ✍ H f ✏ b ✖ ✣ ✖ ✣ 2 3 2 4 aa ✖ ✣ ➓ Captures structure . ✕ . ✢ ... . . . .

  14. ♣ q ✏ ♣ q ✏ ♣ q ✏ The Hankel Matrix Two Equivalent Representations ➓ Functional: f : Σ ✍ Ñ R ➓ Matricial: H f P R Σ ✍ ✂ Σ ✍ , the Hankel matrix of f Definition: p prefix, s suffix ñ H f ♣ p , s q ✏ f ♣ p ☎ s q Properties λ a b aa ☎☎☎ ✔ ✜ 0 1 0 2 ☎ ☎ ☎ λ ➓ ⑤ x ⑤ � 1 entries for f ♣ x q 1 2 1 3 a ✖ ✣ ✖ ✣ 0 1 0 2 ➓ Depends on ordering of Σ ✍ H f ✏ b ✖ ✣ ✖ ✣ 2 3 2 4 aa ✖ ✣ ➓ Captures structure . ✕ . ✢ ... . . . .

  15. A Fundamental Theorem about WFA Relates the rank of H f and the number of states of WFA computing f

  16. A Fundamental Theorem about WFA Theorem [Carlyle and Paz, 1971, Fliess, 1974] Let f : Σ ✍ Ñ R be any function 1. If f ✏ f A for some WFA A with n states ñ rank ♣ H f q ↕ n 2. If rank ♣ H f q ✏ n ñ exists WFA A with n states s.t. f ✏ f A

  17. A Fundamental Theorem about WFA Theorem [Carlyle and Paz, 1971, Fliess, 1974] Let f : Σ ✍ Ñ R be any function 1. If f ✏ f A for some WFA A with n states ñ rank ♣ H f q ↕ n 2. If rank ♣ H f q ✏ n ñ exists WFA A with n states s.t. f ✏ f A

  18. A Fundamental Theorem about WFA Theorem [Carlyle and Paz, 1971, Fliess, 1974] Let f : Σ ✍ Ñ R be any function 1. If f ✏ f A for some WFA A with n states ñ rank ♣ H f q ↕ n 2. If rank ♣ H f q ✏ n ñ exists WFA A with n states s.t. f ✏ f A Why Fundamental? Because proof of (2) gives an algorithm for recovering A from the Hankel matrix of f A

  19. A Fundamental Theorem about WFA Theorem [Carlyle and Paz, 1971, Fliess, 1974] Let f : Σ ✍ Ñ R be any function 1. If f ✏ f A for some WFA A with n states ñ rank ♣ H f q ↕ n 2. If rank ♣ H f q ✏ n ñ exists WFA A with n states s.t. f ✏ f A Why Fundamental? Because proof of (2) gives an algorithm for recovering A from the Hankel matrix of f A Example: Can recover an HMM from the probabilities it assigns to sequences of observations

  20. ♣ q ✏ ♣ ☎q ♣ q ✏ ♣☎ q Structure of Low-rank Hankel Matrices H f A P R Σ ✍ ✂ Σ ✍ P P R Σ ✍ ✂ n S P R n ✂ Σ ✍ s . ✔ ✜ . . ✔ ✜ ☎ ☎ ☎ s . ✖ . ✣ ✔ ✜ . ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✖ ✣ . ✖ ✣ ✏ ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✖ . ✣ . ✖ ✣ ✕ ✢ ✖ ✣ ✖ ✣ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ p ✕ ✢ ☎ ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ☎ ☎ p ✖ ✣ ☎ ☎ ☎ ✕ ✢ . . . α ❏ f A ♣ p 1 ☎ ☎ ☎ p T ☎ s 1 ☎ ☎ ☎ s T ✶ q ✏ 0 A p 1 ☎ ☎ ☎ A p T A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ α A ♣ p q β A ♣ s q

  21. Structure of Low-rank Hankel Matrices H f A P R Σ ✍ ✂ Σ ✍ P P R Σ ✍ ✂ n S P R n ✂ Σ ✍ s . ✔ ✜ . . ✔ ✜ ☎ ☎ ☎ s . ✖ . ✣ ✔ ✜ . ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✖ ✣ . ✖ ✣ ✏ ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✖ . ✣ . ✖ ✣ ✕ ✢ ✖ ✣ ✖ ✣ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ p ✕ ✢ ☎ ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ☎ ☎ p ✖ ✣ ☎ ☎ ☎ ✕ ✢ . . . α ❏ f A ♣ p 1 ☎ ☎ ☎ p T ☎ s 1 ☎ ☎ ☎ s T ✶ q ✏ 0 A p 1 ☎ ☎ ☎ A p T A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ α A ♣ p q β A ♣ s q α A ♣ p q ✏ P ♣ p , ☎q β A ♣ s q ✏ S ♣☎ , s q

  22. ♣ q ✏ ♣ q ✏ ✍ ✂ ✍ ✂ ✂ P P P ✔ ✜ ☎ ☎ ☎ ✔ ✜ ✔ ✜ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✏ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✕ ✢ ✕ ✢ ✖ ✣ ✌ ✌ ✌ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✕ ✢ ☎ ☎ ☎ ❏ ✏ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ✽ ✶ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ ♣ q ♣ q � � ✏ ñ ✏ ñ ✏ Hankel Factorizations and Operators H σ P R Σ ✍ ✂ Σ ✍ s ✔ ✜ ☎ ☎ ✖ ✣ ✖ ✣ ☎ ✖ ✣ ✖ ✣ ☎ ☎ ✌ ☎ ☎ p ✕ ✢ ☎ f A ♣ p 1 ☎ ☎ ☎ p T ☎ σ ☎ s 1 ☎ ☎ ☎ s T ✶ q

  23. ♣ q ✏ ♣ q ✏ � � ✏ ñ ✏ ñ ✏ Hankel Factorizations and Operators H σ P R Σ ✍ ✂ Σ ✍ P P R Σ ✍ ✂ n S P R n ✂ Σ ✍ A σ P R n ✂ n s ✔ ✜ ✔ ✜ ☎ ☎ ☎ ☎ s ✔ ✜ ✔ ✜ ☎ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✖ ✣ ✖ ✣ ☎ ✏ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✕ ✢ ✕ ✢ ✖ ✣ ✖ ✣ ☎ ☎ ✌ ☎ ☎ ✌ ✌ ✌ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ p p ✕ ✢ ✕ ✢ ☎ ☎ ☎ ☎ α ❏ f A ♣ p 1 ☎ ☎ ☎ p T ☎ σ ☎ s 1 ☎ ☎ ☎ s T ✶ q ✏ 0 A p 1 ☎ ☎ ☎ A p T ☎ A σ ☎ A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ α A ♣ p q β A ♣ s q

  24. ♣ q ✏ ♣ q ✏ Hankel Factorizations and Operators H σ P R Σ ✍ ✂ Σ ✍ P P R Σ ✍ ✂ n S P R n ✂ Σ ✍ A σ P R n ✂ n s ✔ ✜ ✔ ✜ ☎ ☎ ☎ ☎ s ✔ ✜ ✔ ✜ ☎ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✖ ✣ ✖ ✣ ☎ ✏ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✕ ✢ ✕ ✢ ✖ ✣ ✖ ✣ ☎ ☎ ✌ ☎ ☎ ✌ ✌ ✌ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ p p ✕ ✢ ✕ ✢ ☎ ☎ ☎ ☎ α ❏ f A ♣ p 1 ☎ ☎ ☎ p T ☎ σ ☎ s 1 ☎ ☎ ☎ s T ✶ q ✏ 0 A p 1 ☎ ☎ ☎ A p T ☎ A σ ☎ A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ α A ♣ p q β A ♣ s q A σ ✏ P � H σ S � H ✏ P S ñ H σ ✏ P A σ S ñ = =

  25. Hankel Factorizations and Operators H σ P R Σ ✍ ✂ Σ ✍ P P R Σ ✍ ✂ n S P R n ✂ Σ ✍ A σ P R n ✂ n s ✔ ✜ ✔ ✜ ☎ ☎ ☎ ☎ s ✔ ✜ ✔ ✜ ☎ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✖ ✣ ✖ ✣ ☎ ✏ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✕ ✢ ✕ ✢ ✖ ✣ ✖ ✣ ☎ ☎ ✌ ☎ ☎ ✌ ✌ ✌ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ p p ✕ ✢ ✕ ✢ ☎ ☎ ☎ ☎ α ❏ f A ♣ p 1 ☎ ☎ ☎ p T ☎ σ ☎ s 1 ☎ ☎ ☎ s T ✶ q ✏ 0 A p 1 ☎ ☎ ☎ A p T ☎ A σ ☎ A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ α A ♣ p q β A ♣ s q A σ ✏ P � H σ S � H ✏ P S ñ H σ ✏ P A σ S ñ = = Note: Works with finite sub-blocks as well (assuming rank ♣ P q ✏ rank ♣ S q ✏ n )

  26. General Learning Algorithm for WFA Hankel Low-rank matrix Factorization and Data WFA estimation matrix linear algebra

  27. General Learning Algorithm for WFA Hankel Low-rank matrix Factorization and Data WFA estimation matrix linear algebra Key Idea: The Hankel Trick 1. Learn a low-rank Hankel matrix that implicitly induces “latent” states 2. Recover the states from a decomposition of the Hankel matrix

  28. ✒ ✚ ✒ ✚ ✒ ✚ ✁ ✁ ✏ ✏ ✏ ✁ ✁ Limitations of WFA Invariance Under Change of Basis For any invertible matrix Q the following WFA are equivalent: ➓ A ✏ ① α 0 , α ✽ , t A σ ✉② ➓ B ✏ ① Q ❏ α 0 , Q ✁ 1 α ✽ , t Q ✁ 1 A σ Q ✉② f A ♣ x q ✏ α ❏ 0 A x 1 ☎ ☎ ☎ A x T α ✽ ✏ ♣ α ❏ 0 Q q♣ Q ✁ 1 A x 1 Q q ☎ ☎ ☎ ♣ Q ✁ 1 A x T Q q♣ Q ✁ 1 α ✽ q ✏ f B ♣ x q

  29. Limitations of WFA Invariance Under Change of Basis For any invertible matrix Q the following WFA are equivalent: ➓ A ✏ ① α 0 , α ✽ , t A σ ✉② ➓ B ✏ ① Q ❏ α 0 , Q ✁ 1 α ✽ , t Q ✁ 1 A σ Q ✉② f A ♣ x q ✏ α ❏ 0 A x 1 ☎ ☎ ☎ A x T α ✽ ✏ ♣ α ❏ 0 Q q♣ Q ✁ 1 A x 1 Q q ☎ ☎ ☎ ♣ Q ✁ 1 A x T Q q♣ Q ✁ 1 α ✽ q ✏ f B ♣ x q Example ✒ 0.5 ✚ ✒ ✚ ✒ ✚ 0.1 0 1 0.3 ✁ 0.2 Q ✁ 1 A a Q ✏ A a ✏ Q ✏ 0.2 0.3 ✁ 1 0 ✁ 0.1 0.5

  30. Limitations of WFA Invariance Under Change of Basis For any invertible matrix Q the following WFA are equivalent: ➓ A ✏ ① α 0 , α ✽ , t A σ ✉② ➓ B ✏ ① Q ❏ α 0 , Q ✁ 1 α ✽ , t Q ✁ 1 A σ Q ✉② f A ♣ x q ✏ α ❏ 0 A x 1 ☎ ☎ ☎ A x T α ✽ ✏ ♣ α ❏ 0 Q q♣ Q ✁ 1 A x 1 Q q ☎ ☎ ☎ ♣ Q ✁ 1 A x T Q q♣ Q ✁ 1 α ✽ q ✏ f B ♣ x q Consequences ➓ There is no unique parametrization for WFA ➓ Given A it is undecidable whether ❅ x f A ♣ x q ➙ 0 ➓ Cannot expect to recover a probabilistic parametrization

  31. Outline 1. Weighted Automata and Hankel Matrices 2. Spectral Learning of Probabilistic Automata 3. Spectral Methods for Transducers and Grammars Sequence Tagging Finite-State Transductions Tree Automata 4. Hankel Matrices with Missing Entries 5. Conclusion 6. References

  32. Spectral Learning of Probabilistic Automata Hankel Low-rank matrix Factorization and Data WFA estimation linear algebra matrix Basic Setup: ➓ Data are strings sampled from probability distribution on Σ ✍ ➓ Hankel matrix is estimated by empiricial probabilities ➓ Factorization and low-rank approximation is computed using SVD

  33. ✏ t ✉ ✏ t ✉ ✩ ✱ ✬ ✴ ✬ ✴ ✫ ✳ ✏ Ñ ♣ q ✏ ✓ ✬ ✴ ✬ ✴ ✪ ✲ The Empirical Hankel Matrix Suppose S ✏ ♣ x 1 , . . . , x N q is a sample of N i.i.d. strings Empirical distribution Empirical Hankel matrix H S ♣ p , s q ✏ ˆ ˆ N f S ♣ ps q f S ♣ x q ✏ 1 I r x i ✏ x s ˆ ➳ N i ✏ 1

  34. ✏ t ✉ ✏ t ✉ The Empirical Hankel Matrix Suppose S ✏ ♣ x 1 , . . . , x N q is a sample of N i.i.d. strings Empirical distribution Empirical Hankel matrix H S ♣ p , s q ✏ ˆ ˆ N f S ♣ ps q f S ♣ x q ✏ 1 I r x i ✏ x s ˆ ➳ N i ✏ 1 Example: ✩ ✱ aa , b , bab , a , ✬ ✴ ✬ ✴ b , a , ab , aa , f S ♣ aa q ✏ 5 ✫ ✳ ˆ S ✏ Ñ 16 ✓ 0.31 − ba , b , aa , a , ✬ ✴ ✬ ✴ aa , bab , b , aa ✪ ✲

  35. The Empirical Hankel Matrix Suppose S ✏ ♣ x 1 , . . . , x N q is a sample of N i.i.d. strings Empirical distribution Empirical Hankel matrix H S ♣ p , s q ✏ ˆ ˆ N f S ♣ ps q f S ♣ x q ✏ 1 I r x i ✏ x s ˆ ➳ N i ✏ 1 Example: a b ✩ ✱ aa , b , bab , a , ✔ .19 .25 ✜ λ ✬ ✴ ✬ ✴ b , a , ab , aa , .31 .06 ✫ ✳ a ˆ ✖ ✣ S ✏ Ñ H S ✏ − ✖ ✣ ba , b , aa , a , .06 .00 b ✕ ✢ ✬ ✴ ✬ ✴ aa , bab , b , aa .00 .13 ✪ ✲ ba (Hankel with rows P ✏ t λ , a , b , ba ✉ and columns S ✏ t a , b ✉ )

  36. Finite Sub-blocks of Hankel Matrices Parameters: ➓ Set of rows (prefixes) P ⑨ Σ ✍ ➓ Set of columns (suffixes) S ⑨ Σ ✍ S ... Σ λ a b aa ab h λ , S 1 0.3 0.7 0.05 0.25 . . . λ H 0.3 0.05 0.25 0.02 0.03 . . . a P 0.7 0.6 0.1 0.03 0.2 . . . b 0.05 0.02 0.03 0.017 0.003 . . . aa H a h P , λ 0.25 0.23 0.02 0.11 0.12 . . . ab . . . . . . ... . . . . . . . . . . . . ➓ H P R P ✂ S for finding P and S ➓ H σ P R P ✂ S for finding A σ ➓ h λ , S P R 1 ✂ S for finding α 0 ➓ h P , λ P R P ✂ 1 for finding α ✽

  37. Low-rank Approximation and Factorization Will use the singular value decomposition (SVD) as the main building block Hence the name spectral!

  38. ❏ ✓ ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ ✂ ✂ ✂ ✂ ✓ � ✏ ✁ ❏ q � ✟ � ✏ ñ ✏ ♣ � ✏ ❏ ✏ ñ Low-rank Approximation and Factorization Parameters: ➓ Desired number of states n ➓ Block H P R P ✂ S of the empirical Hankel matrix

  39. ✓ � ✏ ✁ ❏ q � ✟ � ✏ ñ ✏ ♣ � ✏ ❏ ✏ ñ Low-rank Approximation and Factorization Parameters: ➓ Desired number of states n ➓ Block H P R P ✂ S of the empirical Hankel matrix Low-rank Approximation: compute truncated SVD of rank n V ❏ ✓ H U n Λ n n ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ P ✂ S n ✂ n P ✂ n n ✂ S

  40. Low-rank Approximation and Factorization Parameters: ➓ Desired number of states n ➓ Block H P R P ✂ S of the empirical Hankel matrix Low-rank Approximation: compute truncated SVD of rank n V ❏ ✓ H U n Λ n n ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ P ✂ S n ✂ n P ✂ n n ✂ S Factorization: H ✓ PS given by SVD, pseudo-inverses are easy P � ✏ Λ ✁ 1 n U ❏ ✏ ♣ HV n q � ✟ � P ✏ U n Λ n ñ n S � ✏ V n S ✏ V ❏ ñ n

  41. � ✏ � ✁ ❏ q � � ✟ ✏ ✏ ♣ ❏ ✏ � ✏ � ✁ ❏ q � � ✟ ✽ ✏ ✏ ✏ ♣ ✽ Computing the WFA Parameters: ➓ Factorization H ✓ ♣ U Λ q ☎ V ❏ ✏ P ☎ S ➓ Hankel blocks H σ , h λ , S , h P , λ

  42. ✽ Computing the WFA Parameters: ➓ Factorization H ✓ ♣ U Λ q ☎ V ❏ ✏ P ☎ S ➓ Hankel blocks H σ , h λ , S , h P , λ Equations: A σ ✏ P � H σ S � ✏ Λ ✁ 1 U ❏ H σ V ✏ ♣ HV q � H σ V � ✟ h λ , S S � ✏ α ❏ 0 ✏ h λ , S V P � h P , λ ✏ Λ ✁ 1 U ❏ h P , λ ✏ ♣ HV q � h P , λ � ✟ α ✽ ✏

  43. Computing the WFA Parameters: ➓ Factorization H ✓ ♣ U Λ q ☎ V ❏ ✏ P ☎ S ➓ Hankel blocks H σ , h λ , S , h P , λ Equations: A σ ✏ P � H σ S � ✏ Λ ✁ 1 U ❏ H σ V ✏ ♣ HV q � H σ V � ✟ h λ , S S � ✏ α ❏ 0 ✏ h λ , S V P � h P , λ ✏ Λ ✁ 1 U ❏ h P , λ ✏ ♣ HV q � h P , λ � ✟ α ✽ ✏ Full Algorithm 1. Estimate empirical Hankel and retrieve sub-blocks H , H σ , h λ , S , h P , λ 2. Perform SVD of H 3. Solve for A σ , α 0 , α ✽ with pseudo-inverses

  44. Computational and Statistical Complexity Running Time: ➓ Empirical Hankel matrix: O ♣⑤ PS ⑤ ☎ N q ➓ SVD and linear algebra: O ♣⑤ P ⑤ ☎ ⑤ S ⑤ ☎ n q Statistical Consistency: ➓ By law of large numbers, ˆ H S Ñ E r H s when N Ñ ✽ ➓ If E r H s is Hankel of some WFA A , then ˆ A Ñ A ➓ Works for data coming from PFA and HMM PAC Analysis: (assuming data from A with n states) ❄ ➓ With high probability, ⑥ ˆ H S ✁ H ⑥ ↕ O ♣ 1 ④ N q ➓ When N ➙ O ♣ n ⑤ Σ ⑤ 2 T 4 ④ ε 2 s n ♣ H q 4 q , then ➳ ⑤ f A ♣ x q ✁ f ˆ A ♣ x q⑤ ↕ ε ⑤ x ⑤↕ T Proofs can be found in [Hsu et al., 2009, Bailly, 2011, Balle, 2013]

  45. Practical Considerations Hankel Low-rank matrix Factorization and Data WFA estimation matrix linear algebra Basic Setup: ➓ Data are strings sampled from probability distribution on Σ ✍ ➓ Hankel matrix is estimated by empiricial probabilities ➓ Factorization and low-rank approximation is computed using SVD Advanced Implementations: ➓ Choice of parameters P and S ➓ Scalable estimation and factorization of Hankel matrices ➓ Smoothing and variance normalization ➓ Use of prefix and substring statistics

  46. ↕ ✏ ✏ ➙ ➓ ➓ ➓ Choosing the Basis Definition: The pair ♣ P , S q defining the sub-block is called a basis Intuitions: ➓ Basis should be choosen such that E r H s has full rank ➓ P must contain strings reaching each possible state of the WFA ➓ S must contain string producing different outcomes for each pair of states in the WFA

  47. Choosing the Basis Definition: The pair ♣ P , S q defining the sub-block is called a basis Intuitions: ➓ Basis should be choosen such that E r H s has full rank ➓ P must contain strings reaching each possible state of the WFA ➓ S must contain string producing different outcomes for each pair of states in the WFA Popular Approaches: ➓ Set P ✏ S ✏ Σ ↕ k for some k ➙ 1 [Hsu et al., 2009] ➓ Choose P and S to contain the K most frequent prefixes and suffixes in the sample [Balle et al., 2012] ➓ Take all prefixes and suffixes appearing in the sample [Bailly et al., 2009]

  48. Scalable Implementations Problem: When ⑤ Σ ⑤ is large, even the simplest basis become huge Hankel Matrix Representation: ➓ Use hash functions to map P ( S ) to row (column) indices ➓ Use sparse matrix data structures because statistics are usually sparse ➓ Never store the full Hankel matrix in memory Efficient SVD Computation: ➓ SVD for sparse matrices [Berry, 1992] ➓ Approximate randomized SVD [Halko et al., 2011] ➓ On-line SVD with rank 1 updates [Brand, 2006]

  49. ➓ ➓ ➓ ➓ Refining the Statistics in the Hankel Matrix Smoothing the Estimates ➓ Empirical probabilities ˆ f S ♣ x q tend to be sparse ➓ Like in n -gram models, smoothing can help when Σ is large ➓ Should take into account that strings in PS have different lengths ➓ Open Problem: How to smooth empirical Hankels properly

  50. Refining the Statistics in the Hankel Matrix Smoothing the Estimates ➓ Empirical probabilities ˆ f S ♣ x q tend to be sparse ➓ Like in n -gram models, smoothing can help when Σ is large ➓ Should take into account that strings in PS have different lengths ➓ Open Problem: How to smooth empirical Hankels properly Row and Column Weighting ➓ More frequent prefixes (suffixes) have better estimated rows (columns) ➓ Can scale rows and columns to reflect that ➓ Will lead to more reliable SVD decompositions ➓ See [Cohen et al., 2013] for details

  51. ➳ ✏ r s ✏ ✩ ✱ ✔ ✜ ✬ ✴ ✬ ✴ ✫ ✳ ✖ ✣ ✏ Ñ ✏ ✖ ✣ ✕ ✢ ✬ ✴ ✬ ✴ ✪ ✲ ✩ ✱ ✔ ✜ ✬ ✴ ✬ ✴ ✫ ✳ ✖ ✣ ✏ Ñ ✏ ✖ ✣ ✕ ✢ ✬ ✴ ✬ ✴ ✪ ✲ Substring Statistics Problem: If the sample contains strings with wide range of lengths, small basis will ignore most of the examples

  52. ➳ ✏ r s ✏ ✩ ✱ ✔ ✜ ✬ ✴ ✬ ✴ ✫ ✳ ✖ ✣ ✏ Ñ ✏ ✖ ✣ ✕ ✢ ✬ ✴ ✬ ✴ ✪ ✲ Substring Statistics Problem: If the sample contains strings with wide range of lengths, small basis will ignore most of the examples String Statistics (occurence probability): a b ✩ ✱ ✔ ✜ aa , b , bab , a , .19 .06 λ ✬ ✴ ✬ ✴ bbab , abb , babba , abbb , .06 .06 ✫ ✳ a ˆ ✖ ✣ S ✏ Ñ H ✏ − ✖ ✣ ab , a , aabba , baa , .00 .06 b ✕ ✢ ✬ ✴ ✬ ✴ abbab , baba , bb , a .06 .06 ✪ ✲ ba

  53. Substring Statistics Problem: If the sample contains strings with wide range of lengths, small basis will ignore most of the examples String Statistics (occurence probability): a b ✩ ✱ ✔ ✜ aa , b , bab , a , .19 .06 λ ✬ ✴ ✬ ✴ bbab , abb , babba , abbb , .06 .06 ✫ ✳ a ˆ ✖ ✣ S ✏ Ñ H ✏ − ✖ ✣ ab , a , aabba , baa , .00 .06 b ✕ ✢ ✬ ✴ ✬ ✴ abbab , baba , bb , a .06 .06 ✪ ✲ ba Substring Statistics (expected number of occurences as substring): N ✏ 1 ➳ r number of occurences of x in x i s Empirical expectation N i ✏ 1 a b ✩ ✱ ✔ ✜ aa , b , bab , a , 1.31 1.56 λ ✬ ✴ ✬ ✴ bbab , abb , babba , abbb , .19 .62 ✫ ✳ ˆ a ✖ ✣ S ✏ Ñ H ✏ − ✖ ✣ ab , a , aabba , baa , .56 .50 b ✕ ✢ ✬ ✴ ✬ ✴ abbab , baba , bb , a .06 .31 ✪ ✲ ba

  54. Substring Statistics Theorem [Balle et al., 2014] If a probability distribution f is computed by a WFA with n states, then the corresponding substring statistics are also computed by a WFA with n states Learning from Substring Statistics ➓ Can work with smaller Hankel matrices ➓ But estimating the matrix takes longer

  55. Experiment: PoS-tag Sequence Models Spectral, Σ basis 74 Spectral, basis k=25 Spectral, basis k=50 Spectral, basis k=100 72 Spectral, basis k=300 Spectral, basis k=500 Word Error Rate (%) Unigram 70 Bigram 68 66 64 62 60 0 10 20 30 40 50 Number of States ➓ PTB sequences of simplified PoS tags [Petrov et al., 2012] ➓ Configuration: expectations on frequent substrings ➓ Metric: error rate on predicting next symbol in test sequences

  56. Experiment: PoS-tag Sequence Models 70 68 Word Error Rate (%) 66 64 62 Spectral, Σ basis 60 Spectral, basis k=500 EM Unigram 58 Bigram 0 10 20 30 40 50 Number of States ➓ Comparison with a bigram baseline and EM ➓ Metric: error rate on predicting next symbol in test sequences ➓ At training, the Spectral Method is → 100 faster than EM

  57. Outline 1. Weighted Automata and Hankel Matrices 2. Spectral Learning of Probabilistic Automata 3. Spectral Methods for Transducers and Grammars Sequence Tagging Finite-State Transductions Tree Automata 4. Hankel Matrices with Missing Entries 5. Conclusion 6. References

  58. Sequence Tagging and Transduction ➓ Many applications involve pairs of input-output sequences: ➓ Sequence tagging (one output tag per input token) e.g.: part of speech tagging output: NNP NNP VBZ NNP . input: Ms. Haag plays Elianti . ➓ Transductions (sequence lenghts might differ) e.g.: spelling correction output: a p p l e input: a p l e ➓ Finite-state automata are classic methods to model these relations. Spectral methods apply naturally to this setting.

  59. Sequence Tagging ➓ Notation: ➓ Input alphabet X ➓ Output alphabet Y ➓ Joint alphabet Σ ✏ X ✂ Y ➓ Goal: map input sequences to output sequences of the same length ➓ Approach: learn a function f : ♣ X ✂ Y q ✍ Ñ R Then, given an input x P X T return argmax f ♣ x , y q y P Y T (note: this maximization is not tractable in general)

  60. ✂ ➓ ✏ ① ✽ t ✉② ➓ q ✍ Ñ ♣ ✂ ❏ ❏ ♣ q ✏ ☎ ☎ ☎ ✽ ✏ ✽ Weighted Finite Tagger ➓ Notation: ➓ X ✂ Y : joint alphabet – finite set ➓ n : number of states – positive integer ➓ α 0 : initial weights – vector in R n (features of empty prefix) ➓ α ✽ : final weights – vector in R n (features of empty suffix) a : transition weights – matrix in R n ✂ n ( ❅ a P X , b P Y ) ➓ A b

  61. ➓ q ✍ Ñ ♣ ✂ ❏ ❏ ♣ q ✏ ☎ ☎ ☎ ✽ ✏ ✽ Weighted Finite Tagger ➓ Notation: ➓ X ✂ Y : joint alphabet – finite set ➓ n : number of states – positive integer ➓ α 0 : initial weights – vector in R n (features of empty prefix) ➓ α ✽ : final weights – vector in R n (features of empty suffix) a : transition weights – matrix in R n ✂ n ( ❅ a P X , b P Y ) ➓ A b ➓ Definition: WFTagger with n states over X ✂ Y A ✏ ① α 0 , α ✽ , t A b a ✉②

  62. Weighted Finite Tagger ➓ Notation: ➓ X ✂ Y : joint alphabet – finite set ➓ n : number of states – positive integer ➓ α 0 : initial weights – vector in R n (features of empty prefix) ➓ α ✽ : final weights – vector in R n (features of empty suffix) a : transition weights – matrix in R n ✂ n ( ❅ a P X , b P Y ) ➓ A b ➓ Definition: WFTagger with n states over X ✂ Y A ✏ ① α 0 , α ✽ , t A b a ✉② ➓ Compositional Function: Every WFTagger defines a function f A : ♣ X ✂ Y q ✍ Ñ R f A ♣ x 1 . . . x T , y 1 . . . y T q ✏ α ❏ x T α ✽ ✏ α ❏ 0 A y 1 x 1 ☎ ☎ ☎ A y T 0 A y x α ✽

  63. The Spectral Method for WFTaggers Low-rank matrix Hankel Factorization and Data WFA estimation linear algebra matrix ➓ Assume f ♣ x , y q ✏ P ♣ x , y q ➓ Same mechanics as for WFA, with Σ ✏ X ✂ Y ➓ In a nutshell: 1. Choose set of prefixes and suffixes to define Hankel Ñ in this case they are bistrings 2. Estimate Hankel with prefix-suffix training statistics 3. Factorize Hankel using SVD 4. Compute α and β projections, and compute operators ① α 0 , α ✽ , t A σ ✉② ➓ Other cases: ➓ f A ♣ x , y q ✏ P ♣ y ⑤ x q — see [Balle et al., 2011] ➓ f A ♣ x , y q non-probabilistic — see [Quattoni et al., 2014]

  64. ➳ ❏ ✾ ✽ ✏ ✄ ☛ ✄ ☛ ❏ ➳ ➳ ✁ � ✾ ✽ ✁ � ✁ � ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥ ✍ ♣ ✍ ♣ ✁ q q � ✄ ➳ ☛ ✄ ➳ ☛ ✍ ♣ ✍ ♣ ✍ ♣ ✍ ♣ q ✏ ✁ q q ✏ q � P P Prediction with WFTaggers ➓ Assume f A ♣ x , y q ✏ P ♣ x , y q ➓ Given x 1 : T , compute most likely output tag at position t : argmax µ ♣ t , a q a P Y where ➳ µ ♣ t , a q ✜ P ♣ y t ✏ a ⑤ x q ✾ P ♣ x , y q y ✏ y 1 ... a ... y T

  65. Prediction with WFTaggers ➓ Assume f A ♣ x , y q ✏ P ♣ x , y q ➓ Given x 1 : T , compute most likely output tag at position t : argmax µ ♣ t , a q a P Y where ➳ µ ♣ t , a q ✜ P ♣ y t ✏ a ⑤ x q ✾ P ♣ x , y q y ✏ y 1 ... a ... y T ➳ α ❏ 0 A y ✾ x α ✽ y ✏ y 1 ... a ... y T ✄ ☛ ✄ ☛ A y 1 : t ✁ 1 A y i � 1 : T ✾ α ❏ ➳ A a ➳ α ✽ x 1 : t ✁ 1 x t � 1 : T 0 x t y 1 ... y t ✁ 1 y t � 1 ... y T ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥ α ✍ β ✍ A ♣ x 1 : t ✁ 1 q A ♣ x t � 1 : T q ✄ ➳ ☛ ✄ ➳ ☛ α ✍ A ♣ x 1 : t q ✏ α ✍ A b β ✍ A b β ✍ A ♣ x 1 : t ✁ 1 q A ♣ x t : T q ✏ A ♣ x t � 1 : T q x t x t b P Y b P Y

  66. Prediction with WFTaggers (II) ➓ Assume f A ♣ x , y q ✏ P ♣ x , y q ➓ Given x 1 : T , compute most likely output bigram ab at position t : argmax µ ♣ t , a , b q a , b P Y where µ ♣ t , a , b q ✏ P ♣ y t ✏ a , y t � 1 ✏ b ⑤ x q α ✍ A ♣ x 1 : t ✁ 1 q A a x t A b x t � 1 β ✍ ✾ A ♣ x t � 2 : T q ➓ Compute most likely full sequence y – intractable In practice, use Minimum Bayes-Risk decoding: ➳ argmax µ ♣ t , y t , y t � 1 q y P Y T t

  67. Finite State Transducers c e d a a-c ǫ -d (ab,cde) b-e b ➓ A WFTransducer evaluates aligned strings, using the empty symbol ǫ to produce one-to-one alignments: f ♣ c d e b q ✏ α ❏ 0 A c a A d ǫ A e b α ✽ a ǫ ➓ Then, a function g can be defined on unaligned strings by aggregating alignments ➳ g ♣ ab , cde q ✏ f ♣ π q π P Π ♣ ab , cde q

  68. Ñ Ñ Ñ ➓ ➓ ➓ Finite State Transducers: Main Problems ➓ Prediction: given an FST A , how to . . . ➓ Compute g ♣ x , y q for unaligned strings? ➓ Compute marginal quantities µ ♣ edge q ✏ P ♣ edge ⑤ x q ? ➓ Compute most-likely y for given x ?

  69. ➓ ➓ ➓ Finite State Transducers: Main Problems ➓ Prediction: given an FST A , how to . . . ➓ Compute g ♣ x , y q for unaligned strings? Ñ using edit-distance recursions ➓ Compute marginal quantities µ ♣ edge q ✏ P ♣ edge ⑤ x q ? Ñ also using edit-distance recursions ➓ Compute most-likely y for given x ? Ñ use MBR-decoding with marginal scores

  70. Finite State Transducers: Main Problems ➓ Prediction: given an FST A , how to . . . ➓ Compute g ♣ x , y q for unaligned strings? Ñ using edit-distance recursions ➓ Compute marginal quantities µ ♣ edge q ✏ P ♣ edge ⑤ x q ? Ñ also using edit-distance recursions ➓ Compute most-likely y for given x ? Ñ use MBR-decoding with marginal scores ➓ Unsupervised Learning: learn an FST from pairs of unaligned strings ➓ Unlike with EM, the spectral method can not recover latent structure such as alignments (recall: alignments are needed to estimate Hankel entries) ➓ See [Bailly et al., 2013b] for a solution based on Hankel matrix completion

  71. Spectral Learning of Tree Automata and Grammars S NP VP noun verb NP Mary det noun plays the guitar Some References: ➓ Tree Series: [Bailly et al., 2010, Bailly et al., 2010] ➓ Latent-annotated PCFG: [Cohen et al., 2012, Cohen et al., 2013] ➓ Dependency parsing: [Luque et al., 2012, Dhillon et al., 2012] ➓ Unsupervised learning of WCFG: [Bailly et al., 2013a, Parikh et al., 2014] ➓ Synchronous grammars: [Saluja et al., 2014]

  72. ☎ ☞ ☛ ❏ ✄ ✁ ✁ ✠ ✠ ✏ ✌ ✏ ❜ ♣ q ✝ ✍ ✍ ✆ ☎ ☞ ☛ ❏ ✄ ✏ ✌ ✏ ♣ ♣ q ❜ ♣ qq ✝ ✍ ✆ ✍ Compositional Functions over Trees ☎ ☞ ☎ ☞ ☛ ❏ ☎ ☞ a a ✄ a a b a b a ✌ ✏ ✌ ✏ α A f f β A ✝ ✍ ✝ ✍ ✍ b ✆ ✌ c ✆ ✆ c c c c c b b b b b b

  73. ☎ ☞ ☛ ❏ ✄ ✏ ✌ ✏ ♣ ♣ q ❜ ♣ qq ✝ ✍ ✆ ✍ Compositional Functions over Trees ☎ ☞ ☎ ☞ ☛ ❏ ☎ ☞ a a ✄ a a b a b a ✌ ✏ ✌ ✏ α A f f β A ✝ ✍ ✝ ✍ ✍ b ✆ ✌ c ✆ ✆ c c c c c b b b b b b ☎ ☞ ☛ ❏ a ✄ a ✁ ✁ ✠ ✠ c b a ✏ f ✌ ✏ α A A a β A ❜ β A ♣ c q ✝ ✍ b ✍ ✆ b b c c b b

  74. Compositional Functions over Trees ☎ ☞ ☎ ☞ ☛ ❏ ☎ ☞ a a ✄ a a b a b a ✌ ✏ ✌ ✏ α A f f β A ✝ ✍ ✝ ✍ ✍ b ✆ ✌ c ✆ ✆ c c c c c b b b b b b ☎ ☞ ☛ ❏ a ✄ a ✁ ✁ ✠ ✠ c b a ✏ f ✌ ✏ α A A a β A ❜ β A ♣ c q ✝ ✍ b ✍ ✆ b b c c b b ☎ ☞ ☛ ❏ a ✄ a b a ✏ f ✌ ✏ α A A c ♣ β A ♣ b q ❜ β A ♣ b qq ✝ ✍ b a ✆ c c c ✍ b b

  75. ❞ ✏ Inside-Outside Composition of Trees a a c c ✍ b b c a c a b b t ✏ t o ❞ t i note: i-o composition generalizes the notion of concatenation in strings, i.e., outside trees are prefixes, inside trees are suffixes

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend