Spectral Learning Techniques for Weighted Automata, Transducers, and - - PowerPoint PPT Presentation

spectral learning techniques for weighted automata
SMART_READER_LITE
LIVE PREVIEW

Spectral Learning Techniques for Weighted Automata, Transducers, and - - PowerPoint PPT Presentation

Spectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle Ariadna Quattoni Xavier Carreras q McGill University q Xerox Research Centre Europe TUTORIAL @ EMNLP 2014 Status Quo


slide-1
SLIDE 1

Spectral Learning Techniques for Weighted Automata, Transducers, and Grammars

Borja Balle♦ Ariadna Quattoni♥ Xavier Carreras♥

♣♦q McGill University ♣♥q Xerox Research Centre Europe

TUTORIAL @ EMNLP 2014

slide-2
SLIDE 2

Status Quo

➓ Composable/composite objects (strings and trees) are ubiquitous in

NLP

➓ Latent variables provide powerful mechanisms for learning the

relevant information needed to solve tasks from composable data

➓ Classical learning paradigms are Expectation–Maximization and

Split–Merge

slide-3
SLIDE 3

An Alternative Approach

Spectral Methods in General...

➓ Provide tools for learning latent variable models with strong

algorithmic and statistical guarantees

➓ Facilitate the connection of latent variable models with (multi-)linear

algebra commonly used in machine learning

➓ In practice are faster than iterative methods, and not prone to local

minima

➓ Implementations can readily benefit from latest developments in

numerical linear algebra in a black-box fashion This Tutorial in Particular...

➓ Emphasize the relation of spectral methods and recursive

computations performed by classical weighted automata and grammars

➓ Show how the language of Hankel matrices seamlessly applies to

string and tree computations

slide-4
SLIDE 4

Outline

  • 1. Weighted Automata and Hankel Matrices
  • 2. Spectral Learning of Probabilistic Automata
  • 3. Spectral Methods for Transducers and Grammars

Sequence Tagging Finite-State Transductions Tree Automata

  • 4. Hankel Matrices with Missing Entries
  • 5. Conclusion
  • 6. References
slide-5
SLIDE 5

Outline

  • 1. Weighted Automata and Hankel Matrices
  • 2. Spectral Learning of Probabilistic Automata
  • 3. Spectral Methods for Transducers and Grammars

Sequence Tagging Finite-State Transductions Tree Automata

  • 4. Hankel Matrices with Missing Entries
  • 5. Conclusion
  • 6. References
slide-6
SLIDE 6

Compositional Functions and Bilinear Operators

➓ Compositional functions defined in terms of recurrence relations ➓ Consider a sequence abaccb

f♣abaccbq ✏ αf♣abq ☎ βf♣accbq ✏ ♣ q ☎ ☎ ♣ q ✏ ♣ q ☎ ♣ q where

➓ n is the dimension of the model ➓ αf maps prefixes to Rn ➓ βf maps suffixes to Rn ➓

♣ q ♣ q

slide-7
SLIDE 7

Compositional Functions and Bilinear Operators

➓ Compositional functions defined in terms of recurrence relations ➓ Consider a sequence abaccb

f♣abaccbq ✏ αf♣abq ☎ βf♣accbq ✏ αf♣abq ☎ Aa ☎ βf♣ccbq ✏ ♣ q ☎ ♣ q where

➓ n is the dimension of the model ➓ αf maps prefixes to Rn ➓ βf maps suffixes to Rn ➓ Aa is a bilinear operator in Rn✂n

♣ q ♣ q

slide-8
SLIDE 8

Compositional Functions and Bilinear Operators

➓ Compositional functions defined in terms of recurrence relations ➓ Consider a sequence abaccb

f♣abaccbq ✏ αf♣abq ☎ βf♣accbq ✏ αf♣abq ☎ Aa ☎ βf♣ccbq ✏ αf♣abaq ☎ βf♣ccbq where

➓ n is the dimension of the model ➓ αf maps prefixes to Rn ➓ βf maps suffixes to Rn ➓ Aa is a bilinear operator in Rn✂n

♣ q ♣ q

slide-9
SLIDE 9

Compositional Functions and Bilinear Operators

➓ Compositional functions defined in terms of recurrence relations ➓ Consider a sequence abaccb

f♣abaccbq ✏ αf♣abq ☎ βf♣accbq ✏ αf♣abq ☎ Aa ☎ βf♣ccbq ✏ αf♣abaq ☎ βf♣ccbq where

➓ n is the dimension of the model ➓ αf maps prefixes to Rn ➓ βf maps suffixes to Rn ➓ Aa is a bilinear operator in Rn✂n

Problem

How to estimate αf♣λq, βf♣λq and Aa, Ab, . . . from “samples” of f?

slide-10
SLIDE 10

Weighted Finite Automata (WFA) An algebraic model for compositional functions on strings

slide-11
SLIDE 11

Weighted Finite Automata (WFA)

Example with 2 states and alphabet Σ ✏ ta, b✉ q0 q1

a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6

Operator Representation α0 ✏ ✒1.0 0.0 ✚ α✽ ✏ ✒0.0 0.6 ✚ Aa ✏ ✒0.4 0.2 0.1 0.1 ✚ Ab ✏ ✒0.1 0.3 0.1 0.1 ✚ f♣abq ✏ 0.4 ✂ 0.3 ✂ 0.6 0.2 ✂ 0.1 ✂ 0.6 ✏ 0.084

slide-12
SLIDE 12

Weighted Finite Automata (WFA)

Example with 2 states and alphabet Σ ✏ ta, b✉ q0 q1

a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6

Operator Representation α0 ✏ ✒1.0 0.0 ✚ α✽ ✏ ✒0.0 0.6 ✚ Aa ✏ ✒0.4 0.2 0.1 0.1 ✚ Ab ✏ ✒0.1 0.3 0.1 0.1 ✚ f♣abq ✏ 0.4 ✂ 0.3 ✂ 0.6 0.2 ✂ 0.1 ✂ 0.6 ✏ 0.084

slide-13
SLIDE 13

Weighted Finite Automata (WFA)

Example with 2 states and alphabet Σ ✏ ta, b✉ q0 q1

a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6

Operator Representation α0 ✏ ✒1.0 0.0 ✚ α✽ ✏ ✒0.0 0.6 ✚ Aa ✏ ✒0.4 0.2 0.1 0.1 ✚ Ab ✏ ✒0.1 0.3 0.1 0.1 ✚ f♣abq ✏ 0.4 ✂ 0.3 ✂ 0.6 0.2 ✂ 0.1 ✂ 0.6 ✏ 0.084

slide-14
SLIDE 14

Weighted Finite Automata (WFA)

Example with 2 states and alphabet Σ ✏ ta, b✉ q0 q1

a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6

Operator Representation α0 ✏ ✒1.0 0.0 ✚ α✽ ✏ ✒0.0 0.6 ✚ Aa ✏ ✒0.4 0.2 0.1 0.1 ✚ Ab ✏ ✒0.1 0.3 0.1 0.1 ✚ f♣abq ✏ α❏

0 AaAbα✽

slide-15
SLIDE 15

Weighted Finite Automata (WFA)

Example with 2 states and alphabet Σ ✏ ta, b✉ q0 q1

a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6

Operator Representation α0 ✏ ✒1.0 0.0 ✚ α✽ ✏ ✒0.0 0.6 ✚ Aa ✏ ✒0.4 0.2 0.1 0.1 ✚ Ab ✏ ✒0.1 0.3 0.1 0.1 ✚ f♣abq ✏ ✏ 1.0 0.0 ✘ AaAbα✽

slide-16
SLIDE 16

Weighted Finite Automata (WFA)

Example with 2 states and alphabet Σ ✏ ta, b✉ q0 q1

a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6

Operator Representation α0 ✏ ✒1.0 0.0 ✚ α✽ ✏ ✒0.0 0.6 ✚ Aa ✏ ✒0.4 0.2 0.1 0.1 ✚ Ab ✏ ✒0.1 0.3 0.1 0.1 ✚ f♣abq ✏ ✏ 1.0 0.0 ✘ ✒0.4 0.2 0.1 0.1 ✚ Abα✽

slide-17
SLIDE 17

Weighted Finite Automata (WFA)

Example with 2 states and alphabet Σ ✏ ta, b✉ q0 q1

a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6

Operator Representation α0 ✏ ✒1.0 0.0 ✚ α✽ ✏ ✒0.0 0.6 ✚ Aa ✏ ✒0.4 0.2 0.1 0.1 ✚ Ab ✏ ✒0.1 0.3 0.1 0.1 ✚ f♣abq ✏ ✏ 0.4 0.2 ✘ Abα✽

slide-18
SLIDE 18

Weighted Finite Automata (WFA)

Example with 2 states and alphabet Σ ✏ ta, b✉ q0 q1

a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6

Operator Representation α0 ✏ ✒1.0 0.0 ✚ α✽ ✏ ✒0.0 0.6 ✚ Aa ✏ ✒0.4 0.2 0.1 0.1 ✚ Ab ✏ ✒0.1 0.3 0.1 0.1 ✚ f♣abq ✏ 0.4 ✂ 0.3 ✂ 0.6 0.2 ✂ 0.1 ✂ 0.6 ✏ 0.084

slide-19
SLIDE 19

Weighted Finite Automata (WFA)

Notation:

➓ Σ: alphabet – finite set ➓ n: number of states – positive integer ➓ α0: initial weights – vector in Rn

(features of empty prefix)

➓ α✽: final weights – vector in Rn

(features of empty suffix)

➓ Aσ: transition weights – matrix in Rn✂n (❅σ P Σ)

✏ ①

✽ t

✉②

✍ Ñ

♣ q ✏ ♣ q ✏

☎ ☎ ☎

✽ ✏ ❏ ✽

slide-20
SLIDE 20

Weighted Finite Automata (WFA)

Notation:

➓ Σ: alphabet – finite set ➓ n: number of states – positive integer ➓ α0: initial weights – vector in Rn

(features of empty prefix)

➓ α✽: final weights – vector in Rn

(features of empty suffix)

➓ Aσ: transition weights – matrix in Rn✂n (❅σ P Σ)

Definition: WFA with n states over Σ A ✏ ①α0, α✽, tAσ✉②

✍ Ñ

♣ q ✏ ♣ q ✏

☎ ☎ ☎

✽ ✏ ❏ ✽

slide-21
SLIDE 21

Weighted Finite Automata (WFA)

Notation:

➓ Σ: alphabet – finite set ➓ n: number of states – positive integer ➓ α0: initial weights – vector in Rn

(features of empty prefix)

➓ α✽: final weights – vector in Rn

(features of empty suffix)

➓ Aσ: transition weights – matrix in Rn✂n (❅σ P Σ)

Definition: WFA with n states over Σ A ✏ ①α0, α✽, tAσ✉② Compositional Function: Every WFA A defines a function fA : Σ✍ Ñ R fA♣xq ✏ fA♣x1 . . . xTq ✏ α❏

0 Ax1 ☎ ☎ ☎ AxT α✽ ✏ α❏ 0 Axα✽

slide-22
SLIDE 22

Example – Hidden Markov Model

➓ Assigns probabilities to strings f♣xq ✏ Prxs ➓ Emission and transition are conditionally independent given state

α❏

0 ✏ r0.3 0.3 0.4s

α❏

✽ ✏ r1 1 1s

Aa ✏ Oa ☎ T T ✏ ✔ ✕ 0.7 0.3 0.75 0.25 0.6 0.4 ✜ ✢ Oa ✏ ✔ ✕ 0.3 0.9 0.5 ✜ ✢

0.3 0.4 0.75 0.25 0.7 0.6 a, 0.5 b, 0.5 a, 0.3 b, 0.7 a, 0.9 b, 0.1

slide-23
SLIDE 23

Example – Probabilistic Tagger

➓ Σ ✏ X ✂ Y, where X input alphabet and Y output alphabet ➓ Assigns conditional probabilities f♣x, yq ✏ Pry⑤xs to pairs ♣x, yq P Σ✍

X ✏ tA, B✉ Y ✏ ta, b✉ α❏

0 ✏ r0.3 0 0.7s

α❏

✽ ✏ r1 1 1s

Ab

B ✏

✔ ✕ 0.2 0.4 1 0.75 ✜ ✢

A/a, 0.1 ∣ A/b, 0.9 B/a, 0.25 ∣ B/b, 0.75 ∣ A/b, 0.15 A/a, 0.75 A/b, 0.25 ∣ B/b, 1 B/b, 0.4 B/a, 0.4 A/b, 0.85 B/b, 0.2 0.3 0.7

slide-24
SLIDE 24

Other Examples of WFA

Automata-theoretic:

➓ Probabilistic Finite Automata (PFA) ➓ Deterministic Finite Automata (DFA)

Dynamical Systems:

➓ Observable Operator Models (OOM) ➓ Predictive State Representations (PSR)

slide-25
SLIDE 25

Other Examples of WFA

Automata-theoretic:

➓ Probabilistic Finite Automata (PFA) ➓ Deterministic Finite Automata (DFA)

Dynamical Systems:

➓ Observable Operator Models (OOM) ➓ Predictive State Representations (PSR)

Disclaimer: All weights in R with usual addition and multiplication (no semi-rings!)

slide-26
SLIDE 26

Applications of WFA

WFA Can Model:

➓ Probability distributions fA♣xq ✏ Prxs ➓ Binary classifiers g♣xq ✏ sign♣fA♣xq θq ➓ Real predictors fA♣xq ➓ Sequence predictors g♣xq ✏ argmaxy fA♣x, yq (with Σ ✏ X ✂ Y)

Used In Several Applications:

➓ Speech recognition [Mohri et al., 2008] ➓ Machine translation [de Gispert et al., 2010] ➓ Image processing [Albert and Kari, 2009] ➓ OCR systems [Knight and May, 2009] ➓ System testing [Baier et al., 2009]

slide-27
SLIDE 27

Useful Intuitions About fA

fA♣xq ✏ fA♣x1 . . . xTq ✏ α❏

0 Ax1 ☎ ☎ ☎ AxT α✽ ✏ α❏ 0 Axα✽ ➓ Sum-Product: fA♣xq is a sum–product computation

i0,i1,...,iT Prns

α0♣i0q ✄ T ➵

t✏1

Axt♣it✁1, itq ☛ α✽♣iTq

➓ Forward-Backward: fA♣xq is dot product between forward and

backward vectors fA♣psq ✏

  • α❏

0 Ap

✟ ☎ ♣Asα✽q ✏ αp ☎ βs

➓ Compositional Features: fA♣xq is a linear model

fA♣xq ✏

  • α❏

0 Ax

✟ ☎ α✽ ✏ φ♣xq ☎ α✽ where φ : Σ✍ Ñ Rn compositional features (i.e. φ♣xσq ✏ φ♣xqAσ)

slide-28
SLIDE 28

Useful Intuitions About fA

fA♣xq ✏ fA♣x1 . . . xTq ✏ α❏

0 Ax1 ☎ ☎ ☎ AxT α✽ ✏ α❏ 0 Axα✽ ➓ Sum-Product: fA♣xq is a sum–product computation

i0,i1,...,iT Prns

α0♣i0q ✄ T ➵

t✏1

Axt♣it✁1, itq ☛ α✽♣iTq

➓ Forward-Backward: fA♣xq is dot product between forward and

backward vectors fA♣psq ✏

  • α❏

0 Ap

✟ ☎ ♣Asα✽q ✏ αp ☎ βs

➓ Compositional Features: fA♣xq is a linear model

fA♣xq ✏

  • α❏

0 Ax

✟ ☎ α✽ ✏ φ♣xq ☎ α✽ where φ : Σ✍ Ñ Rn compositional features (i.e. φ♣xσq ✏ φ♣xqAσ)

slide-29
SLIDE 29

Useful Intuitions About fA

fA♣xq ✏ fA♣x1 . . . xTq ✏ α❏

0 Ax1 ☎ ☎ ☎ AxT α✽ ✏ α❏ 0 Axα✽ ➓ Sum-Product: fA♣xq is a sum–product computation

i0,i1,...,iT Prns

α0♣i0q ✄ T ➵

t✏1

Axt♣it✁1, itq ☛ α✽♣iTq

➓ Forward-Backward: fA♣xq is dot product between forward and

backward vectors fA♣psq ✏

  • α❏

0 Ap

✟ ☎ ♣Asα✽q ✏ αp ☎ βs

➓ Compositional Features: fA♣xq is a linear model

fA♣xq ✏

  • α❏

0 Ax

✟ ☎ α✽ ✏ φ♣xq ☎ α✽ where φ : Σ✍ Ñ Rn compositional features (i.e. φ♣xσq ✏ φ♣xqAσ)

slide-30
SLIDE 30

Useful Intuitions About fA

fA♣xq ✏ fA♣x1 . . . xTq ✏ α❏

0 Ax1 ☎ ☎ ☎ AxT α✽ ✏ α❏ 0 Axα✽ ➓ Sum-Product: fA♣xq is a sum–product computation

i0,i1,...,iT Prns

α0♣i0q ✄ T ➵

t✏1

Axt♣it✁1, itq ☛ α✽♣iTq

➓ Forward-Backward: fA♣xq is dot product between forward and

backward vectors fA♣psq ✏

  • α❏

0 Ap

✟ ☎ ♣Asα✽q ✏ αp ☎ βs

➓ Compositional Features: fA♣xq is a linear model

fA♣xq ✏

  • α❏

0 Ax

✟ ☎ α✽ ✏ φ♣xq ☎ α✽ where φ : Σ✍ Ñ Rn compositional features (i.e. φ♣xσq ✏ φ♣xqAσ)

slide-31
SLIDE 31

Forward–Backward Equations for Aσ

Any WFA A defines forward and backward maps αA, βA : Σ✍ Ñ Rn such that for any splitting x ✏ p ☎ s one has fA♣xq ✏

  • α❏

0 Ap1 ☎ ☎ ☎ ApT

✟ ☎

  • As1 ☎ ☎ ☎ AsT✶α✽

✟ ✏ αA♣pq ☎ βA♣sq

r ♣ qs ✏ r

✏ s

r ♣ qs ✏ r ⑤ ✏ s

slide-32
SLIDE 32

Forward–Backward Equations for Aσ

Any WFA A defines forward and backward maps αA, βA : Σ✍ Ñ Rn such that for any splitting x ✏ p ☎ s one has fA♣xq ✏

  • α❏

0 Ap1 ☎ ☎ ☎ ApT

✟ ☎

  • As1 ☎ ☎ ☎ AsT✶α✽

✟ ✏ αA♣pq ☎ βA♣sq Example

➓ In HMM coordinates of αA and βA have probabilistic interpretation:

rαA♣pqsi ✏ Prp , h1 ✏ is rβA♣sqsi ✏ Prs ⑤ h ✏ is

slide-33
SLIDE 33

Forward–Backward Equations for Aσ

Any WFA A defines forward and backward maps αA, βA : Σ✍ Ñ Rn such that for any splitting x ✏ p ☎ s one has fA♣xq ✏

  • α❏

0 Ap1 ☎ ☎ ☎ ApT

✟ ☎

  • As1 ☎ ☎ ☎ AsT✶α✽

✟ ✏ αA♣pq ☎ βA♣sq Key Observation Comparing fA♣psq and fA♣pσsq reveals information about Aσ: fA♣psq ✏ αA♣pq ☎ βA♣sq fA♣pσsq ✏ αA♣pq ☎ Aσ ☎ βA♣sq

slide-34
SLIDE 34

Forward–Backward Equations for Aσ

Any WFA A defines forward and backward maps αA, βA : Σ✍ Ñ Rn such that for any splitting x ✏ p ☎ s one has fA♣xq ✏

  • α❏

0 Ap1 ☎ ☎ ☎ ApT

✟ ☎

  • As1 ☎ ☎ ☎ AsT✶α✽

✟ ✏ αA♣pq ☎ βA♣sq Key Observation Comparing fA♣psq and fA♣pσsq reveals information about Aσ: fA♣psq ✏ αA♣pq ☎ βA♣sq fA♣pσsq ✏ αA♣pq ☎ Aσ ☎ βA♣sq Hankel matrices help organize and solve these equations!

slide-35
SLIDE 35

The Hankel Matrix

Two Equivalent Representations

➓ Functional: f : Σ✍ Ñ R ➓ Matricial: Hf P RΣ✍✂Σ✍, the Hankel matrix of f

Definition: p prefix, s suffix ñ Hf♣p, sq ✏ f♣p ☎ sq ♣ q ✏ ⑤ ⑤ ✏ ✔ ✖ ✖ ✖ ✖ ✖ ✕

☎☎☎

☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✣ ✢

♣ q ✏ ♣ q ✏ ♣ q ✏

slide-36
SLIDE 36

The Hankel Matrix

Two Equivalent Representations

➓ Functional: f : Σ✍ Ñ R ➓ Matricial: Hf P RΣ✍✂Σ✍, the Hankel matrix of f

Definition: p prefix, s suffix ñ Hf♣p, sq ✏ f♣p ☎ sq Example f♣xq ✏ ⑤x⑤a

(number of a’s in x)

Hf ✏ ✔ ✖ ✖ ✖ ✖ ✖ ✕

λ a b aa ☎☎☎ λ

1 2 ☎ ☎ ☎

a

1 2 1 3

b

1 2

aa

2 3 2 4 . . . . . . ... ✜ ✣ ✣ ✣ ✣ ✣ ✢

♣ q ✏ ♣ q ✏ ♣ q ✏

slide-37
SLIDE 37

The Hankel Matrix

Two Equivalent Representations

➓ Functional: f : Σ✍ Ñ R ➓ Matricial: Hf P RΣ✍✂Σ✍, the Hankel matrix of f

Definition: p prefix, s suffix ñ Hf♣p, sq ✏ f♣p ☎ sq Example f♣xq ✏ ⑤x⑤a

(number of a’s in x)

Hf ✏ ✔ ✖ ✖ ✖ ✖ ✖ ✕

λ a b aa ☎☎☎ λ

1 2 ☎ ☎ ☎

a

1 2 1 3

b

1 2

aa

2 3 2 4 . . . . . . ... ✜ ✣ ✣ ✣ ✣ ✣ ✢

Hf♣λ, aaq ✏ Hf♣a, aq ✏ Hf♣aa, λq ✏ 2

slide-38
SLIDE 38

The Hankel Matrix

Two Equivalent Representations

➓ Functional: f : Σ✍ Ñ R ➓ Matricial: Hf P RΣ✍✂Σ✍, the Hankel matrix of f

Definition: p prefix, s suffix ñ Hf♣p, sq ✏ f♣p ☎ sq Properties

➓ ⑤x⑤ 1 entries for f♣xq ➓ Depends on ordering of Σ✍ ➓ Captures structure

Hf ✏ ✔ ✖ ✖ ✖ ✖ ✖ ✕

λ a b aa ☎☎☎ λ

1 2 ☎ ☎ ☎

a

1 2 1 3

b

1 2

aa

2 3 2 4 . . . . . . ... ✜ ✣ ✣ ✣ ✣ ✣ ✢

♣ q ✏ ♣ q ✏ ♣ q ✏

slide-39
SLIDE 39

The Hankel Matrix

Two Equivalent Representations

➓ Functional: f : Σ✍ Ñ R ➓ Matricial: Hf P RΣ✍✂Σ✍, the Hankel matrix of f

Definition: p prefix, s suffix ñ Hf♣p, sq ✏ f♣p ☎ sq Properties

➓ ⑤x⑤ 1 entries for f♣xq ➓ Depends on ordering of Σ✍ ➓ Captures structure

Hf ✏ ✔ ✖ ✖ ✖ ✖ ✖ ✕

λ a b aa ☎☎☎ λ

1 2 ☎ ☎ ☎

a

1 2 1 3

b

1 2

aa

2 3 2 4 . . . . . . ... ✜ ✣ ✣ ✣ ✣ ✣ ✢

♣ q ✏ ♣ q ✏ ♣ q ✏

slide-40
SLIDE 40

A Fundamental Theorem about WFA Relates the rank of Hf and the number of states of WFA computing f

slide-41
SLIDE 41

A Fundamental Theorem about WFA

Theorem [Carlyle and Paz, 1971, Fliess, 1974] Let f : Σ✍ Ñ R be any function

  • 1. If f ✏ fA for some WFA A with n states ñ rank♣Hfq ↕ n
  • 2. If rank♣Hfq ✏ n ñ exists WFA A with n states s.t. f ✏ fA
slide-42
SLIDE 42

A Fundamental Theorem about WFA

Theorem [Carlyle and Paz, 1971, Fliess, 1974] Let f : Σ✍ Ñ R be any function

  • 1. If f ✏ fA for some WFA A with n states ñ rank♣Hfq ↕ n
  • 2. If rank♣Hfq ✏ n ñ exists WFA A with n states s.t. f ✏ fA
slide-43
SLIDE 43

A Fundamental Theorem about WFA

Theorem [Carlyle and Paz, 1971, Fliess, 1974] Let f : Σ✍ Ñ R be any function

  • 1. If f ✏ fA for some WFA A with n states ñ rank♣Hfq ↕ n
  • 2. If rank♣Hfq ✏ n ñ exists WFA A with n states s.t. f ✏ fA

Why Fundamental? Because proof of (2) gives an algorithm for recovering A from the Hankel matrix of fA

slide-44
SLIDE 44

A Fundamental Theorem about WFA

Theorem [Carlyle and Paz, 1971, Fliess, 1974] Let f : Σ✍ Ñ R be any function

  • 1. If f ✏ fA for some WFA A with n states ñ rank♣Hfq ↕ n
  • 2. If rank♣Hfq ✏ n ñ exists WFA A with n states s.t. f ✏ fA

Why Fundamental? Because proof of (2) gives an algorithm for recovering A from the Hankel matrix of fA Example: Can recover an HMM from the probabilities it assigns to sequences

  • f observations
slide-45
SLIDE 45

Structure of Low-rank Hankel Matrices

HfA P RΣ✍✂Σ✍ P P RΣ✍✂n S P Rn✂Σ✍ ✔ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✕

s

. . . . . . . . .

p

☎ ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ☎ ☎ . . . ✜ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎

p

✌ ✌ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✔ ✕

s

☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✜ ✢ fA♣p1 ☎ ☎ ☎ pT ☎ s1 ☎ ☎ ☎ sT ✶q ✏ α❏

0 Ap1 ☎ ☎ ☎ ApT

❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥

αA♣pq

As1 ☎ ☎ ☎ AsT ✶α✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥

βA♣sq

♣ q ✏ ♣ ☎q ♣ q ✏ ♣☎ q

slide-46
SLIDE 46

Structure of Low-rank Hankel Matrices

HfA P RΣ✍✂Σ✍ P P RΣ✍✂n S P Rn✂Σ✍ ✔ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✕

s

. . . . . . . . .

p

☎ ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ☎ ☎ . . . ✜ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎

p

✌ ✌ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✔ ✕

s

☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✜ ✢ fA♣p1 ☎ ☎ ☎ pT ☎ s1 ☎ ☎ ☎ sT ✶q ✏ α❏

0 Ap1 ☎ ☎ ☎ ApT

❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥

αA♣pq

As1 ☎ ☎ ☎ AsT ✶α✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥

βA♣sq

αA♣pq ✏ P♣p, ☎q βA♣sq ✏ S♣☎, sq

slide-47
SLIDE 47

Hankel Factorizations and Operators

Hσ P RΣ✍✂Σ✍ P

✍✂

P

P

✔ ✖ ✖ ✖ ✖ ✕

s

☎ ☎ ☎

p

☎ ☎ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✔ ✕ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✜ ✢ ✔ ✕ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✜ ✢ fA♣p1 ☎ ☎ ☎ pT ☎ σ ☎ s1 ☎ ☎ ☎ sT ✶q ✏

☎ ☎ ☎ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥

♣ q

☎ ☎ ☎ ☎ ☎

❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥

♣ q

✏ ñ ✏ ñ ✏

  • ♣ q ✏

♣ q ✏

slide-48
SLIDE 48

Hankel Factorizations and Operators

Hσ P RΣ✍✂Σ✍ P P RΣ✍✂n Aσ P Rn✂n S P Rn✂Σ✍ ✔ ✖ ✖ ✖ ✖ ✕

s

☎ ☎ ☎

p

☎ ☎ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎

p

✌ ✌ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✔ ✕ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✜ ✢ ✔ ✕

s

☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✜ ✢ fA♣p1 ☎ ☎ ☎ pT ☎ σ ☎ s1 ☎ ☎ ☎ sT ✶q ✏ α❏

0 Ap1 ☎ ☎ ☎ ApT

❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥

αA♣pq

☎ Aσ ☎ As1 ☎ ☎ ☎ AsT ✶ α✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥

βA♣sq

✏ ñ ✏ ñ ✏

  • ♣ q ✏

♣ q ✏

slide-49
SLIDE 49

Hankel Factorizations and Operators

Hσ P RΣ✍✂Σ✍ P P RΣ✍✂n Aσ P Rn✂n S P Rn✂Σ✍ ✔ ✖ ✖ ✖ ✖ ✕

s

☎ ☎ ☎

p

☎ ☎ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎

p

✌ ✌ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✔ ✕ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✜ ✢ ✔ ✕

s

☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✜ ✢ fA♣p1 ☎ ☎ ☎ pT ☎ σ ☎ s1 ☎ ☎ ☎ sT ✶q ✏ α❏

0 Ap1 ☎ ☎ ☎ ApT

❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥

αA♣pq

☎ Aσ ☎ As1 ☎ ☎ ☎ AsT ✶ α✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥

βA♣sq

H ✏ P S = ñ Hσ ✏ P Aσ S = ñ Aσ ✏ P Hσ S

♣ q ✏ ♣ q ✏

slide-50
SLIDE 50

Hankel Factorizations and Operators

Hσ P RΣ✍✂Σ✍ P P RΣ✍✂n Aσ P Rn✂n S P Rn✂Σ✍ ✔ ✖ ✖ ✖ ✖ ✕

s

☎ ☎ ☎

p

☎ ☎ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎

p

✌ ✌ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✔ ✕ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✜ ✢ ✔ ✕

s

☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✜ ✢ fA♣p1 ☎ ☎ ☎ pT ☎ σ ☎ s1 ☎ ☎ ☎ sT ✶q ✏ α❏

0 Ap1 ☎ ☎ ☎ ApT

❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥

αA♣pq

☎ Aσ ☎ As1 ☎ ☎ ☎ AsT ✶ α✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥

βA♣sq

H ✏ P S = ñ Hσ ✏ P Aσ S = ñ Aσ ✏ P Hσ S Note: Works with finite sub-blocks as well (assuming rank♣Pq ✏ rank♣Sq ✏ n)

slide-51
SLIDE 51

General Learning Algorithm for WFA

Data Hankel matrix WFA Low-rank matrix estimation Factorization and linear algebra

slide-52
SLIDE 52

General Learning Algorithm for WFA

Data Hankel matrix WFA Low-rank matrix estimation Factorization and linear algebra

Key Idea: The Hankel Trick

  • 1. Learn a low-rank Hankel matrix that implicitly induces

“latent” states

  • 2. Recover the states from a decomposition of the

Hankel matrix

slide-53
SLIDE 53

Limitations of WFA

Invariance Under Change of Basis For any invertible matrix Q the following WFA are equivalent:

➓ A ✏ ①α0, α✽, tAσ✉② ➓ B ✏ ①Q❏α0, Q✁1α✽, tQ✁1AσQ✉②

fA♣xq ✏ α❏

0 Ax1 ☎ ☎ ☎ AxT α✽

✏ ♣α❏

0 Qq♣Q✁1Ax1Qq ☎ ☎ ☎ ♣Q✁1AxT Qq♣Q✁1α✽q ✏ fB♣xq

✏ ✒ ✚ ✏ ✒ ✁ ✚

✏ ✒ ✁ ✁ ✚

slide-54
SLIDE 54

Limitations of WFA

Invariance Under Change of Basis For any invertible matrix Q the following WFA are equivalent:

➓ A ✏ ①α0, α✽, tAσ✉② ➓ B ✏ ①Q❏α0, Q✁1α✽, tQ✁1AσQ✉②

fA♣xq ✏ α❏

0 Ax1 ☎ ☎ ☎ AxT α✽

✏ ♣α❏

0 Qq♣Q✁1Ax1Qq ☎ ☎ ☎ ♣Q✁1AxT Qq♣Q✁1α✽q ✏ fB♣xq

Example Aa ✏ ✒ 0.5 0.1 0.2 0.3 ✚ Q ✏ ✒ 1 ✁1 ✚ Q✁1AaQ ✏ ✒ 0.3 ✁0.2 ✁0.1 0.5 ✚

slide-55
SLIDE 55

Limitations of WFA

Invariance Under Change of Basis For any invertible matrix Q the following WFA are equivalent:

➓ A ✏ ①α0, α✽, tAσ✉② ➓ B ✏ ①Q❏α0, Q✁1α✽, tQ✁1AσQ✉②

fA♣xq ✏ α❏

0 Ax1 ☎ ☎ ☎ AxT α✽

✏ ♣α❏

0 Qq♣Q✁1Ax1Qq ☎ ☎ ☎ ♣Q✁1AxT Qq♣Q✁1α✽q ✏ fB♣xq

Consequences

➓ There is no unique parametrization for WFA ➓ Given A it is undecidable whether ❅x fA♣xq ➙ 0 ➓ Cannot expect to recover a probabilistic parametrization

slide-56
SLIDE 56

Outline

  • 1. Weighted Automata and Hankel Matrices
  • 2. Spectral Learning of Probabilistic Automata
  • 3. Spectral Methods for Transducers and Grammars

Sequence Tagging Finite-State Transductions Tree Automata

  • 4. Hankel Matrices with Missing Entries
  • 5. Conclusion
  • 6. References
slide-57
SLIDE 57

Spectral Learning of Probabilistic Automata

Data Hankel matrix WFA Low-rank matrix estimation Factorization and linear algebra

Basic Setup:

➓ Data are strings sampled from probability distribution on Σ✍ ➓ Hankel matrix is estimated by empiricial probabilities ➓ Factorization and low-rank approximation is computed using SVD

slide-58
SLIDE 58

The Empirical Hankel Matrix

Suppose S ✏ ♣x1, . . . , xNq is a sample of N i.i.d. strings Empirical distribution ˆ fS♣xq ✏ 1 N

N

i✏1

Irxi ✏ xs Empirical Hankel matrix ˆ HS♣p, sq ✏ ˆ fS♣psq

✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ ✱ ✴ ✴ ✳ ✴ ✴ ✲ Ñ ♣ q ✏ ✓

✏ t ✉ ✏ t ✉

slide-59
SLIDE 59

The Empirical Hankel Matrix

Suppose S ✏ ♣x1, . . . , xNq is a sample of N i.i.d. strings Empirical distribution ˆ fS♣xq ✏ 1 N

N

i✏1

Irxi ✏ xs Empirical Hankel matrix ˆ HS♣p, sq ✏ ˆ fS♣psq Example:

S ✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, b, a, ab, aa, ba, b, aa, a, aa, bab, b, aa ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ˆ fS♣aaq ✏ 5 16 ✓ 0.31

✏ t ✉ ✏ t ✉

slide-60
SLIDE 60

The Empirical Hankel Matrix

Suppose S ✏ ♣x1, . . . , xNq is a sample of N i.i.d. strings Empirical distribution ˆ fS♣xq ✏ 1 N

N

i✏1

Irxi ✏ xs Empirical Hankel matrix ˆ HS♣p, sq ✏ ˆ fS♣psq Example:

S ✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, b, a, ab, aa, ba, b, aa, a, aa, bab, b, aa ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ˆ HS ✏ ✔ ✖ ✖ ✕

a b λ

.19 .25

a

.31 .06

b

.06 .00

ba

.00 .13 ✜ ✣ ✣ ✢

(Hankel with rows P ✏ tλ, a, b, ba✉ and columns S ✏ ta, b✉)

slide-61
SLIDE 61

Finite Sub-blocks of Hankel Matrices

Parameters:

➓ Set of rows (prefixes) P ⑨ Σ✍ ➓ Set of columns (suffixes) S ⑨ Σ✍

hλ,S hP,λ

Σ λ a b aa ab ... λ

1 0.3 0.7 0.05 0.25 . . .

a

0.3 0.05 0.25 0.02 0.03 . . .

b

0.7 0.6 0.1 0.03 0.2 . . .

aa

0.05 0.02 0.03 0.017 0.003 . . .

ab

0.25 0.23 0.02 0.11 0.12 . . . . . . . . . . . . . . . . . . . . . ...

H

Ha

P S ➓ H P RP✂S for finding P and S ➓ Hσ P RP✂S for finding Aσ ➓ hλ,S P R1✂S for finding α0 ➓ hP,λ P RP✂1 for finding α✽

slide-62
SLIDE 62

Low-rank Approximation and Factorization Will use the singular value decomposition (SVD) as the main building block Hence the name spectral!

slide-63
SLIDE 63

Low-rank Approximation and Factorization

Parameters:

➓ Desired number of states n ➓ Block H P RP✂S of the empirical Hankel matrix

❧♦ ♦♠♦ ♦♥

✓ ❧♦ ♦♠♦ ♦♥

❧♦ ♦♠♦ ♦♥

✂ ❏

❧♦ ♦♠♦ ♦♥

✓ ✏ ñ

✏ ✁ ❏

  • ✏ ♣

q✟ ✏

ñ

slide-64
SLIDE 64

Low-rank Approximation and Factorization

Parameters:

➓ Desired number of states n ➓ Block H P RP✂S of the empirical Hankel matrix

Low-rank Approximation: compute truncated SVD of rank n H ❧♦ ♦♠♦ ♦♥

P✂S

✓ Un ❧♦ ♦♠♦ ♦♥

P✂n

Λn ❧♦ ♦♠♦ ♦♥

n✂n

V❏

n

❧♦ ♦♠♦ ♦♥

n✂S

✓ ✏ ñ

✏ ✁ ❏

  • ✏ ♣

q✟ ✏

ñ

slide-65
SLIDE 65

Low-rank Approximation and Factorization

Parameters:

➓ Desired number of states n ➓ Block H P RP✂S of the empirical Hankel matrix

Low-rank Approximation: compute truncated SVD of rank n H ❧♦ ♦♠♦ ♦♥

P✂S

✓ Un ❧♦ ♦♠♦ ♦♥

P✂n

Λn ❧♦ ♦♠♦ ♦♥

n✂n

V❏

n

❧♦ ♦♠♦ ♦♥

n✂S

Factorization: H ✓ PS given by SVD, pseudo-inverses are easy P ✏ UnΛn ñ P ✏ Λ✁1

n U❏ n

  • ✏ ♣HVnq✟

S ✏ V❏

n

ñ S ✏ Vn

slide-66
SLIDE 66

Computing the WFA

Parameters:

➓ Factorization H ✓ ♣UΛq ☎ V❏ ✏ P ☎ S ➓ Hankel blocks Hσ, hλ,S, hP,λ

✁ ❏

  • ✏ ♣

q ✟

❏ ✏ ✏ ✽ ✏

✁ ❏

  • ✏ ♣

q ✟

slide-67
SLIDE 67

Computing the WFA

Parameters:

➓ Factorization H ✓ ♣UΛq ☎ V❏ ✏ P ☎ S ➓ Hankel blocks Hσ, hλ,S, hP,λ

Equations: Aσ ✏ PHσS ✏ Λ✁1U❏HσV

  • ✏ ♣HVqHσV

✟ α❏

0 ✏

hλ,SS ✏ hλ,SV α✽ ✏ PhP,λ ✏ Λ✁1U❏hP,λ

  • ✏ ♣HVqhP,λ

slide-68
SLIDE 68

Computing the WFA

Parameters:

➓ Factorization H ✓ ♣UΛq ☎ V❏ ✏ P ☎ S ➓ Hankel blocks Hσ, hλ,S, hP,λ

Equations: Aσ ✏ PHσS ✏ Λ✁1U❏HσV

  • ✏ ♣HVqHσV

✟ α❏

0 ✏

hλ,SS ✏ hλ,SV α✽ ✏ PhP,λ ✏ Λ✁1U❏hP,λ

  • ✏ ♣HVqhP,λ

✟ Full Algorithm

  • 1. Estimate empirical Hankel and retrieve sub-blocks H, Hσ, hλ,S, hP,λ
  • 2. Perform SVD of H
  • 3. Solve for Aσ, α0, α✽ with pseudo-inverses
slide-69
SLIDE 69

Computational and Statistical Complexity

Running Time:

➓ Empirical Hankel matrix: O♣⑤PS⑤ ☎ Nq ➓ SVD and linear algebra: O♣⑤P⑤ ☎ ⑤S⑤ ☎ nq

Statistical Consistency:

➓ By law of large numbers, ˆ

HS Ñ ErHs when N Ñ ✽

➓ If ErHs is Hankel of some WFA A, then ˆ

A Ñ A

➓ Works for data coming from PFA and HMM

PAC Analysis: (assuming data from A with n states)

➓ With high probability, ⑥ˆ

HS ✁ H⑥ ↕ O♣1④ ❄ Nq

➓ When N ➙ O♣n⑤Σ⑤2T 4④ε2sn♣Hq4q, then

⑤x⑤↕T

⑤fA♣xq ✁ f ˆ

A♣xq⑤ ↕ ε

Proofs can be found in [Hsu et al., 2009, Bailly, 2011, Balle, 2013]

slide-70
SLIDE 70

Practical Considerations

Data Hankel matrix WFA Low-rank matrix estimation Factorization and linear algebra

Basic Setup:

➓ Data are strings sampled from probability distribution on Σ✍ ➓ Hankel matrix is estimated by empiricial probabilities ➓ Factorization and low-rank approximation is computed using SVD

Advanced Implementations:

➓ Choice of parameters P and S ➓ Scalable estimation and factorization of Hankel matrices ➓ Smoothing and variance normalization ➓ Use of prefix and substring statistics

slide-71
SLIDE 71

Choosing the Basis

Definition: The pair ♣P, Sq defining the sub-block is called a basis Intuitions:

➓ Basis should be choosen such that ErHs has full rank ➓ P must contain strings reaching each possible state of the WFA ➓ S must contain string producing different outcomes for each pair of

states in the WFA

✏ ✏

➓ ➓

slide-72
SLIDE 72

Choosing the Basis

Definition: The pair ♣P, Sq defining the sub-block is called a basis Intuitions:

➓ Basis should be choosen such that ErHs has full rank ➓ P must contain strings reaching each possible state of the WFA ➓ S must contain string producing different outcomes for each pair of

states in the WFA Popular Approaches:

➓ Set P ✏ S ✏ Σ↕k for some k ➙ 1 [Hsu et al., 2009] ➓ Choose P and S to contain the K most frequent prefixes and suffixes

in the sample [Balle et al., 2012]

➓ Take all prefixes and suffixes appearing in the sample [Bailly et al., 2009]

slide-73
SLIDE 73

Scalable Implementations

Problem: When ⑤Σ⑤ is large, even the simplest basis become huge Hankel Matrix Representation:

➓ Use hash functions to map P (S) to row (column) indices ➓ Use sparse matrix data structures because statistics are usually sparse ➓ Never store the full Hankel matrix in memory

Efficient SVD Computation:

➓ SVD for sparse matrices [Berry, 1992] ➓ Approximate randomized SVD [Halko et al., 2011] ➓ On-line SVD with rank 1 updates [Brand, 2006]

slide-74
SLIDE 74

Refining the Statistics in the Hankel Matrix

Smoothing the Estimates

➓ Empirical probabilities ˆ

fS♣xq tend to be sparse

➓ Like in n-gram models, smoothing can help when Σ is large ➓ Should take into account that strings in PS have different lengths ➓ Open Problem: How to smooth empirical Hankels properly ➓ ➓ ➓ ➓

slide-75
SLIDE 75

Refining the Statistics in the Hankel Matrix

Smoothing the Estimates

➓ Empirical probabilities ˆ

fS♣xq tend to be sparse

➓ Like in n-gram models, smoothing can help when Σ is large ➓ Should take into account that strings in PS have different lengths ➓ Open Problem: How to smooth empirical Hankels properly

Row and Column Weighting

➓ More frequent prefixes (suffixes) have better estimated rows

(columns)

➓ Can scale rows and columns to reflect that ➓ Will lead to more reliable SVD decompositions ➓ See [Cohen et al., 2013] for details

slide-76
SLIDE 76

Substring Statistics

Problem: If the sample contains strings with wide range of lengths, small basis will ignore most of the examples

✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ ✱ ✴ ✴ ✳ ✴ ✴ ✲ Ñ ✏ ✔ ✖ ✖ ✕ ✜ ✣ ✣ ✢ ✏ ➳

r s ✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ ✱ ✴ ✴ ✳ ✴ ✴ ✲ Ñ ✏ ✔ ✖ ✖ ✕ ✜ ✣ ✣ ✢

slide-77
SLIDE 77

Substring Statistics

Problem: If the sample contains strings with wide range of lengths, small basis will ignore most of the examples String Statistics (occurence probability):

S ✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, bbab, abb, babba, abbb, ab, a, aabba, baa, abbab, baba, bb, a ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ˆ H ✏ ✔ ✖ ✖ ✕

a b λ

.19 .06

a

.06 .06

b

.00 .06

ba

.06 .06 ✜ ✣ ✣ ✢ ✏ ➳

r s ✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ ✱ ✴ ✴ ✳ ✴ ✴ ✲ Ñ ✏ ✔ ✖ ✖ ✕ ✜ ✣ ✣ ✢

slide-78
SLIDE 78

Substring Statistics

Problem: If the sample contains strings with wide range of lengths, small basis will ignore most of the examples String Statistics (occurence probability):

S ✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, bbab, abb, babba, abbb, ab, a, aabba, baa, abbab, baba, bb, a ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ˆ H ✏ ✔ ✖ ✖ ✕

a b λ

.19 .06

a

.06 .06

b

.00 .06

ba

.06 .06 ✜ ✣ ✣ ✢

Substring Statistics (expected number of occurences as substring):

Empirical expectation ✏ 1 N

N

i✏1

rnumber of occurences of x in xis S ✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, bbab, abb, babba, abbb, ab, a, aabba, baa, abbab, baba, bb, a ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ˆ H ✏ ✔ ✖ ✖ ✕

a b λ

1.31 1.56

a

.19 .62

b

.56 .50

ba

.06 .31 ✜ ✣ ✣ ✢

slide-79
SLIDE 79

Substring Statistics

Theorem [Balle et al., 2014] If a probability distribution f is computed by a WFA with n states, then the corresponding substring statistics are also computed by a WFA with n states Learning from Substring Statistics

➓ Can work with smaller Hankel matrices ➓ But estimating the matrix takes longer

slide-80
SLIDE 80

Experiment: PoS-tag Sequence Models

60 62 64 66 68 70 72 74 10 20 30 40 50 Word Error Rate (%) Number of States Spectral, Σ basis Spectral, basis k=25 Spectral, basis k=50 Spectral, basis k=100 Spectral, basis k=300 Spectral, basis k=500 Unigram Bigram

➓ PTB sequences of simplified PoS tags [Petrov et al., 2012] ➓ Configuration: expectations on frequent substrings ➓ Metric: error rate on predicting next symbol in test sequences

slide-81
SLIDE 81

Experiment: PoS-tag Sequence Models

58 60 62 64 66 68 70 10 20 30 40 50 Word Error Rate (%) Number of States Spectral, Σ basis Spectral, basis k=500 EM Unigram Bigram

➓ Comparison with a bigram baseline and EM ➓ Metric: error rate on predicting next symbol in test sequences ➓ At training, the Spectral Method is → 100 faster than EM

slide-82
SLIDE 82

Outline

  • 1. Weighted Automata and Hankel Matrices
  • 2. Spectral Learning of Probabilistic Automata
  • 3. Spectral Methods for Transducers and Grammars

Sequence Tagging Finite-State Transductions Tree Automata

  • 4. Hankel Matrices with Missing Entries
  • 5. Conclusion
  • 6. References
slide-83
SLIDE 83

Sequence Tagging and Transduction

➓ Many applications involve pairs of input-output sequences:

➓ Sequence tagging (one output tag per input token)

e.g.: part of speech tagging

  • utput:

NNP NNP VBZ NNP . input: Ms. Haag plays Elianti .

➓ Transductions (sequence lenghts might differ)

e.g.: spelling correction

  • utput:

a p p l e input: a p l e

➓ Finite-state automata are classic methods to model these relations.

Spectral methods apply naturally to this setting.

slide-84
SLIDE 84

Sequence Tagging

➓ Notation:

➓ Input alphabet X ➓ Output alphabet Y ➓ Joint alphabet Σ ✏ X ✂ Y

➓ Goal: map input sequences to output sequences of the same length ➓ Approach: learn a function

f : ♣X ✂ Yq✍ Ñ R Then, given an input x P XT return argmax

yPYT

f♣x, yq

(note: this maximization is not tractable in general)

slide-85
SLIDE 85

Weighted Finite Tagger

➓ Notation:

➓ X ✂ Y: joint alphabet – finite set ➓ n: number of states – positive integer ➓ α0: initial weights – vector in Rn

(features of empty prefix)

➓ α✽: final weights – vector in Rn

(features of empty suffix)

➓ Ab

a: transition weights – matrix in Rn✂n (❅a P X, b P Y)

✂ ✏ ①

✽ t

✉②

♣ ✂ q✍ Ñ ♣ q ✏

☎ ☎ ☎

✽ ✏ ❏ ✽

slide-86
SLIDE 86

Weighted Finite Tagger

➓ Notation:

➓ X ✂ Y: joint alphabet – finite set ➓ n: number of states – positive integer ➓ α0: initial weights – vector in Rn

(features of empty prefix)

➓ α✽: final weights – vector in Rn

(features of empty suffix)

➓ Ab

a: transition weights – matrix in Rn✂n (❅a P X, b P Y)

➓ Definition: WFTagger with n states over X ✂ Y

A ✏ ①α0, α✽, tAb

a✉② ➓

♣ ✂ q✍ Ñ ♣ q ✏

☎ ☎ ☎

✽ ✏ ❏ ✽

slide-87
SLIDE 87

Weighted Finite Tagger

➓ Notation:

➓ X ✂ Y: joint alphabet – finite set ➓ n: number of states – positive integer ➓ α0: initial weights – vector in Rn

(features of empty prefix)

➓ α✽: final weights – vector in Rn

(features of empty suffix)

➓ Ab

a: transition weights – matrix in Rn✂n (❅a P X, b P Y)

➓ Definition: WFTagger with n states over X ✂ Y

A ✏ ①α0, α✽, tAb

a✉② ➓ Compositional Function: Every WFTagger defines a function

fA : ♣X ✂ Yq✍ Ñ R fA♣x1 . . . xT, y1 . . . yTq ✏ α❏

0 Ay1 x1 ☎ ☎ ☎ AyT xT α✽ ✏ α❏ 0 Ay xα✽

slide-88
SLIDE 88

The Spectral Method for WFTaggers

Data Hankel matrix WFA Low-rank matrix estimation Factorization and linear algebra

➓ Assume f♣x, yq ✏ P♣x, yq

➓ Same mechanics as for WFA, with Σ ✏ X ✂ Y ➓ In a nutshell:

  • 1. Choose set of prefixes and suffixes to define Hankel

Ñ in this case they are bistrings

  • 2. Estimate Hankel with prefix-suffix training statistics
  • 3. Factorize Hankel using SVD
  • 4. Compute α and β projections,

and compute operators ①α0, α✽, tAσ✉②

➓ Other cases:

➓ fA♣x, yq ✏ P♣y ⑤ xq — see [Balle et al., 2011] ➓ fA♣x, yq non-probabilistic — see [Quattoni et al., 2014]

slide-89
SLIDE 89

Prediction with WFTaggers

➓ Assume fA♣x, yq ✏ P♣x, yq ➓ Given x1:T, compute most likely output tag at position t:

argmax

aPY

µ♣t, aq where

µ♣t, aq ✜ P♣yt ✏ a ⑤ xq ✾ ➳

y✏y1...a...yT

P♣x, yq ✾ ➳

✏ ❏ ✽

✄ ➳

✁ ✁ ✁

☛ ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥

✍ ♣ ✁ q

✄ ➳

❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥

✍ ♣

  • q

✍ ♣

q ✏

✍ ♣ ✁ q

✄ ➳

P

✍ ♣

q ✏ ✄ ➳

P

✍ ♣

  • q
slide-90
SLIDE 90

Prediction with WFTaggers

➓ Assume fA♣x, yq ✏ P♣x, yq ➓ Given x1:T, compute most likely output tag at position t:

argmax

aPY

µ♣t, aq where

µ♣t, aq ✜ P♣yt ✏ a ⑤ xq ✾ ➳

y✏y1...a...yT

P♣x, yq ✾ ➳

y✏y1...a...yT

α❏

0 Ay xα✽

✾ α❏ ✄ ➳

y1...yt✁1

Ay1:t✁1

x1:t✁1

☛ ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥

α✍

A♣x1:t✁1q

Aa

xt

✄ ➳

yt1...yT

Ayi1:T

xt1:T

☛ α✽ ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥

β✍

A♣xt1:T q

α✍

A♣x1:tq ✏ α✍ A♣x1:t✁1q

✄ ➳

bPY

Ab

xt

☛ β✍

A♣xt:Tq ✏

✄ ➳

bPY

Ab

xt

☛ β✍

A♣xt1:Tq

slide-91
SLIDE 91

Prediction with WFTaggers (II)

➓ Assume fA♣x, yq ✏ P♣x, yq ➓ Given x1:T, compute most likely output bigram ab at position t:

argmax

a,bPY

µ♣t, a, bq where µ♣t, a, bq ✏ P♣yt ✏ a, yt1 ✏ b ⑤ xq ✾ α✍

A♣x1:t✁1qAa xtAb xt1β✍ A♣xt2:Tq ➓ Compute most likely full sequence y – intractable

In practice, use Minimum Bayes-Risk decoding: argmax

yPYT

t

µ♣t, yt, yt1q

slide-92
SLIDE 92

Finite State Transducers

(ab,cde) a b c d e

a-c ǫ-d b-e ➓ A WFTransducer evaluates aligned strings,

using the empty symbol ǫ to produce one-to-one alignments: f♣c

a d ǫ e bq ✏ α❏ 0 Ac aAd ǫAe bα✽ ➓ Then, a function g can be defined on unaligned strings by

aggregating alignments g♣ab, cdeq ✏ ➳

πPΠ♣ab,cdeq

f♣πq

slide-93
SLIDE 93

Finite State Transducers: Main Problems

➓ Prediction: given an FST A, how to . . .

➓ Compute g♣x, yq for unaligned strings?

Ñ

➓ Compute marginal quantities µ♣edgeq ✏ P♣edge ⑤ xq?

Ñ

➓ Compute most-likely y for given x?

Ñ

➓ ➓

slide-94
SLIDE 94

Finite State Transducers: Main Problems

➓ Prediction: given an FST A, how to . . .

➓ Compute g♣x, yq for unaligned strings?

Ñ using edit-distance recursions

➓ Compute marginal quantities µ♣edgeq ✏ P♣edge ⑤ xq?

Ñ also using edit-distance recursions

➓ Compute most-likely y for given x?

Ñ use MBR-decoding with marginal scores

➓ ➓

slide-95
SLIDE 95

Finite State Transducers: Main Problems

➓ Prediction: given an FST A, how to . . .

➓ Compute g♣x, yq for unaligned strings?

Ñ using edit-distance recursions

➓ Compute marginal quantities µ♣edgeq ✏ P♣edge ⑤ xq?

Ñ also using edit-distance recursions

➓ Compute most-likely y for given x?

Ñ use MBR-decoding with marginal scores

➓ Unsupervised Learning: learn an FST from pairs of unaligned strings

➓ Unlike with EM, the spectral method can not recover latent structure

such as alignments (recall: alignments are needed to estimate Hankel entries)

➓ See [Bailly et al., 2013b] for a solution based on Hankel matrix

completion

slide-96
SLIDE 96

Spectral Learning of Tree Automata and Grammars

S NP noun Mary VP verb plays NP det the noun guitar

Some References:

➓ Tree Series: [Bailly et al., 2010, Bailly et al., 2010] ➓ Latent-annotated PCFG: [Cohen et al., 2012, Cohen et al., 2013] ➓ Dependency parsing: [Luque et al., 2012, Dhillon et al., 2012] ➓ Unsupervised learning of WCFG: [Bailly et al., 2013a, Parikh et al., 2014] ➓ Synchronous grammars: [Saluja et al., 2014]

slide-97
SLIDE 97

Compositional Functions over Trees

f ☎ ✝ ✆

a b a c c b b

☞ ✍ ✌ ✏ f ☎ ✝ ✆

a b a c c b b

☞ ✍ ✌✏ αA ✄

a b

☛❏ βA ☎ ✆

a c c b b

☞ ✌ ✏ ☎ ✝ ✆ ☞ ✍ ✌✏ ✄

☛❏ ✁ ✁ ✠ ❜ ♣ q ✠ ✏ ☎ ✝ ✆ ☞ ✍ ✌✏ ✄

☛❏ ♣ ♣ q ❜ ♣ qq

slide-98
SLIDE 98

Compositional Functions over Trees

f ☎ ✝ ✆

a b a c c b b

☞ ✍ ✌ ✏ f ☎ ✝ ✆

a b a c c b b

☞ ✍ ✌✏ αA ✄

a b

☛❏ βA ☎ ✆

a c c b b

☞ ✌ ✏ f ☎ ✝ ✆

a b a c c b b

☞ ✍ ✌✏ αA ✄

a b

☛❏ Aa ✁ βA ✁

c b b

✠ ❜ βA♣cq ✠ ✏ ☎ ✝ ✆ ☞ ✍ ✌✏ ✄

☛❏ ♣ ♣ q ❜ ♣ qq

slide-99
SLIDE 99

Compositional Functions over Trees

f ☎ ✝ ✆

a b a c c b b

☞ ✍ ✌ ✏ f ☎ ✝ ✆

a b a c c b b

☞ ✍ ✌✏ αA ✄

a b

☛❏ βA ☎ ✆

a c c b b

☞ ✌ ✏ f ☎ ✝ ✆

a b a c c b b

☞ ✍ ✌✏ αA ✄

a b

☛❏ Aa ✁ βA ✁

c b b

✠ ❜ βA♣cq ✠ ✏ f ☎ ✝ ✆

a b a c c b b

☞ ✍ ✌✏ αA ✄

a b a

c

☛❏ Ac ♣βA♣bq ❜ βA♣bqq

slide-100
SLIDE 100

Inside-Outside Composition of Trees

a c b c a b a c

b c a b ❞

t ✏ to ❞ ti

note: i-o composition generalizes the notion of concatenation in strings, i.e., outside trees are prefixes, inside trees are suffixes

slide-101
SLIDE 101

Inside-Outside Composition of Trees

a c b c a b a c

b c a b ❞

t ✏ to ❞ ti

note: i-o composition generalizes the notion of concatenation in strings, i.e., outside trees are prefixes, inside trees are suffixes

slide-102
SLIDE 102

Weighted Finite Tree Automata (WFTA) An algebraic model for compositional functions on trees

slide-103
SLIDE 103

WFTA Notation (I)

Labeled Trees

➓ tΣk✉ ✏ tΣ0, Σ1, . . . , Σr✉ – ranked alphabet ➓ T – space of labeled trees over some ranked alphabet

Tree:

➓ t P T ✏ ①V, E, l♣vq②: a labeled tree ➓ V ✏ t1, . . . , m✉: the set of vertices ➓ E ✏ t①i, j②✉: the set of edges forming a tree ➓ l♣vq Ñ tΣk✉: returns the label of v – (i.e. a symbol in tΣk✉)

slide-104
SLIDE 104

WFTA Notation (II)

Labeled Trees

➓ tΣk✉ ✏ tΣ0, Σ1, . . . , Σr✉ – ranked alphabet ➓ T – space of labeled trees over some ranked alphabet

Leaf Trees and Inside Compositions: leaf tree unary composition binary composition σ P Σ0 σ P Σ1, t1 P T σ P Σ2, t1, t2 P T σ σ t1 σ t1 t2 t ✏ σ t ✏ σrt1s t ✏ σrt1, t2s

slide-105
SLIDE 105

Notation for Matrices and Tensors

Kronecker product:

➓ for v1 P Rn and v2 P Rn: ➓ v1 ❜ v2 P Rn2 contains all products between elements of v1 and v2 ➓ Example:

➓ v1 ✏ ra, bs ➓ v2 ✏ rc, ds ➓ v1 ❜ v2 ✏ rac, ad, bc, bds

➓ ➓

P

P

P P

P

P ♣ ❜ q P

slide-106
SLIDE 106

Notation for Matrices and Tensors

Kronecker product:

➓ for v1 P Rn and v2 P Rn: ➓ v1 ❜ v2 P Rn2 contains all products between elements of v1 and v2 ➓ Example:

➓ v1 ✏ ra, bs ➓ v2 ✏ rc, ds ➓ v1 ❜ v2 ✏ rac, ad, bc, bds

Simplifying assumption:

➓ We consider trees with maximum arity 2 ➓ We think of matrices and tensors as functions:

➓ Vectors v P Rn ➓ Matrices A1 P Rn✂n: take one vector v P Rn

and produce another vector A1 v P Rn

➓ Tensors A2 P Rn✂n2: take two vectors v1, v2 P Rn

and produce another vector A2♣v1 ❜ v2q P Rn

slide-107
SLIDE 107

Weighted Finite Tree Automata (WFTA)

Σ ✏ tΣ0, Σ1, Σ2✉: ranked alphabet of order 2 – finite set Definition: WFTA with n states over Σ A ✏ ①α✍, tβσ✉, tA1

σ✉, tA2 σ✉② ➓ n: number of states – positive integer ➓ α✍ P Rn: root weights ➓ βσ P Rn: leaf weights – (❅σ P Σ0) ➓ A1 σ P Rn✂n: node weights – (❅σ P Σ1) ➓ A2 σ P Rn✂n2: node weights – (❅σ P Σ2) ➓ Note: A2 σ is a tensor in Rn✂n✂n packed as a matrix

slide-108
SLIDE 108

WFTA: Inside Function

Definition: Any WFTA A defines an inside function: βA : T Ñ Rn – maps a tree to a vector in Rn

➓ if t ✏ σ is a leaf:

βA♣tq ✏ βσ

➓ if t ✏ σrt1s results from a unary

composition: βA♣tq ✏ A1

σβA♣t1q ➓ if t ✏ σrt1, t2s results from a binary

composition: βA♣tq ✏ A2

σ ♣βA♣t1q ❜ βA♣t2qq

t ✏ σ t ✏ σ t1 t ✏ σ t1 t2

slide-109
SLIDE 109

WFTA Function: Every WFTA A defines a function fA : T Ñ R computed as: fA♣tq ✏ α❏

✍ βA♣tq

slide-110
SLIDE 110

Weighted Finite Tree Automaton (WFTA)

Example of inside computation:

a b a c c b

❏ ✍

♣ ❜ ♣ ♣ q ❜ qq ❧♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♥ ♣❧♦ ♦♠♦ ♦♥ ❜ ♣ ♣ q ❜ q ❧♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♥q ♣ ♣ q ❧♦♦♦♠♦♦♦♥ ❜ ❧♦ ♦♠♦ ♦♥q ♣❧♦ ♦♠♦ ♦♥q

slide-111
SLIDE 111

Weighted Finite Tree Automaton (WFTA)

Example of inside computation:

a b a c c b

❏ ✍

♣ ❜ ♣ ♣ q ❜ qq ❧♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♥ ♣❧♦ ♦♠♦ ♦♥ ❜ ♣ ♣ q ❜ q ❧♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♥q βb ♣ ♣ q ❧♦♦♦♠♦♦♦♥ ❜ ❧♦ ♦♠♦ ♦♥q ♣❧♦ ♦♠♦ ♦♥q βc βb

slide-112
SLIDE 112

Weighted Finite Tree Automaton (WFTA)

Example of inside computation:

a b a c c b

❏ ✍

♣ ❜ ♣ ♣ q ❜ qq ❧♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♥ ♣❧♦ ♦♠♦ ♦♥ ❜ ♣ ♣ q ❜ q ❧♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♥q βb ♣ ♣ q ❧♦♦♦♠♦♦♦♥ ❜ ❧♦ ♦♠♦ ♦♥q A1

c♣ βb

❧♦ ♦♠♦ ♦♥q βc βb

slide-113
SLIDE 113

Weighted Finite Tree Automaton (WFTA)

Example of inside computation:

a b a c c b

❏ ✍

♣ ❜ ♣ ♣ q ❜ qq ❧♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♥ ♣❧♦ ♦♠♦ ♦♥ ❜ ♣ ♣ q ❜ q ❧♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♥q βb A2

a♣A1 c♣βbq

❧♦♦♦♠♦♦♦♥ ❜ βc ❧♦ ♦♠♦ ♦♥q A1

c♣ βb

❧♦ ♦♠♦ ♦♥q βc βb

slide-114
SLIDE 114

Weighted Finite Tree Automaton (WFTA)

Example of inside computation:

a b a c c b

❏ ✍

♣ ❜ ♣ ♣ q ❜ qq ❧♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♥ A2

a♣ βb

❧♦ ♦♠♦ ♦♥ ❜ A2

a♣A1 c♣βbq ❜ βcq

❧♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♥q βb A2

a♣A1 c♣βbq

❧♦♦♦♠♦♦♦♥ ❜ βc ❧♦ ♦♠♦ ♦♥q A1

c♣ βb

❧♦ ♦♠♦ ♦♥q βc βb

slide-115
SLIDE 115

Weighted Finite Tree Automaton (WFTA)

Example of inside computation:

a b a c c b α❏

✍ A2 a♣βb ❜ A2 a♣A1 c♣βbq ❜ βcqq

❧♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♥ A2

a♣ βb

❧♦ ♦♠♦ ♦♥ ❜ A2

a♣A1 c♣βbq ❜ βcq

❧♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♥q βb A2

a♣A1 c♣βbq

❧♦♦♦♠♦♦♦♥ ❜ βc ❧♦ ♦♠♦ ♦♥q A1

c♣ βb

❧♦ ♦♠♦ ♦♥q βc βb

slide-116
SLIDE 116

Useful Intuition: Latent-variable Models as WFTA

fA♣tq ✏ αt

✍βA♣tq ➓ Each labeled node v is decorated with a latent variable hv P rns ➓ fA♣tq is a sum–product computation:

h0,h1,...,h⑤V⑤Prns

☎ ✆α✍♣h0q ➵

vPV:a♣vq✏0

βl♣vqrhvs ✂ ➵

vPV:a♣vq✏1

A1

l♣vqrhv, hc♣t,vqs

✂ ➵

vPV:a♣vq✏2

A2

l♣vqrhv, hc1♣t,vq, hc2♣t,vqs

☞ ✌

➓ fA♣tq is a linear model in the latent space defined by βA : T Ñ Rn

fA♣tq ✏

n

i✏1

α✍ris βA♣tqris

slide-117
SLIDE 117

Useful Intuition: Latent-variable Models as WFTA

fA♣tq ✏ αt

✍βA♣tq ➓ Each labeled node v is decorated with a latent variable hv P rns ➓ fA♣tq is a sum–product computation:

h0,h1,...,h⑤V⑤Prns

☎ ✆α✍♣h0q ➵

vPV:a♣vq✏0

βl♣vqrhvs ✂ ➵

vPV:a♣vq✏1

A1

l♣vqrhv, hc♣t,vqs

✂ ➵

vPV:a♣vq✏2

A2

l♣vqrhv, hc1♣t,vq, hc2♣t,vqs

☞ ✌

➓ fA♣tq is a linear model in the latent space defined by βA : T Ñ Rn

fA♣tq ✏

n

i✏1

α✍ris βA♣tqris

slide-118
SLIDE 118

Useful Intuition: Latent-variable Models as WFTA

fA♣tq ✏ αt

✍βA♣tq ➓ Each labeled node v is decorated with a latent variable hv P rns ➓ fA♣tq is a sum–product computation:

h0,h1,...,h⑤V⑤Prns

☎ ✆α✍♣h0q ➵

vPV:a♣vq✏0

βl♣vqrhvs ✂ ➵

vPV:a♣vq✏1

A1

l♣vqrhv, hc♣t,vqs

✂ ➵

vPV:a♣vq✏2

A2

l♣vqrhv, hc1♣t,vq, hc2♣t,vqs

☞ ✌

➓ fA♣tq is a linear model in the latent space defined by βA : T Ñ Rn

fA♣tq ✏

n

i✏1

α✍ris βA♣tqris

slide-119
SLIDE 119

Inside/Outside Decomposition

v tree t

  • utside tree t③v

v inside tree trvs Consider a tree t and one node v:

➓ Inside tree trvs: the subtree of t rooted at v

➓ trvs P T

➓ Outside tree t③v: the rest of t when removing trvs

➓ T✍: the space of outside trees, i.e. t③v P T✍ ➓ Foot node ✍: a tree insertion point (a special symbol ✍ ❘ tΣk✉) ➓ An outside tree has exactly one foot node in the leaves

slide-120
SLIDE 120

Inside/Outside Composition

a c b c a b a c

b c a b ❞

➓ A tree is formed by composing an outside tree with an inside tree

Ñ generalizes prefix/suffix concatenation in strings

➓ Multiple ways to decompose a full tree into inside/outside trees

Ñ as many as nodes in a tree

slide-121
SLIDE 121

Inside/Outside Composition

a c b c a b a c

b c a b ❞

➓ A tree is formed by composing an outside tree with an inside tree

Ñ generalizes prefix/suffix concatenation in strings

➓ Multiple ways to decompose a full tree into inside/outside trees

Ñ as many as nodes in a tree

slide-122
SLIDE 122

Outside Trees

➓ Outside trees t✍ P T✍ are defined recursively using compositions:

foot node unary composition binary composition to P T✍, σ P Σ1 to P T✍, σ P Σ2, ti P T

✍ ✍

to ❞

σ

to ❞

σ ti

✍ ✍

to ❞

σ ti

t✍ ✏ ✍ t✍ ✏ to ❞ σr✍s

t✍ ✏ to ❞ σr✍, tis t✍ ✏ to ❞ σrti, ✍s

slide-123
SLIDE 123

WFTA: Outside Function

Definition: Any WFTA A defines an outside function: αA : T✍ Ñ Rn – maps an outside tree to a vector in Rn

➓ if t✍ ✏ ✍ is a foot node:

αA♣t✍q ✏ α0

➓ if t✍ ✏ to ❞ σr✍s results from a unary

composition: αA♣t✍q ✏ αA♣toq❏A1

σ ➓ if t✍ ✏ to ❞ σrti, ✍s results from a binary

composition: αA♣t✍q ✏ αA♣toq❏A2

σ ♣βA♣tiq ❜ 1nq

(note: similar expression for t✍ ✏ to ❞ σr✍, tis)

t✍ ✏ ✍ t✍ ✏

to ❞

σ

t✍ ✏

to ❞

σ ti

slide-124
SLIDE 124

WFTA are fully compositional

a c b c a b a c

b c a b ❞

For any inside-outside decomposition of a tree: fA♣tq ✏ αA♣toq❏βA♣tiq ♣let t ✏ to ❞ tiq ✏ ♣ q❏ ♣ ♣ q ❜ ♣ qq ♣ ✏ r sq Consequences:

➓ We can isolate the αA and βA vector spaces ➓

slide-125
SLIDE 125

WFTA are fully compositional

a c b c a b a c

b c a b ❞

For any inside-outside decomposition of a tree: fA♣tq ✏ αA♣toq❏βA♣tiq ♣let t ✏ to ❞ tiq ✏ αA♣toq❏A2

σ♣βA♣t1q ❜ βA♣t2qq

♣let ti ✏ σrt1, t2sq Consequences:

➓ We can isolate the αA and βA vector spaces ➓ Given αA and βA, we can isolate the operators Ak σ

slide-126
SLIDE 126

Hankel Matrices of functions over Labeled Trees

Two Equivalent Representations

➓ Functional: fA : T Ñ R ➓ Matricial: HfA P R⑤T✍⑤✂⑤T⑤

(the Hankel matrix of fA)

✔ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✕

a b

a

a

a

b

a

a b

☎☎☎

1 ✁1 2 3 ...

a

✁1 2 1 ✁1 ☎ ☎ ☎

b

4 1 6 2

a

b

✁1 ✁3 ✁7

a

b

3 . . . . . . . . . ✜ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✢

➓ Definition:

H♣to, tiq ✏ f♣to ❞ tiq

➓ Sublock for σ:

Hσ♣to, σrt1, t2sq ✏ f♣to ❞ σrt1, t2sq

➓ Highly redundant,

i.e., ⑤V⑤ 1 entries for f♣tq

slide-127
SLIDE 127

A Fundamental Theorem about WFTA Relates the rank of Hf and the number of states of WFTA computing f

slide-128
SLIDE 128

A Fundamental Theorem about WFTA

Let f : T Ñ R be any function over labeled trees.

  • 1. If f ✏ fA for some WFTA A with n states ñ rank♣Hfq ↕ n
  • 2. If rank♣Hfq ✏ n ñ exists WFTA A with n states s.t. f ✏ fA
slide-129
SLIDE 129

A Fundamental Theorem about WFTA

Let f : T Ñ R be any function over labeled trees.

  • 1. If f ✏ fA for some WFTA A with n states ñ rank♣Hfq ↕ n
  • 2. If rank♣Hfq ✏ n ñ exists WFTA A with n states s.t. f ✏ fA
slide-130
SLIDE 130

A Fundamental Theorem about WFTA

Let f : T Ñ R be any function over labeled trees.

  • 1. If f ✏ fA for some WFTA A with n states ñ rank♣Hfq ↕ n
  • 2. If rank♣Hfq ✏ n ñ exists WFTA A with n states s.t. f ✏ fA

Why Fundamental? Proof of (2) gives an algorithm for “recovering” A from the Hankel matrix

  • f fA
slide-131
SLIDE 131

Structure of Low-rank Hankel Matrices

Hf P RT✍✂T O P RT✍✂n I P Rn✂T ✔ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✕

ti

. . . . . . . . .

to

☎ ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ☎ ☎ . . . ✜ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎

to

✌ ✌ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✔ ✕

ti

☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✜ ✢ ♣ ❞ q ✏ ♣ q❏ ♣ q ♣ q ✏ ♣ ☎q ♣ q ✏ ♣☎ q

slide-132
SLIDE 132

Structure of Low-rank Hankel Matrices

Hf P RT✍✂T O P RT✍✂n I P Rn✂T ✔ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✕

ti

. . . . . . . . .

to

☎ ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ☎ ☎ . . . ✜ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎

to

✌ ✌ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✔ ✕

ti

☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✜ ✢ f♣to ❞ tiq ✏ αA♣toq❏βA♣tiq ♣ q ✏ ♣ ☎q ♣ q ✏ ♣☎ q

slide-133
SLIDE 133

Structure of Low-rank Hankel Matrices

Hf P RT✍✂T O P RT✍✂n I P Rn✂T ✔ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✕

ti

. . . . . . . . .

to

☎ ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ☎ ☎ . . . ✜ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎

to

✌ ✌ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✔ ✕

ti

☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✜ ✢ f♣to ❞ tiq ✏ αA♣toq❏βA♣tiq αA♣toq ✏ O♣to, ☎q βA♣tiq ✏ I♣☎, tiq

slide-134
SLIDE 134

Hankel Factorizations and Operators

Hσ P RT✍✂T P

✍✂

P

P

P

✔ ✖ ✖ ✖ ✖ ✕

σrt1,t2s

☎ ☎ ☎

to

☎ ☎ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ✌ ✌ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✒ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✚ ✔ ✖ ✖ ✕ ✒ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ✚ ❜ ✒ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ✚ ✜ ✣ ✣ ✢ f♣to ❞ σrt1, t2sq

❧♦♦♦♠♦♦♦♥

ti

✏ ♣ q❏

♣ ♣ q ❜ ♣ qq ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥

♣ q

slide-135
SLIDE 135

Hankel Factorizations and Operators

Hσ P RT✍✂T O P RT✍✂n A2

σ P Rn✂n2

I P Rn✂T I P Rn✂T ✔ ✖ ✖ ✖ ✖ ✕

σrt1,t2s

☎ ☎ ☎

to

☎ ☎ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎

to

✌ ✌ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✒ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✚ ✔ ✖ ✖ ✕ ✒

t1

☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ✚ ❜ ✒

t2

☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ✚ ✜ ✣ ✣ ✢ f♣to ❞ σrt1, t2sq

❧♦♦♦♠♦♦♦♥

ti

✏ αA♣toq❏ A2

σ♣βA♣t1q ❜ βA♣t2qq

❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥

βA♣tiq

slide-136
SLIDE 136

Hankel Factorizations and Operators

Hσ ✏ O A2

σ rI ❜ Is

= ñ A2

σ ✏ O Hσ rI ❜ Is

Note: Works with finite sub-blocks as well

(assuming rank♣Oq ✏ rank♣Iq ✏ n)

slide-137
SLIDE 137

WFTA: Application to Parsing

S NP noun Mary VP verb plays NP det the noun guitar Some intuitions:

➓ Derivation = Labeled Tree ➓ Learning compositional functions over derivations

= ñ learning functions over trees

➓ We are interested in functions computed by WFTA

slide-138
SLIDE 138

WFTA for Parsing: Key Questions

➓ What is the latent state representing?

➓ e.g., latent real-valued embeddings of words and phrases

➓ What form of supervision do we get?

➓ Full Derivations (labeled trees)

i.e., supervised learning of latent-variable grammars

➓ Derivation skeletons (unlabeled trees)

e.g. [Pereira and Schabes, 1992]

➓ Yields from the grammar (only sentences)

i.e., grammar induction

slide-139
SLIDE 139

Parsing and Tree Automaton S NP Mary VP plays NP the guitar

βMary βplays βthe βguitar

r s

r ❜ s r ❜ r ❜ ss r r s ❜ r ❜ r ❜ sss

r r s ❜ r ❜ r ❜ sss

➓ Vectors βσ associated with terminal symbols ➓ Matrices and tensors Ak

σ associated with non-terminals

➓ Bottom-up computation embeds inside trees into vectors in Rn

slide-140
SLIDE 140

Parsing and Tree Automaton S NP Mary VP plays NP the guitar

βMary βplays βthe βguitar

ANPrβMarys

ANPrβthe ❜ βguitars r ❜ r ❜ ss r r s ❜ r ❜ r ❜ sss

r r s ❜ r ❜ r ❜ sss

➓ Vectors βσ associated with terminal symbols ➓ Matrices and tensors Ak

σ associated with non-terminals

➓ Bottom-up computation embeds inside trees into vectors in Rn

slide-141
SLIDE 141

Parsing and Tree Automaton S NP Mary VP plays NP the guitar

βMary βplays βthe βguitar

ANPrβMarys

ANPrβthe ❜ βguitars AVPrβplays ❜ ANPrβthe ❜ βguitarss ASrANPrβMarys ❜ AVPrβplays ❜ ANPrβthe ❜ βguitarsss α❏

0 ASrANPrβMarys ❜ AVPrβplays ❜ ANPrβthe ❜ βguitarsss ➓ Vectors βσ associated with terminal symbols ➓ Matrices and tensors Ak

σ associated with non-terminals

➓ Bottom-up computation embeds inside trees into vectors in Rn

slide-142
SLIDE 142

WFTA on Parse Trees

➓ WFTA A ✏ ①α✍, tβw✉, tA1 N✉, tA2 N✉②

➓ n: number of states; i.e. dimensionality of the embedding ➓ Ranked alphabet: ➓ Σ0 ✏ tthe, Mary, plays, . . . ✉ – terminal words ➓ Σ1 ✏ tnoun, verb, det, NP, VP, . . . ✉ – unary non-terminals ➓ Σ2 ✏ tS, NP, VP, . . . ✉ – binary non-terminals ➓ α✍ – initial weights ➓ tβw✉ for all w P Σ0 – word embeddings ➓ tA1

N✉ for all N P Σ1 – compute embedding of unary phrases

➓ tA2

N✉ for all N P Σ2 – compute embedding of binary phrases

✏ ❞

♣ q ✏ ♣ q❏ ♣ q

♣ q

♣ q

slide-143
SLIDE 143

WFTA on Parse Trees

➓ WFTA A ✏ ①α✍, tβw✉, tA1 N✉, tA2 N✉②

➓ n: number of states; i.e. dimensionality of the embedding ➓ Ranked alphabet: ➓ Σ0 ✏ tthe, Mary, plays, . . . ✉ – terminal words ➓ Σ1 ✏ tnoun, verb, det, NP, VP, . . . ✉ – unary non-terminals ➓ Σ2 ✏ tS, NP, VP, . . . ✉ – binary non-terminals ➓ α✍ – initial weights ➓ tβw✉ for all w P Σ0 – word embeddings ➓ tA1

N✉ for all N P Σ1 – compute embedding of unary phrases

➓ tA2

N✉ for all N P Σ2 – compute embedding of binary phrases

➓ If t ✏ to ❞ ti

➓ fA♣tq ✏ αA♣toq❏βA♣tiq ➓ βA♣tiq : an n-dimensional embedding of an inside tree ti

i.e., maps inside trees to similar vectors if they are replaceable

➓ αA♣toq : an n-dimensional embedding of an outside tree to

i.e., maps outside trees to similar vectors if they accept similar arguments

slide-144
SLIDE 144

Production Parse Trees

S NP n Mary VP v plays NP d the n guitar

S NP VP NP n VP v NP n Mary v plays NP d n d the n guitar

ô

➓ A production parse tree represents the edges of a parse tree,

i.e. the context-free productions

slide-145
SLIDE 145

WFTA on Production Parse Trees

S NP n Mary VP v plays NP d the n guitar

S NP VP NP n VP v NP n Mary v plays NP d n d the n guitar

ô

➓ WFTA operators associated with rule productions

➓ i/o compositions constrained by overlapping non-terminal ➓ The WFTA induces a separate n-dimensional space per non-terminal,

i.e. observed non-terminals are refined

➓ WFTA on production parse trees include:

➓ classic WCFG, for n ✏ 1 ➓ PCFG-LA, for n → 1 [Matsuzaki et al 2005, Petrov et al 2006, Cohen et al

2012]

slide-146
SLIDE 146

Spectral Learning of Tree Automata

➓ WFTA are a general algebraic framework for compositional functions ➓ WFTA can exploit real-valued embeddings ➓ There are simple algorithms for learning WFTA from samples

slide-147
SLIDE 147

Outline

  • 1. Weighted Automata and Hankel Matrices
  • 2. Spectral Learning of Probabilistic Automata
  • 3. Spectral Methods for Transducers and Grammars

Sequence Tagging Finite-State Transductions Tree Automata

  • 4. Hankel Matrices with Missing Entries
  • 5. Conclusion
  • 6. References
slide-148
SLIDE 148

Learning WFA in More General Settings

Data Hankel matrix WFA Low-rank matrix estimation Factorization and linear algebra

Question: How do we use these approach to learn f : Σ✍ Ñ R where f♣xq does not have a probabilistic interpretation?

➓ ✍ Ñ t

✁ ✉

➓ ✍ Ñ ➓

♣ ✂ q✍ Ñ

slide-149
SLIDE 149

Learning WFA in More General Settings

Data Hankel matrix WFA Low-rank matrix estimation Factorization and linear algebra

Question: How do we use these approach to learn f : Σ✍ Ñ R where f♣xq does not have a probabilistic interpretation? Examples:

➓ Classification f : Σ✍ Ñ t1, ✁1✉ ➓ Unconstrained real-valued predictions f : Σ✍ Ñ R ➓ General scoring functions for tagging: f : ♣Σ ✂ ∆q✍ Ñ R

slide-150
SLIDE 150

Example: Hankel Matrices with Missing Entries

When learning probabilistic functions. . .

entries in Hf are estimated from empirical counts, e.g. f♣xq ✏ Prxs

✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, b, a, ab, aa, ba, b, aa, a, aa, bab, b, aa ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ✔ ✖ ✖ ✕

a b ǫ

.19 .25

a

.31 .06

b

.06 .00

ba

.00 .13 ✜ ✣ ✣ ✢ ✩ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✫ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✪ ✱ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✳ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✲ Ñ ✔ ✖ ✖ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✣ ✣ ✢

slide-151
SLIDE 151

Example: Hankel Matrices with Missing Entries

When learning probabilistic functions. . .

entries in Hf are estimated from empirical counts, e.g. f♣xq ✏ Prxs

✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, b, a, ab, aa, ba, b, aa, a, aa, bab, b, aa ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ✔ ✖ ✖ ✕

a b ǫ

.19 .25

a

.31 .06

b

.06 .00

ba

.00 .13 ✜ ✣ ✣ ✢

But in a general regression setting...

entries in Hf are labels observed in the sample, and many may be missing

✩ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✫ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✪ (bab,1) (bbb,0) (aaa,3) (a,1) (ab,1) (aa,2) (aba,2) (bb,0) ✱ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✳ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✲ − Ñ ✔ ✖ ✖ ✖ ✖ ✖ ✖ ✕

ǫ a b a

1 2 1

b

☎ ☎

aa

2 3 ☎

ab

1 2 ☎

ba

☎ ☎ 1

bb

☎ ✜ ✣ ✣ ✣ ✣ ✣ ✣ ✢

slide-152
SLIDE 152

Example: Hankel Matrices with Missing Entries

When learning probabilistic functions. . .

entries in Hf are estimated from empirical counts, e.g. f♣xq ✏ Prxs

✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, b, a, ab, aa, ba, b, aa, a, aa, bab, b, aa ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ✔ ✖ ✖ ✕

a b ǫ

.19 .25

a

.31 .06

b

.06 .00

ba

.00 .13 ✜ ✣ ✣ ✢

But in a general regression setting...

entries in Hf are labels observed in the sample, and many may be missing

✩ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✫ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✪ (bab,1) (bbb,0) (aaa,3) (a,1) (ab,1) (aa,2) (aba,2) (bb,0) ✱ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✳ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✲ − Ñ ✔ ✖ ✖ ✖ ✖ ✖ ✖ ✕

ǫ a b a

1 2 1

b

? ?

aa

2 3 ?

ab

1 2 ?

ba

? ? 1

bb

? ✜ ✣ ✣ ✣ ✣ ✣ ✣ ✢

slide-153
SLIDE 153

Inference of Hankel Matrices

Goal: Learn a Hankel matrix H P RP✂S from partial information, then apply the Hankel trick Why is this even possible? Need only O♣♣n mqrq coefficients to represent n ✂ m matrix with rank r

slide-154
SLIDE 154

Inference of Hankel Matrices

Goal: Learn a Hankel matrix H P RP✂S from partial information, then apply the Hankel trick Why is this even possible? Need only O♣♣n mqrq coefficients to represent n ✂ m matrix with rank r SVD M ❧♦ ♦♠♦ ♦♥

n☎m

✏ U ❧♦ ♦♠♦ ♦♥

n☎r

Λ ❧♦ ♦♠♦ ♦♥

r

V❏ ❧♦ ♦♠♦ ♦♥

m☎r

slide-155
SLIDE 155

Inference of Hankel Matrices

Goal: Learn a Hankel matrix H P RP✂S from partial information, then apply the Hankel trick

What partial information about H can we hope to gather?

slide-156
SLIDE 156

Inference of Hankel Matrices

Goal: Learn a Hankel matrix H P RP✂S from partial information, then apply the Hankel trick Information Models:

➓ Subset of entries: tH♣p, sq⑤♣p, sq P I✉ ➓ Linear measurements: tHv⑤v P V✉ ➓ Bilinear measurements: tu❏Hv⑤u P U, v P V✉ ➓ Constraints between entries: tH♣p, sq ➙ H♣p✶, s✶q⑤♣p, s, p✶, s✶q P I✉ ➓ Noisy versions of all the above

slide-157
SLIDE 157

Inference of Hankel Matrices

Goal: Learn a Hankel matrix H P RP✂S from partial information, then apply the Hankel trick Information Models:

➓ Subset of entries: tH♣p, sq⑤♣p, sq P I✉ ➓ Linear measurements: tHv⑤v P V✉ ➓ Bilinear measurements: tu❏Hv⑤u P U, v P V✉ ➓ Constraints between entries: tH♣p, sq ➙ H♣p✶, s✶q⑤♣p, s, p✶, s✶q P I✉ ➓ Noisy versions of all the above

What a priori information about H do we have?

slide-158
SLIDE 158

Inference of Hankel Matrices

Goal: Learn a Hankel matrix H P RP✂S from partial information, then apply the Hankel trick Information Models:

➓ Subset of entries: tH♣p, sq⑤♣p, sq P I✉ ➓ Linear measurements: tHv⑤v P V✉ ➓ Bilinear measurements: tu❏Hv⑤u P U, v P V✉ ➓ Constraints between entries: tH♣p, sq ➙ H♣p✶, s✶q⑤♣p, s, p✶, s✶q P I✉ ➓ Noisy versions of all the above

Constraints and Inductive Bias:

➓ Hankel constraints H♣p, sq ✏ H♣p✶, s✶q if ps ✏ p✶s✶ ➓ Constraints on entries ⑤H♣p, sq⑤ ↕ C ➓ Low-rank constraints/regularization rank♣Hq

slide-159
SLIDE 159

Inference of Hankel Matrices

Goal: Learn a Hankel matrix H P RP✂S from partial information, then apply the Hankel trick Information Models:

➓ Subset of entries: tH♣p, sq⑤♣p, sq P I✉ ➓ Linear measurements: tHv⑤v P V✉ ➓ Bilinear measurements: tu❏Hv⑤u P U, v P V✉ ➓ Constraints between entries: tH♣p, sq ➙ H♣p✶, s✶q⑤♣p, s, p✶, s✶q P I✉ ➓ Noisy versions of all the above

Constraints and Inductive Bias:

➓ Hankel constraints H♣p, sq ✏ H♣p✶, s✶q if ps ✏ p✶s✶ ➓ Constraints on entries ⑤H♣p, sq⑤ ↕ C ➓ Low-rank constraints/regularization rank♣Hq

slide-160
SLIDE 160

Empirical Risk Minimization Approach Formalize the problem “find Hankel matrix that agrees with data”

[Balle and Mohri, 2012]

slide-161
SLIDE 161

Empirical Risk Minimization Approach

Data: t♣xi, yiq✉N

i✏1, xi P Σ✍, yi P R ➓

✍ ➓

✂ Ñ

P

♣ ♣ qq ♣ q ↕

P

♣ ♣ qq ♣ q

slide-162
SLIDE 162

Empirical Risk Minimization Approach

Data: t♣xi, yiq✉N

i✏1, xi P Σ✍, yi P R

Parameters:

➓ Rows and columns P, S ⑨ Σ✍ ➓ (Convex) Loss function ℓ : R ✂ R Ñ R ➓ Regularization parameter λ / rank bound R P

♣ ♣ qq ♣ q ↕

P

♣ ♣ qq ♣ q

slide-163
SLIDE 163

Empirical Risk Minimization Approach

Data: t♣xi, yiq✉N

i✏1, xi P Σ✍, yi P R

Parameters:

➓ Rows and columns P, S ⑨ Σ✍ ➓ (Convex) Loss function ℓ : R ✂ R Ñ R ➓ Regularization parameter λ / rank bound R

Optimization (constrained formulation): argmin

HPRP✂S

1 N

N

i✏1

ℓ♣yi, H♣xiqq subject to rank♣Hq ↕ R

P

♣ ♣ qq ♣ q

slide-164
SLIDE 164

Empirical Risk Minimization Approach

Data: t♣xi, yiq✉N

i✏1, xi P Σ✍, yi P R

Parameters:

➓ Rows and columns P, S ⑨ Σ✍ ➓ (Convex) Loss function ℓ : R ✂ R Ñ R ➓ Regularization parameter λ / rank bound R

Optimization (constrained formulation): argmin

HPRP✂S

1 N

N

i✏1

ℓ♣yi, H♣xiqq subject to rank♣Hq ↕ R Optimization (regularized formulation): argmin

HPRP✂S

1 N

N

i✏1

ℓ♣yi, H♣xiqq λ rank♣Hq

slide-165
SLIDE 165

Empirical Risk Minimization Approach

Data: t♣xi, yiq✉N

i✏1, xi P Σ✍, yi P R

Parameters:

➓ Rows and columns P, S ⑨ Σ✍ ➓ (Convex) Loss function ℓ : R ✂ R Ñ R ➓ Regularization parameter λ / rank bound R

Optimization (constrained formulation): argmin

HPRP✂S

1 N

N

i✏1

ℓ♣yi, H♣xiqq subject to rank♣Hq ↕ R Optimization (regularized formulation): argmin

HPRP✂S

1 N

N

i✏1

ℓ♣yi, H♣xiqq λ rank♣Hq Note: These optimization problems are non-convex!

slide-166
SLIDE 166

Nuclear Norm Relaxation

Nuclear Norm: matrix M, ⑥M⑥✝ ✏ ➦ si♣Mq In machine learning, minimizing the nuclear norm is a commonly used convex surrogate for minimizing the rank Convex Optimization for Hankel matrix estimation argmin

HPRP✂S

1 N

N

i✏1

ℓ♣yi, H♣xiqq λ⑥H⑥✝

slide-167
SLIDE 167

Optimization Algorithms for Hankel Matrix Estimation

Optimizing the Nuclear Norm Surrogate

➓ Projected/Proximal sub-gradient (e.g. [Duchi and Singer, 2009]) ➓ Frank–Wolfe [Jaggi and Sulovsk, 2010] ➓ Singular value thresholding [Cai et al., 2010]

Non-Convex “Heuristics”

➓ Alternating minimization (e.g. [Jain et al., 2013])

slide-168
SLIDE 168

Applications of Hankel Matrix Estimation

➓ Max-margin taggers [Quattoni et al., 2014] ➓ Unsupervised transducers [Bailly et al., 2013b] ➓ Unsupervised WCFG [Bailly et al., 2013a]

slide-169
SLIDE 169

Outline

  • 1. Weighted Automata and Hankel Matrices
  • 2. Spectral Learning of Probabilistic Automata
  • 3. Spectral Methods for Transducers and Grammars

Sequence Tagging Finite-State Transductions Tree Automata

  • 4. Hankel Matrices with Missing Entries
  • 5. Conclusion
  • 6. References
slide-170
SLIDE 170

Conclusion

➓ Spectral methods provide new tools to learn compositional functions

by means of algebraic operations

➓ Key result:

forward-backward recursions ô low-rank Hankel matrices

➓ Applicable to a wide range of compositional formalisms:

finite-state automata and transducers, context-free grammars, . . .

➓ Relation to loss-regularized methods, by means of matrix-completion

techniques

slide-171
SLIDE 171

Spectral Learning Techniques for Weighted Automata, Transducers, and Grammars

Borja Balle♦ Ariadna Quattoni♥ Xavier Carreras♥

♣♦q McGill University ♣♥q Xerox Research Centre Europe

TUTORIAL @ EMNLP 2014

slide-172
SLIDE 172

Outline

  • 1. Weighted Automata and Hankel Matrices
  • 2. Spectral Learning of Probabilistic Automata
  • 3. Spectral Methods for Transducers and Grammars

Sequence Tagging Finite-State Transductions Tree Automata

  • 4. Hankel Matrices with Missing Entries
  • 5. Conclusion
  • 6. References
slide-173
SLIDE 173

Albert, J. and Kari, J. (2009). Digital image compression. In Handbook of Weighted Automata. Baier, C., Gr¨

  • ßer, M., and Ciesinski, F. (2009).

Model checking linear-time properties of probabilistic systems. In Handbook of Weighted automata. Bailly, R. (2011). M´ ethodes spectrales pour l’inf´ erence grammaticale probabiliste de langages stochastiques rationnels. PhD thesis, Aix-Marseille Universit´ e. Bailly, R., Carreras, X., Luque, F., and Quattoni, A. (2013a). Unsupervised spectral learning of WCFG as low-rank matrix completion. In EMNLP. Bailly, R., Carreras, X., and Quattoni, A. (2013b). Unsupervised spectral learning of finite state transducers. In NIPS. Bailly, R., Denis, F., and Ralaivola, L. (2009). Grammatical inference as a principal component analysis problem. In ICML. Bailly, R., Habrard, A., and Denis, F. (2010). A spectral approach for probabilistic grammatical inference on trees. In ALT. Balle, B. (2013). Learning Finite-State Machines: Algorithmic and Statistical Aspects. PhD thesis, Universitat Polit` ecnica de Catalunya. Balle, B., Carreras, X., Luque, F., and Quattoni, A. (2014). Spectral learning of weighted automata: A forward-backward perspective. Machine Learning. Balle, B. and Mohri, M. (2012).

slide-174
SLIDE 174

Spectral learning of general weighted automata via constrained matrix completion. In NIPS. Balle, B., Quattoni, A., and Carreras, X. (2011). A spectral learning algorithm for finite state transducers. In ECML-PKDD. Balle, B., Quattoni, A., and Carreras, X. (2012). Local loss optimization in operator models: A new insight into spectral learning. In ICML. Berry, M. W. (1992). Large-scale sparse singular value computations. International Journal of Supercomputer Applications. Brand, M. (2006). Fast low-rank modifications of the thin singular value decomposition. Linear algebra and its applications, 415(1):20–30. Cai, J.-F., Cand` es, E., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization. Carlyle, J. W. and Paz, A. (1971). Realizations by stochastic finite automata. Journal of Computer Systems Science. Cohen, S. B., Stratos, K., Collins, M., Foster, D. P., and Ungar, L. (2012). Spectral learning of latent-variable PCFGs. In ACL. Cohen, S. B., Stratos, K., Collins, M., Foster, D. P., and Ungar, L. (2013). Experiments with spectral learning of latent-variable PCFGs. In NAACL-HLT. de Gispert, A., Iglesias, G., Blackwood, G., Banga, E. R., and Byrne, W. (2010). Hierarchical phrase-based translation with weighted finite-state transducers and shallow-n grammars. Computational Linguistics, 36(3):505–533.

slide-175
SLIDE 175

Dhillon, P. S., Rodu, J., Collins, M., Foster, D. P., and Ungar, L. H. (2012). Spectral dependency parsing with latent variables. In EMNLP-CoNLL. Duchi, J. and Singer, Y. (2009). Efficient online and batch learning using forward backward splitting. The Journal of Machine Learning Research. Fliess, M. (1974). Matrices de Hankel. Journal de Math´ ematiques Pures et Appliqu´ ees. Halko, N., Martinsson, P., and Tropp, J. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288. Hsu, D., Kakade, S. M., and Zhang, T. (2009). A spectral algorithm for learning hidden Markov models. In COLT. Jaggi, M. and Sulovsk, M. (2010). A simple algorithm for nuclear norm regularized problems. In ICML. Jain, P., Netrapalli, P., and Sanghavi, S. (2013). Low-rank matrix completion using alternating minimization. In STOC. Knight, K. and May, J. (2009). Applications of weighted automata in natural language processing. In Handbook of Weighted Automata. Luque, F., Quattoni, A., Balle, B., and Carreras, X. (2012). Spectral learning in non-deterministic dependency parsing. In EACL. Mohri, M., Pereira, F. C. N., and Riley, M. (2008).

slide-176
SLIDE 176

Speech recognition with weighted finite-state transducers. In Handbook on Speech Processing and Speech Communication. Parikh, A. P., Cohen, S. B., and Xing, E. (2014). Spectral unsupervised parsing with additive tree metrics. In ACL. Pereira, F. and Schabes, Y. (1992). Inside-outside reestimation from partially bracketed corpora. In ACL. Petrov, S., Das, D., and McDonald, R. (2012). A universal part-of-speech tagset. In LREC. Quattoni, A., Balle, B., Carreras, X., and Globerson, A. (2014). Spectral regularization for max-margin sequence tagging. In ICML. Saluja, A., Dyer, C., and Cohen, S. B. (2014). Latent-variable synchronous CFGs for hierarchical translation. In EMNLP.