Spectral Learning Techniques for Weighted Automata, Transducers, and Grammars
Borja Balle♦ Ariadna Quattoni♥ Xavier Carreras♥
♣♦q McGill University ♣♥q Xerox Research Centre Europe
Spectral Learning Techniques for Weighted Automata, Transducers, and - - PowerPoint PPT Presentation
Spectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle Ariadna Quattoni Xavier Carreras q McGill University q Xerox Research Centre Europe TUTORIAL @ EMNLP 2014 Status Quo
♣♦q McGill University ♣♥q Xerox Research Centre Europe
➓ Composable/composite objects (strings and trees) are ubiquitous in
➓ Latent variables provide powerful mechanisms for learning the
➓ Classical learning paradigms are Expectation–Maximization and
➓ Provide tools for learning latent variable models with strong
➓ Facilitate the connection of latent variable models with (multi-)linear
➓ In practice are faster than iterative methods, and not prone to local
➓ Implementations can readily benefit from latest developments in
➓ Emphasize the relation of spectral methods and recursive
➓ Show how the language of Hankel matrices seamlessly applies to
➓ Compositional functions defined in terms of recurrence relations ➓ Consider a sequence abaccb
➓ n is the dimension of the model ➓ αf maps prefixes to Rn ➓ βf maps suffixes to Rn ➓
✂
➓ Compositional functions defined in terms of recurrence relations ➓ Consider a sequence abaccb
➓ n is the dimension of the model ➓ αf maps prefixes to Rn ➓ βf maps suffixes to Rn ➓ Aa is a bilinear operator in Rn✂n
➓ Compositional functions defined in terms of recurrence relations ➓ Consider a sequence abaccb
➓ n is the dimension of the model ➓ αf maps prefixes to Rn ➓ βf maps suffixes to Rn ➓ Aa is a bilinear operator in Rn✂n
➓ Compositional functions defined in terms of recurrence relations ➓ Consider a sequence abaccb
➓ n is the dimension of the model ➓ αf maps prefixes to Rn ➓ βf maps suffixes to Rn ➓ Aa is a bilinear operator in Rn✂n
a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6
a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6
a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6
a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6
0 AaAbα✽
a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6
a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6
a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6
a 0.4 b 0.1 a 0.2 b 0.3 a 0.1 b 0.1 a 0.1 b 0.1 0.6
➓ Σ: alphabet – finite set ➓ n: number of states – positive integer ➓ α0: initial weights – vector in Rn
(features of empty prefix)
➓ α✽: final weights – vector in Rn
(features of empty suffix)
➓ Aσ: transition weights – matrix in Rn✂n (❅σ P Σ)
✽ t
✍ Ñ
❏
✽ ✏ ❏ ✽
➓ Σ: alphabet – finite set ➓ n: number of states – positive integer ➓ α0: initial weights – vector in Rn
(features of empty prefix)
➓ α✽: final weights – vector in Rn
(features of empty suffix)
➓ Aσ: transition weights – matrix in Rn✂n (❅σ P Σ)
✍ Ñ
❏
✽ ✏ ❏ ✽
➓ Σ: alphabet – finite set ➓ n: number of states – positive integer ➓ α0: initial weights – vector in Rn
(features of empty prefix)
➓ α✽: final weights – vector in Rn
(features of empty suffix)
➓ Aσ: transition weights – matrix in Rn✂n (❅σ P Σ)
0 Ax1 ☎ ☎ ☎ AxT α✽ ✏ α❏ 0 Axα✽
➓ Assigns probabilities to strings f♣xq ✏ Prxs ➓ Emission and transition are conditionally independent given state
0 ✏ r0.3 0.3 0.4s
✽ ✏ r1 1 1s
0.3 0.4 0.75 0.25 0.7 0.6 a, 0.5 b, 0.5 a, 0.3 b, 0.7 a, 0.9 b, 0.1
➓ Σ ✏ X ✂ Y, where X input alphabet and Y output alphabet ➓ Assigns conditional probabilities f♣x, yq ✏ Pry⑤xs to pairs ♣x, yq P Σ✍
0 ✏ r0.3 0 0.7s
✽ ✏ r1 1 1s
B ✏
A/a, 0.1 ∣ A/b, 0.9 B/a, 0.25 ∣ B/b, 0.75 ∣ A/b, 0.15 A/a, 0.75 A/b, 0.25 ∣ B/b, 1 B/b, 0.4 B/a, 0.4 A/b, 0.85 B/b, 0.2 0.3 0.7
➓ Probabilistic Finite Automata (PFA) ➓ Deterministic Finite Automata (DFA)
➓ Observable Operator Models (OOM) ➓ Predictive State Representations (PSR)
➓ Probabilistic Finite Automata (PFA) ➓ Deterministic Finite Automata (DFA)
➓ Observable Operator Models (OOM) ➓ Predictive State Representations (PSR)
➓ Probability distributions fA♣xq ✏ Prxs ➓ Binary classifiers g♣xq ✏ sign♣fA♣xq θq ➓ Real predictors fA♣xq ➓ Sequence predictors g♣xq ✏ argmaxy fA♣x, yq (with Σ ✏ X ✂ Y)
➓ Speech recognition [Mohri et al., 2008] ➓ Machine translation [de Gispert et al., 2010] ➓ Image processing [Albert and Kari, 2009] ➓ OCR systems [Knight and May, 2009] ➓ System testing [Baier et al., 2009]
0 Ax1 ☎ ☎ ☎ AxT α✽ ✏ α❏ 0 Axα✽ ➓ Sum-Product: fA♣xq is a sum–product computation
i0,i1,...,iT Prns
t✏1
➓ Forward-Backward: fA♣xq is dot product between forward and
0 Ap
➓ Compositional Features: fA♣xq is a linear model
0 Ax
0 Ax1 ☎ ☎ ☎ AxT α✽ ✏ α❏ 0 Axα✽ ➓ Sum-Product: fA♣xq is a sum–product computation
i0,i1,...,iT Prns
t✏1
➓ Forward-Backward: fA♣xq is dot product between forward and
0 Ap
➓ Compositional Features: fA♣xq is a linear model
0 Ax
0 Ax1 ☎ ☎ ☎ AxT α✽ ✏ α❏ 0 Axα✽ ➓ Sum-Product: fA♣xq is a sum–product computation
i0,i1,...,iT Prns
t✏1
➓ Forward-Backward: fA♣xq is dot product between forward and
0 Ap
➓ Compositional Features: fA♣xq is a linear model
0 Ax
0 Ax1 ☎ ☎ ☎ AxT α✽ ✏ α❏ 0 Axα✽ ➓ Sum-Product: fA♣xq is a sum–product computation
i0,i1,...,iT Prns
t✏1
➓ Forward-Backward: fA♣xq is dot product between forward and
0 Ap
➓ Compositional Features: fA♣xq is a linear model
0 Ax
0 Ap1 ☎ ☎ ☎ ApT
➓
✏ s
0 Ap1 ☎ ☎ ☎ ApT
➓ In HMM coordinates of αA and βA have probabilistic interpretation:
0 Ap1 ☎ ☎ ☎ ApT
0 Ap1 ☎ ☎ ☎ ApT
➓ Functional: f : Σ✍ Ñ R ➓ Matricial: Hf P RΣ✍✂Σ✍, the Hankel matrix of f
☎☎☎
♣ q ✏ ♣ q ✏ ♣ q ✏
➓ Functional: f : Σ✍ Ñ R ➓ Matricial: Hf P RΣ✍✂Σ✍, the Hankel matrix of f
(number of a’s in x)
λ a b aa ☎☎☎ λ
a
b
aa
♣ q ✏ ♣ q ✏ ♣ q ✏
➓ Functional: f : Σ✍ Ñ R ➓ Matricial: Hf P RΣ✍✂Σ✍, the Hankel matrix of f
(number of a’s in x)
λ a b aa ☎☎☎ λ
a
b
aa
Hf♣λ, aaq ✏ Hf♣a, aq ✏ Hf♣aa, λq ✏ 2
➓ Functional: f : Σ✍ Ñ R ➓ Matricial: Hf P RΣ✍✂Σ✍, the Hankel matrix of f
➓ ⑤x⑤ 1 entries for f♣xq ➓ Depends on ordering of Σ✍ ➓ Captures structure
λ a b aa ☎☎☎ λ
a
b
aa
♣ q ✏ ♣ q ✏ ♣ q ✏
➓ Functional: f : Σ✍ Ñ R ➓ Matricial: Hf P RΣ✍✂Σ✍, the Hankel matrix of f
➓ ⑤x⑤ 1 entries for f♣xq ➓ Depends on ordering of Σ✍ ➓ Captures structure
λ a b aa ☎☎☎ λ
a
b
aa
♣ q ✏ ♣ q ✏ ♣ q ✏
s
p
p
s
0 Ap1 ☎ ☎ ☎ ApT
αA♣pq
βA♣sq
s
p
p
s
0 Ap1 ☎ ☎ ☎ ApT
αA♣pq
βA♣sq
Hσ P RΣ✍✂Σ✍ P
✍✂
P
✂
P
✂
✍
✔ ✖ ✖ ✖ ✖ ✕
s
☎ ☎ ☎
p
☎ ☎ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✔ ✕ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✜ ✢ ✔ ✕ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✜ ✢ fA♣p1 ☎ ☎ ☎ pT ☎ σ ☎ s1 ☎ ☎ ☎ sT ✶q ✏
❏
☎ ☎ ☎ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥
♣ q
☎ ☎ ☎ ☎ ☎
✶
✽
❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥
♣ q
♣ q ✏
Hσ P RΣ✍✂Σ✍ P P RΣ✍✂n Aσ P Rn✂n S P Rn✂Σ✍ ✔ ✖ ✖ ✖ ✖ ✕
s
☎ ☎ ☎
p
☎ ☎ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎
p
✌ ✌ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✔ ✕ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✜ ✢ ✔ ✕
s
☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✜ ✢ fA♣p1 ☎ ☎ ☎ pT ☎ σ ☎ s1 ☎ ☎ ☎ sT ✶q ✏ α❏
0 Ap1 ☎ ☎ ☎ ApT
❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥
αA♣pq
☎ Aσ ☎ As1 ☎ ☎ ☎ AsT ✶ α✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥
βA♣sq
♣ q ✏
Hσ P RΣ✍✂Σ✍ P P RΣ✍✂n Aσ P Rn✂n S P Rn✂Σ✍ ✔ ✖ ✖ ✖ ✖ ✕
s
☎ ☎ ☎
p
☎ ☎ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎
p
✌ ✌ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✔ ✕ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✜ ✢ ✔ ✕
s
☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✜ ✢ fA♣p1 ☎ ☎ ☎ pT ☎ σ ☎ s1 ☎ ☎ ☎ sT ✶q ✏ α❏
0 Ap1 ☎ ☎ ☎ ApT
❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥
αA♣pq
☎ Aσ ☎ As1 ☎ ☎ ☎ AsT ✶ α✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥
βA♣sq
♣ q ✏ ♣ q ✏
Hσ P RΣ✍✂Σ✍ P P RΣ✍✂n Aσ P Rn✂n S P Rn✂Σ✍ ✔ ✖ ✖ ✖ ✖ ✕
s
☎ ☎ ☎
p
☎ ☎ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎
p
✌ ✌ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✔ ✕ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✜ ✢ ✔ ✕
s
☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✜ ✢ fA♣p1 ☎ ☎ ☎ pT ☎ σ ☎ s1 ☎ ☎ ☎ sT ✶q ✏ α❏
0 Ap1 ☎ ☎ ☎ ApT
❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥
αA♣pq
☎ Aσ ☎ As1 ☎ ☎ ☎ AsT ✶ α✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥
βA♣sq
Data Hankel matrix WFA Low-rank matrix estimation Factorization and linear algebra
Data Hankel matrix WFA Low-rank matrix estimation Factorization and linear algebra
➓ A ✏ ①α0, α✽, tAσ✉② ➓ B ✏ ①Q❏α0, Q✁1α✽, tQ✁1AσQ✉②
0 Ax1 ☎ ☎ ☎ AxT α✽
0 Qq♣Q✁1Ax1Qq ☎ ☎ ☎ ♣Q✁1AxT Qq♣Q✁1α✽q ✏ fB♣xq
✁
➓ A ✏ ①α0, α✽, tAσ✉② ➓ B ✏ ①Q❏α0, Q✁1α✽, tQ✁1AσQ✉②
0 Ax1 ☎ ☎ ☎ AxT α✽
0 Qq♣Q✁1Ax1Qq ☎ ☎ ☎ ♣Q✁1AxT Qq♣Q✁1α✽q ✏ fB♣xq
➓ A ✏ ①α0, α✽, tAσ✉② ➓ B ✏ ①Q❏α0, Q✁1α✽, tQ✁1AσQ✉②
0 Ax1 ☎ ☎ ☎ AxT α✽
0 Qq♣Q✁1Ax1Qq ☎ ☎ ☎ ♣Q✁1AxT Qq♣Q✁1α✽q ✏ fB♣xq
➓ There is no unique parametrization for WFA ➓ Given A it is undecidable whether ❅x fA♣xq ➙ 0 ➓ Cannot expect to recover a probabilistic parametrization
Data Hankel matrix WFA Low-rank matrix estimation Factorization and linear algebra
➓ Data are strings sampled from probability distribution on Σ✍ ➓ Hankel matrix is estimated by empiricial probabilities ➓ Factorization and low-rank approximation is computed using SVD
N
i✏1
✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ ✱ ✴ ✴ ✳ ✴ ✴ ✲ Ñ ♣ q ✏ ✓
✏ t ✉ ✏ t ✉
N
i✏1
S ✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, b, a, ab, aa, ba, b, aa, a, aa, bab, b, aa ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ˆ fS♣aaq ✏ 5 16 ✓ 0.31
✏ t ✉ ✏ t ✉
N
i✏1
S ✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, b, a, ab, aa, ba, b, aa, a, aa, bab, b, aa ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ˆ HS ✏ ✔ ✖ ✖ ✕
a b λ
.19 .25
a
.31 .06
b
.06 .00
ba
.00 .13 ✜ ✣ ✣ ✢
(Hankel with rows P ✏ tλ, a, b, ba✉ and columns S ✏ ta, b✉)
➓ Set of rows (prefixes) P ⑨ Σ✍ ➓ Set of columns (suffixes) S ⑨ Σ✍
hλ,S hP,λ
Σ λ a b aa ab ... λ
1 0.3 0.7 0.05 0.25 . . .
a
0.3 0.05 0.25 0.02 0.03 . . .
b
0.7 0.6 0.1 0.03 0.2 . . .
aa
0.05 0.02 0.03 0.017 0.003 . . .
ab
0.25 0.23 0.02 0.11 0.12 . . . . . . . . . . . . . . . . . . . . . ...
Ha
P S ➓ H P RP✂S for finding P and S ➓ Hσ P RP✂S for finding Aσ ➓ hλ,S P R1✂S for finding α0 ➓ hP,λ P RP✂1 for finding α✽
➓ Desired number of states n ➓ Block H P RP✂S of the empirical Hankel matrix
✂
✂
✂ ❏
✂
✏ ✁ ❏
❏
✏
➓ Desired number of states n ➓ Block H P RP✂S of the empirical Hankel matrix
P✂S
P✂n
n✂n
n
n✂S
✏ ✁ ❏
❏
✏
➓ Desired number of states n ➓ Block H P RP✂S of the empirical Hankel matrix
P✂S
P✂n
n✂n
n
n✂S
n U❏ n
n
➓ Factorization H ✓ ♣UΛq ☎ V❏ ✏ P ☎ S ➓ Hankel blocks Hσ, hλ,S, hP,λ
✁ ❏
❏ ✏ ✏ ✽ ✏
✁ ❏
✽
➓ Factorization H ✓ ♣UΛq ☎ V❏ ✏ P ☎ S ➓ Hankel blocks Hσ, hλ,S, hP,λ
0 ✏
✽
➓ Factorization H ✓ ♣UΛq ☎ V❏ ✏ P ☎ S ➓ Hankel blocks Hσ, hλ,S, hP,λ
0 ✏
➓ Empirical Hankel matrix: O♣⑤PS⑤ ☎ Nq ➓ SVD and linear algebra: O♣⑤P⑤ ☎ ⑤S⑤ ☎ nq
➓ By law of large numbers, ˆ
➓ If ErHs is Hankel of some WFA A, then ˆ
➓ Works for data coming from PFA and HMM
➓ With high probability, ⑥ˆ
➓ When N ➙ O♣n⑤Σ⑤2T 4④ε2sn♣Hq4q, then
⑤x⑤↕T
A♣xq⑤ ↕ ε
Proofs can be found in [Hsu et al., 2009, Bailly, 2011, Balle, 2013]
Data Hankel matrix WFA Low-rank matrix estimation Factorization and linear algebra
➓ Data are strings sampled from probability distribution on Σ✍ ➓ Hankel matrix is estimated by empiricial probabilities ➓ Factorization and low-rank approximation is computed using SVD
➓ Choice of parameters P and S ➓ Scalable estimation and factorization of Hankel matrices ➓ Smoothing and variance normalization ➓ Use of prefix and substring statistics
➓ Basis should be choosen such that ErHs has full rank ➓ P must contain strings reaching each possible state of the WFA ➓ S must contain string producing different outcomes for each pair of
➓
↕
➓ ➓
➓ Basis should be choosen such that ErHs has full rank ➓ P must contain strings reaching each possible state of the WFA ➓ S must contain string producing different outcomes for each pair of
➓ Set P ✏ S ✏ Σ↕k for some k ➙ 1 [Hsu et al., 2009] ➓ Choose P and S to contain the K most frequent prefixes and suffixes
➓ Take all prefixes and suffixes appearing in the sample [Bailly et al., 2009]
➓ Use hash functions to map P (S) to row (column) indices ➓ Use sparse matrix data structures because statistics are usually sparse ➓ Never store the full Hankel matrix in memory
➓ SVD for sparse matrices [Berry, 1992] ➓ Approximate randomized SVD [Halko et al., 2011] ➓ On-line SVD with rank 1 updates [Brand, 2006]
➓ Empirical probabilities ˆ
➓ Like in n-gram models, smoothing can help when Σ is large ➓ Should take into account that strings in PS have different lengths ➓ Open Problem: How to smooth empirical Hankels properly ➓ ➓ ➓ ➓
➓ Empirical probabilities ˆ
➓ Like in n-gram models, smoothing can help when Σ is large ➓ Should take into account that strings in PS have different lengths ➓ Open Problem: How to smooth empirical Hankels properly
➓ More frequent prefixes (suffixes) have better estimated rows
➓ Can scale rows and columns to reflect that ➓ Will lead to more reliable SVD decompositions ➓ See [Cohen et al., 2013] for details
✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ ✱ ✴ ✴ ✳ ✴ ✴ ✲ Ñ ✏ ✔ ✖ ✖ ✕ ✜ ✣ ✣ ✢ ✏ ➳
✏
r s ✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ ✱ ✴ ✴ ✳ ✴ ✴ ✲ Ñ ✏ ✔ ✖ ✖ ✕ ✜ ✣ ✣ ✢
S ✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, bbab, abb, babba, abbb, ab, a, aabba, baa, abbab, baba, bb, a ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ˆ H ✏ ✔ ✖ ✖ ✕
a b λ
.19 .06
a
.06 .06
b
.00 .06
ba
.06 .06 ✜ ✣ ✣ ✢ ✏ ➳
✏
r s ✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ ✱ ✴ ✴ ✳ ✴ ✴ ✲ Ñ ✏ ✔ ✖ ✖ ✕ ✜ ✣ ✣ ✢
S ✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, bbab, abb, babba, abbb, ab, a, aabba, baa, abbab, baba, bb, a ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ˆ H ✏ ✔ ✖ ✖ ✕
a b λ
.19 .06
a
.06 .06
b
.00 .06
ba
.06 .06 ✜ ✣ ✣ ✢
Empirical expectation ✏ 1 N
N
➳
i✏1
rnumber of occurences of x in xis S ✏ ✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, bbab, abb, babba, abbb, ab, a, aabba, baa, abbab, baba, bb, a ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ˆ H ✏ ✔ ✖ ✖ ✕
a b λ
1.31 1.56
a
.19 .62
b
.56 .50
ba
.06 .31 ✜ ✣ ✣ ✢
➓ Can work with smaller Hankel matrices ➓ But estimating the matrix takes longer
60 62 64 66 68 70 72 74 10 20 30 40 50 Word Error Rate (%) Number of States Spectral, Σ basis Spectral, basis k=25 Spectral, basis k=50 Spectral, basis k=100 Spectral, basis k=300 Spectral, basis k=500 Unigram Bigram
➓ PTB sequences of simplified PoS tags [Petrov et al., 2012] ➓ Configuration: expectations on frequent substrings ➓ Metric: error rate on predicting next symbol in test sequences
58 60 62 64 66 68 70 10 20 30 40 50 Word Error Rate (%) Number of States Spectral, Σ basis Spectral, basis k=500 EM Unigram Bigram
➓ Comparison with a bigram baseline and EM ➓ Metric: error rate on predicting next symbol in test sequences ➓ At training, the Spectral Method is → 100 faster than EM
➓ Many applications involve pairs of input-output sequences:
➓ Sequence tagging (one output tag per input token)
e.g.: part of speech tagging
➓ Transductions (sequence lenghts might differ)
e.g.: spelling correction
➓ Finite-state automata are classic methods to model these relations.
➓ Notation:
➓ Input alphabet X ➓ Output alphabet Y ➓ Joint alphabet Σ ✏ X ✂ Y
➓ Goal: map input sequences to output sequences of the same length ➓ Approach: learn a function
yPYT
(note: this maximization is not tractable in general)
➓ Notation:
➓ X ✂ Y: joint alphabet – finite set ➓ n: number of states – positive integer ➓ α0: initial weights – vector in Rn
(features of empty prefix)
➓ α✽: final weights – vector in Rn
(features of empty suffix)
➓ Ab
a: transition weights – matrix in Rn✂n (❅a P X, b P Y)
➓
✽ t
➓
❏
✽ ✏ ❏ ✽
➓ Notation:
➓ X ✂ Y: joint alphabet – finite set ➓ n: number of states – positive integer ➓ α0: initial weights – vector in Rn
(features of empty prefix)
➓ α✽: final weights – vector in Rn
(features of empty suffix)
➓ Ab
a: transition weights – matrix in Rn✂n (❅a P X, b P Y)
➓ Definition: WFTagger with n states over X ✂ Y
a✉② ➓
❏
✽ ✏ ❏ ✽
➓ Notation:
➓ X ✂ Y: joint alphabet – finite set ➓ n: number of states – positive integer ➓ α0: initial weights – vector in Rn
(features of empty prefix)
➓ α✽: final weights – vector in Rn
(features of empty suffix)
➓ Ab
a: transition weights – matrix in Rn✂n (❅a P X, b P Y)
➓ Definition: WFTagger with n states over X ✂ Y
a✉② ➓ Compositional Function: Every WFTagger defines a function
0 Ay1 x1 ☎ ☎ ☎ AyT xT α✽ ✏ α❏ 0 Ay xα✽
Data Hankel matrix WFA Low-rank matrix estimation Factorization and linear algebra
➓ Assume f♣x, yq ✏ P♣x, yq
➓ Same mechanics as for WFA, with Σ ✏ X ✂ Y ➓ In a nutshell:
Ñ in this case they are bistrings
and compute operators ①α0, α✽, tAσ✉②
➓ Other cases:
➓ fA♣x, yq ✏ P♣y ⑤ xq — see [Balle et al., 2011] ➓ fA♣x, yq non-probabilistic — see [Quattoni et al., 2014]
➓ Assume fA♣x, yq ✏ P♣x, yq ➓ Given x1:T, compute most likely output tag at position t:
aPY
µ♣t, aq ✜ P♣yt ✏ a ⑤ xq ✾ ➳
y✏y1...a...yT
P♣x, yq ✾ ➳
✏ ❏ ✽
✾
❏
✄ ➳
✁ ✁ ✁
☛ ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥
✍ ♣ ✁ q
✄ ➳
✽
❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥
✍ ♣
✍ ♣
q ✏
✍ ♣ ✁ q
✄ ➳
P
☛
✍ ♣
q ✏ ✄ ➳
P
☛
✍ ♣
➓ Assume fA♣x, yq ✏ P♣x, yq ➓ Given x1:T, compute most likely output tag at position t:
aPY
µ♣t, aq ✜ P♣yt ✏ a ⑤ xq ✾ ➳
y✏y1...a...yT
P♣x, yq ✾ ➳
y✏y1...a...yT
α❏
0 Ay xα✽
✾ α❏ ✄ ➳
y1...yt✁1
Ay1:t✁1
x1:t✁1
☛ ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥
α✍
A♣x1:t✁1q
Aa
xt
✄ ➳
yt1...yT
Ayi1:T
xt1:T
☛ α✽ ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥
β✍
A♣xt1:T q
α✍
A♣x1:tq ✏ α✍ A♣x1:t✁1q
✄ ➳
bPY
Ab
xt
☛ β✍
A♣xt:Tq ✏
✄ ➳
bPY
Ab
xt
☛ β✍
A♣xt1:Tq
➓ Assume fA♣x, yq ✏ P♣x, yq ➓ Given x1:T, compute most likely output bigram ab at position t:
a,bPY
A♣x1:t✁1qAa xtAb xt1β✍ A♣xt2:Tq ➓ Compute most likely full sequence y – intractable
yPYT
t
a-c ǫ-d b-e ➓ A WFTransducer evaluates aligned strings,
a d ǫ e bq ✏ α❏ 0 Ac aAd ǫAe bα✽ ➓ Then, a function g can be defined on unaligned strings by
πPΠ♣ab,cdeq
➓ Prediction: given an FST A, how to . . .
➓ Compute g♣x, yq for unaligned strings?
Ñ
➓ Compute marginal quantities µ♣edgeq ✏ P♣edge ⑤ xq?
Ñ
➓ Compute most-likely y for given x?
Ñ
➓
➓ ➓
➓ Prediction: given an FST A, how to . . .
➓ Compute g♣x, yq for unaligned strings?
Ñ using edit-distance recursions
➓ Compute marginal quantities µ♣edgeq ✏ P♣edge ⑤ xq?
Ñ also using edit-distance recursions
➓ Compute most-likely y for given x?
Ñ use MBR-decoding with marginal scores
➓
➓ ➓
➓ Prediction: given an FST A, how to . . .
➓ Compute g♣x, yq for unaligned strings?
Ñ using edit-distance recursions
➓ Compute marginal quantities µ♣edgeq ✏ P♣edge ⑤ xq?
Ñ also using edit-distance recursions
➓ Compute most-likely y for given x?
Ñ use MBR-decoding with marginal scores
➓ Unsupervised Learning: learn an FST from pairs of unaligned strings
➓ Unlike with EM, the spectral method can not recover latent structure
such as alignments (recall: alignments are needed to estimate Hankel entries)
➓ See [Bailly et al., 2013b] for a solution based on Hankel matrix
completion
Some References:
➓ Tree Series: [Bailly et al., 2010, Bailly et al., 2010] ➓ Latent-annotated PCFG: [Cohen et al., 2012, Cohen et al., 2013] ➓ Dependency parsing: [Luque et al., 2012, Dhillon et al., 2012] ➓ Unsupervised learning of WCFG: [Bailly et al., 2013a, Parikh et al., 2014] ➓ Synchronous grammars: [Saluja et al., 2014]
a b a c c b b
a b a c c b b
a b
a c c b b
a b a c c b b
a b a c c b b
a b
a c c b b
a b a c c b b
a b
c b b
a b a c c b b
a b a c c b b
a b
a c c b b
a b a c c b b
a b
c b b
a b a c c b b
a b a
c
note: i-o composition generalizes the notion of concatenation in strings, i.e., outside trees are prefixes, inside trees are suffixes
note: i-o composition generalizes the notion of concatenation in strings, i.e., outside trees are prefixes, inside trees are suffixes
➓ tΣk✉ ✏ tΣ0, Σ1, . . . , Σr✉ – ranked alphabet ➓ T – space of labeled trees over some ranked alphabet
➓ t P T ✏ ①V, E, l♣vq②: a labeled tree ➓ V ✏ t1, . . . , m✉: the set of vertices ➓ E ✏ t①i, j②✉: the set of edges forming a tree ➓ l♣vq Ñ tΣk✉: returns the label of v – (i.e. a symbol in tΣk✉)
➓ tΣk✉ ✏ tΣ0, Σ1, . . . , Σr✉ – ranked alphabet ➓ T – space of labeled trees over some ranked alphabet
➓ for v1 P Rn and v2 P Rn: ➓ v1 ❜ v2 P Rn2 contains all products between elements of v1 and v2 ➓ Example:
➓ v1 ✏ ra, bs ➓ v2 ✏ rc, ds ➓ v1 ❜ v2 ✏ rac, ad, bc, bds
➓ ➓
➓
P
➓
P
✂
P P
➓
P
✂
P ♣ ❜ q P
➓ for v1 P Rn and v2 P Rn: ➓ v1 ❜ v2 P Rn2 contains all products between elements of v1 and v2 ➓ Example:
➓ v1 ✏ ra, bs ➓ v2 ✏ rc, ds ➓ v1 ❜ v2 ✏ rac, ad, bc, bds
➓ We consider trees with maximum arity 2 ➓ We think of matrices and tensors as functions:
➓ Vectors v P Rn ➓ Matrices A1 P Rn✂n: take one vector v P Rn
and produce another vector A1 v P Rn
➓ Tensors A2 P Rn✂n2: take two vectors v1, v2 P Rn
and produce another vector A2♣v1 ❜ v2q P Rn
σ✉, tA2 σ✉② ➓ n: number of states – positive integer ➓ α✍ P Rn: root weights ➓ βσ P Rn: leaf weights – (❅σ P Σ0) ➓ A1 σ P Rn✂n: node weights – (❅σ P Σ1) ➓ A2 σ P Rn✂n2: node weights – (❅σ P Σ2) ➓ Note: A2 σ is a tensor in Rn✂n✂n packed as a matrix
➓ if t ✏ σ is a leaf:
➓ if t ✏ σrt1s results from a unary
σβA♣t1q ➓ if t ✏ σrt1, t2s results from a binary
σ ♣βA♣t1q ❜ βA♣t2qq
✍ βA♣tq
❏ ✍
❏ ✍
❏ ✍
c♣ βb
❏ ✍
a♣A1 c♣βbq
c♣ βb
❏ ✍
a♣ βb
a♣A1 c♣βbq ❜ βcq
a♣A1 c♣βbq
c♣ βb
✍ A2 a♣βb ❜ A2 a♣A1 c♣βbq ❜ βcqq
a♣ βb
a♣A1 c♣βbq ❜ βcq
a♣A1 c♣βbq
c♣ βb
✍βA♣tq ➓ Each labeled node v is decorated with a latent variable hv P rns ➓ fA♣tq is a sum–product computation:
➦
h0,h1,...,h⑤V⑤Prns
☎ ✆α✍♣h0q ➵
vPV:a♣vq✏0
βl♣vqrhvs ✂ ➵
vPV:a♣vq✏1
A1
l♣vqrhv, hc♣t,vqs
✂ ➵
vPV:a♣vq✏2
A2
l♣vqrhv, hc1♣t,vq, hc2♣t,vqs
☞ ✌
➓ fA♣tq is a linear model in the latent space defined by βA : T Ñ Rn
fA♣tq ✏
n
➳
i✏1
α✍ris βA♣tqris
✍βA♣tq ➓ Each labeled node v is decorated with a latent variable hv P rns ➓ fA♣tq is a sum–product computation:
➦
h0,h1,...,h⑤V⑤Prns
☎ ✆α✍♣h0q ➵
vPV:a♣vq✏0
βl♣vqrhvs ✂ ➵
vPV:a♣vq✏1
A1
l♣vqrhv, hc♣t,vqs
✂ ➵
vPV:a♣vq✏2
A2
l♣vqrhv, hc1♣t,vq, hc2♣t,vqs
☞ ✌
➓ fA♣tq is a linear model in the latent space defined by βA : T Ñ Rn
fA♣tq ✏
n
➳
i✏1
α✍ris βA♣tqris
✍βA♣tq ➓ Each labeled node v is decorated with a latent variable hv P rns ➓ fA♣tq is a sum–product computation:
➦
h0,h1,...,h⑤V⑤Prns
☎ ✆α✍♣h0q ➵
vPV:a♣vq✏0
βl♣vqrhvs ✂ ➵
vPV:a♣vq✏1
A1
l♣vqrhv, hc♣t,vqs
✂ ➵
vPV:a♣vq✏2
A2
l♣vqrhv, hc1♣t,vq, hc2♣t,vqs
☞ ✌
➓ fA♣tq is a linear model in the latent space defined by βA : T Ñ Rn
fA♣tq ✏
n
➳
i✏1
α✍ris βA♣tqris
➓ Inside tree trvs: the subtree of t rooted at v
➓ trvs P T
➓ Outside tree t③v: the rest of t when removing trvs
➓ T✍: the space of outside trees, i.e. t③v P T✍ ➓ Foot node ✍: a tree insertion point (a special symbol ✍ ❘ tΣk✉) ➓ An outside tree has exactly one foot node in the leaves
➓ A tree is formed by composing an outside tree with an inside tree
➓ Multiple ways to decompose a full tree into inside/outside trees
➓ A tree is formed by composing an outside tree with an inside tree
➓ Multiple ways to decompose a full tree into inside/outside trees
➓ Outside trees t✍ P T✍ are defined recursively using compositions:
σ ti
σ ti
t✍ ✏ to ❞ σr✍, tis t✍ ✏ to ❞ σrti, ✍s
➓ if t✍ ✏ ✍ is a foot node:
➓ if t✍ ✏ to ❞ σr✍s results from a unary
σ ➓ if t✍ ✏ to ❞ σrti, ✍s results from a binary
σ ♣βA♣tiq ❜ 1nq
(note: similar expression for t✍ ✏ to ❞ σr✍, tis)
σ
σ ti
a c b c a b a c
b c a b ❞
➓ We can isolate the αA and βA vector spaces ➓
a c b c a b a c
b c a b ❞
σ♣βA♣t1q ❜ βA♣t2qq
➓ We can isolate the αA and βA vector spaces ➓ Given αA and βA, we can isolate the operators Ak σ
➓ Functional: fA : T Ñ R ➓ Matricial: HfA P R⑤T✍⑤✂⑤T⑤
✔ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✕
a b
a
a
a
b
a
a b
☎☎☎
✍
1 ✁1 2 3 ...
a
✍
✁1 2 1 ✁1 ☎ ☎ ☎
b
✍
4 1 6 2
a
✍
b
✁1 ✁3 ✁7
a
✍
b
3 . . . . . . . . . ✜ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✣ ✢
➓ Definition:
➓ Sublock for σ:
➓ Highly redundant,
ti
to
to
ti
ti
to
to
ti
ti
to
to
ti
Hσ P RT✍✂T P
✍✂
P
✂
P
✂
P
✂
✔ ✖ ✖ ✖ ✖ ✕
σrt1,t2s
☎ ☎ ☎
to
☎ ☎ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ✌ ✌ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✒ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✚ ✔ ✖ ✖ ✕ ✒ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ✚ ❜ ✒ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ✚ ✜ ✣ ✣ ✢ f♣to ❞ σrt1, t2sq
❧♦♦♦♠♦♦♦♥
ti
✏ ♣ q❏
♣ ♣ q ❜ ♣ qq ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥
♣ q
Hσ P RT✍✂T O P RT✍✂n A2
σ P Rn✂n2
I P Rn✂T I P Rn✂T ✔ ✖ ✖ ✖ ✖ ✕
σrt1,t2s
☎ ☎ ☎
to
☎ ☎ ✌ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✏ ✔ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎
to
✌ ✌ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✢ ✒ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✌ ✚ ✔ ✖ ✖ ✕ ✒
t1
☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ✚ ❜ ✒
t2
☎ ✌ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ✚ ✜ ✣ ✣ ✢ f♣to ❞ σrt1, t2sq
❧♦♦♦♠♦♦♦♥
ti
✏ αA♣toq❏ A2
σ♣βA♣t1q ❜ βA♣t2qq
❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥
βA♣tiq
σ rI ❜ Is
σ ✏ O Hσ rI ❜ Is
(assuming rank♣Oq ✏ rank♣Iq ✏ n)
➓ Derivation = Labeled Tree ➓ Learning compositional functions over derivations
➓ We are interested in functions computed by WFTA
➓ What is the latent state representing?
➓ e.g., latent real-valued embeddings of words and phrases
➓ What form of supervision do we get?
➓ Full Derivations (labeled trees)
i.e., supervised learning of latent-variable grammars
➓ Derivation skeletons (unlabeled trees)
e.g. [Pereira and Schabes, 1992]
➓ Yields from the grammar (only sentences)
i.e., grammar induction
r s
❏
➓ Vectors βσ associated with terminal symbols ➓ Matrices and tensors Ak
σ associated with non-terminals
➓ Bottom-up computation embeds inside trees into vectors in Rn
ANPrβMarys
❏
➓ Vectors βσ associated with terminal symbols ➓ Matrices and tensors Ak
σ associated with non-terminals
➓ Bottom-up computation embeds inside trees into vectors in Rn
ANPrβMarys
0 ASrANPrβMarys ❜ AVPrβplays ❜ ANPrβthe ❜ βguitarsss ➓ Vectors βσ associated with terminal symbols ➓ Matrices and tensors Ak
σ associated with non-terminals
➓ Bottom-up computation embeds inside trees into vectors in Rn
➓ WFTA A ✏ ①α✍, tβw✉, tA1 N✉, tA2 N✉②
➓ n: number of states; i.e. dimensionality of the embedding ➓ Ranked alphabet: ➓ Σ0 ✏ tthe, Mary, plays, . . . ✉ – terminal words ➓ Σ1 ✏ tnoun, verb, det, NP, VP, . . . ✉ – unary non-terminals ➓ Σ2 ✏ tS, NP, VP, . . . ✉ – binary non-terminals ➓ α✍ – initial weights ➓ tβw✉ for all w P Σ0 – word embeddings ➓ tA1
N✉ for all N P Σ1 – compute embedding of unary phrases
➓ tA2
N✉ for all N P Σ2 – compute embedding of binary phrases
➓
➓
♣ q ✏ ♣ q❏ ♣ q
➓
♣ q
➓
♣ q
➓ WFTA A ✏ ①α✍, tβw✉, tA1 N✉, tA2 N✉②
➓ n: number of states; i.e. dimensionality of the embedding ➓ Ranked alphabet: ➓ Σ0 ✏ tthe, Mary, plays, . . . ✉ – terminal words ➓ Σ1 ✏ tnoun, verb, det, NP, VP, . . . ✉ – unary non-terminals ➓ Σ2 ✏ tS, NP, VP, . . . ✉ – binary non-terminals ➓ α✍ – initial weights ➓ tβw✉ for all w P Σ0 – word embeddings ➓ tA1
N✉ for all N P Σ1 – compute embedding of unary phrases
➓ tA2
N✉ for all N P Σ2 – compute embedding of binary phrases
➓ If t ✏ to ❞ ti
➓ fA♣tq ✏ αA♣toq❏βA♣tiq ➓ βA♣tiq : an n-dimensional embedding of an inside tree ti
i.e., maps inside trees to similar vectors if they are replaceable
➓ αA♣toq : an n-dimensional embedding of an outside tree to
i.e., maps outside trees to similar vectors if they accept similar arguments
S NP n Mary VP v plays NP d the n guitar
S NP VP NP n VP v NP n Mary v plays NP d n d the n guitar
➓ A production parse tree represents the edges of a parse tree,
S NP n Mary VP v plays NP d the n guitar
S NP VP NP n VP v NP n Mary v plays NP d n d the n guitar
➓ WFTA operators associated with rule productions
➓ i/o compositions constrained by overlapping non-terminal ➓ The WFTA induces a separate n-dimensional space per non-terminal,
i.e. observed non-terminals are refined
➓ WFTA on production parse trees include:
➓ classic WCFG, for n ✏ 1 ➓ PCFG-LA, for n → 1 [Matsuzaki et al 2005, Petrov et al 2006, Cohen et al
2012]
➓ WFTA are a general algebraic framework for compositional functions ➓ WFTA can exploit real-valued embeddings ➓ There are simple algorithms for learning WFTA from samples
Data Hankel matrix WFA Low-rank matrix estimation Factorization and linear algebra
➓ ✍ Ñ t
➓ ✍ Ñ ➓
Data Hankel matrix WFA Low-rank matrix estimation Factorization and linear algebra
➓ Classification f : Σ✍ Ñ t1, ✁1✉ ➓ Unconstrained real-valued predictions f : Σ✍ Ñ R ➓ General scoring functions for tagging: f : ♣Σ ✂ ∆q✍ Ñ R
entries in Hf are estimated from empirical counts, e.g. f♣xq ✏ Prxs
✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, b, a, ab, aa, ba, b, aa, a, aa, bab, b, aa ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ✔ ✖ ✖ ✕
a b ǫ
.19 .25
a
.31 .06
b
.06 .00
ba
.00 .13 ✜ ✣ ✣ ✢ ✩ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✫ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✪ ✱ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✳ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✲ Ñ ✔ ✖ ✖ ✖ ✖ ✖ ✖ ✕ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ✜ ✣ ✣ ✣ ✣ ✣ ✣ ✢
entries in Hf are estimated from empirical counts, e.g. f♣xq ✏ Prxs
✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, b, a, ab, aa, ba, b, aa, a, aa, bab, b, aa ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ✔ ✖ ✖ ✕
a b ǫ
.19 .25
a
.31 .06
b
.06 .00
ba
.00 .13 ✜ ✣ ✣ ✢
entries in Hf are labels observed in the sample, and many may be missing
✩ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✫ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✪ (bab,1) (bbb,0) (aaa,3) (a,1) (ab,1) (aa,2) (aba,2) (bb,0) ✱ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✳ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✲ − Ñ ✔ ✖ ✖ ✖ ✖ ✖ ✖ ✕
ǫ a b a
1 2 1
b
☎ ☎
aa
2 3 ☎
ab
1 2 ☎
ba
☎ ☎ 1
bb
☎ ✜ ✣ ✣ ✣ ✣ ✣ ✣ ✢
entries in Hf are estimated from empirical counts, e.g. f♣xq ✏ Prxs
✩ ✬ ✬ ✫ ✬ ✬ ✪ aa, b, bab, a, b, a, ab, aa, ba, b, aa, a, aa, bab, b, aa ✱ ✴ ✴ ✳ ✴ ✴ ✲ − Ñ ✔ ✖ ✖ ✕
a b ǫ
.19 .25
a
.31 .06
b
.06 .00
ba
.00 .13 ✜ ✣ ✣ ✢
entries in Hf are labels observed in the sample, and many may be missing
✩ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✫ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✬ ✪ (bab,1) (bbb,0) (aaa,3) (a,1) (ab,1) (aa,2) (aba,2) (bb,0) ✱ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✳ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✴ ✲ − Ñ ✔ ✖ ✖ ✖ ✖ ✖ ✖ ✕
ǫ a b a
1 2 1
b
? ?
aa
2 3 ?
ab
1 2 ?
ba
? ? 1
bb
? ✜ ✣ ✣ ✣ ✣ ✣ ✣ ✢
n☎m
n☎r
r
m☎r
➓ Subset of entries: tH♣p, sq⑤♣p, sq P I✉ ➓ Linear measurements: tHv⑤v P V✉ ➓ Bilinear measurements: tu❏Hv⑤u P U, v P V✉ ➓ Constraints between entries: tH♣p, sq ➙ H♣p✶, s✶q⑤♣p, s, p✶, s✶q P I✉ ➓ Noisy versions of all the above
➓ Subset of entries: tH♣p, sq⑤♣p, sq P I✉ ➓ Linear measurements: tHv⑤v P V✉ ➓ Bilinear measurements: tu❏Hv⑤u P U, v P V✉ ➓ Constraints between entries: tH♣p, sq ➙ H♣p✶, s✶q⑤♣p, s, p✶, s✶q P I✉ ➓ Noisy versions of all the above
➓ Subset of entries: tH♣p, sq⑤♣p, sq P I✉ ➓ Linear measurements: tHv⑤v P V✉ ➓ Bilinear measurements: tu❏Hv⑤u P U, v P V✉ ➓ Constraints between entries: tH♣p, sq ➙ H♣p✶, s✶q⑤♣p, s, p✶, s✶q P I✉ ➓ Noisy versions of all the above
➓ Hankel constraints H♣p, sq ✏ H♣p✶, s✶q if ps ✏ p✶s✶ ➓ Constraints on entries ⑤H♣p, sq⑤ ↕ C ➓ Low-rank constraints/regularization rank♣Hq
➓ Subset of entries: tH♣p, sq⑤♣p, sq P I✉ ➓ Linear measurements: tHv⑤v P V✉ ➓ Bilinear measurements: tu❏Hv⑤u P U, v P V✉ ➓ Constraints between entries: tH♣p, sq ➙ H♣p✶, s✶q⑤♣p, s, p✶, s✶q P I✉ ➓ Noisy versions of all the above
➓ Hankel constraints H♣p, sq ✏ H♣p✶, s✶q if ps ✏ p✶s✶ ➓ Constraints on entries ⑤H♣p, sq⑤ ↕ C ➓ Low-rank constraints/regularization rank♣Hq
[Balle and Mohri, 2012]
i✏1, xi P Σ✍, yi P R ➓
✍ ➓
P
✂
✏
P
✂
✏
i✏1, xi P Σ✍, yi P R
➓ Rows and columns P, S ⑨ Σ✍ ➓ (Convex) Loss function ℓ : R ✂ R Ñ R ➓ Regularization parameter λ / rank bound R P
✂
✏
P
✂
✏
i✏1, xi P Σ✍, yi P R
➓ Rows and columns P, S ⑨ Σ✍ ➓ (Convex) Loss function ℓ : R ✂ R Ñ R ➓ Regularization parameter λ / rank bound R
HPRP✂S
N
i✏1
P
✂
✏
i✏1, xi P Σ✍, yi P R
➓ Rows and columns P, S ⑨ Σ✍ ➓ (Convex) Loss function ℓ : R ✂ R Ñ R ➓ Regularization parameter λ / rank bound R
HPRP✂S
N
i✏1
HPRP✂S
N
i✏1
i✏1, xi P Σ✍, yi P R
➓ Rows and columns P, S ⑨ Σ✍ ➓ (Convex) Loss function ℓ : R ✂ R Ñ R ➓ Regularization parameter λ / rank bound R
HPRP✂S
N
i✏1
HPRP✂S
N
i✏1
HPRP✂S
N
i✏1
➓ Projected/Proximal sub-gradient (e.g. [Duchi and Singer, 2009]) ➓ Frank–Wolfe [Jaggi and Sulovsk, 2010] ➓ Singular value thresholding [Cai et al., 2010]
➓ Alternating minimization (e.g. [Jain et al., 2013])
➓ Max-margin taggers [Quattoni et al., 2014] ➓ Unsupervised transducers [Bailly et al., 2013b] ➓ Unsupervised WCFG [Bailly et al., 2013a]
➓ Spectral methods provide new tools to learn compositional functions
➓ Key result:
➓ Applicable to a wide range of compositional formalisms:
➓ Relation to loss-regularized methods, by means of matrix-completion
♣♦q McGill University ♣♥q Xerox Research Centre Europe
Albert, J. and Kari, J. (2009). Digital image compression. In Handbook of Weighted Automata. Baier, C., Gr¨
Model checking linear-time properties of probabilistic systems. In Handbook of Weighted automata. Bailly, R. (2011). M´ ethodes spectrales pour l’inf´ erence grammaticale probabiliste de langages stochastiques rationnels. PhD thesis, Aix-Marseille Universit´ e. Bailly, R., Carreras, X., Luque, F., and Quattoni, A. (2013a). Unsupervised spectral learning of WCFG as low-rank matrix completion. In EMNLP. Bailly, R., Carreras, X., and Quattoni, A. (2013b). Unsupervised spectral learning of finite state transducers. In NIPS. Bailly, R., Denis, F., and Ralaivola, L. (2009). Grammatical inference as a principal component analysis problem. In ICML. Bailly, R., Habrard, A., and Denis, F. (2010). A spectral approach for probabilistic grammatical inference on trees. In ALT. Balle, B. (2013). Learning Finite-State Machines: Algorithmic and Statistical Aspects. PhD thesis, Universitat Polit` ecnica de Catalunya. Balle, B., Carreras, X., Luque, F., and Quattoni, A. (2014). Spectral learning of weighted automata: A forward-backward perspective. Machine Learning. Balle, B. and Mohri, M. (2012).
Spectral learning of general weighted automata via constrained matrix completion. In NIPS. Balle, B., Quattoni, A., and Carreras, X. (2011). A spectral learning algorithm for finite state transducers. In ECML-PKDD. Balle, B., Quattoni, A., and Carreras, X. (2012). Local loss optimization in operator models: A new insight into spectral learning. In ICML. Berry, M. W. (1992). Large-scale sparse singular value computations. International Journal of Supercomputer Applications. Brand, M. (2006). Fast low-rank modifications of the thin singular value decomposition. Linear algebra and its applications, 415(1):20–30. Cai, J.-F., Cand` es, E., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization. Carlyle, J. W. and Paz, A. (1971). Realizations by stochastic finite automata. Journal of Computer Systems Science. Cohen, S. B., Stratos, K., Collins, M., Foster, D. P., and Ungar, L. (2012). Spectral learning of latent-variable PCFGs. In ACL. Cohen, S. B., Stratos, K., Collins, M., Foster, D. P., and Ungar, L. (2013). Experiments with spectral learning of latent-variable PCFGs. In NAACL-HLT. de Gispert, A., Iglesias, G., Blackwood, G., Banga, E. R., and Byrne, W. (2010). Hierarchical phrase-based translation with weighted finite-state transducers and shallow-n grammars. Computational Linguistics, 36(3):505–533.
Dhillon, P. S., Rodu, J., Collins, M., Foster, D. P., and Ungar, L. H. (2012). Spectral dependency parsing with latent variables. In EMNLP-CoNLL. Duchi, J. and Singer, Y. (2009). Efficient online and batch learning using forward backward splitting. The Journal of Machine Learning Research. Fliess, M. (1974). Matrices de Hankel. Journal de Math´ ematiques Pures et Appliqu´ ees. Halko, N., Martinsson, P., and Tropp, J. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288. Hsu, D., Kakade, S. M., and Zhang, T. (2009). A spectral algorithm for learning hidden Markov models. In COLT. Jaggi, M. and Sulovsk, M. (2010). A simple algorithm for nuclear norm regularized problems. In ICML. Jain, P., Netrapalli, P., and Sanghavi, S. (2013). Low-rank matrix completion using alternating minimization. In STOC. Knight, K. and May, J. (2009). Applications of weighted automata in natural language processing. In Handbook of Weighted Automata. Luque, F., Quattoni, A., Balle, B., and Carreras, X. (2012). Spectral learning in non-deterministic dependency parsing. In EACL. Mohri, M., Pereira, F. C. N., and Riley, M. (2008).
Speech recognition with weighted finite-state transducers. In Handbook on Speech Processing and Speech Communication. Parikh, A. P., Cohen, S. B., and Xing, E. (2014). Spectral unsupervised parsing with additive tree metrics. In ACL. Pereira, F. and Schabes, Y. (1992). Inside-outside reestimation from partially bracketed corpora. In ACL. Petrov, S., Das, D., and McDonald, R. (2012). A universal part-of-speech tagset. In LREC. Quattoni, A., Balle, B., Carreras, X., and Globerson, A. (2014). Spectral regularization for max-margin sequence tagging. In ICML. Saluja, A., Dyer, C., and Cohen, S. B. (2014). Latent-variable synchronous CFGs for hierarchical translation. In EMNLP.