Local Loss Optimization in Operator Models: A New Insight into Spectral Learning
Borja Balle, Ariadna Quattoni, Xavier Carreras
ICML 2012 June 2012, Edinburgh
This work is partially supported by the PASCAL2 Network and a Google Research Award
Local Loss Optimization in Operator Models: A New Insight into - - PowerPoint PPT Presentation
Local Loss Optimization in Operator Models: A New Insight into Spectral Learning Borja Balle , Ariadna Quattoni, Xavier Carreras ICML 2012 June 2012, Edinburgh This work is partially supported by the PASCAL2 Network and a Google Research Award
Borja Balle, Ariadna Quattoni, Xavier Carreras
ICML 2012 June 2012, Edinburgh
This work is partially supported by the PASCAL2 Network and a Google Research Award
Discrete Homogeneous Hidden Markov Model
Y1 Y2 Y3 Y4 X1 X2 X3 X4 ⋯
➓ n states – Yt P t1, . . . , n✉ ➓ k symbols – Xt P tσ1, . . . , σk✉ ➓ for now assume n ↕ k ➓ Forward-backward equations with
Aσ P Rn✂n: PrX1:t ✏ ws ✏ α❏
1 Aw1 ☎ ☎ ☎ Awt
1
➓ Probabilities arranged into matrices H, Hσ1, . . . , Hσk P Rk✂k
H♣i, jq ✏ PrX1 ✏ σi, X2 ✏ σjs Hσ♣i, jq ✏ PrX1 ✏ σi, X2 ✏ σ, X3 ✏ σjs
➓ Spectral learning algorithm for Bσ ✏ QAσQ✁1:
(For simplicity, in this talk we ignore learning of initial and final vectors)
➓ Maximum likelihood uses the whole of the sample S ✏ tw1, . . . , wN✉
and is always consistent in the realizable case max
α1,tAσ✉
1 N
N
➳
i✏1
log♣α❏
1 Awi
1 ☎ ☎ ☎ Awi ti
➓ The spectral method only uses local information from the sample in
♣ H, ♣ Ha, ♣ Hb and its consistency depends on properties of H
S ✏ tabbabba, aabaa, baaabbbabab, bbaaba, bababbabbaaaba, abbb, . . .✉
Questions
➓ Is the spectral method minimizing a “local” loss function? ➓ When does this minimization yield a consistent algorithm?
Spectral Learning as Local Loss Optimization A Convex Relaxation of the Local Loss Choosing a Consistent Local Loss
➓ Both ingredients in the spectral method have optimization
interpretations SVD — minV❏
n Vn✏I ⑥HVnV❏
n ✁ H⑥F
Pseudo-inverse — minBσ ⑥HVnBσ ✁ HσVn⑥F
➓ Can formulate a joint optimization for the spectral method
min
tBσ✉,V❏
n Vn✏I
➳
σPΣ
⑥HVnBσ ✁ HσVn⑥2
F
min
tBσ✉,V❏
n Vn✏I
➳
σPΣ
⑥HVnBσ ✁ HσVn⑥2
F ➓ Theorem The optimization is consistent under the same conditions of
the spectral method
➓ The loss is non-convex due to VnBσ and constraint V❏ nVn ✏ I ➓ Spectral method equivalent to
Intuition about the Loss Function
➓ Minimize the ℓ2 norm of the unexplained (finite set of) futures when a
symbol σ is generated and the transition is explained using Bσ (over a finite set of pasts)
➓ Strongly based on the markovianity of the process – which generic ML
does not exploit
➓ For algorithmic purposes a convex local loss function is more desirable ➓ A relaxation can be obtained by replacing the projection Vn with a
regularization term mintBσ✉,V❏
n Vn✏I
➦
σPΣ ⑥HVnBσ ✁ HσVn⑥2 F
➓ ➓ ➒
minBΣ ⑥HBΣ ✁ HΣ⑥2
F τ⑥BΣ⑥✝ ➓ This optimization is convex and has some interesting theoretical (see
paper) and empirical properties
Performing experiments with synthetic targets the following facts are
➓ Tuning the regularization parameter τ a better trade-off between
generalization and model complexity can be achieved
➓ The largest gains when using the convex relaxation are attained for
targets suposedly hard to the spectral method
0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 500 1000 1500 2000 2500 3000 L1 error tau SVD n=1 SVD n=2 SVD n=3 SVD n=4 SVD n=5 CO
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 1e-05 0.0001 0.001 0.01 0.1 1 L1 error minimum singular value of target model SVD CO difference
For any function f : Σ✍ Ñ R its Hankel matrix Hf P RΣ✍✂Σ✍ is defined as Hf♣p, sq ✏ f♣p ☎ sq
Σ λ a b aa ab ... λ
1 0.3 0.7 0.05 0.25 . . .
a
0.3 0.05 0.25 0.02 0.03 . . .
b
0.7 0.6 0.1 0.03 0.2 . . .
aa
0.05 0.02 0.03 0.017 0.003 . . .
ab
0.25 0.23 0.02 0.11 0.12 . . . . . . . . . . . . . . . . . . . . . ... H Ha
➓ Blocks defined by sets of rows (prefixes P) and columns (suffixes S) ➓ Can parametrize the spectral method by P and S taking H P RP✂S ➓ Each pair ♣P, Sq defines a different local loss function
Theorem (Schützenberger ’61) rank♣Hfq ✏ n iff f can be computed with
Consequences
➓ The spectral method is consistent iff rank♣Hq ✏ rank♣Hfq ✏ n ➓ There always exist ⑤P⑤ ✏ ⑤S⑤ ✏ n with rank♣Hq ✏ n
Trade-off
➓ Larger P and S more likely to have rank♣Hq ✏ n, but also require
larger samples for good estimation ♣ H Question
➓ Given a sample, how to choose good P and S?
Answer
➓ Random sampling succeeds w.h.p. with ⑤P⑤ and ⑤S⑤ depending
polynomially on the complexity of the target
Borja Balle, Ariadna Quattoni, Xavier Carreras
ICML 2012 June 2012, Edinburgh
This work is partially supported by the PASCAL2 Network and a Google Research Award