Local Loss Optimization in Operator Models: A New Insight into - - PowerPoint PPT Presentation

local loss optimization in operator models a new insight
SMART_READER_LITE
LIVE PREVIEW

Local Loss Optimization in Operator Models: A New Insight into - - PowerPoint PPT Presentation

Local Loss Optimization in Operator Models: A New Insight into Spectral Learning Borja Balle , Ariadna Quattoni, Xavier Carreras ICML 2012 June 2012, Edinburgh This work is partially supported by the PASCAL2 Network and a Google Research Award


slide-1
SLIDE 1

Local Loss Optimization in Operator Models: A New Insight into Spectral Learning

Borja Balle, Ariadna Quattoni, Xavier Carreras

ICML 2012 June 2012, Edinburgh

This work is partially supported by the PASCAL2 Network and a Google Research Award

slide-2
SLIDE 2

A Simple Spectral Method [HKZ09]

Discrete Homogeneous Hidden Markov Model

Y1 Y2 Y3 Y4 X1 X2 X3 X4 ⋯

➓ n states – Yt P t1, . . . , n✉ ➓ k symbols – Xt P tσ1, . . . , σk✉ ➓ for now assume n ↕ k ➓ Forward-backward equations with

Aσ P Rn✂n: PrX1:t ✏ ws ✏ α❏

1 Aw1 ☎ ☎ ☎ Awt

1

➓ Probabilities arranged into matrices H, Hσ1, . . . , Hσk P Rk✂k

H♣i, jq ✏ PrX1 ✏ σi, X2 ✏ σjs Hσ♣i, jq ✏ PrX1 ✏ σi, X2 ✏ σ, X3 ✏ σjs

➓ Spectral learning algorithm for Bσ ✏ QAσQ✁1:

  • 1. Compute SVD H ✏ UDV❏ and take top n right singular vectors Vn
  • 2. Bσ ✏ ♣HVnqHσVn

(For simplicity, in this talk we ignore learning of initial and final vectors)

slide-3
SLIDE 3

A Local Approach to Learning?

➓ Maximum likelihood uses the whole of the sample S ✏ tw1, . . . , wN✉

and is always consistent in the realizable case max

α1,tAσ✉

1 N

N

i✏1

log♣α❏

1 Awi

1 ☎ ☎ ☎ Awi ti

  • 1q

➓ The spectral method only uses local information from the sample in

♣ H, ♣ Ha, ♣ Hb and its consistency depends on properties of H

S ✏ tabbabba, aabaa, baaabbbabab, bbaaba, bababbabbaaaba, abbb, . . .✉

Questions

➓ Is the spectral method minimizing a “local” loss function? ➓ When does this minimization yield a consistent algorithm?

slide-4
SLIDE 4

Outline

Spectral Learning as Local Loss Optimization A Convex Relaxation of the Local Loss Choosing a Consistent Local Loss

slide-5
SLIDE 5

Loss Function of the Spectral Method

➓ Both ingredients in the spectral method have optimization

interpretations SVD — minV❏

n Vn✏I ⑥HVnV❏

n ✁ H⑥F

Pseudo-inverse — minBσ ⑥HVnBσ ✁ HσVn⑥F

➓ Can formulate a joint optimization for the spectral method

min

tBσ✉,V❏

n Vn✏I

σPΣ

⑥HVnBσ ✁ HσVn⑥2

F

slide-6
SLIDE 6

Properties of the Spectral Optimization

min

tBσ✉,V❏

n Vn✏I

σPΣ

⑥HVnBσ ✁ HσVn⑥2

F ➓ Theorem The optimization is consistent under the same conditions of

the spectral method

➓ The loss is non-convex due to VnBσ and constraint V❏ nVn ✏ I ➓ Spectral method equivalent to

  • 1. Choosing Vn using SVD
  • 2. Optimizing tBσ✉ with fixed Vn

Intuition about the Loss Function

➓ Minimize the ℓ2 norm of the unexplained (finite set of) futures when a

symbol σ is generated and the transition is explained using Bσ (over a finite set of pasts)

➓ Strongly based on the markovianity of the process – which generic ML

does not exploit

slide-7
SLIDE 7

A Convex Relaxation of the Local Loss

➓ For algorithmic purposes a convex local loss function is more desirable ➓ A relaxation can be obtained by replacing the projection Vn with a

regularization term mintBσ✉,V❏

n Vn✏I

σPΣ ⑥HVnBσ ✁ HσVn⑥2 F

➓ ➓ ➒

  • 1. fix n ✏ ⑤S⑤ and take Vn ✏ I
  • 2. BΣ ✏ rBσ1⑤ ☎ ☎ ☎ ⑤Bσks and HΣ ✏ rHσ1⑤ ☎ ☎ ☎ ⑤Hσks
  • 3. regularize via nuclear norm to emulate Vn

minBΣ ⑥HBΣ ✁ HΣ⑥2

F τ⑥BΣ⑥✝ ➓ This optimization is convex and has some interesting theoretical (see

paper) and empirical properties

slide-8
SLIDE 8

Experimental Results with the Convex Local Loss

Performing experiments with synthetic targets the following facts are

  • bserved

➓ Tuning the regularization parameter τ a better trade-off between

generalization and model complexity can be achieved

➓ The largest gains when using the convex relaxation are attained for

targets suposedly hard to the spectral method

0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 500 1000 1500 2000 2500 3000 L1 error tau SVD n=1 SVD n=2 SVD n=3 SVD n=4 SVD n=5 CO

  • 0.01

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 1e-05 0.0001 0.001 0.01 0.1 1 L1 error minimum singular value of target model SVD CO difference

slide-9
SLIDE 9

The Hankel Matrix

For any function f : Σ✍ Ñ R its Hankel matrix Hf P RΣ✍✂Σ✍ is defined as Hf♣p, sq ✏ f♣p ☎ sq

Σ λ a b aa ab ... λ

1 0.3 0.7 0.05 0.25 . . .

a

0.3 0.05 0.25 0.02 0.03 . . .

b

0.7 0.6 0.1 0.03 0.2 . . .

aa

0.05 0.02 0.03 0.017 0.003 . . .

ab

0.25 0.23 0.02 0.11 0.12 . . . . . . . . . . . . . . . . . . . . . ... H Ha

➓ Blocks defined by sets of rows (prefixes P) and columns (suffixes S) ➓ Can parametrize the spectral method by P and S taking H P RP✂S ➓ Each pair ♣P, Sq defines a different local loss function

slide-10
SLIDE 10

Consistency of the Local Loss

Theorem (Schützenberger ’61) rank♣Hfq ✏ n iff f can be computed with

  • perators Aσ P Rn✂n

Consequences

➓ The spectral method is consistent iff rank♣Hq ✏ rank♣Hfq ✏ n ➓ There always exist ⑤P⑤ ✏ ⑤S⑤ ✏ n with rank♣Hq ✏ n

Trade-off

➓ Larger P and S more likely to have rank♣Hq ✏ n, but also require

larger samples for good estimation ♣ H Question

➓ Given a sample, how to choose good P and S?

Answer

➓ Random sampling succeeds w.h.p. with ⑤P⑤ and ⑤S⑤ depending

polynomially on the complexity of the target

slide-11
SLIDE 11

Visit us at poster 53

slide-12
SLIDE 12

Local Loss Optimization in Operator Models: A New Insight into Spectral Learning

Borja Balle, Ariadna Quattoni, Xavier Carreras

ICML 2012 June 2012, Edinburgh

This work is partially supported by the PASCAL2 Network and a Google Research Award