Scalable Large-Margin x x the man bit the dog the man - - PowerPoint PPT Presentation

scalable large margin
SMART_READER_LITE
LIVE PREVIEW

Scalable Large-Margin x x the man bit the dog the man - - PowerPoint PPT Presentation

What is Structured Prediction? Scalable Large-Margin x x the man bit the dog the man bit the dog x x Structured Learning: DT NN VBD DT NN S y Theory and Algorithms NP VP y=+ 1 y=- 1 the man bit the


slide-1
SLIDE 1

Liang Huang Kai Zhao Lemao Liu The City University of New York (CUNY)

Scalable Large-Margin Structured Learning: Theory and Algorithms

the man bit the dog DT NN VBD DT NN

x y x y=-1 y=+1 x

the man hit the dog 那 人 咬 了 狗

slides at: http://acl.cs.qc.edu/~lhuang/

What is Structured Prediction?

  • binary classification: output is binary
  • multiclass classification: output is a (small) number
  • structured classification: output is a structure (seq., tree, graph)
  • part-of-speech tagging, parsing, summarization, translation
  • exponentially many classes: search (inference) efficiency is crucial! 2

x y=-1 y=+1 x

the man bit the dog DT NN VBD DT NN

x y

the man bit the dog

x y

S NP DT the NN man VP VB bit NP DT the NN dog

the man bit the dog

x

那 人 咬 了 狗

y

NLP is all about structured prediction!

Examples of Bad Structured Prediction

3

Learning: Unstructured vs. Structured

4

binary/multiclass structured learning

perceptron structured perceptron SVM structured SVM

Online+ Viterbi max margin max margin Online+ Viterbi

naive bayes

HMMs CRFs

logistic regression (maxent) Conditional Conditional

generative discriminative

(count & divide) (expectations) (argmax) (loss-augmented argmax)

slide-2
SLIDE 2

Why Perceptron (Online Learning)?

  • because we want scalability on big data!
  • learning time has to be linear in the number of examples
  • can make only constant number of passes over training data
  • only online learning (perceptron/MIRA) can guarantee this!
  • SVM scales between O(n2) and O(n3); CRF no guarantee
  • and inference on each example must be super fast
  • another advantage of perceptron: just need argmax

5

SVM CRF . . .

Perceptron: from binary to structured

6

x y=-1 y=+1 x y

update weights if y ≠ z w

x z

exact inference

trivial

2 classes

binary perceptron

(Rosenblatt, 1959)

the man bit the dog 那 人 咬 了 狗

x y y

update weights if y ≠ z w

x z

exact inference

hard

exponential # of classes

structured perceptron

(Collins, 2002)

y

update weights if y ≠ z w

x z

exact inference

easy

constant # of classes

multiclass perceptron

(Freund/Schapire, 1999)

Scalability Challenges

  • inference (on one example) is too slow (even w/ DP)
  • can we sacrifice search exactness for faster learning?
  • would inexact search interfere with learning?
  • if so, how should we modify learning?
  • even fastest inexact inference is still too slow
  • due to too many training examples
  • can we parallelize online learning?

7

. . .

1 3 2 4

update update update update

5 7 6 8

update update update update

9 10 12 11

update update update update

13 15 14 16

update update update update

Tutorial Outline

  • Overview of Structured Learning
  • Challenges in Scalability
  • Structured Perceptron
  • convergence proof
  • Structured Perceptron with Inexact Search
  • Latent-Variable Structured Perceptron
  • Parallelizing Online Learning (Perceptron & MIRA)

8

slide-3
SLIDE 3

Generic Perceptron

  • perceptron is the simplest machine learning algorithm
  • online-learning: one example at a time
  • learning by doing
  • find the best output under the current weights
  • update weights at mistakes

9

inference

xi

update weights

zi yi w

Structured Perceptron

10

inference

xi

update weights

zi yi w

the man bit the dog

DT NN VBD DT NN DT NN NN DT NN

Example: POS Tagging

  • gold-standard: DT NN

VBD DT NN

  • the man bit the dog
  • current output: DT NN NN DT NN
  • the man bit the dog
  • assume only two feature classes
  • tag bigrams ti-1 ti
  • word/tag pairs wi
  • weights ++: (NN,

VBD) (VBD, DT) (VBD→bit)

  • weights --: (NN, NN) (NN, DT) (NN→bit)

11

x x

y

z Φ(x, z) Φ(x, y)

Inference: Dynamic Programming

12

y

update weights if y ≠ z w

x z

exact inference

tagging: O(nT3) CKY parsing: O(n3)

slide-4
SLIDE 4

Efficiency vs. Expressiveness

  • the inference (argmax) must be efficient
  • either the search space GEN(x) is small, or factored
  • features must be local to y (but can be global to x)
  • e.g. bigram tagger, but look at all input words (cf. CRFs)

13

inference

xi

update weights

zi yi w

x

y

argmax

y∈GEN(x)

Averaged Perceptron

  • much more stable and accurate results
  • approximation of voted perceptron (large-margin)

(Freund & Schapire, 1999)

14

j

j j + 1

=

  • j

Wj

test error

Averaging => Large Margin

  • much more stable and accurate results
  • approximation of voted perceptron (large-margin)

(Freund & Schapire, 1999)

15

Efficient Implementation of Averaging

  • naive implementation (running sum) doesn’t scale
  • very clever trick from Daume (2006, PhD thesis)

16

w(0) = w(1) = w(2) = w(3) = w(4) =

∆w(1) ∆w(1)∆w(2) ∆w(1)∆w(2)∆w(3) ∆w(1)∆w(2)∆w(3)∆w(4)

wt

∆wt

slide-5
SLIDE 5

Perceptron vs. CRFs

  • perceptron is online and

Viterbi approximation of CRF

  • simpler to code; faster to converge; ~same accuracy

17

CRFs

(Lafferty et al, 2001)

stochastic gradient descent (SGD) hard/Viterbi CRFs structured perceptron

(Collins, 2002)

  • nline
  • nline

V i t e r b i V i t e r b i

for (x, y) ∈ D, argmax

z∈GEN(x)

w · Φ(x, z)

X

(x,y)∈D

X

z∈GEN(x)

exp(w · Φ(x, z)) Z(x)

Perceptron Convergence Proof

  • binary classification: converges iff. data is separable
  • structured prediction: converges iff. data is separable
  • there is an oracle vector that correctly labels all examples
  • one vs the rest (correct label better than all incorrect labels)
  • theorem: if separable, then # of updates ≤ R2 / δ2 R: diameter

18

y=+1 y=-1 y100 z ≠ y100 x100 x100 x111 x2000 x3012

R : d i a m e t e r R: diameter δ δ

Novikoff => Freund & Schapire => Collins 1962 1999 2002

Geometry of Convergence Proof pt 1

19

y

update weights if y ≠ z w

x z

exact inference

y w(k)

w(k+1)

correct label

∆ Φ ( x , y , z )

update current model update new model

perceptron update:

z

exact 1-best

(by induction)

δ separation unit oracle vector u

margin

≥ δ

(part 1: lowerbound)

kwk+1k kδ

<90˚

Geometry of Convergence Proof pt 2

20

y

update weights if y ≠ z w

x z

exact inference

y w(k)

w(k+1)

violation: incorrect label scored higher

correct label

∆ Φ ( x , y , z )

update current model update new model

perceptron update: violation

R: max diameter

z

exact 1-best

diameter ≤ R2

by induction: (part 2: upperbound) kwk+1k2  kR2

summary: the proof uses 3 facts:

  • 1. separation (margin)
  • 2. diameter (always finite)
  • 3. violation (guaranteed by exact search)

combine with: (part 1: lowerbound) k ≤ R2/δ2 bound on # of updates: kwk+1k kδ kwk+1k  p kR

√ kR

slide-6
SLIDE 6

Tutorial Outline

  • Overview of Structured Learning
  • Challenges in Scalability
  • Structured Perceptron
  • convergence proof
  • Structured Perceptron with Inexact Search
  • Latent-Variable Perceptron
  • Parallelizing Online Learning (Perceptron & MIRA)

21

Scalability Challenge 1: Inference

  • challenge: search efficiency (exponentially many classes)
  • often use dynamic programming (DP)
  • but DP is still too slow for repeated use, e.g. parsing O(n3)
  • Q: can we sacrifice search exactness for faster learning?

22

the man bit the dog DT NN VBD DT NN

x y y

update weights if y ≠ z w

x z

exact inference

x y=-1 y=+1 x y

update weights if y ≠ z w

x z

exact inference

trivial hard

constant # of classes exponential # of classes

binary classification structured classification

Perceptron w/ Inexact Inference

  • routine use of inexact inference in NLP (e.g. beam search)
  • how does structured perceptron work with inexact search?
  • so far most structured learning theory assume exact search
  • would search errors break these learning properties?
  • if so how to modify learning to accommodate inexact search?

23

the man bit the dog DT NN VBD DT NN

x y x z

inexact inference

y

update weights if y ≠ z w

Q: does perceptron still work???

beam search greedy search

Bad News and Good News

  • bad news: no more guarantee of convergence
  • in practice perceptron degrades a lot due to search errors
  • good news: new update methods guarantee convergence
  • new perceptron variants that “live with” search errors
  • in practice they work really well w/ inexact search

24

the man bit the dog DT NN VBD DT NN

x y x z

inexact inference

y

update weights if y ≠ z w

A: it no longer works as is, but we can make it work by some magic.

beam search greedy search

slide-7
SLIDE 7

Convergence with Exact Search

25

w(k)

V V N N V N V V

training example

time flies N V

  • utput space

{N,V} x {N, V}

structured perceptron converges with exact search

correct label

current model

w(k+1)

update

No Convergence w/ Greedy Search

26

w

(k)

∆Φ(x, y, z)

V V N N V N V V

training example

time flies N V

  • utput space

{N,V} x {N, V} N V V V N N V V

w

(k)

w

(k+1)

structured perceptron does not converge with greedy search

correct label update

current model new model

Which part of the convergence proof no longer holds?

the proof only uses 3 facts:

  • 1. separation (margin)
  • 2. diameter (always finite)
  • 3. violation (guaranteed by exact search)

<90˚

Geometry of Convergence Proof pt 2

27

y

update weights if y ≠ z w

x z

exact inference

y w(k)

w(k+1)

violation: incorrect label scored higher

correct label

∆ Φ ( x , y , z )

update current model update new model

perceptron update: violation

R: max diameter

z

exact 1-best

diameter ≤ R2

z

inexact 1-best

inexact search doesn’t guarantee violation!

<90˚

Observation: Violation is all we need!

28

y

violation: incorrect label scored higher

correct label

∆ Φ ( x , y , z )

update

  • exact search is not really required by the proof
  • rather, it is only used to ensure violation!

w(k)

w(k+1)

current model update new model R: max diameter

all violations

z

exact 1-best

the proof only uses 3 facts:

  • 1. separation (margin)
  • 2. diameter (always finite)
  • 3. violation (but no need for exact)
slide-8
SLIDE 8

y

Violation-Fixing Perceptron

  • if we guarantee violation, we don’t care about exactness!
  • violation is good b/c we can at least fix a mistake

29

y

update weights if y ≠ z w

x z

exact inference

y

update weights if y’ ≠ z w

x z

find violation

y ’

same mistake bound as before!

standard perceptron violation-fixing perceptron

all violations

a l l p

  • s

s i b l e u p d a t e s

What if can’t guarantee violation

  • this is why perceptron doesn’t work well w/ inexact search
  • because not every update is guaranteed to be a violation
  • thus the proof breaks; no convergence guarantee
  • example: beam or greedy search
  • the model might prefer the correct label (if exact search)
  • but the search prunes it away
  • such a non-violation update is “bad”

because it doesn’t fix any mistake

  • the new model still misguides the search
  • Q: how can we always guarantee violation?

30

beam

b a d u p d a t e current model

Solution 1: Early update (Collins/Roark 2004)

31

w

(k)

∆Φ(x, y, z)

V V N N V N V V

training example

time flies N V

  • utput space

{N,V} x {N, V} N V V V N

w

(k)

w

(k+1)

∆ Φ ( x , y , z ) w(k+1)

w

(k)

stop and update at the first mistake

V N

standard perceptron does not converge with greedy search

correct label

current model new model

update

new model

Early Update: Guarantees Violation

32

w

(k)

∆Φ(x, y, z)

V V N N V N V V

training example

time flies N V

  • utput space

{N,V} x {N, V} N V V V N N V V

w

(k)

w

(k+1)

∆ Φ ( x , y , z ) w(k+1)

w

(k)

early update: incorrect prefix scores higher: a violation!

V N

standard update doesn’t converge b/c it doesn’t guarantee violation

correct label

slide-9
SLIDE 9

Early Update: from Greedy to Beam

  • beam search is a generalization of greedy (where b=1)
  • at each stage we keep top b hypothesis
  • widely used: tagging, parsing, translation...
  • early update -- when correct label first falls off the beam
  • up to this point the incorrect prefix should score higher
  • standard update (full update) -- no guarantee!

33

early update correct label falls off beam (pruned)

correct i n c

  • r

r e c t violation guaranteed: incorrect prefix scores higher up to this point

standard update (no guarantee!)

Early Update as Violation-Fixing

34

beam

correct label falls off beam (pruned)

standard update (bad!)

y

update weights if y’ ≠ z w

x z

find violation

y’

prefix violations

y’ z y

also new definition of “beam separability”: a correct prefix should score higher than any incorrect prefix

  • f the same length

(maybe too strong)

  • cf. Kulesza and Pereira,2007

early update

Solution 2: Max-Violation (Huang et al 2012)

35

beam

early standard (bad!) max-violation latest

  • we now established a theory for early update (Collins/Roark)
  • but it learns too slowly due to partial updates
  • max-violation: use the prefix where violation is maximum
  • “worst-mistake” in the search space
  • all these update methods are violation-fixing perceptrons

bit man the dog the the man bit the dog

x y

incremental parsing

the man bit the dog DT NN VBD DT NN

x y

part-of-speech tagging

Four Experiments

machine translation

the man bit the dog 那 人 咬 了 狗 the man bit the dog

x y

bottom-up parsing w/ cube pruning

x y

bit man the dog the

slide-10
SLIDE 10

Max-Violation > Early >> Standard

  • exp 1 on part-of-speech tagging w/ beam search (on CTB5)
  • early and max-violation >> standard update at smallest beams
  • this advantage shrinks as beam size increases
  • max-violation converges faster than early (and slightly better)

37

beam=1 (greedy) best accuracy

  • vs. beam size

92 92.5 93 93.5 94 0.05 0.1 0.15 0.2 0.25 tagging accuracy on held-out training time (hours) 92 92.5 93 93.5 94 1 2 3 4 5 6 7 best tagging accuracy on held-out beam size max-violation early standard

Max-Violation > Early >> Standard

38

beam=2 best accuracy

  • vs. beam size

92 92.5 93 93.5 94 1 2 3 4 5 6 7 best tagging accuracy on held-out beam size max-violation early standard 92 92.5 93 93.5 94 0.05 0.1 0.15 0.2 0.25 tagging accuracy on held-out training time (hours)

  • exp 1 on part-of-speech tagging w/ beam search (on CTB5)
  • early and max-violation >> standard update at smallest beams
  • this advantage shrinks as beam size increases
  • max-violation converges faster than early (and slightly better)

Max-Violation > Early >> Standard

  • exp 2 on my incremental dependency parser (Huang & Sagae 10)
  • standard update is horrible due to search errors
  • early update: 38 iterations, 15.4 hours (92.24)
  • max-violation: 10 iterations, 4.6 hours (92.25)

39 78 80 82 84 86 88 90 92 2 4 6 8 10 12 14 16 parsing accuracy on held-out training time (hours) max-violation early standard

beam=8

91 91.25 91.5 91.75 92 92.25 2 4 6 8 10 12 14 16 parsing accuracy on held-out training time (hours) max-violation early standard (79.0, omitted)

beam=8

max-violation is 3.3x faster than early update

Why standard update so bad for parsing

  • standard update works horribly with severe search error
  • due to large number of invalid updates (non-violation)

40

% of invalid updates in standard update

25 50 75 100 2 4 6 8 10 12 14 16 % of invalid updates beam size parsing tagging 78 80 82 84 86 88 90 92 2 4 6 8 10 12 14 16 parsing accuracy on held-out training time (hours) max-violation early standard

parsing b=8 tagging b=1

91 91.5 92 92.5 93 93.5 94 0.05 0.1 0.15 0.2 0.25 tagging accuracy on held-out training time (hours)

O(nT3) => O(nb) O(n11) => O(nb)

take-home message: early/max-violation more helpful for harder search problems!

slide-11
SLIDE 11

Exp 3: Bottom-up Parsing

  • CKY parsing with cube pruning for higher-order features
  • we extended our framework from graphs to hypergraphs

41

92 92.2 92.4 92.6 92.8 93 93.2 93.4 93.6 93.8 94 1 2 3 4 5 6 7 8 UAS epochs UAS on Penn-YM dev s-max p-max skip standard

(Zhang et al 2013)

Exp 4: Machine Translation

  • standard perceptron works poorly for machine translation
  • b/c invalid update ratio is very high (search quality is low)
  • max-violation converges faster than early update
  • first truly successful effort in

large-scale training for translation

50% 60% 70% 80% 90% 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Ratio beam size Ratio of invalid updates +non-local feature

(standard perceptron)

15 16 17 18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 18 20 BLEU Number of iteration max-violation early standard local

(Yu et al 2013)

Comparison of Four Exps

  • the harder your search, the more advantageous

43

92 92.2 92.4 92.6 92.8 93 93.2 93.4 93.6 93.8 94 1 2 3 4 5 6 7 8 UAS epochs UAS on Penn-YM dev s-max p-max skip standard

78 80 82 84 86 88 90 92 2 4 6 8 10 12 14 16 parsing accuracy on held-out training time (hours) max-violation early standard

incremental parsing b=8

91 91.5 92 92.5 93 93.5 94 0.05 0.1 0.15 0.2 0.25 tagging accuracy on held-out training time (hours)

tagging b=1

15 16 17 18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 18 20 BLEU Number of iteration max-violation early standard local

bottom-up parsing machine translation

Related Work and Discussions

  • our “violation-fixing” framework include as special cases
  • early-update (Collins and Roark, 2004)
  • LaSO (Daume and Marcu, 2005)
  • not sure about Searn (Daume et al, 2009)
  • “beam-separability” or “greedy-separability” related to:
  • “algorithmic-separability” of (Kulesza and Pereira, 2007)
  • but these conditions are too strong to hold in practice
  • under-generating (beam) vs. over-generating (LP-relax.)
  • Kulesza & Pereira and Martins et al (2011): LP-relaxation
  • Finley and Joachims (2008): both under and over for SVM

44

slide-12
SLIDE 12

Conclusions So Far

  • structured perceptron is simple, scalable, and powerful
  • (almost) same convergence proof from multiclass perceptron
  • but it doesn’t work very well with inexact search
  • solution: violation-fixing perceptron framework
  • convergence under new defs of separability
  • learn to “live with” search errors
  • in particular, “max-violation” works great
  • converges fast, and results in high accuracy
  • they are more helpful to harder search problems!

45

Tutorial Outline

  • Overview of Structured Learning
  • Challenges in Scalability
  • Structured Perceptron
  • convergence proof
  • Structured Perceptron with Inexact Search
  • Latent-Variable Perceptron
  • Parallelizing Online Learning (Perceptron & MIRA)

46

Learning with Latent Variables

  • aka “weakly-supervised” or “partially-observed” learning
  • learning from “natural annotations”; more scalable
  • examples: translation, transliteration, semantic parsing...

47

x y

Bush talked with Sharon

布什 与 沙 会 latent derivation

parallel text

(Liang et al 2006; Yu et al 2013; Xiao and Xiong 2013)

Alaska

What is the largest state?

argmax(state, size)

QA pairs

(Clark et al 2010; Liang et al 2013; Kwiatkowski et al 2013)

コンピューター

ko n py u : ta : computer latent derivation

transliteration

(Knight & Graehl, 1998; Kondrak et al 2007, etc.)

Learning Latent Structures

48

binary/multiclass structured learning

perceptron structured perceptron SVM structured SVM

Online+ Viterbi max margin max margin Online+ Viterbi

naive bayes

HMMs CRFs

logistic regression (maxent) Conditional Conditional

latent structures EM

(forward-backward)

latent CRFs latent perceptron latent structured SVM

slide-13
SLIDE 13

forced decoding space full search space beam

Latent Structured Perceptron

  • no explicit positive signal
  • hallucinate the “correct” derivation by current weights

49

x y

the man bit the dog

那 人 咬 了 狗

x

那 人 咬 了 狗 highest-scoring derivation highest-scoring gold derivation

penalize wrong

training example during online learning...

the dog bit the man

z

w ← w + Φ(x, d∗) − Φ(x, ˆ d)

d∗

ˆ d reward correct

(Liang et al 2006; Yu et al 2013)

Unconstrained Search

  • example: beam search phrase-based decoding

50

  • _ _
  • ... talks
  • _ _
  • ... talk
  • _ _
  • ... meeting
  • ●●●●● ... Sharon
  • ●●●●● ... Shalong
  • _ _ _ _ _
  • Bush
  • ● ● ● ● ●

Bushi yu Shalong juxing le huitan ???

Constrained Search

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

  • ne gold derivation
  • forced decoding: must produce the exact reference translation

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

  • _ _
  • ... talks
  • _ _
  • ... talk
  • _ _
  • ... meeting
  • ●●●●● ... Sharon
  • ●●●●● ... Shalong
  • _ _ _ _ _
  • Bush

forced decoding space full search space beam

Search Errors in Decoding

  • no explicit positive signal
  • hallucinate the “correct” derivation by current weights

52

x y

the man bit the dog

那 人 咬 了 狗

x

那 人 咬 了 狗 highest-scoring derivation highest-scoring gold derivation

penalize wrong

problem: search errors

training example during online learning...

the dog bit the man

z

w ← w + Φ(x, d∗) − Φ(x, ˆ d)

d∗

ˆ d reward correct

(Liang et al 2006; Yu et al 2013)

slide-14
SLIDE 14

Search Error: Gold Derivations Pruned

53

_ _ _ _ _ _

  • _ _ ● ● ●

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

  • ● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

  • _ _ ● ● _
  • _ ● ● _ ●

_ ● _ ● ● ●

  • _ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _ _ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

should address search errors here!

Fixing Search Error 1: Early Update

  • early update (Collins/Roark’04) when the correct falls off beam
  • up to this point the incorrect prefix should score higher
  • that’s a “violation” we want to fix; proof in (Huang et al 2012)
  • standard perceptron does not guarantee violation
  • the correct sequence (pruned) might score higher at the end!
  • “invalid” update b/c it reinforces the model error

54

early update

correct sequence falls off beam (pruned)

correct incorrect violation guaranteed: incorrect prefix scores higher up to this point

standard update (no guarantee!)

21

Model

Early Update w/ Latent Variable

55

early update

incorrect violation guaranteed: incorrect prefix scores higher up to this point

21

Model

  • the gold-standard derivations are not annotated
  • we treat any reference-producing derivation as good

correct all correct derivations fall off stop decoding

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

Fixing Search Error 2: Max-Violation

56

  • early update works but learns slowly due to partial updates
  • max-violation: use the prefix where violation is maximum
  • “worst-mistake” in the search space
  • now extended to handle latent-variable

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

slide-15
SLIDE 15

Latent-Variable Perceptron

early max- violation latest full

(standard)

best in the beam worst in the beam falls off the beam biggest violation last valid update correct sequence invalid update! early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

Roadmap of Techniques

58

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013; Zhao et al 2014)

MT syntactic parsing semantic parsing transliteration

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

Experiments: Discriminative Training for MT

  • standard update (Liang et al’s “bold”) works poorly
  • b/c invalid update ratio is very high (search quality is low)
  • max-violation converges faster than early update

50% 60% 70% 80% 90% 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Ratio beam size Ratio of invalid updates +non-local feature

standard latent-variable perceptron

this explains why Liang et al ’06 failed std ~ “bold”; local ~ “local”

17 18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 18 20 BLEU Number of iteration M a x F

  • r

c e MERT early l

  • c

a l standard 59

Open Problems in Theory

  • latent-variable structured perceptron:
  • does it converge? under what conditions?
  • latent-variable structured perceptron with inexact search
  • does it converge? under what conditions?

60

y=+1 y=-1 y100 z ≠ y100 x100 x100 x111 x2000 x3012

R R δ δ

  • racle

vector

  • racle

vector

(Novikoff, 1962) (Collins, 2002)

#updates ≤ R2 δ2

slide-16
SLIDE 16

Open Problems in Theory

  • latent-variable structured perceptron:
  • does it converge? under what conditions?
  • latent-variable structured perceptron with inexact search
  • does it converge? under what conditions?

61

y100 z ≠ y100 x100

R δ

  • racle

vector

  • racle

vector

(Sun et al, 2009) (Collins, 2002)

#updates ≤ R2 δ2

  • +

δ

easy to prove but unrealistic

Open Problems in Theory

  • latent-variable structured perceptron:
  • does it converge? under what conditions?
  • latent-variable structured perceptron with inexact search
  • does it converge? under what conditions?

62

  • racle

vector

(Sun et al, 2009) ???

#updates ≤ R2 δ2

  • +

δ

ideal situation but hard to prove

  • +

δ

easy to prove but unrealistic

Tutorial Outline

  • Overview of Structured Learning
  • Challenges in Scalability
  • Structured Perceptron
  • convergence proof
  • Structured Perceptron with Inexact Search
  • Latent-Variable Perceptron
  • Parallelizing Online Learning (Perceptron & MIRA)

63

Online Learning from Big Data

  • online learning has linear-time guarantee
  • contrast: popular methods such as SVM/CRF are superlinear
  • but online learning can still be too slow if data is too big
  • even with the fastest inexact search (e.g. greedy)
  • how to parallelize online learning for big data?

64

. . . SVM CRF

1 3 2 4

update update update update

5 7 6 8

update update update update

9 10 12 11

update update update update

13 15 14 16

update update update update

slide-17
SLIDE 17

Aside: Perceptron => MIRA

  • perceptron is simple but...
  • only “aims to” fixes one mistake (violation) on each example
  • but structured prediction may have many violations
  • may under- or over-correct on a violation
  • MIRA (margin infused relaxation algorithm)
  • 1-best MIRA: corrects one violation “just enough” (Crammer 03)
  • k-best MIRA: corrects k violations at one time (McDonald et al 05)

65

y correct label

∆Φ(x, y, z)

update

all violations

z

exact 1-best current model

Z = k-best

z∈GEN(x) wt · Φ(x, z)

wt+1 = argmin

w0:8z2Z,w0·∆Φ(x,y,z)`(y,z)

kw0 wtk2

Geometry of 1-best & k-best MIRA

66

  • racle

w w0

  • racle

w w0

1-best MIRA k-best MIRA single constraint => analytic solution up to k constraints => convex optimization

wt+1 = argmin

w0:8z2Z,w0·∆Φ(x,y,z)`(y,z)

kw0 wtk2

Can We Parallelize Online Learning?

  • can we parallelize online learning?
  • harder than parallelizing batch (CRF)
  • losing dependency b/w examples
  • each iteration faster, but accuracy lower
  • method 1: IPM (McDonald et al 2010)
  • nly ~3-4x faster on 10+ CPUs
  • can we do (a lot) better?

67

Iterative Parameter Mixing (IPM) (McDonald et al 2010)

1 3 2 4

update update update update

5 7 6 8

update update update update

9 10 12 11

update update update update

13 15 14 16

update update update update

search

update weights w if y ≠ z

y x z w

Method 1: Iterative Parameter Mixing

Method 2: Minibatch Parallelization

  • decode a minibatch in parallel, update in serial

68

Iterative Parameter Mixing (IPM) (McDonald et al 2010)

1 3 2 4

update update update update

5 7 6 8

update update update update

9 10 12 11

update update update update

13 15 14 16

update update update update

1 2 3 4 5 6 7 8 ⨁ update 9 10 11 12 13 14 15 16 ⨁ update

minibatch (Zhao and Huang, 2012)

slide-18
SLIDE 18

Why Minibatch?

  • minibatch also helps in serial mode!
  • perceptron: use average updates within minibatch
  • “averaging effect” (cf. McDonald et al 2010)
  • easy to prove convergence (still R2/δ2)

69

u · w(k+1) ≥ kδ

kw(k+1)k kδ

(by induction)

lower bound

+1 a X

i

u · ∆Φ(x, y, z)

margin

≥ δ

u · w(k+1) = u · w(k)

violation

kw(k+1)k2 = kw(k)k2 + k1 a X

i

∆Φ(x, y, z)k2 +2 aw(k) · X

i

∆Φ(x, y, z)

Jensen’s

≤ R2

≤ 0

kw(k+1)k2  kR2

upper bound

(by induction) w(k+1) = w(k) + 1 a X

i

∆Φ(x, y, z)

update

Why Minibatch?

  • minibatch also helps in serial mode!
  • perceptron: use average updates within minibatch
  • “averaging effect” (cf. McDonald et al 2010)
  • easy to prove convergence (still R2/δ2)

70

8x constrains in each update

minibatch size

1 all pure online MIRA SVM Minibatch MIRA

  • MIRA: optimization over more constraints
  • MIRA is an online approximation of SVM
  • minibatch MIRA: better approximation of SVM
  • approaches SVM at maximum batch size

Geometry of Minibatch MIRA

71

y2

z3

y3 y1

z1 z2

  • racle

gold incorrect

w w0

Load Balancing

  • rearrange each minibatch to minimize wasted time

72

iterative parameter mixing McDonald et al 2010

update

1 3 4 2

update update update

6 5 8 7

update update update update

12 9 10 11

update update update update

15 14 13 16

update update update update

minibatch

update 3 1 4 6 5 8 7 2 12 15 14 9 13 16 10 11

update

load balanced minibatch

3 1 4 6 5 8 7 2 12 15 14 9 13 16 10 11 update

update

slide-19
SLIDE 19

Experiments

the man bit the dog DT NN VBD DT NN

x y

the man bit the dog

x y

bit man the dog the the man bit the dog 那 人 咬 了 狗

x y

incremental parsing part-of-speech tagging phrase-based translation

Experiment 1: Parsing with MIRA

  • beam search incremental parsing (Huang et al 2012) with MIRA
  • minibatch learns better even in the serial setting (1 CPU)

74

90.75 91 91.25 91.5 91.75 92 92.25 92.5 1 2 3 4 5 6 7 8 accuracy on held-out wall-clock time (hours) baseline: minibatch=1 minibatch=4 minibatch=24

Parallelized Minibatch Faster than IPM

  • minibatch is much faster than iterative parameter mixing

75

minibatch: 9x on 12 cores McDonald et al: 3x on 12 cores

Experiment 2: Tagging (Perceptron)

76

slide-20
SLIDE 20

Experiment 3: Machine Translation

  • minibatch leads to 7x speedup on 24 cores

77

22 23 24 0.5 1 1.5 2 2.5 3 3.5 4 BLEU Time MERT PRO-dense minibatch-24 (24-core) minibatch-24 (6-core) minibatch-24 (1-core) pure online baseline

Roadmap of Techniques

78

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

latent-variable perceptron w/ inexact search & parallelization

(Yu et al 2013; Zhao et al 2014)

MT syntactic parsing semantic parsing transliteration

minibatch parallelization

(Zhao & Huang, 2013) (Zhao et al 2014)

Final Conclusions

  • online structured learning is simple and powerful
  • search efficiency is the key challenge
  • search errors do interfere with learning
  • but we can use violation-fixing perceptron w/ inexact search
  • we can extend perceptron to learn latent structures
  • we can parallelize online learning using minibatch

79

early max- violation latest full

(standard)

best in the beam worst in the beam falls off the beam biggest violation last valid update c

  • r

r e c t s e q u e n c e invalid update!

Annotated References (1)

  • Binary Perceptron
  • original: Rosenblatt, 1959
  • convergence proof: Novikoff, 1962
  • Multiclass Perceptron (and voted/average perceptron)
  • Freund and Schapire, 1999
  • Structured Perceptron (and inexact search extensions)
  • original: Collins, 2002 (also contains generalization bounds; proofs

mostly verbatim from Freund/Schapire, 1999)

  • early-update: Collins and Roark, 2004 (but no justification)
  • max-violation: Huang et al, 2012 (also defines violation-fixing

perceptron framework, where early/max-violation are instances)

  • hypergraph inexact search: Zhang et al, 2013 (CKY-style parsing)

80

slide-21
SLIDE 21

Annotated References (2)

  • Latent

Variable Perceptron (and inexact search extensions)

  • semantic parsing: Zettlemoyer and Collins, 2005
  • machine translation: Liang et al, 2006
  • separability condition: Sun et al, 2009
  • inexact search:

Yu et al, 2013

  • hiero w/ inexact search: Zhao et al, 2014
  • MIRA
  • 1-best MIRA: Crammer and Singer, 2003
  • k-best MIRA: McDonald et al, 2005

81

Annotated References (3)

  • Parallelizing Online Learning
  • iterative parameter mixing: McDonald et al, 2010
  • minibatch: Zhao and Huang, 2013
  • Other References
  • CRF: Lafferty et al, 2001
  • M3N (structured SVM): Taskar et al, 2003
  • LaSO: Daumé and Marcu, 2005
  • averaging trick: Daumé, 2006, Ph.D. thesis

82

Backup Slides: Convergence

  • if data is separable in representation, and exact search

83

Separation and Violation

84

slide-22
SLIDE 22

What if not separable?

  • a weaker theorem, and generalization bounds

85

Proof

86

Proof

87