Machine Learning Fall 2017 Structured Prediction (structured - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Fall 2017 Structured Prediction (structured - - PowerPoint PPT Presentation

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) Structured Prediction x x the man bit the dog the man bit the dog x x DT NN


slide-1
SLIDE 1

Machine Learning

Fall 2017

Professor Liang Huang

Structured Prediction

(structured perceptron, HMM, structured SVM)

(Chap. 17 of CIML)

slide-2
SLIDE 2

Structured Prediction

  • binary classification: output is binary
  • multiclass classification: output is a number (small # of classes)
  • structured classification: output is a structure (seq., tree, graph)
  • part-of-speech tagging, parsing, summarization, translation
  • exponentially many classes: search (inference) efficiency is crucial!2

x y=-1 y=+1 x

the man bit the dog DT NN VBD DT NN

x y

the man bit the dog

x y

S NP DT the NN man VP VB bit NP DT the NN dog

the man bit the dog

x

那 人 咬 了 狗

y

slide-3
SLIDE 3

Structured Prediction

Generic Perceptron

  • online-learning: one example at a time
  • learning by doing
  • find the best output under the current weights
  • update weights at mistakes

3

inference

xi

update weights

zi yi w

slide-4
SLIDE 4

Perceptron: from binary to structured

4

x y=-1 y=+1 x y

update weights if y ≠ z w

x z

exact inference

trivial

2 classes

binary classification

the man bit the dog DT NN VBD DT NN

x y y

update weights if y ≠ z w

x z

exact inference

hard

exponential # of classes

structured classification

y

update weights if y ≠ z w

x z

exact inference

easy

constant # of classes

multiclass classification

slide-5
SLIDE 5

From Perceptron to SVM

1959 Rosenblatt invention 1962 Novikoff proof 1999 Freund/Schapire voted/avg: revived 2002 Collins structured 2003 Crammer/Singer MIRA 1995 Cortes/Vapnik SVM 2006 Singer group aggressive 2005 McDonald+ structured MIRA

  • n

l i n e a p p r

  • x

. m a x m a r g i n

+soft-margin conservative updates inseparable case

2007--2010 Singer group Pegasos

subgradient descent minibatch

batch

  • nline

AT&T Research ex-AT&T and students

fall of USSR

max margin

1964 Vapnik Chervonenkis 5 2003 Taskar M3N 2004 Tsochantaridis

  • struct. SVM

2001 Lafferty+ CRF multinomial logistic regression (max. entropy)

kernels

  • kernel. perc.

1964 Aizerman+ same journal!

slide-6
SLIDE 6

Multiclass Classification

  • one weight vector (“prototype”) for each class:
  • multiclass decision rule:

(best agreement w/ prototype)

ˆ y = argmax

z∈1...M

w(z) · x

Q1: what about 2-class? Q2: do we still need augmented space?

slide-7
SLIDE 7

Multiclass Perceptron

  • on an error, penalize the weight for the wrong

class, and reward the weight for the true class

slide-8
SLIDE 8

Convergence of Multiclass

update rule:

w ← w + ∆Φ(x, y, z)

separability:

9u, s.t. 8(x, y) 2 D, z 6= y u · ∆Φ(x, y, z) δ

for all

slide-9
SLIDE 9

Structured Prediction

Example: POS Tagging

  • gold-standard: DT NN

VBD DT NN

  • the man bit the dog
  • current output: DT NN NN DT NN
  • the man bit the dog
  • assume only two feature classes
  • tag bigrams ti-1 ti
  • word/tag pairs wi
  • weights ++: (NN,

VBD) (VBD, DT) (VBD, bit)

  • weights --: (NN, NN) (NN, DT) (NN, bit)

9

x x

y

z Φ(x, z) Φ(x, y)

ɸ(x,y)={(<s>, DT): 1, (DT, NN):2, ..., (NN, </s>): 1, (DT, the):1, ..., (VBD, bit): 1, ..., (NN, dog): 1} ɸ(x,z)={(<s>, DT): 1, (DT, NN):2, ..., (NN, </s>): 1, (DT, the):1, ..., (NN, bit): 1, ..., (NN, dog): 1} ɸ(x,y) - ɸ(x,z)

slide-10
SLIDE 10

Structured Prediction

Structured Perceptron

10

inference

xi

update weights

zi yi w

slide-11
SLIDE 11

Inference: Dynamic Programming

11

y

update weights if y ≠ z w

x z

exact inference

slide-12
SLIDE 12

CS 562 - Lec 5-6: Probs & WFSTs

Python implementation

12

Q: what about top-down recursive + memoization?

slide-13
SLIDE 13

Structured Prediction

Efficiency vs. Expressiveness

  • the inference (argmax) must be efficient
  • either the search space GEN(x) is small, or factored
  • features must be local to y (but can be global to x)
  • e.g. bigram tagger, but look at all input words (cf. CRFs)

13

inference

xi

update weights

zi yi w x

y

argmax

y∈GEN(x)

slide-14
SLIDE 14

Structured Prediction

Averaged Perceptron

  • more stable and accurate results
  • approximation of voted perceptron

(Freund & Schapire, 1999)

14

j

j j + 1

=

  • j

Wj

slide-15
SLIDE 15

Structured Prediction

Averaging Tricks

  • Daume (2006, PhD thesis)

15

w(0) = w(1) = w(2) = w(3) = w(4) =

∆w(1) ∆w(1) ∆w(2) ∆w(1) ∆w(2)∆w(3) ∆w(1) ∆w(2)∆w(3)

∆w(4)

wt

∆wt

sparse vector: defaultdict

slide-16
SLIDE 16

Structured Prediction

Do we need smoothing?

  • smoothing is much easier in discriminative models
  • just make sure for each feature template, its subset

templates are also included

  • e.g., to include (t0 w0 w-1) you must also include
  • (t0 w0) (t0 w-1) (w0 w-1)
  • and maybe also (t0 t-1) because t is less sparse than w

16

x

y

slide-17
SLIDE 17

Convergence with Exact Search

  • linear classification: converges iff. data is separable
  • structured: converges iff. data separable & search exact
  • there is an oracle vector that correctly labels all examples
  • one vs the rest (correct label better than all incorrect labels)
  • theorem: if separable, then # of updates ≤ R2 / δ2 R: diameter

17

y=+1 y=-1 y100 z ≠ y100 x100 x100 x111 x2000 x3012

R: diameter R : d i a m e t e r δ δ Rosenblatt => Collins 1957 2002

slide-18
SLIDE 18

Geometry of Convergence Proof pt 1

18

y

update weights if y ≠ z w

x z

exact inference

y

w(k)

w(k+1)

correct label

∆Φ(x, y, z)

update current model update n e w m

  • d

e l

perceptron update:

z

exact 1-best

(by induction)

δ separation unit oracle vector u

margin

≥ δ

(part 1: upperbound)

slide-19
SLIDE 19

<90˚

Geometry of Convergence Proof pt 2

19

y

update weights if y ≠ z w

x z

exact inference

y

w(k)

w(k+1)

violation: incorrect label scored higher

parts 1+2 => update bounds:

correct label

∆Φ(x, y, z)

update current model update n e w m

  • d

e l

perceptron update: violation

by induction:

R: max diameter

z

exact 1-best

diameter ≤ R2 k ≤ R2/δ2 (part 2: upperbound)

slide-20
SLIDE 20

Experiments

slide-21
SLIDE 21

Structured Prediction

Experiments: Tagging

  • (almost) identical features from (Ratnaparkhi, 1996)
  • trigram tagger: current tag ti, previous tags ti-1, ti-2
  • current word wi and its spelling features
  • surrounding words wi-1 wi+1 wi-2 wi+2..

21

slide-22
SLIDE 22

Structured Prediction

Experiments: NP Chunking

  • B-I-O scheme
  • features:
  • unigram model
  • surrounding words

and POS tags

22

Rockwell International Corp. B I I 's Tulsa unit said it signed B I I O B O a tentative agreement ... B I I

slide-23
SLIDE 23

Structured Prediction

Experiments: NP Chunking

  • results
  • (Sha and Pereira, 2003) trigram tagger
  • voted perceptron: 94.09% vs. CRF: 94.38%

23

slide-24
SLIDE 24

Structured Prediction

Structured SVM

  • structured perceptron: w·Δɸ(x,y,z) > 0
  • SVM: for all (x,y), functional margin y(w·x) ≥ 1
  • structured SVM version 1: simple loss
  • for all (x,y), for all z≠y, margin w·Δɸ(x,y,z) ≥1
  • correct y has to score higher than any wrong z by 1
  • structured SVM version 2: structured loss
  • for all (x,y), for all z≠y, margin w·Δɸ(x,y,z) ≥ℓ(y,z)
  • correct y has to score higher than any wrong z

byℓ(y,z), a distance metric such as hamming loss

24

slide-25
SLIDE 25

Structured Prediction

Loss-Augmented Decoding

  • want for all z: w·ɸ(x,y) ≥ w·ɸ(x,z) +ℓ(y,z)
  • same as: w·ɸ(x,y) ≥ maxz w·ɸ(x,z) +ℓ(y,z)
  • loss-augmented decoding: argmaxz w·ɸ(x,z) +ℓ(y,z)
  • if ℓ(y,z) factors in z (e.g. hamming), just modify DP

25

very similar to Pegasos; but should use Pegasos framework instead

←modified DP

λ = 1/(2C)

←should have learning rate!

CIML version

slide-26
SLIDE 26

Structured Prediction

Correct Version following Pegasos

  • want for all z: w·ɸ(x,y) ≥ w·ɸ(x,z) +ℓ(y,z)
  • same as: w·ɸ(x,y) ≥ maxz w·ɸ(x,z) +ℓ(y,z)
  • loss-augmented decoding: argmaxz w·ɸ(x,z) +ℓ(y,z)
  • if ℓ(y,z) factors in z (e.g. hamming), just modify DP

26

1/t

NC/2t (

) N=|D|, C is from SVM t +=1 for each example

slide-27
SLIDE 27

Structured Prediction

  • Struct. Perceptron vs Struct. SVM
  • tagging, ATIS (train: 488 sent); SVM < avg perc << perc

27

perceptron epoch 1 updates 102, |W|=291, train_err 3.90%, dev_err 9.36% avg_err 6.14% epoch 2 updates 91, |W|=334, train_err 3.33%, dev_err 8.19% avg_err 4.97% epoch 3 updates 78, |W|=347, train_err 2.92%, dev_err 5.85% avg_err 4.97% epoch 4 updates 81, |W|=368, train_err 3.11%, dev_err 6.73% avg_err 5.85% epoch 5 updates 78, |W|=378, train_err 2.70%, dev_err 6.14% avg_err 5.56% epoch 6 updates 63, |W|=385, train_err 2.26%, dev_err 6.14% avg_err 5.56% epoch 7 updates 69, |W|=385, train_err 2.43%, dev_err 7.02% avg_err 5.56% epoch 8 updates 60, |W|=388, train_err 2.15%, dev_err 6.73% avg_err 5.56% epoch 9 updates 59, |W|=390, train_err 2.04%, dev_err 6.14% avg_err 5.56% epoch 10 updates 64, |W|=394, train_err 2.15%, dev_err 5.85% avg_err 5.26% SVM C=1 epoch 1 updates 116, |W|=311, train_err 4.55%, dev_err 5.85% epoch 2 updates 82, |W|=328, train_err 3.05%, dev_err 4.97% epoch 3 updates 78, |W|=334, train_err 2.92%, dev_err 5.56% epoch 4 updates 77, |W|=339, train_err 2.92%, dev_err 5.26% epoch 5 updates 80, |W|=344, train_err 2.94%, dev_err 5.56% epoch 6 updates 73, |W|=345, train_err 2.75%, dev_err 4.68% epoch 7 updates 72, |W|=347, train_err 2.75%, dev_err 4.97% epoch 8 updates 75, |W|=352, train_err 2.86%, dev_err 4.97% epoch 9 updates 74, |W|=353, train_err 2.78%, dev_err 4.97% epoch 10 updates 72, |W|=354, train_err 2.78%, dev_err 4.97%