machine learning
play

Machine Learning Fall 2017 Structured Prediction (structured - PowerPoint PPT Presentation

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) Structured Prediction x x the man bit the dog the man bit the dog x x DT NN


  1. Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML)

  2. Structured Prediction x x the man bit the dog the man bit the dog x x DT NN VBD DT NN S y NP VP y=+ 1 y=- 1 the man bit the dog y x DT NN VB NP the man bit DT NN 那 人 咬 了 狗 y the dog • binary classification: output is binary • multiclass classification: output is a number (small # of classes) • structured classification: output is a structure (seq., tree, graph) • part-of-speech tagging, parsing, summarization, translation • exponentially many classes: search (inference) efficiency is crucial! 2

  3. Generic Perceptron • online-learning: one example at a time • learning by doing • find the best output under the current weights • update weights at mistakes y i update weights x i z i inference w Structured Prediction 3

  4. Perceptron: from binary to structured binary classification trivial w x x exact z x update weights 2 classes inference if y ≠ z y y=+ 1 y=- 1 easy w multiclass classification exact z x constant update weights inference # of classes if y ≠ z y hard exponential structured classification # of classes w the man bit the dog exact x z x update weights inference if y ≠ z y DT NN VBD DT NN y 4

  5. From Perceptron to SVM batch +soft-margin 1964 1995 max Vapnik Cortes/Vapnik margin Chervonenkis SVM o fall of USSR n same journal! l i n subgradient descent e m a 2007--2010 a p x p kernels 1964 r m o a x Singer group r . g Aizerman+ kernel. perc. i n Pegasos minibatch online 2003 2006 conservative updates Crammer/Singer Singer group MIRA aggressive 1959 1962 1999 Rosenblatt Novikoff Freund/Schapire 2003 2004 invention proof voted/avg: revived inseparable case Taskar Tsochantaridis M3N struct. SVM 2002 2005 Collins McDonald+ multinomial 2001 structured structured MIRA logistic regression Lafferty+ (max. entropy) CRF 5 AT&T Research ex-AT&T and students

  6. Multiclass Classification • one weight vector (“prototype”) for each class: • multiclass decision rule: w ( z ) · x y = argmax ˆ (best agreement w/ prototype) z ∈ 1 ...M Q1: what about 2-class? Q2: do we still need augmented space?

  7. Multiclass Perceptron • on an error, penalize the weight for the wrong class, and reward the weight for the true class

  8. Convergence of Multiclass update rule: w ← w + ∆ Φ ( x , y, z ) separability: for all 9 u , s.t. 8 ( x , y ) 2 D, z 6 = y u · ∆ Φ ( x , y, z ) � δ

  9. Example: POS Tagging • gold-standard: DT NN VBD DT NN y Φ ( x, y ) • the man bit the dog x • current output: DT NN NN DT NN z • the man bit the dog Φ ( x, z ) x • assume only two feature classes ɸ (x,y)={(<s>, DT): 1, (DT, NN):2, ..., (NN, </s>): 1, (DT, the):1, ..., • tag bigrams t i-1 t i (VBD, bit): 1, ..., (NN, dog): 1} ɸ (x,z)={(<s>, DT): 1, (DT, NN):2, ..., • word/tag pairs w i (NN, </s>): 1, (DT, the):1, ..., (NN, bit): 1, ..., (NN, dog): 1} • weights ++ : (NN, VBD) (VBD, DT) (VBD, bit) ɸ (x,y) - ɸ (x,z) • weights -- : (NN, NN) (NN, DT) (NN, bit) Structured Prediction 9

  10. Structured Perceptron y i update weights inference x i z i w ← Structured Prediction 10

  11. Inference: Dynamic Programming w exact z x update weights inference if y ≠ z y 11

  12. Python implementation Q: what about top-down recursive + memoization? CS 562 - Lec 5-6: Probs & WFSTs 12

  13. Efficiency vs. Expressiveness y i argmax y ∈ GEN ( x ) update weights inference x i z i w • the inference (argmax) must be efficient • either the search space GEN(x) is small, or factored • features must be local to y (but can be global to x) • e.g. bigram tagger, but look at all input words (cf. CRFs) y x Structured Prediction 13

  14. Averaged Perceptron 0 j ← j + 1 j � W j = j • more stable and accurate results • approximation of voted perceptron (Freund & Schapire, 1999) Structured Prediction 14

  15. Averaging Tricks sparse vector: defaultdict • Daume (2006, PhD thesis) ∆ w t w t w (0) = w (1) = ∆ w (1) w (2) = ∆ w (1) ∆ w (2) w (3) = ∆ w (1) ∆ w (2) ∆ w (3) w (4) = ∆ w (2) ∆ w (3) ∆ w (1) ∆ w (4) Structured Prediction 15

  16. Do we need smoothing? y x • smoothing is much easier in discriminative models • just make sure for each feature template, its subset templates are also included • e.g., to include ( t 0 w 0 w -1 ) you must also include • ( t 0 w 0 ) ( t 0 w -1 ) ( w 0 w -1 ) • and maybe also ( t 0 t -1 ) because t is less sparse than w Structured Prediction 16

  17. Convergence with Exact Search • linear classification: converges iff. data is separable • structured: converges iff. data separable & search exact • there is an oracle vector that correctly labels all examples • one vs the rest (correct label better than all incorrect labels) • theorem: if separable, then # of updates ≤ R 2 / δ 2 R: diameter x 100 R: diameter R : d i a m e t e r y 100 x 100 x 111 δ δ x 3012 x 2000 z ≠ y 100 y=- 1 y=+ 1 Rosenblatt => Collins 1957 2002 17

  18. Geometry of Convergence Proof pt 1 w exact z x update weights inference if y ≠ z exact y z 1-best update ∆ Φ ( x, y, z ) perceptron update: correct y label δ margin ≥ δ separation update (by induction) w ( k ) unit oracle current vector u model w ( k +1) w l e e d n o (part 1: upperbound) m 18

  19. Geometry of Convergence Proof pt 2 w exact z x update weights inference if y ≠ z exact y z 1-best violation: incorrect label scored higher update ∆ Φ ( x, y, z ) perceptron update: correct y label R: max diameter update ≤ R 2 w ( k ) <90 ˚ violation diameter current model w ( k +1) (part 2: upperbound) w by induction: l e e d n o m parts 1+2 => update bounds: k ≤ R 2 / δ 2 19

  20. Experiments

  21. Experiments: Tagging • (almost) identical features from (Ratnaparkhi, 1996) • trigram tagger: current tag t i , previous tags t i-1 , t i-2 • current word w i and its spelling features • surrounding words w i-1 w i+1 w i-2 w i+2.. Structured Prediction 21

  22. Experiments: NP Chunking • B-I-O scheme Rockwell International Corp. B I I 's Tulsa unit said it signed B I I O B O a tentative agreement ... B I I • features: • unigram model • surrounding words and POS tags Structured Prediction 22

  23. Experiments: NP Chunking • results • (Sha and Pereira, 2003) trigram tagger • voted perceptron: 94.09% vs. CRF: 94.38% Structured Prediction 23

  24. Structured SVM • structured perceptron: w · Δ ɸ (x,y,z) > 0 • SVM: for all (x,y), functional margin y(w · x) ≥ 1 • structured SVM version 1: simple loss • for all (x,y), for all z ≠ y, margin w · Δ ɸ (x,y,z) ≥ 1 • correct y has to score higher than any wrong z by 1 • structured SVM version 2: structured loss • for all (x,y), for all z ≠ y, margin w · Δ ɸ (x,y,z) ≥ ℓ (y,z) • correct y has to score higher than any wrong z by ℓ (y,z), a distance metric such as hamming loss Structured Prediction 24

  25. Loss-Augmented Decoding • want for all z: w · ɸ (x,y) ≥ w · ɸ (x,z) + ℓ (y,z) • same as: w · ɸ (x,y) ≥ max z w · ɸ (x,z) + ℓ (y,z) • loss-augmented decoding: arg max z w · ɸ (x,z) + ℓ (y,z) • if ℓ (y,z) factors in z (e.g. hamming), just modify DP CIML version ← modified DP ← should have learning rate! λ = 1/(2 C ) very similar to Pegasos; but should use Pegasos framework instead Structured Prediction 25

  26. Correct Version following Pegasos • want for all z: w · ɸ (x,y) ≥ w · ɸ (x,z) + ℓ (y,z) • same as: w · ɸ (x,y) ≥ max z w · ɸ (x,z) + ℓ (y,z) • loss-augmented decoding: arg max z w · ɸ (x,z) + ℓ (y,z) • if ℓ (y,z) factors in z (e.g. hamming), just modify DP 1/t N=|D|, C is from SVM ) NC/2t ( t +=1 for each example Structured Prediction 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend