scalable large margin
play

Scalable Large-Margin x x the man bit the dog the man - PowerPoint PPT Presentation

What is Structured Prediction? Scalable Large-Margin x x the man bit the dog the man bit the dog x x Structured Learning: DT NN VBD DT NN S y Theory and Algorithms NP VP y=+ 1 y=- 1 the man bit the


  1. What is Structured Prediction? Scalable Large-Margin x x the man bit the dog the man bit the dog x x Structured Learning: DT NN VBD DT NN S y Theory and Algorithms NP VP y=+ 1 y=- 1 the man bit the dog y x DT NN VB NP the man bit DT NN 那 人 咬 了 狗 y x x the man bit the dog x the man hit the dog the dog • binary classification: output is binary DT NN VBD DT NN y 那 人 咬 了 狗 NLP is all about structured prediction! • multiclass classification: output is a (small) number y=+ 1 y=- 1 • structured classification: output is a structure (seq., tree, graph) Liang Huang Kai Zhao Lemao Liu The City University of New York (CUNY) • part-of-speech tagging, parsing, summarization, translation • exponentially many classes: search (inference) efficiency is crucial! 2 slides at: http://acl.cs.qc.edu/~lhuang/ Examples of Bad Structured Prediction Learning: Unstructured vs. Structured binary/multiclass structured learning generative naive HMMs bayes (count & divide) Conditional Conditional discriminative logistic CRFs regression (expectations) (maxent) Online+ Online+ Viterbi Viterbi perceptron structured perceptron (argmax) max max margin margin (loss-augmented SVM structured SVM argmax) 3 4

  2. Why Perceptron (Online Learning)? Perceptron: from binary to structured binary perceptron • because we want scalability on big data! trivial w (Rosenblatt, 1959) exact x x z x update weights • learning time has to be linear in the number of examples 2 classes inference if y ≠ z y • can make only constant number of passes over training data y=+ 1 y=- 1 • only online learning (perceptron/MIRA) can guarantee this! multiclass perceptron easy w (Freund/Schapire, 1999) • SVM scales between O(n 2 ) and O(n 3 ); CRF no guarantee exact z constant x update weights inference # of classes if y ≠ z • and inference on each example must be super fast y • another advantage of perceptron: just need argmax structured perceptron (Collins, 2002) hard exponential # of classes w SVM the man bit the dog x exact z x update weights . . . inference if y ≠ z CRF 那 人 咬 了 狗 y y 5 6 Scalability Challenges Tutorial Outline • inference (on one example) is too slow (even w/ DP) • Overview of Structured Learning • can we sacrifice search exactness for faster learning? • Challenges in Scalability • would inexact search interfere with learning? • Structured Perceptron • if so, how should we modify learning? • convergence proof • even fastest inexact inference is still too slow 9 1 5 13 update update update • Structured Perceptron with Inexact Search update • due to too many training examples 10 6 2 14 update update update update 11 7 15 update 3 update • can we parallelize online learning? 12 update update 16 8 4 update update update update • Latent-Variable Structured Perceptron ⨁ • Parallelizing Online Learning (Perceptron & MIRA) . . . 7 8

  3. Generic Perceptron Structured Perceptron • perceptron is the simplest machine learning algorithm DT NN VBD DT NN y i the man bit the dog update weights • online-learning: one example at a time inference x i z i DT NN NN DT NN • learning by doing w • find the best output under the current weights • update weights at mistakes y i update weights inference x i z i w ← 9 10 Example: POS Tagging Inference: Dynamic Programming • gold-standard: DT NN w VBD DT NN y exact Φ ( x, y ) z x update weights • the man bit the dog inference if y ≠ z x y • current output: DT NN NN DT NN z Φ ( x, z ) • the man bit the dog x • assume only two feature classes • tag bigrams t i-1 t i • word/tag pairs w i tagging: O ( nT 3 ) CKY parsing: O ( n 3 ) • weights ++ : (NN, VBD) (VBD, DT) (VBD → bit) • weights -- : (NN, NN) (NN, DT) (NN → bit) 11 12

  4. Efficiency vs. Expressiveness Averaged Perceptron • much more stable and accurate results y i argmax y ∈ GEN ( x ) update weights • approximation of voted perceptron (large-margin) inference x i z i (Freund & Schapire, 1999) w • the inference (argmax) must be efficient j • either the search space GEN(x) is small, or factored 0 • features must be local to y (but can be global to x) • e.g. bigram tagger, but look at all input words (cf. CRFs) y ← j + 1 j � W j x = j 13 14 Averaging => Large Margin Efficient Implementation of Averaging • naive implementation (running sum) doesn’t scale • much more stable and accurate results • very clever trick from Daume (2006, PhD thesis) • approximation of voted perceptron (large-margin) (Freund & Schapire, 1999) ∆ w t w t w (0) = test error w (1) = ∆ w (1) w (2) = ∆ w (1) ∆ w (2) w (3) = ∆ w (1) ∆ w (2) ∆ w (3) w (4) = ∆ w (1) ∆ w (2) ∆ w (3) ∆ w (4) 15 16

  5. Perceptron vs. CRFs Perceptron Convergence Proof • binary classification: converges iff. data is separable • perceptron is online and Viterbi approximation of CRF • structured prediction: converges iff. data is separable • simpler to code; faster to converge; ~same accuracy • there is an oracle vector that correctly labels all examples online V CRFs i t • one vs the rest (correct label better than all incorrect labels) e (Lafferty et al, 2001) r b i • theorem: if separable, then # of updates ≤ R 2 / δ 2 R: diameter exp( w · Φ ( x, z )) X X Z ( x ) ( x,y ) ∈ D z ∈ GEN ( x ) x 100 t e r stochastic gradient a m e R : d i R: diameter hard/Viterbi CRFs descent (SGD) y 100 x 111 x 100 δ δ for ( x, y ) ∈ D, argmax w · Φ ( x, z ) z ∈ GEN ( x ) x 3012 V online i x 2000 t e structured perceptron r b z ≠ y 100 i y=- 1 y=+ 1 (Collins, 2002) Novikoff => Freund & Schapire => Collins 17 18 1962 1999 2002 Geometry of Convergence Proof pt 1 Geometry of Convergence Proof pt 2 summary: the proof uses 3 facts: w w 1. separation (margin) exact exact z z x x update weights update weights 2. diameter (always finite) inference inference if y ≠ z if y ≠ z exact exact 3. violation (guaranteed by exact search) y y z z 1-best 1-best violation: incorrect label scored higher ∆ update ∆ update Φ Φ perceptron update: perceptron update: ( ( x x , , y y , , z z ) correct ) correct y y label label R: max diameter δ margin ≥ δ separation update update ≤ R 2 kR √ (by induction) w ( k ) w ( k ) violation diameter k δ <90 ˚ unit oracle current current k w k +1 k 2  kR 2 vector u model model (part 2: upperbound) by induction: p w ( k +1) w ( k +1) new new model model k w k +1 k  kR k w k +1 k � k δ (part 1: lowerbound) (part 1: lowerbound) combine with: k w k +1 k � k δ k ≤ R 2 / δ 2 bound on # of updates: 19 20

  6. Tutorial Outline Scalability Challenge 1: Inference binary classification trivial w x x • Overview of Structured Learning exact x z constant update weights inference # of classes if y ≠ z • Challenges in Scalability y y=+ 1 y=- 1 • Structured Perceptron hard exponential structured classification # of classes w • convergence proof exact the man bit the dog x z x update weights inference if y ≠ z • Structured Perceptron with Inexact Search y DT NN VBD DT NN y • challenge: search efficiency (exponentially many classes) • often use dynamic programming (DP) • Latent-Variable Perceptron • but DP is still too slow for repeated use, e.g. parsing O ( n 3 ) • Parallelizing Online Learning (Perceptron & MIRA) • Q: can we sacrifice search exactness for faster learning? 21 22 Perceptron w/ Inexact Inference Bad News and Good News w w the man bit the dog the man bit the dog x inexact x inexact x z x z update weights update weights inference inference if y ≠ z if y ≠ z DT NN VBD DT NN DT NN VBD DT NN y y y y A: it no longer works as is, Q: does perceptron still work??? but we can make it work by some magic. beam search beam search greedy search greedy search • routine use of inexact inference in NLP (e.g. beam search) • bad news: no more guarantee of convergence • how does structured perceptron work with inexact search? • in practice perceptron degrades a lot due to search errors • good news: new update methods guarantee convergence • so far most structured learning theory assume exact search • would search errors break these learning properties? • new perceptron variants that “live with” search errors • if so how to modify learning to accommodate inexact search? • in practice they work really well w/ inexact search 23 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend