Scalable Large-Margin x x the man bit the dog the man - PowerPoint PPT Presentation

What is Structured Prediction? Scalable Large-Margin x x the man bit the dog the man bit the dog x x Structured Learning: DT NN VBD DT NN S y Theory and Algorithms NP VP y=+ 1 y=- 1 the man bit the dog y x DT NN VB NP the man bit DT NN 那人咬了狗 y x x the man bit the dog x the man hit the dog the dog • binary classification: output is binary DT NN VBD DT NN y 那人咬了狗 NLP is all about structured prediction! • multiclass classification: output is a (small) number y=+ 1 y=- 1 • structured classification: output is a structure (seq., tree, graph) Liang Huang Kai Zhao Lemao Liu The City University of New York (CUNY) • part-of-speech tagging, parsing, summarization, translation • exponentially many classes: search (inference) efficiency is crucial! 2 slides at: http://acl.cs.qc.edu/~lhuang/ Examples of Bad Structured Prediction Learning: Unstructured vs. Structured binary/multiclass structured learning generative naive HMMs bayes (count & divide) Conditional Conditional discriminative logistic CRFs regression (expectations) (maxent) Online+ Online+ Viterbi Viterbi perceptron structured perceptron (argmax) max max margin margin (loss-augmented SVM structured SVM argmax) 3 4

Why Perceptron (Online Learning)? Perceptron: from binary to structured binary perceptron • because we want scalability on big data! trivial w (Rosenblatt, 1959) exact x x z x update weights • learning time has to be linear in the number of examples 2 classes inference if y ≠ z y • can make only constant number of passes over training data y=+ 1 y=- 1 • only online learning (perceptron/MIRA) can guarantee this! multiclass perceptron easy w (Freund/Schapire, 1999) • SVM scales between O(n 2 ) and O(n 3 ); CRF no guarantee exact z constant x update weights inference # of classes if y ≠ z • and inference on each example must be super fast y • another advantage of perceptron: just need argmax structured perceptron (Collins, 2002) hard exponential # of classes w SVM the man bit the dog x exact z x update weights . . . inference if y ≠ z CRF 那人咬了狗 y y 5 6 Scalability Challenges Tutorial Outline • inference (on one example) is too slow (even w/ DP) • Overview of Structured Learning • can we sacrifice search exactness for faster learning? • Challenges in Scalability • would inexact search interfere with learning? • Structured Perceptron • if so, how should we modify learning? • convergence proof • even fastest inexact inference is still too slow 9 1 5 13 update update update • Structured Perceptron with Inexact Search update • due to too many training examples 10 6 2 14 update update update update 11 7 15 update 3 update • can we parallelize online learning? 12 update update 16 8 4 update update update update • Latent-Variable Structured Perceptron ⨁ • Parallelizing Online Learning (Perceptron & MIRA) . . . 7 8

Generic Perceptron Structured Perceptron • perceptron is the simplest machine learning algorithm DT NN VBD DT NN y i the man bit the dog update weights • online-learning: one example at a time inference x i z i DT NN NN DT NN • learning by doing w • find the best output under the current weights • update weights at mistakes y i update weights inference x i z i w ← 9 10 Example: POS Tagging Inference: Dynamic Programming • gold-standard: DT NN w VBD DT NN y exact Φ ( x, y ) z x update weights • the man bit the dog inference if y ≠ z x y • current output: DT NN NN DT NN z Φ ( x, z ) • the man bit the dog x • assume only two feature classes • tag bigrams t i-1 t i • word/tag pairs w i tagging: O ( nT 3 ) CKY parsing: O ( n 3 ) • weights ++ : (NN, VBD) (VBD, DT) (VBD → bit) • weights -- : (NN, NN) (NN, DT) (NN → bit) 11 12

Efficiency vs. Expressiveness Averaged Perceptron • much more stable and accurate results y i argmax y ∈ GEN ( x ) update weights • approximation of voted perceptron (large-margin) inference x i z i (Freund & Schapire, 1999) w • the inference (argmax) must be efficient j • either the search space GEN(x) is small, or factored 0 • features must be local to y (but can be global to x) • e.g. bigram tagger, but look at all input words (cf. CRFs) y ← j + 1 j � W j x = j 13 14 Averaging => Large Margin Efficient Implementation of Averaging • naive implementation (running sum) doesn’t scale • much more stable and accurate results • very clever trick from Daume (2006, PhD thesis) • approximation of voted perceptron (large-margin) (Freund & Schapire, 1999) ∆ w t w t w (0) = test error w (1) = ∆ w (1) w (2) = ∆ w (1) ∆ w (2) w (3) = ∆ w (1) ∆ w (2) ∆ w (3) w (4) = ∆ w (1) ∆ w (2) ∆ w (3) ∆ w (4) 15 16

Perceptron vs. CRFs Perceptron Convergence Proof • binary classification: converges iff. data is separable • perceptron is online and Viterbi approximation of CRF • structured prediction: converges iff. data is separable • simpler to code; faster to converge; ~same accuracy • there is an oracle vector that correctly labels all examples online V CRFs i t • one vs the rest (correct label better than all incorrect labels) e (Lafferty et al, 2001) r b i • theorem: if separable, then # of updates ≤ R 2 / δ 2 R: diameter exp( w · Φ ( x, z )) X X Z ( x ) ( x,y ) ∈ D z ∈ GEN ( x ) x 100 t e r stochastic gradient a m e R : d i R: diameter hard/Viterbi CRFs descent (SGD) y 100 x 111 x 100 δ δ for ( x, y ) ∈ D, argmax w · Φ ( x, z ) z ∈ GEN ( x ) x 3012 V online i x 2000 t e structured perceptron r b z ≠ y 100 i y=- 1 y=+ 1 (Collins, 2002) Novikoff => Freund & Schapire => Collins 17 18 1962 1999 2002 Geometry of Convergence Proof pt 1 Geometry of Convergence Proof pt 2 summary: the proof uses 3 facts: w w 1. separation (margin) exact exact z z x x update weights update weights 2. diameter (always finite) inference inference if y ≠ z if y ≠ z exact exact 3. violation (guaranteed by exact search) y y z z 1-best 1-best violation: incorrect label scored higher ∆ update ∆ update Φ Φ perceptron update: perceptron update: ( ( x x , , y y , , z z ) correct ) correct y y label label R: max diameter δ margin ≥ δ separation update update ≤ R 2 kR √ (by induction) w ( k ) w ( k ) violation diameter k δ <90 ˚ unit oracle current current k w k +1 k 2  kR 2 vector u model model (part 2: upperbound) by induction: p w ( k +1) w ( k +1) new new model model k w k +1 k  kR k w k +1 k � k δ (part 1: lowerbound) (part 1: lowerbound) combine with: k w k +1 k � k δ k ≤ R 2 / δ 2 bound on # of updates: 19 20

Tutorial Outline Scalability Challenge 1: Inference binary classification trivial w x x • Overview of Structured Learning exact x z constant update weights inference # of classes if y ≠ z • Challenges in Scalability y y=+ 1 y=- 1 • Structured Perceptron hard exponential structured classification # of classes w • convergence proof exact the man bit the dog x z x update weights inference if y ≠ z • Structured Perceptron with Inexact Search y DT NN VBD DT NN y • challenge: search efficiency (exponentially many classes) • often use dynamic programming (DP) • Latent-Variable Perceptron • but DP is still too slow for repeated use, e.g. parsing O ( n 3 ) • Parallelizing Online Learning (Perceptron & MIRA) • Q: can we sacrifice search exactness for faster learning? 21 22 Perceptron w/ Inexact Inference Bad News and Good News w w the man bit the dog the man bit the dog x inexact x inexact x z x z update weights update weights inference inference if y ≠ z if y ≠ z DT NN VBD DT NN DT NN VBD DT NN y y y y A: it no longer works as is, Q: does perceptron still work??? but we can make it work by some magic. beam search beam search greedy search greedy search • routine use of inexact inference in NLP (e.g. beam search) • bad news: no more guarantee of convergence • how does structured perceptron work with inexact search? • in practice perceptron degrades a lot due to search errors • good news: new update methods guarantee convergence • so far most structured learning theory assume exact search • would search errors break these learning properties? • new perceptron variants that “live with” search errors • if so how to modify learning to accommodate inexact search? • in practice they work really well w/ inexact search 23 24

Scalable Large-Margin x x the man bit the dog the man - PowerPoint PPT Presentation

What is Structured Prediction? Scalable Large-Margin x x the man bit the dog the man bit the dog x x Structured Learning: DT NN VBD DT NN S y Theory and Algorithms NP VP y=+ 1 y=- 1 the man bit the

Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Maximum Margin Criterion Math

About this class Maximizing the Margin Maximum margin classifiers Picture of large and small

Support Vector Machines Greg Mori - CMPT 419/726 Bishop PRML Ch. 7 Maximum Margin Criterion

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Topic #28 Nyquist plots: Gain and phase margin Reference textbook : Control Systems, Dhanesh N.

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

A Scalable Scalable Approach Approach A for for Large- -Scale Scale Schema Schema

Large Margin Taxonomy Embedding with an Application to Document Categorization K. Weinberger and

Large-Margin Softmax Loss for Conv. Neural Networks Weiyang Liu 1* , Yandong Wen 2* , Zhiding Yu 3

Margin Squeeze Hartwig Tauber FTTH Council Europe Margin Squeeze 1 Recognise BEREC are

THE TEXAS MARGIN TAX; THE TEXAS MARGIN TAX; LOOKING FORWARD LOOKING FORWARD PRESENTATION BY

MAXIMUM MARGIN CLASSIFIERS MAXIMUM MARGIN CLASSIFIERS Matthieu R Bloch Tuesday, February 11,

Luke 10:38-42 Martha Mary & Margin OVERCOMING OVERLOAD Mary FOCUSED RELAXED UNCONCERNED

Optimizing the Algorithm Hard-Margin Objective Current objective: want to relax hard

In SMV I IAML: Support Vector Machines II We saw: Max margin trick Nigel Goddard

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.

12.1 Surface Deformation II Hao Li http://cs621.hao-li.com 1 Last Time Linear Surface

Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning

Convergence to equilibrium for rough differential equations Samy Tindel Purdue University

Convergence of Filtered Spherical Harmonic Equations for Radiation Transport Martin Frank (RWTH)

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos

Notes on the Convergence of the Restarted GMRES Eugene Vecharynski Julien Langou Department of

CS 557 BGP Convergence Improved BGP Convergence via Ghost Flushing Bremler-Barr, Afek, Schwarz,

i:*ha:i : in probability conveyance almost Tests for sure convergence

Sambuz

Useful Links

Newsletter

Mail Us

Scalable Large-Margin x x the man bit the dog the man - PowerPoint PPT Presentation

What is Structured Prediction? Scalable Large-Margin x x the man bit the dog the man bit the dog x x Structured Learning: DT NN VBD DT NN S y Theory and Algorithms NP VP y=+ 1 y=- 1 the man bit the

Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Maximum Margin Criterion Math

About this class Maximizing the Margin Maximum margin classifiers Picture of large and small

Support Vector Machines Greg Mori - CMPT 419/726 Bishop PRML Ch. 7 Maximum Margin Criterion

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Topic #28 Nyquist plots: Gain and phase margin Reference textbook : Control Systems, Dhanesh N.

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

A Scalable Scalable Approach Approach A for for Large- -Scale Scale Schema Schema

Large Margin Taxonomy Embedding with an Application to Document Categorization K. Weinberger and

Large-Margin Softmax Loss for Conv. Neural Networks Weiyang Liu 1* , Yandong Wen 2* , Zhiding Yu 3

Margin Squeeze Hartwig Tauber FTTH Council Europe Margin Squeeze 1 Recognise BEREC are

THE TEXAS MARGIN TAX; THE TEXAS MARGIN TAX; LOOKING FORWARD LOOKING FORWARD PRESENTATION BY

MAXIMUM MARGIN CLASSIFIERS MAXIMUM MARGIN CLASSIFIERS Matthieu R Bloch Tuesday, February 11,

Luke 10:38-42 Martha Mary &amp; Margin OVERCOMING OVERLOAD Mary FOCUSED RELAXED UNCONCERNED

Optimizing the Algorithm Hard-Margin Objective Current objective: want to relax hard

In SMV I IAML: Support Vector Machines II We saw: Max margin trick Nigel Goddard

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.

12.1 Surface Deformation II Hao Li http://cs621.hao-li.com 1 Last Time Linear Surface

Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning

Convergence to equilibrium for rough differential equations Samy Tindel Purdue University

Convergence of Filtered Spherical Harmonic Equations for Radiation Transport Martin Frank (RWTH)

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos

Notes on the Convergence of the Restarted GMRES Eugene Vecharynski Julien Langou Department of

CS 557 BGP Convergence Improved BGP Convergence via Ghost Flushing Bremler-Barr, Afek, Schwarz,

i:*ha:i : in probability conveyance almost Tests for sure convergence

Sambuz

Useful Links

Newsletter

Mail Us

Luke 10:38-42 Martha Mary & Margin OVERCOMING OVERLOAD Mary FOCUSED RELAXED UNCONCERNED