Machine Learning Fall 2017 Structured Prediction (structured - PowerPoint PPT Presentation

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML)

Structured Prediction x x the man bit the dog the man bit the dog x x DT NN VBD DT NN S y NP VP y=+ 1 y=- 1 the man bit the dog y x DT NN VB NP the man bit DT NN 那人咬了狗 y the dog • binary classification: output is binary • multiclass classification: output is a number (small # of classes) • structured classification: output is a structure (seq., tree, graph) • part-of-speech tagging, parsing, summarization, translation • exponentially many classes: search (inference) efficiency is crucial! 2

Generic Perceptron • online-learning: one example at a time • learning by doing • find the best output under the current weights • update weights at mistakes y i update weights x i z i inference w Structured Prediction 3

Perceptron: from binary to structured binary classification trivial w x x exact z x update weights 2 classes inference if y ≠ z y y=+ 1 y=- 1 easy w multiclass classification exact z x constant update weights inference # of classes if y ≠ z y hard exponential structured classification # of classes w the man bit the dog exact x z x update weights inference if y ≠ z y DT NN VBD DT NN y 4

From Perceptron to SVM batch +soft-margin 1964 1995 max Vapnik Cortes/Vapnik margin Chervonenkis SVM o fall of USSR n same journal! l i n subgradient descent e m a 2007--2010 a p x p kernels 1964 r m o a x Singer group r . g Aizerman+ kernel. perc. i n Pegasos minibatch online 2003 2006 conservative updates Crammer/Singer Singer group MIRA aggressive 1959 1962 1999 Rosenblatt Novikoff Freund/Schapire 2003 2004 invention proof voted/avg: revived inseparable case Taskar Tsochantaridis M3N struct. SVM 2002 2005 Collins McDonald+ multinomial 2001 structured structured MIRA logistic regression Lafferty+ (max. entropy) CRF 5 AT&T Research ex-AT&T and students

Multiclass Classification • one weight vector (“prototype”) for each class: • multiclass decision rule: w ( z ) · x y = argmax ˆ (best agreement w/ prototype) z ∈ 1 ...M Q1: what about 2-class? Q2: do we still need augmented space?

Multiclass Perceptron • on an error, penalize the weight for the wrong class, and reward the weight for the true class

Convergence of Multiclass update rule: w ← w + ∆ Φ ( x , y, z ) separability: for all 9 u , s.t. 8 ( x , y ) 2 D, z 6 = y u · ∆ Φ ( x , y, z ) � δ

Example: POS Tagging • gold-standard: DT NN VBD DT NN y Φ ( x, y ) • the man bit the dog x • current output: DT NN NN DT NN z • the man bit the dog Φ ( x, z ) x • assume only two feature classes ɸ (x,y)={(<s>, DT): 1, (DT, NN):2, ..., (NN, </s>): 1, (DT, the):1, ..., • tag bigrams t i-1 t i (VBD, bit): 1, ..., (NN, dog): 1} ɸ (x,z)={(<s>, DT): 1, (DT, NN):2, ..., • word/tag pairs w i (NN, </s>): 1, (DT, the):1, ..., (NN, bit): 1, ..., (NN, dog): 1} • weights ++ : (NN, VBD) (VBD, DT) (VBD, bit) ɸ (x,y) - ɸ (x,z) • weights -- : (NN, NN) (NN, DT) (NN, bit) Structured Prediction 9

Structured Perceptron y i update weights inference x i z i w ← Structured Prediction 10

Inference: Dynamic Programming w exact z x update weights inference if y ≠ z y 11

Python implementation Q: what about top-down recursive + memoization? CS 562 - Lec 5-6: Probs & WFSTs 12

Efficiency vs. Expressiveness y i argmax y ∈ GEN ( x ) update weights inference x i z i w • the inference (argmax) must be efficient • either the search space GEN(x) is small, or factored • features must be local to y (but can be global to x) • e.g. bigram tagger, but look at all input words (cf. CRFs) y x Structured Prediction 13

Averaged Perceptron 0 j ← j + 1 j � W j = j • more stable and accurate results • approximation of voted perceptron (Freund & Schapire, 1999) Structured Prediction 14

Averaging Tricks sparse vector: defaultdict • Daume (2006, PhD thesis) ∆ w t w t w (0) = w (1) = ∆ w (1) w (2) = ∆ w (1) ∆ w (2) w (3) = ∆ w (1) ∆ w (2) ∆ w (3) w (4) = ∆ w (2) ∆ w (3) ∆ w (1) ∆ w (4) Structured Prediction 15

Do we need smoothing? y x • smoothing is much easier in discriminative models • just make sure for each feature template, its subset templates are also included • e.g., to include ( t 0 w 0 w -1 ) you must also include • ( t 0 w 0 ) ( t 0 w -1 ) ( w 0 w -1 ) • and maybe also ( t 0 t -1 ) because t is less sparse than w Structured Prediction 16

Convergence with Exact Search • linear classification: converges iff. data is separable • structured: converges iff. data separable & search exact • there is an oracle vector that correctly labels all examples • one vs the rest (correct label better than all incorrect labels) • theorem: if separable, then # of updates ≤ R 2 / δ 2 R: diameter x 100 R: diameter R : d i a m e t e r y 100 x 100 x 111 δ δ x 3012 x 2000 z ≠ y 100 y=- 1 y=+ 1 Rosenblatt => Collins 1957 2002 17

Geometry of Convergence Proof pt 1 w exact z x update weights inference if y ≠ z exact y z 1-best update ∆ Φ ( x, y, z ) perceptron update: correct y label δ margin ≥ δ separation update (by induction) w ( k ) unit oracle current vector u model w ( k +1) w l e e d n o (part 1: upperbound) m 18

Geometry of Convergence Proof pt 2 w exact z x update weights inference if y ≠ z exact y z 1-best violation: incorrect label scored higher update ∆ Φ ( x, y, z ) perceptron update: correct y label R: max diameter update ≤ R 2 w ( k ) <90 ˚ violation diameter current model w ( k +1) (part 2: upperbound) w by induction: l e e d n o m parts 1+2 => update bounds: k ≤ R 2 / δ 2 19

Experiments

Experiments: Tagging • (almost) identical features from (Ratnaparkhi, 1996) • trigram tagger: current tag t i , previous tags t i-1 , t i-2 • current word w i and its spelling features • surrounding words w i-1 w i+1 w i-2 w i+2.. Structured Prediction 21

Experiments: NP Chunking • B-I-O scheme Rockwell International Corp. B I I 's Tulsa unit said it signed B I I O B O a tentative agreement ... B I I • features: • unigram model • surrounding words and POS tags Structured Prediction 22

Experiments: NP Chunking • results • (Sha and Pereira, 2003) trigram tagger • voted perceptron: 94.09% vs. CRF: 94.38% Structured Prediction 23

Structured SVM • structured perceptron: w · Δ ɸ (x,y,z) > 0 • SVM: for all (x,y), functional margin y(w · x) ≥ 1 • structured SVM version 1: simple loss • for all (x,y), for all z ≠ y, margin w · Δ ɸ (x,y,z) ≥ 1 • correct y has to score higher than any wrong z by 1 • structured SVM version 2: structured loss • for all (x,y), for all z ≠ y, margin w · Δ ɸ (x,y,z) ≥ ℓ (y,z) • correct y has to score higher than any wrong z by ℓ (y,z), a distance metric such as hamming loss Structured Prediction 24

Loss-Augmented Decoding • want for all z: w · ɸ (x,y) ≥ w · ɸ (x,z) + ℓ (y,z) • same as: w · ɸ (x,y) ≥ max z w · ɸ (x,z) + ℓ (y,z) • loss-augmented decoding: arg max z w · ɸ (x,z) + ℓ (y,z) • if ℓ (y,z) factors in z (e.g. hamming), just modify DP CIML version ← modified DP ← should have learning rate! λ = 1/(2 C ) very similar to Pegasos; but should use Pegasos framework instead Structured Prediction 25

Correct Version following Pegasos • want for all z: w · ɸ (x,y) ≥ w · ɸ (x,z) + ℓ (y,z) • same as: w · ɸ (x,y) ≥ max z w · ɸ (x,z) + ℓ (y,z) • loss-augmented decoding: arg max z w · ɸ (x,z) + ℓ (y,z) • if ℓ (y,z) factors in z (e.g. hamming), just modify DP 1/t N=|D|, C is from SVM ) NC/2t ( t +=1 for each example Structured Prediction 26

Machine Learning Fall 2017 Structured Prediction (structured - PowerPoint PPT Presentation

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) Structured Prediction x x the man bit the dog the man bit the dog x x DT NN

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Implementing a Multilayer Perceptron from Scratch Implementing a Multilayer Perceptron from

Nick Gnedin The Brief History of Time End of inflation: Today: z=10 27 z=0 t=10 -36 s t=13.7

Design and Architectures for Embedded Systems Prof. Dr. J. Henkel Henkel Prof. Dr. J. CES CES

INTRODUCTION TO CALD (Computer Aided Logic Design) Introduction Late 1960s-early 80s

Coordinating distributed systems part II Marko Vukoli Distributed Systems and Cloud Computing

Genuinely entangled subspaces M. Demianowicz ( joint work with R. Augusiak ) partial support:

VHDL VHDL - Flaxer Eli Ch 2 - 1 Programmable Logic Review (last chapter) VHDL and

Status of front end electronics When mezzanine present, JTAG chain is lost ... PROBLEM When