Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) A - PowerPoint PPT Presentation

  Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) “A ship in port is safe, but that is not what ships are for.” – Grace Hopper (1906-1992) Week 5: Extensions and Variations of Perceptron, and Practical Issues Professor Liang Huang some slides from A. Zisserman (Oxford)

Trivia: Grace Hopper and the first bug • Edison coined the term “bug” around 1878 and since then it had been widely used in engineering • Hopper was associated with the discovery of the first computer bug in 1947 which was a moth stuck in a relay Smithsonian National Museum of American History 2

  Week 5: Perceptron in Practice • Problems with Perceptron “A ship in port is safe, but that is not what ships are for.” • doesn’t converge with inseparable data – Grace Hopper (1906-1992) • update might often be too “bold” • doesn’t optimize margin • result is sensitive to the order of examples • Ways to alleviate these problems (without SVM/kernels) • Part II: voted perceptron and average perceptron • Part III: MIRA (margin-infused relaxation algorithm) • Part IV: Practical Issues and HW1 • Part V: “Soft” Perceptron: Logistic Regression 3

Recap of Week 4 δ δ input: training data D x ⊕ output: weights w ⊕ x initialize w ← 0 u · x ≥ δ R while not converged u : k u k = 1 w 0 aa for ( x , y ) ∈ D aaaa if y ( w · x ) ≤ 0 aaaaaa w ← w + y x w “idealized” ML Input x Model w Training Output y “actual” ML deep learning ≈ representation learning Input x feature map ϕ Input x Model w Training Output y Model w Training Output y feature map ϕ 4

Python Demo (requires numpy and matplotlib) $ python perc_demo.py 5

Part II: Voted and Averaged Perceptron vanilla perceptron dev set error a v e r a g e d p e r c e p t r o n v o t e d p e r c e p t r o n 6

Brief History of Perceptron batch 1997 +soft-margin Cortes/Vapnik SVM +kernels online approx.   subgradient descent max margin +max margin 2007--2010* minibatch Singer group Pegasos minibatch online 2003 2006 Crammer/Singer Singer group conservative updates MIRA aggressive 1999 1959 1962 1969* DEAD Rosenblatt Novikoff Minsky/Papert Freund/Schapire invention proof book killed it voted/avg: revived inseparable case 2002 2005* Collins McDonald/Crammer/Pereira structured structured MIRA *mentioned in lectures but optional   (others papers all covered in detail) AT&T Research ex-AT&T and students 7

Voted/Avged Perceptron • problem: later examples dominate earlier examples • solution: voted perceptron (Freund and Schapire, 1999) • record the weight vector after each example in D • not just after each update! • and vote on a new example using | D | models • shown to have better generalization power • averaged perceptron (from the same paper) • an approximation of voted perceptron • just use the average of all weight vectors • can be implemented efficiently 8

Voted Perceptron our notation: ( x (1) , y (1) ) v is weight,   c is its # of votes if correct, increase the current model’s # of votes; otherwise create a new model with 1 vote 9

Experiments vanilla perceptron dev set error a v e r a g e d p e r c e p t r o n v o t e d p e r c e p t r o n 10

Averaged Perceptron • voted perceptron is not scalable • and does not output a single model • avg perceptron is an approximation of voted perceptron • actually, summing all weight vectors is enough; no need to divide initialize w ← 0; w s ← 0 w (1) = ∆ w (1) w (1) = while not converged aa for ( x , y ) ∈ D aaaa if y ( w · x ) ≤ 0 w (2) = ∆ w (1) ∆ w (2) w (2) = aaaaaa w ← w + y x aaaa w s ← w s + w w (3) = ∆ w (1) ∆ w (2) ∆ w (3) w (3) = output: summed weights w s w (4) = w (4) = ∆ w (1) ∆ w (2) ∆ w (3) ∆ w (4) after each example, not after each update! 11

Efficient Implementation of Averaging • naive implementation (running sum w s ) doesn’t scale either • OK for low dim. (HW1); too slow for high-dim. (HW3) • very clever trick from Hal Daumé (2006, PhD thesis) ∆ w ( t ) w ( t ) initialize w ← 0; w a ← 0; c ← 0 while not converged aa for ( x , y ) ∈ D w (1) = ∆ w (1) w (1) = aaaa if y ( w · x ) ≤ 0 aaaaaa w ← w + y x w (2) = w (2) = ∆ w (1) ∆ w (2) c aaaaaa w a ← w a + cy x aaaa c ← c + 1 w (3) = w (3) = ∆ w (1) ∆ w (2) ∆ w (3) output: c w − w a w (4) = w (4) = ∆ w (1) ∆ w (2) ∆ w (3) ∆ w (4) after each update, not after each example! 12

Part III: MIRA • perceptron often makes bold updates (over-correction) • and sometimes too small updates (under-correction) • but hard to tune learning rate x • “just enough” update to correct the mistake? w 0 w 0 w + y � w · x x under-correction k x k 2 w easy to show: x ⊕ w 0 · x = ( w + y � w · x x ) · x = y w 0 perceptron 1 k k x k 2 x � w · x k k x k w 0 MIRA margin-infused relaxation   1 � w · x k x k over-correction w algorithm (MIRA) 13

Example: Perceptron under-correction x perceptron w 0 w 14

MIRA: just enough w 0 k w 0 � w k 2 min s.t. w 0 · x � 1 x w 0 MIRA minimal change to ensure   functional margin of 1   (dot-product w ’ · x =1) k x k 1 perceptron w 0 MIRA ≈ 1-step SVM w functional margin: y ( w · x ) geometric margin: y ( w · x ) k w k 15

MIRA: functional vs geom. margin w 0 k w 0 � w k 2 min w 0 · x = 1 s.t. w 0 · x � 1 x k w 0 k w 0 MIRA minimal change to ensure   1 functional margin of 1   0 = 0 x w · (dot-product w ’ · x =1) MIRA ≈ 1-step SVM w functional margin: y ( w · x ) geometric margin: y ( w · x ) k w k 16

Optional: Aggressive MIRA w · x = 1 w · x = 0 . 7 k w k 1 w w · x = 0 x • aggressive version of MIRA • also update if correct but not confident enough • i.e., functional margin ( y w · x ) not big enough • p- aggressive MIRA: update if y ( w · x ) < p (0<= p <1) • MIRA is a special case with p= 0: only update if misclassified! • update equation is same as MIRA • i.e., after update, functional margin becomes 1 • larger p leads to a larger geometric margin but slower convergence 17

Demo 18

Demo 19

  Part IV: Practical Issues “A ship in port is safe, but that is not what ships are for.” – Grace Hopper (1906-1992) • you will build your own linear classifiers for HW2 (same data as HW1) • slightly different binarizations • for k-NN, we binarize all categorical fields but keep the two numerical ones • for perceptron (and most other classifiers), we binarize numerical fields as well • why? hint: larger “age” always better? more “hours” always better? 20

Useful Engineering Tips: averaging, shuffling, variable learning rate, fixing feature scale • averaging helps significantly; MIRA helps a tiny little bit • perceptron < MIRA < avg. perceptron ≈ avg. MIRA ≈ SVM • shuffling the data helps hugely if classes were ordered (HW1) • shuffling before each epoch helps a little bit big margin • variable (decaying) learning rate often helps a little small margin • 1/(total#updates) or 1/(total#examples) helps 1 1 • any requirement in order to converge? O O • how to prove convergence now? • centering of each dimension helps (Ex1/HW1) • why? => smaller radius, bigger margin! • unit variance also helps (why?) (Ex1/HW1) • 0-mean, 1-var => each feature ≈ a unit Gaussian 21

Feature Maps in Other Domains • how to convert an image or text to a vector? 23x23 RGB image 28x28 grayscale image x ∈ ℝ 23x23x3 • image • text “one-hot” representation of words (all binary features) in deep learning there are other feature maps 22

Part V: Perceptron vs. Logistic Regression • logistic regression is another popular linear classifier • can be viewed as “soft” or “probabilistic” perceptron • same decision rule (sign of dot-product), but prob. output perceptron f ( x ) = sign ( w · x ) logistic regression 1 f ( x ) = σ ( w · x ) = 1 + e − w · x 23

Logistic vs. Linear Regression • linear regression is regression applied to real-valued output using linear function • logistic regression is regression applied to 0-1 output using the sigmoid function linear 1 feature 2 features logistic 1 feature 2 features https://florianhartl.com/logistic-regression-geometric-intuition.html 24

Why Logistic instead of Linear • linear regression easily dominated by distant points • causing misclassification http://www.robots.ox.ac.uk/~az/lectures/ml/2011/lect4.pdf 25

Why Logistic instead of Linear • linear regression easily dominated by distant points • causing misclassification 26

Why 0/1 instead of +/-1 • perc: y=+1 or -1; logistic regression: y=1 or 0 • reason: want the output to be a probability • decision boundary is still linear: p( y =1 | x ) = 0.5 27

Logistic Regression: Large Margin • perceptron can be viewed roughly as “step” regression • logistic regression favors large margin; SVM: max margin • in practice: perc. << avg. perc. ≈ logistic regression ≈ SVM 28

deep learning   ~1986; 2006-now multilayer perceptron logistic regression   perceptron   SVM   1958   1964;1995 1958 kernels   1964 voted/avg. perceptron   1999 cond. random fields   structured SVM   structured perceptron   2001   2003   2002 29

Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) A - PowerPoint PPT Presentation

Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) A ship in port is safe, but that is not what ships are for. Grace Hopper (1906-1992) Week 5: Extensions and Variations of Perceptron, and Practical Issues

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

PMWG Readmissions Sub-group 06/25 / 2019 Agenda 1. Revisit Workplan/Vision of Sub-Group 2.

to Optimize Cellular Radio Usage Pavan Kumar, Ranjita Bhagwan, Saikat Guha, Vishnu Navda,

Aggressive Double Sampling for Reducing Multi-class Classification to Binary Classification

HawkEye: Efficient Fine-grained OS Support for Huge Pages Ashish Panwar 1 , Sorav Bansal 2 , K.

Autonomous Helicopter Flight Pieter Abbeel UC Berkeley EECS Challenges in Helicopter Control n

ZDLRA @ METRONOM 1 0 .2 4 .2 0 1 8 1 I ntroduction Agenda 2 Mission 3 Best Practices 4

Dependability I ssues Due to Scaling Towards Nanometer Size Devices: Aggressive and Adaptive

A Logic Your Typechecker Can Count On: Unordered Tree Types in Practice Nate Foster (Penn)