Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) A - PowerPoint PPT Presentation

Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) “A ship in port is safe, but that is not what ships are for.” – Grace Hopper (1906-1992) Week 3: Extensions and Variations of Perceptron; Practical Issues and HW1 Professor Liang Huang some slides from A. Zisserman (Oxford)

Trivia: Grace Hopper and the first bug • Edison coined the term “bug” around 1878 and it had been widely used in engineering • Hopper was associated with the discovery of the first computer bug in 1947 which was a moth stuck in a relay Smithsonian National Museum of American History 2

Week 3: Perceptron in Practice • Problems with Perceptron “A ship in port is safe, but that is not what ships are for.” • doesn’t converge with inseparable data – Grace Hopper (1906-1992) • update might often be too “bold” • doesn’t optimize margin • result is sensitive to the order of examples • Ways to alleviate these problems (without SVM/kernels) • Part II: voted perceptron and average perceptron • Part III: MIRA (margin-infused relaxation algorithm) • Part IV: Practical Issues and HW1 • Part V: “Soft” Perceptron: Logistic Regression 3

Recap of Week 2 δ δ input: training data D x ⊕ ⊕ output: weights w x initialize w ← 0 u · x ≥ δ R while not converged u : k u k = 1 w 0 aa for ( x , y ) ∈ D aaaa if y ( w · x ) ≤ 0 aaaaaa w ← w + y x w “idealized” ML Input x Training Model w Output y “actual” ML deep learning ≈ representation learning Input x feature map ϕ Input x Model w Training Output y Model w Training Output y feature map ϕ 4

Python Demo (requires numpy and matplotlib) $ python perc_demo.py 5

Part II: Voted and Averaged Perceptron vanilla perceptron dev set error a v e r a g e d p e r c e p t r o n v o t e d p e r c e p t r o n 6

Voted/Avg. Perceptron Revives Perceptron batch 1997 +soft-margin Cortes/Vapnik SVM s l e n online approx. r e k + subgradient descent max margin +max margin 2007--2010* minibatch Singer group Pegasos minibatch online 2003 2006 Crammer/Singer Singer group c o n s e r v a t i v e u p d a t e s MIRA aggressive 1959 1962 1969* DEAD 1999 Rosenblatt Novikoff Minsky/Papert Freund/Schapire invention proof book killed it voted/avg: revived i n s e p a r a b l e c a s e 2002 2005* Collins McDonald/Crammer/Pereira structured MIRA structured *mentioned in lectures but optional (others papers all covered in detail) 7

Voted/Avged Perceptron • problem: later examples dominate earlier examples • solution: voted perceptron (Freund and Schapire, 1999) • record the weight vector after each example in D • not just after each update! • and vote on a new example using | D | models • shown to have better generalization power • averaged perceptron (from the same paper) • an approximation of voted perceptron • just use the average of all weight vectors • can be implemented efficiently 8

Voted Perceptron our notation: ( x (1) , y (1) ) v is weight, c is its # of votes r e c t , i n c r e a s e t h e i f c o r n t m o d e l ’ s # o f v o t e s ; c u r r e w i s e c r e a t e a n e w o t h e r l w i t h 1 v o t e m o d e 9

Experiments vanilla perceptron dev set error a v e r a g e d p e r c e p t r o n v o t e d p e r c e p t r o n 10

Averaged Perceptron • voted perceptron is not scalable • and does not output a single model • avg perceptron is an approximation of voted perceptron • actually, summing all weight vectors is enough; no need to divide initialize w ← 0; w s ← 0 while not converged w (1) = w (1) = ∆ w (1) aa for ( x , y ) ∈ D aaaa if y ( w · x ) ≤ 0 w (2) = w (2) = ∆ w (1) ∆ w (2) aaaaaa w ← w + y x aaaa w s ← w s + w w (3) = w (3) = ∆ w (1) ∆ w (2) ∆ w (3) output: summed weights w s w (4) = w (4) = ∆ w (1) ∆ w (2) ∆ w (3) ∆ w (4) after each example, not after each update! 11

Efficient Implementation of Averaging • naive implementation (running sum w s ) doesn’t scale • OK for low dim. (HW1); too slow for high-dim. (HW3) • very clever trick from Hal Daumé (2006, PhD thesis) ∆ w ( t ) w ( t ) initialize w ← 0; w a ← 0; c ← 0 while not converged w (1) = w (1) = ∆ w (1) aa for ( x , y ) ∈ D aaaa if y ( w · x ) ≤ 0 c w (2) = w (2) = ∆ w (1) ∆ w (2) aaaaaa w ← w + y x aaaaaa w a ← w a + cy x w (3) = w (3) = ∆ w (1) ∆ w (2) ∆ w (3) aaaa c ← c + 1 output: c w − w a w (4) = w (4) = ∆ w (1) ∆ w (2) ∆ w (3) ∆ w (4) after each update, not after each example! 12

Part III: MIRA • perceptron often makes bold updates (over-correction) • and sometimes too small updates (under-correction) • but hard to tune learning rate x • “just enough” update to correct the mistake? w 0 w 0 w + y � w · x x under-correction k x k 2 w easy to show: x ⊕ w 0 · x = ( w + y � w · x x ) · x = y perceptron w 0 1 k x k k x k 2 � w · x k x k w 0 MIRA margin-infused relaxation x · w k over-correction � x algorithm (MIRA) w k 1 13

Example: Perceptron under-correction x perceptron w 0 w 14

MIRA: just enough w 0 k w 0 � w k 2 min s.t. w 0 · x � 1 x w 0 MIRA minimal change to ensure functional margin of 1 (dot-product w ’ · x =1) k x k 1 perceptron w 0 MIRA ≈ 1-step SVM w functional margin: y ( w · x ) geometric margin: y ( w · x ) k w k 15

MIRA: functional vs geom. margin w 0 k w 0 � w k 2 min 1 = 0 x w · s.t. w 0 · x � 1 x k w 0 k w 0 MIRA minimal change to ensure 1 functional margin of 1 w 0 · x = 0 (dot-product w ’ · x =1) MIRA ≈ 1-step SVM w functional margin: y ( w · x ) geometric margin: y ( w · x ) k w k 16

Optional: Aggressive MIRA · x = 1 0 w 7 0 = w 0 x . · k w 0 k 1 w 0 0 = x 0 x w · • aggressive version of MIRA • also update if correct but not confident enough • i.e., functional margin ( y w · x ) not big enough • p- aggressive MIRA: update if y ( w · x ) < p (0<= p <1) • MIRA is a special case with p= 0: only update if misclassified! • update equation is same as MIRA • i.e., after update, functional margin becomes 1 • larger p leads to a larger geometric margin but slower convergence 17

Demo 18

Demo 19

Part IV: Practical Issues and HW1 “A ship in port is safe, but that is not what ships are for.” – Grace Hopper (1906-1992) • you will build your own linear classifiers for HW1 data 20

HW1: Adult Income >50K? training/dev sets: Age, Sector, Education, Marital_Status, Occupation, Race, Sex, Hours, Country, Target 40, Private, Doctorate, Married-civ-spouse, Prof-specialty, White, Female, 60, United-States, >50K 44, Local-gov, Some-college, Married-civ-spouse, Exec-managerial, Black, Male, 38, United-States, >50K 55, Private, HS-grad, Divorced, Sales, White, Male, 40, England, <=50K test data (semi-blind): 30, Private, Assoc-voc, Married-civ-spouse, Tech-support, White, Female, 40, Canada, ??? • 2 numerical features: age and hours-per-week • option 1: keep them as numerical features • but is older and more hours always better? • option 2: (better) treat them as binary features • e.g., age=22, hours=38, ... • 7 categorical features: convert to binary features • country, race, occupation, etc. • e.g., country=United_States, education=Doctorate,... • perceptron: ~19% dev error, avg. perceptron: ~15% dev error 21

Interesting Facts in HW1 Data • only ~25% positive (>50K); data was from 1994 (~$27K per capita) • education is probably the single most important factor • education=Doctorate is extremely positive (80%) • education=Prof-school is also very positive (75%) • education=Masters is also positive (55%) • education=9th (high school dropout) is extremely negative (6%) • “married” is good (45%), “never married” is extremely bad (5%) • “self-emp-inc” is the best sector (59%), but “self-emp-not-inc” 30% • hours-per-week=1 is 100% positive; country=Iran is 70% positive • exec-managerial and prof-specialty are best occupations (48% / 46%) • interesting combinations (e.g. “edu=Doc and sector=self-emp-inc”: 100%) 22

Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) A - PowerPoint PPT Presentation

Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) A ship in port is safe, but that is not what ships are for. Grace Hopper (1906-1992) Week 3: Extensions and Variations of Perceptron; Practical Issues and HW1 Professor

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

ENGINEERING Whats it all about? Module 5.1 Proudly developed by SMART with funding from

SPY@DND December 2020 update Andrea Bersani Present design Genova, Dec. 2019 2 With some

How do giant molecules wiggle? Ashish Lele National Chemical Laboratory Acknowledgement: Chirag,

Romans Series Lesson #31 August 25, 2011 Dean Bible Ministries www.deanbible.org Dr. Robert L.

Heavy-tailed Distribu1on of Parallel I/O System Response Time

Announcement RIT is looking for a few good Deterministic Finite Automata programmers! ACM

Xen Strategic Summit Xen Strategic Summit Plenary Plenary Nick Gault, CEO Nick Gault, CEO

Objec(ves Computer Science is Complexity Science Dec 8, 2017 Sprenkle - CSCI111 1 Review