applied machine learning
play

Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) A - PowerPoint PPT Presentation

Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) A ship in port is safe, but that is not what ships are for. Grace Hopper (1906-1992) Week 5: Extensions and Variations of Perceptron, and Practical Issues


  1. 
 Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) “A ship in port is safe, but that is not what ships are for.” – Grace Hopper (1906-1992) Week 5: Extensions and Variations of Perceptron, and Practical Issues Professor Liang Huang some slides from A. Zisserman (Oxford)

  2. Trivia: Grace Hopper and the first bug • Edison coined the term “bug” around 1878 and since then it had been widely used in engineering • Hopper was associated with the discovery of the first computer bug in 1947 which was a moth stuck in a relay Smithsonian National Museum of American History 2

  3. 
 Week 5: Perceptron in Practice • Problems with Perceptron “A ship in port is safe, but that is not what ships are for.” • doesn’t converge with inseparable data – Grace Hopper (1906-1992) • update might often be too “bold” • doesn’t optimize margin • result is sensitive to the order of examples • Ways to alleviate these problems (without SVM/kernels) • Part II: voted perceptron and average perceptron • Part III: MIRA (margin-infused relaxation algorithm) • Part IV: Practical Issues and HW1 • Part V: “Soft” Perceptron: Logistic Regression 3

  4. Recap of Week 4 δ δ input: training data D x ⊕ output: weights w ⊕ x initialize w ← 0 u · x ≥ δ R while not converged u : k u k = 1 w 0 aa for ( x , y ) ∈ D aaaa if y ( w · x ) ≤ 0 aaaaaa w ← w + y x w “idealized” ML Input x Model w Training Output y “actual” ML deep learning ≈ representation learning Input x feature map ϕ Input x Model w Training Output y Model w Training Output y feature map ϕ 4

  5. Python Demo (requires numpy and matplotlib) $ python perc_demo.py 5

  6. Part II: Voted and Averaged Perceptron vanilla perceptron dev set error a v e r a g e d p e r c e p t r o n v o t e d p e r c e p t r o n 6

  7. Brief History of Perceptron batch 1997 +soft-margin Cortes/Vapnik SVM +kernels online approx. 
 subgradient descent max margin +max margin 2007--2010* minibatch Singer group Pegasos minibatch online 2003 2006 Crammer/Singer Singer group conservative updates MIRA aggressive 1999 1959 1962 1969* DEAD Rosenblatt Novikoff Minsky/Papert Freund/Schapire invention proof book killed it voted/avg: revived inseparable case 2002 2005* Collins McDonald/Crammer/Pereira structured structured MIRA *mentioned in lectures but optional 
 (others papers all covered in detail) AT&T Research ex-AT&T and students 7

  8. Voted/Avged Perceptron • problem: later examples dominate earlier examples • solution: voted perceptron (Freund and Schapire, 1999) • record the weight vector after each example in D • not just after each update! • and vote on a new example using | D | models • shown to have better generalization power • averaged perceptron (from the same paper) • an approximation of voted perceptron • just use the average of all weight vectors • can be implemented efficiently 8

  9. Voted Perceptron our notation: ( x (1) , y (1) ) v is weight, 
 c is its # of votes if correct, increase the current model’s # of votes; otherwise create a new model with 1 vote 9

  10. Experiments vanilla perceptron dev set error a v e r a g e d p e r c e p t r o n v o t e d p e r c e p t r o n 10

  11. Averaged Perceptron • voted perceptron is not scalable • and does not output a single model • avg perceptron is an approximation of voted perceptron • actually, summing all weight vectors is enough; no need to divide initialize w ← 0; w s ← 0 w (1) = ∆ w (1) w (1) = while not converged aa for ( x , y ) ∈ D aaaa if y ( w · x ) ≤ 0 w (2) = ∆ w (1) ∆ w (2) w (2) = aaaaaa w ← w + y x aaaa w s ← w s + w w (3) = ∆ w (1) ∆ w (2) ∆ w (3) w (3) = output: summed weights w s w (4) = w (4) = ∆ w (1) ∆ w (2) ∆ w (3) ∆ w (4) after each example, not after each update! 11

  12. Efficient Implementation of Averaging • naive implementation (running sum w s ) doesn’t scale either • OK for low dim. (HW1); too slow for high-dim. (HW3) • very clever trick from Hal Daumé (2006, PhD thesis) ∆ w ( t ) w ( t ) initialize w ← 0; w a ← 0; c ← 0 while not converged aa for ( x , y ) ∈ D w (1) = ∆ w (1) w (1) = aaaa if y ( w · x ) ≤ 0 aaaaaa w ← w + y x w (2) = w (2) = ∆ w (1) ∆ w (2) c aaaaaa w a ← w a + cy x aaaa c ← c + 1 w (3) = w (3) = ∆ w (1) ∆ w (2) ∆ w (3) output: c w − w a w (4) = w (4) = ∆ w (1) ∆ w (2) ∆ w (3) ∆ w (4) after each update, not after each example! 12

  13. Part III: MIRA • perceptron often makes bold updates (over-correction) • and sometimes too small updates (under-correction) • but hard to tune learning rate x • “just enough” update to correct the mistake? w 0 w 0 w + y � w · x x under-correction k x k 2 w easy to show: x ⊕ w 0 · x = ( w + y � w · x x ) · x = y w 0 perceptron 1 k k x k 2 x � w · x k k x k w 0 MIRA margin-infused relaxation 
 1 � w · x k x k over-correction w algorithm (MIRA) 13

  14. Example: Perceptron under-correction x perceptron w 0 w 14

  15. MIRA: just enough w 0 k w 0 � w k 2 min s.t. w 0 · x � 1 x w 0 MIRA minimal change to ensure 
 functional margin of 1 
 (dot-product w ’ · x =1) k x k 1 perceptron w 0 MIRA ≈ 1-step SVM w functional margin: y ( w · x ) geometric margin: y ( w · x ) k w k 15

  16. MIRA: functional vs geom. margin w 0 k w 0 � w k 2 min w 0 · x = 1 s.t. w 0 · x � 1 x k w 0 k w 0 MIRA minimal change to ensure 
 1 functional margin of 1 
 0 = 0 x w · (dot-product w ’ · x =1) MIRA ≈ 1-step SVM w functional margin: y ( w · x ) geometric margin: y ( w · x ) k w k 16

  17. Optional: Aggressive MIRA w · x = 1 w · x = 0 . 7 k w k 1 w w · x = 0 x • aggressive version of MIRA • also update if correct but not confident enough • i.e., functional margin ( y w · x ) not big enough • p- aggressive MIRA: update if y ( w · x ) < p (0<= p <1) • MIRA is a special case with p= 0: only update if misclassified! • update equation is same as MIRA • i.e., after update, functional margin becomes 1 • larger p leads to a larger geometric margin but slower convergence 17

  18. Demo 18

  19. Demo 19

  20. 
 Part IV: Practical Issues “A ship in port is safe, but that is not what ships are for.” – Grace Hopper (1906-1992) • you will build your own linear classifiers for HW2 (same data as HW1) • slightly different binarizations • for k-NN, we binarize all categorical fields but keep the two numerical ones • for perceptron (and most other classifiers), we binarize numerical fields as well • why? hint: larger “age” always better? more “hours” always better? 20

  21. Useful Engineering Tips: averaging, shuffling, variable learning rate, fixing feature scale • averaging helps significantly; MIRA helps a tiny little bit • perceptron < MIRA < avg. perceptron ≈ avg. MIRA ≈ SVM • shuffling the data helps hugely if classes were ordered (HW1) • shuffling before each epoch helps a little bit big margin • variable (decaying) learning rate often helps a little small margin • 1/(total#updates) or 1/(total#examples) helps 1 1 • any requirement in order to converge? O O • how to prove convergence now? • centering of each dimension helps (Ex1/HW1) • why? => smaller radius, bigger margin! • unit variance also helps (why?) (Ex1/HW1) • 0-mean, 1-var => each feature ≈ a unit Gaussian 21

  22. Feature Maps in Other Domains • how to convert an image or text to a vector? 23x23 RGB image 28x28 grayscale image x ∈ ℝ 23x23x3 • image • text “one-hot” representation of words (all binary features) in deep learning there are other feature maps 22

  23. Part V: Perceptron vs. Logistic Regression • logistic regression is another popular linear classifier • can be viewed as “soft” or “probabilistic” perceptron • same decision rule (sign of dot-product), but prob. output perceptron f ( x ) = sign ( w · x ) logistic regression 1 f ( x ) = σ ( w · x ) = 1 + e − w · x 23

  24. Logistic vs. Linear Regression • linear regression is regression applied to real-valued output using linear function • logistic regression is regression applied to 0-1 output using the sigmoid function linear 1 feature 2 features logistic 1 feature 2 features https://florianhartl.com/logistic-regression-geometric-intuition.html 24

  25. Why Logistic instead of Linear • linear regression easily dominated by distant points • causing misclassification http://www.robots.ox.ac.uk/~az/lectures/ml/2011/lect4.pdf 25

  26. Why Logistic instead of Linear • linear regression easily dominated by distant points • causing misclassification 26

  27. Why 0/1 instead of +/-1 • perc: y=+1 or -1; logistic regression: y=1 or 0 • reason: want the output to be a probability • decision boundary is still linear: p( y =1 | x ) = 0.5 27

  28. Logistic Regression: Large Margin • perceptron can be viewed roughly as “step” regression • logistic regression favors large margin; SVM: max margin • in practice: perc. << avg. perc. ≈ logistic regression ≈ SVM 28

  29. deep learning 
 ~1986; 2006-now multilayer perceptron logistic regression 
 perceptron 
 SVM 
 1958 
 1964;1995 1958 kernels 
 1964 voted/avg. perceptron 
 1999 cond. random fields 
 structured SVM 
 structured perceptron 
 2001 
 2003 
 2002 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend