bbm406
play

BBM406 Fundamentals of Machine Learning Lecture 15: Support - PowerPoint PPT Presentation

Photo by Arthur Gretton, CMU Machine Learning Protestors at G20 BBM406 Fundamentals of Machine Learning Lecture 15: Support Vector Machines Aykut Erdem // Hacettepe University // Fall 2019 Announcement Midterm exam on Nov 29 Dec 6, 2019


  1. Photo by Arthur Gretton, CMU Machine Learning Protestors at G20 BBM406 Fundamentals of 
 Machine Learning Lecture 15: Support Vector Machines Aykut Erdem // Hacettepe University // Fall 2019

  2. Announcement • Midterm exam on Nov 29 Dec 6, 2019 
 at 09.00 in rooms D3 & D4 
 • No class next Wednesday! Extra o ffi ce hour. • No class class on Friday! Make-up class on 
 Dec 2 (Monday), 15:00-17:00 • No change in the due date of your Assg 3! 2

  3. Last time… AlexNet [Krizhevsky et al. 2012] Full (simplified) AlexNet architecture: Details/Retrospectives: [227x227x3] INPUT - first use of ReLU [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 - used Norm layers (not common [27x27x96] MAX POOL1: 3x3 filters at stride 2 anymore) [27x27x96] NORM1: Normalization layer - heavy data augmentation [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 - dropout 0.5 [13x13x256] MAX POOL2: 3x3 filters at stride 2 - batch size 128 [13x13x256] NORM2: Normalization layer slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson - SGD Momentum 0.9 [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 - Learning rate 1e-2, reduced by 10 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 manually when val accuracy plateaus [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 - L2 weight decay 5e-4 [6x6x256] MAX POOL3: 3x3 filters at stride 2 - 7 CNN ensemble: 18.2% -> 15.4% [4096] FC6: 4096 neurons [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores) 3

  4. Last time.. Understanding ConvNets slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson http://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf 4

  5. Last time… Data Augmentation Random mix/combinations of: - translation - rotation - stretching - shearing, - lens distortions, … 5

  6. Last time… Transfer Learning with Convolutional Networks 3. Medium dataset: 2. Small dataset: 1. Train on finetuning feature extractor Imagenet more data = retrain more of the network (or all of it) Freeze these tip: use only ~1/10th of Freeze the original learning rate these in finetuning top layer, and ~1/100th on slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson intermediate layers Train this Train this 6

  7. Today • Support Vector Machines - Large Margin Separation - Optimization Problem - Support Vectors 7

  8. Recap: Binary Classification Problem et • Training data: sample drawn i.i.d. from set 
 X ⊆ R N according to some distribution D , S =(( x 1 , y 1 ) , . . . , ( x m , y m )) ∈ X × { − 1 , +1 } . in • Problem: find hypothesis in H thesis in h : X �� { � 1 , +1 } or . (classifier) with small generalization error eneralization error R D ( h ) • Linear classification: - Hypotheses based on hyperplanes. - Linear separation in high-dimensional space. slide by Mehryar Mohri 8

  9. Example: Spam • Imagine 3 features (spam is “positive” class): � � 1. free (number of occurrences of “free”) � � � � 2. money (occurrences of “money”) � � � � � � 3. BIAS (intercept, always has value 1) � � � � � � BIAS : 1 BIAS : -3 � � free : 1 free : 4 � free money � money : 1 money : 2 ... ... � � slide by David Sontag w.f(x)'>'0'  'SPAM!!!' w ・ f ( x )>0 ➞ SPAM !!! 9

  10. Binary Decision Rule • In the space of feature vectors - Examples are points - Any weight vector is a hyperplane - One side corresponds to Y = +1 - Other corresponds to Y = -1 Y=-1 money 2 +1 = SPAM 1 BIAS : -3 free : 4 money : 2 ... 0 -1 = HAM slide by David Sontag 0 1 free 10

  11. The perceptron algorithm • Start with weight vector = ~ 0 • For each training instance : ( x i , y ∗ i ) t - Classify with current weights (x i ) i i i (x i ) i - If correct (i.e. ), no change! y = y ∗ e! i - If wrong: update w = w + y ∗ i f ( x i ) slide by David Sontag 11

  12. Properties of the perceptron algorithm • Separability: some parameters 
 get the training set perfectly 
 correct • Convergence: if the training is 
 linearly separable, perceptron 
 will eventually converge slide by David Sontag 12

  13. Problems with the perceptron algorithm • Noise : if the data isn’t linearly separable, no guarantees of convergence or accuracy • Frequently the training data is linearly separable! Why ? - When the number of features is much larger than the number of data points, there is lots of flexibility - As a result, Perceptron can significantly overfit the data • Averaged perceptron is an algorithmic slide by David Sontag modification that helps with both issues - Averages the weight vectors across all iterations 13

  14. Linear Separators • Which of these linear separators is optimal? slide by David Sontag 14

  15. Support Vector Machines 15

  16. Linear Separator Ham Spam slide by Alex Smola 16

  17. Large Margin Classifier Ham Spam slide by Alex Smola 17

  18. Review: Normal to a plane w !!"unit"vector"normal"to"w" w . x + b = 0 k w k " !!"projec9on"of" x j " ¯ x j onto"the"plane" " x j = λ w x j � ¯ k w k Is"the"length"of"the"vector,"i.e." λ λ slide by David Sontag x j � ¯ k w kk w k = λ x j = 18

  19. Scale invariance Any other ways of writing the same dividing line? 
 • w . x +b=0 • 2 w . x +2b=0 • 1000 w . x + 1000b = 0 • .... slide by David Sontag 19

  20. Scale invariance During'learning,'we'set'the' scale'by'asking'that,'for'all' t ,'' ''for' y t = +1 ,' w · x t + b ≥ 1 and'for' y t = -1 ,'' w · x t + b ≤ − 1 ' That'is,'we'want'to'sa8sfy'all'of' the' linear' constraints'' ' y t ( w · x t + b ) ≥ 1 ∀ t slide by David Sontag 20

  21. Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 linear function slide by Alex Smola f ( x ) = h w, x i + b 21

  22. Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w margin h x + � x − , w i 1 1 slide by Alex Smola = 2 k w k [[ h x + , w i + b ] � [ h x − , w i + b ]] = 2 k w k k w k 22

  23. Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 slide by Alex Smola maximize k w k subject to y i [ h x i , w i + b ] � 1 w,b 23

  24. Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 slide by Alex Smola minimize w,b

  25. 
 Convex Programs for Dummies • Primal optimization problem minimize f ( x ) subject to c i ( x ) ≤ 0 x • Lagrange function X L ( x, α ) = f ( x ) + α i c i ( x ) i • First order optimality conditions in x X ∂ x L ( x, α ) = ∂ x f ( x ) + α i ∂ x c i ( x ) = 0 i • Solve for x and plug it back into L 
 maximize L ( x ( α ) , α ) α slide by Alex Smola (keep explicit constraints)

  26. 
 
 
 Dual Problem • Primal optimization problem 
 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b • Lagrange function 
 constraint L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i Optimality in w , b is at saddle point with α • Derivatives in w , b need to vanish slide by Alex Smola

  27. 
 
 Dual Problem • Lagrange function L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i • Derivatives in w , b need to vanish 
 X ∂ w L ( w, b, a ) = w − α i y i x i = 0 i X ∂ b L ( w, b, a ) = α i y i = 0 i • Plugging terms back into L yields � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i slide by Alex Smola X α i y i = 0 and α i � 0 subject to i

  28. Support Vector Machines 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α slide by Alex Smola i,j i X α i y i = 0 and α i � 0 subject to i

  29. Support Vectors 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w Karush Kuhn Tucker α i = 0 Optimality condition slide by Alex Smola α i > 0 = ) y i [ h w, x i i + b ] = 1 α i [ y i [ h w, x i i + b ] � 1] = 0

  30. Properties X w = y i α i x i i w • Weight vector w as weighted linear combination of instances • Only points on margin matter (ignore the rest and get same solution) • Only inner products matter − Quadratic program slide by Alex Smola − We can replace the inner product by a kernel • Keeps instances away from the margin

  31. Example slide by Alex Smola

  32. Example slide by Alex Smola

  33. Why Large Margins? • Maximum robustness relative o to uncertainty r o • Symmetry o + breaking • Independent of o correctly classified + ρ instances + • Easy to find for easy problems slide by Alex Smola +

  34. Watch: Patrick Winston, Support Vector Machines https://www.youtube.com/watch?v=_PwhiWxHK8o 34

  35. Next Lecture: Soft Margin Classification, Multi-class SVMs 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend