intro to supervised learning
play

Intro to Supervised Learning Christoph Lampert IST Austria - PowerPoint PPT Presentation

Intro to Supervised Learning Christoph Lampert IST Austria (Institute of Science and Technology Austria) Vienna, Austria ENS/INRIA Summer School Visual Recognition and Machine Learning Paris 2013 1 / 90 Slides available on my home page


  1. Loss Functions Reminder: ∆( y, ¯ y ) = cost of predicting ¯ y if y is correct. Optimal decision: choose g : X → Y to minimize the expected loss � � L ∆ ( y ; x ) = p (¯ y | x )∆(¯ y, y ) = p (¯ y | x )∆(¯ y, y ) ( ∆( y, y ) = 0 ) y � = y ¯ ¯ y ∈Y g ( x ) = argmin L ∆ ( y ; x ) pick label of smallest expected loss y ∈Y 21 / 90

  2. Loss Functions Reminder: ∆( y, ¯ y ) = cost of predicting ¯ y if y is correct. Optimal decision: choose g : X → Y to minimize the expected loss � � L ∆ ( y ; x ) = p (¯ y | x )∆(¯ y, y ) = p (¯ y | x )∆(¯ y, y ) ( ∆( y, y ) = 0 ) y � = y ¯ ¯ y ∈Y g ( x ) = argmin L ∆ ( y ; x ) pick label of smallest expected loss y ∈Y 0 1 1 Special case: ∆( y, ¯ y ) = � y � = y � . E.g. 1 0 1 (for 3 labels) 1 1 0 � g ∆ ( x ) = argmin L ∆ ( y ) = argmin p ( y | x ) � y � = y � y ∈Y y ∈Y y � = y ¯ = argmax p ( y | x ) y ∈Y ( → Bayes classifier) 21 / 90

  3. Learning Paradigms Given: training data { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ⊂ X × Y Approach 1) Generative Probabilistic Models 1) Use training data to obtain an estimate p ( x | y ) for any y ∈ Y 2) Compute p ( y | x ) ∝ p ( x | y ) p ( y ) � 3) Predict using g ( x ) = argmin y y p (¯ y | x )∆(¯ y, y ) . ¯ Approach 2) Discriminative Probabilistic Models 1) Use training data to estimate p ( y | x ) directly. � 2) Predict using g ( x ) = argmin y y p (¯ y | x )∆(¯ y, y ) . ¯ Approach 3) Loss-minimizing Parameter Estimation 1) Use training data to search for best g : X → Y directly. 22 / 90

  4. 0.5 1.0 p ( x | +1) p ( +1 | x ) 0.4 0.8 p ( x |− 1) p ( − 1 | x ) 0.3 0.6 0.4 0.2 0.2 0.1 0.0 0.0 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 Generative Probabilistic Models This is what we did in the RoboCup example! ◮ Training data X = { x 1 , . . . , x n } , Y = { y 1 , . . . , x n } . X × Y ⊂ X × Y ◮ For each y ∈ Y , build model for p ( x | y ) of X y := { x i ∈ X : y i = y } ◮ Histogram: if x can have only few discrete values. � ◮ Kernel Density Estimator: p ( x | y ) ∝ k ( x i , x ) x i ∈ X y ◮ Gaussian: p ( x | y ) = G ( x ; µ y , Σ y ) ∝ exp( − 1 2 ( x − µ y ) ⊤ Σ − 1 y ( x − µ y )) ◮ Mixture of Gaussians: p ( x | y ) = � K k =1 π k y G ( x ; µ k y , Σ k y ) 23 / 90

  5. Generative Probabilistic Models This is what we did in the RoboCup example! ◮ Training data X = { x 1 , . . . , x n } , Y = { y 1 , . . . , x n } . X × Y ⊂ X × Y ◮ For each y ∈ Y , build model for p ( x | y ) of X y := { x i ∈ X : y i = y } ◮ Histogram: if x can have only few discrete values. � ◮ Kernel Density Estimator: p ( x | y ) ∝ k ( x i , x ) x i ∈ X y ◮ Gaussian: p ( x | y ) = G ( x ; µ y , Σ y ) ∝ exp( − 1 2 ( x − µ y ) ⊤ Σ − 1 y ( x − µ y )) ◮ Mixture of Gaussians: p ( x | y ) = � K k =1 π k y G ( x ; µ k y , Σ k y ) 0.5 1.0 p ( x | +1) p ( +1 | x ) 0.4 0.8 p ( x |− 1) p ( − 1 | x ) 0.3 0.6 0.4 0.2 0.2 0.1 0.0 0.0 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 class posteriors for p (+1) = p ( − 1) = 1 class conditional densities (Gaussian) 2 23 / 90

  6. Generative Probabilistic Models This is what we did in the RoboCup example! ◮ Training data X = { x 1 , . . . , x n } , Y = { y 1 , . . . , x n } . X × Y ⊂ X × Y ◮ For each y ∈ Y , build model for p ( x | y ) of X y := { x i ∈ X : y i = y } ◮ Histogram: if x can have only few discrete values. � ◮ Kernel Density Estimator: p ( x | y ) ∝ k ( x i , x ) x i ∈ X y ◮ Gaussian: p ( x | y ) = G ( x ; µ y , Σ y ) ∝ exp( − 1 2 ( x − µ y ) ⊤ Σ − 1 y ( x − µ y )) ◮ Mixture of Gaussians: p ( x | y ) = � K k =1 π k y G ( x ; µ k y , Σ k y ) Typically: Y small, i.e. few possible labels, X low-dimensional, e.g. RGB colors, X = R 3 But: large Y is possible with right tools → ”Intro to graphical models” 23 / 90

  7. 1.0 0.8 0.6 0.4 0.2 0.0 0.2 40 20 30 0 20 10 20 0 10 40 20 30 Discriminative Probabilistic Models Most popular: Logistic Regression ◮ Training data X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . X × Y ⊂ X × Y ◮ To simplify notation: assume X = R d , Y = {± 1 } ◮ Parametric model: 1 with free parameter w ∈ R d p ( y | x ) = 1 + exp( − y w ⊤ x ) 24 / 90

  8. Discriminative Probabilistic Models Most popular: Logistic Regression ◮ Training data X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . X × Y ⊂ X × Y ◮ To simplify notation: assume X = R d , Y = {± 1 } ◮ Parametric model: 1 with free parameter w ∈ R d p ( y | x ) = 1 + exp( − y w ⊤ x ) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 40 20 30 0 20 10 20 0 10 40 20 30 24 / 90

  9. Discriminative Probabilistic Models Most popular: Logistic Regression ◮ Training data X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . X × Y ⊂ X × Y ◮ To simplify notation: assume X = R d , Y = {± 1 } ◮ Parametric model: 1 with free parameter w ∈ R d p ( y | x ) = 1 + exp( − y w ⊤ x ) ◮ Find w by maximizing the conditional data likelihood n � w = argmax p ( y i | x i ) w ∈ R d i =1 n � � � 1 + exp( − y i w ⊤ x i ) = argmin log w ∈ R d i =1 Extensions to very large Y → ”Structured Outputs (Wednesday)” 24 / 90

  10. Loss-minimizing Parameter Estimation ◮ Training data X = { x 1 , . . . , x n } , Y = { y 1 , . . . , x n } . X × Y ⊂ X × Y ◮ Simplify: X = R d , Y = {± 1 } , ∆( y, ¯ y ) = � y � = ¯ y � ◮ Choose hypothesis class: (which classifiers do we consider?) H = { g : X → Y} (e.g. all linear classifiers) ◮ Expected loss of a classifier h : X → Y on a sample x � L ( g, x ) = p ( y | x )∆( y, g ( x ) ) y ∈Y ◮ Expected overall loss of a classifier: � L ( g ) = p ( x ) L ( g, x ) x ∈X � � = p ( x, y )∆( y, g ( x ) ) = E x,y ∆( y, g ( x )) x ∈X y ∈Y ◮ Task: find ”best” g in H , i.e. g := argmin g ∈H L ( g ) Note: for simplicity, we always write � x . When X is infinite (i.e. almost always), read this as � X dx 25 / 90

  11. Rest of this Lecture Part II: H = { linear classifiers } Part III: H = { nonlinear classifiers } Part IV (if there’s time): Multi-class Classification 26 / 90

  12. Notation... ◮ data points X = { x 1 , . . . , x n } , x i ∈ R d , (think: feature vectors) ◮ class labels Y = { y 1 , . . . , y n } , y i ∈ { +1 , − 1 } , (think: cat or no cat ) ◮ goal: classification rule g : R d → {− 1 , +1 } . 27 / 90

  13. Notation... ◮ data points X = { x 1 , . . . , x n } , x i ∈ R d , (think: feature vectors) ◮ class labels Y = { y 1 , . . . , y n } , y i ∈ { +1 , − 1 } , (think: cat or no cat ) ◮ goal: classification rule g : R d → {− 1 , +1 } . ◮ parameterize g ( x ) = sign f ( x ) with f : R d → R : f ( x ) = a 1 x 1 + a 2 x 2 + · · · + a n x n + a 0 w = ( a 0 , . . . , a n ) : simplify notation: ˆ x = (1 , x ) , ˆ (inner/scalar product in R d +1 ) f ( x ) = � ˆ w, ˆ x � w ⊤ ˆ (also: ˆ w · ˆ x or ˆ x ) ◮ out of lazyness, we just write f ( x ) = � w, x � with x, w ∈ R d . 27 / 90

  14. Linear Classification – the classical view Given X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . 2 . 0 1 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 28 / 90

  15. Linear Classification – the classical view Given X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . Any w partitions the data space into two half-spaces by means of f ( x ) = � w, x � . 2 . 0 f ( x ) > 0 f ( x ) < 0 1 . 0 w 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 28 / 90

  16. Linear Classification – the classical view Given X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . Any w partitions the data space into two half-spaces by means of f ( x ) = � w, x � . 2 . 0 f ( x ) > 0 f ( x ) < 0 1 . 0 w 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 “What’s the best w ?” 28 / 90

  17. Criteria for Linear Classification What properties should an optimal w have? Given X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . 2 . 0 2 . 0 1 . 0 1 . 0 0 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 Are these the best? No, they misclassify many examples. Criterion 1: Enforce sign � w, x i � = y i for i = 1 , . . . , n . 29 / 90

  18. Criteria for Linear Classification What properties should an optimal w have? Given X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . What’s the best w ? 2 . 0 2 . 0 1 . 0 1 . 0 0 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 Are these the best? No, they would be “risky” for future samples. Criterion 2 : Ensure sign � w, x � = y for future ( x, y ) as well. 30 / 90

  19. Criteria for Linear Classification Given X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . Assume that future samples are similar to current ones. What’s the best w ? 2 . 0 ρ ρ 1 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 Maximize “robustness”: use w such that we can maximally perturb the input samples without introducing misclassifications. 31 / 90

  20. Criteria for Linear Classification Given X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . Assume that future samples are similar to current ones. What’s the best w ? 2 . 0 2 . 0 margin region ρ ρ ρ 1 . 0 1 . 0 0 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 Maximize “robustness”: use w such that we can maximally perturb the input samples without introducing misclassifications. Central quantity: margin ( x ) = distance of x to decision hyperplane = � w � w � , x � 31 / 90

  21. Maximum Margin Classification Maximum-margin solution is determined by a maximization problem : w ∈ R d ,γ ∈ R + γ max subject to sign � w, x i � = y i for i = 1 , . . . n . � � � � w � � � w � , x i � � ≥ γ for i = 1 , . . . n . � � Classify new samples using f ( x ) = � w, x � . 32 / 90

  22. Maximum Margin Classification Maximum-margin solution is determined by a maximization problem : max γ w ∈ R d , � w � =1 γ ∈ R subject to y i � w, x i � ≥ γ for i = 1 , . . . n . Classify new samples using f ( x ) = � w, x � . 33 / 90

  23. Maximum Margin Classification We can rewrite this as a minimization problem : � w � 2 min w ∈ R d subject to y i � w, x i � ≥ 1 for i = 1 , . . . n . Classify new samples using f ( x ) = � w, x � . Maximum Margin Classifier (MMC) 34 / 90

  24. Maximum Margin Classification From the view of optimization theory � w � 2 min w ∈ R d subject to y i � w, x i � ≥ 1 for i = 1 , . . . n is rather easy: ◮ The objective function is differentiable and convex . ◮ The constraints are all linear. We can find the globally optimal w in O ( d 3 ) (usually much faster). ◮ There are no local minima. ◮ We have a definite stopping criterion. 35 / 90

  25. Linear Separability What is the best w for this dataset? 2 . 0 1 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 36 / 90

  26. Linear Separability What is the best w for this dataset? 2 . 0 ρ margin ξ i margin violation 1 . 0 x i 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 Possibly this one, even though one sample is misclassified. 37 / 90

  27. Linear Separability What is the best w for this dataset? 2 . 0 1 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 38 / 90

  28. Linear Separability What is the best w for this dataset? 2 . 0 1 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 Maybe not this one, even though all points are classified correctly. 39 / 90

  29. Linear Separability What is the best w for this dataset? 2 . 0 ρ margin ξ i 1 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 Trade-off: large margin vs. few mistakes on training set 40 / 90

  30. Soft-Margin Classification Mathematically, we formulate the trade-off by slack -variables ξ i : n w ∈ R d ,ξ i ∈ R + � w � 2 + C � min ξ i n i =1 subject to y i � w, x i � ≥ 1 − ξ i for i = 1 , . . . n , ξ i ≥ 0 for i = 1 , . . . , n . Linear Support Vector Machine (linear SVM) 41 / 90

  31. Soft-Margin Classification Mathematically, we formulate the trade-off by slack -variables ξ i : n w ∈ R d ,ξ i ∈ R + � w � 2 + C � min ξ i n i =1 subject to y i � w, x i � ≥ 1 − ξ i for i = 1 , . . . n , ξ i ≥ 0 for i = 1 , . . . , n . Linear Support Vector Machine (linear SVM) ◮ We can fulfill every constraint by choosing ξ i large enough. ◮ The larger ξ i , the larger the objective (that we try to minimize). ◮ C is a regularization /trade-off parameter: ◮ small C → constraints are easily ignored ◮ large C → constraints are hard to ignore ◮ C = ∞ → hard margin case → no errors on training set ◮ Note: The problem is still convex and efficiently solvable. 41 / 90

  32. Solving for Soft-Margin Solution Reformulate: n w ∈ R d ,ξ i ∈ R + � w � 2 + C � min ξ i n i =1 subject to y i � w, x i � ≥ 1 − ξ i for i = 1 , . . . n , ξ i ≥ 0 for i = 1 , . . . , n . � We can read off the optimal values ξ i = max { 0 , 1 − y i � w, x i � . Equivalent optimization problem (with λ = 1 /C ): n w ∈ R d λ � w � 2 + 1 � � min max { 0 , 1 − y i � w, x i � n i =1 ◮ Now unconstrained optimization, but non-differentiable ◮ Solve efficiently, e.g., by subgradient method → ”Large-scale visual recognition” (Thursday) 42 / 90

  33. Linear SVMs in Practice Efficient software packages: ◮ liblinear: http://www.csie.ntu.edu.tw/ ∼ cjlin/liblinear/ ◮ SVMperf: http://www.cs.cornell.edu/People/tj/svm light/svm perf.html ◮ see also: Pegasos: , http://www.cs.huji.ac.il/ ∼ shais/code/ ◮ see also: sgd: , http://leon.bottou.org/projects/sgd Training time: ◮ approximately linear in data dimensionality ◮ approximately linear in number of training examples, Evaluation time (per test example): ◮ linear in data dimensionality ◮ independent of number of training examples Linear SVMs are currently the most frequently used classifiers in Computer Vision. 43 / 90

  34. Linear Classification – the modern view Geometric intuition is nice, but are there any guarantees ? ◮ SVM solution is g ( x ) = sign f ( x ) for f ( x ) = � w, x � with n λ � w � 2 + 1 � � w = argmin max { 0 , 1 − y i � w, x i � n w ∈ R d i =1 ◮ What we really wanted to minimized is expected loss : g = argmin E x,y ∆( y, g ( x )) h ∈H with H = { g ( x ) = sign f ( x ) | f ( x ) = � w, x � for w ∈ R d } . What’s the relation? 44 / 90

  35. Linear Classification – the modern view Geometric intuition is nice, but are there any guarantees ? ◮ SVM solution is g ( x ) = sign f ( x ) for f ( x ) = � w, x � with n λ � w � 2 + 1 � � w = argmin max { 0 , 1 − y i � w, x i � n w ∈ R d i =1 ◮ What we really wanted to minimized is expected loss : g = argmin E x,y ∆( y, g ( x )) h ∈H with H = { g ( x ) = sign f ( x ) | f ( x ) = � w, x � for w ∈ R d } . What’s the relation? 44 / 90

  36. Linear Classification – the modern view Geometric intuition is nice, but are there any guarantees ? ◮ SVM solution is g ( x ) = sign f ( x ) for f ( x ) = � w, x � with n λ � w � 2 + 1 � � w = argmin max { 0 , 1 − y i � w, x i � n w ∈ R d i =1 ◮ What we really wanted to minimized is expected loss : g = argmin E x,y ∆( y, g ( x )) h ∈H with H = { g ( x ) = sign f ( x ) | f ( x ) = � w, x � for w ∈ R d } . What’s the relation? 44 / 90

  37. Linear Classification – the modern view Geometric intuition is nice, but are there any guarantees ? ◮ SVM solution is g ( x ) = sign f ( x ) for f ( x ) = � w, x � with n λ � w � 2 + 1 � � w = argmin max { 0 , 1 − y i � w, x i � n w ∈ R d i =1 ◮ What we really wanted to minimized is expected loss : g = argmin E x,y ∆( y, g ( x )) h ∈H with H = { g ( x ) = sign f ( x ) | f ( x ) = � w, x � for w ∈ R d } . What’s the relation? 44 / 90

  38. Linear Classification – the modern view SVM training is an example of Regularized Risk Minimization . General form: n 1 � min Ω( f ) + ℓ ( y i , f ( x i )) n f ∈F ���� i =1 regularizer � �� � loss on training set: ’risk’ Support Vector Machine: n 1 � � λ � w � 2 min + max { 0 , 1 − y i � w, x i � n w ∈ R d i =1 ◮ F = { f ( x ) = � w, x �| w ∈ R d } ◮ Ω( f ) = � w � 2 for any f ( x ) = � w, x � ◮ ℓ ( y, f ( x )) = max { 0 , 1 − yf ( x ) } (Hinge loss) 45 / 90

  39. Linear Classification – the modern view: the loss term Observation 1: The empirical loss approximates the expected loss. For i.i.d. training examples ( x 1 , y 1 ) , . . . , ( x n , y n ) : n p ( x, y )∆( y, g ( x )) ≈ 1 � � � � � ∆( y, g ( x )) = ∆( y i , g ( x i ) ) E x,y n x ∈X y ∈Y i =1 46 / 90

  40. Linear Classification – the modern view: the loss term Observation 1: The empirical loss approximates the expected loss. For i.i.d. training examples ( x 1 , y 1 ) , . . . , ( x n , y n ) : n p ( x, y )∆( y, g ( x )) ≈ 1 � � � � � ∆( y, g ( x )) = ∆( y i , g ( x i ) ) E x,y n x ∈X y ∈Y i =1 Observation 2: The Hinge loss upper bounds the 0/1-loss. For ∆( y, ¯ y ) = � y � = ¯ y � and g ( x ) = sign � w, x � one has ∆( y, g ( x ) ) = � y � w, x � < 0 � ≤ max { 0 , 1 − y � w, x �} 46 / 90

  41. Linear Classification – the modern view: the loss term Observation 1: The empirical loss approximates the expected loss. For i.i.d. training examples ( x 1 , y 1 ) , . . . , ( x n , y n ) : n p ( x, y )∆( y, g ( x )) ≈ 1 � � � � � E x,y ∆( y, g ( x )) = ∆( y i , g ( x i ) ) n x ∈X y ∈Y i =1 Observation 2: The Hinge loss upper bounds the 0/1-loss. For ∆( y, ¯ y ) = � y � = ¯ y � and g ( x ) = sign � w, x � one has ∆( y, g ( x ) ) = � y � w, x � < 0 � ≤ max { 0 , 1 − y � w, x �} Combination: 1 � � � E x,y ∆( y, g ( x )) � max { 0 , 1 − y i � w, x i �} n i Intuition: small ”risk” term in SVM → few mistakes in the future 46 / 90

  42. Linear Classification – the modern view: the regularizer Observation 3: Only minimizing the loss term can lead to overfitting. We want classifiers that have small loss, but are simple enough to generalize. 47 / 90

  43. Linear Classification – the modern view: the regularizer Ad-hoc definition: a function f : R d → R is simple , if it not very sensitive to the exact input dy dx dy dx sensitivity is measured by slope: f ′ For linear f ( x ) = � w, x � , slope is �∇ x f � = � w � : Minimizing � w � 2 encourages ”simple” functions Formal results, including proper bounds on the generalization error: e.g. [Shawe-Taylor, Cristianini: ”Kernel Methods for Pattern Analysis”, Cambridge U Press, 2004] 48 / 90

  44. Other classifiers based on Regularized Risk Minimization There are many other RRM-based classifiers, including variants of SVM: L1-regularized Linear SVM n 1 � � min λ � w � L 1 + max { 0 , 1 − y i � w, x i � n w ∈ R d i =1 � w � L 1 = � d j =1 | w j | encourages sparsity ◮ learned weight vector w will have many zero entries ◮ acts as feature selector ◮ evaluation f ( x ) = � w, x � becomes more efficient Use if you have prior knowledge that optimal classifier should be sparse. 49 / 90

  45. Other classifiers based on Regularized Risk Minimization SVM with squared slacks / squared Hinge loss n 1 � λ � w � 2 ξ 2 min + i n w ∈ R d i =1 subject to y i � w, x i � ≥ 1 − ξ i and ξ i ≥ 0 . Equivalently: n 1 � � ) 2 min λ � w � L 1 + (max { 0 , 1 − y i � w, x i � n w ∈ R d i =1 Also has a max-margin interpretation, but objective is once differentiable . 50 / 90

  46. Other classifiers based on Regularized Risk Minimization Least-Squares SVM aka Ridge Regression n 1 � λ � w � 2 (1 − y i � w, x i � ) 2 min + n w ∈ R d i =1 Loss function: ℓ ( y, f ( x )) = ( y − f ( x )) 2 ”squared loss” ◮ Easier to optimize than regular SVM: closed-form solution for w w = y ⊤ ( λ Id + XX ⊤ ) − 1 X ⊤ ◮ But: loss does not really reflect classification : ℓ ( y, f ( x )) can be big, even if sign f ( x ) = y 51 / 90

  47. Other classifiers based on Regularized Risk Minimization Regularized Logistic Regression n 1 � λ � w � 2 min + log(1 + exp( − y i � w, x i � )) n w ∈ R d i =1 Loss function: ℓ ( y, f ( x ) ) = log(1 + exp( − y i � w, x i � )) ”logistic loss” ◮ Smooth ( C ∞ -differentiable) objective ◮ Often similar results to SVM 52 / 90

  48. 3.0 0--1 loss 2.5 Hinge Loss Squared Hinge Loss 2.0 Squared Loss 1.5 Logistic Loss 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Summary – Linear Classifiers (Linear) Support Vector Machines ◮ geometric intuition: maximum margin classifier ◮ well understood theory: regularized risk minimization 53 / 90

  49. Summary – Linear Classifiers (Linear) Support Vector Machines ◮ geometric intuition: maximum margin classifier ◮ well understood theory: regularized risk minimization Many variants of losses and regularizers 3.0 0--1 loss ◮ first: try Ω( · ) = � · � 2 2.5 Hinge Loss Squared Hinge Loss 2.0 ◮ encourage sparsity: Ω( · ) = � · � L 1 Squared Loss 1.5 Logistic Loss ◮ differentiable losses: 1.0 0.5 easier numeric optimization 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 53 / 90

  50. Summary – Linear Classifiers (Linear) Support Vector Machines ◮ geometric intuition: maximum margin classifier ◮ well understood theory: regularized risk minimization Many variants of losses and regularizers 3.0 0--1 loss ◮ first: try Ω( · ) = � · � 2 2.5 Hinge Loss Squared Hinge Loss 2.0 ◮ encourage sparsity: Ω( · ) = � · � L 1 Squared Loss 1.5 Logistic Loss ◮ differentiable losses: 1.0 0.5 easier numeric optimization 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Fun fact: different losses often have similar empirical performance ◮ don’t blindly believe claims ”My classifier is the best.” 53 / 90

  51. Nonlinear Classification 54 / 90

  52. Nonlinear Classification What is the best linear classifier for this dataset? 1 . 5 y 0 . 5 x − 0 . 5 − 1 . 5 − 2 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 None. We need something nonlinear! 55 / 90

  53. Nonlinear Classification Idea 1) Combine multiple linear classifiers into nonlinear classifier σ (f 1 (x)) σ (f 2 (x)) σ (f 5 (x)) σ (f 3 (x)) σ (f 4 (x)) 56 / 90

  54. Nonlinear Classification: Boosting Boosting Situation: ◮ we have many simple classifiers (typically linear), h 1 , . . . , h k : X → {± 1 } ◮ none of them is particularly good Method: ◮ construct stronger nonlinear classifier: � g ( x ) = sign j α j h j ( x ) with α j ∈ R ◮ typically: iterative construction for finding α 1 , α 2 , . . . Advantage: ◮ very easy to implement Disadvantage: ◮ computationally expensive to train ◮ finding base classifiers can be hard 57 / 90

  55. Nonlinear Classification: Decision Tree Decision Trees x f 1 (x) <0 >0 f 2 (x) f 3 (x) <0 >0 <0 >0 y: 1 2 3 1 Advantage: ◮ easy to interpret ◮ handles multi-class situation Disadvantage: ◮ by themselves typically worse results than other modern methods [Breiman, Friedman, Olshen, Stone, ”Classification and regression trees”, 1984] 58 / 90

  56. Nonlinear Classification: Random Forest Random Forest x x x x f 1 (x) f 1 (x) f 1 (x) f 1 (x) <0 >0 <0 >0 <0 >0 <0 >0 . . . f 2 (x) f 3 (x) f 2 (x) f 3 (x) f 2 (x) f 3 (x) f 2 (x) f 3 (x) <0 >0 <0 >0 <0 >0 <0 >0 <0 >0 <0 >0 <0 >0 <0 >0 y: 1 2 3 1 y: 1 2 3 1 y: 1 2 3 1 y: 1 2 3 1 Method: ◮ construct many decision trees randomly (under some constraints) ◮ classify using majority vote Advantage: ◮ conceptually easy ◮ works surprisingly well Disadvantage: ◮ computationally expensive to train ◮ expensive at test time if forest has many trees [Breiman, ”Random Forests”, 2001] 59 / 90

  57. Nonlinear Classification: Neural Networks Artificial Neural Network / Multilayer Perceptron / Deep Learning f i (x)=<w i ,x> σ nonlinear Multi-layer architecture: σ (f 5 (x)) ◮ first layer: inputs x ◮ each layer k evaluates f k 1 , . . . , f k m feeds output to next layer σ (f 1 (x)) σ (f 2 (x)) σ (f 3 (x)) σ (f 4 (x)) ◮ last layer: output y Advantage: ◮ biologically inspired → easy to explain to non-experts ◮ efficient at evaluation time Disadvantage: ◮ non-convex optimization problem ◮ many design parameters, few theoretic results → ”Deep Learning” (Tuesday) [Rumelhart, Hinton, Williams, ”Learning Internal Representations by Error Propagation”, 1986] 60 / 90

  58. Nonlinearity: Data Preprocessing Idea 2) Preprocess the data y This dataset is not x linearly separable : θ This one is separable: r But: both are the same dataset ! Top: Cartesian coordinates. Bottom: polar coordinates 61 / 90

  59. Nonlinearity: Data Preprocessing Idea 2) Preprocess the data y Nonlinear separation : x θ Linear Separation r Linear classifier in polar space acts nonlinearly in Cartesian space. 62 / 90

  60. Generalized Linear Classifier Given ◮ X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . ◮ Given any (nonlinear) feature map φ : R k → R m . Solve the minimization for φ ( x 1 ) , . . . , φ ( x n ) instead of x 1 , . . . , x n : n w ∈ R m ,ξ i ∈ R + � w � 2 + C � min ξ i n i =1 subject to y i � w, φ ( x i ) � ≥ 1 − ξ i for i = 1 , . . . n . ◮ The weight vector w now comes from the target space R m . ◮ Distances/angles are measure by the inner product � . , . � in R m . ◮ Classifier f ( x ) = � w, φ ( x ) � is linear in w , but nonlinear in x . 63 / 90

  61. Example Feature Mappings ◮ Polar coordinates: �� � x � � x 2 + y 2 φ : �→ y ∠ ( x, y ) ◮ d -th degree polynomials: � � � � 1 , x 1 , . . . , x n , x 2 1 , . . . , x 2 n , . . . , x d 1 , . . . , x d φ : x 1 , . . . , x n �→ n ◮ Distance map: � � φ : � x �→ � � x − � p i � , . . . , � � x − � p N � for a set of N prototype vectors � p i , i = 1 , . . . , N . 64 / 90

  62. Representer Theorem Solve the soft-margin minimization for φ ( x 1 ) , . . . , φ ( x n ) ∈ R m : n w ∈ R m ,ξ i ∈ R + � w � 2 + C � min ξ i (1) n i =1 subject to y i � w, φ ( x i ) � ≥ 1 − ξ i for i = 1 , . . . n . For large m , won’t solving for w ∈ R m become impossible? 65 / 90

  63. Representer Theorem Solve the soft-margin minimization for φ ( x 1 ) , . . . , φ ( x n ) ∈ R m : n w ∈ R m ,ξ i ∈ R + � w � 2 + C � min ξ i (1) n i =1 subject to y i � w, φ ( x i ) � ≥ 1 − ξ i for i = 1 , . . . n . For large m , won’t solving for w ∈ R m become impossible? No! Theorem (Representer Theorem) The minimizing solution w to problem (1) can always be written as n � w = α j φ ( x j ) for coefficients α 1 , . . . , α n ∈ R . j =1 [Sch¨ olkopf, Smola, ”Learning with Kernels”, 2001] 65 / 90

  64. Kernel Trick Rewrite the optimization using the representer theorem: ◮ insert w = � n j =1 α j φ ( x j ) everywhere, ◮ minimize over α i instead of w . n w ∈ R m ,ξ i ∈ R + � w � 2 + C � min ξ i n i =1 subject to y i � w, φ ( x i ) � ≥ 1 − ξ i for i = 1 , . . . n . 66 / 90

  65. Kernel Trick Rewrite the optimization using the representer theorem: ◮ insert w = � n j =1 α j φ ( x j ) everywhere, ◮ minimize over α i instead of w . n n α j φ ( x j ) � 2 + C � � α i ∈ R ,ξ i ∈ R + � min ξ i n j =1 i =1 subject to n � y i � α j φ ( x j ) , φ ( x i ) � ≥ 1 − ξ i for i = 1 , . . . n. j =1 The former m -dimensional optimization is now n -dimensional. 67 / 90

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend