Statistical Machine Learning Lecture 11: Support Vector Machines - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 11: Support Vector Machines Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 1 / 59

Today’s Objectives Covered Topics Linear Support Vector Classification Features and Kernels Non-Linear Support Vector Classification Outlook on Applications, Relevance Vector Machines and Support Vector Regression K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 2 / 59

Outline 1. From Structural Risk Minimization to Linear SVMs 2. Nonlinear SVMs 3. Applications 4. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 3 / 59

1. From Structural Risk Minimization to Linear SVMs Outline 1. From Structural Risk Minimization to Linear SVMs 2. Nonlinear SVMs 3. Applications 4. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 4 / 59

1. From Structural Risk Minimization to Linear SVMs Structural Risk Minimization How can we implement structural risk minimization? R ( w ) ≤ R emp ( w ) + ǫ ( N , p ∗ , h ) where N is the number of training examples, p ∗ is the probability that the bound is met and h is the VC-dimension Classical Machine Learning algorithms Keep ǫ ( N , p ∗ , h ) constant and minimize R emp ( w ) ǫ ( N , p ∗ , h ) is fixed by keeping some model parameters fixed, e.g. the number of hidden neurons in a neural network (see later) Support Vector Machines (SVMs) Keep R emp ( w ) constant and minimize ǫ ( N , p ∗ , h ) In practice R emp ( w ) = 0 with separable data ǫ ( N , p ∗ , h ) is controlled by changing the VC-dimension (“capacity control”) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 5 / 59

1. From Structural Risk Minimization to Linear SVMs Support Vector Machines Linear classifiers (generalized later) Approximate implementation of the structural risk minimization principle If the data is linearly separable, the empirical risk of SVM classifiers will be zero, and the risk bound will be approximately minimized SVMs have built-in “guaranteed” generalization abilities K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 6 / 59

1. From Structural Risk Minimization to Linear SVMs Support Vector Machines For now assume linearly separable data N training data points i = 1 , with x i ∈ R d and y i ∈ {− 1 , 1 } { x i , y i } N Hyperplane that separates the data x 2 y > 0 y = 0 R 1 y < 0 y ( x ) = w ⊺ x + b R 2 x w y ( x ) k w k x ? x 1 − w 0 k w k Which hyperplane shall we use? How can we minimize the VC dimension? K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 7 / 59

1. From Structural Risk Minimization to Linear SVMs Support Vector Machines Intuitively: We should find the hyperplane with the maximum “distance” to the data K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 8 / 59

1. From Structural Risk Minimization to Linear SVMs Support Vector Machines Maximizing the margin Why does that make sense? Why does it minimize the VC dimension? Key result (from Vapnik) If the data points lie in a sphere of radius R , � x i � < R , ... ...and the margin of the linear classifier in d dimensions is γ , then � 4 R 2 � �� h ≤ min d , γ 2 Maximizing the margin lowers a bound on the VC-dimension! K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 9 / 59

1. From Structural Risk Minimization to Linear SVMs Support Vector Machines Find a hyperplane so that the data is linearly separated y i ( w ⊺ x i + b ) ≥ 1 ∀ i Enforce y i ( w ⊺ x i + b ) = 1 for at least one data point K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 10 / 59

1. From Structural Risk Minimization to Linear SVMs Support Vector Machines x 2 y > 0 y = 0 R 1 y < 0 R 2 x w y ( x ) k w k x ? x 1 − w 0 k w k We can easily express the margin The distance to the hyperplane is y ( x i ) � w � = w ⊺ x i + b � w � (Note in the figure b = w 0 ) 1 Hence the margin is � w � K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 11 / 59

1. From Structural Risk Minimization to Linear SVMs Support Vector Machines y = − 1 y = 0 y = 1 Support vectors: all points that lie on the margin, i.e., y i ( w ⊺ x i + b ) = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 12 / 59

1. From Structural Risk Minimization to Linear SVMs Support Vector Machines Maximizing the margin 1 / � w � is equivalent to minimizing � w � 2 Formulate as constrained optimization problem 2 � w � 2 1 arg min w , b s.t. y i ( w ⊺ x i + b ) − 1 ≥ 0 ∀ i Lagrangian formulation N L ( w , b , α ) = 1 2 � w � 2 − � α i ( y i ( w ⊺ x i + b ) − 1 ) i = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 13 / 59

1. From Structural Risk Minimization to Linear SVMs Support Vector Machines N min L ( w , b , α ) = 1 2 � w � 2 − � α i ( y i ( w ⊺ x i + b ) − 1 ) i = 1 N ∂ L ( w , b , α ) � = 0 = ⇒ α i y i = 0 ∂ b i = 1 N ∂ L ( w , b , α ) � = 0 = ⇒ w = α i y i x i ∂ w i = 1 The separating hyperplane is a linear combination of the input data But what are the α i ? K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 14 / 59

1. From Structural Risk Minimization to Linear SVMs Sparsity Important property y = − 1 Almost all the α i are zero y = 0 y = 1 There are only a few support vectors But the hyperplane was written as N � w = α i y i x i i = 1 SVMs are sparse learning machines The classifier only depends on a few data points K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 15 / 59

1. From Structural Risk Minimization to Linear SVMs Dual Form Let us rewrite the Lagrangian N 1 2 � w � 2 − � L ( w , b , α ) = α i ( y i ( w ⊺ x i + b ) − 1 ) i = 1 N N N 1 2 � w � 2 − � � � = α i y i w ⊺ x i − α i y i b + α i i = 1 i = 1 i = 1 We know that N � α i y i = 0 i = 1 Hence we have N N L ( w , α ) = 1 2 � w � 2 − ˆ � � α i y i w ⊺ x i + α i i = 1 i = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 16 / 59

1. From Structural Risk Minimization to Linear SVMs Dual Form N N L ( w , α ) = 1 2 � w � 2 − ˆ � � α i y i w ⊺ x i + α i i = 1 i = 1 Use the constraint w = � N i = 1 α i y i x i N N N 1 2 � w � 2 − ˆ � � � α j y j x ⊺ L ( w , α ) = α i y i j x i + α i i = 1 j = 1 i = 1 N N N 1 2 � w � 2 − � � � � � = α i α j y i y j x ⊺ + α i j x i i = 1 j = 1 i = 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 17 / 59

1. From Structural Risk Minimization to Linear SVMs Dual Form We have also N N 1 2 � w � 2 = 1 2 w ⊺ w = 1 � � � � x ⊺ α i α j y i y j j x i 2 i = 1 j = 1 Finally we obtain the Wolfe dual formulation N N N α i − 1 � � ˜ � � � L ( α ) = α i α j y i y j x ⊺ j x i 2 i = 1 i = 1 j = 1 We can now solve the original problem by maximizing the dual function ˜ L K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 18 / 59

1. From Structural Risk Minimization to Linear SVMs Support Vector Machines - Dual Form N N N α i − 1 � � � � � min α i α j y i y j x ⊺ j x i 2 i = 1 i = 1 j = 1 s.t. α i ≥ 0 N � α i y i = 0 i = 1 The separating hyperplane is given by the N S support vectors N S � w = α i y i x i i = 1 b can also be computed, but we skip the derivation K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 19 / 59

1. From Structural Risk Minimization to Linear SVMs Support Vector Machines so far Both the original SVM formulation (primal) as well as the derived dual formulation are quadratic programming problems (quadratic cost, linear constraints), which have unique solutions that can be computed efficiently Why did we bother to derive the dual form? To go beyond linear classifiers! K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 20 / 59

2. Nonlinear SVMs Outline 1. From Structural Risk Minimization to Linear SVMs 2. Nonlinear SVMs 3. Applications 4. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 21 / 59

2. Nonlinear SVMs Nonlinear SVMs Nonlinear transformation φ of the data (features) x ∈ R d φ : R d → H Hyperplane H (linear classifier in H ) w ⊺ φ ( x ) + b = 0 Nonlinear classifier in R d Same trick as in least-squares regression. So what is so special here? K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 22 / 59

Statistical Machine Learning Lecture 11: Support Vector Machines - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 11: Support Vector Machines Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 59 Todays Objectives

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Chapter IX: Classification* 1. Basic idea 2. Decision trees 3. Nave Bayes classifier 4.

COMP24111: Machine Learning and Optimisation Chapter 4: Support Vector Machines Dr. Tingting Mu

Introduction to Support Vector Machines Starting from slides drawn by Ming-Hsuan Yang and Antoine

Bisectors and foliations in the complex hyperbolic space Maciej Czarnecki Uniwersytet L

Support Vector Machines 3-18-16 Reading Quiz Q1: Which of these hyperplanes would be selected by

The Perceptron CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Credit: figures by Piyush Rai and Hal

1 if w x b 0 + i y = i 1 if w x b 0

Local formality of inversion hyperplane arrangements William Slofstra IQC, University of

Sambuz

Useful Links

Newsletter

Mail Us