About this class Maximizing the Margin Maximum margin classifiers - PowerPoint PPT Presentation

About this class Maximizing the Margin Maximum margin classifiers Picture of large and small margin hyperplanes SVMs: geometric derivation of the primal problem Intuition: large margin condition acts as a reg- ularizer and should generalize better Statement of the dual problem The Support Vector Machine (SVM) makes this formal. Not only that, it is amenable to The “kernel trick” the kernel trick which will allow us to get much greater representational power! SVMs as the solution to a regularization problem 1 2

Since x − z is parallel to w (both perpendicular Deriving the SVM to the separating hyperplane) k = w. ( x − z ) (Derivation based on Ryan Rifkin’s slides in MIT 9.520 from Spring 2003) ⇒ k = || w |||| x − z || k Assume we classify a point x as sgn( w.x ) ⇒ || x − z || = || w || Let x be a datapoint on the margin, and z the So now, maximizing || x − z || is equivalent to point on the separating hyperplane closest to minimizing || w || x We can fix k = 1 (this is just a rescaling) We want to maximize || x − z || Now we have an optimization problem: For some k (assumed positive) w ∈ R n || w || 2 min w.x = k subject to: w.z = 0 y i ( w.x i ) ≥ 1 , i = 1 , . . . , l ⇒ w. ( x − z ) = k Can be solved using quadratic programming 3

When a Separating Hyperplane Does Not Exist We introduce slack variables . The new optimization problem becomes Think about this expression in terms of train- l ξ i + 1 ing set error and inductive bias! 2 || w || 2 � min w ∈ R n , ξ ∈ R l C i =1 Typically we also use a bias term to shift the subject to: hyperplane around (so it doesn’t have to pass y i ( w.x i + b ) ≥ 1 − ξ i , i = 1 , . . . , l through the origin) Now f ( x ) = sgn( w.x + b ) ξ i ≥ 0 , i = 1 , . . . , l Now we are trading the error o ff against the margin 4

The Dual Formulation l � � max α i − α i α j y i y j ( x i .x j ) α ∈ R l i =1 i,j subject to: l � y i α i = 0 i =1 This allows for more e ffi cient solution of the 0 ≤ α i ≤ C, i = 1 , . . . , l QP than we could get otherwise The hypothesis is then: l � f ( x ) = sgn( α i y i ( x.x i )) i =1 Sparsity: it turns out that: y i f ( x i ) > 1 ⇒ α i = 0 y i f ( x i ) < 1 ⇒ α i = C 5

The Kernel Trick ! 2 x 1 x 2 3 2 The really nice thing: optimization depends 1 0 only on the dot product between examples. -1 2.5 -2 2 -3 0 1.5 0.5 2 1 1 x 2 1.5 2 0.5 x 1 An example from Russell & Norvig 2 Now F ( x i ) .F ( x j ) = ( x i .x j ) 2 1.5 1 0.5 We don’t need to compute the actual feature x 2 0 representation in the higher dimensional space, -0.5 because of Mercer’s theorem. -1 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 x 1 For a Mercer Kernel K , the dot product of F ( x i ) and F ( x j ) is given by K ( x i , x j ). Now suppose we go from representation: x = < x 1 , x 2 > to representation: √ What is a Mercer kernel? Continuous, sym- F ( x ) = < x 2 1 , x 2 2 x 1 x 2 > 2 , metric, and positive definite 6

Positive definiteness: for any m -size subset of the input space, the matrix K where K ij = K ( X i , X j ) is positive definite Remember positive definiteness: for all non- zero vectors z, z T Kz > 0 Allows us to work with very high-dimensional spaces! How do we choose which kernel and which λ Examples: to use? (The first could be harder!) 1. Polynomial: K ( X i , X j ) = (1 + x i .x j ) d (feature space is exponential in d !) || xi − xj || 2 e − 2 σ 2 2. Gaussian: (infinite dimensional feature space!) 3. String kernels, protein kernels!

Selecting the Best Hypothesis Expected error of a hypothesis: Expected error on a sample drawn from the underlying (un- known) distribution Based on notes from Poggio, Mukherjee and Rifkin � I [ f ] = V ( f ( x ) , y ) dµ ( x, y ) Define the performance of a hypothesis by a In discrete terms we would replace with a sum loss function V and µ with P Commonly used for regression: V ( f ( x ) , y ) = Empirical error, or empirical risk, is the average ( f ( x ) − y ) 2 loss over the training set I S [ f ] = 1 Could use absolute value: V ( f ( x ) , y ) = | f ( x ) − � V ( f ( x i ) , y i ) l y | Empirical risk minimization: find the hypothe- What about classification? 0-1 loss: V ( f ( x ) , y ) = sis in the hypothesis space that minimizes the I [ y = f ( x )] empirical risk Hinge loss: V ( f ( x ) , y ) = (1 − y.f ( x )) + n 1 Hypothesis space: space of functions that we � min V ( f ( x i ) , y i ) n f ∈ H search i =1 7

For most hypothesis spaces, ERM is an ill- posed problem. A problem is ill-posed if it is ω is the regularization or smoothness func- not well-posed . A problem is well-posed if its tional. The mathematical machinery for defin- solution exists, is unique, and depends contin- ing this is complex, and we won’t get into it uously on the data much more, but the interesting thing is that if we use the hinge loss and the linear kernel, Regularization restores well-posedness. Ivanov the SVM comes out of solving the Tikhonov regularization directly constrains the hypothe- regularization problem! sis space, and Tikhonov regularization imposes a penalty on hypothesis complexity Meaning of using an unregularized bias term? Ivanov regularization: Punish function complexity but not an arbitrary translation of the origin n 1 � min V ( f ( x i ) , y i ) subject to ω ( f ) ≤ τ n f ∈ H i =1 However, in the case of SVMs, the answer will end up being di ff erent if we add a fictional “1” Tikhonov regularization: to each example, because now we punish the n 1 weight we put on it! � min V ( f ( x i ) , y i ) + λω ( f ) n f ∈ H i =1

Generalization Bounds Important concepts of error: 1. Sample (estimation) error: di ff erence between hypothesis we find in H and the best hypothesis in H 2. Approximation error: di ff erence between best hypothesis in H and the true function in some other space T 3. Generalization error: di ff erence between hypothesis we find in H and the true function in T , which is the sum of the two above Tradeo ff : making H bigger makes the approximation error smaller, but the estimation error larger 8

About this class Maximizing the Margin Maximum margin classifiers - PowerPoint PPT Presentation

About this class Maximizing the Margin Maximum margin classifiers Picture of large and small margin hyperplanes SVMs: geometric derivation of the primal prob- lem Intuition: large margin condition acts as a reg- ularizer and should generalize

Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Maximum Margin Criterion Math

Support Vector Machines Greg Mori - CMPT 419/726 Bishop PRML Ch. 7 Maximum Margin Criterion

Learning From Data Lecture 23 SVMs: Maximizing the Margin A Better Hyperplane Maximizing the

Topic #28 Nyquist plots: Gain and phase margin Reference textbook : Control Systems, Dhanesh N.

Maximizing the Efficiency Potential Maximizing the Efficiency Potential in New Hampshire N

Member Orientation: Maximizing your SEEP Member Benefits Member Orientation: Maximizing your

Maximizing Anterior Vertebral Maximizing Anterior Vertebral Screw Fixation for Spinal Screw

Maximizing your slow cooker is about Maximizing the flavor of foods you prepare, which will

OLA 2009: OLA 2009: Maximizing the Value of Your Maximizing the Value of Your OCLC Cataloging

Lecture 4: Optimization Maximizing or Minimizing a Function of a Single Variable

Maximizing the Spread of Maximizing the Spread of I nfluence through a Social I nfluence through

Programming Abstraction in C++ Eric S. Roberts and Julie Zelenski Stanford University 2010

BIBLICAL SURVEY Introductory Class Introductory Class BIBLICAL SURVEY Introductory Class

Margin Squeeze Hartwig Tauber FTTH Council Europe Margin Squeeze 1 Recognise BEREC are

THE TEXAS MARGIN TAX; THE TEXAS MARGIN TAX; LOOKING FORWARD LOOKING FORWARD PRESENTATION BY

MAXIMUM MARGIN CLASSIFIERS MAXIMUM MARGIN CLASSIFIERS Matthieu R Bloch Tuesday, February 11,

Confidence Intervals II 18.05 Spring 2014 Jeremy Orloff and Jonathan Bloom Agenda Polling:

CSCE 478/878 Lecture 2: Supervised Learning Supervised Learning Stephen Scott Introduction

Statistical and Computational Statistical and Computational Learning Theory Learning Theory

Large Margin Classification Using the Perceptron Algorithm (Part 2) Henry Tan Georgetown

DUNE Fitter Validation Daniel Cherdack Colorado State University DUNE LBPWG Meeting Monday July

COMP 138: Reinforcement Learning Instructor : Jivko Sinapov Webpage :

Markov Chain Monte Carlo (MCMC) Variational methods Milos Hauskrecht milos@cs.pitt.edu

CS 188: Artificial Intelligence Bayes Nets: Sampling Instructors: Dan Klein and Pieter Abbeel

Sambuz

Useful Links

Newsletter

Mail Us