Machine Learning A Geometric Approach CIML book Chap 7.7 Linear - PowerPoint PPT Presentation

Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU)

Linear Separator Ham Spam

From Perceptron to SVM batch 1997 +soft-margin Cortes/Vapnik s l e SVM n online approx. r e fall of USSR k + subgradient descent max margin 2007--2010* n 1964 i g r Singer group a Vapnik m x Pegasos a Chervonenkis m + minibatch online 2003 2006 Crammer/Singer Singer group c o n s e r v a t i v e u p d a t e s MIRA aggressive 1959 1962 1999 Rosenblatt Novikoff Freund/Schapire invention proof voted/avg: revived i n s e p a r a b l e c a s e 2002 2005* Collins McDonald/Crammer/Pereira structured MIRA structured *mentioned in lectures but optional (others papers all covered in detail) AT&T Research ex-AT&T and students 3

Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 linear function f ( x ) = h w, x i + b

Why large margins? • Maximum robustness relative o to uncertainty r o o • Symmetry breaking + • Independent of correctly classified o + instances ρ • Easy to find for + easy problems +

Feature Map Φ • SVM is often used with kernels

Large Margin Classifier h w, x i + b = � 1 h w, x i + b = 1 h w, x i + b � 1 h w, x i + b  � 1 w functional margin: y i ( w · x i ) geometric margin: y i ( w · x i ) 1 = k w k k w k

Large Margin Classifier h w, x i + b = � 1 Q1: what if we want functional margin of 2? h w, x i + b = 1 Q2: what if we want geometric margin of 1? h w, x i + b � 1 h w, x i + b  � 1 w SVM objective (max version): max. geometric margin 1 s.t. functional margin max k w k s.t. 8 ( x , y ) 2 D, y ( w · x ) � 1 is at least 1 w

Large Margin Classifier h w, x i + b = � 1 h w, x i + b = 1 h w, x i + b � 1 h w, x i + b  � 1 w SVM objective (min version): min. weight vector s.t. functional margin min w k w k s.t. 8 ( x , y ) 2 D, y ( w · x ) � 1 is at least 1 interpretation: small models generalize better

Large Margin Classifier h w, x i + b = � 1 h w, x i + b = 1 h w, x i + b � 1 h w, x i + b  � 1 w SVM objective (min version): min. weight vector 1 2 k w k 2 s.t. 8 ( x , y ) 2 D, y ( w · x ) � 1 s.t. functional margin min is at least 1 w |w| not differentiable, |w| 2 is.

SVM vs. MIRA • SVM: min weight vector to enforce functional margin of at least 1 on ALL EXAMPLES • MIRA: min weight change to enforce functional margin of at least 1 on THIS EXAMPLE • MIRA is 1-step or online approximation of SVM • Aggressive MIRA → SVM as p → 1 x i 1 MIRA 2 k w k 2 s.t. 8 ( x , y ) 2 D, y ( w · x ) � 1 min w perceptron w 0 k w 0 � w k 2 min s.t. w 0 · x � 1 w

Convex Hull Interpretation max. distance between convex hulls how many support vectors in 2D? weight vector is determined by the support vectors alone why don’t use convex hulls c.f. perceptron: X w = y · x for SVMs in practice?? what about MIRA? ( x ,y ) ∈ errors

Convexity and Convex Hulls convex combination

Optimization • Primal optimization problem 1 2 k w k 2 s.t. 8 ( x , y ) 2 D, y ( w · x ) � 1 min w constraint • Convex optimization: convex function over convex set! • Quadratic prog.: quadratic function w/ linear constraints linear quadratic

MIRA as QP • MIRA is a trivial QP; can be solved geometrically • what about multiple constraints (e.g. minibatch)? x i ⊕ w 0 k w 0 � w k 2 min � w i · x i k x i k k x i k 1 s.t. w 0 · x � 1 MIRA 1 � w i · x i k x i k w i

Optimization • Primal optimization problem 1 2 k w k 2 s.t. 8 ( x , y ) 2 D, y ( w · x ) � 1 min w • Convex optimization: convex function over convex set! • Lagrange function L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i constraint Derivatives in w need to vanish model is a linear combo X α ∂ w L ( w, b, a ) = w − α i y i x i = 0 of a small subset of input i (the support vectors) X X ∂ b L ( w, b, a ) = α i y i = 0 i.e., those with α i > 0 w = y i α i x i i i

Lagrangian & Saddle Point • equality: min x 2 s.t. x = 1 • inequality: min x 2 s.t. x >= 1 • Lagrangian: L(x, α )=x 2 - α (x-1) • derivative in x need to vanish x • optimality is at saddle point with α • min x in primal => max α in dual α

Constrained Optimization constraint 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i • Quadratic Programming Karush–Kuhn–Tucker i • Quadratic Objective KKT condition ( complementary slackness ) • Linear Constraints optimal point is achieved at active constraints where α i > 0 ( α i =0 => inactive) α i [ y i [ h w, x i i + b ] � 1] = 0

KKT => Support Vectors 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w Karush Kuhn Tucker (KKT) α i = 0 Optimality Condition α i > 0 = ) y i [ h w, x i i + b ] = 1 α i [ y i [ h w, x i i + b ] � 1] = 0

Properties X w = y i α i x i i w • Weight vector w as weighted linear combination of instances • Only points on margin matter (ignore the rest and get same solution) • Only inner products matter • Quadratic program • We can replace the inner product by a kernel • Keeps instances away from the margin

Alternative: Primal=>Dual • Lagrange function L ( w, b, α ) = 1 2 k w k 2 � X α i [ y i [ h x i , w i + b ] � 1] i • Derivatives in w need to vanish X α ∂ w L ( w, b, a ) = w − α i y i x i = 0 i X X ∂ b L ( w, b, a ) = α i y i = 0 w = y i α i x i i i • Plugging w back into L yields � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i dual variables X α i y i = 0 and α i � 0 subject to

Primal vs. Dual Primal 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w � 1 Dual X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i X dual variables α i y i = 0 and α i � 0 subject to i

Solving the optimization problem • Dual problem � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α i,j i dual variables X α i y i = 0 and α i � 0 subject to i • If problem is small enough (1000s of variables) we can use off-the-shelf solver (CVXOPT, CPLEX, OOQP, LOQO) • For larger problem use fact that only SVs matter and solve in blocks (active set method).

Quadratic Program in Dual • Methods • Dual problem • Gradient Descent − 1 2 α T Q α − α T b • Coordinate Descent maximize α • aka., Hildreth Algorithm subject to α ≥ 0 • Sequential Minimal Optimization (SMO) Q: what’s the Q in SVM primal? how about Q in SVM dual? • Quadratic Programming • Objective: Quadratic function • Q is positive semidefinite • Constraints: Linear functions

Convex QP • if Q is positive (semi) definite, i.e., x T Qx ≥ 0 for all x, then convex QP => local min/max is global min/max svm • if Q = 0, it reduces to linear programming QP CP LP • if Q is indefinite => saddlepoint • general QP is NP-hard; convex QP is polynomial-time

QP: Hildreth Algorithm • idea 1: • update one coordinate while fixing all other coordinates • e.g., update coordinate i is to solve: − 1 2 α T Q α − α T b argmax α i subject to α ≥ 0 Quadratic function with only one variable Maximum => first-order derivative is 0

QP: Hildreth Algorithm • idea 2: • choose another coordinate and repeat until meet stopping criterion • reach maximum or • increase between 2 consecutive iterations is very small or • after some # of iterations • how to choose coordinate: sweep pattern Sequential: • • 1, 2, ..., n, 1, 2, ..., n, ... • 1, 2, ..., n, n-1, n-2, ...,1, 2, ... • Random: permutation of 1,2, ..., n • Maximal Descent • choose i with maximal descent in objective

QP: Hildreth Algorithm initialize for all α i = 0 i repeat pick following sweep pattern i solve − 1 2 α T Q α − α T b α i ← argmax α i subject to α ≥ 0 until meet stopping criterion

QP: Hildreth Algorithm ✓ 4 ✓ − 6 ◆ ◆ − 1 1 2 α T α − α T maximize 1 2 − 4 α subject to α ≥ 0 • choose coordinates • 1, 2, 1, 2, ...

QP: Hildreth Algorithm • pros: • extremely simple • no gradient calculation • easy to implement • cons: • converges slow, compared to other methods • can’t deal with too many constraints • works for minibatch MIRA but not SVM

Linear Separator Ham Spam

Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 + + linear separator linear function is impossible f ( x ) = h w, x i + b

Large Margin Classifier h w, x i + b � 1 h w, x i + b  � 1 + minimum error separator is impossible Theorem (Minsky & Papert) Finding the minimum error separating hyperplane is NP hard

Adding slack variables h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ + - minimize amount of slack Convex optimization problem

margin violation vs. misclassification misclassification is also margin violation ( ξ >0)

Machine Learning A Geometric Approach CIML book Chap 7.7 Linear - PowerPoint PPT Presentation

Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron to SVM batch 1997

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Using JavaScript with Twine Cool effects to polish your interactive story! The Code Liberation

Nonlinear Control Lecture # 26 State Feedback Stabilization Nonlinear Control Lecture # 26 State

15-388/688 - Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon

From w orkflo w s to pipelines D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P

Introduction in ML with scikit- learn Professor Patrick McDaniel Jonathan Price Fall 2015

PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 5: Neural Networks and Deep Learning November

COMP 364: Computer Tools for Life Sciences Intro to machine learning with scikit-learn (part

Simulating the Sky, Lecture2 Creating, Testing, and Using Simulations of the Galaxy Population