Lecture 17: Multi-class SVMs Kernels Aykut Erdem December 2016 - PowerPoint PPT Presentation

Lecture 17: − Multi-class SVMs − Kernels Aykut Erdem December 2016 Hacettepe University

Administrative • We will have a make-up lecture on Saturday December 17, 2016 (I will check the date today) . • Project progress reports are due today! 2

Last time… Support Vector Machines h w, x i + b � 1 h w, x i + b  � 1 linear function slide by Alex Smola f ( x ) = h w, x i + b 3

Last time… Support Vector Machines h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 slide by Alex Smola maximize k w k subject to y i [ h x i , w i + b ] � 1 w,b 4

Last time… Support Vector Machines h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 slide by Alex Smola minimize w,b

Last time… Support Vector Machines 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α slide by Alex Smola i,j i X α i y i = 0 and α i � 0 subject to i

Last time… Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w support   vectors α i > 0 = ) 7

Last time… Soft-margin Classifier h w, x i + b � 1 h w, x i + b  � 1 minimum error separator is impossible Theorem (Minsky & Papert)   slide by Alex Smola Finding the minimum error separating hyperplane is NP hard

Last time… Adding Slack Variables ξ i ≥ 0 h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ minimize amount of slack slide by Alex Smola Convex optimization problem

Last time… Adding Slack Variables • for point is between the margin and correctly 0 < ξ ≤ 1 classified • for point is misclassified ξ i ≥ 0 h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ adopted from Andrew Zisserman minimize amount of slack Convex optimization problem

      Last time… Adding Slack Variables • Hard margin problem 1 2 k w k 2 subject to y i [ h w, x i i + b ] � 1 minimize w,b • With slack variables   1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 Problem is always feasible. Proof:   (also yields upper bound) w = 0 and b = 0 and ξ i = 1 slide by Alex Smola

      Soft-margin classifier • Optimisation problem:   1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 C is a regularization parameter: • small C allows constraints to be easily ignored   → large margin • large C makes constraints hard to ignore   adopted from Andrew Zisserman → narrow margin • C = ∞ enforces all constraints: hard margin

Demo time… 13

This week • Multi-class classification • Introduction to kernels 14

Multi-class classification slide by Eric Xing 15

One versus all classification • Learn&3&classifiers:& w + – &.&vs.&{o,+},&weights&w .& w - – +&vs.&{o,.},&weights&w +& – o&vs.&{+,.},&weights&w o& w o • Predict&label&using:& • Any&problems?& slide by Eric Xing • Could&we&learn&this&dataset?& 18

Multi-class SVM • Simultaneously-learn-3-sets-- w + of-weights:-- w - • How-do-we-guarantee-the-- correct-labels?-- w o • Need-new-constraints!-- The-“score”-of-the-correct-- class-must-be-be?er-than- the-“score”-of-wrong-classes:-- slide by Eric Xing 19

Multi-class SVM • As#for#the#SVM,#we#introduce#slack#variables#and#maximize#margin:## To predict, we use: Now#can#we#learn#it?### ?  slide by Eric Xing 20

21 Kernels slide by Alex Smola

Non-linear features • Regression   We got nonlinear functions by preprocessing • Perceptron • Map data into feature space x → φ ( x ) • Solve problem in this space • Query replace by for code h x, x 0 i h φ ( x ) , φ ( x 0 ) i • Feature Perceptron • Solution in span of φ ( x i ) slide by Alex Smola

Non-linear features • Separating surfaces are   Circles, hyperbolae, parabolae slide by Alex Smola

Solving XOR ( x 1 , x 2 ) ( x 1 , x 2 , x 1 x 2 ) • XOR not linearly separable • Mapping into 3 dimensions makes it easily solvable slide by Alex Smola 24

Linear Separation with Quadratic Kernels slide by Alex Smola 25

Quadratic Features Quadratic Features in Quadratic Features in R 2 p ⇣ ⌘ x 2 2 x 1 x 2 , x 2 Φ ( x ) := 1 , 2 Dot Product Dot Product p p D⇣ ⌘ ⇣ 2 ⌘E 2 , h Φ ( x ) , Φ ( x 0 ) i = x 0 2 x 0 1 x 0 2 , x 0 x 2 2 x 1 x 2 , x 2 1 , , 2 1 2 = h x, x 0 i 2 . Insight Insight via Trick works for any polynomials of order Trick works for any polynomials of order d via h x, x 0 i d . slide by Alex Smola 26

Computational E ffi ciency Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to 5005 numbers. For higher order polyno- 5 · 10 5 mial features much worse. Solu%on Solution Don’t compute the features, try to compute dot products implicitly. For some features this works . . . Defini%on Definition A kernel function k : X ⇥ X ! R is a symmetric function in its arguments for which the following property holds k ( x, x 0 ) = h Φ ( x ) , Φ ( x 0 ) i for some feature map Φ . slide by Alex Smola If k ( x, x 0 ) is much cheaper to compute than Φ ( x ) . . . 27

Recap: The Perceptron initialize w = 0 and b = 0 repeat if y i [ h w, x i i + b ]  0 then w w + y i x i and b b + y i end if until all classified correctly • Nothing happens if classified correctly • Weight vector is linear combination X w = y i x i • Classifier is linear combination of   i ∈ I inner products X f ( x ) = y i h x i , x i + b slide by Alex Smola i ∈ I 28

Recap: The Perceptron on features initialize w, b = 0 repeat Pick ( x i , y i ) from data if y i ( w · Φ ( x i ) + b )  0 then w 0 = w + y i Φ ( x i ) b 0 = b + y i until y i ( w · Φ ( x i ) + b ) > 0 for all i end • Nothing happens if classified correctly • Weight vector is linear combination X w = y i φ ( x i ) • Classifier is linear combination of   i ∈ I slide by Alex Smola inner products X f ( x ) = y i h φ ( x i ) , φ ( x ) i + b 29 i ∈ I

The Kernel Perceptron initialize f = 0 repeat Pick ( x i , y i ) from data if y i f ( x i ) ≤ 0 then f ( · ) ← f ( · ) + y i k ( x i , · ) + y i until y i f ( x i ) > 0 for all i end • Nothing happens if classified correctly • Weight vector is linear combination X w = y i φ ( x i ) i ∈ I • Classifier is linear combination of inner products slide by Alex Smola X X f ( x ) = y i h φ ( x i ) , φ ( x ) i + b = y i k ( x i , x ) + b 30 i ∈ I i ∈ I

Processing Pipeline • Original data • Data in feature space (implicit) • Solve in feature space using kernels slide by Alex Smola 31

Polynomial Kernels Idea We want to extend k ( x, x 0 ) = h x, x 0 i 2 to k ( x, x 0 ) = ( h x, x 0 i + c ) d where c > 0 and d 2 N . Prove that such a kernel corresponds to a dot product. Proof strategy Simple and straightforward: compute the explicit sum given by the kernel, i.e. m ✓ d ◆ k ( x, x 0 ) = ( h x, x 0 i + c ) d = ( h x, x 0 i ) i c d � i X i i =0 slide by Alex Smola Individual terms ( h x, x 0 i ) i are dot products for some Φ i ( x ) . 32

Kernel Conditions Computability We have to be able to compute k ( x, x 0 ) efficiently (much cheaper than dot products themselves). “Nice and Useful” Functions The features themselves have to be useful for the learning problem at hand. Quite often this means smooth functions. Symmetry Obviously k ( x, x 0 ) = k ( x 0 , x ) due to the symmetry of the dot product h Φ ( x ) , Φ ( x 0 ) i = h Φ ( x 0 ) , Φ ( x ) i . Dot Product in Feature Space slide by Alex Smola Is there always a Φ such that k really is a dot product? 33

Mercer’s Theorem The Theorem For any symmetric function k : X ⇥ X ! R which is square integrable in X ⇥ X and which satisfies Z k ( x, x 0 ) f ( x ) f ( x 0 ) dxdx 0 � 0 for all f 2 L 2 ( X ) X ⇥ X there exist φ i : X ! R and numbers λ i � 0 where λ i φ i ( x ) φ i ( x 0 ) for all x, x 0 2 X . X k ( x, x 0 ) = i Interpretation Double integral is the continuous version of a vector- matrix-vector multiplication. For positive semidefinite matrices we have slide by Alex Smola X X k ( x i , x j ) α i α j � 0 34

Properties Distance in Feature Space Distance between points in feature space via d ( x, x 0 ) 2 := k Φ ( x ) � Φ ( x 0 ) k 2 = h Φ ( x ) , Φ ( x ) i � 2 h Φ ( x ) , Φ ( x 0 ) i + h Φ ( x 0 ) , Φ ( x 0 ) i = k ( x, x ) + k ( x 0 , x 0 ) � 2 k ( x, x ) Kernel Matrix To compare observations we compute dot products, so we study the matrix K given by K ij = h Φ ( x i ) , Φ ( x j ) i = k ( x i , x j ) where x i are the training patterns. Similarity Measure slide by Alex Smola The entries K ij tell us the overlap between Φ ( x i ) and Φ ( x j ) , so k ( x i , x j ) is a similarity measure. 35

Lecture 17: Multi-class SVMs Kernels Aykut Erdem December 2016 - PowerPoint PPT Presentation

Lecture 17: Multi-class SVMs Kernels Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on Saturday December 17, 2016 (I will check the date today) . Project progress reports are due today!

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Reflection Today Long Division Practice Every house is built by someone, but God is the

Lectures 3&4: from categorical and ordered Express Separate attributes Change

Filling multiples of embedded curves and quantifying nonorientability Robert Young New York

Integral points on biquadratic curves and near-multiples of squares in Lucas sequences Max

Diophantine and tropical geometry David Zureick-Brown joint with Eric Katz (Waterloo) and Joe

Low weight polynomials in crypto Thomas Johansson Dept of EIT, Lund University, Sweden FSE

Numb3rs 11 2 10 3 Lecture 5 9 4 Modular Arithmetic 8 5 7 6 Story So Far Quotient and

t srt ts t ts

Sambuz

Useful Links

Newsletter

Mail Us

Lecture 17: Multi-class SVMs Kernels Aykut Erdem December 2016 - PowerPoint PPT Presentation

Lecture 17: Multi-class SVMs Kernels Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on Saturday December 17, 2016 (I will check the date today) . Project progress reports are due today!

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Reflection Today Long Division Practice Every house is built by someone, but God is the

Lectures 3&amp;4: from categorical and ordered Express Separate attributes Change

Filling multiples of embedded curves and quantifying nonorientability Robert Young New York

Integral points on biquadratic curves and near-multiples of squares in Lucas sequences Max

Diophantine and tropical geometry David Zureick-Brown joint with Eric Katz (Waterloo) and Joe

Low weight polynomials in crypto Thomas Johansson Dept of EIT, Lund University, Sweden FSE

Numb3rs 11 2 10 3 Lecture 5 9 4 Modular Arithmetic 8 5 7 6 Story So Far Quotient and

t srt ts t ts

Sambuz

Useful Links

Newsletter

Mail Us

Lectures 3&4: from categorical and ordered Express Separate attributes Change