L INEAR KERNEL CONT . For the linear kernel, 1 2 + 2 || Y K c || - PowerPoint PPT Presentation

R EGULARIZED L EAST S QUARES AND S UPPORT V ECTOR M ACHINES Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu BISS 2012 March 14, 2012 Regularization Methods for High Dimensional Learning RLS and SVM

A BOUT THIS CLASS G OAL To introduce two main examples of Tikhonov regularization, deriving and comparing their computational properties. Regularization Methods for High Dimensional Learning RLS and SVM

B ASICS : D ATA Training set: S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } . Inputs: X = { x 1 , . . . , x n } . Labels: Y = { y 1 , . . . , y n } . Regularization Methods for High Dimensional Learning RLS and SVM

B ASICS : RKHS, K ERNEL RKHS H with a positive semidefinite kernel function K : K ( x i , x j ) = x T linear: i x j K ( x i , x j ) = ( x T i x j + 1 ) d polynomial: � � −|| x i − x j || 2 gaussian: K ( x i , x j ) = exp σ 2 Define the kernel matrix K to satisfy K ij = K ( x i , x j ) . The kernel function with one argument fixed is K x = K ( x , · ) . Given an arbitrary input x ∗ , K x ∗ is a vector whose i th entry is K ( x i , x ∗ ) . Regularization Methods for High Dimensional Learning RLS and SVM

T IKHONOV R EGULARIZATION We are interested into studying Tikhonov Regularization n � V ( y i , f ( x i )) 2 + λ � f � 2 argmin { H } . f ∈H i = 1 Regularization Methods for High Dimensional Learning RLS and SVM

R EPRESENTER T HEOREM The representer theorem guarantees that the solution can be written as n � f = c j K x j j = 1 for some c = ( c 1 , . . . , c n ) ∈ R n . So K c is a vector whose i th element is f ( x i ) : n n � � f ( x i ) = c j K x i ( x j ) = c j K ij j = 1 j = 1 and � f � 2 H = c T K c . Regularization Methods for High Dimensional Learning RLS and SVM

RKHS N ORM AND R EPRESENTER T HEOREM Since f = � n j = 1 c j K x j , then � f � 2 = � f , f � H H n n � � = � c i K x i , c j K x j � H i = 1 j = 1 n n � � = c i c j � K x i , K x j � H i = 1 j = 1 n n � � c i c j K ( x i , x j ) = c t K c = i = 1 j = 1 Regularization Methods for High Dimensional Learning RLS and SVM

P LAN RLS dual problem regularization path linear case SVM dual problem linear case historical derivation Regularization Methods for High Dimensional Learning RLS and SVM

T HE RLS PROBLEM Goal: Find the function f ∈ H that minimizes the weighted sum of the square loss and the RKHS norm n { 1 ( f ( x i ) − y i ) 2 + λ � 2 || f || 2 argmin H } . 2 f ∈H i = 1 Regularization Methods for High Dimensional Learning RLS and SVM

RLS AND R EPRESENTER T HEOREM Using the representer theorem the RLS problem is: 1 2 + λ 2 � Y − K c � 2 2 c T K c argmin f ∈H The above functional is differentiable, we can find the minimum setting the gradient w.r.t c to 0: Regularization Methods for High Dimensional Learning RLS and SVM

RLS AND R EPRESENTER T HEOREM Using the representer theorem the RLS problem is: 1 2 + λ 2 � Y − K c � 2 2 c T K c argmin f ∈H The above functional is differentiable, we can find the minimum setting the gradient w.r.t c to 0: − K ( Y − K c ) + λ K c = 0 ( K + λ I ) c = Y ( K + λ I ) − 1 Y c = We find c by solving a system of linear equations. Regularization Methods for High Dimensional Learning RLS and SVM

S OLVING RLS FOR FIXED P ARAMETERS ( K + λ I ) c = Y . The matrix K + λ I is symmetric positive definite, so the appropriate algorithm is Cholesky factorization. In Matlab, the “slash” operator seems to be using Cholesky, so you can just write c = ( K + l ∗ I ) \ Y , but to be safe, (or in octave), I suggest R = chol ( K + l ∗ I ); c = ( R \ (R’ \ Y)); . The above algorithm has complexity O ( n 3 ) . Regularization Methods for High Dimensional Learning RLS and SVM

T HE RLS S OLUTION , C OMMENTS c = ( K + λ I ) − 1 Y The prediction at a new input x ∗ is: n � f ( x ∗ ) = c j K x j ( x ∗ ) j = 1 = K x ∗ c K x ∗ G − 1 Y , = where G = K + λ I . Note that the above operation is O ( n 2 ) . Regularization Methods for High Dimensional Learning RLS and SVM

RLS R EGULARIZATION P ATH Typically we have to choose λ and hence to compute the solutions corresponding to different values of λ . Is there a more efficent method than solving c ( λ ) = ( K + λ I ) − 1 Y anew for each λ ? Regularization Methods for High Dimensional Learning RLS and SVM

RLS R EGULARIZATION P ATH Typically we have to choose λ and hence to compute the solutions corresponding to different values of λ . Is there a more efficent method than solving c ( λ ) = ( K + λ I ) − 1 Y anew for each λ ? Form the eigendecomposition K = Q Λ Q T , where Λ is diagonal with Λ ii ≥ 0 and QQ T = I . Then G = K + λ I Q Λ Q T + λ I = Q (Λ + λ I ) Q T , = which implies that G − 1 = Q (Λ + λ I ) − 1 Q T . Regularization Methods for High Dimensional Learning RLS and SVM

RLS R EGULARIZATION P ATH C ONT ’ D O ( n 3 ) time to solve one (dense) linear system, or to compute the eigendecomposition (constant is maybe 4x worse). Given Q and Λ , we can find c ( λ ) in O ( n 2 ) time: c ( λ ) = Q (Λ + λ I ) − 1 Q T Y , noting that (Λ + λ I ) is diagonal. Finding c ( λ ) for many λ ’s is (essentially) free! Regularization Methods for High Dimensional Learning RLS and SVM

P ARAMETER CHOICE idea: try different λ and see which one performs best How to try them? A simple choice is to use a validation set of data If we have "enough" training data we may sample out a training and a validation set. Otherwise a common practice is K-fold Cross Validation (KCV): Divide data into K sets of equal size: S 1 , . . . , S k 1 For each i train on the other K − 1 sets and test on the i th 2 set If K = n we get the leave-one-out strategy (LOO) Regularization Methods for High Dimensional Learning RLS and SVM

P ARAMETER CHOICE Notice that some data should always be kept aside to be used as test set, to test the generalization performance of the system after parameter tuning took place TRAINING VALIDATION TEST Entire set of data Regularization Methods for High Dimensional Learning RLS and SVM

T HE L INEAR C ASE The linear kernel is K ( x i , x j ) = x T i x j . The linear kernel offers many advantages for computation. Key idea: we get a decomposition of the kernel matrix for free: K = XX T — where X = [ x ⊤ 1 , . . . , x ⊤ n ] is the data matrix n × d In the linear case, we will see that we have two different computation options. Regularization Methods for High Dimensional Learning RLS and SVM

L INEAR KERNEL , LINEAR FUNCTION With a linear kernel, the function we are learning is linear as well: f ( x ∗ ) = K x ∗ c x T ∗ X T c = x T = ∗ w , where we define w to be X T c . Regularization Methods for High Dimensional Learning RLS and SVM

L INEAR KERNEL CONT . For the linear kernel, 1 2 + λ 2 || Y − K c || 2 2 c T K c min c ∈ R n 1 2 + λ 2 || Y − XX T c || 2 2 c T XX T c = min c ∈ R n 1 2 + λ 2 || Y − X w || 2 2 || w || 2 = min 2 . w ∈ R d Taking the gradient with respect to w and setting it to zero X T X w − X T Y + λ w = 0 we get w = ( X T X + λ I ) − 1 X T Y . Regularization Methods for High Dimensional Learning RLS and SVM

S OLUTION FOR FIXED PARAMETER w = ( X T X + λ I ) − 1 X T Y . Choleski decomposition allows to solve the above problem in O ( d 3 ) for any fixed λ . We can work with the covariance matrix X T X ∈ R d × d . The algorithm is identical to solving a general RLS problem replacing the kernel matrix by X T X and the labels vector by X T y . We can classify new points in O ( d ) time, using w , rather than having to compute a weighted sum of n kernel products (which will usually cost O ( nd ) time). Regularization Methods for High Dimensional Learning RLS and SVM

R EGULARIZATION P ATH VIA SVD To compute solutions corresponding to multiple values of λ we can again consider an eigend-ecomposition/svd. We need O ( nd ) memory to store the data in the first place. The SVD also requires O ( nd ) memory, and O ( nd 2 ) time. Compared to the nonlinear case, we have replaced an O ( n ) with an O ( d ) , in both time and memory. If n >> d , this can represent a huge savings. Regularization Methods for High Dimensional Learning RLS and SVM

S UMMARY S O F AR When can we solve one RLS problem? (I.e. what are the bottlenecks?) Regularization Methods for High Dimensional Learning RLS and SVM

S UMMARY S O F AR When can we solve one RLS problem? (I.e. what are the bottlenecks?) We need to form K , which takes O ( n 2 d ) time and O ( n 2 ) memory. We need to perform a Cholesky factorization or an eigendecomposition of K , which takes O ( n 3 ) time. In the linear case we have replaced an O ( n ) with an O ( d ) , in both time and memory. If n >> d , this can represent a huge savings. Usually, we run out of memory before we run out of time. The practical limit on today’s workstations is (more-or-less) 10,000 points (using Matlab). Regularization Methods for High Dimensional Learning RLS and SVM

P LAN RLS dual problem regularization path linear case SVM dual problem linear case historical derivation Regularization Methods for High Dimensional Learning RLS and SVM

L INEAR KERNEL CONT . For the linear kernel, 1 2 + 2 || Y K c || - PowerPoint PPT Presentation

R EGULARIZED L EAST S QUARES AND S UPPORT V ECTOR M ACHINES Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu BISS 2012 March 14, 2012 Regularization Methods for High Dimensional Learning RLS and SVM A BOUT THIS CLASS

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Cleani ning C ng Cont ontract Cleani ning C ng Cont ontract Cleani ning C ng Cont

Why LINEX Our Explanation (cont-d) Our Explanation (cont-d) (Linear Exponential) Our

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Minimal ConT EXt Distribution Mojca Miklavec, BachoT EX 2008 Specifics of ConT EXt

Migration to ConT EXt? First experience with ConT EXt typesetting Tom Hla KONVOJ

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

CS171 Introduction to Computer Science II Recursion (cont.) + MergeSort Recursion (cont.) +

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

A kernel in a library Genodes custom kernel approach Martin Stein <

A Novel Algorithm for the Reduction of Irregular Noise in Corrupted Speech Signals ROSHAHLIZA M

A methodology to compare two estimation methods for Parallel Hammerstein Models Marc REBILLAT

Considering Rust February 2020 Jon Gjengset Graduate student at MITs Parallel and Distributed

Black Box Search By Unbiased Variation Per Kristian Lehre and Carsten Witt CERCIA, University of

Concluding reflections from RLS Berlin meeting A common battle, south and north, against debt

Charlo'e Streck PMR Workshop Post-2020 Accoun;ng: Implica;ons for Carbon Pricing and

Epimorphisms in varieties of square-increasing residuated structures T. Moraschini 1 , J.G.

Kernel Principal Component Ranking: Robust Ranking on Noisy Data Evgeni Tsivtsivadze Botond

Sambuz

Useful Links

Newsletter

Mail Us

L INEAR KERNEL CONT . For the linear kernel, 1 2 + 2 || Y K c || - PowerPoint PPT Presentation

R EGULARIZED L EAST S QUARES AND S UPPORT V ECTOR M ACHINES Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu BISS 2012 March 14, 2012 Regularization Methods for High Dimensional Learning RLS and SVM A BOUT THIS CLASS

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Cleani ning C ng Cont ontract Cleani ning C ng Cont ontract Cleani ning C ng Cont

Why LINEX Our Explanation (cont-d) Our Explanation (cont-d) (Linear Exponential) Our

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Minimal ConT EXt Distribution Mojca Miklavec, BachoT EX 2008 Specifics of ConT EXt

Migration to ConT EXt? First experience with ConT EXt typesetting Tom Hla KONVOJ

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

CS171 Introduction to Computer Science II Recursion (cont.) + MergeSort Recursion (cont.) +

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

A kernel in a library Genodes custom kernel approach Martin Stein &lt;

A Novel Algorithm for the Reduction of Irregular Noise in Corrupted Speech Signals ROSHAHLIZA M

A methodology to compare two estimation methods for Parallel Hammerstein Models Marc REBILLAT

Considering Rust February 2020 Jon Gjengset Graduate student at MITs Parallel and Distributed

Black Box Search By Unbiased Variation Per Kristian Lehre and Carsten Witt CERCIA, University of

Concluding reflections from RLS Berlin meeting A common battle, south and north, against debt

Charlo'e Streck PMR Workshop Post-2020 Accoun;ng: Implica;ons for Carbon Pricing and

Epimorphisms in varieties of square-increasing residuated structures T. Moraschini 1 , J.G.

Kernel Principal Component Ranking: Robust Ranking on Noisy Data Evgeni Tsivtsivadze Botond

Sambuz

Useful Links

Newsletter

Mail Us

A kernel in a library Genodes custom kernel approach Martin Stein <