regml 2020 class 2 tikhonov regularization and kernels
play

RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo - PowerPoint PPT Presentation

RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT Learning problem Problem For H { f | f : X Y } , solve min f H E ( f ) , d ( x, y ) L ( f ( x ) , y ) given S n = ( x i , y i ) n i =1 (


  1. RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT

  2. Learning problem Problem For H ⊂ { f | f : X → Y } , solve � min f ∈H E ( f ) , dρ ( x, y ) L ( f ( x ) , y ) given S n = ( x i , y i ) n i =1 ( ρ , fixed, unknown). L.Rosasco, RegML 2020

  3. Empirical Risk Minimization (ERM) � min f ∈H E ( f ) �→ min E ( f ) f ∈H � n E ( f ) = 1 � L ( f ( x i ) , y i ) n i =1 proxy to E L.Rosasco, RegML 2020

  4. From ERM to regularization ERM can be a bad idea if n is “small” and H is “big” Regularization � � min E ( f ) �→ min E ( f ) + λ R ( f ) � �� � f ∈H f ∈H regularization λ regularization parameter L.Rosasco, RegML 2020

  5. Examples of regularizers Let p � f ( x ) = φ j ( x ) w j j =1 ◮ ℓ 2 p � R ( f ) = � w � 2 = | w j | 2 , j =1 ◮ ℓ 1 p � R ( f ) = � w � 1 = | w j | , j =1 ◮ Differential operators � �∇ f ( x ) � 2 dρ ( x ) , R ( f ) = X ◮ ... L.Rosasco, RegML 2020

  6. From statistics to optimization Problem Solve w ∈ R p � E ( w ) + λ � w � 2 min with � n E ( w ) = 1 � L ( w ⊤ x i , y i ) . n i =1 L.Rosasco, RegML 2020

  7. Minimization E ( w ) + λ � w � 2 � min w ◮ Strongly convex functional ◮ Computations depends on the considered function L.Rosasco, RegML 2020

  8. Logistic regression � n E λ ( w ) = 1 � log(1 + e − y i w ⊤ x i ) + λ � w � 2 . n i =1 � �� � smooth and strongly convex � n E λ ( w ) = − 1 x i y i ∇ � 1 + e y i w ⊤ x i + 2 λw n i =1 L.Rosasco, RegML 2020

  9. Gradient descent Let F : R d → R differentiable, (strictly) convex and such that �∇ F ( w ) − ∇ F ( w ′ ) � ≤ L � w − w ′ � (e.g. sup w � H ( w ) � ≤ L ) � �� � hessian Then w t +1 = w t − 1 w 0 = 0 , L ∇ F ( w t ) , converges to the minimizer of F . L.Rosasco, RegML 2020

  10. Gradient descent for LR � n 1 log(1 + e − y i w ⊤ x i ) + λ � w � 2 min n w ∈ R p i =1 Consider � � n � w t +1 = w t − 1 − 1 x i y i t x i + 2 λw t L n 1 + e y i w ⊤ i =1 L.Rosasco, RegML 2020

  11. Complexity Logistic: O ( ndT ) n number of examples, d dimensionality, T number of steps What if n ≪ d ? Can we get better complexities? L.Rosasco, RegML 2020

  12. Representer theorems Idea Show that n � f ( x ) = w ⊤ x = x ⊤ i xc i , c i ∈ R . i =1 i.e. w = � n i =1 x i c i . Compute c = ( c 1 , . . . , c n ) ∈ R n rather than w ∈ R d . L.Rosasco, RegML 2020

  13. Representer theorem for GD & LR By induction � � n � c t +1 = c t − 1 − 1 e i y i 1 + e y i f t ( x i ) + 2 λc t L n i =1 with e i the i -th element of the canonical basis and � n x ⊤ x i ( c t ) i f t ( x ) = i =1 L.Rosasco, RegML 2020

  14. Non-linear features p � d � f ( x ) = w i x i �→ f ( x ) = w i φ i ( x i ) . i =1 i =1 Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) , Model L.Rosasco, RegML 2020

  15. Non-linear features p � d � f ( x ) = w i x i �→ f ( x ) = w i φ i ( x i ) . i =1 i =1 Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) , Computations Same up-to the change x �→ Φ( x ) L.Rosasco, RegML 2020

  16. Representer theorem with non-linear features � n � n x ⊤ Φ( x i ) ⊤ Φ( x ) c i f ( x ) = i xc i �→ f ( x ) = i =1 i =1 L.Rosasco, RegML 2020

  17. Rewriting logistic regression and gradient descent By induction � � n � c t +1 = c t − 1 − 1 e i y i 1 + e y i f t ( x i ) + 2 λc t L n i =1 with e i the i -th element of the canonical basis and � n Φ( x ) ⊤ Φ( x ) i ( c t ) i f t ( x ) = i =1 L.Rosasco, RegML 2020

  18. Hinge loss and support vector machines � n E λ ( w ) = 1 � | 1 − y i w ⊤ x i | + + λ � w � 2 n i =1 � �� � non-smooth & strongly-convex Consider “left” derivative � � n � 1 1 w t +1 = w t − √ S i ( w t ) + 2 λw t n L t i =1 � if y i w ⊤ x i ≤ 1 − y i x i S i ( w ) = , L = sup � x � + 2 λ. 0 otherwise x ∈ X L.Rosasco, RegML 2020

  19. Subgradient Let F : R p → R convex, Subgradient ∂F ( w 0 ) set of vectors v ∈ R p such that, for every w ∈ R p F ( w ) − F ( w 0 ) ≥ ( w − w 0 ) ⊤ v In one dimension ∂F ( w 0 ) = [ F ′ − ( w 0 ) , F ′ + ( w 0 )] . L.Rosasco, RegML 2020

  20. Subgradient method Let F : R p → R strictly convex, with bounded subdifferential, and γ t = 1 /t then, w t +1 = w t − γ t v t with v t ∈ ∂F ( w t ) converges to the minimizer of F . L.Rosasco, RegML 2020

  21. Subgradient method for SVM n � 1 | 1 − y i w ⊤ x i | + + λ � w � 2 min n w ∈ R p i =1 Consider � � n � w t +1 = w t − 1 1 S i ( w t ) + 2 λw t t n i =1 � if y i w ⊤ x i ≤ 1 − y i x i S i ( w t ) = 0 otherwise L.Rosasco, RegML 2020

  22. Representer theorem of SVM By induction � � � n c t +1 = c t − 1 1 S i ( c t ) + 2 λc t t n i =1 with e i the i -th element of the canonical basis, � n x ⊤ x i ( c t ) i f t ( x ) = i =1 and � − y i e i if y i f t ( x i ) < 1 S i ( c t ) = . 0 otherwise L.Rosasco, RegML 2020

  23. Rewriting SVM by subgradient By induction � � � n c t +1 = c t − 1 1 S i ( c t ) + 2 λc t t n i =1 with e i the i -th element of the canonical basis, � n Φ( x ) ⊤ Φ( x ) i ( c t ) i f t ( x ) = i =1 and � − y i e i if y i f t ( x i ) < 1 S i ( c t ) = . 0 otherwise L.Rosasco, RegML 2020

  24. Optimality condition for SVM Smooth Convex Non-smooth Convex ∇ F ( w ∗ ) = 0 0 ∈ ∂F ( w ) 0 ∈ ∂ | 1 − y i w ⊤ x i | + + λ 2 w 0 ∈ ∂F ( w ∗ ) ⇔ w ∈ ∂ 1 2 λ | 1 − y i w ⊤ x i | + . ⇔ L.Rosasco, RegML 2020

  25. Optimality condition for SVM (cont.) The optimality condition can be rewritten as � n � n 0 = 1 x i ( y i c i ( − y i x i c i ) + 2 λw ⇒ w = 2 λn ) . n i =1 i =1 where c i = c i ( w ) ∈ [ V − ( − y i w ⊤ x i ) , V + ( − y i w ⊤ x i )] . A direct computation gives c i = 1 if yf ( x i ) < 1 0 ≤ c i ≤ 1 if yf ( x i ) = 1 c i = 0 if yf ( x i ) > 1 L.Rosasco, RegML 2020

  26. Support vectors c i = 1 if yf ( x i ) < 1 0 ≤ c i ≤ 1 if yf ( x i ) = 1 c i = 0 if yf ( x i ) > 1 L.Rosasco, RegML 2020

  27. Complexity Without representer Logistic: O ( ndT ) SVM: O ( ndT ) With representer Logistic: O ( n 2 ( d + T )) SVM: O ( n 2 ( d + T )) n number of example, d dimensionality, T number of steps L.Rosasco, RegML 2020

  28. Are loss functions all the same? E ( w ) + λ � w � 2 � min w ◮ each loss has a different target function. . . ◮ . . . and different computations The choice of the loss is problem dependent L.Rosasco, RegML 2020

  29. So far ◮ regularization by penalization ◮ iterative optimization ◮ linear/non-linear parametric models What about nonparametric models? L.Rosasco, RegML 2020

  30. From features to kernels n n � � x ⊤ Φ( x i ) ⊤ Φ( x ) c i f ( x ) = i xc i �→ f ( x ) = i =1 i =1 Kernels Φ( x ) ⊤ Φ( x ′ ) �→ K ( x, x ′ ) � n f ( x ) = K ( x i , x ) c i i =1 L.Rosasco, RegML 2020

  31. LR and SVM with kernels As before: LR � � � n c t +1 = c t − 1 − 1 e i y i 1 + e y i f t ( x i ) + 2 λc t L n i =1 SVM � � � n c t +1 = c t − 1 1 S i ( c t ) + 2 λc t t n i =1 But now � n f t ( x ) = K ( x, x i )( c t ) i i =1 L.Rosasco, RegML 2020

  32. Examples of kernels ◮ Linear K ( x, x ′ ) = x ⊤ x ′ ◮ Polynomial K ( x, x ′ ) = (1 + x ⊤ x ) p , with p ∈ N ◮ Gaussian K ( x, x ′ ) = e − γ � x − x ′ � 2 , with γ > 0 � n f ( x ) = c i K ( x i , x ) i =1 L.Rosasco, RegML 2020

  33. Kernel engineering Kernels for ◮ Strings, ◮ Graphs, ◮ Histograms, ◮ Sets, ◮ ... L.Rosasco, RegML 2020

  34. What is a kernel? K ( x, x ′ ) ◮ Similarity measure ◮ Inner product ◮ Positive definite function L.Rosasco, RegML 2020

  35. Positive definite function K : X × X → R is positive definite , when for any n ∈ N , x 1 , . . . , x n ∈ X , let K n such that K n ∈ R n × n , ( K n ) ij = K ( x i , x j ) , then K n is positive semidefinite, (eigenvalues ≥ 0 ) L.Rosasco, RegML 2020

  36. PD functions and RKHS Each PD Kernel defines a function space called Reproducing kernel Hilbert space (RKHS) H = span { K ( · , x ) | x ∈ X } . L.Rosasco, RegML 2020

  37. Nonparametrics and kernels Number of parameters automatically determined by number of points � n f ( x ) = K ( x i , x ) c i i =1 Compare to p � f ( x ) = φ j ( x ) w j j =1 L.Rosasco, RegML 2020

  38. This class ◮ Learning and Regularization: logistic regression and SVM ◮ Optimization with first order methods ◮ Linear and Non-linear parametric models ◮ Non-parametric models and kernels L.Rosasco, RegML 2020

  39. Next class Beyond penalization Regularization by ◮ subsampling ◮ stochastic projection L.Rosasco, RegML 2020

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend