mlcc 2017 regularization networks i linear models
play

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco - PowerPoint PPT Presentation

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational aspects of these


  1. MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017

  2. About this class ◮ We introduce a class of learning algorithms based on Tikhonov regularization ◮ We study computational aspects of these algorithms . MLCC 2017 2

  3. Empirical Risk Minimization (ERM) ◮ Empirical Risk Minimization (ERM): probably the most popular approach to design learning algorithms. ◮ General idea: considering the empirical error n E ( f ) = 1 � ˆ ℓ ( y i , f ( x i )) , n i =1 as a proxy for the expected error � E ( f ) = E [ ℓ ( y, f ( x ))] = dxdyp ( x, y ) ℓ ( y, f ( x )) . MLCC 2017 3

  4. The Expected Risk is Not Computable Recall that ◮ ℓ measures the price we pay predicting f ( x ) when the true label is y ◮ E ( f ) cannot be directly computed, since p ( x, y ) is unknown MLCC 2017 4

  5. From Theory to Algorithms: The Hypothesis Space To turn the above idea into an actual algorithm, we: ◮ Fix a suitable hypothesis space H ◮ Minimize ˆ E over H H should allow feasible computations and be rich , since the complexity of the problem is not known a priori. MLCC 2017 5

  6. Example: Space of Linear Functions The simplest example of H is the space of linear functions: H = { f : R d → R : ∃ w ∈ R d such that f ( x ) = x T w, ∀ x ∈ R d } . ◮ Each function f is defined by a vector w ◮ f w ( x ) = x T w . MLCC 2017 6

  7. Rich H s May Require Regularization ◮ If H is rich enough, solving ERM may cause overfitting (solutions highly dependent on the data) ◮ Regularization techniques restore stability and ensure generalization MLCC 2017 7

  8. Tikhonov Regularization Consider the Tikhonov regularization scheme, w ∈ R d ˆ E ( f w ) + λ � w � 2 min (1) It describes a large class of methods sometimes called Regularization Networks. MLCC 2017 8

  9. The Regularizer ◮ � w � 2 is called regularizer ◮ It controls the stability of the solution and prevents overfitting ◮ λ balances the error term and the regularizer MLCC 2017 9

  10. Loss Functions ◮ Different loss functions ℓ induce different classes of methods ◮ We will see common aspects and differences in considering different loss functions ◮ There exists no general computational scheme to solve Tikhonov Regularization ◮ The solution depends on the considered loss function MLCC 2017 10

  11. The Regularized Least Squares Algorithm Regularized Least Squares: Tikhonov regularization n E ( f w ) = 1 � w ∈ R D ˆ ˆ E ( f w ) + λ � w � 2 , min ℓ ( y i , f w ( x i )) (2) n i =1 Square loss function : ℓ ( y, f w ( x )) = ( y − f w ( x )) 2 We then obtain the RLS optimization problem (linear model): n 1 � ( y i − w T x i ) 2 + λw T w, min λ ≥ 0 . (3) n w ∈ R D i =1 MLCC 2017 11

  12. Matrix Notation ◮ The n × d matrix X n , whose rows are the input points ◮ The n × 1 vector Y n , whose entries are the corresponding outputs. With this notation, n 1 ( y i − w T x i ) 2 = 1 � n � Y n − X n w � 2 . n i =1 MLCC 2017 12

  13. Gradients of the ER and of the Regularizer By direct computation, ◮ Gradient of the empirical risk w. r. t. w − 2 nX T n ( Y n − X n w ) ◮ Gradient of the regularizer w. r. t. w 2 w MLCC 2017 13

  14. The RLS Solution By setting the gradient to zero, the solution of RLS solves the linear system ( X T n X n + λnI ) w = X T n Y n . λ controls the invertibility of ( X T n X n + λnI ) MLCC 2017 14

  15. Choosing the Cholesky Solver ◮ Several methods can be used to solve the above linear system ◮ Cholesky decomposition is the method of choice, since X T n X n + λI is symmetric and positive definite. MLCC 2017 15

  16. Time Complexity Time complexity of the method : ◮ Training: O ( nd 2 ) (assuming n >> d ) ◮ Testing: O ( d ) MLCC 2017 16

  17. Dealing with an Offset For linear models, especially in low dimensional spaces, it is useful to consider an offset : w T x + b How to estimate b from data? MLCC 2017 17

  18. Idea: Augmenting the Dimension of the Input Space ◮ Simple idea: augment the dimension of the input space, considering ˜ x = ( x, 1) and ˜ w = ( w, b ) . ◮ This is fine if we do not regularize, but if we do then this method tends to prefer linear functions passing through the origin (zero offset), since the regularizer becomes: w � 2 = � w � 2 + b 2 . � ˜ MLCC 2017 18

  19. Avoiding to Penalize the Solutions with Offset We want to regularize considering only � w � 2 , without penalizing the offset. The modified regularized problem becomes: n 1 � ( y i − w T x i − b ) 2 + λ � w � 2 . min n ( w,b ) ∈ R D +1 i =1 MLCC 2017 19

  20. Solution with Offset: Centering the Data It can be proved that a solution w ∗ , b ∗ of the above problem is given by b ∗ = ¯ x T w ∗ y − ¯ where n y = 1 � ¯ y i n i =1 n x = 1 � ¯ x i n i =1 MLCC 2017 20

  21. Solution with Offset: Centering the Data w ∗ solves n 1 � i ) 2 + λ � w � 2 . i − w T x c ( y c min n w ∈ R D i =1 where y c y and x c i = y − ¯ i = x − ¯ x for all i = 1 , . . . , n . Note: This corresponds to centering the data and then applying the standard RLS algorithm. MLCC 2017 21

  22. Introduction: Regularized Logistic Regression Regularized logistic regression: Tikhonov regularization n E ( f w ) = 1 � w ∈ R d ˆ ˆ E ( f w ) + λ � w � 2 , min ℓ ( y i , f w ( x i )) (4) n i =1 With the logistic loss function : ℓ ( y, f w ( x )) = log(1 + e − yf w ( x ) ) MLCC 2017 22

  23. The Logistic Loss Function Figure: Plot of the logistic regression loss function MLCC 2017 23

  24. Minimization Through Gradient Descent ◮ The logistic loss function is differentiable ◮ The candidate to compute a minimizer is the gradient descent (GD) algorithm MLCC 2017 24

  25. Regularized Logistic Regression (RLR) ◮ The regularized ERM problem associated with the logistic loss is called regularized logistic regression ◮ Its solution can be computed via gradient descent ◮ Note: n n − y i e − y i x T i w t − 1 E ( f w ) = 1 i w t − 1 = 1 − y i � � ∇ ˆ x i x i 1 + e − y i x T 1 + e y i x T n n i w t − 1 i =1 i =1 MLCC 2017 25

  26. RLR: Gradient Descent Iteration For w 0 = 0 , the GD iteration applied to w ∈ R d ˆ E ( f w ) + λ � w � 2 min is � � n 1 − y i � w t = w t − 1 − γ x i i w t − 1 + 2 λw t − 1 1 + e y i x T n i =1 � �� � a for t = 1 , . . . T , where a = ∇ ( ˆ E ( f w ) + λ � w � 2 ) MLCC 2017 26

  27. Logistic Regression and Confidence Estimation ◮ The solution of logistic regression has a probabilistic interpretation ◮ It can be derived from the following model e x T w p (1 | x ) = 1 + e x T w � �� � h where h is called logistic function . ◮ This can be used to compute a confidence for each prediction MLCC 2017 27

  28. Support Vector Machines Formulation in terms of Tikhonov regularization: n E ( f w ) = 1 � w ∈ R d ˆ ˆ E ( f w ) + λ � w � 2 , min ℓ ( y i , f w ( x i )) (5) n i =1 With the Hinge loss function : ℓ ( y, f w ( x )) = | 1 − yf w ( x ) | + 4 3.5 3 2.5 Hinge Loss 2 1.5 1 0.5 0 � 3 � 2 � 1 0 1 2 3 y * f(x) MLCC 2017 28

  29. A more classical formulation (linear case) n 1 � w ∗ = min | 1 − y i w ⊤ x i | + + λ � w � 2 n w ∈ R d i =1 with λ = 1 C MLCC 2017 29

  30. A more classical formulation (linear case) n w ∈ R d ,ξ i ≥ 0 � w � 2 + C � w ∗ = min subject to ξ i n i =1 y i w ⊤ x i ≥ 1 − ξ i ∀ i ∈ { 1 . . . n } MLCC 2017 30

  31. A geometric intuition - classification In general do you have many solutions 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 What do you select? MLCC 2017 31

  32. A geometric intuition - classification Intuitively I would choose an “equidistant” line 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 MLCC 2017 32

  33. A geometric intuition - classification Intuitively I would choose an “equidistant” line 2 1.5 1 0.5 0 − 0.5 − 1 − 1.5 − 2 − 2.5 − 2 − 1.5 − 1 − 0.5 0 0.5 1 1.5 2 MLCC 2017 33

  34. Maximum margin classifier I want the classifier that ◮ classifies perfectly the dataset ◮ maximize the distance from its closest examples 2 1.5 1 0.5 0 − 0.5 − 1 − 1.5 − 2 − 2.5 − 2 − 1.5 − 1 − 0.5 0 0.5 1 1.5 2 MLCC 2017 34

  35. Point-Hyperplane distance How to do it mathematically? Let w our separating hyperplane. We have x = αw + x ⊥ with α = x ⊤ w � w � and x ⊥ = x − αw . Point-Hyperplane distance : d ( x, w ) = � x ⊥ � MLCC 2017 35

  36. Margin An hyperplane w well classifies an example ( x i , y i ) if ◮ y i = 1 and w ⊤ x i > 0 or ◮ y i = − 1 and w ⊤ x i < 0 therefore x i is well classified iff y i w ⊤ x i > 0 Margin : m i = y i w ⊤ x i Note that x ⊥ = x − y i m i � w � w MLCC 2017 36

  37. Maximum margin classifier definition I want the classifier that ◮ classifies perfectly the dataset ◮ maximize the distance from its closest examples w ∗ = max 1 ≤ i ≤ n d ( x i , w ) 2 w ∈ R d min subject to m i > 0 ∀ i ∈ { 1 . . . n } Let call µ the smallest m i thus we have 1 ≤ i ≤ n,µ ≥ 0 � x i � − ( x ⊤ i w ) 2 w ∗ = max min subject to � w � 2 w ∈ R d y i w ⊤ x i ≥ µ ∀ i ∈ { 1 . . . n } that is MLCC 2017 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend