MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco - PowerPoint PPT Presentation

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017

About this class ◮ We introduce a class of learning algorithms based on Tikhonov regularization ◮ We study computational aspects of these algorithms . MLCC 2017 2

Empirical Risk Minimization (ERM) ◮ Empirical Risk Minimization (ERM): probably the most popular approach to design learning algorithms. ◮ General idea: considering the empirical error n E ( f ) = 1 � ˆ ℓ ( y i , f ( x i )) , n i =1 as a proxy for the expected error � E ( f ) = E [ ℓ ( y, f ( x ))] = dxdyp ( x, y ) ℓ ( y, f ( x )) . MLCC 2017 3

The Expected Risk is Not Computable Recall that ◮ ℓ measures the price we pay predicting f ( x ) when the true label is y ◮ E ( f ) cannot be directly computed, since p ( x, y ) is unknown MLCC 2017 4

From Theory to Algorithms: The Hypothesis Space To turn the above idea into an actual algorithm, we: ◮ Fix a suitable hypothesis space H ◮ Minimize ˆ E over H H should allow feasible computations and be rich , since the complexity of the problem is not known a priori. MLCC 2017 5

Example: Space of Linear Functions The simplest example of H is the space of linear functions: H = { f : R d → R : ∃ w ∈ R d such that f ( x ) = x T w, ∀ x ∈ R d } . ◮ Each function f is defined by a vector w ◮ f w ( x ) = x T w . MLCC 2017 6

Rich H s May Require Regularization ◮ If H is rich enough, solving ERM may cause overfitting (solutions highly dependent on the data) ◮ Regularization techniques restore stability and ensure generalization MLCC 2017 7

Tikhonov Regularization Consider the Tikhonov regularization scheme, w ∈ R d ˆ E ( f w ) + λ � w � 2 min (1) It describes a large class of methods sometimes called Regularization Networks. MLCC 2017 8

The Regularizer ◮ � w � 2 is called regularizer ◮ It controls the stability of the solution and prevents overfitting ◮ λ balances the error term and the regularizer MLCC 2017 9

Loss Functions ◮ Different loss functions ℓ induce different classes of methods ◮ We will see common aspects and differences in considering different loss functions ◮ There exists no general computational scheme to solve Tikhonov Regularization ◮ The solution depends on the considered loss function MLCC 2017 10

The Regularized Least Squares Algorithm Regularized Least Squares: Tikhonov regularization n E ( f w ) = 1 � w ∈ R D ˆ ˆ E ( f w ) + λ � w � 2 , min ℓ ( y i , f w ( x i )) (2) n i =1 Square loss function : ℓ ( y, f w ( x )) = ( y − f w ( x )) 2 We then obtain the RLS optimization problem (linear model): n 1 � ( y i − w T x i ) 2 + λw T w, min λ ≥ 0 . (3) n w ∈ R D i =1 MLCC 2017 11

Matrix Notation ◮ The n × d matrix X n , whose rows are the input points ◮ The n × 1 vector Y n , whose entries are the corresponding outputs. With this notation, n 1 ( y i − w T x i ) 2 = 1 � n � Y n − X n w � 2 . n i =1 MLCC 2017 12

Gradients of the ER and of the Regularizer By direct computation, ◮ Gradient of the empirical risk w. r. t. w − 2 nX T n ( Y n − X n w ) ◮ Gradient of the regularizer w. r. t. w 2 w MLCC 2017 13

The RLS Solution By setting the gradient to zero, the solution of RLS solves the linear system ( X T n X n + λnI ) w = X T n Y n . λ controls the invertibility of ( X T n X n + λnI ) MLCC 2017 14

Choosing the Cholesky Solver ◮ Several methods can be used to solve the above linear system ◮ Cholesky decomposition is the method of choice, since X T n X n + λI is symmetric and positive definite. MLCC 2017 15

Time Complexity Time complexity of the method : ◮ Training: O ( nd 2 ) (assuming n >> d ) ◮ Testing: O ( d ) MLCC 2017 16

Dealing with an Offset For linear models, especially in low dimensional spaces, it is useful to consider an offset : w T x + b How to estimate b from data? MLCC 2017 17

Idea: Augmenting the Dimension of the Input Space ◮ Simple idea: augment the dimension of the input space, considering ˜ x = ( x, 1) and ˜ w = ( w, b ) . ◮ This is fine if we do not regularize, but if we do then this method tends to prefer linear functions passing through the origin (zero offset), since the regularizer becomes: w � 2 = � w � 2 + b 2 . � ˜ MLCC 2017 18

Avoiding to Penalize the Solutions with Offset We want to regularize considering only � w � 2 , without penalizing the offset. The modified regularized problem becomes: n 1 � ( y i − w T x i − b ) 2 + λ � w � 2 . min n ( w,b ) ∈ R D +1 i =1 MLCC 2017 19

Solution with Offset: Centering the Data It can be proved that a solution w ∗ , b ∗ of the above problem is given by b ∗ = ¯ x T w ∗ y − ¯ where n y = 1 � ¯ y i n i =1 n x = 1 � ¯ x i n i =1 MLCC 2017 20

Solution with Offset: Centering the Data w ∗ solves n 1 � i ) 2 + λ � w � 2 . i − w T x c ( y c min n w ∈ R D i =1 where y c y and x c i = y − ¯ i = x − ¯ x for all i = 1 , . . . , n . Note: This corresponds to centering the data and then applying the standard RLS algorithm. MLCC 2017 21

Introduction: Regularized Logistic Regression Regularized logistic regression: Tikhonov regularization n E ( f w ) = 1 � w ∈ R d ˆ ˆ E ( f w ) + λ � w � 2 , min ℓ ( y i , f w ( x i )) (4) n i =1 With the logistic loss function : ℓ ( y, f w ( x )) = log(1 + e − yf w ( x ) ) MLCC 2017 22

The Logistic Loss Function Figure: Plot of the logistic regression loss function MLCC 2017 23

Minimization Through Gradient Descent ◮ The logistic loss function is differentiable ◮ The candidate to compute a minimizer is the gradient descent (GD) algorithm MLCC 2017 24

Regularized Logistic Regression (RLR) ◮ The regularized ERM problem associated with the logistic loss is called regularized logistic regression ◮ Its solution can be computed via gradient descent ◮ Note: n n − y i e − y i x T i w t − 1 E ( f w ) = 1 i w t − 1 = 1 − y i � � ∇ ˆ x i x i 1 + e − y i x T 1 + e y i x T n n i w t − 1 i =1 i =1 MLCC 2017 25

RLR: Gradient Descent Iteration For w 0 = 0 , the GD iteration applied to w ∈ R d ˆ E ( f w ) + λ � w � 2 min is � � n 1 − y i � w t = w t − 1 − γ x i i w t − 1 + 2 λw t − 1 1 + e y i x T n i =1 � �� a for t = 1 , . . . T , where a = ∇ ( ˆ E ( f w ) + λ � w � 2 ) MLCC 2017 26

Logistic Regression and Confidence Estimation ◮ The solution of logistic regression has a probabilistic interpretation ◮ It can be derived from the following model e x T w p (1 | x ) = 1 + e x T w � �� h where h is called logistic function . ◮ This can be used to compute a confidence for each prediction MLCC 2017 27

Support Vector Machines Formulation in terms of Tikhonov regularization: n E ( f w ) = 1 � w ∈ R d ˆ ˆ E ( f w ) + λ � w � 2 , min ℓ ( y i , f w ( x i )) (5) n i =1 With the Hinge loss function : ℓ ( y, f w ( x )) = | 1 − yf w ( x ) | + 4 3.5 3 2.5 Hinge Loss 2 1.5 1 0.5 0 � 3 � 2 � 1 0 1 2 3 y * f(x) MLCC 2017 28

A more classical formulation (linear case) n 1 � w ∗ = min | 1 − y i w ⊤ x i | + + λ � w � 2 n w ∈ R d i =1 with λ = 1 C MLCC 2017 29

A more classical formulation (linear case) n w ∈ R d ,ξ i ≥ 0 � w � 2 + C � w ∗ = min subject to ξ i n i =1 y i w ⊤ x i ≥ 1 − ξ i ∀ i ∈ { 1 . . . n } MLCC 2017 30

A geometric intuition - classification In general do you have many solutions 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 What do you select? MLCC 2017 31

A geometric intuition - classification Intuitively I would choose an “equidistant” line 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 MLCC 2017 32

A geometric intuition - classification Intuitively I would choose an “equidistant” line 2 1.5 1 0.5 0 − 0.5 − 1 − 1.5 − 2 − 2.5 − 2 − 1.5 − 1 − 0.5 0 0.5 1 1.5 2 MLCC 2017 33

Maximum margin classifier I want the classifier that ◮ classifies perfectly the dataset ◮ maximize the distance from its closest examples 2 1.5 1 0.5 0 − 0.5 − 1 − 1.5 − 2 − 2.5 − 2 − 1.5 − 1 − 0.5 0 0.5 1 1.5 2 MLCC 2017 34

Point-Hyperplane distance How to do it mathematically? Let w our separating hyperplane. We have x = αw + x ⊥ with α = x ⊤ w � w � and x ⊥ = x − αw . Point-Hyperplane distance : d ( x, w ) = � x ⊥ � MLCC 2017 35

Margin An hyperplane w well classifies an example ( x i , y i ) if ◮ y i = 1 and w ⊤ x i > 0 or ◮ y i = − 1 and w ⊤ x i < 0 therefore x i is well classified iff y i w ⊤ x i > 0 Margin : m i = y i w ⊤ x i Note that x ⊥ = x − y i m i � w � w MLCC 2017 36

Maximum margin classifier definition I want the classifier that ◮ classifies perfectly the dataset ◮ maximize the distance from its closest examples w ∗ = max 1 ≤ i ≤ n d ( x i , w ) 2 w ∈ R d min subject to m i > 0 ∀ i ∈ { 1 . . . n } Let call µ the smallest m i thus we have 1 ≤ i ≤ n,µ ≥ 0 � x i � − ( x ⊤ i w ) 2 w ∗ = max min subject to � w � 2 w ∈ R d y i w ⊤ x i ≥ µ ∀ i ∈ { 1 . . . n } that is MLCC 2017 37

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco - PowerPoint PPT Presentation

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational aspects of these

MLCC 2018 Statistical Learning: Basic Concepts Lorenzo Rosasco UNIGE-MIT-IIT Outline Learning

MLCC 2015 machine learning applications Francesca Odone ML applications Machine Learning

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, 2017 What? Classification

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

MLCC 2017 Machine Learning Crash Course Universita' di Genova, Summer, 2017 Instructor : Lorenzo

MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2017

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

MLCC 2019 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT About this

Regression and regularization Matthieu R. Bloch learning setting when Y = R . Said differently, we

LINEAR REGRESSION Sylvain Calinon Robot Learning & Interaction Group Idiap Research

Use of agrochemicals Environmental, social and economic impacts of alternative farming

Linear Programming What s Linear Programming? Often your try is to maximize or minimize an

Scalable Machine Learning 5. (Generalized) Linear Models Alex Smola Yahoo! Research and ANU

Identification of Hybrid Systems Identification of Hybrid Systems Therefore, a model must be

Neural Networks (and Gradient Ascent Again) Frank Wood April 27, 2010 Generalized Regression

Condition Numbers of Numeric and Algebraic Problems Stephen Vavasis 1 1 Department of

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco - PowerPoint PPT Presentation

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational aspects of these

MLCC 2018 Statistical Learning: Basic Concepts Lorenzo Rosasco UNIGE-MIT-IIT Outline Learning

MLCC 2015 machine learning applications Francesca Odone ML applications Machine Learning

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, 2017 What? Classification

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

MLCC 2017 Machine Learning Crash Course Universita' di Genova, Summer, 2017 Instructor : Lorenzo

MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2017

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

MLCC 2019 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT About this

Regression and regularization Matthieu R. Bloch learning setting when Y = R . Said differently, we

LINEAR REGRESSION Sylvain Calinon Robot Learning &amp; Interaction Group Idiap Research

Use of agrochemicals Environmental, social and economic impacts of alternative farming

Linear Programming What s Linear Programming? Often your try is to maximize or minimize an

Scalable Machine Learning 5. (Generalized) Linear Models Alex Smola Yahoo! Research and ANU

Identification of Hybrid Systems Identification of Hybrid Systems Therefore, a model must be

Neural Networks (and Gradient Ascent Again) Frank Wood April 27, 2010 Generalized Regression

Condition Numbers of Numeric and Algebraic Problems Stephen Vavasis 1 1 Department of

Regularization Overview Regularization Overview Problems & Multicollinearity We will

LINEAR REGRESSION Sylvain Calinon Robot Learning & Interaction Group Idiap Research