data sciences centralesupelec advance machine learning
play

Data Sciences CentraleSupelec Advance Machine Learning Course II - - PowerPoint PPT Presentation

Data Sciences CentraleSupelec Advance Machine Learning Course II - Linear regression/Linear classification Emilie Chouzenoux Center for Visual Computing CentraleSupelec emilie.chouzenoux@centralesupelec.fr Linear Regression Linear


  1. Data Sciences – CentraleSupelec Advance Machine Learning Course II - Linear regression/Linear classification Emilie Chouzenoux Center for Visual Computing CentraleSupelec emilie.chouzenoux@centralesupelec.fr

  2. Linear Regression Linear classification Linear regression Motivations: ◮ Simple approach (essential to understand more sophisticated ones) ◮ Interpretable description of the relations inputs ↔ outputs ◮ Can outperform nonlinear models, in the case of few training data/high noise/sparse data ◮ Extended applicability when combined with basis-function methods (see Lab) :

  3. Linear Regression Linear classification Linear regression Motivations: ◮ Simple approach (essential to understand more sophisticated ones) ◮ Interpretable description of the relations inputs ↔ outputs ◮ Can outperform nonlinear models, in the case of few training data/high noise/sparse data ◮ Extended applicability when combined with basis-function methods (see Lab) Applications: Prediction of ◮ Sale of products in the future based on past buying behaviour. ◮ Economic growth of a country or state. ◮ How much houses it would sell in the coming months and at what price. ◮ Number of goals a player would score in coming matches based on previous performances. ◮ Hours of study a student puts in, with respect to the exam results. :

  4. Linear Regression Linear classification Linear model Training data: x i ∈ R d , y i ∈ R , i = 1 , . . . , n ( x i ) 1 ≤ i ≤ n are inputs / transformed version of inputs (eg, through log) / basis expansions. :

  5. Linear Regression Linear classification Linear model Training data: x i ∈ R d , y i ∈ R , i = 1 , . . . , n ( x i ) 1 ≤ i ≤ n are inputs / transformed version of inputs (eg, through log) / basis expansions. Fitting model: y i ≈ f ( x i ) ( ∀ i = 1 , . . . , n ) with, for every i ∈ { 1 , . . . , n } , f ( x i ) = β 0 1 + β 1 x i 1 + . . . + β d x id = x ′⊤ i β = [ X β ] i with X ∈ R n × d +1 whose i -th line is x ′ i = [1 , x i 1 , . . . , x id ]. :

  6. Linear Regression Linear classification Linear model Training data: x i ∈ R d , y i ∈ R , i = 1 , . . . , n ( x i ) 1 ≤ i ≤ n are inputs / transformed version of inputs (eg, through log) / basis expansions. Fitting model: y i ≈ f ( x i ) ( ∀ i = 1 , . . . , n ) with, for every i ∈ { 1 , . . . , n } , f ( x i ) = β 0 1 + β 1 x i 1 + . . . + β d x id = x ′⊤ i β = [ X β ] i with X ∈ R n × d +1 whose i -th line is x ′ i = [1 , x i 1 , . . . , x id ]. � [ β 1 , . . . , β d ] defines a hyperplan in R d , and β 0 can be viewed as a bias shifting function f perpendicularly to the hyperplan. :

  7. Linear Regression Linear classification Linear model Training data: x i ∈ R d , y i ∈ R , i = 1 , . . . , n ( x i ) 1 ≤ i ≤ n are inputs / transformed version of inputs (eg, through log) / basis expansions. Fitting model: y i ≈ f ( x i ) ( ∀ i = 1 , . . . , n ) with, for every i ∈ { 1 , . . . , n } , f ( x i ) = β 0 1 + β 1 x i 1 + . . . + β d x id = x ′⊤ i β = [ X β ] i with X ∈ R n × d +1 whose i -th line is x ′ i = [1 , x i 1 , . . . , x id ]. � [ β 1 , . . . , β d ] defines a hyperplan in R d , and β 0 can be viewed as a bias shifting function f perpendicularly to the hyperplan. Goal: Using the training set, learn the linear function f (parametrized by β ) that predict a real value y from an observation x . :

  8. Linear Regression Linear classification Least Squares Principle: Search for β that minimizes the sum of squares residuals n F ( β ) = 1 ( y i − f ( x i )) 2 = 1 2 � X β − y � 2 = 1 � 2 � e � 2 2 i =1 with e = X β − y the residual vector. :

  9. Linear Regression Linear classification Optimization (reminders?) We search for a solution to min β F ( β ) where F : R d +1 → R is convex. β is minimizer if and only if ∇ F (ˆ ˆ β ) = 0 where ∇ F is the gradient of F , such that [ ∇ F ( β )] j = ∂ F ( β ) ( ∀ j = 0 , . . . , d ) . ∂β j Note that F also reads: F ( β ) = 1 2 y ⊤ y − β ⊤ X ⊤ y + 1 2 β ⊤ X ⊤ X β Its gradient is ∇ F ( β ) = − X ⊤ y + X ⊤ X β . Assuming that X has full column rank, then X ⊤ X is positive definite, the solution is unique and reads: β = ( X ⊤ X ) − 1 X ⊤ y ˆ :

  10. Linear Regression Linear classification White board :

  11. Linear Regression Linear classification Interpretation The fitted values at the training inputs are y = X ˆ β = X ( X ⊤ X ) − 1 X ⊤ y = Hy ˆ where H is called the “hat matrix”. This matrix computes the orthogonal projection of y onto the vectorial subspace spanned by the columns of X . :

  12. Linear Regression Linear classification Statistical properties Variance: Var(ˆ β ) = ( X ⊤ X ) − 1 σ 2 for uncorrelated observations y i with variance σ 2 , and deterministic x i . Unbiased estimator: n 1 σ 2 = � y i ) 2 ˆ ( y i − ˆ n − ( d + 1) i =1 Inference properties: Assume that Y = β 0 + � d j =1 X j β j + ǫ with ǫ ∼ N (0 , σ 2 ). Then ˆ β and ˆ σ are independant and ◮ ˆ β ∼ N ( β , ( X ⊤ X ) − 1 σ 2 ) σ 2 ∼ σ 2 χ 2 ◮ ( n − ( d + 1))ˆ n − ( d +1) :

  13. Linear Regression Linear classification High dimensional linear regression Problems with least squares regression if d is large: ◮ Accuracy : The hyperplan fits the data well but predicts (generalizes) badly. (low bias / large variance) ◮ Interpretation : We want to identify a small subset of features important/relevant for predicting the data. :

  14. Linear Regression Linear classification High dimensional linear regression Problems with least squares regression if d is large: ◮ Accuracy : The hyperplan fits the data well but predicts (generalizes) badly. (low bias / large variance) ◮ Interpretation : We want to identify a small subset of features important/relevant for predicting the data. 2 � y − X β � 2 + λ R ( β ) Regularization: F ( β ) = 1 ◮ ridge regression : R ( β ) = 1 2 � β � 2 ◮ shrinkage : R ( β ) = � β � 1 ◮ subset selection : R ( β ) = � β � 0 ∗ Explicit solution in the case of ridge. Otherwise, optimization method is usually needed ! :

  15. Linear Regression Linear classification White board :

  16. Linear Regression Linear classification Penalty functions j | β j | q Contour plots for � When the columns of X are orthonormal, the estimators can be deduced from the LS estimator ˆ β according to: ◮ Ridge : ˆ β j / (1 + λ ) weight decay ◮ Lasso : sign(ˆ β j )( | ˆ β j | − λ ) + soft tresholding � � ◮ Best subset : ˆ ˆ β 2 β j · δ j ≥ 2 λ hard tresholding :

  17. Linear Regression Linear classification White board :

  18. Linear Regression Linear classification White board :

  19. Linear Regression Linear classification Robust regression Challenge: Estimation methods insensitive to outliers and possibly high leverage points. Approach: M-estimation n � ρ ( y i − x ′⊤ F ( β ) = i β ) i =1 with ρ a potential function satisfying: ◮ ρ ( e ) ≥ 0 and ρ (0) = 0 ◮ ρ ( e ) = ρ ( − e ) ◮ ρ ( e ) ≥ ρ ( e ′ ) for | e | ≥ | e ′ | :

  20. Linear Regression Linear classification Robust regression Challenge: Estimation methods insensitive to outliers and possibly high leverage points. Approach: M-estimation n � ρ ( y i − x ′⊤ F ( β ) = i β ) i =1 with ρ a potential function satisfying: ◮ ρ ( e ) ≥ 0 and ρ (0) = 0 ◮ ρ ( e ) = ρ ( − e ) ◮ ρ ( e ) ≥ ρ ( e ′ ) for | e | ≥ | e ′ | ∗ Minimizer satisfies: i ˆ ρ ( y i − x ′⊤ β ) x ′ ˙ i = 0 , i = 1 , . . . , n ⇒ IRLS algorithm . :

  21. Linear Regression Linear classification IRLS algorithm Core idea: Let f be defined as ( ∀ x ∈ R ) ρ ( x ) = φ ( | x | ) where (i) φ is differentiable on ]0 , + ∞ [, (ii) φ ( √· ) is concave on ]0 , + ∞ [, h(.,y) ˙ (iii) ( ∀ x ∈ [0 , + ∞ [) φ ( x ) ≥ 0, f � ˙ � φ ( x ) ∈ R . (iv) lim x → 0 ω ( x ) := x x > 0 y Then, for all y ∈ R , ρ ( y )( x − y ) + 1 2 ω ( | y | )( x − y ) 2 . ( ∀ x ∈ R ) ρ ( x ) ≤ ρ ( y ) + ˙ :

  22. Linear Regression Linear classification Examples of functions ρ ρ ( x ) ω ( x ) (exercise) | x | − δ log( | x | /δ + 1) � x 2 if | x | < δ Convex 2 δ | x | − δ 2 otherwise log(cosh( x )) (1 + x 2 /δ 2 ) κ/ 2 − 1 1 − exp( − x 2 / (2 δ 2 )) x 2 / (2 δ 2 + x 2 ) Nonconvex √ � 1 − (1 − x 2 / (6 δ 2 )) 3 if | x | ≤ 6 δ 1 otherwise tanh( x 2 / (2 δ 2 )) log(1 + x 2 /δ 2 ) ( λ, δ ) ∈ ]0 , + ∞ [ 2 , κ ∈ [1 , 2] :

  23. Linear Regression Linear classification White board :

  24. Linear Regression Linear classification IRLS algorithm: β k +1 = ( X ⊤ W k X ) − 1 X ⊤ W k y . ( ∀ k ∈ N ) with the IRLS weight matrix W k = Diag( ω ( y − X β k )). :

  25. Linear Regression Linear classification Linear classification Applications: ◮ Sentiment analysis from text features ◮ Handwritten digits recognition ◮ Gene expression data classification ◮ Object recognition in images :

  26. Linear Regression Linear classification Linear classification Applications: ◮ Sentiment analysis from text features ◮ Handwritten digits recognition ◮ Gene expression data classification ◮ Object recognition in images Goal: Learn linear functions f k ( · ) for dividing the input space into a collection of K regions. ◮ Map a linear function on Pr( G = k | X = x ) ∼ linear regression ◮ More generally, map a linear function to a transformation of Pr( G = k | X = x ) :

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend