regression
play

Regression Practical Machine Learning Fabian Wauthier 09/10/2009 - PowerPoint PPT Presentation

Regression Practical Machine Learning Fabian Wauthier 09/10/2009 Adapted from slides by Kurt Miller and Romain Thibaux 1 Outline Ordinary Least Squares Regression - Online version - Normal equations - Probabilistic interpretation


  1. Regression Practical Machine Learning Fabian Wauthier 09/10/2009 Adapted from slides by Kurt Miller and Romain Thibaux 1

  2. Outline • Ordinary Least Squares Regression - Online version - Normal equations - Probabilistic interpretation • Overfitting and Regularization • Overview of additional topics - L 1 Regression - Quantile Regression - Generalized linear models - Kernel Regression and Locally Weighted Regression 2

  3. Outline • Ordinary Least Squares Regression - Online version - Normal equations - Probabilistic interpretation • Overfitting and Regularization • Overview of additional topics - L 1 Regression - Quantile Regression - Generalized linear models - Kernel Regression and Locally Weighted Regression 3

  4. Regression vs. Classification: Classification X Y ⇒ Anything: • Discrete: • continuous ( ℜ , ℜ d , …) – {0,1} binary – {1,…k} � multi-class • discrete ({0,1}, {1,…k}, …) – tree, etc. structured • structured (tree, string, …) • … 4

  5. Regression vs. Classification: Classification X Y ⇒ Perceptron Anything: Logistic Regression Support Vector Machine • continuous ( ℜ , ℜ d , …) • discrete ({0,1}, {1,…k}, …) Decision Tree Random Forest • structured (tree, string, …) • … Kernel trick 5

  6. Regression vs. Classification: Regression X Y ⇒ Anything: • continuous: – ℜ , ℜ d • continuous ( ℜ , ℜ d , …) • discrete ({0,1}, {1,…k}, …) • structured (tree, string, …) • … 6

  7. Examples • Voltage Temperature ⇒ • Processes, memory Power consumption ⇒ • Protein structure Energy ⇒ • Robot arm controls Torque at effector ⇒ • Location, industry, past losses Premium ⇒ 7

  8. Linear regression Given examples given a new point Predict y 40 y 26 24 20 22 20 30 40 20 30 20 0 10 0 10 20 10 0 0 x x 8

  9. Linear regression We wish to estimate by a linear function of our data : ˆ x y ˆ = w 0 + w 1 x n +1 , 1 + w 2 x n +1 , 2 y n +1 w ⊤ x n +1 = where is a parameter to be estimated and we have used the w standard convention of letting the first component of be 1. x y 40 y 26 24 20 22 20 30 40 20 30 20 0 10 0 10 20 10 0 0 x x 9

  10. Choosing the regressor Of the many regression fits that approximate the data, which should we choose? Observation � 1 � X i = x i 0 0 20 10 10

  11. LMS Algorithm (Least Mean Squares) In order to clarify what we mean by a good choice of , we will w define a cost function for how well we are doing on the training data: Error or “residual” Observation Prediction � 1 � X i = x i 0 0 20 n 1 Cost = � ( w ⊤ x i − y i ) 2 2 i =1 11

  12. LMS Algorithm (Least Mean Squares) The best choice of is the one that minimizes our cost function w n n E = 1 ( w ⊤ x i − y i ) 2 = � � E i 2 i =1 i =1 In order to optimize this equation, we use standard gradient descent w t +1 := w t − α ∂ ∂wE where n ∂ 1 ∂ ∂ ∂ ∂ w ( w ⊤ x i − y i ) 2 � = ∂ wE i and ∂ wE = ∂ wE i 2 i =1 ( w ⊤ x i − y i ) x i = 12

  13. LMS Algorithm (Least Mean Squares) The LMS algorithm is an online method that performs the following update for each new data point w t − α ∂ w t +1 := ∂ wE i w t + α ( y i − x ⊤ = i w ) x i α∂E i ∂w 13

  14. LMS, Logistic regression, and Perceptron updates • LMS w t + α ( y i − x ⊤ w t +1 := i w ) x i • Logistic Regression w t + α ( y i − f w ( x i )) x i w t +1 := • Perceptron w t + α ( y i − f w ( x i )) x i w t +1 := 14

  15. Ordinary Least Squares (OLS) Error or “residual” Observation Prediction � 1 � X i = x i 0 0 20 n 1 Cost = � ( w ⊤ x i − y i ) 2 2 i =1 15

  16. Minimize the sum squared error n 1 � ( w ⊤ x i − y i ) 2 = E 2 i =1 1 2( Xw − y ) ⊤ ( Xw − y ) = 1 2( w ⊤ X ⊤ Xw − 2 y ⊤ Xw + y ⊤ y ) = ∂ X ⊤ Xw − X ⊤ y ∂ wE = Setting the derivative equal to zero n gives us the Normal Equations X ⊤ Xw X ⊤ y = ( X ⊤ X ) − 1 X ⊤ y = w d 16

  17. A geometric interpretation ∂ ∂wE = X ⊤ ( Xw − y ) = 0 We solved Residuals are orthogonal to columns of X ⇒ gives the best reconstruction of ⇒ y = Xw ˆ y in the range of X 17 17

  18. Residual vector y ! y’ is orthogonal to subspace S y Subspace S spanned by columns of X [X] 2 [X] 1 y’ y’ is an orthogonal 18 projection of y onto S 18

  19. Computing the solution w = ( X ⊤ X ) − 1 X ⊤ y We compute . If X ⊤ X is invertible, then ( X ⊤ X ) − 1 X ⊤ coincides with X + of the pseudoinverse and the solution is unique. X w . If X ⊤ X is not invertible, there is no unique solution w = X + y In that case chooses the solution with smallest Euclidean norm. An alternative way to deal with non-invertible X ⊤ X is to add a small portion of the identity matrix (= Ridge regression). 19 19

  20. Beyond lines and planes Linear models become powerful function approximators when we consider non-linear feature transformations. ⇒ 40 Predictions are still linear in X ! 20 All the math is the same! 0 0 10 20 20

  21. Geometric interpretation y = w 0 + w 1 x + w 2 x 2 ˆ 20 10 400 0 300 200 -10 100 0 10 20 0 [Matlab demo] 21

  22. Ordinary Least Squares [summary] Given examples Let For example Let n d by solving Minimize Predict 22

  23. Probabilistic interpretation 0 0 20 Likelihood 23

  24. 25 Conditional Gaussians 20 p(y|x) 15 y Mean µ 10 5 µ =8 µ =3 µ =5 0 0 2 4 6 8 10 24 X 24

  25. BREAK 25

  26. Outline • Ordinary Least Squares Regression - Online version - Normal equations - Probabilistic interpretation • Overfitting and Regularization • Overview of additional topics - L 1 Regression - Quantile Regression - Generalized linear models - Kernel Regression and Locally Weighted Regression 26

  27. Overfitting • So the more features the better? NO! • Carefully selected features can improve model accuracy. • But adding too many can lead to overfitting. • Feature selection will be discussed in a separate lecture. 27 27

  28. Overfitting 30 25 20 Degree 15 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 [Matlab demo] 28

  29. Ridge Regression (Regularization) 15 Effect of regularization (degree 19) Minimize 10 5 with “small” by solving 0 -5 ( X ⊤ X + ǫ I ) w = X ⊤ y -10 0 2 4 6 8 10 12 14 16 18 20 [Continue Matlab demo] 29

  30. Probabilistic interpretation Likelihood Prior Posterior P ( w, x 1 , . . . , x n , y 1 , . . . , y n ) P ( w | X, y ) = P ( x 1 , . . . , x n , y 1 , . . . , y n ) ∝ P ( w.x 1 , . . . , x 1 , y 1 , . . . , y n ) � − 1 � 2 � − ǫ � � � X ⊤ 2 σ 2 || w || 2 � ∝ exp exp i w − y i 2 2 σ 2 i � � �� − 1 � ( X ⊤ ǫ || w || 2 i w − y i ) 2 = exp 2 + 2 σ 2 i 30

  31. Outline • Ordinary Least Squares Regression - Online version - Normal equations - Probabilistic interpretation • Overfitting and Regularization • Overview of additional topics - L 1 Regression - Quantile Regression - Generalized linear models - Kernel Regression and Locally Weighted Regression 31

  32. Errors in Variables (Total Least Squares) 0 0 32

  33. Sensitivity to outliers High weight given to outliers Temperature at noon 25 20 Influence 15 function 10 5 30 40 20 30 20 10 10 0 0 33

  34. L 1 Regression Linear program Influence function [Matlab demo] 34

  35. Quantile Regression ● ● ● ● mean CPU 360 ● ● 95th percentile of CPU ● ● ● ● ● ● ● 340 ● ● ● ● ● ● ● ● ● ● ● CPU utilization [MHz] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 320 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 300 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 280 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 260 ● ● ● 15 16 17 18 19 20 21 workload (ViewItem.php) [req/s] Slide courtesy of Peter Bodik 35

  36. Generalized Linear Models Probabilistic interpretation of OLS Mean is linear in X i OLS: linearly predict the mean of a Gaussian conditional. GLM: predict the mean of some other conditional density. f ( X ⊤ � � y i | x i ∼ p i w ) May need to transform linear prediction by to produce a f ( · ) valid parameter. 36 36

  37. Example: “Poisson regression” Suppose data are event counts: y ∈ N 0 y Typical distribution for count data: Poisson Poisson( y | λ ) = e − λ λ y Mean parameter is λ > 0 y ! Say we predict λ = f ( x ⊤ w ) = exp x ⊤ w � � f ( X ⊤ � � y i | x i ∼ Poisson i w ) GLM: 37 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend