machine learning mt 2016 2 linear regression
play

Machine Learning - MT 2016 2. Linear Regression Varun Kanade - PowerPoint PPT Presentation

Machine Learning - MT 2016 2. Linear Regression Varun Kanade University of Oxford October 12, 2016 Announcements All students eligible to take the course for credit can sign-up for classes and practicals Attempt Problem Sheet 0


  1. Machine Learning - MT 2016 2. Linear Regression Varun Kanade University of Oxford October 12, 2016

  2. Announcements ◮ All students eligible to take the course for credit can sign-up for classes and practicals ◮ Attempt Problem Sheet 0 (contact your class tutor if you intend to attend class in Week 2) ◮ Problem Sheet 1 is posted (submit by noon 21 Oct at CS reception) 1

  3. Announcement : Strachey Lecture ◮ Will finish 15-20 min early on Monday, October 31 ◮ May run over by 5 minutes or so a few other days 2

  4. Outline Goals ◮ Review the supervised learning setting ◮ Describe the linear regression framework ◮ Apply the linear model to make predictions ◮ Derive the least squares estimate Supervised Learning Setting ◮ Data consists of input and output pairs ◮ Inputs (also covariates, independent variables, predictors, features) ◮ Output (also variates, dependent variable, targets, labels) 3

  5. Why study linear regression? ◮ Least squares is at least 200 years old going back to Legendre and Gauss ◮ Francis Galton (1886): ‘‘Regression to the mean’’ ◮ Often real processes can be approximated by linear models ◮ More complex models require understanding linear regression ◮ Closed form analytic solutions can be obtained ◮ Many key notions of machine learning can be introduced 4

  6. A toy example : Commute Times Want to predict commute time into city centre What variables would be useful? ◮ Distance to city centre ◮ Day of the week Data dist (km) day commute time (min) 2.7 fri 25 4.1 mon 33 1.0 sun 15 5.2 tue 45 2.8 sat 22 5

  7. Linear Models Suppose the input is a vector x ∈ R D and the output is y ∈ R . We have data � x i , y i � N i =1 Notation: data dimension D , size of dataset N , column vectors Linear Model y = w 0 + x 1 w 1 + · · · + x D w D + ǫ Bias/intercept Noise/uncertainty 6

  8. Linear Models : Commute Time Linear Model y = w 0 + x 1 w 1 + · · · + x D w D + ǫ Bias/intercept Noise/uncertainty Input encoding: mon-sun has to be converted to a number ◮ monday: 0, tuesday: 1, . . . , sunday: 6 Using 0 - 6 is a bad encoding. Use seven 0 - 1 features instead ◮ 0 if weekend, 1 if weekday called one-hot encoding Say x 1 ∈ R (distance) and x 2 ∈ { 0 , 1 } (weekend/weekday) Linear model for commute time y = w 0 + w 1 x 1 + w 2 x 2 + ǫ 7

  9. Linear Model : Adding a feature for bias term dist day commute time one dist day commute time x 1 x 2 y x 0 x 1 x 2 y 2.7 fri 25 1 2.7 fri 25 4.1 mon 33 1 4.1 mon 33 ⇔ 1.0 sun 15 1 1.0 sun 15 5.2 tue 45 1 5.2 tue 45 2.8 sat 22 1 2.8 sat 22 Model Model y = w 0 + w 1 x 1 + w 2 x 2 + ǫ y = w 0 x 0 + w 1 x 1 + w 2 x 2 + ǫ = w · x + ǫ 8

  10. Learning Linear Models i =1 , where x i ∈ R D and y i ∈ R Data: � ( x i , y i ) � N Model parameter w , where w ∈ R D Training phase: (learning/estimation w from data) � ( x i , y i ) � N Learning i =1 w (estimate) Algorithm data Testing/Deployment phase: (predict � y new = x new · w ) ◮ How different is � y new from y new (actual observation)? ◮ We should keep some data aside for testing before deploying a model 9

  11. � ( x i , y i ) � N i =1 , where x i ∈ R and y i ∈ R y ( x ) = w 0 + x · w 1 , (no noise term in � � y ) � N � N 1 1 y i − y i ) 2 = ( w 0 + x i · w 1 − y i ) 2 ( � L ( w ) = L ( w 0 , w 1 ) = 2 N 2 N i =1 i =1 Loss function Cost function Objective Function Energy Function Notation - L , J, E, R This objective is known as the residual sum of squares or (RSS) The estimate ( w 0 , w 1 ) is known as the least squares estimate 10

  12. � ( x i , y i ) � N i =1 , where x i ∈ R and y i ∈ R y ( x ) = w 0 + x · w 1 , (no noise term in � � y ) � N � N 1 1 y i − y i ) 2 = ( w 0 + x i · w 1 − y i ) 2 ( � L ( w ) = L ( w 0 , w 1 ) = 2 N 2 N i =1 i =1 � � N ∂w 0 = 1 i x i ∂ L ( w 0 + w 1 · x i − y i ) x = ¯ N N i =1 � i y i � N ∂w 1 = 1 y = ¯ ∂ L ( w 0 + w 1 · x i − y i ) x i N � N i x 2 i =1 i x 2 var( x ) = � − ¯ N We obtain the solution for ( w 0 , w 1 ) by setting the � i x i y i partial derivatives to 0 and solving the resulting � cov( x, y ) = − ¯ x · ¯ y N system. (Normal Equations) � � i x i i y i (1) w 0 + w 1 · = w 1 = � cov( x, y ) N N � � � var( x ) � i x 2 i x i i x i y i i w 0 · + w 1 · = (2) w 0 = ¯ y − w 1 · ¯ x N N N 11

  13. Linear Regression : General Case Recall that the linear model is � D � y i = x ij w j j =0 where we assume that x i 0 = 1 for all x i , so that the bias term w 0 does not need to be treated separately. Expressing everything in matrix notation y = Xw � y ∈ R N × 1 , X ∈ R N × ( D +1) and w ∈ R ( D +1) × 1 Here we have � X N × ( D +1) w ( D +1) × 1 X N × ( D +1) w ( D +1) × 1 � y N × 1           x T � y 1 w 0 x 10 · · · x 1 D w 0 1           x T � . · · · . y 2 x 20 x 2 D           2 . .        .  =  .  . = . . . ...       . . . .  .     . .  . w D w D � x T y N x N 0 · · · x ND N 12

  14. Back to toy example one dist (km) weekday? commute time (min) 1 2.7 1 (fri) 25 1 4.1 1 (mon) 33 1 1.0 0 (sun) 15 1 5.2 1 (tue) 45 1 2.8 0 (sat) 22 We have N = 5 , D + 1 = 3 and so we get     25 1 2 . 7 1        33   1 4 . 1 1  w 0       y =  15  , X =  1 1 . 0 0  , w = w 1        45   1 5 . 2 1  w 2 22 1 2 . 8 0 Suppose we get w = [6 . 09 , 6 . 53 , 2 . 11] T . Then our predictions would be   25 . 83    34 . 97    � y =  12 . 62    42 . 16   24 . 37 13

  15. Least Squares Estimate : Minimise the Squared Error � N 1 i w − y i ) 2 = ( Xw − y ) T ( Xw − y ) ( x T L ( w ) = 2 N i =1 14

  16. Finding Optimal Solutions using Calculus � N 1 1 i w − y i ) 2 = 2 N ( Xw − y ) T ( Xw − y ) ( x T L ( w ) = 2 N i =1 � � w T � � 1 X T X w − w T X T y − y T Xw + y T y = 2 N � � w T � � 1 X T X w − 2 · y T Xw + y T y = 2 N = · · · Then, write out all partial derivatives to form the gradient ∇ w L ∂ L ∂w 0 = · · · ∂ L ∂w 1 = · · · Instead, we will develop tricks to differ- . . entiate using matrix notation directly . ∂ L ∂w D = · · · 15

  17. Differentiating Matrix Expressions Rules (Tricks) � � c T w (i) Linear Form Expressions: ∇ w = c � D c T w = c j w j j =0 � � ∂ ( c T w ) c T w = c j , and so ∇ w = c (3) ∂w j (ii) Quadratic Form Expressions: � � w T Aw = Aw + A T w ( = 2 Aw for symmetric A ) ∇ w � D � D w T Aw = w i w j A ij i =0 j =0 � D � D ∂ ( w T Aw ) A kj w j = A T = w i A ik + [: ,k ] w + A [ k, :] w ∂w k i =0 j =0 � � w T Aw = A T w + Aw ∇ w (4) 16

  18. Deriving the Least Squares Estimate � � w T � � � N 1 1 i w − y i ) 2 = ( x T X T X w − 2 · y T Xw + y T y L ( w ) = 2 N 2 N i =1 We compute the gradient ∇ w L = 0 using the matrix differentiation rules, �� � � ∇ w L = 1 X T X w − X T y N By setting ∇ w L = 0 and solving we get, � � X T X w = X T y � � − 1 X T X X T y w = (Assuming inverse exists) The predictions made by the model on the data X are given by � � − 1 X T X X T y � y = Xw = X � � − 1 X T is called the ‘‘hat’’ matrix X T X For this reason the matrix X 17

  19. Least Squares Estimate � � − 1 X T X X T y w = ◮ When do we expect X T X to be invertible? rank( X T X ) = rank( X ) ≤ min { D + 1 , N } As X T X is D + 1 × D + 1 , invertible is rank( X ) = D + 1 ◮ What if we use one-hot encoding for a feature like day? Suppose x mon , . . . , x sun stand for 0 - 1 valued variables in the one-hot encoding We always have x mon + · · · + x sun = 1 This introduces a linear dependence in the columns of X reducing the rank In this case, we can drop some features to adjust rank. We’ll see alternative approaches later in the course. ◮ What is the computational complexity of computing w ? Relatively easy to get O ( D 2 N ) bound 18

  20. 19

  21. Recap : Predicting Commute Time Goal ◮ Predict the time taken for commute given distance and day of week ◮ Do we only wish to make predictions or also suggestions? Model and Choice of Loss Function ◮ Use a linear model y = w 0 + w 1 x 1 + · · · + w D x D + ǫ = � y + ǫ � ( y i − � ◮ Minimise average squared error 1 y i ) 2 2 N Algorithm to Fit Model ◮ Simple matrix operations using closed-form solution 20

  22. Model and Loss Function Choice ‘‘Optimisation’’ View of Machine Learning ◮ Pick model that you expect may fit the data well enough ◮ Pick a measure of performance that makes ‘‘sense’’ and can be optimised ◮ Run optimisation algorithm to obtain model parameters Probabilistic View of Machine Learning ◮ Pick a model for data and explicitly formulate the deviation (or uncertainty) from the model using the language of probability ◮ Use notions from probability to define suitability of various models ◮ ‘‘Find’’ the parameters or make predictions on unseen data using these suitability criteria (Frequentist vs Bayesian viewpoints) 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend