Machine Learning - MT 2016 2. Linear Regression Varun Kanade University of Oxford October 12, 2016
Announcements ◮ All students eligible to take the course for credit can sign-up for classes and practicals ◮ Attempt Problem Sheet 0 (contact your class tutor if you intend to attend class in Week 2) ◮ Problem Sheet 1 is posted (submit by noon 21 Oct at CS reception) 1
Announcement : Strachey Lecture ◮ Will finish 15-20 min early on Monday, October 31 ◮ May run over by 5 minutes or so a few other days 2
Outline Goals ◮ Review the supervised learning setting ◮ Describe the linear regression framework ◮ Apply the linear model to make predictions ◮ Derive the least squares estimate Supervised Learning Setting ◮ Data consists of input and output pairs ◮ Inputs (also covariates, independent variables, predictors, features) ◮ Output (also variates, dependent variable, targets, labels) 3
Why study linear regression? ◮ Least squares is at least 200 years old going back to Legendre and Gauss ◮ Francis Galton (1886): ‘‘Regression to the mean’’ ◮ Often real processes can be approximated by linear models ◮ More complex models require understanding linear regression ◮ Closed form analytic solutions can be obtained ◮ Many key notions of machine learning can be introduced 4
A toy example : Commute Times Want to predict commute time into city centre What variables would be useful? ◮ Distance to city centre ◮ Day of the week Data dist (km) day commute time (min) 2.7 fri 25 4.1 mon 33 1.0 sun 15 5.2 tue 45 2.8 sat 22 5
Linear Models Suppose the input is a vector x ∈ R D and the output is y ∈ R . We have data � x i , y i � N i =1 Notation: data dimension D , size of dataset N , column vectors Linear Model y = w 0 + x 1 w 1 + · · · + x D w D + ǫ Bias/intercept Noise/uncertainty 6
Linear Models : Commute Time Linear Model y = w 0 + x 1 w 1 + · · · + x D w D + ǫ Bias/intercept Noise/uncertainty Input encoding: mon-sun has to be converted to a number ◮ monday: 0, tuesday: 1, . . . , sunday: 6 Using 0 - 6 is a bad encoding. Use seven 0 - 1 features instead ◮ 0 if weekend, 1 if weekday called one-hot encoding Say x 1 ∈ R (distance) and x 2 ∈ { 0 , 1 } (weekend/weekday) Linear model for commute time y = w 0 + w 1 x 1 + w 2 x 2 + ǫ 7
Linear Model : Adding a feature for bias term dist day commute time one dist day commute time x 1 x 2 y x 0 x 1 x 2 y 2.7 fri 25 1 2.7 fri 25 4.1 mon 33 1 4.1 mon 33 ⇔ 1.0 sun 15 1 1.0 sun 15 5.2 tue 45 1 5.2 tue 45 2.8 sat 22 1 2.8 sat 22 Model Model y = w 0 + w 1 x 1 + w 2 x 2 + ǫ y = w 0 x 0 + w 1 x 1 + w 2 x 2 + ǫ = w · x + ǫ 8
Learning Linear Models i =1 , where x i ∈ R D and y i ∈ R Data: � ( x i , y i ) � N Model parameter w , where w ∈ R D Training phase: (learning/estimation w from data) � ( x i , y i ) � N Learning i =1 w (estimate) Algorithm data Testing/Deployment phase: (predict � y new = x new · w ) ◮ How different is � y new from y new (actual observation)? ◮ We should keep some data aside for testing before deploying a model 9
� ( x i , y i ) � N i =1 , where x i ∈ R and y i ∈ R y ( x ) = w 0 + x · w 1 , (no noise term in � � y ) � N � N 1 1 y i − y i ) 2 = ( w 0 + x i · w 1 − y i ) 2 ( � L ( w ) = L ( w 0 , w 1 ) = 2 N 2 N i =1 i =1 Loss function Cost function Objective Function Energy Function Notation - L , J, E, R This objective is known as the residual sum of squares or (RSS) The estimate ( w 0 , w 1 ) is known as the least squares estimate 10
� ( x i , y i ) � N i =1 , where x i ∈ R and y i ∈ R y ( x ) = w 0 + x · w 1 , (no noise term in � � y ) � N � N 1 1 y i − y i ) 2 = ( w 0 + x i · w 1 − y i ) 2 ( � L ( w ) = L ( w 0 , w 1 ) = 2 N 2 N i =1 i =1 � � N ∂w 0 = 1 i x i ∂ L ( w 0 + w 1 · x i − y i ) x = ¯ N N i =1 � i y i � N ∂w 1 = 1 y = ¯ ∂ L ( w 0 + w 1 · x i − y i ) x i N � N i x 2 i =1 i x 2 var( x ) = � − ¯ N We obtain the solution for ( w 0 , w 1 ) by setting the � i x i y i partial derivatives to 0 and solving the resulting � cov( x, y ) = − ¯ x · ¯ y N system. (Normal Equations) � � i x i i y i (1) w 0 + w 1 · = w 1 = � cov( x, y ) N N � � � var( x ) � i x 2 i x i i x i y i i w 0 · + w 1 · = (2) w 0 = ¯ y − w 1 · ¯ x N N N 11
Linear Regression : General Case Recall that the linear model is � D � y i = x ij w j j =0 where we assume that x i 0 = 1 for all x i , so that the bias term w 0 does not need to be treated separately. Expressing everything in matrix notation y = Xw � y ∈ R N × 1 , X ∈ R N × ( D +1) and w ∈ R ( D +1) × 1 Here we have � X N × ( D +1) w ( D +1) × 1 X N × ( D +1) w ( D +1) × 1 � y N × 1 x T � y 1 w 0 x 10 · · · x 1 D w 0 1 x T � . · · · . y 2 x 20 x 2 D 2 . . . = . . = . . . ... . . . . . . . . w D w D � x T y N x N 0 · · · x ND N 12
Back to toy example one dist (km) weekday? commute time (min) 1 2.7 1 (fri) 25 1 4.1 1 (mon) 33 1 1.0 0 (sun) 15 1 5.2 1 (tue) 45 1 2.8 0 (sat) 22 We have N = 5 , D + 1 = 3 and so we get 25 1 2 . 7 1 33 1 4 . 1 1 w 0 y = 15 , X = 1 1 . 0 0 , w = w 1 45 1 5 . 2 1 w 2 22 1 2 . 8 0 Suppose we get w = [6 . 09 , 6 . 53 , 2 . 11] T . Then our predictions would be 25 . 83 34 . 97 � y = 12 . 62 42 . 16 24 . 37 13
Least Squares Estimate : Minimise the Squared Error � N 1 i w − y i ) 2 = ( Xw − y ) T ( Xw − y ) ( x T L ( w ) = 2 N i =1 14
Finding Optimal Solutions using Calculus � N 1 1 i w − y i ) 2 = 2 N ( Xw − y ) T ( Xw − y ) ( x T L ( w ) = 2 N i =1 � � w T � � 1 X T X w − w T X T y − y T Xw + y T y = 2 N � � w T � � 1 X T X w − 2 · y T Xw + y T y = 2 N = · · · Then, write out all partial derivatives to form the gradient ∇ w L ∂ L ∂w 0 = · · · ∂ L ∂w 1 = · · · Instead, we will develop tricks to differ- . . entiate using matrix notation directly . ∂ L ∂w D = · · · 15
Differentiating Matrix Expressions Rules (Tricks) � � c T w (i) Linear Form Expressions: ∇ w = c � D c T w = c j w j j =0 � � ∂ ( c T w ) c T w = c j , and so ∇ w = c (3) ∂w j (ii) Quadratic Form Expressions: � � w T Aw = Aw + A T w ( = 2 Aw for symmetric A ) ∇ w � D � D w T Aw = w i w j A ij i =0 j =0 � D � D ∂ ( w T Aw ) A kj w j = A T = w i A ik + [: ,k ] w + A [ k, :] w ∂w k i =0 j =0 � � w T Aw = A T w + Aw ∇ w (4) 16
Deriving the Least Squares Estimate � � w T � � � N 1 1 i w − y i ) 2 = ( x T X T X w − 2 · y T Xw + y T y L ( w ) = 2 N 2 N i =1 We compute the gradient ∇ w L = 0 using the matrix differentiation rules, �� � � ∇ w L = 1 X T X w − X T y N By setting ∇ w L = 0 and solving we get, � � X T X w = X T y � � − 1 X T X X T y w = (Assuming inverse exists) The predictions made by the model on the data X are given by � � − 1 X T X X T y � y = Xw = X � � − 1 X T is called the ‘‘hat’’ matrix X T X For this reason the matrix X 17
Least Squares Estimate � � − 1 X T X X T y w = ◮ When do we expect X T X to be invertible? rank( X T X ) = rank( X ) ≤ min { D + 1 , N } As X T X is D + 1 × D + 1 , invertible is rank( X ) = D + 1 ◮ What if we use one-hot encoding for a feature like day? Suppose x mon , . . . , x sun stand for 0 - 1 valued variables in the one-hot encoding We always have x mon + · · · + x sun = 1 This introduces a linear dependence in the columns of X reducing the rank In this case, we can drop some features to adjust rank. We’ll see alternative approaches later in the course. ◮ What is the computational complexity of computing w ? Relatively easy to get O ( D 2 N ) bound 18
19
Recap : Predicting Commute Time Goal ◮ Predict the time taken for commute given distance and day of week ◮ Do we only wish to make predictions or also suggestions? Model and Choice of Loss Function ◮ Use a linear model y = w 0 + w 1 x 1 + · · · + w D x D + ǫ = � y + ǫ � ( y i − � ◮ Minimise average squared error 1 y i ) 2 2 N Algorithm to Fit Model ◮ Simple matrix operations using closed-form solution 20
Model and Loss Function Choice ‘‘Optimisation’’ View of Machine Learning ◮ Pick model that you expect may fit the data well enough ◮ Pick a measure of performance that makes ‘‘sense’’ and can be optimised ◮ Run optimisation algorithm to obtain model parameters Probabilistic View of Machine Learning ◮ Pick a model for data and explicitly formulate the deviation (or uncertainty) from the model using the language of probability ◮ Use notions from probability to define suitability of various models ◮ ‘‘Find’’ the parameters or make predictions on unseen data using these suitability criteria (Frequentist vs Bayesian viewpoints) 21
Recommend
More recommend