Linear Regression David M. Blei COS424 Princeton University April 4, 2012
Regression • We have studied classification, the problem of automatically categorizing data into a set of discrete classes. • E.g., based on its words, is an email spam or ham? • Regression is the problem of predicting a real-valued variable from data input.
Linear regression ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● response ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ● ● ● −2 −1 0 1 2 input Data are a set of inputs and outputs � = { ( x n , y n ) } N n = 1
Linear regression ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● response ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ● ● ● −2 −1 0 1 2 input The goal is to predict y from x using a linear function.
Examples ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● response ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ● ● ● ● −2 −1 0 1 2 input • Given today’s weather, how much will it rain tomorrow? • Given today’s market, what will be the price of a stock tomorrow? • Given her emails, how long will a user stay on a page? • Others?
Linear regression ● ● ● ● ● ● 1.0 ( x n , y n ) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● f ( x ) = β 0 + βx ● ● ● ● ● response ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 0.5 ● ● ● ● − 2 − 1 0 1 2 input
Multiple inputs • Usually, we have a vector of inputs, each representing a different feature of the data that might be predictive of the response. x = 〈 x 1 , x 2 ,..., x p 〉 • The response is assumed to be a linear function of the input p � f ( x ) = β 0 + x i β i i = 1 • Here, β ⊤ x = 0 is a hyperplane.
Multiple inputs Y • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • X 2 • • • X 1
Flexibility of linear regression • This set-up is less limiting than you might imagine. • Inputs can be: • Any features of the data • Transformations of the original features, e.g., x 2 = log x 1 or x 2 = � x 1 . • A basis expansion, e.g., x 2 = x 2 1 and x 3 = x 3 1 • Indicators of qualitative inputs, e.g., category • Interactions between inputs, e.g., x 1 = x 2 x 3 • Its simplicity and flexibility make linear regression one of the most important and widely used statistical prediction techniques.
Polynomial regression example 10 ● 8 6 ● response 4 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 −1 0 1 2 input
Linear regression 10 ● 8 ● 6 response 4 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 −1 0 1 2 input f ( x ) = β 0 + β x
Polynomial regression 10 ● 8 ● 6 response 4 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 −1 0 1 2 input f ( x ) = β 0 + β 1 x + β 2 x 2 + β 3 x 3
Fitting a regression ● ● ● ● • Given data � = { ( x n , y n ) } N 1.0 n = 1 , find ● ● ● ● ● ● ● ● ● ● the coefficient β that can predict ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y new from x new . ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● y ● ● • Simplifications: ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 • 0-intercept, i.e., β 0 = 0 ● ● ● ● ● ● ● ● ● • One input, i.e., p = 1 ● ● ● −1.0 ● ● ● ● ● ● ● • How should we proceed? ● −2 −1 0 1 2 x
Residual sum of squares ● ● ● ● 1.0 | ( y n − βx n ) | ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 0.5 ● ● ● ● ● ● ● ● ● ● ● ● − 1.0 ● ● ● ● ● ● ● ● ● − 2 − 1 0 1 2 x A reasonable approach is to minimize sum of the squared Euclidean distance between each prediction β x n and the truth y n N 1 � ( y n − β x n ) 2 RSS ( β ) = 2 n = 1
RSS for two inputs Y • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • X 2 • • • X 1
Optimizing β The objective function is N 1 � ( y n − β x n ) 2 RSS ( β ) = 2 n = 1 The derivative is N d � d β RSS ( β ) = − ( y n − β x n ) x n n = 1 The optimal value is � N n = 1 y n x n ˆ β = � n x 2 n
The optimal β ● ● ● ● 1.0 • The optimal value is ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● � N ● ● ● ● ● ● ● n = 1 y n x n ● ● ● ● ● ● ˆ ● ● β = ● ● ● ● ● ● ● ● ● ● ● ● ● � 0.0 ● ● n x 2 ● ● y ● ● ● ● ● ● n ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ● ● ● ● ● ● ● ● • + values pull the slope up. ● ● ● −1.0 ● ● ● ● ● ● ● • − values pull the slope down ● −2 −1 0 1 2 x
Prediction ● ● ● • After finding the optimal β , we ● 1.0 ● ● ● ● ● ● would like to predict a new output ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● from a new input. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● • We use the point on the line at the 0.0 ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● input, ● ● ● −0.5 ● ● y new = ˆ ● ● ˆ ● ● ● β x new ● ● ● ● ● −1.0 ● ● ● ● ● ● ● ● −2 −1 0 1 2 x
Recommend
More recommend