Overview IAML: Linear Regression The linear model Fitting the - PowerPoint PPT Presentation

Overview IAML: Linear Regression ◮ The linear model ◮ Fitting the linear model to data ◮ Probabilistic interpretation of the error function Nigel Goddard School of Informatics ◮ Examples of regression problems ◮ Dealing with multiple outputs ◮ Generalized linear regression ◮ Radial basis function (RBF) models Semester 1 1 / 38 2 / 38 The Regression Problem Examples of regression problems ◮ Classification and regression problems: ◮ Classification: target of prediction is discrete ◮ Robot inverse dynamics: predicting what torques are ◮ Regression: target of prediction is continuous needed to drive a robot arm along a given trajectory ◮ Electricity load forecasting, generate hourly forecasts two ◮ Training data: Set D of pairs ( x i , y i ) for i = 1 , . . . , n , where ◮ days in advance (see W & F, § 1.3) x i ∈ R D and y i ∈ R ◮ Predicting staffing requirements at help desks based on ◮ Today: Linear regression, i.e., relationship between x and historical data and product and sales information, y is linear. ◮ Predicting the time to failure of equipment based on ◮ Although this is simple (and limited) it is: utilization and environmental conditions ◮ More powerful than you would expect ◮ The basis for more complex nonlinear methods ◮ Teaches a lot about regression and classification 3 / 38 4 / 38

The Linear Model Toy example: Data ◮ Linear model 4 ● ● 3 ● f ( x ; w ) = w 0 + w 1 x 1 + . . . + w D x D ● ● ● ● = φ ( x ) w 2 ● ● ● ● ● ● ● ● y 1 ● where φ ( x ) = ( 1 , x 1 , . . . , x D ) = ( 1 , x T ) ● ● ● ● and 0 ●   w 0 ● ● −1 ● ● w 1   w = (1)   −2 ...   w D −3 −2 −1 0 1 2 3 ◮ The maths of fitting linear models to data is easy. We use x the notation φ ( x ) to make generalisation easy later. 5 / 38 6 / 38 Toy example: Data With two features Y 4 4 ● ● ● ● 3 3 ● ● • • ● ● ● ● ● ● ● ● • • • • • • 2 ● 2 ● • • • • ● ● ● ● • • • ● ● ● ● • • • ● ● ● ● • • • • • • • ● ● • • y 1 y 1 • • • • ● ● • • • • • • • • • • • • • • ● ● ● ● ● ● ● ● • • 0 0 • • • • • • ● ● • • • • • • • • • • • ● ● ● ● • −1 −1 ● ● ● ● • • • • • • X 2 −2 −2 • • • −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 X 1 x x Instead of a line, a plane . With more features, a hyperplane . Figure: Hastie, Tibshirani, and Friedman 7 / 38 8 / 38

With more features With more features CPU Performance Data Set ◮ Predict: PRP: published relative performance PRP = - 56.1 + 0.049 MYCT ◮ MYCT: machine cycle time in nanoseconds (integer) + 0.015 MMIN ◮ MMIN: minimum main memory in kilobytes (integer) + 0.006 MMAX ◮ MMAX: maximum main memory in kilobytes (integer) + 0.630 CACH ◮ CACH: cache memory in kilobytes (integer) - 0.270 CHMIN ◮ CHMIN: minimum channels in units (integer) + 1.46 CHMAX ◮ CHMAX: maximum channels in units (integer) 9 / 38 10 / 38 In matrix notation Linear Algebra: The 1-Slide Version What is matrix multiplication?     a 11 a 12 a 13 b 1  , b = A = a 21 a 22 a 23 b 2 ◮ Design matrix is n × ( D + 1 )    a 31 a 32 a 33 b 3   1 x 11 x 12 . . . x 1 D First consider matrix times vector, i.e., A b . Two answers: 1 x 21 x 22 . . . x 2 D   Φ =  . . . . .  . . . . . 1. A b is a linear combination of the columns of A   . . . . .   1 x n 1 x n 2 . . . x nD       a 11 a 12 a 13  + b 2  + b 3 Ab = b 1 a 21 a 22 a 23     ◮ x ij is the j th component of the training input x i a 31 a 32 a 33 ◮ Let y = ( y 1 , . . . , y n ) T 2. A b is a vector. Each element of the vector is the dot ◮ Then ˆ y = Φ w is ...? products between b and one row of A .   ( a 11 , a 12 , a 13 ) b A b = ( a 21 , a 22 , a 23 ) b   ( a 31 , a 32 , a 33 ) b 11 / 38 12 / 38

Linear model (part 2) Solving for Model Parameters This looks like what we’ve seen in linear algebra In matrix notation: y = Φ w ◮ Design matrix is n × ( D + 1 ) We know y and Φ but not w .   . . . 1 x 11 x 12 x 1 D 1 x 21 x 22 . . . x 2 D   Φ =  . . . . .  So why not take w = Φ − 1 y ? (You can’t, but why?) . . . . .   . . . . .   1 x n 1 x n 2 . . . x nD ◮ x ij is the j th component of the training input x i ◮ Let y = ( y 1 , . . . , y n ) T ◮ Then ˆ y = Φ w is the model’s predicted values on training inputs. 13 / 38 14 / 38 Solving for Model Parameters Loss function This looks like what we’ve seen in linear algebra y = Φ w Want a loss function O ( w ) that We know y and Φ but not w . ◮ We minimize wrt w . ◮ At minimum, ˆ y looks like y . So why not take w = Φ − 1 y ? (You can’t, but why?) ◮ (Recall: ˆ y depends on w ) Three reasons: ˆ y = Φ w ◮ Φ is not square. It is n × ( D + 1 ) . ◮ The system is overconstrained ( n equations for D + 1 parameters), in other words ◮ The data has noise 15 / 38 16 / 38

Fitting a linear model to data Fitting a linear model to data ◮ n � ( y i − w T x i ) 2 O ( w ) = Y i = 1 ◮ A common choice: squared error = ( y − Φ w ) T ( y − Φ w ) (makes the maths easy) ◮ We want to minimize this with respect to w . • • • • • • • • n • • ◮ The error surface is a parabolic bowl • • • • • • � ( y i − w T x i ) 2 • • O ( w ) = • • • • • • • • • 25 • • • • • • • • • • • • • • • • i = 1 • • 20 • • • • • • • • • • • • • • 15 • • ◮ In the picture: this is sum of • • • • E[w] • • 10 • • squared length of black sticks. • • X 2 • 5 ◮ (Each one is called a residual , • 0 • i.e., each y i − w T x i ) 2 X 1 1 -2 -1 0 0 1 2 -1 3 w1 w0 ◮ How do we do this? 17 / 38 18 / 38 The Solution Probabilistic interpretation of O ( w ) ◮ Assume that y = w T x + ǫ , where ǫ ∼ N ( 0 , σ 2 ) ◮ Answer: to minimize O ( w ) = � n i = 1 ( y i − w T x i ) 2 , set partial ◮ (This is an exact linear relationship plus Gaussian noise.) derivatives to 0. ◮ This implies that y | x i ∼ N ( w T x i , σ 2 ) , i.e. ◮ This has an analytical solution √ 2 π + log σ + ( y i − w T x i ) 2 w = (Φ T Φ) − 1 Φ T y − log p ( y i | x i ) = log ˆ 2 σ 2 ◮ (Φ T Φ) − 1 Φ T is the pseudo-inverse of Φ ◮ So minimising O ( w ) equivalent to maximising likelihood! ◮ First check: Does this make sense? Do the matrix ◮ Can view w T x as E [ y | x ] . dimensions line up? ◮ Squared residuals allow estimation of σ 2 ◮ Then: Why is this called a pseudo-inverse? () n σ 2 = 1 ◮ Finally: What happens if there are no features? � ( y i − w T x i ) 2 ˆ n i = 1 19 / 38 20 / 38

Sensitivity to Outliers ◮ Linear regression is sensitive to outliers √ ◮ Example: Suppose y = 0 . 5 x + ǫ , where ǫ ∼ N ( 0 , 0 . 25 ) , Fitting this into the general structure for learning algorithms: and then add a point (2.5,3): 3.0 ● ◮ Define the task : regression ◮ Decide on the model structure : linear regression model 2.5 ◮ Decide on the score function : squared error (likelihood) ● ● 2.0 ◮ Decide on optimization/search method to optimize the score function: calculus (analytic solution) 1.5 ● 1.0 ● 0.5 ● 0.0 0 1 2 3 4 5 21 / 38 22 / 38 Diagnositics Dealing with multiple outputs Graphical diagnostics can be useful for checking: ◮ Is the relationship obviously nonlinear? Look for structure ◮ Suppose there are q different targets for each input x in residuals? ◮ Are there obvious outliers? ◮ We introduce a different w i for each target dimension, and do regression separately for each one The goal isn’t to find all problems. You can’t. The goal is to find ◮ This is called multiple regression obvious, embarrassing problems. Examples: Plot residuals by fitted values. Stats packages will do this for you. 23 / 38 24 / 38

Overview IAML: Linear Regression The linear model Fitting the - PowerPoint PPT Presentation

Overview IAML: Linear Regression The linear model Fitting the linear model to data Probabilistic interpretation of the error function Nigel Goddard School of Informatics Examples of regression problems Dealing with multiple

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Cartesian Control Analytical inverse kinematics can be difficult to derive Inverse

CSC 311: Introduction to Machine Learning Lecture 6 - Bagging, Boosting Roger Grosse Chris

Natural Language Processing (CSE 490U): Featurized Language Models Noah Smith 2017 c

Smoothing sample extremes: the mixed model approach Francesco Pauli Fabrizio Laurini Dept of

Characterizing Endpoints of Generalized Inverse Limits Lori Alvin Bradley University Nipissing

Partial isometries and pseudoinverses in semi-Hilbertian spaces Mar a Celeste Gonzalez

Solving Matrix-Vector Equations Eric Eager Data Scientist at Pro Football Focus DataCamp

Matrices A brief introduction Basilio Bona DAUIN Politecnico di Torino September 2013