 
              CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics Linear Methods for Regression and Classification Petr Pošík Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics P. Pošík c � 2017 Artificial Intelligence – 1 / 34
Linear regression P. Pošík c � 2017 Artificial Intelligence – 2 / 34
Linear regression: Illustration 5 Linear regression • Illustration • Regression • Notation remarks 0 • Train, apply • 1D regression • LSM • Minimizing J ( w , T ) • Gradient descent • Multivariate linear -5 regression Linear classification 1 Perceptron Logistic regression 0 Optimal separating hyperplane 1 -1 0.5 0 -0.5 -1 Summary Given a dataset of input vectors x ( i ) and the respective values of output variable y ( i ) . . . P. Pošík c � 2017 Artificial Intelligence – 3 / 34
Linear regression: Illustration Linear regression • Illustration • Regression • Notation remarks • Train, apply • 1D regression • LSM • Minimizing J ( w , T ) • Gradient descent • Multivariate linear regression Linear classification Perceptron Logistic regression Optimal separating hyperplane Summary . . . we would like to find a linear model of this dataset . . . P. Pošík c � 2017 Artificial Intelligence – 3 / 34
Linear regression: Illustration Linear regression • Illustration • Regression • Notation remarks • Train, apply • 1D regression • LSM • Minimizing J ( w , T ) • Gradient descent • Multivariate linear regression Linear classification Perceptron Logistic regression Optimal separating hyperplane Summary . . . which would minimize certain error between the known values of output variable and the model predictions. P. Pošík c � 2017 Artificial Intelligence – 3 / 34
Linear regression Regression task is a supervised learning task, i.e. ■ a training (multi)set T = { ( x ( 1 ) , y ( 1 ) ) , . . . , ( x ( | T | ) , y ( | T | ) ) } is available, where Linear regression ■ the labels y ( i ) are quantitative , often continuous (as opposed to classification tasks • Illustration where y ( i ) are nominal). • Regression • Notation remarks ■ Its purpose is to model the relationship between independent variables (inputs) • Train, apply x = ( x 1 , . . . , x D ) and the dependent variable (output) y . • 1D regression • LSM • Minimizing J ( w , T ) • Gradient descent • Multivariate linear regression Linear classification Perceptron Logistic regression Optimal separating hyperplane Summary P. Pošík c � 2017 Artificial Intelligence – 4 / 34
Linear regression Regression task is a supervised learning task, i.e. ■ a training (multi)set T = { ( x ( 1 ) , y ( 1 ) ) , . . . , ( x ( | T | ) , y ( | T | ) ) } is available, where Linear regression ■ the labels y ( i ) are quantitative , often continuous (as opposed to classification tasks • Illustration where y ( i ) are nominal). • Regression • Notation remarks ■ Its purpose is to model the relationship between independent variables (inputs) • Train, apply x = ( x 1 , . . . , x D ) and the dependent variable (output) y . • 1D regression • LSM • Minimizing J ( w , T ) • Gradient descent Linear regression is a particular regression model which assumes (and learns) linear • Multivariate linear relationship between the inputs and the output: regression Linear classification y = h ( x ) = w 0 + w 1 x 1 + . . . + w D x D = w 0 + � w , x � = w 0 + xw T , � Perceptron Logistic regression where Optimal separating � y is the model prediction ( estimate of the true value y ), hyperplane ■ h ( x ) is the linear model (a hypothesis ), Summary ■ w 0 , . . . , w D are the coefficients of the linear function, w 0 is the bias , organized in a row ■ vector w , ■ � w , x � is a dot product of vectors w and x (scalar product), ■ which can be also computed as a matrix product xw T if w and x are row vectors. P. Pošík c � 2017 Artificial Intelligence – 4 / 34
Notation remarks Homogeneous coordinates : If we add “1” as the first element of x so that x = ( 1, x 1 , . . . , x D ) , then we can write the linear model in an even simpler form (without the explicit bias term): Linear regression • Illustration y = h ( x ) = w 0 · 1 + w 1 x 1 + . . . + w D x D = � w , x � = xw T . � • Regression • Notation remarks • Train, apply • 1D regression • LSM • Minimizing J ( w , T ) • Gradient descent • Multivariate linear regression Linear classification Perceptron Logistic regression Optimal separating hyperplane Summary P. Pošík c � 2017 Artificial Intelligence – 5 / 34
Notation remarks Homogeneous coordinates : If we add “1” as the first element of x so that x = ( 1, x 1 , . . . , x D ) , then we can write the linear model in an even simpler form (without the explicit bias term): Linear regression • Illustration y = h ( x ) = w 0 · 1 + w 1 x 1 + . . . + w D x D = � w , x � = xw T . � • Regression • Notation remarks • Train, apply • 1D regression Matrix notation: If we organize the data into matrix X and vector y , such that • LSM • Minimizing J ( w , T )     x ( 1 ) y ( 1 ) • Gradient descent 1 • Multivariate linear     . . . regression     X = y = . . and .  ,    . . . Linear classification x ( | T | ) y ( | T | ) 1 Perceptron Logistic regression and similarly with � y , then we can write a batch computation of predictions for all data in Optimal separating hyperplane X as Summary y = Xw T . � P. Pošík c � 2017 Artificial Intelligence – 5 / 34
Two operation modes Any ML model has 2 operation modes: 1. learning (training, fitting) and Linear regression 2. application (testing, making predictions). • Illustration • Regression • Notation remarks • Train, apply • 1D regression • LSM • Minimizing J ( w , T ) • Gradient descent • Multivariate linear regression Linear classification Perceptron Logistic regression Optimal separating hyperplane Summary P. Pošík c � 2017 Artificial Intelligence – 6 / 34
Two operation modes Any ML model has 2 operation modes: 1. learning (training, fitting) and Linear regression 2. application (testing, making predictions). • Illustration • Regression • Notation remarks • Train, apply • 1D regression The model h can be viewed as a function of 2 variables: h ( x , w ) . • LSM • Minimizing J ( w , T ) • Gradient descent • Multivariate linear regression Linear classification Perceptron Logistic regression Optimal separating hyperplane Summary P. Pošík c � 2017 Artificial Intelligence – 6 / 34
Two operation modes Any ML model has 2 operation modes: 1. learning (training, fitting) and Linear regression 2. application (testing, making predictions). • Illustration • Regression • Notation remarks • Train, apply • 1D regression The model h can be viewed as a function of 2 variables: h ( x , w ) . • LSM • Minimizing J ( w , T ) • Gradient descent Model application: If the model is given ( w is fixed), we can manipulate x to make • Multivariate linear predictions: regression Linear classification y = h ( x , w ) = h w ( x ) . � Perceptron Logistic regression Optimal separating hyperplane Summary P. Pošík c � 2017 Artificial Intelligence – 6 / 34
Two operation modes Any ML model has 2 operation modes: 1. learning (training, fitting) and Linear regression 2. application (testing, making predictions). • Illustration • Regression • Notation remarks • Train, apply • 1D regression The model h can be viewed as a function of 2 variables: h ( x , w ) . • LSM • Minimizing J ( w , T ) • Gradient descent Model application: If the model is given ( w is fixed), we can manipulate x to make • Multivariate linear predictions: regression Linear classification y = h ( x , w ) = h w ( x ) . � Perceptron Logistic regression Optimal separating Model learning: If the data is given ( T is fixed), we can manipulate the model parameters hyperplane w to fit the model to the data: Summary w ∗ = argmin J ( w , T ) . w P. Pošík c � 2017 Artificial Intelligence – 6 / 34
Two operation modes Any ML model has 2 operation modes: 1. learning (training, fitting) and Linear regression 2. application (testing, making predictions). • Illustration • Regression • Notation remarks • Train, apply • 1D regression The model h can be viewed as a function of 2 variables: h ( x , w ) . • LSM • Minimizing J ( w , T ) • Gradient descent Model application: If the model is given ( w is fixed), we can manipulate x to make • Multivariate linear predictions: regression Linear classification y = h ( x , w ) = h w ( x ) . � Perceptron Logistic regression Optimal separating Model learning: If the data is given ( T is fixed), we can manipulate the model parameters hyperplane w to fit the model to the data: Summary w ∗ = argmin J ( w , T ) . w How to train the model? P. Pošík c � 2017 Artificial Intelligence – 6 / 34
Recommend
More recommend