Machine Learning Lecture 02: Linear Regression and Basic ML Issues - PowerPoint PPT Presentation

Machine Learning Lecture 02: Linear Regression and Basic ML Issues Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and KP Murphy (2012). Machine learning: a probabilistic perspective. MIT Press. (Chapter 7) Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning . MIT press. www.deeplearningbook.org . (Chapter 5) Andrew Ng. Lecture Notes on Machine Learning. Stanford. Nevin L. Zhang (HKUST) Machine Learning 1 / 30

Linear Regression Outline 1 Linear Regression 2 Probabilistic Interpretation 3 Polynomial Regression 4 Model Capacity, Overfitting and Underfitting Nevin L. Zhang (HKUST) Machine Learning 2 / 30

Linear Regression Linear Regression: Problem Statement Given: A training set D = { x i , y i } N i =1 Each x i is a D -dimensional real-valued column vector: x = ( x 1 , . . . , x D ) ⊤ . Each y i is a real number D � y = f ( x ) = w ⊤ x = To Learn: w j x j j =0 The weights w = ( w 0 , w 1 , . . . , w D ) ⊤ determine how important the features ( x 1 , . . . , x D ) are in predicting the response y . Always set x 0 = 1, and w 0 is the bias term. Often it is denoted by b . Nevin L. Zhang (HKUST) Machine Learning 3 / 30

Linear Regression Linear Regression: Examples Here are several examples from http://people.sc.fsu.edu/ ∼ jburkardt/datasets/regression/regression.html Predict brain weight of mammals based their body weight (x01.txt) Predict blood fat content based on age and weight (x09.txt) Predict death rate from cirrhosis based on a number of other factors (x20.txt) Predict selling price of houses based on a number of factors (X27.txt) Nevin L. Zhang (HKUST) Machine Learning 4 / 30

Linear Regression Linear Regression: Mean Squared Error How to determine the weights w ? We want the predicted response values f ( x i ) to be close to the observed response values. So, we want to minimize the following objective function: N J ( w ) = 1 � ( y i − w ⊤ x i ) 2 . N i =1 This is called mean squared error (MSE) . Nevin L. Zhang (HKUST) Machine Learning 5 / 30

Linear Regression Linear Regression: Mean Squared Error As a function of the weights w , MSE is a quadratic “bowl” with a unique minimum. We can minimizing it by setting its gradient to 0 ∇ J ( w ) = 0 Nevin L. Zhang (HKUST) Machine Learning 6 / 30

Linear Regression Linear Regression: Matrix Representation We can represent the training set D = { x i , y i } N i =1 using the design matrix X and a column vector y .    x ⊤  x 1 , 0 x 1 , 1 x 1 , 2 . . . x 1 , D 1 x ⊤ x 2 , 0 x 2 , 1 x 2 , 2 . . . x 2 , D    2  X =  =     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    x ⊤ x N , 0 x N , 2 x N , 2 . . . x N , D N   y 1 y 2   y =   . . .   y N Then, the MSE can be written as follows J ( w ) = 1 2 = 1 N ( y − Xw ) ⊤ ( y − Xw ) = 1 N || y − Xw || 2 N ( w ⊤ ( X ⊤ X ) w − 2 w ⊤ ( X ⊤ y )+ y ⊤ y ) Nevin L. Zhang (HKUST) Machine Learning 7 / 30

Linear Regression Linear Regression: The Normal Equation From the equation of the previous slide, we get (see Murphy Chapter 7) that ∇ J ( w ) = 1 N (2 X ⊤ Xw − 2 X ⊤ y ) Setting the gradient to zero, we get the normal equation X ⊤ Xw = X ⊤ y The value of w that minimizes J ( w ) is w = ( X ⊤ X ) − 1 X ⊤ y ˆ This is called the ordinary least squares (OLS) solution. Nevin L. Zhang (HKUST) Machine Learning 8 / 30

Probabilistic Interpretation Outline 1 Linear Regression 2 Probabilistic Interpretation 3 Polynomial Regression 4 Model Capacity, Overfitting and Underfitting Nevin L. Zhang (HKUST) Machine Learning 9 / 30

Probabilistic Interpretation Probabilistic Interpretation Next, we show that least squares regression can be derived from a probabilistic model. D We assert � y = w ⊤ x + ǫ = w j x j + ǫ j =0 where the error term ǫ captures unmodeled effects and random noise. We also assume that ǫ follow the Gaussian distribution with zero mean and variance σ : ǫ ∼ N (0 , σ 2 ). The model parameters θ include w and σ . The conditional distribution of y given input x and parameters θ is a Gaussian p ( y | x , θ ) = N ( y | µ ( x ) , σ 2 ) where µ ( x ) = w ⊤ x Nevin L. Zhang (HKUST) Machine Learning 10 / 30

Probabilistic Interpretation Probabilistic Interpretation p ( y | x , θ ) = N ( y | µ ( x ) , σ 2 ) For each input x , we get a distribution of y , which is a Gaussian distribution. To get a point estimation of y , we can use the mean, i.e., y = µ ( x ) = w ⊤ x ˆ Nevin L. Zhang (HKUST) Machine Learning 11 / 30

Probabilistic Interpretation Parameter Estimation Determine θ = ( w , σ ) by minimizing the cross entropy: N N exp( − ( y i − w ⊤ x i ) 2 − 1 − 1 1 � � √ log p ( y i | x i , θ ) = log[ )] 2 σ 2 N N 2 πσ i =1 i =1 N 1 1 2 log(2 πσ 2 ) + � ( y i − w ⊤ x i ) 2 = N 2 σ 2 i =1 Assume σ is fixed. This is the same as minimizing the MSE N J ( w ) = 1 � ( y i − w ⊤ x i ) 2 N i =1 Summary : Under some assumptions, least-squares regression can be justified as a very natural method that minimizes cross entropy, or maximize likelihood. Nevin L. Zhang (HKUST) Machine Learning 12 / 30

Polynomial Regression Outline 1 Linear Regression 2 Probabilistic Interpretation 3 Polynomial Regression 4 Model Capacity, Overfitting and Underfitting Nevin L. Zhang (HKUST) Machine Learning 13 / 30

Polynomial Regression Beyond Linear Regression Here again is the linear regression model: D y = f ( x ) = w ⊤ x = � w j x j j =0 Linear regression can be made to model non-linear relationships by replacing x with some non-linear function of the inputs, φ ( x ). That is, we use y = f ( x ) = w ⊤ φ ( x ) This is known as basis function expansion and φ is called feature mapping . Nevin L. Zhang (HKUST) Machine Learning 14 / 30

Polynomial Regression Polynomial Regression For x = [1 , x 1 , x 2 ] ⊤ , we can use the following polynomial feature mapping φ ( x ) = [1 , x 1 , x 2 , . . . , x D , x 2 1 , x 1 x 2 , . . . , x d D ] ⊤ When and d = 2, we get polynomial regression . y = w ⊤ φ ( x ) = w 0 + w 1 x 1 + w 2 x 2 + w 3 x 2 1 + w 4 x 1 x 2 + w 5 x 2 2 Model selection : What d to choose? What is the impact of d ? Nevin L. Zhang (HKUST) Machine Learning 15 / 30

Model Capacity, Overfitting and Underfitting Outline 1 Linear Regression 2 Probabilistic Interpretation 3 Polynomial Regression 4 Model Capacity, Overfitting and Underfitting Nevin L. Zhang (HKUST) Machine Learning 16 / 30

Model Capacity, Overfitting and Underfitting Hypothesis Space and Capacity The hypothesis space of a machine learning algorithm/model is the set of functions that it is allowed to select as being the solution. The “size” of the hypothesis space is called the capacity of the model. For polynomial regression, the larger the d , the higher the model capacity. Higher model capacity implies better fit to training data . Two examples with d = 14 and 20 and one feature x = ( x ). Nevin L. Zhang (HKUST) Machine Learning 17 / 30

Model Capacity, Overfitting and Underfitting Generalization Error An machine learning model is trained to perform well on the training example. But it is not really what we care about. What we really care about is that it must perform well on new and previously unseen examples. This is called generalization . We use a the error on a test set to measure how well a model generalize: 1 N ( test ) || y ( test ) − X ( test ) w || 2 J ( test ) ( w ) = 2 This is called the test error or the generalization error In contrast, here is the training error we have been talking about so far: 1 N ( train ) || y ( train ) − X ( train ) w || 2 J ( train ) ( w ) = 2 Nevin L. Zhang (HKUST) Machine Learning 18 / 30

Model Capacity, Overfitting and Underfitting Test and Training Error The test and training errors are related because we assume both training and test data are iid samples of an underlining data generation process p ( x , y ). However, small training error does not always imply small generalization error. The generalization error is usually larger than training error because the model parameters are selected to minimizing the training error. So, we need to Make the training error small, and Make the gap between the test and training error small. Nevin L. Zhang (HKUST) Machine Learning 19 / 30

Model Capacity, Overfitting and Underfitting Overfitting and Underfitting Training and test error behave differently as model capacity increases. At the left end of the graph, training error and generalization error are both high. This is the underfitting regime . As we increase capacity, training error decreases, but the gap between training and generalization error increases. Eventually,the size of this gap outweighs the decrease in training error, and we enter the overfitting regime , where capacity is too large. Nevin L. Zhang (HKUST) Machine Learning 20 / 30

Model Capacity, Overfitting and Underfitting Overfitting and Underfitting Choosing a model with the appropriate capacity is important . This can be achieved by either validation or regularization. Nevin L. Zhang (HKUST) Machine Learning 21 / 30

Machine Learning Lecture 02: Linear Regression and Basic ML Issues - PowerPoint PPT Presentation

Machine Learning Lecture 02: Linear Regression and Basic ML Issues Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Ma c hine L e a rning with MAT L AB - - Re g re ssion Stanley Liang, PhD York University Re

Pattern recognition in nuclear fusion data by means of geometric methods in probabilistic spaces

Slide Set 4 CLRM estimation Pietro Coretto pcoretto@unisa.it Econometrics Master in Economics

Introduction to regression Supervised Learning with scikit-learn Boston housing data In [1]:

Cyber@UC Meeting 93 Pwn Adventure If Youre New! Join our Slack: cyberatuc.slack.com Check

Inference and Representation David Sontag New York University Lecture 13, Dec. 8, 2015 David

Structural Transformation, Inequality, and Inclusive Growth in South Africa Haroon Bhorat, Kezia

Socialism and Freedom Capitalism University of Virginia Matthias Brinkmann Contents 1. G. A.