Applied Machine Learning Linear Regression Siamak Ravanbakhsh COMP - PowerPoint PPT Presentation

Applied Machine Learning Linear Regression Siamak Ravanbakhsh COMP 551 (fall 2020)

Learning objectives linear model evaluation criteria how to find the best fit geometric interpretation maximum likelihood interpretation

Motivation History: method of least squares was invented by Legendre and Gauss (1800's) Gauss at age 24 used it to predict the future location of Ceres (largest astroid in the astroid belt) effect of income inequality on health and social problems source: http://chrisauld.com/2012/10/07/what-do-we-know-about-the-effect-of-income-inequality-on-health/

Motivation (?) COMP 551 | Fall 2020

Notation recall one instance ( n ) R D ∈ each instance: x ( n ) R ∈ y ⎡ x 1 ⎤ a feature ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ x 2 ⊤ ⎢ ⎥ vectors are assume to be column vectors x = = [ x , x , … , x D ] 1 2 ⋮ ⎣ x D ⎦ we assume N instances in the dataset D = {( x ( n ) ( n , y )} N n =1 each instance has D features indexed by d ( n ) for example, is the feature d of instance n R ∈ x d

Notation recall design matrix: concatenate all instances each row is a datapoint, each column is a feature ⎡ x ⎤ ⎡ x (1)⊤ ⎤ (1) (1) (1) , , ⋯ , one instance x x D ⎢ ⎥ ⎢ ⎢ ⎥ ⎥ 1 2 ⎢ x (2)⊤ ⎥ ⎢ ⎥ ⎢ ⎥ = ∈ R N × D X = ⎢ ⎥ ⋮ ⋮ ⋱ ⋮ ⎣ ( N ) ⎦ ⋮ ⎣ x ( N )⊤ ⎦ ( N ) ( N ) , , ⋯ , x x x D 1 2 one feature

Notation Example: Micro array data (X), contains gene expression levels labels (y) can be {cancer/no cancer} label for each patient patient (n) ∈ R N × D gene (d) COMP 551 | Fall 2020

Linear model R R : D → assuming a scalar output f w will generalize to a vector later f ( x ) = w + w x + … + w x 0 1 1 w D D model parameters or weights bias or intercept simplification D ⊤ concatenate a 1 to x x = [1, x , … , x ] 1 ⊤ f ( x ) = w x w

Loss function ( n ) ( n ) , y ∀ n objective: find parameters to fit the data x ( n ) y ( n ) f ( x ) ≈ w and minimize a measure of difference between ^ ( n ) ( n ) y ( n ) = f ( x ) y w L ( y , ) ≜ 1 ^ 2 ^ ( y − ) square error loss (a.k.a. L2 loss) y y 2 for a single instance (a function of labels) for future convenience versus for the whole dataset sum of squared errors cost function 2 ( y ) 1 ∑ n =1 N ( n ) ⊤ ( n ) J ( w ) = − w x 2

Example (D = 1) +bias (D=2)! y (1) (1) ( x , y ) ∗ ∗ ( x ) = w + f w x w ∗ 0 1 (2) (2) ( x , y ) (3) (3) − f ( x ) y (3) (3) ( x , y ) (4) (4) ( x , y ) ∗ w 0 x = [ x ] 1 2 ( y ) Linear Least Squares ∗ ( n ) ( n ) w = min − T w ∑ n w x

Example (D=2) +bias (D=3)! y ∗ ∗ ∗ ( x ) = w + w x + f w x w ∗ 1 2 0 1 2 x 2 ∗ w 0 x 1 2 ( y ) Linear Least Squares ∗ ( n ) ( n ) w = arg min − T w ∑ n w x COMP 551 | Fall 2020

Matrix form ^ ( n ) ⊤ ( n ) = instead of y w x ∈ R 1 × D D × 1 ^ = y Xw use design matrix to write N × 1 D × 1 N × D Linear least squares 1 1 2 ⊤ arg min ∣∣ y − Xw ∣∣ = ( y − Xw ) ( y − Xw ) w 2 2 2 squared L2 norm of the residual vector

Minimizing the cost the cost function is a smooth function of w find minimum by setting partial derivatives to zero

Simple case: D = 1 (no intercept) J f ( x ) = wx model w both scalar 1 ∑ n ( n ) ( n ) 2 J ( w ) = ( y − ) cost function wx 2 w d J ( n ) ( n ) ( n ) = ( wx − ) derivative ∑ n x y d w ( n ) ( n ) ∑ n x y ∗ setting the derivative to zero w = ( n )2 ∑ n x global minimum because cost is smooth and convex more on convexity layer

Gradient J for a multivariate function J ( w , w ) 0 1 partial derivatives instead of derivative = derivative when other vars. are fixed w 1 J ( w , w + ϵ )− J ( w , w ) ∂ J ( w , w ) ≜ lim ϵ →0 0 1 0 1 w 0 0 1 ∂ w 1 ϵ J critical point : all partial derivatives are zero gradient : vector of all partial derivatives ∂ ∂ ⊤ ∇ J ( w ) = [ J ( w ), ⋯ J ( w )] ∂ w 1 ∂ w D w 0 w 1

Finding w (any D) J ( w ) ∂ J ( w ) = 0 setting ∂ w d ∂ ∑ n 2 1 ( n ) ( n ) 2 ( y − f ( x )) = 0 w ∂ w d ∂ f w ∂ J d J = using chain rule : w 0 ∂ w d d f w ∂ w d w 1 cost is a smooth and convex function of w ( n ) ⊤ ( n ) ( n ) ( w x − ) x = 0 ∀ d ∈ {1, … , D } ∑ n we get y d

Normal equation y ∈ R N optimal weights w* satisfy y − Xw ( n ) ( n ) ⊤ ( n ) ( y − ) x = 0 ∀ d ∑ n w x d x 1 matrix form (using the design matrix) each row enforces one of D equations ^ N × 1 D × N y ⊤ X ( y − Xw ) = 0 x 2 2nd column of the design matrix Normal equation: because for optimal w, the residual vector is normal to column space of the design matrix system of D linear equations ⊤ ⊤ or X Xw = Aw = b X y

Direct solution y ∈ R N we can get a closed form solution! y − Xw ⊤ X ( y − Xw ) = 0 x 1 ⊤ ⊤ X Xw = X y pseudo-inverse of X ^ y ∗ ⊤ −1 ⊤ w = ( X X ) X y x 2 D × D D × N N × 1 ⊤ −1 ⊤ ^ = Xw = X ( X X ) y X y projection matrix into column space of X

Uniqueness of the solution we can get a closed form solution! ∗ ⊤ −1 ⊤ w = ( X X ) X y what if the covariance matrix is not invertible? when is this not invertible? recall the eigenvalue decomposition for the covariance matrix Q Λ Q ⊤ ⎡ λ 1 ⎤ 1 0 … 0 ⎢ ⎥ ⊤ −1 −1 ⊤ ( Q Λ Q ) = Q Λ −1 1 where 0 … 0 Q Λ = ⎣ 1 ⎦ λ 2 0 0 … λ D this matrix is not well-defined when some eigenvalues are zero!

Uniqueness of the solution under some assumptions, we can get a closed form solution! ⊤ −1 −1 ⊤ ( Q Λ Q ) = Q Λ Q ⎡ λ 1 ⎤ this does not exist when some eigenvalues are zero! 1 0 … 0 ⎢ ⎥ −1 1 Λ = ⎣ 0 … 0 1 ⎦ λ 2 0 0 … λ D that is, if features are completely correlated λ 1 λ 2 ... or more generally if features are not linearly independent there exists some such that α x = 0 ∑ d { α } d d d

Uniqueness of the solution that is, if features are completely correlated ... or more generally if features are not linearly independent there exists some such that α x = 0 ∑ d { α } d d d alternatively w ∗ if satisfies the normal equation ⊤ ∗ 0 X ( y − Xw ) = 0 then the following are also solutions ∗ w + ∀ c because X ( w + ∗ ∗ cα cα ) = Xw + cXα examples x = (1 − x ) having a binary feature as well as its negation x 1 2 1 when we have many features and D ≥ N patient (n) gene (d) COMP 551 | Fall 2020

Time complexity D × D D × N N × 1 ∗ ⊤ −1 ⊤ w = ( X X ) X y O ( ND ) D elements, each using N ops. 3 matrix inversion O ( D ) 2 D x D elements, each requiring N multiplications O ( D N ) total complexity for N > D is O ( ND + 2 3 D ) in practice we don't directly use matrix inversion (unstable) however, other more stable solutions (e.g., Gaussian elimination) have similar complexity

Multiple targets y ∈ R N Y ∈ R N × D ′ instead of we have a different weight vectors for each target each column of Y is associated with a column of W ^ = Y XW N × D ′ N × D D × D ′ ∗ ⊤ −1 ⊤ = ( X X ) W X Y D × D D × N N × D ′

Feature engineering = ∑ d so far we learned a linear function f w x w d d sometimes this may be too simplistic idea create new more useful features out of initial set of given features e.g., x , x x , log( x ), 2 1 2 1 how about ? x + 2 x 3 1

Nonlinear basis functions = ∑ d so far we learned a linear function f w x w d d ϕ ( x )∀ d let's denote the set of all features by d = w ϕ ( x ) ∑ d the problem of linear regression doesn't change f w d d ϕ ( x ) is the new x ⊤ ∗ ⊤ d (Φ Φ) w = Φ y solution simply becomes with Φ replacing X a (nonlinear) feature ⎡ ϕ ( x ⎤ (1) (1) (1) ), ϕ ( x ), ⋯ , ϕ ( x ) 1 2 D ⎢ ⎥ ⎢ ⎥ (2) (2) (2) one instance ϕ ( x ), ϕ ( x ), ⋯ , ϕ ( x ) ⎢ ⎥ 1 2 D ⎢ ⎥ Φ = ⋮ ⋮ ⋱ ⋮ ⎣ ( N ) ⎦ ( N ) ( N ) ϕ ( x ), ϕ ( x ), ⋯ , ϕ ( x ) 1 2 D

Nonlinear basis functions x ∈ R example original input is scalar Sigmoid bases polynomial bases Gaussian bases 1 k 2 ϕ ( x ) = ( x − μ ) e − ϕ ( x ) = x − μk ϕ ( x ) = x k k s 2 1+ e − k k s

Example: Gaussian bases k 2 ( x − μ ) e − ϕ ( x ) = s 2 we are using a fixed standard deviation of s=1 k ^ ( n ) = w + w ϕ ( x ) ∑ k y 0 k k the green curve (our fit) is the sum of these scaled Gaussian bases plus the intercept. Each basis is scaled by the corresponding weight

Example: Sigmoid bases Sigmoid bases 1 ϕ ( x ) = k x − μk we are using a fixed standard deviation of s=1 1+ e − s ^ ( n ) = w + w ϕ ( x ) ∑ k y 0 k k the green curve (our fit) is the sum of these scaled Gaussian bases plus the intercept. Each basis is scaled by the corresponding weight COMP 551 | Fall 2020

Applied Machine Learning Linear Regression Siamak Ravanbakhsh COMP - PowerPoint PPT Presentation

Applied Machine Learning Linear Regression Siamak Ravanbakhsh COMP 551 (fall 2020) Learning objectives linear model evaluation criteria how to find the best fit geometric interpretation maximum likelihood interpretation Motivation History:

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Supervised Learning Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University Reference and

4. Minimax and planning problems Optimizing piecewise linear functions Minimax problems

CS 445 Introduction to Machine Learning Logistic Regression Instructor: Dr. Kevin Molloy Review

Linear regression DS GA 1002 Probability and Statistics for Data Science

Integer Linear Programming CONTACT@ADAMFURMANEK.PL HTTP://BLOG.ADAMFURMANEK.PL FURMANEKADAM 1

Linear models for classification. Perceptron. Logistic regression. Petr Po s k P. Po

Linear Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale

Evaluation Of Post-Hoc Optimization Constraints Under Altered Cost Functions Presentation of

Sambuz

Useful Links

Newsletter

Mail Us