 
              Applied Machine Learning Applied Machine Learning Linear Regression Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1
Learning objectives Learning objectives linear model evaluation criteria how to find the best fit geometric interpretation 2
Motivation Motivation History: method of least squares was invented by Legendre and Gauss (1800's) Gauss at age 24 used it to predict the future location of Ceres (largest astroid in the astroid belt) effect of income inequality on health and social problems source: http://chrisauld.com/2012/10/07/what-do-we-know-about-the-effect-of-income-inequality-on-health/ 3 . 1
Motivation (?) Motivation (?) 3 . 2 Winter 2020 | Applied Machine Learning (COMP551)
Representing data Representing data one instance ( n ) R D ∈ each instance: x ( n ) R ∈ y ⎡ x ⎤ a feature 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ x 2 ⊤ ⎢ ⎥ vectors are assume to be column vectors x = = , , … , [ x D ] x x 1 2 ⋮ ⎣ D ⎦ x we assume N instances in the dataset D = {( x ( n ) ( n , y )} N n =1 each instance has D features indexed by d ( n ) for example, is the feature d of instance n R ∈ x d 4 . 1
Representing data Representing data design matrix: concatenate all instances each row is a datapoint, each column is a feature ⎡ x ⎤ ⎡ x (1)⊤ ⎤ (1) (1) (1) , , ⋯ , one instance x x ⎢ ⎥ ⎢ ⎢ ⎥ ⎥ 1 2 D x (2)⊤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = ∈ R N × D X = ⎢ ⎥ ⋮ ⋮ ⋱ ⋮ ⎣ ( N ) ⎦ ⋮ ⎣ x ( N )⊤ ⎦ ( N ) ( N ) , , ⋯ , x x x 1 2 D one feature 4 . 2
Representing data Representing data Example: Micro array data (X), contains gene expression levels labels (y) can be {cancer/no cancer} label for each patient patient (n) ∈ R N × D gene (d) 4 . 3 Winter 2020 | Applied Machine Learning (COMP551)
Linear model Linear model R R : D → assuming a scalar output f w will generalize to a vector later ( x ) = + + … + w f w w x x 0 1 1 w D D model parameters or weights bias or intercept yh_n = np.dot(w,x) simplification D ⊤ concatenate a 1 to x x = [1, x , … , x ] 1 ⊤ ( x ) = f w x w 5
Loss function Loss function ( n ) ( n ) , y ∀ n objective: find parameters to fit the data x ( n ) y ( n ) ( x ) ≈ f w and minimize a measure of difference between ^ ( n ) ( n ) y ( n ) = ( x ) y f w ) ≜ 1 ^ 2 L ( y , ^ ( y − ) square error loss (a.k.a. L2 loss) y y 2 for a single instance (a function of labels) for future convenience versus for the whole dataset sum of squared errors cost function 2 ( y ) 1 ∑ n =1 N ( n ) ⊤ ( n ) J ( w ) = − w x 2 6 . 1
Example Example (D = 1) (D = 1) +bias (D=2)! y (1) (1) ( x , y ) ∗ ∗ ( x ) = + f w w x w ∗ 0 1 (2) (2) ( x , y ) (3) (3) − f ( x ) y (3) (3) ( x , y ) (4) (4) ( x , y ) ∗ w 0 x = [ x ] 1 2 ( y ) Linear Least Squares ( n ) ( n ) min − T w ∑ n w x 6 . 2
Example Example (D=2) (D=2) +bias (D=3)! y ∗ ∗ ∗ ( x ) = + + f w w x w x w ∗ 1 2 0 1 2 ∗ x w 2 0 x 1 2 ( y ) Linear Least Squares ∗ ( n ) ( n ) w = arg min − T w ∑ n w x 6 . 3 Winter 2020 | Applied Machine Learning (COMP551)
Matrix form Matrix form ^ ( n ) ⊤ ( n ) = instead of y w x ∈ R 1 × D D × 1 ^ = y Xw use design matrix to write N × 1 D × 1 N × D Linear least squares yh = np.dot(X, w) 1 1 2 ⊤ arg min ∣∣ y − Xw ∣∣ = ( y − Xw ) ( y − Xw ) w 2 cost = np.sum((yh - y)**2)/2. 2 # or cost = np.mean((yh - y)**2)/2. squared L2 norm of the residual vector 7
Minimizing the cost Minimizing the cost weight space data space y w 0 x w 1 the objective is a smooth function of w find minimum by setting partial derivatives to zero image: Grosse, Farahmand, Carrasquilla 8 . 1
Simple case: D = 1 Simple case: D = 1 J ( x ) = f wx model w both scalar 1 ∑ n ( n ) ( n ) 2 J ( w ) = ( y − ) cost function wx 2 w d J ( n ) ( n ) ( n ) = ( wx − ) derivative ∑ n x y d w ( n ) ( n ) ∑ n x y setting the derivative to zero w = ( n )2 ∑ n x global minimum because cost is smooth and convex more on convexity layer 8 . 2
Gradient Gradient J for a multivariate function J ( w , w ) 0 1 partial derivatives instead of derivative = derivative when other vars. are fixed w 1 J ( w , w + ϵ )− J ( w , w ) ∂ ) ≜ J ( w , w lim 0 1 0 1 w 0 0 1 ϵ →0 ∂ w ϵ 1 J critical point : all partial derivatives are zero gradient : vector of all partial derivatives ∂ ∂ ⊤ ∇ J ( w ) = [ J ( w ), ⋯ J ( w )] ∂ w ∂ w 1 D w w 0 1 8 . 3
Finding w Finding w (any D) (any D) J ( w ) ∂ J ( w ) = 0 setting ∂ w i ∂ ∑ n 2 1 ( n ) ( n ) 2 ( y − ( x )) = 0 f w ∂ w i ∂ f ∂ J d J = w using chain rule : w ∂ w d f ∂ w 0 i w i w 1 cost is a smooth and convex function of w ( n ) ⊤ ( n ) ( n ) ( w x − ) x = 0 ∀ d ∈ {1, … , D } ∑ n we get y d 8 . 4
Normal equation Normal equation system of D linear equations y ∈ R N ( n ) ( n ) ⊤ ( n ) ( y − ) x = 0 ∀ d ∑ n w x d y − Xw matrix form (using the design matrix) x 1 each row enforces one of D equations N × 1 D × N ⊤ X ( y − Xw ) = 0 ^ y x 2 Normal equation: because for optimal w, the residual vector is normal to column space of the design matrix 8 . 5
Direct solution Direct solution y ∈ R N we can get a closed form solution! y − Xw ⊤ X ( y − Xw ) = 0 x 1 ⊤ ⊤ X Xw = X y pseudo-inverse of X ^ y ∗ ⊤ −1 ⊤ w = ( X X ) X y x D × D 2 D × N N × 1 ⊤ −1 ⊤ ^ = Xw = X ( X X ) y X y w = np.linalg.lstsq(X,y)[0] projection matrix into column space of X 8 . 6 Winter 2020 | Applied Machine Learning (COMP551)
Time complexity Time complexity D × D D × N N × 1 ∗ ⊤ −1 ⊤ w = ( X X ) X y O ( ND ) D elements, each using N ops. 3 matrix inversion O ( D ) 2 D x D elements, each requiring N multiplications O ( D N ) total complexity for N > D is O ( ND + 2 3 D ) in practice we don't directly use matrix inversion (unstable) 9
Multiple targets Multiple targets y ∈ R N Y ∈ R N × D ′ instead of we have a different weight vectors for each target each column of Y is associated with a column of W ^ = Y XW N × D ′ N × D D × D ′ ∗ ⊤ −1 ⊤ = ( X X ) W X Y w = np.linalg.lstsq(X,Y)[0] D × D D × N N × D ′ 10
Nonlinear basis functions Nonlinear basis functions = ∑ d so far we learned a linear function f w x w d d nothing changes if we have nonlinear bases = ( x ) ∑ d f w ϕ w d d ∗ ⊤ −1 ⊤ w = (Φ Φ) Φ y solution simply becomes X with Φ replacing a (nonlinear) feature ⎡ ϕ ⎤ (1) (1) (1) ( x ), ( x ), ⋯ , ( x ) ϕ ϕ 1 2 D ⎢ ⎢ ⎥ ⎥ (2) (2) (2) ( x ), ( x ), ⋯ , ( x ) one instance ⎢ ⎥ ϕ ϕ ϕ 1 2 D ⎢ ⎥ Φ = ⎣ ⋮ ⋮ ⋱ ⋮ ( N ) ⎦ ( N ) ( N ) ( x ), ( x ), ⋯ , ( x ) ϕ ϕ ϕ 1 2 D 11 . 1
Nonlinear basis functions Nonlinear basis functions x ∈ R examples original input is scalar Sigmoid bases polynomial bases Gaussian bases 1 k 2 ( x ) = ( x − μ ) ϕ e − ( x ) = x − μ ( x ) = x k k ϕ s 2 ϕ k 1+ e − k k s 11 . 2
Example Example: Gaussian bases : Gaussian bases k 2 ( x − μ ) e − ( x ) = ϕ s 2 k noise ( n ) ( n ) = sin( x ) + cos( ∣ x ( n ) ∣ ) + y ϵ before adding noise 1 1 1 #x: N #x: N #x: N 2 2 2 #y: N #y: N #y: N 3 3 3 plt.plot(x, y, 'b.') plt.plot(x, y, 'b.') plt.plot(x, y, 'b.') 4 4 4 phi = lambda x,mu: np.exp(-(x-mu)**2) phi = lambda x,mu: np.exp(-(x-mu)**2) phi = lambda x,mu: np.exp(-(x-mu)**2) 5 5 5 mu = np.linspace(0,10,10) #10 Gaussians bases mu = np.linspace(0,10,10) #10 Gaussians bases mu = np.linspace(0,10,10) #10 Gaussians bases 6 6 6 Phi = phi(x[:,None], mu[None,:]) #N x 10 Phi = phi(x[:,None], mu[None,:]) #N x 10 Phi = phi(x[:,None], mu[None,:]) #N x 10 7 7 7 w = np.linalg.lstsq(Phi, y)[0] w = np.linalg.lstsq(Phi, y)[0] w = np.linalg.lstsq(Phi, y)[0] 8 8 8 yh = np.dot(Phi,w) yh = np.dot(Phi,w) yh = np.dot(Phi,w) 9 9 9 plt.plot(x, yh, 'g-') plt.plot(x, yh, 'g-') plt.plot(x, yh, 'g-') our fit to data using 10 Gaussian bases 11 . 3
Example: Sigmoid bases Example : Sigmoid bases Sigmoid bases 1 ( x ) = ϕ k x − μ k 1+ e − s ( n ) ( n ) = sin( x ) + cos( ∣ x ( n ) ∣ ) + y ϵ 1 #x: N 2 #y: N 3 plt.plot(x, y, 'b.') 4 phi = lambda x,mu: 1/(1 + np.exp(-(x - mu))) 5 mu = np.linspace(0,10,10) #10 sigmoid bases 6 Phi = phi(x[:,None], mu[None,:]) #N x 10 7 w = np.linalg.lstsq(Phi, y)[0] 8 yh = np.dot(Phi,w) 9 plt.plot(x, yh, 'g-') our fit to data using 10 sigmoid bases 11 . 4 Winter 2020 | Applied Machine Learning (COMP551)
Problematic settings Problematic settings In ∗ ⊤ −1 ⊤ = ( X X ) W X Y what if we have a large dataset ? N > 100, 000, 000 use stochastic gradient descent what if ⊤ is not invertible ? X X w = np.linalg.lstsq(X,Y)[0] columns of X (features) are not linearly independent (either redundant features or D > N) or find one of the solutions W* is not unique, make it unique by decomposition-based (not discussed) methods still work removing redundant features use gradient descent (later!) regularization (later!) 12
Recommend
More recommend