machine learning and data mining linear regression
play

Machine Learning and Data Mining Linear regression Kalev Kask - PowerPoint PPT Presentation

+ Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions Parameters q Learning algorithm Change q Program (Learner) Improve performance


  1. + Machine Learning and Data Mining Linear regression Kalev Kask

  2. Supervised learning • Notation – Features x – Targets y – Predictions ŷ – Parameters q Learning algorithm Change q Program (“Learner”) Improve performance Characterized by some “ parameters ” q Training data (examples) Procedure (using q ) Features that outputs a prediction Feedback / Target values Score performance (“cost function”)

  3. Linear regression “ Predictor ” : 40 Evaluate line: Target y return r 20 0 0 10 20 Feature x • Define form of function f(x) explicitly • Find a good f(x) within that family (c) Alexander Ihler

  4. Notation Define “feature” x 0 = 1 (constant) Then (c) Alexander Ihler

  5. Measuring error Error or “ residual ” Observation Prediction 0 0 20 (c) Alexander Ihler

  6. Mean squared error • How can we quantify the error? • Could choose something else, of course… – Computationally convenient (more later) – Measures the variance of the residuals – Corresponds to likelihood under Gaussian model of “ noise ” (c) Alexander Ihler

  7. MSE cost function • Rewrite using matrix form # Python / NumPy: e = Y – X.dot( theta.T ); J = e.T.dot( e ) / m # = np.mean( e ** 2 ) (c) Alexander Ihler

  8. Visualizing the cost function J( θ ) → 40 40 30 30 20 20 10 10 0 0 -10 -10 -20 -20 -30 -30 -40 (c) Alexander Ihler -40 -1 -0.5 0 0.5 1 1.5 2 2.5 3 -1 -0.5 0 0.5 1 1.5 2 2.5 3

  9. Finding good parameters • Want to find parameters which minimize our error… • Think of a cost “ surface ” : error residual for that µ … (c) Alexander Ihler

  10. + Machine Learning and Data Mining Linear regression: Gradient descent & stochastic gradient descent Kalev Kask

  11. Gradient descent • How to change µ to improve J( q )? • Choose a direction in ? which J( q ) is decreasing (c) Alexander Ihler

  12. Gradient descent • How to change µ to improve J( q )? • Choose a direction in which J( q ) is decreasing • Derivative • Positive => increasing • Negative => decreasing (c) Alexander Ihler

  13. Gradient descent in more dimensions • Gradient vector • Indicates direction of steepest ascent (negative = steepest descent) (c) Alexander Ihler

  14. Gradient descent • Initialization Initialize q • Step size Do { – Can change as a function of iteration q ← q - α ∇ q J( q ) • Gradient direction } while ( α || ∇ J|| > ε ) • Stopping condition (c) Alexander Ihler

  15. Gradient for the MSE • MSE • ∇ J = ? 0 0 (c) Alexander Ihler

  16. Gradient for the MSE • MSE • ∇ J = ? (c) Alexander Ihler

  17. Gradient descent • Initialization • Step size Initialize q – Can change as a function of iteration Do { • Gradient direction q ← q - α ∇ q J( q ) • Stopping condition } while ( α || ∇ J|| > ε ) { { Sensitivity to Error magnitude & each q i direction for datum j (c) Alexander Ihler

  18. Derivative of MSE { { Sensitivity to Error magnitude & each q i direction for datum j • Rewrite using matrix form e = Y – X.dot( theta.T ); # error residual DJ = - e.dot(X) * 2.0/m # compute the gradient theta -= alpha * DJ # take a step (c) Alexander Ihler

  19. Gradient descent on cost function 40 20 30 15 20 10 10 5 0 0 -10 -5 -20 -10 -30 -15 -40 -20 -1 -0.5 0 0.5 1 1.5 2 2.5 3 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

  20. Comments on gradient descent • Very general algorithm – we ’ ll see it many times • Local minima – Sensitive to starting point (c) Alexander Ihler

  21. Comments on gradient descent • Very general algorithm – we ’ ll see it many times • Local minima – Sensitive to starting point • Step size – Too large? Too small? Automatic ways to choose? – May want step size to decrease with iteration – Common choices: • Fixed • Linear: C/(iteration) • Line search / backoff (Armijo, etc.) • Newton ’ s method (c) Alexander Ihler

  22. Newton’s method • Want to find the roots of f(x) – “Root”: value of x for which f(x)=0 • Initialize to some point x • Compute the tangent at x & compute where it crosses x-axis Optimization: find roots of ∇ J( q ) • (“Step size” ¸ = 1/ ∇∇ J ; inverse curvature) – Does not always converge; sometimes unstable – If converges, usually very fast – Works well for smooth, non-pathological functions, locally quadratic – For n large, may be computationally hard: O(n 2 ) storage, O(n 3 ) time (Multivariate: ∇ J(µ) = gradient vector ∇∇ 2 J(µ) = matrix of 2 nd derivatives a/b = a b -1 , matrix inverse)

  23. Stochastic / Online gradient descent • MSE • Gradient • Stochastic (or “online”) gradient descent: – Use updates based on individual datum j, chosen at random – At optima, (average over the data) (c) Alexander Ihler

  24. Online gradient descent Initialize q • Update based on each datum at a time Do { – Find residual and the gradient of its part of for j=1:m the error & update q ← q - α ∇ q J j ( q ) } while ( not done ) 40 20 30 15 20 10 10 5 0 0 -10 -5 -20 -10 -30 -15 -40 -20 -1 -0.5 0 0.5 1 1.5 2 2.5 3 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

  25. Online gradient descent Initialize q • Update based on each datum at a time Do { – Find residual and the gradient of its part of for j=1:m the error & update q ← q - α ∇ q J j ( q ) } while ( not done ) 40 20 30 15 20 10 10 5 0 0 -10 -5 -20 -10 -30 -15 -40 -20 -1 -0.5 0 0.5 1 1.5 2 2.5 3 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

  26. Online gradient descent Initialize q • Update based on each datum at a time Do { – Find residual and the gradient of its part of for j=1:m the error & update q ← q - α ∇ q J j ( q ) } while ( not done ) 40 20 30 15 20 10 10 5 0 0 -10 -5 -20 -10 -30 -15 -40 -20 -1 -0.5 0 0.5 1 1.5 2 2.5 3 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

  27. Online gradient descent Initialize q • Update based on each datum at a time Do { – Find residual and the gradient of its part of for j=1:m the error & update q ← q - α ∇ q J j ( q ) } while ( not done ) 40 20 30 15 20 10 10 5 0 0 -10 -5 -20 -10 -30 -15 -40 -20 -1 -0.5 0 0.5 1 1.5 2 2.5 3 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

  28. Online gradient descent Initialize q • Update based on each datum at a time Do { – Find residual and the gradient of its part of for j=1:m the error & update q ← q - α ∇ q J j ( q ) } while ( not done ) 40 20 30 15 20 10 10 5 0 0 -10 -5 -20 -10 -30 -15 -40 -20 -1 -0.5 0 0.5 1 1.5 2 2.5 3 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

  29. Online gradient descent Initialize q • Update based on each datum at a time Do { – Find residual and the gradient of its part of for j=1:m the error & update q ← q - α ∇ q J j ( q ) } while ( not done ) 40 20 30 15 20 10 10 5 0 0 -10 -5 -20 -10 -30 -15 -40 -20 -1 -0.5 0 0.5 1 1.5 2 2.5 3 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

  30. Online gradient descent Initialize q Do { for j=1:m q ← q - α ∇ q J j ( q ) } while ( not converged ) • Benefits – Lots of data = many more updates per pass – Computationally faster • Drawbacks – No longer strictly “descent” – Stopping conditions may be harder to evaluate (Can use “running estimates” of J(.), etc. ) • Related: mini-batch updates, etc. (c) Alexander Ihler

  31. + Machine Learning and Data Mining Linear regression: direct minimization Kalev Kask

  32. MSE Minimum • Consider a simple problem – One feature, two data points – Two unknowns: q 0 , q 1 – Two equations: • Can solve this system directly: • However, most of the time, m > n – There may be no linear function that hits all the data exactly – Instead, solve directly for minimum of MSE function (c) Alexander Ihler

  33. MSE Minimum • Reordering, we have • X (X T X) -1 is called the “ pseudo-inverse ” • If X T is square and independent, this is the inverse • If m > n: overdetermined; gives minimum MSE fit (c) Alexander Ihler

  34. Python MSE • This is easy to solve in Python / NumPy … # y = np.matrix ( [[y1], … , [ ym]] ) # X = np.matrix( [[x1_0 … x1_n], [x2_0 … x2_n], … ] ) # Solution 1: “ manual ” th = y.T * X * np.linalg.inv(X.T * X); # Solution 2: “ least squares solve ” th = np.linalg.lstsq(X, Y); (c) Alexander Ihler

  35. Normal equations • Interpretation: – (y - q X) = (y – y^) is the vector of errors in each example – X are the features we have to work with for each example – Dot product = 0: orthogonal (c) Alexander Ihler

  36. Normal equations • Interpretation: – (y - q X) = (y – y^) is the vector of errors in each example – X are the features we have to work with for each example – Dot product = 0: orthogonal • Example: (c) Alexander Ihler

  37. Effects of MSE choice • Sensitivity to outliers 18 16 16 2 cost for this one datum 14 12 Heavy penalty for large errors 10 5 8 4 3 6 2 4 1 0 -20 -15 -10 -5 0 5 2 0 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend