Machine Learning and Data Mining Linear regression Kalev Kask - PowerPoint PPT Presentation

+ Machine Learning and Data Mining Linear regression Kalev Kask

Supervised learning • Notation – Features x – Targets y – Predictions ŷ – Parameters q Learning algorithm Change q Program (“Learner”) Improve performance Characterized by some “ parameters ” q Training data (examples) Procedure (using q ) Features that outputs a prediction Feedback / Target values Score performance (“cost function”)

Linear regression “ Predictor ” : 40 Evaluate line: Target y return r 20 0 0 10 20 Feature x • Define form of function f(x) explicitly • Find a good f(x) within that family (c) Alexander Ihler

Notation Define “feature” x 0 = 1 (constant) Then (c) Alexander Ihler

Measuring error Error or “ residual ” Observation Prediction 0 0 20 (c) Alexander Ihler

Mean squared error • How can we quantify the error? • Could choose something else, of course… – Computationally convenient (more later) – Measures the variance of the residuals – Corresponds to likelihood under Gaussian model of “ noise ” (c) Alexander Ihler

MSE cost function • Rewrite using matrix form # Python / NumPy: e = Y – X.dot( theta.T ); J = e.T.dot( e ) / m # = np.mean( e ** 2 ) (c) Alexander Ihler

Visualizing the cost function J( θ ) → 40 40 30 30 20 20 10 10 0 0 -10 -10 -20 -20 -30 -30 -40 (c) Alexander Ihler -40 -1 -0.5 0 0.5 1 1.5 2 2.5 3 -1 -0.5 0 0.5 1 1.5 2 2.5 3

Finding good parameters • Want to find parameters which minimize our error… • Think of a cost “ surface ” : error residual for that µ … (c) Alexander Ihler

+ Machine Learning and Data Mining Linear regression: Gradient descent & stochastic gradient descent Kalev Kask

Gradient descent • How to change µ to improve J( q )? • Choose a direction in ? which J( q ) is decreasing (c) Alexander Ihler

Gradient descent • How to change µ to improve J( q )? • Choose a direction in which J( q ) is decreasing • Derivative • Positive => increasing • Negative => decreasing (c) Alexander Ihler

Gradient descent in more dimensions • Gradient vector • Indicates direction of steepest ascent (negative = steepest descent) (c) Alexander Ihler

Gradient descent • Initialization Initialize q • Step size Do { – Can change as a function of iteration q ← q - α ∇ q J( q ) • Gradient direction } while ( α || ∇ J|| > ε ) • Stopping condition (c) Alexander Ihler

Gradient for the MSE • MSE • ∇ J = ? 0 0 (c) Alexander Ihler

Gradient for the MSE • MSE • ∇ J = ? (c) Alexander Ihler

Gradient descent • Initialization • Step size Initialize q – Can change as a function of iteration Do { • Gradient direction q ← q - α ∇ q J( q ) • Stopping condition } while ( α || ∇ J|| > ε ) { { Sensitivity to Error magnitude & each q i direction for datum j (c) Alexander Ihler

Derivative of MSE { { Sensitivity to Error magnitude & each q i direction for datum j • Rewrite using matrix form e = Y – X.dot( theta.T ); # error residual DJ = - e.dot(X) * 2.0/m # compute the gradient theta -= alpha * DJ # take a step (c) Alexander Ihler

Gradient descent on cost function 40 20 30 15 20 10 10 5 0 0 -10 -5 -20 -10 -30 -15 -40 -20 -1 -0.5 0 0.5 1 1.5 2 2.5 3 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

Comments on gradient descent • Very general algorithm – we ’ ll see it many times • Local minima – Sensitive to starting point (c) Alexander Ihler

Comments on gradient descent • Very general algorithm – we ’ ll see it many times • Local minima – Sensitive to starting point • Step size – Too large? Too small? Automatic ways to choose? – May want step size to decrease with iteration – Common choices: • Fixed • Linear: C/(iteration) • Line search / backoff (Armijo, etc.) • Newton ’ s method (c) Alexander Ihler

Newton’s method • Want to find the roots of f(x) – “Root”: value of x for which f(x)=0 • Initialize to some point x • Compute the tangent at x & compute where it crosses x-axis Optimization: find roots of ∇ J( q ) • (“Step size” ¸ = 1/ ∇∇ J ; inverse curvature) – Does not always converge; sometimes unstable – If converges, usually very fast – Works well for smooth, non-pathological functions, locally quadratic – For n large, may be computationally hard: O(n 2 ) storage, O(n 3 ) time (Multivariate: ∇ J(µ) = gradient vector ∇∇ 2 J(µ) = matrix of 2 nd derivatives a/b = a b -1 , matrix inverse)

Stochastic / Online gradient descent • MSE • Gradient • Stochastic (or “online”) gradient descent: – Use updates based on individual datum j, chosen at random – At optima, (average over the data) (c) Alexander Ihler

Online gradient descent Initialize q • Update based on each datum at a time Do { – Find residual and the gradient of its part of for j=1:m the error & update q ← q - α ∇ q J j ( q ) } while ( not done ) 40 20 30 15 20 10 10 5 0 0 -10 -5 -20 -10 -30 -15 -40 -20 -1 -0.5 0 0.5 1 1.5 2 2.5 3 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

Online gradient descent Initialize q Do { for j=1:m q ← q - α ∇ q J j ( q ) } while ( not converged ) • Benefits – Lots of data = many more updates per pass – Computationally faster • Drawbacks – No longer strictly “descent” – Stopping conditions may be harder to evaluate (Can use “running estimates” of J(.), etc. ) • Related: mini-batch updates, etc. (c) Alexander Ihler

+ Machine Learning and Data Mining Linear regression: direct minimization Kalev Kask

MSE Minimum • Consider a simple problem – One feature, two data points – Two unknowns: q 0 , q 1 – Two equations: • Can solve this system directly: • However, most of the time, m > n – There may be no linear function that hits all the data exactly – Instead, solve directly for minimum of MSE function (c) Alexander Ihler

MSE Minimum • Reordering, we have • X (X T X) -1 is called the “ pseudo-inverse ” • If X T is square and independent, this is the inverse • If m > n: overdetermined; gives minimum MSE fit (c) Alexander Ihler

Python MSE • This is easy to solve in Python / NumPy … # y = np.matrix ( [[y1], … , [ ym]] ) # X = np.matrix( [[x1_0 … x1_n], [x2_0 … x2_n], … ] ) # Solution 1: “ manual ” th = y.T * X * np.linalg.inv(X.T * X); # Solution 2: “ least squares solve ” th = np.linalg.lstsq(X, Y); (c) Alexander Ihler

Normal equations • Interpretation: – (y - q X) = (y – y^) is the vector of errors in each example – X are the features we have to work with for each example – Dot product = 0: orthogonal (c) Alexander Ihler

Normal equations • Interpretation: – (y - q X) = (y – y^) is the vector of errors in each example – X are the features we have to work with for each example – Dot product = 0: orthogonal • Example: (c) Alexander Ihler

Effects of MSE choice • Sensitivity to outliers 18 16 16 2 cost for this one datum 14 12 Heavy penalty for large errors 10 5 8 4 3 6 2 4 1 0 -20 -15 -10 -5 0 5 2 0 0 2 4 6 8 10 12 14 16 18 20 (c) Alexander Ihler

Machine Learning and Data Mining Linear regression Kalev Kask - PowerPoint PPT Presentation

+ Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions Parameters q Learning algorithm Change q Program (Learner) Improve performance

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression: Simple and Linear Introduction to Machine Learning Regression Principle REGRESSION

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Machine Learning 1: Linear Regression Stefano Ermon March 31, 2016 Stefano Ermon March 31, 2016

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

MACHINE LEARNING Linear and Weighted Regression Support Vector Regression 1 APPLIED MACHINE

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Augmenting Paths Math 482, Lecture 25 Misha Lavrov April 3, 2020 The greedy algorithm

The vanishing gradient problem revisited: Highway and residual connections CS 6956: Deep

Iterative Techniques in Matrix Algebra Relaxation Techniques for Solving Linear Systems Numerical

Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik

AMMI Introduction to Deep Learning 6.5. Residual networks Fran cois Fleuret

Residuals in Deep Super-Resolution Ruofan Zhou, Fayez Lahoud , Majed EI Helou, and Sabine

NULL-COLLISION ALGORITHMSPART 2 TRANSMITTANCE ESTIMATION DELTA TRACKING Extinction A B A

penetration of renewables in TIMES-Greece G. Giannakidis K. Tigas J. Mantzaris Centre for