10-601 Machine Learning Regression Outline Regression vs - PowerPoint PPT Presentation

10-601 Machine Learning Regression

Outline • Regression vs Classification • Linear regression – another discriminative learning method – As optimization è Gradient descent – As matrix inversion (Ordinary Least Squares) • Overfitting and bias-variance • Bias-variance decomposition for classification

What is regression?

Where we are Inputs Prob- Density √ ability Estimator Inputs Classifier Predict √ category Inputs Predict Regressor Today real #

Regression examples

Prediction of menu prices Chaheau Gimpel … and Smith EMNLP 2012 …

A decision tree: classification Play Don’t Play Don’t Play Play

A regression tree Play ~= Play ~= Play ~= 5 Play ~= 0 37 32 Play = 30m, 45min Play = 0m, 0m, 15m Play = 0m, 0m Play = 20m, 30m, 45m,

Theme for the week: learning as optimization

Types of learners • Two types of learners: 1. Generative: make assumptions about how to generate data (given the class) - e.g., naïve Bayes 2. Discriminative - directly estimate a decision rule/boundary - e.g., logistic regression Today: another discriminative learner, but for regression tasks

Least Mean Squares Regression for LMS as optimization Toy problem #2 11

Linear regression • Given an input x we would like to compute an output y • For example: Y - Predict height from age - Predict Google ’ s price from Yahoo ’ s price - Predict distance from wall from sensors X

Linear regression • Given an input x we would like to compute an output y • In linear regression we assume that y and x are related with the Y following equation: Observed values What we are trying to predict y = wx+ ε X where w is a parameter and ε represents measurement or other noise

Linear regression y = wx + ε Y • Our goal is to estimate w from a training data of <x i ,y i > pairs • Optimization goal: minimize squared error (least squares): 2 arg min ( y wx ) ∑ − w i i X i • Why least squares? - minimizes squared distance between see HW measurements and predicted line - has a nice probabilistic interpretation - the math is pretty

Solving linear regression • To optimize: • We just take the derivative w.r.t. to w …. prediction ∂ ∑ ( y i − wx i ) 2 ∑ = 2 − x i ( y i − wx i ) ∂ w i i prediction Compare to logistic regression…

Solving linear regression • To optimize – closed form: • We just take the derivative w.r.t. to w and set to 0: ∂ ∑ ( y i − wx i ) 2 ∑ = 2 − x i ( y i − wx i ) ⇒ ∂ w i i ∑ ∑ ∑ 2 x i y i − 2 wx i x i = 0 2 x i ( y i − wx i ) = 0 ⇒ i i i ∑ ∑ 2 x i y i = wx i ⇒ i i covar(X,Y)/var(X) ∑ if mean(X)=mean(Y)=0 x i y i i w = ∑ 2 x i i

Regression example • Generated: w=2 • Recovered: w=2.03 • Noise: std=1

Bias term • So far we assumed that the line passes through the origin Y • What if the line does not? • No problem, simply change the model to y = w 0 + w 1 x+ ε w 0 • Can use least squares to determine w 0 , w 1 X x ( y w ) y w x ∑ ∑ − − i i 0 i 1 i w i i = w = 1 2 x ∑ 0 n i i

Simpler solution is coming soon… Bias term • So far we assumed that the line passes through the origin Y • What if the line does not? • No problem, simply extend the model to y = w 0 + w 1 x+ ε w 0 • Can use least squares to determine w 0 , w 1 X x ( y w ) y w x ∑ ∑ − − i i 0 i 1 i w i i = w = 1 2 x ∑ 0 n i i

Multivariate regression • What if we have several inputs? - Stock prices for Yahoo, Microsoft and Ebay for the Google prediction task • This becomes a multivariate regression problem • Again, its easy to model: y = w 0 + w 1 x 1 + … + w k x k + ε Google ’ s stock Microsoft’s stock price price Yahoo ’ s stock price

Multivariate regression • What if we have several inputs? - Stock prices for Yahoo, Microsoft and Ebay for the Google prediction task • This becomes a multivariate regression problem • Again, its easy to model: y = w 0 + w 1 x 1 + … + w k x k + ε

Not all functions can be approximated by a line/hyperplane… y=10+3x 1 2 -2x 2 2 + ε In some cases we would like to use polynomial or other terms based on the input data, are these still linear regression problems? Yes. As long as the coefficients are linear the equation is still a linear regression problem!

Non-Linear basis function • So far we only used the observed values x 1 ,x 2 ,… • However, linear regression can be applied in the same way to functions of these values – Eg: to add a term w x 1 x 2 add a new variable z=x 1 x 2 so each example becomes: x 1 , x 2 , …. z • As long as these functions can be directly computed from the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem 2 2 y w w x … w k x = + + + + ε 0 1 1 k

Non-Linear basis function • How can we use this to add an intercept term? Add a new “ variable ” z=1 and weight w 0

Non-linear basis functions • What type of functions can we use? • A few common examples: - Polynomial: φ j (x) = x j for j=0 … n φ j ( x ) = ( x − µ j ) Any function of the input - Gaussian: values can be used. The 2 2 σ j solution for the parameters 1 of the regression remains - Sigmoid: φ j ( x ) = 1 + exp( − s j x ) the same. - Logs: φ j ( x ) = log( x + 1)

General linear regression problem • Using our new notations for the basis function linear regression can be written as n ∑ y = w j φ j ( x ) j = 0 • Where φ j ( x ) can be either x j for multivariate regression or one of the non-linear basis functions we defined • … and φ 0 ( x ) =1 for the intercept term

Learning/Optimizing Multivariate Least Squares Approach 1: Gradient Descent

Gradient descent 30

Gradient Descent for Linear Regression n i = ∑ w j φ j ( x i ) predict with : ˆ y Goal: minimize the following j loss function: 2 # & 2 = y i − ˆ y i − ( ) ∑ y i ∑ ∑ w j φ j ( x i ) J X,y ( w ) = % ( % ( $ ' i i j sum over n examples sum over k+1 basis vectors

Gradient Descent for Linear Regression n Goal: minimize the following i = ∑ w j φ j ( x i ) predict with : ˆ y loss function: j 2 % ( 2 = y i − ˆ y i − ∑ ( ) ∑ ∑ i w j φ j ( x i ) J X,y ( w ) = y ' * ' * & ) i i j ∂ ∂ 2 y i − ˆ ∑ ( ) i J ( w ) = y ∂ w j ∂ w j i ∂ y i − ˆ ∑ ( ) i i ˆ = 2 y y ∂ w j i ∂ y i − ˆ ∑ ( ) ∑ i w j φ j ( x i ) = 2 y ∂ w j i j y i − ˆ ∑ ( ) i φ j ( x i ) = 2 y i

Gradient Descent for Linear Regression Learning algorithm: • Initialize weights w=0 • For t=1,… until convergence: k y i = ∑ w j φ j ( x i ) • Predict for each example x i using w: ˆ j = 0 ∂ y i − ˆ ∑ ( ) • Compute gradient of loss: i φ j ( x i ) J ( w ) = 2 y ∂ w j • This is a vector g i • Update: w = w – λ g • λ is the learning rate.

Gradient Descent for Linear Regression • We can use any of the tricks we used for logistic regression: – stochastic gradient descent (if the data is too big to put in memory) – regularization – …

Linear regression is a convex optimization problem so again gradient descent will reach a global optimum proof: differentiate again to get the second derivative

Multivariate Least Squares Approach 2: Matrix Inversion

OLS (Ordinary Least Squares Solution) n Goal: minimize the following i = ∑ w j φ j ( x i ) predict with : ˆ y loss function: j 2 % ( 2 = y i − ˆ y i − ∑ ( ) ∑ ∑ i w j φ j ( x i ) J X,y ( w ) = y ' * ' * & ) i i j ∂ y i − ˆ ∑ ( ) i φ j ( x i ) J ( w ) = 2 y ∂ w j i

n Goal: minimize the following i = ∑ w j φ j ( x i ) predict with : ˆ y loss function: j 2 % ( 2 = y i − ˆ y i − ∑ ( ) ∑ ∑ i w j φ j ( x i ) J X,y ( w ) = y ' * ' * & ) i i j ∂ y i − ˆ ∑ ( ) i φ j ( x i ) J ( w ) = 2 y ∂ w j i k+1 basis vectors " % φ 0 ( x 1 ) φ 1 ( x 1 )  φ k ( x 1 ) Notation: $ ' $ ' φ 0 ( x 2 ) φ 1 ( x 2 )  φ k ( x 2 ) Φ = $ ' n examples     $ ' $ ' φ 0 ( x n ) φ 1 ( x n )  φ k ( x n ) # &

n Goal: minimize the following i = ∑ w j φ j ( x i ) predict with : ˆ y loss function: j 2 % ( 2 = y i − ˆ y i − ∑ ( ) ∑ ∑ i w j φ j ( x i ) J X,y ( w ) = y ' * ' * & ) i i j ∂ y i − ˆ ∑ ( ) i φ j ( x i ) J ( w ) = 2 y ! $ w 0 ∂ w j # & i k+1 basis vectors w = # .. & # & " % w k # & φ 0 ( x 1 ) φ 1 ( x 1 )  φ k ( x 1 ) " % y 1 $ ' " % $ ' $ ' ... φ 0 ( x 2 ) φ 1 ( x 2 )  φ k ( x 2 ) $ ' y = Φ = $ ' n examples ...     $ ' $ ' $ ' y n $ ' # & φ 0 ( x n ) φ 1 ( x n )  φ k ( x n ) # &

10-601 Machine Learning Regression Outline Regression vs - PowerPoint PPT Presentation

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression another discriminative learning method As optimization Gradient descent As matrix inversion (Ordinary Least Squares) Overfitting

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Ac#ve Learning Aarti Singh Machine Learning 10-601 Dec 6, 2011 Slides Courtesy: Burr

Energy Sector Workforce Issues Margi Schiemann Nicor Gas David G. Loomis, Ph.D. Illinois State

Easement Requests OCTOBER 11, 2018 Howard Requested Easements County Easement 1. Willow Bend

2014 2015 Review July 2014 Agenda School Board Policy 5.23 Energy Management

A HEALTHY natural environment where communities THRIVE Dan Marinigh Chief Administrative

PR-SOCO Personality Recognition in SOurce COde PAN@FIRE 2016 Kolkata, 8-10 December Francisco

ISO-NE Model Verification EE Team 2011 / CSE Team 10 EE: Cynthia Bissereth, Jonathan Davis, Ethan

Lessons Learned Deploying and Monitoring AI Models in Production at Major Tech Companies Who are

Context and Intention-Awareness in POIs Recommender Systems 1 Hernani Costa 1 Barbara Furtado 2