Linear Regression Yijun Zhao Northeastern University Fall 2016 - PowerPoint PPT Presentation

Linear Regression Yijun Zhao Northeastern University Fall 2016 Yijun Zhao Linear Regression

Regression Examples Any Attributes Continuous Value = ⇒ x y { age , major , gender , race } ⇒ GPA { income , credit score , profession } ⇒ loan { college , major , GPA } ⇒ future income . . . Yijun Zhao Linear Regression

Regression Examples Data often has/can be converted into matrix form: Age Gender Race Major GPA 20 0 A Art 3.85 22 0 C Engineer 3.90 25 1 A Engineer 3.50 24 0 AA Art 3.60 19 1 H Art 3.70 18 1 C Engineer 3.00 30 0 AA Engineer 3.80 25 0 C Engineer 3.95 28 1 A Art 4.00 26 0 C Engineer 3.20 Yijun Zhao Linear Regression

Formal Problem Setup Given N observations { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } a regression problem tries to uncover the function y i = f ( x i ) ∀ i = 1 , 2 . . . . , n such that for a new input value x ∗ , we can accurately predict the corresponding value y ∗ = f ( x ∗ ). Yijun Zhao Linear Regression

Linear Regression Assume the function f is a linear combination of components in x Formally, let x = (1 , x 1 , x 2 , . . . , x d ) T , we have y = ω 0 + ω 1 x 1 + ω 2 x 2 + · · · + ω d x d = w T x where w = ( ω 0 , ω 1 , ω 2 , . . . , ω d ) T w is the parameter to estimate ! Prediction: y ∗ = w T x ∗ Yijun Zhao Linear Regression

Visual Illustration Figure: 1D and 2D linear regression Yijun Zhao Linear Regression

Error Measure Mean Squared Error (MSE): N E ( w ) = 1 � ( w T x n − y n ) 2 N n =1 = 1 N � Xw − y � 2 where — x 1 T —     y 1 — x 2 T — y 2     X = y =     . . . . . .     — x NT — y N Yijun Zhao Linear Regression

Minimizing Error Measure E ( w ) = 1 N � Xw − y � 2 ▽ E ( w ) = 2 N X T ( Xw − y ) = 0 X T Xw = X T y w = X † y where X † = ( X T X ) − 1 X T is the ’pseudo-inverse’ of X Yijun Zhao Linear Regression

LR Algorithm Summary Ordinary Least Squares (OLS) Algorithm Construct matrix X and the vector y from the dataset { ( x 1 , y 1 ) , x 2 , y 2 ) , . . . , ( x N , y N ) } (each x includes x 0 = 1) as follows:  — x T    1 — y 1 — x T 2 — y 2     X = y =     . . . . . .     — x T N — y N Compute X † = ( X T X ) − 1 X T Return w = X † y Yijun Zhao Linear Regression

Gradient Descent Why? Minimize our target function ( E ( w )) by moving down in the steepest direction Yijun Zhao Linear Regression

Gradient Descent Gradient Descent Algorithm Initialize the w (0) for time t = 0 for t = 0 , 1 , 2 , . . . do Compute the gradient g t = ▽ E ( w ( t )) Set the direction to move, v t = − g t Update w ( t + 1) = w ( t ) + η v t Iterate until it is time to stop Return the final weights w Yijun Zhao Linear Regression

Gradient Descent How η affects the algorithm? Use 0.1 (practical observation) Use variable size: η t = η � ▽ E � Yijun Zhao Linear Regression

OLS or Gradient Descent? Yijun Zhao Linear Regression

Computational Complexity OLS Gradient Descent ¡ OLS is expensive when D is large! Yijun Zhao Linear Regression

Linear Regression What is the Probabilistic Interpretation? Yijun Zhao Linear Regression

Normal Distribution Right Skewed Left Skewed Random Yijun Zhao Linear Regression

Normal Distribution mean = median = mode symmetry about the center 2 σ 2 ( x − µ ) 2 1 2 π e − x ∼ N ( µ, σ 2 ) = 1 ⇒ f ( x ) = √ σ Yijun Zhao Linear Regression

Central Limit Theorem All things bell shaped! Random occurrences over a large population tend to wash out the asymmetry and uniformness of individual events. A more ’natural’ distribution ensues. The name for it is the Normal distribution (the bell curve). Formal definition: If ( y 1 , . . . , y n ) are i.i.d. and 0 < σ 2 y < ∞ , then when n is large the distribution of ¯ y is well approximated by a σ 2 normal distribution N ( µ y , n ). y Yijun Zhao Linear Regression

Central Limit Theorem Example: Yijun Zhao Linear Regression

LR: Probabilistic Interpretation Yijun Zhao Linear Regression

LR: Probabilistic Interpretation 2 πσ e − 1 2 σ 2 ( w T x i − y i ) 2 1 prob ( y i | x i ) = √ Yijun Zhao Linear Regression

LR: Probabilistic Interpretation Likelihood of the entire dataset: e − 1 2 σ 2 ( w T x i − y i ) 2 � � � L ∝ i − 1 ( w T x i − y i ) 2 � 2 σ 2 = e i ( w T x i − y i ) 2 Maximize L ⇐ ⇒ Minimize � i Yijun Zhao Linear Regression

Non-linear Transformation Linear is limited: Linear models become powerful when we consider non-linear feature transformations: X i = (1 , x i , x 2 ⇒ y i = ω 0 + ω 1 x i + ω 2 x 2 i ) = i Yijun Zhao Linear Regression

Overfitting Yijun Zhao Linear Regression

Overfitting How do we know we overfitted? E in : Error from the training data E out : Error from the test data Example: Yijun Zhao Linear Regression

Overfitting How to avoid overfitting? Use more data Evaluate on a parameter tuning set Regularization Yijun Zhao Linear Regression

Regularization Attempts to impose ”Occam’s razor” principle Add a penalty term for model complexity Most commonly used : L 2 regularization (ridge regression) minimizes: E ( w ) = � Xw − y � 2 + λ � w � 2 where λ ≥ 0 and � w � 2 = w T w L 1 regularization (LASSO) minimizes: E ( w ) = � Xw − y � 2 + λ | w | 1 D � where λ ≥ 0 and | w | 1 = | ω i | i =1 Yijun Zhao Linear Regression

Regularization L 2: closed form solution w = ( X T X + λ I ) − 1 X T y L 1: No closed form solution. Use quadratic programming: minimize � Xw − y � 2 � w � 1 ≤ s s . t . Yijun Zhao Linear Regression

L 2 Regularization Example Yijun Zhao Linear Regression

Model Selection Which model? A central problem in supervised learning Simple model: ”underfit” the data Constant function Linear model applied to quadratic data Complex model: ”overfit” the data High degree polynomials Model with hidden logics that fits the data to completion Yijun Zhao Linear Regression

Bias-Variance Trade-off � � N 1 ( w T x n − y n ) 2 y = w T x n � Consider E let ˆ N n =1 y − y n ) 2 can be decomposed into (reading): E (ˆ var { noise } + bias 2 + var { y i } var { noise } : can’t be reduced bias 2 + var { y i } is what counts for prediction High bias 2 : model mismatch, often due to ”underfitting” High var { y i } : training set and test set mismatch, often due to ”overfitting” Yijun Zhao Linear Regression

Bias-Variance Trade-off Often: low bias ⇒ high variance low variance ⇒ high bias Trade-off: Yijun Zhao Linear Regression

How to choose λ ? But we still need to pick λ . Use the test set data ? NO! Set aside another evaluation set Small evaluation set ⇒ inaccurate estimated error Large evaluation set ⇒ small training set CrossValidation Yijun Zhao Linear Regression

Cross Validation (CV) Divide data into K folds Alternatively train on all except k th folds, and test on k th fold Yijun Zhao Linear Regression

Cross Validation (CV) How to choose K? Common choice of K = 5 , 10, or N (LOOCV) Measure on average performance Cost of computation: K folds × choices of λ Yijun Zhao Linear Regression

Learning Curve A learning curve plots the performance of the algorithm as a function of the size of training data Yijun Zhao Linear Regression

Learning Curve Yijun Zhao Linear Regression

Linear Regression Yijun Zhao Northeastern University Fall 2016 - PowerPoint PPT Presentation

Linear Regression Yijun Zhao Northeastern University Fall 2016 Yijun Zhao Linear Regression Regression Examples Any Attributes Continuous Value = x y { age , major , gender , race } GPA { income , credit score , profession }

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

EM, K-Means, GNG 4-6-16 Reading Quiz Which of the following can be considered an instance of the

CS1010 Programming Methodology AY18/19 Sem 1 Lecture 1 14 August 2018 Admin Matters Unit 1:

CSSE463: Image Recognition Day 5 Demo code posted Lab 2 due Wednesday. Be sure you

Continuous random Variables Anna Karlin Most Slides by Alex Tsun + Joshua Fan Agenda Recap

Class Photo HW: Summary Due Before class Friday Chapter 2 Hardcopy due: Friday before

Auto tomat mated d Pla lanning ing State + action unique resulting state Sometimes

Fitting Markovian binary trees using global and individual population data Sophie Hautphenne *

Tutorial: Operations Research in Constraint Programming John Hooker Carnegie Mellon University