Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept - - PowerPoint PPT Presentation

linear regression
SMART_READER_LITE
LIVE PREVIEW

Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept - - PowerPoint PPT Presentation

Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Anemic cell Science Healthy cell News Y = Diagnosis X = Document Y = Topic X = Cell Image Regression Stock


slide-1
SLIDE 1

Linear Regression

Aarti Singh

Machine Learning 10-701/15-781 Sept 27, 2010

slide-2
SLIDE 2

2

Discrete to Continuous Labels

Sports Science News Classification Regression Anemic cell Healthy cell Stock Market Prediction

Y = ?

X = Feb01 X = Document

Y = Topic

X = Cell Image

Y = Diagnosis

slide-3
SLIDE 3

Regression Tasks

3

Weather Prediction

Y = Temp

X = 7 pm

Estimating Contamination

X = new location Y = sensor reading

slide-4
SLIDE 4

4

Supervised Learning

Sports Science News Classification: Regression: Probability of Error

Goal:

Mean Squared Error

Y = ?

X = Feb01

slide-5
SLIDE 5

Regression

Optimal predictor:

5

Intuition: Signal plus (zero-mean) Noise model (Conditional Mean)

slide-6
SLIDE 6

Regression

Optimal predictor: Proof Strategy:

6

Dropping subscripts for notational convenience

≥ 0

slide-7
SLIDE 7

Regression

Optimal predictor:

7

Depends on unknown distribution

Intuition: Signal plus (zero-mean) Noise model (Conditional Mean)

slide-8
SLIDE 8

Regression algorithms

Learning algorithm

8

Linear Regression Lasso, Ridge regression (Regularized Linear Regression) Nonlinear Regression Kernel Regression Regression Trees, Splines, Wavelet estimators, …

slide-9
SLIDE 9

Empirical Risk Minimization (ERM)

More later…

9

Empirical Risk Minimizer: Optimal predictor:

Law of Large Numbers Class of predictors Empirical mean

slide-10
SLIDE 10
  • Learning Distributions

Max likelihood = Min -ve log likelihood empirical risk What is the class F ? Class of parametric distributions Bernoulli (q) Gaussian (m, s2)

10

ERM – you saw it before!

slide-11
SLIDE 11

Linear Regression

11

  • Class of Linear functions

b1 - intercept

b2 = slope Uni-variate case: Multi-variate case: 1 where , Least Squares Estimator

slide-12
SLIDE 12

Least Squares Estimator

12

slide-13
SLIDE 13

Least Squares Estimator

13

slide-14
SLIDE 14

Normal Equations

14

If is invertible, When is invertible ? Recall: Full rank matrices are invertible. What is rank of ? What if is not invertible ? Regularization (later)

p xp p x1 p x1

slide-15
SLIDE 15

Geometric Interpretation

15

Difference in prediction on training set: is the orthogonal projection of

  • nto the linear subspace spanned by the

columns of

slide-16
SLIDE 16

Revisiting Gradient Descent

16

Even when is invertible, might be computationally expensive if A is huge. Initialize: Update: 0 if = Stop: when some criterion met e.g. fixed # iterations, or < ε. Gradient Descent since J(b) is convex

slide-17
SLIDE 17

Effect of step-size α

17

Large α => Fast convergence but larger residual error Also possible oscillations Small α => Slow convergence but small residual error

slide-18
SLIDE 18

Least Squares and MLE

19

Intuition: Signal plus (zero-mean) Noise model Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model ! log likelihood

slide-19
SLIDE 19

Regularized Least Squares and MAP

20

What if is not invertible ?

log likelihood log prior Prior belief that β is Gaussian with zero-mean biases solution to “small” β I) Gaussian Prior

Ridge Regression

Closed form: HW

slide-20
SLIDE 20

Regularized Least Squares and MAP

21

What if is not invertible ?

log likelihood log prior Prior belief that β is Laplace with zero-mean biases solution to “small” β

Lasso

II) Laplace Prior

slide-21
SLIDE 21

Ridge Regression vs Lasso

22

Ridge Regression: Lasso:

Lasso (l1 penalty) results in sparse solutions – vector with more zero coordinates Good for high-dimensional problems – don’t have to store all coordinates! βs with constant l1 norm Ideally l0 penalty, but optimization becomes non-convex βs with constant l0 norm βs with constant J(β) (level sets of J(β)) βs with constant l2 norm

β2 β1

HOT!

slide-22
SLIDE 22

Beyond Linear Regression

23

Polynomial regression Regression with nonlinear features/basis functions Kernel regression - Local/Weighted regression Regression trees – Spatially adaptive regression

h

slide-23
SLIDE 23

Polynomial Regression

24

Univariate (1-d) case: where , Nonlinear features Weight of each feature

slide-24
SLIDE 24

25

http://mste.illinois.edu/users/exner/java.f/leastsquares/

Polynomial Regression

slide-25
SLIDE 25

Nonlinear Regression

26

Fourier Basis Wavelet Basis Nonlinear features/basis functions Basis coefficients Good representation for oscillatory functions Good representation for functions localized at multiple scales

slide-26
SLIDE 26

Local Regression

27

Nonlinear features/basis functions Basis coefficients Globally supported basis functions (polynomial, fourier) will not yield a good representation

slide-27
SLIDE 27

Local Regression

28

Nonlinear features/basis functions Basis coefficients Globally supported basis functions (polynomial, fourier) will not yield a good representation

slide-28
SLIDE 28

What you should know

Linear Regression

Least Squares Estimator Normal Equations Gradient Descent Geometric and Probabilistic Interpretation (connection to MLE)

Regularized Linear Regression (connection to MAP)

Ridge Regression, Lasso

Polynomial Regression, Basis (Fourier, Wavelet) Estimators Next time

  • Kernel Regression (Localized)
  • Regression Trees

29