Introduction to Machine Learning Linear Regression Prof. Andreas - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning Linear Regression Prof. Andreas - - PowerPoint PPT Presentation

Introduction to Machine Learning Linear Regression Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch) Basic Supervised Learning Pipeline Training Data Test Data spam ? Learning Predic- ham Model method ?


slide-1
SLIDE 1

Linear Regression

  • Prof. Andreas Krause

Learning and Adaptive Systems (las.ethz.ch)

Introduction to Machine Learning

slide-2
SLIDE 2

Basic Supervised Learning Pipeline

2

Training Data “spam” “ham” “spam”

Learning method

Model

Predic- tion

? ? ? Test Data

f : X → Y

: X→ Y

Model fitting Prediction/ Generalization

slide-3
SLIDE 3

Regression

Instance of supervised learning Goal: Predict real valued labels (possibly vectors) Examples:

3

X Y Flight route Delay (minutes) Real estate objects Price Customer & ad features Click-through probability

slide-4
SLIDE 4

Running example: Diabetes

[Efron et al ‘04]

Features X:

Age Sex Body mass index Average blood pressure Six blood serum measurements (S1-S6)

Label (target) Y

quantitative measure of disease progression

4

slide-5
SLIDE 5

Regression

Goal: learn real valued mapping

5

f : Rd → R + + + + + + + + + x y

slide-6
SLIDE 6

Important choices in regression

6

What types of functions f should we consider? Examples How should we measure goodness of fit? + + + + + + + + + x f(x) + + ++ + + + + + + + + + + + + x f(x)

slide-7
SLIDE 7

Example: linear regression

7

+ + + + + + + + + x y

slide-8
SLIDE 8

Homogeneous representation

8

slide-9
SLIDE 9

Quantifying goodness of fit

9

+ ++ + + + + + + x y D = {(x1, y1), . . . , (xn, yn)} xi ∈ Rd yi ∈ R

slide-10
SLIDE 10

Least-squares linear regression optimization

[Legendre 1805, Gauss 1809]

Given data set How do we find the optimal weight vector?

10

w∗ = arg min

w n

X

i=1

(yi − wT xi)2 D = {(x1, y1), . . . , (xn, yn)} ˆ w

<latexit sha1_base64="fDIMbQ9YlPRf7DMzabXpUaK0cNU=">AB+XicbVDLSsNAFJ34rPUVdelmsAiuSlIFXbgouHFZwT6gCWUynbRDJ5Mwc1MpIX/ixoUibv0Td/6NkzYLbT0wcDjnXu6ZEySCa3Ccb2tfWNza7uyU93d2z84tI+OzpOFWVtGotY9QKimeCStYGDYL1EMRIFgnWDyV3hd6dMaR7LR5glzI/ISPKQUwJGti2NyaQeRGBcRBmT3k+sGtO3ZkDrxK3JDVUojWwv7xhTNOISaCaN13nQT8jCjgVLC86qWaJYROyIj1DZUkYtrP5slzfG6UIQ5jZ4EPFd/b2Qk0noWBWayiKiXvUL8z+unEN74GZdJCkzSxaEwFRhiXNSAh1wxCmJmCKGKm6yYjokiFExZVOCu/zlVdJp1N3LeuPhqta8LeuoFN0hi6Qi65RE92jFmojiqboGb2iNyuzXqx362MxumaVOyfoD6zPH1P/lBc=</latexit>
slide-11
SLIDE 11

Method 1: Closed form solution

The problem can be solved in closed form: Hereby:

11

w∗ = (XT X)−1XT y w∗ = arg min

w n

X

i=1

(yi − wT xi)2 ˆ w

<latexit sha1_base64="fDIMbQ9YlPRf7DMzabXpUaK0cNU=">AB+XicbVDLSsNAFJ34rPUVdelmsAiuSlIFXbgouHFZwT6gCWUynbRDJ5Mwc1MpIX/ixoUibv0Td/6NkzYLbT0wcDjnXu6ZEySCa3Ccb2tfWNza7uyU93d2z84tI+OzpOFWVtGotY9QKimeCStYGDYL1EMRIFgnWDyV3hd6dMaR7LR5glzI/ISPKQUwJGti2NyaQeRGBcRBmT3k+sGtO3ZkDrxK3JDVUojWwv7xhTNOISaCaN13nQT8jCjgVLC86qWaJYROyIj1DZUkYtrP5slzfG6UIQ5jZ4EPFd/b2Qk0noWBWayiKiXvUL8z+unEN74GZdJCkzSxaEwFRhiXNSAh1wxCmJmCKGKm6yYjokiFExZVOCu/zlVdJp1N3LeuPhqta8LeuoFN0hi6Qi65RE92jFmojiqboGb2iNyuzXqx362MxumaVOyfoD6zPH1P/lBc=</latexit>

ˆ w

<latexit sha1_base64="fDIMbQ9YlPRf7DMzabXpUaK0cNU=">AB+XicbVDLSsNAFJ34rPUVdelmsAiuSlIFXbgouHFZwT6gCWUynbRDJ5Mwc1MpIX/ixoUibv0Td/6NkzYLbT0wcDjnXu6ZEySCa3Ccb2tfWNza7uyU93d2z84tI+OzpOFWVtGotY9QKimeCStYGDYL1EMRIFgnWDyV3hd6dMaR7LR5glzI/ISPKQUwJGti2NyaQeRGBcRBmT3k+sGtO3ZkDrxK3JDVUojWwv7xhTNOISaCaN13nQT8jCjgVLC86qWaJYROyIj1DZUkYtrP5slzfG6UIQ5jZ4EPFd/b2Qk0noWBWayiKiXvUL8z+unEN74GZdJCkzSxaEwFRhiXNSAh1wxCmJmCKGKm6yYjokiFExZVOCu/zlVdJp1N3LeuPhqta8LeuoFN0hi6Qi65RE92jFmojiqboGb2iNyuzXqx362MxumaVOyfoD6zPH1P/lBc=</latexit>
slide-12
SLIDE 12

How to solve? Example: Scikit Learn

12

slide-13
SLIDE 13

Demo

13

Body mass index Disease progression

slide-14
SLIDE 14

Method 2: Optimization

The objective function is convex!

14

ˆ R(w) = X

i

(yi − wT xi)2

slide-15
SLIDE 15

Gradient Descent

Start at an arbitrary For t=1,2,... do Hereby, is called learning rate

15

w0 ∈ Rd ηt wt+1 = wt ηtr ˆ R(wt)

slide-16
SLIDE 16

Convergence of gradient descent

Under mild assumptions, if step size sufficiently small, gradient descent converges to a stationary point (gradient = 0) For convex objectives, it therefore finds the

  • ptimal solution!

In the case of the squared loss, constant stepsize ½ converges linearly

16

slide-17
SLIDE 17

Computing the gradient

17

slide-18
SLIDE 18

Demo: Gradient descent

18

slide-19
SLIDE 19

Choosing a stepsize

What happens if we choose a poor stepsize?

19

slide-20
SLIDE 20

Adaptive step size

Can update the step size adaptively. For example: 1) Via line search (optimizing step size every step) 2) „Bold driver“ heuristic

If function decreases, increase step size: If function increases, decrease step size:

20

slide-21
SLIDE 21

Demo: Gradient Descent for Linear Regression

21

slide-22
SLIDE 22

Gradient descent vs closed form

Why would one ever consider performing gradient descent, when it is possible to find closed form solution? Computational complexity May not need an optimal solution Many problems don‘t admit closed form solution

22

slide-23
SLIDE 23

Other loss functions

So far: Measure goodness of fit via squared error Many other loss functions possible (and sensible!)

23