Linear Regression + Optimization for ML
1
10-601 Introduction to Machine Learning
Matt Gormley Lecture 8
- Feb. 07, 2020
Machine Learning Department School of Computer Science Carnegie Mellon University
Linear Regression + Optimization for ML Matt Gormley Lecture 8 - - PowerPoint PPT Presentation
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Linear Regression + Optimization for ML Matt Gormley Lecture 8 Feb. 07, 2020 1 Q&A Q: How can I get more
1
10-601 Introduction to Machine Learning
Matt Gormley Lecture 8
Machine Learning Department School of Computer Science Carnegie Mellon University
3
course staff?
as possible!
4
5
6
The Big Picture
7
8
9
y = f(x) =x2 + 1 1 2 3
v* = 1, the minimum value of the function x* = 0, the argument that yields the minimum value
10
11
Contour Plots 1. Each level curve labeled with value 2. Value label indicates the value of the function for all points lying on that level curve 3. Just like a topographical map, but for a function
12
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2 θ1 θ2
Optimization Method #0: Random Guessing 1. Pick a random θ 2. Evaluate J(θ) 3. Repeat steps 1 and 2 many times 4. Return θ that gives smallest J(θ)
13
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2 θ1 θ2 θ1 θ2 J(θ1, θ2) 0.2 0.2 10.4 0.3 0.7 7.2 0.6 0.4 1.0 0.9 0.7 19.2 t 1 2 3 4
Optimization Method #0: Random Guessing 1. Pick a random θ 2. Evaluate J(θ) 3. Repeat steps 1 and 2 many times 4. Return θ that gives smallest J(θ)
14
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2 θ1 θ2 θ1 θ2 J(θ1, θ2) 0.2 0.2 10.4 0.3 0.7 7.2 0.6 0.4 1.0 0.9 0.7 19.2 t 1 2 3 4
For Linear Regression:
Squared Error (MSE)
= J(θ1, θ2) =
MSE – lower means a better fit
parameters (w,b) = (θ1, θ2) that best fit some training dataset
Optimization Method #0: Random Guessing 1. Pick a random θ 2. Evaluate J(θ) 3. Repeat steps 1 and 2 many times 4. Return θ that gives smallest J(θ)
15
time # tourists (thousands) y = h*(x) (unknown)
h(x; θ(1)) h(x; θ(2)) h(x; θ(3)) h(x; θ(4))
For Linear Regression:
training examples (x(i),y(i))
approximates h*(x)
bias that restricts hypothesis class to linear functions
Optimization Method #0: Random Guessing 1. Pick a random θ 2. Evaluate J(θ) 3. Repeat steps 1 and 2 many times 4. Return θ that gives smallest J(θ)
16
θ1 θ2 θ1 θ2 J(θ1, θ2) 0.2 0.2 10.4 0.3 0.7 7.2 0.6 0.4 1.0 0.9 0.7 19.2 time # tourists (thousands) y = h*(x) (unknown) t 1 2 3 4
h(x; θ(1)) h(x; θ(2)) h(x; θ(3)) h(x; θ(4))
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2
17
18
19
20
21
22
These are the gradients that Gradient Ascent would follow.
23
These are the negative gradients that Gradient Descent would follow.
24
Shown are the paths that Gradient Descent would follow if it were making infinitesimally small steps.
25 Slide courtesy of William Cohen
26
27
1: procedure GD(D, θ(0)) 2:
3:
4:
5:
In order to apply GD to Linear Regression all we need is the gradient of the objective function (i.e. vector of partial derivatives).
d dθ1 J(θ) d dθ2 J(θ)
d dθN J(θ)
— M
28
1: procedure GD(D, θ(0)) 2:
3:
4:
5:
There are many possible ways to detect convergence. For example, we could check whether the L2 norm of the gradient is below some small tolerance.
Alternatively we could check that the reduction in the
—
29
30
Optimization Method #1: Gradient Descent 1. Pick a random θ 2. Repeat:
3. Return θ that gives smallest J(θ)
31
θ1 θ2 θ1 θ2 J(θ1, θ2) 0.01 0.02 25.2 0.30 0.12 8.7 0.51 0.30 1.5 0.59 0.43 0.2 t 1 2 3 4 J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2
Optimization Method #1: Gradient Descent 1. Pick a random θ 2. Repeat:
3. Return θ that gives smallest J(θ)
32
θ1 θ2 J(θ1, θ2) 0.01 0.02 25.2 0.30 0.12 8.7 0.51 0.30 1.5 0.59 0.43 0.2 time # tourists (thousands) y = h*(x) (unknown) t 1 2 3 4
h(x; θ(1)) h(x; θ(2)) h(x; θ(3)) h(x; θ(4))
Optimization Method #1: Gradient Descent 1. Pick a random θ 2. Repeat:
3. Return θ that gives smallest J(θ)
33
θ1 θ2 θ1 θ2 J(θ1, θ2) 0.01 0.02 25.2 0.30 0.12 8.7 0.51 0.30 1.5 0.59 0.43 0.2 time # tourists (thousands) y = h*(x) (unknown) t 1 2 3 4
h(x; θ(1)) h(x; θ(2)) h(x; θ(3)) h(x; θ(4))
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2
34
θ1 θ2 J(θ1, θ2) 0.01 0.02 25.2 0.30 0.12 8.7 0.51 0.30 1.5 0.59 0.43 0.2 time # tourists (thousands) y = h*(x) (unknown) t 1 2 3 4
h(x; θ(1)) h(x; θ(2)) h(x; θ(3)) h(x; θ(4))
iteration, t mean squared error, J(θ1, θ2)
35
θ1 θ2 θ1 θ2 J(θ1, θ2) 0.01 0.02 25.2 0.30 0.12 8.7 0.51 0.30 1.5 0.59 0.43 0.2 time # tourists (thousands) y = h*(x) (unknown) t 1 2 3 4
h(x; θ(1)) h(x; θ(2)) h(x; θ(3)) h(x; θ(4))
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2 iteration, t mean squared error, J(θ1, θ2)
36
37
[used by Gradient Descent]
Gradient Descent for Linear Regression repeatedly takes steps opposite the gradient of the objective function
38
39
40
Convex Function
global minimum
Nonconvex Function
convex
necessarily a global minimum
41
44
45
46
Which of the following could be used as loss functions for training a linear regression model? Select all that apply.
47
48
49
50
52
53
54
Optimization Method #2: Closed Form 1. Evaluate 2. Return θMLE
55
θ1 θ2 θ1 θ2 J(θ1, θ2) 0.59 0.43 0.2 time # tourists (thousands) y = h*(x) (unknown) t
MLE
h(x; θ(MLE))
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2
56
Computational Complexity of OLS:
57
To solve the Ordinary Least Squares problem we compute: The resulting shape of the matrices:
Linear in # of examples, N Polynomial in # of features, M
58
much more rapidly than GD
MSE is initially large due to uninformed initialization
59
Gradient Descent SGD Closed-form (normal eq.s)
Figure adapted from Eric P. Xing
single pass through the training data 1. For GD, only one update per epoch
per epoch N = (# train examples)