Aykut Erdem // Hacettepe University // Fall 2019
illustration: detail from xkcd strip #2048
Lecture 4:
Linear Regression, Optimization, Generalization, Model complexity, Regularization
BBM406 Fundamentals of Machine Learning Lecture 4: Linear - - PowerPoint PPT Presentation
illustration: detail from xkcd strip #2048 BBM406 Fundamentals of Machine Learning Lecture 4: Linear Regression, Optimization, Generalization, Model complexity, Regularization Aykut Erdem // Hacettepe University // Fall 2019 1 ,
Aykut Erdem // Hacettepe University // Fall 2019
illustration: detail from xkcd strip #2048
Lecture 4:
Linear Regression, Optimization, Generalization, Model complexity, Regularization
Recall from last time… Kernel Regression
2
x y
Here, this is the closest
Here, this is the closest
Here, this is the closest Here, this is the closest1-NN for Regression
1, 𝑧1 , … , 𝑦 𝑜, 𝑧𝑜
– 𝑦 𝑗 ∈ 𝑌 – 𝑧𝑗 ∈ ℜ
𝐿 ∶ 𝑌 × 𝑌 → ℜ –
x’ – 𝐿 𝑦 𝑗, 𝑦 ′
Weighted K-NN for Regression
D = n X
i=1
|xi − yi|p !1/p
Distance metrics wi = exp(-d(xi, query)2 / σ2) Kernel width
3
t(x) = f(x) + ε
with ε some noise
4
slide by Sanja Fidler
− How do we parametrize the model (the curve)? − What loss (objective) function should we use to judge fit? − How do we optimize fit to unseen test data
(generalization)?
5
slide by Sanja Fidler
statistics
6
https://archive.ics.uci.edu/ml/datasets/Housing
slide by Sanja Fidler
− x is the input feature (per capita crime rate) − t is the target output (median house price) − (i) simply indicates the training examples (we have N in this case)
y(x) = w0 + w1x
− Use the training examples to construct hypothesis, or function
approximator, that maps x to predicted y
− Evaluate hypothesis on test set
7
slide by Sanja Fidler
fit can be considered noise
− Imprecision in data attributes (input noise, e.g. noise in per-capita
crime)
− Errors in data targets (mislabeling, e.g. noise in house prices) − Additional attributes not taken into account by data attributes, affect
target values (latent variables). In the example, what else could affect house prices?
− Model may be too simple to account for data targets
8
slide by Sanja Fidler
9
slide by Sanja Fidler
y(x) = function(x, w)
Linear:
error between y and the true value t
in red), what does the loss represent geometrically?
10
slide by Sanja Fidler
y(x) = function(x, w)
Linear:
error between y and the true value t
in red), what does the loss represent geometrically?
11
y(x) = w0 + w1x
slide by Sanja Fidler
Linear:
between y and the true value t
in red), what does the loss represent geometrically?
12
y(x) = w0 + w1x
slide by Sanja Fidler
`(w) =
N
X
n=1
h t(n) − y(x(n)) i2
Linear:
between y and the true value t Linear model:
in red), what does the loss represent geometrically?
13
y(x) = w0 + w1x `(w) =
N
X
n=1
h t(n) − (w0 + w1x(n)) i2
slide by Sanja Fidler
Linear:
between y and the true value t Linear model:
in red), what does the loss represent geometrically?
14
y(x) = w0 + w1x `(w) =
N
X
n=1
h t(n) − (w0 + w1x(n)) i2
slide by Sanja Fidler
Linear:
between y and the true value t Linear model:
errors (squared lengths of green vertical lines)
15
y(x) = w0 + w1x `(w) =
N
X
n=1
h t(n) − (w0 + w1x(n)) i2
slide by Sanja Fidler
Linear:
between y and the true value t Linear model:
16
y(x) = w0 + w1x `(w) =
N
X
n=1
h t(n) − (w0 + w1x(n)) i2
slide by Sanja Fidler
w = (w0, w1)
Linear:
between y and the true value t Linear model:
loss
17
y(x) = w0 + w1x `(w) =
N
X
n=1
h t(n) − (w0 + w1x(n)) i2
slide by Sanja Fidler
w = (w0, w1) `(w)
− initialize w (e.g., randomly) − repeatedly update w based on the gradient
(w stops changing)
18
slide by Sanja Fidler
w ← w − @` @w w ← w + 2λ ⇣ t(n) − y(x(n)) ⌘ x(n)
error
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 19
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 20
Also possible oscillations
21
slide by Erik Sudderth
w0 `(w) `(w) w0
Optimizing Across Training Set
then change the parameter values
training case in turn, according to its own gradients
22
Algorithm 1 Stochastic gradient descent
1: Randomly shuffle examples in the training set 2: for i = 1 to N do 3:
Update: w ← w + 2λ(t(i) − y(x(i)))x(i) (update for a linear model)
4: end for
slide by Sanja Fidler
w ← w + 2λ ⇣ t(n) − y(x(n)) ⌘ x(n)
Optimizing Across Training Set
then change the parameter values
training case in turn, according to its own gradients
23
slide by Sanja Fidler
w ← w + 2λ ⇣ t(n) − y(x(n)) ⌘ x(n)
distributed (i.i.d.)
Analytical Solution
analytically
24
slide by Sanja Fidler
25
y(x) = w0 + w1x w = w0 w1
x] y(x) = wT x
Vectorization
26
slide by Sanja Fidler
`(w) =
N
X
n=1
h wT x(n) − t(n)i2 = (Xw − t)T (Xw − t)
t = h t(1), t(2), . . . , t(N)iT X = 2 6 6 4 1, x(1) 1, x(2) . . . 1, x(N) 3 7 7 5
T
w = w0 w1
RN×1 R2×1 RN×1
{
R1×N
− Notice the solution is when
− Take derivative and set equal to 0, then solve for
27
@ @w`(w) = 0
`(w) = (Xw − t)T (Xw − t) = wT XT Xw − tT Xw − wT XT t + tT t = wT XT Xw − 2wT XT t + tT t ∂ ∂w
w =
−1 XT t
Closed Form Solution:
If XTX is not invertible (i.e., singular), may need to:
− In Python, numpy.linalg.pinv(a)
linearly independent) features
ensure that d ≤ N
1x1
28
29
slide by Sanja Fidler
y(x) = w0 + w1x1 + w2x2
Linear Regression with Multi-dimensional Inputs
these multi-dimensional observations
compute w analytically (how does the solution change?)
30
slide by Sanja Fidler
x(n) = ⇣ x(n)
1 , . . . , x(n) j
, . . . , x(n)
d
⌘ y(x) = w0 +
d
X
j=1
wjxj = wT x
w =
−1 XT t
recall:
complicated model?
31
slide by Sanja Fidler
complicated model?
that are combinations of components of x
feature x: where xj is the j-th power of x
32
slide by Sanja Fidler
y(x, w) = w0 +
M
X
j=1
wjxj
Some types of basis functions in 1-D
33
Sigmoids Gaussians Polynomials
φj(x) = exp
2s2
x − µj
s
1 1 + exp(−a).
slide by Erik Sudderth
) ( ... ) ( ) ( ) ( ... ) (
2 2 1 1 2 2 1 1
x w x x w x, x w w x, Φ = + + + = = + + + =
T T
w w w y x w x w w y φ φ
bias
Two types of linear model that are equivalent with respect to learning
dimensionality of the data +1.
the number of basis functions +1.
functions, fitting the second model is exactly the same problem as fitting the first model (unless we use the kernel trick)
34
slide by Erik Sudderth
General linear regression problem
regression can be written as where can be either xj for multivariate regression
35
y w j j(x)
j 0 n
non linear bas
regression problem
y w j j(x)
j 0 n
J(w) (y i w j j(x i)
j
i
Our goal is to minimize the following loss function: Moving to vector notations we get: We take the derivative w.r.t w
J(w) (yi w T(xi))2
i
(y i w T(x i))2
i
(y i w T(x i))
i
Equating to 0 we get
2 (y i w T(x i))
i
y i
i
(x i)
i
(xi) – vector of dimension k+1 yi – a scaler
36
LMS for the general linear regression problem
slide by E. P . Xing
37
We take the derivative w.r.t w J(w) (yi w T(xi))2
i
(y i w T(x i))2
i
(y i w T(x i))
i
Equating to 0 we get
2 (y i w T(x i))
i
y i
i
(x i)
i
0(x1) 1(x1) m(x1) 0(x 2) 1(x 2) m(x 2) 0(x n) 1(x n) m(x n)
we get:
w (T)1Ty
LMS for the general linear regression problem
slide by E. P . Xing
LMS for the general linear regression problem
38
J(w) (yi w T(xi))2
i
w (T)1Ty
n by k+1 matrix n entries vector k+1 entries vector This solution is also known as ‘psuedo ¡inverse’
slide by E. P . Xing
39
slide by Erik Sudderth
40
slide by Erik Sudderth
41
slide by Erik Sudderth
42
slide by Erik Sudderth
43
slide by Sanja Fidler from Bishop
44
x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1E(w) = 1 2
N
{y(xn, w) − tn}2
ERMS =
The division by N allows us to compare different sizes of data sets on an equal footing, and the square root ensures that ERMS is measured on the same scale (and in the same units) as the target variable t
slide by Erik Sudderth
Root>Mean>Square'(RMS)'Error:'
E(w) = 1 2
N
X
n=1
(tn − φ(xn)T w)2 = 1 2||t − Φw||2
45
inde- M ERMS 3 6 9 0.5 1 Training Test
slide by Erik Sudderth
46
slide by Sanja Fidler
inde- M ERMS 3 6 9 0.5 1 Training Test
47
slide by Sanja Fidler
fewer examples
48
slide by Sanja Fidler
1-D regression illustrates key concepts
− Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data (fit not only the signal but also the noise in the data), especially if not enough data to constrain model
− test generalization = model’s ability to predict the held out data
iterative approaches; analytic when available
49
slide by Richard Zemel
small (this way no input dimension will have too much influence on prediction). This is called regularization.
50
slide by Sanja Fidler
discourage the coefficients from reaching large values
51
2
N
{y(xn, w) − tn}2 + λ 2 ∥w∥2
0 + w2 1 + . . . + w2
M,
importance of the regularization term compared
'
Ridge regression
which is minimized by
slide by Erik Sudderth
52
x t ln λ = −18 1 −1 1 x t ln λ = 0 1 −1 1
M = 9
slide by Erik Sudderth
ERMS ln λ −35 −30 −25 −20 0.5 1 Training Test
53
ln λ = −∞ ln λ = −18 ln λ = 0 w⋆ 0.35 0.35 0.13 w⋆
1
232.37 4.74
w⋆
2
w⋆
3
48568.31
w⋆
4
w⋆
5
640042.26 55.28
w⋆
6
41.32
w⋆
7
1042400.18
w⋆
8
0.00 w⋆
9
125201.43 72.68 0.01
The corresponding coefficients from the fitted polynomials, showing that regularization has the desired effect of reducing the magnitude
slide by Erik Sudderth
54
1 2
N
{tn − wTφ(xn)}2 + λ 2
M
|wj|q
q = 0.5 q = 1 q = 2 q = 4
slide by Richard Zemel
1-D regression illustrates key concepts
− Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data (fit not only the signal but also the noise in the data), especially if not enough data to constrain model
− test generalization = model’s ability to predict the held out data
iterative approaches; analytic when available
55
slide by Richard Zemel
Machine Learning Methodology
56