Lecture 4:
− Linear Regression (cont’d.) − Optimization − Generalization − Model complexity − Regularization
Aykut Erdem
October 2017 Hacettepe University
Lecture 4: Linear Regression (contd.) Optimization Generalization - - PowerPoint PPT Presentation
Lecture 4: Linear Regression (contd.) Optimization Generalization Model complexity Regularization Aykut Erdem October 2017 Hacettepe University Administrative Assignment 1 is out! It is due October 20 (i.e. in two
− Linear Regression (cont’d.) − Optimization − Generalization − Model complexity − Regularization
Aykut Erdem
October 2017 Hacettepe University
− Pencil-and-paper derivations − Implementing kNN classifier − numpy/Python code
2
3
−5033 train, 1000 test images
− Attributes, − Color histogram, − HOG features − Deep CNN features
adapted from Sanja Fidler
Hooded Oriole (Icterus cucullatus)
4
x y
Here, this is the closest
Here, this is the closest
Here, this is the closest
Here, this is the closest
1-NN for Regression
1, 𝑧1 , … , 𝑦 𝑜, 𝑧𝑜
– 𝑦 𝑗 ∈ 𝑌 – 𝑧𝑗 ∈ ℜ
𝐿 ∶ 𝑌 × 𝑌 → ℜ –
x’ – 𝐿 𝑦 𝑗, 𝑦 ′
Weighted K-NN for Regression
D = n X
i=1
|xi − yi|p !1/p
Distance metrics wi = exp(-d(xi, query)2 / σ2) Kernel width
5
slide by Sanja Fidler
y(x) = function(x, w)
Linear:
error between y and the true value t
in red), what does the loss represent geometrically?
6
slide by Sanja Fidler
y(x) = function(x, w)
Linear:
between y and the true value t
in red), what does the loss represent geometrically?
7
y(x) = w0 + w1x
slide by Sanja Fidler
`(w) =
N
X
n=1
h t(n) − y(x(n)) i2
Linear:
between y and the true value t Linear model:
errors (squared lengths of green vertical lines)
8
y(x) = w0 + w1x `(w) =
N
X
n=1
h t(n) − (w0 + w1x(n)) i2
slide by Sanja Fidler
Linear:
between y and the true value t Linear model:
9
y(x) = w0 + w1x `(w) =
N
X
n=1
h t(n) − (w0 + w1x(n)) i2
slide by Sanja Fidler
Linear:
between y and the true value t Linear model:
loss
10
y(x) = w0 + w1x `(w) =
N
X
n=1
h t(n) − (w0 + w1x(n)) i2
slide by Sanja Fidler
− initialize w (e.g., randomly) − repeatedly update w based on the gradient
(w stops changing)
11
slide by Sanja Fidler
w ← w − @` @w w ← w + 2λ ⇣ t(n) − y(x(n)) ⌘ x(n)
error
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 12
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 13
Also possible oscillations
14
slide by Erik Sudderth
w0
then change the parameter values
training case in turn, according to its own gradients
15
Algorithm 1 Stochastic gradient descent
1: Randomly shuffle examples in the training set 2: for i = 1 to N do 3:
Update: w ← w + 2λ(t(i) − y(x(i)))x(i) (update for a linear model)
4: end for
slide by Sanja Fidler
then change the parameter values
training case in turn, according to its own gradients
16
slide by Sanja Fidler
distributed (i.i.d.)
analytically
17
slide by Sanja Fidler
18
y(x) = w0 + w1x w = w0 w1
x] y(x) = wT x
19
slide by Sanja Fidler
`(w) =
N
X
n=1
h wT x(n) − t(n)i2 = (Xw − t)T (Xw − t)
t = h t(1), t(2), . . . , t(N)iT X = 2 6 6 4 1, x(1) 1, x(2) . . . 1, x(N) 3 7 7 5
T
w = w0 w1
RN×1 R2×1 RN×1
R1×N
− Notice the solution is when
− Take derivative and set equal to 0, then solve for
20
@ @w`(w) = 0
`(w) = (Xw − t)T (Xw − t) = wT XT Xw − tT Xw − wT XT t + tT t = wT XT Xw − 2wT XT t + tT t ∂ ∂w
w =
−1 XT t
Closed Form Solution:
If XTX is not inver-ble (i.e., singular), may need to:
the inverse
− In Python, numpy.linalg.pinv(a)
linearly independent) features
ensure that d ≤ N
1x1
21
22
slide by Sanja Fidler
y(x) = w0 + w1x1 + w2x2
these multi-dimensional observations
compute w analytically (how does the solution change?)
23
slide by Sanja Fidler
x(n) = ⇣ x(n)
1 , . . . , x(n) j
, . . . , x(n)
d
⌘ y(x) = w0 +
d
X
j=1
wjxj = wT x
w =
−1 XT t
recall:
complicated model?
24
slide by Sanja Fidler
complicated model?
that are combinations of components of x
feature x: where xj is the j-th power of x
25
slide by Sanja Fidler
y(x, w) = w0 +
M
X
j=1
wjxj
26
φj(x) = exp
2s2
x − µj
s
1 1 + exp(−a).
slide by Erik Sudderth
2 2 1 1 2 2 1 1
T T
bias
dimensionality of the data +1.
the number of basis functions +1.
functions, fitting the second model is exactly the same problem as fitting the first model (unless we use the kernel trick)
27
slide by Erik Sudderth
28
j 0 n
non linear bas
j 0 n
J(w) (y i w j j(x i)
j
i
Our goal is to minimize the following loss function: Moving to vector notations we get: We take the derivative w.r.t w
J(w) (yi w T(xi))2
i
(y i w T(x i))2
i
(y i w T(x i))
i
Equating to 0 we get
2 (y i w T(x i))
i
y i
i
(x i)
i
(xi) – vector of dimension k+1 yi – a scaler
29
slide by E. P . Xing
30
We take the derivative w.r.t w J(w) (yi w T(xi))2
i
(y i w T(x i))2
i
(y i w T(x i))
i
Equating to 0 we get
2 (y i w T(x i))
i
y i
i
(x i)
i
0(x1) 1(x1) m(x1) 0(x 2) 1(x 2) m(x 2) 0(x n) 1(x n) m(x n)
we get:
slide by E. P . Xing
31
J(w) (yi w T(xi))2
i
n by k+1 matrix n entries vector k+1 entries vector This solution is also known as ‘psuedo ¡inverse’
slide by E. P . Xing
32
slide by Erik Sudderth
33
slide by Erik Sudderth
34
slide by Erik Sudderth
35
slide by Erik Sudderth
36
slide by Sanja Fidler from Bishop
37
x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1
E(w) = 1 2
N
{y(xn, w) − tn}2
ERMS =
The division by N allows us to compare different sizes of data sets on an equal footing, and the square root ensures that ERMS is measured on the same scale (and in the same units) as the target variable t
slide by Erik Sudderth
Root>Mean>Square'(RMS)'Error:'
E(w) = 1 2
N
X
n=1
(tn − φ(xn)T w)2 = 1 2||t − Φw||2
38
inde- M ERMS 3 6 9 0.5 1 Training Test
slide by Erik Sudderth
39
slide by Sanja Fidler
inde- M ERMS 3 6 9 0.5 1 Training Test
40
slide by Sanja Fidler
fewer examples
41
slide by Sanja Fidler
small (this way no input dimension will have too much influence on prediction). This is called regularization.
42
slide by Sanja Fidler
inde- M ERMS 3 6 9 0.5 1 Training Test
− Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data (fit not only the signal but also the noise in the data), especially if not enough data to constrain model
− test generalization = model’s ability to predict the held out data
iterative approaches; analytic when available
43
slide by Richard Zemel
44
2
N
{y(xn, w) − tn}2 + λ 2 ∥w∥2
0 + w2 1 + . . . + w2
M,
importance of the regularization term compared
slide by Erik Sudderth
45
x t ln λ = −18 1 −1 1 x t ln λ = 0 1 −1 1
M = 9
slide by Erik Sudderth
ERMS ln λ −35 −30 −25 −20 0.5 1 Training Test
46
ln λ = −∞ ln λ = −18 ln λ = 0 w⋆ 0.35 0.35 0.13 w⋆
1
232.37 4.74
w⋆
2
w⋆
3
48568.31
w⋆
4
w⋆
5
640042.26 55.28
w⋆
6
41.32
w⋆
7
1042400.18
w⋆
8
0.00 w⋆
9
125201.43 72.68 0.01
The corresponding coefficients from the fitted polynomials, showing that regularization has the desired effect of reducing the magnitude
slide by Erik Sudderth
47
N
M
q = 0.5 q = 1 q = 2 q = 4
slide by Richard Zemel
inde- M ERMS 3 6 9 0.5 1 Training Test
− Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data (fit not only the signal but also the noise in the data), especially if not enough data to constrain model
− test generalization = model’s ability to predict the held out data
iterative approaches; analytic when available
48
slide by Richard Zemel
49