Lecture 5:
−Linear regression (cont’d.) −Regularization −ML Methodology −Learning theory
Aykut Erdem
October 2017 Hacettepe University
Lecture 5: Linear regression (contd.) Regularization ML - - PowerPoint PPT Presentation
Lecture 5: Linear regression (contd.) Regularization ML Methodology Learning theory Aykut Erdem October 2017 Hacettepe University About class projects This semester the theme is machine learning and the city. To be done
−Linear regression (cont’d.) −Regularization −ML Methodology −Learning theory
Aykut Erdem
October 2017 Hacettepe University
(classroom + video (new) presentations), final report and code
http://web.cs.hacettepe.edu.tr/~aykut/classes/fall2017/bbm406/project.html.
3
4
Recall from last time… Linear Regression
y(x) = w0 + w1x `(w) =
N
X
n=1
h t(n) − (w0 + w1x(n)) i2 w = (w0, w1)
w ← w + 2λ ⇣ t(n) − y(x(n)) ⌘ x(n)
Gradient Descent Update Rule: Closed Form Solution:
w =
−1 XT t
5
6
slide by Sanja Fidler
y(x) = w0 + w1x1 + w2x2
Linear Regression with Multi-dimensional Inputs
these multi-dimensional observations
compute w analytically (how does the solution change?)
7
slide by Sanja Fidler
x(n) = ⇣ x(n)
1 , . . . , x(n) j
, . . . , x(n)
d
⌘ y(x) = w0 +
d
X
j=1
wjxj = wT x
complicated model?
8
slide by Sanja Fidler
complicated model?
that are combinations of components of x
feature x: where xj is the j-th power of x
9
slide by Sanja Fidler
y(x, w) = w0 +
M
X
j=1
wjxj
Some types of basis functions in 1-D
10
Sigmoids Gaussians Polynomials
φj(x) = exp
2s2
x − µj
s
1 1 + exp(−a).
slide by Erik Sudderth
) ( ... ) ( ) ( ) ( ... ) (
2 2 1 1 2 2 1 1
x w x x w x, x w w x, Φ = + + + = = + + + =
T T
w w w y x w x w w y φ φ
bias
Two types of linear model that are equivalent with respect to learning
dimensionality of the data +1.
the number of basis functions +1.
functions, fitting the second model is exactly the same problem as fitting the first model (unless we use the kernel trick)
11
slide by Erik Sudderth
regression can be written as where can be either xj for multivariate regression
12
y w j j(x)
j 0 n
non linear bas
y w j j(x)
j 0 n
J(w) (y i w j j(x i)
j
i
Our goal is to minimize the following loss function: Moving to vector notations we get: We take the derivative w.r.t w
J(w) (yi w T(xi))2
i
(y i w T(x i))2
i
(y i w T(x i))
i
Equating to 0 we get
2 (y i w T(x i))
i
y i
i
(x i)
i
(xi) – vector of dimension k+1 yi – a scaler
13
slide by E. P . Xing
14
We take the derivative w.r.t w J(w) (yi w T(xi))2
i
(y i w T(x i))2
i
(y i w T(x i))
i
Equating to 0 we get
2 (y i w T(x i))
i
y i
i
(x i)
i
0(x1) 1(x1) m(x1) 0(x 2) 1(x 2) m(x 2) 0(x n) 1(x n) m(x n)
we get:
w (T)1Ty
slide by E. P . Xing
15
J(w) (yi w T(xi))2
i
w (T)1Ty
n by k+1 matrix n entries vector k+1 entries vector This solution is also known as ‘psuedo ¡inverse’
slide by E. P . Xing
16
slide by Erik Sudderth
17
slide by Erik Sudderth
18
slide by Erik Sudderth
19
slide by Erik Sudderth
20
slide by Sanja Fidler from Bishop
21
x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1E(w) = 1 2
N
{y(xn, w) − tn}2
ERMS =
The division by N allows us to compare different sizes of data sets on an equal footing, and the square root ensures that ERMS is measured on the same scale (and in the same units) as the target variable t
slide by Erik Sudderth
Root>Mean>Square'(RMS)'Error:'
E(w) = 1 2
N
X
n=1
(tn − φ(xn)T w)2 = 1 2||t − Φw||2
22
inde- M ERMS 3 6 9 0.5 1 Training Test
slide by Erik Sudderth
23
slide by Sanja Fidler
inde- M ERMS 3 6 9 0.5 1 Training Test
24
slide by Sanja Fidler
fewer examples
25
slide by Sanja Fidler
small (this way no input dimension will have too much influence on prediction). This is called regularization.
26
slide by Sanja Fidler
1-D regression illustrates key concepts
− Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data (fit not only the signal but also the noise in the data), especially if not enough data to constrain model
− test generalization = model’s ability to predict the held out data
iterative approaches; analytic when available
27
slide by Richard Zemel
discourage the coefficients from reaching large values
28
2
N
{y(xn, w) − tn}2 + λ 2 ∥w∥2
0 + w2 1 + . . . + w2
M,
importance of the regularization term compared
'
Ridge regression
which is minimized by
slide by Erik Sudderth
29
x t ln λ = −18 1 −1 1 x t ln λ = 0 1 −1 1
M = 9
slide by Erik Sudderth
ERMS ln λ −35 −30 −25 −20 0.5 1 Training Test
30
ln λ = −∞ ln λ = −18 ln λ = 0 w⋆ 0.35 0.35 0.13 w⋆
1
232.37 4.74
w⋆
2
w⋆
3
48568.31
w⋆
4
w⋆
5
640042.26 55.28
w⋆
6
41.32
w⋆
7
1042400.18
w⋆
8
0.00 w⋆
9
125201.43 72.68 0.01
The corresponding coefficients from the fitted polynomials, showing that regularization has the desired effect of reducing the magnitude
slide by Erik Sudderth
31
1 2
N
{tn − wTφ(xn)}2 + λ 2
M
|wj|q
q = 0.5 q = 1 q = 2 q = 4
slide by Richard Zemel
32
continuous
solved very similarly
far transfers to classification with very minor changes
examples to the fitted model
33
slide by Olga Veksler
x y 1 4 8 6 3
to new data
34
slide by Olga Veksler
make it parameter d
− degree 3 is the best according to the training error, but overfits
the data
35
slide by Olga Veksler
x y
− degree 2 is the best model according to the test error
training at all!
36
slide by Olga Veksler
choosing among 3 classifiers (degree 1, 2, or 3)
37
slide by Olga Veksler
Training ≈ 60% Validation ≈ 20% Test ≈ 20%
train tunable parameters w train other parameters,
classifier use only to assess final performance
labeled data
38
slide by Olga Veksler
Training ≈ 60% Validation ≈ 20% Test ≈ 20%
Training error: computed on training example Validation error: computed on validation examples Test error: computed
test examples
labeled data
39
slide by Olga Veksler
validation error: 3.3 validation error: 1.8 validation error: 3.4
40
slide by Olga Veksler
error Validation error Training error
number of base functions 50
41
slide by Olga Veksler
Underfitting
Just Right
Overfitting
42
slide by Olga Veksler
parameters
for test and 20% for validation data
be lucky or unlucky
that wastes less data
43
slide by Olga Veksler
44
slide by Olga Veksler
Linear Model: Quadratic Model: Join the dots Model:
Mean Squared Error = 2.4 Mean Squared Error = 0.9 Mean Squared Error = 2.2
x y x y x y
LOOCV (Leave-one-out Cross Validation)
45
slide by Olga Veksler
x y For k=1 to n
from the dataset
examples
When you’ve done all points, report the mean error
LOOCV (Leave-one-out Cross Validation)
46
slide by Olga Veksler
x y For k=1 to n
from the dataset
examples
When you’ve done all points, report the mean error
LOOCV (Leave-one-out Cross Validation)
47
slide by Olga Veksler
x y For k=1 to n
from the dataset
examples
When you’ve done all points, report the mean error
LOOCV (Leave-one-out Cross Validation)
48
slide by Olga Veksler
x y For k=1 to n
from the dataset
examples
When you’ve done all points, report the mean error
LOOCV (Leave-one-out Cross Validation)
49
slide by Olga Veksler
x y For k=1 to n
from the dataset
examples
When you’ve done all points, report the mean error
LOOCV (Leave-one-out Cross Validation)
50
slide by Olga Veksler x y x y x y
MSELOOCV = 2.12
x y x y x y x y x y x y
51
slide by Olga Veksler
MSELOOCV = 0.96
x y x y x y x y x y x y x y x y x y
52
slide by Olga Veksler
MSELOOCV = 3.33
x y x y x y x y x y x y x y x y x y
53
Upside Testset maygiveunreliable estimateoffuture performance cheap Leaveone
expensive doesn’twaste data
54
colored red green and blue
slide by Olga Veksler
x y
55
colored red green and blue
in the blue partition. Find test‐set sum of errors on blue points
slide by Olga Veksler
x y
56
colored red green and blue
in the blue partition. Find test‐set sum of errors on blue points
in green partition. Find test‐set sum of errors on green points
slide by Olga Veksler
x y
57
colored red green and blue
in the blue partition. Find test‐set sum of errors on blue points
in green partition. Find test‐set sum of errors on green points
red partition. Find the test‐set sum of errors
slide by Olga Veksler
x y
58
colored red green and blue
in the blue partition. Find test‐set sum of errors on blue points
in green partition. Find test‐set sum of errors on green points
red partition. Find the test‐set sum of errors
slide by Olga Veksler
x y
Linear Regression MSE3FOLD = 2.05
59
colored red green and blue
in the blue partition. Find test‐set sum of errors on blue points
in green partition. Find test‐set sum of errors on green points
red partition. Find the test‐set sum of errors
slide by Olga Veksler
Quadratic Regression MSE3FOLD = 1.1
x y
60
colored red green and blue
in the blue partition. Find test‐set sum of errors on blue points
in green partition. Find test‐set sum of errors on green points
red partition. Find the test‐set sum of errors
slide by Olga Veksler
Join the dots MSE3FOLD = 2.93
x y
61
Upside Testset maygiveunreliable estimateoffuture performance cheap Leave
expensive doesn’twastedata 10fold wastes10%ofthedata,10 timesmoreexpensivethan testset
timesmoreexpensive insteadofn times 3fold wastesmoredatathan10 fold,moreexpensivethan testset slightlybetterthantestset Nfold IdenticaltoLeaveoneout
slide by Olga Veksler
errors on a test set, you should compute...
62
slide by Andrew Moore
errors on a test set, you should compute…
The total number of misclassifications on a test set
63
slide by Andrew Moore
errors on a test set, you should compute…
The total number of misclassifications on a test set
64
slide by Andrew Moore
65
slide by Andrew Moore
66
TrainingError
f2 f3
f5 f6
slide by Olga Veksler
67
TrainingError
10FOLDCVError f1 f2 f3
f5 f6
slide by Olga Veksler
68
TrainingError
10FOLDCVError Choice f1 f2 f3
f5 f6
slide by Olga Veksler
69
TrainingError
10foldCVError Choice
k=1 k=2 k=3 k=4
k=6
slide by Olga Veksler
worse as K was increasing
will be the global optimum?
70
slide by Olga Veksler
71
mathematical analysis of machine learning algorithms
− PAC (probably approximately correct) learning
→ boosting
− VC (Vapnik–Chervonenkis) theory
→ support vector machines
72
slide by Eric Eaton
(
Annual conference: Conference on Learning Theory (COLT)
good job learning?
than my training performance?
73
The key idea that underlies all these answer is that simple functions generalize well.
adapted from Hal Daume III
− It can justify and help understand why
common practice works.
− It can also serve to suggest new algorithms
and approaches that turn out to work well in practice.
74
adapted from Hal Daume III
theory after theory before
Often, it turns out to be a mix!
surprisingly well.
something about it.
− In the process, they make it better or find new
algorithms.
possible and what’s not possible.
75
adapted from Hal Daume III
− whether there is an “ultimate” learning algorithm, Aawesome,
that solves the Binary Classification problem.
Aawesome is out there?
− Take in a data set D and produce a function f. − No matter what D looks like, this function f should get perfect
classification on all future examples drawn from the same distribution that produced D.
76
adapted from Hal Daume III
− whether there is an “ultimate” learning algorithm, Aawesome,
that solves the Binary Classification problem.
Aawesome is out there?
− Take in a data set D and produce a function f. − No matter what D looks like, this function f should get perfect
classification on all future examples drawn from the same distribution that produced D.
77
adapted from Hal Daume III
− 80% of data points in this distribution have x = y and 20%
don’t.
there’s no way that it can do better than 20% error on this data.
− No Aawesome exists that always achieves an error rate of zero. − The best that we can hope is that the error rate is not “too
large.”
78
adapted from Hal Daume III
D = (⟨+1⟩,+1) = 0.4 D = (⟨+1⟩,-1) = 0.1 D = (⟨-1⟩,-1) = 0.4 D = (⟨-1⟩,+1) = 0.1
sampling.
− When trying to learn about a distribution, you only get to
see data points drawn from that distribution.
− You know that “eventually” you will see enough data points
that your sample is representative of the distribution, but it might not happen immediately.
a sequence of four coin flips you never see a tails, or perhaps only see one tails.
79
adapted from Hal Daume III
work.
− In particular, if we happen to get a lousy sample of
data from D, we need to allow Aawesome to do something completely unreasonable.
time.
80
adapted from Hal Daume III
The best we can reasonably hope of Aawesome is that it will do pretty well, most of the time.
the best we can hope of an algorithm is that
− It does a good job most of the time (probably
approximately correct)
81
adapted from Hal Daume III
− We have 10 different binary classification data sets. − For each one, it comes back with functions f1, f2, . . . , f10.
✦ For some reason, whenever you run f4 on a test point, it
crashes your computer. For the other learned functions, their performance on test data is always at most 5% error.
✦ If this situtation is guaranteed to happen, then this
hypothetical learning algorithm is a PAC learning algorithm.
✤ It satisfies probably because it only failed in one out of
ten cases, and it’s approximate because it achieved low, but non-zero, error on the remainder of the cases.
82
adapted from Hal Daume III
83
adapted from Hal Daume III
Definitions 1. An algorithm A is an (e, d)-PAC learning algorithm if, for all distributions D: given samples from D, the probability that it returns a “bad function” is at most d; where a “bad” function is one with test error rate more than e on D.
84
adapted from Hal Daume III
Definition: An algorithm A is an efficient (e, d)-PAC learning al- gorithm if it is an (e, d)-PAC learning algorithm whose runtime is polynomial in 1
e and 1 d.
In other words, suppose that you want your algorithm to achieve
In other words, to let your algorithm to achieve 4% error rather than 5%, the runtime required to do so should not go up by an exponential factor!
− Computational complexity: Prefer an algorithm that runs quickly
to one that takes forever
− Sample complexity: The number of examples required for your
algorithm to achieve its goals
Example: PAC Learning of Conjunctions
(e.g. x1 ⋀ x2 ⋀ x5)
x = ⟨x1, x2, . . . , xD⟩.
y = c(x)
− Clearly, the true formula cannot
include the terms x1 , x2, ¬x3, ¬x4
85
adapted from Hal Daume III
y x1 x2 x3 x4 +1 1 1 +1 1 1 1
1 1 1
able 10.1: Data set for learning con-
Example: PAC Learning
f 0(x) = x1 ⋀ ¬x1 ⋀ x2 ⋀ ¬x2 ⋀ x3 ⋀ ¬x3 ⋀ x4 ⋀ ¬x4 f 1(x) = ¬x1 ⋀ ¬x2 ⋀ x3 ⋀ x4 f 2(x) = ¬x1 ⋀ x3 ⋀ x4 f 3(x) = ¬x1 ⋀ x3 ⋀ x4
example correctly (provided that there is no noise)
− Given a data set of N examples in D dimensions, it takes O (ND)
time to process the data. This is linear in the size of the data set.
86
Algorithm 30 BinaryConjunctionTrain(D)
1: f ← x1 ∧ ¬x1 ∧ x2 ∧ ¬x2 ∧ · · · ∧ xD ∧ ¬xD// initialize function
2: for all positive examples (x,+1) in D do 3:for d = 1 . . . D do
4:if xd = 0 then
5:f ← f without term “xd”
6:else
7:f ← f without term “¬xd”
8:end if
9:end for
10: end for 11: return fadapted from Hal Daume III
y x1 x2 x3 x4 +1 1 1 +1 1 1 1
1 1 1
able 10.1: Data set for learning con-
“Throw Out Bad Terms”
− How many examples N do you need to see in order to
guarantee that it achieves an error rate of at most ε (in all but δ- many cases)?
− Perhaps N has to be gigantic (like ) to (probably) guarantee
a small error.
87
adapted from Hal Daume III
Algorithm 30 BinaryConjunctionTrain(D)
1: f ← x1 ∧ ¬x1 ∧ x2 ∧ ¬x2 ∧ · · · ∧ xD ∧ ¬xD// initialize function
2: for all positive examples (x,+1) in D do 3:for d = 1 . . . D do
4:if xd = 0 then
5:f ← f without term “xd”
6:else
7:f ← f without term “¬xd”
8:end if
9:end for
10: end for 11: return fExample: PAC Learning
y x1 x2 x3 x4 +1 1 1 +1 1 1 1
1 1 1
able 10.1: Data set for learning con-
most e (like 22D/e)
“Throw Out Bad Terms”
achieve a small error is not-too-big.
− Say there is some term (say ¬x8) that should have been thrown
− If this was the case, then you must not have seen any positive
training examples with ¬x8 = 0.
− So example with x8 = 0 must have low probability (otherwise you
would have seen them). So such a thing is not that common
88
adapted from Hal Daume III
Algorithm 30 BinaryConjunctionTrain(D)
1: f ← x1 ∧ ¬x1 ∧ x2 ∧ ¬x2 ∧ · · · ∧ xD ∧ ¬xD// initialize function
2: for all positive examples (x,+1) in D do 3:for d = 1 . . . D do
4:if xd = 0 then
5:f ← f without term “xd”
6:else
7:f ← f without term “¬xd”
8:end if
9:end for
10: end for 11: return fExample: PAC Learning
y x1 x2 x3 x4 +1 1 1 +1 1 1 1
1 1 1
able 10.1: Data set for learning con-
“Throw Out Bad Terms”
many variables.
− The hypothesis class for Boolean conjunctions is finite; the
hypothesis class for linear classifiers is infinite.
− For Occam’s razor, we can only work with finite hypothesis classes.
89
adapted from Hal Daume III
William of Occam (c. 1288 – c. 1348)
“If one can explain a phenomenon without assuming this or that hypothetical entity, then there is no ground for assuming it i.e. that one should always opt for an explanation in terms of the fewest possible number of causes, factors, or variables.”
Theorem 14 (Occam’s Bound). Suppose A is an algorithm that learns a function f from some finite hypothesis class H. Suppose the learned function always gets zero error on the training data. Then, the sample com- plexity of f is at most log |H|.
a Boolean conjunction, represent it as a conjunction of inequalities.
− Instead of having x1 ∧ ¬x2 ∧ x5, you have
[x1 > 0.2] ∧ [x2 < 0.77] ∧ [x5 < π/4]
− In this representation, for each feature, you need to choose
an inequality (< or >) and a threshold.
− Since the thresholds can be arbitrary real values, there are
now infinitely many possibilities: |H| = 2D×∞ = ∞
90
adapted from Hal Daume III
based on this intuition.
complexity
− The idea is to look at a finite set of unlabeled examples − no matter how these points were labeled, would we be able to
find a hypothesis that correctly classifies them
represent an arbitrary labeling becomes harder and harder.
91
adapted from Hal Daume III
Definitions 2. For data drawn from some space X , the VC dimension of a hypothesis space H over X is the maximal K such that: there exists a set X ⊆ X of size |X| = K, such that for all binary labelings of X, there exists a function f ∈ H that matches this labeling.
shatter 3 points.
points.
hence has VC dimension greater than 3.
92
adapted from Trevor Hastie, Robert Tibshirani, Jerome Friedman