1
University College London
Introduction to Machine Learning
Iasonas Kokkinos
Iasonas.kokkinos@gmail.com
Lecture 1: Introduction and Linear Regression
Lecture outline Introduction to the course Introduction to Machine - - PowerPoint PPT Presentation
1 Introduction to Machine Learning Lecture 1: Introduction and Linear Regression Iasonas Kokkinos Iasonas.kokkinos@gmail.com University College London 2 Lecture outline Introduction to the course Introduction to Machine Learning Least
1
University College London
Introduction to Machine Learning
Iasonas Kokkinos
Iasonas.kokkinos@gmail.com
Lecture 1: Introduction and Linear Regression
2
Lecture outline
Introduction to the course Introduction to Machine Learning Least squares
3
Machine Learning
Principles, methods, and algorithms for learning and prediction based on past evidence Goal: Machines that perform a task based on experience, instead of explicitly coded instructions Why?
4
Machine Learning variants
– Classification – Regression
– Clustering – Dimensionality Reduction
Some data supervised, some unsupervised
Supervision: sparse reward for a sequence of decisions
5
Classification
– Binary decision: yes/no
Decision boundary
6
Classification examples
7
Decision boundary
Face
Background
`Faceness function’: classifier
8
– Multiple scales – Multiple orientations
– Face – Non-face Classifier Window Face Non-face
Test time: deploy the learned function
9
Machine Learning variants
– Classification – Regression
– Clustering – Dimensionality Reduction
Some data supervised, some unsupervised
Supervision: reward for a sequence of decisions
10
Regression
– E.g. price of a car based on years, mileage, condition,…
11
Computer vision example
12
Machine Learning variants
– Classification – Regression
– Clustering – Dimensionality Reduction
Some data supervised, some unsupervised
Supervision: reward for a sequence of decisions
13
Clustering
– Labels are `invented’
14
Clustering examples
15
Clustering examples
16
Machine Learning variants
– Classification – Regression
– Clustering – Dimensionality Reduction
Some data supervised, some unsupervised
Supervision: reward for a sequence of decisions
17
Dimensionality reduction & manifold learning
– Continuous outputs are `invented’
18
Example of nonlinear manifold: faces
x2 1 2(x1 + x2)
Average of two faces is not a face
19
Moving along the learned face manifold
Trajectory along the “male” dimension Trajectory along the “young” dimension Lample et. al. Fader Networks, NIPS 2017
20
Machine Learning variants
– Classification – Regression
– Clustering – Dimensionality Reduction
Partially supervised
Supervision: reward for a sequence of decisions
21
Weakly supervised learning: only part of the supervision signal
Supervision signal: “motorcycle” Inferred localization information
22
Weakly supervised learning: only part of the supervision signal
Supervision signal: “motorcycle” Inferred localization information
23
Semi-supervised learning: only part of the data labelled
Labelled data Labelled + unlabelled data
24
Machine Learning variants
– Classification – Regression
– Clustering – Dimensionality Reduction
Some data supervised, some unsupervised
Supervision: reward for a sequence of decisions
25
Reinforcement learning
– Take actions, based on state – (occasionally) receive rewards – Update state – Repeat
26
Reinforcement learning examples
Backgammon, 90’s GO, 2015
27
Focus of first part: supervised learning
– Classification – Regression
– Clustering – Dimensionality Reduction, Manifold Learning
Some data supervised, some unsupervised
Supervision: reward for a sequence of decisions
28
Classification: yes/no decision
29
Regression: continuous output
30
What we want to learn: a function
y = fw(x)
31
What we want to learn: a function
Input method parameters prediction
y = fw(x)
32
What we want to learn: a function
Input method parameters prediction
x ∈ R
Calculus Vector calculus Machine learning: can work also for discrete inputs, strings, trees, graphs,…
x ∈ RD
y = fw(x)
33
What we want to learn: a function
Input method parameters prediction
y ∈ {0, 1}
Classification: Regression:
y ∈ R
y = fw(x)
34
What we want to learn: a function
method prediction Linear classifiers, neural networks, decision trees, ensemble models, probabilistic classifiers, …
y = fw(x)
35
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
Example of method: K-nearest neighbor classifier
– Compute distance to other training records – Identify K nearest neighbors – Take majority vote
36
Training data for NN classifier (in R2)
37
1-nn classifier prediction (in R2)
38
3-nn classifier prediction
39
Method example: decision tree
Machine learning: can work also for discrete inputs, strings, trees, graphs,…
40
Method example: decision tree
41
Method example: decision tree
What is the depth of the decision tree for this problem?
42
Method example: linear classifier
Feature coordinate i Feature coordinate j
43
Method example: neural network
44
Method example: neural network
45
Method example: neural network
46
We have two centuries of material to cover!
The first clear and concise exposition of the method of least squares was published by Legendre in 1805. The technique is described as an algebraic procedure for fitting linear equations to data and Legendre demonstrates the new method by analyzing the same data as Laplace for the shape of the earth. The value of Legendre's method of least squares was immediately recognized by leading astronomers and geodesists of the time https://en.wikipedia.org/wiki/Least_squares
47
What we want to learn: a function
Input method parameters prediction
w ∈ R w ∈ RK
= f(x; w)
y = fw(x)
48
y = fw(x) = f(x, w) = wT x
Assumption: linear function
x ∈ RD, w ∈ RD wT x = hw, xi =
D
X
d=1
wdxd
Inner product:
49
Reminder: linear classifier
Feature coordinate i Feature coordinate j
: negative : positive < + ⋅ ≥ + ⋅ b b
i i i i
w x x w x x
Each data point has a class label: +1 ( )
yt =
50
: negative : positive < + ⋅ ≥ + ⋅ b b
i i i i
w x x w x x
Each data point has a class label: +1 ( )
yt = Feature coordinate i Feature coordinate j
Question: which one?
51
Linear regression in 1D
52
Linear regression in 1D
Training set: input–output pairs S = {(xi, yi)},
i = 1 . . . , N
xi ∈ R, yi ∈ R
53
Linear regression in 1D
yi = w0 + w1xi
1 + ✏i
= w0xi
0 + w1xi 1 + ✏i,
xi
0 = 1,
∀i = wT xi + ✏i
54
Sum of squared errors criterion
L(w0, w1) =
N
X
i=1
⇥ yi −
0 + w1xi 1
⇤2
L(w) =
N
X
i=1
(✏i)2
yi = wT xi + ✏i
Loss function: sum of squared errors Expressed as a function of two variables: Question: what is the best (or least bad) value of w? Answer: least squares
55
Calculus 101
x f(x) x∗
56
Calculus 101
x f(x) x∗ x∗ = argmaxxf(x)
57
Condition for maximum: derivative is zero
x f(x) x∗ = argmaxxf(x) x∗
58
Condition for maximum: derivative is zero
x f(x) x∗ = argmaxxf(x) f 0(x⇤) = 0
→
x∗
59
Condition for minimum: derivative is zero
x∗ = argminxf(x) f 0(x⇤) = 0
→
60
Vector calculus 101
2D function graph isocontours gradient field at minimum of function:
f(x)
f(x) = c
rf(x) = "
∂f ∂x1 ∂f ∂x2
#
rf(x) = 0
61
Back to least squares..
L(w0, w1) =
N
X
i=1
⇥ yi −
0 + w1xi 1
⇤2
L(w) =
N
X
i=1
(✏i)2
yi = wT xi + ✏i
Loss function: sum of squared errors Expressed as a function of two variables: Question: what is the best (or least bad) value of w? Answer: least squares training sample feature dimension
62
L(w0, w1) =
N
X
i=1
⇥ yi −
0 + w1xi 1
⇤2 ∂L(w0, w1) ∂w0 =
N
X
i=1
∂ ⇥ yi −
0 + w1xi 1
⇤2 ∂w0
Fitting a line
∂L(w0, w1) ∂w0 = 0
ó ó
N
X
i=1
yixi
0 = w0 N
X
i=1
xi
0xi 0 + w1 N
X
i=1
xi
1xi
= −2
N
X
i=1
0 − w0xi 0xi 0 − w1xi 1xi
N
X
i=1
2 ⇥ yi −
0 + w1xi 1
⇤ (−xi
0)
63
Fitting a line, continued
∂L(w0, w1) ∂w0 = 0
ó ó
N
X
i=1
yixi
0 = w0 N
X
i=1
xi
0xi 0 + w1 N
X
i=1
xi
1xi
ó ó
N
X
i=1
yixi
1 = w0 N
X
i=1
xi
0xi 1 + w1 N
X
i=1
xi
1xi 1
2 linear equations, 2 unknowns
∂L(w0, w1) ∂w1 = 0
64
Fitting a line, continued
N
X
i=1
yixi
0 = w0 N
X
i=1
xi
0xi 0 + w1 N
X
i=1
xi
1xi
N
X
i=1
yixi
1 = w0 N
X
i=1
xi
0xi 1 + w1 N
X
i=1
xi
1xi 1
" PN
i=1 yixi
PN
i=1 yixi 1
# = " PN
i=1 xi 0xi
PN
i=1 xi 0xi 1
PN
i=1 xi 0xi 1
PN
i=1 xi 1xi 1
# w0 w1
That’s it!
65
Fitting a line, continued
" PN
i=1 yixi
PN
i=1 yixi 1
# = " PN
i=1 xi 0xi
PN
i=1 xi 0xi 1
PN
i=1 xi 0xi 1
PN
i=1 xi 1xi 1
# w0 w1
y = y1 . . . yN X = x1 x1
1
. . . . . . xN xN
2
2x2 system of equations: Or, without summations:
w = (XT X)−1XT y
Solution:
66
Linear regression in 1D
67
Linear regression in 2D (or ND)
68
Least squares solution for linear regression
Nx1 NxD Dx1 Nx1
y1 y2 . . . yN = x1
1
. . . x1
D
x2
1
. . . x2
D
. . . xN
1
. . . xN
D
w1 w2 . . . wD + ✏1 ✏2 . . . ✏N
D: problem dimension N: training set size
69
Least squares solution for linear regression
70
Least squares solution for linear regression
Loss function: L(w) =
N
X
i=1
(yi − wT xi)2 =
N
X
i=1
(✏i)2
L(w) = ⇥ ✏1 ✏2 . . . ✏N ⇤ 2 6 6 6 4 ✏1 ✏2 . . . ✏N 3 7 7 7 5
71
Least squares solution for linear regression
Loss function: L(w) =
N
X
i=1
(yi − wT xi)2 =
N
X
i=1
(✏i)2
L(w) = ⇥ ✏1 ✏2 . . . ✏N ⇤ 2 6 6 6 4 ✏1 ✏2 . . . ✏N 3 7 7 7 5
L(w) = ✏ ✏ ✏T✏ ✏ ✏
y = Xw + ✏ ✏ ✏
72
Generalized linear regression
x → φ φ φ(x) = φ1(x) . . . φM(x)
73
1D Example: 2nd degree polynomial fitting
φ φ φ(x) = 1 x (x)2 hw,φ φ φ(x)i = w0 + w1x + w2(x)2
74
1D Example: k-th degree polynomial fitting
hw,φ φ φ(x)i = w0 + w1x + . . . + wk(x)K φ φ φ(x) = 1 x . . . (x)K
75
2D example: second-order polynomials
x = (x1, x2)
hw,φ φ φ(x)i = w0 + w1x1 + w2x2 + w3x2
1 + w4x2 2 + w5x1x2
φ φ φ(x) = 1 x1 x2 (x1)2 (x2)2 x1x2
76
Reminder: linear regression
Loss function: L(w) =
N
X
i=1
(yi − wT xi)2 =
N
X
i=1
(✏i)2
y1 y2 . . . yN = x1
1
. . . x1
D
x2
1
. . . x2
D
. . . xN
1
. . . xN
D
w1 w2 . . . wD + ✏1 ✏2 . . . ✏N
77
Reminder: linear regression
Loss function: L(w) =
N
X
i=1
(yi − wT xi)2 =
N
X
i=1
(✏i)2
y1 y2 . . . yN = (x1)T (x2)T . . . (xN)T w1 w2 . . . wD + ✏1 ✏2 . . . ✏N
78
Generalized linear regression
Loss function: Nx1 NxM Mx1 Nx1
φ φ φ(x) : RD → RM
L(w) =
N
X
i=1
(yi–wT
N
X
i=1
(✏i)2
y1 y2 . . . yN =
. . .
w1 w2 . . . wM + ✏1 ✏2 . . . ✏N
79
Least squares solution for linear regression
Minimize:
X = (x1)T (x2)T . . . (xN)T
80
Least squares solution for generalized linear regression
Φ Φ Φ = φ φ φ(x1)T φ φ φ(x2)T . . . φ φ φ(xN)T
Minimize:
81
2D example: second-order polynomials
x = (x1, x2)
hw,φ φ φ(x)i = w0 + w1x1 + w2x2 + w3x2
1 + w4x2 2 + w5x1x2
φ φ φ(x) = 1 x1 x2 (x1)2 (x2)2 x1x2
82
5D Example: fourth-order polynomials in 5D
x = (x1, . . . , x5)
15625 Dimensions =>15625 parameters
φ φ φ(x) = 1 x1 . . . x5 . . . (x1x2x3x4x5)4
83
What was happening before: approximations
Training: ,
S = {(xi, yi)}, i = 1, . . . , N
. . .
If N>D (e.g. 30 points, 2 dimensions) we have more equations than unknowns: overdetermined system! Input-output relations can only hold approximately!
y1 ' w0x1
0 + w1x1 1 + . . . + wDx1 D
y2 ' w0x2
0 + w1x2 1 + . . . + wDx2 D
yN ' w0xN
0 + w1xN 1 + . . . + wDxN D
84
What is happening now: overfitting
Training: ,
S = {(xi, yi)}, i = 1, . . . , N
If N<D (e.g. 30 points, 15265 dimensions) we have more unknowns than equations: underdetermined system! Input-output equations hold exactly, but we are simply memorizing data
y1 = w0x1
0 + w1x1 1 + . . . + wDx1 D
y2 = w0x2
0 + w1x2 1 + . . . + wDx2 D
yN = w0xN
0 + w1xN 1 + . . . + wDxN D
. . .
85
Overfitting, in images
Classification Regression
just right
86
Tuning the model’s complexity
A flexible model approximates the target function well in the training set but can “overtrain” and have poor performance on the test set (“variance”) A rigid model’s performance is more predictable in the test set but the model may not be good even on the training set (“bias”)
87
Regularization: keeping it simple
In high dimensions: too many solutions for the same problem How? Penalize complexity Regularization: prefer the least complex among them
88
How to control complexity?
Observation: problem started with high-dimensional embeddings Guess: Number of dimensions relates to “complexity” But what if we force the classifier not to use all of the parameters? (Week 4: we will guess again!) Intuition: with many parameters, we can fit anything Idea: penalize the use of large parameter values How do we measure “large”? How do we enforce small values?
89
How do we measure “large”?
Method parameters: D-dimensional vector
w = [w1, w2, . . . , wD]
“Large” vector: vector norm L2, (“euclidean”) norm:
kwk2 . = v u u t
D
X
d=1
w2
d =
p hw, wi
kwk1 . =
D
X
d=1
|wd|
L1, (“manhattan”) norm: Lp norm, p>1:
kwkp . = D X
d=1
wp
d
!1/p
90
Regularized linear regression
✏ ✏ ✏ = y − Φ Φ Φw
residual vector linear regression: minimize model error Complexity term: (regularizer) R(w) .
= kwk2
2 = wT w
scalar, remains to be determined minimum remains to be determined
“data fidelity” complexity
91
Least squares solution
L(w) = ✏ ✏ ✏T✏ ✏ ✏
= (y − Xw)T (y − Xw) = yT y − 2yT Xw + wT XT Xw rL(w∗) = 0 w∗ = (XT X)−1XT y
Condition for minimum:
−2XT y + 2XT Xw∗ = 0
92
Ridge regression: L2-regularized linear regression
L(w) = ✏ ✏ ✏T✏ ✏ ✏ + wT w = yT y − 2yT Xw + wT XT Xw + λwT Iw
as before, for linear regression identity matrix
= yT y − 2yT Xw + wT XT X + λI
rL(w∗) = 0
Condition for minimum:
−2XT y + 2(XT X + λI)w∗ = 0 w∗ = (XT X + λI)−1XT y
93
Ridge regression, continued
Regularizer:
R(w) . = kwk2
2 = wT w
New objective: scalar, remains to be determined We just determined minimum
“data fidelity” complexity
λ: “hyperparameter”
Νοτε: direct minimization w.r.t. it would lead to λ=0
94
Bias-Variance tradeoff as a function of λ
(function of λ) sweet spot!
95
– Exclude part of the training data from parameter estimation – Use them only to predict the test error
– K splits, average K errors
values of λ parameter
– pick value that minimizes cross- validation error
Selecting λ with cross-validation
Least glorious, most effective