Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS
Statistical Machine Learning
A Crash Course
Part II: Classification & SVMs
- 11.05.2012
Statistical Machine Learning A Crash Course Part II: Classification - - PowerPoint PPT Presentation
Statistical Machine Learning A Crash Course Part II: Classification & SVMs - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS Bayesian Decision Theory Decision rule: p ( C 1 | x ) > p ( C 2 | x ) Decide
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 2
We do not need the normalization!
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 3
class prior
p(x|Ck)
p(Ck) p(Ck|x) p(Ck|x)
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 4
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 5
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
6
0.01 0.01 0.01 0.01 0.01 0.02 . 2 . 2 0.02 0.03 0.03 . 3 0.04 . 4 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5 . 1 0.01 0.01 0.01 . 1 0.02 0.02 0.02 0.02 . 3 . 3 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5 −0.01 −0.01 −0.01 0.01 0.01 0.01 0.01 . 2 0.02 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5 − 3 − 2 − 2 − 1 − 1 − 1 1 1 1 1 2 2 2 3 3 4 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5 −0.01 −0.01 −0.01 0.01 0.01 0.01 0.01 . 2 0.02 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 5
decision boundary
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 7
normal vector
decide class 1, otherwise class 2
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
8
x2 x1 w x
y(x) kwk
x?
w0 kwk
y = 0 y < 0 y > 0 R2 R1
signed distance to the decision boundary
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
9
linearly separable not linearly separable
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
ambiguities.
10
R1 R2 R3 ? C1 not C1 C2 not C2
(one-versus-the-rest)
R1 R2 R3 ? C1 C2 C1 C3 C2 C3
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
each class:
11
Ri Rj Rk xA xB ˆ x
If the discriminant functions are linear, the decision regions are connected and convex
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
Bayes classifier.
classes?
12
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
conditional distributions to come up with a good decision boundary.
intricacies that do not matter at the end
can avoid some of the complexity.
superior to probabilistic ones.
13
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
14
i w + w0 = yi,
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
15
i w + w0 = yi,
i ˆ
with
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
16
Pseudo-inverse
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
17
−4 −2 2 4 6 8 −8 −6 −4 −2 2 4
No outliers Least-squares discriminant works
−4 −2 2 4 6 8 −8 −6 −4 −2 2 4
Outliers present Least-squares discriminant breaks down
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
18
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
19
xi w ← w − xi b ← b − 1 w ← w + xi b ← b + 1
yi = 1 y(xi) = yi yi = −1
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
20
−1 −0.5 0.5 1 −1 −0.5 0.5 1
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
21
−1 −0.5 0.5 1 −1 −0.5 0.5 1
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
22
−1 −0.5 0.5 1 −1 −0.5 0.5 1
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
23
−1 −0.5 0.5 1 −1 −0.5 0.5 1
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
algorithm will find it.
24
Theorem 2.1 Let S = {(x1, y1), . . . , (xn, yn)} be a data set containing at least
exists a weight vector w∗ with w∗ = 1 such that γi = yiw∗, xi ≥ γ for all
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
being able to handle this case, which halted research on this and related techniques for decades.
25
“XOR function”
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
26
−1.5 −1 −0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
27
φ(x)
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
the other one outside, we can use a linear classifier on the radius from the center to perfectly classify the data
transformed feature space, but if we look at the original space, the classifier is nonlinear!
28
−1.5 −1 −0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5
1 + x2 2, arctan
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
polynomials of a certain degree.
29
1 + (A12 + A21)x1x2 + A22x2 2 + b1x1 + b2x2 + c
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
decision boundaries using:
30
1 + (A12 + A21)x1x2 + A22x2 2 + b1x1 + b2x2 + c
1, x1x2, x2 2, x1, x2)T
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
31
−1.5 −1 −0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5
1 + x2 2 − r2
1 + 0 · x1x2 + 1 · x2 2 + 0 · x1 + 0 · x2 − r2
1, x1x2, x2 2, x1, x2)T − r2
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
into the space of all monomials with degree
32
d + N − 1 d ⇥
d = 5, N = 256 ⇒ ≈ 1010
dimensions in the transformed space
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
parameters.
doesn’t sound like an unreasonable task.
33
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
34
xi w ← w − xi b ← b − 1 w ← w + xi b ← b + 1
yi = 1 y(xi) = yi yi = −1
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
35
n
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
points
36
n
i=1
i x + b
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
examples.
37
i=1
i x + b
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
38
xi
y(xi) = yi
i=1
i x + b
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
whether we can compute the scalar product between the transformed features efficiently:
this scalar product can be computed efficiently without having to go to the high-dimensional feature space.
39
i=1
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
features
complex model
generalization ability and the corresponding risk.
40
shoe size matriculation number
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 41
# of model parameters training error test error
run 1 run 2 run 3 run 4
test train
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 42
generalization abilities (of a learning machine).
used heuristics.
say that much about real problems...
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
43
N
N
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 44
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
45
All 3 decision boundaries have zero empirical risk. Which one is preferable? Which one generalizes?
generalizes best to unseen data
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 46
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 47
empirical risk:
(formally: VC-dimension)
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 48
? ? ? ?
too complex too simple tradeoff
new patient negative example positive example
[Florian Markowetz]
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 49
that family (no matter which label configuration the data points have).
power” of a classifier.
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 50
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 51
dimension
inequality).
risk.
true risk is a tight bound.
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
etc.) is fixed.
(“capacity control”).
52
(N, p, h)
Remp(w)
(N, p, h)
Remp(w)
(N, p, h)
Remp(w) = 0
(N, p, h)
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 53
principle.
will be zero, and the risk bound will be approximately minimized.
abilities.
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
separates the data:
dimension?
54
i=1
x2 x1 w x
y(x) kwk
x?
w0 kwk
y = 0 y < 0 y > 0 R2 R1
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
the data.
55
generalizes best to unseen data
y = 1 y = 0 y = −1 margin
Maximize the margin (distance to the closest data point)
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
56
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 57
hyperplane is:
x2 x1 w x
y(x) kwk
x?
w0 kwk
y = 0 y < 0 y > 0 R2 R1
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
58
y = 1 y = 0 y = −1
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 59
w,b
N
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
data.
60
N
N
N
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
vectors.
as:
61
y = 1 y = 0 y = −1
N
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 62
N
N
N
N
N
N
N
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
63
N
N
N
N
j xi + N
N
N
j xi) + N
N
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
function.
64
N
N
j xi)
N
N
N
j xi)
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
65
N
N
N
j xi) N
can be calculated as well, but we will skip that here...
b
NS
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
66
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 67
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
68
N
N
N
j xi) N
N
N
N
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
69
NS
NS
# of support vectors
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
functional form:
margin of the classifier.
70
NS
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
product, we can avoid mapping into a high-dimensional space and instead compute the scalar-product directly.
71
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 72
1y2 1 + 2x1x2y1y2 + y2 1y2 2
1
2
T
1
2
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
73
1y2 1 + 2x1x2y1y2 + y2 1y2 2
1 − x2 2
1 + x2 2
T
1 − y2 2
1 + y2 2
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
space of all ordered monomials of degree .
in this transformed space.
products without doing the explicit mapping:
74
Cd(x)
d = 2
x2
1, x1x2, x2 2
x2
1, x1x2, x2x1, x2 2
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 75
d + N − 1 d ⇥
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
76
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
77
linearly separable classifier almost linear not linearly separable (in original space)
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
useful.
scalar product without making the mapping explicit:
in a kernel function directly?
78
φ(x)
K(xi, xj) K(xi, xj)
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 79
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
80
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 81
some point every data point will have its “own” kernel:
limit the VC-dimension!
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 82
it holds that:
with
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 83
< d
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
84
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
become linearly separable.
capacity.
85
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 86
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
87
y = 1 y = 0 y = −1
ξ > 1 ξ < 1 ξ = 0 ξ = 0
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
points that are not outside the margin.
88
w,b
N
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
89
N
N
N
j xi) N
NS
box constraint
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 90
etc.)
word frequency:
dimensionality reduction)
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 91
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 92
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 93
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 94
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 95
networks.