CS 6355: Structured Prediction
Review: Supervised Learning
1
Review: Supervised Learning CS 6355: Structured Prediction 1 - - PowerPoint PPT Presentation
Review: Supervised Learning CS 6355: Structured Prediction 1 Previous lecture A broad overview of structured prediction The different aspects of the area Basically the syllabus of the class Questions? 2 Supervised learning,
1
2
3
4
5
6
7
8
For now
9
In n dimensions, a linear classifier represents a hyperplane that separates the space into two half-spaces [w1 w2]
10
No line can be drawn to separate the two classes
11
Not all functions are linearly separable
12
Not all functions are linearly separable
13
14
15
16
17
Update only on an error. Perceptron is an mistake- driven algorithm. T is a hyperparameter to the algorithm
18
19
20
21
22
23
But distribution D is unknown
24
But distribution D is unknown
25
But distribution D is unknown
26
Overfitting!
DβE regularizer(w) + C N O β π(β π¦< , π§<)
S N T π₯7π₯ + π· β π(πΊ π¦<, π₯ , π§<)
27
DβE regularizer(w) + C N O β π(β π¦< , π§<)
S N T π₯7π₯ + π· β π(πΊ π¦<, π₯ , π§<)
28
S W π(πΊ π¦<, π₯ , π§<)
29
S W π(πΊ π¦<, π₯ , π§<)
30
31
min
π W π(πΊ π¦<, π₯ , π§<)
32
min
π W π(πΊ π¦<, π₯ , π§<)
33
min
π W π(πΊ π¦<, π₯ , π§<)
34
min
π W π(πΊ π¦<, π₯ , π§<)
35
min
π W π(πΊ π¦<, π₯ , π§<)
36
Β°t: learning rate, many tweaks possible
min
π W π(πΊ π¦<, π₯ , π§<)
37
Β°t: learning rate, many tweaks possible If the objective is not convex, initialization can be important
min
π W π(πΊ π¦<, π₯ , π§<)
<(π¦) ) <]N
<(π¦)
38
39
40
πghN(π§, π§i) = j1 if π§ β π§i, if π§ = π§i.
πghN(π§, π§i) = j1 if π§πΊ π¦, π₯ β€ 0, if π§πΊ π¦, π₯ > 0.
41
42
For binary classification min
DβE regularizer(w) + C 1
π W π(πΊ π¦<, π₯ , π§<)
43
min
DβE regularizer(w) + C 1
π W π(πΊ π¦<, π₯ , π§<)
44
Zero-one
min
DβE regularizer(w) + C 1
π W π(πΊ π¦<, π₯ , π§<)
45
Hinge: SVM Zero-one
min
DβE regularizer(w) + C 1
π W π(πΊ π¦<, π₯ , π§<)
πx<)yr π§, π¦, π₯ = max(0, 1 β π§πΊ π¦, π₯ )
46
Perceptron Hinge: SVM Zero-one
min
DβE regularizer(w) + C 1
π W π(πΊ π¦<, π₯ , π§<)
πqrstru\sv) π§, π¦, π₯ = max(0, βπ§πΊ π¦, π₯ ) πx<)yr π§, π¦, π₯ = max(0, 1 β π§πΊ π¦, π₯ )
47
Perceptron Hinge: SVM Exponential: AdaBoost Zero-one
min
DβE regularizer(w) + C 1
π W π(πΊ π¦<, π₯ , π§<)
πx<)yr π§, π¦, π₯ = max(0, 1 β π§πΊ π¦, π₯ ) πqrstru\sv) π§, π¦, π₯ = max(0, βπ§πΊ π¦, π₯ ) πz{uv)r)\<|} π§, π¦, π₯ = πh~β’ {,S
48
Perceptron Hinge: SVM Logistic regression Exponential: AdaBoost Zero-one
min
DβE regularizer(w) + C 1
π W π(πΊ π¦<, π₯ , π§<)
πx<)yr π§, π¦, π₯ = max(0, 1 β π§πΊ π¦, π₯ ) πqrstru\sv) π§, π¦, π₯ = max(0, βπ§πΊ π¦, π₯ ) πz{uv)r)\<|} π§, π¦, π₯ = πh~β’ {,S πβ¬vy<β’\<t π§, π¦, π₯ = log(1 + πh~β’ {,S ) )
49
50
51
52
53
Maximize margin Penalty for the prediction: The Hinge loss
54
Regularization term:
hypothesis space and pushes for better generalization Empirical Loss:
mistakes A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss
55
56
57
58
59
60 Source: Wikipedia
61
62
63
64
65