CS 6355: Structured Prediction
Training Strategies
1
Training Strategies CS 6355: Structured Prediction 1 So far we saw - - PowerPoint PPT Presentation
Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output prediction? Different ways for modeling structured prediction Conditional random fields, factor graphs, constraints What we only
1
2
3
4
5
Minimize norm of weights such that the closest points to the hyperplane have a score ±1
Minimize total norm of the weights such that the true label is scored at least 1 more than the second best one
6
7
Recall hard binary SVM We have a data set D = {<xi, yi>}
8
Recall hard binary SVM The score for the true label is higher than the score for any other label by 1 Size of the weights. Effectively, regularizer We have a data set D = {<xi, yi>}
– And feature definitions for each “part” 𝑞 as Φ𝑞(𝐲, 𝐳𝑞) – Remember: we can talk about the feature vector for the entire structure
Φ 𝐲, 𝐳 = ) Φ* 𝐲, 𝐳*
9
– And feature definitions for each “part” 𝑞 as Φ𝑞(𝐲, 𝐳𝑞) – Remember: we can talk about the feature vector for the entire structure
Φ 𝐲, 𝐳 = ) Φ* 𝐲, 𝐳*
10
We also have a data set 𝐸 = {(𝐲4, 𝐳4)}
– And feature definitions for each “part” 𝑞 as Φ𝑞(𝐲, 𝐳𝑞) – Remember: we can talk about the feature vector for the entire structure
Φ 𝐲, 𝐳 = ) Φ𝑞 𝐲, 𝐳𝑞
For each training example (𝐲4, 𝐳4) :
– The annotated structure 𝐳4 gets the highest score among all structures – Or to be safe, 𝐳4 gets a score that is at least one more than all other structures
11
– And feature definitions for each “part” 𝑞 as Φ𝑞(𝐲, 𝐳𝑞) – Remember: we can talk about the feature vector for the entire structure
Φ 𝐲, 𝐳 = ) Φ𝑞 𝐲, 𝐳𝑞
For each training example (𝐲4, 𝐳4) :
– The annotated structure 𝐳4 gets the highest score among all structures – Or to be safe, 𝐳4 gets a score that is at least one more than all other structures
12
– And feature definitions for each “part” 𝑞 as Φ𝑞(𝐲, 𝐳𝑞) – Remember: we can talk about the feature vector for the entire structure
Φ 𝐲, 𝐳 = ) Φ𝑞 𝐲, 𝐳𝑞
For each training example (𝐲4, 𝐳4) :
– The annotated structure 𝐳4 gets the highest score among all structures – Or to be safe, 𝐳4 gets a score that is at least one more than all other structures
13
– And feature definitions for each “part” 𝑞 as Φ𝑞(𝐲, 𝐳𝑞) – Remember: we can talk about the feature vector for the entire structure
Φ 𝐲, 𝐳 = ) Φ𝑞 𝐲, 𝐳𝑞
For each training example (𝐲4, 𝐳4) :
– The annotated structure 𝐳4 gets the highest score among all structures – Or to be safe, 𝐳4 gets a score that is at least one more than all other structures
14
– And feature definitions for each “part” 𝑞 as Φ𝑞(𝐲, 𝐳𝑞) – Remember: we can talk about the feature vector for the entire structure
Φ 𝐲, 𝐳 = ) Φ𝑞 𝐲, 𝐳𝑞
For each training example (𝐲4, 𝐳4) :
– The annotated structure 𝐳4 gets the highest score among all structures – Or to be safe, 𝐳4 gets a score that is at least one more than all other structures
15
Score for other structure Score for gold structure Some other structure Maximize margin
16
For every training example
Some other structure Maximize margin Input with gold structure Score for gold Score for other
17
18
Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other
19
Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other
Gold structure
20
Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other
Gold structure Other structure A: Only one mistake Other structure B: Fully incorrect
21
Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other
Gold structure Other structure A: Only one mistake Other structure B: Fully incorrect Structure B has is more wrong, but this formulation will be happy if both A & B are scored one less than gold!
22
No partial credit! Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other
23
Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other
Hamming distance between structures: Counts the number of differences between them
24
Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other
25
Maximize margin by minimizing norm of w Some other structure Input with gold structure Score for gold Score for other
26
27
Maximize margin by minimizing norm of w
Input with gold structure
28
Maximize margin by minimizing norm of w
Input with gold structure Score for gold
29
Maximize margin by minimizing norm of w
Input with gold structure Score for gold Score for other
30
Maximize margin by minimizing norm of w
Input with gold structure Score for gold Score for other Hamming distance between
31
Maximize margin by minimizing norm of w
32
Maximize margin by minimizing norm of w Another structure, could be yi Input with gold structure Score for gold Score for other Hamming distance between
33
Maximize margin by minimizing norm of w Another structure, could be yi Input with gold structure Score for gold Score for other Hamming distance between
What if these constraints are not satisfied for any w for a given dataset?
34
Maximize margin by minimizing norm of w Another structure, could be yi Input with gold structure Score for gold Score for other Hamming distance between
Hamming distance Slack variable for each example, must be positive
35
Maximize margin by minimizing norm of w All structures Input with gold structure Score for gold Score for other
Slack variable for each example, must be positive Also minimize total slack
36
Hamming distance Maximize margin by minimizing norm of w All structures Input with gold structure Score for gold Score for other
37
Another structure Input with gold structure
38
Another structure Input with gold structure Score for gold Score for other Hamming distance
39
Another structure Input with gold structure Score for gold Score for other Hamming distance Slack variable for each example
40
Another structure Input with gold structure Score for gold Score for other Hamming distance Slack variable for each example All slacks must be positive
41
Another structure Input with gold structure Score for gold Score for other Maximize margin & minimize slack C: the tradeoff parameter Hamming distance Slack variable for each example
42
All slacks must be positive
Another structure Input with gold structure Score for gold Score for other Maximize margin & minimize slack C: the tradeoff parameter Hamming distance Slack variable for each example
Questions?
43
All slacks must be positive
Maximize margin, minimize slack C: the tradeoff parameter
44
Another structure Input with gold structure Score for gold Score for other Hamming distance Slack variable for each example All slacks must be positive
Maximize margin, minimize slack C: the tradeoff parameter
45
Another structure Input with gold structure Score for gold Score for other Hamming distance Slack variable for each example All slacks must be positive Equivalent formulation
Questions? Maximize margin, minimize slack C: the tradeoff parameter
46
Another structure Input with gold structure Score for gold Score for other Hamming distance Slack variable for each example All slacks must be positive Equivalent formulation
47
Exercise: Work it out
48
49
This must look familiar. We have seen this before for binary classification!
– Loss(f(x), y) tells us how good f is for this x by comparing it against y
– Expected risk:
50
51
Zero-one
52
Perceptron Hinge: SVM Logistic regression Exponential: AdaBoost Zero-one
𝐱
𝐳
𝐱
53
Where 𝑄 is defined as 𝑄 yS 𝑦4, 𝑥 = exp 𝐱?Φ(𝐲4, 𝐳4 𝑎(𝐲4, 𝐱)
𝐱
𝐳
𝐱
54
Where 𝑄 is defined as 𝑄 yS 𝑦4, 𝑥 = exp 𝐱?Φ(𝐲4, 𝐳4 𝑎(𝐲4, 𝐱) Regularizer
𝐱
𝐳
𝐱
55
Where 𝑄 is defined as 𝑄 yS 𝑦4, 𝑥 = exp 𝐱?Φ(𝐲4, 𝐳4 𝑎(𝐲4, 𝐱) Regularizer How badly does w do on the training data
𝐱
𝐳
𝐱
56
Where 𝑄 is defined as 𝑄 yS 𝑦4, 𝑥 = exp 𝐱?Φ(𝐲4, 𝐳4 𝑎(𝐲4, 𝐱) Regularizer How badly does w do on the training data Structured hinge loss
𝐱
𝐳
𝐱
57
Regularizer How badly does w do on the training data Log loss Where 𝑄 is defined as 𝑄 yS 𝑦4, 𝑥 = exp 𝐱?Φ(𝐲4, 𝐳4 𝑎(𝐲4, 𝐱)
𝐱
𝐳
𝐱
𝐱 ) max 𝐳
58
Regularizer How badly does w do on the training data
𝐱
𝐳
𝐱
𝐱 ) max 𝐳
59
Regularizer How badly does w do on the training data Structured Perceptron loss
60