Machine Learning
Fall 2017
Professor Liang Huang
Structured Prediction
(structured perceptron, HMM, structured SVM)
(Chap. 17 of CIML)
Machine Learning Fall 2017 Structured Prediction (structured - - PowerPoint PPT Presentation
Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) Structured Prediction x x the man bit the dog the man bit the dog x x DT NN
Fall 2017
Professor Liang Huang
Structured Prediction
(structured perceptron, HMM, structured SVM)
(Chap. 17 of CIML)
x y=-1 y=+1 x
the man bit the dog DT NN VBD DT NN
x y
the man bit the dog
x y
S NP DT the NN man VP VB bit NP DT the NN dog
the man bit the dog
x
那 人 咬 了 狗
y
Structured Prediction
3
inference
xi
update weights
zi yi w
4
x y=-1 y=+1 x y
update weights if y ≠ z w
x z
exact inference
trivial
2 classes
binary classification
the man bit the dog DT NN VBD DT NN
x y y
update weights if y ≠ z w
x z
exact inference
hard
exponential # of classes
structured classification
y
update weights if y ≠ z w
x z
exact inference
easy
constant # of classes
multiclass classification
1959 Rosenblatt invention 1962 Novikoff proof 1999 Freund/Schapire voted/avg: revived 2002 Collins structured 2003 Crammer/Singer MIRA 1995 Cortes/Vapnik SVM 2006 Singer group aggressive 2005 McDonald+ structured MIRA
l i n e a p p r
. m a x m a r g i n
+soft-margin conservative updates inseparable case
2007--2010 Singer group Pegasos
subgradient descent minibatch
batch
AT&T Research ex-AT&T and students
fall of USSR
max margin
1964 Vapnik Chervonenkis 5 2003 Taskar M3N 2004 Tsochantaridis
2001 Lafferty+ CRF multinomial logistic regression (max. entropy)
kernels
1964 Aizerman+ same journal!
(best agreement w/ prototype)
ˆ y = argmax
z∈1...M
w(z) · x
Q1: what about 2-class? Q2: do we still need augmented space?
class, and reward the weight for the true class
update rule:
w ← w + ∆Φ(x, y, z)
separability:
9u, s.t. 8(x, y) 2 D, z 6= y u · ∆Φ(x, y, z) δ
for all
Structured Prediction
VBD DT NN
VBD) (VBD, DT) (VBD, bit)
9
x x
y
z Φ(x, z) Φ(x, y)
ɸ(x,y)={(<s>, DT): 1, (DT, NN):2, ..., (NN, </s>): 1, (DT, the):1, ..., (VBD, bit): 1, ..., (NN, dog): 1} ɸ(x,z)={(<s>, DT): 1, (DT, NN):2, ..., (NN, </s>): 1, (DT, the):1, ..., (NN, bit): 1, ..., (NN, dog): 1} ɸ(x,y) - ɸ(x,z)
Structured Prediction
←
10
inference
xi
update weights
zi yi w
11
y
update weights if y ≠ z w
x z
exact inference
CS 562 - Lec 5-6: Probs & WFSTs
12
Q: what about top-down recursive + memoization?
Structured Prediction
13
inference
xi
update weights
zi yi w x
y
argmax
y∈GEN(x)
Structured Prediction
(Freund & Schapire, 1999)
14
j
←
j j + 1
=
Wj
Structured Prediction
15
w(0) = w(1) = w(2) = w(3) = w(4) =
∆w(1) ∆w(1) ∆w(2) ∆w(1) ∆w(2)∆w(3) ∆w(1) ∆w(2)∆w(3)
∆w(4)
wt
∆wt
sparse vector: defaultdict
Structured Prediction
templates are also included
16
x
y
17
y=+1 y=-1 y100 z ≠ y100 x100 x100 x111 x2000 x3012
R: diameter R : d i a m e t e r δ δ Rosenblatt => Collins 1957 2002
18
y
update weights if y ≠ z w
x z
exact inference
y
w(k)
w(k+1)
correct label
∆Φ(x, y, z)
update current model update n e w m
e l
perceptron update:
z
exact 1-best
(by induction)
δ separation unit oracle vector u
margin
≥ δ
(part 1: upperbound)
<90˚
19
y
update weights if y ≠ z w
x z
exact inference
y
w(k)
w(k+1)
violation: incorrect label scored higher
parts 1+2 => update bounds:
correct label
∆Φ(x, y, z)
update current model update n e w m
e l
perceptron update: violation
by induction:
R: max diameter
z
exact 1-best
diameter ≤ R2 k ≤ R2/δ2 (part 2: upperbound)
Structured Prediction
21
Structured Prediction
and POS tags
22
Rockwell International Corp. B I I 's Tulsa unit said it signed B I I O B O a tentative agreement ... B I I
Structured Prediction
23
Structured Prediction
byℓ(y,z), a distance metric such as hamming loss
24
Structured Prediction
25
very similar to Pegasos; but should use Pegasos framework instead
←modified DP
λ = 1/(2C)
←should have learning rate!
CIML version
Structured Prediction
26
1/t
NC/2t (
) N=|D|, C is from SVM t +=1 for each example
Structured Prediction
27
perceptron epoch 1 updates 102, |W|=291, train_err 3.90%, dev_err 9.36% avg_err 6.14% epoch 2 updates 91, |W|=334, train_err 3.33%, dev_err 8.19% avg_err 4.97% epoch 3 updates 78, |W|=347, train_err 2.92%, dev_err 5.85% avg_err 4.97% epoch 4 updates 81, |W|=368, train_err 3.11%, dev_err 6.73% avg_err 5.85% epoch 5 updates 78, |W|=378, train_err 2.70%, dev_err 6.14% avg_err 5.56% epoch 6 updates 63, |W|=385, train_err 2.26%, dev_err 6.14% avg_err 5.56% epoch 7 updates 69, |W|=385, train_err 2.43%, dev_err 7.02% avg_err 5.56% epoch 8 updates 60, |W|=388, train_err 2.15%, dev_err 6.73% avg_err 5.56% epoch 9 updates 59, |W|=390, train_err 2.04%, dev_err 6.14% avg_err 5.56% epoch 10 updates 64, |W|=394, train_err 2.15%, dev_err 5.85% avg_err 5.26% SVM C=1 epoch 1 updates 116, |W|=311, train_err 4.55%, dev_err 5.85% epoch 2 updates 82, |W|=328, train_err 3.05%, dev_err 4.97% epoch 3 updates 78, |W|=334, train_err 2.92%, dev_err 5.56% epoch 4 updates 77, |W|=339, train_err 2.92%, dev_err 5.26% epoch 5 updates 80, |W|=344, train_err 2.94%, dev_err 5.56% epoch 6 updates 73, |W|=345, train_err 2.75%, dev_err 4.68% epoch 7 updates 72, |W|=347, train_err 2.75%, dev_err 4.97% epoch 8 updates 75, |W|=352, train_err 2.86%, dev_err 4.97% epoch 9 updates 74, |W|=353, train_err 2.78%, dev_err 4.97% epoch 10 updates 72, |W|=354, train_err 2.78%, dev_err 4.97%