CS 6355: Structured Prediction
Predicting Structures: Conditional Models and Local Classifiers
1
Predicting Structures: Conditional Models and Local Classifiers CS - - PowerPoint PPT Presentation
Predicting Structures: Conditional Models and Local Classifiers CS 6355: Structured Prediction 1 Outline Sequence models Hidden Markov models Inference with HMM Learning Conditional Models and Local Classifiers Global
1
2
3
4
𝑄 𝑦#, 𝑦%, ⋯ , 𝑦', 𝑧#, 𝑧%, ⋯ , 𝑧' = 𝑄 𝑧# * 𝑄 𝑧+,# 𝑧+ * 𝑄 𝑦+ 𝑧+
' +-# '.# +-#
5
𝑄 𝑦#, 𝑦%, ⋯ , 𝑦', 𝑧#, 𝑧%, ⋯ , 𝑧' = 𝑄 𝑧# * 𝑄 𝑧+,# 𝑧+ * 𝑄 𝑦+ 𝑧+
' +-# '.# +-#
2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲+, 𝐳+ ∣ 𝜌, 𝐵, 𝐶)
We are optimizing joint likelihood of the input and the output for training
6
𝑄 𝑦#, 𝑦%, ⋯ , 𝑦', 𝑧#, 𝑧%, ⋯ , 𝑧' = 𝑄 𝑧# * 𝑄 𝑧+,# 𝑧+ * 𝑄 𝑦+ 𝑧+
' +-# '.# +-#
2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲+, 𝐳+ ∣ 𝜌, 𝐵, 𝐶)
We are optimizing joint likelihood of the input and the output for training
7
𝑄 𝑦#, 𝑦%, ⋯ , 𝑦', 𝑧#, 𝑧%, ⋯ , 𝑧' = 𝑄 𝑧# * 𝑄 𝑧+,# 𝑧+ * 𝑄 𝑦+ 𝑧+
' +-# '.# +-#
2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲+, 𝐳+ ∣ 𝜌, 𝐵, 𝐶)
We are optimizing joint likelihood of the input and the output for training Probability of input given the prediction!
8
𝑄 𝑦#, 𝑦%, ⋯ , 𝑦', 𝑧#, 𝑧%, ⋯ , 𝑧' = 𝑄 𝑧# * 𝑄 𝑧+,# 𝑧+ * 𝑄 𝑦+ 𝑧+
' +-# '.# +-#
2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲+, 𝐳+ ∣ 𝜌, 𝐵, 𝐶)
We are optimizing joint likelihood of the input and the output for training At prediction time, we only care about the probability of output given the input: 𝑄(𝑧#, 𝑧%, ⋯ , 𝑧' ∣ 𝑦#, 𝑦%, ⋯ , 𝑦') Why not directly optimize this conditional likelihood instead?
9
Probability of input given the prediction!
(other names: discriminative/conditional markov model, …)
10
11
A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care
12
A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care
13
A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care
14
A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care
15
𝑄 𝑦#, 𝑦%, ⋯ , 𝑦', 𝑧#, 𝑧%, ⋯ , 𝑧' = 𝑄 𝑧# * 𝑄 𝑧+,# 𝑧+ * 𝑄 𝑦+ 𝑧+
' +-# '.# +-#
2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲+, 𝐳+ ∣ 𝜌, 𝐵, 𝐶)
We are optimizing joint likelihood of the input and the output for training
16
Probability of input given the prediction! At prediction time, we only care about the probability of output given the
yt-1 yt xt HMM
17
yt-1 yt xt yt-1 yt xt HMM Conditional model
18
yt-1 yt xt yt-1 yt xt HMM Conditional model
19
yt-1 yt xt yt-1 yt xt HMM Conditional model
20
yt-1 yt xt yt-1 yt xt HMM Conditional model
21
We need to learn this function
22
23
24
25
26
Interpretation: Score for label, converted to a well-formed probability distribution by exponentiating + normalizing
𝐱 Y log 𝑄(𝐳+ ∣ 𝐲+, 𝐱)
𝐱
27
𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T
𝐱 Y log 𝑄(𝐳+ ∣ 𝐲+, 𝐱)
𝐱
28
𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T
The cross-entropy loss
– Update 𝒙 ← 𝒙 − rd𝛼𝑀(𝒙, 𝒚+, 𝒛+)
29
– Update 𝒙 ← 𝒙 − rd𝛼𝑀(𝒙, 𝒚+, 𝒛+)
30
Other methods exist For example the L-BFGS algorithm
– Update 𝒙 ← 𝒙 − rd𝛼𝑀(𝒙, 𝒚+, 𝒛+)
31
A vector, whose jth element is the derivative of L with wj. Has a neat interpretation
32
33
34
35
– Update 𝒙 ← 𝒙 − rd𝛼𝑀(𝒙, 𝒚+, 𝒛+)
36
A vector, whose jth element is the derivative of L with wj. Has a neat interpretation 𝜖 𝜖𝐱 𝑀 𝒙, 𝒚+, 𝒛+ = 𝜚 𝒚+, 𝒛𝒋 − Y 𝑄 𝒛T 𝒚+, 𝐱 𝜚(𝒚𝒋, 𝒛T)
𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T
𝑀 𝐱, 𝐲, 𝐳 = −log 𝑄(𝐳 ∣ 𝐲, 𝐱)
– Update 𝒙 ← 𝒙 − rd𝛼𝑀(𝒙, 𝒚+, 𝒛+)
37
A vector, whose jth element is the derivative of L with wj. Has a neat interpretation 𝜖 𝜖𝐱 𝑀 𝒙, 𝒚+, 𝒛+ = 𝜚 𝒚+, 𝒛𝒋 − Y 𝑄 𝒛T 𝒚+, 𝐱 𝜚(𝒚𝒋, 𝒛T)
Features for the true output
𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T
𝑀 𝐱, 𝐲, 𝐳 = −log 𝑄(𝐳 ∣ 𝐲, 𝐱)
– Update 𝒙 ← 𝒙 − rd𝛼𝑀(𝒙, 𝒚+, 𝒛+)
38
A vector, whose jth element is the derivative of L with wj. Has a neat interpretation 𝜖 𝜖𝐱 𝑀 𝒙, 𝒚+, 𝒛+ = 𝜚 𝒚+, 𝒛𝒋 − Y 𝑄 𝒛T 𝒚+, 𝐱 𝜚(𝒚𝒋, 𝒛T)
The expected feature vector according to the current model Features for the true output
𝑄 𝐳 𝐲, 𝐱 = exp (𝐱R𝜚 𝐲, 𝐳 ) ∑ exp 𝐱R𝜚 𝐲, 𝐳T
𝑀 𝐱, 𝐲, 𝐳 = −log 𝑄(𝐳 ∣ 𝐲, 𝐱)
39
40
There can be many conditional probability distributions that satisfy this constraint. What is a trivial one that does so?
41
There can be many conditional probability distributions that satisfy this constraint. What is a trivial one that does so? We need a principled way to choose between such distributions:
42
There can be many conditional probability distributions that satisfy this constraint. What is a trivial one that does so? We need a principled way to choose between such distributions: Find a distribution that satisfies the constraint, and does not make any
That is, given the constraint, it is maximally uncertain otherwise.
43
44
45
Questions?
46
yt-1 yt xt yt-1 yt xt HMM Conditional model We need to learn this function
47
48
Basically, multinomial logistic regression
49
Basically, multinomial logistic regression
50
start Goal: Compute P(y | x) The prediction task: Using the entire input and the current label, predict the next label
52
start Caps
Previous word Goal: Compute P(y | x) To model the probability, first, we need to define features for the current classification problem
53
start
Caps
Previous word Goal: Compute P(y | x)
Á(x, 0, start, y0)
54
Compare to HMM: Only depends on the word and the previous tag
start Questions?
Caps
Previous word Goal: Compute P(y | x)
Á(x, 0, start, y0) Á(x, 1, y0, y1) Á(x, 2, y1, y2) Á(x, 3, y2, y3) Á(x, 4, y3, y4)
Can get very creative here
55
Compare to HMM: Only depends on the word and the previous tag
start Questions?
Caps
Previous word Goal: Compute P(y | x)
Á(x, 0, start, y0) Á(x, 1, y0, y1) Á(x, 2, y1, y2) Á(x, 3, y2, y3) Á(x, 4, y3, y4)
Can get very creative here
56
Compare to HMM: Only depends on the word and the previous tag
start Questions?
Caps
Previous word Goal: Compute P(y | x)
Á(x, 0, start, y0) Á(x, 1, y0, y1) Á(x, 2, y1, y2) Á(x, 3, y2, y3) Á(x, 4, y3, y4)
Can get very creative here
57
Compare to HMM: Only depends on the word and the previous tag
start Questions?
Caps
Previous word Goal: Compute P(y | x)
Á(x, 0, start, y0) Á(x, 1, y0, y1) Á(x, 2, y1, y2) Á(x, 3, y2, y3) Á(x, 4, y3, y4)
Can get very creative here
58
Compare to HMM: Only depends on the word and the previous tag
start Questions?
Caps
Previous word Goal: Compute P(y | x)
Á(x, 0, start, y0) Á(x, 1, y0, y1) Á(x, 2, y1, y2) Á(x, 3, y2, y3) Á(x, 4, y3, y4)
Can get very creative here
59
Compare to HMM: Only depends on the word and the previous tag
start Questions?
Caps
Previous word Goal: Compute P(y | x)
Á(x, 0, start, y0) Á(x, 1, y0, y1) Á(x, 2, y1, y2) Á(x, 3, y2, y3) Á(x, 4, y3, y4)
Can get very creative here
60
HMM Conditional Markov model
61
62
Questions?
63
yt-1 yt xt yt-1 yt xt HMM Conditional model
64
We need to train local multiclass classifiers that predicts the next state given the previous state and the input
65
The robot wheels are round Eg: Part-of-speech tagging the sentence N V V N N
0.8 0.2 1 1
D
1
A R
1 1
Suppose these are the only state transitions allowed
Option 1: P(D | The) · P(N | D, robot) · P(N | N, wheels) · P(V | N, are) · P(A | V, round) Option 2: P(D | The) · P(N | D, robot) · P(V | N, wheels) · P(N | V, are) · P( R| N, round)
Example based on [Wallach 2002]
66
The robot wheels are round Eg: Part-of-speech tagging the sentence N V V N N
0.8 0.2 1 1
D
1
A R
1 1
Suppose these are the only state transitions allowed
Option 1: P(D | The) · P(N | D, robot) · P(N | N, wheels) · P(V | N, are) · P(A | V, round) Option 2: P(D | The) · P(N | D, robot) · P(V | N, wheels) · P(N | V, are) · P( R| N, round)
Example based on [Wallach 2002]
67
The robot wheels are round Eg: Part-of-speech tagging the sentence N V V N N
0.8 0.2 1 1
D
1
A R
1 1
Suppose these are the only state transitions allowed
Option 1: P(D | The) · P(N | D, robot) · P(N | N, wheels) · P(V | N, are) · P(A | V, round) Option 2: P(D | The) · P(N | D, robot) · P(V | N, wheels) · P(N | V, are) · P( R| N, round)
Example based on [Wallach 2002]
68
The robot wheels are round Suppose these are the only state transitions allowed N V V N N
0.8 0.2 1 1
D
1
A R
1 1
Option 1: P(D | The) · P(N | D, robot) · P(N | N, wheels) · P(V | N, are) · P(A | V, round) Option 2: P(D | The) · P(N | D, robot) · P(V | N, wheels) · P(N | V, are) · P( R| N, round)
69
The robot wheels are round Suppose these are the only state transitions allowed N V V N N
0.8 0.2 1 1
D
1
A R
1 1
Option 1: P(D | The) · P(N | D, robot) · P(N | N, wheels) · P(V | N, are) · P(A | V, round) Option 2: P(D | The) · P(N | D, robot) · P(V | N, wheels) · P(N | V, are) · P( R| N, round)
The robot wheels Fred round
P(V | N, Fred) · P(N | V, Fred) ·
The path scores are the same Even if the word Fred is never observed as a verb in the data, it will be predicted as one The input Fred does not influence the output at all
70
71
72
73