Sequential Supervised Learning Sequential Supervised Learning Many - - PowerPoint PPT Presentation
Sequential Supervised Learning Sequential Supervised Learning Many - - PowerPoint PPT Presentation
Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Many Application Problems Require Sequential Learning Sequential Learning Part- -of of- -speech Tagging speech Tagging Part Information
Many Application Problems Require Many Application Problems Require Sequential Learning Sequential Learning
Part Part-
- of
- f-
- speech Tagging
speech Tagging Information Extraction from the Web Information Extraction from the Web Text Text-
- to
to-
- Speech Mapping
Speech Mapping
Part Part-
- of
- f-
- Speech Tagging
Speech Tagging
Given an English sentence, can we assign Given an English sentence, can we assign a part of speech to each word? a part of speech to each word? “ “Do you want fries with that? Do you want fries with that?” ” <verb <verb pron pron verb noun prep verb noun prep pron pron> >
Information Extraction from the Information Extraction from the Web Web
<dl><dt><b>Srinivasan Seshan</b> (Carnegie Mellon University) <dt><a href=…><i>Making Virtual Worlds Real</i></a><dt>Tuesday, June 4, 2002<dd>2:00 PM , 322 Sieg<dd>Research Seminar * * * name name * * affiliation affiliation affiliation * * * * title title title title * * * date date date date * time time * location location * event-type event-type
Text Text-
- to
to-
- Speech Mapping
Speech Mapping
“ “photograph photograph” ” => / => /f f-
- Ot@graf
Ot@graf-
- /
/
Sequential Supervised Learning Sequential Supervised Learning (SSL) (SSL)
Given: A set of training examples of the Given: A set of training examples of the form ( form (X Xi
i,
,Y Yi
i), where
), where X Xi
i =
= h hx xi,1
i,1,
, … … , , x xi,T
i,Ti
ii
i and and Y Yi
i =
= h hy yi,1
i,1,
, … … , , y yi,T
i,Ti
ii
i are sequences of length are sequences of length T Ti
i
Find: A function f for predicting new Find: A function f for predicting new sequences: sequences: Y Y = = f( f(X X). ).
Examples Examples of
- f
Sequential Supervised Learning Sequential Supervised Learning
sequence sequence phonemes phonemes sequence of sequence of letters letters Test Test-
- to
to-
- speech
speech Mapping Mapping sequence of field sequence of field labels {name, labels {name, … …} } sequence of sequence of tokens tokens Information Information Extraction Extraction sequence of sequence of parts of speech parts of speech sequence of sequence of words words Part Part-
- of
- f-
- speech
speech Tagging Tagging Output Output Y Yi
i
Input Input X Xi
i
Domain Domain
Two Kinds of Relationships Two Kinds of Relationships
“ “Vertical Vertical” ” relationship between the relationship between the x xt
t’
’s s and and y yt
t’
’s s
– – Example: Example: “ “Friday Friday” ” is usually a is usually a “ “date date” ”
“ “Horizontal Horizontal” ” relationships among the relationships among the y yt
t’
’s s
– – Example: Example: “ “name name” ” is usually followed by is usually followed by “ “affiliation affiliation” ”
SSL can (and should) exploit both kinds of SSL can (and should) exploit both kinds of information information
y1 y2 y3 x1 x2 x3
Existing Methods Existing Methods
Hacks Hacks
– – Sliding Sliding windows windows – – Recurrent sliding windows Recurrent sliding windows
Hidden Markov Hidden Markov models models
– – joint distribution: P(X,Y) joint distribution: P(X,Y)
Conditional Random Fields Conditional Random Fields
– – conditional distribution: P(Y|X) conditional distribution: P(Y|X)
Discriminant Methods: HM Discriminant Methods: HM-
- SVMs, MMMs, voted
SVMs, MMMs, voted perceptrons perceptrons
– – discriminant function: f(Y; X) discriminant function: f(Y; X)
Sliding Windows Sliding Windows
___ ___ that that with with fries fries want want you you Do Do ___ ___ verb verb → → you you Do Do ___ ___ verb verb → → fries fries want want you you noun noun → → with with fries fries want want prep prep → → that that with with fries fries pron pron → → ___ ___ that that with with pron pron → → want want you you Do Do
Properties of Sliding Windows Properties of Sliding Windows
Converts SSL to ordinary supervised Converts SSL to ordinary supervised learning learning Only captures the relationship between Only captures the relationship between (part of) X and (part of) X and y yt
- t. Does not explicitly
. Does not explicitly model relations among the model relations among the y yt
t’
’s s Assumes each window is independent Assumes each window is independent
Recurrent Sliding Windows Recurrent Sliding Windows
___ ___ that that with with fries fries want want you you Do Do ___ ___ ___ ___ verb verb → → you you Do Do ___ ___ pron pron verb verb → → fries fries want want you you verb verb noun noun → → with with fries fries want want noun noun prep prep → → that that with with fries fries prep prep pron pron → → ___ ___ that that with with verb verb pron pron → → want want you you Do Do
Recurrent Sliding Windows Recurrent Sliding Windows
Key Idea: Include Key Idea: Include y yt
t as input feature when
as input feature when computing computing y yt+1
t+1.
. During training: During training:
– – Use the correct value of Use the correct value of y yt
t
– – Or train iteratively (especially recurrent neural Or train iteratively (especially recurrent neural networks) networks)
During evaluation: During evaluation:
– – Use the predicted value of Use the predicted value of y yt
t
Properties of Recurrent Sliding Properties of Recurrent Sliding Windows Windows
Captures relationship among the Captures relationship among the y y’ ’s s, but , but
- nly in one direction!
- nly in one direction!
Results on text Results on text-
- to
to-
- speech:
speech:
74.2% 74.2% 24.4% 24.4% right right-
- left
left recurrent s. w. recurrent s. w. 67.9% 67.9% 17.0% 17.0% left left-
- right
right recurrent s. w. recurrent s. w. 69.6% 69.6% 12.5% 12.5% none none sliding window sliding window Letters Letters Words Words Direction Direction Method Method
Hidden Markov Models Hidden Markov Models
Generalization of Na Generalization of Naï ïve Bayes to SSL ve Bayes to SSL
y1 y2 y3 y4 y5 x1 x2 x3 x4 x5
P(y P(y1
1)
) P(y P(yt
t | y
| yt
t-
- 1
1) assumed the same for all t
) assumed the same for all t P( P(x xt
t | y
| yt
t) = P(x
) = P(xt,1
t,1 | y
| yt
t)
) · · P(x P(xt,2
t,2| y
| yt
t)
) L L P(x P(xt,n
t,n,y
,yt
t)
) assumed the same for all t assumed the same for all t
Making Predictions with HMMs Making Predictions with HMMs
Two possible goals: Two possible goals:
– – argmax argmaxY
Y P(Y|X)
P(Y|X)
find the most likely find the most likely sequence sequence of labels Y given the
- f labels Y given the
input sequence X input sequence X
– – argmax argmaxy
yt t P(y
P(yt
t | X) forall t
| X) forall t
find the most likely label y find the most likely label yt
t at each time t given the
at each time t given the entire input sequence X entire input sequence X
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Trellis The Trellis
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Every label sequence corresponds to a path through the trellis graph. The probability of a label sequence is proportional to P(y1) · P(x1|y1) · P(y2|y1) · P(x2|y2) L P(yT | yT-1) · P(xT | yT)
Converting to Shortest Path Problem Converting to Shortest Path Problem
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun maxy1,…,yT P(y1) · P(x1|y1) · P(y2|y1) · P(x2|y2) L P(yT | yT-1) · P(xT | yT) = miny1,…,yT l –log [P(y1) · P(x1|y1)] + –log [P(y2|y1) · P(x2|y2)] + L + –log [P(yT | yT-1) · P(xT | yT)] shortest path through graph. edge cost = –log [P(yt|yt-1) · P(xt|yt)]
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Step t of the Viterbi algorithm computes the possible successors of state yt-1 _and computes the total path length for each edge
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Each node yt=k stores the cost µ of the shortest path that reaches it from s and the predecessor class yt-1 = k’ that achieves this cost k’ = argminyt-1 –log [P(yt | yt-1) · P(xt | yt)] + µ(yt-1) µ(k) = minyt-1 –log [P(yt | yt-1) · P(xt | yt)] + µ(yt-1)
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute Successors…
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute and store shortest incoming arc at each node
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute successors
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute and store shortest incoming arc at each node
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute successors…
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute and store shortest incoming edges
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute successors (trivial)
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute best edge into f
Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Now trace back along best incoming edges to recover the predicted Y sequence: “verb pronoun verb noun noun”
Finding the Most Likely Label Finding the Most Likely Label at time t: P(y at time t: P(yt
t | X)
| X)
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun P(y3=2 | X) = probability of reaching y3=2 from the start * probability of getting from y3=2 to the finish
Finding the most likely class at Finding the most likely class at each time t each time t
goal: compute P(y goal: compute P(yt
t |
| x x1
1,
, … …, , x xT
T)
) ∝ ∝ ∑ ∑y
y1:t 1:t-
- 1
1 ∑
∑y
yt+1:T t+1:T P(y1) · P(x1|y1) · P(y2|y1) · P(x2|y2) L P(yT | yT-1) · P(xT|yT)
∝ ∑ ∑y
y1:t 1:t-
- 1
1 P(y1) · P(x1|y1) · P(y2|y1) · P(x2|y2) L P(yt|yt-1) · P(xt | yt) ·
∑ ∑y
yt+1:T t+1:T P(y
P(yt+1
t+1|y
|yt
t) P(
) P(x xt+1
t+1|y
|yt+1
t+1)
) L L P(yT | yT-1) · P(xT | yT) ∝ ∑yt-1[ L ∑y2 [∑y1 P(y1) · P(x1|y1) · P(y2|y1)] · P(x2|y2) · P(y3|y2)] L P(yt|yt-1)] · P(xt|yt) · ∑yt+1 [P(yt+1|yt) · P(xt+1|yt+1) L ∑yT-1 [P(yT-1|yT-2) · P(xT-1|yT-1) · ∑ [P(yT|yT-
1) · P(xT | yT)]] L ]
Forward Forward-
- Backward Algorithm
Backward Algorithm
α αt
t(y
(yt
t) =
) = ∑ ∑y
yt t-
- 1
1 P(y
P(yt
t | y
| yt
t-
- 1
1)
) · · P( P(x xt
t | y
| yt
t)
) · · α αt
t-
- 1
1(y
(yt
t-
- 1
1)
)
– – This is the sum over the arcs coming into y This is the sum over the arcs coming into yt
t =
= k k – – It is computed It is computed “ “forward forward” ” along the sequence along the sequence and stored in the trellis and stored in the trellis
β βt
t(y
(yt
t) =
) = ∑ ∑y
yt+1 t+1P(y
P(yt+1
t+1|y
|yt
t)
) · · P( P(x xt+1
t+1 | y
| yt+1
t+1)
)· · β βt+1
t+1(y
(yt+1
t+1)
)
– – It is computed It is computed “ “backward backward” ” along the sequence along the sequence and stored in the trellis and stored in the trellis
P(y P(yt
t | X) =
| X) = α αt
t(y
(yt
t)
) β βt
t(y
(yt
t) / [
) / [∑ ∑k
k α
αt
t(k)
(k) β βt
t(k)]
(k)]
Training Hidden Markov Models Training Hidden Markov Models
If the inputs and outputs are fully If the inputs and outputs are fully-
- bserved, this is extremely easy:
- bserved, this is extremely easy:
P(y P(y1
1=k) = [# examples with y
=k) = [# examples with y1
1=k] / m
=k] / m P(y P(yt
t=k | y
=k | yt
t-
- 1
1 = k
= k’ ’) = ) =
[# <k,k [# <k,k’ ’> transitions] / [# of times y > transitions] / [# of times yt
t = k]
= k]
P(x P(xj
j = v | y = k) =
= v | y = k) =
[# times y=k and x [# times y=k and xj
j=v ] / [# times y
=v ] / [# times yt
t = k]
= k]
Should apply Laplace corrections to these Should apply Laplace corrections to these estimates estimates
Conditional Random Fields Conditional Random Fields
The The y yt
t’
’s s form a Markov Random Field form a Markov Random Field conditioned on X: P(Y|X) conditioned on X: P(Y|X)
Lafferty, McCallum, & Pereira (2001) y2 y1 y3 x1 x2 x3
Markov Random Fields Markov Random Fields
Graph G = (V,E) Graph G = (V,E)
– – Each vertex v Each vertex v ∈ ∈ V represents a random variable V represents a random variable y yv
v.
. – – Each edge represents a direct probabilistic Each edge represents a direct probabilistic dependency. dependency.
P(Y) = 1/Z exp [ P(Y) = 1/Z exp [∑ ∑c
c Ψ
Ψc
c(c(Y
(c(Y))] ))]
– – c indexes the cliques in the graph c indexes the cliques in the graph – – Ψ Ψc
c is a potential function
is a potential function – – c(Y c(Y) selects the random variables participating in ) selects the random variables participating in clique c. clique c.
A Simple MRF A Simple MRF
Cliques: Cliques:
– – singletons: {y singletons: {y1
1}, {y
}, {y2
2}, {y
}, {y3
3}
} – – pairs (edges); {y pairs (edges); {y1
1,y
,y2
2}, {y
}, {y2
2,y
,y3
3}
}
P( P(h hy y1
1,y
,y2
2,y
,y3
3i
i) = 1/Z exp[ ) = 1/Z exp[Ψ Ψ1
1(y
(y1
1) +
) + Ψ Ψ2
2(y
(y2
2) +
) + Ψ Ψ3
3(y
(y3
3) +
) + Ψ Ψ12
12(y
(y1
1,y
,y2
2) +
) + Ψ Ψ23
23(y
(y2
2,y
,y3
3)]
)]
y2 y1 y3
CRF Potential Functions are CRF Potential Functions are Conditioned on X Conditioned on X
Ψ Ψt
t(y
(yt
t,X): how compatible is y
,X): how compatible is yt
t with X?
with X? Ψ Ψt,t
t,t-
- 1
1(y
(yt
t,y
,yt
t-
- 1
1,X): how compatible is a transition from y
,X): how compatible is a transition from yt
t-
- 1
1 to
to y yt
t with X?
with X?
y2 y1 y3 x1 x2 x3
CRF Potentials are Log Linear CRF Potentials are Log Linear Models Models
Ψ Ψt
t(y
(yt
t,X
,X) = ) = ∑ ∑b
b β
βb
b g
gb
b(y
(yt
t,X
,X) ) Ψ Ψt,t+1
t,t+1(y
(yt
t,y
,yt+1
t+1,X) =
,X) = ∑ ∑a
a λ
λa
a f
fa
a(y
(yt
t,y
,yt+1
t+1,X)
,X) where where g gb
b and
and f fa
a are user
are user-
- defined
defined boolean boolean functions ( functions (“ “features features” ”) )
– – Example: g Example: g23
23 = [
= [x xt
t =
= “ “o
- ”
” and and y yt
t =
= / /@
@/]
/]
we will lump them together as we will lump them together as Ψ Ψt
t(y
(yt
t, y
, yt+1
t+1,X) =
,X) = ∑ ∑a
a λ
λa
a f
fa
a(y
(yt
t, y
, yt+1
t+1,X)
,X)
Making Predictions with CRFs Making Predictions with CRFs
Viterbi and Forward Viterbi and Forward-
- Backward algorithms
Backward algorithms can be applied exactly as for HMMs can be applied exactly as for HMMs
Training Training CRFs CRFs
Let Let θ θ = { = {β β1
1,
, β β2
2,
, … …, , λ λ1
1,
, λ λ2
2,
, … …} be all of our } be all of our parameters parameters Let F Let Fθ
θ be our CRF, so F
be our CRF, so Fθ
θ(Y,X) = P(Y|X)
(Y,X) = P(Y|X) Define the Define the loss loss function L(Y,F function L(Y,Fθ
θ(Y,X)
(Y,X)) to be ) to be the Negative Log Likelihood the Negative Log Likelihood
L(Y,F L(Y,Fθ
θ(Y,X)) =
(Y,X)) = – – log F log Fθ
θ(Y,X)
(Y,X)
Goal: Find Goal: Find θ θ to minimize loss (maximize to minimize loss (maximize likelihood) likelihood) Algorithm: Gradient Descent Algorithm: Gradient Descent
Gradient Computation Gradient Computation
gq = ∂ ∂λq logP(Y |X) = ∂ ∂λq log
Q t exp Ψt(yt, yt−1, X)
Z = ∂ ∂λq
X t
Ψt(yt, yt−1, X) − logZ =
X t
∂ ∂λq
X a
λafa(yt, yt−1, X) − ∂ ∂λq logZ =
X t
fq(yt, yt−1, X) − ∂ ∂λq logZ
Gradient of Z Gradient of Z
∂ ∂λq logZ = 1 Z ∂Z ∂λq = 1 Z ∂ ∂λq
X Y 0 Y t
exp Ψt(y0
t, y0 t−1, X)
= 1 Z ∂ ∂λq
X Y 0
exp
X t
Ψt(y0
t, y0 t−1, X)
= 1 Z
X Y 0
exp
⎡ ⎣X t
Ψt(y0
t, y0 t−1, X) ⎤ ⎦ X t
∂ ∂λq Ψt(y0
t, y0 t−1, X)
=
X Y 0
exp
hP t Ψt(y0 t, y0 t−1, X) i
Z
X t
∂ ∂λq
X a
λafa(y0
t, y0 t−1, X)
=
X Y 0
P(Y 0|X)
⎡ ⎣X t
fq(y0
t, y0 t−1, X) ⎤ ⎦
Gradient Computation Gradient Computation
Number of times feature q is true minus the expected number of times feature q is true. This can be computed via the forward backward
- algorithm. First, apply forward-backward to compute P(yt-1,yt | X).
gq =
X t
fq(yt, yt−1, X) −
X yt X yt−1
P (yt−1, yt|X)fq(yt, yt−1, X)
Then compute the gradient with respect to each λq
P (yt−1, yt|X) = 1 Z
X yt X yt−1
αt−1(yt−1) · exp Ψ(yt, yt−1, X) · βt(yt)
gq =
X t
fq(yt, yt−1, X) −
X Y 0
P (Y 0|X)
⎡ ⎣X t
fq(y0
t, y0 t−1, X) ⎤ ⎦
Discriminative Methods Discriminative Methods
Learn a discriminant function to which the Learn a discriminant function to which the Viterbi algorithm can be applied Viterbi algorithm can be applied
– – “ “just get the right answer just get the right answer” ”
Methods: Methods:
– – Averaged perceptron (Collins) Averaged perceptron (Collins) – – Hidden Markov SVMs (Altun, et al.) Hidden Markov SVMs (Altun, et al.) – – Max Margin Markov Nets (Taskar, et al.) Max Margin Markov Nets (Taskar, et al.)
Collins Collins’ ’ Perceptron Method Perceptron Method
If we ignore the global normalizer in the If we ignore the global normalizer in the CRF, the score for a label sequence Y CRF, the score for a label sequence Y given an input sequence X is given an input sequence X is Collin Collin’ ’s approach is to adjust the weights s approach is to adjust the weights λ λa
a so that the correct label sequence gets
so that the correct label sequence gets the highest score according to the Viterbi the highest score according to the Viterbi algorithm algorithm
score(Y ) =
X t X a
λafa(yt−1, yt, X)
Sequence Perceptron Algorithm Sequence Perceptron Algorithm
Initialize weights Initialize weights λ λa
a = 0
= 0 For For ℓ ℓ = 1, = 1, … …, L do , L do
– – For each training example (X For each training example (Xi
i,Y
,Yi
i)
)
apply Viterbi algorithm to find the path apply Viterbi algorithm to find the path Ŷ Ŷ with the with the highest score highest score for all for all a, a, update update λ λa
a according to
according to λ λa
a :=
:= λ λa
a +
+ ∑ ∑t
t [f
[fa
a(y
(yt
t,y
,yt
t-
- 1
1,X)
,X) – – f fa
a(
(ŷ ŷt
t,
, ŷ ŷt
t-
- 1
1, X)]
, X)]
Compares the Compares the “ “viterbi path viterbi path” ” to the to the “ “correct correct path path” ”. Note that no update is made if the . Note that no update is made if the viterbi path is correct. viterbi path is correct.
Averaged Perceptron Averaged Perceptron
Let Let λ λa
a ℓ ℓ,i ,i be the value of
be the value of λ λa
a after processing
after processing training example training example i i in iteration in iteration ℓ ℓ Define Define λ λa
a * * = the average value of
= the average value of λ λa
a =
= 1/(LN) 1/(LN) ∑ ∑ℓ
ℓ,i ,i λ
λa
a ℓ ℓ,i ,i
Use these averaged weights in the final Use these averaged weights in the final classifier classifier
Collins Part Collins Part-
- of
- f-
- Speech Tagging with
Speech Tagging with Averaged Sequence Perceptron Averaged Sequence Perceptron
Without averaging: 3.68% error Without averaging: 3.68% error
– – 20 iterations 20 iterations
With averaging: 2.93% error With averaging: 2.93% error
– – 10 iterations 10 iterations
Hidden Markov SVM Hidden Markov SVM
Define a kernel between two input values Define a kernel between two input values x x and and x x’ ’: k( : k(x x, ,x x’ ’). ). Define a kernel between (X,Y) and (X Define a kernel between (X,Y) and (X’ ’,Y ,Y’ ’) ) as follows: as follows:
K((X,Y), (X K((X,Y), (X’ ’,Y ,Y’ ’)) = )) = ∑ ∑s,t
s,t I[y
I[ys
s-
- 1
1 = y
= y’ ’t
t-
- 1
1 & y
& ys
s = y
= y’ ’t
t] + I[y
] + I[ys
s = y
= y’ ’t
t] k(
] k(x xs
s,
,x x’ ’t
t)
) Number of (y Number of (yt
t-
- 1
1,y
,yt
t) transitions that they share +
) transitions that they share + Number of matching labels (weighted by Number of matching labels (weighted by similarity between the similarity between the x x values) values)
Dual Form of Linear Classifier Dual Form of Linear Classifier
Score(Y|X) = Score(Y|X) =
∑ ∑j
j ∑
∑a
a α
αj
j(Y
(Ya
a) K((X
) K((Xj
j,Y
,Ya
a), (X,Y))
), (X,Y)) a a indexes indexes “ “support vector support vector” ” label sequences Y label sequences Ya
a
Learning algorithm finds Learning algorithm finds
– – set of Y set of Ya
a label sequences
label sequences – – weight values weight values α αj
j(Y
(Ya
a)
)
Dual Perceptron Algorithm Dual Perceptron Algorithm
Initialize Initialize α αj
j = 0
= 0 For For ℓ ℓ from 1 to L do from 1 to L do
– – For i from 1 to N do For i from 1 to N do
Ŷ Ŷ = argmax = argmaxY
Y Score(Y | X
Score(Y | Xi
i)
) if if Ŷ Ŷ ≠ ≠ Y Yi
i then
then
– – α αi
i(Y
(Yi
i) =
) = α αi
i(Y
(Yi
i) + 1
) + 1 – – α αi
i(
(Ŷ Ŷ) = ) = α αi
i(
(Ŷ Ŷ) ) – – 1 1
Hidden Markov SVM Algorithm Hidden Markov SVM Algorithm
For all i initialize For all i initialize
– – S Si
i = {Y
= {Yi
i} set of
} set of “ “support vector sequences support vector sequences” ” for i for i – – α αi
i(Y)=0 for all Y in S
(Y)=0 for all Y in Si
i
For For ℓ ℓ from 1 to L do from 1 to L do
– – For i from 1 to N do For i from 1 to N do Ŷ Ŷ = argmax = argmaxY
Y≠ ≠Y Yi i Score(Y | X
Score(Y | Xi
i)
) If Score(Y If Score(Yi
i | X
| Xi
i) < Score(
) < Score(Ŷ Ŷ | X | Xi
i)
) – – Add Add Ŷ Ŷ to S to Si
i
– – Solve quadratic program to optimize the Solve quadratic program to optimize the α αi
i(Y)
(Y) for all Y in S for all Y in Si
i to maximize the margin between
to maximize the margin between Y Yi
i and all of the other Y
and all of the other Y’ ’s in S s in Si
i
– – If If α αi
i(Y) = 0, delete Y from S
(Y) = 0, delete Y from Si
i
Altun et al. comparison Altun et al. comparison
Maximum Margin Markov Networks Maximum Margin Markov Networks
Define SVM Define SVM-
- like optimization problem to
like optimization problem to maximize the maximize the per time step per time step margin margin Define Define
∆ ∆F(X F(Xi
i,Y
,Yi
i,
,Ŷ Ŷ) = F(X ) = F(Xi
i,Y
,Yi
i)
) – – F(X F(Xi
i,
,Ŷ Ŷ) ) ∆ ∆Y(Y Y(Yi
i,
, Ŷ Ŷ) = ) = ∑ ∑i
i I[
I[ŷ ŷt
t ≠
≠ y yit
it]
]
MMM SVM formulation: MMM SVM formulation:
min ||w|| min ||w||2
2 + C
+ C ∑ ∑i
i ξ
ξi
i
subject to subject to
w w · · ∆ ∆F(X F(Xi
i,Y
,Yi
i,
,Ŷ Ŷ) ) ≥ ≥ ∆ ∆Y(Y Y(Yi
i,
, Ŷ Ŷ) + ) + ξ ξi
i forall Y, forall i
forall Y, forall i
Dual Form Dual Form
maximize maximize ∑ ∑i
i ∑
∑Y
Y α
αi
i(
(Ŷ Ŷ) ) ∆ ∆(Y (Yi
i,
, Ŷ Ŷ) ) – – ½ ½ ∑ ∑i
i ∑
∑Ŷ
Ŷ ∑
∑j
j ∑
∑Ŷ
Ŷ’ ’ α
αi
i(
(Ŷ Ŷ) ) α αj
j(
(Ŷ Ŷ’ ’) [ ) [∆ ∆F(X F(Xi
i,Y
,Yi
i,
,Ŷ Ŷ) ) · · ∆ ∆F(X F(Xj
j,Y
,Yj
j,
,Ŷ Ŷ’ ’)] )] subject to subject to ∑ ∑Ŷ
Ŷ α
αi
i(
(Ŷ Ŷ) = C forall i ) = C forall i α αi
i(
(Ŷ Ŷ) ) ≥ ≥ 0 forall i, forall 0 forall i, forall Ŷ Ŷ Note that there are exponentially Note that there are exponentially-
- many
many Ŷ Ŷ label label sequences sequences
Converting to a Polynomial Converting to a Polynomial-
- Sized
Sized Formulation Formulation
Note the constraints: Note the constraints: ∑ ∑Ŷ
Ŷ α
αi
i(
(Ŷ Ŷ) = C forall i ) = C forall i α αi
i(
(Ŷ Ŷ) ) ≥ ≥ 0 forall i, forall 0 forall i, forall Ŷ Ŷ These imply that for each i, the These imply that for each i, the α αi
i(
(Ŷ Ŷ) values are ) values are proportional to a probability distribution: proportional to a probability distribution:
Q( Q(Ŷ Ŷ | X | Xi
i) =
) = α αi
i(
(Ŷ Ŷ) / C ) / C
Because the MRF is a simple chain, this Because the MRF is a simple chain, this distribution can be factored into local distribution can be factored into local distributions: distributions:
Q( Q(Ŷ Ŷ | X | Xi
i) =
) = ∏ ∏t
t Q(
Q(ŷ ŷt
t-
- 1
1,
, ŷ ŷt
t | X
| Xi
i)
)
Let Let µ µi
i(
(ŷ ŷt
t-
- 1
1,
, ŷ ŷt
t) be the unnormalized version of Q
) be the unnormalized version of Q
Reformulated Dual Form Reformulated Dual Form
subject to
X ˆ yt−1
µi(ˆ yt−1, ˆ yt) = µi(ˆ yt)
X ˆ yt
µi(ˆ yt) = C µi(ˆ yt−1, ˆ yt) ≥ max
X i X t X ˆ yt
µi(ˆ yt)I[ˆ yt 6= yi,t]− 1 2
X i,j X t X ˆ yt,ˆ yt−1 X s X ˆ y0
s,ˆ
y0
s−1
µi(ˆ yt−1, ˆ yt)µj(ˆ y0
s−1, ˆ
y0
s)
∆F(ˆ yt−1, ˆ yt, Xi) · ∆F(ˆ y0
s−1, ˆ
y0
s, Xj)
Variables in the Dual Form Variables in the Dual Form
µ µi
i(k,k
(k,k’ ’) for each training example i and ) for each training example i and each possible class labels k, k each possible class labels k, k’ ’: O(NK : O(NK2
2)
) µ µi
i(k) for each trianing example i and
(k) for each trianing example i and possible class label k: O(NK) possible class label k: O(NK) Polynomial! Polynomial!
Taskar et al. comparison Taskar et al. comparison Handwriting Recognition Handwriting Recognition
log-reg: logistic regression sliding window CRF: mSVM: multiclass SVM sliding window M^3N: max margin markov net