Sequential Supervised Learning Sequential Supervised Learning Many - - PowerPoint PPT Presentation

sequential supervised learning sequential supervised
SMART_READER_LITE
LIVE PREVIEW

Sequential Supervised Learning Sequential Supervised Learning Many - - PowerPoint PPT Presentation

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Many Application Problems Require Sequential Learning Sequential Learning Part- -of of- -speech Tagging speech Tagging Part Information


slide-1
SLIDE 1

Sequential Supervised Learning Sequential Supervised Learning

slide-2
SLIDE 2

Many Application Problems Require Many Application Problems Require Sequential Learning Sequential Learning

Part Part-

  • of
  • f-
  • speech Tagging

speech Tagging Information Extraction from the Web Information Extraction from the Web Text Text-

  • to

to-

  • Speech Mapping

Speech Mapping

slide-3
SLIDE 3

Part Part-

  • of
  • f-
  • Speech Tagging

Speech Tagging

Given an English sentence, can we assign Given an English sentence, can we assign a part of speech to each word? a part of speech to each word? “ “Do you want fries with that? Do you want fries with that?” ” <verb <verb pron pron verb noun prep verb noun prep pron pron> >

slide-4
SLIDE 4

Information Extraction from the Information Extraction from the Web Web

<dl><dt><b>Srinivasan Seshan</b> (Carnegie Mellon University) <dt><a href=…><i>Making Virtual Worlds Real</i></a><dt>Tuesday, June 4, 2002<dd>2:00 PM , 322 Sieg<dd>Research Seminar * * * name name * * affiliation affiliation affiliation * * * * title title title title * * * date date date date * time time * location location * event-type event-type

slide-5
SLIDE 5

Text Text-

  • to

to-

  • Speech Mapping

Speech Mapping

“ “photograph photograph” ” => / => /f f-

  • Ot@graf

Ot@graf-

  • /

/

slide-6
SLIDE 6

Sequential Supervised Learning Sequential Supervised Learning (SSL) (SSL)

Given: A set of training examples of the Given: A set of training examples of the form ( form (X Xi

i,

,Y Yi

i), where

), where X Xi

i =

= h hx xi,1

i,1,

, … … , , x xi,T

i,Ti

ii

i and and Y Yi

i =

= h hy yi,1

i,1,

, … … , , y yi,T

i,Ti

ii

i are sequences of length are sequences of length T Ti

i

Find: A function f for predicting new Find: A function f for predicting new sequences: sequences: Y Y = = f( f(X X). ).

slide-7
SLIDE 7

Examples Examples of

  • f

Sequential Supervised Learning Sequential Supervised Learning

sequence sequence phonemes phonemes sequence of sequence of letters letters Test Test-

  • to

to-

  • speech

speech Mapping Mapping sequence of field sequence of field labels {name, labels {name, … …} } sequence of sequence of tokens tokens Information Information Extraction Extraction sequence of sequence of parts of speech parts of speech sequence of sequence of words words Part Part-

  • of
  • f-
  • speech

speech Tagging Tagging Output Output Y Yi

i

Input Input X Xi

i

Domain Domain

slide-8
SLIDE 8

Two Kinds of Relationships Two Kinds of Relationships

“ “Vertical Vertical” ” relationship between the relationship between the x xt

t’

’s s and and y yt

t’

’s s

– – Example: Example: “ “Friday Friday” ” is usually a is usually a “ “date date” ”

“ “Horizontal Horizontal” ” relationships among the relationships among the y yt

t’

’s s

– – Example: Example: “ “name name” ” is usually followed by is usually followed by “ “affiliation affiliation” ”

SSL can (and should) exploit both kinds of SSL can (and should) exploit both kinds of information information

y1 y2 y3 x1 x2 x3

slide-9
SLIDE 9

Existing Methods Existing Methods

Hacks Hacks

– – Sliding Sliding windows windows – – Recurrent sliding windows Recurrent sliding windows

Hidden Markov Hidden Markov models models

– – joint distribution: P(X,Y) joint distribution: P(X,Y)

Conditional Random Fields Conditional Random Fields

– – conditional distribution: P(Y|X) conditional distribution: P(Y|X)

Discriminant Methods: HM Discriminant Methods: HM-

  • SVMs, MMMs, voted

SVMs, MMMs, voted perceptrons perceptrons

– – discriminant function: f(Y; X) discriminant function: f(Y; X)

slide-10
SLIDE 10

Sliding Windows Sliding Windows

___ ___ that that with with fries fries want want you you Do Do ___ ___ verb verb → → you you Do Do ___ ___ verb verb → → fries fries want want you you noun noun → → with with fries fries want want prep prep → → that that with with fries fries pron pron → → ___ ___ that that with with pron pron → → want want you you Do Do

slide-11
SLIDE 11

Properties of Sliding Windows Properties of Sliding Windows

Converts SSL to ordinary supervised Converts SSL to ordinary supervised learning learning Only captures the relationship between Only captures the relationship between (part of) X and (part of) X and y yt

  • t. Does not explicitly

. Does not explicitly model relations among the model relations among the y yt

t’

’s s Assumes each window is independent Assumes each window is independent

slide-12
SLIDE 12

Recurrent Sliding Windows Recurrent Sliding Windows

___ ___ that that with with fries fries want want you you Do Do ___ ___ ___ ___ verb verb → → you you Do Do ___ ___ pron pron verb verb → → fries fries want want you you verb verb noun noun → → with with fries fries want want noun noun prep prep → → that that with with fries fries prep prep pron pron → → ___ ___ that that with with verb verb pron pron → → want want you you Do Do

slide-13
SLIDE 13

Recurrent Sliding Windows Recurrent Sliding Windows

Key Idea: Include Key Idea: Include y yt

t as input feature when

as input feature when computing computing y yt+1

t+1.

. During training: During training:

– – Use the correct value of Use the correct value of y yt

t

– – Or train iteratively (especially recurrent neural Or train iteratively (especially recurrent neural networks) networks)

During evaluation: During evaluation:

– – Use the predicted value of Use the predicted value of y yt

t

slide-14
SLIDE 14

Properties of Recurrent Sliding Properties of Recurrent Sliding Windows Windows

Captures relationship among the Captures relationship among the y y’ ’s s, but , but

  • nly in one direction!
  • nly in one direction!

Results on text Results on text-

  • to

to-

  • speech:

speech:

74.2% 74.2% 24.4% 24.4% right right-

  • left

left recurrent s. w. recurrent s. w. 67.9% 67.9% 17.0% 17.0% left left-

  • right

right recurrent s. w. recurrent s. w. 69.6% 69.6% 12.5% 12.5% none none sliding window sliding window Letters Letters Words Words Direction Direction Method Method

slide-15
SLIDE 15

Hidden Markov Models Hidden Markov Models

Generalization of Na Generalization of Naï ïve Bayes to SSL ve Bayes to SSL

y1 y2 y3 y4 y5 x1 x2 x3 x4 x5

P(y P(y1

1)

) P(y P(yt

t | y

| yt

t-

  • 1

1) assumed the same for all t

) assumed the same for all t P( P(x xt

t | y

| yt

t) = P(x

) = P(xt,1

t,1 | y

| yt

t)

) · · P(x P(xt,2

t,2| y

| yt

t)

) L L P(x P(xt,n

t,n,y

,yt

t)

) assumed the same for all t assumed the same for all t

slide-16
SLIDE 16

Making Predictions with HMMs Making Predictions with HMMs

Two possible goals: Two possible goals:

– – argmax argmaxY

Y P(Y|X)

P(Y|X)

find the most likely find the most likely sequence sequence of labels Y given the

  • f labels Y given the

input sequence X input sequence X

– – argmax argmaxy

yt t P(y

P(yt

t | X) forall t

| X) forall t

find the most likely label y find the most likely label yt

t at each time t given the

at each time t given the entire input sequence X entire input sequence X

slide-17
SLIDE 17

Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Trellis The Trellis

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Every label sequence corresponds to a path through the trellis graph. The probability of a label sequence is proportional to P(y1) · P(x1|y1) · P(y2|y1) · P(x2|y2) L P(yT | yT-1) · P(xT | yT)

slide-18
SLIDE 18

Converting to Shortest Path Problem Converting to Shortest Path Problem

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun maxy1,…,yT P(y1) · P(x1|y1) · P(y2|y1) · P(x2|y2) L P(yT | yT-1) · P(xT | yT) = miny1,…,yT l –log [P(y1) · P(x1|y1)] + –log [P(y2|y1) · P(x2|y2)] + L + –log [P(yT | yT-1) · P(xT | yT)] shortest path through graph. edge cost = –log [P(yt|yt-1) · P(xt|yt)]

slide-19
SLIDE 19

Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Step t of the Viterbi algorithm computes the possible successors of state yt-1 _and computes the total path length for each edge

slide-20
SLIDE 20

Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Each node yt=k stores the cost µ of the shortest path that reaches it from s and the predecessor class yt-1 = k’ that achieves this cost k’ = argminyt-1 –log [P(yt | yt-1) · P(xt | yt)] + µ(yt-1) µ(k) = minyt-1 –log [P(yt | yt-1) · P(xt | yt)] + µ(yt-1)

slide-21
SLIDE 21

Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute Successors…

slide-22
SLIDE 22

Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute and store shortest incoming arc at each node

slide-23
SLIDE 23

Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute successors

slide-24
SLIDE 24

Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute and store shortest incoming arc at each node

slide-25
SLIDE 25

Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute successors…

slide-26
SLIDE 26

Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute and store shortest incoming edges

slide-27
SLIDE 27

Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute successors (trivial)

slide-28
SLIDE 28

Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Compute best edge into f

slide-29
SLIDE 29

Finding Most Likely Label Sequence: Finding Most Likely Label Sequence: The Viterbi Algorithm The Viterbi Algorithm

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun Now trace back along best incoming edges to recover the predicted Y sequence: “verb pronoun verb noun noun”

slide-30
SLIDE 30

Finding the Most Likely Label Finding the Most Likely Label at time t: P(y at time t: P(yt

t | X)

| X)

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 s f Do you want fries sir? verb pronoun noun adjective verb pronoun verb noun noun P(y3=2 | X) = probability of reaching y3=2 from the start * probability of getting from y3=2 to the finish

slide-31
SLIDE 31

Finding the most likely class at Finding the most likely class at each time t each time t

goal: compute P(y goal: compute P(yt

t |

| x x1

1,

, … …, , x xT

T)

) ∝ ∝ ∑ ∑y

y1:t 1:t-

  • 1

1 ∑

∑y

yt+1:T t+1:T P(y1) · P(x1|y1) · P(y2|y1) · P(x2|y2) L P(yT | yT-1) · P(xT|yT)

∝ ∑ ∑y

y1:t 1:t-

  • 1

1 P(y1) · P(x1|y1) · P(y2|y1) · P(x2|y2) L P(yt|yt-1) · P(xt | yt) ·

∑ ∑y

yt+1:T t+1:T P(y

P(yt+1

t+1|y

|yt

t) P(

) P(x xt+1

t+1|y

|yt+1

t+1)

) L L P(yT | yT-1) · P(xT | yT) ∝ ∑yt-1[ L ∑y2 [∑y1 P(y1) · P(x1|y1) · P(y2|y1)] · P(x2|y2) · P(y3|y2)] L P(yt|yt-1)] · P(xt|yt) · ∑yt+1 [P(yt+1|yt) · P(xt+1|yt+1) L ∑yT-1 [P(yT-1|yT-2) · P(xT-1|yT-1) · ∑ [P(yT|yT-

1) · P(xT | yT)]] L ]

slide-32
SLIDE 32

Forward Forward-

  • Backward Algorithm

Backward Algorithm

α αt

t(y

(yt

t) =

) = ∑ ∑y

yt t-

  • 1

1 P(y

P(yt

t | y

| yt

t-

  • 1

1)

) · · P( P(x xt

t | y

| yt

t)

) · · α αt

t-

  • 1

1(y

(yt

t-

  • 1

1)

)

– – This is the sum over the arcs coming into y This is the sum over the arcs coming into yt

t =

= k k – – It is computed It is computed “ “forward forward” ” along the sequence along the sequence and stored in the trellis and stored in the trellis

β βt

t(y

(yt

t) =

) = ∑ ∑y

yt+1 t+1P(y

P(yt+1

t+1|y

|yt

t)

) · · P( P(x xt+1

t+1 | y

| yt+1

t+1)

)· · β βt+1

t+1(y

(yt+1

t+1)

)

– – It is computed It is computed “ “backward backward” ” along the sequence along the sequence and stored in the trellis and stored in the trellis

P(y P(yt

t | X) =

| X) = α αt

t(y

(yt

t)

) β βt

t(y

(yt

t) / [

) / [∑ ∑k

k α

αt

t(k)

(k) β βt

t(k)]

(k)]

slide-33
SLIDE 33

Training Hidden Markov Models Training Hidden Markov Models

If the inputs and outputs are fully If the inputs and outputs are fully-

  • bserved, this is extremely easy:
  • bserved, this is extremely easy:

P(y P(y1

1=k) = [# examples with y

=k) = [# examples with y1

1=k] / m

=k] / m P(y P(yt

t=k | y

=k | yt

t-

  • 1

1 = k

= k’ ’) = ) =

[# <k,k [# <k,k’ ’> transitions] / [# of times y > transitions] / [# of times yt

t = k]

= k]

P(x P(xj

j = v | y = k) =

= v | y = k) =

[# times y=k and x [# times y=k and xj

j=v ] / [# times y

=v ] / [# times yt

t = k]

= k]

Should apply Laplace corrections to these Should apply Laplace corrections to these estimates estimates

slide-34
SLIDE 34

Conditional Random Fields Conditional Random Fields

The The y yt

t’

’s s form a Markov Random Field form a Markov Random Field conditioned on X: P(Y|X) conditioned on X: P(Y|X)

Lafferty, McCallum, & Pereira (2001) y2 y1 y3 x1 x2 x3

slide-35
SLIDE 35

Markov Random Fields Markov Random Fields

Graph G = (V,E) Graph G = (V,E)

– – Each vertex v Each vertex v ∈ ∈ V represents a random variable V represents a random variable y yv

v.

. – – Each edge represents a direct probabilistic Each edge represents a direct probabilistic dependency. dependency.

P(Y) = 1/Z exp [ P(Y) = 1/Z exp [∑ ∑c

c Ψ

Ψc

c(c(Y

(c(Y))] ))]

– – c indexes the cliques in the graph c indexes the cliques in the graph – – Ψ Ψc

c is a potential function

is a potential function – – c(Y c(Y) selects the random variables participating in ) selects the random variables participating in clique c. clique c.

slide-36
SLIDE 36

A Simple MRF A Simple MRF

Cliques: Cliques:

– – singletons: {y singletons: {y1

1}, {y

}, {y2

2}, {y

}, {y3

3}

} – – pairs (edges); {y pairs (edges); {y1

1,y

,y2

2}, {y

}, {y2

2,y

,y3

3}

}

P( P(h hy y1

1,y

,y2

2,y

,y3

3i

i) = 1/Z exp[ ) = 1/Z exp[Ψ Ψ1

1(y

(y1

1) +

) + Ψ Ψ2

2(y

(y2

2) +

) + Ψ Ψ3

3(y

(y3

3) +

) + Ψ Ψ12

12(y

(y1

1,y

,y2

2) +

) + Ψ Ψ23

23(y

(y2

2,y

,y3

3)]

)]

y2 y1 y3

slide-37
SLIDE 37

CRF Potential Functions are CRF Potential Functions are Conditioned on X Conditioned on X

Ψ Ψt

t(y

(yt

t,X): how compatible is y

,X): how compatible is yt

t with X?

with X? Ψ Ψt,t

t,t-

  • 1

1(y

(yt

t,y

,yt

t-

  • 1

1,X): how compatible is a transition from y

,X): how compatible is a transition from yt

t-

  • 1

1 to

to y yt

t with X?

with X?

y2 y1 y3 x1 x2 x3

slide-38
SLIDE 38

CRF Potentials are Log Linear CRF Potentials are Log Linear Models Models

Ψ Ψt

t(y

(yt

t,X

,X) = ) = ∑ ∑b

b β

βb

b g

gb

b(y

(yt

t,X

,X) ) Ψ Ψt,t+1

t,t+1(y

(yt

t,y

,yt+1

t+1,X) =

,X) = ∑ ∑a

a λ

λa

a f

fa

a(y

(yt

t,y

,yt+1

t+1,X)

,X) where where g gb

b and

and f fa

a are user

are user-

  • defined

defined boolean boolean functions ( functions (“ “features features” ”) )

– – Example: g Example: g23

23 = [

= [x xt

t =

= “ “o

” and and y yt

t =

= / /@

@/]

/]

we will lump them together as we will lump them together as Ψ Ψt

t(y

(yt

t, y

, yt+1

t+1,X) =

,X) = ∑ ∑a

a λ

λa

a f

fa

a(y

(yt

t, y

, yt+1

t+1,X)

,X)

slide-39
SLIDE 39

Making Predictions with CRFs Making Predictions with CRFs

Viterbi and Forward Viterbi and Forward-

  • Backward algorithms

Backward algorithms can be applied exactly as for HMMs can be applied exactly as for HMMs

slide-40
SLIDE 40

Training Training CRFs CRFs

Let Let θ θ = { = {β β1

1,

, β β2

2,

, … …, , λ λ1

1,

, λ λ2

2,

, … …} be all of our } be all of our parameters parameters Let F Let Fθ

θ be our CRF, so F

be our CRF, so Fθ

θ(Y,X) = P(Y|X)

(Y,X) = P(Y|X) Define the Define the loss loss function L(Y,F function L(Y,Fθ

θ(Y,X)

(Y,X)) to be ) to be the Negative Log Likelihood the Negative Log Likelihood

L(Y,F L(Y,Fθ

θ(Y,X)) =

(Y,X)) = – – log F log Fθ

θ(Y,X)

(Y,X)

Goal: Find Goal: Find θ θ to minimize loss (maximize to minimize loss (maximize likelihood) likelihood) Algorithm: Gradient Descent Algorithm: Gradient Descent

slide-41
SLIDE 41

Gradient Computation Gradient Computation

gq = ∂ ∂λq logP(Y |X) = ∂ ∂λq log

Q t exp Ψt(yt, yt−1, X)

Z = ∂ ∂λq

X t

Ψt(yt, yt−1, X) − logZ =

X t

∂ ∂λq

X a

λafa(yt, yt−1, X) − ∂ ∂λq logZ =

X t

fq(yt, yt−1, X) − ∂ ∂λq logZ

slide-42
SLIDE 42

Gradient of Z Gradient of Z

∂ ∂λq logZ = 1 Z ∂Z ∂λq = 1 Z ∂ ∂λq

X Y 0 Y t

exp Ψt(y0

t, y0 t−1, X)

= 1 Z ∂ ∂λq

X Y 0

exp

X t

Ψt(y0

t, y0 t−1, X)

= 1 Z

X Y 0

exp

⎡ ⎣X t

Ψt(y0

t, y0 t−1, X) ⎤ ⎦ X t

∂ ∂λq Ψt(y0

t, y0 t−1, X)

=

X Y 0

exp

hP t Ψt(y0 t, y0 t−1, X) i

Z

X t

∂ ∂λq

X a

λafa(y0

t, y0 t−1, X)

=

X Y 0

P(Y 0|X)

⎡ ⎣X t

fq(y0

t, y0 t−1, X) ⎤ ⎦

slide-43
SLIDE 43

Gradient Computation Gradient Computation

Number of times feature q is true minus the expected number of times feature q is true. This can be computed via the forward backward

  • algorithm. First, apply forward-backward to compute P(yt-1,yt | X).

gq =

X t

fq(yt, yt−1, X) −

X yt X yt−1

P (yt−1, yt|X)fq(yt, yt−1, X)

Then compute the gradient with respect to each λq

P (yt−1, yt|X) = 1 Z

X yt X yt−1

αt−1(yt−1) · exp Ψ(yt, yt−1, X) · βt(yt)

gq =

X t

fq(yt, yt−1, X) −

X Y 0

P (Y 0|X)

⎡ ⎣X t

fq(y0

t, y0 t−1, X) ⎤ ⎦

slide-44
SLIDE 44

Discriminative Methods Discriminative Methods

Learn a discriminant function to which the Learn a discriminant function to which the Viterbi algorithm can be applied Viterbi algorithm can be applied

– – “ “just get the right answer just get the right answer” ”

Methods: Methods:

– – Averaged perceptron (Collins) Averaged perceptron (Collins) – – Hidden Markov SVMs (Altun, et al.) Hidden Markov SVMs (Altun, et al.) – – Max Margin Markov Nets (Taskar, et al.) Max Margin Markov Nets (Taskar, et al.)

slide-45
SLIDE 45

Collins Collins’ ’ Perceptron Method Perceptron Method

If we ignore the global normalizer in the If we ignore the global normalizer in the CRF, the score for a label sequence Y CRF, the score for a label sequence Y given an input sequence X is given an input sequence X is Collin Collin’ ’s approach is to adjust the weights s approach is to adjust the weights λ λa

a so that the correct label sequence gets

so that the correct label sequence gets the highest score according to the Viterbi the highest score according to the Viterbi algorithm algorithm

score(Y ) =

X t X a

λafa(yt−1, yt, X)

slide-46
SLIDE 46

Sequence Perceptron Algorithm Sequence Perceptron Algorithm

Initialize weights Initialize weights λ λa

a = 0

= 0 For For ℓ ℓ = 1, = 1, … …, L do , L do

– – For each training example (X For each training example (Xi

i,Y

,Yi

i)

)

apply Viterbi algorithm to find the path apply Viterbi algorithm to find the path Ŷ Ŷ with the with the highest score highest score for all for all a, a, update update λ λa

a according to

according to λ λa

a :=

:= λ λa

a +

+ ∑ ∑t

t [f

[fa

a(y

(yt

t,y

,yt

t-

  • 1

1,X)

,X) – – f fa

a(

(ŷ ŷt

t,

, ŷ ŷt

t-

  • 1

1, X)]

, X)]

Compares the Compares the “ “viterbi path viterbi path” ” to the to the “ “correct correct path path” ”. Note that no update is made if the . Note that no update is made if the viterbi path is correct. viterbi path is correct.

slide-47
SLIDE 47

Averaged Perceptron Averaged Perceptron

Let Let λ λa

a ℓ ℓ,i ,i be the value of

be the value of λ λa

a after processing

after processing training example training example i i in iteration in iteration ℓ ℓ Define Define λ λa

a * * = the average value of

= the average value of λ λa

a =

= 1/(LN) 1/(LN) ∑ ∑ℓ

ℓ,i ,i λ

λa

a ℓ ℓ,i ,i

Use these averaged weights in the final Use these averaged weights in the final classifier classifier

slide-48
SLIDE 48

Collins Part Collins Part-

  • of
  • f-
  • Speech Tagging with

Speech Tagging with Averaged Sequence Perceptron Averaged Sequence Perceptron

Without averaging: 3.68% error Without averaging: 3.68% error

– – 20 iterations 20 iterations

With averaging: 2.93% error With averaging: 2.93% error

– – 10 iterations 10 iterations

slide-49
SLIDE 49

Hidden Markov SVM Hidden Markov SVM

Define a kernel between two input values Define a kernel between two input values x x and and x x’ ’: k( : k(x x, ,x x’ ’). ). Define a kernel between (X,Y) and (X Define a kernel between (X,Y) and (X’ ’,Y ,Y’ ’) ) as follows: as follows:

K((X,Y), (X K((X,Y), (X’ ’,Y ,Y’ ’)) = )) = ∑ ∑s,t

s,t I[y

I[ys

s-

  • 1

1 = y

= y’ ’t

t-

  • 1

1 & y

& ys

s = y

= y’ ’t

t] + I[y

] + I[ys

s = y

= y’ ’t

t] k(

] k(x xs

s,

,x x’ ’t

t)

) Number of (y Number of (yt

t-

  • 1

1,y

,yt

t) transitions that they share +

) transitions that they share + Number of matching labels (weighted by Number of matching labels (weighted by similarity between the similarity between the x x values) values)

slide-50
SLIDE 50

Dual Form of Linear Classifier Dual Form of Linear Classifier

Score(Y|X) = Score(Y|X) =

∑ ∑j

j ∑

∑a

a α

αj

j(Y

(Ya

a) K((X

) K((Xj

j,Y

,Ya

a), (X,Y))

), (X,Y)) a a indexes indexes “ “support vector support vector” ” label sequences Y label sequences Ya

a

Learning algorithm finds Learning algorithm finds

– – set of Y set of Ya

a label sequences

label sequences – – weight values weight values α αj

j(Y

(Ya

a)

)

slide-51
SLIDE 51

Dual Perceptron Algorithm Dual Perceptron Algorithm

Initialize Initialize α αj

j = 0

= 0 For For ℓ ℓ from 1 to L do from 1 to L do

– – For i from 1 to N do For i from 1 to N do

Ŷ Ŷ = argmax = argmaxY

Y Score(Y | X

Score(Y | Xi

i)

) if if Ŷ Ŷ ≠ ≠ Y Yi

i then

then

– – α αi

i(Y

(Yi

i) =

) = α αi

i(Y

(Yi

i) + 1

) + 1 – – α αi

i(

(Ŷ Ŷ) = ) = α αi

i(

(Ŷ Ŷ) ) – – 1 1

slide-52
SLIDE 52

Hidden Markov SVM Algorithm Hidden Markov SVM Algorithm

For all i initialize For all i initialize

– – S Si

i = {Y

= {Yi

i} set of

} set of “ “support vector sequences support vector sequences” ” for i for i – – α αi

i(Y)=0 for all Y in S

(Y)=0 for all Y in Si

i

For For ℓ ℓ from 1 to L do from 1 to L do

– – For i from 1 to N do For i from 1 to N do Ŷ Ŷ = argmax = argmaxY

Y≠ ≠Y Yi i Score(Y | X

Score(Y | Xi

i)

) If Score(Y If Score(Yi

i | X

| Xi

i) < Score(

) < Score(Ŷ Ŷ | X | Xi

i)

) – – Add Add Ŷ Ŷ to S to Si

i

– – Solve quadratic program to optimize the Solve quadratic program to optimize the α αi

i(Y)

(Y) for all Y in S for all Y in Si

i to maximize the margin between

to maximize the margin between Y Yi

i and all of the other Y

and all of the other Y’ ’s in S s in Si

i

– – If If α αi

i(Y) = 0, delete Y from S

(Y) = 0, delete Y from Si

i

slide-53
SLIDE 53

Altun et al. comparison Altun et al. comparison

slide-54
SLIDE 54

Maximum Margin Markov Networks Maximum Margin Markov Networks

Define SVM Define SVM-

  • like optimization problem to

like optimization problem to maximize the maximize the per time step per time step margin margin Define Define

∆ ∆F(X F(Xi

i,Y

,Yi

i,

,Ŷ Ŷ) = F(X ) = F(Xi

i,Y

,Yi

i)

) – – F(X F(Xi

i,

,Ŷ Ŷ) ) ∆ ∆Y(Y Y(Yi

i,

, Ŷ Ŷ) = ) = ∑ ∑i

i I[

I[ŷ ŷt

t ≠

≠ y yit

it]

]

MMM SVM formulation: MMM SVM formulation:

min ||w|| min ||w||2

2 + C

+ C ∑ ∑i

i ξ

ξi

i

subject to subject to

w w · · ∆ ∆F(X F(Xi

i,Y

,Yi

i,

,Ŷ Ŷ) ) ≥ ≥ ∆ ∆Y(Y Y(Yi

i,

, Ŷ Ŷ) + ) + ξ ξi

i forall Y, forall i

forall Y, forall i

slide-55
SLIDE 55

Dual Form Dual Form

maximize maximize ∑ ∑i

i ∑

∑Y

Y α

αi

i(

(Ŷ Ŷ) ) ∆ ∆(Y (Yi

i,

, Ŷ Ŷ) ) – – ½ ½ ∑ ∑i

i ∑

∑Ŷ

Ŷ ∑

∑j

j ∑

∑Ŷ

Ŷ’ ’ α

αi

i(

(Ŷ Ŷ) ) α αj

j(

(Ŷ Ŷ’ ’) [ ) [∆ ∆F(X F(Xi

i,Y

,Yi

i,

,Ŷ Ŷ) ) · · ∆ ∆F(X F(Xj

j,Y

,Yj

j,

,Ŷ Ŷ’ ’)] )] subject to subject to ∑ ∑Ŷ

Ŷ α

αi

i(

(Ŷ Ŷ) = C forall i ) = C forall i α αi

i(

(Ŷ Ŷ) ) ≥ ≥ 0 forall i, forall 0 forall i, forall Ŷ Ŷ Note that there are exponentially Note that there are exponentially-

  • many

many Ŷ Ŷ label label sequences sequences

slide-56
SLIDE 56

Converting to a Polynomial Converting to a Polynomial-

  • Sized

Sized Formulation Formulation

Note the constraints: Note the constraints: ∑ ∑Ŷ

Ŷ α

αi

i(

(Ŷ Ŷ) = C forall i ) = C forall i α αi

i(

(Ŷ Ŷ) ) ≥ ≥ 0 forall i, forall 0 forall i, forall Ŷ Ŷ These imply that for each i, the These imply that for each i, the α αi

i(

(Ŷ Ŷ) values are ) values are proportional to a probability distribution: proportional to a probability distribution:

Q( Q(Ŷ Ŷ | X | Xi

i) =

) = α αi

i(

(Ŷ Ŷ) / C ) / C

Because the MRF is a simple chain, this Because the MRF is a simple chain, this distribution can be factored into local distribution can be factored into local distributions: distributions:

Q( Q(Ŷ Ŷ | X | Xi

i) =

) = ∏ ∏t

t Q(

Q(ŷ ŷt

t-

  • 1

1,

, ŷ ŷt

t | X

| Xi

i)

)

Let Let µ µi

i(

(ŷ ŷt

t-

  • 1

1,

, ŷ ŷt

t) be the unnormalized version of Q

) be the unnormalized version of Q

slide-57
SLIDE 57

Reformulated Dual Form Reformulated Dual Form

subject to

X ˆ yt−1

µi(ˆ yt−1, ˆ yt) = µi(ˆ yt)

X ˆ yt

µi(ˆ yt) = C µi(ˆ yt−1, ˆ yt) ≥ max

X i X t X ˆ yt

µi(ˆ yt)I[ˆ yt 6= yi,t]− 1 2

X i,j X t X ˆ yt,ˆ yt−1 X s X ˆ y0

s,ˆ

y0

s−1

µi(ˆ yt−1, ˆ yt)µj(ˆ y0

s−1, ˆ

y0

s)

∆F(ˆ yt−1, ˆ yt, Xi) · ∆F(ˆ y0

s−1, ˆ

y0

s, Xj)

slide-58
SLIDE 58

Variables in the Dual Form Variables in the Dual Form

µ µi

i(k,k

(k,k’ ’) for each training example i and ) for each training example i and each possible class labels k, k each possible class labels k, k’ ’: O(NK : O(NK2

2)

) µ µi

i(k) for each trianing example i and

(k) for each trianing example i and possible class label k: O(NK) possible class label k: O(NK) Polynomial! Polynomial!

slide-59
SLIDE 59

Taskar et al. comparison Taskar et al. comparison Handwriting Recognition Handwriting Recognition

log-reg: logistic regression sliding window CRF: mSVM: multiclass SVM sliding window M^3N: max margin markov net

slide-60
SLIDE 60

Current State of the Art Current State of the Art

Discriminative Methods give best results Discriminative Methods give best results

– – not clear whether they scale not clear whether they scale – – published results all involve small numbers of training published results all involve small numbers of training examples and very long training times examples and very long training times

Work is continuing on making CRFs fast and Work is continuing on making CRFs fast and practical practical

– – new methods for training CRFs new methods for training CRFs – – potentially extendable to discriminative methods potentially extendable to discriminative methods