Fields for Information Extraction S U N I T A S A R A W A G I A N - - PowerPoint PPT Presentation

▶

Mar 08, 2023 450 likes •738 views

Semi-Markov Conditional Random Fields for Information Extraction S U N I T A S A R A W A G I A N D W I L L I A M C O H E N N I P S 2 0 0 4 P R E S E N T E D B Y : D I N E S H K H A N D E L W A L S L I D E S A R E A D O P T E D F R O

SLIDE 1

S U N I T A S A R A W A G I A N D W I L L I A M C O H E N N I P S 2 0 0 4

Semi-Markov Conditional Random Fields for Information Extraction

S L I D E S A R E A D O P T E D F R O M D A N I E L K H A S H A B I P R E S E N T E D B Y : D I N E S H K H A N D E L W A L

SLIDE 2

Beyond Classification Learning

 Standard classification problem assumes

individual cases are disconnected and independent (i.i.d.: independently and identically distributed).

 Many NLP problems do not satisfy this assumption

and involve making many connected decisions, each resolving a different ambiguity, but which are mutually dependent.

 More sophisticated learning and inference

techniques are needed to handle such situations in general.

SLIDE 3

Sequence Labeling Problem

 Many NLP problems can viewed as sequence labeling.  Each token in a sequence is assigned a label.  Labels of tokens are dependent on the labels of other

tokens in the sequence, particularly their neighbors (not i.i.d).

SLIDE 4

Named Entity Recognition

My review of Fermat’s last theorem by S. Singh

1 2 3 4 5 6 7 8 9 My review

Fermat’s last theorem by S. Singh

Other Other Other Title Title Title

ther

Author Author

t x y y1 y2 y3 y4 y5 y6 y7 y8 y9

SLIDE 5

Problem Description

 The relational connection occurs in many applications, NLP,

Computer Vision, Signal Processing, ….

 Traditionally in graphical models,  Modeling the joint distribution can lead to difficulties  rich local features occur in relational data,  features may have complex dependencies,

 constructing probability distribution over them is difficult

 Solution: directly model the conditional,  is sufficient for classification!  CRF is simply a conditional distribution

with an associated graphical structure

, ( ) p x y | ( ) ( ) p p  y x x ( ) p x | ( ) p y x | ( ) p y x

SLIDE 6

Log linear representation of CRFs

𝐆 𝐲, 𝐳 = 𝒋=𝟐

𝐲 𝐠(i, 𝐲, 𝐳)

𝐠 = 𝑔1, … , 𝑔𝐿 P𝑠 𝐳 𝐲, 𝐗 =

1 𝑎(𝐲) 𝑓𝐗T𝐆(𝐲,𝐳) Vector of local feature functions 𝑔𝐿(𝑗, 𝐲, 𝐳) ∈ 𝑆

 Parameters to be estimated, 𝐗

SLIDE 7

Linear Chain CRF

=unobservable =observable

𝑔𝐿 𝑗, 𝐲, 𝐳 = 𝑔′𝐿 𝑗, 𝐲, 𝐳𝒋, 𝐳𝒋−𝟐

SLIDE 8

Features

The kind of features used in NLP-oriented machine learning systems typically involve

 Binary values: Think of a feature as being on or off

rather than as a feature with a value

 Values that are relative to an object/class pair

rather than being a function of the object alone.

 Typically have lots and lots of features (100,000s of

features isn’t unusual.)

SLIDE 9

Features

𝑔1(𝑗, 𝐲, 𝐳)= 1 yi = DT and yi−1 = V 0,

therwise

𝑔2(𝑗, 𝐲, 𝐳)= 1 xi = the and yi = DT 0,

therwise

𝑔3(𝑗, 𝐲, 𝐳)= 1 suffix xi = "ing" and yi = V 0, ,

therwise

SLIDE 10

1 2 3 4 5 6 7 8 9 I went skiing with Fernando Pereira in British Columbia

O O O O I I O I I

i x y

Features describe the single word

Segmentation models (Semi-CRFs)

t1=u1=1 t2=u2=2 t3=u3=3 t4= u4=4

t5=5, u5=6 t6= u6=7 t7=8, u7=9

I went skiing with Fernando Pereira in British Columbia

O O O O I O I

x y

Features describe the segment from 𝑢𝑘 to 𝑣𝑘

t,u

𝑔𝐿 𝑗, 𝐲, 𝐳𝒋, 𝐳𝒋−𝟐 𝑕𝐿 y𝑘, y𝑘−1, 𝐲, 𝑢𝑘, 𝑣𝑘

SLIDE 11

Semi-CRF

𝐲𝑣1 𝐲𝑢1 𝐲𝑢𝑞 𝐲𝑣𝑞

𝑻𝟐 𝑻𝒒 s = 𝑡1, … , 𝑡𝑞 denote a segmentation of x Segment 𝑡

𝑘 = 𝑢𝑘, 𝑣𝑘, 𝑧𝑘

consists of a start position 𝑢𝑘, an end position 𝑣𝑘, and a label 𝑧𝑘 1 ≤ 𝑢𝑘 ≤ 𝑣𝑘 ≤ 𝑡 𝑢𝑘+1 = 𝑣𝑘 + 1 and

SLIDE 12

Semi-CRF

=unobservable =observable

𝑕𝐿 𝑘, 𝐲, 𝐭 = 𝑕′𝐿 y𝑘, y𝑘−1, 𝐲, 𝑢𝑘, 𝑣𝑘 P𝑠 𝐭 𝐲, 𝐗 =

1 𝑎(𝐲) 𝑓𝐗T𝐇(𝐲,𝐭)

𝑎(𝐲)= 𝑡′ 𝑓𝐗T𝐇(𝐲,𝐭) G 𝐲, 𝐭 = 𝒋=𝟐

𝐲 𝐡(i, 𝐲, 𝐭)

𝐡 is a vector of segment level feature functions.

SLIDE 13

MAP Inference Semi-CRF

𝐓∗ = argmax

𝐭

𝑄(𝐭|𝐲, 𝐗) 𝐓∗ = argmax

𝐭

𝐗T𝐇(𝐲, 𝐭) 𝐓∗ = argmax

𝐭

𝐗T

𝒌 𝐭

𝐡 y𝑘, y𝑘−1, 𝐲, 𝑢𝑘, 𝑣𝑘

𝐡 is a vector of segment level feature functions.

SLIDE 14

Viterbi algorithm for Semi-CRF

max

𝐭

𝐗T

𝒌=𝟐 𝐭

𝐡 y𝑘, y𝑘−1, 𝐲, 𝑢𝑘, 𝑣𝑘

 L be an upper bound on segment length  𝐭𝐣:𝐳 denote set of all partial segmentation starting from 1

to i, such that the last segment has the label y and ending position i. 𝑊(𝑗, 𝑧) = max

𝒛′,𝒆

max

𝐭′∈𝐭𝐣−𝐞:𝒛′

𝐗T

𝒌=𝟐 𝐭′

𝐡 y𝑘, y𝑘−1, 𝐲, 𝑢𝑘, 𝑣𝑘 + 𝐗T𝐡 𝑧, 𝑧′, 𝐲, 𝑗 − 𝑒, 𝑗

SLIDE 15

Viterbi algorithm for Semi-CRF

𝑊(𝑗, 𝑧) = max

𝒛′,𝒆

max

𝐭′∈𝐭𝐣−𝐞:𝒛′

𝐗T

𝒌=𝟐 𝐭′

𝐡 y𝑘, y𝑘−1, 𝐲, 𝑢𝑘, 𝑣𝑘 + max

𝒛′,𝒆 𝐗T𝐡 𝑧, 𝑧′, 𝐲, 𝑗 − 𝑒, 𝑗

𝑊 𝑗 − 𝑒, 𝑧′ = max

𝐭′∈𝐭𝐣−𝐞:𝒛′ 𝐗T 𝒌=𝟐 𝐭′

𝐡 y𝑘, y𝑘−1, 𝐲, 𝑢𝑘, 𝑣𝑘 𝑊(𝑗, 𝑧) = max

𝒛′,𝒆 𝑊 𝑗 − 𝑒, 𝑧′ + 𝐗T𝐡 𝑧, 𝑧′, 𝐲, 𝑗 − 𝑒, 𝑗

SLIDE 16

Viterbi algorithm for Semi-CRF

𝑊(𝑗, 𝑧) = max

𝒛′,𝒆=𝟐,..𝑴 𝑊 𝑗 − 𝑒, 𝑧′ + 𝐗T𝐡 𝑧, 𝑧′, 𝐲, 𝑗 − 𝑒, 𝑗

−∞ If i >0 If i = 0 If i<0 The optimal label sequence corresponds to path traced by max

𝑧

𝑊 𝑦 , 𝑧 .

SLIDE 17

Semi-Markov CRFs vs conventional CRFs

Since conventional CRFs need not maximize over possible segment lengths d, inference for semi-CRFs is more expensive. However additional cost is only linear in L. Semi-CRFs are more expressive power. A major advantage of semi-CRFs is that they allow features which measure properties of segments, rather than individual elements.

SLIDE 18

Semi-Markov CRFs vs Higher order CRFs

Semi-CRFs are no more expressive than order-L CRFs. For order-L CRFs, however the additional computational cost is exponential in L. Semi-CRFs only consider sequences in which the same label is assigned to all L positions, rather than all |𝑍|L length-L sequences. This is a useful restriction, as it leads to faster inference.

SLIDE 19

Parameter Learning: Semi-CRF

 Given the training data, we wish to learn parameters of

the model. We express the log-likelihood over the training sequences as {(𝐲𝑚, 𝒕𝑚)} 𝑚=1

𝑂

𝑀 𝑋 =

𝑚

log 𝑄(𝒕𝑚|𝐲𝑚, 𝐗) =

𝑚

(𝐗T𝐇(𝐲𝑚, 𝒕𝑚)−log Z𝐗 (𝐲𝑚))

𝑀 𝑋 is concave, and can thus be maximized by gradient ascent, or

ne of many related methods. (Paper uses a limited-memory quasi-

Newton method)

𝛼𝑀 𝑋 =

𝑚

(𝐇 𝐲𝑚, 𝒕𝑚 − 𝐹P𝑠 𝐭′ 𝐲, 𝐗 𝐇(𝐲𝑚, 𝐭′)) Observed feature count Expected feature count

SLIDE 20

Parameter Learning: Semi-CRF

𝛼𝑀 𝑋 =

𝑚

(𝐇 𝐲𝑚, 𝒕𝑚 − 𝐹P𝑠 𝐭′ 𝐲, 𝐗 𝐇(𝐲𝑚, 𝐭′)) 𝛼𝑀 𝑋 =

𝑚

𝐇 𝐲𝑚, 𝒕𝑚 − 𝑡′ 𝐇 𝐲𝑚, 𝑡′ 𝒇𝐗T𝐇(𝐲𝑚,𝑡′) Z𝐗 (𝐲𝑚) Markov property of G and a dynamic programming helps in fast computation of the expected value of the features under the current weight vector

𝐹P𝑠 𝐭′ 𝐲, 𝐗 𝐇(𝐲𝑚, 𝐭′)

𝛽(𝑗, 𝑧) =

𝑡′∈ 𝐭𝐣:𝐳

𝒇𝐗T𝐇(𝐲𝑚,𝑡′) Where 𝐭𝐣:𝐳 denotes all segmentations from 1 to 𝑗 ending at 𝑗 and labeled 𝑧. Z𝐗 (𝐲)=

𝑧

𝛽( 𝐲 , 𝑧)

SLIDE 21

Parameter Learning: Semi-CRF

𝛽(𝑗, 𝑧) =

𝒆=𝟐 𝑴 𝑧′∈𝓩

𝛽 𝑗 − 𝑒, 𝑧′ 𝑓𝐗𝑈𝐡 𝑧,𝒛′,𝐲,𝑗−𝑒,𝑗 1

if i > 0 if i =0 if i < 0 A similar approach can be used to compute the expectation 𝑡′ 𝐇 𝐲𝑚, 𝑡′ 𝒇𝐗T𝐇(𝐲𝑚,𝑡′) 𝜃𝑙 𝑗, 𝑧 = 𝑡′∈ 𝐭𝐣:𝐳 𝐻𝑙 𝐲𝑚, 𝑡′ 𝒇𝐗T𝐇(𝐲𝑚,𝑡′) , restricted to the part of the segmentation ending at position i. 𝜃𝑙 𝑗, 𝑧 = 𝒆=𝟐

𝑴

𝑧′∈𝓩(𝜃𝑙 𝑗 − 𝑒, 𝑧′ + 𝛽 𝑗 − 𝑒, 𝑧′ 𝑕′𝐿 𝑧, 𝑧′, 𝐲, 𝑗 − 𝑒, 𝑗 )𝑓𝐗𝑈𝐡 𝑧,𝒛′,𝐲,𝑗−𝑒,𝑗

SLIDE 22

Parameter Learning: Semi-CRF

𝐹P𝑠 𝐭′ 𝐲, 𝐗 𝐇 𝐲, 𝐭′ = 𝟐 Z𝐗 (𝐲)

𝑧

𝜃𝑙( 𝐲 , 𝑧)

SLIDE 23

Extentions

Barun,Gagan, Dhruvin,Yashoteja: This idea of reasoning over segments can be extended in the task of image segmentation. Nupur: Introducing constraints in the model to have something similar to CCM as in case of CRF. Happy: Apart from the similarity measures they have used, there is a very good similarity measure called Gower distance, which is primarily used for non- numerical data. I think, we can also use that here. Prachi: Compare SOTA deep learning models and semi-CRFs to building insights

n what one can capture and other can't. This may enable us to improve

architectures of both the models. Yashoteja: Start with L=1, and quickly filter out the regions of the sequence that we are confident to not contain any named entities. Now we can use L=2 and resegment only those regions where entities might lie. We can then proceed with L=3, etc. Intuition is similar to those in Apriori algorithm.

SLIDE 24

Experiments with NER data

Baseline algorithms: CRF/1, labels words inside and outside entities with I and O, respectively. CRF/4, replaces the I tag with four tags B, E, C, and U, which depend on where the word appears in an entity. Datasets: The Address corpus contains 4,226 words, and consists of 395 home addresses of

students. Paper considered extraction of city names and state names from this

corpus. The Jobs corpus contains 73,330 words, and consists of 300 computer related job postings. Paper considered extraction of company names and job titles. The 18,121-word Email corpus contains 216 email messages taken from the CSPACE email corpus , which is mail associated with a 14-week, 277-person management game. Paper considered extraction of person names.

SLIDE 25

Features

CRF Features Indicators for specific words at location i, or locations within three words of i. Indicators for capitalization/letter patterns Indicators for the phrase inside a segment and the capitalization pattern inside a segment. Indicators for words and capitalization patterns in 3-word windows before and after the segment. Indicators for each segment length (d = 1, . . . ,L), and combined all word-level features with indicators for the beginning and end of a segment. Semi-CRF Features External dictionary of strings Internal segment dictionary 𝐡𝒕𝒋𝒏,𝑬 (𝐤, 𝐲, 𝐭)=𝑛𝑏𝑦𝑣∈𝐸 sim(𝑦𝑡𝑘, 𝑣) Dictionary based features:

SLIDE 26

Results

SLIDE 27

Results

SLIDE 28

Results

Dhruvin, Prachi, Gagan - Precision/ Recall values not reported. Anshul- Why order-L CRFs perform much worse than semi-CRFs ? Nupur, Haroun- Comparison with only CRF ?