Automatic Collocation Extraction from Text Corpora Pavel Pecina - - PowerPoint PPT Presentation

automatic collocation extraction from text corpora
SMART_READER_LITE
LIVE PREVIEW

Automatic Collocation Extraction from Text Corpora Pavel Pecina - - PowerPoint PPT Presentation

Outline Automatic Collocation Extraction from Text Corpora Pavel Pecina Ustav form aln a aplikova ne lingvistiky MFF UK Praha May 17, 2004 Pavel Pecina Automatic Collocation Extraction from Text Corpora Outline Outline 1


slide-1
SLIDE 1

Outline

Automatic Collocation Extraction from Text Corpora

Pavel Pecina

´ Ustav form´ aln´ ı a aplikova´ ne lingvistiky MFF UK Praha

May 17, 2004

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-2
SLIDE 2

Outline

Outline

1 The notion of collocation

Motivation Few definitions Characteristic features, classification and categotization

2 Methodology of collocation extraction

Phrase Extraction Collocation identification

3 Experiments

Toolkit Data Basic Methods Evaluation Advanced Methods

4 Summary

Conclusion, Future work, Used Tools

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-3
SLIDE 3

The notion of collocation Methodology of collocation extraction Experiments Summary Motivation Few definitions Characteristic features, classification and categotization

Outline

1 The notion of collocation

Motivation Few definitions Characteristic features, classification and categotization

2 Methodology of collocation extraction

Phrase Extraction Collocation identification

3 Experiments

Toolkit Data Basic Methods Evaluation Advanced Methods

4 Summary

Conclusion, Future work, Used Tools

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-4
SLIDE 4

The notion of collocation Methodology of collocation extraction Experiments Summary Motivation Few definitions Characteristic features, classification and categotization

Well known problems

Lexicography

  • Which multiword expressions to include into a lexicon?

My new computer is a laptop computer.

Machine translation

  • Where to brake a sentence into chunks?

She likes ice cream pancakes.

Information retrieval

  • Which multiword terms to index?

Our new friend is from New York.

Word sense disambiguation

  • How to distinguish between possible word senses?

My uncle owns a wine yard.

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-5
SLIDE 5

The notion of collocation Methodology of collocation extraction Experiments Summary Motivation Few definitions Characteristic features, classification and categotization

Other well known problems

Spell/grammar/style-checking

  • Is this text written correctly?

Meals will be served outside, weather allowing.

Text classification and summarization

  • What is this text about?

Carriage return is necessary here.

Language modeling (text/speech synthesis)

  • How to create a fluent sentence?

Could you hand me salt and pepper?

Corpus-based language teaching/learning

  • What kinds of multiword expressions to teach?

When she kicked his head he kicked the bucket.

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-6
SLIDE 6

The notion of collocation Methodology of collocation extraction Experiments Summary Motivation Few definitions Characteristic features, classification and categotization

What are we looking for?

noun phrases

disk drive, weapons of mass destruction

light verbs compounds

keep an eye, make a decision

phrasal verbs

make up, give up, tell off

stock phrases

bacon and eggs, salt and pepper

idioms

hear it through the grapevine

technological expressions

  • bject oriented language

proper names

Joe Black, Prague Spring

frequent usages

game over, good morning

multiword units w/ independent existence white wine,Far East close associations between words

knock on a door, thick hair

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-7
SLIDE 7

The notion of collocation Methodology of collocation extraction Experiments Summary Motivation Few definitions Characteristic features, classification and categotization

What are we looking for?

noun phrases

disk drive, weapons of mass destruction

light verbs compounds

keep an eye, make a decision

phrasal verbs

make up, give up, tell off

stock phrases

bacon and eggs, salt and pepper

idioms

hear it through the grapevine

technological expressions

  • bject oriented language

proper names

Joe Black, Prague Spring

frequent usages

game over, good morning

multiword units w/ independent existence white wine,Far East close associations between words

knock on a door, thick hair

Collocations.

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-8
SLIDE 8

The notion of collocation Methodology of collocation extraction Experiments Summary Motivation Few definitions Characteristic features, classification and categotization

Definitions ...

Firth (1951) “Collocations of a given word are statements of the habitual or customary places of that word.” Choueka (1988) “A collocation is a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components.”

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-9
SLIDE 9

The notion of collocation Methodology of collocation extraction Experiments Summary Motivation Few definitions Characteristic features, classification and categotization

Other Definitions ...

Manning (1999) “A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things.” Radev (1998) “A collocation is a group of words that that occur together more

  • ften than by a chance.”

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-10
SLIDE 10

The notion of collocation Methodology of collocation extraction Experiments Summary Motivation Few definitions Characteristic features, classification and categotization

... and The Definition

“A collocation is an expression consisting of two or more words that form a grammatical and semantic unit.”

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-11
SLIDE 11

The notion of collocation Methodology of collocation extraction Experiments Summary Motivation Few definitions Characteristic features, classification and categotization

Characteristic Features

Non-compositionality

kick the bucket, carriage return, white man

Non-substituability

yellow wine, hit the bucket, make homework

Non-modifiability

give a small hand, poor as a church mice

Not straightforward translation

ice cream, to be right

Domain-dependency

carriage return,

“Subjectivity”

game over, new company

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-12
SLIDE 12

The notion of collocation Methodology of collocation extraction Experiments Summary Motivation Few definitions Characteristic features, classification and categotization

Classification

Semantics

  • compositional, noncompositional

Consecutivity

  • free, fixed

Functionality

  • idioms, proper names, technical terms, phrasal verbs, light

verbs Word usage

  • A→N, N→A, D→V, R→N

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-13
SLIDE 13

The notion of collocation Methodology of collocation extraction Experiments Summary Motivation Few definitions Characteristic features, classification and categotization

Grammar Patterns

Part-Of-Speech A N line´ arn´ ı funkce N N n´ asledn´ ık tr˚ unu D A N objektovˇ e orientovan´ y jazyk N A N zbranˇ e hromadn´ eho niˇ cen´ ı V R N pˇ rij´ ıt k sobˇ e Dependency Types Atr cenn´ y pap´ ır Sb soud rozhodl Obj d´ avat pˇ rednost Adv zdravotnˇ e postiˇ zen´ y

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-14
SLIDE 14

The notion of collocation Methodology of collocation extraction Experiments Summary Phrase Extraction Collocation identification

Outline

1 The notion of collocation

Motivation Few definitions Characteristic features, classification and categotization

2 Methodology of collocation extraction

Phrase Extraction Collocation identification

3 Experiments

Toolkit Data Basic Methods Evaluation Advanced Methods

4 Summary

Conclusion, Future work, Used Tools

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-15
SLIDE 15

The notion of collocation Methodology of collocation extraction Experiments Summary Phrase Extraction Collocation identification

Phrase extraction

  • 1. extracting all possible candidates for collocations

consequent word n-grams sliding window syntactical subtrees

  • 2. collecting their occurrence statistics

contingency tables empirical context

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-16
SLIDE 16

The notion of collocation Methodology of collocation extraction Experiments Summary Phrase Extraction Collocation identification

Contingency table: observed frequencies

bigram: xy X=x X=x Y=y O11 O12 R1 Y=y O21 O22 R2 C1 C2 N example: ˇ cern´ y trh X=ˇ cern´ y X=ˇ cern´ y Y=trh

ˇ cern´ y trh dom´ ac´ ı trh

Y=trh

ˇ cern´ y ˇ caj zelen´ y ˇ caj

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-17
SLIDE 17

The notion of collocation Methodology of collocation extraction Experiments Summary Phrase Extraction Collocation identification

Contingency table: observed frequencies

bigram: xy X=x X=x Y=y a b Y=y c d example: ˇ cern´ y trh X=ˇ cern´ y X=ˇ cern´ y Y=trh

ˇ cern´ y trh dom´ ac´ ı trh

Y=trh

ˇ cern´ y ˇ caj zelen´ y ˇ caj

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-18
SLIDE 18

The notion of collocation Methodology of collocation extraction Experiments Summary Phrase Extraction Collocation identification

Average Word Context

Example zlepˇ sen´ ı situace . Kapit´ alov´ y trh je vˇ sak st´ ale nelikvidn´ ı ˇ ze to nen´ ı samostatn´ y trh a ˇ ze je souˇ c´ ast´ a ˇ sirˇ s´ ıho bari´ er´ ach v pˇ r´ ıstupu na trh , cenov´ ych rozd´ ılech , banky . Americk´ y akciov´ y trh byl za siln´ eho obchodov´ an´ ı j´ ıt se svou kuˇ z´ ı na trh . Pro vyd´ an i mluvila zejm´ ena Context word probability distribution P(wi|x)

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-19
SLIDE 19

The notion of collocation Methodology of collocation extraction Experiments Summary Phrase Extraction Collocation identification

Collocation Identification

Few different basic approaches

1 Cooccurrence statistics 2 Hypothesis tests 3 Association estimation 4 Information theory measures 5 Context similarity measures Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-20
SLIDE 20

The notion of collocation Methodology of collocation extraction Experiments Summary Phrase Extraction Collocation identification

Cooccurrence statistics

Joint probability P(xy) Conditional probability P(y|x) Reverse conditional probability P(y|x) Symetric conditional probability P(y|x)P(x|y)

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-21
SLIDE 21

The notion of collocation Methodology of collocation extraction Experiments Summary Phrase Extraction Collocation identification

Hypothesis testing:

Null hypothesis: word occurences are independent H0 : P(xy) = P(x)P(y) bigram: xy X=x X=x Y=y E11 = R1C1

N

E12 = R1C2

N

Y=y E21 = R2C1

N

E22 = R2C2

N

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-22
SLIDE 22

The notion of collocation Methodology of collocation extraction Experiments Summary Phrase Extraction Collocation identification

Hypothesis testing cont.

z-score O11−E11

√ E 11

t-score O11−E11

√ O11

χ2 score

i,j (Oij−Eij)2 Eij

log-likelihood 2

i,j Oij log Oij Eij

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-23
SLIDE 23

The notion of collocation Methodology of collocation extraction Experiments Summary Phrase Extraction Collocation identification

Association estimation

Russel-Rao

a a+b+c+d

Sokal-Michiner

a+d a+b+c+d

Rogers-Tanimoto

a+d a+2b+2c+d

Hamann (a+d)−(b+c)

a+b+c+d

Sokal-Sneath 3rd b+c

a+d

Jaccard

a a+b+c

Kulczynski 1st

a b+c

Sokal-Sneath 2th

a a+2(b+c)

Kulczynski 2nd 1

2( a a+b + a a+c )

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-24
SLIDE 24

The notion of collocation Methodology of collocation extraction Experiments Summary Phrase Extraction Collocation identification

Information thory and ontext similarity measures

pointwise mutual information log

P(xy) P(x)P(y)

local mutual information NP(xy) log

P(xy) P(x)P(y)

Cross Entropy −

w∈C P(w|x) log2 P(w|y)

Intersection measure 2|Cx∩Cy|

|Cx|+|Cy|

Euclidean norm

  • w∈C(P(w|x) − P(w|y))2

Cosine norm

P

w∈C P(w|x)P(w|y)

P

w∈C P(w|x)2 P w∈C P(w|y)2

L1 norm

w∈C |P(w|x) − P(w|y)|

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-25
SLIDE 25

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Outline

1 The notion of collocation

Motivation Few definitions Characteristic features, classification and categotization

2 Methodology of collocation extraction

Phrase Extraction Collocation identification

3 Experiments

Toolkit Data Basic Methods Evaluation Advanced Methods

4 Summary

Conclusion, Future work, Used Tools

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-26
SLIDE 26

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Task setup

1 implementation of toolkit for statistical analysis of word

cooccurrences

2 collecting of basic methods for collocation extraction 1 implementation of the basic methods 2 evaluation of the basic methods 3 experiments with advanced methods Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-27
SLIDE 27

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Toolkit

fully functional prototype implementation in Perl Input: plain text/ morphological level/ analytical level Output: collocation candidates with values of all specified measures and scores

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-28
SLIDE 28

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Word Base Forms

Full word forms too specific (morphology) Lemmas too general (loosing semantic information) Solution: lemmas with subset of morphological tags

<f>nenahraditeln´ a<l>nahraditeln´ y (*4)<t>AAFS1----1N----<r>8<g>7 ↓ ↓ ↓ ↓↓ nahraditeln´ y (*4) A F 1N ⇓ <f>nahraditeln´ y (*4)<t>A*F1N</f> ⇓ nenahraditeln´ a

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-29
SLIDE 29

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Data

Prague Dependency Treebank base form types: 66 662 bigram types: 306 845 experiments performed on dependency bigrams with frequency > 5: 21 595 all these collocation candidates manually evaluated ...

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-30
SLIDE 30

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Evaluation Data

All dependency bigrams with frequency > 5 classified into 6 groups:

5 k´ amen ´ urazu, slep´ a uliˇ cka, ˇ zelezn´ a opona 7 4 bil´ y dum, ˇ cern´ y trh, posledn´ ı slovo, pata kolmice 201 3 ˇ sifrovac´ ı kliˇ c, atomov´ a energie, Ban´ ık Ostrava 2460 2 d´ avat pˇ rednost, minul´ e stolet´ ı, starosta mˇ esta 443 1 na Slovensko, do Portugalska 484 (non-collocations) 18002

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-31
SLIDE 31

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Evaluation Data

All dependency bigrams with frequency > 5 classified into 6 groups:

5 k´ amen ´ urazu, slep´ a uliˇ cka, ˇ zelezn´ a opona 4 bil´ y dum, ˇ cern´ y trh, posledn´ ı slovo, pata kolmice 3 ˇ sifrovac´ ı kliˇ c, atomov´ a energie, Ban´ ık Ostrava 3595 2 d´ avat pˇ rednost, minul´ e stolet´ ı, starosta mˇ esta 1 na Slovensko, do Portugalska (non-collocations) 18002

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-32
SLIDE 32

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Evaluation Data

All dependency bigrams with frequency > 5 classified into 6 groups:

5 k´ amen ´ urazu, slep´ a uliˇ cka, ˇ zelezn´ a opona 4 bil´ y dum, ˇ cern´ y trh, posledn´ ı slovo, pata kolmice 2668 3 ˇ sifrovac´ ı kl´ ıˇ c, atomov´ a energie, Ban´ ık Ostrava 2 d´ avat pˇ rednost, minul´ e stolet´ ı, starosta mˇ esta 1 na Slovensko, do Portugalska 18929 (non-collocations)

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-33
SLIDE 33

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Basic Methods

Pattern filtering Part of Speech pattern Dependecy pattern Association measures and scores Cooccurrence statistics Likelihood measures Hypothesis testing Association estimation Information theory measures Context similarity measures

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-34
SLIDE 34

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Evaluation: trading recall for precision

Precision P = #selected collocations #selected bigrams ∈< 0, 1 > Recall R = #selected collocations #all collocation ∈< 0, 1 >

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-35
SLIDE 35

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Recall and precision

Example: Mutual Information

20 40 60 80 100 20 40 60 80 100 Precision (%) Recall (%) Precision Recall 11: Mutual information

log2 P(xy) P(x)P(y)

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-36
SLIDE 36

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Recall and precision

Example: t score

20 40 60 80 100 20 40 60 80 100 Precision (%) Recall (%) Precision Recall 19: t test

O11 − E11 √ O11

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-37
SLIDE 37

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Recall and precision

Example: z score

20 40 60 80 100 20 40 60 80 100 Precision (%) Recall (%) Precision Recall 20: Z score

O11 − E11 √ E 11

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-38
SLIDE 38

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Evaluation results

Overview

20 40 60 80 100 20 40 60 80 100 Precision (%) Recall (%) Precision Recall

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-39
SLIDE 39

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Advanced Methods: motivation

Example: Mutual Information vs. Cosine context similarity

1000 2000 3000 4000 5000 6000 7000 8000 9000 1000 2000 3000 4000 5000 6000 7000 8000 9000 11: Mutual Information 85: Cosine similarity Negative Positive

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-40
SLIDE 40

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Advanced Methods: idea

Statistical learning problem for each bigram we get set of features (categories, scores etc.) xi = (x1, x2, . . . x90) each bigram we want to classify as collocation or noncolloc. f (xi) = yi, yi = 0, 1 so we are looking for function that minimizes a risk functional min

  • i

Q(f (xi), yi)

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-41
SLIDE 41

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Advanced Methods: idea cont.

Statistical learning problem but classification might be hard, what about regression? f (xi) = yy, yi ∈ 0, 1 Isn’t R90 too much? What about feature selection? Yes! And how to do it?

Liner discriminant General linear models - logistic regresion Neural networks Support vector Machines

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-42
SLIDE 42

The notion of collocation Methodology of collocation extraction Experiments Summary Toolkit Data Basic Methods Evaluation Advanced Methods

Result

Support Vector Machines

20 40 60 80 100 20 40 60 80 100 Precision (%) Recall (%)

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-43
SLIDE 43

The notion of collocation Methodology of collocation extraction Experiments Summary Conclusion, Future work, Used Tools

Outline

1 The notion of collocation

Motivation Few definitions Characteristic features, classification and categotization

2 Methodology of collocation extraction

Phrase Extraction Collocation identification

3 Experiments

Toolkit Data Basic Methods Evaluation Advanced Methods

4 Summary

Conclusion, Future work, Used Tools

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-44
SLIDE 44

The notion of collocation Methodology of collocation extraction Experiments Summary Conclusion, Future work, Used Tools

Conclusion

Achived results implementation and evaluation of basic methods for collocation extraction promising results with advanced methods Future work experiments with advanced methods evaluation of advanced methods experiments on English data

Pavel Pecina Automatic Collocation Extraction from Text Corpora

slide-45
SLIDE 45

The notion of collocation Methodology of collocation extraction Experiments Summary Conclusion, Future work, Used Tools

Used tools and toolkits

R-project a language and environment for statistical computing and graphics extremelly powerfull GNU GPL license www.r-project.org Torch machine learning library C++, BSD license www.torch.ch

Pavel Pecina Automatic Collocation Extraction from Text Corpora