Lexical Association Measures Collocation Extraction Pavel Pecina - - PowerPoint PPT Presentation

lexical association measures
SMART_READER_LITE
LIVE PREVIEW

Lexical Association Measures Collocation Extraction Pavel Pecina - - PowerPoint PPT Presentation

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions Lexical Association Measures Collocation Extraction Pavel Pecina pecina@ufal.mff.cuni.cz Institute of


slide-1
SLIDE 1

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Lexical Association Measures

Collocation Extraction Pavel Pecina

pecina@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Charles University, Prague DCU, Dublin September 21, 2009

slide-2
SLIDE 2

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Talk outline

  • 1. Introduction
  • 2. Collocation extraction
  • 3. Lexical association measures
  • 4. Reference data
  • 5. Empirical evaluation
  • 6. Combining association measures
  • 7. Conclusions
slide-3
SLIDE 3

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Lexical association

1/30

Semantic association

◮ reflects semantic relationship between words ◮ synonymy, antonymy, hyponymy, meronymy, etc.

➔ stored in a thesaurus

sick – ill, baby – infant, dog – cat

Cross-language association

◮ corresponds to potential translations of words between languages ◮ translation equivalents

➔ stored in a dictionary

maison(FR)– house(EN), baum(GE)– tree(EN), kvˇ etina(CZ)– flower(EN)

Collocational association

◮ restricts combination of words into phrases (beyond grammar!) ◮ collocations / multiword expressions

➔ stored in a lexicon

crystal clear, cosmetic surgery, cold war

slide-4
SLIDE 4

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Lexical association

1/30

Semantic association

◮ reflects semantic relationship between words ◮ synonymy, antonymy, hyponymy, meronymy, etc.

➔ stored in a thesaurus

sick – ill, baby – infant, dog – cat

Cross-language association

◮ corresponds to potential translations of words between languages ◮ translation equivalents

➔ stored in a dictionary

maison(FR)– house(EN), baum(GE)– tree(EN), kvˇ etina(CZ)– flower(EN)

Collocational association

◮ restricts combination of words into phrases (beyond grammar!) ◮ collocations / multiword expressions

➔ stored in a lexicon

crystal clear, cosmetic surgery, cold war

slide-5
SLIDE 5

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Lexical association

1/30

Semantic association

◮ reflects semantic relationship between words ◮ synonymy, antonymy, hyponymy, meronymy, etc.

➔ stored in a thesaurus

sick – ill, baby – infant, dog – cat

Cross-language association

◮ corresponds to potential translations of words between languages ◮ translation equivalents

➔ stored in a dictionary

maison(FR)– house(EN), baum(GE)– tree(EN), kvˇ etina(CZ)– flower(EN)

Collocational association

◮ restricts combination of words into phrases (beyond grammar!) ◮ collocations / multiword expressions

➔ stored in a lexicon

crystal clear, cosmetic surgery, cold war

slide-6
SLIDE 6

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Measuring lexical association

2/30

Motivation

◮ automatic acquisition of associated words (into a lexicon/thesarus/dictionary)

Tool: Lexical association measures

◮ mathematical formulas determining strength of association between two

(or more) words based on their occurrences and cooccurrences in a corpus Applications

◮ lexicography, natural language generation, word sense disambiguation ◮ bilingual word alignment, identification of translation equivalents ◮ information retrieval, cross-lingual information retrieval ◮ keyword extraction, named entity recognition ◮ syntactic constituent boundary detection ◮ collocation extraction

slide-7
SLIDE 7

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Measuring lexical association

2/30

Motivation

◮ automatic acquisition of associated words (into a lexicon/thesarus/dictionary)

Tool: Lexical association measures

◮ mathematical formulas determining strength of association between two

(or more) words based on their occurrences and cooccurrences in a corpus Applications

◮ lexicography, natural language generation, word sense disambiguation ◮ bilingual word alignment, identification of translation equivalents ◮ information retrieval, cross-lingual information retrieval ◮ keyword extraction, named entity recognition ◮ syntactic constituent boundary detection ◮ collocation extraction

slide-8
SLIDE 8

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Measuring lexical association

2/30

Motivation

◮ automatic acquisition of associated words (into a lexicon/thesarus/dictionary)

Tool: Lexical association measures

◮ mathematical formulas determining strength of association between two

(or more) words based on their occurrences and cooccurrences in a corpus Applications

◮ lexicography, natural language generation, word sense disambiguation ◮ bilingual word alignment, identification of translation equivalents ◮ information retrieval, cross-lingual information retrieval ◮ keyword extraction, named entity recognition ◮ syntactic constituent boundary detection ◮ collocation extraction

slide-9
SLIDE 9

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Goals, objectives, and limitations

3/30

Goal

◮ application of lexical association measures to collocation extraction

Objectives

  • 1. to compile a comprehensive inventory of lexical association measures
  • 2. to build reference data sets for collocation extraction
  • 3. to evaluate the lexical association measures on these data sets
  • 4. to explore the possibility of combining these measures into more complex

models and advance the state of the art in collocation extraction Limitations

✓ focus on bigram (two-word) collocations

(limited scalability to higher-order n-grams; limited corpus size)

✓ binary (two-class) discrimination only (collocation/non-collocation)

slide-10
SLIDE 10

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Goals, objectives, and limitations

3/30

Goal

◮ application of lexical association measures to collocation extraction

Objectives

  • 1. to compile a comprehensive inventory of lexical association measures
  • 2. to build reference data sets for collocation extraction
  • 3. to evaluate the lexical association measures on these data sets
  • 4. to explore the possibility of combining these measures into more complex

models and advance the state of the art in collocation extraction Limitations

✓ focus on bigram (two-word) collocations

(limited scalability to higher-order n-grams; limited corpus size)

✓ binary (two-class) discrimination only (collocation/non-collocation)

slide-11
SLIDE 11

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Goals, objectives, and limitations

3/30

Goal

◮ application of lexical association measures to collocation extraction

Objectives

  • 1. to compile a comprehensive inventory of lexical association measures
  • 2. to build reference data sets for collocation extraction
  • 3. to evaluate the lexical association measures on these data sets
  • 4. to explore the possibility of combining these measures into more complex

models and advance the state of the art in collocation extraction Limitations

✓ focus on bigram (two-word) collocations

(limited scalability to higher-order n-grams; limited corpus size)

✓ binary (two-class) discrimination only (collocation/non-collocation)

slide-12
SLIDE 12

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Collocational association

4/30

Collocability

◮ the ability of words to combine with other words in text ◮ governed by a system of rules and constraints: syntactic, semantic, pragmatic ◮ must be adhered to in order to produce correct, meaningful, fluent utterances ◮ ranges from free word combinations to idioms ◮ specified intensionally (general rules) or extensionally (particular constraints)

Collocations

◮ word combinations with extensionally restricted collocability ◮ should be listed in a lexicon and learned in the same way as single words

Types of collocations

  • 1. idioms (to kick the bucket, to hear st. through the grapevine)
  • 2. proper names (New York, Old Town, Vaclav Havel)
  • 3. technical terms (car oil, stock owl, hard disk)
  • 4. phrasal verbs (to switch off, to look after)
  • 5. light verb compounds (to take a nap, to do homework)
  • 6. lexically restricted expressions (strong tea, broad daylight)
slide-13
SLIDE 13

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Collocational association

4/30

Collocability

◮ the ability of words to combine with other words in text ◮ governed by a system of rules and constraints: syntactic, semantic, pragmatic ◮ must be adhered to in order to produce correct, meaningful, fluent utterances ◮ ranges from free word combinations to idioms ◮ specified intensionally (general rules) or extensionally (particular constraints)

Collocations

◮ word combinations with extensionally restricted collocability ◮ should be listed in a lexicon and learned in the same way as single words

Types of collocations

  • 1. idioms (to kick the bucket, to hear st. through the grapevine)
  • 2. proper names (New York, Old Town, Vaclav Havel)
  • 3. technical terms (car oil, stock owl, hard disk)
  • 4. phrasal verbs (to switch off, to look after)
  • 5. light verb compounds (to take a nap, to do homework)
  • 6. lexically restricted expressions (strong tea, broad daylight)
slide-14
SLIDE 14

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Collocational association

4/30

Collocability

◮ the ability of words to combine with other words in text ◮ governed by a system of rules and constraints: syntactic, semantic, pragmatic ◮ must be adhered to in order to produce correct, meaningful, fluent utterances ◮ ranges from free word combinations to idioms ◮ specified intensionally (general rules) or extensionally (particular constraints)

Collocations

◮ word combinations with extensionally restricted collocability ◮ should be listed in a lexicon and learned in the same way as single words

Types of collocations

  • 1. idioms (to kick the bucket, to hear st. through the grapevine)
  • 2. proper names (New York, Old Town, Vaclav Havel)
  • 3. technical terms (car oil, stock owl, hard disk)
  • 4. phrasal verbs (to switch off, to look after)
  • 5. light verb compounds (to take a nap, to do homework)
  • 6. lexically restricted expressions (strong tea, broad daylight)
slide-15
SLIDE 15

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Collocation properties

5/30

Semantic non-compositionality

◮ exact meaning cannot be (fully) inferred from the meaning of components

to kick the bucket

Syntactic non-modifiability

◮ syntactic structure cannot be freely modified (word order, word insertions etc.)

poor as a church mouse vs. poor as a *big church mouse

Lexical non-substitutability

◮ components cannot be substituted by synonyms or other words

stiff breeze vs. *stiff wind

Translatability into other languages

◮ translation cannot generally be performed blindly, word by word

ice cream – zmrzlina

Domain dependency

◮ collocational character only in specific domains

carriage return

slide-16
SLIDE 16

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Collocation properties

5/30

Semantic non-compositionality

◮ exact meaning cannot be (fully) inferred from the meaning of components

to kick the bucket

Syntactic non-modifiability

◮ syntactic structure cannot be freely modified (word order, word insertions etc.)

poor as a church mouse vs. poor as a *big church mouse

Lexical non-substitutability

◮ components cannot be substituted by synonyms or other words

stiff breeze vs. *stiff wind

Translatability into other languages

◮ translation cannot generally be performed blindly, word by word

ice cream – zmrzlina

Domain dependency

◮ collocational character only in specific domains

carriage return

slide-17
SLIDE 17

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Collocation properties

5/30

Semantic non-compositionality

◮ exact meaning cannot be (fully) inferred from the meaning of components

to kick the bucket

Syntactic non-modifiability

◮ syntactic structure cannot be freely modified (word order, word insertions etc.)

poor as a church mouse vs. poor as a *big church mouse

Lexical non-substitutability

◮ components cannot be substituted by synonyms or other words

stiff breeze vs. *stiff wind

Translatability into other languages

◮ translation cannot generally be performed blindly, word by word

ice cream – zmrzlina

Domain dependency

◮ collocational character only in specific domains

carriage return

slide-18
SLIDE 18

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Collocation properties

5/30

Semantic non-compositionality

◮ exact meaning cannot be (fully) inferred from the meaning of components

to kick the bucket

Syntactic non-modifiability

◮ syntactic structure cannot be freely modified (word order, word insertions etc.)

poor as a church mouse vs. poor as a *big church mouse

Lexical non-substitutability

◮ components cannot be substituted by synonyms or other words

stiff breeze vs. *stiff wind

Translatability into other languages

◮ translation cannot generally be performed blindly, word by word

ice cream – zmrzlina

Domain dependency

◮ collocational character only in specific domains

carriage return

slide-19
SLIDE 19

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Collocation properties

5/30

Semantic non-compositionality

◮ exact meaning cannot be (fully) inferred from the meaning of components

to kick the bucket

Syntactic non-modifiability

◮ syntactic structure cannot be freely modified (word order, word insertions etc.)

poor as a church mouse vs. poor as a *big church mouse

Lexical non-substitutability

◮ components cannot be substituted by synonyms or other words

stiff breeze vs. *stiff wind

Translatability into other languages

◮ translation cannot generally be performed blindly, word by word

ice cream – zmrzlina

Domain dependency

◮ collocational character only in specific domains

carriage return

slide-20
SLIDE 20

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Collocation extraction

6/30

Task

◮ to extract a list of collocations (types) from a text corpus ◮ no need to identify particular occurrences (instances) of collocations

Methods

◮ based on extraction principles verifying characteristic collocation properties ◮ i.e. hypotheses about word occurences and cooccurrences in the corpus ◮ formulated as lexical association measures ◮ compute association score for each collocation candidate from the corpus ◮ the scores indicate a chance of a candidate to be a collocation

Extraction principles

  • 1. “Collocation components occur together more often than by chance”
  • 2. “Collocations occur as units in information-theoretically noisy environment”
  • 3. “Collocations occur in different contexts to their components”
slide-21
SLIDE 21

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Collocation extraction

6/30

Task

◮ to extract a list of collocations (types) from a text corpus ◮ no need to identify particular occurrences (instances) of collocations

Methods

◮ based on extraction principles verifying characteristic collocation properties ◮ i.e. hypotheses about word occurences and cooccurrences in the corpus ◮ formulated as lexical association measures ◮ compute association score for each collocation candidate from the corpus ◮ the scores indicate a chance of a candidate to be a collocation

Extraction principles

  • 1. “Collocation components occur together more often than by chance”
  • 2. “Collocations occur as units in information-theoretically noisy environment”
  • 3. “Collocations occur in different contexts to their components”
slide-22
SLIDE 22

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Collocation extraction

6/30

Task

◮ to extract a list of collocations (types) from a text corpus ◮ no need to identify particular occurrences (instances) of collocations

Methods

◮ based on extraction principles verifying characteristic collocation properties ◮ i.e. hypotheses about word occurences and cooccurrences in the corpus ◮ formulated as lexical association measures ◮ compute association score for each collocation candidate from the corpus ◮ the scores indicate a chance of a candidate to be a collocation

Extraction principles

  • 1. “Collocation components occur together more often than by chance”
  • 2. “Collocations occur as units in information-theoretically noisy environment”
  • 3. “Collocations occur in different contexts to their components”
slide-23
SLIDE 23

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Extraction principle I

7/30

“Collocation components occur together more often than by chance”

◮ the corpus is interepreted as a sequence of randomly generated words ◮ word (marginal) probability ML estimations: p(x) = f(x) N ◮ bigram (joint) probability ML estimations:

p(xy) = f(xy)

N ◮ the chance ∼ the null hypothesis of independence: H0: ˆ

p(xy) = p(x) · p(y) AM: Log-likelihood ratio, χ2test, Odds ratio, Jaccard, Pointwise mutual information

Example: Pointwise Mutual Information

Data: f(iron curtain) = 11 MLE: p(iron curtain) = 0.000007

f(iron) = 30 p(iron) = 0.000020 f(curtain) = 15 p(curtain) = 0.000010

H0:

ˆ p(iron curtain) = p(iron) · p(curtain) = 0.000000000020 ˆ f(iron curtain) = 0.000030

AM:

PMI(iron curtain) = log p(xy) ˆ p(xy) = log 0.000007 0.000000000020 = 18.417

slide-24
SLIDE 24

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Extraction principle I

7/30

“Collocation components occur together more often than by chance”

◮ the corpus is interepreted as a sequence of randomly generated words ◮ word (marginal) probability ML estimations: p(x) = f(x) N ◮ bigram (joint) probability ML estimations:

p(xy) = f(xy)

N ◮ the chance ∼ the null hypothesis of independence: H0: ˆ

p(xy) = p(x) · p(y) AM: Log-likelihood ratio, χ2test, Odds ratio, Jaccard, Pointwise mutual information

Example: Pointwise Mutual Information

Data: f(iron curtain) = 11 MLE: p(iron curtain) = 0.000007

f(iron) = 30 p(iron) = 0.000020 f(curtain) = 15 p(curtain) = 0.000010

H0:

ˆ p(iron curtain) = p(iron) · p(curtain) = 0.000000000020 ˆ f(iron curtain) = 0.000030

AM:

PMI(iron curtain) = log p(xy) ˆ p(xy) = log 0.000007 0.000000000020 = 18.417

slide-25
SLIDE 25

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Extraction principle II

8/30

“Collocations occur as units in information-theoretically noisy environment”

◮ the corpus again interpreted as a sequence of randomly generated words ◮ at each point of the sequence we estimate:

  • 1. probability distribution of words occurring after/before: p(w|Cr

xy), p(w|Cl xy)

  • 2. uncertainty (entropy) what the next/previous word is: H(p(w|Cr

xy)),H(p(w|Cl xy)) ◮ points with high uncertainty are likely to be collocation boundaries ◮ points with low uncertainty are likely to be located within a collocation

AM: Left context entropy, Right context entropy

Example: H(p(w|Cr

xy))

ˇ Cesk´ y kapit´ alov´ y trh dnes ovlivnil pokles cen vˇ sech cenn´ ych pap´ ır˚ u a zejm´ ena akci´ ı.

slide-26
SLIDE 26

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Extraction principle II

8/30

“Collocations occur as units in information-theoretically noisy environment”

◮ the corpus again interpreted as a sequence of randomly generated words ◮ at each point of the sequence we estimate:

  • 1. probability distribution of words occurring after/before: p(w|Cr

xy), p(w|Cl xy)

  • 2. uncertainty (entropy) what the next/previous word is: H(p(w|Cr

xy)),H(p(w|Cl xy)) ◮ points with high uncertainty are likely to be collocation boundaries ◮ points with low uncertainty are likely to be located within a collocation

AM: Left context entropy, Right context entropy

Example: H(p(w|Cr

xy))

ˇ Cesk´ y kapit´ alov´ y trh dnes ovlivnil pokles cen vˇ sech cenn´ ych pap´ ır˚ u a zejm´ ena akci´ ı.

slide-27
SLIDE 27

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Extraction principle III

9/30

“Collocations occur in different contexts to their components”

◮ non-compositionality: meaning of a collocation must differ from the union of

the meaning of its components

◮ modeling meanings by empirical contexts: a bag of words occurring within

a specified context window of a word or an expression

◮ the more different the contexts of an expression to its components are, the

higher the chance is that the expression is a collocation

AM: J-S divergence, K-L divergence, Skew divergence, Cosine similarity in vector space

Example: Cxy, Cx

. . . nen´ ı. Maltsk´ e liry lze nakoupit pouze ve smˇ en´ arn´ ach, ˇ cern´ y trh s valutami neexistuje. Na Maltˇ e je v porovn´ an´ ı s . . . . . . pˇ

  • restal. V pat´

ach za kriz´ ı vstoupil do Bˇ elehradu ˇ cern´ y trh , paˇ sov´ an´ ı a zv´ yˇ sen´ a kriminalita. Pˇ rekupn´ ıci prov´ aˇ zej´ ı . . . . . . nebyli z toho obvinˇ

  • eni. ˇ

R´ ıd´ ı gangy, kter´ e kontroluj´ ı ˇ cern´ y trh a okr´ adaj´ ı cizince. Oba byli zbaveni funkc´ ı a byl . . . . . . antidrogov´ e hysterii. N´ asledkem toho neexistoval ani ˇ cern´ y trh , protoˇ ze nebylo na ˇ cem vydˇ el´

  • avat. V roce 1957 bylo . . .

. . . doruˇ ceny k rychl´ emu zpracov´ an´ ı. Naplno se jiˇ z rozj´ ıˇ zd´ ı ˇ cern´ y trh se vstupenkami. Na z´ avod na 5000 m v rychlobruslaˇ r˚ u . . . . . . na ˇ celn´ em m´ ıstˇ e obchodu se zbranˇ

  • emi. Zat´

ımco ˇ cern´ y trh se zbranˇ emi se pro cel´ y svˇ et st´ av´ a ˇ c´ ım d´ al t´ ım vˇ etˇ s´ ı. . . . . . . ˇ cten´ ım v parlamentu. Vˇ eˇ r´ ım, ˇ ze brzy bude regulovat ˇ cern´ y trh s ohroˇ zen´ ymi druhy zv´ ıˇ rat, m´ ın´ ı. Promoravsk´ e strany . . . . . . jako mal´ ı ˇ ctyˇ rlet´ ı a pˇ etilet´ ı kluci. Byl to dobytˇ c´ ı trh jako z minul´ eho stolet´ ı. Se vˇ s´ ım vˇ sudy prod´ avali . . . . . . pˇ r´ an´ ı neˇ z re´ aln´ ych moˇ znost´ ı. Na rozd´ ıl od dolaru se trh americk´ ych st´ atn´ ıch dluhopis˚ u nezmˇ

  • enil. A nov´

ymi . . . . . . opˇ etn´ emu n´ ar˚

  • ustu. Podle Plan Econu si ˇ

cesk´ y kapit´ alov´ y trh bude v nejbliˇ zˇ s´ ım roce poˇ c´ ınat o nˇ eco l´

  • epe. Vˇ

etˇ sina . . . . . . To by mohlo vzhledem k propojen´ ı pˇ res mezibankovn´ ı trh depozit v´ est k ˇ retˇ ezov´ ym reakc´ ım. Pˇ r´ ıliv kapit´ alu . . . . . . PVT, na cenˇ e ztratil tak´ e indexov´ y Tab´

  • ak. Voln´

y trh m´ a vˇ sak naˇ stˇ est´ ı i svˇ etl´ e str´

  • anky. K nim patˇ

r´ ı napˇ r´ ıklad . . . . . . spoluzakladatel. Tak´ e v Maˇ darsku se uvoln´ ı medi´ aln´ ı trh jiˇ z letos. Maˇ darsko jako prvn´ ı z postkomunistick´ ych . . . . . . . Mezi nˇ e patˇ r´ ı i OfficePorte Voice, kter´ y byl na trh uveden pod heslem ”v´ ıce neˇ z modem”. Obsahuje totiˇ z . . .

slide-28
SLIDE 28

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Extraction principle III

9/30

“Collocations occur in different contexts to their components”

◮ non-compositionality: meaning of a collocation must differ from the union of

the meaning of its components

◮ modeling meanings by empirical contexts: a bag of words occurring within

a specified context window of a word or an expression

◮ the more different the contexts of an expression to its components are, the

higher the chance is that the expression is a collocation

AM: J-S divergence, K-L divergence, Skew divergence, Cosine similarity in vector space

Example: Cxy, Cx

. . . nen´ ı. Maltsk´ e liry lze nakoupit pouze ve smˇ en´ arn´ ach, ˇ cern´ y trh s valutami neexistuje. Na Maltˇ e je v porovn´ an´ ı s . . . . . . pˇ

  • restal. V pat´

ach za kriz´ ı vstoupil do Bˇ elehradu ˇ cern´ y trh , paˇ sov´ an´ ı a zv´ yˇ sen´ a kriminalita. Pˇ rekupn´ ıci prov´ aˇ zej´ ı . . . . . . nebyli z toho obvinˇ

  • eni. ˇ

R´ ıd´ ı gangy, kter´ e kontroluj´ ı ˇ cern´ y trh a okr´ adaj´ ı cizince. Oba byli zbaveni funkc´ ı a byl . . . . . . antidrogov´ e hysterii. N´ asledkem toho neexistoval ani ˇ cern´ y trh , protoˇ ze nebylo na ˇ cem vydˇ el´

  • avat. V roce 1957 bylo . . .

. . . doruˇ ceny k rychl´ emu zpracov´ an´ ı. Naplno se jiˇ z rozj´ ıˇ zd´ ı ˇ cern´ y trh se vstupenkami. Na z´ avod na 5000 m v rychlobruslaˇ r˚ u . . . . . . na ˇ celn´ em m´ ıstˇ e obchodu se zbranˇ

  • emi. Zat´

ımco ˇ cern´ y trh se zbranˇ emi se pro cel´ y svˇ et st´ av´ a ˇ c´ ım d´ al t´ ım vˇ etˇ s´ ı. . . . . . . ˇ cten´ ım v parlamentu. Vˇ eˇ r´ ım, ˇ ze brzy bude regulovat ˇ cern´ y trh s ohroˇ zen´ ymi druhy zv´ ıˇ rat, m´ ın´ ı. Promoravsk´ e strany . . . . . . jako mal´ ı ˇ ctyˇ rlet´ ı a pˇ etilet´ ı kluci. Byl to dobytˇ c´ ı trh jako z minul´ eho stolet´ ı. Se vˇ s´ ım vˇ sudy prod´ avali . . . . . . pˇ r´ an´ ı neˇ z re´ aln´ ych moˇ znost´ ı. Na rozd´ ıl od dolaru se trh americk´ ych st´ atn´ ıch dluhopis˚ u nezmˇ

  • enil. A nov´

ymi . . . . . . opˇ etn´ emu n´ ar˚

  • ustu. Podle Plan Econu si ˇ

cesk´ y kapit´ alov´ y trh bude v nejbliˇ zˇ s´ ım roce poˇ c´ ınat o nˇ eco l´

  • epe. Vˇ

etˇ sina . . . . . . To by mohlo vzhledem k propojen´ ı pˇ res mezibankovn´ ı trh depozit v´ est k ˇ retˇ ezov´ ym reakc´ ım. Pˇ r´ ıliv kapit´ alu . . . . . . PVT, na cenˇ e ztratil tak´ e indexov´ y Tab´

  • ak. Voln´

y trh m´ a vˇ sak naˇ stˇ est´ ı i svˇ etl´ e str´

  • anky. K nim patˇ

r´ ı napˇ r´ ıklad . . . . . . spoluzakladatel. Tak´ e v Maˇ darsku se uvoln´ ı medi´ aln´ ı trh jiˇ z letos. Maˇ darsko jako prvn´ ı z postkomunistick´ ych . . . . . . . Mezi nˇ e patˇ r´ ı i OfficePorte Voice, kter´ y byl na trh uveden pod heslem ”v´ ıce neˇ z modem”. Obsahuje totiˇ z . . .

slide-29
SLIDE 29

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Inventory of lexical association measures

10/30

slide-30
SLIDE 30

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Extraction pipeline

11/30

  • 1. linguistic preprocessing (morphological and syntactic level)
  • 2. identification of collocation candidates (dependency/surface/distance bigrams)
  • 3. extraction of occurrence and cooccurrence statistics (frequency, contexts)
  • 4. filtering the candidates to improve precision (POS patterns)
  • 5. application of a choosen lexical association measure
  • 6. ranking/classification of collocation candidates according to their scores

Ranking

red cross 15.66 decimal point 14.01 arithmetic operation 10.52 paper feeder 10.17 system type 3.54 and others 0.54 program in 0.35 level is 0.25

Classification

red cross 1 decimal point 1 arithmetic operation 1 paper feeder 1 system type and others program in level is

slide-31
SLIDE 31

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Extraction pipeline

11/30

  • 1. linguistic preprocessing (morphological and syntactic level)
  • 2. identification of collocation candidates (dependency/surface/distance bigrams)
  • 3. extraction of occurrence and cooccurrence statistics (frequency, contexts)
  • 4. filtering the candidates to improve precision (POS patterns)
  • 5. application of a choosen lexical association measure
  • 6. ranking/classification of collocation candidates according to their scores

Ranking

red cross 15.66 decimal point 14.01 arithmetic operation 10.52 paper feeder 10.17 system type 3.54 and others 0.54 program in 0.35 level is 0.25

Classification

red cross 1 decimal point 1 arithmetic operation 1 paper feeder 1 system type and others program in level is

slide-32
SLIDE 32

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Extraction pipeline

11/30

  • 1. linguistic preprocessing (morphological and syntactic level)
  • 2. identification of collocation candidates (dependency/surface/distance bigrams)
  • 3. extraction of occurrence and cooccurrence statistics (frequency, contexts)
  • 4. filtering the candidates to improve precision (POS patterns)
  • 5. application of a choosen lexical association measure
  • 6. ranking/classification of collocation candidates according to their scores

Ranking

red cross 15.66 decimal point 14.01 arithmetic operation 10.52 paper feeder 10.17 system type 3.54 and others 0.54 program in 0.35 level is 0.25

Classification

red cross 1 decimal point 1 arithmetic operation 1 paper feeder 1 system type and others program in level is

slide-33
SLIDE 33

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Reference data set

12/30

Source corpus

◮ Prague Dependency Treebank 2.0, 1.5 mil. tokens ◮ manually annotated on morphological and analytical level

Collocation candidates

◮ dependency bigrams: direct dependency relation between components ◮ morphological normalization (lemma proper + pos + gender + degree + negation) ◮ part-of-speech filter (A:N, N:N, V:N, R:N, C:N, N:V, N:C, D:A, N:A, D:V, N:T, N:D, D:D) ◮ frequency filter (minimal frequency required, f >5)

Annotation

◮ three independent parallel annotations (no context; full agreement required) ◮ 6 categories, merged into two: collocations (1-5), non-collocations (0):

  • 5. idiomatic expressions
  • 4. technical terms
  • 3. support verb constructions
  • 2. proper names
  • 1. frequent unpredictable usages
  • 0. non-collocations

◮ 12 232 candidates = 2 557 true collocations + 9 675 true non-collocations

slide-34
SLIDE 34

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Reference data set

12/30

Source corpus

◮ Prague Dependency Treebank 2.0, 1.5 mil. tokens ◮ manually annotated on morphological and analytical level

Collocation candidates

◮ dependency bigrams: direct dependency relation between components ◮ morphological normalization (lemma proper + pos + gender + degree + negation) ◮ part-of-speech filter (A:N, N:N, V:N, R:N, C:N, N:V, N:C, D:A, N:A, D:V, N:T, N:D, D:D) ◮ frequency filter (minimal frequency required, f >5)

Annotation

◮ three independent parallel annotations (no context; full agreement required) ◮ 6 categories, merged into two: collocations (1-5), non-collocations (0):

  • 5. idiomatic expressions
  • 4. technical terms
  • 3. support verb constructions
  • 2. proper names
  • 1. frequent unpredictable usages
  • 0. non-collocations

◮ 12 232 candidates = 2 557 true collocations + 9 675 true non-collocations

slide-35
SLIDE 35

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Reference data set

12/30

Source corpus

◮ Prague Dependency Treebank 2.0, 1.5 mil. tokens ◮ manually annotated on morphological and analytical level

Collocation candidates

◮ dependency bigrams: direct dependency relation between components ◮ morphological normalization (lemma proper + pos + gender + degree + negation) ◮ part-of-speech filter (A:N, N:N, V:N, R:N, C:N, N:V, N:C, D:A, N:A, D:V, N:T, N:D, D:D) ◮ frequency filter (minimal frequency required, f >5)

Annotation

◮ three independent parallel annotations (no context; full agreement required) ◮ 6 categories, merged into two: collocations (1-5), non-collocations (0):

  • 5. idiomatic expressions
  • 4. technical terms
  • 3. support verb constructions
  • 2. proper names
  • 1. frequent unpredictable usages
  • 0. non-collocations

1 2 3 4 5 2000 4000 6000 8000

◮ 12 232 candidates = 2 557 true collocations + 9 675 true non-collocations

slide-36
SLIDE 36

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Experimental design

13/30

Reference data

◮ split into 7 stratified folds of the same size (the same ratio of true collocations) ◮ 1 fold put aside as held-out data ◮ 6 folds used for evaluation of AMs

eval1 eval2 eval3 eval4 eval5 eval6 held-out

Evaluation

◮ based on quality of ranking (ranking performance) ◮ evaluation measures estimated on each eval fold separately and averaged

Significance testing

◮ methods compared by paired Wilcoxon signed-ranked test on the 6 eval folds ◮ significance level α = 0.05

slide-37
SLIDE 37

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Experimental design

13/30

Reference data

◮ split into 7 stratified folds of the same size (the same ratio of true collocations) ◮ 1 fold put aside as held-out data ◮ 6 folds used for evaluation of AMs

eval1 eval2 eval3 eval4 eval5 eval6 held-out

Evaluation

◮ based on quality of ranking (ranking performance) ◮ evaluation measures estimated on each eval fold separately and averaged

Significance testing

◮ methods compared by paired Wilcoxon signed-ranked test on the 6 eval folds ◮ significance level α = 0.05

slide-38
SLIDE 38

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Experimental design

13/30

Reference data

◮ split into 7 stratified folds of the same size (the same ratio of true collocations) ◮ 1 fold put aside as held-out data ◮ 6 folds used for evaluation of AMs

eval1 eval2 eval3 eval4 eval5 eval6 held-out

Evaluation

◮ based on quality of ranking (ranking performance) ◮ evaluation measures estimated on each eval fold separately and averaged

Significance testing

◮ methods compared by paired Wilcoxon signed-ranked test on the 6 eval folds ◮ significance level α = 0.05

slide-39
SLIDE 39

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Evaluation measures: Precision – Recall

14/30

1) Precision = |correctly classified collocations|

|total classified as collocations|

Recall = |correctly classified collocations|

|total collocations|

slide-40
SLIDE 40

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Evaluation measures: Precision – Recall

14/30

1) Precision = |correctly classified collocations|

|total classified as collocations|

Recall = |correctly classified collocations|

|total collocations|

Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25

slide-41
SLIDE 41

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Evaluation measures: Precision – Recall

14/30

1) Precision = |correctly classified collocations|

|total classified as collocations|

Recall = |correctly classified collocations|

|total collocations|

Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25

slide-42
SLIDE 42

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Evaluation measures: Precision – Recall

14/30

1) Precision = |correctly classified collocations|

|total classified as collocations|

Recall = |correctly classified collocations|

|total collocations|

Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author arithmetic operation paper feeder new book round table new wave gas station system type central part and others program in level is

slide-43
SLIDE 43

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Evaluation measures: Precision – Recall

14/30

1) Precision = |correctly classified collocations|

|total classified as collocations|

Recall = |correctly classified collocations|

|total collocations|

Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author arithmetic operation paper feeder new book round table new wave gas station system type central part and others program in level is

slide-44
SLIDE 44

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Evaluation measures: Precision – Recall

14/30

1) Precision = |correctly classified collocations|

|total classified as collocations|

Recall = |correctly classified collocations|

|total collocations|

Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author arithmetic operation paper feeder new book round table new wave gas station system type central part and others program in level is Precision Recall 100 % 50 %

slide-45
SLIDE 45

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Evaluation measures: Precision – Recall

14/30

1) Precision = |correctly classified collocations|

|total classified as collocations|

Recall = |correctly classified collocations|

|total collocations|

Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author 1 arithmetic operation paper feeder new book round table new wave gas station system type central part and others program in level is Precision Recall 100 % 50 % 80 % 50 %

slide-46
SLIDE 46

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Evaluation measures: Precision – Recall

14/30

1) Precision = |correctly classified collocations|

|total classified as collocations|

Recall = |correctly classified collocations|

|total collocations|

Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author 1 arithmetic operation 1 paper feeder 1 new book 1 round table 1 new wave 1 gas station 1 system type 1 central part 1 and others 1 program in 1 level is 1 Precision Recall 100 % 12 % 100 % 25 % 100 % 37 % 100 % 50 % 80 % 50 % 83 % 62 % 85 % 75 % 75 % 75 % 77 % 87 % 70 % 87 % 72 % 100 % 66 % 100 % 61 % 100 % 57 % 100 % 53 % 100 % 50 % 100 %

◮ measured within the entire interval of possible threshold values

slide-47
SLIDE 47

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Visual evaluation: Precision-Recall curves

15/30

◮ graphical plots of recall vs. precision ◮ the closer to the top and right, the better ranking performance ◮ estimated for each eval fold and vertically averaging

Precision-Recall curve averaging

Recall Precision

0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Unaveraged curves

slide-48
SLIDE 48

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Visual evaluation: Precision-Recall curves

15/30

◮ graphical plots of recall vs. precision ◮ the closer to the top and right, the better ranking performance ◮ estimated for each eval fold and vertically averaging

Precision-Recall curve averaging

Recall Precision

0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Averaged curve Unaveraged curves

slide-49
SLIDE 49

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Evaluation results: Precision-Recall curves

16/30 The best-performing association measures

Recall Averaged precision

0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

Pointwise mutual information (4) Pearson’s test (10) z score (13) Unigram subtuple measure (39) Cosine context similarity in boolean vector space (77)

slide-50
SLIDE 50

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Evaluation measure: Average Precision

17/30

2) Average Precision: E[P(R)], R ∼U(0, 1) AP = 1 r

r

X

i=1

pi

Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author 1 arithmetic operation 1 paper feeder 1 new book 1 round table 1 new wave 1 gas station 1 system type 1 central part 1 and others 1 program in 1 level is 1 Precision Recall 100 % 12 % 100 % 25 % 100 % 37 % 100 % 50 % 80 % 50 % 83 % 62 % 85 % 75 % 75 % 75 % 77 % 87 % 70 % 87 % 72 % 100 % 66 % 100 % 61 % 100 % 57 % 100 % 53 % 100 % 50 % 100 %

slide-51
SLIDE 51

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Evaluation measure: Average Precision

17/30

2) Average Precision: E[P(R)], R ∼U(0, 1) AP = 1 r

r

X

i=1

pi

Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author 1 arithmetic operation 1 paper feeder 1 new book 1 round table 1 new wave 1 gas station 1 system type 1 central part 1 and others 1 program in 1 level is 1 Precision Recall 100 % 12 % 100 % 25 % 100 % 37 % 100 % 50 % 80 % 50 % 83 % 62 % 85 % 75 % 75 % 75 % 77 % 87 % 70 % 87 % 72 % 100 % 66 % 100 % 61 % 100 % 57 % 100 % 53 % 100 % 50 % 100 %

slide-52
SLIDE 52

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Evaluation measure: Average Precision

17/30

2) Average Precision: E[P(R)], R ∼U(0, 1) AP = 1 r

r

X

i=1

pi

Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author 1 arithmetic operation 1 paper feeder 1 new book 1 round table 1 new wave 1 gas station 1 system type 1 central part 1 and others 1 program in 1 level is 1 Precision Recall 100 % 12 % 100 % 25 % 100 % 37 % 100 % 50 % 80 % 50 % 83 % 62 % 85 % 75 % 75 % 75 % 77 % 87 % 70 % 87 % 72 % 100 % 66 % 100 % 61 % 100 % 57 % 100 % 53 % 100 % 50 % 100 % 89.6 % = AP

slide-53
SLIDE 53

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Evaluation measure: Average Precision

17/30

2) Average Precision: E[P(R)], R ∼U(0, 1) AP = 1 r

r

X

i=1

pi

Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author 1 arithmetic operation 1 paper feeder 1 new book 1 round table 1 new wave 1 gas station 1 system type 1 central part 1 and others 1 program in 1 level is 1 Precision Recall 100 % 12 % 100 % 25 % 100 % 37 % 100 % 50 % 80 % 50 % 83 % 62 % 85 % 75 % 75 % 75 % 77 % 87 % 70 % 87 % 72 % 100 % 66 % 100 % 61 % 100 % 57 % 100 % 53 % 100 % 50 % 100 %

3) Mean Average Precision: E[AP] MAP = 1 6

6

X

i=1

APi

89.6 % = AP

slide-54
SLIDE 54

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Overall results: Mean Average Precision

18/30 MAP of all lexical association measures in descending order

77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67

Mean Average Precision

0.0 0.2 0.4 0.6 0.8

◮ Baseline (ratio of true collocations): 21.02 % ◮ Best context-based measure (■): Cosine similarity in vector space: 66.79 % ◮ Best statistical association measure (■): Unigram subtuple measure: 66.72 % ◮ Best 16 measures – statistically indistinguishable MAP ∼ current state of the art

slide-55
SLIDE 55

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Overall results: Mean Average Precision

18/30 MAP of all lexical association measures in descending order

77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67

Mean Average Precision

0.0 0.2 0.4 0.6 0.8

◮ Baseline (ratio of true collocations): 21.02 % ◮ Best context-based measure (■): Cosine similarity in vector space: 66.79 % ◮ Best statistical association measure (■): Unigram subtuple measure: 66.72 % ◮ Best 16 measures – statistically indistinguishable MAP ∼ current state of the art

slide-56
SLIDE 56

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Overall results: Mean Average Precision

18/30 MAP of all lexical association measures in descending order

77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67

Mean Average Precision

0.0 0.2 0.4 0.6 0.8

◮ Baseline (ratio of true collocations): 21.02 % ◮ Best context-based measure (■): Cosine similarity in vector space: 66.79 % ◮ Best statistical association measure (■): Unigram subtuple measure: 66.72 % ◮ Best 16 measures – statistically indistinguishable MAP ∼ current state of the art

slide-57
SLIDE 57

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Overall results: Mean Average Precision

18/30 MAP of all lexical association measures in descending order

77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67

Mean Average Precision

0.0 0.2 0.4 0.6 0.8

◮ Baseline (ratio of true collocations): 21.02 % ◮ Best context-based measure (■): Cosine similarity in vector space: 66.79 % ◮ Best statistical association measure (■): Unigram subtuple measure: 66.72 % ◮ Best 16 measures – statistically indistinguishable MAP ∼ current state of the art

slide-58
SLIDE 58

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Overall results: Mean Average Precision

18/30 MAP of all lexical association measures in descending order

77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67

Mean Average Precision

0.0 0.2 0.4 0.6 0.8

◮ Baseline (ratio of true collocations): 21.02 % ◮ Best context-based measure (■): Cosine similarity in vector space: 66.79 % ◮ Best statistical association measure (■): Unigram subtuple measure: 66.72 % ◮ Best 16 measures – statistically indistinguishable MAP ∼ current state of the art

slide-59
SLIDE 59

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Combining association measures

19/30

Motivation

◮ different association measures discover different groups/types of collocations ◮ existence of uncorrelated association measures

slide-60
SLIDE 60

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Combining association measures

19/30

Motivation

◮ different association measures discover different groups/types of collocations ◮ existence of uncorrelated association measures

5 % data sample from PDT-Dep

0.9 0.5 0.1 16.9 8.8 0.7 Cosine context similarity in boolean vector space Pointwise mutual information collocations non-collocations linear discriminant

slide-61
SLIDE 61

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Combining association measures

19/30

Motivation

◮ different association measures discover different groups/types of collocations ◮ existence of uncorrelated association measures

5 % data sample from PDT-Dep

0.9 0.5 0.1 16.9 8.8 0.7 Cosine context similarity in boolean vector space Pointwise mutual information collocations non-collocations linear discriminant

Note: So far all methods – unsupervised, the combination methods – supervised

slide-62
SLIDE 62

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Combination models

20/30

Framework

◮ each collocation candidate xi is described by the feature vector

xi =(xi

1, . . . , xi 82)T consisting of scores of all association measures ◮ and assigned a label yi ∈ {0, 1} indicating whether the bigram is considered

to be a true collocation (y = 1) or not (y = 0)

◮ we look for a ranker function f(xi) determining the strength of lexical

association between components of a candidate xi

◮ e.g. linear combination of association scores: f(xi) = w0 + w1xi 1 + . . . + w82xi 82

Methods

  • 1. Linear logistic regression
  • 2. Linear discriminant analysis
  • 3. Support vector machines
  • 4. Neural networks

◮ in the training phase used as regular classifiers on two-class data ◮ in the application phase no classification threshold applies

slide-63
SLIDE 63

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Combination models

20/30

Framework

◮ each collocation candidate xi is described by the feature vector

xi =(xi

1, . . . , xi 82)T consisting of scores of all association measures ◮ and assigned a label yi ∈ {0, 1} indicating whether the bigram is considered

to be a true collocation (y = 1) or not (y = 0)

◮ we look for a ranker function f(xi) determining the strength of lexical

association between components of a candidate xi

◮ e.g. linear combination of association scores: f(xi) = w0 + w1xi 1 + . . . + w82xi 82

Methods

  • 1. Linear logistic regression
  • 2. Linear discriminant analysis
  • 3. Support vector machines
  • 4. Neural networks

◮ in the training phase used as regular classifiers on two-class data ◮ in the application phase no classification threshold applies

slide-64
SLIDE 64

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Combination models

20/30

Framework

◮ each collocation candidate xi is described by the feature vector

xi =(xi

1, . . . , xi 82)T consisting of scores of all association measures ◮ and assigned a label yi ∈ {0, 1} indicating whether the bigram is considered

to be a true collocation (y = 1) or not (y = 0)

◮ we look for a ranker function f(xi) determining the strength of lexical

association between components of a candidate xi

◮ e.g. linear combination of association scores: f(xi) = w0 + w1xi 1 + . . . + w82xi 82

Methods

  • 1. Linear logistic regression
  • 2. Linear discriminant analysis
  • 3. Support vector machines
  • 4. Neural networks

◮ in the training phase used as regular classifiers on two-class data ◮ in the application phase no classification threshold applies

slide-65
SLIDE 65

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Combination models: Evaluation

21/30

Evaluation scheme

◮ 6-fold crossvalidation on the 6 evaluation folds ◮ 5 folds for training (fitting parameters), 1 fold for testing (ranking performance) ◮ PR curve and AP score estimated on each test fold and averaged

train1 train2 train3 train4 train5 test6 held-out

Results: Mean Average Precision

method MAP +% Unigram subtuple measure 66.72 – Cosine similarity in vector space 66.79 0.00 Support Vector Machine 73.03 9.35 Neural Network (1 unit) 74.88 12.11 Linear Discriminant Analysis 75.16 12.54 Linear Logistic Regression 77.36 15.82 Neural Network (5 units) 80.87 21.08

slide-66
SLIDE 66

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Combination models: Evaluation

21/30

Evaluation scheme

◮ 6-fold crossvalidation on the 6 evaluation folds ◮ 5 folds for training (fitting parameters), 1 fold for testing (ranking performance) ◮ PR curve and AP score estimated on each test fold and averaged

train1 train2 train3 train4 train5 test6 held-out

Results: Mean Average Precision

method MAP +% Unigram subtuple measure 66.72 – Cosine similarity in vector space 66.79 0.00 Support Vector Machine 73.03 9.35 Neural Network (1 unit) 74.88 12.11 Linear Discriminant Analysis 75.16 12.54 Linear Logistic Regression 77.36 15.82 Neural Network (5 units) 80.87 21.08

slide-67
SLIDE 67

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Results: Precision-Recall curves

22/30 Combination methods compared with the best association measures

Recall Average precision

0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

Neural network (5 units) Linear logistic regression Support vector machine (linear) Linear discriminant analysis Neural network (1 unit) Cosine context similarity in boolean vector space (77) Unigram subtuple measure (39)

slide-68
SLIDE 68

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Learning curve analysis

23/30 Neural network (5 units) learning curve

Training data size (%) Mean average precision

20 40 60 80 100 0.50 0.55 0.60 0.65 0.70 0.75 0.80

◮ 100% of training data = 5 training folds (8 737 annotated collocation candidates) ◮ 95% of the final MAP achieved with 15% of training data ◮ 99% of the final MAP achieved with 50% of training data

slide-69
SLIDE 69

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Adding linguistic features

24/30

Idea

◮ improving the combination models by adding linguistic features ◮ categorical features can be transformed to binary dummy features

New features

◮ Part-of-Speech pattern: combination of component POS (A:N, N:N, . . .) ◮ Syntactic relation: dependency type (attribute, object, . . .)

Results: Mean Average Precision

method MAP +% Unigram subtuple measure 66.72 – Cosine similarity in vector space 66.79 0.00 NNet/5 (AM) 80.87 21.08 NNet/5 (AM+POS) 82.79 24.09 NNet/5 (AM+POS+DEP) 84.53 26.69

slide-70
SLIDE 70

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Adding linguistic features

24/30

Idea

◮ improving the combination models by adding linguistic features ◮ categorical features can be transformed to binary dummy features

New features

◮ Part-of-Speech pattern: combination of component POS (A:N, N:N, . . .) ◮ Syntactic relation: dependency type (attribute, object, . . .)

Results: Mean Average Precision

method MAP +% Unigram subtuple measure 66.72 – Cosine similarity in vector space 66.79 0.00 NNet/5 (AM) 80.87 21.08 NNet/5 (AM+POS) 82.79 24.09 NNet/5 (AM+POS+DEP) 84.53 26.69

slide-71
SLIDE 71

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Adding linguistic features

24/30

Idea

◮ improving the combination models by adding linguistic features ◮ categorical features can be transformed to binary dummy features

New features

◮ Part-of-Speech pattern: combination of component POS (A:N, N:N, . . .) ◮ Syntactic relation: dependency type (attribute, object, . . .)

Results: Mean Average Precision

method MAP +% Unigram subtuple measure 66.72 – Cosine similarity in vector space 66.79 0.00 NNet/5 (AM) 80.87 21.08 NNet/5 (AM+POS) 82.79 24.09 NNet/5 (AM+POS+DEP) 84.53 26.69

slide-72
SLIDE 72

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Model reduction

25/30

Motivation

◮ “Ocama’s razor” ◮ combination of all 82 association measures is too complex ◮ models should be reduced: redundant variables removed

Two issues

  • 1. groups of highly correlated measures
  • 2. measures with no or minimal contribution to the model

Two-step solution

  • 1. correlation based clustering; one representative selected from each cluster
  • 2. step-wise procedure removing variables one by one
slide-73
SLIDE 73

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Model reduction

25/30

Motivation

◮ “Ocama’s razor” ◮ combination of all 82 association measures is too complex ◮ models should be reduced: redundant variables removed

Two issues

  • 1. groups of highly correlated measures
  • 2. measures with no or minimal contribution to the model

Two-step solution

  • 1. correlation based clustering; one representative selected from each cluster
  • 2. step-wise procedure removing variables one by one
slide-74
SLIDE 74

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Model reduction

25/30

Motivation

◮ “Ocama’s razor” ◮ combination of all 82 association measures is too complex ◮ models should be reduced: redundant variables removed

Two issues

  • 1. groups of highly correlated measures
  • 2. measures with no or minimal contribution to the model

Two-step solution

  • 1. correlation based clustering; one representative selected from each cluster
  • 2. step-wise procedure removing variables one by one
slide-75
SLIDE 75

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Model reduction: 1) Clustering

26/30

Agglomerative hierarchical clustering

◮ groups the measures with the same/similar contribution to the model ◮ begins with each measure as a separate cluster and merge them into

successively larger clusters

◮ distance metrics = 1- |Pearson’s correlation| (estimated on the held-out fold)

slide-76
SLIDE 76

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Model reduction: 1) Clustering

26/30

Agglomerative hierarchical clustering

◮ groups the measures with the same/similar contribution to the model ◮ begins with each measure as a separate cluster and merge them into

successively larger clusters

◮ distance metrics = 1- |Pearson’s correlation| (estimated on the held-out fold)

69 78 79 57 56 58 12 1 17 51 36 55 47 8 15 14 23 37 27 16 24 42 10 43 34 22 45 7 63 13 38 32 31 30 68 59 44 33 19 18 20 21 54 29 28 6 9 5 39 4 50 61 73 71 48 3 77 80 26 25 49 35 53 52 41 46 2 60 67 76 11 70 40 75 62 74 72 82 81 66 64 65

slide-77
SLIDE 77

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Model reduction: 1) Clustering

26/30

Agglomerative hierarchical clustering

◮ groups the measures with the same/similar contribution to the model ◮ begins with each measure as a separate cluster and merge them into

successively larger clusters

◮ distance metrics = 1- |Pearson’s correlation| (estimated on the held-out fold)

69 78 79 57 56 58 12 1 17 51 36 55 47 8 15 14 23 37 27 16 24 42 10 43 34 22 45 7 63 13 38 32 31 30 68 59 44 33 19 18 20 21 54 29 28 6 9 5 39 4 50 61 73 71 48 3 77 80 26 25 49 35 53 52 41 46 2 60 67 76 11 70 40 75 62 74 72 82 81 66 64 65

◮ number of the final clusters empirically set to 60 ◮ the best performing measure (by MAP on the held-out fold) selected as the

representative from each cluster

slide-78
SLIDE 78

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Model reduction: 2) Stepwise variable removal

27/30

Iterative procedure

◮ initiated with the 60 variables/measures ◮ in each iteration we remove the variable causing minimal performance

degradation when not used in the model (by MAP on the held-out fold)

◮ stops before the degradation becomes statistically significant

slide-79
SLIDE 79

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Model reduction: 2) Stepwise variable removal

27/30

Iterative procedure

◮ initiated with the 60 variables/measures ◮ in each iteration we remove the variable causing minimal performance

degradation when not used in the model (by MAP on the held-out fold)

◮ stops before the degradation becomes statistically significant

Number of variables Mean average precision

60 50 40 30 20 10 1 0.60 0.65 0.70 0.75 0.80 0.85 held−out MAP test MAP

slide-80
SLIDE 80

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Model reduction: 2) Stepwise variable removal

27/30

Iterative procedure

◮ initiated with the 60 variables/measures ◮ in each iteration we remove the variable causing minimal performance

degradation when not used in the model (by MAP on the held-out fold)

◮ stops before the degradation becomes statistically significant

Number of variables Mean average precision

60 50 40 30 20 10 1 0.60 0.65 0.70 0.75 0.80 0.85 held−out MAP test MAP

◮ the final model contains 13 variables/lexical association measures

slide-81
SLIDE 81

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Model reduction: Process overview

28/30 MAP of individual lexical association measures

77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67

Mean Average Precision

0.0 0.2 0.4 0.6 0.8

◮ procedure initiated with all 82 association measures ◮ highly correlated measures removed in the first phase (clustering) ◮ 13 measures left after the second phase (stepwise removal)

= 4 statistical association mesaures (■) + 9 context-based measures (■)

slide-82
SLIDE 82

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Model reduction: Process overview

28/30 MAP of individual lexical association measures

77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67

Mean Average Precision

0.0 0.2 0.4 0.6 0.8

◮ procedure initiated with all 82 association measures ◮ highly correlated measures removed in the first phase (clustering) ◮ 13 measures left after the second phase (stepwise removal)

= 4 statistical association mesaures (■) + 9 context-based measures (■)

slide-83
SLIDE 83

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Model reduction: Process overview

28/30 MAP of individual lexical association measures

77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67

Mean Average Precision

0.0 0.2 0.4 0.6 0.8

◮ procedure initiated with all 82 association measures ◮ highly correlated measures removed in the first phase (clustering) ◮ 13 measures left after the second phase (stepwise removal)

= 4 statistical association mesaures (■) + 9 context-based measures (■)

slide-84
SLIDE 84

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Model reduction results: Precision-Recall curves

29/30 Reduced combination models compared with the best association measures

Recall Averaged precision

0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

NNet (5 units) with 82 variables NNet (5 units) with 47 variables NNet (5 units) with 13 variables NNet (5 units) with 7 variables Cosine context similarity in boolean vector space (77) Unigram subtuple measure (39)

slide-85
SLIDE 85

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Conslusions

30/30

Main results

  • 1. inventory of 82 lexical association measures
  • 2. 4 reference data sets
  • 3. all lexical association measures evaluated on these data sets
  • 4. combining association measures improved state of the art in collocation extraction
  • 5. combination models reduced to 13 measures without performance degradation

Other contribution of the thesis

◮ overview of different notions of collocation (definitions, typology, classification) ◮ evaluation scheme (average precision, crossvalidation, significance tests) ◮ reference data sets used in MWE 2008 Shared Task

slide-86
SLIDE 86

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Conslusions

30/30

Main results

  • 1. inventory of 82 lexical association measures
  • 2. 4 reference data sets
  • 3. all lexical association measures evaluated on these data sets
  • 4. combining association measures improved state of the art in collocation extraction
  • 5. combination models reduced to 13 measures without performance degradation

Other contribution of the thesis

◮ overview of different notions of collocation (definitions, typology, classification) ◮ evaluation scheme (average precision, crossvalidation, significance tests) ◮ reference data sets used in MWE 2008 Shared Task

slide-87
SLIDE 87

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

List of relevant publications

Pavel Pecina: Lexical Association Measures and Collocation Extraction, Multiword expressions: Hard going or plain sailing? Special issue of the International Journal of Language Resources and Evaluation, Springer, 2009 (accepted). Pavel Pecina: Lexical Association Measures: Collocation Extraction, PhD Thesis, Charles University, Prague, Czech Republic, 2008. Pavel Pecina: Machine Learning Approach to Multiword Expression Extraction, In Proceedings of the sixth International Conference on Language Resources and Evaluation (LREC) Workshop: Towards a Shared Task for Multiword Expressions, Marrakech, Morocco, 2008. Pavel Pecina: Reference Data for Czech Collocation Extraction, In Proceedings of the sixth International Conference on Language Resources and Evaluation (LREC) Workshop: Towards a Shared Task for Multiword Expressions, Marrakech, Morocco, 2008. Pavel Pecina, Pavel Schlesinger: Combining Association Measures for Collocation Extraction, In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL), Sydney, Australia, 2006. Silvie Cinkov´ a, Petr Podvesk´ y, Pavel Pecina, Pavel Schlesinger: Semi-automatic Building of Swedish Collocation Lexicon, In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), Genova, Italy, 2006. Pavel Pecina: An Extensive Empirical Study of Collocation Extraction Methods, In Proceedings of the Association for Computational Linguistics Student Research Workshop (ACL), Ann Arbor, Michigan, USA, 2005. Pavel Pecina, Martin Holub: Semantically Significant Collocations, UFAL/CKL Technical Report TR-2002-13, Faculty of Mathematics and Physics, Charles University, Prague, Czech Rep., 2002.

slide-88
SLIDE 88

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Additional data sets

PDT-Surf

◮ analogous to PDT-Dep (corpus, filtering, annotation) ◮ collocation candidates extracted as surface bigrams: pairs of adjacent words ◮ assumption: collocations cannot be modified by insertion of another word ◮ annotation consistent with PDT-Dep

CNC-Surf

◮ collocation candidates – instances of PDT-Surf in the Czech National Corpus ◮ SYN 2000 and 2005, 240 mil. tokens, morphologicaly tagged and lemmatized ◮ annotation consistent with PDT-Surf

PAR-Dist

◮ source corpus: Swedish Parole, 22 mil. tokens ◮ automatic morphological tagging and lemmatization ◮ distance bigrams: word pairs occurring within a distance of 1–3 words ◮ annotation: non-exhaustive manual extraction of support verb constructions ◮ no frequency filter applied

slide-89
SLIDE 89

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Reference data summary

reference data set PDT-Dep PDT-Surf CNC-Surf PAR-Dist source corpus PDT PDT CNC PAROLE language Czech Czech Czech Swedish morphology manual manual auto auto syntax manual none none none bigram types dependency surface surface distance tokens 1 504 847 1 504 847 242 272 798 22 883 361 bigram types 635 952 638 030 30 608 916 13 370 375 after frequency filtering 26 450 29 035 2 941 414 13 370 375 after part-of-speech filtering 12 232 10 021 1 503 072 898 324 collocation candidates 12 232 10 021 9 868 17 027 data sample size 100 % 100 % 0.66 % 1.90 % true collocations 2 557 2 293 2 263 1 292 baseline precision (%) 21.02 22.88 22.66 7.59

slide-90
SLIDE 90

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Context-based vs. statistical association measures

PDT-Dep

77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67

Mean Average Precision

0.0 0.2 0.4 0.6 0.8

PDT-Surf

39 38 32 27 28 29 31 30 37 13 10 5 42 4 16 24 22 23 33 45 7 77 80 18 21 20 19 9 63 6 43 50 34 54 48 3 26 25 59 44 8 53 52 76 35 49 41 82 55 15 14 47 70 11 66 61 73 71 72 74 69 46 2 60 64 65 68 40 12 75 81 51 36 56 78 79 58 62 57 17 1 67

Mean Average Precision

0.0 0.2 0.4 0.6 0.8

CNC-Surf

39 4 27 28 29 38 37 32 31 30 42 9 13 10 5 33 16 22 23 24 63 50 45 43 18 19 21 20 34 7 54 3 48 77 59 44 26 25 82 80 41 35 53 52 6 49 66 69 73 71 8 61 55 72 74 62 70 15 14 47 64 79 46 60 65 78 40 2 81 1 17 11 12 56 75 36 51 76 68 67 57 58

Mean Average Precision

0.0 0.2 0.4 0.6 0.8

PAR-Dist

36 51 12 47 56 69 1 17 57 15 14 78 11 9 6 65 55 8 61 62 44 68 54 18 19 21 20 59 58 66 33 64 73 71 37 27 28 29 34 43 23 24 22 2 40 63 38 5 32 30 42 31 82 13 77 80 3 48 52 53 7 45 4 70 50 81 26 79 25 46 67 35 41 39 76 74 49 60 75 10 16 72

Mean Average Precision

0.0 0.1 0.2 0.3 0.4

slide-91
SLIDE 91

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Results / Mean average precision: PDT-Dep vs. PDT-Surf

Dependency bigrams vs. surface bigrams

77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67

Mean Average Precision

0.0 0.2 0.4 0.6 0.8

slide-92
SLIDE 92

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Results / Mean average precision: PDT-Surf vs. CNC-Surf

Small source corpus vs. large source corpus

39 38 32 27 28 29 31 30 37 13 10 5 42 4 16 24 22 23 33 45 7 77 80 18 21 20 19 9 6 43 50 34 63 54 48 3 26 25 59 44 8 53 52 35 49 41 55 82 15 70 14 47 66 11 73 61 71 74 72 69 76 46 2 60 64 65 40 81 12 68 56 51 36 78 79 58 62 57 75 17 1 67

Mean Average Precision

0.0 0.2 0.4 0.6 0.8

slide-93
SLIDE 93

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Results / Mean average precision: PAR-Dist vs. PDT-Dep

Different corpus, different language, different task

36 51 12 47 56 69 1 17 57 15 14 78 11 9 6 65 55 8 61 62 44 68 54 18 19 21 20 59 58 66 33 64 73 71 37 27 28 29 34 43 23 24 22 2 40 63 38 5 32 30 42 31 82 13 77 80 3 48 52 53 7 45 4 70 50 81 26 79 25 46 67 35 41 39 76 74 49 60 75 10 16 72

Mean Average Precision

0.0 0.2 0.4 0.6 0.8

slide-94
SLIDE 94

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Comparison of AM evaluation results on different data sets

PDT−Dep 0.2 0.4 0.6 0.08 0.12 0.16 0.2 0.3 0.4 0.5 0.6 0.2 0.4 0.6 PDT−Surf CNC−Surf 0.0 0.2 0.4 0.6 0.2 0.3 0.4 0.5 0.6 0.08 0.12 0.16 0.0 0.2 0.4 0.6 PAR−Dist

slide-95
SLIDE 95

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Automatic extraction of semantically associated words

◮ joint work with Martin Kirschner ◮ application of lexical association measures ◮ the same approach as for collocation extraction ◮ supervised combination of lexical association measures ◮ reference data from WordNet and PDT annotation ◮ occurrence and cooccurrence statistics from a larger corpus (CNC) ◮ different models for different types of associations (synonyms, antonyms, etc.)

slide-96
SLIDE 96

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Syntax based language models for Information Retrieval

◮ joint work with Jana Kravalova ◮ application of language model paradigm for information retrieval ◮ unigram models × bigram models ◮ surface word forms × lemmas × stems ◮ surface bigrams × dependency bigrams ◮ various weighting schemes: MLE, association scores ◮ various approaches to combining the models ◮ test collection: CLEF 2007 Czech Ad-Hoc collection

slide-97
SLIDE 97

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Webpage cleaning

◮ joint work with Michal Marek, Miroslav Spousta ◮ the task: to clean arbitrary web pages (remove boilerpalte) ◮ each document split into a sequence of block ◮ blocks separated by one or more HTML tags ◮ the blocks labeled as Title, Text, Garbage ◮ supervised sequence labeling by Conditional Random Fields ◮ Cleaneval 2007 Shared Task winners

slide-98
SLIDE 98

Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions

Near duplicate document detection

◮ joint work with Daniel Bencik ◮ motivated by the work of Deepak Ravichandran ◮ probabilistic (approximate) approach ◮ documents represented by term frequency vectors (vector space model) ◮ Locality Sensitive Hashing used to create (binary) document signatures ◮ Hamming distance on the signatures approximates Cosine similarity on the

document vectors

◮ Hamming distance computed by Point Location in Equal Balls algorithm