Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Lexical Association Measures Collocation Extraction Pavel Pecina - - PowerPoint PPT Presentation
Lexical Association Measures Collocation Extraction Pavel Pecina - - PowerPoint PPT Presentation
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions Lexical Association Measures Collocation Extraction Pavel Pecina pecina@ufal.mff.cuni.cz Institute of
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Talk outline
- 1. Introduction
- 2. Collocation extraction
- 3. Lexical association measures
- 4. Reference data
- 5. Empirical evaluation
- 6. Combining association measures
- 7. Conclusions
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Lexical association
1/30
Semantic association
◮ reflects semantic relationship between words ◮ synonymy, antonymy, hyponymy, meronymy, etc.
➔ stored in a thesaurus
sick – ill, baby – infant, dog – cat
Cross-language association
◮ corresponds to potential translations of words between languages ◮ translation equivalents
➔ stored in a dictionary
maison(FR)– house(EN), baum(GE)– tree(EN), kvˇ etina(CZ)– flower(EN)
Collocational association
◮ restricts combination of words into phrases (beyond grammar!) ◮ collocations / multiword expressions
➔ stored in a lexicon
crystal clear, cosmetic surgery, cold war
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Lexical association
1/30
Semantic association
◮ reflects semantic relationship between words ◮ synonymy, antonymy, hyponymy, meronymy, etc.
➔ stored in a thesaurus
sick – ill, baby – infant, dog – cat
Cross-language association
◮ corresponds to potential translations of words between languages ◮ translation equivalents
➔ stored in a dictionary
maison(FR)– house(EN), baum(GE)– tree(EN), kvˇ etina(CZ)– flower(EN)
Collocational association
◮ restricts combination of words into phrases (beyond grammar!) ◮ collocations / multiword expressions
➔ stored in a lexicon
crystal clear, cosmetic surgery, cold war
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Lexical association
1/30
Semantic association
◮ reflects semantic relationship between words ◮ synonymy, antonymy, hyponymy, meronymy, etc.
➔ stored in a thesaurus
sick – ill, baby – infant, dog – cat
Cross-language association
◮ corresponds to potential translations of words between languages ◮ translation equivalents
➔ stored in a dictionary
maison(FR)– house(EN), baum(GE)– tree(EN), kvˇ etina(CZ)– flower(EN)
Collocational association
◮ restricts combination of words into phrases (beyond grammar!) ◮ collocations / multiword expressions
➔ stored in a lexicon
crystal clear, cosmetic surgery, cold war
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Measuring lexical association
2/30
Motivation
◮ automatic acquisition of associated words (into a lexicon/thesarus/dictionary)
Tool: Lexical association measures
◮ mathematical formulas determining strength of association between two
(or more) words based on their occurrences and cooccurrences in a corpus Applications
◮ lexicography, natural language generation, word sense disambiguation ◮ bilingual word alignment, identification of translation equivalents ◮ information retrieval, cross-lingual information retrieval ◮ keyword extraction, named entity recognition ◮ syntactic constituent boundary detection ◮ collocation extraction
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Measuring lexical association
2/30
Motivation
◮ automatic acquisition of associated words (into a lexicon/thesarus/dictionary)
Tool: Lexical association measures
◮ mathematical formulas determining strength of association between two
(or more) words based on their occurrences and cooccurrences in a corpus Applications
◮ lexicography, natural language generation, word sense disambiguation ◮ bilingual word alignment, identification of translation equivalents ◮ information retrieval, cross-lingual information retrieval ◮ keyword extraction, named entity recognition ◮ syntactic constituent boundary detection ◮ collocation extraction
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Measuring lexical association
2/30
Motivation
◮ automatic acquisition of associated words (into a lexicon/thesarus/dictionary)
Tool: Lexical association measures
◮ mathematical formulas determining strength of association between two
(or more) words based on their occurrences and cooccurrences in a corpus Applications
◮ lexicography, natural language generation, word sense disambiguation ◮ bilingual word alignment, identification of translation equivalents ◮ information retrieval, cross-lingual information retrieval ◮ keyword extraction, named entity recognition ◮ syntactic constituent boundary detection ◮ collocation extraction
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Goals, objectives, and limitations
3/30
Goal
◮ application of lexical association measures to collocation extraction
Objectives
- 1. to compile a comprehensive inventory of lexical association measures
- 2. to build reference data sets for collocation extraction
- 3. to evaluate the lexical association measures on these data sets
- 4. to explore the possibility of combining these measures into more complex
models and advance the state of the art in collocation extraction Limitations
✓ focus on bigram (two-word) collocations
(limited scalability to higher-order n-grams; limited corpus size)
✓ binary (two-class) discrimination only (collocation/non-collocation)
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Goals, objectives, and limitations
3/30
Goal
◮ application of lexical association measures to collocation extraction
Objectives
- 1. to compile a comprehensive inventory of lexical association measures
- 2. to build reference data sets for collocation extraction
- 3. to evaluate the lexical association measures on these data sets
- 4. to explore the possibility of combining these measures into more complex
models and advance the state of the art in collocation extraction Limitations
✓ focus on bigram (two-word) collocations
(limited scalability to higher-order n-grams; limited corpus size)
✓ binary (two-class) discrimination only (collocation/non-collocation)
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Goals, objectives, and limitations
3/30
Goal
◮ application of lexical association measures to collocation extraction
Objectives
- 1. to compile a comprehensive inventory of lexical association measures
- 2. to build reference data sets for collocation extraction
- 3. to evaluate the lexical association measures on these data sets
- 4. to explore the possibility of combining these measures into more complex
models and advance the state of the art in collocation extraction Limitations
✓ focus on bigram (two-word) collocations
(limited scalability to higher-order n-grams; limited corpus size)
✓ binary (two-class) discrimination only (collocation/non-collocation)
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Collocational association
4/30
Collocability
◮ the ability of words to combine with other words in text ◮ governed by a system of rules and constraints: syntactic, semantic, pragmatic ◮ must be adhered to in order to produce correct, meaningful, fluent utterances ◮ ranges from free word combinations to idioms ◮ specified intensionally (general rules) or extensionally (particular constraints)
Collocations
◮ word combinations with extensionally restricted collocability ◮ should be listed in a lexicon and learned in the same way as single words
Types of collocations
- 1. idioms (to kick the bucket, to hear st. through the grapevine)
- 2. proper names (New York, Old Town, Vaclav Havel)
- 3. technical terms (car oil, stock owl, hard disk)
- 4. phrasal verbs (to switch off, to look after)
- 5. light verb compounds (to take a nap, to do homework)
- 6. lexically restricted expressions (strong tea, broad daylight)
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Collocational association
4/30
Collocability
◮ the ability of words to combine with other words in text ◮ governed by a system of rules and constraints: syntactic, semantic, pragmatic ◮ must be adhered to in order to produce correct, meaningful, fluent utterances ◮ ranges from free word combinations to idioms ◮ specified intensionally (general rules) or extensionally (particular constraints)
Collocations
◮ word combinations with extensionally restricted collocability ◮ should be listed in a lexicon and learned in the same way as single words
Types of collocations
- 1. idioms (to kick the bucket, to hear st. through the grapevine)
- 2. proper names (New York, Old Town, Vaclav Havel)
- 3. technical terms (car oil, stock owl, hard disk)
- 4. phrasal verbs (to switch off, to look after)
- 5. light verb compounds (to take a nap, to do homework)
- 6. lexically restricted expressions (strong tea, broad daylight)
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Collocational association
4/30
Collocability
◮ the ability of words to combine with other words in text ◮ governed by a system of rules and constraints: syntactic, semantic, pragmatic ◮ must be adhered to in order to produce correct, meaningful, fluent utterances ◮ ranges from free word combinations to idioms ◮ specified intensionally (general rules) or extensionally (particular constraints)
Collocations
◮ word combinations with extensionally restricted collocability ◮ should be listed in a lexicon and learned in the same way as single words
Types of collocations
- 1. idioms (to kick the bucket, to hear st. through the grapevine)
- 2. proper names (New York, Old Town, Vaclav Havel)
- 3. technical terms (car oil, stock owl, hard disk)
- 4. phrasal verbs (to switch off, to look after)
- 5. light verb compounds (to take a nap, to do homework)
- 6. lexically restricted expressions (strong tea, broad daylight)
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Collocation properties
5/30
Semantic non-compositionality
◮ exact meaning cannot be (fully) inferred from the meaning of components
to kick the bucket
Syntactic non-modifiability
◮ syntactic structure cannot be freely modified (word order, word insertions etc.)
poor as a church mouse vs. poor as a *big church mouse
Lexical non-substitutability
◮ components cannot be substituted by synonyms or other words
stiff breeze vs. *stiff wind
Translatability into other languages
◮ translation cannot generally be performed blindly, word by word
ice cream – zmrzlina
Domain dependency
◮ collocational character only in specific domains
carriage return
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Collocation properties
5/30
Semantic non-compositionality
◮ exact meaning cannot be (fully) inferred from the meaning of components
to kick the bucket
Syntactic non-modifiability
◮ syntactic structure cannot be freely modified (word order, word insertions etc.)
poor as a church mouse vs. poor as a *big church mouse
Lexical non-substitutability
◮ components cannot be substituted by synonyms or other words
stiff breeze vs. *stiff wind
Translatability into other languages
◮ translation cannot generally be performed blindly, word by word
ice cream – zmrzlina
Domain dependency
◮ collocational character only in specific domains
carriage return
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Collocation properties
5/30
Semantic non-compositionality
◮ exact meaning cannot be (fully) inferred from the meaning of components
to kick the bucket
Syntactic non-modifiability
◮ syntactic structure cannot be freely modified (word order, word insertions etc.)
poor as a church mouse vs. poor as a *big church mouse
Lexical non-substitutability
◮ components cannot be substituted by synonyms or other words
stiff breeze vs. *stiff wind
Translatability into other languages
◮ translation cannot generally be performed blindly, word by word
ice cream – zmrzlina
Domain dependency
◮ collocational character only in specific domains
carriage return
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Collocation properties
5/30
Semantic non-compositionality
◮ exact meaning cannot be (fully) inferred from the meaning of components
to kick the bucket
Syntactic non-modifiability
◮ syntactic structure cannot be freely modified (word order, word insertions etc.)
poor as a church mouse vs. poor as a *big church mouse
Lexical non-substitutability
◮ components cannot be substituted by synonyms or other words
stiff breeze vs. *stiff wind
Translatability into other languages
◮ translation cannot generally be performed blindly, word by word
ice cream – zmrzlina
Domain dependency
◮ collocational character only in specific domains
carriage return
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Collocation properties
5/30
Semantic non-compositionality
◮ exact meaning cannot be (fully) inferred from the meaning of components
to kick the bucket
Syntactic non-modifiability
◮ syntactic structure cannot be freely modified (word order, word insertions etc.)
poor as a church mouse vs. poor as a *big church mouse
Lexical non-substitutability
◮ components cannot be substituted by synonyms or other words
stiff breeze vs. *stiff wind
Translatability into other languages
◮ translation cannot generally be performed blindly, word by word
ice cream – zmrzlina
Domain dependency
◮ collocational character only in specific domains
carriage return
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Collocation extraction
6/30
Task
◮ to extract a list of collocations (types) from a text corpus ◮ no need to identify particular occurrences (instances) of collocations
Methods
◮ based on extraction principles verifying characteristic collocation properties ◮ i.e. hypotheses about word occurences and cooccurrences in the corpus ◮ formulated as lexical association measures ◮ compute association score for each collocation candidate from the corpus ◮ the scores indicate a chance of a candidate to be a collocation
Extraction principles
- 1. “Collocation components occur together more often than by chance”
- 2. “Collocations occur as units in information-theoretically noisy environment”
- 3. “Collocations occur in different contexts to their components”
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Collocation extraction
6/30
Task
◮ to extract a list of collocations (types) from a text corpus ◮ no need to identify particular occurrences (instances) of collocations
Methods
◮ based on extraction principles verifying characteristic collocation properties ◮ i.e. hypotheses about word occurences and cooccurrences in the corpus ◮ formulated as lexical association measures ◮ compute association score for each collocation candidate from the corpus ◮ the scores indicate a chance of a candidate to be a collocation
Extraction principles
- 1. “Collocation components occur together more often than by chance”
- 2. “Collocations occur as units in information-theoretically noisy environment”
- 3. “Collocations occur in different contexts to their components”
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Collocation extraction
6/30
Task
◮ to extract a list of collocations (types) from a text corpus ◮ no need to identify particular occurrences (instances) of collocations
Methods
◮ based on extraction principles verifying characteristic collocation properties ◮ i.e. hypotheses about word occurences and cooccurrences in the corpus ◮ formulated as lexical association measures ◮ compute association score for each collocation candidate from the corpus ◮ the scores indicate a chance of a candidate to be a collocation
Extraction principles
- 1. “Collocation components occur together more often than by chance”
- 2. “Collocations occur as units in information-theoretically noisy environment”
- 3. “Collocations occur in different contexts to their components”
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Extraction principle I
7/30
“Collocation components occur together more often than by chance”
◮ the corpus is interepreted as a sequence of randomly generated words ◮ word (marginal) probability ML estimations: p(x) = f(x) N ◮ bigram (joint) probability ML estimations:
p(xy) = f(xy)
N ◮ the chance ∼ the null hypothesis of independence: H0: ˆ
p(xy) = p(x) · p(y) AM: Log-likelihood ratio, χ2test, Odds ratio, Jaccard, Pointwise mutual information
Example: Pointwise Mutual Information
Data: f(iron curtain) = 11 MLE: p(iron curtain) = 0.000007
f(iron) = 30 p(iron) = 0.000020 f(curtain) = 15 p(curtain) = 0.000010
H0:
ˆ p(iron curtain) = p(iron) · p(curtain) = 0.000000000020 ˆ f(iron curtain) = 0.000030
AM:
PMI(iron curtain) = log p(xy) ˆ p(xy) = log 0.000007 0.000000000020 = 18.417
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Extraction principle I
7/30
“Collocation components occur together more often than by chance”
◮ the corpus is interepreted as a sequence of randomly generated words ◮ word (marginal) probability ML estimations: p(x) = f(x) N ◮ bigram (joint) probability ML estimations:
p(xy) = f(xy)
N ◮ the chance ∼ the null hypothesis of independence: H0: ˆ
p(xy) = p(x) · p(y) AM: Log-likelihood ratio, χ2test, Odds ratio, Jaccard, Pointwise mutual information
Example: Pointwise Mutual Information
Data: f(iron curtain) = 11 MLE: p(iron curtain) = 0.000007
f(iron) = 30 p(iron) = 0.000020 f(curtain) = 15 p(curtain) = 0.000010
H0:
ˆ p(iron curtain) = p(iron) · p(curtain) = 0.000000000020 ˆ f(iron curtain) = 0.000030
AM:
PMI(iron curtain) = log p(xy) ˆ p(xy) = log 0.000007 0.000000000020 = 18.417
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Extraction principle II
8/30
“Collocations occur as units in information-theoretically noisy environment”
◮ the corpus again interpreted as a sequence of randomly generated words ◮ at each point of the sequence we estimate:
- 1. probability distribution of words occurring after/before: p(w|Cr
xy), p(w|Cl xy)
- 2. uncertainty (entropy) what the next/previous word is: H(p(w|Cr
xy)),H(p(w|Cl xy)) ◮ points with high uncertainty are likely to be collocation boundaries ◮ points with low uncertainty are likely to be located within a collocation
AM: Left context entropy, Right context entropy
Example: H(p(w|Cr
xy))
ˇ Cesk´ y kapit´ alov´ y trh dnes ovlivnil pokles cen vˇ sech cenn´ ych pap´ ır˚ u a zejm´ ena akci´ ı.
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Extraction principle II
8/30
“Collocations occur as units in information-theoretically noisy environment”
◮ the corpus again interpreted as a sequence of randomly generated words ◮ at each point of the sequence we estimate:
- 1. probability distribution of words occurring after/before: p(w|Cr
xy), p(w|Cl xy)
- 2. uncertainty (entropy) what the next/previous word is: H(p(w|Cr
xy)),H(p(w|Cl xy)) ◮ points with high uncertainty are likely to be collocation boundaries ◮ points with low uncertainty are likely to be located within a collocation
AM: Left context entropy, Right context entropy
Example: H(p(w|Cr
xy))
ˇ Cesk´ y kapit´ alov´ y trh dnes ovlivnil pokles cen vˇ sech cenn´ ych pap´ ır˚ u a zejm´ ena akci´ ı.
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Extraction principle III
9/30
“Collocations occur in different contexts to their components”
◮ non-compositionality: meaning of a collocation must differ from the union of
the meaning of its components
◮ modeling meanings by empirical contexts: a bag of words occurring within
a specified context window of a word or an expression
◮ the more different the contexts of an expression to its components are, the
higher the chance is that the expression is a collocation
AM: J-S divergence, K-L divergence, Skew divergence, Cosine similarity in vector space
Example: Cxy, Cx
. . . nen´ ı. Maltsk´ e liry lze nakoupit pouze ve smˇ en´ arn´ ach, ˇ cern´ y trh s valutami neexistuje. Na Maltˇ e je v porovn´ an´ ı s . . . . . . pˇ
- restal. V pat´
ach za kriz´ ı vstoupil do Bˇ elehradu ˇ cern´ y trh , paˇ sov´ an´ ı a zv´ yˇ sen´ a kriminalita. Pˇ rekupn´ ıci prov´ aˇ zej´ ı . . . . . . nebyli z toho obvinˇ
- eni. ˇ
R´ ıd´ ı gangy, kter´ e kontroluj´ ı ˇ cern´ y trh a okr´ adaj´ ı cizince. Oba byli zbaveni funkc´ ı a byl . . . . . . antidrogov´ e hysterii. N´ asledkem toho neexistoval ani ˇ cern´ y trh , protoˇ ze nebylo na ˇ cem vydˇ el´
- avat. V roce 1957 bylo . . .
. . . doruˇ ceny k rychl´ emu zpracov´ an´ ı. Naplno se jiˇ z rozj´ ıˇ zd´ ı ˇ cern´ y trh se vstupenkami. Na z´ avod na 5000 m v rychlobruslaˇ r˚ u . . . . . . na ˇ celn´ em m´ ıstˇ e obchodu se zbranˇ
- emi. Zat´
ımco ˇ cern´ y trh se zbranˇ emi se pro cel´ y svˇ et st´ av´ a ˇ c´ ım d´ al t´ ım vˇ etˇ s´ ı. . . . . . . ˇ cten´ ım v parlamentu. Vˇ eˇ r´ ım, ˇ ze brzy bude regulovat ˇ cern´ y trh s ohroˇ zen´ ymi druhy zv´ ıˇ rat, m´ ın´ ı. Promoravsk´ e strany . . . . . . jako mal´ ı ˇ ctyˇ rlet´ ı a pˇ etilet´ ı kluci. Byl to dobytˇ c´ ı trh jako z minul´ eho stolet´ ı. Se vˇ s´ ım vˇ sudy prod´ avali . . . . . . pˇ r´ an´ ı neˇ z re´ aln´ ych moˇ znost´ ı. Na rozd´ ıl od dolaru se trh americk´ ych st´ atn´ ıch dluhopis˚ u nezmˇ
- enil. A nov´
ymi . . . . . . opˇ etn´ emu n´ ar˚
- ustu. Podle Plan Econu si ˇ
cesk´ y kapit´ alov´ y trh bude v nejbliˇ zˇ s´ ım roce poˇ c´ ınat o nˇ eco l´
- epe. Vˇ
etˇ sina . . . . . . To by mohlo vzhledem k propojen´ ı pˇ res mezibankovn´ ı trh depozit v´ est k ˇ retˇ ezov´ ym reakc´ ım. Pˇ r´ ıliv kapit´ alu . . . . . . PVT, na cenˇ e ztratil tak´ e indexov´ y Tab´
- ak. Voln´
y trh m´ a vˇ sak naˇ stˇ est´ ı i svˇ etl´ e str´
- anky. K nim patˇ
r´ ı napˇ r´ ıklad . . . . . . spoluzakladatel. Tak´ e v Maˇ darsku se uvoln´ ı medi´ aln´ ı trh jiˇ z letos. Maˇ darsko jako prvn´ ı z postkomunistick´ ych . . . . . . . Mezi nˇ e patˇ r´ ı i OfficePorte Voice, kter´ y byl na trh uveden pod heslem ”v´ ıce neˇ z modem”. Obsahuje totiˇ z . . .
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Extraction principle III
9/30
“Collocations occur in different contexts to their components”
◮ non-compositionality: meaning of a collocation must differ from the union of
the meaning of its components
◮ modeling meanings by empirical contexts: a bag of words occurring within
a specified context window of a word or an expression
◮ the more different the contexts of an expression to its components are, the
higher the chance is that the expression is a collocation
AM: J-S divergence, K-L divergence, Skew divergence, Cosine similarity in vector space
Example: Cxy, Cx
. . . nen´ ı. Maltsk´ e liry lze nakoupit pouze ve smˇ en´ arn´ ach, ˇ cern´ y trh s valutami neexistuje. Na Maltˇ e je v porovn´ an´ ı s . . . . . . pˇ
- restal. V pat´
ach za kriz´ ı vstoupil do Bˇ elehradu ˇ cern´ y trh , paˇ sov´ an´ ı a zv´ yˇ sen´ a kriminalita. Pˇ rekupn´ ıci prov´ aˇ zej´ ı . . . . . . nebyli z toho obvinˇ
- eni. ˇ
R´ ıd´ ı gangy, kter´ e kontroluj´ ı ˇ cern´ y trh a okr´ adaj´ ı cizince. Oba byli zbaveni funkc´ ı a byl . . . . . . antidrogov´ e hysterii. N´ asledkem toho neexistoval ani ˇ cern´ y trh , protoˇ ze nebylo na ˇ cem vydˇ el´
- avat. V roce 1957 bylo . . .
. . . doruˇ ceny k rychl´ emu zpracov´ an´ ı. Naplno se jiˇ z rozj´ ıˇ zd´ ı ˇ cern´ y trh se vstupenkami. Na z´ avod na 5000 m v rychlobruslaˇ r˚ u . . . . . . na ˇ celn´ em m´ ıstˇ e obchodu se zbranˇ
- emi. Zat´
ımco ˇ cern´ y trh se zbranˇ emi se pro cel´ y svˇ et st´ av´ a ˇ c´ ım d´ al t´ ım vˇ etˇ s´ ı. . . . . . . ˇ cten´ ım v parlamentu. Vˇ eˇ r´ ım, ˇ ze brzy bude regulovat ˇ cern´ y trh s ohroˇ zen´ ymi druhy zv´ ıˇ rat, m´ ın´ ı. Promoravsk´ e strany . . . . . . jako mal´ ı ˇ ctyˇ rlet´ ı a pˇ etilet´ ı kluci. Byl to dobytˇ c´ ı trh jako z minul´ eho stolet´ ı. Se vˇ s´ ım vˇ sudy prod´ avali . . . . . . pˇ r´ an´ ı neˇ z re´ aln´ ych moˇ znost´ ı. Na rozd´ ıl od dolaru se trh americk´ ych st´ atn´ ıch dluhopis˚ u nezmˇ
- enil. A nov´
ymi . . . . . . opˇ etn´ emu n´ ar˚
- ustu. Podle Plan Econu si ˇ
cesk´ y kapit´ alov´ y trh bude v nejbliˇ zˇ s´ ım roce poˇ c´ ınat o nˇ eco l´
- epe. Vˇ
etˇ sina . . . . . . To by mohlo vzhledem k propojen´ ı pˇ res mezibankovn´ ı trh depozit v´ est k ˇ retˇ ezov´ ym reakc´ ım. Pˇ r´ ıliv kapit´ alu . . . . . . PVT, na cenˇ e ztratil tak´ e indexov´ y Tab´
- ak. Voln´
y trh m´ a vˇ sak naˇ stˇ est´ ı i svˇ etl´ e str´
- anky. K nim patˇ
r´ ı napˇ r´ ıklad . . . . . . spoluzakladatel. Tak´ e v Maˇ darsku se uvoln´ ı medi´ aln´ ı trh jiˇ z letos. Maˇ darsko jako prvn´ ı z postkomunistick´ ych . . . . . . . Mezi nˇ e patˇ r´ ı i OfficePorte Voice, kter´ y byl na trh uveden pod heslem ”v´ ıce neˇ z modem”. Obsahuje totiˇ z . . .
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Inventory of lexical association measures
10/30
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Extraction pipeline
11/30
- 1. linguistic preprocessing (morphological and syntactic level)
- 2. identification of collocation candidates (dependency/surface/distance bigrams)
- 3. extraction of occurrence and cooccurrence statistics (frequency, contexts)
- 4. filtering the candidates to improve precision (POS patterns)
- 5. application of a choosen lexical association measure
- 6. ranking/classification of collocation candidates according to their scores
Ranking
red cross 15.66 decimal point 14.01 arithmetic operation 10.52 paper feeder 10.17 system type 3.54 and others 0.54 program in 0.35 level is 0.25
Classification
red cross 1 decimal point 1 arithmetic operation 1 paper feeder 1 system type and others program in level is
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Extraction pipeline
11/30
- 1. linguistic preprocessing (morphological and syntactic level)
- 2. identification of collocation candidates (dependency/surface/distance bigrams)
- 3. extraction of occurrence and cooccurrence statistics (frequency, contexts)
- 4. filtering the candidates to improve precision (POS patterns)
- 5. application of a choosen lexical association measure
- 6. ranking/classification of collocation candidates according to their scores
Ranking
red cross 15.66 decimal point 14.01 arithmetic operation 10.52 paper feeder 10.17 system type 3.54 and others 0.54 program in 0.35 level is 0.25
Classification
red cross 1 decimal point 1 arithmetic operation 1 paper feeder 1 system type and others program in level is
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Extraction pipeline
11/30
- 1. linguistic preprocessing (morphological and syntactic level)
- 2. identification of collocation candidates (dependency/surface/distance bigrams)
- 3. extraction of occurrence and cooccurrence statistics (frequency, contexts)
- 4. filtering the candidates to improve precision (POS patterns)
- 5. application of a choosen lexical association measure
- 6. ranking/classification of collocation candidates according to their scores
Ranking
red cross 15.66 decimal point 14.01 arithmetic operation 10.52 paper feeder 10.17 system type 3.54 and others 0.54 program in 0.35 level is 0.25
Classification
red cross 1 decimal point 1 arithmetic operation 1 paper feeder 1 system type and others program in level is
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Reference data set
12/30
Source corpus
◮ Prague Dependency Treebank 2.0, 1.5 mil. tokens ◮ manually annotated on morphological and analytical level
Collocation candidates
◮ dependency bigrams: direct dependency relation between components ◮ morphological normalization (lemma proper + pos + gender + degree + negation) ◮ part-of-speech filter (A:N, N:N, V:N, R:N, C:N, N:V, N:C, D:A, N:A, D:V, N:T, N:D, D:D) ◮ frequency filter (minimal frequency required, f >5)
Annotation
◮ three independent parallel annotations (no context; full agreement required) ◮ 6 categories, merged into two: collocations (1-5), non-collocations (0):
- 5. idiomatic expressions
- 4. technical terms
- 3. support verb constructions
- 2. proper names
- 1. frequent unpredictable usages
- 0. non-collocations
◮ 12 232 candidates = 2 557 true collocations + 9 675 true non-collocations
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Reference data set
12/30
Source corpus
◮ Prague Dependency Treebank 2.0, 1.5 mil. tokens ◮ manually annotated on morphological and analytical level
Collocation candidates
◮ dependency bigrams: direct dependency relation between components ◮ morphological normalization (lemma proper + pos + gender + degree + negation) ◮ part-of-speech filter (A:N, N:N, V:N, R:N, C:N, N:V, N:C, D:A, N:A, D:V, N:T, N:D, D:D) ◮ frequency filter (minimal frequency required, f >5)
Annotation
◮ three independent parallel annotations (no context; full agreement required) ◮ 6 categories, merged into two: collocations (1-5), non-collocations (0):
- 5. idiomatic expressions
- 4. technical terms
- 3. support verb constructions
- 2. proper names
- 1. frequent unpredictable usages
- 0. non-collocations
◮ 12 232 candidates = 2 557 true collocations + 9 675 true non-collocations
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Reference data set
12/30
Source corpus
◮ Prague Dependency Treebank 2.0, 1.5 mil. tokens ◮ manually annotated on morphological and analytical level
Collocation candidates
◮ dependency bigrams: direct dependency relation between components ◮ morphological normalization (lemma proper + pos + gender + degree + negation) ◮ part-of-speech filter (A:N, N:N, V:N, R:N, C:N, N:V, N:C, D:A, N:A, D:V, N:T, N:D, D:D) ◮ frequency filter (minimal frequency required, f >5)
Annotation
◮ three independent parallel annotations (no context; full agreement required) ◮ 6 categories, merged into two: collocations (1-5), non-collocations (0):
- 5. idiomatic expressions
- 4. technical terms
- 3. support verb constructions
- 2. proper names
- 1. frequent unpredictable usages
- 0. non-collocations
1 2 3 4 5 2000 4000 6000 8000
◮ 12 232 candidates = 2 557 true collocations + 9 675 true non-collocations
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Experimental design
13/30
Reference data
◮ split into 7 stratified folds of the same size (the same ratio of true collocations) ◮ 1 fold put aside as held-out data ◮ 6 folds used for evaluation of AMs
eval1 eval2 eval3 eval4 eval5 eval6 held-out
Evaluation
◮ based on quality of ranking (ranking performance) ◮ evaluation measures estimated on each eval fold separately and averaged
Significance testing
◮ methods compared by paired Wilcoxon signed-ranked test on the 6 eval folds ◮ significance level α = 0.05
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Experimental design
13/30
Reference data
◮ split into 7 stratified folds of the same size (the same ratio of true collocations) ◮ 1 fold put aside as held-out data ◮ 6 folds used for evaluation of AMs
eval1 eval2 eval3 eval4 eval5 eval6 held-out
Evaluation
◮ based on quality of ranking (ranking performance) ◮ evaluation measures estimated on each eval fold separately and averaged
Significance testing
◮ methods compared by paired Wilcoxon signed-ranked test on the 6 eval folds ◮ significance level α = 0.05
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Experimental design
13/30
Reference data
◮ split into 7 stratified folds of the same size (the same ratio of true collocations) ◮ 1 fold put aside as held-out data ◮ 6 folds used for evaluation of AMs
eval1 eval2 eval3 eval4 eval5 eval6 held-out
Evaluation
◮ based on quality of ranking (ranking performance) ◮ evaluation measures estimated on each eval fold separately and averaged
Significance testing
◮ methods compared by paired Wilcoxon signed-ranked test on the 6 eval folds ◮ significance level α = 0.05
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Evaluation measures: Precision – Recall
14/30
1) Precision = |correctly classified collocations|
|total classified as collocations|
Recall = |correctly classified collocations|
|total collocations|
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Evaluation measures: Precision – Recall
14/30
1) Precision = |correctly classified collocations|
|total classified as collocations|
Recall = |correctly classified collocations|
|total collocations|
Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Evaluation measures: Precision – Recall
14/30
1) Precision = |correctly classified collocations|
|total classified as collocations|
Recall = |correctly classified collocations|
|total collocations|
Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Evaluation measures: Precision – Recall
14/30
1) Precision = |correctly classified collocations|
|total classified as collocations|
Recall = |correctly classified collocations|
|total collocations|
Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author arithmetic operation paper feeder new book round table new wave gas station system type central part and others program in level is
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Evaluation measures: Precision – Recall
14/30
1) Precision = |correctly classified collocations|
|total classified as collocations|
Recall = |correctly classified collocations|
|total collocations|
Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author arithmetic operation paper feeder new book round table new wave gas station system type central part and others program in level is
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Evaluation measures: Precision – Recall
14/30
1) Precision = |correctly classified collocations|
|total classified as collocations|
Recall = |correctly classified collocations|
|total collocations|
Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author arithmetic operation paper feeder new book round table new wave gas station system type central part and others program in level is Precision Recall 100 % 50 %
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Evaluation measures: Precision – Recall
14/30
1) Precision = |correctly classified collocations|
|total classified as collocations|
Recall = |correctly classified collocations|
|total collocations|
Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author 1 arithmetic operation paper feeder new book round table new wave gas station system type central part and others program in level is Precision Recall 100 % 50 % 80 % 50 %
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Evaluation measures: Precision – Recall
14/30
1) Precision = |correctly classified collocations|
|total classified as collocations|
Recall = |correctly classified collocations|
|total collocations|
Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author 1 arithmetic operation 1 paper feeder 1 new book 1 round table 1 new wave 1 gas station 1 system type 1 central part 1 and others 1 program in 1 level is 1 Precision Recall 100 % 12 % 100 % 25 % 100 % 37 % 100 % 50 % 80 % 50 % 83 % 62 % 85 % 75 % 75 % 75 % 77 % 87 % 70 % 87 % 72 % 100 % 66 % 100 % 61 % 100 % 57 % 100 % 53 % 100 % 50 % 100 %
◮ measured within the entire interval of possible threshold values
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Visual evaluation: Precision-Recall curves
15/30
◮ graphical plots of recall vs. precision ◮ the closer to the top and right, the better ranking performance ◮ estimated for each eval fold and vertically averaging
Precision-Recall curve averaging
Recall Precision
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Unaveraged curves
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Visual evaluation: Precision-Recall curves
15/30
◮ graphical plots of recall vs. precision ◮ the closer to the top and right, the better ranking performance ◮ estimated for each eval fold and vertically averaging
Precision-Recall curve averaging
Recall Precision
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Averaged curve Unaveraged curves
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Evaluation results: Precision-Recall curves
16/30 The best-performing association measures
Recall Averaged precision
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
Pointwise mutual information (4) Pearson’s test (10) z score (13) Unigram subtuple measure (39) Cosine context similarity in boolean vector space (77)
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Evaluation measure: Average Precision
17/30
2) Average Precision: E[P(R)], R ∼U(0, 1) AP = 1 r
r
X
i=1
pi
Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author 1 arithmetic operation 1 paper feeder 1 new book 1 round table 1 new wave 1 gas station 1 system type 1 central part 1 and others 1 program in 1 level is 1 Precision Recall 100 % 12 % 100 % 25 % 100 % 37 % 100 % 50 % 80 % 50 % 83 % 62 % 85 % 75 % 75 % 75 % 77 % 87 % 70 % 87 % 72 % 100 % 66 % 100 % 61 % 100 % 57 % 100 % 53 % 100 % 50 % 100 %
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Evaluation measure: Average Precision
17/30
2) Average Precision: E[P(R)], R ∼U(0, 1) AP = 1 r
r
X
i=1
pi
Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author 1 arithmetic operation 1 paper feeder 1 new book 1 round table 1 new wave 1 gas station 1 system type 1 central part 1 and others 1 program in 1 level is 1 Precision Recall 100 % 12 % 100 % 25 % 100 % 37 % 100 % 50 % 80 % 50 % 83 % 62 % 85 % 75 % 75 % 75 % 77 % 87 % 70 % 87 % 72 % 100 % 66 % 100 % 61 % 100 % 57 % 100 % 53 % 100 % 50 % 100 %
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Evaluation measure: Average Precision
17/30
2) Average Precision: E[P(R)], R ∼U(0, 1) AP = 1 r
r
X
i=1
pi
Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author 1 arithmetic operation 1 paper feeder 1 new book 1 round table 1 new wave 1 gas station 1 system type 1 central part 1 and others 1 program in 1 level is 1 Precision Recall 100 % 12 % 100 % 25 % 100 % 37 % 100 % 50 % 80 % 50 % 83 % 62 % 85 % 75 % 75 % 75 % 77 % 87 % 70 % 87 % 72 % 100 % 66 % 100 % 61 % 100 % 57 % 100 % 53 % 100 % 50 % 100 % 89.6 % = AP
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Evaluation measure: Average Precision
17/30
2) Average Precision: E[P(R)], R ∼U(0, 1) AP = 1 r
r
X
i=1
pi
Ranking red cross 15.66 iron curtain 15.23 decimal point 14.01 coupon book 13.83 book author 11.05 arithmetic operation 10.52 paper feeder 10.17 new book 10.09 round table 7.03 new wave 6.59 gas station 6.04 system type 3.54 central part 1.54 and others 0.54 program in 0.35 level is 0.25 Classification red cross 1 iron curtain 1 decimal point 1 coupon book 1 book author 1 arithmetic operation 1 paper feeder 1 new book 1 round table 1 new wave 1 gas station 1 system type 1 central part 1 and others 1 program in 1 level is 1 Precision Recall 100 % 12 % 100 % 25 % 100 % 37 % 100 % 50 % 80 % 50 % 83 % 62 % 85 % 75 % 75 % 75 % 77 % 87 % 70 % 87 % 72 % 100 % 66 % 100 % 61 % 100 % 57 % 100 % 53 % 100 % 50 % 100 %
3) Mean Average Precision: E[AP] MAP = 1 6
6
X
i=1
APi
89.6 % = AP
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Overall results: Mean Average Precision
18/30 MAP of all lexical association measures in descending order
77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67
Mean Average Precision
0.0 0.2 0.4 0.6 0.8
◮ Baseline (ratio of true collocations): 21.02 % ◮ Best context-based measure (■): Cosine similarity in vector space: 66.79 % ◮ Best statistical association measure (■): Unigram subtuple measure: 66.72 % ◮ Best 16 measures – statistically indistinguishable MAP ∼ current state of the art
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Overall results: Mean Average Precision
18/30 MAP of all lexical association measures in descending order
77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67
Mean Average Precision
0.0 0.2 0.4 0.6 0.8
◮ Baseline (ratio of true collocations): 21.02 % ◮ Best context-based measure (■): Cosine similarity in vector space: 66.79 % ◮ Best statistical association measure (■): Unigram subtuple measure: 66.72 % ◮ Best 16 measures – statistically indistinguishable MAP ∼ current state of the art
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Overall results: Mean Average Precision
18/30 MAP of all lexical association measures in descending order
77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67
Mean Average Precision
0.0 0.2 0.4 0.6 0.8
◮ Baseline (ratio of true collocations): 21.02 % ◮ Best context-based measure (■): Cosine similarity in vector space: 66.79 % ◮ Best statistical association measure (■): Unigram subtuple measure: 66.72 % ◮ Best 16 measures – statistically indistinguishable MAP ∼ current state of the art
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Overall results: Mean Average Precision
18/30 MAP of all lexical association measures in descending order
77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67
Mean Average Precision
0.0 0.2 0.4 0.6 0.8
◮ Baseline (ratio of true collocations): 21.02 % ◮ Best context-based measure (■): Cosine similarity in vector space: 66.79 % ◮ Best statistical association measure (■): Unigram subtuple measure: 66.72 % ◮ Best 16 measures – statistically indistinguishable MAP ∼ current state of the art
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Overall results: Mean Average Precision
18/30 MAP of all lexical association measures in descending order
77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67
Mean Average Precision
0.0 0.2 0.4 0.6 0.8
◮ Baseline (ratio of true collocations): 21.02 % ◮ Best context-based measure (■): Cosine similarity in vector space: 66.79 % ◮ Best statistical association measure (■): Unigram subtuple measure: 66.72 % ◮ Best 16 measures – statistically indistinguishable MAP ∼ current state of the art
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Combining association measures
19/30
Motivation
◮ different association measures discover different groups/types of collocations ◮ existence of uncorrelated association measures
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Combining association measures
19/30
Motivation
◮ different association measures discover different groups/types of collocations ◮ existence of uncorrelated association measures
5 % data sample from PDT-Dep
0.9 0.5 0.1 16.9 8.8 0.7 Cosine context similarity in boolean vector space Pointwise mutual information collocations non-collocations linear discriminant
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Combining association measures
19/30
Motivation
◮ different association measures discover different groups/types of collocations ◮ existence of uncorrelated association measures
5 % data sample from PDT-Dep
0.9 0.5 0.1 16.9 8.8 0.7 Cosine context similarity in boolean vector space Pointwise mutual information collocations non-collocations linear discriminant
Note: So far all methods – unsupervised, the combination methods – supervised
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Combination models
20/30
Framework
◮ each collocation candidate xi is described by the feature vector
xi =(xi
1, . . . , xi 82)T consisting of scores of all association measures ◮ and assigned a label yi ∈ {0, 1} indicating whether the bigram is considered
to be a true collocation (y = 1) or not (y = 0)
◮ we look for a ranker function f(xi) determining the strength of lexical
association between components of a candidate xi
◮ e.g. linear combination of association scores: f(xi) = w0 + w1xi 1 + . . . + w82xi 82
Methods
- 1. Linear logistic regression
- 2. Linear discriminant analysis
- 3. Support vector machines
- 4. Neural networks
◮ in the training phase used as regular classifiers on two-class data ◮ in the application phase no classification threshold applies
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Combination models
20/30
Framework
◮ each collocation candidate xi is described by the feature vector
xi =(xi
1, . . . , xi 82)T consisting of scores of all association measures ◮ and assigned a label yi ∈ {0, 1} indicating whether the bigram is considered
to be a true collocation (y = 1) or not (y = 0)
◮ we look for a ranker function f(xi) determining the strength of lexical
association between components of a candidate xi
◮ e.g. linear combination of association scores: f(xi) = w0 + w1xi 1 + . . . + w82xi 82
Methods
- 1. Linear logistic regression
- 2. Linear discriminant analysis
- 3. Support vector machines
- 4. Neural networks
◮ in the training phase used as regular classifiers on two-class data ◮ in the application phase no classification threshold applies
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Combination models
20/30
Framework
◮ each collocation candidate xi is described by the feature vector
xi =(xi
1, . . . , xi 82)T consisting of scores of all association measures ◮ and assigned a label yi ∈ {0, 1} indicating whether the bigram is considered
to be a true collocation (y = 1) or not (y = 0)
◮ we look for a ranker function f(xi) determining the strength of lexical
association between components of a candidate xi
◮ e.g. linear combination of association scores: f(xi) = w0 + w1xi 1 + . . . + w82xi 82
Methods
- 1. Linear logistic regression
- 2. Linear discriminant analysis
- 3. Support vector machines
- 4. Neural networks
◮ in the training phase used as regular classifiers on two-class data ◮ in the application phase no classification threshold applies
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Combination models: Evaluation
21/30
Evaluation scheme
◮ 6-fold crossvalidation on the 6 evaluation folds ◮ 5 folds for training (fitting parameters), 1 fold for testing (ranking performance) ◮ PR curve and AP score estimated on each test fold and averaged
train1 train2 train3 train4 train5 test6 held-out
Results: Mean Average Precision
method MAP +% Unigram subtuple measure 66.72 – Cosine similarity in vector space 66.79 0.00 Support Vector Machine 73.03 9.35 Neural Network (1 unit) 74.88 12.11 Linear Discriminant Analysis 75.16 12.54 Linear Logistic Regression 77.36 15.82 Neural Network (5 units) 80.87 21.08
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Combination models: Evaluation
21/30
Evaluation scheme
◮ 6-fold crossvalidation on the 6 evaluation folds ◮ 5 folds for training (fitting parameters), 1 fold for testing (ranking performance) ◮ PR curve and AP score estimated on each test fold and averaged
train1 train2 train3 train4 train5 test6 held-out
Results: Mean Average Precision
method MAP +% Unigram subtuple measure 66.72 – Cosine similarity in vector space 66.79 0.00 Support Vector Machine 73.03 9.35 Neural Network (1 unit) 74.88 12.11 Linear Discriminant Analysis 75.16 12.54 Linear Logistic Regression 77.36 15.82 Neural Network (5 units) 80.87 21.08
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Results: Precision-Recall curves
22/30 Combination methods compared with the best association measures
Recall Average precision
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
Neural network (5 units) Linear logistic regression Support vector machine (linear) Linear discriminant analysis Neural network (1 unit) Cosine context similarity in boolean vector space (77) Unigram subtuple measure (39)
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Learning curve analysis
23/30 Neural network (5 units) learning curve
Training data size (%) Mean average precision
20 40 60 80 100 0.50 0.55 0.60 0.65 0.70 0.75 0.80
◮ 100% of training data = 5 training folds (8 737 annotated collocation candidates) ◮ 95% of the final MAP achieved with 15% of training data ◮ 99% of the final MAP achieved with 50% of training data
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Adding linguistic features
24/30
Idea
◮ improving the combination models by adding linguistic features ◮ categorical features can be transformed to binary dummy features
New features
◮ Part-of-Speech pattern: combination of component POS (A:N, N:N, . . .) ◮ Syntactic relation: dependency type (attribute, object, . . .)
Results: Mean Average Precision
method MAP +% Unigram subtuple measure 66.72 – Cosine similarity in vector space 66.79 0.00 NNet/5 (AM) 80.87 21.08 NNet/5 (AM+POS) 82.79 24.09 NNet/5 (AM+POS+DEP) 84.53 26.69
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Adding linguistic features
24/30
Idea
◮ improving the combination models by adding linguistic features ◮ categorical features can be transformed to binary dummy features
New features
◮ Part-of-Speech pattern: combination of component POS (A:N, N:N, . . .) ◮ Syntactic relation: dependency type (attribute, object, . . .)
Results: Mean Average Precision
method MAP +% Unigram subtuple measure 66.72 – Cosine similarity in vector space 66.79 0.00 NNet/5 (AM) 80.87 21.08 NNet/5 (AM+POS) 82.79 24.09 NNet/5 (AM+POS+DEP) 84.53 26.69
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Adding linguistic features
24/30
Idea
◮ improving the combination models by adding linguistic features ◮ categorical features can be transformed to binary dummy features
New features
◮ Part-of-Speech pattern: combination of component POS (A:N, N:N, . . .) ◮ Syntactic relation: dependency type (attribute, object, . . .)
Results: Mean Average Precision
method MAP +% Unigram subtuple measure 66.72 – Cosine similarity in vector space 66.79 0.00 NNet/5 (AM) 80.87 21.08 NNet/5 (AM+POS) 82.79 24.09 NNet/5 (AM+POS+DEP) 84.53 26.69
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Model reduction
25/30
Motivation
◮ “Ocama’s razor” ◮ combination of all 82 association measures is too complex ◮ models should be reduced: redundant variables removed
Two issues
- 1. groups of highly correlated measures
- 2. measures with no or minimal contribution to the model
Two-step solution
- 1. correlation based clustering; one representative selected from each cluster
- 2. step-wise procedure removing variables one by one
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Model reduction
25/30
Motivation
◮ “Ocama’s razor” ◮ combination of all 82 association measures is too complex ◮ models should be reduced: redundant variables removed
Two issues
- 1. groups of highly correlated measures
- 2. measures with no or minimal contribution to the model
Two-step solution
- 1. correlation based clustering; one representative selected from each cluster
- 2. step-wise procedure removing variables one by one
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Model reduction
25/30
Motivation
◮ “Ocama’s razor” ◮ combination of all 82 association measures is too complex ◮ models should be reduced: redundant variables removed
Two issues
- 1. groups of highly correlated measures
- 2. measures with no or minimal contribution to the model
Two-step solution
- 1. correlation based clustering; one representative selected from each cluster
- 2. step-wise procedure removing variables one by one
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Model reduction: 1) Clustering
26/30
Agglomerative hierarchical clustering
◮ groups the measures with the same/similar contribution to the model ◮ begins with each measure as a separate cluster and merge them into
successively larger clusters
◮ distance metrics = 1- |Pearson’s correlation| (estimated on the held-out fold)
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Model reduction: 1) Clustering
26/30
Agglomerative hierarchical clustering
◮ groups the measures with the same/similar contribution to the model ◮ begins with each measure as a separate cluster and merge them into
successively larger clusters
◮ distance metrics = 1- |Pearson’s correlation| (estimated on the held-out fold)
69 78 79 57 56 58 12 1 17 51 36 55 47 8 15 14 23 37 27 16 24 42 10 43 34 22 45 7 63 13 38 32 31 30 68 59 44 33 19 18 20 21 54 29 28 6 9 5 39 4 50 61 73 71 48 3 77 80 26 25 49 35 53 52 41 46 2 60 67 76 11 70 40 75 62 74 72 82 81 66 64 65
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Model reduction: 1) Clustering
26/30
Agglomerative hierarchical clustering
◮ groups the measures with the same/similar contribution to the model ◮ begins with each measure as a separate cluster and merge them into
successively larger clusters
◮ distance metrics = 1- |Pearson’s correlation| (estimated on the held-out fold)
69 78 79 57 56 58 12 1 17 51 36 55 47 8 15 14 23 37 27 16 24 42 10 43 34 22 45 7 63 13 38 32 31 30 68 59 44 33 19 18 20 21 54 29 28 6 9 5 39 4 50 61 73 71 48 3 77 80 26 25 49 35 53 52 41 46 2 60 67 76 11 70 40 75 62 74 72 82 81 66 64 65
◮ number of the final clusters empirically set to 60 ◮ the best performing measure (by MAP on the held-out fold) selected as the
representative from each cluster
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Model reduction: 2) Stepwise variable removal
27/30
Iterative procedure
◮ initiated with the 60 variables/measures ◮ in each iteration we remove the variable causing minimal performance
degradation when not used in the model (by MAP on the held-out fold)
◮ stops before the degradation becomes statistically significant
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Model reduction: 2) Stepwise variable removal
27/30
Iterative procedure
◮ initiated with the 60 variables/measures ◮ in each iteration we remove the variable causing minimal performance
degradation when not used in the model (by MAP on the held-out fold)
◮ stops before the degradation becomes statistically significant
Number of variables Mean average precision
60 50 40 30 20 10 1 0.60 0.65 0.70 0.75 0.80 0.85 held−out MAP test MAP
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Model reduction: 2) Stepwise variable removal
27/30
Iterative procedure
◮ initiated with the 60 variables/measures ◮ in each iteration we remove the variable causing minimal performance
degradation when not used in the model (by MAP on the held-out fold)
◮ stops before the degradation becomes statistically significant
Number of variables Mean average precision
60 50 40 30 20 10 1 0.60 0.65 0.70 0.75 0.80 0.85 held−out MAP test MAP
◮ the final model contains 13 variables/lexical association measures
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Model reduction: Process overview
28/30 MAP of individual lexical association measures
77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67
Mean Average Precision
0.0 0.2 0.4 0.6 0.8
◮ procedure initiated with all 82 association measures ◮ highly correlated measures removed in the first phase (clustering) ◮ 13 measures left after the second phase (stepwise removal)
= 4 statistical association mesaures (■) + 9 context-based measures (■)
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Model reduction: Process overview
28/30 MAP of individual lexical association measures
77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67
Mean Average Precision
0.0 0.2 0.4 0.6 0.8
◮ procedure initiated with all 82 association measures ◮ highly correlated measures removed in the first phase (clustering) ◮ 13 measures left after the second phase (stepwise removal)
= 4 statistical association mesaures (■) + 9 context-based measures (■)
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Model reduction: Process overview
28/30 MAP of individual lexical association measures
77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67
Mean Average Precision
0.0 0.2 0.4 0.6 0.8
◮ procedure initiated with all 82 association measures ◮ highly correlated measures removed in the first phase (clustering) ◮ 13 measures left after the second phase (stepwise removal)
= 4 statistical association mesaures (■) + 9 context-based measures (■)
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Model reduction results: Precision-Recall curves
29/30 Reduced combination models compared with the best association measures
Recall Averaged precision
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
NNet (5 units) with 82 variables NNet (5 units) with 47 variables NNet (5 units) with 13 variables NNet (5 units) with 7 variables Cosine context similarity in boolean vector space (77) Unigram subtuple measure (39)
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Conslusions
30/30
Main results
- 1. inventory of 82 lexical association measures
- 2. 4 reference data sets
- 3. all lexical association measures evaluated on these data sets
- 4. combining association measures improved state of the art in collocation extraction
- 5. combination models reduced to 13 measures without performance degradation
Other contribution of the thesis
◮ overview of different notions of collocation (definitions, typology, classification) ◮ evaluation scheme (average precision, crossvalidation, significance tests) ◮ reference data sets used in MWE 2008 Shared Task
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Conslusions
30/30
Main results
- 1. inventory of 82 lexical association measures
- 2. 4 reference data sets
- 3. all lexical association measures evaluated on these data sets
- 4. combining association measures improved state of the art in collocation extraction
- 5. combination models reduced to 13 measures without performance degradation
Other contribution of the thesis
◮ overview of different notions of collocation (definitions, typology, classification) ◮ evaluation scheme (average precision, crossvalidation, significance tests) ◮ reference data sets used in MWE 2008 Shared Task
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
List of relevant publications
Pavel Pecina: Lexical Association Measures and Collocation Extraction, Multiword expressions: Hard going or plain sailing? Special issue of the International Journal of Language Resources and Evaluation, Springer, 2009 (accepted). Pavel Pecina: Lexical Association Measures: Collocation Extraction, PhD Thesis, Charles University, Prague, Czech Republic, 2008. Pavel Pecina: Machine Learning Approach to Multiword Expression Extraction, In Proceedings of the sixth International Conference on Language Resources and Evaluation (LREC) Workshop: Towards a Shared Task for Multiword Expressions, Marrakech, Morocco, 2008. Pavel Pecina: Reference Data for Czech Collocation Extraction, In Proceedings of the sixth International Conference on Language Resources and Evaluation (LREC) Workshop: Towards a Shared Task for Multiword Expressions, Marrakech, Morocco, 2008. Pavel Pecina, Pavel Schlesinger: Combining Association Measures for Collocation Extraction, In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL), Sydney, Australia, 2006. Silvie Cinkov´ a, Petr Podvesk´ y, Pavel Pecina, Pavel Schlesinger: Semi-automatic Building of Swedish Collocation Lexicon, In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), Genova, Italy, 2006. Pavel Pecina: An Extensive Empirical Study of Collocation Extraction Methods, In Proceedings of the Association for Computational Linguistics Student Research Workshop (ACL), Ann Arbor, Michigan, USA, 2005. Pavel Pecina, Martin Holub: Semantically Significant Collocations, UFAL/CKL Technical Report TR-2002-13, Faculty of Mathematics and Physics, Charles University, Prague, Czech Rep., 2002.
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Additional data sets
PDT-Surf
◮ analogous to PDT-Dep (corpus, filtering, annotation) ◮ collocation candidates extracted as surface bigrams: pairs of adjacent words ◮ assumption: collocations cannot be modified by insertion of another word ◮ annotation consistent with PDT-Dep
CNC-Surf
◮ collocation candidates – instances of PDT-Surf in the Czech National Corpus ◮ SYN 2000 and 2005, 240 mil. tokens, morphologicaly tagged and lemmatized ◮ annotation consistent with PDT-Surf
PAR-Dist
◮ source corpus: Swedish Parole, 22 mil. tokens ◮ automatic morphological tagging and lemmatization ◮ distance bigrams: word pairs occurring within a distance of 1–3 words ◮ annotation: non-exhaustive manual extraction of support verb constructions ◮ no frequency filter applied
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Reference data summary
reference data set PDT-Dep PDT-Surf CNC-Surf PAR-Dist source corpus PDT PDT CNC PAROLE language Czech Czech Czech Swedish morphology manual manual auto auto syntax manual none none none bigram types dependency surface surface distance tokens 1 504 847 1 504 847 242 272 798 22 883 361 bigram types 635 952 638 030 30 608 916 13 370 375 after frequency filtering 26 450 29 035 2 941 414 13 370 375 after part-of-speech filtering 12 232 10 021 1 503 072 898 324 collocation candidates 12 232 10 021 9 868 17 027 data sample size 100 % 100 % 0.66 % 1.90 % true collocations 2 557 2 293 2 263 1 292 baseline precision (%) 21.02 22.88 22.66 7.59
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Context-based vs. statistical association measures
PDT-Dep
77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67
Mean Average Precision
0.0 0.2 0.4 0.6 0.8
PDT-Surf
39 38 32 27 28 29 31 30 37 13 10 5 42 4 16 24 22 23 33 45 7 77 80 18 21 20 19 9 63 6 43 50 34 54 48 3 26 25 59 44 8 53 52 76 35 49 41 82 55 15 14 47 70 11 66 61 73 71 72 74 69 46 2 60 64 65 68 40 12 75 81 51 36 56 78 79 58 62 57 17 1 67
Mean Average Precision
0.0 0.2 0.4 0.6 0.8
CNC-Surf
39 4 27 28 29 38 37 32 31 30 42 9 13 10 5 33 16 22 23 24 63 50 45 43 18 19 21 20 34 7 54 3 48 77 59 44 26 25 82 80 41 35 53 52 6 49 66 69 73 71 8 61 55 72 74 62 70 15 14 47 64 79 46 60 65 78 40 2 81 1 17 11 12 56 75 36 51 76 68 67 57 58
Mean Average Precision
0.0 0.2 0.4 0.6 0.8
PAR-Dist
36 51 12 47 56 69 1 17 57 15 14 78 11 9 6 65 55 8 61 62 44 68 54 18 19 21 20 59 58 66 33 64 73 71 37 27 28 29 34 43 23 24 22 2 40 63 38 5 32 30 42 31 82 13 77 80 3 48 52 53 7 45 4 70 50 81 26 79 25 46 67 35 41 39 76 74 49 60 75 10 16 72
Mean Average Precision
0.0 0.1 0.2 0.3 0.4
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Results / Mean average precision: PDT-Dep vs. PDT-Surf
Dependency bigrams vs. surface bigrams
77 39 80 38 32 13 10 31 30 37 5 42 27 28 29 4 63 16 23 22 24 45 33 7 21 18 19 20 43 34 6 54 9 76 50 82 48 3 8 59 44 66 61 73 71 26 70 25 15 14 72 74 11 69 53 52 49 35 41 68 55 64 40 47 65 81 75 46 56 12 78 2 60 79 51 36 58 62 57 1 17 67
Mean Average Precision
0.0 0.2 0.4 0.6 0.8
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Results / Mean average precision: PDT-Surf vs. CNC-Surf
Small source corpus vs. large source corpus
39 38 32 27 28 29 31 30 37 13 10 5 42 4 16 24 22 23 33 45 7 77 80 18 21 20 19 9 6 43 50 34 63 54 48 3 26 25 59 44 8 53 52 35 49 41 55 82 15 70 14 47 66 11 73 61 71 74 72 69 76 46 2 60 64 65 40 81 12 68 56 51 36 78 79 58 62 57 75 17 1 67
Mean Average Precision
0.0 0.2 0.4 0.6 0.8
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Results / Mean average precision: PAR-Dist vs. PDT-Dep
Different corpus, different language, different task
36 51 12 47 56 69 1 17 57 15 14 78 11 9 6 65 55 8 61 62 44 68 54 18 19 21 20 59 58 66 33 64 73 71 37 27 28 29 34 43 23 24 22 2 40 63 38 5 32 30 42 31 82 13 77 80 3 48 52 53 7 45 4 70 50 81 26 79 25 46 67 35 41 39 76 74 49 60 75 10 16 72
Mean Average Precision
0.0 0.2 0.4 0.6 0.8
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Comparison of AM evaluation results on different data sets
PDT−Dep 0.2 0.4 0.6 0.08 0.12 0.16 0.2 0.3 0.4 0.5 0.6 0.2 0.4 0.6 PDT−Surf CNC−Surf 0.0 0.2 0.4 0.6 0.2 0.3 0.4 0.5 0.6 0.08 0.12 0.16 0.0 0.2 0.4 0.6 PAR−Dist
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Automatic extraction of semantically associated words
◮ joint work with Martin Kirschner ◮ application of lexical association measures ◮ the same approach as for collocation extraction ◮ supervised combination of lexical association measures ◮ reference data from WordNet and PDT annotation ◮ occurrence and cooccurrence statistics from a larger corpus (CNC) ◮ different models for different types of associations (synonyms, antonyms, etc.)
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Syntax based language models for Information Retrieval
◮ joint work with Jana Kravalova ◮ application of language model paradigm for information retrieval ◮ unigram models × bigram models ◮ surface word forms × lemmas × stems ◮ surface bigrams × dependency bigrams ◮ various weighting schemes: MLE, association scores ◮ various approaches to combining the models ◮ test collection: CLEF 2007 Czech Ad-Hoc collection
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions
Webpage cleaning
◮ joint work with Michal Marek, Miroslav Spousta ◮ the task: to clean arbitrary web pages (remove boilerpalte) ◮ each document split into a sequence of block ◮ blocks separated by one or more HTML tags ◮ the blocks labeled as Title, Text, Garbage ◮ supervised sequence labeling by Conditional Random Fields ◮ Cleaneval 2007 Shared Task winners
Introduction Collocation Extraction Association Measures Reference Data Empirical Evaluation Combining Association Measures Conclusions