Word Meaning & Word Sense Disambiguation
CMSC 723 / LING 723 / INST 725 MARINE CARPUAT
marine@cs.umd.edu
Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / - - PowerPoint PPT Presentation
Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T oday Representing word meaning Word sense disambiguation as supervised classification Word sense disambiguation
CMSC 723 / LING 723 / INST 725 MARINE CARPUAT
marine@cs.umd.edu
http://www.ling.upenn.edu/ ˜beatrice/humor/headlines.html
Which flight serves breakfast? Which flights serve Tuscon? *Which flights serve breakfast and Tuscon?
– Homonyms (homonymy): unrelated senses; identical
– Polysemes (polysemy): related, but distinct senses – Metonyms (metonymy): “stand in”, technically, a sub- case of polysemy
– Synonyms (synonymy)
– How do humans store and access knowledge about concept? – Hypothesis: concepts are interconnected via meaningful relations – Useful for reasoning
– Can most (all?) of the words in a language be represented as a semantic network where words are interlinked by meaning? – If so, the result would be a large semantic network…
Noun {pipe, tobacco pipe} (a tube with a small bowl at one end; used for smoking tobacco) {pipe, pipage, piping} (a long tube made of metal or plastic that is used to carry water or oil or gas etc.) {pipe, tube} (a hollow cylindrical shape) {pipe} (a tubular wind instrument) {organ pipe, pipe, pipework} (the flues and stops on a pipe organ) Verb {shriek, shrill, pipe up, pipe} (utter a shrill cry) {pipe} (transport by pipeline) “pipe oil, water, and gas into the desert” {pipe} (play on a pipe) “pipe a tune” {pipe} (trim with piping) “pipe the skirt”
What do you think of the sense granularity?
{ v e h i c l e } { c
v e y a n c e ; t r a n s p
t } { c a r ; a u t
a u t
i l e ; m a c h i n e ; m
c a r } { c r u i s e r ; s q u a d c a r ; p a t r
c a r ; p
i c e c a r ; p r
l c a r } { c a b ; t a x i ; h a c k ; t a x i c a b ; }
{ m
v e h i c l e ; a u t
i v e v e h i c l e } { b u m p e r } { c a r d
} { c a r w i n d
} { c a r m i r r
} { h i n g e ; f l e x i b l e j
n t } { d
l
k } { a r m r e s t }
h y p e r
y m h y p e r
y m h y p e r
y m h y p e r
y m h y p e r
y m m e r
y m m e r
y m m e r
y m m e r
y m
Part of speech Word form Synsets Noun 117,798 82,115 Verb 11,529 13,767 Adjective 21,479 18,156 Adverb 4,481 3,621 Total 155,287 117,659
http://wordnet.princeton.edu/
Noun {pipe, tobacco pipe} (a tube with a small bowl at one end; used for smoking tobacco) {pipe, pipage, piping} (a long tube made of metal or plastic that is used to carry water or oil or gas etc.) {pipe, tube} (a hollow cylindrical shape) {pipe} (a tubular wind instrument) {organ pipe, pipe, pipework} (the flues and stops on a pipe organ) Verb {shriek, shrill, pipe up, pipe} (utter a shrill cry) {pipe} (transport by pipeline) “pipe oil, water, and gas into the desert” {pipe} (play on a pipe) “pipe a tune” {pipe} (trim with piping) “pipe the skirt”
From WordNet:
– Can be framed as lexical sample (focus on one word type at a time) or all-words (disambiguate all content words in a document)
– Information retrieval – Machine translation – …
– 62% in Longman’s Dictionary of Contemporary English – 79% in WordNet
– Average of 3.83 in LDOCE – Average of 2.96 in WordNet
– In the British National Corpus, 84% of instances have more than
62% accuracy in this case!
label1 label2 label3 label4 Classifier supervised machine learning algorithm
?
unlabeled document label1? label2? label3? label4?
Testing Training
training data
Feature Functions
– Unsupervised, but knowledge rich
The bank can guarantee deposits will eventually cover future tuition costs because it invests in adjustable-rate mortgage securities. WordNet
– Count overlapping content words between glosses and context
– Include the examples in dictionary definitions – Include hypernyms and hyponyms – Give more weight to larger overlaps (e.g., bigrams) – Give extra weight to infrequent words – …
– Start with (small) seed, learn classifier – Use classifier to label rest of corpus – Retain “confident” labels, add to training set – Learn new classifier – Repeat…
– One sense per discourse – One sense per collocation
A word tends to preserve its meaning across all its
– 8 words with two-way ambiguity, e.g. plant, crane, etc. – 98% of the two-word occurrences in the same discourse carry the same meaning
– Heuristic true mostly for coarse-grained senses and for homonymy rather than polysemy – Performance of “one sense per discourse” measured on SemCor is approximately 70%
– Strong for adjacent collocations – Weaker as the distance between words increases
– 97% precision on words with two-way ambiguity – Again, accuracy depends on granularity:
– How fine or coarse grained? – Application specific?
– Use the foreign language as the sense inventory – Added bonus: annotations for free! (Using machine translation data)
– WordNet
– Lesk Algorithm – Supervised classification – Minimizing supervision
– If needed review probability, expectations