From From IR WSD IR WSD to to IR WSD IR WSD Julio Gonzalo - - PowerPoint PPT Presentation
From From IR WSD IR WSD to to IR WSD IR WSD Julio Gonzalo - - PowerPoint PPT Presentation
From From IR WSD IR WSD to to IR WSD IR WSD Julio Gonzalo Julio Gonzalo UNED UNED IR @ UNED: WSD IR @ UNED: WSD initial motivation initial motivation 1997: EuroWordNet: 1997: EuroWordNet: lets lets
WSD WSD → → IR @ UNED: IR @ UNED: initial motivation initial motivation
- 1997: EuroWordNet:
1997: EuroWordNet: let’s let’s use use it it! !
- 1998: (manual)
1998: (manual) indexing with indexing with synsets +29% synsets +29%
- 1999:
1999: Sanderson pseudo Sanderson pseudo-
- senses
senses vs. vs. WordNet synsets (EMNLP) WordNet synsets (EMNLP)
- 1999: WSD versus
1999: WSD versus first sense heuristic first sense heuristic (SIGLEX) (SIGLEX)
- 2000: ITEM conceptual
2000: ITEM conceptual search engine search engine
Conceptual versus textual indexing WSD strategy
ITEM ITEM search engine search engine
- Scalable to several languages
Scalable to several languages
- Conceptual
Conceptual query expansion query expansion
- Translations via hyperonym relations
Translations via hyperonym relations (e.g
(e.g governor’s governor’s race race) )
but but
- Granularity
Granularity
- Indexing units
Indexing units versus versus translation units translation units
– – Words Words are are not good for translation not good for translation
» » (té cargado/ (té cargado/strong strong tea) tea)
– – Phrases Phrases are are not good for indexing not good for indexing
» » “ “word word+ +sense sense+ +disambiguation disambiguation”/“ ”/“sense tagging sense tagging” ”
→Is Word Sense Disambiguation an issue for the semantic web?
QUERY RECONSULT WITH PHRASE EXPLORE PHRASE EXPLORE DOCUMENT
Website Term Browser
WTB WTB Evaluation
Website Term Browser
Evaluation
- 1523 sessions with interaction
- average 5.11 actions per session
- explore phrase used in 65.13% sessions
All queries 1 word queries >1 word queries First action DOC 40.70% 45.49% 37.30% after QUERY PHRASE 51.14% 45.65% 55.05% RECONSULT 8.141% 8.846% 7.640% Last action Before ending QUERY 48.74% 53.38% 45.15% Session with PHRASE 42.95% 40.85% 44.57% explore DOC RECONSULT 8.306% 5.764% 10.27%
Is WSD easier than MT/CLIR?
abortion aborto abortion issue issue tema número asunto edición
- tema del aborto
- asunto del aborto
- asuntos como el aborto
- asuntos del aborto
- temas como el aborto
- asunto aborto
Corpus evidence emisión Alignment without parallel corpora abortion issue tema del aborto
Results on Results on CLEF comparable corpus CLEF comparable corpus
Spanish
252,795 252,795 7,623,168 7,623,168 3 3 2,004,760 2,004,760 6,577,763 6,577,763 2 2 # # Aligned Aligned # # Phrases Phrases Size Size
English
198,956 198,956 3,058,698 3,058,698 3 3 1,456,140 1,456,140 3,830,663 3,830,663 2 2 # # Aligned Aligned # # Phrases Phrases Size Size
Results on Results on CLEF corpus CLEF corpus
2 lemmas Algorithm Random Selection
.02 .02 .02 .02 .54 .54 .66 .66
- frequent
frequent .02 .02 .02 .02 .80 .80 .83 .83 + + frequent frequent ES ES EN EN ES ES EN EN
3 lemmas Algorithm Random Selection
.004 .004 .004 .004 .62 .62 .81 .81
- frequent
frequent .005 .005 .004 .004 .80 .80 .94 .94 + + frequent frequent ES ES EN EN ES ES EN EN
Noun Phrase translation
1) Select aligned sub-phrase with most frequent translation 2) discard overlapping sub-phrases 3) iterate.
advances in treatment of a wide variety of diseases advances in treatment advances in treatment treatment of a wide treatment of a wide wide variety wide variety variety of disea variety of disease ses
advances in treatment of a wide variety of diseases advances in treatment advances in treatment treatment of a wide treatment of a wide wide variety wide variety variety of disea variety of disease ses variety of disea variety of disease ses tipo de enfermedades tipo de enfermedades
advances in treatment of a wide variety of diseases advances in treatment advances in treatment treatment of a wide treatment of a wide wide variety wide variety wide variety wide variety (amplio) (amplio) (amplio) (amplio) variety of disea variety of disease ses variety of disea variety of disease ses tipo de enfermedades tipo de enfermedades
advances in treatment of a wide variety of diseases advances advances advances advances in in in in treatment treatment treatment treatment treatment of a wide treatment of a wide wide variety wide variety wide variety wide variety (amplio) (amplio) (amplio) (amplio) variety of disea variety of disease ses variety of disea variety of disease ses avances en el trat avances en el tratamiento amiento tipo de enfermedades tipo de enfermedades
advances in treatment of a wide variety of diseases advances advances advances advances in in in in treatment treatment treatment treatment treatment of treatment of treatment of treatment of a a a a wide wide wide wide (amplio) (amplio) (amplio) (amplio) wide variety wide variety wide variety wide variety (amplio) (amplio) (amplio) (amplio) variety of disea variety of disease ses variety of disea variety of disease ses avances en el trat avances en el tratamiento amiento tipo de enfermedades tipo de enfermedades
advances in treatment of a wide variety of diseases advances advances advances advances in in in in treatment treatment treatment treatment treatment of treatment of treatment of treatment of a a a a wide wide wide wide wide variety wide variety wide variety wide variety variety of disea variety of disease ses variety of disea variety of disease ses avances en el tr avances en el tratamiento atamiento amplio tipo de enfermedades tipo de enfermedades
Is this document relevant?
Source: Oard 2000
Systran Systran
UNED @ iCLEF’2001
Noun phrases Noun phrases
UNED @ iCLEF’2001
Results Results
0.34 (+52%) 0.34 (+52%) 0.47 ( 0.47 (-
- 2%)
2%) UNED UNED NPs NPs 0.22 0.22 0.48 0.48 Systran Systran MT MT Recall Recall Precision Precision System System
UNED @ iCLEF’2001
- cf. U. Maryland experiment: word-by-word translation
substantially worse than Systran.
CLIR CLIR Query formulation Query formulation
Reference system Reference system
- Assisted word
Assisted word-
- by
by-
- word translation
word translation. . UNED UNED system system
- Assisted formulation
Assisted formulation by by phrases phrases. .
- Automatic translation
Automatic translation using alignment using alignment. .
UNED @ iCLEF’2002
UNED @ iCLEF’2002
UNED UNED query formulation query formulation
UNED UNED relevance relevance feedback feedback
UNED UNED relevance relevance feedback feedback
Results Results
.37 (+65%) .37 (+65%) UNED UNED .23 .23 Reference Reference F Fα=
α=0 0.8 .8
System System
UNED @ iCLEF’2002
Statistical significance: p< 0.05 linear mixed-effects model + ANOVA.
Initial Query formulation Initial Query formulation
44 s. 44 s. UNED UNED 286 s. 286 s. Reference Reference Average time Average time System System
UNED @ iCLEF’2002
Initial query formulation Initial query formulation
.29 .29 UNED UNED .19 .19 Reference Reference P @ 20 P @ 20 System System
UNED @ iCLEF’2002
And what about And what about WSD? WSD?
- Supervised systems have little to
Supervised systems have little to be be supervised with supervised with... ...
– – Research on unsupervised systems Research on unsupervised systems (Senseval 2) (Senseval 2) – – Solve the acquisition bottleneck of supervised Solve the acquisition bottleneck of supervised systems systems: : obtain
- btain training
training instances instances automatically automatically. .
- Better understanding of the problem
Better understanding of the problem: : sense sense inventories inventories, , test test suites, polysemy. suites, polysemy.
WSD WSD is harder than the is harder than the applications applications
- IR
IR → → WSD: WSD: automatic assignment of automatic assignment of web web directories to word senses directories to word senses ( (Computational Computational Linguistics Linguistics, , to appear to appear) )
- MT
MT → → WSD: use WSD: use aligned phrases for partial aligned phrases for partial disambiguation disambiguation (no (no need for parallel need for parallel corpora corpora!) ( !) (work work in in progress progress) )
- WSD
WSD: : go to the basics go to the basics: : study sense study sense inventories inventories, , and and polysemy polysemy distinctions for distinctions for clustering clustering (SIGLEX 00, 02) (SIGLEX 00, 02)
WordNet WordNet senses senses ↔ ↔web web dirs dirs
- Circuit
Circuit 1 ( 1 (electrical circuit electrical circuit) ) business business/industries/ /industries/electronics and electrical electronics and electrical/ /contract manufacturers contract manufacturers (.98) (.98) manufacturers manufacturers/ /printed circuit boards printed circuit boards/ /fabrication fabrication (.88) (.88) computers computers/ /cad cad/ /electronic design automation electronic design automation... (.78) ... (.78) Sense specializations Sense specializations: : business business/industries/ /industries/electronics electronics/ /components components/ /integrated circuits integrated circuits (.98) (.98)
- Circuit
Circuit 2 (tour, 2 (tour, journey around journey around a particular a particular area area) ) Sports Sports/ /cycling cycling/ /travel travel/ /travelogues travelogues/ /europe europe/ /france france (.58) (.58) Regional/ Regional/asia asia/ /nepal nepal/ /travel and tourism travel and tourism/ /travel guides travel guides (.66) (.66)
- Circuit
Circuit 5 ( 5 (racing circuit racing circuit) ) Sports Sports/ /motorsports motorsports/auto /auto racing racing/stock /stock cars cars/ /drivers and teams drivers and teams (.78) (.78) Sports Sports/ /motorsports motorsports/auto /auto racing racing/ /tracks tracks (.82) (.82) Sports Sports/ /motorsports motorsports/auto /auto racing racing/ /driving schools driving schools (.78) (.78)
Applications Applications
- Automatic acquisition of
Automatic acquisition of training training corpora corpora. .
- Sense clustering
Sense clustering. .
- Increase lexical coverage
Increase lexical coverage
– – Specialized senses Specialized senses ( (integrated circuit integrated circuit) ) – – New senses New senses ( (oasis, jaguar,
- asis, jaguar, tiger
tiger) )
Cleaner than the full web as corpus More stable: ODP is downloadable! Less redistribution problems
Algorithm Algorithm
1. 1.
Retrieve Retrieve ODP ODP directories for every word directories for every word sense with wordnet sense with wordnet-
- based queries
based queries. .
2. 2.
Directory Directory: : Extract lemmas Extract lemmas in in every every directory directory full full path path. .
3. 3.
Word Word sense sense: : extract lemmas extract lemmas in in related related synsets, synsets, including hyperonym chain including hyperonym chain. .
4. 4.
Compare Compare both representations for both representations for coocurrence coocurrence. .
5. 5.
Apply heuristic filters Apply heuristic filters. .
Evaluation Evaluation: Senseval : Senseval-
- 2
2 nouns nouns
.73/.88 .73/.88
Coverage Coverage
.86 .86 28 28 43 43 148 148 147 147
precision precision # # sense sense extensions extensions # # labeled labeled senses senses # # dirs dirs # # senses senses .67 .67 .17 .17 .15 .15 Highly relevant Highly relevant Mildly relevant Mildly relevant Irrelevant Irrelevant
Processing of wordnet nouns Processing of wordnet nouns
1,800 1,800 Sense specializations Sense specializations 27,383 27,383 Characterized senses Characterized senses 24,558 24,558 Characterized nouns Characterized nouns 29,291 29,291 Associations Associations 73,612 73,612 Candidate senses Candidate senses 51,168 51,168 Candidate nouns Candidate nouns
Automatic acquisition of Automatic acquisition of WSD WSD training training corpora corpora
- Circuit
Circuit 1 ( 1 (electrical circuit electrical circuit) ) Electromechanical products for brand name firms Electromechanical products for brand name firms; ; offers printed circuit boards
- ffers printed circuit boards (..)
(..) Offers surface mount Offers surface mount, , thru thru-
- hole
hole, , and flex circuit assembly and flex circuit assembly, in , in circuit and functional circuit and functional
- Circuit
Circuit 2 (tour, 2 (tour, journey around journey around a particular a particular area area) ) The The Tour Tour du Mont du Mont-
- Blanc is
Blanc is a a circuit of circuit of 322 322 km based km based in in the northern French Alps the northern French Alps A virtual tour A virtual tour of the circuit
- f the circuit by
by Raimon Bach Raimon Bach
- Circuit
Circuit 5 ( 5 (racing circuit racing circuit) ) The circuit is The circuit is a a smooth smooth 536 536 yards of racing for yards of racing for Hot Hot Rod and Rod and Stock Stock Car’s at the east Car’s at the east History of the circuit and its banked track and news of History of the circuit and its banked track and news of Formula 1 Formula 1
Results of supervised Results of supervised WSD WSD
.58 .58 .73 .73 379 379 547 547 773 773 TOTAL TOTAL .95 .95 .95 .95 1,19 1,19 8,50 8,50 3,45 3,45 Stress 1,2 Stress 1,2 .50 .50 .65 .65 8,14,4 8,14,4 2,2,2 2,2,2 17,32,11 17,32,11 Restrain Restrain 1,4,6 1,4,6 .25 .25 .45 .45 2,25,13,12,4 2,25,13,12,4 2,7,1,9,3 2,7,1,9,3 1,64,20,11,7 1,64,20,11,7 Post 2,3,4,7,8 Post 2,3,4,7,8 .79 .79 .79 .79 30,9 30,9 63,10 63,10 65,7 65,7 Material 1,4 Material 1,4 .96 .96 .96 .96 26,2 26,2 5,17 5,17 4,57 4,57 Holiday Holiday 1,2 1,2 1 1 1 1 4,0 4,0 17,6 17,6 6,1 6,1 Grip Grip 2.7 2.7 .67 .67 .79 .79 15,28 15,28 4,18 4,18 26,61 26,61 Facility Facility 1,4 1,4 .70 .70 .70 .70 23,2,8 23,2,8 229,2,5 229,2,5 67,6,7 67,6,7 Circuit Circuit 1,2,5 1,2,5 .44 .44 ..57 ..57 35,27 35,27 3,80 3,80 39,78 39,78 Child Child 1,2 1,2 .50 .50 .91 .91 62,6 62,6 1,1 1,1 127,11 127,11 Bar Bar 1,10 1,10 Recall Recall Directories Directories Recall Recall Senseval Senseval # # test instances test instances # # train train instances instances Directories Directories # # train train instances instances Senseval Senseval Word Word senses senses
Comparable training material Comparable training material
.58 .58 .73 .73 379 379 547 547 773 773 TOTAL TOTAL .95 .95 .95 .95 1,19 1,19 8,50 8,50 3,45 3,45 Stress 1,2 Stress 1,2 .50 .50 .65 .65 8,14,4 8,14,4 2,2,2 2,2,2 17,32,11 17,32,11 Restrain Restrain 1,4,6 1,4,6 .25 .25 .45 .45 2,25,13,12,4 2,25,13,12,4 2,7,1,9,3 2,7,1,9,3 1,64,20,11,7 1,64,20,11,7 Post 2,3,4,7,8 Post 2,3,4,7,8 .79 .79 .79 .79 30,9 30,9 63,10 63,10 65,7 65,7 Material 1,4 Material 1,4 .96 .96 .96 .96 26,2 26,2 5,17 5,17 4,57 4,57 Holiday Holiday 1,2 1,2 1 1 1 1 4,0 4,0 17,6 17,6 6,1 6,1 Grip Grip 2.7 2.7 .67 .67 .79 .79 15,28 15,28 4,18 4,18 26,61 26,61 Facility Facility 1,4 1,4 .70 .70 .70 .70 23,2,8 23,2,8 229,2,5 229,2,5 67,6,7 67,6,7 Circuit Circuit 1,2,5 1,2,5 .44 .44 ..57 ..57 35,27 35,27 3,80 3,80 39,78 39,78 Child Child 1,2 1,2 .50 .50 .91 .91 62,6 62,6 1,1 1,1 127,11 127,11 Bar Bar 1,10 1,10 Recall Recall DIRS DIRS Recall Recall SENSEVAL SENSEVAL # # test instances test instances # # train train instances instances DIRS DIRS # # train train instances instances SENSEVAL SENSEVAL Word Word senses senses
Incorrect directories Incorrect directories
.58 .58 .73 .73 379 379 547 547 773 773 TOTAL TOTAL .95 .95 .95 .95 1,19 1,19 8,50 8,50 3,45 3,45 Stress 1,2 Stress 1,2 .50 .50 .65 .65 8,14,4 8,14,4 2,2,2 2,2,2 17,32,11 17,32,11 Restrain Restrain 1,4,6 1,4,6 .25 .25 .45 .45 2,25,13,12,4 2,25,13,12,4 2,7,1,9,3 2,7,1,9,3 1,64,20,11,7 1,64,20,11,7 Post 2,*3,*4,7,*8 Post 2,*3,*4,7,*8 .79 .79 .79 .79 30,9 30,9 63,10 63,10 65,7 65,7 Material *1,*4 Material *1,*4 .96 .96 .96 .96 26,2 26,2 5,17 5,17 4,57 4,57 Holiday Holiday 1,*2 1,*2 1 1 1 1 4,0 4,0 17,6 17,6 6,1 6,1 Grip Grip 2.*7 2.*7 .67 .67 .79 .79 15,28 15,28 4,18 4,18 26,61 26,61 Facility Facility 1,4 1,4 .70 .70 .70 .70 23,2,8 23,2,8 229,2,5 229,2,5 67,6,7 67,6,7 Circuit Circuit 1,2,5 1,2,5 .44 .44 ..57 ..57 35,27 35,27 3,80 3,80 39,78 39,78 Child Child 1,2 1,2 .50 .50 .91 .91 62,6 62,6 1,1 1,1 127,11 127,11 Bar Bar 1,10 1,10 Recall Recall Directories Directories Recall Recall Senseval Senseval # # test instances test instances # # train instances train instances Directories Directories # # train instances train instances Senseval Senseval Word Word senses senses
Characterization of sense Characterization of sense inventories for inventories for WSD WSD
- Given two senses of
Given two senses of a a word word, ,
– – How How are are they related they related? (polysemy ? (polysemy relations relations) ) – – How closely How closely? ( ? (sense proximity sense proximity) ) – – In In what applications should what applications should be be distinguished distinguished? ?
- Given an
Given an individual individual sense of sense of a a word word
– – Should it Should it be be split into subsenses split into subsenses? ( ? (sense stability sense stability) )
Semantic distance Semantic distance in WordNet in WordNet
- Wordnet conceptual
Wordnet conceptual relations relations (Resnik, (Resnik, Agirre Agirre & & Rigau Rigau, etc.) , etc.) → → topic relatedness topic relatedness? ?
- Cross
Cross-
- Linguistic evidence
Linguistic evidence: Resnik & : Resnik & Yarowsky 99. Yarowsky 99.
Cross Cross-
- Linguistic evidence
Linguistic evidence
Fine 40129 Mountains on the other side of the valley rose from the mist like islands, and here and there flecks of cloud as pale and <tag>fine</tag> as sea-spray, trailed across their sombre, wooded slopes. TRANSLATION: * *
Sense proximity Sense proximity
PL(same lexicalization|wi, wj) ≡
1
|wi|| wj|
∑
x ∈{wi examples } y ∈{wj examples }
trL(x) = trL(y) 1 |languages|∑ L ∈
languages
PL(same lexicalization|wi, wj) Proximity(wi, wj) ≡
Sense Stability Sense Stability
Stability(wi) ≡
|x, y ∈{wi examples }, x ≠ y|
|languages| 1
∑
L ∈ languages
1 tL(x) = tL(y)
∑
x, y ∈ wi examples }, x ≠ y
Experiment Design Experiment Design
MAIN SET MAIN SET
182 182 senses senses 44 words (nouns and adjectives) 508 508 examples examples Bulgarian Russian Spanish Urdu 11 native/bilingual speakers of 4 languages
RESULTS: RESULTS: distribution of proximity indexes distribution of proximity indexes
20 40 60 80 100 120 140
# pairs
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
proximity
Average proximity = 0.29
Results: distribution of Results: distribution of stability indexes stability indexes
10 20 30 40 50 60 70
# senses
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
stability
Average stability = 0.80
Re Re-
- scoring
scoring Senseval Senseval systems systems
SENSEVAL-2 score= P(correct sense) NEW score = ∑proximity(i,correct)P(i)
SUPERVISED SYSTEMS (25) 1 3 5 7 9 11 13 15 17 19 21 With proximity matrices Without proximity matrices
1 2 3 4 5 6 7 8 9 10 with proximity matrices without proximity matrices UNSUPERVISED SYSTEMS (10)
distribution of distribution of metaphors metaphors
2 4 6 8 10 12 14
#sense pairs
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
proximity
Suggestions for Meaning Suggestions for Meaning
- IR
IR → → WSD: Use web WSD: Use web directories directories
- MT
MT → → WSD: Use comparable WSD: Use comparable corpora corpora
- Lexical representation
Lexical representation: : enrich enrich WordNet WordNet with with polysemy polysemy relations relations
- Evaluation
Evaluation: : Focus on interactive Focus on interactive applications applications. . Beware of the semantic Beware of the semantic web! web!
More More info info
http:// http://nlp nlp. .uned uned.es .es
Typology of Polysemic Relations Typology of Polysemic Relations
- METONYMY
(post-letters vs. post-delivery)
- METAPHOR
(window-house vs. window-computer)
- SPECIALIZATION / GENERALIZATION
(fine-ok vs. fine-greeting)
- HOMONYMY no relation
(bar-law vs. bar-unit_of_pressure)
distribution of distribution of homonyms homonyms
20 40 60 80 100 120
#sense pairs
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
proximity
distribution of distribution of metonymy metonymy
2 4 6 8 10 12
#sense pairs
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
proximity
distribution of distribution of specialization specialization/ /generalization generalization
1 2 3 4 5 6
#sense pairs
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
proximity