Word Sense Disambiguation (WSD) Based on Foundations of Statistical - PowerPoint PPT Presentation

0. Word Sense Disambiguation (WSD) Based on “Foundations of Statistical NLP” by C. Manning & H. Sch¨ utze, ch. 7 MIT Press, 2002

1. WSD Examples They have the right to bear arms. ( drept ) The sign on the right was bent. ( diret ¸ie ) The plant is producing far too little to sustain its op- eration for more than a year. ( fabric˘ a ) An overboundance of oxygen was produced by the plant in the third week of the study. ( plant˘ a ) The tank has a top speed of 70 miles an hour, which it can sustain for 3 hours. ( tanc petrolier ) We cannot fill more gasoline in the tank. ( rezervor de a ) ma¸ sin˘ The tank is full of soldiers. ( tank de lupt˘ a ) The tank is full of nitrogen. ( recipient )

2. Plan 1. Supervised WSD 1.1 A Naive Bayes Learning algorithm for WSD 1.2 An Information Theoretic algorithm for WSD 2. Unsupervised WSD clustering 2.1 WS clustering: the EM algorithm 2.2 Constraint-based WSD: “One sense per discourse, one sense per collocation”: Yarowsky’s algorithm 2.3 Resource-based WSD 2.3.1 Dictionary-based WSD: Lesk’s algorithm 2.3.2 Thesaurus-based WSD: Walker’s algorithm Yarowsky’s algorithm

3. 1.1 Supervised WSD through Naive Bayesian Classification s ′ = argmax s k P ( s k | c ) = argmax s k P ( c | s k ) P ( c ) P ( s k ) = argmax s k P ( c | s k ) P ( s k ) = argmax s k [ log P ( c | s k ) + log P ( s k )] = argmax s k [ log P ( s k ) + Σ w j in c log P ( w j | s k )] where we used the naive Bayes assumption: P ( c | s k ) = P ( { w j | w j in c } | s k ) = Π w j in c P ( w j | s k ) The Maximum Likelihood estimation: P ( w j | s k ) = C ( w j ,s k ) and P ( s k ) = C ( w,s k ) where: C ( s k ) C ( w ) C ( w j , s k ) = number of occurrences of word w j with the sense s k C ( w, s k ) = number of occurrences of word w with the sense s k C ( w ) = number of occurrences of the ambiguous word w all counted in the training corpus.

4. A Naive Bayes Algorithm for WSD comment: training 1 for all senses s k of w do 2 for all words w j in the vocabulary do 3 P ( w j | s k ) = C ( w j ,s k ) 4 C ( w j ) end 5 end 6 for all senses s k of w do 7 P ( s k ) = C ( w,s k ) 8 C ( w ) end 9 comment: Disambiguation 10 for all senses s k of w do 11 score( s k ) = log( P ( s k ) ) 12 for all words w j in the context window c do 13 score( s k ) = score( s k ) + log( P ( w j | s k ) ) 14 end 15 end 16 choose s ′ = argmax s k score( s k ) 17

5. 1.2 An Information Theoretic approach to WSD Remark: The Naive Bayes classifier attempts to use information from words in the context window to help in the disambiguation decision. It does this at the cost of a somewhat unrealistic indepen- dence assumption. The IT (“Flip-Flop”) algorithm that follows does the oppo- site: It tries to find a single contextual feature that reliably in- dicates which sense of the ambiguous word is being used. Empirical result: The Flip-Flop algorithm improved by 20% the accuracy of a Machine Translation system.

6. Example Highly informative indicators for 3 ambiguous French words Ambiguous Indicator Examples: word value → sense prendre object mesure → to take d´ ecision → to make vuloir tense present → to want conditional → to like cent word to the left per → % number → c. [money]

7. Notations t 1 , ..., t m — translations of the ambiguous word example: prendre → take, make, rise, speak x 1 , ..., x n — possible values of the indicator example: mesure, d´ ecision, example, note, parole Mutual Information: p ( x,y ) I ( X ; Y ) = Σ x ∈ X Σ y ∈ Y p ( x, y ) log p ( x ) p ( y ) Note: The Flip-Flop algorithm only disambiguates between 2 senses. For the more general case see [Brown et al., 1991a].

8. Brown et al.’s WSD (“Flip-Flop”) algorithm: Finding indicators for disambiguation find random partion P = { P 1 , P 2 } of t 1 , ..., t m 1 while (improving I ( P ; Q ) ) do 2 find partion Q = { Q 1 , Q 2 } of x 1 , ..., x n 3 that maximizes I ( P ; Q ) 4 find partion { P 1 , P 2 } of t 1 , ..., t m 5 that maximizes I ( P ; Q ) 6 end 7

9. “Flip-Flop” algorithm Note: Using the splitting theorem, [Breiman et al., 1984], it can be shown that the Fklip-Flop algorithm monotonically increases I ( P ; Q ) . Stopping criterion: I ( P ; Q ) does not increases anymore (significantly).

10. “Flip-Flop” algorithm: Disambiguation For the occurrence of the ambiguous word, determine the value x i of the indicator; if x i ∈ Q 1 then assign the occurence the sense 1; if x i ∈ Q 2 then assign the occurence the sense 2.

A running example 11. 1. a randomly taken partition P = { P 1 , P 2 } : P 1 = { take, rise } , P 2 = { make, speak } 2. maximizing I ( P ; Q ) on Q , using the (presumable) data take a measure notes an example make a decision a speech rise to speak Q 1 = { measure, note, example } , Q 2 = { d´ ecision, parole } 3. maximizing I ( P ; Q ) on P : P 1 = { take } , P 2 = { make, rise, speak } Note: Consider more than 2 ‘senses’ to distinguish between { make, rise, speak } .

12. 2. Unsupervised word sense clustering 2.1 The EM algorithm Notation: K is the number of desired senses; c 1 , c 2 , ..., c I are the contexts of the ambiguous word in the corpus; w 1 , w 2 , ..., w J are the words being used as disambiguating features. Parameters of the model ( µ ): P ( w j | s k ) , 1 ≤ j ≤ J, 1 ≤ k ≤ K P ( s k ) , 1 ≤ k ≤ K . Given µ , the log-likelihood of the corpus C is computed as: l ( C | µ ) = log Π I i =1 P ( c i ) = log Π I i =1 Σ K k =1 P ( c i | s k ) P ( s k ) = Σ I i =1 log Σ K k =1 P ( c i | s k ) P ( s k ) Note: To compute P ( c i | s k ) , use the Naive Bayes assumption: P ( c i | s k ) = Π w j in c i P ( w j | s k ) .

13. Procedure: 1. Initialize the parameters of the model µ randomly. 2. While l ( C | µ ) is improving, repeat a. E-step: estimate the (posterior) probability that the sense s k generated the context c i : P ( c i | s k ) h ik = Σ K k =1 P ( c i | s k ) b. M-step: re-estimate the parameters P ( w j | s k ) , P ( s k ) , by way of MLE: Σ { ci | wj in ci } h ik Σ I i =1 h ik = Σ I i =1 h ik i =1 h ik P ( w j | s k ) = and P ( s k ) = Σ K k =1 Σ I Z k I where Z k is a normalizing constant: Z k = Σ K k =1 Σ { c i | w j in c i } h ik .

14. 2.2 Constraint-based WSD: “One sense per discourse, one sense per collocation” [ Yarowsky, 1995 ] One sense per discourse: The sense of a target word is highly consistent within any given document. One sense per collocation: Nearby words provide strong and consistent clues to the sense of a target word, conditional on relative distance, order, and syntactic relationship.

Yarowsky Algorithm: 15. WSD by constraint propagation comment: Initialization 1 for all senses s k of w do 2 F k = the set of features (words) in s k dictionary definition 3 E k = ∅ 4 end 5 comment: one sense per discourse 6 while (at least one E k changed in the last iteration) do 7 for all senses s k of w do 8 comment: identify the contexts c i bearing the sense s k 9 E k = { c i | ∃ f m ∈ F k : f m ∈ c i } 10 end 11 for all senses s k of w do 12 comment: retain the features f m which best express sense s k 13 F k = { f m | ∀ n � = k P ( s k | f m ) C ( f m ,s i ) P ( s n | f m ) > α } where P ( s i | f m ) = 14 Σ j C ( f m ,s j ) end 15 end 16

16. Yarowsky Algorithm (Cont’d) comment: one sense per collocation 17 determine the majority sense s k of w in the document d : 18 Σ m ∈ Fi C ( f m ,s i ) s k = argmax s i P ( s i ) P ( s i ) = where 19 Σ j Σ m ∈ Fi C ( f m ,s j ) assign all occurrences of w in the document d the sense s k 20 end 21

17. 2.3 Resource-based WSD 2.3.1 Dictionary-based WSD Example Two senses of ash : sense definition s 1 : tree D 1 : a tree of the olive family s 2 : burned stuff D 2 : the solid residue left when combustible material is burned Disambiguation of ash using Lesk’s algorithm (see next slide): scores context s 1 s 2 This cigar burns slowly and creates a stiff ash . 0 1 The ash is one of the last trees to come into leaf. 1 0

18. Dictionary-based WSD: Lesk’s algorithm comment: training 1 for all senses s k of w do 2 score( s k ) = overlap( D k , ∪ v j in c E v j ) 3 end 4 comment: Disambiguation 5 choose s ′ = argmax s k score( s k ) 6 where: D k is the set of words occurring in the dictionary definition of the sense s k for w ; E v j is the set of words occurring in the dictionary definition of v j (the union of all its sense definitions)

19. 2.3.2 Thesaurus-based WSD Example Word Sense Thesaurus category (Roget) bass musical senses music fish animal, insect star space object universe celebrity entertainer star shaped object insignia interest curiosity reasoning advantage injustice financial debt share property

20. Thesaurus-based WSD: Walker’s algorithm comment: Given: context c 1 for all senses s k of w do 2 score( s k ) = Σ w j in c δ ( t ( s k ) , w j ) 3 comment: score = #words compatible with the category of s k 1 end 5 comment: Disambiguation 6 choose s ′ = argmax s k score( s k ) 7 where: t ( s k ) is the thesaurus category of the sense s k ; δ ( t ( s k ) , w j ) = 1 iff t ( s k ) is one of the thesaurus categories for w j and 0 otherwise.

Word Sense Disambiguation (WSD) Based on Foundations of Statistical - PowerPoint PPT Presentation

0. Word Sense Disambiguation (WSD) Based on Foundations of Statistical NLP by C. Manning & H. Sch utze, ch. 7 MIT Press, 2002 1. WSD Examples They have the right to bear arms. ( drept ) The sign on the right was bent. ( diret ie

Word Sense Disambiguation Unsupervised WSD Modern WSD L645 / B659 (Some material from Jurafsky

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

From From IR WSD IR WSD to to IR WSD IR WSD Julio Gonzalo Julio Gonzalo

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

WSD Word Sense Disambiguation: Determine from context (or otherwise) what Word Sense

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

HW #8 WordNet-based WSD Perform word sense disambiguation of probe word In context of

Final Projects Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison

Data-driven sense induction for disambiguation and lexical selection in translation Marianna

Word Sense Disambiguation for Ontological Document Classification Speaker: Georgiana Ifrim

Similarity-based Word Sense Disambiguation Yael Karov Shimon Edelman Weizmann Institute MIT

Semantics Avalanche: Word Sense Disambiguation, Dependency Parsing, Semantic Role

Semantics Avalanche: Word Sense Disambiguation, Dependency Parsing, Semantic Role Labeling/Verb

Natural Language Processing: Word Sense Disambiguation Roman Kern <rkern@tugraz.at>

Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation Eneko Agirre

Lecture 23: Lexical Semantics: Word Sense Julia Hockenmaier juliahmr@illinois.edu 3324

Philosophy ITS (NOT) ALL IN YOUR HEAD January 19 Today : 1. Review Existence & Nature

What Students Dont Tell Professors: A Presentation on Boosting Student Success

Open World Planning for Robots via Hindsight Optimization Scott Kiesel 1 , Ethan Burns 1 , Wheeler

? Can Functional Programmers Make make Make Sense? Norman Ramsey Tufts

Human-Computer Interaction 11. Evaluating User Interface (2) Dr. Sunyoung Kim School of

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Reference and Sense Ling324 Reading: Meaning and Grammar , pg. 62-68 Substitutability (1) a.

Word Sense Disambiguation (WSD) Based on Foundations of Statistical - PowerPoint PPT Presentation

0. Word Sense Disambiguation (WSD) Based on Foundations of Statistical NLP by C. Manning & H. Sch utze, ch. 7 MIT Press, 2002 1. WSD Examples They have the right to bear arms. ( drept ) The sign on the right was bent. ( diret ie

Word Sense Disambiguation Unsupervised WSD Modern WSD L645 / B659 (Some material from Jurafsky

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

From From IR WSD IR WSD to to IR WSD IR WSD Julio Gonzalo Julio Gonzalo

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

WSD Word Sense Disambiguation: Determine from context (or otherwise) what Word Sense

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

HW #8 WordNet-based WSD Perform word sense disambiguation of probe word In context of

Final Projects Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison

Data-driven sense induction for disambiguation and lexical selection in translation Marianna

Word Sense Disambiguation for Ontological Document Classification Speaker: Georgiana Ifrim

Similarity-based Word Sense Disambiguation Yael Karov Shimon Edelman Weizmann Institute MIT

Semantics Avalanche: Word Sense Disambiguation, Dependency Parsing, Semantic Role

Semantics Avalanche: Word Sense Disambiguation, Dependency Parsing, Semantic Role Labeling/Verb

Natural Language Processing: Word Sense Disambiguation Roman Kern &lt;rkern@tugraz.at&gt;

Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation Eneko Agirre

Lecture 23: Lexical Semantics: Word Sense Julia Hockenmaier juliahmr@illinois.edu 3324

Philosophy ITS (NOT) ALL IN YOUR HEAD January 19 Today : 1. Review Existence &amp; Nature

What Students Dont Tell Professors: A Presentation on Boosting Student Success

Open World Planning for Robots via Hindsight Optimization Scott Kiesel 1 , Ethan Burns 1 , Wheeler

? Can Functional Programmers Make make Make Sense? Norman Ramsey Tufts

Human-Computer Interaction 11. Evaluating User Interface (2) Dr. Sunyoung Kim School of

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Reference and Sense Ling324 Reading: Meaning and Grammar , pg. 62-68 Substitutability (1) a.

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Natural Language Processing: Word Sense Disambiguation Roman Kern <rkern@tugraz.at>

Philosophy ITS (NOT) ALL IN YOUR HEAD January 19 Today : 1. Review Existence & Nature