Natural Language Processing: Word Sense Disambiguation Roman Kern - - PowerPoint PPT Presentation

natural language processing word sense disambiguation
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing: Word Sense Disambiguation Roman Kern - - PowerPoint PPT Presentation

Natural Language Processing: Word Sense Disambiguation SCIENCE PASSION TECHNOLOGY Natural Language Processing: Word Sense Disambiguation Roman Kern <rkern@tugraz.at> 2020-05-28 Roman Kern <rkern@tugraz.at>, Institute for


slide-1
SLIDE 1

Natural Language Processing: Word Sense Disambiguation

SCIENCE PASSION TECHNOLOGY

Natural Language Processing: Word Sense Disambiguation

Roman Kern <rkern@tugraz.at> 2020-05-28

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 1

slide-2
SLIDE 2

Natural Language Processing: Word Sense Disambiguation

Outline

1 Introduction 2 General Observations 3 Approaches 4 Evaluation, Applications, Tools

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 2

slide-3
SLIDE 3

Introduction

Ambiguous Words

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 3

slide-4
SLIDE 4

Introduction

Motivational Example Given a single (writen) word e.g., paper Depending on the context, the word might have different meanings e.g., newspaper or writing material In short: words are ambiguous

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 4

slide-5
SLIDE 5

Introduction

Motivational Example

Sense Sentence using that sense Substance That statue is made out of paper Sheets of material He needs some paper to draw on Material with writing Hand her that paper to read Meaning of the writing Did you understand that paper? Oral presentation I want to go hear his paper News source I read the paper every morning Newspaper company The paper might go out of business Company representative The paper called about doing an interview with you Editorial policies The paper is very pro-Illinois Class report I have to go turn in my paper Wall covering She got the most beautiful paper for her bedroom walls Gif wrap He tore open the paper to get at the present Commercial paper The paper on that silver mine is worth 10! on the dollar Klein, D. and Murphy, G. 2002. Paper has been my ruin: conceptual relations of polysemous senses. Journal of Memory and Language. 47, 4 (Nov. 2002), 548–570. DOI:htps://doi.org/10.1016/S0749-596X(02)00020-7.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 5

slide-6
SLIDE 6

Introduction

Recommended Literature WSD Book

❤tt♣✿✴✴✇✇✇✳✇s❞❜♦♦❦✳♦r❣

Papers

Navigli, R. 2009. Word Sense Disambiguation: A Survey. ACM Computing Surveys (CSUR). 41, 2 (2009), 10. DOI:htps://doi.org/10.1145/1459352.1459355. Iacobacci, I., Pilehvar, M.T. and Navigli, R. 2016. Embeddings for word sense disambiguation: An evaluation study. 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers. 2, 2003 (2016), 897–907. DOI:htps://doi.org/10.18653/v1/p16-1085.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 6

slide-7
SLIDE 7

Introduction

Polysemy and Homonymy Sense vs. meaning e.g., table (put stuff on) and table (like in a spreadsheet) Different meaning, but shared etymology → polysemous word Different interpretations of a homonym are referred to as meanings Resulting in different lexical entries Those of a polysemous word are referred to as senses

→ ambiguity of words are on a spectrum

Homograph: different words, same spelling Homophone: different word, same sound Contranym (auto antonym): Ambiguous word, with contradicting senses (e.g., dust)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 7

slide-8
SLIDE 8

Introduction

How Senses Evolve Example: Iron Material Much of the Erzberg consisted out of iron. Product Voestalpine produces high quality iron out of iron. Object The electric clothing iron might not even be made out of iron.

The evolution of sense ofen similar (material → product).

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 8

slide-9
SLIDE 9

Introduction

Related Tasks Word Sense Disambiguation ... the task to identify for a single instance (e.g., word in a sentence) its correct sense Typically given a list of possible sense (closed class) Word Sense Induction and Disambiguation ... the task to identify the different senses of a word Typically without a pre-defined set of senses

Note: Related tasks are, cross-lingual WSD, multi-lingual WSD, entity disambiguation (where typically named entities are being ambiguous, e.g., Aberdeen)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 9

slide-10
SLIDE 10

Introduction

History WSD was first formulated in the late 1940ies Weaver (1949): “[...] central word in question but also say N words on either side, then, [...] one can unambiguously decide the meaning” Zipf (1949): more frequent words have more senses than less frequent words in a power-law relationship Acknowledgement of the hardness of the problem (50ies/60ies) Bar-Hillel (1960): “no existing or imaginable program will enable an electronic computer to determine that the word pen is used in its ‘enclosure’ sense”

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 10

slide-11
SLIDE 11

Introduction

History WSD being held back by the knowledge acquisition botleneck (70ies) Knowledge sources need to be hand-crafed Turning point for WSD in the 80ies Usage of corpora, like “Oxford Advanced Learner’s Dictionary of Current English”, and “Roget’s International Thesaurus” Dictionary-based approach to WSD

Downside: Not robust due to lack of coverage

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 11

slide-12
SLIDE 12

Introduction

History “Statistical revolution” in the 80ies/90ies Application of statistical and machine learning approaches on WSD Evaluation initiatives emerged in the late 90ies, early 2000 Needed to be able to compare approaches Most prominently, Senseval (and later SemEval) series

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 12

slide-13
SLIDE 13

Introduction

History “Deep-learning revolution” in the recent years Adaption of word embeddings

Contextual word embeddings also imply (to a certain degree) WSD

Utilisation of various neural network architectures for the task

e.g., LSTMs, CNNs

End-to-end learning

i.e., WSD is also implicitly taken care of [1]

[1] Raganato, A., Bovi, C.D. and Navigli, R. 2017. Neural sequence learning models for word sense disambiguation. EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings. (2017), 1156–1167. DOI:htps://doi.org/10.18653/v1/d17-1120.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 13

slide-14
SLIDE 14

General Observations

... and starting point of solutions

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 14

slide-15
SLIDE 15

General Observations

How Hard is it? Human performance Just require 2 words of context (either side) to infer the sense (equivalent to whole sentence, Kaplan 1950) Caveat: Even human agree only to a certain amount (as low as 85% being reported) Machine performance WSD has been considered to be AI-complete [1] ... since it requires knowledge (of the world)

[1] Mallery, J. C. 1988. Thinking about foreign policy: Finding an appropriate role for artificial intelligence

  • computers. Ph.D. dissertation. MIT Political Science Department, Cambridge, MA.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 15

slide-16
SLIDE 16

General Observations

Language Specific Ambiguity prevails in many human languages ... not only limited to the senses of the word English

121 most frequent English nouns → on average 7.8 meanings each [1]

The senses are not aligned between the languages The senses also depend on the domain Senses come and go (diachronic)

[1] Ng, Hwee Tou & Hian Beng Lee. 1996. Integrating multiple knowledge sources to disambiguate word sense: An exemplar-based approach. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, California, 40–47.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 16

slide-17
SLIDE 17

General Observations

Cross-Lingual WSD

Apidianaki, M., Ljubešić, N. and Fišer, D. 2013. Cross-lingual WSD for Translation Extraction from Comparable

  • Corpora. Proceedings of the Sixth Workshop on Building and Using Comparable Corpora.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 17

slide-18
SLIDE 18

General Observations

Important Hypothesis Semantics of words Distributional Hypothesis First described by Harris in 1954, which states that words which tend to

  • ccur together are semantically related. Firth describes this intuition as “a

word is characterised by the company it keeps”. Strong Contextual Hypothesis Proposed by Miller and Charles in 1991, says that the more similar the contexts of words the more semantically related the words are.

Note: Linguists also use the term context to refer to situational or social context (pragmatics).

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 18

slide-19
SLIDE 19

General Observations

General Observations Selectional preference The most frequent sense (MFS) is typically a good choice (if one would have to guess)

Also found to be highly domain dependent

Figures of 69% precision reported in literature [1] In many cases MFS is a competitive baseline

[1] Agirre,E. and Martinez, D. 2001. Learning class-to-class selectional preferences. In Proceedings of the 5th Conference on Computational Natural Language Learning (CoNLL, Toulouse, France). 15–22.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 19

slide-20
SLIDE 20

General Observations

General Observations One sense per discourse Within a single discourse, the words refer to the same sense ... can be approximated by, same sense per paragraph/document Works beter for course grained senses

For more fine grained senses, 33% if words referred to different (fine-grained) senses [1]

[1] Krovetz, R. 1998. More than one sense per discourse. NEC Princeton NJ Labs., Research Memorandum. (1998), 1–10.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 20

slide-21
SLIDE 21

General Observations

General Observations One sense per collocation The “closer” a word is to other words, the more ofen it is associated with the same sense Works beter for course grained senses

Drop of 30% found for finer-grained senses [1]

Does not appear to change across domains

[1] Martinez, D. and Agirre, E. 2000. One sense per collocation and genre/topic variations. Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora, 207–215.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 21

slide-22
SLIDE 22

General Observations

General Observations Senses are domain-specific

Gliozzo, A., Magnini, B. and Strapparava, C. 2004. Unsupervised Domain Relevance Estimation for Word Sense

  • Disambiguation. Proc. of the 2004 Conference on EMNLP. Ddd (2004), 380–387.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 22

slide-23
SLIDE 23

Approaches

How to tackle the task of WSD

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 23

slide-24
SLIDE 24

Approaches

Overview Knowledge-based methods Use existing knowledge bases, e.g., lexicons Corpus-based methods Typically distributional-approaches, e.g., co-occurrences Ofen making use of machine learning

Supervised, semi-supervised, weakly-supervised and unsupervised

Translation-based methods

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 24

slide-25
SLIDE 25

Approaches

Knowledge-based methods Lesk algorithm Initially a lexicon-based approach For each sense there is distinct description Compare the descriptions of all sense of all words in a sentence Many variations of the original Lesk algorithm exist

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 25

slide-26
SLIDE 26

Approaches

Knowledge-based methods Lesk algorithm for each sense i of word w1 for each sense j of word w2

compute overlap(i, j) of the words of the definitions ... of both words

find i and j with the maximal overlap assign sense i to w1, and sense j to w2

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 26

slide-27
SLIDE 27

Approaches

Knowledge-based methods Other knowledge sources Instead of a simple list of senses and descriptions ... richer knowledge sources contain relationships between words Typically organised as trees or graphs Thesauri, taxonomies, ontologies, ... e.g., WordNet Requires a similarity measure based on the distance within the graph

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 27

slide-28
SLIDE 28

Approaches

Machine learning methods Typically used features Local features derived from surrounding words of the occurrence Topical features about the general topic of the text of the occurrence Syntactic features related to the syntax of the sentence Semantic features related to the domain, or the sense of close words

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 28

slide-29
SLIDE 29

Approaches

Machine learning methods Typical feature generation Following one-sense-per-collocation, ofen a window-based approach is followed For each ambiguous target word A context is defined, e.g., n-words to the right and lef of an occurrence ... and n typically ranges from 2 to 10 For each of words within the window features are derived ... and combined with global features (e.g., topical)

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 29

slide-30
SLIDE 30

Approaches

Machine learning methods Supervised approaches Consider WSD a classification tasks All senses for a ambiguous word are known, for each occurrence the correct sense is inferred Requires a labelled training dataset Many classification algorithms can be applied Ofen SVMs, LSTMs, CNNs

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 30

slide-31
SLIDE 31

Approaches

Machine learning methods Semi-supervised and weakly-supervised approaches Requires only small manually annotated dataset Various heuristics for label propagation Making use of monosemous relatives Words with only a single sense, but synonyms with multiple senses e.g., use a web search engine to collect a dataset

[1] Kilgarriff, A. and Greffenstete, G. 2003. Introduction to the special issue on the Web as corpus. Computat. Ling. 29, 3, 333–347.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 31

slide-32
SLIDE 32

Approaches

Machine learning methods Unsupervised approaches No annotated dataset available Optionally, also no fixed set of senses predefined for the ambiguous words (i.e., sense discrimination instead of sense labelling) Approaches Context clustering: represent the surrounding of an ambiguous target word Word clustering: identification of (semantically) similar words to the target word Graph partitioning: the text is presented as graph with co-occurrences typically as edges

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 32

slide-33
SLIDE 33

Approaches

Cross-Lingual Methods Intuition: Senses do only partially overlap across languages i.e., polysemous word in one language, multiple words in another language Approaches Making use of parallel corpora Specialised knowledge bases like EuroWordNet

Also affected by the knowledge acquisition botleneck

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 33

slide-34
SLIDE 34

Approaches

Embeddings What about ambiguous words and word embeddings? The embeddings appear to represent the linear combination of the senses [1] Use embeddings as features for WSD [2] Contextual word embeddings implicitly take care

[1] Arora, S., Li, Y., Liang, Y., Ma, T. and Risteski, A. 2016. Linear Algebraic Structure of Word Senses, with Applications to Polysemy. 1 (2016). [2] Iacobacci, I., Pilehvar, M.T. and Navigli, R. 2016. Embeddings for word sense disambiguation: An evaluation

  • study. 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers. 2, 2003

(2016), 897–907. DOI:htps://doi.org/10.18653/v1/p16-1085.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 34

slide-35
SLIDE 35

Evaluation, Applications, Tools

How good are the methods? What tools are out there?

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 35

slide-36
SLIDE 36

Evaluation, Applications, Tools

Evaluation Expectations Lower bound Simple most frequent sense baseline Upper bound Inter-annotator agreement Between 60% to 90%, for fine-grained to coarse-grained senses

The random sense heuristic is not a good baseline

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 36

slide-37
SLIDE 37

Evaluation, Applications, Tools

Evaluation Pseudo-words Merge two different words into a single “new” word Measure how well a WSD method splits the two meanings Manually annotated corpora Senseval, SemEval Low reported interrater-agreement of 85% Allow to apply same evaluation strategy as for classification tasks

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 37

slide-38
SLIDE 38

Evaluation, Applications, Tools

Evaluation Measures for Supervised Methods Coverage Ratio of ambiguous words with answers ... where the answer might be correct or not Precision Ratio of correct answers of answers given Recall Ratio or correct answers of all possible answers F1 Harmonic mean of precision and recall

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 38

slide-39
SLIDE 39

Evaluation, Applications, Tools

Evaluation Measures for Unsupervised Methods Compute the overlap between the found senses with a gold standard Same evaluation measures as for clustering e.g., v-measure, paired F-score [1]

[1] Manandhar, S., Klapafis, I. P., Dligach, D., & Pradhan, S. S. (2010, July). Semeval-2010 task 14: Word sense induction & disambiguation. In Proceedings of the 5th international workshop on semantic evaluation (pp. 63-68). Association for Computational Linguistics.

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 39

slide-40
SLIDE 40

Evaluation, Applications, Tools

Applications WSD could be beneficial in a number of scenarios Information retrieval (search)

To improve precision (only the relevant documents) To improve recall (via diversity) Ofen humans provide disambiguation cues

Word embeddings Document classification Machine Translation ...

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 40

slide-41
SLIDE 41

Evaluation, Applications, Tools

Datasets Recommended Resources & Web Sites Papers with Code

❤tt♣s✿✴✴♣❛♣❡rs✇✐t❤❝♦❞❡✳❝♦♠✴t❛s❦✴✇♦r❞✲s❡♥s❡✲❞✐s❛♠❜✐❣✉❛t✐♦♥

NLPProgress

❤tt♣s✿ ✴✴♥❧♣♣r♦❣r❡ss✳❝♦♠✴❡♥❣❧✐s❤✴✇♦r❞❴s❡♥s❡❴❞✐s❛♠❜✐❣✉❛t✐♦♥✳❤t♠❧

ACL Wiki

❤tt♣s✿✴✴❛❝❧✇❡❜✳♦r❣✴❛❝❧✇✐❦✐✴❲♦r❞❴s❡♥s❡❴❞✐s❛♠❜✐❣✉❛t✐♦♥

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 41

slide-42
SLIDE 42

Evaluation, Applications, Tools

Tools Python implementations of common Word Sense Disambiguation (WSD) technologies

❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❛❧✈❛t✐♦♥s✴♣②✇s❞

DKPro WSD - corpora and code

❤tt♣s✿✴✴❞❦♣r♦✳❣✐t❤✉❜✳✐♦✴❞❦♣r♦✲✇s❞✴❝♦r♣♦r❛✴

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 42

slide-43
SLIDE 43

Thank You!

Next: Stylometry

Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 2020-05-28 43