Word Sense Disambiguation Unsupervised WSD Modern WSD L645 / B659 - - PowerPoint PPT Presentation

word sense disambiguation
SMART_READER_LITE
LIVE PREVIEW

Word Sense Disambiguation Unsupervised WSD Modern WSD L645 / B659 - - PowerPoint PPT Presentation

Word Sense Disambiguation Supervised WSD WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD Word Sense Disambiguation Unsupervised WSD Modern WSD L645 / B659


slide-1
SLIDE 1

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Word Sense Disambiguation

L645 / B659 (Some material from Jurafsky & Martin (2009) + Manning & Sch¨ utze (2000))

  • Dept. of Linguistics, Indiana University

Fall 2015

1 / 30

slide-2
SLIDE 2

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Context

Lexical Semantics

A (word) sense represents one meaning of a word

◮ bank1: financial institution ◮ bank2: sloped ground near water

Various relations:

◮ homonymy: 2 words/senses happen to sound the same

(e.g., bank1 & bank2)

◮ polysemy: 2 senses have some semantic relation

between them

◮ bank1 & bank3 = repository for biological entities 2 / 30

slide-3
SLIDE 3

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Context

WordNet

WordNet (http://wordnet.princeton.edu/) is a database of lexical relations:

◮ Nouns (117,798); verbs (11,529); adjectives (21,479) &

adverbs (4,481)

◮ https://wordnet.princeton.edu/wordnet/man/wnstats.

7WN.html

WordNet contains different senses of a word, defined by synsets (synonym sets)

◮ {chump1, fool2, gull1, mark9, patsy1, fall

guy1, sucker1, soft touch1, mug2}

◮ Words are substitutable in some contexts

◮ gloss: a person who is gullible and easy to take

advantage of See http://babelnet.org for other languages

3 / 30

slide-4
SLIDE 4

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Word Sense Disambiguation (WSD)

Word Sense Disambiguation (WSD): determine the proper sense of an ambiguous word in a given context e.g., Given the word bank, is it:

◮ the rising ground bordering a body of water? ◮ an establishment for exchanging funds?

◮ Or maybe a repository (e.g., blood bank)?

WSD comes in two variants:

◮ Lexical sample task: small pre-selected set of target

words (along with sense inventory)

◮ All-words task: entire texts

Our goal: get a flavor for insights & what techniques need to accomplish

4 / 30

slide-5
SLIDE 5

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Supervised WSD

Supervised WSD: extract features which are helpful for particular senses & train a classifier to assign correct sense

◮ lexical sample task: labeled corpora for individual

words

◮ all-word disambiguation task: use a semantic

concordance (e.g., SemCor)

5 / 30

slide-6
SLIDE 6

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

WSD Evaluation

◮ Extrinsic (in vivo) evaluation: evaluate WSD in the

context of another task, e.g., question answering

◮ Intrinsic (in vitro) evaluation: evaluate WSD as a

stand-alone system

◮ Exact-match sense accuracy ◮ Precision/recall measures, if systems pass on some

labelings

Baselines:

◮ Most frequent sense (MFS): for WordNet, take first

sense

◮ Lesk algorithm (later)

Ceiling: inter-annotator agreement, generally 75-80%

6 / 30

slide-7
SLIDE 7

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Feature extraction

  • 1. POS tag, lemmatize/stem, & perhaps parse the

sentence in question

  • 2. Extract context features within a certain window of a

target word

◮ Feature vector: numeric or nominal values encoding

linguistic information

7 / 30

slide-8
SLIDE 8

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Feature extraction

Collocational features Collocational features encode information about specific positions to the left or right of a target word

◮ capture local lexical & grammatical information

Consider: An electric guitar and bass player stand off to one side, not really part of the scene ...

◮ [wi−2,POSi−2,wi−1,POSi−1,wi+1,POSi+1,wi+2,POSi+2] ◮ [guitar, NN, and, CC, player, NN, stand, VB]

8 / 30

slide-9
SLIDE 9

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Feature extraction

Bag-of-words features Bag-of-words features encode unordered sets of surrounding words, ignoring exact position

◮ Captures more semantic properties & general topic of

discourse

◮ Vocabulary for surrounding words usually pre-defined

e.g., 12 most frequent content words from bass sentences in the WSJ:

◮ [fishing, big, sound, player, fly, rod, pound, double, runs,

playing, guitar, band] leading to this feature vector:

◮ [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]

9 / 30

slide-10
SLIDE 10

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Bayesian WSD

◮ Look at a context of surrounding words, call it c, within a

window of a particular size

◮ Select the best sense s from among the different

senses (1) s

= argsk max P(sk|c) = argsk max P(c|sk)P(sk)

P(c)

= argsk max P(c|sk)P(sk)

Computationally simpler to calculate logarithms, giving: (2) s = argsk max[log P(c|sk) + log P(sk)]

10 / 30

slide-11
SLIDE 11

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Naive Bayes assumption

◮ Treat the context (c) as a bag of words (vj) ◮ Make the Naive Bayes assumption that every

surrounding word vj is independent of the other ones: (3) P(c|sk) =

vj∈c

P(vj|sk) (4) s = argsk max[

vj∈c

log P(vj|sk) + log P(sk)] We get maximum likelihood estimates from the corpus to

  • btain P(sk) and P(vj|sk)

11 / 30

slide-12
SLIDE 12

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Dictionary-based WSD

Lesk algorithm

Use general characterizations of the senses to aid in disambiguation Intuition: words found in a particular sense definition can provide contextual cues, e.g., for ash: Sense Definition s1: tree a tree of the olive fam- ily s2: burned stuff the solid residue left when combustible ma- terial is burned If tree is in the context of ash, the sense is more likely s1

12 / 30

slide-13
SLIDE 13

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Lesk algorithm

Look at words within the sense definition and the words within the definitions of context words, too (unioning over different senses)

  • 1. Take all senses sk of a word w and gather the set of

words for each definition

◮ Treat it as a bag of words

  • 2. Gather all the words in the definitions of the surrounding

words, within some context window

  • 3. Calculate the overlap
  • 4. Choose the sense with the higher overlap

13 / 30

slide-14
SLIDE 14

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Example

(5) This cigar burns slowly and creates a stiff ash. (6) The ash is one of the last trees to come into leaf. So, sense s2 goes with the first sentence and s1 with the second

◮ Note that, depending on the dictionary, leaf might also

be a contextual cue for sense s1 of ash

14 / 30

slide-15
SLIDE 15

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Problems with dictionary-based WSD

◮ Not very accurate: 50%-70% ◮ Highly dependent upon the choice of dictionary ◮ Not always clear whether the dictionary definitions align

with what we think of as different senses

15 / 30

slide-16
SLIDE 16

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Heuristic-based WSD

Can use a heuristic to automatically select seeds

◮ One sense per discourse: the sense of a word is

highly consistent within a given document

◮ One sense per collocation: collocations rarely have

multiple senses associated with them

16 / 30

slide-17
SLIDE 17

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

One sense per collocation

Rank senses based on what collocations the word appears in,

◮ e.g., show interest might be strongly correlated with the

’attention, concern’ usage of interest

◮ The collocational feature could be a surrounding POS

tag, or a word in the object position

◮ For a given context, select which collocational feature

will be used to disambiguate, based on which feature is strongest indicator

◮ Avoid having to combine different pieces of information

this way

Rankings are based on the following, where f is a collocational feature: (7)

P(sk1|f) P(sk2|f)

17 / 30

slide-18
SLIDE 18

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Calculating collocations

  • 1. Initially, calculate the collocations for sk
  • 2. Calculate the contexts in which an ambiguous word is

assigned to sk, based on those collocations

  • 3. Calculate the set of collocations that are most

characteristic of the contexts for sk, using the formula: (8)

P(sk1|f) P(sk2|f)

  • 4. Repeat steps 2 & 3 until a threshold is reached.

18 / 30

slide-19
SLIDE 19

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Word similarity

Idea: expect synonyms to behave similarly Define this in two ways:

◮ Knowledge-based: thesaurus-based WSD ◮ Knowledge-free: distributional methods

Word similarity computations are useful for IR, QA, summarization, language modeling, etc.

19 / 30

slide-20
SLIDE 20

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Thesaurus-based WSD

Use essentially the same set-up as dictionary-based WSD, but now:

◮ instead of requiring context words to have overlapping

dictionary definitions

◮ we require surrounding context words to list the focus

word w (or the subject code of w) as one of their topics e.g., If an animal or insect appears in the context of bass, we choose the fish sense instead of the musical one Alternative: use path lengths in an ontology like WordNet to calculate word similarity

20 / 30

slide-21
SLIDE 21

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Translation-based WSD

Idea: when disambiguating a word w, look for a combination

  • f w and some contextual word which translates to a

particular pair, indicating a particular sense

◮ interest can be ‘legal share’ (Beteiligung in German) or

‘concern’ (Interesse)

◮ In the phrase show concern, we are more likely to

translate to Interesse zeigen than Beteiligung zeigen

◮ So, in this English context, the German context tells us

to go with the sense that corresponds to Interesse

21 / 30

slide-22
SLIDE 22

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Information-theoretic WSD

Instead of using all contextual features—which we assume are independent—an information-theoretic approach tries to find one disambiguating feature

◮ Take a set of possible indicators and determine which is

the best, i.e., which gives the highest mutual information in the training data Possible indicators:

◮ object of the verb ◮ the verb tense ◮ word to the left ◮ word to the right ◮ etc.

When sense tagging, find value of that indicator to tag

22 / 30

slide-23
SLIDE 23

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Partitioning

More specifically, determine what the values (xi) of the indicator indicate, i.e. what sense (si) they point to.

◮ Assume two senses (P1 and P2), which can be

captured in subsets Q1 = {xi|xi indicates sense 1} and Q2 = {xi|xi indicates sense 2}

◮ We will have a set of indicator values Q; our goal is to

partition Q into these two sets The partition we choose is the one which maximizes the mutual information scores I(P1, Q1) and I(P2, Q2)

◮ The Flip-Flop algorithm is used when you have to

automatically determine your senses (e.g., if using parallel text)

23 / 30

slide-24
SLIDE 24

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

The Flip-Flop Algorithm (roughly)

  • 1. Randomly partition P (possible senses/translations) into

P1 and P2

  • 2. While improving mutual information scores,

2.1 Find the partition Q (possible indicators) into Q1 and Q2 which maximizes I(P; Q)

◮ Q might be the set of objects which appear after the

verb in question

2.2 Find the partition P into P1 and P2 which maximizes I(P; Q)

24 / 30

slide-25
SLIDE 25

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Disambiguation

After determining the best indicator and partitioning the values, disambiguating is easy:

  • 1. Determine the value xi of the indicator for the

ambiguous word.

  • 2. If xi is in Q1, assign it sense 1; otherwise, sense 2.

This method is also applicable for determining which indicators are best for a set of translation words

25 / 30

slide-26
SLIDE 26

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Unsupervised WSD

Perform sense discrimination, or clustering

◮ In other words, group comparable senses

together—even if you cannot give a correct label We will look briefly at the EM (Expectation-Maximization) algorithm for this task, based on a Bayesian model

26 / 30

slide-27
SLIDE 27

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

EM algorithm: Bayesian review

Bayesian WSD for supervised learning:

◮ Look at a context of surrounding words, call it c (vj =

word in context), within a window of a particular size

◮ Select the best sense s from among the different

senses (9) s

= argsk max P(sk|c) = argsk max P(c|sk)P(sk)

P(c)

= argsk max P(c|sk)P(sk) = argsk max[log P(c|sk) + log P(sk)] = argsk max[

vj∈c

log P(vj|sk) + log P(sk)] We need some other way to get estimates of P(sk) and P(c|sk)

27 / 30

slide-28
SLIDE 28

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

EM algorithm

  • 1. Intialize the parameters randomly, i.e., the probabilities

for all senses and contexts

◮ And decide K, the number of senses you want →

determines how fine-grained your distinctions are

  • 2. While still improving:

2.1 Expectation: re-estimate the probability of sk generating the context c

(10) ˆ P(ci|sk) =

P(ci|sk)

K

  • k=1

P(ci|sk)

Recall that all contextual words vj (i.e., P(vj|sk)) will be used to calculate the context

28 / 30

slide-29
SLIDE 29

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

EM algorithm (cont.)

  • 2. Maximization: Use the expected probabilities to

re-estimate the parameters: (11) P(vj|sk) =

  • {ci:vj∈ci}

ˆ P(ci|sk)

  • k
  • {ci:vj∈ci}

ˆ P(ci|sk)

→ Of all the times that vj occurs in a context of any of

this word’s senses, how often does vj indicate sk? (12) P(sk) =

  • i

ˆ P(ci|sk)

  • k
  • i

ˆ P(ci|sk)

→ Of all the times that any sense generates ci, how

  • ften does sk generate it?

29 / 30

slide-30
SLIDE 30

Word Sense Disambiguation Supervised WSD

WSD evaluation Feature extraction Naive Bayes Lesk algorithm Heuristic-based WSD Similarity-based WSD Translation-based WSD

Unsupervised WSD Modern WSD

Surveys on WSD Systems

Surveys:

◮ Roberto Navigli (2009). Word Sense Disambiguation: a

  • Survey. ACM Computing Surveys, 41(2), pp. 1-69.

◮ Covers: decision lists, decision trees, Naive Bayes,

neural networks, instance-based learning, SVMs, ensemble methods, clustering, multilinguality, Semeval/Senseval, etc.

◮ http://wwwusers.di.uniroma1.it/∼navigli/pubs/

ACM Survey 2009 Navigli.pdf

◮ Alok Ranjan Pal and Diganta Saha (2015). Word Sense

Disambiguation: A Survey. International Journal of Control Theory and Computer Modeling (IJCTCM), 5(3).

◮ http://arxiv.org/pdf/1508.01346.pdf 30 / 30