SI425 : NLP Set 11 Distributional Similarity some slides adapted - - PowerPoint PPT Presentation

▶

Nov 17, 2023 407 likes •608 views

SI425 : NLP Set 11 Distributional Similarity some slides adapted from Dan Jurafsky and Bill MacCartney Distributional methods Firth (1957) You shall know a word by the company it keeps! Example from Nida (1975) noted by Lin: A

SLIDE 1

SI425 : NLP

Set 11 Distributional Similarity

some slides adapted from Dan Jurafsky and Bill MacCartney

SLIDE 2

Distributional methods

Firth (1957)

“You shall know a word by the company it keeps!”

Example from Nida (1975) noted by Lin:

A bottle of tezgüino is on the table Everybody likes tezgüino Tezgüino makes you tipsy We make tezgüino out of corn

Intuition:
Just from context, you can guess the meaning of tezgüino.
So we should look at surrounding contexts, and see what
ther words occur in similar context.

SLIDE 3

Fill-in-the-blank on Google

You can get a quick & dirty impression of what words show up in a given context with Google queries:

SLIDE 4

Context vectors

Target word w
We have a boolean variable fi for each word vi in the

vocabulary.

fi = “word vi occurs in the neighborhood of w”

w = (f1, f2, f3, …, fN) If w = tezgüino, v1 = bottle, v2 = make, v3 = matrix

w = (1, 1, 0, …)

SLIDE 5

Intuition

Define two words by these sparse vectors
Apply a vector distance metric
Call two words similar if their vectors are similar

SLIDE 6

Distributional similarity

We need to define 3 things:

1. How the co-occurrence terms are defined
Vocabulary? N-Grams?
2. How terms are weighted
(Boolean? Frequency? Logs? Mutual information?)
3. What vector similarity metric should we use?
Euclidean distance? Cosine? Jaccard? Dice?

SLIDE 7

1. Defining co-occurrence vectors
Windows of neighboring words (n words to the left…)
Bag-of-words
We generally remove stop words
Con: we lose any sense of syntax
Solution: use the words occurring in particular

grammatical relations

SLIDE 8

Defining co-occurrence vectors

“The meaning of entities, and the meaning of grammatical relations among them, is related to the restriction of combinations of these entitites relative to other entities.” Zellig Harris (1968)

Idea: parse the sentence, extract grammatical dependencies

SLIDE 9

Vectors with grammatical dependencies

For the word cell: vector of N*R features

(R is the number of dependency relations)

SLIDE 10

Group Exercise

Search “Naval Academy” and create a vector.
What other school is most similar? Most different?
Compare vectors

SLIDE 11

2. Weighting the counts
We have been using the frequency count of context as its

weight/value

But we could use any function of this frequency
Instead: compute an association score
Consider one feature f = (r, w’) = (obj-of, attack)
P(f | w) = count(f, w) / count(w)
Assocprob(w, f) = p(f | w)

SLIDE 12

Frequency-based problems

Problem: “drink it” is more common than “drink wine” !

(“wine” is a better drinkable thing than “it”)

Need: We need to control for expected frequency
Solution: normalize by the expected frequency

Objects of the verb drink:

Water 7 Champagne 4 It 3 Much 3 Anything 3 Liquid 2 Wine 2

SLIDE 13

Weighting: Mutual Information

Pointwise mutual information: measure of how
ften two events x and y occur, compared with what

we would expect if they were independent:

PMI between a target word w and a feature f :

SLIDE 14

Mutual information intuition

Objects of the verb drink

SLIDE 15

Summary: weightings

See Manning and Schuetze (1999) for more

SLIDE 16

3. Defining vector similarity

SLIDE 17

Summary of similarity measures

SLIDE 18

Evaluating similarity measures

Intrinsic evaluation
Correlation with word similarity ratings from humans
Extrinsic (task-based, end-to-end) evaluation
Malapropism (spelling error) detection
WSD
Essay grading
Plagiarism detection
Taking TOEFL multiple-choice vocabulary tests
Language modeling in some application

SLIDE 19