SLIDE 1 SI425 : NLP
Set 11 Distributional Similarity
some slides adapted from Dan Jurafsky and Bill MacCartney
SLIDE 2 Distributional methods
“You shall know a word by the company it keeps!”
- Example from Nida (1975) noted by Lin:
A bottle of tezgüino is on the table Everybody likes tezgüino Tezgüino makes you tipsy We make tezgüino out of corn
- Intuition:
- Just from context, you can guess the meaning of tezgüino.
- So we should look at surrounding contexts, and see what
- ther words occur in similar context.
SLIDE 3
Fill-in-the-blank on Google
You can get a quick & dirty impression of what words show up in a given context with Google queries:
SLIDE 4 Context vectors
- Target word w
- We have a boolean variable fi for each word vi in the
vocabulary.
- fi = “word vi occurs in the neighborhood of w”
w = (f1, f2, f3, …, fN) If w = tezgüino, v1 = bottle, v2 = make, v3 = matrix
w = (1, 1, 0, …)
SLIDE 5 Intuition
- Define two words by these sparse vectors
- Apply a vector distance metric
- Call two words similar if their vectors are similar
SLIDE 6 Distributional similarity
We need to define 3 things:
- 1. How the co-occurrence terms are defined
- Vocabulary? N-Grams?
- 2. How terms are weighted
- (Boolean? Frequency? Logs? Mutual information?)
- 3. What vector similarity metric should we use?
- Euclidean distance? Cosine? Jaccard? Dice?
SLIDE 7
- 1. Defining co-occurrence vectors
- Windows of neighboring words (n words to the left…)
- Bag-of-words
- We generally remove stop words
- Con: we lose any sense of syntax
- Solution: use the words occurring in particular
grammatical relations
SLIDE 8
Defining co-occurrence vectors
“The meaning of entities, and the meaning of grammatical relations among them, is related to the restriction of combinations of these entitites relative to other entities.” Zellig Harris (1968)
Idea: parse the sentence, extract grammatical dependencies
SLIDE 9
Vectors with grammatical dependencies
For the word cell: vector of N*R features
(R is the number of dependency relations)
SLIDE 10 Group Exercise
- Search “Naval Academy” and create a vector.
- What other school is most similar? Most different?
- Compare vectors
10
SLIDE 11
- 2. Weighting the counts
- We have been using the frequency count of context as its
weight/value
- But we could use any function of this frequency
- Instead: compute an association score
- Consider one feature f = (r, w’) = (obj-of, attack)
- P(f | w) = count(f, w) / count(w)
- Assocprob(w, f) = p(f | w)
SLIDE 12 Frequency-based problems
- Problem: “drink it” is more common than “drink wine” !
(“wine” is a better drinkable thing than “it”)
- Need: We need to control for expected frequency
- Solution: normalize by the expected frequency
Objects of the verb drink:
Water 7 Champagne 4 It 3 Much 3 Anything 3 Liquid 2 Wine 2
SLIDE 13 Weighting: Mutual Information
- Pointwise mutual information: measure of how
- ften two events x and y occur, compared with what
we would expect if they were independent:
- PMI between a target word w and a feature f :
SLIDE 14
Mutual information intuition
Objects of the verb drink
SLIDE 15 Summary: weightings
- See Manning and Schuetze (1999) for more
SLIDE 16
- 3. Defining vector similarity
SLIDE 17
Summary of similarity measures
SLIDE 18 Evaluating similarity measures
- Intrinsic evaluation
- Correlation with word similarity ratings from humans
- Extrinsic (task-based, end-to-end) evaluation
- Malapropism (spelling error) detection
- WSD
- Essay grading
- Plagiarism detection
- Taking TOEFL multiple-choice vocabulary tests
- Language modeling in some application
SLIDE 19
An example of detected plagiarism