SI425 : NLP Set 8 Words as Vectors (distributional similarity) - - PowerPoint PPT Presentation

si425 nlp
SMART_READER_LITE
LIVE PREVIEW

SI425 : NLP Set 8 Words as Vectors (distributional similarity) - - PowerPoint PPT Presentation

SI425 : NLP Set 8 Words as Vectors (distributional similarity) Fall 2020 : Chambers some slides adapted from Dan Jurafsky and Bill MacCartney Why are these so different? P( ball | threw, the) = 0.12 P( baseball | threw, the) = 0.01 P( ran |


slide-1
SLIDE 1

SI425 : NLP

Set 8 Words as Vectors

(distributional similarity)

some slides adapted from Dan Jurafsky and Bill MacCartney

Fall 2020 : Chambers

slide-2
SLIDE 2

Why are these so different?

P( ball | threw, the) = 0.12 P( baseball | threw, the) = 0.01 P( ran | they) = 0.08 P( sprinted | they) = 0.0

Smoothing doesn’t solve

  • this. We should make it

similar to its synonym.

slide-3
SLIDE 3

Distributional methods

  • Firth (1957)

“You shall know a word by the company it keeps!”

  • Example from Nida (1975) noted by Lin:

A bottle of tezgüino is on the table Everybody likes tezgüino Tezgüino makes you tipsy We make tezgüino out of corn

  • Intuition:
  • Just from context, you can guess the meaning of tezgüino.
  • So we should look at surrounding contexts, and see what
  • ther words occur in similar context.
slide-4
SLIDE 4

Fill-in-the-blank on Google

You can get a quick & dirty impression of what words show up in a given context with Google queries:

slide-5
SLIDE 5

Intuition

  • Define each word by a sparse vector
  • Use a vector distance metric between words
  • Declare two words similar if their distance is small
slide-6
SLIDE 6

Context vectors

  • Target word w
  • We have a boolean variable fi for each word vi in the

vocabulary.

  • fi = “word vi occurs in the neighborhood of w”

w = (f1, f2, f3, …, fN) If w = tezgüino, v1 = bottle, v2 = make, v3 = matrix

w = (1, 1, 0, …)

slide-7
SLIDE 7

Distributional similarity

We need to define 3 things:

  • 1. How the co-occurrence terms are defined
  • Vocabulary? N-Grams?
  • 2. How terms are weighted
  • (Boolean? Frequency? Logs? Mutual information?)
  • 3. What vector similarity metric should we use?
  • Euclidean distance? Cosine? Jaccard? Dice?
slide-8
SLIDE 8
  • 1. Defining co-occurrence vectors
  • Windows of neighboring words (n words to the left…)
  • Bag-of-words
  • We generally remove stop words
  • Con: we lose any sense of syntax
  • Solution: use the words occurring in particular

grammatical relations

slide-9
SLIDE 9

Defining co-occurrence vectors

“The meaning of entities, and the meaning of grammatical relations among them, is related to the restriction of combinations of these entitites relative to other entities.” Zellig Harris (1968)

Idea: parse the sentence, extract grammar relations

slide-10
SLIDE 10

Vectors with grammatical dependencies

For the word cell: vector of N*R features

(R is the number of grammar relations)

slide-11
SLIDE 11

Group Exercise

  • Search “Naval Academy” and create a vector.
  • What other school is most similar? Most different?
  • Compare vectors

11

slide-12
SLIDE 12
  • 2. Weighting the counts

Common: use the frequency count of context as its value

  • or any function of this frequency

Alternative: compute an association score

  • Consider feature feat = “dried” for target word = “tangerine”
  • P(feat | word) = count(feat, word) / count(word)
  • Assocprob(word, feat) = P(feat | word)
slide-13
SLIDE 13

Frequency-based problems

  • Problem: “drink it” is more common than “drink wine” !

(“wine” is a better drinkable thing than “it”)

  • Need: We need to control for expected frequency
  • Solution: normalize by the expected frequency

Objects of the verb drink:

Water 7 Champagne 4 It 3 Much 3 Anything 3 Liquid 2 Wine 2

slide-14
SLIDE 14

Weighting: Mutual Information

  • Pointwise mutual information: measure of how often

two events x and y occur, compared with what we would expect if they were independent:

  • PMI between a target word w and a feature f :

PMI(x, y) = log2 P(x, y) P(x)P(y)

assocPMI(w, f ) = log2 P(w, f ) P(w)P( f )

slide-15
SLIDE 15

Mutual information intuition

Objects of the verb drink

slide-16
SLIDE 16

Summary: weightings

  • See Manning and Schuetze (1999) for more
slide-17
SLIDE 17
  • 3. Defining vector similarity
slide-18
SLIDE 18

Summary of similarity measures

slide-19
SLIDE 19

Evaluating similarity measures

  • Intrinsic evaluation
  • Correlation with word similarity ratings from humans
  • Extrinsic (task-based, end-to-end) evaluation
  • Malapropism (spelling error) detection
  • WSD
  • Essay grading
  • Plagiarism detection
  • Taking TOEFL multiple-choice vocabulary tests
  • Language modeling in some application
slide-20
SLIDE 20

An example of detected plagiarism