Chapter 6: Vector Semantics What do words mean? First thought: - - PowerPoint PPT Presentation
Chapter 6: Vector Semantics What do words mean? First thought: - - PowerPoint PPT Presentation
Dan Jurafsky and James Martin Speech and Language Processing Chapter 6: Vector Semantics What do words mean? First thought: look in a dictionary http://www.oed.com/ Words, Lemmas, Senses, Definitions sense definition lemma pepper, n. /
What do words mean?
First thought: look in a dictionary http://www.oed.com/
Words, Lemmas, Senses, Definitions
Pronunciation:
pepper, n.
Brit. /ˈpɛpə/ , U.S. /ˈpɛpər/ Forms: OE peopor (rare), OE pipcer (transmission error), OE pipor, OE pipur (rare ... Frequency (in current use): Etymology: A borrowing from Latin. Etymon: Latin piper. < classical Latin piper, a loanword < Indo-Aryan (as is ancient Greek πέπερι ); compare Sans
- I. The spice or the plant.
1.
- a. A hot pungent spice derived from the prepared fruits (peppercorns) of
the pepper plant, Piper nigrum (see sense 2a), used from early times to season food, either whole or ground to powder (often in association with salt). Also (locally, chiefly with distinguishing word): a similar spice derived from the fruits of certain other species of the genus Piper; the fruits themselves.
The ground spice from Piper nigrum comes in two forms, the more pungent black pepper, produced from black peppercorns, and the milder white pepper, produced from white peppercorns: see BLACK
- adj. and n. Special uses 5a, PEPPERCORN n. 1a, and WHITE adj. and n. Special uses 7b(a).
1
2.
- a. The plant Piper nigrum (family Piperaceae), a climbing shrub
indigenous to South Asia and also cultivated elsewhere in the tropics, which has alternate stalked entire leaves, with pendulous spikes of small green flowers opposite the leaves, succeeded by small berries turning red when ripe. Also more widely: any plant of the genus Piper or the family Piperaceae.
- b. Usu. with distinguishing word: any of numerous plants of other
families having hot pungent fruits or leaves which resemble pepper ( 1a) in taste and in some cases are used as a substitute for it.
- c. U.S. The California pepper tree, Schinus molle. Cf. PEPPER TREE n. 3
- 3. Any of various forms of capsicum, esp. Capsicum annuum var.
- annuum. Originally (chiefly with distinguishing word): any variety of the
- C. annuum Longum group, with elongated fruits having a hot, pungent
taste, the source of cayenne, chilli powder, paprika, etc., or of the perennial C. frutescens, the source of Tabasco sauce. Now frequently (more fully sweet pepper): any variety of the C. annuum Grossum group, with large, bell-shaped or apple-shaped, mild-flavoured fruits, usually ripening to red, orange, or yellow and eaten raw in salads or cooked as a vegetable. Also: the fruit of any of these capsicums.
Sweet peppers are often used in their green immature state (more fully green pepper), but some new varieties remain green when ripe.
sense lemma definition
Lemma pepper
Sense 1: spice from pepper plant Sense 2: the pepper plant itself Sense 3: another similar plant (Jamaican pepper) Sense 4: another plant with peppercorns (California pepper) Sense 5: capsicum (i.e. chili, paprika, bell pepper, etc)
A sense or “concept” is the meaning component of a word
There are relations between senses
Relation: Synonymity
Synonyms have the same meaning in some
- r all contexts.
- filbert / hazelnut
- couch / sofa
- big / large
- automobile / car
- vomit / throw up
- Water / H20
Relation: Synonymity
Note that there are probably no examples of perfect synonymy.
- Even if many aspects of meaning are identical
- Still may not preserve the acceptability based on
notions of politeness, slang, register, genre, etc.
The Linguistic Principle of Contrast:
- Difference in form -> difference in meaning
Relation: Synonymity?
Water/H20 Big/large Brave/courageous
Relation: Antonymy
Senses that are opposites with respect to one feature of meaning Otherwise, they are very similar!
dark/light short/long fast/slow rise/fall hot/cold up/down in/out
More formally: antonyms can
- define a binary opposition
- r be at opposite ends of a scale
- long/short, fast/slow
- Be reversives:
- rise/fall, up/down
Relation: Similarity
Words with similar meanings. Not synonyms, but sharing some element of meaning
car, bicycle cow, horse
Ask humans how similar 2 words are
word1 word2 similarity
vanish disappear 9.8 behave
- bey
7.3 belief impression 5.95 muscle bone 3.65 modest flexible 0.98 hole agreement 0.3
SimLex-999 dataset (Hill et al., 2015)
Relation: Word relatedness
Also called "word association" Words be related in any way, perhaps via a semantic frame or field
- car, bicycle: similar
- car, gasoline: related, not similar
Semantic field
Words that
- cover a particular semantic domain
- bear structured relations with each other.
hospitals surgeon, scalpel, nurse, anaesthetic, hospital restaurants waiter, menu, plate, food, menu, chef), houses door, roof, kitchen, family, bed
Relation: Superordinate/ subordinate
One sense is a subordinate of another if the first sense is more specific, denoting a subclass of the
- ther
- car is a subordinate of vehicle
- mango is a subordinate of fruit
Conversely superordinate
- vehicle is a superordinate of car
- fruit is a subodinate of mango
Superordinate vehicle fruit furniture Subordinate car mango chair
These levels are not symmetric
One level of category is distinguished from the others The "basic level"
Name these items
Superordinate Basic Subordinate chair
- ffice chair
piano chair rocking chair furniture lamp torchiere desk lamp table end table coffee table
Cluster of Interactional Properties
Basic level things are “human-sized” Consider chairs
- We know how to interact with a chair
(sitting)
- Not so clear for superordinate
categories like furniture
- “Imagine a furniture without thinking of a
bed/table/chair/specific basic-level category”
The basic level
Is the level of distinctive actions Is the level which is learned earliest and at which things are first named It is the level at which names are shortest and used most frequently
Connotation
Words have affective meanings
positive connotations (happy) negative connotations (sad) positive evaluation (great, love) negative evaluation (terrible, hate).
So far
Concepts or word senses
- Have a complex many-to-many association with words
(homonymy, multiple senses)
Have relations with each other
- Synonymy
- Antonymy
- Similarity
- Relatedness
- Superordinate/subordinate
- Connotation
But how to define a concept?
Classical (“Aristotelian”) Theory of Concepts
The meaning of a word: a concept defined by necessary and sufficient conditions A necessary condition for being an X is a condition C that X must satisfy in
- rder for it to be an X.
- If not C, then not X
- ”Having four sides” is necessary to be a square.
A sufficient condition for being an X is condition such that if something satisfies condition C, then it must be an X.
- If and only if C, then X
- The following necessary conditions, jointly, are sufficient to be a square
- x has (exactly) four sides
- each of x's sides is straight
- x is a closed figure
- x lies in a plane
- each of x's sides is equal in length to each of the others
- each of x's interior angles is equal to the others (right angles)
- the sides of x are joined at their ends
Example from Norman Swartz, SFU
Problem 1: The features are complex and may be context-dependent
William Labov. 1975 What are these? Cup or bowl?
The category depends on complex features of the object (diameter, etc)
The category depends on the context! (If there is food in it, it’s a bowl)
Labov’s definition of cup
explicating the core „cup‟ and „mug‟ „cup of [tea]‟ „mug of [tea]‟ ‘cup’ and ‘mug’ „cup‟ and „mug‟, a distinction of “notorious difficulty” was Labov‟s definition of „cup‟ as:
: Labov’s (2004) definition of ‘cup’ The term cup is used to denote round containers with a ratio of depth to width of 1±r where r≤rb, and rb = α1 + α2 + …αυ and α1 is a positive quality when the feature i is present and 0 otherwise. feature 1 = with one handle 2 = made of opaque vitreous material 3 = used for consumption of food 4 = used for the consumption of liquid food 5 = used for consumption of hot liquid food 6 = with a saucer 7 = tapering 8 = circular in cross-section Cup is used variably to denote such containers with ratios width to depth 1±r where rb≤r≤r1 with a probability of r1 - r/rt – rb. The quantity 1±rb expresses the distance from the modal value of width to height.
Ludwig Wittgenstein (1889- 1951)
Philosopher of language In his late years, a proponent of studying “ordinary language”
Wittgenstein (1945) Philosophical Investigations. Paragraphs 66,67
What is a game?
Wittgenstein’s thought experiment on "What is a game”:
PI #66: ”Don’t say “there must be something common, or they would not be called `games’”—but look and see whether there is anything common to all” Is it amusing? Is there competition? Is there long-term strategy? Is skill required? Must luck play a role? Are there cards? Is there a ball?
Family Resemblance
Game 1 Game 2 Game 3 Game 4 ABC BCD ACD ABD
“each item has at least one, and probably several, elements in common with one or more items, but no, or few, elements are common to all items” Rosch and Mervis
How about a radically different approach?
Ludwig Wittgenstein
PI #43: "The meaning of a word is its use in the language"
Let's define words by their usages
In particular, words are defined by their environments (the words around them) Zellig Harris (1954): If A and B have almost identical environments we say that they are synonyms.
What does ongchoi mean?
Suppose you see these sentences:
- Ong choi is delicious sautéed with garlic.
- Ong choi is superb over rice
- Ong choi leaves with salty sauces
And you've also seen these:
- …spinach sautéed with garlic over rice
- Chard stems and leaves are delicious
- Collard greens and other salty leafy greens
Conclusion:
- Ongchoi is a leafy green like spinach, chard, or collard
greens
Ong choi: Ipomoea aquatica "Water Spinach"
Yamaguchi, Wikimedia Commons, public domain
good nice bad worst not good wonderful amazing terrific dislike worse very good incredibly good fantastic incredibly bad now you i that with by to ‘s are is a than
We'll build a new model of meaning focusing on similarity
Each word = a vector
- Not just "word" or word45.
Similar words are "nearby in space"
We define a word as a vector
Called an "embedding" because it's embedded into a space The standard way to represent meaning in NLP Fine-grained model of meaning for similarity
- NLP tasks like sentiment analysis
- With words, requires same word to be in training and test
- With embeddings: ok if similar words occurred!!!
- Question answering, conversational agents, etc
We'll introduce 2 kinds of embeddings
Tf-idf
- A common baseline model
- Sparse vectors
- Words are represented by a simple function of the counts
- f nearby words
Word2vec
- Dense vectors
- Representation is created by training a classifier to
distinguish nearby and far-away words
Review: words, vectors, and co-occurrence matrices
Term-document matrix
As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 5 117 Figure 6.2 The term-document matrix for four words in four Shakespeare plays. Each cell As You Like It Twelfth Night Julius Caesar Henry V battle 1 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Figure 6.3
Each document is represented by a vector of words
Visualizing document vectors
5 10 15 20 25 30 5 10 Henry V [4,13] As You Like It [36,1] Julius Caesar [1,7]
battle fool
Twelfth Night [58,0] 15 40 35 40 45 50 55 60
Vectors are the basis of information retrieval
As You Like It Twelfth Night Julius Caesar Henry V battle 1 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Figure 6.3
Vectors are similar for the two comedies Different than the history Comedies have more fools and wit and fewer battles.
Words can be vectors too
As You Like It Twelfth Night Julius Caesar Henry V battle 1 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Figure 6.2 The term-document matrix for four words in four Shakespeare plays. Each cell
battle is "the kind of word that occurs in Julius Caesar and Henry V" fool is "the kind of word that occurs in comedies, especially Twelfth Night"
More common: word-word matrix (or "term-context matrix")
Two words are similar in meaning if their context vectors are similar
aardvark computer data pinch result sugar … apricot 1 1 pineapple 1 1 digital 2 1 1 information 1 6 4
sugar, a sliced lemon, a tablespoonful of apricot jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer. In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the
1 2 3 4 5 6 1 2 digital [1,1]
result data
information [6,4] 3 4
Reminders from linear algebra
dot-product(~ v,~ w) =~ v·~ w =
N
X
i=1
viwi = v1w1 +v2w2 +...+vNwN
vector length
|~ v| = v u u t
N
X
i=1
v2
i
Cosine for computing similarity
vi is the count for word v in context i wi is the count for word w in context i. Cos(v,w) is the cosine similarity of v and w
- Sec. 6.3
~ a·~ b = |~ a|| ~ b|cosθ ~ a·~ b |~ a|| ~ b| = cosθ
cosine(~ v,~ w) = ~ v·~ w |~ v||~ w| =
N
X
i=1
viwi v u u t
N
X
i=1
v2
i
v u u t
N
X
i=1
w2
i
Cosine as a similarity metric
- 1: vectors point in opposite directions
+1: vectors point in same directions 0: vectors are orthogonal Frequency is non-negative, so cosine range 0-1
51
large data computer apricot 1 digital 1 2 information 1 6 1
52
Which pair of words is more similar? cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) =
cos( v, w) = v • w v w = v v • w w = viwi
i=1 N
∑
vi
2 i=1 N
∑
wi
2 i=1 N
∑
1+ 0 + 0 1+ 0 + 0 1+36 +1 1+36 +1 0 +1+ 4 0 +1+ 4 1+ 0 + 0 0 + 6 + 2 0 + 0 + 0 = 1 38 =.16 = 8 38 5 =.58 = 0
Visualizing cosines (well, angles)
1 2 3 4 5 6 7 1 2 3 digital apricot information Dimension 1: ‘large’ Dimension 2: ‘data’
But raw frequency is a bad representation
Frequency is clearly useful; if sugar appears a lot near apricot, that's useful information. But overly frequent words like the, it, or they are not very informative about the context Need a function that resolves this frequency paradox!
tf-idf: combine two factors
tf: term frequency. frequency count (usually log-transformed): Idf: inverse document frequency: tf-
idfi = log ✓ N dfi ◆
Total # of docs in collection # of docs that have word i
wt,d = tft,d ×idft
tf-idf value for word t in document d:
Words like "the" or "good" have very low idf
tft,d = ⇢ 1+log10 count(t,d) if count(t,d) > 0
- therwise
Summary: tf-idf
Compare two words using tf-idf cosine to see if they are similar Compare two documents
- Take the centroid of vectors of all the words in
the document
- Centroid document vector is:
d = w1 +w2 +...+wk k
An alternative to tf-idf
Ask whether a context word is particularly informative about the target word.
- Positive Pointwise Mutual Information (PPMI)
57
Pointwise Mutual Information
Pointwise mutual information:
Do events x and y co-occur more than if they were independent?
PMI between two words: (Church & Hanks 1989)
Do words x and y co-occur more than if they were independent? PMI $%&'(, $%&'* = log* /($%&'(, $%&'*) / $%&'( /($%&'*)
PMI(X,Y) = log2 P(x,y) P(x)P(y)
Positive Pointwise Mutual Information
- PMI ranges from −∞ to + ∞
- But the negative values are problematic
- Things are co-occurring less than we expect by chance
- Unreliable without enormous corpora
- Imagine w1 and w2 whose probability is each 10-6
- Hard to be sure p(w1,w2) is significantly different than 10-12
- Plus it’s not clear people are good at “unrelatedness”
- So we just replace negative PMI values by 0
- Positive PMI (PPMI) between word1 and word2:
PPMI '()*+, '()*- = max log- 5('()*+, '()*-) 5 '()*+ 5('()*-) , 0
Computing PPMI on a term-context matrix
Matrix F with W rows (words) and C columns (contexts) fij is # of times wi occurs in context cj
60
pij = fij fij
j=1 C
∑
i=1 W
∑
pi* = fij
j=1 C
∑
fij
j=1 C
∑
i=1 W
∑
p* j = fij
i=1 W
∑
fij
j=1 C
∑
i=1 W
∑
pmiij = log2 pij pi*p* j ppmiij = pmiij if pmiij > 0
- therwise
! " # $ #
p(w=information,c=data) = p(w=information) = p(c=data) =
61
p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11 = .32 6/19 11/19 = .58 7/19 = .37
pij = fij fij
j=1 C
∑
i=1 W
∑
p(wi) = fij
j=1 C
∑
N p(cj) = fij
i=1 W
∑
N
62
pmiij = log2 pij pi*p* j pmi(information,data) = log2 (
p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11
PPMI(w,context) computer data pinch result sugar apricot 1 1 2.25 1 2.25 pineapple 1 1 2.25 1 2.25 digital 1.66 0.00 1 0.00 1 information 0.00 0.57 1 0.47 1
.32 / (.37*.58) ) = .58
(.57 using full precision)
Weighting PMI
PMI is biased toward infrequent events
- Very rare words have very high PMI values
Two solutions:
- Give rare words slightly higher probabilities
- Use add-one smoothing (which has a similar
effect)
63
Weighting PMI: Giving rare context words slightly higher probability
Raise the context probabilities to ! = 0.75: This helps because '
( ) > ' ) for rare c
Consider two events, P(a) = .99 and P(b)=.01 '
( + = .,,.-. .,,.-./.01.-. = .97 ' ( 3 = .01.-. .01.-./.01.-. = .03
64
PPMIα(w,c) = max(log2 P(w,c) P(w)P
α(c),0)
P
α(c) =
count(c)α P
c count(c)α
Use Laplace (add-1) smoothing
65
66
Add#2%Smoothed%Count(w,context)
computer data pinch result sugar apricot 2 2 3 2 3 pineapple 2 2 3 2 3 digital 4 3 2 3 2 information 3 8 2 6 2 p(w,context),[add02] p(w) computer data pinch result sugar apricot 0.03 0.03 0.05 0.03 0.05 0.20 pineapple 0.03 0.03 0.05 0.03 0.05 0.20 digital 0.07 0.05 0.03 0.05 0.03 0.24 information 0.05 0.14 0.03 0.10 0.03 0.36 p(context) 0.19 0.25 0.17 0.22 0.17
PPMI versus add-2 smoothed PPMI
67
PPMI(w,context).[add22] computer data pinch result sugar apricot 0.00 0.00 0.56 0.00 0.56 pineapple 0.00 0.00 0.56 0.00 0.56 digital 0.62 0.00 0.00 0.00 0.00 information 0.00 0.58 0.00 0.37 0.00 PPMI(w,context) computer data pinch result sugar apricot 1 1 2.25 1 2.25 pineapple 1 1 2.25 1 2.25 digital 1.66 0.00 1 0.00 1 information 0.00 0.57 1 0.47 1
Summary for Part I
- Survey of Lexical Semantics
- Idea of Embeddings: Represent a word as a
function of its distribution with other words
- Tf-idf
- Cosines
- PPMI
- Next lecture: sparse embeddings, word2vec