Vector Semantics
Natural Language Processing Lecture 16 Adapted from Jurafsky and Martin, 3rd ed.
1
Vector Semantics Natural Language Processing Lecture 16 Adapted - - PowerPoint PPT Presentation
Vector Semantics Natural Language Processing Lecture 16 Adapted from Jurafsky and Martin, 3 rd ed. 1 Why vector models of meaning? computing the similarity between words fast is similar to rapid tall is similar to
Natural Language Processing Lecture 16 Adapted from Jurafsky and Martin, 3rd ed.
1
“fast” is similar to “rapid” “tall” is similar to “height” Question answering: Q: “How tall is Mt. Everest?” Candidate A: “The official height of Mount Everest is 29029 feet”
2
4
Kulkarni, Al-Rfou, Perozzi, Skiena 2015 Sagi, Kaufmann Clark 2013
5 10 15 20 25 30 35 40 45
dog deer hound
Semantic Broadening
<1250 Middle 1350-1500 Modern 1500-1710
to year t+1
Intuitions:
6
A bottle of tesgüino is on the table Everybody likes tesgüino Tesgüino makes you drunk We make tesgüino out of corn.
8
9
11
As You Like It Twelfth Night Julius Caesar Henry V
battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117
12
13
As You Like It Twelfth Night Julius Caesar Henry V
battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117
14
As You Like It Twelfth Night Julius Caesar Henry V
battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117
15
As You Like It Twelfth Night Julius Caesar Henry V
battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117
16
aardvark computer data pinch result sugar … apricot 1 1 pineapple 1 1 digital 2 1 1 information 1 6 4
17
sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer. In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the
… …
18
19
(Schütze and Pedersen, 1993)
21
Pointwise mutual information:
Do events x and y co-occur more than if they were independent?
Do words x and y co-occur more than if they were independent?
PMI $%&'(, $%&'* = log* /($%&'(, $%&'*) / $%&'( /($%&'*)
PMI(X,Y) = log2 P(x,y) P(x)P(y)
PPMI '()*+, '()*- = max log- 5('()*+, '()*-) 5 '()*+ 5('()*-) , 0
24
pij = fij fij
j=1 C
i=1 W
pi* = fij
j=1 C
fij
j=1 C
i=1 W
p* j = fij
i=1 W
fij
j=1 C
i=1 W
pmiij = log2 pij pi*p* j ppmiij = pmiij if pmiij > 0
! " # $ #
p(w=information,c=data) = p(w=information) = p(c=data) =
25
Count(w,context) computer data pinch result sugar apricot 1 1 pineapple 1 1 digital 2 1 1 information 1 6 4
p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11 = .32 6/19 11/19 = .58 7/19 = .37
j=1 C
i=1 W
p(wi) = fij
j=1 C
N p(cj) = fij
i=1 W
N
26
pmiij = log2 pij pi*p* j
p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11
PPMI(w,context) computer data pinch result sugar apricot
pineapple
digital 1.66 0.00
0.00 0.57
(.57 using full precision)
27
( ) ≫ ' ) for rare c
( + = .,,.-. .,,.-./.01.-. = .97 ' ( 3 = .01.-. .01.-./.01.-. = .03
28
PPMIα(w,c) = max(log2 P(w,c) P(w)P
α(c),0)
P
α(c) =
count(c)α P
c count(c)α
29
Add-2 Smoothed Count(w,context)
computer data pinch result sugar apricot 2 2 3 2 3 pineapple 2 2 3 2 3 digital 4 3 2 3 2 information 3 8 2 6 2 p(w,context) [add-2] p(w) computer data pinch result sugar apricot 0.03 0.03 0.05 0.03 0.05 0.20 pineapple 0.03 0.03 0.05 0.03 0.05 0.20 digital 0.07 0.05 0.03 0.05 0.03 0.24 information 0.05 0.14 0.03 0.10 0.03 0.36 p(context) 0.19 0.25 0.17 0.22 0.17
30
PPMI(w,context) [add-2] computer data pinch result sugar apricot 0.00 0.00 0.56 0.00 0.56 pineapple 0.00 0.00 0.56 0.00 0.56 digital 0.62 0.00 0.00 0.00 0.00 information 0.00 0.58 0.00 0.37 0.00 PPMI(w,context) computer data pinch result sugar apricot
pineapple
digital 1.66 0.00
0.00 0.57
32
dot-product(~ v,~ w) =~ v·~ w =
N
X
i=1
viwi = v1w1 +v2w2 +...+vNwN
33
|~ v| = v u u t
N
X
i=1
v2
i
dot-product(~ v,~ w) =~ v·~ w =
N
X
i=1
viwi = v1w1 +v2w2 +...+vNwN
34
~ a·~ b = |~ a|| ~ b|cosθ ~ a·~ b |~ a|| ~ b| = cosθ
cos( v, w) = v • w v w = v v • w w = viwi
i=1 N
vi
2 i=1 N
wi
2 i=1 N
Dot product Unit vectors
vi is the PPMI value for word v in context i wi is the PPMI value for word w in context i.
cosine range 0-1
36
large data computer apricot 2 digital 1 2 information 1 6 1
37
Which pair of words is more similar? cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) =
cos( v, w) = v • w v w = v v • w w = viwi
i=1 N
vi
2 i=1 N
wi
2 i=1 N
1+ 0 + 0 1+36 +1 1+36 +1 0 +1+ 4 0 +1+ 4 0 + 6 + 2 0 + 0 + 0 = 8 38 5 =.58 = 0
2 + 0 + 0 2 + 0 + 0 = 2 2 38 = .23
1 2 3 4 5 6 7 1 2 3 digital apricot information Dimension 1: ‘large’ Dimension 2: ‘data’
38
large data apricot 2 digital 1 information 1 6
WRIST ANKLE SHOULDER ARM LEG HAND FOOT HEAD NOSE FINGER TOE FACE EAR EYE TOOTH DOG CAT PUPPY KITTEN COW MOUSE TURTLE OYSTER LION BULL CHICAGO ATLANTA MONTREAL NASHVILLE TOKYO CHINA RUSSIA AFRICA ASIA EUROPE AMERICA BRAZIL MOSCOW FRANCE HAWAII 39
Rohde et al. (2006)
Duty and responsibility have similar syntactic distribution:
Modified by adjectives additional, administrative, assumed, collective, congressional, constitutional … Objects of verbs assert, assign, assume, attend to, avoid, become, breach..
Dekang Lin, 1998 “Automatic Retrieval and Clustering of Similar Words”
etc).
count(pobj(cell,absorb)), etc.
44
Hindle, Don. 1990. Noun Classification from Predicate-Argument Structure. ACL
49
50
51
65
detail below
shorter introduction to CBOW
66
67
examples
examples (the NS in SGNS)
cases (positive and negative examples)
68
input embedding v, in the input matrix W
embedding vi for word i in the vocabulary.
embedding vʹi for word i in the vocabulary.
69
|V| x d W’
1 2 |V|
i
1 2 d …
. . . . . . . .
d x |V| W
1 2 |V|
i
1 2 d
. . . .
…
vocabulary is j, so we’ll call it wj (1 < j < |V |).
Hence our task is to compute P(wk|wj).
70
71
Input layer Projection layer Output layer
|V|⨉d
wt wt-1 wt+1 1-hot input vector
1⨉d 1⨉|V|
embedding for wt probabilities of context words
d ⨉ |V|
d ⨉ |V|
x1 x2 xj x|V| y1 y2 yk y|V| y1 y2 yk y|V|
72
Input layer Projection layer Output layer
|V|⨉d
wt wt-1 wt+1 1-hot input vector
1⨉d 1⨉|V|
embedding for wt probabilities of context words
d ⨉ |V|
d ⨉ |V|
x1 x2 xj x|V| y1 y2 yk y|V| y1 y2 yk y|V|
73
Input layer Projection layer Output layer
|V|⨉d
wt wt-1 wt+1 1-hot input vector
1⨉d 1⨉|V|
embedding for wt probabilities of context words
d ⨉ |V|
d ⨉ |V|
x1 x2 xj x|V| y1 y2 yk y|V| y1 y2 yk y|V|
74
75
k ·v j)
w02|V| exp(v0 w ·v j)
76
argmax
θ
log p(Text)
77
argmax
θ
log
T
Y
t=1
p(w(tC),...,w(t1),w(t+1),...,w(t+C))
argmax
θ
X
cjc,j6=0
log p(w(t+j)|w(t))
= argmax
θ T
X
t=1
X
c jc, j6=0
log exp(v0(t+ j) ·v(t)) P
w2|V| exp(v0 w ·v(t))
= argmax
θ T
X
t=1
X
cjc,j6=0
2 4v0(t+j) ·v(t) log X
w2|V|
exp(v0
w ·v(t))
3 5
association between input word i and output word j
just when this matrix is a shifted version of PMI: WWʹT =MPMI −log k
into the two embedding matrices.
78
Input layer Projection layer Output layer
W
|V|⨉d
wt wt-1 wt+1 1-hot input vectors for each context word
1⨉d 1⨉|V|
sum of embeddings for context words probability of wt
W’
d ⨉ |V|
x1 x2 xj x|V| y1 y2 yk y|V| x1 x2 xj x|V|
W
|V|⨉d
79
80
target: Redmond Havel ninjutsu graffiti capitulate Redmond Wash. Vaclav Havel ninja spray paint capitulation Redmond Washington president Vaclav Havel martial arts grafitti capitulated Microsoft Velvet Revolution swordsmanship taggers capitulating Figure 19.14 Examples of the closest tokens to some target words using a phrase-based
vector(‘king’) - vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’) vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) ≈ vector(‘Rome’)
81
treat embeddings as compositional
vector(‘queen’)
is similarity, not a collection of semantic components.
82
84
exactly one embedding
at the edge of a river
words?
The hottest blockbuster in NLP this year
neural architectures—including notions like attention—as well as state-of-the-art architectures like transformers
going to abstract over them
favor and read the BERT paper: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language
85
representations
and right-to-left LSTMs
themselves, which would foul everything up
86
and a person/machine is required to fill in an appropriate word
task
87
88
Jay Alammar, http://jalammar. github.io/illustrat ed-bert/
89