Distributed Representations
CMSC 473/673 UMBC September 27th, 2017
Some slides adapted from 3SLP
Distributed Representations CMSC 473/673 UMBC September 27 th , - - PowerPoint PPT Presentation
Distributed Representations CMSC 473/673 UMBC September 27 th , 2017 Some slides adapted from 3SLP Course Announement: Assignment 2 Due Wednesday October 18 th by 11:59 AM Capstone: Perform language id with maxent models on code-switched
Some slides adapted from 3SLP
Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends on θ) The objective is implicitly defined with respect to (wrt) your data on hand
Each component k is the difference between: the total value of feature fk in the training data
and
the total value the current model pθ thinks it computes for feature fk
𝜖𝐺 𝜖𝜄𝑙 =
𝑗
𝑔
𝑙 𝑦𝑗, 𝑧𝑗 − 𝑗
𝑧′
𝑔
𝑙 𝑦𝑗, 𝑧′ 𝑞 𝑧′ 𝑦𝑗)
predict the next word given some context…
𝑞 𝑥𝑗 𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1, 𝑥𝑗)
wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
𝑞 𝑥𝑗 𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1) ∝ softmax(𝜄 ⋅ 𝑔(𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1, 𝑥𝑗))
predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
𝑞 𝑥𝑗 𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1) ∝ softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1))
create/use “distributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f
matrix-vector product
ew θwi
predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
𝑞 𝑥𝑗 𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1) ∝ softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1))
create/use “distributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f
matrix-vector product
ew θwi
“oculist and eye-doctor … occur in almost the same environments” “If A and B have almost identical environments we say that they are synonyms.”
“You shall know a word by the company it keeps!”
reflected paper truth
reflected paper truth glean hide falsehood
Capture “like” (similar) words
Mikolov et al. (2013)
Capture “like” (similar) words Capture relationships
Mikolov et al. (2013)
vector(‘king’) – vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’) vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) ≈ vector(‘Rome’)
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix
basic bag-of- words counting
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix
Assumption: Two documents are similar if their vectors are similar
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix
Assumption: Two words are similar if their vectors are similar
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix
Assumption: Two words are similar if their vectors are similar Issue: Count word vectors are very large, sparse, and skewed!
apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (↓)-word (→) count matrix
Context: those other words within a small “window” of a target word
apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (↓)-word (→) count matrix
a cloud computer stores digital data on a remote computer Context: those other words within a small “window” of a target word
apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (↓)-word (→) count matrix The size of windows depends on your goals The shorter the windows , the more syntactic the representation ± 1-3 more “syntax-y” The longer the windows, the more semantic the representation ± 4-10 more “semantic-y”
apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (↓)-word (→) count matrix
Assumption: Two words are similar if their vectors are similar Issue: Count word vectors are very large, sparse, and skewed! Context: those other words within a small “window” of a target word
Sparse vector representations
matrices
Dense vector representations:
Sparse vector representations
matrices
Dense vector representations:
You already saw some of this in assignment 1 (question 3)!
Raw word frequency is not a great measure of association between words
It’s very skewed: “the” and “of” are very frequent, but maybe not the most discriminative
We’d rather have a measure that asks whether a context word is particularly informative about the target word.
(Positive) Pointwise Mutual Information ((P)PMI)
Raw word frequency is not a great measure of association between words
It’s very skewed: “the” and “of” are very frequent, but maybe not the most discriminative
We’d rather have a measure that asks whether a context word is particularly informative about the target word.
(Positive) Pointwise Mutual Information ((P)PMI)
Pointwise mutual information:
Do events x and y co-occur more than if they were independent?
Raw word frequency is not a great measure of association between words
It’s very skewed: “the” and “of” are very frequent, but maybe not the most discriminative
We’d rather have a measure that asks whether a context word is particularly informative about the target word.
(Positive) Pointwise Mutual Information ((P)PMI)
Pointwise mutual information:
Do events x and y co-occur more than if they were independent?
PMI between two words: (Church & Hanks
1989)
Do words x and y co-occur more than if they were independent?
Object of “drink” Count PMI it 3 1.3 anything 3 5.2 wine 2 9.3 tea 2 11.8 liquid 2 10.5
Sparse vector representations
matrices
Dense vector representations:
Learn more in:
Sparse vector representations
matrices
Dense vector representations:
Build a binary tree from bottom to top based on how clusters are merged Each word represented by binary string = path from root to leaf Each intermediate node is a cluster
CEO chairman president November 001 000 0011 0010 00 01 … 010 root In practice, use an available implementation: https://github.com/percyliang/brown-cluster
Question Answering Spell Checking Essay grading
Correlation between algorithm and human word similarity ratings
Wordsim353: 353 noun pairs rated 0-10. sim(plane,car)=5.77
Taking TOEFL multiple-choice vocabulary tests
High when two vectors have large values in same dimensions, low for orthogonal vectors with zeros in complementary distribution
High when two vectors have large values in same dimensions, low for orthogonal vectors with zeros in complementary distribution
large data computer apricot 2 digital 1 2 information 1 6 1
large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) =
large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) = 2 + 0 + 0 4 + 0 + 0 1 + 36 + 1 = 0.1622
large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) = 2 + 0 + 0 4 + 0 + 0 1 + 36 + 1 = 0.1622 0 + 6 + 2 0 + 1 + 4 1 + 36 + 1 = 0.5804 0 + 0 + 0 4 + 0 + 0 0 + 1 + 4 = 0.0
Lin (1998): “Automatic Retrieval and Clustering of Similar Words” Padó and Lapata (2007): “Dependency-based Construction of Semantic Space Models” Levy and Goldberg (2014): “Dependency-Based Word Embeddings” Cotterell and Schütze (2015): “Morphological Word Embeddings” Ferraro et al. (2017): “Frame-Based Continuous Lexical Semantics through Exponential Family Tensor Factorization and Semantic Proto- Roles”
and many more…