SLIDE 1 Distributed Representations
CMSC 473/673 UMBC
Some slides adapted from 3SLP
SLIDE 2
Outline
Recap
Maxent models Basic neural language models
Continuous representations
Motivation Key idea: represent words with vectors Two common counting types
Two (four) common continuous representation models Evaluation
SLIDE 3 Maxent Objective: Log-Likelihood
Differentiating this becomes nicer (even though Z depends
= 𝐺 𝜄
The objective is implicitly defined with respect to (wrt) your data on hand
SLIDE 4 Log-Likelihood Gradient
Each component k is the difference between: the total value of feature fk in the training data
and
the total value the current model pθ thinks it computes for feature fk
X' Yi
SLIDE 5 N-gram Language Models
predict the next word given some context…
𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1,𝑥𝑗)
wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
SLIDE 6 Maxent Language Models
predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄 ⋅ 𝑔(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1,𝑥𝑗))
SLIDE 7 Neural Language Models
predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1))
create/use “distributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f
matrix-vector product
ew θwi
SLIDE 8 Neural Language Models
predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…
𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1))
create/use “distributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f
matrix-vector product
ew θwi
SLIDE 9
Outline
Recap
Maxent models Basic neural language models
Continuous representations
Motivation Key idea: represent words with vectors Two common counting types
Two (four) common continuous representation models Evaluation
SLIDE 10
How have we represented words?
Each word is a distinct item
Bijection between the strings and unique integer ids: "cat" --> 3, "kitten" --> 792 "dog" --> 17394 Are "cat" and "kitten" similar?
SLIDE 11
How have we represented words?
Each word is a distinct item
Bijection between the strings and unique integer ids: "cat" --> 3, "kitten" --> 792 "dog" --> 17394 Are "cat" and "kitten" similar?
Equivalently: "One-hot" encoding
Represent each word type w with a vector the size of the vocabulary This vector has V-1 zero entries, and 1 non-zero (one) entry
SLIDE 12
Word Similarity ➔ Plagiarism Detection
SLIDE 13
Distributional models of meaning = vector-space models of meaning = vector semantics
Zellig Harris (1954):
“oculist and eye-doctor … occur in almost the same environments” “If A and B have almost identical environments we say that they are synonyms.”
Firth (1957):
“You shall know a word by the company it keeps!”
SLIDE 14
Continuous Meaning
The paper reflected the truth.
SLIDE 15 Continuous Meaning
The paper reflected the truth.
reflected paper truth
SLIDE 16 Continuous Meaning
The paper reflected the truth.
reflected paper truth glean hide falsehood
SLIDE 17 (Some) Properties of Embeddings
Capture “like” (similar) words
Mikolov et al. (2013)
SLIDE 18 (Some) Properties of Embeddings
Capture “like” (similar) words Capture relationships
Mikolov et al. (2013)
vector(‘king’) – vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’) vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) ≈ vector(‘Rome’)
SLIDE 19
Outline
Recap
Maxent models Basic neural language models
Continuous representations
Motivation Key idea: represent words with vectors Two common counting types
Two (four) common continuous representation models Evaluation
SLIDE 20 Key Idea
- 1. Acquire basic contextual statistics (counts)
for each word type w
- 2. Extract a real-valued vector v for each word
w from those statistics
- 3. Use the vectors to represent each word in
later tasks
SLIDE 21
Outline
Recap
Maxent models Basic neural language models
Continuous representations
Motivation Key idea: represent words with vectors Two common counting types
Two (four) common continuous representation models Evaluation
SLIDE 22 “You shall know a word by the company it keeps!” Firth (1957)
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix
SLIDE 23 “You shall know a word by the company it keeps!” Firth (1957)
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix
basic bag-of- words counting
SLIDE 24 “You shall know a word by the company it keeps!” Firth (1957)
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix
Assumption: Two documents are similar if their vectors are similar
SLIDE 25 “You shall know a word by the company it keeps!” Firth (1957)
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix
Assumption: Two words are similar if their vectors are similar
SLIDE 26 “You shall know a word by the company it keeps!” Firth (1957)
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix
Assumption: Two words are similar if their vectors are similar Issue: Count word vectors are very large, sparse, and skewed!
SLIDE 27 “You shall know a word by the company it keeps!” Firth (1957)
apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (↓)-word (→) count matrix
Context: those other words within a small “window” of a target word
SLIDE 28 “You shall know a word by the company it keeps!” Firth (1957)
apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (↓)-word (→) count matrix
a cloud computer stores digital data on a remote computer Context: those other words within a small “window” of a target word
SLIDE 29 “You shall know a word by the company it keeps!” Firth (1957)
apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (↓)-word (→) count matrix The size of windows depends on your goals The shorter the windows , the more syntactic the representation ± 1-3 more “syntax-y” The longer the windows, the more semantic the representation ± 4-10 more “semantic-y”
SLIDE 30 “You shall know a word by the company it keeps!” Firth (1957)
apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (↓)-word (→) count matrix
Assumption: Two words are similar if their vectors are similar Issue: Count word vectors are very large, sparse, and skewed! Context: those other words within a small “window” of a target word
SLIDE 31
Outline
Recap
Maxent models Basic neural language models
Continuous representations
Motivation Key idea: represent words with vectors Two common counting types
Two (four) common continuous representation models Evaluation
SLIDE 32 Four kinds of vector models
Sparse vector representations
- 1. Mutual-information weighted word co-occurrence
matrices
Dense vector representations:
- 2. Singular value decomposition/Latent Semantic Analysis
- 3. Neural-network-inspired models (skip-grams, CBOW)
- 4. Brown clusters
SLIDE 33
Shared Intuition
Model the meaning of a word by “embedding” in a vector space The meaning of a word is a vector of numbers Contrast: word meaning is represented in many computational linguistic applications by a vocabulary index (“word number 545”) or the string itself
SLIDE 34
What’s the Meaning of Life?
SLIDE 35
What’s the Meaning of Life?
LIFE’
SLIDE 36
What’s the Meaning of Life?
LIFE’ (.478, -.289, .897, …)
SLIDE 37 “Embeddings” Did Not Begin In This Century
Hinton (1986): “Learning Distributed Representations
Deerwester et al. (1990): “Indexing by Latent Semantic Analysis” Brown et al. (1992): “Class-based n-gram models of natural language”
SLIDE 38 Four kinds of vector models
Sparse vector representations
- 1. Mutual-information weighted word co-occurrence
matrices
Dense vector representations:
- 2. Singular value decomposition/Latent Semantic Analysis
- 3. Neural-network-inspired models (skip-grams, CBOW)
- 4. Brown clusters
You already saw some of this in assignment 2 (question 2)!
SLIDE 39 Pointwise Mutual Information (PMI): Dealing with Problems of Raw Counts
Raw word frequency is not a great measure of association between words
It’s very skewed: “the” and “of” are very frequent, but maybe not the most discriminative
We’d rather have a measure that asks whether a context word is particularly informative about the target word.
(Positive) Pointwise Mutual Information ((P)PMI)
SLIDE 40 Pointwise Mutual Information (PMI): Dealing with Problems of Raw Counts
Raw word frequency is not a great measure of association between words
It’s very skewed: “the” and “of” are very frequent, but maybe not the most discriminative
We’d rather have a measure that asks whether a context word is particularly informative about the target word.
(Positive) Pointwise Mutual Information ((P)PMI)
Pointwise mutual information:
Do events x and y co-occur more than if they were independent?
SLIDE 41 Pointwise Mutual Information (PMI): Dealing with Problems of Raw Counts
Raw word frequency is not a great measure of association between words
It’s very skewed: “the” and “of” are very frequent, but maybe not the most discriminative
We’d rather have a measure that asks whether a context word is particularly informative about the target word.
(Positive) Pointwise Mutual Information ((P)PMI)
Pointwise mutual information:
Do events x and y co-occur more than if they were independent?
PMI between two words: (Church & Hanks
1989)
Do words x and y co-occur more than if they were independent?
SLIDE 42
“Noun Classification from Predicate- Argument Structure,” Hindle (1990)
Object of “drink” Count PMI it 3 1.3 anything 3 5.2 wine 2 9.3 tea 2 11.8 liquid 2 10.5
“drink it” is more common than “drink wine” “wine” is a better “drinkable” thing than “it”
SLIDE 43 Four kinds of vector models
Sparse vector representations
- 1. Mutual-information weighted word co-occurrence
matrices
Dense vector representations:
- 2. Singular value decomposition/Latent Semantic Analysis
- 3. Neural-network-inspired models (skip-grams, CBOW)
- 4. Brown clusters
Learn more in:
- Your project
- Paper (673)
- Other classes (478/678)
SLIDE 44 Four kinds of vector models
Sparse vector representations
- 1. Mutual-information weighted word co-occurrence
matrices
Dense vector representations:
- 2. Singular value decomposition/Latent Semantic Analysis
- 3. Neural-network-inspired models (skip-grams, CBOW)
- 4. Brown clusters
SLIDE 45
Brown clustering (Brown et al., 1992)
An agglomerative clustering algorithm that clusters words based on which words precede or follow them These word clusters can be turned into a kind of vector (binary vector)
SLIDE 46 Brown Clusters as vectors
Build a binary tree from bottom to top based on how clusters are merged Each word represented by binary string = path from root to leaf Each intermediate node is a cluster
CEO chairman president November 001 000 0011 0010 00 01 … 010 root In practice, use an available implementation: https://github.com/percyliang/brown-cluster
SLIDE 47
Brown cluster examples
SLIDE 48
Outline
Recap
Maxent models Basic neural language models
Continuous representations
Motivation Key idea: represent words with vectors Two common counting types
Two (four) common continuous representation models Evaluation
SLIDE 49
Evaluating Similarity
Extrinsic (task-based, end-to-end) Evaluation:
Question Answering Spell Checking Essay grading
Intrinsic Evaluation:
Correlation between algorithm and human word similarity ratings
Wordsim353: 353 noun pairs rated 0-10. sim(plane,car)=5.77
Taking TOEFL multiple-choice vocabulary tests
SLIDE 50
Cosine: Measuring Similarity
Given 2 target words v and w how similar are their vectors? Dot product or inner product from linear algebra
High when two vectors have large values in same dimensions, low for orthogonal vectors with zeros in complementary distribution
SLIDE 51
Cosine: Measuring Similarity
Given 2 target words v and w how similar are their vectors? Dot product or inner product from linear algebra
High when two vectors have large values in same dimensions, low for orthogonal vectors with zeros in complementary distribution
Correct for high magnitude vectors
SLIDE 52
Cosine Similarity
Divide the dot product by the length of the two vectors This is the cosine of the angle between them
SLIDE 53 Cosine as a similarity metric
- 1: vectors point in opposite
directions +1: vectors point in same directions 0: vectors are orthogonal
SLIDE 54 Example: Word Similarity
large data computer apricot 2 digital 1 2 information 1 6 1
SLIDE 55 Example: Word Similarity
large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) =
SLIDE 56 Example: Word Similarity
large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) = 2 + 0 + 0 4 + 0 + 0 1 + 36 + 1 = 0.1622
SLIDE 57 Example: Word Similarity
large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) = 2 + 0 + 0 4 + 0 + 0 1 + 36 + 1 = 0.1622 0 + 6 + 2 0 + 1 + 4 1 + 36 + 1 = 0.5804 0 + 0 + 0 4 + 0 + 0 0 + 1 + 4 = 0.0
SLIDE 58
Other Similarity Measures
SLIDE 59 Adding Morphology, Syntax, and Semantics to Embeddings
Lin (1998): “Automatic Retrieval and Clustering of Similar Words” Padó and Lapata (2007): “Dependency-based Construction of Semantic Space Models” Levy and Goldberg (2014): “Dependency-Based Word Embeddings” Cotterell and Schütze (2015): “Morphological Word Embeddings” Ferraro et al. (2017): “Frame-Based Continuous Lexical Semantics through Exponential Family Tensor Factorization and Semantic Proto- Roles”
and many more…
SLIDE 60
Outline
Recap
Maxent models Basic neural language models
Continuous representations
Motivation Key idea: represent words with vectors Two common counting types
Two (four) common continuous representation models Evaluation