SLIDE 1 Distributed Representations
CMSC 473/673 UMBC
Some slides adapted from 3SLP
SLIDE 2
Outline
Recap
Maxent models Basic neural language models
Continuous representations
Motivation Key idea: represent blobs with vectors Two common counting types
Evaluation Common continuous representation models
SLIDE 3 Maxent Objective: Log-Likelihood (n- gram LM (โ๐, ๐ฆ๐))
Differentiating this becomes nicer (even though Z depends
log เท
๐
๐๐ ๐ฆ๐ โ๐ = เท
๐
log ๐๐(๐ฆ๐|โ๐) = เท
๐
๐๐๐ ๐ฆ๐, โ๐ โ log ๐(โ๐) = ๐บ ๐
The objective is implicitly defined with respect to (wrt) your data on hand
SLIDE 4 Log-Likelihood Gradient
Each component k is the difference between: the total value of feature fk in the training data
and
the total value the current model pฮธ thinks it computes for feature fk เท
๐
๐ฝ๐ฆโฒ~ ๐ ๐
๐(๐ฆโฒ, โ๐)
เท
๐
๐
๐(๐ฆ๐, โ๐)
SLIDE 5 N-gram Language Models
predict the next word given some contextโฆ
๐ ๐ฅ๐ ๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1) โ ๐๐๐ฃ๐๐ข(๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1, ๐ฅ๐)
wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโฆ
SLIDE 6 Maxent Language Models
predict the next word given some contextโฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโฆ
๐ ๐ฅ๐ ๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1) โ softmax(๐ โ
๐(๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1, ๐ฅ๐))
SLIDE 7 Neural Language Models
predict the next word given some contextโฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโฆ
๐ ๐ฅ๐ ๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1) โ softmax(๐๐ฅ๐ โ
๐(๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1))
create/use โdistributed representationsโโฆ ei-3 ei-2 ei-1 combine these representationsโฆ C = f
matrix-vector product
ew ฮธwi
SLIDE 8 Neural Language Models
predict the next word given some contextโฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโฆ
๐ ๐ฅ๐ ๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1) โ softmax(๐๐ฅ๐ โ
๐(๐ฅ๐โ3, ๐ฅ๐โ2, ๐ฅ๐โ1))
create/use โdistributed representationsโโฆ ei-3 ei-2 ei-1 combine these representationsโฆ C = f
matrix-vector product
ew ฮธwi
SLIDE 9
Outline
Recap
Maxent models Basic neural language models
Continuous representations
Motivation Key idea: represent blobs with vectors Two common counting types
Evaluation Common continuous representation models
SLIDE 10 Recall from Deck 2: Representing a Linguistic โBlobโ
word โ array of characters sentence โ array of words
representation/one-hot encoding
Let V = vocab size (# types)
- 1. Represent each word type
with a unique integer i, where 0 โค ๐ < ๐
โ Assign each word to some index i, where 0 โค ๐ < ๐ โ Represent each word w with a V-dimensional binary vector ๐๐ฅ, where ๐๐ฅ,๐ = 1 and 0
SLIDE 11 Recall from Deck 2: One-Hot Encoding Example
- Let our vocab be {a, cat, saw, mouse, happy}
- V = # types = 5
- Assign:
a 4 cat 2 saw 3 mouse happy 1
๐cat = 1
How do we represent โcat?โ
๐happy = 1
How do we represent โhappy?โ
SLIDE 12 Recall from Deck 2: Representing a Linguistic โBlobโ
word โ array of characters sentence โ array of words
representation/one-hot encoding
Let E be some embedding size (often 100, 200, 300, etc.) Represent each word w with an E-dimensional real- valued vector ๐๐ฅ
SLIDE 13
Recall from Deck 2: A Dense Representation (E=2)
SLIDE 14 Maxent Plagiarism Detector?
Given two documents ๐ฆ1, ๐ฆ2, predict y = 1 (plagiarized) or y = 0 (not plagiarized) What is/are the:
- Method/steps for predicting?
- General formulation?
- Features?
SLIDE 15
Plagiarism Detection: Word Similarity?
SLIDE 16
Distributional Representations
A dense, โlowโ dimensional vector representation
SLIDE 17
How have we represented words?
Each word is a distinct item
Bijection between the strings and unique integer ids: "cat" --> 3, "kitten" --> 792 "dog" --> 17394 Are "cat" and "kitten" similar?
SLIDE 18
How have we represented words?
Each word is a distinct item
Bijection between the strings and unique integer ids: "cat" --> 3, "kitten" --> 792 "dog" --> 17394 Are "cat" and "kitten" similar?
Equivalently: "One-hot" encoding
Represent each word type w with a vector the size of the vocabulary This vector has V-1 zero entries, and 1 non-zero (one) entry
SLIDE 19
Distributional Representations
A dense, โlowโ dimensional vector representation
An E-dimensional vector, often (but not always) real-valued
SLIDE 20
Distributional Representations
A dense, โlowโ dimensional vector representation
An E-dimensional vector, often (but not always) real-valued Up till ~2013: E could be any size 2013-present: E << vocab
SLIDE 21 Distributional Representations
A dense, โlowโ dimensional vector representation
Many values are not 0 (or at least less sparse than
An E-dimensional vector, often (but not always) real-valued Up till ~2013: E could be any size 2013-present: E << vocab
SLIDE 22 Distributional Representations
A dense, โlowโ dimensional vector representation
These are also called
- embeddings
- Continuous representations
- (word/sentence/โฆ) vectors
- Vector-space model
Many values are not 0 (or at least less sparse than
An E-dimensional vector, often (but not always) real-valued Up till ~2013: E could be any size 2013-present: E << vocab
SLIDE 23
Distributional models of meaning = vector-space models of meaning = vector semantics
Zellig Harris (1954):
โoculist and eye-doctor โฆ occur in almost the same environmentsโ โIf A and B have almost identical environments we say that they are synonyms.โ
Firth (1957):
โYou shall know a word by the company it keeps!โ
SLIDE 24
Continuous Meaning
The paper reflected the truth.
SLIDE 25 Continuous Meaning
The paper reflected the truth.
reflected paper truth
SLIDE 26 Continuous Meaning
The paper reflected the truth.
reflected paper truth glean hide falsehood
SLIDE 27 (Some) Properties of Embeddings
Capture โlikeโ (similar) words
Mikolov et al. (2013)
SLIDE 28 (Some) Properties of Embeddings
Capture โlikeโ (similar) words Capture relationships
Mikolov et al. (2013)
vector(โkingโ) โ vector(โmanโ) + vector(โwomanโ) โ vector(โqueenโ) vector(โParisโ) - vector(โFranceโ) + vector(โItalyโ) โ vector(โRomeโ)
SLIDE 29 โEmbeddingsโ Did Not Begin In This Century
Hinton (1986): โLearning Distributed Representations
Deerwester et al. (1990): โIndexing by Latent Semantic Analysisโ Brown et al. (1992): โClass-based n-gram models of natural languageโ
SLIDE 30
Outline
Recap
Maxent models Basic neural language models
Continuous representations
Motivation Key idea: represent blobs with vectors Two common counting types
Evaluation Common continuous representation models
SLIDE 31 Key Ideas
- 1. Acquire basic contextual statistics (often
counts) for each word type w
SLIDE 32 Key Ideas
- 1. Acquire basic contextual statistics (often
counts) for each word type w
- 2. Extract a real-valued vector v for each word
w from those statistics
SLIDE 33 Key Ideas
- 1. Acquire basic contextual statistics (often
counts) for each word type w
- 2. Extract a real-valued vector v for each word
w from those statistics
- 3. Use the vectors to represent each word in
later tasks
SLIDE 34 Key Ideas: Generalizing to โblobsโ
- 1. Acquire basic contextual statistics (often
counts) for each blob type w
- 2. Extract a real-valued vector v for each blob w
from those statistics
- 3. Use the vectors to represent each blob in
later tasks
SLIDE 35
Outline
Recap
Maxent models Basic neural language models
Continuous representations
Motivation Key idea: represent blobs with vectors Two common counting types
Evaluation Common continuous representation models
SLIDE 36 โAcquire basic contextual statistics (often counts) for each word type wโ
- Two basic, initial counting approaches
โ Record which words appear in which documents โ Record which words appear together
- These are good first attempts, but with some
large downsides
SLIDE 37 โYou shall know a word by the company it keeps!โ Firth (1957)
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (โ)-word (โ) count matrix
SLIDE 38 โYou shall know a word by the company it keeps!โ Firth (1957)
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (โ)-word (โ) count matrix
basic bag-of- words counting
SLIDE 39 โYou shall know a word by the company it keeps!โ Firth (1957)
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (โ)-word (โ) count matrix
Assumption: Two documents are similar if their vectors are similar
SLIDE 40 โYou shall know a word by the company it keeps!โ Firth (1957)
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (โ)-word (โ) count matrix
Assumption: Two words are similar if their vectors are similar
SLIDE 41 โYou shall know a word by the company it keeps!โ Firth (1957)
battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (โ)-word (โ) count matrix
Assumption: Two words are similar if their vectors are similar Issue: Count word vectors are very large, sparse, and skewed!
SLIDE 42 โYou shall know a word by the company it keeps!โ Firth (1957)
apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (โ)-word (โ) count matrix
Context: those other words within a small โwindowโ of a target word
SLIDE 43 โYou shall know a word by the company it keeps!โ Firth (1957)
apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (โ)-word (โ) count matrix
a cloud computer stores digital data on a remote computer Context: those other words within a small โwindowโ of a target word
SLIDE 44 โYou shall know a word by the company it keeps!โ Firth (1957)
apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (โ)-word (โ) count matrix The size of windows depends on your goals The shorter the windows , the more syntactic the representation ยฑ 1-3 more โsyntax-yโ The longer the windows, the more semantic the representation ยฑ 4-10 more โsemantic-yโ
SLIDE 45 โYou shall know a word by the company it keeps!โ Firth (1957)
apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (โ)-word (โ) count matrix
Assumption: Two words are similar if their vectors are similar Issue: Count word vectors are very large, sparse, and skewed! Context: those other words within a small โwindowโ of a target word
SLIDE 46
Outline
Recap
Maxent models Basic neural language models
Continuous representations
Motivation Key idea: represent blobs with vectors Two common counting types
Evaluation Common continuous representation models
SLIDE 47
Evaluating Similarity
Extrinsic (task-based, end-to-end) Evaluation:
Question Answering Spell Checking Essay grading
Intrinsic Evaluation:
Correlation between algorithm and human word similarity ratings
Wordsim353: 353 noun pairs rated 0-10. sim(plane,car)=5.77
Taking TOEFL multiple-choice vocabulary tests
SLIDE 48
Cosine: Measuring Similarity
Given 2 target words v and w how similar are their vectors? Dot product or inner product from linear algebra
High when two vectors have large values in same dimensions, low for orthogonal vectors with zeros in complementary distribution
SLIDE 49
Cosine: Measuring Similarity
Given 2 target words v and w how similar are their vectors? Dot product or inner product from linear algebra
High when two vectors have large values in same dimensions, low for orthogonal vectors with zeros in complementary distribution
Correct for high magnitude vectors
SLIDE 50
Cosine Similarity
Divide the dot product by the length of the two vectors This is the cosine of the angle between them
SLIDE 51 Cosine as a similarity metric
- 1: vectors point in opposite
directions +1: vectors point in same directions 0: vectors are orthogonal
SLIDE 52 Example: Word Similarity
large data computer apricot 2 digital 1 2 information 1 6 1
cos ๐ฆ, ๐ง = ฯ๐ ๐ฆ๐๐ง๐ ฯ๐ ๐ฆ๐
2
ฯ๐ ๐ง๐
2
SLIDE 53 Example: Word Similarity
large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) =
cos ๐ฆ, ๐ง = ฯ๐ ๐ฆ๐๐ง๐ ฯ๐ ๐ฆ๐
2
ฯ๐ ๐ง๐
2
SLIDE 54 Example: Word Similarity
large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) =
cos ๐ฆ, ๐ง = ฯ๐ ๐ฆ๐๐ง๐ ฯ๐ ๐ฆ๐
2
ฯ๐ ๐ง๐
2 2 + 0 + 0 4 + 0 + 0 1 + 36 + 1 = 0.1622
SLIDE 55 Example: Word Similarity
large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) =
cos ๐ฆ, ๐ง = ฯ๐ ๐ฆ๐๐ง๐ ฯ๐ ๐ฆ๐
2
ฯ๐ ๐ง๐
2 2 + 0 + 0 4 + 0 + 0 1 + 36 + 1 = 0.1622 0 + 6 + 2 0 + 1 + 4 1 + 36 + 1 = 0.5804 0 + 0 + 0 4 + 0 + 0 0 + 1 + 4 = 0.0
SLIDE 56
Other Similarity Measures
SLIDE 57 Adding Morphology, Syntax, and Semantics to Embeddings
Lin (1998): โAutomatic Retrieval and Clustering of Similar Wordsโ Padรณ and Lapata (2007): โDependency-based Construction of Semantic Space Modelsโ Levy and Goldberg (2014): โDependency-Based Word Embeddingsโ Cotterell and Schรผtze (2015): โMorphological Word Embeddingsโ Ferraro et al. (2017): โFrame-Based Continuous Lexical Semantics through Exponential Family Tensor Factorization and Semantic Proto- Rolesโ
and many moreโฆ
SLIDE 58
Outline
Recap
Maxent models Basic neural language models
Continuous representations
Motivation Key idea: represent blobs with vectors Two common counting types
Evaluation Common continuous representation models
SLIDE 59
Shared Intuition
Model the meaning of a word by โembeddingโ in a vector space The meaning of a word is a vector of numbers Contrast: word meaning is represented in many computational linguistic applications by a vocabulary index (โword number 545โ) or the string itself
SLIDE 60 Four kinds of vector models
- 1. Mutual-information weighted word co-occurrence
matrices
- 2. Singular value decomposition/Latent Semantic Analysis
- 3. Neural-network-inspired models (skip-grams, CBOW)
- 4. Brown clusters
SLIDE 61 Four kinds of vector models
- 1. Mutual-information weighted word co-occurrence
matrices
- 2. Singular value decomposition/Latent Semantic Analysis
- 3. Neural-network-inspired models (skip-grams, CBOW)
- 4. Brown clusters
You already saw some of this in assignment 2!
SLIDE 62 Pointwise Mutual Information (PMI): Dealing with Problems of Raw Counts
Raw word frequency is not a great measure of association between words
Itโs very skewed: โtheโ and โofโ are very frequent, but maybe not the most discriminative
Weโd rather have a measure that asks whether a context word is particularly informative about the target word.
(Positive) Pointwise Mutual Information ((P)PMI)
SLIDE 63 Pointwise Mutual Information (PMI): Dealing with Problems of Raw Counts
Raw word frequency is not a great measure of association between words
Itโs very skewed: โtheโ and โofโ are very frequent, but maybe not the most discriminative
Weโd rather have a measure that asks whether a context word is particularly informative about the target word.
(Positive) Pointwise Mutual Information ((P)PMI)
Pointwise mutual information:
Do events x and y co-occur more than if they were independent?
PMI ๐ฆ, ๐ง = log ๐(๐ฆ, ๐ง) ๐ ๐ฆ ๐(๐ง)
SLIDE 64 Pointwise Mutual Information (PMI): Dealing with Problems of Raw Counts
Raw word frequency is not a great measure of association between words
Itโs very skewed: โtheโ and โofโ are very frequent, but maybe not the most discriminative
Weโd rather have a measure that asks whether a context word is particularly informative about the target word.
(Positive) Pointwise Mutual Information ((P)PMI)
Pointwise mutual information:
Do events x and y co-occur more than if they were independent?
PMI between two words: (Church & Hanks
1989)
Do words x and y co-occur more than if they were independent?
PMI ๐ฆ, ๐ง = log ๐(๐ฆ, ๐ง) ๐ ๐ฆ ๐(๐ง)
SLIDE 65
โNoun Classification from Predicate- Argument Structure,โ Hindle (1990)
Object of โdrinkโ Count PMI it 3 1.3 anything 3 5.2 wine 2 9.3 tea 2 11.8 liquid 2 10.5
โdrink itโ is more common than โdrink wineโ โwineโ is a better โdrinkableโ thing than โitโ
SLIDE 66 Four kinds of vector models
- 1. Mutual-information weighted word co-occurrence
- 2. Singular value decomposition/Latent Semantic Analysis
- 3. Neural-network-inspired models (skip-grams, CBOW)
- 4. Brown clusters
Learn more in:
- Your project
- Paper (673)
- Other classes (478/678)
SLIDE 67 Four kinds of vector models
- 1. Mutual-information weighted word co-occurrence
- 2. Singular value decomposition/Latent Semantic Analysis
- 3. Neural-network-inspired models (skip-grams, CBOW)
- 4. Brown clusters
SLIDE 68 Word2Vec
- Mikolov et al. (2013; NeurIPS): โDistributed
Representations of Words and Phrases and their Compositionalityโ
- Revisits the context-word approach
- Learn a model p(c | w) to predict a context word
from a target word
SLIDE 69 Word2Vec
- Mikolov et al. (2013; NeurIPS): โDistributed
Representations of Words and Phrases and their Compositionalityโ
- Revisits the context-word approach
- Learn a model p(c | w) to predict a context word
from a target word
- Learn two types of vector representations
โ โ๐ โ โ๐น: vector embeddings for each context word โ ๐ค๐ฅ โ โ๐น: vector embeddings for each target word
๐ ๐ ๐ฅ) โ exp(โ๐
๐๐ค๐ฅ)
SLIDE 70 Word2Vec
apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (โ)-word (โ) count matrix
Context: those other words within a small โwindowโ of a target word
max
โ,๐ค
เท
๐,๐ฅ pairs
count ๐, ๐ฅ log ๐ ๐ ๐ฅ)
SLIDE 71 Word2Vec
apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (โ)-word (โ) count matrix
Context: those other words within a small โwindowโ of a target word
max
โ,๐ค
เท
๐,๐ฅ pairs
count ๐, ๐ฅ โ๐
๐๐ค๐ฅ โ log(เท ๐ฃ
exp(โ๐ฃ
๐๐ค๐ฅ)))
SLIDE 72
Word2Vec has Inspired a Lot of Work
Off-the-shelf embeddings
https://code.google.com/archive/p/word2vec/
Off-the-shelf implementations
https://radimrehurek.com/gensim/models/word2vec.html
Follow-on work
โGloVe: Global Vectors for Word Representationโ (Pennington, Socher and Manning, 2014)
https://nlp.stanford.edu/projects/glove/
Many others 15000+ citations
SLIDE 73 FastText
- โEnriching Word Vectors with Subword
Informationโ Bojanowski et al. (2017; TACL)
- Main idea: learn n-gram embeddings for the
target word (not context) and modify the word2vec model to use these
- Pre-trained models in 150+ languages
โ https://fasttext.cc
SLIDE 74 FastText Details
Main idea: learn n-gram embeddings and for the target word (not the context) modify the word2vec model to use these Original word2vec: ๐ ๐ ๐ฅ) โ exp(โ๐
๐๐ค๐ฅ)
FastText: ๐ ๐ ๐ฅ) โ exp โ๐
๐
เท
nโgram ๐ in ๐ฅ
๐จ๐
SLIDE 75 FastText Details
Main idea: learn n-gram embeddings and for the target word (not the context) modify the word2vec model to use these ๐ ๐ ๐ฅ) โ exp โ๐
๐
เท
nโgram ๐ in ๐ฅ
๐จ๐
fluffy โ fl flu luf uff ffy fy
decompose
SLIDE 76 FastText Details
Main idea: learn n-gram embeddings and for the target word (not the context) modify the word2vec model to use these ๐ ๐ ๐ฅ) โ exp โ๐
๐
เท
nโgram ๐ in ๐ฅ
๐จ๐
fluffy โ fl flu luf uff ffy fy
decompose Learn n-gram embeddings
SLIDE 77 FastText Details
Main idea: learn n-gram embeddings and for the target word (not the context) modify the word2vec model to use these ๐ ๐ ๐ฅ) โ exp โ๐
๐
เท
nโgram ๐ in ๐ฅ
๐จ๐
fluffy โ fl flu luf uff ffy fy
๏ decompose Learn n-gram embeddings To deterministically compute word embeddings
SLIDE 78
SLIDE 79
Contextual Word Embeddings
Word2vec-based models are not context- dependent
Single word type โ single word embedding
If a single word type can have different meaningsโฆ
bank, bass, plant,โฆ
โฆ why should we only have one embedding?
SLIDE 80 Contextual Word Embeddings
Word2vec-based models are not context- dependent
Single word type โ single word embedding
If a single word type can have different meaningsโฆ
bank, bass, plant,โฆ
โฆ why should we only have one embedding?
Entire task devoted to classifying these meanings:
Word Sense Disambiguation
(weโll get back to it throughout the semester)
SLIDE 81 Contextual Word Embeddings
Growing interest in this Off-the-shelf is a bit more difficult
Download and run a model Canโt just download a file of embeddings
Two to know about (with code):
ELMo: โDeep contextualized word representationsโ Peters et al. (2018; NAACL)
https://allennlp.org/elmo
BERT: โBERT: Pre-training of Deep Bidirectional Transformers for Language Understandingโ Devlin et al. (2019; NAACL)
https://github.com/google-research/bert
SLIDE 82
Your Idea?
SLIDE 83 Four kinds of vector models
- 1. Mutual-information weighted word co-occurrence
matrices
- 2. Singular value decomposition/Latent Semantic Analysis
- 3. Neural-network-inspired models (skip-grams, CBOW)
- 4. Brown clusters
SLIDE 84
Brown clustering (Brown et al., 1992)
An agglomerative clustering algorithm that clusters words based on which words precede or follow them These word clusters can be turned into a kind of vector (binary vector)
SLIDE 85 Brown Clusters as vectors
Build a binary tree from bottom to top based on how clusters are merged Each word represented by binary string = path from root to leaf Each intermediate node is a cluster
CEO chairman president November 001 000 0011 0010 00 01 โฆ 010 root In practice, use an available implementation: https://github.com/percyliang/brown-cluster
SLIDE 86
Brown cluster examples
SLIDE 87
Outline
Recap
Maxent models Basic neural language models
Continuous representations
Motivation Key idea: represent blobs with vectors Two common counting types
Evaluation Common continuous representation models