Distributed Representations CMSC 473/673 UMBC Some slides adapted - - PowerPoint PPT Presentation

โ–ถ
distributed representations
SMART_READER_LITE
LIVE PREVIEW

Distributed Representations CMSC 473/673 UMBC Some slides adapted - - PowerPoint PPT Presentation

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent blobs with vectors Two common counting types


slide-1
SLIDE 1

Distributed Representations

CMSC 473/673 UMBC

Some slides adapted from 3SLP

slide-2
SLIDE 2

Outline

Recap

Maxent models Basic neural language models

Continuous representations

Motivation Key idea: represent blobs with vectors Two common counting types

Evaluation Common continuous representation models

slide-3
SLIDE 3

Maxent Objective: Log-Likelihood (n- gram LM (โ„Ž๐‘—, ๐‘ฆ๐‘—))

Differentiating this becomes nicer (even though Z depends

  • n ฮธ)

log เท‘

๐‘—

๐‘ž๐œ„ ๐‘ฆ๐‘— โ„Ž๐‘— = เท

๐‘—

log ๐‘ž๐œ„(๐‘ฆ๐‘—|โ„Ž๐‘—) = เท

๐‘—

๐œ„๐‘ˆ๐‘” ๐‘ฆ๐‘—, โ„Ž๐‘— โˆ’ log ๐‘Ž(โ„Ž๐‘—) = ๐บ ๐œ„

The objective is implicitly defined with respect to (wrt) your data on hand

slide-4
SLIDE 4

Log-Likelihood Gradient

Each component k is the difference between: the total value of feature fk in the training data

and

the total value the current model pฮธ thinks it computes for feature fk เท

๐‘—

๐”ฝ๐‘ฆโ€ฒ~ ๐‘ž ๐‘”

๐‘™(๐‘ฆโ€ฒ, โ„Ž๐‘—)

เท

๐‘—

๐‘”

๐‘™(๐‘ฆ๐‘—, โ„Ž๐‘—)

slide-5
SLIDE 5

N-gram Language Models

predict the next word given some contextโ€ฆ

๐‘ž ๐‘ฅ๐‘— ๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1) โˆ ๐‘‘๐‘๐‘ฃ๐‘œ๐‘ข(๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1, ๐‘ฅ๐‘—)

wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโ€ฆ

slide-6
SLIDE 6

Maxent Language Models

predict the next word given some contextโ€ฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโ€ฆ

๐‘ž ๐‘ฅ๐‘— ๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1) โˆ softmax(๐œ„ โ‹… ๐‘”(๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1, ๐‘ฅ๐‘—))

slide-7
SLIDE 7

Neural Language Models

predict the next word given some contextโ€ฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโ€ฆ

๐‘ž ๐‘ฅ๐‘— ๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1) โˆ softmax(๐œ„๐‘ฅ๐‘— โ‹… ๐’ˆ(๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1))

create/use โ€œdistributed representationsโ€โ€ฆ ei-3 ei-2 ei-1 combine these representationsโ€ฆ C = f

matrix-vector product

ew ฮธwi

slide-8
SLIDE 8

Neural Language Models

predict the next word given some contextโ€ฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโ€ฆ

๐‘ž ๐‘ฅ๐‘— ๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1) โˆ softmax(๐œ„๐‘ฅ๐‘— โ‹… ๐’ˆ(๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1))

create/use โ€œdistributed representationsโ€โ€ฆ ei-3 ei-2 ei-1 combine these representationsโ€ฆ C = f

matrix-vector product

ew ฮธwi

slide-9
SLIDE 9

Outline

Recap

Maxent models Basic neural language models

Continuous representations

Motivation Key idea: represent blobs with vectors Two common counting types

Evaluation Common continuous representation models

slide-10
SLIDE 10

Recall from Deck 2: Representing a Linguistic โ€œBlobโ€

  • 1. An array of sub-blobs

word โ†’ array of characters sentence โ†’ array of words

  • 2. Integer

representation/one-hot encoding

  • 3. Dense embedding

Let V = vocab size (# types)

  • 1. Represent each word type

with a unique integer i, where 0 โ‰ค ๐‘— < ๐‘Š

  • 2. Or equivalently, โ€ฆ

โ€“ Assign each word to some index i, where 0 โ‰ค ๐‘— < ๐‘Š โ€“ Represent each word w with a V-dimensional binary vector ๐‘“๐‘ฅ, where ๐‘“๐‘ฅ,๐‘— = 1 and 0

  • therwise
slide-11
SLIDE 11

Recall from Deck 2: One-Hot Encoding Example

  • Let our vocab be {a, cat, saw, mouse, happy}
  • V = # types = 5
  • Assign:

a 4 cat 2 saw 3 mouse happy 1

๐‘“cat = 1

How do we represent โ€œcat?โ€

๐‘“happy = 1

How do we represent โ€œhappy?โ€

slide-12
SLIDE 12

Recall from Deck 2: Representing a Linguistic โ€œBlobโ€

  • 1. An array of sub-blobs

word โ†’ array of characters sentence โ†’ array of words

  • 2. Integer

representation/one-hot encoding

  • 3. Dense embedding

Let E be some embedding size (often 100, 200, 300, etc.) Represent each word w with an E-dimensional real- valued vector ๐‘“๐‘ฅ

slide-13
SLIDE 13

Recall from Deck 2: A Dense Representation (E=2)

slide-14
SLIDE 14

Maxent Plagiarism Detector?

Given two documents ๐‘ฆ1, ๐‘ฆ2, predict y = 1 (plagiarized) or y = 0 (not plagiarized) What is/are the:

  • Method/steps for predicting?
  • General formulation?
  • Features?
slide-15
SLIDE 15

Plagiarism Detection: Word Similarity?

slide-16
SLIDE 16

Distributional Representations

A dense, โ€œlowโ€ dimensional vector representation

slide-17
SLIDE 17

How have we represented words?

Each word is a distinct item

Bijection between the strings and unique integer ids: "cat" --> 3, "kitten" --> 792 "dog" --> 17394 Are "cat" and "kitten" similar?

slide-18
SLIDE 18

How have we represented words?

Each word is a distinct item

Bijection between the strings and unique integer ids: "cat" --> 3, "kitten" --> 792 "dog" --> 17394 Are "cat" and "kitten" similar?

Equivalently: "One-hot" encoding

Represent each word type w with a vector the size of the vocabulary This vector has V-1 zero entries, and 1 non-zero (one) entry

slide-19
SLIDE 19

Distributional Representations

A dense, โ€œlowโ€ dimensional vector representation

An E-dimensional vector, often (but not always) real-valued

slide-20
SLIDE 20

Distributional Representations

A dense, โ€œlowโ€ dimensional vector representation

An E-dimensional vector, often (but not always) real-valued Up till ~2013: E could be any size 2013-present: E << vocab

slide-21
SLIDE 21

Distributional Representations

A dense, โ€œlowโ€ dimensional vector representation

Many values are not 0 (or at least less sparse than

  • ne-hot)

An E-dimensional vector, often (but not always) real-valued Up till ~2013: E could be any size 2013-present: E << vocab

slide-22
SLIDE 22

Distributional Representations

A dense, โ€œlowโ€ dimensional vector representation

These are also called

  • embeddings
  • Continuous representations
  • (word/sentence/โ€ฆ) vectors
  • Vector-space model

Many values are not 0 (or at least less sparse than

  • ne-hot)

An E-dimensional vector, often (but not always) real-valued Up till ~2013: E could be any size 2013-present: E << vocab

slide-23
SLIDE 23

Distributional models of meaning = vector-space models of meaning = vector semantics

Zellig Harris (1954):

โ€œoculist and eye-doctor โ€ฆ occur in almost the same environmentsโ€ โ€œIf A and B have almost identical environments we say that they are synonyms.โ€

Firth (1957):

โ€œYou shall know a word by the company it keeps!โ€

slide-24
SLIDE 24

Continuous Meaning

The paper reflected the truth.

slide-25
SLIDE 25

Continuous Meaning

The paper reflected the truth.

reflected paper truth

slide-26
SLIDE 26

Continuous Meaning

The paper reflected the truth.

reflected paper truth glean hide falsehood

slide-27
SLIDE 27

(Some) Properties of Embeddings

Capture โ€œlikeโ€ (similar) words

Mikolov et al. (2013)

slide-28
SLIDE 28

(Some) Properties of Embeddings

Capture โ€œlikeโ€ (similar) words Capture relationships

Mikolov et al. (2013)

vector(โ€˜kingโ€™) โ€“ vector(โ€˜manโ€™) + vector(โ€˜womanโ€™) โ‰ˆ vector(โ€˜queenโ€™) vector(โ€˜Parisโ€™) - vector(โ€˜Franceโ€™) + vector(โ€˜Italyโ€™) โ‰ˆ vector(โ€˜Romeโ€™)

slide-29
SLIDE 29

โ€œEmbeddingsโ€ Did Not Begin In This Century

Hinton (1986): โ€œLearning Distributed Representations

  • f Conceptsโ€

Deerwester et al. (1990): โ€œIndexing by Latent Semantic Analysisโ€ Brown et al. (1992): โ€œClass-based n-gram models of natural languageโ€

slide-30
SLIDE 30

Outline

Recap

Maxent models Basic neural language models

Continuous representations

Motivation Key idea: represent blobs with vectors Two common counting types

Evaluation Common continuous representation models

slide-31
SLIDE 31

Key Ideas

  • 1. Acquire basic contextual statistics (often

counts) for each word type w

slide-32
SLIDE 32

Key Ideas

  • 1. Acquire basic contextual statistics (often

counts) for each word type w

  • 2. Extract a real-valued vector v for each word

w from those statistics

slide-33
SLIDE 33

Key Ideas

  • 1. Acquire basic contextual statistics (often

counts) for each word type w

  • 2. Extract a real-valued vector v for each word

w from those statistics

  • 3. Use the vectors to represent each word in

later tasks

slide-34
SLIDE 34

Key Ideas: Generalizing to โ€œblobsโ€

  • 1. Acquire basic contextual statistics (often

counts) for each blob type w

  • 2. Extract a real-valued vector v for each blob w

from those statistics

  • 3. Use the vectors to represent each blob in

later tasks

slide-35
SLIDE 35

Outline

Recap

Maxent models Basic neural language models

Continuous representations

Motivation Key idea: represent blobs with vectors Two common counting types

Evaluation Common continuous representation models

slide-36
SLIDE 36

โ€œAcquire basic contextual statistics (often counts) for each word type wโ€

  • Two basic, initial counting approaches

โ€“ Record which words appear in which documents โ€“ Record which words appear together

  • These are good first attempts, but with some

large downsides

slide-37
SLIDE 37

โ€œYou shall know a word by the company it keeps!โ€ Firth (1957)

battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (โ†“)-word (โ†’) count matrix

slide-38
SLIDE 38

โ€œYou shall know a word by the company it keeps!โ€ Firth (1957)

battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (โ†“)-word (โ†’) count matrix

basic bag-of- words counting

slide-39
SLIDE 39

โ€œYou shall know a word by the company it keeps!โ€ Firth (1957)

battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (โ†“)-word (โ†’) count matrix

Assumption: Two documents are similar if their vectors are similar

slide-40
SLIDE 40

โ€œYou shall know a word by the company it keeps!โ€ Firth (1957)

battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (โ†“)-word (โ†’) count matrix

Assumption: Two words are similar if their vectors are similar

slide-41
SLIDE 41

โ€œYou shall know a word by the company it keeps!โ€ Firth (1957)

battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (โ†“)-word (โ†’) count matrix

Assumption: Two words are similar if their vectors are similar Issue: Count word vectors are very large, sparse, and skewed!

slide-42
SLIDE 42

โ€œYou shall know a word by the company it keeps!โ€ Firth (1957)

apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (โ†“)-word (โ†’) count matrix

Context: those other words within a small โ€œwindowโ€ of a target word

slide-43
SLIDE 43

โ€œYou shall know a word by the company it keeps!โ€ Firth (1957)

apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (โ†“)-word (โ†’) count matrix

a cloud computer stores digital data on a remote computer Context: those other words within a small โ€œwindowโ€ of a target word

slide-44
SLIDE 44

โ€œYou shall know a word by the company it keeps!โ€ Firth (1957)

apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (โ†“)-word (โ†’) count matrix The size of windows depends on your goals The shorter the windows , the more syntactic the representation ยฑ 1-3 more โ€œsyntax-yโ€ The longer the windows, the more semantic the representation ยฑ 4-10 more โ€œsemantic-yโ€

slide-45
SLIDE 45

โ€œYou shall know a word by the company it keeps!โ€ Firth (1957)

apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (โ†“)-word (โ†’) count matrix

Assumption: Two words are similar if their vectors are similar Issue: Count word vectors are very large, sparse, and skewed! Context: those other words within a small โ€œwindowโ€ of a target word

slide-46
SLIDE 46

Outline

Recap

Maxent models Basic neural language models

Continuous representations

Motivation Key idea: represent blobs with vectors Two common counting types

Evaluation Common continuous representation models

slide-47
SLIDE 47

Evaluating Similarity

Extrinsic (task-based, end-to-end) Evaluation:

Question Answering Spell Checking Essay grading

Intrinsic Evaluation:

Correlation between algorithm and human word similarity ratings

Wordsim353: 353 noun pairs rated 0-10. sim(plane,car)=5.77

Taking TOEFL multiple-choice vocabulary tests

slide-48
SLIDE 48

Cosine: Measuring Similarity

Given 2 target words v and w how similar are their vectors? Dot product or inner product from linear algebra

High when two vectors have large values in same dimensions, low for orthogonal vectors with zeros in complementary distribution

slide-49
SLIDE 49

Cosine: Measuring Similarity

Given 2 target words v and w how similar are their vectors? Dot product or inner product from linear algebra

High when two vectors have large values in same dimensions, low for orthogonal vectors with zeros in complementary distribution

Correct for high magnitude vectors

slide-50
SLIDE 50

Cosine Similarity

Divide the dot product by the length of the two vectors This is the cosine of the angle between them

slide-51
SLIDE 51

Cosine as a similarity metric

  • 1: vectors point in opposite

directions +1: vectors point in same directions 0: vectors are orthogonal

slide-52
SLIDE 52

Example: Word Similarity

large data computer apricot 2 digital 1 2 information 1 6 1

cos ๐‘ฆ, ๐‘ง = ฯƒ๐‘— ๐‘ฆ๐‘—๐‘ง๐‘— ฯƒ๐‘— ๐‘ฆ๐‘—

2

ฯƒ๐‘— ๐‘ง๐‘—

2

slide-53
SLIDE 53

Example: Word Similarity

large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) =

cos ๐‘ฆ, ๐‘ง = ฯƒ๐‘— ๐‘ฆ๐‘—๐‘ง๐‘— ฯƒ๐‘— ๐‘ฆ๐‘—

2

ฯƒ๐‘— ๐‘ง๐‘—

2

slide-54
SLIDE 54

Example: Word Similarity

large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) =

cos ๐‘ฆ, ๐‘ง = ฯƒ๐‘— ๐‘ฆ๐‘—๐‘ง๐‘— ฯƒ๐‘— ๐‘ฆ๐‘—

2

ฯƒ๐‘— ๐‘ง๐‘—

2 2 + 0 + 0 4 + 0 + 0 1 + 36 + 1 = 0.1622

slide-55
SLIDE 55

Example: Word Similarity

large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) =

cos ๐‘ฆ, ๐‘ง = ฯƒ๐‘— ๐‘ฆ๐‘—๐‘ง๐‘— ฯƒ๐‘— ๐‘ฆ๐‘—

2

ฯƒ๐‘— ๐‘ง๐‘—

2 2 + 0 + 0 4 + 0 + 0 1 + 36 + 1 = 0.1622 0 + 6 + 2 0 + 1 + 4 1 + 36 + 1 = 0.5804 0 + 0 + 0 4 + 0 + 0 0 + 1 + 4 = 0.0

slide-56
SLIDE 56

Other Similarity Measures

slide-57
SLIDE 57

Adding Morphology, Syntax, and Semantics to Embeddings

Lin (1998): โ€œAutomatic Retrieval and Clustering of Similar Wordsโ€ Padรณ and Lapata (2007): โ€œDependency-based Construction of Semantic Space Modelsโ€ Levy and Goldberg (2014): โ€œDependency-Based Word Embeddingsโ€ Cotterell and Schรผtze (2015): โ€œMorphological Word Embeddingsโ€ Ferraro et al. (2017): โ€œFrame-Based Continuous Lexical Semantics through Exponential Family Tensor Factorization and Semantic Proto- Rolesโ€

and many moreโ€ฆ

slide-58
SLIDE 58

Outline

Recap

Maxent models Basic neural language models

Continuous representations

Motivation Key idea: represent blobs with vectors Two common counting types

Evaluation Common continuous representation models

slide-59
SLIDE 59

Shared Intuition

Model the meaning of a word by โ€œembeddingโ€ in a vector space The meaning of a word is a vector of numbers Contrast: word meaning is represented in many computational linguistic applications by a vocabulary index (โ€œword number 545โ€) or the string itself

slide-60
SLIDE 60

Four kinds of vector models

  • 1. Mutual-information weighted word co-occurrence

matrices

  • 2. Singular value decomposition/Latent Semantic Analysis
  • 3. Neural-network-inspired models (skip-grams, CBOW)
  • 4. Brown clusters
slide-61
SLIDE 61

Four kinds of vector models

  • 1. Mutual-information weighted word co-occurrence

matrices

  • 2. Singular value decomposition/Latent Semantic Analysis
  • 3. Neural-network-inspired models (skip-grams, CBOW)
  • 4. Brown clusters

You already saw some of this in assignment 2!

slide-62
SLIDE 62

Pointwise Mutual Information (PMI): Dealing with Problems of Raw Counts

Raw word frequency is not a great measure of association between words

Itโ€™s very skewed: โ€œtheโ€ and โ€œofโ€ are very frequent, but maybe not the most discriminative

Weโ€™d rather have a measure that asks whether a context word is particularly informative about the target word.

(Positive) Pointwise Mutual Information ((P)PMI)

slide-63
SLIDE 63

Pointwise Mutual Information (PMI): Dealing with Problems of Raw Counts

Raw word frequency is not a great measure of association between words

Itโ€™s very skewed: โ€œtheโ€ and โ€œofโ€ are very frequent, but maybe not the most discriminative

Weโ€™d rather have a measure that asks whether a context word is particularly informative about the target word.

(Positive) Pointwise Mutual Information ((P)PMI)

Pointwise mutual information:

Do events x and y co-occur more than if they were independent?

PMI ๐‘ฆ, ๐‘ง = log ๐‘ž(๐‘ฆ, ๐‘ง) ๐‘ž ๐‘ฆ ๐‘ž(๐‘ง)

slide-64
SLIDE 64

Pointwise Mutual Information (PMI): Dealing with Problems of Raw Counts

Raw word frequency is not a great measure of association between words

Itโ€™s very skewed: โ€œtheโ€ and โ€œofโ€ are very frequent, but maybe not the most discriminative

Weโ€™d rather have a measure that asks whether a context word is particularly informative about the target word.

(Positive) Pointwise Mutual Information ((P)PMI)

Pointwise mutual information:

Do events x and y co-occur more than if they were independent?

PMI between two words: (Church & Hanks

1989)

Do words x and y co-occur more than if they were independent?

PMI ๐‘ฆ, ๐‘ง = log ๐‘ž(๐‘ฆ, ๐‘ง) ๐‘ž ๐‘ฆ ๐‘ž(๐‘ง)

slide-65
SLIDE 65

โ€œNoun Classification from Predicate- Argument Structure,โ€ Hindle (1990)

Object of โ€œdrinkโ€ Count PMI it 3 1.3 anything 3 5.2 wine 2 9.3 tea 2 11.8 liquid 2 10.5

โ€œdrink itโ€ is more common than โ€œdrink wineโ€ โ€œwineโ€ is a better โ€œdrinkableโ€ thing than โ€œitโ€

slide-66
SLIDE 66

Four kinds of vector models

  • 1. Mutual-information weighted word co-occurrence
  • 2. Singular value decomposition/Latent Semantic Analysis
  • 3. Neural-network-inspired models (skip-grams, CBOW)
  • 4. Brown clusters

Learn more in:

  • Your project
  • Paper (673)
  • Other classes (478/678)
slide-67
SLIDE 67

Four kinds of vector models

  • 1. Mutual-information weighted word co-occurrence
  • 2. Singular value decomposition/Latent Semantic Analysis
  • 3. Neural-network-inspired models (skip-grams, CBOW)
  • 4. Brown clusters
slide-68
SLIDE 68

Word2Vec

  • Mikolov et al. (2013; NeurIPS): โ€œDistributed

Representations of Words and Phrases and their Compositionalityโ€

  • Revisits the context-word approach
  • Learn a model p(c | w) to predict a context word

from a target word

slide-69
SLIDE 69

Word2Vec

  • Mikolov et al. (2013; NeurIPS): โ€œDistributed

Representations of Words and Phrases and their Compositionalityโ€

  • Revisits the context-word approach
  • Learn a model p(c | w) to predict a context word

from a target word

  • Learn two types of vector representations

โ€“ โ„Ž๐‘‘ โˆˆ โ„๐น: vector embeddings for each context word โ€“ ๐‘ค๐‘ฅ โˆˆ โ„๐น: vector embeddings for each target word

๐‘ž ๐‘‘ ๐‘ฅ) โˆ exp(โ„Ž๐‘‘

๐‘ˆ๐‘ค๐‘ฅ)

slide-70
SLIDE 70

Word2Vec

apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (โ†“)-word (โ†’) count matrix

Context: those other words within a small โ€œwindowโ€ of a target word

max

โ„Ž,๐‘ค

เท

๐‘‘,๐‘ฅ pairs

count ๐‘‘, ๐‘ฅ log ๐‘ž ๐‘‘ ๐‘ฅ)

slide-71
SLIDE 71

Word2Vec

apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (โ†“)-word (โ†’) count matrix

Context: those other words within a small โ€œwindowโ€ of a target word

max

โ„Ž,๐‘ค

เท

๐‘‘,๐‘ฅ pairs

count ๐‘‘, ๐‘ฅ โ„Ž๐‘‘

๐‘ˆ๐‘ค๐‘ฅ โˆ’ log(เท ๐‘ฃ

exp(โ„Ž๐‘ฃ

๐‘ˆ๐‘ค๐‘ฅ)))

slide-72
SLIDE 72

Word2Vec has Inspired a Lot of Work

Off-the-shelf embeddings

https://code.google.com/archive/p/word2vec/

Off-the-shelf implementations

https://radimrehurek.com/gensim/models/word2vec.html

Follow-on work

โ€œGloVe: Global Vectors for Word Representationโ€ (Pennington, Socher and Manning, 2014)

https://nlp.stanford.edu/projects/glove/

Many others 15000+ citations

slide-73
SLIDE 73

FastText

  • โ€œEnriching Word Vectors with Subword

Informationโ€ Bojanowski et al. (2017; TACL)

  • Main idea: learn n-gram embeddings for the

target word (not context) and modify the word2vec model to use these

  • Pre-trained models in 150+ languages

โ€“ https://fasttext.cc

slide-74
SLIDE 74

FastText Details

Main idea: learn n-gram embeddings and for the target word (not the context) modify the word2vec model to use these Original word2vec: ๐‘ž ๐‘‘ ๐‘ฅ) โˆ exp(โ„Ž๐‘‘

๐‘ˆ๐‘ค๐‘ฅ)

FastText: ๐‘ž ๐‘‘ ๐‘ฅ) โˆ exp โ„Ž๐‘‘

๐‘ˆ

เท

nโˆ’gram ๐‘• in ๐‘ฅ

๐‘จ๐‘•

slide-75
SLIDE 75

FastText Details

Main idea: learn n-gram embeddings and for the target word (not the context) modify the word2vec model to use these ๐‘ž ๐‘‘ ๐‘ฅ) โˆ exp โ„Ž๐‘‘

๐‘ˆ

เท

nโˆ’gram ๐‘• in ๐‘ฅ

๐‘จ๐‘•

fluffy โ†’ fl flu luf uff ffy fy

decompose

slide-76
SLIDE 76

FastText Details

Main idea: learn n-gram embeddings and for the target word (not the context) modify the word2vec model to use these ๐‘ž ๐‘‘ ๐‘ฅ) โˆ exp โ„Ž๐‘‘

๐‘ˆ

เท

nโˆ’gram ๐‘• in ๐‘ฅ

๐‘จ๐‘•

fluffy โ†’ fl flu luf uff ffy fy

decompose Learn n-gram embeddings

slide-77
SLIDE 77

FastText Details

Main idea: learn n-gram embeddings and for the target word (not the context) modify the word2vec model to use these ๐‘ž ๐‘‘ ๐‘ฅ) โˆ exp โ„Ž๐‘‘

๐‘ˆ

เท

nโˆ’gram ๐‘• in ๐‘ฅ

๐‘จ๐‘•

fluffy โ†’ fl flu luf uff ffy fy

๏ƒŸ decompose Learn n-gram embeddings To deterministically compute word embeddings

slide-78
SLIDE 78
slide-79
SLIDE 79

Contextual Word Embeddings

Word2vec-based models are not context- dependent

Single word type โ†’ single word embedding

If a single word type can have different meaningsโ€ฆ

bank, bass, plant,โ€ฆ

โ€ฆ why should we only have one embedding?

slide-80
SLIDE 80

Contextual Word Embeddings

Word2vec-based models are not context- dependent

Single word type โ†’ single word embedding

If a single word type can have different meaningsโ€ฆ

bank, bass, plant,โ€ฆ

โ€ฆ why should we only have one embedding?

Entire task devoted to classifying these meanings:

Word Sense Disambiguation

(weโ€™ll get back to it throughout the semester)

slide-81
SLIDE 81

Contextual Word Embeddings

Growing interest in this Off-the-shelf is a bit more difficult

Download and run a model Canโ€™t just download a file of embeddings

Two to know about (with code):

ELMo: โ€œDeep contextualized word representationsโ€ Peters et al. (2018; NAACL)

https://allennlp.org/elmo

BERT: โ€œBERT: Pre-training of Deep Bidirectional Transformers for Language Understandingโ€ Devlin et al. (2019; NAACL)

https://github.com/google-research/bert

slide-82
SLIDE 82

Your Idea?

slide-83
SLIDE 83

Four kinds of vector models

  • 1. Mutual-information weighted word co-occurrence

matrices

  • 2. Singular value decomposition/Latent Semantic Analysis
  • 3. Neural-network-inspired models (skip-grams, CBOW)
  • 4. Brown clusters
slide-84
SLIDE 84

Brown clustering (Brown et al., 1992)

An agglomerative clustering algorithm that clusters words based on which words precede or follow them These word clusters can be turned into a kind of vector (binary vector)

slide-85
SLIDE 85

Brown Clusters as vectors

Build a binary tree from bottom to top based on how clusters are merged Each word represented by binary string = path from root to leaf Each intermediate node is a cluster

CEO chairman president November 001 000 0011 0010 00 01 โ€ฆ 010 root In practice, use an available implementation: https://github.com/percyliang/brown-cluster

slide-86
SLIDE 86

Brown cluster examples

slide-87
SLIDE 87

Outline

Recap

Maxent models Basic neural language models

Continuous representations

Motivation Key idea: represent blobs with vectors Two common counting types

Evaluation Common continuous representation models