Algorithms for NLP Word Embeddings Yulia Tsvetkov CMU Slides: Dan - - PowerPoint PPT Presentation

algorithms for nlp
SMART_READER_LITE
LIVE PREVIEW

Algorithms for NLP Word Embeddings Yulia Tsvetkov CMU Slides: Dan - - PowerPoint PPT Presentation

Algorithms for NLP Word Embeddings Yulia Tsvetkov CMU Slides: Dan Jurafsky Stanford, Mike Peters AI2, Edouard Grave FAIR Brown Clustering 0 dog [0000] 0 1 cat [0001] 1 0 ant [001] 0 1 1 0 0 1 1 river [010] dog cat


slide-1
SLIDE 1

Word Embeddings

Yulia Tsvetkov – CMU Slides: Dan Jurafsky – Stanford, Mike Peters – AI2, Edouard Grave – FAIR

Algorithms for NLP

slide-2
SLIDE 2

Brown Clustering

dog [0000] cat [0001] ant [001] river [010] lake [011] blue [10] red [11]

dog cat ant river lake blue red 1 1 1 1 1 1

slide-3
SLIDE 3

Brown Clustering

[Brown et al, 1992]

slide-4
SLIDE 4

Brown Clustering

[ Miller et al., 2004]

slide-5
SLIDE 5

▪ is a vocabulary ▪ is a partition of the vocabulary into k clusters ▪ is a probability of cluster of wi to follow the cluster of wi-1 ▪

Brown Clustering

Quality(C)

The model:

slide-6
SLIDE 6

Quality(C)

Slide by Michael Collins

slide-7
SLIDE 7

A Naive Algorithm

▪ We start with |V| clusters: each word gets its own cluster ▪ Our aim is to find k final clusters ▪ We run |V| − k merge steps:

▪ At each merge step we pick two clusters ci and cj , and merge them into a single cluster ▪ We greedily pick merges such that Quality(C) for the clustering C after the merge step is maximized at each stage

▪ Cost? Naive = O(|V|5 ). Improved algorithm gives O(|V|3 ): still too slow for realistic values of |V|

Slide by Michael Collins

slide-8
SLIDE 8

Brown Clustering Algorithm

▪ Parameter of the approach is m (e.g., m = 1000) ▪ Take the top m most frequent words, put each into its own cluster, c1, c2, … cm ▪ For i = (m + 1) … |V|

▪ Create a new cluster, cm+1, for the i’th most frequent word. We now have m + 1 clusters ▪ Choose two clusters from c1 . . . cm+1 to be merged: pick the merge that gives a maximum value for Quality(C). We’re now back to m clusters

▪ Carry out (m − 1) final merges, to create a full hierarchy ▪ Running time: O(|V|m2 + n) where n is corpus length

Slide by Michael Collins

slide-9
SLIDE 9

Plan for Today

▪ Word2Vec

▪ Representation is created by training a classifier to distinguish nearby and far-away words

▪ FastText

▪ Extension of word2vec to include subword information

▪ ELMo

▪ Contextual token embeddings

▪ Multilingual embeddings ▪ Using embeddings to study history and culture

slide-10
SLIDE 10

Word2Vec

▪ Popular embedding method ▪ Very fast to train ▪ Code available on the web ▪ Idea: predict rather than count

slide-11
SLIDE 11

Word2Vec

[Mikolov et al.’ 13]

slide-12
SLIDE 12

Skip-gram Prediction

▪ Predict vs Count

the cat sat on the mat

slide-13
SLIDE 13

▪ Predict vs Count

Skip-gram Prediction

the cat sat on the mat context size = 2 wt = the CLASSIFIER wt-2 = <start-2> wt-1 = <start-1> wt+1 = cat wt+2 = sat

slide-14
SLIDE 14

Skip-gram Prediction

▪ Predict vs Count

the cat sat on the mat context size = 2 wt = cat CLASSIFIER wt-2 = <start-1> wt-1 = the wt+1 = sat wt+2 = on

slide-15
SLIDE 15

the cat sat on the mat

▪ Predict vs Count

Skip-gram Prediction

context size = 2 wt = sat CLASSIFIER wt-2 = the wt-1 = cat wt+1 = on wt+2 = the

slide-16
SLIDE 16

▪ Predict vs Count

the cat sat on the mat

Skip-gram Prediction

context size = 2 wt = on CLASSIFIER wt-2 = cat wt-1 = sat wt+1 = the wt+2 = mat

slide-17
SLIDE 17

▪ Predict vs Count

the cat sat on the mat

Skip-gram Prediction

context size = 2 wt = the CLASSIFIER wt-2 = sat wt-1 = on wt+1 = mat wt+2 = <end+1>

slide-18
SLIDE 18

▪ Predict vs Count

the cat sat on the mat

Skip-gram Prediction

context size = 2 wt = mat CLASSIFIER wt-2 = on wt-1 = the wt+1 = <end+1> wt+2 = <end+2>

slide-19
SLIDE 19

▪ Predict vs Count

Skip-gram Prediction

wt = the CLASSIFIER wt-2 = <start-2> wt-1 = <start-1> wt+1 = cat wt+2 = sat wt = the CLASSIFIER wt-2 = sat wt-1 = on wt+1 = mat wt+2 = <end+1>

slide-20
SLIDE 20

Skip-gram Prediction

slide-21
SLIDE 21

Skip-gram Prediction

▪ Training data

wt , wt-2 wt , wt-1 wt , wt+1 wt , wt+2 ...

slide-22
SLIDE 22

Skip-gram Prediction

slide-23
SLIDE 23

▪ For each word in the corpus t= 1 … T

Maximize the probability of any context window given the current center word

slide-24
SLIDE 24

Skip-gram Prediction

▪ Softmax

slide-25
SLIDE 25

SGNS

▪ Negative Sampling

▪ Treat the target word and a neighboring context word as positive examples.

▪ subsample very frequent words

▪ Randomly sample other words in the lexicon to get negative samples

▪ x2 negative samples

Given a tuple (t,c) = target, context ▪ (cat, sat) ▪ (cat, aardvark)

slide-26
SLIDE 26

Learning the classifier

▪ Iterative process

▪ We’ll start with 0 or random weights ▪ Then adjust the word weights to

▪ make the positive pairs more likely ▪ and the negative pairs less likely

  • ver the entire training set:

▪ Train using gradient descent

slide-27
SLIDE 27

How to compute p(+|t,c)?

slide-28
SLIDE 28

SGNS

Given a tuple (t,c) = target, context ▪ (cat, sat) ▪ (cat, aardvark) Return probability that c is a real context word:

slide-29
SLIDE 29

Choosing noise words

Could pick w according to their unigram frequency P(w) More common to chosen then according to pα(w) α= ¾ works well because it gives rare noise words slightly higher probability To show this, imagine two events p(a)=.99 and p(b) = .01:

slide-30
SLIDE 30

Skip-gram Prediction

slide-31
SLIDE 31

FastText

https://fasttext.cc/

slide-32
SLIDE 32

FastText: Motivation

slide-33
SLIDE 33

Subword Representation

skiing = {^skiing$, ^ski, skii, kiin, iing, ing$}

slide-34
SLIDE 34

FastText

slide-35
SLIDE 35

Details

▪ n-grams between 3 and 6 characters ▪ how many possible ngrams?

▪ |character set|n ▪ Hashing to map n-grams to integers in 1 to K=2M

▪ get word vectors for out-of-vocabulary words using subwords. ▪ less than 2× slower than word2vec skipgram ▪ short n-grams (n = 4) are good to capture syntactic information ▪ longer n-grams (n = 6) are good to capture semantic information

slide-36
SLIDE 36

FastText Evaluation

▪ Intrinsic evaluation ▪ Arabic, German, Spanish, French, Romanian, Russian

word1 word2 similarity (humans)

vanish disappear 9.8 behave

  • bey

7.3 belief impression 5.95 muscle bone 3.65 modest flexible 0.98 hole agreement 0.3

similarity (embeddings)

1.1 0.5 0.3 1.7 0.98 0.3

Spearman's rho (human ranks, model ranks)

slide-37
SLIDE 37

FastText Evaluation

[Grave et al, 2017]

slide-38
SLIDE 38

FastText Evaluation

slide-39
SLIDE 39

FastText Evaluation

slide-40
SLIDE 40

ELMo

https://allennlp.org/elmo

slide-41
SLIDE 41

Motivation

p(play | Elmo and Cookie Monster play a game .)

p(play | The Broadway play premiered yesterday .)

slide-42
SLIDE 42

Background

slide-43
SLIDE 43

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM

??

slide-44
SLIDE 44

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

??

slide-45
SLIDE 45

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

?? ??

slide-46
SLIDE 46

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

Embeddings from Language Models

ELMo

=

??

slide-47
SLIDE 47

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

Embeddings from Language Models

ELMo

=

slide-48
SLIDE 48

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

ELMo

=

+ + Embeddings from Language Models

slide-49
SLIDE 49

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

λ1

ELMo

λ2 λ0

=

+ + ( ( ( ) ) ) Embeddings from Language Models

slide-50
SLIDE 50

Evaluation: Extrinsic Tasks

slide-51
SLIDE 51

Stanford Question Answering Dataset (SQuAD)

[Rajpurkar et al, ‘16, ‘18]

slide-52
SLIDE 52

SNLI

[Bowman et al, ‘15]

slide-53
SLIDE 53

Multilingual Embeddings

https://github.com/mfaruqui/crosslingual-cca http://128.2.220.95/multilingual/

slide-54
SLIDE 54

Motivation

model 1 model 2

?

slide-55
SLIDE 55

Motivation

English French

?

slide-56
SLIDE 56

Canonical Correlation Analysis (CCA)

Canonical Correlation Analysis (Hotelling, 1936) Projects two sets of vectors (of equal cardinality) in a space where they are maximally correlated.

Ω ∑

CCA

Ω ∑ Ω ∑

slide-57
SLIDE 57

Canonical Correlation Analysis (CCA)

X Y

W, V = CCA(Ω, ∑) x x

W V

Ω ⊆ X, ∑ ⊆ Y

n1 d1 k n2 d2 k

X’ Y’

X’ and Y’ are now maximally correlated.

n2 n1 k k k = min(r(Ω), r(∑))

[Faruqui & Dyer, ‘14]

slide-58
SLIDE 58

Extension: Multilingual Embeddings

[Ammar et al., ‘16]

58

English French Spanish Arabic Swedish French-E nglish Ofrench→english Ofrench←english

O

french→english x

O

french←english

  • 1
slide-59
SLIDE 59

Embeddings can help study word history!

slide-60
SLIDE 60

Diachronic Embeddings

6   1900 1950 2000 vs. Word vectors for 1920 Word vectors 1990 “dog” 1920 word vector “dog” 1990 word vector

▪ count-based embeddings w/ PPMI ▪ projected to a common space

slide-61
SLIDE 61

Project 300 dimensions down into 2

~30 million books, 1850-1990, Google Books data

slide-62
SLIDE 62

Negative words change faster than positive words

slide-63
SLIDE 63

Embeddings reflect ethnic stereotypes over time

slide-64
SLIDE 64

Change in linguistic framing 1910-1990

Change in association of Chinese names with adjectives framed as "othering" (barbaric, monstrous, bizarre)

slide-65
SLIDE 65

Conclusion

▪ Concepts or word senses

▪ Have a complex many-to-many association with words (homonymy, multiple senses) ▪ Have relations with each other ▪ Synonymy, Antonymy, Superordinate ▪ But are hard to define formally (necessary & sufficient conditions)

▪ Embeddings = vector models of meaning

▪ More fine-grained than just a string or index ▪ Especially good at modeling similarity/analogy ▪ Just download them and use cosines!! ▪ Useful in many NLP tasks ▪ But know they encode cultural stereotypes