Neural LMs Image: (Bengio et al, 03) One Hot Vectors Neural LMs - - PowerPoint PPT Presentation

neural lms
SMART_READER_LITE
LIVE PREVIEW

Neural LMs Image: (Bengio et al, 03) One Hot Vectors Neural LMs - - PowerPoint PPT Presentation

Neural LMs Image: (Bengio et al, 03) One Hot Vectors Neural LMs (Bengio et al, 03) Low-dimensional Representations Learning representations by back-propagating errors Rumelhart, Hinton & Williams, 1986 A neural


slide-1
SLIDE 1
slide-2
SLIDE 2

Neural LMs

Image: (Bengio et al, 03)

slide-3
SLIDE 3

“One Hot” Vectors

slide-4
SLIDE 4

Neural LMs

(Bengio et al, 03)

slide-5
SLIDE 5

Low-dimensional Representations

▪ Learning representations by back-propagating errors

▪ Rumelhart, Hinton & Williams, 1986

▪ A neural probabilistic language model

▪ Bengio et al., 2003

▪ Natural Language Processing (almost) from scratch

▪ Collobert & Weston, 2008

▪ Word representations: A simple and general method for semi-supervised learning

▪ Turian et al., 2010

▪ Distributed Representations of Words and Phrases and their Compositionality

▪ Word2Vec; Mikolov et al., 2013

slide-6
SLIDE 6

Word Vectors

Distributed representations

slide-7
SLIDE 7

What are various ways to represent the meaning of a word?

slide-8
SLIDE 8

Lexical Semantics

▪ How should we represent the meaning of the word?

▪ Dictionary definition ▪ Lemma and wordforms ▪ Senses ▪ Relationships between words or senses ▪ Taxonomic relationships ▪ Word similarity, word relatedness ▪ Semantic frames and roles ▪ Connotation and sentiment

slide-9
SLIDE 9

WordNet

https://wordnet.princeton.edu/

Electronic Dictionaries

slide-10
SLIDE 10

Electronic Dictionaries

WordNet NLTK www.nltk.org

slide-11
SLIDE 11

Problems with Discrete Representations

▪ Too coarse

▪ expert ↔ skillful

▪ Sparse

▪ wicked, badass, ninja

▪ Subjective ▪ Expensive ▪ Hard to compute word relationships

expert [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0] skillful [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]

dimensionality: PTB: 50K, Google1T 13M

slide-12
SLIDE 12

Distributional Hypothesis

“The meaning of a word is its use in the language”

[Wittgenstein PI 43]

“You shall know a word by the company it keeps”

[Firth 1957]

If A and B have almost identical environments we say that they are synonyms.

[Harris 1954]

slide-13
SLIDE 13

Example

What does ongchoi mean?

slide-14
SLIDE 14

▪ Suppose you see these sentences:

▪ Ongchoi is delicious sautéed with garlic. ▪ Ongchoi is superb over rice ▪ Ongchoi leaves with salty sauces

▪ And you've also seen these:

▪ …spinach sautéed with garlic over rice ▪ Chard stems and leaves are delicious ▪ Collard greens and other salty leafy greens

Example

What does ongchoi mean?

slide-15
SLIDE 15

Ongchoi: Ipomoea aquatica "Water Spinach"

Ongchoi is a leafy green like spinach, chard, or collard greens

Yamaguchi, Wikimedia Commons, public domain

slide-16
SLIDE 16

Model of Meaning Focusing on Similarity

▪ Each word = a vector

▪ not just “word” or word45. ▪ similar words are “nearby in space” ▪ the standard way to represent meaning in NLP

slide-17
SLIDE 17

We'll Introduce 4 Kinds of Embeddings

▪ Count-based

▪ Words are represented by a simple function of the counts of nearby words

▪ Class-based

▪ Representation is created through hierarchical clustering, Brown clusters

▪ Distributed prediction-based (type) embeddings

▪ Representation is created by training a classifier to distinguish nearby and far-away words: word2vec, fasttext

▪ Distributed contextual (token) embeddings from language models

▪ ELMo, BERT

slide-18
SLIDE 18

Term-Document Matrix

Context = appearing in the same document.

battle 1 7 17 soldier 2 80 62 89 fool 36 58 1 4 clown 20 15 2 3

slide-19
SLIDE 19

Term-Document Matrix

Each document is represented by a vector of words

battle 1 7 17 soldier 2 80 62 89 fool 36 58 1 4 clown 20 15 2 3

slide-20
SLIDE 20

battle 1 7 13 soldier 2 80 62 89 fool 36 58 1 4 clown 20 15 2 3

Vectors are the Basis of Information Retrieval

▪ Vectors are similar for the two comedies ▪ Different than the history ▪ Comedies have more fools and wit and fewer battles.

slide-21
SLIDE 21

Visualizing Document Vectors

slide-22
SLIDE 22

Words Can Be Vectors Too

▪ battle is "the kind of word that occurs in Julius Caesar and Henry V" ▪ fool is "the kind of word that occurs in comedies, especially Twelfth Night"

battle 1 7 13 good 114 80 62 89 fool 36 58 1 4 clown 20 15 2 3

slide-23
SLIDE 23

Term-Context Matrix

▪ Two words are “similar” in meaning if their context vectors are similar

▪ Similarity == relatedness knife dog sword love like knife 1 6 5 5 dog 1 5 5 5 sword 6 5 5 5 love 5 5 5 5 like 5 5 5 5 2

slide-24
SLIDE 24

Count-Based Representations

▪ Counts: term-frequency

▪ remove stop words ▪ use log10(tf) ▪ normalize by document length

battle 1 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3

slide-25
SLIDE 25

▪ What to do with words that are evenly distributed across many documents?

TF-IDF

Total # of docs in collection # of docs that have word i Words like "the" or "good" have very low idf

slide-26
SLIDE 26

Positive Pointwise Mutual Information (PPMI)

▪ In word--context matrix ▪ Do words w and c co-occur more than if they were independent? ▪ PMI is biased toward infrequent events

▪ Very rare words have very high PMI values ▪ Give rare words slightly higher probabilities α=0.75

(Church and Hanks, 1990) (Turney and Pantel, 2010)

slide-27
SLIDE 27

(Pecina’09)

slide-28
SLIDE 28

Dimensionality Reduction

▪ Wikipedia: ~29 million English documents. Vocab: ~1M words.

▪ High dimensionality of word--document matrix ▪ Sparsity ▪ The order of rows and columns doesn’t matter

▪ Goal:

▪ good similarity measure for words or documents ▪ dense representation

▪ Sparse vs Dense vectors

▪ Short vectors may be easier to use as features in machine learning (less weights to tune) ▪ Dense vectors may generalize better than storing explicit counts ▪ They may do better at capturing synonymy ▪ In practice, they work better

aardvark 1

slide-29
SLIDE 29

▪ Solution idea:

▪ Find a projection into a low-dimensional space (~300 dim) ▪ That gives us a best separation between features

Singular Value Decomposition (SVD)

  • rthonormal

diagonal, sorted

slide-30
SLIDE 30

dense word vectors

Truncated SVD

We can approximate the full matrix by only considering the leftmost k terms in the diagonal matrix (the k largest singular values)

⨉ ⨉

9 4 .1 .0 .0 .0 .0 .0 .0

dense document vectors

slide-31
SLIDE 31

Latent Semantic Analysis

[Deerwester et al., 1990]

#0 #1 #2 #3 #4 #5 we music company how program 10 said film mr what project 30 have theater its about russian 11 they mr inc their space 12 not this stock

  • r

russia 15 but who companies this center 13 be movie sales are programs 14 do which shares history clark 20 he show said be aircraft sept this about business social ballet 16 there dance share these its 25 you its chief

  • ther

projects 17 are disney executive research

  • rchestra

18 what play president writes development 19 if production group language work 21

slide-32
SLIDE 32

LSA++

▪ Probabilistic Latent Semantic Indexing (pLSI)

▪ Hofmann, 1999

▪ Latent Dirichlet Allocation (LDA)

▪ Blei et al., 2003

▪ Nonnegative Matrix Factorization (NMF)

▪ Lee & Seung, 1999

slide-33
SLIDE 33

Word Similarity

slide-34
SLIDE 34

Evaluation

▪ Intrinsic ▪ Extrinsic ▪ Qualitative

slide-35
SLIDE 35

Extrinsic Evaluation

▪ Chunking ▪ POS tagging ▪ Parsing ▪ MT ▪ SRL ▪ Topic categorization ▪ Sentiment analysis ▪ Metaphor detection ▪ etc. ▪

slide-36
SLIDE 36

Intrinsic Evaluation

▪ WS-353 (Finkelstein et al. ‘02) ▪ MEN-3k (Bruni et al. ‘12) ▪ SimLex-999 dataset (Hill et al., 2015)

word1 word2 similarity (humans)

vanish disappear 9.8 behave

  • bey

7.3 belief impression 5.95 muscle bone 3.65 modest flexible 0.98 hole agreement 0.3

similarity (embeddings)

1.1 0.5 0.3 1.7 0.98 0.3

Spearman's rho (human ranks, model ranks)

slide-37
SLIDE 37

Visualisation

▪ Visualizing Data using t-SNE (van der Maaten & Hinton’08)

[Faruqui et al., 2015]

slide-38
SLIDE 38

What we’ve seen by now

▪ Meaning representation ▪ Distributional hypothesis ▪ Count-based vectors

▪ term-document matrix ▪ word-in-context matrix ▪ normalizing counts: tf-idf, PPMI ▪ dimensionality reduction ▪ measuring similarity ▪ evaluation

Next: ▪ Brown clusters

▪ Representation is created through hierarchical clustering

slide-39
SLIDE 39

Word embedding representations

▪ Count-based

▪ tf-idf, , PPMI

▪ Class-based

▪ Brown clusters

▪ Distributed prediction-based (type) embeddings

▪ Word2Vec, Fasttext

▪ Distributed contextual (token) embeddings from language models

▪ ELMo, BERT

▪ + many more variants

▪ Multilingual embeddings ▪ Multisense embeddings ▪ Syntactic embeddings ▪

  • etc. etc.
slide-40
SLIDE 40

The intuition of Brown clustering

▪ Similar words appear in similar contexts ▪ More precisely: similar words have similar distributions of words to their immediate left and right

Monday

  • n

last

  • Tuesday
  • n

last

  • Wednesday
  • n

last

slide-41
SLIDE 41

Brown Clustering

dog [0000] cat [0001] ant [001] river [010] lake [011] blue [10] red [11]

dog cat ant river lake blue red 1 1 1 1 1 1

slide-42
SLIDE 42

Brown Clustering

[Brown et al, 1992]

slide-43
SLIDE 43

Brown Clustering

[ Miller et al., 2004]

slide-44
SLIDE 44

▪ is a vocabulary ▪ is a partition of the vocabulary into k clusters ▪ is a probability of cluster of wi to follow the cluster of wi-1 ▪

Brown Clustering

The model:

slide-45
SLIDE 45

▪ is a vocabulary ▪ is a partition of the vocabulary into k clusters ▪ is a probability of cluster of wi to follow the cluster of wi-1 ▪

Brown Clustering

Quality(C)

The model:

slide-46
SLIDE 46

Quality(C)

Slide by Michael Collins

slide-47
SLIDE 47

A Naive Algorithm

▪ We start with |V| clusters: each word gets its own cluster ▪ Our aim is to find k final clusters ▪ We run |V| − k merge steps:

▪ At each merge step we pick two clusters ci and cj , and merge them into a single cluster ▪ We greedily pick merges such that Quality(C) for the clustering C after the merge step is maximized at each stage

▪ Cost? Naive = O(|V|5 ). Improved algorithm gives O(|V|3 ): still too slow for realistic values of |V|

Slide by Michael Collins

slide-48
SLIDE 48

Brown Clustering Algorithm

▪ Parameter of the approach is m (e.g., m = 1000) ▪ Take the top m most frequent words, put each into its own cluster, c1, c2, … cm ▪ For i = (m + 1) … |V|

▪ Create a new cluster, cm+1, for the i’th most frequent word. We now have m + 1 clusters ▪ Choose two clusters from c1 . . . cm+1 to be merged: pick the merge that gives a maximum value for Quality(C). We’re now back to m clusters

▪ Carry out (m − 1) final merges, to create a full hierarchy ▪ Running time: O(|V|m2 + n) where n is corpus length

Slide by Michael Collins

slide-49
SLIDE 49

Word embedding representations

▪ Count-based

tf-idf, PPMI ▪ Class-based

▪ Brown clusters

▪ Distributed prediction-based (type) embeddings

▪ Word2Vec, Fasttext

▪ Distributed contextual (token) embeddings from language models

▪ ELMo, BERT

▪ + many more variants

▪ Multilingual embeddings ▪ Multisense embeddings ▪ Syntactic embeddings ▪

  • etc. etc.
slide-50
SLIDE 50

Word2Vec

▪ Popular embedding method ▪ Very fast to train ▪ Code available on the web ▪ Idea: predict rather than count

slide-51
SLIDE 51

Word2Vec

[Mikolov et al.’ 13]

slide-52
SLIDE 52

Skip-gram Prediction

▪ Predict vs Count

the cat sat on the mat

slide-53
SLIDE 53

▪ Predict vs Count

Skip-gram Prediction

the cat sat on the mat context size = 2 wt = the CLASSIFIER wt-2 = <start-2> wt-1 = <start-1> wt+1 = cat wt+2 = sat

slide-54
SLIDE 54

Skip-gram Prediction

▪ Predict vs Count

the cat sat on the mat context size = 2 wt = cat CLASSIFIER wt-2 = <start-1> wt-1 = the wt+1 = sat wt+2 = on

slide-55
SLIDE 55

the cat sat on the mat

▪ Predict vs Count

Skip-gram Prediction

context size = 2 wt = sat CLASSIFIER wt-2 = the wt-1 = cat wt+1 = on wt+2 = the

slide-56
SLIDE 56

▪ Predict vs Count

the cat sat on the mat

Skip-gram Prediction

context size = 2 wt = on CLASSIFIER wt-2 = cat wt-1 = sat wt+1 = the wt+2 = mat

slide-57
SLIDE 57

▪ Predict vs Count

the cat sat on the mat

Skip-gram Prediction

context size = 2 wt = the CLASSIFIER wt-2 = sat wt-1 = on wt+1 = mat wt+2 = <end+1>

slide-58
SLIDE 58

▪ Predict vs Count

the cat sat on the mat

Skip-gram Prediction

context size = 2 wt = mat CLASSIFIER wt-2 = on wt-1 = the wt+1 = <end+1> wt+2 = <end+2>

slide-59
SLIDE 59

▪ Predict vs Count

Skip-gram Prediction

wt = the CLASSIFIER wt-2 = <start-2> wt-1 = <start-1> wt+1 = cat wt+2 = sat wt = the CLASSIFIER wt-2 = sat wt-1 = on wt+1 = mat wt+2 = <end+1>

slide-60
SLIDE 60

Skip-gram Prediction

slide-61
SLIDE 61

Skip-gram Prediction

▪ Training data

wt , wt-2 wt , wt-1 wt , wt+1 wt , wt+2 ...

slide-62
SLIDE 62

Skip-gram Prediction

slide-63
SLIDE 63

Objective

▪ For each word in the corpus t= 1 … T

Maximize the probability of any context window given the current center word

slide-64
SLIDE 64

Skip-gram Prediction

▪ Softmax

slide-65
SLIDE 65

SGNS

▪ Negative Sampling

▪ Treat the target word and a neighboring context word as positive examples.

▪ subsample very frequent words

▪ Randomly sample other words in the lexicon to get negative samples

▪ x2 negative samples

Given a tuple (t,c) = target, context ▪ (cat, sat) ▪ (cat, aardvark)

slide-66
SLIDE 66

Choosing noise words

Could pick w according to their unigram frequency P(w) More common to chosen then according to pα(w) α= ¾ works well because it gives rare noise words slightly higher probability To show this, imagine two events p(a)=.99 and p(b) = .01:

slide-67
SLIDE 67

How to compute p(+|t,c)?

slide-68
SLIDE 68

SGNS

Given a tuple (t,c) = target, context ▪ (cat, sat) ▪ (cat, aardvark) Return probability that c is a real context word:

slide-69
SLIDE 69

Learning the classifier

▪ Iterative process

▪ We’ll start with 0 or random weights ▪ Then adjust the word weights to

▪ make the positive pairs more likely ▪ and the negative pairs less likely

  • ver the entire training set:

▪ Train using gradient descent

slide-70
SLIDE 70

Skip-gram Prediction

slide-71
SLIDE 71

FastText

https://fasttext.cc/

slide-72
SLIDE 72

FastText: Motivation

slide-73
SLIDE 73

Subword Representation

skiing = {^skiing$, ^ski, skii, kiin, iing, ing$}

slide-74
SLIDE 74

FastText

slide-75
SLIDE 75

Details

▪ how many possible ngrams?

▪ |character set|n ▪ Hashing to map n-grams to integers in 1 to K=2M

▪ get word vectors for out-of-vocabulary words using subwords. ▪ less than 2× slower than word2vec skipgram ▪ n-grams between 3 and 6 characters ▪ short n-grams (n = 4) are good to capture syntactic information ▪ longer n-grams (n = 6) are good to capture semantic information

slide-76
SLIDE 76

FastText Evaluation

▪ Intrinsic evaluation ▪ Arabic, German, Spanish, French, Romanian, Russian

word1 word2 similarity (humans)

vanish disappear 9.8 behave

  • bey

7.3 belief impression 5.95 muscle bone 3.65 modest flexible 0.98 hole agreement 0.3

similarity (embeddings)

1.1 0.5 0.3 1.7 0.98 0.3

Spearman's rho (human ranks, model ranks)

slide-77
SLIDE 77

FastText Evaluation

[Grave et al, 2017]

slide-78
SLIDE 78

FastText Evaluation

slide-79
SLIDE 79

FastText Evaluation

slide-80
SLIDE 80

Dense Embeddings You Can Download

Word2vec (Mikolov et al.’ 13) https://code.google.com/archive/p/word2vec/ Fasttext (Bojanowski et al.’ 17) http://www.fasttext.cc/ Glove (Pennington et al., 14) http://nlp.stanford.edu/projects/glove/

slide-81
SLIDE 81

Word embedding representations

▪ Count-based

tf-idf, PPMI ▪ Class-based

▪ Brown clusters

▪ Distributed prediction-based (type) embeddings

▪ Word2Vec, Fasttext

▪ Distributed contextual (token) embeddings from language models

▪ ELMo, BERT

▪ + many more variants

▪ Multilingual embeddings ▪ Multisense embeddings ▪ Syntactic embeddings ▪

  • etc. etc.
slide-82
SLIDE 82

Motivation

p(play | Elmo and Cookie Monster play a game .)

p(play | The Broadway play premiered yesterday .)

slide-83
SLIDE 83

ELMo

https://allennlp.org/elmo

slide-84
SLIDE 84

Background

slide-85
SLIDE 85

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM

??

slide-86
SLIDE 86

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

??

slide-87
SLIDE 87

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

?? ??

slide-88
SLIDE 88

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

Embeddings from Language Models

ELMo

=

??

slide-89
SLIDE 89

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

Embeddings from Language Models

ELMo

=

slide-90
SLIDE 90

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

ELMo

=

+ + Embeddings from Language Models

slide-91
SLIDE 91

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

λ1

ELMo

λ2 λ0

=

+ + ( ( ( ) ) ) Embeddings from Language Models

slide-92
SLIDE 92

Evaluation: Extrinsic Tasks

slide-93
SLIDE 93

Stanford Question Answering Dataset (SQuAD)

[Rajpurkar et al, ‘16, ‘18]

slide-94
SLIDE 94

SNLI

[Bowman et al, ‘15]

slide-95
SLIDE 95

BERT

https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

slide-96
SLIDE 96

Cloze task objective

slide-97
SLIDE 97
slide-98
SLIDE 98

https://rajpurkar.github.io/SQuAD-explorer/

slide-99
SLIDE 99

Multilingual Embeddings

https://github.com/mfaruqui/crosslingual-cca http://128.2.220.95/multilingual/

slide-100
SLIDE 100

Motivation

model 1 model 2

?

▪ comparison of words trained with different models

slide-101
SLIDE 101

Motivation

▪ translation induction ▪ improving monolingual embeddings through cross-lingual context

English French

?

slide-102
SLIDE 102

Canonical Correlation Analysis (CCA)

slide-103
SLIDE 103

Canonical Correlation Analysis (CCA)

⊆ ⊆

[Faruqui & Dyer, ‘14]

slide-104
SLIDE 104

Extension: Multilingual Embeddings

[Ammar et al., ‘16]

104

English French Spanish Arabic Swedish French-E nglish Ofrench→english Ofrench←english

O

french→english x

O

french←english

  • 1
slide-105
SLIDE 105

Polyglot Models

[Ammar et al., ‘16, Tsvetkov et al., ‘16]

slide-106
SLIDE 106

Embeddings can help study word history!

slide-107
SLIDE 107

Diachronic Embeddings

1 7  

▪ count-based embeddings w/ PPMI ▪ projected to a common space

slide-108
SLIDE 108

Project 300 dimensions down into 2

~30 million books, 1850-1990, Google Books data

slide-109
SLIDE 109

Negative words change faster than positive words

slide-110
SLIDE 110

Embeddings reflect ethnic stereotypes over time

slide-111
SLIDE 111

Change in linguistic framing 1910-1990

Change in association of Chinese names with adjectives framed as "othering" (barbaric, monstrous, bizarre)

slide-112
SLIDE 112

Analogy: Embeddings capture relational meaning!

[Mikolov et al.’ 13]

slide-113
SLIDE 113

and also human biases

[Bolukbasi et al., ‘16]

slide-114
SLIDE 114

Conclusion

▪ Concepts or word senses

▪ Have a complex many-to-many association with words (homonymy, multiple senses) ▪ Have relations with each other ▪ Synonymy, Antonymy, Superordinate ▪ But are hard to define formally (necessary & sufficient conditions)

▪ Embeddings = vector models of meaning

▪ More fine-grained than just a string or index ▪ Especially good at modeling similarity/analogy ▪ Just download them and use cosines!! ▪ Useful in many NLP tasks ▪ But know they encode cultural stereotypes