Algorithms for NLP Word Embeddings Yulia Tsvetkov CMU Slides: Dan - PowerPoint PPT Presentation

Algorithms for NLP Word Embeddings Yulia Tsvetkov – CMU Slides: Dan Jurafsky – Stanford, Mike Peters – AI2, Edouard Grave – FAIR

Brown Clustering 0 dog [0000] 0 1 cat [0001] 1 0 ant [001] 0 1 1 0 0 1 1 river [010] dog cat ant river lake blue red lake [011] blue [10] red [11]

Brown Clustering [Brown et al, 1992]

Brown Clustering [ Miller et al., 2004]

Brown Clustering ▪ is a vocabulary ▪ is a partition of the vocabulary into k clusters ▪ is a probability of cluster of w i to follow the cluster of w i-1 ▪ The model: Quality( C )

Quality(C) Slide by Michael Collins

A Naive Algorithm ▪ We start with | V | clusters: each word gets its own cluster ▪ Our aim is to find k final clusters ▪ We run | V | − k merge steps: ▪ At each merge step we pick two clusters c i and c j , and merge them into a single cluster ▪ We greedily pick merges such that Quality(C) for the clustering C after the merge step is maximized at each stage ▪ Cost? Naive = O(| V | 5 ). Improved algorithm gives O(| V | 3 ): still too slow for realistic values of | V | Slide by Michael Collins

Brown Clustering Algorithm ▪ Parameter of the approach is m (e.g., m = 1000 ) ▪ Take the top m most frequent words, put each into its own cluster, c 1 , c 2 , … c m ▪ For i = (m + 1) … | V | ▪ Create a new cluster, c m+1 , for the i ’th most frequent word. We now have m + 1 clusters ▪ Choose two clusters from c 1 . . . c m+1 to be merged: pick the merge that gives a maximum value for Quality(C). We’re now back to m clusters ▪ Carry out (m − 1) final merges, to create a full hierarchy ▪ Running time: O(| V | m 2 + n ) where n is corpus length Slide by Michael Collins

Plan for Today ▪ Word2Vec ▪ Representation is created by training a classifier to distinguish nearby and far-away words ▪ FastText ▪ Extension of word2vec to include subword information ▪ ELMo ▪ Contextual token embeddings ▪ Multilingual embeddings ▪ Using embeddings to study history and culture

Word2Vec ▪ Popular embedding method ▪ Very fast to train ▪ Code available on the web ▪ Idea: predict rather than count

Word2Vec [Mikolov et al.’ 13]

Skip-gram Prediction ▪ Predict vs Count the cat sat on the mat

Skip-gram Prediction ▪ Predict vs Count the cat sat on the mat w t-2 = <start -2 > w t-1 = <start -1 > w t+1 = cat w t = the CLASSIFIER w t+2 = sat context size = 2

Skip-gram Prediction ▪ Predict vs Count the cat sat on the mat w t-2 = <start -1 > w t-1 = the w t+1 = sat w t = cat CLASSIFIER w t+2 = on context size = 2

Skip-gram Prediction ▪ Predict vs Count the cat sat on the mat w t-2 = the w t-1 = cat w t+1 = on w t = sat CLASSIFIER w t+2 = the context size = 2

Skip-gram Prediction ▪ Predict vs Count the cat sat on the mat w t-2 = cat w t-1 = sat w t+1 = the w t = on CLASSIFIER w t+2 = mat context size = 2

Skip-gram Prediction ▪ Predict vs Count the cat sat on the mat w t-2 = sat w t-1 = on w t+1 = mat w t = the CLASSIFIER w t+2 = <end +1 > context size = 2

Skip-gram Prediction ▪ Predict vs Count the cat sat on the mat w t-2 = on w t-1 = the w t+1 = <end +1 > w t = mat CLASSIFIER w t+2 = <end +2 > context size = 2

Skip-gram Prediction ▪ Predict vs Count w t-2 = sat w t-1 = on w t+1 = mat w t = the CLASSIFIER w t+2 = <end +1 > w t-2 = <start -2 > w t-1 = <start -1 > w t+1 = cat w t = the CLASSIFIER w t+2 = sat

Skip-gram Prediction

Skip-gram Prediction ▪ Training data w t , w t-2 w t , w t-1 w t , w t+1 w t , w t+2 ...

▪ For each word in the corpus t= 1 … T Maximize the probability of any context window given the current center word

Skip-gram Prediction ▪ Softmax

SGNS ▪ Negative Sampling ▪ Treat the target word and a neighboring context word as positive examples. ▪ subsample very frequent words ▪ Randomly sample other words in the lexicon to get negative samples ▪ x2 negative samples Given a tuple (t,c) = target, context ▪ (cat, sat) ▪ (cat, aardvark)

Learning the classifier ▪ Iterative process ▪ We’ll start with 0 or random weights ▪ Then adjust the word weights to ▪ make the positive pairs more likely ▪ and the negative pairs less likely ▪ over the entire training set: ▪ Train using gradient descent

How to compute p(+|t,c)?

SGNS Given a tuple (t,c) = target, context ▪ (cat, sat) ▪ (cat, aardvark) Return probability that c is a real context word:

Choosing noise words Could pick w according to their unigram frequency P(w) More common to chosen then according to p α (w) α = ¾ works well because it gives rare noise words slightly higher probability To show this, imagine two events p(a)=.99 and p(b) = .01:

FastText https://fasttext.cc/

FastText: Motivation

Subword Representation skiing = {^skiing$, ^ski, skii, kiin, iing, ing$}

FastText

Details ▪ n -grams between 3 and 6 characters ▪ how many possible ngrams? ▪ |character set| n ▪ Hashing to map n-grams to integers in 1 to K=2M ▪ get word vectors for out-of-vocabulary words using subwords. ▪ less than 2× slower than word2vec skipgram ▪ short n-grams (n = 4) are good to capture syntactic information ▪ longer n-grams (n = 6) are good to capture semantic information

FastText Evaluation ▪ Intrinsic evaluation similarity similarity word1 word2 (humans) (embeddings) vanish disappear 9.8 1.1 behave obey 7.3 0.5 belief impression 5.95 0.3 muscle bone 3.65 1.7 modest flexible 0.98 0.98 hole agreement 0.3 0.3 ▪ Arabic, German, Spanish, Spearman's rho (human ranks, model ranks) French, Romanian, Russian

FastText Evaluation [Grave et al, 2017]

FastText Evaluation

ELMo https://allennlp.org/elmo

Motivation p(play | Elmo and Cookie Monster play a game .) ≠ p(play | The Broadway play premiered yesterday .)

Background

?? LSTM LSTM LSTM LSTM LSTM LSTM The Broadway play premiered yesterday .

?? LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM The Broadway play premiered yesterday .

?? ?? LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM The Broadway play premiered yesterday .

Embeddings from Language Models ELMo ?? = LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM The Broadway play premiered yesterday .

Embeddings from Language Models ELMo = LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM The Broadway play premiered yesterday .

Embeddings from Language Models ELMo = + + LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM The Broadway play premiered yesterday .

Embeddings from Language Models ELMo ( ) ) = ) ( λ 2 + ( λ 0 λ 1 + LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM The Broadway play premiered yesterday .

Evaluation: Extrinsic Tasks

Stanford Question Answering Dataset (SQuAD) [Rajpurkar et al, ‘16, ‘18]

SNLI [Bowman et al, ‘15]

Multilingual Embeddings https://github.com/mfaruqui/crosslingual-cca http://128.2.220.95/multilingual/

Motivation model 1 model 2 ?

Motivation English French ?

Canonical Correlation Analysis (CCA) Canonical Correlation Analysis (Hotelling, 1936) Projects two sets of vectors (of equal cardinality) in a space where they are maximally correlated. CCA Ω Ω ∑ ∑ Ω ∑

Canonical Correlation Analysis (CCA) Ω ⊆ X, ∑ ⊆ Y W, V = CCA(Ω, ∑) d 1 k k d 2 V W x x X n 1 Y n 2 k k X’ Y’ n 1 n 2 k = min(r(Ω), r(∑)) X’ and Y’ are now maximally correlated. [Faruqui & Dyer, ‘14]

Extension: Multilingual Embeddings French Spanish Arabic Swedish O french → english x O french ← english O french → english -1 58 French-E English nglish O french ← english [Ammar et al., ‘16]

Embeddings can help study word history!

Diachronic Embeddings Word vectors 1990 Word vectors for 1920 “dog” 1990 word vector “dog” 1920 word vector vs. 1950 2000 1900 ▪ count-based embeddings w/ PPMI 6 ▪ projected to a common space 0

Project 300 dimensions down into 2 ~30 million books, 1850-1990, Google Books data

Negative words change faster than positive words

Embeddings reflect ethnic stereotypes over time

Change in linguistic framing 1910-1990 Change in association of Chinese names with adjectives framed as "othering" (barbaric, monstrous, bizarre)

Algorithms for NLP Word Embeddings Yulia Tsvetkov CMU Slides: Dan - PowerPoint PPT Presentation

Algorithms for NLP Word Embeddings Yulia Tsvetkov CMU Slides: Dan Jurafsky Stanford, Mike Peters AI2, Edouard Grave FAIR Brown Clustering 0 dog [0000] 0 1 cat [0001] 1 0 ant [001] 0 1 1 0 0 1 1 river [010] dog cat

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Natural Language Processing (NLP) In 11-711 Algorithms for NLP we take an

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Algorithms for NLP 11-711, Fall 2019 Lecture 26: Computational Ethics Yulia Tsvetkov 1

Algorithms for NLP IITP, Fall 2019 Lecture 25: Computational Ethics Yulia Tsvetkov 1 Tsvetkov

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Capsule Networks for NLP Will Merrill Advanced NLP 10/25/18 Capsule Networks: A Better ConvNet

Case and the Structure of Events: Evidence from Indo-Aryan Miriam Butt University of Konstanz

Lexical Semantics and Distribution of Suffixes A Visual Analysis Christian Rohrdantz 1 Andreas

Diabase: Towards a diachronic BLARK in support of historical studies Lars Borin, Markus Forsberg,

Theories and Models of Language Change Example: Reflexives Example: Instrumentals Advantages

Analysing Lexical Semantic Change with Contextualised Word Representations Mario Giulianelli,

Unit 7: A multivariate approach to linguistic variation Statistics for Linguists with R A

Coping with variation in the Icelandic Diachronic Treebank Eirkur Rgnvaldsson Anton Karl

Recent Developments in the Czech National Corpus Michal Ken Charles University in Prague