Neural LMs Image: (Bengio et al, 03) One Hot Vectors Neural LMs - PowerPoint PPT Presentation

Brown Clustering ▪ is a vocabulary ▪ is a partition of the vocabulary into k clusters ▪ is a probability of cluster of w i to follow the cluster of w i-1 ▪ The model: Quality( C )

Quality(C) Slide by Michael Collins

A Naive Algorithm ▪ We start with | V | clusters: each word gets its own cluster ▪ Our aim is to find k final clusters ▪ We run | V | − k merge steps: ▪ At each merge step we pick two clusters c i and c j , and merge them into a single cluster ▪ We greedily pick merges such that Quality(C) for the clustering C after the merge step is maximized at each stage ▪ Cost? Naive = O(| V | 5 ). Improved algorithm gives O(| V | 3 ): still too slow for realistic values of | V | Slide by Michael Collins

Brown Clustering Algorithm ▪ Parameter of the approach is m (e.g., m = 1000 ) ▪ Take the top m most frequent words, put each into its own cluster, c 1 , c 2 , … c m ▪ For i = (m + 1) … | V | ▪ Create a new cluster, c m+1 , for the i ’th most frequent word. We now have m + 1 clusters ▪ Choose two clusters from c 1 . . . c m+1 to be merged: pick the merge that gives a maximum value for Quality(C). We’re now back to m clusters ▪ Carry out (m − 1) final merges, to create a full hierarchy ▪ Running time: O(| V | m 2 + n ) where n is corpus length Slide by Michael Collins

Word embedding representations ▪ Count-based ▪ tf-idf, PPMI ▪ Class-based ▪ Brown clusters ▪ Distributed prediction-based (type) embeddings ▪ Word2Vec, Fasttext ▪ Distributed contextual (token) embeddings from language models ▪ ELMo, BERT ▪ + many more variants ▪ Multilingual embeddings ▪ Multisense embeddings ▪ Syntactic embeddings ▪ etc. etc.

Word2Vec ▪ Popular embedding method ▪ Very fast to train ▪ Code available on the web ▪ Idea: predict rather than count

Word2Vec [Mikolov et al.’ 13]

Skip-gram Prediction ▪ Predict vs Count the cat sat on the mat

Skip-gram Prediction ▪ Predict vs Count the cat sat on the mat w t-2 = <start -2 > w t-1 = <start -1 > w t+1 = cat w t = the CLASSIFIER w t+2 = sat context size = 2

Skip-gram Prediction ▪ Predict vs Count the cat sat on the mat w t-2 = <start -1 > w t-1 = the w t+1 = sat w t = cat CLASSIFIER w t+2 = on context size = 2

Skip-gram Prediction ▪ Predict vs Count the cat sat on the mat w t-2 = the w t-1 = cat w t+1 = on w t = sat CLASSIFIER w t+2 = the context size = 2

Skip-gram Prediction ▪ Predict vs Count the cat sat on the mat w t-2 = cat w t-1 = sat w t+1 = the w t = on CLASSIFIER w t+2 = mat context size = 2

Skip-gram Prediction ▪ Predict vs Count the cat sat on the mat w t-2 = sat w t-1 = on w t+1 = mat w t = the CLASSIFIER w t+2 = <end +1 > context size = 2

Skip-gram Prediction ▪ Predict vs Count the cat sat on the mat w t-2 = on w t-1 = the w t+1 = <end +1 > w t = mat CLASSIFIER w t+2 = <end +2 > context size = 2

Skip-gram Prediction ▪ Predict vs Count w t-2 = sat w t-1 = on w t+1 = mat w t = the CLASSIFIER w t+2 = <end +1 > w t-2 = <start -2 > w t-1 = <start -1 > w t+1 = cat w t = the CLASSIFIER w t+2 = sat

Skip-gram Prediction

Skip-gram Prediction ▪ Training data w t , w t-2 w t , w t-1 w t , w t+1 w t , w t+2 ...

Objective ▪ For each word in the corpus t= 1 … T Maximize the probability of any context window given the current center word

Skip-gram Prediction ▪ Softmax

SGNS ▪ Negative Sampling ▪ Treat the target word and a neighboring context word as positive examples. ▪ subsample very frequent words ▪ Randomly sample other words in the lexicon to get negative samples ▪ x2 negative samples Given a tuple (t,c) = target, context ▪ (cat, sat) ▪ (cat, aardvark)

Choosing noise words Could pick w according to their unigram frequency P(w) More common to chosen then according to p α (w) α = ¾ works well because it gives rare noise words slightly higher probability To show this, imagine two events p(a)=.99 and p(b) = .01:

How to compute p(+|t,c)?

SGNS Given a tuple (t,c) = target, context ▪ (cat, sat) ▪ (cat, aardvark) Return probability that c is a real context word:

Learning the classifier ▪ Iterative process ▪ We’ll start with 0 or random weights ▪ Then adjust the word weights to ▪ make the positive pairs more likely ▪ and the negative pairs less likely ▪ over the entire training set: ▪ Train using gradient descent

FastText https://fasttext.cc/

FastText: Motivation

Subword Representation skiing = {^skiing$, ^ski, skii, kiin, iing, ing$}

FastText

Details ▪ how many possible ngrams? ▪ |character set| n ▪ Hashing to map n-grams to integers in 1 to K=2M ▪ get word vectors for out-of-vocabulary words using subwords. ▪ less than 2× slower than word2vec skipgram ▪ n -grams between 3 and 6 characters ▪ short n-grams (n = 4) are good to capture syntactic information ▪ longer n-grams (n = 6) are good to capture semantic information

FastText Evaluation ▪ Intrinsic evaluation similarity similarity word1 word2 (humans) (embeddings) vanish disappear 9.8 1.1 behave obey 7.3 0.5 belief impression 5.95 0.3 muscle bone 3.65 1.7 modest flexible 0.98 0.98 hole agreement 0.3 0.3 ▪ Arabic, German, Spanish, Spearman's rho (human ranks, model ranks) French, Romanian, Russian

FastText Evaluation [Grave et al, 2017]

FastText Evaluation

Dense Embeddings You Can Download Word2vec (Mikolov et al.’ 13) https://code.google.com/archive/p/word2vec/ Fasttext (Bojanowski et al.’ 17) http://www.fasttext.cc/ Glove (Pennington et al., 14) http://nlp.stanford.edu/projects/glove/

Word embedding representations ▪ Count-based ▪ tf-idf, PPMI ▪ Class-based ▪ Brown clusters ▪ Distributed prediction-based (type) embeddings ▪ Word2Vec, Fasttext ▪ Distributed contextual (token) embeddings from language models ▪ ELMo, BERT ▪ + many more variants ▪ Multilingual embeddings ▪ Multisense embeddings ▪ Syntactic embeddings ▪ etc. etc.

Motivation p(play | Elmo and Cookie Monster play a game .) ≠ p(play | The Broadway play premiered yesterday .)

ELMo https://allennlp.org/elmo

Background

?? LSTM LSTM LSTM LSTM LSTM LSTM The Broadway play premiered yesterday .

?? LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM The Broadway play premiered yesterday .

?? ?? LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM The Broadway play premiered yesterday .

Embeddings from Language Models ELMo ?? = LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM The Broadway play premiered yesterday .

Embeddings from Language Models ELMo = LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM The Broadway play premiered yesterday .

Embeddings from Language Models ELMo = + + LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM The Broadway play premiered yesterday .

Embeddings from Language Models ELMo ( ) ) = ) ( λ 2 + ( λ 0 λ 1 + LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM The Broadway play premiered yesterday .

Evaluation: Extrinsic Tasks

Stanford Question Answering Dataset (SQuAD) [Rajpurkar et al, ‘16, ‘18]

SNLI [Bowman et al, ‘15]

BERT https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

Cloze task objective

https://rajpurkar.github.io/SQuAD-explorer/

Multilingual Embeddings https://github.com/mfaruqui/crosslingual-cca http://128.2.220.95/multilingual/

Motivation ▪ comparison of words trained with different model 1 model 2 models ?

Neural LMs Image: (Bengio et al, 03) One Hot Vectors Neural LMs - PowerPoint PPT Presentation

Neural LMs Image: (Bengio et al, 03) One Hot Vectors Neural LMs (Bengio et al, 03) Low-dimensional Representations Learning representations by back-propagating errors Rumelhart, Hinton & Williams, 1986 A neural

LMS Working Group LMS Transportation 7/30/2020 Existing Conditions & Future Transportation

Review of LMS Activities (August 2016-July 2017) Professor Ken Brown (LMS Vice-President) LMS

A NEXT -GEN LEARNING MANAGEMENT SYSTEM WHAT IS LMS? A Learning Management System (LMS) is a

NIH Best Practices Using the HHS Learning Portal (LMS) Michele Schwartzman, OHR, NIH LMS

Algorithms for NLP CS 11711, Fall 2019 Lecture 5: Vector Semantics Yulia Tsvetkov 1 Neural LMs

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Local Maternity System (LMS) Christine Morris LMS Senior Responsible Officer Shrewsbury &

FEMA 1595 Non-Federal Expected Fed Share Project LMS Funding LMS (75% of Total (25 % of

LMS and GAMLSS Flexible Regression and Smoothing Mikis Stasinopoulos 1 and Bob Rigby 1 1

Teaching and Learning Services LMS Review Listening session LMS Review What is it? How does

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Network LMs READ CHAPTERS 5 AND 7 IN JURAFSKY AND MARTIN READ CHAPTER 4 FROM YOAV

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural LMs, Recurrent

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Summary Part One: Examples of interaction with LMS discussion boards Part Two: Using the

Review S. Cheng (OU-Tulsa) October 17, 2017 1 / 28 Lecture 10 Review Conditioning reduces

Clustering: K-Means, GMM, EM March 11, 2016 Boris Ivanovic* csc411ta@cs.toronto.edu *Based on

MOL2NET , 2017 , 3, doi: 10.3390/mol2net-03-04608 2 with SuperPro Designer software from a raw

ATLAS Pixel Detector Upgrade The Insertable B-Layer David Bertsche November 8th, 2012 ATLAS

Program Guided Agent ICLR 2020 (Spotlight) Shao-Hua Sun Te-Lin Wu Joseph J. Lim Follow an

Introduction to Linear Programming Linear Programming is the study of optimization problems in

Causation When C causes E, C helps to make E happen. Learning about causes allows us

The Invisible Internet Project Andrew Savchenko Moscow, Russia FOSDEM 2018 3 & 4 February

Neural LMs Image: (Bengio et al, 03) One Hot Vectors Neural LMs - PowerPoint PPT Presentation

Neural LMs Image: (Bengio et al, 03) One Hot Vectors Neural LMs (Bengio et al, 03) Low-dimensional Representations Learning representations by back-propagating errors Rumelhart, Hinton & Williams, 1986 A neural

LMS Working Group LMS Transportation 7/30/2020 Existing Conditions &amp; Future Transportation

Review of LMS Activities (August 2016-July 2017) Professor Ken Brown (LMS Vice-President) LMS

A NEXT -GEN LEARNING MANAGEMENT SYSTEM WHAT IS LMS? A Learning Management System (LMS) is a

NIH Best Practices Using the HHS Learning Portal (LMS) Michele Schwartzman, OHR, NIH LMS

Algorithms for NLP CS 11711, Fall 2019 Lecture 5: Vector Semantics Yulia Tsvetkov 1 Neural LMs

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Local Maternity System (LMS) Christine Morris LMS Senior Responsible Officer Shrewsbury &amp;

FEMA 1595 Non-Federal Expected Fed Share Project LMS Funding LMS (75% of Total (25 % of

LMS and GAMLSS Flexible Regression and Smoothing Mikis Stasinopoulos 1 and Bob Rigby 1 1

Teaching and Learning Services LMS Review Listening session LMS Review What is it? How does

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Network LMs READ CHAPTERS 5 AND 7 IN JURAFSKY AND MARTIN READ CHAPTER 4 FROM YOAV

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural LMs, Recurrent

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Summary Part One: Examples of interaction with LMS discussion boards Part Two: Using the

Review S. Cheng (OU-Tulsa) October 17, 2017 1 / 28 Lecture 10 Review Conditioning reduces

Clustering: K-Means, GMM, EM March 11, 2016 Boris Ivanovic* csc411ta@cs.toronto.edu *Based on

MOL2NET , 2017 , 3, doi: 10.3390/mol2net-03-04608 2 with SuperPro Designer software from a raw

ATLAS Pixel Detector Upgrade The Insertable B-Layer David Bertsche November 8th, 2012 ATLAS

Program Guided Agent ICLR 2020 (Spotlight) Shao-Hua Sun Te-Lin Wu Joseph J. Lim Follow an

Introduction to Linear Programming Linear Programming is the study of optimization problems in

Causation When C causes E, C helps to make E happen. Learning about causes allows us

The Invisible Internet Project Andrew Savchenko Moscow, Russia FOSDEM 2018 3 &amp; 4 February

LMS Working Group LMS Transportation 7/30/2020 Existing Conditions & Future Transportation

Local Maternity System (LMS) Christine Morris LMS Senior Responsible Officer Shrewsbury &

The Invisible Internet Project Andrew Savchenko Moscow, Russia FOSDEM 2018 3 & 4 February