Word Embeddings & Language Modeling Lili Mou lmou@ualberta.ca - PowerPoint PPT Presentation

CMPUT 651 (Fall 2019) Word Embeddings & Language Modeling Lili Mou lmou@ualberta.ca lili-mou.github.io

CMPUT 651 (Fall 2019) Last Lecture • Logistic regression/Softmax: Linear classification • Non-linear classification - Non-linear feature engineering - Non-linear kernel - Non-linear function composition • Neural networks - Forward propagation: Compute activation - Backward propagation: Compute derivative (greedy dynamic programming)

CMPUT 651 (Fall 2019) Advantages of DL • Work with raw data - Images processing: pixels ImageNet - Speech processing: frequency [Graves+, ICASSP'13]

CMPUT 651 (Fall 2019) How about Language? • The raw input of language I like the course • Problem: Words are discrete tokens !

CMPUT 651 (Fall 2019) Representing Words • Attempt#1: - By index in the vocabulary • Problem 1 3 2 0 - Introducing artefacts 0 { } • Order, metric, inner-product 1 • Extreme non-linearity 2 𝒲 = 3

CMPUT 651 (Fall 2019) Representing Words • Attempt#2: One-hot representation X Separability doesn’t generalize X Metric is trivial 0 0 0 1 0 1 0 0 0 { } 0 0 0 1 1 0 1 0 0 2 𝒲 = 3

CMPUT 651 (Fall 2019) Metric in the Word Space • Design a metric � to evaluate the “distance” of two d ( ⋅ , ⋅ ) words in terms of some aspect - E.g., semantic similarity I’d like to have some pop/soda/water/fruit/rest • Traditional method: WordNet distance (if it’s a metric). things If not, doesn’t matter. … … food drinks leisure 0 0 0 1 1 0 0 0 pop water fruit rest sleep 0 0 1 0 soda 0 1 0 0

� CMPUT 651 (Fall 2019) Metric in the Word Space • Design a metric � to evaluate the “distance” of two d ( ⋅ , ⋅ ) words in terms of some aspect - E.g., semantic similarity I’d like to have some pop/soda/water/fruit/rest • A straightforward metric on one-hot vector: - Discrete metric if � , � otherwise d ( x i , x j ) = 1 x i = x j 0 Non-informative 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0

� CMPUT 651 (Fall 2019) ID and One-Hot ID representation One-hot representation 0 0 0 1 0 1 0 0 1 3 2 0 0 0 1 0 0 1 0 0 Dimension One-dimensional -dimensional | 𝒲 | Euclidean Artefact Non-informative Metric Discrete Non-informative Non-informative Learnable Possible but may not generalize Di ffi cult Need to explore more

CMPUT 651 (Fall 2019) Something in Between • Map a word to a low-dimensional space - Not as low as one-dimensional ID representation - Not as high as � -dimensional one-hot representation | 𝒲 | • Attempt#3: Word vector representation (a.k.a., word embeddings) - Mapping a word to a vector - Equivalent to linear tranformation of one-hot vector

CMPUT 651 (Fall 2019) Obtaining the Embedding Matrix • Attemp#1: Treat as neural weights as usual - Random initialization & gradient descent • Properties of the embedding matrix - Huge, � parameters (cf. weight for layerwise MLP) | 𝒲 | × d NN - Sparsely updated • Nature of language - Power law distribution • Good if corpus is large

CMPUT 651 (Fall 2019) Embedding Learning • Attempt #2: - Manually specifying the distance metric/inner-product, etc. - Humans are not rational • Attempt #3: - Pre-training on a massive corpus with a di ff erent (pretraining) objective - Then, we can fine-tune those pre-trained embeddings in almost any specific task.

CMPUT 651 (Fall 2019) Pretraining Criterion • Language Modeling - Given a corpus � x = x 1 x 2 ⋯ x t - Goal: Maximize � p ( x ) • Is it meaningful to view language sentences as a random variable? - Frequentist: Sentences are repetitions of i.i.d. experiments - Bayesian: Everything unknown is a random variable

CMPUT 651 (Fall 2019) Factorization • � cannot be parametrized p ( x ) = p ( x 1 , ⋯ , x t ) • Factorizing a giant probability p ( x ) = p ( x 1 , ⋯ , x t ) = p ( x 1 ) p ( x 2 | x 1 ) ⋯ p ( x t | x 1 , ⋯ , x t − 1 ) • Still unable to parametrize, especially p ( x n | x 1 , ⋯ , x n − 1 ) • Questions: - Can we decompose any probabilistic distribution defined on � x into this form? Yes. - Is it necessary to decompose the distribution a probabilistic distribution in this form? No.

� CMPUT 651 (Fall 2019) Markov Assumptions p ( x ) = p ( x 1 , ⋯ , x t ) = p ( x 1 ) p ( x 2 | x 1 ) ⋯ p ( x t | x 1 , ⋯ , x t − 1 ) • Independency - Given the current “state,” independent with previous ones - State at step � : � t ( x t − n +1 , x t − n +2 , ⋯ , x t − 1 ) - � x t ⊥ x ≤ t − n | x t − n +1 , x t − n +2 , ⋯ , x t − 1 • Stationary property - � for all � p ( x t | x t − 1 , ⋯ , x t − n +1 ) = p ( x s | x s − n +1 , ⋯ , x s − 1 ) t , s

� � CMPUT 651 (Fall 2019) Parametrizing � p ( w ) p ( x ) = p ( x 1 , ⋯ , x t ) = p ( x 1 ) p ( x 2 | x 1 ) ⋯ p ( x t | x 1 , ⋯ , x n − 1 ) ≈ p ( x 1 ) p ( x 2 | x 1 ) ⋯ p ( x n | x 1 , ⋯ , x t − n +1 ) Direct parametrization : Each multinomial distribution is directly parametrized (notation abuse) p ( w n | w 1 , ⋯ , w n − 1 )

� � ̂ CMPUT 651 (Fall 2019) N-gram Model p ( x ) = p ( x 1 , ⋯ , x n ) = p ( x 1 ) p ( x 2 | x 1 ) ⋯ p ( x n | x 1 , ⋯ , x n − 1 ) ≈ p ( x 1 ) p ( x 2 | x 1 ) ⋯ p ( x n | x 1 , ⋯ , x t − n +1 ) # w 1 ⋯ w n p ( w n | w 1 , ⋯ , w n − 1 ) = # w 1 ⋯ w n − 1 Questions: • How many multinomial distributions? • How many parameters in total?

CMPUT 651 (Fall 2019) Problems of n-gram models • #para � exp( � ) ∝ n • Power-law distribution - Severe data sparsity even if � is small n • Normal distribution p ( x ) ∝ exp( − τ x 2 ) • Power-law distribution p ( x ) ∝ x − k

CMPUT 651 (Fall 2019) Smoothing Techniques • Add-one smoothing • Interpolation smoothing • Backo ff smoothing Useful link: https://nlp.stanford.edu/~wcmac/papers/20050421- smoothing-tutorial.pdf

CMPUT 651 (Fall 2019) Parametrizing LM by NN • Is it possible to parametrize LM by NN? • Yes - � is a classification problem p ( w n | w 1 , ⋯ , w n − 1 ) - NNs are good at (esp. non-linear) classification

CMPUT 651 (Fall 2019) Feed-Forward Language Model N.B. The Markov assumption also holds. Bengio, Yoshua, et al. "A Neural Probabilistic Language Model." JMLR. 2003. By product: Embeddings are pre-trained in a meaningful way

CMPUT 651 (Fall 2019) Recurrent Neural Language Model ● RNN keeps one or a few hidden states ● The hidden states change at each time step according to the input ● RNN directly parametrizes rather than Mikolov T, Karafi á t M, Burget L, Cernock ý J, Khudanpur S. Recurrent neural network based language model. In INTERSPEECH, 2010.

CMPUT 651 (Fall 2019) How can we use word embeddings? ● Embeddings demonstrate the internal structures of words – Relation represented by vector offset “man” – “woman” = “king” – “queen” [Mikolov+NAACL13] – Word similarity ● Embeddings can serve as the initialization of almost every supervised task – A way of pretraining – N.B.: may not be useful when the training set is large enough

CMPUT 651 (Fall 2019) Word Embeddings in our Brain Huth, Alexander G., et al. "Natural speech reveals the semantic maps that tile human cerebral cortex." Nature 532.7600 (2016): 453-458.

CMPUT 651 (Fall 2019) “Somatotopic Embeddings” in our Brain [8] Bear MF, Connors BW, Michael A. Paradiso. Neuroscience: Exploring the Brain. 2007

CMPUT 651 (Fall 2019) [8] Bear MF, Connors BW, Michael A. Paradiso. Neuroscience: Exploring the Brain. 2007

CMPUT 651 (Fall 2019) Complexity Concerns ● Time complexity – Hierarchical softmax [1] – Negative sampling: Hinge loss [2], Noisy contrastive estimation [3] ● Memory complexity – Compressing LM [4] ● Model complexity – Shallow neural networks are still too “deep.” – CBOW, SkipGram [3] [1] Mnih A, Hinton GE. A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from scratch. JMLR, 2011. [3] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013 [4] Yunchuan Chen, Lili Mou, Yan Xu, Ge Li, Zhi Jin. "Compressing neural language models by sparse word representations." In ACL, 2016.

CMPUT 651 (Fall 2019) Deep neural networks: To be, or not to be? That is the question.

CMPUT 651 (Fall 2019) CBOW, SkipGram (word2vec) [6] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013

CMPUT 651 (Fall 2019) Hierarchical Softmax and Negative Contrastive Estimation ● HS ● NCE [6] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013

CMPUT 651 (Fall 2019) Tricks in Training Word Embeddings ● The # of negative samples? – The more, the better. ● The distribution from which negative samples are generated? Should negative samples be close to positive samples? – The closer, the better. ● Full softmax vs. NCE vs. HS vs. hinge loss?

Word Embeddings & Language Modeling Lili Mou lmou@ualberta.ca - PowerPoint PPT Presentation

CMPUT 651 (Fall 2019) Word Embeddings & Language Modeling Lili Mou lmou@ualberta.ca lili-mou.github.io CMPUT 651 (Fall 2019) Last Lecture Logistic regression/Softmax: Linear classification Non-linear classification - Non-linear

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

An introduction to word embeddings W4705: Natural Language Processing Fei-Tzin Lee September 23,

Lecture 8: NLP and Word Embeddings Alireza Akhavan Pour CLASS.VISION

The Impact of Different Approaches The Impact of Different Approaches to Imposing Pressure

1 Peter 1:23-25 NIV 23 For you have been born again, not of perishable seed, but of imperishable,

Interface Treating Method for Multi-Medium Flow on Adaptive Triangular Meshes Chunwu Wang Joint

The Homeowners Guide to Stormwater Kristen Kyler Project Coordinator, Lower Susquehanna

Print version Updated: 1 March 2020 Lecture #23 Dissolved Carbon Dioxide: Open & Closed

Make Milk #1 Question A: Question B: Why is white milk preferable Why is milk a better

n t e r a c t i v e H A u t o m a t a _ Programming for Engineers Winter 2015 Andreas

Welcome! Welcome Welcome to the Copenhagen FreeSurfer course 2016! 3 days of lectures and

Word Embeddings & Language Modeling Lili Mou lmou@ualberta.ca - PowerPoint PPT Presentation

CMPUT 651 (Fall 2019) Word Embeddings & Language Modeling Lili Mou lmou@ualberta.ca lili-mou.github.io CMPUT 651 (Fall 2019) Last Lecture Logistic regression/Softmax: Linear classification Non-linear classification - Non-linear

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

An introduction to word embeddings W4705: Natural Language Processing Fei-Tzin Lee September 23,

Lecture 8: NLP and Word Embeddings Alireza Akhavan Pour CLASS.VISION

The Impact of Different Approaches The Impact of Different Approaches to Imposing Pressure

1 Peter 1:23-25 NIV 23 For you have been born again, not of perishable seed, but of imperishable,

Interface Treating Method for Multi-Medium Flow on Adaptive Triangular Meshes Chunwu Wang Joint

The Homeowners Guide to Stormwater Kristen Kyler Project Coordinator, Lower Susquehanna

Print version Updated: 1 March 2020 Lecture #23 Dissolved Carbon Dioxide: Open &amp; Closed

Make Milk #1 Question A: Question B: Why is white milk preferable Why is milk a better

n t e r a c t i v e H A u t o m a t a _ Programming for Engineers Winter 2015 Andreas

Welcome! Welcome Welcome to the Copenhagen FreeSurfer course 2016! 3 days of lectures and

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Print version Updated: 1 March 2020 Lecture #23 Dissolved Carbon Dioxide: Open & Closed