CS7015 (Deep Learning) : Lecture 10 Learning Vectorial - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained: deriving Mikolov et al.’s negative-sampling word- embedding method’ by Yoav Goldberg and Omer Levy Sebastian Ruder’s blogs on word embeddings a a Blog1, Blog2, Blog3 2/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Module 10.1: One-hot representations of words 3/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Let us start with a very simple mo- tivation for why we are interested in vectorial representations of words Suppose we are given an input stream of words (sentence, document, etc.) and we are interested in learning some function of it (say, ˆ = y sentiments ( words )) Model Say, we employ a machine learning al- gorithm (some mathematical model) for learning such a function (ˆ y = [5.7, 1.2, 2.3, -10.2, 4.5, ..., 11.9, 20.1, -0.5, 40.7] f ( x )) We first need a way of converting the This is by far AAMIR KHAN’s best one. Finest input stream (or each word in the casting and terrific acting by all. stream) to a vector x (a mathematical quantity) 4/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Given a corpus, consider the set V Corpus: of all unique words across all input Human machine interface for computer streams ( i.e. , all sentences or docu- applications ments) User opinion of computer system response V is called the vocabulary of the time corpus ( i.e. , all sentences or docu- User interface management system ments) System engineering for improved response We need a representation for every time word in V V = [human,machine, interface, for, computer, One very simple way of doing this is applications, user, opinion, of, system, response, to use one-hot vectors of size | V | time, management, engineering, improved] The representation of the i -th word machine : 0 1 0 ... 0 0 0 will have a 1 in the i -th position and a 0 in the remaining | V | − 1 positions 5/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Problems: cat: 0 0 0 0 0 1 0 V tends to be very large (for example, dog: 0 1 0 0 0 0 0 50K for PTB, 13M for Google 1T corpus) truck: 0 0 0 1 0 0 0 These representations do not capture any notion of similarity √ euclid dist ( cat , dog ) = 2 Ideally, we would want the represent- √ ations of cat and dog (both domestic euclid dist ( dog , truck ) = 2 animals) to be closer to each other cosine sim ( cat , dog ) = 0 than the representations of cat and cosine sim ( dog , truck ) = 0 truck However, with 1-hot representations, the Euclidean distance between any √ two words in the vocabulary in 2 And the cosine similarity between any two words in the vocabulary is 0 6/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Module 10.2: Distributed Representations of words 7/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

You shall know a word by the com- pany it keeps - Firth, J. R. 1957:11 Distributional similarity based representations This leads us to the idea of co- A bank is a financial institution that accepts occurrence matrix deposits from the public and creates credit . The idea is to use the accompanying words (financial, deposits, credit) to represent bank 8/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Corpus: A co-occurrence matrix is a terms × Human machine interface for computer ap- terms matrix which captures the plications number of times a term appears in User opinion of computer system response the context of another term time The context is defined as a window of User interface management system k words around the terms System engineering for improved response Let us build a co-occurrence matrix time for this toy corpus with k = 2 This is also known as a word × human machine system for ... user 0 1 0 1 ... 0 context matrix human 1 0 0 1 ... 0 machine 0 0 0 1 ... 2 system You could choose the set of words 1 1 1 0 ... 0 for . . . . . . . and contexts to be same or different . . . . . . . . . . . . . . Each row (column) of the co- 0 0 2 0 ... 0 user occurrence matrix gives a vectorial Co-occurence Matrix representation of the corresponding word (context) 9/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Some (fixable) problems Stop words (a, the, for, etc.) are very frequent → these counts will be very high human machine system for ... user human 0 1 0 1 ... 0 machine 1 0 0 1 ... 0 system 0 0 0 1 ... 2 for 1 1 1 0 ... 0 . . . . . . . . . . . . . . . . . . . . . user 0 0 2 0 ... 0 10/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Some (fixable) problems Solution 1: Ignore very frequent words human machine system ... user human 0 1 0 ... 0 machine 1 0 0 ... 0 system 0 0 0 ... 2 . . . . . . . . . . . . . . . . . . user 0 0 2 ... 0 11/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Some (fixable) problems Solution 2: Use a threshold t (say, t = 100) human machine system for ... user human 0 1 0 x ... 0 X ij = min ( count ( w i , c j ) , t ) , machine 1 0 0 x ... 0 system 0 0 0 x ... 2 for x x x x ... x where w is word and c is context. . . . . . . . . . . . . . . . . . . . . . user 0 0 2 x ... 0 12/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Some (fixable) problems Solution 3: Instead of count ( w, c ) use PMI ( w, c ) PMI ( w, c ) = log p ( c | w ) human machine system for ... user human 0 2.944 0 2.25 ... 0 p ( c ) machine 2.944 0 0 2.25 ... 0 system 0 0 0 1.15 ... 1.84 count ( w, c ) ∗ N for 2.25 2.25 1.15 0 ... 0 = log . . . . . . . count ( c ) ∗ count ( w ) . . . . . . . . . . . . . . N is the total number of words user 0 0 1.84 0 ... 0 If count ( w, c ) = 0, PMI ( w, c ) = −∞ Instead use, PMI 0 ( w, c ) = PMI ( w, c ) if count ( w, c ) > 0 = 0 otherwise or PPMI ( w, c ) = PMI ( w, c ) if PMI ( w, c ) > 0 = 0 otherwise 13/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Some (severe) problems Very high dimensional ( | V | ) Very sparse human machine system for ... user Grows with the size of the vocabulary human 0 2.944 0 2.25 ... 0 machine 2.944 0 0 2.25 ... 0 Solution: Use dimensionality reduc- system 0 0 0 1.15 ... 1.84 for 2.25 2.25 1.15 0 ... 0 tion (SVD) . . . . . . . . . . . . . . . . . . . . . user 0 0 1.84 0 ... 0 14/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Module 10.3: SVD for learning word representations 15/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Singular Value Decomposition gives a rank k approximation of the original matrix   X = X PPMIm × n = U m × k Σ k × k V T k × n   =   X   X PPMI (simplifying notation to m × n X ) is the co-occurrence matrix       ↑ · · · ↑ v T σ 1 ← → 1 with PPMI values   . ...     .   .     u 1 · · · u k   SVD gives the best rank- k ap- v T σ k ← → ↓ · · · ↓ k k × k k × n proximation of the original data m × k ( X ) Discovers latent semantics in the corpus (let us examine this with the help of an example) 16/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

Notice that the product can be written as a sum of k rank-1 matrices i ∈ R m × n because it Each σ i u i v T   is a product of a m × 1 vector   =   X   with a 1 × n vector m × n If we truncate the sum at σ 1 u 1 v T   1     ↑ · · · ↑ v T σ 1 ← → then we get the best rank-1 ap- 1   . ...     .   proximation of X (By SVD the- .     u 1 · · · u k   v T σ k ← → orem! But what does this mean? ↓ · · · ↓ k k × k k × n m × k We will see on the next slide) = σ 1 u 1 v T 1 + σ 2 u 2 v T 2 + · · · + σ k u k v T k If we truncate the sum at σ 1 u 1 v T 1 + σ 2 u 2 v T 2 then we get the best rank-2 approximation of X and so on 17/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

What do we mean by approximation here? Notice that X has m × n entries   When we use he rank-1 approx-   imation we are using only n + =   X   m + 1 entries to reconstruct [ u ∈ m × n R m , v ∈ R n , σ ∈ R 1 ]       ↑ · · · ↑ v T σ 1 ← → 1 But SVD theorem tells us that   . ...     .   .     u 1 · · · u k u 1 , v 1 and σ 1 store the most in-   v T σ k ← → ↓ · · · ↓ k formation in X (akin to the prin- k × k k × n m × k = σ 1 u 1 v T 1 + σ 2 u 2 v T 2 + · · · + σ k u k v T cipal components in X ) k Each subsequent term ( σ 2 u 2 v T 2 , σ 3 u 3 v T 3 , . . . ) stores less and less important information 18/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 10

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,

CS7015 (Deep Learning) : Lecture 2 McCulloch Pitts Neuron, Thresholding Logic, Perceptrons,

On Out-of-Distribution Detection Algorithms with Deep Neural Skin Cancer Classifiers Andre G. C.

N-gram Graph: Representation for Graphs Shengchao Liu, Mehmet Furkan Demirel, Yingyu Liang

A CLT for Information-Theoretic Statistics of Gram Random Matrices Malika Kharouf Joint work

Gram-Schmidt Finding Orthonormal Basis The famous Gram-Schmidt process is used to produce an

Information Retrieval WS 2016 / 2017 Lecture 5, Tuesday November 22 nd , 2016 (Fuzzy Search, Edit

Semantic Indexing Using Deep CNNs and GMM Supervectors Nakamasa Inoue and Koichi Shinoda Zhang

Rich History of WIC MN Sen. HuBERT Humphrey sponsored legislation creating WIC in 1972

Disclosures None Thyroid Cases Case Based Discussion 69 yo healthy active man with abnormal