CS 3750: Word Models PRESENTED BY: MUHENG YAN UNIVERSITY OF - PDF document

2/20/2020 CS 3750: Word Models PRESENTED BY: MUHENG YAN UNIVERSITY OF PITTSBURGH FEB.20, 2020 Is Document Models Enough? • Recap: previously we have LDA and LSI to learn document representations • What if we have very short documents, or even sentences? (e.g. Tweets) • Can we investigate relationships between words/sentences with previous models? • We need to model words individually for a better granularity 1

2/20/2020 Distributional Semantics: from a Linguistic Aspect Word Embedding , Distributed Representations, Semantic Vector Space... What are they? A more formal term from linguistic: Distributional Semantic Model "… quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data." -- Wikipedia --> Represent elements of language (word here) as distributions of other elements (i.e. documents, paragraphs, sentences, and words) E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of documents: Latent Semantic Analysis/Indexing ( LSA/LSI ) 1.Build a co-occurrence matrix of word vs. doc (n by d) 2.Decompose the Word-Document matrix via SVD 3.Take the highest singular values to get the lower-ranked approximation of the w-d matrix, as the word representations Picture Credit: https://en.wikipedia.org/wiki/Latent_semantic_analysis 2

2/20/2020 Word Level Representation I. Counting and Matrix Factorization II. Latent Representation I.Neural Network for Language Models II.CBOW III.Skip-gram IV.Other Models III.Graph-based Models I.Node2Vec Counting and Matrix Factorization • Counting methods start with constructing a matrix of co- occurrences between words and words (can be expanded to other levels, e.g. at document level it becomes LSA) • Due to the high-dimensionality and sparcity, usually used with a dim-reduction algorithm (PCA, SVD, etc.) • The rows of the matrix approximates the distribution of co- occurring words for every word we are trying to model Example Models including: LSA, Explicit Semantic Analysis (ESA), Global vectors for word representation (GloVe) 3

2/20/2020 Explicit Semantic Analysis • Similar words most likely appear with the same distribution of topics • ESA represents topics by Wikipedia concepts (Pages). ESA use Wikipedia concepts as dimensions to construct the space in which words will be projected • For each dimension (concept), words in this concept article are counted • Inverted index is then constructed to convert each word into a vector of concepts • The vector constructed for each word represents the frequency of its occurrences within each (concept). Picture and Content Credit: Ahmed Magooda Global vectors for word representation (GloVe) 1. Word-word co-occurrence with sliding “I learn machine learning in CS - 3750” window (|V| by |V|) (and normalize as probability) 2. Construct the cost as: |𝑾| Window=2 I learn machine learning 𝟑 𝑈 𝒘 𝑘 + 𝒄 𝑗 + 𝒄 𝑘 − log 𝒀 𝑗,𝑘 I 0 1 1 0 𝑲 = ෍ 𝒈 𝒀 𝒋,𝒌 𝒘 𝑗 Learn 1 0 1 1 𝒋,𝒌 machine 1 1 0 2 3. Use gradient descent to solve the optimization 4

2/20/2020 GloVe Cont. How the cost is derived? 𝑌 𝑗𝑙 Probability of word i and k appear together: 𝑄 𝑗,𝑙 = 𝑌𝑗 𝑄 𝑗𝑙 Using word k as a probe, the “ratio” of two word pairs: 𝑠𝑏𝑢𝑗𝑝 𝑗,𝑘,𝑙 = 𝑄 𝑘𝑙 2 To model the ratio with embedding v : 𝐾 = σ 𝑠𝑏𝑢𝑗𝑝 𝑗𝑘𝑙 − 𝑕 𝑤 𝑗 , 𝑤 𝑘 , 𝑤 𝑙 -> O(N^3) 𝑈 𝑤 𝑙 Simplify the computation by design 𝑕 ∙ = 𝑓 𝑤 𝑗 −𝑤 𝑘 Value of ratio J and k related J and k not related I and k related 1 Inf 𝑈 𝑤 𝑙 ) 𝑄 𝑗𝑙 𝑓^(𝑤 𝑗 Thus we are trying to make 𝑄 𝑘𝑙 = I and k not related 0 1 𝑈 𝑤 𝑙 ) 𝑓^(𝑤 𝑘 2 𝑈 𝑤 𝑘 Thus we have 𝐾 = σ log 𝑄 𝑗𝑘 − 𝑤 𝑗 𝑈 𝑤 𝑘 , we have log 𝑌 𝑗𝑘 − log 𝑌 𝑗 = 𝑤 𝑗 𝑈 𝑤 𝑘 , then To expand the object log 𝑄 𝑗𝑘 = 𝑤 𝑗 𝑈 𝑤 𝑘 + 𝑐 𝑗 + 𝑐 𝑈 𝑤 𝑗 log 𝑌 𝑗𝑘 = 𝑤 𝑗 𝑘 . By doing this, we solve the problem that 𝑄 𝑗𝑘 ≠ 𝑄 𝑘𝑗 but 𝑤 𝑘 𝟑 , where 𝑔(∙) is a weight |𝑾| 𝒈 𝒀 𝒋,𝒌 𝑈 𝒘 𝑘 + 𝒄 𝑗 + 𝒄 𝑘 − log 𝒀 𝑗,𝑘 Then we come up with the final cost function 𝐾 = σ 𝒋,𝒌 𝒘 𝑗 function Latent Representation Modeling the distribution of context* for a certain words through a series of latent variables, by maximizing the likelihood P (word | context)* Usually fulfilled by neural networks The learned latent variables are used as the representations of words after optimization * context refers to the other words from the distribution of which we model the target word * in some models it could be P(context | word), e.g. Skip-gram 5

2/20/2020 Neural Network for Language Model Learning Objective (predicting next word 𝒙 𝒌 ): Find the parameter set 𝜄 to minimize 1 𝑈 σ 𝑘 log(𝑄(𝑥 𝑀 𝜄 = − 𝑘 |𝑥 𝑘−1 ,… , 𝑥 𝑘−𝑜+1 )) + 𝑆(𝜄) 𝑓 𝑧𝑥𝑗 Where 𝑄 ∙ = σ 𝑗≠𝑘 𝑓 𝑧𝑥𝑘 , Y = b + 𝑿 𝑝𝑣𝑢 tanh(d + 𝑿 𝑗𝑜 X ), And X is the lookup results of the n-length sequence: X = [ 𝐷 𝑥 𝑘−1 , … , 𝑑(𝑥 𝑘−𝑜+1 )] * ( 𝑿 𝑝𝑣𝑢 , b ) is the parameter set of output layer, ( 𝑿 𝑗𝑜 , d ) is the parameter set of hidden layer In this mode we learn the parameters in C (|V| * |N|), 𝑿 𝑗𝑜 (n * |V| * hidden_size), and 𝑿 𝑝𝑣𝑢 (hidden_size * |V|) Content Credit: Ahmed Magooda RNN for Language Model Learning Objective: similar to NN for LM Alter from NN: ◦ The hidden layer is now the linear combination of the input current word t and the hidden of previous word t-1 : 𝑡 𝑢 = 𝑔(𝑽𝑥 𝑢 + 𝑿𝑡 𝑢 − 1 ) Where 𝑔(∙) is the activation function Content Credit: Ahmed Magooda 6

2/20/2020 Continuous Bag-of-Words Model Learning Objective: maximizing the likelihood of 𝑄(𝑥𝑝𝑠𝑒|𝑑𝑝𝑜𝑢𝑓𝑦𝑢) for every word in a corpus Similar to NN for LM, the inputs are one-hot vectors and the matrix 𝑿 here is like the look-up matrix. Differences compared to the NN for LM: ◦ Bi-directional : not predicting the “next”, instead predicting the center word inside a window, where words from both directions are input ◦ Significantly reduced complexity: only learns 2 * |V| * |N| parameters Picture Credit: Francois Chaubard, Rohit Mundra, Richard Socher, from https://cs224d.stanford.edu/lecture_notes/notes1.pdf CBOW Cont. Steps breakdown: 1. Generate the one-hot vectors for the context: ( 𝒚 𝑑−𝑛 , … , 𝒚 𝑑−1 , 𝒚 𝑑+1 , … , 𝒚 𝑑+𝑛 𝜗 𝑺 𝑊 ), and lookup for the word vectors 𝒘 𝑗 = 𝑿𝒚 𝑗 2. Average the vectors over contexts: 𝒊 𝑑 = 𝒘 𝑑−𝑛 + …+𝒘 𝑑+𝑛 2𝑛 3. Generate the posterior 𝒜 𝑑 = 𝑿 ′ 𝒊 𝑑 , and turn it in to probabilities ෝ 𝒛 𝑑 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒜 𝑑 ) |𝑊| 𝑧 𝑗 log(ො Notations: 4. Calculate the loss as cross-entropy: σ 𝑗=1 𝑧 𝑗 ) m : half window size  𝑄(𝑥 𝑑 |𝑥 𝑑−𝑛 , … 𝑥 𝑑+𝑛 ) c: center word index 𝑥 𝑗 : word i from vocabulary V 𝒚 𝑗 : one-hot input of word i 𝑿𝜗 𝑺 𝑊 × 𝑜 : the context lookup matrix 𝑿 ′ 𝜗 𝑺 𝑜 × 𝑊 : the center lookup matrix 7

2/20/2020 CBOW Cont. Loss fuction: 𝐺𝑝𝑠 𝑏𝑚𝑚 𝑥 𝑑 ∈ 𝑊 , 𝑛𝑗𝑜𝑗𝑛𝑗𝑨𝑓 𝐾 ∙ = 𝑚𝑝𝑕𝑄 𝑥 𝑑 𝑥 𝑑−𝑛 , … 𝑥 𝑑+𝑛 ⇒ − 1 |𝑊| ෍ 𝑚𝑝𝑕𝑄 𝑿 𝑑 𝒊 𝑑 ′ 𝑈 𝒊 𝑑 𝑓 𝒙 𝑑 = − 1 𝑊 ෍ 𝑚𝑝𝑕 ′𝑼 𝒊 𝑑 𝑊 𝑓 𝒙 𝑘 σ 𝑘=1 𝑊 = − 1 ′𝑼 𝒊 𝑑 ) ′𝑈 𝒊 𝑑 + log(෍ 𝑓 𝒙 𝑘 𝑊 ෍ −𝒙 𝑑 𝑘=1 ′ 𝑏𝑜𝑒 𝒙 Optimization: use SGD to update all relevant vectors 𝒙 𝑑 Skip-gram Model Learning Objective: maximizing the likelihood of 𝑄(𝑑𝑝𝑜𝑢𝑓𝑦𝑢|𝑥𝑝𝑠𝑒) for every word in a corpus Steps Breakdown: 1. Generate one-hot vector for the center word 𝒚 𝜗 𝑺 𝑊 , and calculate the embedded vector 𝒊 𝑑 = 𝑿𝒚 𝜗 𝑺 𝑜 2. Calculate the posterior 𝒜 𝑑 = 𝑿 ′ 𝒊 𝑑 3. For each word j in the context of the center word, calculate the probabilities ෝ 𝒛 𝑑 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒜 𝑑 ) 𝑧 𝑑𝑘 in ෝ 4. We want the probabilities ො 𝒛 𝑑 match the true probabilities of the contexts which are 𝑧 𝑑−𝑛 , … , 𝑧 𝑑+𝑛 Cost function constructed similarly to the CBOW model Picture Credit: Francois Chaubard, Rohit Mundra, Richard Socher, from https://cs224d.stanford.edu/lecture_notes/notes1.pdf 8

CS 3750: Word Models PRESENTED BY: MUHENG YAN UNIVERSITY OF - PDF document

2/20/2020 CS 3750: Word Models PRESENTED BY: MUHENG YAN UNIVERSITY OF PITTSBURGH FEB.20, 2020 Is Document Models Enough? Recap: previously we have LDA and LSI to learn document representations What if we have very short documents, or

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Document and Topic Models: pLSA and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced

Graphical models and inference III Milos Hauskrecht milos@pitt.edu 5329 Sennott Square, x4-8845

Graphical models and inference II Milos Hauskrecht milos@pitt.edu 5329 Sennott Square, x4-8845

Monte Carlo approximation methods Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square CS

Markov Chain Monte Carlo (MCMC) Variational methods Milos Hauskrecht milos@cs.pitt.edu

>>>CLICK HERE<<< Presentation d un document word New Haven. peugeot 207 workshop

Is this a word that would be used by a mature language user? Is it a frequently used word?

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Word Class A recap on earlier years word class learning for Year 5 and 6 classes Grammarsaurus

4.2 Microsoft Word Microsoft Word is the word processing component of the Microsoft Office

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Word Meaning and Similarity Word Senses and Word Rela-ons Dan

Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Brno University of

Basic approach and some considerations for insulation coordination Topic 5: Converters:

Elementary Particles Lecture 1 Niels Tuning Harry van der Graaf Martin Fransen Ernst-Jan

Monaadilised parserid Sntaksanals ja parserid Parseri lesandeks on: kontrollida kas

ECO 199 GAMES OF STRATEGY Spring Term 2004 April 6 BRINKMANSHIP Company and labor union

INTRODUCTION I Syllabus INTRODUCTION I Syllabus I Why study labor economics? INTRODUCTION I

Rural Labour Markets Fall 2010 () Rural labour Fall 2010 1 / 24 Example: Labour Markets in

Human Resource Management Learning Outcomes! 1. Explain the purpose and scope of Human Resource

CS 3750: Word Models PRESENTED BY: MUHENG YAN UNIVERSITY OF - PDF document

2/20/2020 CS 3750: Word Models PRESENTED BY: MUHENG YAN UNIVERSITY OF PITTSBURGH FEB.20, 2020 Is Document Models Enough? Recap: previously we have LDA and LSI to learn document representations What if we have very short documents, or

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Document and Topic Models: pLSA and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced

Graphical models and inference III Milos Hauskrecht milos@pitt.edu 5329 Sennott Square, x4-8845

Graphical models and inference II Milos Hauskrecht milos@pitt.edu 5329 Sennott Square, x4-8845

Monte Carlo approximation methods Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square CS

Markov Chain Monte Carlo (MCMC) Variational methods Milos Hauskrecht milos@cs.pitt.edu

&gt;&gt;&gt;CLICK HERE&lt;&lt;&lt; Presentation d un document word New Haven. peugeot 207 workshop

Is this a word that would be used by a mature language user? Is it a frequently used word?

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Word Class A recap on earlier years word class learning for Year 5 and 6 classes Grammarsaurus

4.2 Microsoft Word Microsoft Word is the word processing component of the Microsoft Office

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Word Meaning and Similarity Word Senses and Word Rela-ons Dan

Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Brno University of

Basic approach and some considerations for insulation coordination Topic 5: Converters:

Elementary Particles Lecture 1 Niels Tuning Harry van der Graaf Martin Fransen Ernst-Jan

Monaadilised parserid Sntaksanals ja parserid Parseri lesandeks on: kontrollida kas

ECO 199 GAMES OF STRATEGY Spring Term 2004 April 6 BRINKMANSHIP Company and labor union

INTRODUCTION I Syllabus INTRODUCTION I Syllabus I Why study labor economics? INTRODUCTION I

Rural Labour Markets Fall 2010 () Rural labour Fall 2010 1 / 24 Example: Labour Markets in

Human Resource Management Learning Outcomes! 1. Explain the purpose and scope of Human Resource

>>>CLICK HERE<<< Presentation d un document word New Haven. peugeot 207 workshop

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT