cs 3750 word models
play

CS 3750: Word Models PRESENTED BY: MUHENG YAN UNIVERSITY OF - PDF document

2/20/2020 CS 3750: Word Models PRESENTED BY: MUHENG YAN UNIVERSITY OF PITTSBURGH FEB.20, 2020 Is Document Models Enough? Recap: previously we have LDA and LSI to learn document representations What if we have very short documents, or


  1. 2/20/2020 CS 3750: Word Models PRESENTED BY: MUHENG YAN UNIVERSITY OF PITTSBURGH FEB.20, 2020 Is Document Models Enough? • Recap: previously we have LDA and LSI to learn document representations • What if we have very short documents, or even sentences? (e.g. Tweets) • Can we investigate relationships between words/sentences with previous models? • We need to model words individually for a better granularity 1

  2. 2/20/2020 Distributional Semantics: from a Linguistic Aspect Word Embedding , Distributed Representations, Semantic Vector Space... What are they? A more formal term from linguistic: Distributional Semantic Model "… quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data." -- Wikipedia --> Represent elements of language (word here) as distributions of other elements (i.e. documents, paragraphs, sentences, and words) E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of documents: Latent Semantic Analysis/Indexing ( LSA/LSI ) 1.Build a co-occurrence matrix of word vs. doc (n by d) 2.Decompose the Word-Document matrix via SVD 3.Take the highest singular values to get the lower-ranked approximation of the w-d matrix, as the word representations Picture Credit: https://en.wikipedia.org/wiki/Latent_semantic_analysis 2

  3. 2/20/2020 Word Level Representation I. Counting and Matrix Factorization II. Latent Representation I.Neural Network for Language Models II.CBOW III.Skip-gram IV.Other Models III.Graph-based Models I.Node2Vec Counting and Matrix Factorization • Counting methods start with constructing a matrix of co- occurrences between words and words (can be expanded to other levels, e.g. at document level it becomes LSA) • Due to the high-dimensionality and sparcity, usually used with a dim-reduction algorithm (PCA, SVD, etc.) • The rows of the matrix approximates the distribution of co- occurring words for every word we are trying to model Example Models including: LSA, Explicit Semantic Analysis (ESA), Global vectors for word representation (GloVe) 3

  4. 2/20/2020 Explicit Semantic Analysis • Similar words most likely appear with the same distribution of topics • ESA represents topics by Wikipedia concepts (Pages). ESA use Wikipedia concepts as dimensions to construct the space in which words will be projected • For each dimension (concept), words in this concept article are counted • Inverted index is then constructed to convert each word into a vector of concepts • The vector constructed for each word represents the frequency of its occurrences within each (concept). Picture and Content Credit: Ahmed Magooda Global vectors for word representation (GloVe) 1. Word-word co-occurrence with sliding “I learn machine learning in CS - 3750” window (|V| by |V|) (and normalize as probability) 2. Construct the cost as: |𝑾| Window=2 I learn machine learning 𝟑 𝑈 𝒘 𝑘 + 𝒄 𝑗 + 𝒄 𝑘 − log 𝒀 𝑗,𝑘 I 0 1 1 0 𝑲 = ෍ 𝒈 𝒀 𝒋,𝒌 𝒘 𝑗 Learn 1 0 1 1 𝒋,𝒌 machine 1 1 0 2 3. Use gradient descent to solve the optimization 4

  5. 2/20/2020 GloVe Cont. How the cost is derived? 𝑌 𝑗𝑙 Probability of word i and k appear together: 𝑄 𝑗,𝑙 = 𝑌𝑗 𝑄 𝑗𝑙 Using word k as a probe, the “ratio” of two word pairs: 𝑠𝑏𝑢𝑗𝑝 𝑗,𝑘,𝑙 = 𝑄 𝑘𝑙 2 To model the ratio with embedding v : 𝐾 = σ 𝑠𝑏𝑢𝑗𝑝 𝑗𝑘𝑙 − 𝑕 𝑤 𝑗 , 𝑤 𝑘 , 𝑤 𝑙 -> O(N^3) 𝑈 𝑤 𝑙 Simplify the computation by design 𝑕 ∙ = 𝑓 𝑤 𝑗 −𝑤 𝑘 Value of ratio J and k related J and k not related I and k related 1 Inf 𝑈 𝑤 𝑙 ) 𝑄 𝑗𝑙 𝑓^(𝑤 𝑗 Thus we are trying to make 𝑄 𝑘𝑙 = I and k not related 0 1 𝑈 𝑤 𝑙 ) 𝑓^(𝑤 𝑘 2 𝑈 𝑤 𝑘 Thus we have 𝐾 = σ log 𝑄 𝑗𝑘 − 𝑤 𝑗 𝑈 𝑤 𝑘 , we have log 𝑌 𝑗𝑘 − log 𝑌 𝑗 = 𝑤 𝑗 𝑈 𝑤 𝑘 , then To expand the object log 𝑄 𝑗𝑘 = 𝑤 𝑗 𝑈 𝑤 𝑘 + 𝑐 𝑗 + 𝑐 𝑈 𝑤 𝑗 log 𝑌 𝑗𝑘 = 𝑤 𝑗 𝑘 . By doing this, we solve the problem that 𝑄 𝑗𝑘 ≠ 𝑄 𝑘𝑗 but 𝑤 𝑘 𝟑 , where 𝑔(∙) is a weight |𝑾| 𝒈 𝒀 𝒋,𝒌 𝑈 𝒘 𝑘 + 𝒄 𝑗 + 𝒄 𝑘 − log 𝒀 𝑗,𝑘 Then we come up with the final cost function 𝐾 = σ 𝒋,𝒌 𝒘 𝑗 function Latent Representation Modeling the distribution of context* for a certain words through a series of latent variables, by maximizing the likelihood P (word | context)* Usually fulfilled by neural networks The learned latent variables are used as the representations of words after optimization * context refers to the other words from the distribution of which we model the target word * in some models it could be P(context | word), e.g. Skip-gram 5

  6. 2/20/2020 Neural Network for Language Model Learning Objective (predicting next word 𝒙 𝒌 ): Find the parameter set 𝜄 to minimize 1 𝑈 σ 𝑘 log(𝑄(𝑥 𝑀 𝜄 = − 𝑘 |𝑥 𝑘−1 ,… , 𝑥 𝑘−𝑜+1 )) + 𝑆(𝜄) 𝑓 𝑧𝑥𝑗 Where 𝑄 ∙ = σ 𝑗≠𝑘 𝑓 𝑧𝑥𝑘 , Y = b + 𝑿 𝑝𝑣𝑢 tanh(d + 𝑿 𝑗𝑜 X ), And X is the lookup results of the n-length sequence: X = [ 𝐷 𝑥 𝑘−1 , … , 𝑑(𝑥 𝑘−𝑜+1 )] * ( 𝑿 𝑝𝑣𝑢 , b ) is the parameter set of output layer, ( 𝑿 𝑗𝑜 , d ) is the parameter set of hidden layer In this mode we learn the parameters in C (|V| * |N|), 𝑿 𝑗𝑜 (n * |V| * hidden_size), and 𝑿 𝑝𝑣𝑢 (hidden_size * |V|) Content Credit: Ahmed Magooda RNN for Language Model Learning Objective: similar to NN for LM Alter from NN: ◦ The hidden layer is now the linear combination of the input current word t and the hidden of previous word t-1 : 𝑡 𝑢 = 𝑔(𝑽𝑥 𝑢 + 𝑿𝑡 𝑢 − 1 ) Where 𝑔(∙) is the activation function Content Credit: Ahmed Magooda 6

  7. 2/20/2020 Continuous Bag-of-Words Model Learning Objective: maximizing the likelihood of 𝑄(𝑥𝑝𝑠𝑒|𝑑𝑝𝑜𝑢𝑓𝑦𝑢) for every word in a corpus Similar to NN for LM, the inputs are one-hot vectors and the matrix 𝑿 here is like the look-up matrix. Differences compared to the NN for LM: ◦ Bi-directional : not predicting the “next”, instead predicting the center word inside a window, where words from both directions are input ◦ Significantly reduced complexity: only learns 2 * |V| * |N| parameters Picture Credit: Francois Chaubard, Rohit Mundra, Richard Socher, from https://cs224d.stanford.edu/lecture_notes/notes1.pdf CBOW Cont. Steps breakdown: 1. Generate the one-hot vectors for the context: ( 𝒚 𝑑−𝑛 , … , 𝒚 𝑑−1 , 𝒚 𝑑+1 , … , 𝒚 𝑑+𝑛 𝜗 𝑺 𝑊 ), and lookup for the word vectors 𝒘 𝑗 = 𝑿𝒚 𝑗 2. Average the vectors over contexts: 𝒊 𝑑 = 𝒘 𝑑−𝑛 + …+𝒘 𝑑+𝑛 2𝑛 3. Generate the posterior 𝒜 𝑑 = 𝑿 ′ 𝒊 𝑑 , and turn it in to probabilities ෝ 𝒛 𝑑 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒜 𝑑 ) |𝑊| 𝑧 𝑗 log(ො Notations: 4. Calculate the loss as cross-entropy: σ 𝑗=1 𝑧 𝑗 ) m : half window size  𝑄(𝑥 𝑑 |𝑥 𝑑−𝑛 , … 𝑥 𝑑+𝑛 ) c: center word index 𝑥 𝑗 : word i from vocabulary V 𝒚 𝑗 : one-hot input of word i 𝑿𝜗 𝑺 𝑊 × 𝑜 : the context lookup matrix 𝑿 ′ 𝜗 𝑺 𝑜 × 𝑊 : the center lookup matrix 7

  8. 2/20/2020 CBOW Cont. Loss fuction: 𝐺𝑝𝑠 𝑏𝑚𝑚 𝑥 𝑑 ∈ 𝑊 , 𝑛𝑗𝑜𝑗𝑛𝑗𝑨𝑓 𝐾 ∙ = 𝑚𝑝𝑕𝑄 𝑥 𝑑 𝑥 𝑑−𝑛 , … 𝑥 𝑑+𝑛 ⇒ − 1 |𝑊| ෍ 𝑚𝑝𝑕𝑄 𝑿 𝑑 𝒊 𝑑 ′ 𝑈 𝒊 𝑑 𝑓 𝒙 𝑑 = − 1 𝑊 ෍ 𝑚𝑝𝑕 ′𝑼 𝒊 𝑑 𝑊 𝑓 𝒙 𝑘 σ 𝑘=1 𝑊 = − 1 ′𝑼 𝒊 𝑑 ) ′𝑈 𝒊 𝑑 + log(෍ 𝑓 𝒙 𝑘 𝑊 ෍ −𝒙 𝑑 𝑘=1 ′ 𝑏𝑜𝑒 𝒙 Optimization: use SGD to update all relevant vectors 𝒙 𝑑 Skip-gram Model Learning Objective: maximizing the likelihood of 𝑄(𝑑𝑝𝑜𝑢𝑓𝑦𝑢|𝑥𝑝𝑠𝑒) for every word in a corpus Steps Breakdown: 1. Generate one-hot vector for the center word 𝒚 𝜗 𝑺 𝑊 , and calculate the embedded vector 𝒊 𝑑 = 𝑿𝒚 𝜗 𝑺 𝑜 2. Calculate the posterior 𝒜 𝑑 = 𝑿 ′ 𝒊 𝑑 3. For each word j in the context of the center word, calculate the probabilities ෝ 𝒛 𝑑 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒜 𝑑 ) 𝑧 𝑑𝑘 in ෝ 4. We want the probabilities ො 𝒛 𝑑 match the true probabilities of the contexts which are 𝑧 𝑑−𝑛 , … , 𝑧 𝑑+𝑛 Cost function constructed similarly to the CBOW model Picture Credit: Francois Chaubard, Rohit Mundra, Richard Socher, from https://cs224d.stanford.edu/lecture_notes/notes1.pdf 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend