optimization of skip gram model
play

OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for - PowerPoint PPT Presentation

OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for STA 790 Word Embedding Map words to vectors of real numbers The earliest word representation is one hot representation Word Embedding Distributed


  1. OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for STA 790

  2. Word Embedding ■ Map words to vectors of real numbers ■ The earliest word representation is ”one hot representation”

  3. Word Embedding ■ Distributed representation

  4. Word2Vec ■ An unsupervised NLP method developed by Google in 2013 ■ Quantify the relationship between words Hierarchical Softmax Skip-gram Negative word2vec Sampling CBOW

  5. Skip-gram ■ Input a vector representation of a specific word ■ Output the context word vector corresponding to this word

  6. DNN (Deep Neural Network)

  7. Huffman Tree ■ Leaf nodes denote all words in the vocabulary ■ The leaf nodes act as neurons in the output layer, and the internal nodes act as hidden neurons. ■ Input: n weights f1, f2, ..., fn (The frequency of each word in the corpus) ■ Output: The corresponding Huffman tree ■ Benefit: Common words have shorter Huffman code

  8. ■ (1) Treat f1, f2, ..., fn as a forest with n trees (Each tree has only one node); ■ (2) In the forest, select the two trees with the smallest weights to merge as the left and right subtrees of a new tree. And the weight of the root node of this new tree is the sum of the weights of the left and right child nodes; ■ (3) Delete the two selected trees from the forest and add the new trees to the forest; ■ (4) Repeat steps (2) and (3) until there is only one tree left in the forest

  9. Hierarchical Softmax

  10. HS Details

  11. HS Details ■ use sigmoid function to decide whether to go left (+) or go right (-) ■ In the example above, w is "hierarchical".

  12. HS Target Function

  13. HS Gradient

  14. Negative Sampling ■ Alternative method for training Skip-gram model ■ Subsampling frequent words to decrease the number of training examples. ■ Let each training sample to update only a small percentage of the model’s weights.

  15. Negative sample ■ randomly select one word u from its surrounding words, so u and w compose one "positive sample". ■ The negative sample would be to use this same u, we randomly choose a word from the dictionary that is not w.

  16. Sampling method ■ The unigram distribution is used to select negative words. ■ The probability of a word being selected as a negative sample is related to the frequency of its occurrence. The higher the frequency of occurrence, the easier it is to select as negative words

  17. NS Details ■ Still use sigmoid function to train the model ■ Suppose through negative sampling, we get neg negative samples (context(w), w_i), I = 1, 2, …, neg. So each training sample is ( context(w), w, w_1, …w_neg ). ■ We expect our positive sample to satisfy: ■ Expect negative samples to satisfy:

  18. NS Details ■ Want to maximize the following log-likelihood: ■ Similarly, compute gradient to update parameters.

  19. Reference Mikolov et al., 2013, Distributed d Repres esentations of Words ds a and Phrases es a and t d thei eir Compositio ional ality http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases- and The code for this implementation can be found on my GItHub repo: https://github.com/cassie1102

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend