OPTIMIZATION OF SKIP-GRAM MODEL
Chenxi Wu Final Presentation for STA 790
OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for - - PowerPoint PPT Presentation
OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for STA 790 Word Embedding Map words to vectors of real numbers The earliest word representation is one hot representation Word Embedding Distributed
Chenxi Wu Final Presentation for STA 790
■ Map words to vectors of real numbers ■ The earliest word representation is ”one hot representation”
■ Distributed representation
■ An unsupervised NLP method developed by Google in 2013 ■ Quantify the relationship between words
word2vec Skip-gram Hierarchical Softmax Negative Sampling CBOW
■ Input a vector representation of a specific word ■ Output the context word vector corresponding to this word
■ Leaf nodes denote all words in the vocabulary ■ The leaf nodes act as neurons in the output layer, and the internal nodes act as hidden neurons. ■ Input: n weights f1, f2, ..., fn (The frequency of each word in the corpus) ■ Output: The corresponding Huffman tree ■ Benefit: Common words have shorter Huffman code
■ (1) Treat f1, f2, ..., fn as a forest with n trees (Each tree has only one node); ■ (2) In the forest, select the two trees with the smallest weights to merge as the left and right subtrees of a new tree. And the weight of the root node of this new tree is the sum of the weights of the left and right child nodes; ■ (3) Delete the two selected trees from the forest and add the new trees to the forest; ■ (4) Repeat steps (2) and (3) until there is only one tree left in the forest
■ use sigmoid function to decide whether to go left (+) or go right (-) ■ In the example above, w is "hierarchical".
■ Alternative method for training Skip-gram model ■ Subsampling frequent words to decrease the number of training examples. ■ Let each training sample to update only a small percentage of the model’s weights.
■ randomly select one word u from its surrounding words, so u and w compose one "positive sample". ■ The negative sample would be to use this same u, we randomly choose a word from the dictionary that is not w.
■ The unigram distribution is used to select negative words. ■ The probability of a word being selected as a negative sample is related to the frequency of its occurrence. The higher the frequency of occurrence, the easier it is to select as negative words
■ Still use sigmoid function to train the model ■ Suppose through negative sampling, we get neg negative samples (context(w), w_i), I = 1, 2, …, neg. So each training sample is (context(w), w, w_1, …w_neg). ■ We expect our positive sample to satisfy: ■ Expect negative samples to satisfy:
■ Want to maximize the following log-likelihood: ■ Similarly, compute gradient to update parameters.
Mikolov et al., 2013, Distributed d Repres esentations of Words ds a and Phrases es a and t d thei eir Compositio ional ality http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases- and The code for this implementation can be found on my GItHub repo: https://github.com/cassie1102