OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for - - PowerPoint PPT Presentation

optimization of skip gram model
SMART_READER_LITE
LIVE PREVIEW

OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for - - PowerPoint PPT Presentation

OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for STA 790 Word Embedding Map words to vectors of real numbers The earliest word representation is one hot representation Word Embedding Distributed


slide-1
SLIDE 1

OPTIMIZATION OF SKIP-GRAM MODEL

Chenxi Wu Final Presentation for STA 790

slide-2
SLIDE 2

Word Embedding

■ Map words to vectors of real numbers ■ The earliest word representation is ”one hot representation”

slide-3
SLIDE 3

Word Embedding

■ Distributed representation

slide-4
SLIDE 4

Word2Vec

■ An unsupervised NLP method developed by Google in 2013 ■ Quantify the relationship between words

word2vec Skip-gram Hierarchical Softmax Negative Sampling CBOW

slide-5
SLIDE 5

Skip-gram

■ Input a vector representation of a specific word ■ Output the context word vector corresponding to this word

slide-6
SLIDE 6

DNN (Deep Neural Network)

slide-7
SLIDE 7

Huffman Tree

■ Leaf nodes denote all words in the vocabulary ■ The leaf nodes act as neurons in the output layer, and the internal nodes act as hidden neurons. ■ Input: n weights f1, f2, ..., fn (The frequency of each word in the corpus) ■ Output: The corresponding Huffman tree ■ Benefit: Common words have shorter Huffman code

slide-8
SLIDE 8

■ (1) Treat f1, f2, ..., fn as a forest with n trees (Each tree has only one node); ■ (2) In the forest, select the two trees with the smallest weights to merge as the left and right subtrees of a new tree. And the weight of the root node of this new tree is the sum of the weights of the left and right child nodes; ■ (3) Delete the two selected trees from the forest and add the new trees to the forest; ■ (4) Repeat steps (2) and (3) until there is only one tree left in the forest

slide-9
SLIDE 9

Hierarchical Softmax

slide-10
SLIDE 10

HS Details

slide-11
SLIDE 11

HS Details

■ use sigmoid function to decide whether to go left (+) or go right (-) ■ In the example above, w is "hierarchical".

slide-12
SLIDE 12

HS Target Function

slide-13
SLIDE 13

HS Gradient

slide-14
SLIDE 14

Negative Sampling

■ Alternative method for training Skip-gram model ■ Subsampling frequent words to decrease the number of training examples. ■ Let each training sample to update only a small percentage of the model’s weights.

slide-15
SLIDE 15

Negative sample

■ randomly select one word u from its surrounding words, so u and w compose one "positive sample". ■ The negative sample would be to use this same u, we randomly choose a word from the dictionary that is not w.

slide-16
SLIDE 16

Sampling method

■ The unigram distribution is used to select negative words. ■ The probability of a word being selected as a negative sample is related to the frequency of its occurrence. The higher the frequency of occurrence, the easier it is to select as negative words

slide-17
SLIDE 17

NS Details

■ Still use sigmoid function to train the model ■ Suppose through negative sampling, we get neg negative samples (context(w), w_i), I = 1, 2, …, neg. So each training sample is (context(w), w, w_1, …w_neg). ■ We expect our positive sample to satisfy: ■ Expect negative samples to satisfy:

slide-18
SLIDE 18

NS Details

■ Want to maximize the following log-likelihood: ■ Similarly, compute gradient to update parameters.

slide-19
SLIDE 19

Reference

Mikolov et al., 2013, Distributed d Repres esentations of Words ds a and Phrases es a and t d thei eir Compositio ional ality http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases- and The code for this implementation can be found on my GItHub repo: https://github.com/cassie1102