word2vec Kuan-Ting Lai 2020/5/28 Word2vec (Word Embeddings) Embed - - PowerPoint PPT Presentation

word2vec
SMART_READER_LITE
LIVE PREVIEW

word2vec Kuan-Ting Lai 2020/5/28 Word2vec (Word Embeddings) Embed - - PowerPoint PPT Presentation

word2vec Kuan-Ting Lai 2020/5/28 Word2vec (Word Embeddings) Embed one-hot encoded word vectors into dense vectors Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and


slide-1
SLIDE 1

Kuan-Ting Lai 2020/5/28

word2vec

slide-2
SLIDE 2

Word2vec (Word Embeddings)

  • Embed one-hot encoded word vectors into dense vectors
  • Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff
  • Dean. "Distributed representations of words and phrases and their

compositionality." In Advances in neural information processing systems, pp. 3111-3119. 2013.

slide-3
SLIDE 3

Why Word Embeddings?

https://www.tensorflow.org/tutorials/representation/word2vec

slide-4
SLIDE 4

Vector Space Models for Natural Language

  • Count-based methods:

− how often some word co-occurs with its neighbor words − Latent Semantic Analysis

  • Predictive methods:

− Predict a word from its neighbors − Continuous Bag-of-Words model (CBOW) and Skip-Gram model

slide-5
SLIDE 5

Continuous Bag-of-Words vs. Skip-Gram

slide-6
SLIDE 6

Word2Vec Tutorial

  • Word2Vec Tutorial - The Skip-Gram Model
  • Word2Vec Tutorial - Negative Sampling

Chris McCormick, http://mccormickml.com/tutorials/

slide-7
SLIDE 7

N-Gram Model

  • Use a sequence of N words

to predict next word

  • Example N=3

− (The, quick, brown) -> fox

7

slide-8
SLIDE 8

Skip-Gram Model

  • Window size of 2

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

slide-9
SLIDE 9

Neural Network for Skip-Gram

No activation function

slide-10
SLIDE 10
slide-11
SLIDE 11

Hidden Layer as Look-up Table

  • One-hot vector selects the matrix row corresponding to the “1”
slide-12
SLIDE 12

The Output Layer (Softmax)

  • Output probability of nearby words (e.g., “car” next to “ants”)
  • Sum of all outputs is equal to 1
slide-13
SLIDE 13

Softmax Function

  • 𝑄 𝑥𝑢 ℎ = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦 𝑡𝑑𝑝𝑠𝑓 𝑥𝑢, ℎ

=

𝑓{𝑡𝑑𝑝𝑠𝑓 𝑥𝑢,ℎ } σ𝑥𝑝𝑠𝑒 𝑥′ 𝑗𝑜 𝑤𝑝𝑑𝑏𝑐. 𝑓{𝑡𝑑𝑝𝑠𝑓 𝑥′,ℎ }

  • 𝑡𝑑𝑝𝑠𝑓 𝑥𝑢, ℎ computes compatibility of word 𝑥𝑢 with the context ℎ (dot-

product is used)

  • Train the model by maximizing its log-likelihood:

− 𝑚𝑝𝑕𝑄 𝑥𝑢 ℎ = 𝑡𝑑𝑝𝑠𝑓 𝑥𝑢, ℎ − 𝑚𝑝𝑕 σ𝑥𝑝𝑠𝑒 𝑥′ 𝑗𝑜 𝑤𝑝𝑑𝑏𝑐. 𝑓{𝑡𝑑𝑝𝑠𝑓 𝑥′,ℎ }

slide-14
SLIDE 14

Sampling Important Words

  • Remove non-informative word “the”
slide-15
SLIDE 15

Probability of Keeping the Word

  • 𝑨 𝑥𝑗 is the occurrence rate of word 𝑥𝑗
  • P 𝑥𝑗 is the keeping probability
slide-16
SLIDE 16

Negative Sampling

  • Problem: too many parameters to

learn at training

  • Solution

−Select only few other words as negative samples (output prob. = “0”) −Original paper selected 5 – 20 words for small datasets. 2 – 5 words work for large datasets

slide-17
SLIDE 17

Negative Sampling

  • 𝑄 𝑥𝑢 ℎ = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦 𝑡𝑑𝑝𝑠𝑓 𝑥𝑢, ℎ

=

𝑓{𝑡𝑑𝑝𝑠𝑓 𝑥𝑢,ℎ } σ𝑥𝑝𝑠𝑒 𝑥′ 𝑗𝑜 𝑤𝑝𝑑𝑏𝑐. 𝑓{𝑡𝑑𝑝𝑠𝑓 𝑥′,ℎ }

  • 𝑚𝑝𝑕𝑄 𝑥𝑢 ℎ = 𝑡𝑑𝑝𝑠𝑓 𝑥𝑢, ℎ − 𝑚𝑝𝑕 σ𝑥𝑝𝑠𝑒 𝑥′ 𝑗𝑜 𝑤𝑝𝑑𝑏𝑐. 𝑓{𝑡𝑑𝑝𝑠𝑓 𝑥′,ℎ }
  • Negative sampling reduces the number of words in the second terms
slide-18
SLIDE 18

Evaluate Word2Vec

slide-19
SLIDE 19

Vector Addition & Subtraction

  • vec(“Russia”) + vec(“river”) ≈ vec(“Volga River”)
  • vec(“Germany”) + vec(“capital”) ≈ vec(“Berlin”)
  • vec(“King”) - vec(“man”) + vec(“woman”) ≈ vec(“Queen”)
slide-20
SLIDE 20

Embedding in Keras

  • Input dimension: Dimension of the one-hot encoding, e.g. number of

word indices

  • Output dimension: Dimension of embedding vector

from keras.layers import Embedding embedding_layer = Embedding(1000, 64)

slide-21
SLIDE 21

Using Embedding to Classify IMDB Data

from keras.datasets import imdb from keras import preprocessing from keras.models import Sequential from keras.layers import Flatten, Dense, Embedding max_features = 10000 # Number of words maxlen = 20 # Select only 20 words in a text for demo (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features) # Turn the lists of integers into a 2D integer tensor of shape (samples, maxlen) x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen) x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen) model = Sequential() # Specify the max input length to the Embedding layer so we can later flatten the embedded # inputs. After the Embedding layer, the activations have shape (samples, maxlen, 8). model.add(Embedding(10000, 8, input_length=maxlen)) model.add(Flatten()) model.add(Dense(1, activation='sigmoid')) model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

slide-22
SLIDE 22

GloVe: Global Vectors for Word Representation

  • Developed by Stanford in 2014
  • Based on Matrix Factorization of Word Co-occurrence
  • https://nlp.stanford.edu/projects/glove/
  • Assumption

− Ratios of word-word co-occurrence probabilities encode some form of meaning

slide-23
SLIDE 23

Using Pretrained Word Embedding Vectors (2-1)

# Preprocessing the embeddings glove_dir = './glove/' embeddings_index = {} f = open(os.path.join(glove_dir, 'glove.6B.100d.txt')) for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() print(‘Found %s word vectors.’ % len(embeddings_index))# 400000 word vectors. # Create a word embedding tensor embedding_dim = 100 embedding_matrix = np.zeros((max_words, embedding_dim)) for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if i < max_words: if embedding_vector is not None: # Words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector

slide-24
SLIDE 24

Using Pretrained Word Embedding Vectors (2-2)

from keras.models import Sequential from keras.layers import Embedding, Flatten, Dense model = Sequential() model.add(Embedding(max_words, embedding_dim, input_length=maxlen)) model.add(Flatten()) model.add(Dense(32, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.summary() # Load the GloVe embeddings in the model model.layers[0].set_weights([embedding_matrix]) model.layers[0].trainable = False model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val)) model.save_weights('pre_trained_glove_model.h5')

https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/6.1-using-word-embeddings.ipynb

slide-25
SLIDE 25

Classifying IMDB Reviews

With Pretrained IMDB Model Without Pretrained IMDB Model

slide-26
SLIDE 26

Embedding Project (projector.tensorflow.org/)

slide-27
SLIDE 27
slide-28
SLIDE 28

Neighbors of “Learning”

slide-29
SLIDE 29

Image Hashtag Recommendation

  • Hashtag => a word or phrase preceded by the symbol # that

categorizes the accompanying text

  • Created by Twitter, now supported by all social networks
  • Instagram hashtag statistics (2017):
318.2 319.4 334.1 344.3 344.5 360.5 389.3 389.3 389.5 396.5 424 426.9 458.5 659.6 1165 500 1000 1500 summer selfie me follow picoftheday followme cute like4like tbt happy beautiful fashion photooftheday instagood love Million Hashtags

Latest stats: izea.com/2018/06/07/top-instagram-hashtags-2018

slide-30
SLIDE 30

Difficulties of Predicting Image Hashtag

  • Abstraction: #love, #cute,...
  • Abbreviation: #ootd, #ootn,…
  • Emotion: #happy,…
  • Obscurity: #motivation, #lol,…
  • New-creation: #EvaChenPose,…
  • No-relevance: #tbt, #nofilter, #vscocam
  • Location: #NYC, #London

#ootn #ootd #tbt

#FromWereIStand

#Selfie #EvaChenPose

slide-31
SLIDE 31

Zero-Shot Learning

  • Identify object that you’ve never seen before
  • More formal definition:

− Classify test classes Z with zero labeled data (Zero-shot!)

slide-32
SLIDE 32

Zero-Shot Formulation

  • Describe objects by words

− Use attributes (semantic features)

slide-33
SLIDE 33

DeViSE – Deep Visual Semantic Embedding

  • Google, NIPS, 2013
slide-34
SLIDE 34

User Conditional Hashtag Prediction for Images

  • E. Denton, J. Weston, M. Paluri, L. Bourdev, and R. Fergus, “User Conditional Hashtag

Prediction for Images,” ACM SIGKDD, 2015 (Facebook)

  • Hashtag Embedding:
  • Proposed 3 models:
  • 1. Bilinear

Embedding Model

  • 2. User-biased

model

  • 3. User-

multiplicative model

slide-35
SLIDE 35

User Meta Data

User Profile and Locations

slide-36
SLIDE 36

Facebook’s Experiments

  • 20 million images
  • 4.6 million hashtags, average 2.7 tags per image
  • Result
slide-37
SLIDE 37

Real World Applications

mccormickml.com/2018/06/15/applying-word2vec-to-recommenders-and-advertising/

slide-38
SLIDE 38

References

  • 1. Mikolov, Tomas, et al. "Distributed representations of words and

phrases and their compositionality." Advances in neural information processing systems. 2013.

  • 2. Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving

Mikolov et al.'s negative-sampling word-embedding method." arXiv preprint arXiv:1402.3722 (2014).

  • 3. https://www.tensorflow.org/tutorials/representation/word2vec
  • 4. http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-

gram-model/

  • 5. https://www.analyticsvidhya.com/blog/2017/06/word-

embeddings-count-word2veec/