Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 - - PowerPoint PPT Presentation

word embedding
SMART_READER_LITE
LIVE PREVIEW

Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 - - PowerPoint PPT Presentation

Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 1 Outline Introduction Classical Methods Language Modeling Neural Language Model Challenges of SoftMax Hierarchical Softmax Margin Based Hinge Loss Sampling Based


slide-1
SLIDE 1

Word Embedding

Praveen Krishnan

CVIT, IIIT Hyderabad

June 22, 2017

1

slide-2
SLIDE 2

Outline

Introduction Classical Methods Language Modeling Neural Language Model Challenges of SoftMax Hierarchical Softmax Margin Based Hinge Loss Sampling Based Approaches Word2Vec

Noise Contrastive Estimation Negative Sampling

2

slide-3
SLIDE 3

Philosophy of Language

“(...) the meaning of a word is its use in the language.”

  • Ludwig Wittgenstein

Philosophical Investigations - 1953

Slide Credit: Christian Perone, Word Embeddings - Introduction

3

slide-4
SLIDE 4

Word Embedding

◮ Word embeddings refer to dense representations of words in a

low-dimensional vector space which encodes the associated semantics.

◮ Introduced by Bengio et. al. NIPS’01. ◮ The silver bullet for many NLP tasks.

4

slide-5
SLIDE 5

Word Embedding

The Syntactic and Semantic Phenomenon

◮ Morphology, Tense etc. ◮ Vocabulary Mismatch [Synonymy, Polysemy] ◮ Topic vs. Word Distribution

Reasoning vs. Analogy

w(athens) − w(greece) ≈ w(oslo)−? w(apples) − w(apple) ≈ w(oranges)−? w(walking) − w(walked) ≈ w(swimming)−?

5

slide-6
SLIDE 6

Word Embedding

The Syntactic and Semantic Phenomenon

◮ Morphology, Tense etc. ◮ Vocabulary Mismatch [Synonymy, Polysemy] ◮ Topic vs. Word Distribution

Reasoning vs. Analogy

w(athens) − w(greece) ≈ w(oslo) − w(norway) w(apples) − w(apple) ≈ w(oranges) − w(orange) w(walking) − w(walked) ≈ w(swimming) − w(swam)

6

slide-7
SLIDE 7

Word Embedding

◮ Sparse-to-Dense. ◮ Unsupervised learning. ◮ Typically learned as a by-product* of language modeling problem.

7

slide-8
SLIDE 8

Classical Methods - Topic Modeling

Latent Semantic Analysis [Deerwester et. al. ’1990]

Project terms and documents into a topic space using SVD on term-document (co-occurrence) matrix.

8

slide-9
SLIDE 9

Classical Methods - Topic Modeling

Latent Dirichilet Allocation [Blei et. al ’2003]

◮ Assumes generative probabilistic model of a corpus. ◮ Documents are represented as distribution over latent topics,

where each topic is characterized by a distribution over words.

Figure 1: Plate notation representing the LDA model. Source: Wikipedia

9

slide-10
SLIDE 10

Language Modeling

Probabilistic Language Modeling

Given a set of words, does they form valid construct in a language. p(w1, . . . , wT) =

  • i

p(wi|w1, . . . , wi−1) p(′high′,′ winds′,′ tonight′) >p(′large′,′ winds′,′ tonight′) Using Markov assumptions:- p(w1, . . . , wT) =

  • i

p(wi|wi−1, . . . , wi−n+1)

Applications

Spell correction, Machine translation, Speech recognition, OCR etc.

10

slide-11
SLIDE 11

Language Modeling

Probabilistic Language Modelling

◮ nGram based model:-

p(wt|wt−1, . . . , wt−n+1) = count(wt−n+1, . . . , wt−1, wt) count(wt−n+1, . . . , wt−1)

11

slide-12
SLIDE 12

Language Model

Neural Probabilistic Language Model

Bengio et. al. JMLR’03

12

slide-13
SLIDE 13

Language Model

Neural Probabilistic Language Model

p(wt|wt−1, . . . , wt−n+1) = exp(hTvwt)

  • wi∈V exp(hTvwi) ⇒ Softmax Layer

Here h is the hidden representation of input, vwi is the output word embedding of word i and V is the vocabulary.

Bengio et. al. JMLR’03

12

slide-14
SLIDE 14

Neural Probabilistic Language Model

◮ Associate each word in vocabulary a distributed feature vector. ◮ Learn both the embedding and parameters for probability function

jointly.

Bengio et. al. JMLR’03

13

slide-15
SLIDE 15

Softmax Classifier

Figure 2: Predicting the next word with softmax

Slide Credit: Sebastian Ruder. Blog On word embeddings - Part 2: Approximating the Softmax.

14

slide-16
SLIDE 16

Challenges of SoftMax

One of the major challenges in previous formulation is the cost of computing the softmax which is O(|V |) (typically > 100K)

Major works:-

◮ Softmax-based approaches [Bringing more efficiency]

◮ Hierarchical Softmax ◮ Differentiated Softmax ◮ CNN-Softmax

◮ Sampling-based approaches [Approximating the softmax using a

different loss function]

◮ Importance Sampling. ◮ Margin based Hinge loss. ◮ Noise Contrastive Approximation ◮ Negative Sampling ◮ ... 15

slide-17
SLIDE 17

Hierarchical Softmax

◮ Uses a binary tree representation of output layer with leaves

belonging to each word in vocabulary.

◮ Evaluates at-most log2(|V |) nodes instead of |V | nodes. ◮ Parameters are stored only at internal nodes, hence total

parameters same as regular softmax. p(right|n, c) = σ(hTvn) p(left|n, c) = 1 − p(right|n, c)

Figure 3: Hierarchical Softmax: Morin and Bengio, 2005

16

slide-18
SLIDE 18

Hierarchical Softmax

Figure 4: Hierarchical Softmax: Hugo Lachorelle’s Youtube lectures

◮ Structure of tree important for further efficiency in computation

and performance.

◮ Examples:-

◮ Morin and Bengio: Using synsets in WordNet as clusters of tree. ◮ Mikolov et. al.: Use Huffman tree which takes into account the

frequency of words.

17

slide-19
SLIDE 19

Margin Based Hinge Loss

C&W Model

◮ Avoids computing the expensive softmax by reformulating the

  • bjective.

◮ Train a network to produce higher scores for correct word windows

than incorrect ones.

◮ The pairwise ranking criteria is given as:-

Jθ =

  • x∈X
  • w∈V

max{0, 1 − fθ(x) + fθ(x(w))} Here x is the correct windows and xw is the incorrect windows created by replacing the center word and fθ(x) is the score output by the model.

18

slide-20
SLIDE 20

Sampling Based Approaches

Sampling based approaches approximates the softmax by an alternative loss function which is cheaper to compute.

Interpreting logistic loss function

Jθ = − log exp(h⊤v′

w)

  • wi∈V exp(h⊤v′

wi)

Jθ = − h⊤v′

w + log

  • wi∈V

exp(h⊤v′

wi)

Computing gradient w.r.t. model params, we get ∇θJθ = ∇θE(w) −

  • wi∈V

P(wi)∇θE(wi) where − E(w) = h⊤v′

w

19

slide-21
SLIDE 21

Sampling Based Approaches

∇θJθ = ∇θE(w) −

  • wi∈V

P(wi)∇θE(wi) The gradient has two parts:-

◮ Positive reinforcement for target word . ◮ Negative reinforcement for all other words weighted by its

probability.

  • wi∈V

P(wi)∇θE(wi) = Ewi∼P[∇θE(wi)] To avoid this all sampling based approaches approximates the negative reinforcement.

20

slide-22
SLIDE 22

Word2Vec

◮ Proposed by Mikolov et. al. and widely used for many NLP

applications.

◮ Key features:-

◮ Removed hidden layer. ◮ Use of additional context for training LM’s. ◮ Introduced newer training strategies using huge database of words

efficiently.

Word Analogies

Mikolov et. al. 2013

21

slide-23
SLIDE 23

Word2Vec - Model Architectures

Continuous Bag-of-Words

◮ All words in the context gets

projected to the same position.

◮ Context is defined using both history

and future words.

◮ Order of words in the context does

not matter.

◮ Uses a log-linear classifier model.

Jθ = 1 T

T

  • t=1

log p(wt | wt−n, · · · , wt−1, wt+1, · · · , wt+n)

Mikolov et. al. 2013

22

slide-24
SLIDE 24

Word2Vec - Model Architectures

Continuous Skip-gram Model

◮ Given the current word, predict the

words in the context within a certain range.

◮ Rest of ideas follow CBOW.

Jθ = 1 T

T

  • t=1
  • −n≤j≤n,=0

log p(wt+j|wt) Here, p(wt+j | wt) = exp(v⊤

wtv′ wt+j)

  • wi∈V exp(v⊤

wtv′ wi)

Mikolov et. al. 2013

23

slide-25
SLIDE 25

Noise Contrastive Estimation

Key Idea

Similar to margin based hinge loss, learn a noise classifier which differentiates between a target word and noise.

◮ Formulated as a binary classification

problem.

◮ Minimizes cross entropy logistic loss. ◮ Draws k noise samples from a noise

distribution for each correct word.

◮ Approximates softmax as k increases.

Gutmann, M. et. al.’ 2010

24

slide-26
SLIDE 26

Noise Contrastive Estimation

Distributions

◮ Empirical: (Ptrain) Actual distribution given by the training

samples.

◮ Noise: (Q)

◮ Easy to sample. ◮ Allows analytical expression to log pdf. ◮ Close to actual data distribution. E.g. Uniform or empirical

unigram.

◮ Model: (P) Approximation to empirical distribution.

Notation

For every correct word wi along with its context ci, we generate k noise samples ˜ wik from a noise distribution Q. The labels y = 1 for all correct words and y = 0 for noise samples.

Gutmann, M. et. al.’ 2010

25

slide-27
SLIDE 27

Noise Contrastive Estimation

Objective Function

Jθ = −

  • wi∈V

[log P(y = 1 | wi, ci) + k E ˜

wik∼Q[ log P(y = 0 | ˜

wij, ci)]] Using Monte Carlo approximation:- Jθ = −

  • wi∈V

[log P(y = 1 | wi, ci) + k

k

  • j=1

1 k log P(y = 0 | ˜ wij, ci)]

Gutmann, M. et. al.’ 2010

26

slide-28
SLIDE 28

Noise Contrastive Estimation

The conditional distribution is given as:- P(y = 1 | w, c) = 1 k + 1Ptrain(w | c) 1 k + 1Ptrain(w | c) + k k + 1Q(w) P(y = 1 | w, c) = Ptrain(w | c) Ptrain(w | c) + k Q(w) Using model distribution:- P(y = 1 | w, c) = P(w | c) P(w | c) + k Q(w) Here P(w | c) = exp(h⊤v′

w)

  • wi∈V exp(h⊤v′

wi) corresponds to softmax

function.

Gutmann, M. et. al.’ 2010

27

slide-29
SLIDE 29

Noise Contrastive Estimation

NCE Trick

Treats the normalization denominator as a parameter that model can learn. Also called as self-normalization.

P(w | c) = exp(h⊤v ′

w)

Z(c)

The modified conditional distribution using Z(c) = 1 [Mnih et. al. 2012]:-

P(y = 1 | w, c) = exp(h⊤v ′

w)

exp(h⊤v ′

w) + k Q(w)

NCE Loss

Jθ = −

  • wi∈V

[log exp(h⊤v ′

wi)

exp(h⊤v ′

wi) + k Q(wi)+ k

  • j=1

log(1− exp(h⊤v ′

˜ wij)

exp(h⊤v ′

˜ wij) + k Q( ˜

wij))]

Gutmann, M. et. al.’ 2010

28

slide-30
SLIDE 30

Negative Sampling

◮ Negative sampling (NEG) is a simplification of NCE. ◮ No longer approximates softmax as the goal is to learn high quality

word embedding thereby inappropriate for language modeling. Recall the conditional probability of NCE:- P(y = 1 | w, c) = exp(h⊤v′

w)

exp(h⊤v′

w) + k Q(w)

NEG sets kQ(w) = 1, which gives:- P(y = 1 | w, c) = exp(h⊤v′

w)

exp(h⊤v′

w) + 1 =

1 1 + exp(−h⊤v′

w)

Mikolov et. al. 2013

29

slide-31
SLIDE 31

Negative Sampling

NEG Loss

Plugging back to logistic regression loss, we get:- Jθ = −

  • wi∈V

[log 1 1 + exp(−h⊤v′

wi) + k

  • j=1

log (1 − 1 1 + exp(−h⊤v′

˜ wij)]

Jθ = −

  • wi∈V

[log σ(h⊤v′

wi) + k

  • j=1

log σ(−h⊤v′

˜ wij)]

Here σ(x) = 1 1 + exp(−x) NEG loss is equivalent to NCE when k = |V | and Q is uniform.

Mikolov et. al. 2013

30

slide-32
SLIDE 32

References

Sebastian Ruder, On word embeddings - Part 1, http: //sebastianruder.com/word-embeddings-1/index.html Sebastian Ruder, On word embeddings - Part 2: Approximating the Softmax, http://sebastianruder.com/ word-embeddings-softmax/index.html

31

slide-33
SLIDE 33

Thanks for the attention.

32