Natural Language Processing with Deep Learning Word Embeddings - - PowerPoint PPT Presentation

β–Ά
natural language processing with deep learning word
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing with Deep Learning Word Embeddings - - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Word Embeddings Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Introduction Count-based word representation Prediction-based word embedding


slide-1
SLIDE 1

Institute of Computational Perception

Natural Language Processing with Deep Learning Word Embeddings

Navid Rekab-Saz

navid.rekabsaz@jku.at

slide-2
SLIDE 2

Agenda

  • Introduction
  • Count-based word representation
  • Prediction-based word embedding
slide-3
SLIDE 3

Agenda

  • Introduction
  • Count-based word representation
  • Prediction-based word embedding
slide-4
SLIDE 4

4

Distributional Representation Β§ An entity is represented with a vector of 𝑒 dimensions Β§ Distributed Representations

  • Each dimension (units) is a feature of the entity
  • Units in a layer are not mutually exclusive
  • Two units can be β€˜β€˜active’’ at the same time

𝑦! 𝑦" 𝑦# … 𝑦$

π’š

𝑒

slide-5
SLIDE 5

5

Word Embedding Model 𝑀1 𝑀2 𝑀𝑂 𝑒

Word Embedding Model

When vector representations are dense, they are often called embedding e.g. word embedding

slide-6
SLIDE 6

Word embeddings projected to a two-dimensional space

slide-7
SLIDE 7

7

Word Embeddings – Nearest neighbors

frog

frogs toad litoria leptodactylidae rana

book

books foreword author published preface

asthma

bronchitis allergy allergies arthritis diabetes

https://nlp.stanford.edu/projects/glove/

slide-8
SLIDE 8

8

Word Embeddings – Linear substructures Β§ Analogy task:

  • man to woman is like king to ? (queen)

π’š!"#$% βˆ’ π’š#$% + π’š&'%( = π’šβˆ— π’šβˆ— β‰ˆ π’š*+,-.

https://nlp.stanford.edu/projects/glove/

slide-9
SLIDE 9

Agenda

  • Introduction
  • Count-based word representation
  • Prediction-based word embedding
slide-10
SLIDE 10

10

Intuition for Computational Semantics

β€œYou shall know a word by the company it keeps!”

  • J. R. Firth, A synopsis of

linguistic theory 1930–1955 (1957)

slide-11
SLIDE 11

11

Nida[1975]

TesgΓΌino

d r i n k f e r m e n t e d b

  • t

t l e

  • ut of corn

s a c r e d beverage Mexico alcoholic

slide-12
SLIDE 12

12

Ale

d r i n k b a r g r a i n medieval pale b

  • t

t l e brew fermentation alcoholic

slide-13
SLIDE 13

13

TesgΓΌino ←→ Ale

Algorithmic intuition:

Two words are related when they have common context words

slide-14
SLIDE 14

14

Word-Document Matrix – recap

Β§ 𝔼 is a set of documents (plays of Shakespeare) 𝔼 = [𝑒1, 𝑒2, … , 𝑒𝑁] Β§ π•Ž is the set of words (vocabularies) in dictionary π•Ž = [𝑀1, 𝑀2, … , 𝑀𝑂] Β§ Words as rows and documents as columns Β§ Values: term count tc!,# Β§ Matrix size 𝑂×𝑁 𝑒1 𝑒2 𝑒3 𝑒4

As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 ... ... … ... ...

slide-15
SLIDE 15

15

Cosine Β§ Cosine is the normalized dot product of two vectors

  • Its result is between -1 and +1

cos(π’š, 𝒛) = π’š π’š / . 𝒛 𝒛 / = βˆ‘0 𝑦0𝑧0 βˆ‘0 𝑦0

/

βˆ‘0 𝑧0

/

Β§ π’š = 1 1 0 𝒛 = 4 5 6 cos(π’š, 𝒛) = 1 βˆ— 4 + 1 βˆ— 5 + 0 βˆ— 6 1$ + 1$ + 0$ 4$ + 5$ + 6$ = 9 ~12.4

slide-16
SLIDE 16

16

𝑒1 𝑒2 𝑒3 𝑒4

As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 ... ... … ... ...

Word-Document Matrix Β§ Similarity between two words: similarity soldier, clown = cos π’š1"23',4, π’š52"!%

slide-17
SLIDE 17

17

Context Β§ Context can be

  • Document
  • Paragraph, tweet
  • Window of (2-10) context words on each side of the word

Β§ Word-Context matrix

  • Every word as a unit (dimension)

β„‚ = [𝑑1, 𝑑2, … , 𝑑𝑀]

  • Matrix size: 𝑂×𝑀
  • Usually β„‚ = π•Ž and therefore 𝑀 = 𝑂
slide-18
SLIDE 18

18

Word-Context Matrix

sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer. In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the

𝑑1 𝑑2 𝑑3 𝑑4 𝑑5 𝑑6

aardvark computer data pinch result sugar

𝑀1 apricot

1 1

𝑀2 pineapple

1 1

𝑀3 digital

2 1 1

𝑀4 information

1 6 4

Β§ Window context of 7 words

slide-19
SLIDE 19

19

𝑑1 𝑑2 𝑑3 𝑑4 𝑑5 𝑑6

aardvark computer data pinch result sugar

𝑀1 apricot

1 1

𝑀2 pineapple

1 1

𝑀3 digital

2 1 1

𝑀4 information

1 6 4

Word-to-Word Relations Β§ First-order co-occurrence relation

  • Each cell of the word-context matrix
  • Words that appear in the proximity of each other
  • Like drink to beer, and drink and wine

Β§ Second-order similarity relation

  • Cosine similarity between the representation vectors
  • Words that appear in similar contexts
  • Like beer to wine, tesgΓΌino to ale, and frog to toad
slide-20
SLIDE 20

20

Point Mutual Information Β§ Problem with raw counting methods

  • Biased towards high frequent words (β€œand”, β€œthe”) although they

don’t contain much of information

Β§ Point Mutual Information (PMI)

  • Rooted in information theory
  • A better measure for the first-order relation in word-context

matrix in order to reflect the informativeness of co-occurrences

  • Joint probability of two events (random variables) divided by

their marginal probabilities

PMI π‘Œ, 𝑍 = log π‘ž(X, Y) π‘ž X π‘ž(Y)

slide-21
SLIDE 21

21

Point Mutual Information PMI 𝑀, 𝑑 = log π‘ž(𝑀, 𝑑) π‘ž 𝑀 π‘ž(𝑑)

π‘ž 𝑀, 𝑑 = # 𝑀, 𝑑 𝑇 π‘ž 𝑀 = βˆ‘%&'

(

# 𝑀, 𝑑% 𝑇 π‘ž 𝑑 = βˆ‘)&'

*

# 𝑀), 𝑑 𝑇 𝑇 = H

)&' *

H

%&' (

# 𝑀), 𝑑%

Β§ Positive Point Mutual Information (PPMI) PPMI 𝑒, 𝑑 = max(PMI, 0)

slide-22
SLIDE 22

22

Point Mutual Information

𝑄 𝑀 = information, 𝑑 = data = Q 6 19 = .32 𝑄 𝑀 = information = Q 11 19 = .58 𝑄 𝑑 = data = Q 7 19 = .37 PPMI 𝑀 = information, 𝑑 = data = max(0, log .32 .58 βˆ— .37) = .39 𝑑1 𝑑2 𝑑3 𝑑4 𝑑5 𝑑6

aardvark computer data pinch result sugar

𝑀1 apricot

1 1

𝑀2 pineapple

1 1

𝑀3 digital

2 1 1

𝑀4 information

1 6 4

slide-23
SLIDE 23

24

Singular Value Decomposition - Recap Β§ An 𝑂×𝑀 matrix 𝒀 can be factorized to three matrices:

𝒀 = π‘½πœ―π‘Ύπ”

Β§ 𝑽 (left singular vectors) is an 𝑂×𝑀 unitary matrix Β§ 𝜯 is an 𝑀×𝑀 diagonal matrix, diagonal entries

  • are eigenvalues,
  • show the importance of corresponding 𝑀 dimensions in 𝒀
  • are all positive and sorted from large to small values

Β§ 𝑾𝐔 (right singular vectors) is an 𝑀×𝑀 unitary matrix

* The definition of SVD is simplified. Refer to https://en.wikipedia.org/wiki/Singular_value_decomposition for the exact definition

slide-24
SLIDE 24

25

Singular Value Decomposition – Recap

𝑂×𝑀 𝑂×𝑀 𝑀×𝑀 𝑀×𝑀 =

  • riginal matrix

𝒀 eigenvalues 𝜯 right singular vectors 𝑾𝐔 left singular vectors 𝑽

slide-25
SLIDE 25

26

Applying SVD to Word-Context Matrix

𝑂×𝑀 words contexts = 𝑂×𝑀 𝑀×𝑀 𝑀×𝑀

word vectors 𝑽 eigenvalues 𝜯 context vectors 𝑾𝐔 (sparse) word-context matrix 𝒀

Β§ Step 1: create a sparse PPMI matrix of the size π‘‚βœ•π‘€, Β§ Apply SVD

slide-26
SLIDE 26

27

Applying SVD to Term-Context Matrix Β§ Step 2: keep only top 𝑒 eigenvalues in 𝜯 and set the rest to zero Β§ Truncate the 𝑽 and 𝑾𝐔 matrices based on the changes in 𝜯, called N 𝑽 and N 𝑾𝐔 respectively

slide-27
SLIDE 27

28

×𝑀 𝑀×𝑀 𝑀×𝑀

truncated word vectors

: 𝑽

truncated eigenvalues % 𝜯 truncated context word vectors

: 𝑾𝐔

Applying SVD to Term-Context Matrix

𝑒 𝑒 𝑂 𝑒 𝑒 𝑀

Β§ N 𝑽 matrix is the dense low-dimensional word vectors

slide-28
SLIDE 28

Agenda

  • Introduction
  • Count-based word representation
  • Prediction-based word embedding
slide-29
SLIDE 29

30

Word Embedding with Neural Networks Β§ Design a neural network architecture! Β§ Loop over training data (𝑀, 𝑑) for some epochs

  • Pass the word 𝑀 as input and execute forward passing
  • Calculate the probability of observing context word 𝑑 at output:

π‘ž(𝑑|𝑀)

  • Optimize the network to maximize this likelihood probability

Recipe for creating (dense) word embedding with neural networks Details come next!

slide-30
SLIDE 30

31

Training Data 𝒠

Window of size 2

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

slide-31
SLIDE 31

32

https://web.stanford.edu/~jurafsky/slp3/

Train sample: (TesgΓΌino, drink)

Linear activation

Encoder embedding Decoder embedding Input Layer (One-hot encoder) Output Layer (softmax)

Neural embeddings – architecture

1×𝑂 1×𝑒 1×𝑂

𝑭𝑂×𝑒 𝑽%

𝑒×𝑂

π‘ž(𝐞𝐬𝐣𝐨π₯|π”πŸπ­π‘ΓΌπ£π¨π©)

Forward pass Backpropagation

slide-32
SLIDE 32

33

Embedding vector Ale TesgΓΌino

slide-33
SLIDE 33

34

Embedding vector Ale TesgΓΌino

slide-34
SLIDE 34

35

Embedding vector Decoding vector Ale TesgΓΌino

slide-35
SLIDE 35

36

drink Ale TesgΓΌino Embedding vector Decoding vector

slide-36
SLIDE 36

37

drink Ale TesgΓΌino Embedding vector Decoding vector

slide-37
SLIDE 37

38

drink

  • Train sample: (TesgΓΌino, drink)
  • Update vectors to maximize π‘ž(drink|TesgΓΌino)

Ale TesgΓΌino Embedding vector Decoding vector

slide-38
SLIDE 38

39

Neural embeddings – prediction Β§ Since hidden layer is linear, output vector is:

π’œ = 𝒇)𝑽*

Β§ Probability distribution of output words using softmax: π‘ž 𝑑 𝑀 = softmax(π’œ)? = exp 𝒇@𝒗?

A

βˆ‘ Μƒ

?βˆˆπ•Ž exp 𝒇@𝒗 Μƒ ? A

Β§ In this example: π‘ž drink TesgΓΌino = exp 𝒇A,1(ΓΌ'%"𝒗34'%&

A

βˆ‘ Μƒ

?βˆˆπ•Ž exp 𝒇A,1(ΓΌ'%"𝒗 Μƒ ? A

Denominator (normalization) is expensive!

slide-39
SLIDE 39

40

Neural embeddings – loss Β§ Loss function with NLL over all training samples 𝒠

β„’ = βˆ’π”½ j,k ~𝒠 log π‘ž 𝑑 𝑀 β„’ = βˆ’π”½ j,k ~𝒠 log exp 𝒇j𝒗k

m

βˆ‘ Μƒ

kβˆˆπ•Ž exp 𝒇j𝒗 Μƒ k m

β„’ = βˆ’π”½ j,k ~𝒠 𝒇j𝒗k

m βˆ’ log ? Μƒ kβˆˆπ•Ž

exp 𝒇j𝒗 Μƒ

k m

slide-40
SLIDE 40

41

Neural embeddings – word embedding Β§ word2vec uses the encoding matrix 𝑭 as the final word embeddings Β§ Other possibilities

  • Mean of the vectors of 𝑭 and 𝑽 for each word
  • Concatenation of the vectors in 𝑭 and 𝑽 for each word
  • This results in vectors with 2𝑒 dimensions

𝑭𝑂×𝑒 𝑽%

𝑒×𝑂

slide-41
SLIDE 41

42

word2vec (Skip-Gram) with Negative Sampling Β§ Word2vec is an efficient and effective algorithm that proposes Negative Sampling method to define loss Β§ In Negative Sampling instead of π‘ž 𝑑 𝑀 , the network estimates π‘ž 𝑧 = 1 𝑀, 𝑑 : the probability that the co-

  • ccurrence of 𝑀, 𝑑 comes from a genuine distribution,

namely from training corpus Β§ π‘ž 𝑧 = 1 𝑀, 𝑑 is defined using sigmoid 𝜏:

π‘ž 𝑧 = 1 𝑀, 𝑑 = 𝜏(𝒇j C 𝒗k

m)

slide-42
SLIDE 42

43

word2vec (Skip-Gram) with Negative Sampling Β§ When two words 𝑀, 𝑑 appear in the training data (genuine distribution), it is a positive sample Β§ Negative Sampling aims to distinguish between the co-occurrence probability of 𝑀 in a positive sample π‘ž 𝑧 = 1 𝑀, 𝑑 and the co-occurrences in 𝑙 negative samples with context words Μƒ 𝑑 π‘ž 𝑧 = 1 𝑀, Μƒ 𝑑

slide-43
SLIDE 43

44

word2vec (Skip-Gram) with Negative Sampling Β§ Negative samples are drawn from the noisy distribution ] 𝒠

  • q

𝒠 represents a randomly created corpus β†’ words co-occur randomly!

  • Why a random co-occurrence can be assumed as a negative

sample?

Β§ The noisy distribution ] 𝒠 is defined using a smooth unigram distribution of words in the corpus,

  • In word2vec, q

𝒠 is smoothed by raising unigram counts to the power of 𝛽 = 0.75 β†’ Context Distribution Smoothing

Β§ Number of negative samples 𝑙 is usually between 2 to 20

slide-44
SLIDE 44

45

word2vec with Negative Sampling – Objective Function

Β§ The objective function

  • increases the probability for the positive sample (𝑀, 𝑑)
  • decreases the probability for the 𝑙 negative samples (𝑀, Μƒ

𝑑) Β§ Loss function:

β„’ = βˆ’π”½ @,? ~𝒠 log π‘ž 𝑧 = 1 𝑀, 𝑑 βˆ’ `

Μƒ ?~I 𝒠 J K'#,1

log π‘ž 𝑧 = 1 𝑀, Μƒ 𝑑 β„’ = βˆ’π”½ @,? ~𝒠 log 𝜏 𝒇@𝒗?

A βˆ’

`

Μƒ ?~I 𝒠 J K'#,1

log 𝜏 𝒇@𝒗 Μƒ

? A positive samples negative samples

slide-45
SLIDE 45

46

drink TesgΓΌino

  • Train sample: (TesgΓΌino, drink)

Embedding vector Decoding vector

slide-46
SLIDE 46

47

drink TesgΓΌino

  • Train sample: (TesgΓΌino, drink)
  • Sample 𝑙 negative context words

Embedding vector Decoding vector

slide-47
SLIDE 47

48

drink TesgΓΌino

  • Train sample: (TesgΓΌino, drink)
  • Sample 𝑙 negative context words
  • Update vectors to
  • Maximize π‘ž 𝑧 = 1 TesgΓΌino, drink
  • Minimize π‘ž(𝑧 = 1|TesgΓΌino, Μƒ

𝑑) Embedding vector Decoding vector

slide-48
SLIDE 48

49

Negative Sampling – final words! Β§ Negative Sampling turns the problem from multi-class classification to binary classification Β§ Negative Sampling is a biased approximation of softmax

  • Noisy Contrastive Estimation (the parent of Negative Sampling)

is an unbiased approximation of softmax

Β§ Softmax is a good choice for training language models, namely, to estimate π‘ž 𝑑 𝑀 Β§ word2vec and Negative Sampling aim to train good embeddings