word embedding
play

Word Embedding CE-324: Modern Information Retrieval Sharif - PowerPoint PPT Presentation

Word Embedding CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Many slides have been adopted from Socher lectures, cs224d, Stanford, 2017. One-hot coding 2 Distributed similarity based


  1. Word Embedding CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Many slides have been adopted from Socher lectures, cs224d, Stanford, 2017.

  2. One-hot coding 2

  3. Distributed similarity based representations } representing a word by means of its neighbors } “You shall know a word by the company it keeps” (J. R. Firth 1957: 11) } One of the most successful ideas of modern statistical NLP 3

  4. Word embedding } Store “most” of the important information in a fixed, small number of dimensions: a dense vector } Usually around 25 – 1000 dimensions } Embeddings: distributional models with dimensionality reduction, based on prediction 4

  5. How to make neighbors represent words? } Answer:With a co-occurrence matrix X } options: full document vs windows } Full word-document co-occurrence matrix } will give general topics (all sports terms will have similar entries) leading to “Latent Semantic Analysis” } Window around each word } captures both syntactic (POS) and semantic information 5

  6. LSA: Dimensionality Reduction based on word- doc matrix Docs words Maintaining only the k largest singular values of X Embedded words 6

  7. Problems with SVD } Its computational cost scales quadratically for n x m matrix: O( mn 2 ) flops (when n<m) } Bad for millions of words or documents } Hard to incorporate new words or documents } Does not consider order of words in the documents 7

  8. Directly learn low-dimensional word vectors } Old idea. Relevant for this lecture: } Learning representations by back-propagating errors. (Rumelhart et al., 1986) } NNLM: A neural probabilistic language model (Bengio et al., 2003) } NLP (almost) from Scratch (Collobert & Weston, 2008) } A recent, even simpler and faster model: word2vec (Mikolov et al. 2013)-> intro now 8

  9. word2vec } Key idea:The word vector can predict surrounding words } word2vec: as originally described (Mikolov et al 2013), a NN model using a two-layer network (i.e., not deep!) to perform dimensionality reduction. } Faster and can easily incorporate a new sentence/document or add a word to the vocabulary } Very computationally efficient, good all-round model (good hyper-parameters already selected). 9

  10. Skip-gram vs. CBOW } Two possible architectures: } given some context words, predict the center (CBOW) } Predict center word from sum of surrounding word vectors } given a center word, predict the contexts (Skip-gram) Skip-gram uses a word to CBOW uses a window of word predict the surrounding to predict the middle word words. Continuous Bag of words Skip-gram (CBOW) 10

  11. Continuous Bag of Word: Example } E.g.“The cat sat on floor” } Window size = 2 the cat sat on floor 11

  12. Continuous Bag of Word: Example Input layer 0 Index of cat in vocabulary 1 0 0 Hidden layer Output layer cat 0 0 0 0 0 0 … 0 0 0 one-hot 0 one-hot sat 0 vector vector 0 0 0 1 … 0 1 0 0 on 0 0 0 … 0 12

  13. Continuous Bag of Word: Example We must learn W and W ’ Input layer 0 1 0 0 Hidden layer Output layer 𝑋 cat 0 "×$ 0 0 0 0 0 … 0 V-dim 0 0 𝑋′ $×" 0 sat 0 0 0 1 0 0 … d-dim 𝑋 V-dim 0 1 "×$ 0 on 0 0 0 … V-dim N will be the size of word vector 0 13

  14. Word embedding matrix } You will get the word-vector by left multiplying a one-hot vector by W 0 0 Ÿ Ÿ Ÿ a ⋮ Aardvark Ÿ Ÿ Ÿ 𝑦 = ( 𝑦 + = 1 ) 1 Ÿ Ÿ Ÿ … 𝑋 = Ÿ Ÿ Ÿ ⋮ zebra Ÿ Ÿ Ÿ 0 0 ℎ = 𝑦 - 𝑋 = 𝑋 +,. = 𝑤 + 𝑙 -th row of the matrix 𝑋 14

  15. Continuous Bag of Word: Example 𝑋 - ×𝑦 78 = 𝑤 78 4.8 4.5 5 … … … 2.1 0 4.5 1 Input layer 0.5 8.4 2.5 … … … 5.6 8.4 0 … … … … … … … × 0 0 = … 1 0 … … … … … … … … 0 0 0.6 6.7 0.8 … … … 3.7 6.7 0 0 Output layer x 0 0 cat 0 … 0 0 0 0 0 … 0 V-dim 0 0 2 = 𝑤 345 + 𝑤 78 0 + 𝑤 sat 2 0 0 0 0 1 … 0 V-dim 1 0 Hidden layer 0 x on 0 d-dim 0 0 … V-dim 0 15

  16. Continuous Bag of Word: Example 𝑋 - ×𝑦 345 = 𝑤 345 0 4.8 4.5 5 1.5 … … … 2.1 1.5 0 Input layer 0.5 8.4 2.5 0.9 … … … 5.6 0.9 0 × 0 1 = … … … … … … … … … 1 0 … … … … … … … … … 0 0 0 0 0.6 6.7 0.8 1.9 … … … 3.7 1.9 Output layer x 0 0 cat 0 … 0 0 0 0 0 … 0 V-dim 0 0 2 = 𝑤 345 + 𝑤 78 𝑤 0 + sat 2 0 0 0 0 1 … 0 V-dim 1 0 Hidden layer 0 x on 0 d-dim 0 0 … V-dim 0 16

  17. Continuous Bag of Word: Example Input layer 0 1 0 0 Hidden layer Output layer 𝑋 cat 0 "×$ 0 0 0 0 0 … 0 V-dim 0 0 ? 𝑧 2 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑨) 𝑋 ×𝑤 2 = 𝑨 0 "×$ 0 0 0 1 0 𝑤 2 0 … 𝑋 0 1 "×$ d-dim 0 on 𝑧 2 ;<= 0 0 V-dim 0 … V-dim N will be the size of word vector 0 17

  18. Continuous Bag of Word: Example Input layer 0 We would prefer 𝑧 2 close to 𝑧 2 I45 1 0 0 Hidden layer Output layer 𝑋 cat 0 "×$ 0 0 0 0.01 0 0 0.02 … 0 ? V-dim 𝑋 ×𝑤 2 = 𝑨 0.00 0 0 "×$ 0 0.02 𝑧 2 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑨) 0 0.01 0 0 1 0 0.02 𝑤 2 0 … 0.01 𝑋 0 1 "×$ 0.7 d-dim 0 on 𝑧 2 ;<= 0 … 0 V-dim 0.00 0 … 𝑧 2 V-dim N will be the size of word vector 0 18

  19. Continuous Bag of Word: Example 𝑋 - 4.8 4.5 5 1.5 … … … 2.1 Contain word’s vectors Input layer 0.5 8.4 2.5 0.9 … … … 5.6 0 … … … … … … … … 1 … … … … … … … … 0 0.6 6.7 0.8 1.9 … … … 3.7 0 Output layer x 0 cat 0 0 0 𝑋 "×$ 0 0 … 0 V-dim 0 0 ? 𝑋 0 $×" sat 0 0 0 0 1 … 0 𝑋 V-dim 1 "×$ 0 Hidden layer 0 x on 0 d-dim 0 0 … V-dim 0 We can consider either W or W’ as the word’s representation. Or even take the average. 19

  20. Skip-gram } Embeddings that are good at predicting neighboring words are also good at representing similarity 20

  21. Output layer 0 1 0 0 0 cat Input layer 0 ? Hidden layer 𝑋 0 0 $×" 0 0 … 0 0 0 0 x 𝑋 sat 0 "×$ 0 0 0 0 ? 0 𝑋 $×" 0 1 𝑤 2 0 on V-dim 0 d-dim 0 0 … 0 V-dim 21

  22. � Details of Word2Vec } Learn to predict surrounding words in a window of length m of every word. } Objective function: Maximize the log probability of any context word given the current center word: - 𝐾 𝜄 = 1 𝑈 M M log 𝑞 𝑥 5ST |𝑥 5 WXYTYX 5\] TZ[ T: training set size m: context size } Use a large training corpus to maximize it 𝑥 T : vector representation of the jth word 𝜄 : whole parameters of the network m is usually 5~10 22

  23. Skip-gram } 𝑥 7 : context or output (outside) word 𝑓 I37no p b ,p e 𝑄 𝑥 7 𝑥 ^ = ∑ 𝑓 I37no p q ,p e } 𝑥 ^ : center or input word r ? = 𝑤 ^ 𝑡𝑑𝑝𝑠𝑓 𝑥 7 , 𝑥 ^ = ℎ - 𝑋 - 𝑣 7 .,7 - 𝑋 = 𝑋 ℎ = 𝑦 ^ ^,. = 𝑤 ^ ? = 𝑣 7 𝑋 .,7 c d e 𝑓 a b 𝑄 𝑥 7 𝑥 ^ = c d h 𝑓 a g ∑ i Every word has 2 vectors 𝑤 m : when 𝑥 is the center word 𝑣 m : when 𝑥 is the outside word (context word) 23

  24. � Details of Word2Vec } Predict surrounding words in a window of length m of every word: c d e 𝑓 a b 𝑄 𝑥 7 𝑥 ^ = c d e ∑ 𝑓 a q �r - 𝐾 𝜄 = 1 𝑈 M M log 𝑞 𝑥 5ST |𝑥 5 WXYTYX 5\] TZ[ 24

  25. Parameters 25

  26. Review: Iterative optimization of objective function } Objective function: 𝐾(𝜾) t = argmax } Optimization problem: 𝜾 𝐾(𝜾) 𝜾 } Steps: } Start from 𝜾 [ } Repeat } Update 𝜾 5 to 𝜾 5S] in order to increase 𝐾 } 𝑢 ← 𝑢 + 1 } until we hopefully end up at a maximum 26

  27. Review: Gradient ascent t = argmax First-order optimization algorithm to find 𝜾 𝐾(𝜾) } 𝜾 Also known as ” steepest ascent ” } } In each step, takes steps proportional to the negative of the gradient vector of the function at the current point 𝜾 5 : 𝐾(𝜾) increases fastest if one goes from 𝜾 5 in the direction of 𝛼 𝜾 𝐾(𝜾 5 ) } Assumption: 𝐾(𝜾) is defined and differentiable in a neighborhood of a point 𝜾 5 } 27

  28. Review: Gradient ascent } Maximize 𝐾(𝜾) Step size 𝜾 5S] = 𝜾 5 + 𝜃𝛼 𝜾 𝐾(𝜾 5 ) (Learning rate parameter) 𝛼 𝜾 𝐾 𝒙 = [𝜖𝐾 𝜾 , 𝜖𝐾 𝜾 , … , 𝜖𝐾 𝜾 ] 𝜖𝜄 ] 𝜖𝜄 • 𝜖𝜄 $ } If 𝜃 is small enough, then 𝐾 𝜾 5S] ≥ 𝐾 𝜾 5 . } 𝜃 can be allowed to change at every iteration as 𝜃 5 . 28

  29. � � � Gradient c d e 𝑓 a b 𝜖 log 𝑞 𝑥 7 |𝑥 ^ = 𝜖 log 𝜖𝑤 ^ 𝜖𝑤 ^ c d e ∑ 𝑓 a q �r = 𝜖 c d e − log M 𝑓 a q c d e log 𝑓 a b 𝜖𝑤 ^ r 1 c d e 𝑓 a q = 𝑣 7 − M 𝑣 r c d e ∑ 𝑓 a q �r r = 𝑣 7 − M 𝑞 𝑥 r |𝑥 ^ 𝑣 r r 29

  30. Training difficulties } With large vocabularies, it is not scalable! " 𝜖 log 𝑞 𝑥 7 |𝑥 ^ = 𝑣 7 − M 𝑞 𝑥 r |𝑥 ^ 𝑣 r 𝜖𝑤 ^ r\] } Define negative prediction that only samples a few words that do not appear in the context } Similar to focusing on mostly positive correlations 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend