Word Embedding CE-324: Modern Information Retrieval Sharif - PowerPoint PPT Presentation

Word Embedding CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Many slides have been adopted from Socher lectures, cs224d, Stanford, 2017.

One-hot coding 2

Distributed similarity based representations } representing a word by means of its neighbors } “You shall know a word by the company it keeps” (J. R. Firth 1957: 11) } One of the most successful ideas of modern statistical NLP 3

Word embedding } Store “most” of the important information in a fixed, small number of dimensions: a dense vector } Usually around 25 – 1000 dimensions } Embeddings: distributional models with dimensionality reduction, based on prediction 4

How to make neighbors represent words? } Answer:With a co-occurrence matrix X } options: full document vs windows } Full word-document co-occurrence matrix } will give general topics (all sports terms will have similar entries) leading to “Latent Semantic Analysis” } Window around each word } captures both syntactic (POS) and semantic information 5

LSA: Dimensionality Reduction based on word- doc matrix Docs words Maintaining only the k largest singular values of X Embedded words 6

Problems with SVD } Its computational cost scales quadratically for n x m matrix: O( mn 2 ) flops (when n<m) } Bad for millions of words or documents } Hard to incorporate new words or documents } Does not consider order of words in the documents 7

Directly learn low-dimensional word vectors } Old idea. Relevant for this lecture: } Learning representations by back-propagating errors. (Rumelhart et al., 1986) } NNLM: A neural probabilistic language model (Bengio et al., 2003) } NLP (almost) from Scratch (Collobert & Weston, 2008) } A recent, even simpler and faster model: word2vec (Mikolov et al. 2013)-> intro now 8

word2vec } Key idea:The word vector can predict surrounding words } word2vec: as originally described (Mikolov et al 2013), a NN model using a two-layer network (i.e., not deep!) to perform dimensionality reduction. } Faster and can easily incorporate a new sentence/document or add a word to the vocabulary } Very computationally efficient, good all-round model (good hyper-parameters already selected). 9

Skip-gram vs. CBOW } Two possible architectures: } given some context words, predict the center (CBOW) } Predict center word from sum of surrounding word vectors } given a center word, predict the contexts (Skip-gram) Skip-gram uses a word to CBOW uses a window of word predict the surrounding to predict the middle word words. Continuous Bag of words Skip-gram (CBOW) 10

Continuous Bag of Word: Example } E.g.“The cat sat on floor” } Window size = 2 the cat sat on floor 11

Continuous Bag of Word: Example Input layer 0 Index of cat in vocabulary 1 0 0 Hidden layer Output layer cat 0 0 0 0 0 0 … 0 0 0 one-hot 0 one-hot sat 0 vector vector 0 0 0 1 … 0 1 0 0 on 0 0 0 … 0 12

Continuous Bag of Word: Example We must learn W and W ’ Input layer 0 1 0 0 Hidden layer Output layer 𝑋 cat 0 "×$ 0 0 0 0 0 … 0 V-dim 0 0 𝑋′ $×" 0 sat 0 0 0 1 0 0 … d-dim 𝑋 V-dim 0 1 "×$ 0 on 0 0 0 … V-dim N will be the size of word vector 0 13

Word embedding matrix } You will get the word-vector by left multiplying a one-hot vector by W 0 0 a ⋮ Aardvark 𝑦 = ( 𝑦 + = 1 ) 1 … 𝑋 = ⋮ zebra 0 0 ℎ = 𝑦 - 𝑋 = 𝑋 +,. = 𝑤 + 𝑙 -th row of the matrix 𝑋 14

Continuous Bag of Word: Example 𝑋 - ×𝑦 78 = 𝑤 78 4.8 4.5 5 … … … 2.1 0 4.5 1 Input layer 0.5 8.4 2.5 … … … 5.6 8.4 0 … … … … … … … × 0 0 = … 1 0 … … … … … … … … 0 0 0.6 6.7 0.8 … … … 3.7 6.7 0 0 Output layer x 0 0 cat 0 … 0 0 0 0 0 … 0 V-dim 0 0 2 = 𝑤 345 + 𝑤 78 0 + 𝑤 sat 2 0 0 0 0 1 … 0 V-dim 1 0 Hidden layer 0 x on 0 d-dim 0 0 … V-dim 0 15

Continuous Bag of Word: Example 𝑋 - ×𝑦 345 = 𝑤 345 0 4.8 4.5 5 1.5 … … … 2.1 1.5 0 Input layer 0.5 8.4 2.5 0.9 … … … 5.6 0.9 0 × 0 1 = … … … … … … … … … 1 0 … … … … … … … … … 0 0 0 0 0.6 6.7 0.8 1.9 … … … 3.7 1.9 Output layer x 0 0 cat 0 … 0 0 0 0 0 … 0 V-dim 0 0 2 = 𝑤 345 + 𝑤 78 𝑤 0 + sat 2 0 0 0 0 1 … 0 V-dim 1 0 Hidden layer 0 x on 0 d-dim 0 0 … V-dim 0 16

Continuous Bag of Word: Example Input layer 0 1 0 0 Hidden layer Output layer 𝑋 cat 0 "×$ 0 0 0 0 0 … 0 V-dim 0 0 ? 𝑧 2 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑨) 𝑋 ×𝑤 2 = 𝑨 0 "×$ 0 0 0 1 0 𝑤 2 0 … 𝑋 0 1 "×$ d-dim 0 on 𝑧 2 ;<= 0 0 V-dim 0 … V-dim N will be the size of word vector 0 17

Continuous Bag of Word: Example Input layer 0 We would prefer 𝑧 2 close to 𝑧 2 I45 1 0 0 Hidden layer Output layer 𝑋 cat 0 "×$ 0 0 0 0.01 0 0 0.02 … 0 ? V-dim 𝑋 ×𝑤 2 = 𝑨 0.00 0 0 "×$ 0 0.02 𝑧 2 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑨) 0 0.01 0 0 1 0 0.02 𝑤 2 0 … 0.01 𝑋 0 1 "×$ 0.7 d-dim 0 on 𝑧 2 ;<= 0 … 0 V-dim 0.00 0 … 𝑧 2 V-dim N will be the size of word vector 0 18

Continuous Bag of Word: Example 𝑋 - 4.8 4.5 5 1.5 … … … 2.1 Contain word’s vectors Input layer 0.5 8.4 2.5 0.9 … … … 5.6 0 … … … … … … … … 1 … … … … … … … … 0 0.6 6.7 0.8 1.9 … … … 3.7 0 Output layer x 0 cat 0 0 0 𝑋 "×$ 0 0 … 0 V-dim 0 0 ? 𝑋 0 $×" sat 0 0 0 0 1 … 0 𝑋 V-dim 1 "×$ 0 Hidden layer 0 x on 0 d-dim 0 0 … V-dim 0 We can consider either W or W’ as the word’s representation. Or even take the average. 19

Skip-gram } Embeddings that are good at predicting neighboring words are also good at representing similarity 20

Output layer 0 1 0 0 0 cat Input layer 0 ? Hidden layer 𝑋 0 0 $×" 0 0 … 0 0 0 0 x 𝑋 sat 0 "×$ 0 0 0 0 ? 0 𝑋 $×" 0 1 𝑤 2 0 on V-dim 0 d-dim 0 0 … 0 V-dim 21

� Details of Word2Vec } Learn to predict surrounding words in a window of length m of every word. } Objective function: Maximize the log probability of any context word given the current center word: - 𝐾 𝜄 = 1 𝑈 M M log 𝑞 𝑥 5ST |𝑥 5 WXYTYX 5\] TZ[ T: training set size m: context size } Use a large training corpus to maximize it 𝑥 T : vector representation of the jth word 𝜄 : whole parameters of the network m is usually 5~10 22

Skip-gram } 𝑥 7 : context or output (outside) word 𝑓 I37no p b ,p e 𝑄 𝑥 7 𝑥 ^ = ∑ 𝑓 I37no p q ,p e } 𝑥 ^ : center or input word r ? = 𝑤 ^ 𝑡𝑑𝑝𝑠𝑓 𝑥 7 , 𝑥 ^ = ℎ - 𝑋 - 𝑣 7 .,7 - 𝑋 = 𝑋 ℎ = 𝑦 ^ ^,. = 𝑤 ^ ? = 𝑣 7 𝑋 .,7 c d e 𝑓 a b 𝑄 𝑥 7 𝑥 ^ = c d h 𝑓 a g ∑ i Every word has 2 vectors 𝑤 m : when 𝑥 is the center word 𝑣 m : when 𝑥 is the outside word (context word) 23

� Details of Word2Vec } Predict surrounding words in a window of length m of every word: c d e 𝑓 a b 𝑄 𝑥 7 𝑥 ^ = c d e ∑ 𝑓 a q �r - 𝐾 𝜄 = 1 𝑈 M M log 𝑞 𝑥 5ST |𝑥 5 WXYTYX 5\] TZ[ 24

Parameters 25

Review: Iterative optimization of objective function } Objective function: 𝐾(𝜾) t = argmax } Optimization problem: 𝜾 𝐾(𝜾) 𝜾 } Steps: } Start from 𝜾 [ } Repeat } Update 𝜾 5 to 𝜾 5S] in order to increase 𝐾 } 𝑢 ← 𝑢 + 1 } until we hopefully end up at a maximum 26

Review: Gradient ascent t = argmax First-order optimization algorithm to find 𝜾 𝐾(𝜾) } 𝜾 Also known as ” steepest ascent ” } } In each step, takes steps proportional to the negative of the gradient vector of the function at the current point 𝜾 5 : 𝐾(𝜾) increases fastest if one goes from 𝜾 5 in the direction of 𝛼 𝜾 𝐾(𝜾 5 ) } Assumption: 𝐾(𝜾) is defined and differentiable in a neighborhood of a point 𝜾 5 } 27

Review: Gradient ascent } Maximize 𝐾(𝜾) Step size 𝜾 5S] = 𝜾 5 + 𝜃𝛼 𝜾 𝐾(𝜾 5 ) (Learning rate parameter) 𝛼 𝜾 𝐾 𝒙 = [𝜖𝐾 𝜾 , 𝜖𝐾 𝜾 , … , 𝜖𝐾 𝜾 ] 𝜖𝜄 ] 𝜖𝜄 • 𝜖𝜄 $ } If 𝜃 is small enough, then 𝐾 𝜾 5S] ≥ 𝐾 𝜾 5 . } 𝜃 can be allowed to change at every iteration as 𝜃 5 . 28

� � � Gradient c d e 𝑓 a b 𝜖 log 𝑞 𝑥 7 |𝑥 ^ = 𝜖 log 𝜖𝑤 ^ 𝜖𝑤 ^ c d e ∑ 𝑓 a q �r = 𝜖 c d e − log M 𝑓 a q c d e log 𝑓 a b 𝜖𝑤 ^ r 1 c d e 𝑓 a q = 𝑣 7 − M 𝑣 r c d e ∑ 𝑓 a q �r r = 𝑣 7 − M 𝑞 𝑥 r |𝑥 ^ 𝑣 r r 29

Training difficulties } With large vocabularies, it is not scalable! " 𝜖 log 𝑞 𝑥 7 |𝑥 ^ = 𝑣 7 − M 𝑞 𝑥 r |𝑥 ^ 𝑣 r 𝜖𝑤 ^ r\] } Define negative prediction that only samples a few words that do not appear in the context } Similar to focusing on mostly positive correlations 30

Word Embedding CE-324: Modern Information Retrieval Sharif - PowerPoint PPT Presentation

Word Embedding CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Many slides have been adopted from Socher lectures, cs224d, Stanford, 2017. One-hot coding 2 Distributed similarity based

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for STA 790 Word Embedding

Embedding 3-manifolds via surgery on surfaces Kyle Larson University of Texas at Austin

>>>CLICK HERE<<< Presentation d un document word New Haven. peugeot 207 workshop

Is this a word that would be used by a mature language user? Is it a frequently used word?

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Deep Learning for Natural Language Processing Inspecting and evaluating word embedding models

Avoiding artifacts in spectral white matter fiber clustering and embedding Demian Wassermann

Large-Scale Clustering through Functional NCut Embedding Embedding Experiments Summary

Rby : An Embedding of Alloy in Ruby Aleksandar Milicevic , Ido Efrati, and Daniel Jackson

@danhaesler @danhaesler WHAT DO YOU SEE? @danhaesler CONFUSION ANXIETY RESISTANCE

Motivating Yourself Olivia Roche HELLO! I am Olivia Roche I am a ______ trainer since XXXX.

Evaluation methods for unsupervised word embeddings EMNLP 2015 Tobias Schnabel, Igor Labutov,

Supervising Unsupervised Learning Vikas K. Garg & Adam Kalai Vikas K. Garg & Adam Kalai

PHYSICAL ELECTRONICS(ECE3540) CHAPTER 4 THE SEMICONDUCTOR IN EQUILIBRIUM Brook Abegaz,

Pay-per-Question: Towards Targeted Q&A with Payments Steve Jan , Chun Wang, Qing Zhang , Gang

Achieving a Quality and Stable HMIS Staffing Pattern May 2020 Ryan Burger, ICF Chris Pitcher,

Patent Law Prof. Roger Ford Wednesday, April 1, 2015 Class 19 Infringement I: claim

Word Embedding CE-324: Modern Information Retrieval Sharif - PowerPoint PPT Presentation

Word Embedding CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Many slides have been adopted from Socher lectures, cs224d, Stanford, 2017. One-hot coding 2 Distributed similarity based

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for STA 790 Word Embedding

Embedding 3-manifolds via surgery on surfaces Kyle Larson University of Texas at Austin

&gt;&gt;&gt;CLICK HERE&lt;&lt;&lt; Presentation d un document word New Haven. peugeot 207 workshop

Is this a word that would be used by a mature language user? Is it a frequently used word?

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Deep Learning for Natural Language Processing Inspecting and evaluating word embedding models

Avoiding artifacts in spectral white matter fiber clustering and embedding Demian Wassermann

Large-Scale Clustering through Functional NCut Embedding Embedding Experiments Summary

Rby : An Embedding of Alloy in Ruby Aleksandar Milicevic , Ido Efrati, and Daniel Jackson

@danhaesler @danhaesler WHAT DO YOU SEE? @danhaesler CONFUSION ANXIETY RESISTANCE

Motivating Yourself Olivia Roche HELLO! I am Olivia Roche I am a ______ trainer since XXXX.

Evaluation methods for unsupervised word embeddings EMNLP 2015 Tobias Schnabel, Igor Labutov,

Supervising Unsupervised Learning Vikas K. Garg &amp; Adam Kalai Vikas K. Garg &amp; Adam Kalai

PHYSICAL ELECTRONICS(ECE3540) CHAPTER 4 THE SEMICONDUCTOR IN EQUILIBRIUM Brook Abegaz,

Pay-per-Question: Towards Targeted Q&amp;A with Payments Steve Jan , Chun Wang, Qing Zhang , Gang

Achieving a Quality and Stable HMIS Staffing Pattern May 2020 Ryan Burger, ICF Chris Pitcher,

Patent Law Prof. Roger Ford Wednesday, April 1, 2015 Class 19 Infringement I: claim

>>>CLICK HERE<<< Presentation d un document word New Haven. peugeot 207 workshop

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Supervising Unsupervised Learning Vikas K. Garg & Adam Kalai Vikas K. Garg & Adam Kalai

Pay-per-Question: Towards Targeted Q&A with Payments Steve Jan , Chun Wang, Qing Zhang , Gang