 
              (Post-) Doctoral Seminar of Group ILES, LIMSI 10/04/2018 Graph-Based Word Embeddings Learning Presenter: Zheng ZHANG Supervisors: Pierre ZWEIGENBAUM & Yue MA 1 Date
One year ago… Our plan: Using graph-of-words for word2vec training • 28/03/2017 Difficulty: Optimization for big data • 2
Graph-based word2vec training graph-of-words → word co-occurrences networks (matrices) • Definition: A graph whose vertices represent unique • terms of the document and whose edges represent co- occurrences between the terms within a fixed-size sliding window. Networks and matrices are interchangeable. • A new context → negative examples • word2vec already implicitly uses the statistics of word • co-occurrences for the context word selection, but not for the negative examples selection. 3 Ref. Rousseau F., Vazirgiannis M. (2015) Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction. https://safetyapp.shinyapps.io/GoWvis/
Skip-gram negative sampling Why? • Softmax calculation is too expensive → Replace every • !"# $(& ' |& ) ) term in the Skip-gram objective. What? • • Distinguish the target word from draws from the noise distribution using logistic regression , where there are k negative examples for each data sample. How? • • Advantages: • Cheap to calculate. • All valid words could be selected as negative examples. • Drawbacks: • • Not targeted for training words. 4
Drawbacks of skip-gram negative sampling Negative sampling is not targeted for training words. It is • only based on the word count. word_count word_id lg($ % (&)) word_id word_id Heat map of the negative word count ((&) examples distribution $ % (&) Same ! 5
Graph-based negative sampling Based on the word co-occurrence network (matrix) • word_id lg($%&' (% − %((*&&+,(+) word_id Heat map of the word co-occurrence distribution • Three methods to generate noise distribution Training-word context distribution • Difference between the unigram distribution and the • training words contexts distribution 6 Random walks on the word co-occurrence network •
Graph-based negative sampling Evaluation results • Total time • Entire English Wikipedia corpus ( ~2.19 &'((')* tokens) • trained on the server prevert (50 threads used): 2.5h + 8h Word co-occurrence word2vec training network (matrix) generation 7
How to generate a large word co- occurrence network within 3 hours ? Difficulties: • Large corpus • NLP applications oriented (tokenizer, word • preprocessing, POS-tagging, weighted word co- occurrences…) Grid search of the window size • We (joint work with Ruiqing YIN in group TLP) developed a • tool for that ! Multiprocessing • Built-in methods to preprocess words, analyze • sentences, extract word pairs and define edges weights. User-customized functions supported. • • Works with other graph libraries (igraph, NetworkX and graph-tool) as a front end providing data to boost network generation speed. 8
One year ago… Our plan: Using graph-of-words for word2vec training • 28/03/2017 Difficulty: Optimization for big data • 9
Progress and Future work Our plan: Using graph-of-words for word2vec training • GNEG: Graph-Based Negative Sampling for word2vec (Submitted as • a short paper to ACL 2018) Future work: • • Graph-based context selection Re-assign the training order • Adapt for multi-lingual word embeddings training • Difficulty: Optimization for big data • Efficient Generation and Processing of Word Co-occurrence • Networks Using corpus2graph (to appear in TextGraphs 2018) An open-source NLP-application-oriented tool: public version will • be available in GitHub (https://github.com/zzcoolj/corpus2graph) by the end of this week. Future work: • More built-in methods • Efficient graph processing • 10
Merci
Recommend
More recommend