mid term follow up
play

MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of - PowerPoint PPT Presentation

PhD Mid-term Follow-up 16/10/2018 MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of multilingual semantic classes from texts Presenter: Zheng ZHANG Supervisors: Pierre ZWEIGENBAUM & Yue MA


  1. PhD Mid-term Follow-up 16/10/2018 MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of multilingual semantic classes from texts Presenter: Zheng ZHANG 
 Supervisors: Pierre ZWEIGENBAUM & Yue MA 
 Evaluation committee: Vincent CLAVEAU & Alexandre ALLAUZEN Date 1

  2. OUTLINE • Introduction of Thesis Topic • Results Achieved GNEG: Graph-Based Negative Sampling for word2vec • corpus2graph: Efficient Generation and Processing of • Word Co-occurrence Networks Using corpus2graph • Work in Progress • Future Work 2

  3. Introduction of Thesis Topic 3

  4. Introduction of Thesis Topic Multilingual semantic classes pomme Semantic class: a group of words • clustered by using distributional pêche similarity measures fruits prunelle poire couleurs 4

  5. Introduction of Thesis Topic Multilingual semantic classes FR EN peach pear pomme fruits pêche apple fruits prunelle colors poire couleurs 5

  6. Introduction of Thesis Topic Multilingual semantic classes FR EN peach pear pomme fruits pêche apple fruits prunelle colors poire couleurs 5

  7. Introduction of Thesis Topic Applications: Unknown words “translation” FR EN peach pear pomme fruits pêche apple fruits prunelle colors poire roux couleurs 6

  8. Introduction of Thesis Topic Applications: Unknown words “translation” FR EN peach pear pomme fruits pêche apple fruits prunelle colors poire roux couleurs 6

  9. Introduction of Thesis Topic Applications: Universal classes extraction FR EN peach pear pomme fruits pêche apple fruits prunelle poire Universal class of “fruits” 7

  10. Introduction of Thesis Topic Cross-lingual Word Embeddings Learning alignment data level document vulic2015bilingual sentence gouws2015 levy2017strong bilbowa Luong2015 bilingual gouws2015simple word mikolov2013exploiting artetxe2017learning training stage “count -based ” “neura l ” pre-preprocessing training post-embedding Multilingual word embeddings learning can be seen as an extension of (monolingual) word embeddings learning. 8

  11. Results Achieved 9

  12. Results Achieved: GNEG* Skip-gram negative sampling Why? • Softmax calculation is too expensive → Replace • every term in the Skip-gram objective. What? • Distinguish the target word from draws from the • noise distribution using logistic regression , where there are k negative examples for each data sample. Advantages: • Cheap to calculate. • All valid words could be selected as negative • examples. *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 10

  13. Results Achieved: GNEG* Drawbacks of skip-gram negative sampling Negative sampling is not targeted for training words. It • is only based on the word count. word_count word_id lg( 𝑄 𝑜 ( 𝑥 )) word_id word_id Heat map of the negative word count examples distribution Same ! *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 11

  14. Results Achieved: GNEG* Graph-based word2vec training Word co-occurrences networks (matrices) • Definition: A graph whose vertices represent • unique terms of the document and whose edges represent co-occurrences between the terms within a fixed-size sliding window. Networks and matrices are interchangeable. • Ref. Rousseau F., Vazirgiannis M. (2015) Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction. https://safetyapp.shinyapps.io/GoWvis/ A new context → negative examples • word2vec already implicitly uses the statistics of • word co-occurrences for the context word selection, but not for the negative examples selection. *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 12

  15. Results Achieved: GNEG* Graph-based negative sampling • Based on the word co-occurrence network (matrix) word_id lg( 𝑥𝑝𝑠𝑒 𝑑𝑝 − 𝑝𝑑𝑑𝑣𝑠𝑠𝑓𝑜𝑑𝑓 ) word_id Heat map of the word co-occurrence distribution Three methods to generate noise distribution • • Training-word context distribution Difference between the unigram distribution and • the training words contexts distribution Random walks on the word co-occurrence network • *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia 13

  16. Results Achieved: GNEG* Graph-based negative sampling Evaluation results • Total time • Entire English Wikipedia corpus ( tokens) trained • on the server prevert (50 threads used): 2.5h + 8h Word co-occurrence word2vec training network (matrix) generation 14 *GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018 , Melbourne, Australia

  17. Results Achieved: corpus2graph* How to generate a large word co-occurrence network within 3 hours ? “Tech Specs” • Working with other graph libraries friendly • ( “Don’t reinvent the wheel.” ) NLP applications oriented (built-in tokenizer, • stemmer, sentence analyzer…) Handle large corpus (e.g. Entire English Wikipedia • corpus, tokens; by using multiprocessing) Grid search friendly (different window size, • vocabulary size, sentence analyzer…) Fast ! • *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, 15 In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

  18. Results Achieved: corpus2graph* corpus2graph corpus2graph generation corpus2graph igraph processing *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, 16 In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

  19. Results Achieved: corpus2graph* Word Co-occurrence Network Generation 
 NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor The history of natural language processing generally started in the 1950s. The histori of natur languag process gener start in the 0000s h n l p g s 0 17 *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

  20. Results Achieved: corpus2graph* Word Co-occurrence Network Generation 
 NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence d max =distance=2 distance=1 h n l p g s 0 18 *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

  21. Results Achieved: corpus2graph* Word Co-occurrence Network Generation 
 NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence • Word pair analyzer • Word pair weight w.r.t. the maximum distance Word • Directed & undirected • User-customized word pair analyzer pair 1 1 n h 1 l 1 1 0 p 1 1 1 *Efficient Generation and Processing of Word Co-occurrence Networks Using 1 corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL s g 1 19 1 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

  22. Results Achieved: corpus2graph* Word Co-occurrence Network Generation 
 Multi-processing 3 multi-processing steps • Word processing • Sentence analyzing • Word pair merging • MapReduce like • *Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US 20

  23. Work in Progress 21

  24. Work in Progress Use of pre-computed graph information extracted from word co-occurrence networks Graph-based negative sampling for fastText • Word co-occurrence based matrix factorization for • word embeddings learning 22

  25. Work in Progress Matrix Factorization • [Levy and Goldberg, 2014] shows that skip-gram with negative sampling is implicitly factorizing a word-context matrix. max(PMI( )-log k, 0) SVD word co-occurrence matrix “enhanced” matrix 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend