efficient generation and processing of word co occurrence
play

Efficient Generation and Processing of Word Co-occurrence Networks - PowerPoint PPT Presentation

TextGraphs-2018 June 6, 2018 Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph Zheng Zhang, Ruiqing Yin, Pierre Zweigenbaum 1 Date Why do we need this tool? However, the landscape of natural language


  1. TextGraphs-2018 June 6, 2018 Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph Zheng Zhang, Ruiqing Yin, Pierre Zweigenbaum 1 Date

  2. Why do we need this tool? “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence networks. Otherwise, the present tool looks too specific because people are too focused on word/sense/sentence/other embeddings today.” 2

  3. Why do we need this tool? “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence networks . Otherwise, the present tool looks too specific because people are too focused on word/sense/sentence/other embeddings today.” NAACL 2018 Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang Filling Missing Paths: Modeling Co-occurrences of Word Pairs and Dependency Paths for Recognizing Lexical Semantic Relations Koki Washio and Tsuneaki Kato 3

  4. Why do we need this tool? “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence networks . Otherwise, the present tool looks too specific because people are too focused on word /sense/sentence/other embeddings today.” Example of injecting word co-occurrence networks into word embeddings learning. GNEG: Graph-Based Negative Sampling for word2vec Zheng Zhang, Pierre Zweigenbaum, In Proceedings of ACL 2018, Melbourne, Australia 4

  5. Why do we need this tool? Negative sampling Graph-based negative sampling word_id word_id lg(/ 0 ($)) lg ($%&' (%%(()&rence) word_id word_id Heat map of the negative Heat map of the word co- examples distribution / 0 ($) occurrence distribution Negative examples distribution is based on the word co-occurrence 5 network (matrix) 1($) : word count

  6. Why do we need this tool? Negative sampling Graph-based negative sampling word_id word_id lg(/ 0 ($)) lg ($%&' (%%(()&rence) word_id word_id Heat map of the negative Heat map of the word co- examples distribution / 0 ($) occurrence distribution Negative examples distribution is based on the word co-occurrence 5 corpus2graph network (matrix) 1($) : word count

  7. What kind of tool? “Tech Specs” • Working with other graph libraries friendly ( “Don’t • reinvent the wheel.” ) NLP applications oriented (built-in tokenizer, stemmer, • sentence analyzer…) • Handle large corpus (e.g. Entire English Wikipedia corpus, ~2.19 &'((')* tokens; by using multiprocessing) Grid search friendly (different window size, vocabulary • size, sentence analyzer…) Fast ! • 6

  8. corpus2graph corpus2graph generation corpus2graph igraph processing 7

  9. Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor The history of natural language processing generally started in the 1950s. The histori of natur languag process gener start in the 0000s h n l p g s 0 8

  10. Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence d max =distance=2 distance=1 h n l p g s 0 9

  11. Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence • Word pair analyzer • Word pair weight w.r.t. the maximum distance • Directed & undirected Word pair • User-customized word pair analyzer 1 1 n h l 1 1 1 !"#$ℎ& = 1×*+,-". /0 + 1×*+,-". /0 • 0 p 1 1 • undirected 1 10 1 s g 1 1

  12. Word Co-occurrence Network Generation Multi-processing 3 multi-processing steps • Word processing • Sentence analyzing • Word pair merging • MapReduce like • 11

  13. Word Co-occurrence Network Generation Grid-search Word pair weight of different maximum distances ( ! "#$ ) • Reuse of the intermediate data • 1 st step: numerical id encoded text file after word processing • 2 nd step: separate word pair files of different distances for • each text file 2 nd step: distinct word count • 12

  14. Word Co-occurrence Network Generation Speed Word co-occurrence network generation speed (seconds) The baseline: • Processing the corpus sentence by sentence, extracting word pairs • and adding them to the graph as edges through graph libraries. Single-core • Why corpus2graph is slower than baseline when using NetworkX? • Small corpus, one core : corpus2graph is slower than baseline • when using NetworkX. Large corpus, multiple cores : corpus2graph is much faster than • baseline when using NetworkX. Example: entire English Wikipedia dump from April 2017, ~2.19 billion tokens); 50 logical cores on a server with 4 Intel Xeon E5- 4620 processors): ~2.5 hours 13

  15. Word Co-occurrence Network Processing Networks and matrices are interchangeable • Graph loading & transition matrix calculation speed (seconds) • 14

  16. Open source https://github.com/zzcoolj/corpus2graph graph_from_corpus all [--max_window_size=<max_window_size> --process_num=<process_num> --min_count=<min_count> --max_vocab_size=<max_vocab_size> --safe_files_number_per_processor=<safe_files_number_per_processor>] <data_dir> <output_dir> 15

  17. Future work • Word co-occurrence network generation “Desktop mode”: less memory consumption, less cores, • but also less feasibility for grid search. Word co-occurrence network processing • Support more graph processing methods • GPU mode • 16

  18. Thanks for your attention!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend