Efficient Generation and Processing of Word Co-occurrence Networks - PowerPoint PPT Presentation

TextGraphs-2018 June 6, 2018 Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph Zheng Zhang, Ruiqing Yin, Pierre Zweigenbaum 1 Date

Why do we need this tool? “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence networks. Otherwise, the present tool looks too specific because people are too focused on word/sense/sentence/other embeddings today.” 2

Why do we need this tool? “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence networks . Otherwise, the present tool looks too specific because people are too focused on word/sense/sentence/other embeddings today.” NAACL 2018 Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang Filling Missing Paths: Modeling Co-occurrences of Word Pairs and Dependency Paths for Recognizing Lexical Semantic Relations Koki Washio and Tsuneaki Kato 3

Why do we need this tool? “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence networks . Otherwise, the present tool looks too specific because people are too focused on word /sense/sentence/other embeddings today.” Example of injecting word co-occurrence networks into word embeddings learning. GNEG: Graph-Based Negative Sampling for word2vec Zheng Zhang, Pierre Zweigenbaum, In Proceedings of ACL 2018, Melbourne, Australia 4

Why do we need this tool? Negative sampling Graph-based negative sampling word_id word_id lg(/ 0 ($)) lg ($%&' (%%(()&rence) word_id word_id Heat map of the negative Heat map of the word co- examples distribution / 0 ($) occurrence distribution Negative examples distribution is based on the word co-occurrence 5 network (matrix) 1($) : word count

Why do we need this tool? Negative sampling Graph-based negative sampling word_id word_id lg(/ 0 ($)) lg ($%&' (%%(()&rence) word_id word_id Heat map of the negative Heat map of the word co- examples distribution / 0 ($) occurrence distribution Negative examples distribution is based on the word co-occurrence 5 corpus2graph network (matrix) 1($) : word count

What kind of tool? “Tech Specs” • Working with other graph libraries friendly ( “Don’t • reinvent the wheel.” ) NLP applications oriented (built-in tokenizer, stemmer, • sentence analyzer…) • Handle large corpus (e.g. Entire English Wikipedia corpus, ~2.19 &'((')* tokens; by using multiprocessing) Grid search friendly (different window size, vocabulary • size, sentence analyzer…) Fast ! • 6

corpus2graph corpus2graph generation corpus2graph igraph processing 7

Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor The history of natural language processing generally started in the 1950s. The histori of natur languag process gener start in the 0000s h n l p g s 0 8

Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence d max =distance=2 distance=1 h n l p g s 0 9

Word Co-occurrence Network Generation NLP applications oriented • Word processor (built-in) • Tokenizer, stemmer, replacing numbers & removing punctuation marks and(or) stop words Word • User-customized word processor • Word pairs of different distances are extracted by sentence analyzer • User-customized sentence analyzer Sentence • Word pair analyzer • Word pair weight w.r.t. the maximum distance • Directed & undirected Word pair • User-customized word pair analyzer 1 1 n h l 1 1 1 !"#$ℎ& = 1×*+,-". /0 + 1×*+,-". /0 • 0 p 1 1 • undirected 1 10 1 s g 1 1

Word Co-occurrence Network Generation Multi-processing 3 multi-processing steps • Word processing • Sentence analyzing • Word pair merging • MapReduce like • 11

Word Co-occurrence Network Generation Grid-search Word pair weight of different maximum distances ( ! "#$ ) • Reuse of the intermediate data • 1 st step: numerical id encoded text file after word processing • 2 nd step: separate word pair files of different distances for • each text file 2 nd step: distinct word count • 12

Word Co-occurrence Network Generation Speed Word co-occurrence network generation speed (seconds) The baseline: • Processing the corpus sentence by sentence, extracting word pairs • and adding them to the graph as edges through graph libraries. Single-core • Why corpus2graph is slower than baseline when using NetworkX? • Small corpus, one core : corpus2graph is slower than baseline • when using NetworkX. Large corpus, multiple cores : corpus2graph is much faster than • baseline when using NetworkX. Example: entire English Wikipedia dump from April 2017, ~2.19 billion tokens); 50 logical cores on a server with 4 Intel Xeon E5- 4620 processors): ~2.5 hours 13

Word Co-occurrence Network Processing Networks and matrices are interchangeable • Graph loading & transition matrix calculation speed (seconds) • 14

Open source https://github.com/zzcoolj/corpus2graph graph_from_corpus all [--max_window_size=<max_window_size> --process_num=<process_num> --min_count=<min_count> --max_vocab_size=<max_vocab_size> --safe_files_number_per_processor=<safe_files_number_per_processor>] <data_dir> <output_dir> 15

Future work • Word co-occurrence network generation “Desktop mode”: less memory consumption, less cores, • but also less feasibility for grid search. Word co-occurrence network processing • Support more graph processing methods • GPU mode • 16

Thanks for your attention!

Efficient Generation and Processing of Word Co-occurrence Networks - PowerPoint PPT Presentation

TextGraphs-2018 June 6, 2018 Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph Zheng Zhang, Ruiqing Yin, Pierre Zweigenbaum 1 Date Why do we need this tool? However, the landscape of natural language

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Arsenic Occurrence and Arsenic Occurrence and Innovative Technologies Innovative Technologies

PFAS OCCURRENCE & MONITORING GUIDANCE for California water systems Rick Zimmer May 2, 2019

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

4.2 Microsoft Word Microsoft Word is the word processing component of the Microsoft Office

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Networks based on words Bowen Dai WANs definition Word-adjacency networks belong to the

>>>CLICK HERE<<< Presentation d un document word New Haven. peugeot 207 workshop

Is this a word that would be used by a mature language user? Is it a frequently used word?

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

Efficient audio signal processing using LLVM and Haskell Henning Thielemann 2013-04-30

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Efficient induction of probabilistic word classes with LDA Grzegorz Chrupa la Saarland

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

of Geometric Concepts Uri Stemmer Ben-Gurion University joint work with Haim Kaplan, Yishay

Text Processing CS440 Text processing NLP tasks typically require multiple steps of text

DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth

Putting Suffix-Tree-Stemming to Work Benno Stein Martin Potthast Bauhaus University Weimar

Overview Esta es una naranja atrac1va: Adventures in Adap1ng an English Research Goal

Natural Language Processing CSCI 4152/6509 Lecture 9 Elements of Morphology Instructor:

Web Information Retrieval Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries Recap

Introduction to Artificial Intelligence Natural Language Processing Janyl Jumadinova November

Efficient Generation and Processing of Word Co-occurrence Networks - PowerPoint PPT Presentation

TextGraphs-2018 June 6, 2018 Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph Zheng Zhang, Ruiqing Yin, Pierre Zweigenbaum 1 Date Why do we need this tool? However, the landscape of natural language

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Arsenic Occurrence and Arsenic Occurrence and Innovative Technologies Innovative Technologies

PFAS OCCURRENCE &amp; MONITORING GUIDANCE for California water systems Rick Zimmer May 2, 2019

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

4.2 Microsoft Word Microsoft Word is the word processing component of the Microsoft Office

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Networks based on words Bowen Dai WANs definition Word-adjacency networks belong to the

&gt;&gt;&gt;CLICK HERE&lt;&lt;&lt; Presentation d un document word New Haven. peugeot 207 workshop

Is this a word that would be used by a mature language user? Is it a frequently used word?

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Building On The Word Building On The Word Nehemiah 8:1-8 Nehemiah 8:1-8

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

Efficient audio signal processing using LLVM and Haskell Henning Thielemann 2013-04-30

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Efficient induction of probabilistic word classes with LDA Grzegorz Chrupa la Saarland

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

of Geometric Concepts Uri Stemmer Ben-Gurion University joint work with Haim Kaplan, Yishay

Text Processing CS440 Text processing NLP tasks typically require multiple steps of text

DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth

Putting Suffix-Tree-Stemming to Work Benno Stein Martin Potthast Bauhaus University Weimar

Overview Esta es una naranja atrac1va: Adventures in Adap1ng an English Research Goal

Natural Language Processing CSCI 4152/6509 Lecture 9 Elements of Morphology Instructor:

Web Information Retrieval Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries Recap

Introduction to Artificial Intelligence Natural Language Processing Janyl Jumadinova November

PFAS OCCURRENCE & MONITORING GUIDANCE for California water systems Rick Zimmer May 2, 2019

>>>CLICK HERE<<< Presentation d un document word New Haven. peugeot 207 workshop

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT