Date
Zheng Zhang, Ruiqing Yin, Pierre Zweigenbaum
Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph
TextGraphs-2018 June 6, 2018
1
Efficient Generation and Processing of Word Co-occurrence Networks - - PowerPoint PPT Presentation
TextGraphs-2018 June 6, 2018 Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph Zheng Zhang, Ruiqing Yin, Pierre Zweigenbaum 1 Date Why do we need this tool? However, the landscape of natural language
Date
Zheng Zhang, Ruiqing Yin, Pierre Zweigenbaum
TextGraphs-2018 June 6, 2018
1
2 “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence
are too focused on word/sense/sentence/other embeddings today.”
3 “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence
are too focused on word/sense/sentence/other embeddings today.” NAACL 2018 Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang Filling Missing Paths: Modeling Co-occurrences of Word Pairs and Dependency Paths for Recognizing Lexical Semantic Relations Koki Washio and Tsuneaki Kato
4 “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence
are too focused on word/sense/sentence/other embeddings today.” Example of injecting word co-occurrence networks into word embeddings learning. GNEG: Graph-Based Negative Sampling for word2vec Zheng Zhang, Pierre Zweigenbaum, In Proceedings of ACL 2018, Melbourne, Australia
word_id word_id
lg ($%&' (%%(()&rence)
Negative examples distribution is based on the word co-occurrence network (matrix) 5 Heat map of the word co-
word_id word_id lg(/
0($))
1($): word count Heat map of the negative examples distribution /
0($)
word_id word_id
lg ($%&' (%%(()&rence)
Negative examples distribution is based on the word co-occurrence network (matrix) 5 Heat map of the word co-
word_id word_id lg(/
0($))
1($): word count Heat map of the negative examples distribution /
0($)
corpus2graph
reinvent the wheel.” )
sentence analyzer…)
corpus, ~2.19 &'((')* tokens; by using multiprocessing)
size, sentence analyzer…)
6
7
corpus2graph processing
8
The history of natural language processing generally started in the 1950s.
& removing punctuation marks and(or) stop words
The histori of natur languag process gener start in the 0000s h n l p g s
9
& removing punctuation marks and(or) stop words
h n l p g s
dmax =distance=2 distance=1
10
& removing punctuation marks and(or) stop words
h n l p g s
1 1 1 1 1 1 1 1 1 1 1
11
12
each text file
13
and adding them to the graph as edges through graph libraries.
when using NetworkX.
baseline when using NetworkX. Example: entire English Wikipedia dump from April 2017, ~2.19 billion tokens); 50 logical cores on a server with 4 Intel Xeon E5- 4620 processors):
Word co-occurrence network generation speed (seconds)
14
15
https://github.com/zzcoolj/corpus2graph
graph_from_corpus all [--max_window_size=<max_window_size> --process_num=<process_num> --min_count=<min_count>
<data_dir> <output_dir>
but also less feasibility for grid search.
16