Efficient Generation and Processing of Word Co-occurrence Networks - - PowerPoint PPT Presentation

efficient generation and processing of word co occurrence
SMART_READER_LITE
LIVE PREVIEW

Efficient Generation and Processing of Word Co-occurrence Networks - - PowerPoint PPT Presentation

TextGraphs-2018 June 6, 2018 Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph Zheng Zhang, Ruiqing Yin, Pierre Zweigenbaum 1 Date Why do we need this tool? However, the landscape of natural language


slide-1
SLIDE 1

Date

Zheng Zhang, Ruiqing Yin, Pierre Zweigenbaum

Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph

TextGraphs-2018 June 6, 2018

1

slide-2
SLIDE 2

Why do we need this tool?

2 “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence

  • networks. Otherwise, the present tool looks too specific because people

are too focused on word/sense/sentence/other embeddings today.”

slide-3
SLIDE 3

Why do we need this tool?

3 “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence

  • networks. Otherwise, the present tool looks too specific because people

are too focused on word/sense/sentence/other embeddings today.” NAACL 2018 Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang Filling Missing Paths: Modeling Co-occurrences of Word Pairs and Dependency Paths for Recognizing Lexical Semantic Relations Koki Washio and Tsuneaki Kato

slide-4
SLIDE 4

Why do we need this tool?

4 “However, the landscape of natural language processing has changed significantly since the glorious days of TextRank and similar co-occurrence- based approaches. I believe the authors should provide more recent studies confirming that one should still be curious about co-occurrence

  • networks. Otherwise, the present tool looks too specific because people

are too focused on word/sense/sentence/other embeddings today.” Example of injecting word co-occurrence networks into word embeddings learning. GNEG: Graph-Based Negative Sampling for word2vec Zheng Zhang, Pierre Zweigenbaum, In Proceedings of ACL 2018, Melbourne, Australia

slide-5
SLIDE 5

word_id word_id

lg ($%&' (%%(()&rence)

Why do we need this tool?

Negative sampling Graph-based negative sampling

Negative examples distribution is based on the word co-occurrence network (matrix) 5 Heat map of the word co-

  • ccurrence distribution

word_id word_id lg(/

0($))

1($): word count Heat map of the negative examples distribution /

0($)

slide-6
SLIDE 6

word_id word_id

lg ($%&' (%%(()&rence)

Why do we need this tool?

Negative sampling Graph-based negative sampling

Negative examples distribution is based on the word co-occurrence network (matrix) 5 Heat map of the word co-

  • ccurrence distribution

word_id word_id lg(/

0($))

1($): word count Heat map of the negative examples distribution /

0($)

corpus2graph

slide-7
SLIDE 7

What kind of tool?

  • “Tech Specs”
  • Working with other graph libraries friendly ( “Don’t

reinvent the wheel.” )

  • NLP applications oriented (built-in tokenizer, stemmer,

sentence analyzer…)

  • Handle large corpus (e.g. Entire English Wikipedia

corpus, ~2.19 &'((')* tokens; by using multiprocessing)

  • Grid search friendly (different window size, vocabulary

size, sentence analyzer…)

  • Fast!

6

slide-8
SLIDE 8

corpus2graph

7

igraph

corpus2graph processing

corpus2graph generation

slide-9
SLIDE 9

8

Word Co-occurrence Network Generation NLP applications oriented

The history of natural language processing generally started in the 1950s.

Word

  • Word processor (built-in)
  • Tokenizer, stemmer, replacing numbers

& removing punctuation marks and(or) stop words

  • User-customized word processor

The histori of natur languag process gener start in the 0000s h n l p g s

slide-10
SLIDE 10

9

Word

  • Word processor (built-in)
  • Tokenizer, stemmer, replacing numbers

& removing punctuation marks and(or) stop words

  • User-customized word processor

Sentence

  • Word pairs of different distances are extracted by sentence analyzer
  • User-customized sentence analyzer

h n l p g s

dmax =distance=2 distance=1

Word Co-occurrence Network Generation NLP applications oriented

slide-11
SLIDE 11

10

Word

  • Word processor (built-in)
  • Tokenizer, stemmer, replacing numbers

& removing punctuation marks and(or) stop words

  • User-customized word processor

Sentence

  • Word pairs of different distances are extracted by sentence analyzer
  • User-customized sentence analyzer

Word pair

  • Word pair analyzer
  • Word pair weight w.r.t. the maximum distance
  • Directed & undirected
  • User-customized word pair analyzer

h n l p g s

  • !"#$ℎ& = 1×*+,-". /0 + 1×*+,-". /0
  • undirected

1 1 1 1 1 1 1 1 1 1 1

Word Co-occurrence Network Generation NLP applications oriented

slide-12
SLIDE 12

11

Word Co-occurrence Network Generation Multi-processing

  • 3 multi-processing steps
  • Word processing
  • Sentence analyzing
  • Word pair merging
  • MapReduce like
slide-13
SLIDE 13

12

  • Word pair weight of different maximum distances (!"#$)
  • Reuse of the intermediate data
  • 1st step: numerical id encoded text file after word processing
  • 2nd step: separate word pair files of different distances for

each text file

  • 2nd step: distinct word count

Word Co-occurrence Network Generation Grid-search

slide-14
SLIDE 14

13

  • The baseline:
  • Processing the corpus sentence by sentence, extracting word pairs

and adding them to the graph as edges through graph libraries.

  • Single-core
  • Why corpus2graph is slower than baseline when using NetworkX?
  • Small corpus, one core: corpus2graph is slower than baseline

when using NetworkX.

  • Large corpus, multiple cores: corpus2graph is much faster than

baseline when using NetworkX. Example: entire English Wikipedia dump from April 2017, ~2.19 billion tokens); 50 logical cores on a server with 4 Intel Xeon E5- 4620 processors):

~2.5 hours

Word Co-occurrence Network Generation Speed

Word co-occurrence network generation speed (seconds)

slide-15
SLIDE 15
  • Networks and matrices are interchangeable
  • Graph loading & transition matrix calculation speed (seconds)

14

Word Co-occurrence Network Processing

slide-16
SLIDE 16

15

Open source

https://github.com/zzcoolj/corpus2graph

graph_from_corpus all [--max_window_size=<max_window_size> --process_num=<process_num> --min_count=<min_count>

  • -max_vocab_size=<max_vocab_size> --safe_files_number_per_processor=<safe_files_number_per_processor>]

<data_dir> <output_dir>

slide-17
SLIDE 17

Future work

  • Word co-occurrence network generation
  • “Desktop mode”: less memory consumption, less cores,

but also less feasibility for grid search.

  • Word co-occurrence network processing
  • Support more graph processing methods
  • GPU mode

16

slide-18
SLIDE 18

Thanks for your attention!