MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of - - PowerPoint PPT Presentation

mid term follow up
SMART_READER_LITE
LIVE PREVIEW

MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of - - PowerPoint PPT Presentation

PhD Mid-term Follow-up 16/10/2018 MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of multilingual semantic classes from texts Presenter: Zheng ZHANG Supervisors: Pierre ZWEIGENBAUM & Yue MA


slide-1
SLIDE 1

Date

PhD Mid-term Follow-up 16/10/2018

MID-TERM FOLLOW-UP

Semantic mining: Unsupervised acquisition of multilingual semantic classes from texts

Presenter: Zheng ZHANG
 Supervisors: Pierre ZWEIGENBAUM & Yue MA
 Evaluation committee: Vincent CLAVEAU & Alexandre ALLAUZEN

1

slide-2
SLIDE 2
  • Introduction of Thesis Topic
  • Results Achieved
  • GNEG: Graph-Based Negative Sampling for word2vec
  • corpus2graph: Efficient Generation and Processing of

Word Co-occurrence Networks Using corpus2graph

  • Work in Progress
  • Future Work

OUTLINE

2

slide-3
SLIDE 3

Introduction of Thesis Topic

3

slide-4
SLIDE 4

fruits

couleurs pomme poire pêche prunelle

  • Semantic class: a group of words

clustered by using distributional similarity measures

Multilingual semantic classes

Introduction of Thesis Topic

4

slide-5
SLIDE 5

Multilingual semantic classes

fruits

couleurs pomme poire pêche prunelle

colors

fruits

apple pear peach FR EN

Introduction of Thesis Topic

5

slide-6
SLIDE 6

Multilingual semantic classes

fruits

couleurs pomme poire pêche prunelle

colors

fruits

apple pear peach FR EN

Introduction of Thesis Topic

5

slide-7
SLIDE 7

Applications: Unknown words “translation”

fruits

couleurs pomme poire pêche prunelle

colors

fruits

apple pear peach FR EN roux

Introduction of Thesis Topic

6

slide-8
SLIDE 8

Applications: Unknown words “translation”

fruits

couleurs pomme poire pêche prunelle

colors

fruits

apple pear peach FR EN roux

Introduction of Thesis Topic

6

slide-9
SLIDE 9

Applications: Universal classes extraction

fruits

pomme poire pêche prunelle

fruits

apple pear peach FR EN

Universal class

  • f “fruits”

Introduction of Thesis Topic

7

slide-10
SLIDE 10

Cross-lingual Word Embeddings Learning

word sentence document

alignment data level training stage

pre-preprocessing training post-embedding

mikolov2013exploiting artetxe2017learning gouws2015 bilbowa gouws2015simple Luong2015 bilingual levy2017strong

“neural” “count-based”

vulic2015bilingual

Multilingual word embeddings learning can be seen as an extension of (monolingual) word embeddings learning.

Introduction of Thesis Topic

8

slide-11
SLIDE 11

Results Achieved

9

slide-12
SLIDE 12
  • Why?
  • Softmax calculation is too expensive → Replace

every term in the Skip-gram objective.

  • What?
  • Distinguish the target word from draws from the

noise distribution using logistic regression, where there are k negative examples for each data sample.

  • Advantages:
  • Cheap to calculate.
  • All valid words could be selected as negative

examples.

Skip-gram negative sampling

Results Achieved: GNEG*

*GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018, Melbourne, Australia 10

slide-13
SLIDE 13
  • Negative sampling is not targeted for training words. It

is only based on the word count.

word_id word_count word_id word_id lg(𝑄𝑜(𝑥))

word count Heat map of the negative examples distribution

Same !

Drawbacks of skip-gram negative sampling

Results Achieved: GNEG*

*GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018, Melbourne, Australia 11

slide-14
SLIDE 14
  • Word co-occurrences networks (matrices)
  • Definition: A graph whose vertices represent

unique terms of the document and whose edges represent co-occurrences between the terms within a fixed-size sliding window.

  • Networks and matrices are interchangeable.
  • A new context → negative examples
  • word2vec already implicitly uses the statistics of

word co-occurrences for the context word selection, but not for the negative examples selection.

  • Ref. Rousseau F., Vazirgiannis M. (2015) Main Core Retention on Graph-of-Words

for Single-Document Keyword Extraction. https://safetyapp.shinyapps.io/GoWvis/

Graph-based word2vec training

Results Achieved: GNEG*

*GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018, Melbourne, Australia 12

slide-15
SLIDE 15

word_id word_id lg(𝑥𝑝𝑠𝑒 𝑑𝑝 − 𝑝𝑑𝑑𝑣𝑠𝑠𝑓𝑜𝑑𝑓)

  • Based on the word co-occurrence network (matrix)
  • Three methods to generate noise distribution
  • Training-word context distribution
  • Difference between the unigram distribution and

the training words contexts distribution

  • Random walks on the word co-occurrence network

Heat map of the word co-occurrence distribution

Graph-based negative sampling

Results Achieved: GNEG*

*GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018, Melbourne, Australia 13

slide-16
SLIDE 16
  • Evaluation results
  • Total time
  • Entire English Wikipedia corpus ( tokens) trained
  • n the server prevert (50 threads used):

2.5h + 8h

Word co-occurrence network (matrix) generation word2vec training

Results Achieved: GNEG*

*GNEG: Graph-Based Negative Sampling for word2vec, Zheng ZHANG, Pierre ZWEIGENBAUM, In Proceedings of ACL 2018, Melbourne, Australia

Graph-based negative sampling

14

slide-17
SLIDE 17
  • “Tech Specs”
  • Working with other graph libraries friendly

( “Don’t reinvent the wheel.” )

  • NLP applications oriented (built-in tokenizer,

stemmer, sentence analyzer…)

  • Handle large corpus (e.g. Entire English Wikipedia

corpus, tokens; by using multiprocessing)

  • Grid search friendly (different window size,

vocabulary size, sentence analyzer…)

  • Fast!

How to generate a large word co-occurrence network within 3 hours ?

Results Achieved: corpus2graph*

*Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

15

slide-18
SLIDE 18

corpus2graph

igraph

corpus2graph processing

corpus2graph generation

Results Achieved: corpus2graph*

*Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

16

slide-19
SLIDE 19

Word Co-occurrence Network Generation
 NLP applications oriented

The history of natural language processing generally started in the 1950s.

Word

  • Word processor (built-in)
  • Tokenizer, stemmer, replacing numbers

& removing punctuation marks and(or) stop words

  • User-customized word processor

The histori of natur languag process gener start in the 0000s h n l p g s

Results Achieved: corpus2graph*

*Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

17

slide-20
SLIDE 20

Word

  • Word processor (built-in)
  • Tokenizer, stemmer, replacing numbers

& removing punctuation marks and(or) stop words

  • User-customized word processor

Sentence

  • Word pairs of different distances are extracted by sentence

analyzer

  • User-customized sentence analyzer

h n l p g s

dmax =distance=2 distance=1

Word Co-occurrence Network Generation
 NLP applications oriented

Results Achieved: corpus2graph*

*Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

18

slide-21
SLIDE 21

Word

  • Word processor (built-in)
  • Tokenizer, stemmer, replacing numbers

& removing punctuation marks and(or) stop words

  • User-customized word processor

Sentence

  • Word pairs of different distances are extracted by sentence

analyzer

  • User-customized sentence analyzer

Word pair

  • Word pair analyzer
  • Word pair weight w.r.t. the maximum distance
  • Directed & undirected
  • User-customized word pair analyzer

h n l p g s

1 1 1 1 1 1 1 1 1 1 1

Word Co-occurrence Network Generation
 NLP applications oriented

Results Achieved: corpus2graph*

*Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US

19

slide-22
SLIDE 22

Word Co-occurrence Network Generation
 Multi-processing

  • 3 multi-processing steps
  • Word processing
  • Sentence analyzing
  • Word pair merging
  • MapReduce like

Results Achieved: corpus2graph*

*Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph, Zheng ZHANG, Ruiqing YIN, Pierre ZWEIGENBAUM, In Proceedings of NAACL 2018 Workshop on Graph-Based Algorithms for Natural Language Processing, New Orleans, US 20

slide-23
SLIDE 23

Work in Progress

21

slide-24
SLIDE 24
  • Graph-based negative sampling for fastText
  • Word co-occurrence based matrix factorization for

word embeddings learning

Use of pre-computed graph information extracted from word co-occurrence networks

Work in Progress

22

slide-25
SLIDE 25
  • [Levy and Goldberg, 2014] shows that skip-gram with negative

sampling is implicitly factorizing a word-context matrix.

Matrix Factorization

Work in Progress

max(PMI( )-log k, 0) SVD

word co-occurrence matrix “enhanced” matrix

23

slide-26
SLIDE 26
  • [Levy and Goldberg, 2014] shows that skip-gram with negative

sampling is implicitly factorizing a word-context matrix.

Matrix Factorization

Work in Progress

max(PMI( )-log k, 0) SVD

word co-occurrence matrix “enhanced” matrix

first order matrix Transition matrix of random walks

23

slide-27
SLIDE 27

Matrix Factorization

Work in Progress

first order matrix Transition matrix of random walks word co-occurrence matrix “enhanced” matrix

  • r
  • r

+

“+” option 1: matrix concatenation “+” option 2: mask based element wise merge

24

slide-28
SLIDE 28

Future Work

25

slide-29
SLIDE 29

Individual Training Plan

29

slide-30
SLIDE 30

Timetable for the Second Half of My PhD

  • Sep. 2019
  • Jul. 2019
  • Jan. 2019
  • Jan. 2019

End of Oct. 2019 Defence Thesis submission Starting writing thesis Finishing future work (programming & experiment part) Finishing 3 works in progress

30

slide-31
SLIDE 31

Thank you!