Automatic construction of distributional thesaurus (for multiple - - PowerPoint PPT Presentation

automatic construction of distributional thesaurus for
SMART_READER_LITE
LIVE PREVIEW

Automatic construction of distributional thesaurus (for multiple - - PowerPoint PPT Presentation

Automatic construction of distributional thesaurus (for multiple languages) Zheng ZHANG 1 st year PhD student ILES, LIMSI 28/03/2017 1 What is a distributional thesaurus? For a given input, a distributional thesaurus identifies


slide-1
SLIDE 1

Automatic construction of distributional thesaurus (for multiple languages)

Zheng ZHANG 1st year PhD student ILES, LIMSI

28/03/2017 1

slide-2
SLIDE 2

What is a distributional thesaurus?

  • For a given input, a distributional thesaurus identifies semantically

similar words based on the assumption that they share a similar distribution.

  • Distributional assumption: In practice, two words are considered

similar if their occurrences share similar contexts.

  • Ref. Vincent Claveau, Ewa Kijak. Distributional Thesauri for Information Retrieval and vice versa.

28/03/2017 2

slide-3
SLIDE 3

Why do we need it?

  • It is useful for alleviating data sparseness in many NLP applications.
  • It is useful for completing lexical resources.
  • Ref. Enrique Henestroza Anguiano, Pascal Denis. FreDist: Automatic construction of distributional thesauri for French.

28/03/2017 3

slide-4
SLIDE 4

Contexts

  • These contexts are typically co-occurring words in a limited window

around the considered words, or syntactically linked words.

  • Ref. http://nlp.stanford.edu:8080/corenlp/process

28/03/2017 4

slide-5
SLIDE 5

Contexts

  • These contexts are typically co-occurring words in a limited window

around the considered words, or syntactically linked words.

  • Ref. http://nlp.stanford.edu:8080/corenlp/process

28/03/2017 4

slide-6
SLIDE 6

A new context: Graph-of-words

  • A graph whose vertices represent unique terms of the document and

whose edges represent co-occurrences between the terms within a fixed-size sliding window.

  • “This is an example about how to generate a graph. ” (window size=4)
  • Ref. Rousseau F., Vazirgiannis M. (2015) Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction.

https://safetyapp.shinyapps.io/GoWvis/ 28/03/2017 5

slide-7
SLIDE 7

Graph attributes: K-core

  • A subgraph Hk = (Vʹ,Eʹ), induced by the subset of vertices Vʹ ⊆ V (and

a fortiori by the subset of edges Eʹ ⊆ E), is called a k-core or a core of

  • rder k iff ∀v ∈ Vʹ, degHk (v) ≥ k and Hk is the maximal subgraph

with this property, i.e. it cannot be augmented without losing this property.

  • Ref. Rousseau F., Vazirgiannis M. (2015) Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction.

Text Mining – an introduction, Michalis Vazirgiannis, 2017 Data Science Winter School, Beijing, China 28/03/2017 6

slide-8
SLIDE 8

Graph attributes: K-core

  • A subgraph Hk = (Vʹ,Eʹ), induced by the subset of vertices Vʹ ⊆ V (and

a fortiori by the subset of edges Eʹ ⊆ E), is called a k-core or a core of

  • rder k iff ∀v ∈ Vʹ, degHk (v) ≥ k and Hk is the maximal subgraph

with this property, i.e. it cannot be augmented without losing this property.

  • In other words, the k-core of a graph corresponds to the maximal

connected subgraph whose vertices are at least of degree k within the subgraph.

  • Ref. Rousseau F., Vazirgiannis M. (2015) Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction.

Text Mining – an introduction, Michalis Vazirgiannis, 2017 Data Science Winter School, Beijing, China 28/03/2017 6

slide-9
SLIDE 9

Why graph-of-words may be a good choice?

  • Graph-of-words:
  • Taking into account word co-occurrence and word order (optional).

(compared with bag-of-words)

  • K-core:
  • In one core, all neighborhoods contribute equally to the subgraph. (compared

with centrality which is used in PageRank & HITS)

  • K-cores are adaptive.
  • It has been proved that main core has a good performance in information

retrieval.

  • Ref. Rousseau F., Vazirgiannis M. (2015) Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction.

28/03/2017 7

slide-10
SLIDE 10

Difficulty: optimization for Big data

  • Texts: Multiprocessing
  • Encoding text by local ids
  • Merge local id-word dictionaries to get an universal id-word dictionary
  • Transfer local encoded text
  • “MapReduce like” Multiprocessing to prepare edges files
  • “This is an example about how to generate a graph. ” (window size=2)
  • Edges of window size n = edges of distance 2 + … + edges of distance n

28/03/2017 8

slide-11
SLIDE 11

Difficulty: optimization for Big data

  • Texts: Multiprocessing
  • Encoding text by local ids
  • Merge local id-word dictionaries to get an universal id-word dictionary
  • Transfer local encoded text
  • “MapReduce like” Multiprocessing to prepare edges files
  • “This is an example about how to generate a graph. ” (window size=2)
  • Edges of window size n = edges of distance 2 + … + edges of distance n

3

28/03/2017 8

slide-12
SLIDE 12

Difficulty: optimization for Big data

  • Texts: Multiprocessing
  • Encoding text by local ids
  • Merge local id-word dictionaries to get an universal id-word dictionary
  • Transfer local encoded text
  • “MapReduce like” Multiprocessing to prepare edges files
  • “This is an example about how to generate a graph. ” (window size=2)
  • Edges of window size n = edges of distance 2 + … + edges of distance n

3 4

28/03/2017 8

slide-13
SLIDE 13

Multiple languages (ideas)

  • Using a small dictionary to generate a mixed text
  • Find common graph patterns for multiple languages

28/03/2017 9

  • Ref. Stephan Gouws, Anders Søgaard, Simple task-specific bilingual word embeddings
slide-14
SLIDE 14

Future work

  • word2vec: GoW model architecture
  • Using graph-of-words for other task. (e.g. identifying parallel

sentences in comparable corpora, BUCC2017 shared task)

  • From distributional thesaurus to semantic classes

28/03/2017 10

slide-15
SLIDE 15

Me Merci