SEMANTIC-BASED MULTILINGUAL DOCUMENT CLUSTERING VIA TENSOR MODELING
Salvatore Romeo1, Andrea Tagarelli1, and Dino Ienco2
1 DIMES, University of Calabria, Rende, Italy 2 IRSTEA - UMR TETIS, and LIRMM, Montpellier, France
S EMANTIC -B ASED M ULTILINGUAL D OCUMENT C LUSTERING VIA T ENSOR M - - PowerPoint PPT Presentation
EMNLP 2014 C ONFERENCE ON E MPIRICAL M ETHODS IN N ATURAL L ANGUAGE P ROCESSING D OHA , Q ATAR . O CTOBER 2529, 2014 S EMANTIC -B ASED M ULTILINGUAL D OCUMENT C LUSTERING VIA T ENSOR M ODELING Salvatore Romeo 1 , Andrea Tagarelli 1 , and Dino
1 DIMES, University of Calabria, Rende, Italy 2 IRSTEA - UMR TETIS, and LIRMM, Montpellier, France
國語文
English German
Content languages for websites Internet users by language
Source: W3Techs.com (March 12, 2014) Source: Internet World Stats (May11, 2011)
English Swedish Dutch German French Cebuano Waray-Waray Russian Italian Spanish Vietnamese Polish
1million+ articles
0e+00 1e+06 2e+06 3e+06 4e+06 English Swedish Dutch German French Cebuano Waray-Waray Russian Italian Spanish Vietnamese Polish
1million+ users
0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07
1million+ Wikipedia articles …and corresponding registered users
Source: Wikipedia (October 6, 2014)
“The Tower of Babel”, P. Bruegel (ca. 1563)
Eric Clapton: Italian Wikipage Eric Clapton: English Wikipage
knowledge base: BabelNet
the content semantics
portions of documents by content
“Tower of Babel”, M. C. Escher (1928)
(4) Conceptual Repr. Multilingual Document Collection (1) Text Segmentation Multilingual WSD
BabelNet
(5) Segment Clustering terms/synsets documents segment clusters (2) Sentence Splitting Lemmatization/POS Tagging (6) Tensor Decomposition (7) Document Clustering Multilingual Segment Collection English French Italian (3)
world knowledge
psycholinguistic principles
versione, ora sono di più) languages
topical associative relations
[Navigli & Ponzetto, Artificial Intelligence, 2012]
1.
2.
3.
cosine-sim values for all pairs of adjacent blocks
1/3 on Italian
è BabelNet provides a more complete coverage for English documents
… needs explanation