MULTILINGUAL DOCUMENT CLASSICATION VIA TRANSDUCTIVE LEARNING
Salvatore Romeo – UNICAL
srome@dimes.unical.it
Dino Ienco - IRSTEA, LIRMM
dino.ienco@irstea.fr
Andrea Tagarelli – UNICAL
tagarelli@dimes.unical.it
MULTILINGUAL DOCUMENT CLASSICATION VIA TRANSDUCTIVE LEARNING - - PowerPoint PPT Presentation
MULTILINGUAL DOCUMENT CLASSICATION VIA TRANSDUCTIVE LEARNING Salvatore Romeo UNICAL srome@dimes.unical.it Dino Ienco - IRSTEA, LIRMM dino.ienco@irstea.fr Andrea Tagarelli UNICAL tagarelli@dimes.unical.it 2 Introduction: Multilingual
srome@dimes.unical.it
dino.ienco@irstea.fr
tagarelli@dimes.unical.it
國語文
English German
2
English Swedish Dutch German French Cebuano Waray-Waray Russian Italian Spanish Vietnamese Polish
1million+ articles
0e+00 1e+06 2e+06 3e+06 4e+06 English Swedish Dutch German French Cebuano Waray-Waray Russian Italian Spanish Vietnamese Polish
1million+ users
0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07
1million+ Wikipedia articles …and corresponding registered users
Source: Wikipedia (October 6, 2014)
3
“The Tower of Babel”, P. Bruegel (ca. 1563)
4
5
6
knowledge base: BabelNet
“Tower of Babel”, M. C. Escher (1928)
7
[Navigli & Ponzetto, 2012] BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 2012
8
9
(training and test data available together)
10
[Joachims, 1999] Transductive Inference for Text Classification using Support Vector Machines. ICML, 1999. [Joachims, 2003] Transductive learning via spectral graph partitioning..ICML, 2003.
propagate the class information from the labeled to the unlabeled documents
[Liu & Chang, 2009] W. Liu, S.-F. Chang: Robust multi-class transductive learning with graphs. CVPR 2009 [de Sousa et al, 2014] C. A. R. de Sousa, S.O. Rezende, G. E. A. P. A. Batista: Influence of Graph Construction on Semi-supervised
11
1.
2.
12
13
a)
Machine Translation: MT-fr, MT-it, MT-en
b)
Bag of Words (BoW): union of language-specific term vocabularies
c)
BoW-LSA: Latent Semantic Analysis over the BoW space
d)
Bag of Synsets (BoS)
14
fraction of words belonging to the document whose concepts are present as entries in BabelNet
RCV2 Wikipedia
15
16
17
18
define richer multilingual document models
19
20