[PPT] - Ljubljana, Slovenia MultilingualWeb Workshop, Madrid, Oct 26 th 2010 PowerPoint Presentation

SLIDE 1

Marko Grobelnik (marko.grobelnik@ijs.si) Jozef Stefan Institute (http://www.ijs.si/) Ljubljana, Slovenia

MultilingualWeb Workshop, Madrid, Oct 26th 2010

SLIDE 2

 Imagine a user understanding several languages

e.g. English, German, Italian, Croatian, Serbian, Slovenian

(…not so uncommon in Slovenia)

 Such a user would want to browse and search

documents in all the known languages

 …but of course, a query can be provided only in

ne language

 We need a s

search ch engine ne, which ch given en a q query y in

ne language

uage, returns urns docum cumen ents ts in select cted ed languages uages

…this is called “cross

ss-lingual information retrieval”

SLIDE 3

SLIDE 4

SLIDE 5

 …there are many research fields working with

textual data solving different problems:

Computational Linguistics, Machine Translation,

Information Retrieval, Text Mining, Semantic Web, …

 Each of the research fields “represents” text

in a slightly different way

SLIDE 6

 Character (character n-grams and sequences)  Words (stop-words, stemming, lemmatization)  Phrases (word n-grams, proximity features)  Part-of-speech tags  Taxonomies / thesauri  Vector-space model  Correlated V.S.M.  Language models  Full-parsing  Collaborative tagging / Web2.0  Templates / Frames  Ontologies / First order theories

SLIDE 7

 Character (character n-grams and sequences)  Words (stop-words, stemming, lemmatization)  Phrases (word n-grams, proximity features)  Part-of-speech tags  Taxonomies / thesauri  Vector-space model  Correlated V.S.M.  Language models  Full-parsing  Collaborative tagging / Web2.0  Templates / Frames  Ontologies / First order theories

Search (Inf. Retrieval), Categorization, Clustering, Summarization, …

SLIDE 8

 Character (character n-grams and sequences)  Words (stop-words, stemming, lemmatization)  Phrases (word n-grams, proximity features)  Part-of-speech tags  Taxonomies / thesauri  Vector-space model  Correlated V.S.M.  Language models  Full-parsing  Collaborative tagging / Web2.0  Templates / Frames  Ontologies / First order theories

Cross-lingual Inf. Retrieval, Connecting Text + Images, Search (Inf. Retrieval), Categorization, Clustering, Summarization, …

SLIDE 9

 Character (character n-grams and sequences)  Words (stop-words, stemming, lemmatization)  Phrases (word n-grams, proximity features)  Part-of-speech tags  Taxonomies / thesauri  Vector-space model  Correlated V.S.M.  Language models  Full-parsing  Collaborative tagging / Web2.0  Templates / Frames  Ontologies / First order theories

Machine translation Spam filtering, … Cross-lingual Inf. Retrieval, Connecting Text + Images, Search (Inf. Retrieval), Categorization, Clustering, Summarization, …

SLIDE 10

 Character (character n-grams and sequences)  Words (stop-words, stemming, lemmatization)  Phrases (word n-grams, proximity features)  Part-of-speech tags  Taxonomies / thesauri  Vector-space model  Correlated V.S.M.  Language models  Full-parsing  Collaborative tagging / Web2.0  Templates / Frames  Ontologies / First order theories

Machine translation Spam filtering, … Cross-lingual Inf. Retrieval, Connecting Text + Images, Search (Inf. Retrieval), Categorization, Clustering, Summarization, …

SLIDE 11

SLIDE 12

 Ideally, we would want represent the text in a

language neutral form

…so that a document content would be comparable

regardless on the language

 Having this, we can solve many problems still

unaddressed on the market…

 Nowadays, we can solve this on a large

scale…

…because of availability of large amounts of

“comparable corpora” like Wikipedia

SLIDE 13

English German French Spanish Italian Slovenian Slovak Czech Hungarian Greek Finnish Swedish Dutch Lithuanian Danish Language Neutral Document Representation (trained with machine learning)

New document represented as text in any of the above languages New document represented in Language Neutral way …enables cross-lingual retrieval, categorization, clustering, …

SLIDE 14

№ Language Arti ticl cles 1 English 3,451,276 2 German 1,139,687 3 French 1,022,762 4 Polish 740,342 5 Italian 739,961 6 Japanese 711,765 7 Spanish 663,201 … … 32 Bulgarian 107,739 33 Persian 107,564 34 Slovenian 101,731 35 Waray-Waray 100,454 … … 92 Walloon 11,791 93 Irish 11,623 94 Chuvash 11,620 95 Armenian 11,197 96 Yoruba 10,167 … … 192 Picard 1,092 193 Aymara 1,088 194 Wolof 1,082 195 Tumbuka 1,061

Wi Wikipedia edia La Lang ngua uage ges

 With machine learning techniques

we can learn “language neutral document representation”…

 …planned for ~200 Wikipedia

languages having over 1000 articles

 The goal is to have an updated

200x200 matrix of languages for comparing document content

…trained statistical models + software

will be open to use

…part of FP7 MetaNet Network of

Excellence

SLIDE 15

 Cross-lingual Information Retrieval is a technique

for comparing documents written in different languages

…still largely unsolved for comparing large number of

languages

 We are introducing “languag

age e neutra ral l document ent represent entati ation

n”
…based on statistical representation

 Using Wikipedia we are building 200x200 matrix

f languages within
…solution will be open source