Marko Grobelnik (marko.grobelnik@ijs.si) Jozef Stefan Institute (http://www.ijs.si/) Ljubljana, Slovenia
MultilingualWeb Workshop, Madrid, Oct 26th 2010
Ljubljana, Slovenia MultilingualWeb Workshop, Madrid, Oct 26 th 2010 - - PowerPoint PPT Presentation
Marko Grobelnik (marko.grobelnik@ijs.si) Jozef Stefan Institute (http://www.ijs.si/) Ljubljana, Slovenia MultilingualWeb Workshop, Madrid, Oct 26 th 2010 Imagine a user understanding several languages e.g. English, German, Italian,
MultilingualWeb Workshop, Madrid, Oct 26th 2010
Imagine a user understanding several languages
Such a user would want to browse and search
…but of course, a query can be provided only in
We need a s
…there are many research fields working with
Each of the research fields “represents” text
Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Phrases (word n-grams, proximity features) Part-of-speech tags Taxonomies / thesauri Vector-space model Correlated V.S.M. Language models Full-parsing Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories
Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Phrases (word n-grams, proximity features) Part-of-speech tags Taxonomies / thesauri Vector-space model Correlated V.S.M. Language models Full-parsing Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories
Search (Inf. Retrieval), Categorization, Clustering, Summarization, …
Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Phrases (word n-grams, proximity features) Part-of-speech tags Taxonomies / thesauri Vector-space model Correlated V.S.M. Language models Full-parsing Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories
Cross-lingual Inf. Retrieval, Connecting Text + Images, Search (Inf. Retrieval), Categorization, Clustering, Summarization, …
Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Phrases (word n-grams, proximity features) Part-of-speech tags Taxonomies / thesauri Vector-space model Correlated V.S.M. Language models Full-parsing Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories
Machine translation Spam filtering, … Cross-lingual Inf. Retrieval, Connecting Text + Images, Search (Inf. Retrieval), Categorization, Clustering, Summarization, …
Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Phrases (word n-grams, proximity features) Part-of-speech tags Taxonomies / thesauri Vector-space model Correlated V.S.M. Language models Full-parsing Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories
Machine translation Spam filtering, … Cross-lingual Inf. Retrieval, Connecting Text + Images, Search (Inf. Retrieval), Categorization, Clustering, Summarization, …
Ideally, we would want represent the text in a
Having this, we can solve many problems still
Nowadays, we can solve this on a large
English German French Spanish Italian Slovenian Slovak Czech Hungarian Greek Finnish Swedish Dutch Lithuanian Danish Language Neutral Document Representation (trained with machine learning)
New document represented as text in any of the above languages New document represented in Language Neutral way …enables cross-lingual retrieval, categorization, clustering, …
№ Language Arti ticl cles 1 English 3,451,276 2 German 1,139,687 3 French 1,022,762 4 Polish 740,342 5 Italian 739,961 6 Japanese 711,765 7 Spanish 663,201 … … 32 Bulgarian 107,739 33 Persian 107,564 34 Slovenian 101,731 35 Waray-Waray 100,454 … … 92 Walloon 11,791 93 Irish 11,623 94 Chuvash 11,620 95 Armenian 11,197 96 Yoruba 10,167 … … 192 Picard 1,092 193 Aymara 1,088 194 Wolof 1,082 195 Tumbuka 1,061
With machine learning techniques
…planned for ~200 Wikipedia
The goal is to have an updated
will be open to use
Excellence
Cross-lingual Information Retrieval is a technique
We are introducing “languag
Using Wikipedia we are building 200x200 matrix