Roberto Navigli
Babelplagiarism: what can BabelNet do for cross- language plagiarism detection?
Babelplagiarism: what can BabelNet do for cross- language plagiarism - - PowerPoint PPT Presentation
Babelplagiarism: what can BabelNet do for cross- language plagiarism detection? Roberto Navigli Joint work with Simone Ponzetto Mirella Lapata Andrea Moro Babelplagiarism: What can BabelNet do for 21/09/2012 2 cross-language plagiarism
Roberto Navigli
Babelplagiarism: what can BabelNet do for cross- language plagiarism detection?
Joint work with…
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 2Mirella Lapata Simone Ponzetto Andrea Moro
Outline
detection?
It’s all about knowledge!
Machine Translation (Google Translate)
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 5Machine Translation (Google Translate)
rock, is an important element but not necessarily central to the plot. Examples are Easy Rider (1969), The Graduate (1969), and Saturday Night Fever (1978).
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 6Machine Translation (Google Translate)
rock, is an important element but not necessarily central to the plot. Examples are Easy Rider (1969), The Graduate (1969), and Saturday Night Fever (1978).
roccia, è un elemento importante, ma non necessariamente al centro della trama. necessariamente al centro della trama.
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 7Machine Translation (Google Translate)
densities can assist in interpreting subsurface geologic structure and rock type.
Danger here!
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 8Machine Translation (Google Translate)
densities can assist in interpreting subsurface geologic structure and rock type.
underground può aiutare a interpretare in sottosuolo struttura geologica e tipo di roccia.
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 9It’s not that the “big data” approach is bad, it’s just that mere statistics is not enough
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 10The Knowledge Acquisition Bottleneck
– Word Sense Disambiguation – Named Entity Recognition – Question Answering – (your favourite NLP task here)
Plagiarism detection!
available in a machine readable format
– WordNet [Fellbaum, 1998] – Open Mind Word Expert [Chklovski & Mihalcea, 2002] – The WordNetPlus project [Boyd-Graber et al., 2006] – OntoNotes [Hovy et al., 2006] – EuroWordNet [Vossen, 1998], Multilingual Central Repository [Atserias et al. 2004], … – Wikipedia (collaborative effort)
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 11Word Sense Disambiguation in a Nutshell
spring (target word) “Spring water can be found at different altitudes” (context)
WSD system system
knowledge sense of target word
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 12 Roberto Navigli: Word sense disambiguation: A survey. ACM Computing Surveys 41(2), 2009, pp. 1-69The Richer, The Better
impact on knowledge-based WSD even in a fine-grained setting [Navigli & Lapata, IEEE TPAMI 2010] divergence nirvana point!!!
source: [Navigli and Lapata, 2010]State-of-the- art WSD divergence point
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 13Knowledge-based WSD NEEDS (a lot of) Knowledge!
– Lexical knowledge resources only partly available
lexical lexical knowledge resource
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 14State of the Art “in a nutshell”
– Lexical knowledge resources only partly available – Only for few languages (e.g. not all 23 EU official languages) – Heterogenous and with low coverage
MultiWordNet MultiWordNet WordNet WordNet WOLF WOLF MCR MCR GermaNet GermaNet BalkaNet BalkaNet
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 15This is where the ERC (and my project) comes into play A 5-year ERC Starting Grant (2011-2016)
(http://lcl.uniroma1.it/multijedi)
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 16Key Objective 1: create knowledge for all languages Multilingual Joint Word Sense Disambiguation (MultiJEDI)
MultiWordNet MultiWordNet WOLF WOLF BalkaNet BalkaNet WordNet WordNet MCR MCR GermaNet GermaNet
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 17Multilingual Joint Word Sense Disambiguation (MultiJEDI) Key Objective 2: use all languages to disambiguate one
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 18BabelNet [Navigli & Ponzetto, ACL 2010; AIJ 2012]
including both encyclopedic (from Wikipedia) and lexicographic (from WordNet) entries
Concepts from WordNet Concepts/N.E. from Wikipedia Concepts integrated from both resources
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 19BabelNet integrates the best of both worlds
balloon
WordNet Wikipedia
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 20WordNet [Miller et al., 1990; Fellbaum, 1998]
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 21WordNet [Miller et al., 1990; Fellbaum, 1998]
{wheeled vehicle} {self-propelled vehicle} {motor vehicle} {tractor} is-a is-a is-a {wagon, waggon} is-a {locomotive, engine, locomotive engine, railway locomotive} i s
{brake} h a s
a r t {wheel} has-part {splasher} has-part
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 22{car,auto, automobile, machine, motorcar} {convertible} {air bag} is-a is-a has-part {golf cart, golfcart} i s
{accelerator, accelerator pedal, gas pedal, throttle} has-part {car window} has-part railway locomotive}
Wikipedia [the online community, 2001-today]
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 23BabelNet: concepts and semantic relations (1)
WordNet and Wikipedia:
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 24BabelNet: concepts and semantic relations (2)
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 25BabelNet: objectives
– By establishing an automated mapping between Wikipedia pages and WordNet senses
– By collecting the lexicalizations of concepts in different languages using: a) Wikipedia interlanguage links b) Statistical Machine Translation
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 26Building BabelNet: Mapping Wikipedia to WordNet (1)
Wikipedia pages as word senses
WordNet senses and performs lexical-sample WSD
between Wikipedia and WordNet
– We select the most likely WordNet sense s of a wikipedia page w:
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 27An example of mapping
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 28Creation of the Wikipedia disambiguation contexts
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 29Building BabelNet: Mapping Wikipedia to WordNet (2)
ctx(w):
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 30ctx(w):
– For each WordNet sense s of w, calculate score(s, w) as follows:
Building BabelNet: Translating Babel synsets
globo aerostàtico Ballon
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 31pallone aerostatico
Building BabelNet: Translating Babel synsets
Translation system to translate the English lexicalizations of a concept
world's first hydrogen [[Balloon (aircraft)|balloon]] flight. flight.
ballon d'hydrogène.
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 32Google Translate
Building BabelNet: Translating Babel synsets
Translation system to translate the English lexicalizations of a concept
– sentences from SemCor (a corpus annotated with WordNet senses) which contain s – sentences from Wikipedia linked to the Wikipage of s – sentences from Wikipedia linked to the Wikipage of s
selected for each target language
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 33BabelNet: an encyclopedic dictionary!
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 34For research purposes…
Anatomy of BabelNet
Evaluation of the Wikipedia-WordNet mapping
corresponding WordNet sense, if available
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 36Evaluation of BabelNet against gold standard resources Coverage
GermaNet GermaNet GermaNet Multilingual Central Repository
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 37WOrdNet Libre du Français
Evaluation of BabelNet against gold standard resources Extra-coverage
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 38Coarse-grained Word Sense Disambiguation with BabelNet
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 39Main alternatives to BabelNet
– a multilingual semantic network built from Wikipedia and including semantic relations between Wikipedia entities collected from the category network, infoboxes and article bodies
– bootstrapped from WordNet and built by collecting evidence extracted from existing wordnets, translation dictionaries, and parallel corpora
– multilingual taxonomy containing 5.4 million entities, also built from WordNet and Wikipedia using a number of heuristics
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 40BabelNetXplorer: A Java API and a Visual Explorer [Navigli & Ponzetto, WWW 2012 DEMO]
multilingual semantic networks such as BabelNet
– A Java API based on Apache Lucene – Available at: http://babelnet.org
semantic networks
– Based on Cytoscape Web, a state-of-the-art visualization software – Based on Cytoscape Web, a state-of-the-art visualization software
Retrieve all synsets with the English lemma “bank”
The BabelNet API
Print information about each synset Print each German sense in the synset Get the synsets related by a given relation type Print the information of each related synset Get the (relation, synsets) map
BabelNetXplorer: semantic network exploration
Input word
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 43BabelNetXplorer: semantic network exploration
Selected sense
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 44BabelNetXplorer: semantic network exploration
Expand with the neighbours
node
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 45BabelNetXplorer: semantic network exploration
Expand with the neighbours
node
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 46BabelNetXplorer: search for connecting paths
Multilingual WSD with Just a Few Lines of Code [Navigli & Ponzetto, ACL 2012 DEMO]
Target words can even be in mixed languages!
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 48And disambiguate in 1 line! Create a disambiguation graph for the target words
Coming soon to your screens: BabelNet 1.1!
means: 40 languages + more accurate mappings and translations!
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 49Now… why am I saying all this to YOU?!
He is trying to steal important secrets from us…
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 50Plagiarism detection: the state of the art
[Stein et al., SIGIR 2007]keyphrase extraction, n-grams, query formulation, search control, etc.
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 51So, what can we do? [Examples from Vila et al. 2011]
Google bought YouTube Google purchased YouTube
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 52So, what can we do?
Google bought YouTube YouTube was sold to Google
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 53So, what can we do?
I like eating chocolate I like chocolate
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 54So, what can we do?
Bill flew across the ocean Bill crossed the ocean by plane
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 55Remember? BabelNet is multilingual!
Paolo is eating Parmesan Paolo sta mangiando il parmigiano
Entities are multilingual!
– verbs, adjectives and adverbs only in English
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 56Conclusions
semantics
What comes next…
– Increasing the accuracy of BabelNet (e.g. game with a purpose) – Integrate more knowledge (Wikipedia categories, Wiktionary, adjectives, verbs, etc.) – Labeling relatedness relations (see WiSeNet [Moro & Navigli, CIKM 2012]) CIKM 2012]) – More languages (40+)
Thanks or…
(grazie)
21/09/2012 Babelplagiarism: What can BabelNet do for cross-language plagiarism detection? Roberto Navigli 59Roberto Navigli
http://lcl.uniroma1.it
Joint work with: Simone Ponzetto; +Mirella Lapata, +Andrea Moro