The MultiJEDI ERC Project: Multilingual Joint Word Sense - - PowerPoint PPT Presentation

the multijedi erc project multilingual joint word sense
SMART_READER_LITE
LIVE PREVIEW

The MultiJEDI ERC Project: Multilingual Joint Word Sense - - PowerPoint PPT Presentation

The MultiJEDI ERC Project: Multilingual Joint Word Sense Disambiguation Roberto Navigli http://lcl.uniroma1.it 5 July 2016 META-FORUM 2016 ing Andrea Moro Alessandro Claudio Raganato Delli Bovi Daniele 11.07.16 Tiziano Vannella


slide-1
SLIDE 1

Roberto Navigli The MultiJEDI ERC Project: Multilingual Joint Word Sense Disambiguation

5 July 2016 – META-FORUM 2016 http://lcl.uniroma1.it

slide-2
SLIDE 2

ing

11.07.16

Simone Ponzetto Tiziano Flati Andrea Moro Daniele Vannella Taher Pilehvar Francesco Cecconi

11.07.16 The MultiJEDI ERC Project Roberto Navigli 2

Federico Scozzafava Alessandro Raganato Ignacio Iacobacci José Camacho Collados Claudio Delli Bovi

slide-3
SLIDE 3 11.07.16 Multilingual Web Access – WWW 2015 Roberto Navigli 3
slide-4
SLIDE 4 11.07.16 Recent achievements in multilingual NLP Roberto Navigli 4

You may say I'm a dreamer, but I am not the only one. I hope someday you'll join us. And the world will be as one!

  • John Lennon
slide-5
SLIDE 5 11.07.16 5

A 5-year ERC Starting Grant (2011-2016)

  • n Multilingual Word Sense Disambiguation

http://multijedi.org

The MultiJEDI ERC Project Roberto Navigli
slide-6
SLIDE 6

INTEGRATING KNOWLEDGE

[Navigli & Ponzetto, ACL 2010; Pilehvar & Navigli, ACL 2014]

11.07.16 The MultiJEDI ERC Project Roberto Navigli
slide-7
SLIDE 7 11.07.16 The MultiJEDI ERC Project Roberto Navigli 7

The resource diaspora

slide-8
SLIDE 8 11.07.16 8

Key Objective 1: create knowledge for all languages Multilingual Joint Word Sense Disambiguation (MultiJEDI)

WordNet MultiWordNet WOLF MCR GermaNet BalkaNet

The MultiJEDI ERC Project Roberto Navigli
slide-9
SLIDE 9
  • We collect lexicalizations, definitions, translations,

images, etc. from each of the merged resources Merging entries from different resources into BabelNet

The MultiJEDI ERC Project Roberto Navigli 9

WordNet

slide-10
SLIDE 10

What is BabelNet?

  • A merger of resources of different kinds:
11.07.16 META Prize 2015: BabelNet Roberto Navigli 10
slide-11
SLIDE 11 11.07.16 11
  • A merger of resources of different kinds:

– WordNet: the most popular computational lexicon of English – Open Multilingual WordNet: a collection of open wordnets – WoNeF: a French WordNet – Wikipedia: the largest collaborative encyclopedia – Wikidata: the largest collaborative knowledge base – Wiktionary: the largest collaborative dictionary – OmegaWiki: a medium-size collaborative multilingual dictionary – GeoNames: a worldwide geographical database – Microsoft Terminology: a computer science thesaurus – High-quality automatic sense-based translations

The MultiJEDI ERC Project Roberto Navigli

What is BabelNet?

slide-12
SLIDE 12 11.07.16 12

Why do we need BabelNet?

  • Multilinguality: the same concept is expressed in tens of

languages

The MultiJEDI ERC Project Roberto Navigli
slide-13
SLIDE 13 11.07.16 13

Why do we need BabelNet?

  • Multilinguality: the same concept is expressed in tens of

languages

The MultiJEDI ERC Project Roberto Navigli
slide-14
SLIDE 14 11.07.16 14

Why do we need BabelNet?

  • Multilinguality: the same concept is expressed in tens of

languages

The MultiJEDI ERC Project Roberto Navigli
slide-15
SLIDE 15 11.07.16 15

Why do we need BabelNet?

  • Multilinguality: the same concept is expressed in tens of

languages

  • Coverage: 271 languages and 14 million entries!

– 6M concepts and 7.7M named entities – 119M word senses – 378M semantic relations (27 relations per concept on avg.) – 11M images associated with concepts – 41M textual definitions – 2M concepts with domains associated

The MultiJEDI ERC Project Roberto Navigli
slide-16
SLIDE 16 11.07.16 Multilingual Web Access – WWW 2015 Roberto Navigli 16

Why do we need BabelNet?

  • Multilinguality: the same concept is expressed in tens of

languages

  • Coverage: 271 languages and 14 million entries!
  • Concepts and named entities together: dictionary and

encyclopedic knowledge is semantically interconnected

11.07.16 META Prize 2015: BabelNet Roberto Navigli 16
slide-17
SLIDE 17 11.07.16 Multilingual Web Access – WWW 2015 Roberto Navigli 17

Why do we need BabelNet?

  • Multilinguality: the same concept is expressed in tens of

languages

  • Coverage: 271 languages and 14 million entries!
  • Concepts and named entities together: dictionary and

encyclopedic knowledge is semantically interconnected

  • "Dictionary of the future": semantic network structure

with labeled relations, pictures, multilingual synsets

11.07.16 META Prize 2015: BabelNet Roberto Navigli 17
slide-18
SLIDE 18 11.07.16 18

Why do we need BabelNet?

  • Multilinguality: the same concept is expressed in tens of

languages

  • Coverage: 271 languages and 14 million entries!
  • Concepts and named entities together: dictionary and

encyclopedic knowledge is semantically interconnected

  • "Dictionary of the future": semantic network structure

with labeled relations, pictures, multilingual synsets

  • Full-fledged taxonomy: is-a relations are available for

both concepts and named entities (Wikipedia Bitaxonomy)

– Ferrari Testarossa is-a sports car – BabelNet is-a semantic network & encyclopedic dictionary

The MultiJEDI ERC Project Roberto Navigli
slide-19
SLIDE 19 11.07.16 19

Why do we need BabelNet?

  • Multilinguality: the same concept is expressed in tens of

languages

  • Coverage: 272 languages and 14 million entries!
  • Concepts and named entities together: dictionary and

encyclopedic knowledge is semantically interconnected

  • "Dictionary of the future": semantic network structure

with labeled relations, pictures, multilingual synsets

  • Full-fledged taxonomy: is-a relations are available for

both concepts and named entities (Wikipedia Bitaxonomy)

  • Easy access: Java and HTTP RESTful APIs; SPARQL

endpoint (2 billion triples); downloadable indices for research purposes

The MultiJEDI ERC Project Roberto Navigli
slide-20
SLIDE 20

The core of the Linguistic Linked Open Data cloud!

slide-21
SLIDE 21 11.07.16 21

What can we do with BabelNet?

  • Search and translate:
The MultiJEDI ERC Project Roberto Navigli
slide-22
SLIDE 22 11.07.16 META Prize 2015: BabelNet Roberto Navigli 22

What can we do with BabelNet?

slide-23
SLIDE 23

What can we do with BabelNet?

  • Explore the network:
11.07.16 META Prize 2015: BabelNet Roberto Navigli 23
slide-24
SLIDE 24

WordNet-Wikipedia mapping accuracy

  • Quality lower bound of the mapping: 87%

– On the 6000 lowest-confidence mappings – Note: this concerns only 50k synsets in the intersection

11.07.16 BabelNet & friends Roberto Navigli 24
slide-25
SLIDE 25

Creating Datasets with BabelNet: all in one!

  • Annotating with BabelNet implies annotating with

WordNet, Wikipedia, OmegaWiki, Open Multilingual WordNet, Wikidata and Wiktionary

Key fact!

25

BabelNet

7

The MultiJEDI ERC Project Roberto Navigli 25
slide-26
SLIDE 26

ADDRESSING AMBIGUITY

[Moro, Raganato & Navigli, TACL 2014]

The MultiJEDI ERC Project Roberto Navigli 26
slide-27
SLIDE 27

Motivation (1): hungry computers

  • EN - The mouse ate the cheese
11.07.16 27 The MultiJEDI ERC Project Roberto Navigli
slide-28
SLIDE 28

Motivation (1): hungry computers

  • EN - The mouse ate the cheese
  • FR - La souris a mangé le fromage.
11.07.16 28 The MultiJEDI ERC Project Roberto Navigli
slide-29
SLIDE 29

Motivation (1): hungry computers

  • EN - The mouse ate the cheese
  • FR - La souris a mangé le fromage.
  • IT - Il mouse ha mangiato il formaggio
11.07.16 29 The MultiJEDI ERC Project Roberto Navigli
slide-30
SLIDE 30

Multilingual Joint Word Sense Disambiguation (MultiJEDI) Key Objective 2: use all languages to disambiguate one

30 11.07.16 The MultiJEDI ERC Project Roberto Navigli
slide-31
SLIDE 31

So what?

  • The first (and only) system that performs

Word Sense Disambiguation (common nouns, verbs, adjectives) and Entity Linking together

  • In arbitrary languages (270+ languages)
  • In multiple languages at once
31 11.07.16 The MultiJEDI ERC Project Roberto Navigli
slide-32
SLIDE 32

Step 4: Select the most reliable meanings

“Thomas and Mario are strikers playing in Munich”

Thomas (novel) Seth Thomas Thomas Müller Mario Gómez Mario (Album) Mario (Character) Striker (Movie) Striker (Video Game) striker (Sport) Munich (City) FC Bayern Munich Munich (Song)

32 11.07.16 The MultiJEDI ERC Project Roberto Navigli
slide-33
SLIDE 33
slide-34
SLIDE 34

Experimental Results: Fine-grained (Multilingual) Disambiguation

Senseval-3 SemEval-2007 task 17 SemEval-2013 task 12

34 11.07.16 The MultiJEDI ERC Project Roberto Navigli
slide-35
SLIDE 35

Experimental Results: KORE50, AIDA-CoNLL

  • Two gold-standard Entity Linking datasets:
35 11.07.16 The MultiJEDI ERC Project Roberto Navigli
slide-36
SLIDE 36

Babelfy "understands" 'the mouse ate the cheese'!

11.07.16 36 The MultiJEDI ERC Project Roberto Navigli
slide-37
SLIDE 37

WSD and Entity Linking together win!

11.07.16 37 The MultiJEDI ERC Project Roberto Navigli
slide-38
SLIDE 38

The Crazy Polyglot!

11.07.16 Multilingual Web Access – WWW 2015 Roberto Navigli 38
slide-39
SLIDE 39

Live demo (2) – Crazy polyglot! EN In todayʼs knowledge and information society FR le paysage lexicographique est plus hétérogène que jamais. IT Possono le risorse stand-alone competere ES con múltiples funciones, portale lexicográficas multilingüe y servicios web, ZH Web服,定 制 的 喜 好 和 个 人 用 的 个 人 料 ?

11.07.16 39 The MultiJEDI ERC Project Roberto Navigli
slide-40
SLIDE 40

BabelNet 3.6 is now a knowledge base!

  • Semantic relations from Wikidata + Infoboxes

(superset of DBpedia) + relations extracted with Open Information Extraction techniques

79 The MultiJEDI ERC Project Roberto Navigli
slide-41
SLIDE 41

SENSE AND CONCEPT REPRESENTATIONS

[Iacobacci et al., ACL 2015; Camacho-Collados et al., NAACL+ACL 2015]

The MultiJEDI ERC Project Roberto Navigli 41
slide-42
SLIDE 42

Latent representation of word senses: SensEmbed

Iacobacci, Pilehvar, Navigli (ACL 2015)

11.07.16 Représentations vectorielles latentes et explicites Roberto Navigli 42
slide-43
SLIDE 43

Problem: word representations cannot capture polysemy

11.07.16 43 Représentations vectorielles latentes et explicites Roberto Navigli
slide-44
SLIDE 44 11.07.16 44 Représentations vectorielles latentes et explicites Roberto Navigli

Problem: word representations cannot capture polysemy

slide-45
SLIDE 45 11.07.16 45 Représentations vectorielles latentes et explicites Roberto Navigli

Problem: word representations cannot capture polysemy

slide-46
SLIDE 46

Our solution: distinct representation for each word’s meaning

Représentations vectorielles latentes et explicites Roberto Navigli
slide-47
SLIDE 47

Embeddings + Semantic Knowledge = SensEmbed

11.07.16 Représentations vectorielles latentes et explicites Roberto Navigli 47
  • Achieved by training word2vec with text disambiguated

with Babelfy with high precision, low recall

slide-48
SLIDE 48

Explicit representation of concepts: NASARI

Camacho Collados, Pilehvar and Navigli NAACL 2015 + ACL 2015 + Artificial Intelligence Journal 2016

11.07.16 Représentations vectorielles latentes et explicites Roberto Navigli 48
slide-49
SLIDE 49

Motivation

11.07.16 49 Représentations vectorielles latentes et explicites Roberto Navigli
slide-50
SLIDE 50

NASARI: human-interpretable semantic vectors

11.07.16 50 Représentations vectorielles latentes et explicites Roberto Navigli
  • A vector represents the meaning of Babel synset
  • Its components are

– words – Babel synsets (concepts and named entities)

slide-51
SLIDE 51

NASARI: human-interpretable semantic vectors

11.07.16 51 Représentations vectorielles latentes et explicites Roberto Navigli
  • The semantic part of the vector enables cross-lingual

semantic alignments and comparison of text

slide-52
SLIDE 52

Semantic similarity: results

  • NASARI is the concatenation of lexical and unified

vectors:

11.07.16 Recent achievements in multilingual NLP Roberto Navigli 52
slide-53
SLIDE 53 11.07.16 53 Représentations vectorielles latentes et explicites Roberto Navigli

Cross-lingual Word similarity: Results

Spearman (ρ) and Pearson (r) correlation performance of different systems on multilingual editions of the RG-65 datasets.

slide-54
SLIDE 54

Babelscape: bringing our multilingual technologies to the market

  • The MultiJEDI project is now over
  • However, much work still to be done in this direction

– BabelNet live – Increase coverage

  • Babelscape brings our multilingual technologies to the

market

  • Key objective: making BabelNet and all other HLT

sustainable

  • BabelNet will always be available (and downloadable)

for research purposes

11.07.16 The MultiJEDI ERC Project Roberto Navigli 54
slide-55
SLIDE 55

Summarizing…

55 The MultiJEDI ERC Project Roberto Navigli

+ latent and explicit representations of meanings + sustainability plan for improving our systems over time

slide-56
SLIDE 56 11.07.16 BabelNet & friends Roberto Navigli 56
slide-57
SLIDE 57

Thanks or…

m i

(grazie)

57 11.07.16 The MultiJEDI ERC Project Roberto Navigli
slide-58
SLIDE 58

Roberto Navigli

Linguistic Computing Laboratory http://lcl.uniroma1.it

@RNavigli