Multilinguality in Wikidata Lucie-Aime Kaffee kaffee@soton.ac.uk - - PowerPoint PPT Presentation

multilinguality in wikidata
SMART_READER_LITE
LIVE PREVIEW

Multilinguality in Wikidata Lucie-Aime Kaffee kaffee@soton.ac.uk - - PowerPoint PPT Presentation

Multilinguality in Wikidata Lucie-Aime Kaffee kaffee@soton.ac.uk About Me PhD Student WAIS, University of Southampton Previously worked as a Software Developer at Wikimedia Deutschland, in the Wikidata team Interest in (under-resourced)


slide-1
SLIDE 1

Multilinguality in Wikidata

Lucie-Aimée Kaffee kaffee@soton.ac.uk

slide-2
SLIDE 2

About Me

PhD Student WAIS, University of Southampton Previously worked as a Software Developer at Wikimedia Deutschland, in the Wikidata team Interest in (under-resourced) languages From Berlin, Germany

slide-3
SLIDE 3

What we will talk about

Wikidata Multilinguality in Wikidata My work

slide-4
SLIDE 4

Wikidata

➔ Knowledge base maintained and edited by a community of users ➔ 48,775,926 items ➔ Each entity can have labels in >400 languages

LOD cloud

slide-5
SLIDE 5
slide-6
SLIDE 6

Multilinguality in Wikidata

slide-7
SLIDE 7

A Glimpse into Babel: An Analysis of Multilinguality in Wikidata

Lucie-Aimée Kaffee, Alessandro Piscopo, Pavlos Vougiouklis, Elena Simperl, Leslie Carr, Lydia Pintscher OpenSym 2017

slide-8
SLIDE 8

Q7259

Multilinguality in Wikidata

Q82594 P106

Ada Lovelaceسﯾﻼﻓوﻟ ادآ

  • ccupationﺔﻧﮭﻣﻟا

computer scientistبوﺳﺎﺣ مﻟﺎﻋ

@en @en @en @ar @ar @ar rdfs:label rdfs:label rdfs:label rdfs:label rdfs:label rdfs:label

slide-9
SLIDE 9

Multilinguality in Wikidata - Why do we care?

  • Labels are the access point for humans
  • Give language communities access to existing knowledge
  • Central storage for translations for (under resourced) languages
  • Semantic Web in NLP and NLG
  • Reuse: Wikipedia, translation, question answering, chat bots, ...
slide-10
SLIDE 10

Research Questions

  • What is the state of Wikidata with regard to multilinguality?
  • How does Wikidata's label distribution relate to the real world and Wikipedia's

language distribution?

  • Is there a difference in the multilinguality of the properties, compared to the
  • verall multilinguality of the knowledge base?
slide-11
SLIDE 11
slide-12
SLIDE 12

11.04%

slide-13
SLIDE 13

11.04% 6.5% 6% 5% 4%

slide-14
SLIDE 14

Comparison of distribution of languages in Wikidata and first language speakers in the world

slide-15
SLIDE 15

The most spoken language in the world, Chinese, is not well covered in Wikidata.

slide-16
SLIDE 16

Bot edits can make a difference in content coverage (Cebuano and Swedish)

slide-17
SLIDE 17

Dedicated communities change language representation (German and Dutch)

slide-18
SLIDE 18

Q7259

Wikidata Properties

Q82594 P106

Ada Lovelaceسﯾﻼﻓوﻟ ادآ

  • ccupationﺔﻧﮭﻣﻟا

computer scientistبوﺳﺎﺣ مﻟﺎﻋ

@en @en @en @ar @ar @ar rdfs:label rdfs:label rdfs:label rdfs:label rdfs:label rdfs:label

slide-19
SLIDE 19

Ranking of number of

slide-20
SLIDE 20

Ranking of number of Wikipedia articles by language,

slide-21
SLIDE 21

Ranking of number of Wikipedia articles by language, all labels in Wikidata,

slide-22
SLIDE 22

Ranking of number of Wikipedia articles by language, all labels in Wikidata, and labels for properties in Wikidata

slide-23
SLIDE 23

German is widely used in Wikipedia and Wikidata High coverage through active community

slide-24
SLIDE 24

As German, active community that brings high coverage of labels

slide-25
SLIDE 25

High coverage in labels on Wikidata through high number of bot-imported Wikipedia articles, however low number of community edited properties

slide-26
SLIDE 26

Even more extreme than Swedish: Not in top 25 of community-edited properties by language

slide-27
SLIDE 27

Users in Wikidata

(Work in Progress)

slide-28
SLIDE 28
slide-29
SLIDE 29

Native Languages of Wikidata users

slide-30
SLIDE 30

Language coverage of labels does not reflect in languages Wikidata’s users speak

Native Languages of Wikidata users

slide-31
SLIDE 31

From Wikidata to Wikipedia

slide-32
SLIDE 32

Wikipedia is available in 285 languages, but the content is unevenly distributed

English

Articles: 5,656,303 Editors: 132,781

Arabic

Articles: 576,376 Editors: 4,809

Esperanto

Articles: 247, 215 Editors: 361

slide-33
SLIDE 33

Content From Wikidata to Wikipedia

slide-34
SLIDE 34

Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata

Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl NAACL 2018

slide-35
SLIDE 35

Esperanto

➔ Esperanto is an artificial language ➔ Easy to learn ➔ Engaged Wikipedia community ➔ A good starting point ➔ Arabic is the 5th most spoken language in the world ➔ Content online in Arabic is sparse however

Arabic

en ar eo

slide-36
SLIDE 36

Sample Input

Q490900 Floridia P17 country Q38 Italy Q490900 Floridia P31 instance of Q747074 comune of Italy Q30025755 Floridia (town) P36 capital Q490900 Floridia

...

slide-37
SLIDE 37

Neural Text Generation

Arabic and Esperanto output text Feed-forward architecture encodes Wikidata triples into vector of fixed dimensionality RNN-based decoder generates text summaries, one token at a time Property placeholder to deal with

  • ut of vocabulary words
slide-38
SLIDE 38

Q106693 Group 14 (chemical series) رﺻﺎﻧﻌﻠﻟ يرودﻟا لودﺟﻟا ﻲﻓ ةدوﺟوﻣﻟا ةدوﺟوﻣﻟا رﺻﺎﻧﻌﻟا ﻲھ نوﺑرﻛﻟا ﺔﻋوﻣﺟﻣ Karbongrupo estas elemento en grupo 0 de la perioda tabelo laŭ la IUPAC-sistemo .

The carbon group is a periodic table group consisting of carbon, silico-n, germanium, tin, lead, and flerovium.

Q16885 Thelxinoe (natural satellite) . يرﺗﺷﻣﻟا بﻛوﻛﻟ ﻊﺑﺎﺗ ﺔﯾﻌﺟارﺗ ﺔﻛرﺣﺑ كرﺣﺗﯾ ﻲﻣﺎظﻧ رﯾﻏ ﻲﻌﯾﺑط رﻣﻗ وھ نوﯾﺳﻛﯾﻠﯾﺛ Telksino estas neregula satelito de Jupitero , kiu havas retrogradan orbiton .

Thelxinoe (/θɛlkˈsɪnoʊˌiː/ thelk-SIN-o-ee; Greek: Θελξινόη), also known as Jupiter XLII, is a natural satellite of Jupiter.

slide-39
SLIDE 39

Automatic Evaluation

  • Baselines

○ Machine Translation, Information Retrieval Based, Kneser-Ney

  • Automatic Evaluation

○ BLEU 1, BLEU 2, BLEU 3, BLEU 4, METEOR, ROUGE

slide-40
SLIDE 40

Results of the automatic evaluation: Our network outperforms all baselines

slide-41
SLIDE 41

Mind the (Language) Gap: Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders

Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl ESWC 2018

slide-42
SLIDE 42

ArticlePlaceholder display Wikidata triples on Wikipedia in tabular way Currently deployed on 14 Wikipedias

slide-43
SLIDE 43
slide-44
SLIDE 44
slide-45
SLIDE 45

Enriching ArticlePlaceholder with textual summaries generated from Wikidata triples Working with Arabic and Esperanto

slide-46
SLIDE 46

Community Study

➔ Two 15 days online surveys, aimed at readers and editors in Esperanto and Arabic ➔ Aiming to test our work with the actual Wikipedia community, outreach on Wikipedia plattforms ➔ Reader:

◆ Fluency: Is the text understandable and grammatically correct? ◆ Appropriateness: Does the summary ‘feel’ like a Wikipedia article?

➔ Editor:

◆ Editors were asked to edit the article starting from our summary (2-3 sentences) ◆ How much of the text was reused?

slide-47
SLIDE 47

Results of the reader study

Kaffee, Elsahar, Vougiouklis et al.: Mind the (Language) Gap: Neural Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders

slide-48
SLIDE 48

Results of the reader study: We generate sentences of comparable fluency, that “feel” like Wikipedia sentences

Kaffee, Elsahar, Vougiouklis et al.: Mind the (Language) Gap: Neural Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders

slide-49
SLIDE 49

Results of the editor study: We generate sentences that are highly reused by editors

wholly derived partially derived wholly derived non derived partially derived non derived

slide-50
SLIDE 50

Results of the editor study: We generate sentences that are highly reused by editors

wholly derived partially derived wholly derived non derived partially derived non derived

78.78% 94.77%

slide-51
SLIDE 51

Our algorithms can always only be as good as information in

  • ur data.
slide-52
SLIDE 52

Our algorithms can always only be as good as information in

  • ur data. Severe lack of data in Arabic in Wikidata.
slide-53
SLIDE 53

Future Work: Label Extraction From Wikipedia For Wikidata

slide-54
SLIDE 54

Example Wikidata Triple

Berlin Capital Of Germany Q64 P1376 Q183 Berlin Hauptstadt X English German

slide-55
SLIDE 55

Example Wikidata Triple

Berlin Capital Of Germany Q64 P1376 Q183 Berlin Hauptstadt X English German

slide-56
SLIDE 56
slide-57
SLIDE 57
slide-58
SLIDE 58
slide-59
SLIDE 59
slide-60
SLIDE 60
  • Mind the (Language) Gap: Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders,

Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl, ESWC 2018

  • Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata, Lucie-Aimée Kaffee, Hady

Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl, NAACL 2018

  • The Human Face of the Web of Data: A Cross-Sectional Study of Labels, Lucie-Aimée Kaffee, Elena Simperl,

SEMANTiCS 2018

  • Property Label Stability in Wikidata, Thomas Pellissier Tanon, Lucie-Aimée Kaffee, Wiki Workshop at The Web

Conference 2018

  • A Glimpse into Babel: An Analysis of Multilinguality in Wikidata, Lucie-Aimée Kaffee, Alessandro Piscopo, Pavlos

Vougiouklis, Elena Simperl, Leslie Carr, Lydia Pintscher, OpenSym 2017

  • Provenance Information in a Collaborative Knowledge Graph: an Evaluation of Wikidata External References,

Alessandro Piscopo, Lucie-Aimée Kaffee, Chris Phethean, Elena Simperl, ISWC 2017

  • What do Wikidata and Wikipedia Have in Common?: An Analysis of their Use of External References, Alessandro

Piscopo, Pavlos Vougiouklis, Lucie-Aimée Kaffee, Christopher Phethean, Jonathon Hare, Elena Simperl, OpenSym 2017

  • Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples, Pavlos Vougiouklis, Hady Elsahar,

Lucie-Aimée Kaffee, Christoph Gravier, Frederique Laforest, Jonathon Hare, Elena Simperl (under submission)

slide-61
SLIDE 61

Thanks!

kaffee@soton.ac.uk luciekaffee.github.io @frimelle

Mind the (Language) Gap: Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders, Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl, ESWC 2018 Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata, Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl, NAACL 2018 Property Label Stability in Wikidata, Thomas Pellissier Tanon, Lucie-Aimée Kaffee, Wiki Workshop at The Web Conference 2018 A Glimpse into Babel: An Analysis of Multilinguality in Wikidata, Lucie-Aimée Kaffee, Alessandro Piscopo, Pavlos Vougiouklis, Elena Simperl, Leslie Carr, Lydia Pintscher, OpenSym 2017