multilinguality in wikidata
play

Multilinguality in Wikidata Lucie-Aime Kaffee kaffee@soton.ac.uk - PowerPoint PPT Presentation

Multilinguality in Wikidata Lucie-Aime Kaffee kaffee@soton.ac.uk About Me PhD Student WAIS, University of Southampton Previously worked as a Software Developer at Wikimedia Deutschland, in the Wikidata team Interest in (under-resourced)


  1. Multilinguality in Wikidata Lucie-Aimée Kaffee kaffee@soton.ac.uk

  2. About Me PhD Student WAIS, University of Southampton Previously worked as a Software Developer at Wikimedia Deutschland, in the Wikidata team Interest in (under-resourced) languages From Berlin, Germany

  3. What we will talk about Wikidata Multilinguality in Wikidata My work

  4. LOD cloud Wikidata Knowledge base maintained and edited by a community of users ➔ 48,775,926 items ➔ Each entity can have labels in >400 languages ➔

  5. Multilinguality in Wikidata

  6. A Glimpse into Babel: An Analysis of Multilinguality in Wikidata Lucie-Aimée Kaffee, Alessandro Piscopo, Pavlos Vougiouklis, Elena Simperl, Leslie Carr, Lydia Pintscher OpenSym 2017

  7. Multilinguality in Wikidata Q7259 Q82594 P106 rdfs:label rdfs:label rdfs:label rdfs:label rdfs:label rdfs:label computer Ada Lovelaceسﯾﻼﻓوﻟ ادآ scientistبوﺳﺎﺣ مﻟﺎﻋ occupationﺔﻧﮭﻣﻟا @en @ar @en @ar @en @ar

  8. Multilinguality in Wikidata - Why do we care? Labels are the access point for humans ● Give language communities access to existing knowledge ● Central storage for translations for (under resourced) languages ● Semantic Web in NLP and NLG ● Reuse: Wikipedia, translation, question answering, chat bots, ... ●

  9. Research Questions What is the state of Wikidata with regard to multilinguality? ● How does Wikidata's label distribution relate to the real world and Wikipedia's ● language distribution? Is there a difference in the multilinguality of the properties, compared to the ● overall multilinguality of the knowledge base?

  10. 11.04%

  11. 4% 11.04% 6.5% 6% 5%

  12. Comparison of distribution of languages in Wikidata and first language speakers in the world

  13. The most spoken language in the world, Chinese, is not well covered in Wikidata.

  14. Bot edits can make a difference in content coverage (Cebuano and Swedish)

  15. Dedicated communities change language representation (German and Dutch)

  16. Wikidata Properties Q7259 Q82594 P106 rdfs:label rdfs:label rdfs:label rdfs:label rdfs:label rdfs:label computer Ada Lovelaceسﯾﻼﻓوﻟ ادآ scientistبوﺳﺎﺣ مﻟﺎﻋ occupationﺔﻧﮭﻣﻟا @en @ar @en @ar @en @ar

  17. Ranking of number of

  18. Ranking of number of Wikipedia articles by language,

  19. Ranking of number of Wikipedia articles by language, all labels in Wikidata,

  20. Ranking of number of Wikipedia articles by language, all labels in Wikidata, and labels for properties in Wikidata

  21. German is widely used in Wikipedia and Wikidata High coverage through active community

  22. As German, active community that brings high coverage of labels

  23. High coverage in labels on Wikidata through high number of bot-imported Wikipedia articles, however low number of community edited properties

  24. Even more extreme than Swedish: Not in top 25 of community-edited properties by language

  25. Users in Wikidata (Work in Progress)

  26. Native Languages of Wikidata users

  27. Language coverage of labels does not reflect in languages Wikidata’s users speak Native Languages of Wikidata users

  28. From Wikidata to Wikipedia

  29. English Articles: 5,656,303 Editors: 132,781 Arabic Articles: 576,376 Editors: 4,809 Esperanto Articles: 247, 215 Editors: 361 Wikipedia is available in 285 languages, but the content is unevenly distributed

  30. Content From Wikidata to Wikipedia

  31. Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl NAACL 2018

  32. Esperanto Arabic Esperanto is an artificial language Arabic is the 5th most spoken language in ➔ ➔ Easy to learn the world ➔ Engaged Wikipedia community Content online in Arabic is sparse however ➔ ➔ A good starting point ➔ en ar eo

  33. Sample Input Q490900 P17 Q38 Floridia country Italy Q490900 P31 Q747074 Floridia instance of comune of Italy ... Q30025755 P36 Q490900 Floridia (town) capital Floridia

  34. Neural Text Generation Arabic and Esperanto output text Feed-forward architecture encodes Wikidata triples into vector of fixed dimensionality RNN-based decoder generates text summaries, one token at a time Property placeholder to deal with out of vocabulary words

  35. Q106693 Group 14 (chemical series) رﺻﺎﻧﻌﻠﻟ يرودﻟا لودﺟﻟا ﻲﻓ ةدوﺟوﻣﻟا ةدوﺟوﻣﻟا رﺻﺎﻧﻌﻟا ﻲھ نوﺑرﻛﻟا ﺔﻋوﻣﺟﻣ Karbongrupo estas elemento en grupo 0 de la perioda tabelo la ŭ la IUPAC-sistemo . The carbon group is a periodic table group consisting of carbon, silico-n, germanium, tin, lead, and flerovium. Q16885 Thelxinoe (natural satellite) . يرﺗﺷﻣﻟا بﻛوﻛﻟ ﻊﺑﺎﺗ ﺔﯾﻌﺟارﺗ ﺔﻛرﺣﺑ كرﺣﺗﯾ ﻲﻣﺎظﻧ رﯾﻏ ﻲﻌﯾﺑط رﻣﻗ وھ نوﯾﺳﻛﯾﻠﯾﺛ Telksino estas neregula satelito de Jupitero , kiu havas retrogradan orbiton . Thelxinoe (/ θɛ lk ˈ s ɪ no ʊˌ i ː / thelk-SIN-o-ee; Greek: Θελξινόη ), also known as Jupiter XLII, is a natural satellite of Jupiter.

  36. Automatic Evaluation Baselines ● Machine Translation, Information Retrieval Based, Kneser-Ney ○ Automatic Evaluation ● BLEU 1, BLEU 2, BLEU 3, BLEU 4, METEOR, ROUGE ○

  37. Results of the automatic evaluation: Our network outperforms all baselines

  38. Mind the (Language) Gap: Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl ESWC 2018

  39. ArticlePlaceholder display Wikidata triples on Wikipedia in tabular way Currently deployed on 14 Wikipedias

  40. Enriching ArticlePlaceholder with textual summaries generated from Wikidata triples Working with Arabic and Esperanto

  41. Community Study Two 15 days online surveys, aimed at readers and editors in Esperanto and Arabic ➔ Aiming to test our work with the actual Wikipedia community, outreach on Wikipedia plattforms ➔ Reader: ➔ Fluency: Is the text understandable and grammatically correct? ◆ Appropriateness: Does the summary ‘feel’ like a Wikipedia article? ◆ Editor: ➔ Editors were asked to edit the article starting from our summary (2-3 sentences) ◆ How much of the text was reused? ◆

  42. Kaffee, Elsahar, Vougiouklis et al.: Mind the (Language) Gap: Neural Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders Results of the reader study

  43. Kaffee, Elsahar, Vougiouklis et al.: Mind the (Language) Gap: Neural Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders Results of the reader study: We generate sentences of comparable fluency, that “feel” like Wikipedia sentences

  44. wholly derived partially derived non derived wholly derived partially derived non derived Results of the editor study: We generate sentences that are highly reused by editors

  45. wholly derived 78.78% partially derived non derived wholly derived 94.77% partially derived non derived Results of the editor study: We generate sentences that are highly reused by editors

  46. Our algorithms can always only be as good as information in our data.

  47. Our algorithms can always only be as good as information in our data. Severe lack of data in Arabic in Wikidata.

  48. Future Work: Label Extraction From Wikipedia For Wikidata

  49. Example Wikidata Triple Berlin Capital Of Germany English Q64 P1376 Q183 Berlin Hauptstadt X German

  50. Example Wikidata Triple Berlin Capital Of Germany English Q64 P1376 Q183 Berlin Hauptstadt X German

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend