OD2WD: From Open Data to Wikidata through Patterns Muhammad Faiz, - - PowerPoint PPT Presentation

od2wd from open data to wikidata through patterns
SMART_READER_LITE
LIVE PREVIEW

OD2WD: From Open Data to Wikidata through Patterns Muhammad Faiz, - - PowerPoint PPT Presentation

OD2WD: From Open Data to Wikidata through Patterns Muhammad Faiz, Gibran M.F. Wisesa, Adila Krisnadhi , and Fariz Darari Faculty of Computer Science, Universitas Indonesia, Depok, Indonesia Outline aaaa Motivation The OD2WD system


slide-1
SLIDE 1

OD2WD: From Open Data to Wikidata through Patterns

Muhammad Faiz, Gibran M.F. Wisesa, Adila Krisnadhi, and Fariz Darari

Faculty of Computer Science, Universitas Indonesia, Depok, Indonesia

slide-2
SLIDE 2

Outline

aaaa

  • Motivation
  • The OD2WD system
  • Emerging patterns
  • Discussion and Future Work
slide-3
SLIDE 3
  • Worldwide open

data adoption

  • Indonesia: several
  • pen data portals

with total of >50,000 CSV/Excel tables

Motivation

slide-4
SLIDE 4
  • Many portal stops at publishing

CSV files hence preventing FAIR

  • Linked Data is a solution but

difficult due to technical, budgetary, or policy reasons

Motivation

slide-5
SLIDE 5

Idea: Make use of infrastructure of existing linked data infrastructure

Proposed Solution

  • Transform and republish tabular data to

repository of choice: Wikidata

  • Upside #1: Allows further

edits by public

  • Upside #2: Wikidata is

enriched further

slide-6
SLIDE 6

OD2WD: Open Data to Wikidata

  • Online at: http://od2wd.id
  • Currently implemented for Satu Data

Indonesia portal, Jakarta Open Data portal, and Bandung Open Data portal.

  • Challenge #1: triple extraction from tabular

cell values

  • Challenge #2: alignment with Wikidata

vocabulary

slide-7
SLIDE 7

OD2WD: Open Data to Wikidata

  • Online at: http://od2wd.id
  • Currently implemented for Satu Data

Indonesia portal, Jakarta Open Data portal, and Bandung Open Data portal.

  • Challenge #1: triple extraction from tabular

cell values

  • Challenge #2: alignment with Wikidata

vocabulary

Triple Extraction

slide-8
SLIDE 8

OD2WD: Open Data to Wikidata

  • Online at: http://od2wd.id
  • Currently implemented for Satu Data

Indonesia portal, Jakarta Open Data portal, and Bandung Open Data portal.

  • Challenge #1: triple extraction from tabular

cell values

  • Challenge #2: alignment with Wikidata

vocabulary

Triple Extraction Vocabulary Alignment

slide-9
SLIDE 9

OD2WD Architecture

slide-10
SLIDE 10

Reengineering Pattern

  • Currently only handling

vertical listing tables.

  • Other table types are left

as future work, e.g., horizontal listings, enumeration, matrix.

  • Protagonist column: the
  • ne with the highest

number of unique cell values, with leftmost position winning the tiebreaker.

slide-11
SLIDE 11

Datatype Detection

slide-12
SLIDE 12

Sumber: (https://wikidata.org) Ciity Depok Jakarta Bandung Semarang Aceh Medan Bogor

Mapping/Linking: Disambiguation Challenge

slide-13
SLIDE 13

Ciity Depok Jakarta Bandung Semarang Aceh Medan Bogor Sumber: (https://wikidata.org)

Mapping/Linking: Disambiguation Challenge

slide-14
SLIDE 14

Wikidata Allignment

Mapping

slide-15
SLIDE 15

Wikidata Allignment

Mapping

Disambiguation

Similarity Score Data Type

slide-16
SLIDE 16

Wikidata Allignment

Entity Linking

slide-17
SLIDE 17

Wikidata Allignment

Entity Linking

Disambiguation

Column Name Similarity Score

slide-18
SLIDE 18

Wikidata Allignment

Context in Entity Linking

Kelurahan Kalisari Wijaya Kusuma Cengkareng Barat Cipinang Cempedak Kelapa Gading Barat Slipi Krukut Source: (https://wikidata.org)

SELECT ?item ?itemLabel WHERE { wd:X wdt:P31 ?item . SERVICE wikibase:label { bd:serviceParam wikibase:language "id" } }

slide-19
SLIDE 19

Wikidata Allignment

Class Linking

slide-20
SLIDE 20

Wikidata Allignment

Class Linking

Disambiguation

Class Filtering Similarity Score

slide-21
SLIDE 21

Alignment Patterns

AP1: applied to non-protagonist column headers AP2: applied to cell values AP1: applied to protagonist column headers

slide-22
SLIDE 22

Conversion Accuracy

Performance measurement on 50 CSV documents from Indonesia's

  • pen data portal (compared against human judgement)

20256 new statements has been added to Wikidata Below is a chart describing the accuracy of each conversion phase. Inaccuracy causes: value irregularity, nested structure (minority), inadequate corpus coverage for embedding

81.9 88 79.21 88.42 70 10 20 30 40 50 60 70 80 90 100 Datatype Detection Protagonist Detection Mapping Entity Linking Class Linking

slide-23
SLIDE 23

Future Work

Improvement on conversion accuracy by incorporating more context information Handling more types of tables: horizontal listings, enumeration, matrix, etc. Study better encoding of the patterns and their applicability and usage in other open data portals Prototypical tool for converting tabular CSVs to RDF graphs and republish them to Wikidata.

slide-24
SLIDE 24

Acknowledgement

Wikimedia Indonesia project “Peningkatan Konten Wikidata." Students at Universitas Indonesia as human evaluators Raisha Abdillah from Wikimedia Indonesia for final quality checks prior to deploying data to Wikidata 2019 PITTA B research grant “Analysis and Enrichment of Wikidata Knowledge Graph" from Universitas Indonesia

slide-25
SLIDE 25

Video demo: https://youtu.be/oOjJdOQ8dwM

Thank You