wdplus leveraging wikidata to link and extend tabular data
play

WDPlus: Leveraging Wikidata to Link and Extend Tabular Data Daniel - PowerPoint PPT Presentation

WDPlus: Leveraging Wikidata to Link and Extend Tabular Data Daniel Garijo , Pedro Szekely Information Sciences Institute and Department of Computer Science @dgarijov dgarijo@isi.edu Abundance of data sources in the Web Users of data face


  1. WDPlus: Leveraging Wikidata to Link and Extend Tabular Data Daniel Garijo , Pedro Szekely Information Sciences Institute and Department of Computer Science @dgarijov dgarijo@isi.edu

  2. Abundance of data sources in the Web Users of data face three challenges • How do I find relevant datasets for my problem? • How do I augment my dataset with existing information? • How can I share my integrated results with the community? Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend 2 Tabular Data. (Sciknow 2019)

  3. Popular initiatives for addressing these challenges • Search individual items • Search is manual, based on user input • LOD cloud of connected datasets • Knowledge engineers are needed to map and augment content • ETL Frameworks (e.g, Karma, Open Refine) • Pipelines are custom, expertise required • Often not shared Sources: https://lod-cloud.net/versions/2019-03-29/lod-cloud.png; https://panoply.io/data-warehouse-guide/3-ways-to-build-an-etl-process/ Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend 3 Tabular Data. (Sciknow 2019)

  4. WDPlus A framework designed to: crime • Discover data on the Web • Improve raw data to make it useful shopp ing sports 1884-05-08 • Search, querying dataset structure 1972-12-26 male • Download fresh data Harry Truman • Combine existing dataset Bress Truman 1945-04-12 • Share improved data and methods President 1953-01-20 Lamar USA Core weath er Satellites Metadata index Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend 4 Tabular Data. (Sciknow 2019)

  5. WDPlus architecture Wikidata as a core KG • 60 Million items 1884-05-08 1972-12-26 • 700 Million statements male Harry Truman • 20,000 + contributors Bress Truman • +1 billion edits 1945-04-12 President 1953-01-20 Lamar USA • Collaborative! Core Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend 5 Tabular Data. (Sciknow 2019)

  6. WDPlus architecture crime Satellite organization • Detailed information on a domain shopp ing sports 1884-05-08 • Crime records, sport events, etc. 1972-12-26 • Linked to the Wikidata core male Harry Truman • Link first strategy Bress Truman • Custom properties and Qnodes 1945-04-12 • Extensions to core model President 1953-01-20 Lamar USA • Synchronized with core Core • Decentralized weath er • 1 satellite may be maintained by 1 Satellites community Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend 6 Tabular Data. (Sciknow 2019)

  7. WDPlus architecture Table models crime • Tables are not materialized • Able to become a satellite under shopp demand ing sports 1884-05-08 • Described in machine-readable 1972-12-26 metadata index male Harry Truman • Indexing columns names and Bress Truman relevant instances for fast retrieval 1945-04-12 • Link to table model is preserved President 1953-01-20 Lamar USA Core weath er Satellites Metadata index Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend 7 Tabular Data. (Sciknow 2019)

  8. Towards WDPLus crime shopp ing sports 1884-05-08 1972-12-26 male Harry Truman Bress Truman 1945-04-12 President 1953-01-20 Lamar USA Core weath er Satellites Metadata index Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend 8 Tabular Data. (Sciknow 2019)

  9. WDPlus framework: Metadata index and table Augmentation • Search • Keywords, variables or content • Wikifier may be used in search • Download • Download a dataset or its metadata • Augment • Merge your dataset with contents from other datasets automatically • Upload • Add new datasets (automated metadata profiling and provenance) • Enrich • Header enrichment for search efficiency Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend 9 Tabular Data. (Sciknow 2019)

  10. WDPlus framework: T2WML Entity Linking Cell-based mapping. This mapping is saved in WDPlus for future reference Table overview Easy to share! Result sample Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend 10 Tabular Data. (Sciknow 2019)

  11. Creating Wikidata Satellites: Challenges • Identify new properties to model satellites • Currently done by hand by Knowledge engineers • Creation of new Qnodes for satellite instances • Identified a schema for each satellite • Feedback loop to Wikidata • How to select a “trusty” statement when several values are available? • Namespace issues • Single namespace, or namespace per satellite? • Inter-satellite linkages Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend 11 Tabular Data. (Sciknow 2019)

  12. Conclusions • Tabular data exists in heterogeneous formats • Difficult to find, use, augment and share • WDPlus is a framework to help discover, improve, search, augment, combine and share tabular data • WDPlus framework for profiling and enriching datasets • T2WML language to generate linked instances from tabular data • Encouraging early results on usability Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend 12 Tabular Data. (Sciknow 2019)

  13. Help us extend WDPlus! Do you have comments, suggestions or use cases? Contact me at: dgarijo@isi.edu Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend 13 Tabular Data. (Sciknow 2019)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend