aligning and integrating data in karma
play

Aligning and Integrating Data in Karma Craig Knoblock University - PowerPoint PPT Presentation

Aligning and Integrating Data in Karma Craig Knoblock University of Southern California Data Integration Approaches Data Integration Approaches Data Warehousing 3 Data Integration Approaches Data Warehousing Virtual Integration 4 Domain


  1. Aligning and Integrating Data in Karma Craig Knoblock University of Southern California

  2. Data Integration Approaches

  3. Data Integration Approaches Data Warehousing 3

  4. Data Integration Approaches Data Warehousing Virtual Integration 4

  5. Domain Model Domain Model 5

  6. Key Ingredient: Source Mappings Domain Model Source Mappings 6

  7. Karma: A Data Integration Tool

  8. Karma Interactive tool for rapidly extracting, cleaning, transforming, integrating and publishing data Tabular RDF Sources Database Hierarchica l Sources Karma CSV … Services @ KarmaSemWeb http://www.isi.edu/integration/karma 8

  9. Information Integration in Karma Domain Model Karma Source Mappings Samples of Source Data 10

  10. Information Integration in Karma Domain Model Karma Source Mappings Samples of Source Data 11

  11. Secret Sauce: Karma Understands Your Data Semantic Model of the Data Domain Model Karma Source Mappings Samples of Source Data Karma semi-automatically builds a semantic model of your data 12

  12. What is a Semantic Model? Describe sources using classes & relationships in an ontology Source name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google bornIn nearby birthdate isPartOf livesIn Place name Person organizer postalCode location ceo name State City worksFor Event state title Domain Organization object property Model phone startDate data property subClassOf name 13

  13. Semantic Types Person Person City State Organization name birthdate name name name name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google 14

  14. Relationships worksFor bornIn state Person City State Organization name birthdate name name name name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google 15

  15. Semantic Model worksFor bornIn state Person City State Organization name birthdate name name name name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google Semantic models will be formalized as Source Mappings Key ingredient to automate source discovery, data integration, and publishing semantic data (RDF triples) 16

  16. so what?

  17. Knowledge Graphs Karma uses semantic models to create knowledge graphs

  18. Karma semi-automatically builds semantic models Knowledge Graphs Karma uses semantic models to create knowledge graphs

  19. Karma semi-automatically builds semantic models … and provides a nice GUI to edit them Knowledge Graphs Karma uses semantic models to create knowledge graphs

  20. Semi-automatically Building Semantic Models in Karma

  21. Approach [Knoblock et al, ESWC 2012] Sample Data Steiner Tree Extract Learn Relationships Semantic Types Construct a Graph Domain Ontology 22

  22. Example Source name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google Domain Ontology object property data property subClassOf Find a semantic model for the source (map the source to the ontology) 23

  23. Learning Semantic Types [Krishnamurthy et al., ESWC 2015] class? property ? 24

  24. Learning Semantic Types CulturalHeritageObject extent 1- User specifies 2- System learns 25

  25. Learning Semantic Types CulturalHeritageObject extent 26

  26. Learning Semantic Types CulturalHeritageObject CulturalHeritageObject extent extent 27

  27. Requirements • Learn from a small number of examples • Work on both textual and numeric values • Learn quickly and highly scalable to large number of semantic types 28

  28. Approach for Textual Data • Document: each column of data • Label: each semantic type • Use Apache Lucene to index the labeled documents • Compute TF/IDF vectors for documents • Compare documents using Cosine Similarity between TF/IDF vectors 29

  29. Approach for Textual Data 30

  30. Approach for Numeric Data • Distribution of values in different semantic types is different, e.g., temperature vs. population • Use Statistical Hypothesis Testing to see which distribution fits best • Welch’s T-test, Mann-Whitney U-test and Kolmogorov- Smirnov Test 31

  31. Approach for Numeric Data 32

  32. Similarity features Similiarity Features Attribute Distribution Histogram Value names Similarity Similarity Similarity similarity Mann- Kolmogorov- Mann- Jaccard TF-IDF Jaccard Whitney test Smirnov test Whitney test

  33. Training machine learning model [Pham et al., ISWC 2016]

  34. Predicting new attribute

  35. Construct a Graph Construct a graph from semantic types and ontology Person Person City State Organization name birthdate name name name name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google 38

  36. Construct a Graph Construct a graph from semantic types and ontology date

  37. Inferring the Relationships • Search for minimal explanation • Steiner tree connecting semantic types over ontology graph • Given graph G=(V,E), nodes S  V, cost c: E  • Find a tree of G that spans S with minimal total cost • Unfortunately, NP-complete • Approximation Algorithm [Kou et al., 1981] • Worst-case time complexity: O(|V| 2 |S|) • Approximation Ratio: less than 2 40

  38. Inferring the Relationships Select minimal tree that connects all semantic types • A customized Steiner tree algorithm date 42

  39. Result in Karma 43

  40. Refining the Model Impose constraints on Steiner Tree Algorithm – Change weight of selected links to ε – Add source and target of selected link to Steiner nodes date 44

  41. Final Semantic Model 45

  42. Karma Learns the Source Models Taheriyanet al., ISWC 2013, ICSC 2014 Sample Data Generate Learn Candidate Models Semantic Types Construct a Graph Domain Ontology Rank Results Known Semantic Models

  43. Karma Use Cases Pedro Szekely and Craig Knoblock University of Southern California

  44. Source Mapping Phase Domain Domain Model Expert Source Mappings Karma Samples of Source Data Mapping Phase Pedro Szekely and Craig Knoblock University of Southern California

  45. Source Mapping and Query Time Domain Domain Model Expert Source Mappings Karma Samples of Source Data Mapping Phase Query Phase Karma Query Runtime System Analyst Data Warehousing Virtual Integration Pedro Szekely and Craig Knoblock University of Southern California

  46. VIVO • VIVO is a system to build researcher networks across institutions • Used Karma to map the data about USC faculty to VIVO ontology and publish it as RDF • VIVO ingest the RDF data • Video 50

  47. Smithsonian American Art Museum • Used Karma to convert data of 44000 museum objects to Linked Open Data • Modeled according to Europeana Data Model (EDM) • Linked the generated RDF to DBPedia, ULAN, NY Times Linked Data • News: USC press, Viterbi • Video 51

  48. DIG • DIG: Domain-specific Insight Graphs • Building and using knowledge graphs to combat human trafficking • Used Karma to map extracted data and structured sources to shared domain ontology • News: Forbes, Wired.co.uk 53

  49. Demo

  50. Using Karma to map museum data to the CIDOC CRM ontology https://www.youtube.com/watch?v=h3_yiBhAJIc 55

  51. Discussion • Automatically build rich semantic descriptions of data sources • Exploit the background knowledge from (i) the domain ontology, and (ii) the known source models • Semantic descriptions are the key ingredients to automate many tasks, e.g., • Source Discovery • Data Integration • Service Composition Mohsen Taheriyan University of Southern California

  52. More Info karma.isi.edu

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend