Aligning and Integrating Data in Karma Craig Knoblock University - - PowerPoint PPT Presentation

aligning and integrating data in karma
SMART_READER_LITE
LIVE PREVIEW

Aligning and Integrating Data in Karma Craig Knoblock University - - PowerPoint PPT Presentation

Aligning and Integrating Data in Karma Craig Knoblock University of Southern California Data Integration Approaches Data Integration Approaches Data Warehousing 3 Data Integration Approaches Data Warehousing Virtual Integration 4 Domain


slide-1
SLIDE 1

Aligning and Integrating Data in Karma

Craig Knoblock University of Southern California

slide-2
SLIDE 2

Data Integration Approaches

slide-3
SLIDE 3

Data Integration Approaches

3

Data Warehousing

slide-4
SLIDE 4

Data Integration Approaches

4

Data Warehousing Virtual Integration

slide-5
SLIDE 5

Domain Model

5 Domain Model

slide-6
SLIDE 6

Key Ingredient: Source Mappings

6 Domain Model Source Mappings

slide-7
SLIDE 7

Karma: A Data Integration Tool

slide-8
SLIDE 8

Karma

8

Hierarchica l Sources Services

Karma

Tabular Sources

Database RDF

… Interactive tool for rapidly extracting, cleaning, transforming, integrating and publishing data

CSV http://www.isi.edu/integration/karma @KarmaSemWeb

slide-9
SLIDE 9

Information Integration in Karma

10

Domain Model Source Mappings Karma Samples of Source Data

slide-10
SLIDE 10

Information Integration in Karma

11

Domain Model Karma Samples of Source Data Source Mappings

slide-11
SLIDE 11

Secret Sauce: Karma Understands Your Data

12

Domain Model Source Mappings Karma Samples of Source Data

Karma semi-automatically builds a semantic model of your data

Semantic Model

  • f the Data
slide-12
SLIDE 12

What is a Semantic Model?

13

Source

  • bject property

data property subClassOf

Domain Model

Person Organization Place State name birthdate bornIn worksFor state name phone name livesIn City Event ceo location

  • rganizer

nearby startDate title isPartOf postalCode

name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google

Describe sources using classes & relationships in an ontology

slide-13
SLIDE 13

Semantic Types

Person

Organization

City State

name birthdate name name name

14

Person

name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google

slide-14
SLIDE 14

Relationships

Organization

City State

name birthdate name name name

15

Person

name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google

bornIn worksFor state

slide-15
SLIDE 15

Semantic Model

Organization

City State

name birthdate name name name

16

Person

name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google

bornIn worksFor state

Key ingredient to automate source discovery, data integration, and publishing semantic data (RDF triples) Semantic models will be formalized as Source Mappings

slide-16
SLIDE 16

so what?

slide-17
SLIDE 17

Knowledge Graphs

Karma uses semantic models to create knowledge graphs

slide-18
SLIDE 18

Knowledge Graphs

Karma uses semantic models to create knowledge graphs Karma semi-automatically builds semantic models

slide-19
SLIDE 19

Knowledge Graphs

Karma uses semantic models to create knowledge graphs Karma semi-automatically builds semantic models … and provides a nice GUI to edit them

slide-20
SLIDE 20

Semi-automatically Building Semantic Models in Karma

slide-21
SLIDE 21

Approach

[Knoblock et al, ESWC 2012] 22

Domain Ontology Learn Semantic Types Extract Relationships Steiner Tree Sample Data Construct a Graph

slide-22
SLIDE 22

Example

Source

  • bject property

data property subClassOf

Domain Ontology

23

name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google

Find a semantic model for the source (map the source to the ontology)

slide-23
SLIDE 23

Learning Semantic Types

[Krishnamurthy et al., ESWC 2015] 24

class? property ?

slide-24
SLIDE 24

Learning Semantic Types

25

CulturalHeritageObject extent

1- User specifies 2- System learns

slide-25
SLIDE 25

CulturalHeritageObject

Learning Semantic Types

26

extent

slide-26
SLIDE 26

CulturalHeritageObject CulturalHeritageObject

Learning Semantic Types

27

extent extent

slide-27
SLIDE 27

Requirements

  • Learn from a small number of examples
  • Work on both textual and numeric values
  • Learn quickly and highly scalable to large

number of semantic types

28

slide-28
SLIDE 28

Approach for Textual Data

  • Document: each column of data
  • Label: each semantic type
  • Use Apache Lucene to index the

labeled documents

  • Compute TF/IDF vectors for

documents

  • Compare documents using Cosine

Similarity between TF/IDF vectors

29

slide-29
SLIDE 29

Approach for Textual Data

30

slide-30
SLIDE 30

Approach for Numeric Data

31

  • Distribution of values in

different semantic types is different, e.g., temperature vs. population

  • Use Statistical Hypothesis

Testing to see which distribution fits best

  • Welch’s T-test, Mann-Whitney

U-test and Kolmogorov- Smirnov Test

slide-31
SLIDE 31

Approach for Numeric Data

32

slide-32
SLIDE 32

Similarity features

Similiarity Features Attribute names similarity Jaccard Value Similarity TF-IDF Jaccard Distribution Similarity Mann- Whitney test Kolmogorov- Smirnov test Histogram Similarity Mann- Whitney test

slide-33
SLIDE 33

Training machine learning model

[Pham et al., ISWC 2016]

slide-34
SLIDE 34

Predicting new attribute

slide-35
SLIDE 35

Construct a Graph

Construct a graph from semantic types and ontology

38

Person Organization City State name birthdate name name name Person name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google

slide-36
SLIDE 36

Construct a Graph

Construct a graph from semantic types and ontology

date

slide-37
SLIDE 37

Inferring the Relationships

  • Search for minimal explanation
  • Steiner tree connecting semantic types over ontology

graph

  • Given graph G=(V,E), nodes S  V, cost c: E 
  • Find a tree of G that spans S with minimal total cost
  • Unfortunately, NP-complete
  • Approximation Algorithm [Kou et al., 1981]
  • Worst-case time complexity: O(|V|2|S|)
  • Approximation Ratio: less than 2

40

slide-38
SLIDE 38

Inferring the Relationships

Select minimal tree that connects all semantic types

  • A customized Steiner tree algorithm

42

date

slide-39
SLIDE 39

Result in Karma

43

slide-40
SLIDE 40

Refining the Model

44

Impose constraints on Steiner Tree Algorithm

– Change weight of selected links to ε – Add source and target of selected link to Steiner nodes

date

slide-41
SLIDE 41

Final Semantic Model

45

slide-42
SLIDE 42

Karma Learns the Source Models

Taheriyanet al., ISWC 2013, ICSC 2014

Domain Ontology Learn Semantic Types Sample Data Construct a Graph Generate Candidate Models Rank Results Known Semantic Models

slide-43
SLIDE 43

Karma Use Cases

Pedro Szekely and Craig Knoblock University of Southern California

slide-44
SLIDE 44

Source Mapping Phase

Domain Model Source Mappings

Karma

Domain Expert

Mapping Phase

Pedro Szekely and Craig Knoblock University of Southern California

Samples of Source Data

slide-45
SLIDE 45

Source Mapping and Query Time

Domain Model Source Mappings

Karma

Samples of Source Data Domain Expert

Mapping Phase Karma Runtime System Query Phase

Analyst

Query

Virtual Integration Data Warehousing

Pedro Szekely and Craig Knoblock University of Southern California

slide-46
SLIDE 46

VIVO

  • VIVO is a system to build

researcher networks across institutions

  • Used Karma to map the data

about USC faculty to VIVO

  • ntology and publish it as

RDF

  • VIVO ingest the RDF data
  • Video

50

slide-47
SLIDE 47

Smithsonian American Art Museum

  • Used Karma to convert data of

44000 museum objects to Linked Open Data

  • Modeled according to Europeana

Data Model (EDM)

  • Linked the generated RDF to

DBPedia, ULAN, NY Times Linked Data

  • News: USC press, Viterbi
  • Video

51

slide-48
SLIDE 48

DIG

  • DIG: Domain-specific

Insight Graphs

  • Building and using

knowledge graphs to combat human trafficking

  • Used Karma to map

extracted data and structured sources to shared domain ontology

  • News: Forbes, Wired.co.uk

53

slide-49
SLIDE 49

Demo

slide-50
SLIDE 50

Using Karma to map museum data to the CIDOC CRM ontology

55 https://www.youtube.com/watch?v=h3_yiBhAJIc

slide-51
SLIDE 51

Discussion

  • Automatically build rich semantic descriptions of data

sources

  • Exploit the background knowledge from (i) the domain
  • ntology, and (ii) the known source models
  • Semantic descriptions are the key ingredients to

automate many tasks, e.g.,

  • Source Discovery
  • Data Integration
  • Service Composition

Mohsen Taheriyan University of Southern California

slide-52
SLIDE 52

More Info

karma.isi.edu