Name System Mayank Kejriwal 2 Linked Data A set of four best - - PowerPoint PPT Presentation

name system
SMART_READER_LITE
LIVE PREVIEW

Name System Mayank Kejriwal 2 Linked Data A set of four best - - PowerPoint PPT Presentation

1 Populating a Linked Data Entity Name System Mayank Kejriwal 2 Linked Data A set of four best practices for publishing and connecting structured data on the Web Bizer et al. (2009, 2014) 3 Instance Matching Connecting pairs of


slide-1
SLIDE 1

Populating a Linked Data Entity Name System

Mayank Kejriwal

1

slide-2
SLIDE 2
  • A set of four best practices for publishing and connecting structured data on the Web

Linked Data

Bizer et al. (2009, 2014)

2

slide-3
SLIDE 3

Instance Matching

Jaffri et al. (2008) Papadakis et al. (2010) Nikolov et al. (2011)

  • Connecting pairs of entities that refer to the same underlying entity
  • Also known as ‘entity resolution’, ‘entity matching’, ‘co-reference resolution’, ‘merge-purge’...

3

slide-4
SLIDE 4

Entity Name System: a thesaurus for entities

  • Populating an ENS requires solutions to instance matching
  • Many applications

Paul Gardner Allen Microsoft

...

freebase:Paul_G._Allen dbpedia:Allen_,Paul

...

dbpedia:Microsoft Corp.

...

freebase:Microsoft Bouquet et al. (2008)

4

slide-5
SLIDE 5

Data Integration: Example from e-commerce

5

... Seller 1 Seller 2 Seller n

Entity Name System Mediated schema/Target

  • ntology

Product X Aggregated Results

Doan et al. (2012)

slide-6
SLIDE 6

Emerald: Data Integration for RDF and Linked Data

6

slide-7
SLIDE 7

Resource Description Framework (RDF)

  • An RDF dataset is a set of triples, visualized as a directed labeled graph
  • A triple is a 3-element tuple (subject, property, object) and represents an edge in the graph
  • Subjects and properties are necessarily URIs
  • Objects may be URIs or literals

http://www.w3.org/RDF Bizer et al. (2009)

7

slide-8
SLIDE 8

From a Web of Linked ‘Documents’...

8

slide-9
SLIDE 9

...to a Web of Linked ‘Data’

Cyganiak and Jentzsch (2014) Linkeddata.org

  • ‘Linked Open Data’ started in

2007 with just 12 RDF datasets

  • At last survey (2014), contains:
  • Millions of resources
  • 1000 datasets
  • 900,000 documents
  • 500 million inter-dataset

links

  • Many domains!
  • Applications include

schema.org, Google Knowledge Graph, Constitute... Media Social Networking Cross-domain Publications

9

slide-10
SLIDE 10

Research question

What requirements need to be fulfilled in order to populate a Linked Data Entity Name System?

10

slide-11
SLIDE 11

Returning to our example...

11

slide-12
SLIDE 12

Linked Open Data

Cyganiak and Jentzsch (2014) Linkeddata.org

  • ‘Linked Open Data’ started in

2007 with just a handful of datasets

  • At last survey (2014), contains:
  • Millions of resources
  • 1000 datasets
  • 900,000 documents
  • 500 million inter-dataset

links

  • Many domains!

Media Social Networking Cross-domain Publications

12

slide-13
SLIDE 13

Thesis statement

Populating a Linked Data Entity Name System requires simultaneously fulfilling the four DASH requirements of domain-independence, automation, scalability and heterogeneity

13

Kejriwal and Miranker (2014)

slide-14
SLIDE 14

Step 1: Type alignment

14

Kejriwal and Miranker (2014) Euzenat and Shvaiko (2007)

slide-15
SLIDE 15

Step 2: Property alignment

15

Euzenat and Shvaiko (2007)

slide-16
SLIDE 16

Step 3: Similarity prediction?

16

slide-17
SLIDE 17

Step 3: blocking and similarity

17

Blocks

1 2 3 4 5

Apply blocking key e.g. Tokens(LastName) Generate candidate set (7 pairs), apply similarity function

  • n each pair

? ? ? ? ? ? ? Dataset 1 Dataset 2 ‘Exhaustive’ set: 4 X 6=24 pairs

Christen (2012)

slide-18
SLIDE 18

Final output

18

slide-19
SLIDE 19

Supervised schematic (post type-alignment)

19

Elmagarmid et al. (2007) Learn Property Alignment Learn blocking key Learn Similarity function Training set of duplicates/ non-duplicates

Aligned training set Trained Classifier

Execute blocking Execute similarity

Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2

  • Presented mainly to static tabular datasets; not viable for dynamic linked datasets
slide-20
SLIDE 20

Semi-supervised schematic (post type-alignment)

20

Kejriwal and Miranker (2015) Learn Property Alignment Learn blocking key Learn Similarity function Seed training set

  • f duplicates/

non-duplicates

Aligned training set Trained Classifier

Execute blocking Execute similarity

Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2

Most confident samples

  • Hard to realize in practice both because of class imbalance, and because graphs are hard to explore
slide-21
SLIDE 21

Unsupervised schematic?

21

Learn Property Alignment Learn blocking key Learn Similarity function Seed training set

  • f duplicates/

non-duplicates

Aligned training set Trained Classifier

Execute blocking Execute similarity

Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2

Most confident samples

slide-22
SLIDE 22

Unsupervised schematic?

22

Kejriwal and Miranker (2013-2015) Learn Property Alignment Learn blocking key Learn Similarity function

Noisy seed training set of duplicates/ non- duplicates Aligned training set Trained Classifier

Execute blocking Execute similarity

Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2

Most confident samples Training set generator?

slide-23
SLIDE 23

Our system: a complete, unsupervised schematic

23

Kejriwal and Miranker (2015)

  • Implemented both serially and in MapReduce (using standard cloud services)
  • Feasible for linking large, cross-domain graphs like Dbpedia and Freebase
  • Does not ‘assume away’ any of the DASH requirements (e.g. property heterogeneity)
slide-24
SLIDE 24

Specific algorithmic contributions

24

2013 2014 2015 2016

ICDM, 2013 ISWC, 2014 OM, 2014 ESWC, 2015 Know@ LOD, 2015 JWS, 2015 ISWC, 2016 (submitted)

Motivation Type Heterogeneity Automation Blocking and similarity Property Heterogeneity Full system (serial) Scalability

ISWC, 2015

slide-25
SLIDE 25

First contribution: Unsupervised training set generation

25

Kejriwal and Miranker (2013-2015)

slide-26
SLIDE 26

Training Set Generator (TSG): Intuition

  • Generate a seed training set by locating a few easy examples using fast, unsupervised heuristics

26

Kejriwal and Miranker (2013-2015) Learn Property Alignment Learn blocking key Learn Similarity function

Noisy seed training set of duplicates/ non- duplicates Aligned training set Trained Classifier

Execute blocking Execute similarity

Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2

Most confident samples Training set generator

slide-27
SLIDE 27

What’s considered ‘easy’?

27

Entity from RDF dataset 1 Entity from RDF dataset 2

  • Operational definition: Pair on which a token-based heuristic (e.g. Jaccard) gives a high score
  • Tokens can be extracted by using an RDF-specific tokenizer
slide-28
SLIDE 28

Step 1: Fast heuristic that is ‘recall-favoring’ with respect to easy examples

28

𝑀𝑝𝑕𝑈𝐺𝐽𝐸𝐺(𝑇1, 𝑇2) = σ𝑟 ∈𝑇1∩𝑇2 ) 𝑥 𝑇1, 𝑟 𝑥(𝑇2, 𝑟 , where 𝑥(S, 𝑟) =

) 𝑥′(S,𝑟 σ𝑟 ∈𝑇 𝑥′ S,𝑟 2, where

𝑥′ 𝑇, 𝑟 = log 𝑢𝑔

S,𝑟 + 1 lo g( 𝑄

𝑒𝑔

𝑟

+ 1)

  • Found LogTFIDF with cosine similarity to work well for this step
  • Prunes much of the quadratic space in slightly super-linear time
  • Given two bags of tokens (‘words’), 𝑇1 and 𝑇2...

Cohen (2000)

slide-29
SLIDE 29

Step 2: ‘Precision-favoring’ heuristic

29

𝐾𝑏𝑑𝑑𝑏𝑠𝑒(𝑇1, 𝑇2) = |𝑇1 ∩ 𝑇2| |𝑇1 ∪ 𝑇2|

  • Found Jaccard to work well for this ‘re-ranking’ step
  • Given two sets of tokens (‘words’), 𝑇1 and 𝑇2...

Christen (2012)

slide-30
SLIDE 30

Unsupervised RDF Training Set Generator (TSG)

30

Training set generator (TSG)

Use TF-IDF to prune space and favor recall Use Jaccard to favor precision Make every sample count

Kejriwal and Miranker (2015)

Generate non- duplicates

slide-31
SLIDE 31

Baseline and Metrics

31

  • We evaluate the training set generator using Precision vs. Recall graphs

𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡| |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡 ∪ 𝐺𝑏𝑚𝑡𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡| 𝑆𝑓𝑑𝑏𝑚𝑚 = |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡| |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡 ∪ 𝐺𝑏𝑚𝑡𝑓 𝑜𝑓𝑕𝑏𝑢𝑗𝑤𝑓𝑡| 𝐺 − 𝑁𝑓𝑏𝑡𝑣𝑠𝑓 = 2 × 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 × 𝑆𝑓𝑑𝑏𝑚𝑚 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 + 𝑆𝑓𝑑𝑏𝑚𝑚

  • Use Dumas TSG (just uses LogTFIDF) as baseline
  • Why not an RDF instance-matching TSG? There were none!

Bilke and Naumann (2005)

slide-32
SLIDE 32

Serial Evaluations: Test suite

32

Test case (pair of datasets) Domain Number of properties Number of instances Number of duplicate pairs Persons 1 People 15/14 2000/1000 500 Persons 2 People 15/14 2400/800 400 Restaurants Restaurants 8/8 339/2256 89 Eprints-Rexa Publications 24/115 1130/18,492 171 IM-Similarity Books 9/9 181/180 496 IIMB-059 Movies 31/25 1549/519 412 IIMB-062 Movies 31/34 1549/265 264 Libraries Point of Interest, Addresses 4/10 17,636/26,583 16,789 Parks Point of Interest, Addresses 3/10 567/359 322 Video Game Point of Interest, Addresses 11/4 20,000/16,755 10,000

Kejriwal and Miranker (2015)

slide-33
SLIDE 33

Some Results

33

Kejriwal and Miranker (2015)

slide-34
SLIDE 34

How does it scale?

34

  • Implemented in MapReduce in Microsoft Azure
  • Scales near linearly, even with millions of entities
  • Designed to avoid data skew and ‘curse of the last reducer’ problems
slide-35
SLIDE 35

‘Using’ the generated training set for learning

35

Learn Property Alignment Learn blocking key Learn Similarity function

Noisy seed training set of duplicates/ non- duplicates Aligned training set Trained Classifier

Execute blocking Execute similarity

Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2

Most confident samples Training set generator

slide-36
SLIDE 36

Second contribution: Property Alignment

36

slide-37
SLIDE 37

Property Alignment

37

  • Experiments show that recall of property alignment has very high correlation with success of later steps
  • Existing systems do not always achieve this
slide-38
SLIDE 38

Intuition

  • Combine several matchers and use a (max.) combiner to favor recall, but

without trivially degrading precision

38

Matcher 1 Matcher n ... Dataset 1 Dataset 2 Combiner Property Alignment

slide-39
SLIDE 39

A ‘recall-friendly’, parameter-free algorithm

Property Aligner

Instance-driven; use positive and negative samples Don’t ignore names of properties Instance- based measure Matching cardinality is flexible 39

Kejriwal and Miranker (2015)

slide-40
SLIDE 40

Evaluations

40

Kejriwal and Miranker (2015) Bilke and Naumann (2005) Tian et al. (2014) Our algorithm Dumas Column Matcher Name Recall Prec. FM Recall Prec. FM Recall Prec. FM Persons 1 80.00 100 88.89 93.33 100 96.55 73.33 100 84.61 Persons 2 85.71 80.00 82.76 92.86 86.67 89.66 83.33 66.67 74.07 Restaurants 85.71 100 92.31 71.43 62.50 66.67 71.43 71.43 71.43 Eprints-Rexa 100 92.31 96.00 33.33 33.33 33.33 4.17 100 8.00 IM-Similarity 100 81.82 90.00 100 100 100 88.89 61.54 72.73 IIMB-059 100 82.14 90.19 78.26 72.00 75.00 60.87 60.87 60.87 IIMB-062 100 100 100 16.67 16.13 16.40 10.00 100 18.18 Libraries 100 22.50 36.73 33.33 75.00 46.15 55.55 62.50 58.82 Parks 100 26.67 42.11 37.50 100 54.55 37.50 100 54.55 Video Game 75.00 75.00 75.00 100 100 100 50.00 100 66.67 Average 92.60 76.04 79.40 65.67 74.56 67.83 53.51 82.30 56.99

slide-41
SLIDE 41

Next stop: Learning Blocking and Similarity

41

slide-42
SLIDE 42

Third contribution: Blocking key learning

42

Learn Property Alignment Learn blocking key Learn Similarity function

Noisy seed training set of duplicates/ non- duplicates Aligned training set Trained Classifier

Execute blocking Execute similarity

Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2

Most confident samples Training set generator

slide-43
SLIDE 43

Disjunctive Normal Form (DNF) blocking for RDF

43

Bilenko, Kamath and Mooney (2006) Michelson and Knoblock (2006) Kejriwal and Miranker (2015)

  • Intuitive to understand; may be arbitrarily expressive
  • Learning from training data can be framed in terms of set selection
  • Empirically robust
  • Before our work, only developed for homogeneous Relational Databases

ID Name Address City Cuisine 1 Fenix 8358 Sunset Blvd. West Hollywood American 2 Art’s Delicatessen 12224 Ventura Blvd. Studio City American 3 Hotel Bel-Air 701 Stone Canyon Rd. Bel Air Californian 4 Art Deli 12224 Ventura Blvd. Studio City Delis 5 Fenix at the Argyle 8359 Sunset Blvd.

  • W. Hollywood

French (new)

𝐷𝑝𝑛𝑛𝑝𝑜𝑈𝑝𝑙𝑓𝑜 𝑂𝑏𝑛𝑓 ∨ 𝐷𝑝𝑛𝑛𝑝𝑜𝐽𝑜𝑢𝑓𝑕𝑓𝑠(𝐵𝑒𝑒𝑠𝑓𝑡𝑡) Example of DNF blocking scheme:

slide-44
SLIDE 44

Final stop: Executing Blocking and Similarity

44

slide-45
SLIDE 45

Execution of blocking

45

Learn Property Alignment Learn blocking key Learn Similarity function

Noisy seed training set of duplicates/ non- duplicates Aligned training set Trained Classifier

Execute blocking Execute similarity

Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2

Most confident samples Training set generator

slide-46
SLIDE 46

Evaluations: DNF Blocking for RDF

46

Kejriwal and Miranker (2013-2015) Papadakis et al. (2013)

DNF blocking for RDF Attribute Clustering (AC) Name PC RR FM PC RR FM Persons 1 100 99.75 99.88 100 98.86 99.43 Persons 2 99.00 99.79 99.39 99.75 99.02 99.38 Restaurants 100 99.73 99.87 100 95.57 99.79 Eprints-Rexa 98.16 99.28 98.72 99.60 99.37 99.48 IM-Similarity 100 98.14 99.06 100 62.79 77.14 IIMB-059 99.76 93.35 96.45 97.33 73.09 83.49 IIMB-062 47.73 98.11 64.22 77.27 90.80 83.49 Libraries 97.96 99.99 98.96 99.99 99.87 99.93 Parks 95.96 94.41 95.18 99.07 88.27 93.36 Video Game 98.73 99.96 99.34 99.72 99.85 99.79 Average 93.73 98.25 95.11 97.27 91.15 93.53

slide-47
SLIDE 47

Execution of similarity

47

Learn Property Alignment Learn blocking key Learn Similarity function

Noisy seed training set of duplicates/ non- duplicates Aligned training set Trained Classifier

Execute blocking Execute similarity

Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2

Most confident samples Training set generator

slide-48
SLIDE 48

Similarity: Comparison of ‘highest’ F-scores

48

slide-49
SLIDE 49

49

Summary

  • A system that has DASH (domain independence, automation, scalability and

heterogeneity) properties and can be used to populate a Linked Data ENS

  • Many modular improvements possible
slide-50
SLIDE 50

50

Capstone evaluation: Dbpedia and Freebase

  • Provides real-world ‘extreme’ case-study of DASH; also applications beyond data integration
slide-51
SLIDE 51

51

Some results: Dbpedia-Freebase

  • Serialized Dbpedia (about 10 GB) and Freebase (400 GB+) in MapReduce using a NoSQL

property table representation, executed algorithms in Microsoft Azure HDInsight clusters

  • Terminates on small cluster (~60-70 cores) in less than 15 hours

Type alignment results Similarity results

slide-52
SLIDE 52

Future and Current Work

  • Transfer learning
  • Using sources already linked to train new models (in potentially new domains)
  • Schema-free features
  • Bypassing alignment altogether, while still extracting some structural features
  • Integrating natural language documents and co-reference resolution into the

pipeline

52

slide-53
SLIDE 53

Concluding notes

  • For this dissertation, we proposed, built and evaluated an instance matcher that

fulfills DASH and can be used to populate a Linked Data ENS

  • We contributed...
  • RDF training set generator to bootstrap an otherwise semi-supervised learning process
  • Recall-favoring property alignment algorithm
  • RDF DNF blocking formalism and algorithms
  • Other things: (simple) type alignment, cross-domain applications (transfer learning, new

ground-truths), model-selection bias, schema-free approaches

  • We’ve made public datasets and experimental runs

53