Populating a Linked Data Entity Name System
Mayank Kejriwal
1
Name System Mayank Kejriwal 2 Linked Data A set of four best - - PowerPoint PPT Presentation
1 Populating a Linked Data Entity Name System Mayank Kejriwal 2 Linked Data A set of four best practices for publishing and connecting structured data on the Web Bizer et al. (2009, 2014) 3 Instance Matching Connecting pairs of
1
Bizer et al. (2009, 2014)
2
Jaffri et al. (2008) Papadakis et al. (2010) Nikolov et al. (2011)
3
Paul Gardner Allen Microsoft
...
freebase:Paul_G._Allen dbpedia:Allen_,Paul
dbpedia:Microsoft Corp.
...
freebase:Microsoft Bouquet et al. (2008)
4
5
... Seller 1 Seller 2 Seller n
Entity Name System Mediated schema/Target
Product X Aggregated Results
Doan et al. (2012)
6
http://www.w3.org/RDF Bizer et al. (2009)
7
8
Cyganiak and Jentzsch (2014) Linkeddata.org
2007 with just 12 RDF datasets
links
schema.org, Google Knowledge Graph, Constitute... Media Social Networking Cross-domain Publications
9
10
11
Cyganiak and Jentzsch (2014) Linkeddata.org
2007 with just a handful of datasets
links
Media Social Networking Cross-domain Publications
12
13
Kejriwal and Miranker (2014)
14
Kejriwal and Miranker (2014) Euzenat and Shvaiko (2007)
15
Euzenat and Shvaiko (2007)
16
17
Blocks
1 2 3 4 5
Apply blocking key e.g. Tokens(LastName) Generate candidate set (7 pairs), apply similarity function
? ? ? ? ? ? ? Dataset 1 Dataset 2 ‘Exhaustive’ set: 4 X 6=24 pairs
Christen (2012)
18
19
Elmagarmid et al. (2007) Learn Property Alignment Learn blocking key Learn Similarity function Training set of duplicates/ non-duplicates
Aligned training set Trained Classifier
Execute blocking Execute similarity
Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2
20
Kejriwal and Miranker (2015) Learn Property Alignment Learn blocking key Learn Similarity function Seed training set
non-duplicates
Aligned training set Trained Classifier
Execute blocking Execute similarity
Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2
Most confident samples
21
Learn Property Alignment Learn blocking key Learn Similarity function Seed training set
non-duplicates
Aligned training set Trained Classifier
Execute blocking Execute similarity
Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2
Most confident samples
22
Kejriwal and Miranker (2013-2015) Learn Property Alignment Learn blocking key Learn Similarity function
Noisy seed training set of duplicates/ non- duplicates Aligned training set Trained Classifier
Execute blocking Execute similarity
Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2
Most confident samples Training set generator?
23
Kejriwal and Miranker (2015)
24
2013 2014 2015 2016
ICDM, 2013 ISWC, 2014 OM, 2014 ESWC, 2015 Know@ LOD, 2015 JWS, 2015 ISWC, 2016 (submitted)
Motivation Type Heterogeneity Automation Blocking and similarity Property Heterogeneity Full system (serial) Scalability
ISWC, 2015
25
Kejriwal and Miranker (2013-2015)
26
Kejriwal and Miranker (2013-2015) Learn Property Alignment Learn blocking key Learn Similarity function
Noisy seed training set of duplicates/ non- duplicates Aligned training set Trained Classifier
Execute blocking Execute similarity
Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2
Most confident samples Training set generator
27
Entity from RDF dataset 1 Entity from RDF dataset 2
28
𝑀𝑝𝑈𝐺𝐽𝐸𝐺(𝑇1, 𝑇2) = σ𝑟 ∈𝑇1∩𝑇2 ) 𝑥 𝑇1, 𝑟 𝑥(𝑇2, 𝑟 , where 𝑥(S, 𝑟) =
) 𝑥′(S,𝑟 σ𝑟 ∈𝑇 𝑥′ S,𝑟 2, where
𝑥′ 𝑇, 𝑟 = log 𝑢𝑔
S,𝑟 + 1 lo g( 𝑄
𝑒𝑔
𝑟
+ 1)
Cohen (2000)
29
𝐾𝑏𝑑𝑑𝑏𝑠𝑒(𝑇1, 𝑇2) = |𝑇1 ∩ 𝑇2| |𝑇1 ∪ 𝑇2|
Christen (2012)
30
Training set generator (TSG)
Use TF-IDF to prune space and favor recall Use Jaccard to favor precision Make every sample count
Kejriwal and Miranker (2015)
Generate non- duplicates
31
𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡| |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡 ∪ 𝐺𝑏𝑚𝑡𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡| 𝑆𝑓𝑑𝑏𝑚𝑚 = |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡| |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡 ∪ 𝐺𝑏𝑚𝑡𝑓 𝑜𝑓𝑏𝑢𝑗𝑤𝑓𝑡| 𝐺 − 𝑁𝑓𝑏𝑡𝑣𝑠𝑓 = 2 × 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 × 𝑆𝑓𝑑𝑏𝑚𝑚 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 + 𝑆𝑓𝑑𝑏𝑚𝑚
Bilke and Naumann (2005)
32
Test case (pair of datasets) Domain Number of properties Number of instances Number of duplicate pairs Persons 1 People 15/14 2000/1000 500 Persons 2 People 15/14 2400/800 400 Restaurants Restaurants 8/8 339/2256 89 Eprints-Rexa Publications 24/115 1130/18,492 171 IM-Similarity Books 9/9 181/180 496 IIMB-059 Movies 31/25 1549/519 412 IIMB-062 Movies 31/34 1549/265 264 Libraries Point of Interest, Addresses 4/10 17,636/26,583 16,789 Parks Point of Interest, Addresses 3/10 567/359 322 Video Game Point of Interest, Addresses 11/4 20,000/16,755 10,000
Kejriwal and Miranker (2015)
33
Kejriwal and Miranker (2015)
34
35
Learn Property Alignment Learn blocking key Learn Similarity function
Noisy seed training set of duplicates/ non- duplicates Aligned training set Trained Classifier
Execute blocking Execute similarity
Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2
Most confident samples Training set generator
36
37
without trivially degrading precision
38
Matcher 1 Matcher n ... Dataset 1 Dataset 2 Combiner Property Alignment
Property Aligner
Instance-driven; use positive and negative samples Don’t ignore names of properties Instance- based measure Matching cardinality is flexible 39
Kejriwal and Miranker (2015)
40
Kejriwal and Miranker (2015) Bilke and Naumann (2005) Tian et al. (2014) Our algorithm Dumas Column Matcher Name Recall Prec. FM Recall Prec. FM Recall Prec. FM Persons 1 80.00 100 88.89 93.33 100 96.55 73.33 100 84.61 Persons 2 85.71 80.00 82.76 92.86 86.67 89.66 83.33 66.67 74.07 Restaurants 85.71 100 92.31 71.43 62.50 66.67 71.43 71.43 71.43 Eprints-Rexa 100 92.31 96.00 33.33 33.33 33.33 4.17 100 8.00 IM-Similarity 100 81.82 90.00 100 100 100 88.89 61.54 72.73 IIMB-059 100 82.14 90.19 78.26 72.00 75.00 60.87 60.87 60.87 IIMB-062 100 100 100 16.67 16.13 16.40 10.00 100 18.18 Libraries 100 22.50 36.73 33.33 75.00 46.15 55.55 62.50 58.82 Parks 100 26.67 42.11 37.50 100 54.55 37.50 100 54.55 Video Game 75.00 75.00 75.00 100 100 100 50.00 100 66.67 Average 92.60 76.04 79.40 65.67 74.56 67.83 53.51 82.30 56.99
41
42
Learn Property Alignment Learn blocking key Learn Similarity function
Noisy seed training set of duplicates/ non- duplicates Aligned training set Trained Classifier
Execute blocking Execute similarity
Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2
Most confident samples Training set generator
43
Bilenko, Kamath and Mooney (2006) Michelson and Knoblock (2006) Kejriwal and Miranker (2015)
ID Name Address City Cuisine 1 Fenix 8358 Sunset Blvd. West Hollywood American 2 Art’s Delicatessen 12224 Ventura Blvd. Studio City American 3 Hotel Bel-Air 701 Stone Canyon Rd. Bel Air Californian 4 Art Deli 12224 Ventura Blvd. Studio City Delis 5 Fenix at the Argyle 8359 Sunset Blvd.
French (new)
𝐷𝑝𝑛𝑛𝑝𝑜𝑈𝑝𝑙𝑓𝑜 𝑂𝑏𝑛𝑓 ∨ 𝐷𝑝𝑛𝑛𝑝𝑜𝐽𝑜𝑢𝑓𝑓𝑠(𝐵𝑒𝑒𝑠𝑓𝑡𝑡) Example of DNF blocking scheme:
44
45
Learn Property Alignment Learn blocking key Learn Similarity function
Noisy seed training set of duplicates/ non- duplicates Aligned training set Trained Classifier
Execute blocking Execute similarity
Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2
Most confident samples Training set generator
46
Kejriwal and Miranker (2013-2015) Papadakis et al. (2013)
DNF blocking for RDF Attribute Clustering (AC) Name PC RR FM PC RR FM Persons 1 100 99.75 99.88 100 98.86 99.43 Persons 2 99.00 99.79 99.39 99.75 99.02 99.38 Restaurants 100 99.73 99.87 100 95.57 99.79 Eprints-Rexa 98.16 99.28 98.72 99.60 99.37 99.48 IM-Similarity 100 98.14 99.06 100 62.79 77.14 IIMB-059 99.76 93.35 96.45 97.33 73.09 83.49 IIMB-062 47.73 98.11 64.22 77.27 90.80 83.49 Libraries 97.96 99.99 98.96 99.99 99.87 99.93 Parks 95.96 94.41 95.18 99.07 88.27 93.36 Video Game 98.73 99.96 99.34 99.72 99.85 99.79 Average 93.73 98.25 95.11 97.27 91.15 93.53
47
Learn Property Alignment Learn blocking key Learn Similarity function
Noisy seed training set of duplicates/ non- duplicates Aligned training set Trained Classifier
Execute blocking Execute similarity
Blocking key Candidate set :sameAs links RDF dataset 1 RDF dataset 2
Most confident samples Training set generator
48
49
heterogeneity) properties and can be used to populate a Linked Data ENS
50
51
property table representation, executed algorithms in Microsoft Azure HDInsight clusters
Type alignment results Similarity results
pipeline
52
fulfills DASH and can be used to populate a Linked Data ENS
ground-truths), model-selection bias, schema-free approaches
53