Online and Scalable
Semantic Data Analytics
Themis Palpanas
Paris Descartes University Institut Universitaire de France
Federated Semantic Data Management Seminar Dagstuhl, June 2017
Online and Scalable Semantic Data Analytics Themis Palpanas Paris - - PowerPoint PPT Presentation
Online and Scalable Semantic Data Analytics Themis Palpanas Paris Descartes University Institut Universitaire de France Federated Semantic Data Management Seminar Dagstuhl, June 2017 16 Our Work large scale data streaming data
Themis Palpanas
Paris Descartes University Institut Universitaire de France
Federated Semantic Data Management Seminar Dagstuhl, June 2017
16
Themis Palpanas - June 2017
funded by: European Commission, CNRS, Facebook, IBM Research, FMJH, Inria, Hewlett-Packard Labs, Telecom Italia, Autonomous Province of Trento
▫ Managing and Analyzing Very Large Scientific Data
▫ infrastructure monitoring, motion capture, genome sequences, fMRI (neuroscience), astronomy
17
Themis Palpanas - June 2017
▫ Real Time Analysis of Data Streams
▫ continuous monitoring, online pattern identification
18
Themis Palpanas - June 2017
▫ Fuse Data from Different Sources
▫ entity resolution, query answering using knowledge graphs, ▫ subjectivity analysis
19
Themis Palpanas - June 2017
▫ Processing and Mining Uncertain Data
▫ uncertain data series (e.g., sensor measurements) ▫ uncertain graphs (e.g., biological networks)
20
Themis Palpanas - June 2017
41
entity resolution in large, heterogeneous data spaces
Themis Palpanas - June 2017
42
problem
develop framework and techniques for entity resolution in very large and highly heterogeneous data spaces (i.e., loose schema binding, high levels of heterogeneity and noise, missing attribute names or values)
scale to web size
Themis Palpanas - June 2017
43
problem
develop framework and techniques for entity resolution in very large and highly heterogeneous data spaces (i.e., loose schema binding, high levels of heterogeneity and noise, missing attribute names or values)
scale to web size applications:
web-scale data integration
“which entities in these two web datasets are the same?”
entity resolution for heterogeneous web data
query answering
return a set of unique entities in response to a user query
produce high-quality results
Themis Palpanas - June 2017
Themis Palpanas - June 2017 44
novel blocking techniques that are resilient to heterogeneity can be the basis for efficient entity resolution
develop block building methods that lead to blocks with low number of missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency)
Themis Palpanas - June 2017 45
novel blocking techniques that are resilient to heterogeneity can be the basis for efficient entity resolution
develop block building methods that lead to blocks with low number of missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency)
we propose framework for entity resolution in heterogeneous data spaces at web scale
efficient and effective algorithms for:
blocking
block purging
duplicates propagation
block scheduling
block pruning
comparisons propagation
comparisons pruning
Themis Palpanas - June 2017 46
novel blocking techniques that are resilient to heterogeneity can be the basis for efficient entity resolution
develop block building methods that lead to blocks with low number of missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency)
we propose framework for entity resolution in heterogeneous data spaces at web scale
efficient and effective algorithms for:
blocking
block purging
duplicates propagation
block scheduling
block pruning
comparisons propagation
comparisons pruning
Tutorial, links for Papers, Demo, Code, Datasets:
http://www.mi.parisdescartes.fr/~themisp/publications/PapadakisPalpanas-TutorialScaDS-LeipsigSummerSchool2016v2.pptx
Themis Palpanas - June 2017 47
JedAI implements the following schema-agnostic, end- to-end workflow for both Clean-Clean and Dirty ER:
Data Reading Block Building Block Cleaning Comparison Cleaning Entity Matching Entity Clustering Evaluation & Storing Step 5 Step 2 Step 3 Step 4 Step 6 Step 1 Step 7 Reads files containing the entity profiles and the golden standard. Creates
blocks. Optional step that cleans blocks from useless comparisons (repeated, superfluous). Optional step that operates on the level of individual comparisons to remove the useless ones. Executes all retained comparisons. Partitions the similarity graph into equivalence clusters. Stores and presents performance results w.r.t. numerous measures.
Themis Palpanas - June 2017 48
???
Themis Palpanas - June 2017 49
JedAI supports several established methods for each workflow step:
Data Reading Block Building Block Cleaning Comparison Cleaning Entity Matching Entity Clustering Evaluation & Storing Step 5 Step 2 Step 3 Step 4 Step 6 Step 1 Step 7 Possible to read CSV, RDF/XML files & relational DBs in any combination! Choose 1 out of 8 methods. Specify any combination of 3 (4) complementary methods for Dirty (Clean- Clean) ER. Choose 1 out of 7 methods (including Meta-blocking). Combine 1 out of 2 methods with 12 textual representation models and 10 similarity measures. Choose 1 out of 6 methods for Dirty ER. For Clean-Clean ER, 1 method is available. Store results as a CSV file.
Themis Palpanas - June 2017 50
Block Building Block Cleaning Comparison Cleaning
Token Blocking Block Filtering Comparison Propagation Sorted Neighborhood Size-based Block Purging Cardinality Edge Pruning (CEP) Extended Sorted Neighborhood Cardinality-based Block Purging Cardinality Node Pruning (CNP) Attribute Clustering Block Scheduling Weighted Edge Pruning (WEP) Q-Grams Blocking Weighted Node Pruning (WNP) Extended Q-Grams Blocking Reciprocal CNP Suffix Arrays Reciprocal WNP Extended Suffix Arrays
Themis Palpanas - June 2017 51
– JedAI Library: https://github.com/scify/JedAIToolkit . – JedAI Desktop Application and Workbench: https://github.com/scify/jedai-ui . – All code is implemented using Java 8. – All code is publicly available under Apache License V2.0.
https://github.com/scify/JedAIToolkit/tree/master/documentation .
George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas and Manolis Koubarakis: "JedAI: The Force behind Entity Resolution", in ESWC 2017.
Themis Palpanas - June 2017 52
Clean-Clean ER (real) D1 Entities D2 Entities
Abt-Buy 1,076 1,076 DBLP-ACM 2,616 2,294 DBLP-Scholar 2,516 61,353 Amazon-GP 1,354 3,039 Movies 27,615 23,182 DBPedia 1,190,733 2,164,040
Dirty ER (synthetic) Entities
10K 10,000 50K 50,000 100K 100,000 200K 200,00 300K 300,00 1M 1,000,000 2M 2,000,000
Can be used for Dirty ER, as well.
Several datasets are available for testing at https://github.com/scify/JedAIToolkit .
Themis Palpanas - June 2017 53
54
exemplar queries: query answering using examples and knowledge graphs
Themis Palpanas - June 2017
55
problem
given an example element (subgraph) of interest, return a ranked set of similar elements
scale to full size size knowledge graphs, provide answers in real-time
Themis Palpanas - June 2017
56
problem
given an example element (subgraph) of interest, return a ranked set of similar elements
scale to full size size knowledge graphs, provide answers in real-time applications:
data exploration for non-expert users
“find company acquisitions like the one of YouTube by Google”
fast and easy discovery of facts with same semantics
complex similarity queries made easy
“find other legal cases where the actors had relationships similar to this”
pain-free information search for specialized users
Themis Palpanas - June 2017
Themis Palpanas - June 2017 57
formulation of and algorithms for the Exemplar Query problem
develop query answering methods that can efficiently prune the search space, and produce relevant and diverse results
Themis Palpanas - June 2017 58
formulation of and algorithms for the Exemplar Query problem
develop query answering methods that can efficiently prune the search space, and produce relevant and diverse results
we propose techniques for real-time Exemplar Query answering using real-world knowledge graphs
efficient and effective algorithms using as similarity measures:
subgraph isomorphism
simulation
Themis Palpanas - June 2017 59
formulation of and algorithms for the Exemplar Query problem
develop query answering methods that can efficiently prune the search space, and produce relevant and diverse results
we propose techniques for real-time Exemplar Query answering using real-world knowledge graphs
efficient and effective algorithms using as similarity measures:
subgraph isomorphism
simulation
Papers, Code, Datasets:
http://www.mi.parisdescartes.fr/~themisp/exemplarquery-ext/
60
Traditional Query Answering
based=California produces=Mobiles
Database
62
Exemplar Queries
acquired
Query???
Does not know how to search for other acquisitions Database
63
A different need
64
Existing Search Engines
acquisitions like Google Youtube
Yahoo!-Tumblr or Microsoft-Skype not present as interesting acquisitions.
65
The Exemplar Query perspective
66
Our Approach Exemplar Queries
66
67
General Solution
Input: User Query Q, an example of the expected results Output: Set of expected results Procedure:
67
68
Data Model: Knowledge graph
68
69
Strict equality: Edge Isomorphism
69
S A1 A2
70
Strict equality: Edge Isomorphism
S A1 A2
Why Yahoo! Tumblr are not present? 70
71
More freedom: Simulation
S A1 A2
Tumblr matches both an acquisition and a website Match edge-label sequences instead of structures 71
72
Ranking results
72
S A1 A2
User Query
Google Yahoo! CBS
Combination of two factors
73
Simulation vs Isomorphism
7 3
0.01 0.1 1 10 100 1000 0.005 0.01 Count (k) τ
Found Answers ISO Visited Edges Visited Ver ces Found Answers SIM
0.01 0.1 1 10 100 1000 0.005 0.01 Time (s) τ
Total Time SIM Total Time ISO Analysis
74
Qualitative Evaluation
74
Query: Google – YouTube – Menlo Park Approximate Graph Query Answering [Khan13] Edge Isomorphism Simulation Answers are collapsed More interesting answers
75
for Entity Resolution
develop out of the box
automatic tuning
easy to use solutions
guide the user to choose among the alternatives
that can cope with big data characteristics
volume, variety, velocity
Themis Palpanas - June 2017
76
for Entity Resolution
develop out of the box
automatic tuning
easy to use solutions
guide the user to choose among the alternatives
that can cope with big data characteristics
volume, variety, velocity
for Exemplar Queries
extend to multiple exemplar queries in the input
take into account user preferences
employ more semantics
Themis Palpanas - June 2017
77
for Entity Resolution
develop out of the box
automatic tuning
easy to use solutions
guide the user to choose among the alternatives
that can cope with big data characteristics
volume, variety, velocity
for Exemplar Queries
extend to multiple exemplar queries in the input
take into account user preferences
employ more semantics
for our Dagstuhl Seminar
think of applications that can be built on top of a federated semantic data management system
and the associated requirements/challenges…
Themis Palpanas - June 2017
Data-Intensive and Knowledge-Oriented systems
google: Themis Palpanas