Online and Scalable Semantic Data Analytics Themis Palpanas Paris - - PowerPoint PPT Presentation

online and scalable
SMART_READER_LITE
LIVE PREVIEW

Online and Scalable Semantic Data Analytics Themis Palpanas Paris - - PowerPoint PPT Presentation

Online and Scalable Semantic Data Analytics Themis Palpanas Paris Descartes University Institut Universitaire de France Federated Semantic Data Management Seminar Dagstuhl, June 2017 16 Our Work large scale data streaming data


slide-1
SLIDE 1

Online and Scalable

Semantic Data Analytics

Themis Palpanas

Paris Descartes University Institut Universitaire de France

Federated Semantic Data Management Seminar Dagstuhl, June 2017

slide-2
SLIDE 2

Our Work

  • large scale data
  • streaming data
  • heterogeneous data
  • private data
  • uncertain data

16

Themis Palpanas - June 2017

funded by: European Commission, CNRS, Facebook, IBM Research, FMJH, Inria, Hewlett-Packard Labs, Telecom Italia, Autonomous Province of Trento

slide-3
SLIDE 3

Our Work

  • large scale data

▫ Managing and Analyzing Very Large Scientific Data

▫ infrastructure monitoring, motion capture, genome sequences, fMRI (neuroscience), astronomy

  • streaming data
  • heterogeneous data
  • private data
  • uncertain data

17

Themis Palpanas - June 2017

slide-4
SLIDE 4

Our Work

  • large scale data
  • streaming data

▫ Real Time Analysis of Data Streams

▫ continuous monitoring, online pattern identification

  • heterogeneous data
  • private data
  • uncertain data

18

Themis Palpanas - June 2017

slide-5
SLIDE 5

Our Work

  • large scale data
  • streaming data
  • heterogeneous data

▫ Fuse Data from Different Sources

▫ entity resolution, query answering using knowledge graphs, ▫ subjectivity analysis

  • private data
  • uncertain data

19

Themis Palpanas - June 2017

slide-6
SLIDE 6

Our Work

  • large scale data
  • streaming data
  • heterogeneous data
  • private data
  • uncertain data

▫ Processing and Mining Uncertain Data

▫ uncertain data series (e.g., sensor measurements) ▫ uncertain graphs (e.g., biological networks)

20

Themis Palpanas - June 2017

slide-7
SLIDE 7

41

entity resolution in large, heterogeneous data spaces

Themis Palpanas - June 2017

slide-8
SLIDE 8

42

Entity Resolution in Large, Heterogeneous Data Spaces

problem

develop framework and techniques for entity resolution in very large and highly heterogeneous data spaces (i.e., loose schema binding, high levels of heterogeneity and noise, missing attribute names or values)

scale to web size

Themis Palpanas - June 2017

slide-9
SLIDE 9

43

Entity Resolution in Large, Heterogeneous Data Spaces

problem

develop framework and techniques for entity resolution in very large and highly heterogeneous data spaces (i.e., loose schema binding, high levels of heterogeneity and noise, missing attribute names or values)

scale to web size applications:

web-scale data integration

“which entities in these two web datasets are the same?”

entity resolution for heterogeneous web data

query answering

return a set of unique entities in response to a user query

produce high-quality results

Themis Palpanas - June 2017

slide-10
SLIDE 10

Themis Palpanas - June 2017 44

Our Work

novel blocking techniques that are resilient to heterogeneity can be the basis for efficient entity resolution

develop block building methods that lead to blocks with low number of missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency)

slide-11
SLIDE 11

Themis Palpanas - June 2017 45

Our Work

novel blocking techniques that are resilient to heterogeneity can be the basis for efficient entity resolution

develop block building methods that lead to blocks with low number of missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency)

we propose framework for entity resolution in heterogeneous data spaces at web scale

efficient and effective algorithms for:

blocking

block purging

duplicates propagation

block scheduling

block pruning

comparisons propagation

comparisons pruning

slide-12
SLIDE 12

Themis Palpanas - June 2017 46

Our Work

novel blocking techniques that are resilient to heterogeneity can be the basis for efficient entity resolution

develop block building methods that lead to blocks with low number of missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency)

we propose framework for entity resolution in heterogeneous data spaces at web scale

efficient and effective algorithms for:

blocking

block purging

duplicates propagation

block scheduling

block pruning

comparisons propagation

comparisons pruning

Tutorial, links for Papers, Demo, Code, Datasets:

http://www.mi.parisdescartes.fr/~themisp/publications/PapadakisPalpanas-TutorialScaDS-LeipsigSummerSchool2016v2.pptx

slide-13
SLIDE 13

What is the JedAI Toolkit?

JedAI can be used in three ways:

  • 1. As an open source library that implements

numerous state-of-the-art methods for all steps

  • f an established end-to-end ER workflow.
  • 2. As a desktop application for ER with an intuitive

Graphical User Interface that is suitable for both expert and lay users.

  • 3. As a workbench for comparing all performance

aspects of various (configurations of) end-to-end ER workflows.

Themis Palpanas - June 2017 47

slide-14
SLIDE 14

How does the JedAI Toolkit work?

JedAI implements the following schema-agnostic, end- to-end workflow for both Clean-Clean and Dirty ER:

Data Reading Block Building Block Cleaning Comparison Cleaning Entity Matching Entity Clustering Evaluation & Storing Step 5 Step 2 Step 3 Step 4 Step 6 Step 1 Step 7 Reads files containing the entity profiles and the golden standard. Creates

  • verlapping

blocks. Optional step that cleans blocks from useless comparisons (repeated, superfluous). Optional step that operates on the level of individual comparisons to remove the useless ones. Executes all retained comparisons. Partitions the similarity graph into equivalence clusters. Stores and presents performance results w.r.t. numerous measures.

Themis Palpanas - June 2017 48

slide-15
SLIDE 15

How is the JedAI Toolkit structured?

  • Modular architecture:
  • ne module per

workflow step.

  • Extensible architecture

(e.g., ontology matching)

???

Themis Palpanas - June 2017 49

slide-16
SLIDE 16

How can I build an ER workflow?

JedAI supports several established methods for each workflow step:

Data Reading Block Building Block Cleaning Comparison Cleaning Entity Matching Entity Clustering Evaluation & Storing Step 5 Step 2 Step 3 Step 4 Step 6 Step 1 Step 7 Possible to read CSV, RDF/XML files & relational DBs in any combination! Choose 1 out of 8 methods. Specify any combination of 3 (4) complementary methods for Dirty (Clean- Clean) ER. Choose 1 out of 7 methods (including Meta-blocking). Combine 1 out of 2 methods with 12 textual representation models and 10 similarity measures. Choose 1 out of 6 methods for Dirty ER. For Clean-Clean ER, 1 method is available. Store results as a CSV file.

Themis Palpanas - June 2017 50

slide-17
SLIDE 17

Which Blocking Methods are included?

Block Building Block Cleaning Comparison Cleaning

Token Blocking Block Filtering Comparison Propagation Sorted Neighborhood Size-based Block Purging Cardinality Edge Pruning (CEP) Extended Sorted Neighborhood Cardinality-based Block Purging Cardinality Node Pruning (CNP) Attribute Clustering Block Scheduling Weighted Edge Pruning (WEP) Q-Grams Blocking Weighted Node Pruning (WNP) Extended Q-Grams Blocking Reciprocal CNP Suffix Arrays Reciprocal WNP Extended Suffix Arrays

Themis Palpanas - June 2017 51

slide-18
SLIDE 18

Where can I find JedAI Toolkit?

  • Project website: http://jedai.scify.org .
  • Github repositories:

– JedAI Library: https://github.com/scify/JedAIToolkit . – JedAI Desktop Application and Workbench: https://github.com/scify/jedai-ui . – All code is implemented using Java 8. – All code is publicly available under Apache License V2.0.

  • Documentation (slides, videos, etc) available at

https://github.com/scify/JedAIToolkit/tree/master/documentation .

  • When using JedAI, please cite:

George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas and Manolis Koubarakis: "JedAI: The Force behind Entity Resolution", in ESWC 2017.

Themis Palpanas - June 2017 52

slide-19
SLIDE 19

Which datasets are available for testing?

Clean-Clean ER (real) D1 Entities D2 Entities

Abt-Buy 1,076 1,076 DBLP-ACM 2,616 2,294 DBLP-Scholar 2,516 61,353 Amazon-GP 1,354 3,039 Movies 27,615 23,182 DBPedia 1,190,733 2,164,040

Dirty ER (synthetic) Entities

10K 10,000 50K 50,000 100K 100,000 200K 200,00 300K 300,00 1M 1,000,000 2M 2,000,000

Can be used for Dirty ER, as well.

Several datasets are available for testing at https://github.com/scify/JedAIToolkit .

Themis Palpanas - June 2017 53

slide-20
SLIDE 20

54

exemplar queries: query answering using examples and knowledge graphs

Themis Palpanas - June 2017

slide-21
SLIDE 21

55

Exemplar Queries

problem

given an example element (subgraph) of interest, return a ranked set of similar elements

scale to full size size knowledge graphs, provide answers in real-time

Themis Palpanas - June 2017

slide-22
SLIDE 22

56

Exemplar Queries

problem

given an example element (subgraph) of interest, return a ranked set of similar elements

scale to full size size knowledge graphs, provide answers in real-time applications:

data exploration for non-expert users

“find company acquisitions like the one of YouTube by Google”

fast and easy discovery of facts with same semantics

complex similarity queries made easy

“find other legal cases where the actors had relationships similar to this”

pain-free information search for specialized users

Themis Palpanas - June 2017

slide-23
SLIDE 23

Themis Palpanas - June 2017 57

Our Work

formulation of and algorithms for the Exemplar Query problem

develop query answering methods that can efficiently prune the search space, and produce relevant and diverse results

slide-24
SLIDE 24

Themis Palpanas - June 2017 58

Our Work

formulation of and algorithms for the Exemplar Query problem

develop query answering methods that can efficiently prune the search space, and produce relevant and diverse results

we propose techniques for real-time Exemplar Query answering using real-world knowledge graphs

efficient and effective algorithms using as similarity measures:

subgraph isomorphism

simulation

slide-25
SLIDE 25

Themis Palpanas - June 2017 59

Our Work

formulation of and algorithms for the Exemplar Query problem

develop query answering methods that can efficiently prune the search space, and produce relevant and diverse results

we propose techniques for real-time Exemplar Query answering using real-world knowledge graphs

efficient and effective algorithms using as similarity measures:

subgraph isomorphism

simulation

Papers, Code, Datasets:

http://www.mi.parisdescartes.fr/~themisp/exemplarquery-ext/

slide-26
SLIDE 26

60

Traditional Query Answering

  • wns=Search Engine,

based=California produces=Mobiles

Database

slide-27
SLIDE 27

62

Exemplar Queries

acquired

Query???

Does not know how to search for other acquisitions Database

slide-28
SLIDE 28

63

A different need

slide-29
SLIDE 29

64

Existing Search Engines

acquisitions like Google Youtube

Yahoo!-Tumblr or Microsoft-Skype not present as interesting acquisitions.

slide-30
SLIDE 30

65

The Exemplar Query perspective

slide-31
SLIDE 31

66

Our Approach Exemplar Queries

  • The user query is an indication of the structure of the answers

66

slide-32
SLIDE 32

67

General Solution

Input: User Query Q, an example of the expected results Output: Set of expected results Procedure:

  • Detect the sample for the query Q
  • Find the structures similar to the sample
  • Rank the results

67

slide-33
SLIDE 33

68

Data Model: Knowledge graph

68

slide-34
SLIDE 34

69

Strict equality: Edge Isomorphism

69

S A1 A2

slide-35
SLIDE 35

70

Strict equality: Edge Isomorphism

S A1 A2

Why Yahoo! Tumblr are not present? 70

slide-36
SLIDE 36

71

More freedom: Simulation

S A1 A2

Tumblr matches both an acquisition and a website Match edge-label sequences instead of structures 71

slide-37
SLIDE 37

72

Ranking results

72

S A1 A2

User Query

Google Yahoo! CBS

Combination of two factors

  • 1. Structural: similarity of two nodes in terms of neighbor relationships
  • 2. Distance-based: personalized PageRank starting from query nodes
slide-38
SLIDE 38

73

Simulation vs Isomorphism

7 3

0.01 0.1 1 10 100 1000 0.005 0.01 Count (k) τ

Found Answers ISO Visited Edges Visited Ver ces Found Answers SIM

0.01 0.1 1 10 100 1000 0.005 0.01 Time (s) τ

Total Time SIM Total Time ISO Analysis

  • Simulation finds more answers (up to 48%) but aggregates results
  • Isomorphism runs faster than simulation (less operations on simple queries)
slide-39
SLIDE 39

74

Qualitative Evaluation

74

Query: Google – YouTube – Menlo Park Approximate Graph Query Answering [Khan13] Edge Isomorphism Simulation Answers are collapsed More interesting answers

slide-40
SLIDE 40

75

Open Research Directions

for Entity Resolution

develop out of the box

automatic tuning

easy to use solutions

guide the user to choose among the alternatives

that can cope with big data characteristics

volume, variety, velocity

Themis Palpanas - June 2017

slide-41
SLIDE 41

76

Open Research Directions

for Entity Resolution

develop out of the box

automatic tuning

easy to use solutions

guide the user to choose among the alternatives

that can cope with big data characteristics

volume, variety, velocity

for Exemplar Queries

extend to multiple exemplar queries in the input

take into account user preferences

employ more semantics

Themis Palpanas - June 2017

slide-42
SLIDE 42

77

Open Research Directions

for Entity Resolution

develop out of the box

automatic tuning

easy to use solutions

guide the user to choose among the alternatives

that can cope with big data characteristics

volume, variety, velocity

for Exemplar Queries

extend to multiple exemplar queries in the input

take into account user preferences

employ more semantics

for our Dagstuhl Seminar

think of applications that can be built on top of a federated semantic data management system

and the associated requirements/challenges…

Themis Palpanas - June 2017

slide-43
SLIDE 43

Data-Intensive and Knowledge-Oriented systems

slide-44
SLIDE 44

thank you!

google: Themis Palpanas