Online and Scalable Semantic Data Analytics Themis Palpanas Paris - PowerPoint PPT Presentation

Online and Scalable Semantic Data Analytics Themis Palpanas Paris Descartes University Institut Universitaire de France Federated Semantic Data Management Seminar Dagstuhl, June 2017

16 Our Work • large scale data • streaming data • heterogeneous data • private data • uncertain data funded by : European Commission, CNRS, Facebook, IBM Research, FMJH, Inria, Hewlett-Packard Labs, Telecom Italia, Autonomous Province of Trento Themis Palpanas - June 2017

17 Our Work • large scale data ▫ Managing and Analyzing Very Large Scientific Data ▫ infrastructure monitoring, motion capture, genome sequences, fMRI (neuroscience), astronomy • streaming data • heterogeneous data • private data • uncertain data Themis Palpanas - June 2017

18 Our Work • large scale data • streaming data ▫ Real Time Analysis of Data Streams ▫ continuous monitoring, online pattern identification • heterogeneous data • private data • uncertain data Themis Palpanas - June 2017

19 Our Work • large scale data • streaming data • heterogeneous data ▫ Fuse Data from Different Sources ▫ entity resolution, query answering using knowledge graphs, ▫ subjectivity analysis • private data • uncertain data Themis Palpanas - June 2017

20 Our Work • large scale data • streaming data • heterogeneous data • private data • uncertain data ▫ Processing and Mining Uncertain Data ▫ uncertain data series (e.g., sensor measurements) ▫ uncertain graphs (e.g., biological networks) Themis Palpanas - June 2017

entity resolution in large, heterogeneous data spaces Themis Palpanas - June 2017 41

Entity Resolution in Large, Heterogeneous Data Spaces problem  develop framework and techniques for entity resolution in very large  and highly heterogeneous data spaces (i.e., loose schema binding, high levels of heterogeneity and noise, missing attribute names or values) scale to web size  Themis Palpanas - June 2017 42

Entity Resolution in Large, Heterogeneous Data Spaces problem  develop framework and techniques for entity resolution in very large  and highly heterogeneous data spaces (i.e., loose schema binding, high levels of heterogeneity and noise, missing attribute names or values) scale to web size  applications: web-scale data integration  “which entities in these two web datasets are the same?”  entity resolution for heterogeneous web data  query answering  return a set of unique entities in response to a user query  produce high-quality results  Themis Palpanas - June 2017 43

Our Work novel blocking techniques that are resilient to heterogeneity can be the  basis for efficient entity resolution develop block building methods that lead to blocks with low number of  missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency) Themis Palpanas - June 2017 44

Our Work novel blocking techniques that are resilient to heterogeneity can be the  basis for efficient entity resolution develop block building methods that lead to blocks with low number of  missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency) we propose framework for entity resolution in heterogeneous data  spaces at web scale efficient and effective algorithms for:  blocking  block purging  duplicates propagation  block scheduling  block pruning  comparisons propagation  comparisons pruning  Themis Palpanas - June 2017 45

Our Work novel blocking techniques that are resilient to heterogeneity can be the  basis for efficient entity resolution develop block building methods that lead to blocks with low number of  missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency) we propose framework for entity resolution in heterogeneous data  spaces at web scale efficient and effective algorithms for:  blocking  block purging  duplicates propagation  block scheduling  block pruning  comparisons propagation  comparisons pruning  Tutorial, links for Papers, Demo, Code, Datasets: http://www.mi.parisdescartes.fr/~themisp/publications/PapadakisPalpanas-TutorialScaDS-LeipsigSummerSchool2016v2.pptx Themis Palpanas - June 2017 46

What is the JedAI Toolkit? JedAI can be used in three ways: 1. As an open source library that implements numerous state-of-the-art methods for all steps of an established end-to-end ER workflow. 2. As a desktop application for ER with an intuitive Graphical User Interface that is suitable for both expert and lay users. 3. As a workbench for comparing all performance aspects of various (configurations of) end-to-end ER workflows. Themis Palpanas - June 2017 47

How does the JedAI Toolkit work? JedAI implements the following schema-agnostic, end- to-end workflow for both Clean-Clean and Dirty ER: Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Data Block Block Comparison Entity Entity Evaluation Reading Building Cleaning Cleaning Matching Clustering & Storing Reads files Creates Optional step Optional step Executes all Partitions the Stores and containing overlapping that cleans that operates on retained similarity graph presents the entity blocks from the level of comparisons. into equivalence performance blocks. clusters. results profiles and useless individual w.r.t. the golden comparisons comparisons to standard. (repeated, remove the numerous superfluous). useless ones. measures. Themis Palpanas - June 2017 48

How is the JedAI Toolkit structured? • Modular architecture: one module per workflow step. • Extensible architecture (e.g., ontology matching) ??? Themis Palpanas - June 2017 49

How can I build an ER workflow? JedAI supports several established methods for each workflow step: Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Data Block Block Comparison Entity Entity Evaluation Reading Building Cleaning Cleaning Matching Clustering & Storing Possible to Choose Specify any Choose Combine Choose Store results read CSV, 1 out of 8 combination of 1 out of 7 1 out of 2 1 out of 6 as a CSV file. RDF/XML files 3 (4) methods methods with methods for methods. Dirty ER. For & relational complementary (including 12 textual DBs in any methods for Meta-blocking). representation Clean-Clean ER, combination! Dirty (Clean- models and 10 1 method is Clean) ER. similarity available. measures. Themis Palpanas - June 2017 50

Which Blocking Methods are included? Block Building Block Cleaning Comparison Cleaning Token Blocking Block Filtering Comparison Propagation Sorted Neighborhood Size-based Block Purging Cardinality Edge Pruning (CEP) Extended Sorted Cardinality-based Block Cardinality Node Pruning (CNP) Neighborhood Purging Attribute Clustering Block Scheduling Weighted Edge Pruning (WEP) Q-Grams Blocking Weighted Node Pruning (WNP) Extended Q-Grams Blocking Reciprocal CNP Suffix Arrays Reciprocal WNP Extended Suffix Arrays Themis Palpanas - June 2017 51

Where can I find JedAI Toolkit? • Project website: http://jedai.scify.org . • Github repositories: – JedAI Library: https://github.com/scify/JedAIToolkit . – JedAI Desktop Application and Workbench: https://github.com/scify/jedai-ui . – All code is implemented using Java 8. – All code is publicly available under Apache License V2.0. • Documentation (slides, videos, etc) available at https://github.com/scify/JedAIToolkit/tree/master/documentation . • When using JedAI, please cite: George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas and Manolis Koubarakis: " JedAI: The Force behind Entity Resolution ", in ESWC 2017. Themis Palpanas - June 2017 52

Which datasets are available for testing? Several datasets are available for testing Can be used for Dirty ER, as well. at https://github.com/scify/JedAIToolkit . Clean-Clean ER D1 D2 Dirty ER Entities (real) Entities Entities (synthetic) Abt-Buy 1,076 1,076 10K 10,000 DBLP-ACM 2,616 2,294 50K 50,000 DBLP-Scholar 2,516 61,353 100K 100,000 Amazon-GP 1,354 3,039 200K 200,00 Movies 27,615 23,182 300K 300,00 DBPedia 1,190,733 2,164,040 1M 1,000,000 2M 2,000,000 Themis Palpanas - June 2017 53

exemplar queries: query answering using examples and knowledge graphs Themis Palpanas - June 2017 54

Exemplar Queries problem  given an example element (subgraph) of interest, return a ranked set of  similar elements scale to full size size knowledge graphs, provide answers in real-time  Themis Palpanas - June 2017 55

Exemplar Queries problem  given an example element (subgraph) of interest, return a ranked set of  similar elements scale to full size size knowledge graphs, provide answers in real-time  applications: data exploration for non-expert users  “find company acquisitions like the one of YouTube by Google”  fast and easy discovery of facts with same semantics  complex similarity queries made easy  “find other legal cases where the actors had relationships similar to this”  pain-free information search for specialized users  Themis Palpanas - June 2017 56

Online and Scalable Semantic Data Analytics Themis Palpanas Paris - PowerPoint PPT Presentation

Online and Scalable Semantic Data Analytics Themis Palpanas Paris Descartes University Institut Universitaire de France Federated Semantic Data Management Seminar Dagstuhl, June 2017 16 Our Work large scale data streaming data

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Google News Personalization: Scalable Google News Personalization: Scalable Online Collaborative

Dyninst Scalable Tools Workshop Granlibakken Resort Lake Tahoe, California Dyninst Scalable

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

ONLINE ADVERTISING What is SIBC online? SIBC Online is a leading online news source for the

ONLINE PROCESS SIMULATION ONLINE PROCESS SIMULATION ONLINE, REAL-TIME AND PREDICTIVE PROCESS DATA

Getting Online Getting Online Domain Names Email Google My Business Listing

SECURE-ONLINE (ZEKER-ONLINE) Quality mark for online cloud services Tom Vreeburg Boardmember

How DGD.online helps prepare DG documentation easily CIFFA Webinar, June 25, 2019 DGD.online

Online Identity & Social Media by: Nicole Santarsiero What is Online Identity? -Online

2008 Online Awards Awards Banquet Better Newspaper Online Contest 2008 Best Online Advertising

2013 IRS Online Services Update IRS Online Services Update Jim Weaver Director, Product

A General Approach to Discovering, Registering, and Extracting Features from Raster Maps Craig

Working with Academic Literature Approach Search & Search, Screen, Read, Appraise Acquire

Building amazing searcies with Searci API T h o ma s S e i d l ( d r u n k e n

Autotuning (2/2): Specialized code generators Prof. Richard Vuduc Georgia Institute of

Topics in Database Systems: Data Management in Peer-to-Peer Systems P2p exchange documents, music

Apache CXF, Tika and Lucene The power of search the JAX-RS way Andriy Redko About myself

Big Linked Data ETL Benchmark on Cloud Commodity Hardware iMinds Ghent University Dieter De

Advancing Declarative Programming Aleksandar Milicevic Massachusetts Institute of Technology May

Online and Scalable Semantic Data Analytics Themis Palpanas Paris - PowerPoint PPT Presentation

Online and Scalable Semantic Data Analytics Themis Palpanas Paris Descartes University Institut Universitaire de France Federated Semantic Data Management Seminar Dagstuhl, June 2017 16 Our Work large scale data streaming data

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Google News Personalization: Scalable Google News Personalization: Scalable Online Collaborative

Dyninst Scalable Tools Workshop Granlibakken Resort Lake Tahoe, California Dyninst Scalable

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

ONLINE ADVERTISING What is SIBC online? SIBC Online is a leading online news source for the

ONLINE PROCESS SIMULATION ONLINE PROCESS SIMULATION ONLINE, REAL-TIME AND PREDICTIVE PROCESS DATA

Getting Online Getting Online Domain Names Email Google My Business Listing

SECURE-ONLINE (ZEKER-ONLINE) Quality mark for online cloud services Tom Vreeburg Boardmember

How DGD.online helps prepare DG documentation easily CIFFA Webinar, June 25, 2019 DGD.online

Online Identity &amp; Social Media by: Nicole Santarsiero What is Online Identity? -Online

2008 Online Awards Awards Banquet Better Newspaper Online Contest 2008 Best Online Advertising

2013 IRS Online Services Update IRS Online Services Update Jim Weaver Director, Product

A General Approach to Discovering, Registering, and Extracting Features from Raster Maps Craig

Working with Academic Literature Approach Search &amp; Search, Screen, Read, Appraise Acquire

Building amazing searcies with Searci API T h o ma s S e i d l ( d r u n k e n

Autotuning (2/2): Specialized code generators Prof. Richard Vuduc Georgia Institute of

Topics in Database Systems: Data Management in Peer-to-Peer Systems P2p exchange documents, music

Apache CXF, Tika and Lucene The power of search the JAX-RS way Andriy Redko About myself

Big Linked Data ETL Benchmark on Cloud Commodity Hardware iMinds Ghent University Dieter De

Advancing Declarative Programming Aleksandar Milicevic Massachusetts Institute of Technology May

Online Identity & Social Media by: Nicole Santarsiero What is Online Identity? -Online

Working with Academic Literature Approach Search & Search, Screen, Read, Appraise Acquire