online and scalable
play

Online and Scalable Semantic Data Analytics Themis Palpanas Paris - PowerPoint PPT Presentation

Online and Scalable Semantic Data Analytics Themis Palpanas Paris Descartes University Institut Universitaire de France Federated Semantic Data Management Seminar Dagstuhl, June 2017 16 Our Work large scale data streaming data


  1. Online and Scalable Semantic Data Analytics Themis Palpanas Paris Descartes University Institut Universitaire de France Federated Semantic Data Management Seminar Dagstuhl, June 2017

  2. 16 Our Work • large scale data • streaming data • heterogeneous data • private data • uncertain data funded by : European Commission, CNRS, Facebook, IBM Research, FMJH, Inria, Hewlett-Packard Labs, Telecom Italia, Autonomous Province of Trento Themis Palpanas - June 2017

  3. 17 Our Work • large scale data ▫ Managing and Analyzing Very Large Scientific Data ▫ infrastructure monitoring, motion capture, genome sequences, fMRI (neuroscience), astronomy • streaming data • heterogeneous data • private data • uncertain data Themis Palpanas - June 2017

  4. 18 Our Work • large scale data • streaming data ▫ Real Time Analysis of Data Streams ▫ continuous monitoring, online pattern identification • heterogeneous data • private data • uncertain data Themis Palpanas - June 2017

  5. 19 Our Work • large scale data • streaming data • heterogeneous data ▫ Fuse Data from Different Sources ▫ entity resolution, query answering using knowledge graphs, ▫ subjectivity analysis • private data • uncertain data Themis Palpanas - June 2017

  6. 20 Our Work • large scale data • streaming data • heterogeneous data • private data • uncertain data ▫ Processing and Mining Uncertain Data ▫ uncertain data series (e.g., sensor measurements) ▫ uncertain graphs (e.g., biological networks) Themis Palpanas - June 2017

  7. entity resolution in large, heterogeneous data spaces Themis Palpanas - June 2017 41

  8. Entity Resolution in Large, Heterogeneous Data Spaces problem  develop framework and techniques for entity resolution in very large  and highly heterogeneous data spaces (i.e., loose schema binding, high levels of heterogeneity and noise, missing attribute names or values) scale to web size  Themis Palpanas - June 2017 42

  9. Entity Resolution in Large, Heterogeneous Data Spaces problem  develop framework and techniques for entity resolution in very large  and highly heterogeneous data spaces (i.e., loose schema binding, high levels of heterogeneity and noise, missing attribute names or values) scale to web size  applications: web-scale data integration  “which entities in these two web datasets are the same?”  entity resolution for heterogeneous web data  query answering  return a set of unique entities in response to a user query  produce high-quality results  Themis Palpanas - June 2017 43

  10. Our Work novel blocking techniques that are resilient to heterogeneity can be the  basis for efficient entity resolution develop block building methods that lead to blocks with low number of  missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency) Themis Palpanas - June 2017 44

  11. Our Work novel blocking techniques that are resilient to heterogeneity can be the  basis for efficient entity resolution develop block building methods that lead to blocks with low number of  missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency) we propose framework for entity resolution in heterogeneous data  spaces at web scale efficient and effective algorithms for:  blocking  block purging  duplicates propagation  block scheduling  block pruning  comparisons propagation  comparisons pruning  Themis Palpanas - June 2017 45

  12. Our Work novel blocking techniques that are resilient to heterogeneity can be the  basis for efficient entity resolution develop block building methods that lead to blocks with low number of  missed matches (high recall), and block processing methods that reduce the number of required pair-wise entity comparisons (high efficiency) we propose framework for entity resolution in heterogeneous data  spaces at web scale efficient and effective algorithms for:  blocking  block purging  duplicates propagation  block scheduling  block pruning  comparisons propagation  comparisons pruning  Tutorial, links for Papers, Demo, Code, Datasets: http://www.mi.parisdescartes.fr/~themisp/publications/PapadakisPalpanas-TutorialScaDS-LeipsigSummerSchool2016v2.pptx Themis Palpanas - June 2017 46

  13. What is the JedAI Toolkit? JedAI can be used in three ways: 1. As an open source library that implements numerous state-of-the-art methods for all steps of an established end-to-end ER workflow. 2. As a desktop application for ER with an intuitive Graphical User Interface that is suitable for both expert and lay users. 3. As a workbench for comparing all performance aspects of various (configurations of) end-to-end ER workflows. Themis Palpanas - June 2017 47

  14. How does the JedAI Toolkit work? JedAI implements the following schema-agnostic, end- to-end workflow for both Clean-Clean and Dirty ER: Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Data Block Block Comparison Entity Entity Evaluation Reading Building Cleaning Cleaning Matching Clustering & Storing Reads files Creates Optional step Optional step Executes all Partitions the Stores and containing overlapping that cleans that operates on retained similarity graph presents the entity blocks from the level of comparisons. into equivalence performance blocks. clusters. results profiles and useless individual w.r.t. the golden comparisons comparisons to standard. (repeated, remove the numerous superfluous). useless ones. measures. Themis Palpanas - June 2017 48

  15. How is the JedAI Toolkit structured? • Modular architecture: one module per workflow step. • Extensible architecture (e.g., ontology matching) ??? Themis Palpanas - June 2017 49

  16. How can I build an ER workflow? JedAI supports several established methods for each workflow step: Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Data Block Block Comparison Entity Entity Evaluation Reading Building Cleaning Cleaning Matching Clustering & Storing Possible to Choose Specify any Choose Combine Choose Store results read CSV, 1 out of 8 combination of 1 out of 7 1 out of 2 1 out of 6 as a CSV file. RDF/XML files 3 (4) methods methods with methods for methods. Dirty ER. For & relational complementary (including 12 textual DBs in any methods for Meta-blocking). representation Clean-Clean ER, combination! Dirty (Clean- models and 10 1 method is Clean) ER. similarity available. measures. Themis Palpanas - June 2017 50

  17. Which Blocking Methods are included? Block Building Block Cleaning Comparison Cleaning Token Blocking Block Filtering Comparison Propagation Sorted Neighborhood Size-based Block Purging Cardinality Edge Pruning (CEP) Extended Sorted Cardinality-based Block Cardinality Node Pruning (CNP) Neighborhood Purging Attribute Clustering Block Scheduling Weighted Edge Pruning (WEP) Q-Grams Blocking Weighted Node Pruning (WNP) Extended Q-Grams Blocking Reciprocal CNP Suffix Arrays Reciprocal WNP Extended Suffix Arrays Themis Palpanas - June 2017 51

  18. Where can I find JedAI Toolkit? • Project website: http://jedai.scify.org . • Github repositories: – JedAI Library: https://github.com/scify/JedAIToolkit . – JedAI Desktop Application and Workbench: https://github.com/scify/jedai-ui . – All code is implemented using Java 8. – All code is publicly available under Apache License V2.0. • Documentation (slides, videos, etc) available at https://github.com/scify/JedAIToolkit/tree/master/documentation . • When using JedAI, please cite: George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas and Manolis Koubarakis: " JedAI: The Force behind Entity Resolution ", in ESWC 2017. Themis Palpanas - June 2017 52

  19. Which datasets are available for testing? Several datasets are available for testing Can be used for Dirty ER, as well. at https://github.com/scify/JedAIToolkit . Clean-Clean ER D1 D2 Dirty ER Entities (real) Entities Entities (synthetic) Abt-Buy 1,076 1,076 10K 10,000 DBLP-ACM 2,616 2,294 50K 50,000 DBLP-Scholar 2,516 61,353 100K 100,000 Amazon-GP 1,354 3,039 200K 200,00 Movies 27,615 23,182 300K 300,00 DBPedia 1,190,733 2,164,040 1M 1,000,000 2M 2,000,000 Themis Palpanas - June 2017 53

  20. exemplar queries: query answering using examples and knowledge graphs Themis Palpanas - June 2017 54

  21. Exemplar Queries problem  given an example element (subgraph) of interest, return a ranked set of  similar elements scale to full size size knowledge graphs, provide answers in real-time  Themis Palpanas - June 2017 55

  22. Exemplar Queries problem  given an example element (subgraph) of interest, return a ranked set of  similar elements scale to full size size knowledge graphs, provide answers in real-time  applications: data exploration for non-expert users  “find company acquisitions like the one of YouTube by Google”  fast and easy discovery of facts with same semantics  complex similarity queries made easy  “find other legal cases where the actors had relationships similar to this”  pain-free information search for specialized users  Themis Palpanas - June 2017 56

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend