Mining Large Datasets: Case of Mining Graph Data in the Cloud - PowerPoint PPT Presentation

Background Contributions Conclusion Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d’Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 1 / 50

Background Contributions Conclusion Context and motivations Application domains Computer networks, Social networks, Bioinformatics, Protein structure Chemoinformatics. Graph representation Chemical compound Data modeling. Identifying relationship patterns and rules. Social network Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 2 / 50

Background Contributions Conclusion Context and motivations Mining graph data Graph mining aims to find patterns, hidden relations and behaviors in data. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 3 / 50

Background Contributions Conclusion Context and motivations Mining graph data Graph mining aims to find patterns, hidden relations and behaviors in data. Mining graph goals Computing graph properties: Density, diameter, radius, ... Mining substructures from graph databases. Substructures: paths, trees, subgraphs. Frequent Subgraph Mining (FSM) task. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 3 / 50

Background Contributions Conclusion Context and motivations Availability of graph data Exponential growth in both size and number of graphs in databases. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 4 / 50

Background Contributions Conclusion Context and motivations Availability of graph data Exponential growth in both size and number of graphs in databases. Availability of graph data sources: The protein data bank (PDB) contains 95280 of protein 3D structures. Facebook loads 60 terabytes of new data every day [Thusoo 2010] . Google processes 20 petabytes of data per day [Dean 2008] . Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 4 / 50

Background Contributions Conclusion Context and motivations Availability of graph data Exponential growth in both size and number of graphs in databases. Availability of graph data sources: The protein data bank (PDB) contains 95280 of protein 3D structures. Facebook loads 60 terabytes of new data every day [Thusoo 2010] . Google processes 20 petabytes of data per day [Dean 2008] . 3Vs of Big Data (Volume, Velocity and Variety). Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 4 / 50

Background Contributions Conclusion Context and motivations Availability of graph data Exponential growth in both size and number of graphs in databases. Availability of graph data sources: The protein data bank (PDB) contains 95280 of protein 3D structures. Facebook loads 60 terabytes of new data every day [Thusoo 2010] . Google processes 20 petabytes of data per day [Dean 2008] . 3Vs of Big Data (Volume, Velocity and Variety). Availability of cloud computing environments. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 4 / 50

Background Contributions Conclusion Context and motivations In this work We are interested to FSM from graph databases. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 5 / 50

Background Contributions Conclusion Context and motivations In this work We are interested to FSM from graph databases. Frequent subgraph mining algorithms Various approaches of FSM. Existing approaches are mainly: Tested on centralized computing systems. Evaluated on relatively small databases. Few works for FSM in the cloud. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 5 / 50

Background Contributions Conclusion Goals Questions Distributed FSM from large graph database. Data/computation distribution. Tuning cloud parameters. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 6 / 50

Background Contributions Conclusion Outline 1 Background 2 Contributions 3 Conclusion Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 7 / 50

Graph mining Background Cloud computing Contributions Frameworks for large data processing in the cloud Conclusion Related works Outline 1 Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 2 Contributions 3 Conclusion Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 8 / 50

Graph mining Background Cloud computing Contributions Frameworks for large data processing in the cloud Conclusion Related works Outline 1 Background Graph mining Cloud computing Frameworks for large data processing in the cloud Related works 2 Contributions Distributed subgraph mining in the cloud 3 Conclusion Contributions Prospects Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 9 / 50

Graph mining Background Cloud computing Contributions Frameworks for large data processing in the cloud Conclusion Related works Background Graph A graph is denoted as G = ( V , E ) where V is a set of nodes and E is a set of edges. Subgraph A graph G ′ = ( V ′ , E ′ ) is a subgraph of another graph G = ( V , E ) iff: V ′ ⊆ V , and E ′ ⊆ E ∩ ( V ′ × V ′ ). Density The density of a graph G = ( V , E ) is 2 ·| E | calculated by density ( G ) = ( | V |· ( | V |− 1)) . Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 10 / 50

Graph mining Background Cloud computing Contributions Frameworks for large data processing in the cloud Conclusion Related works Background Cloud computing Large number of computers that are connected via Internet. Applications delivered as services. Hardware and system software delivered as services. Pay as you go. Cloud services can be rapidly and elastically provisioned. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 12 / 50

Graph mining Background Cloud computing Contributions Frameworks for large data processing in the cloud Conclusion Related works Background Service models Software as a Service (SaaS). Platform as a Service (PaaS), Infrastructure as a Service (IaaS), Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 13 / 50

Graph mining Background Cloud computing Contributions Frameworks for large data processing in the cloud Conclusion Related works Background MapReduce framework A framework for processing huge datasets. Large number of computers and task/node failures. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 15 / 50

Graph mining Background Cloud computing Contributions Frameworks for large data processing in the cloud Conclusion Related works Background Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 16 / 50

Graph mining Background Cloud computing Contributions Frameworks for large data processing in the cloud Conclusion Related works Background SPARK framework A general engine for large-scale data processing. Combine SQL, streaming, and complex analytics. It offers several high-level operators that make it easy to build parallel applications. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 17 / 50

Graph mining Background Cloud computing Contributions Frameworks for large data processing in the cloud Conclusion Related works Background SHARK framework A distributed SQL query engine for Hadoop. Based on SPARK and uses the existing Hive client and metastore. Sabeur Aridhi Mining Large Datasets - Big Data Forum - Lyon 18 / 50

Mining Large Datasets: Case of Mining Graph Data in the Cloud - PowerPoint PPT Presentation

Background Contributions Conclusion Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent dOrazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Learning with Large Datasets L eon Bottou NEC Laboratories America Why Large-scale Datasets?

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Chapter X: Graph Mining Information Retrieval & Data Mining Universitt des Saarlandes,

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Graph Essentials Graph Basics Social Media Mining Social Media Mining Measures and Metrics

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

CARPENTER Biological Datasets Find Closed Patterns in Long Biological Datasets Gene

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis

Topic II: Graph Mining Discrete Topics in Data Mining Universitt des Saarlandes, Saarbrcken

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Graph Neural Networks E. Daller, S. Bougleux, and L. Brun April 19, 2018 () Graph Neural

an introduction to git Ewen Cheslack-Postava 1 Outline Review of Version Control

Version Control Systems (Part 2) Devin J. Pohly <djpohly@cse.psu.edu> CMPSC 311:

GIT in PandaRoot Ralf Kliemt Panda Meeting I 2017 1 A Small GIT Tutorial Repositories

OpenNebula: Experiences at KTH With a deeper dive into emerging data analytics stacks ke

LIACS Fundamentals Jetty Kleijn | Informatica Bachelorklas 2015-12-01 Discover theworld at

EvolutionofanEfficient SearchAlgorithmforthe

Die Entwicklung der Spielprogrammierung: Von John von Neumann bis zu den hochparallelen

Mining Large Datasets: Case of Mining Graph Data in the Cloud - PowerPoint PPT Presentation

Background Contributions Conclusion Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent dOrazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Learning with Large Datasets L eon Bottou NEC Laboratories America Why Large-scale Datasets?

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Chapter X: Graph Mining Information Retrieval &amp; Data Mining Universitt des Saarlandes,

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Graph Essentials Graph Basics Social Media Mining Social Media Mining Measures and Metrics

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

CARPENTER Biological Datasets Find Closed Patterns in Long Biological Datasets Gene

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis

Topic II: Graph Mining Discrete Topics in Data Mining Universitt des Saarlandes, Saarbrcken

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Graph Neural Networks E. Daller, S. Bougleux, and L. Brun April 19, 2018 () Graph Neural

an introduction to git Ewen Cheslack-Postava 1 Outline Review of Version Control

Version Control Systems (Part 2) Devin J. Pohly &lt;djpohly@cse.psu.edu&gt; CMPSC 311:

GIT in PandaRoot Ralf Kliemt Panda Meeting I 2017 1 A Small GIT Tutorial Repositories

OpenNebula: Experiences at KTH With a deeper dive into emerging data analytics stacks ke

LIACS Fundamentals Jetty Kleijn | Informatica Bachelorklas 2015-12-01 Discover theworld at

EvolutionofanEfficient SearchAlgorithmforthe

Die Entwicklung der Spielprogrammierung: Von John von Neumann bis zu den hochparallelen

Chapter X: Graph Mining Information Retrieval & Data Mining Universitt des Saarlandes,

Version Control Systems (Part 2) Devin J. Pohly <djpohly@cse.psu.edu> CMPSC 311: