Towards a property graph generator for benchmarking Arnau Prat-Prez - - PowerPoint PPT Presentation

towards a property graph generator for benchmarking
SMART_READER_LITE
LIVE PREVIEW

Towards a property graph generator for benchmarking Arnau Prat-Prez - - PowerPoint PPT Presentation

Towards a property graph generator for benchmarking Arnau Prat-Prez Davide Basilio Bartolini Joan Guisado-Gmez Siegfried Depner Xavier Fernndez-Salas Petr Koupy Why a property graph generator? Graph-based analysis is becoming more and


slide-1
SLIDE 1

Towards a property graph generator for benchmarking

Arnau Prat-Pérez Joan Guisado-Gámez Xavier Fernández-Salas Davide Basilio Bartolini Siegfried Depner Petr Koupy

slide-2
SLIDE 2

Why a property graph generator?

  • Graph-based analysis is becoming more and more popular

TOTE TOTEM

GraphMAT

slide-3
SLIDE 3

Why a property graph generator?

  • For the field to advance, many benchmarking initiatives have

appeared

Graphalytics Social Network Benchmark LUBM

LinkBench gMark

slide-4
SLIDE 4

Why a property graph generator?

  • Benchmarks need datasets, preferably real ones
slide-5
SLIDE 5

Why a property graph generator?

  • But ...
slide-6
SLIDE 6

Why a property graph generator?

  • But ...

OR

slide-7
SLIDE 7

Why a property graph generator?

  • Synthetic graph generators
  • However, each benchmark has specific data needs

each benchmark designer implements its own

time consuming task sometimes reinventing the wheel

slide-8
SLIDE 8

Why a property graph generator?

  • Tool that, given some “graph specification”, produces a synthetic

graph with the specified characteristics

  • DataSynth

https://github.com/DAMA-UPC/DataSynth

Written in Scala

Uses Apache Spark

slide-9
SLIDE 9

Architecture Overview

DSL Parser Optimizer

Frontend

Apache Spark Runtime

Backend

Scala based DSL with extensive use of code generation Execution Plan Optimizations possible for certain types of graphs State of the art BigData framework

slide-10
SLIDE 10

What features should DataSynth have?

  • But what characteristics should a property graph generator be able to

reproduce?

slide-11
SLIDE 11

What features should DataSynth have?

  • But what characteristics should a property graph generator be able to

reproduce?

Properties and correlations/dependencies between them

  • e.g. name is correlated with country
slide-12
SLIDE 12

What features should DataSynth have?

  • But what characteristics should a property graph generator be able to

reproduce?

Variate Structure

  • degree distributions
  • community structure
  • low diameter
  • large connected component
  • etc.

Properties and correlations/dependencies between them

  • e.g. name is correlated with country
slide-13
SLIDE 13

What features should DataSynth have?

  • But what characteristics should a property graph generator be able to

reproduce?

Variate Structure

  • degree distributions
  • community structure
  • low diameter
  • large connected component
  • etc.

Properties and correlations/dependencies between them

  • e.g. name is correlated with country

Property-Structure correlations/depencies

  • e.g. Chinese people tend to

connect to Chinese people

  • represented as a P(X,Y) of
  • bserving X and Y on a randomly

picked edge.

slide-14
SLIDE 14

What features should DataSynth have?

  • But what characteristics should a property graph generator be able to

reproduce?

Variate Structure

  • degree distributions
  • community structure
  • low diameter
  • large connected component
  • etc.

Property-Structure correlations/depencies

  • e.g. Chinese people tend to

connect to Chinese people

  • represented as a P(X,Y) of
  • bserving X and Y on a randomly

picked edge. Properties and correlations/dependencies between them

  • e.g. name is correlated with country

S C A L E

slide-15
SLIDE 15

But...

  • Having a single algorithm for generating

so many things seems too complex

Properties and property correlations

Ralistic graph structure

Property-structure correlations

  • There are tens of metrics to measure

the structure of a graph, which ones to take (which possibly depend on the algorithms used)?

slide-16
SLIDE 16

DataSynth's approach

Person Country Name

knows date

TIME

slide-17
SLIDE 17

DataSynth's approach

Person Country Name

Id Country 1 China 2 Japan 3 China ... ... 17 Germany Id Name 1 Lee 2 Hiroshi 3 Yang ... ... 17 Wolfgang

knows date

TIME node property generation structure generation

slide-18
SLIDE 18

DataSynth's approach

Person Country Name

Id Country 1 China 2 Japan 3 China ... ... 17 Germany Id Name 1 Lee 2 Hiroshi 3 Yang ... ... 17 Wolfgang

5 1 9 3 8 2 12 14 10 15 11 7 13 4 17 6 16

knows date

TIME node property generation structure generation Matching preserving given joint probability distributions e.g. P(China,China) ≈ 0.2

slide-19
SLIDE 19

DataSynth's approach

Person Country Name

Id Country 1 China 2 Japan 3 China ... ... 17 Germany Id Name 1 Lee 2 Hiroshi 3 Yang ... ... 17 Wolfgang

5 1 9 3 8 2 12 14 10 15 11 7 13 4 17 6 16

knows date

Id date 1 30/01/2015 2 4/06/2016 3 12/11/2016 ... ... 30 03/03/2017

TIME node property generation structure generation edge property generation Matching preserving given joint probability distributions e.g. P(China,China) ≈ 0.2

slide-20
SLIDE 20

DataSynt's Approach

  • Pros:

Accurate distributions of property values and correlations between properties

Does not limit us to a single way of generating the structure of a graph

  • We can use existing techniques and let the door open to new contributions

Pay for what we get

  • Cons:

Heavy relies on a sophisticated matching approach to achieve accurate property- structure correlation

slide-21
SLIDE 21

Property Generation

  • We have a “Property Table” for each <type,property> pair
  • We use a similar technique to that proposed by Myriad [1]

Highly parallel

Allows in-place data generation

  • Given and Id of an entity, I can generate its properties

[1] Alexander Alexandrov, Kostas Tzoumas, and Volker Markl. 2012. Myriad: scalable and expressive data generation. PVLDB 5, 12 (2012), 1890–1893.

slide-22
SLIDE 22

Structure Generation

  • We can use existing scalable graph generation techniques: BTER [1],

Darwini [2], etc.

  • Hadoop implementation of BTER implemented:

https://github.com/DAMA-UPC/BTERonH [1] Tamara G Kolda et al. 2014. A scalable generative graph model with community structure. SISC 36, 5 (2014), C424–C452. [2] Sergey Edunov et al. 2016. Darwini: Generating realistic large-scale social

  • graphs. arXiv:1610.00664 (2016)
slide-23
SLIDE 23

Property-to-Structure Matching

0.3 0.067 0.067 0.067 0.33 0.067 0.17 0.067 0.067

Input

P(X,Y)

slide-24
SLIDE 24

Property-to-Structure Matching

0.3 0.067 0.067 0.067 0.33 0.067 0.17 0.067 0.067 9 2 2 2 10 2 5 2 2 6 7 4

Block Model Input 7,10 4,5 6,9 2 2 2

P(X,Y)

slide-25
SLIDE 25

Property-to-Structure Matching

0.3 0.067 0.067 0.067 0.33 0.067 0.17 0.067 0.067 9 2 2 2 10 2 5 2 2 6 7 4

Block Model Graph Partitioning Input 7,10 4,5 6,9 2 2 2

P(X,Y)

slide-26
SLIDE 26

Next Steps

  • Investigate further on the performance/quality of our Matching approach

Multithreaded/Distributed

Efficient for high-cardinality values

Understand when and when not works well

  • Push for the DSL
  • Integrate more existing structure generators

bi-partite graphs

  • Long term: work towards “DGaaS” (Data Generation as a Service)