towards a property graph generator for benchmarking
play

Towards a property graph generator for benchmarking Arnau Prat-Prez - PowerPoint PPT Presentation

Towards a property graph generator for benchmarking Arnau Prat-Prez Davide Basilio Bartolini Joan Guisado-Gmez Siegfried Depner Xavier Fernndez-Salas Petr Koupy Why a property graph generator? Graph-based analysis is becoming more and


  1. Towards a property graph generator for benchmarking Arnau Prat-Pérez Davide Basilio Bartolini Joan Guisado-Gámez Siegfried Depner Xavier Fernández-Salas Petr Koupy

  2. Why a property graph generator? Graph-based analysis is becoming more and more popular ● GraphMAT TOTE TOTEM

  3. Why a property graph generator? For the field to advance, many benchmarking initiatives have ● appeared gMark Graphalytics Social Network Benchmark LinkBench LUBM

  4. Why a property graph generator? Benchmarks need datasets, preferably real ones ●

  5. Why a property graph generator? But ... ●

  6. Why a property graph generator? But ... ● OR

  7. Why a property graph generator? Synthetic graph generators ● However, each benchmark has specific data needs ● each benchmark designer implements its own – time consuming task sometimes reinventing the wheel –

  8. Why a property graph generator? Tool that, given some “graph specification”, produces a synthetic ● graph with the specified characteristics DataSynth ● https://github.com/DAMA-UPC/DataSynth – Written in Scala – Uses Apache Spark –

  9. Architecture Overview Scala based DSL with Frontend extensive use of code generation DSL Parser Execution Plan Optimizer Optimizations possible for certain types of graphs Backend State of the art BigData Apache Spark Runtime framework

  10. What features should DataSynth have? But what characteristics should a property graph generator be able to ● reproduce?

  11. What features should DataSynth have? But what characteristics should a property graph generator be able to ● reproduce? Properties and correlations/dependencies between them - e.g. name is correlated with country

  12. What features should DataSynth have? But what characteristics should a property graph generator be able to ● reproduce? Properties and correlations/dependencies between them - e.g. name is correlated with country Variate Structure - degree distributions - community structure - low diameter - large connected component - etc.

  13. What features should DataSynth have? But what characteristics should a property graph generator be able to ● reproduce? Properties and correlations/dependencies between them - e.g. name is correlated with country Variate Structure Property-Structure - degree distributions correlations/depencies - community structure - e.g. Chinese people tend to - low diameter connect to Chinese people - large connected component - represented as a P(X,Y) of - etc. observing X and Y on a randomly picked edge.

  14. What features should DataSynth have? But what characteristics should a property graph generator be able to ● reproduce? Properties and correlations/dependencies between them S - e.g. name is correlated with country C A L Variate Structure E Property-Structure - degree distributions correlations/depencies - community structure - e.g. Chinese people tend to - low diameter connect to Chinese people - large connected component - represented as a P(X,Y) of - etc. observing X and Y on a randomly picked edge.

  15. But... Having a single algorithm for generating ● so many things seems too complex Properties and property correlations – Ralistic graph structure – Property-structure correlations – There are tens of metrics to measure ● the structure of a graph, which ones to take (which possibly depend on the algorithms used)?

  16. Person DataSynth's approach Country Name knows date TIME

  17. Person DataSynth's approach Country Name knows node property generation date Id Name Id Country 1 Lee 1 China 2 Japan 2 Hiroshi 3 China 3 Yang ... ... ... ... 17 Germany 17 Wolfgang structure generation TIME

  18. Person DataSynth's approach Country Name knows node property generation Matching preserving given joint date Id Name Id Country probability distributions 1 Lee 1 China 2 Japan 2 Hiroshi 3 China 3 Yang 11 7 ... ... ... ... 10 16 5 17 Germany 17 Wolfgang 2 15 1 3 14 8 structure generation 9 12 4 6 17 13 e.g. P(China,China) ≈ 0.2 TIME

  19. Person DataSynth's approach Country Name knows node property generation Matching preserving given joint date Id Name Id Country probability distributions 1 Lee 1 China 2 Japan 2 Hiroshi edge property generation 3 China 3 Yang 11 7 ... ... ... ... 10 16 Id date 5 17 Germany 17 Wolfgang 2 15 1 30/01/2015 1 3 14 2 4/06/2016 8 structure generation 9 3 12/11/2016 12 4 6 ... ... 17 30 03/03/2017 13 e.g. P(China,China) ≈ 0.2 TIME

  20. DataSynt's Approach Pros: ● Accurate distributions of property values and correlations between properties – Does not limit us to a single way of generating the structure of a graph – We can use existing techniques and let the door open to new contributions ● Pay for what we get – Cons: ● Heavy relies on a sophisticated matching approach to achieve accurate property- – structure correlation

  21. Property Generation We have a “Property Table” for each <type,property> pair ● We use a similar technique to that proposed by Myriad [1] ● Highly parallel – Allows in-place data generation – Given and Id of an entity, I can generate its properties ● [1] Alexander Alexandrov, Kostas Tzoumas, and Volker Markl. 2012. Myriad: scalable and expressive data generation. PVLDB 5, 12 (2012), 1890–1893.

  22. Structure Generation We can use existing scalable graph generation techniques: BTER [1], ● Darwini [2], etc. Hadoop implementation of BTER implemented: ● https://github.com/DAMA-UPC/BTERonH – [1] Tamara G Kolda et al. 2014. A scalable generative graph model with community structure. SISC 36, 5 (2014), C424–C452. [2] Sergey Edunov et al. 2016. Darwini: Generating realistic large-scale social graphs. arXiv:1610.00664 (2016)

  23. Property-to-Structure Matching Input P(X,Y) 0.3 0.067 0.067 0.067 0.33 0.067 0.067 0.067 0.17

  24. Property-to-Structure Matching Input Block Model 6 9 2 2 7 2 10 2 4 2 2 5 P(X,Y) 6,9 2 2 0.3 0.067 0.067 7,10 0.067 0.33 0.067 4,5 0.067 0.067 0.17 2

  25. Property-to-Structure Matching Input Block Model 6 9 2 2 7 2 10 2 4 2 2 5 P(X,Y) 6,9 2 2 0.3 0.067 0.067 7,10 0.067 0.33 0.067 4,5 0.067 0.067 0.17 2 Graph Partitioning

  26. Next Steps Investigate further on the performance/quality of our Matching approach ● Multithreaded/Distributed – Efficient for high-cardinality values – Understand when and when not works well – Push for the DSL ● Integrate more existing structure generators ● bi-partite graphs – Long term: work towards “DGaaS” (Data Generation as a Service) ●

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend