SKETCHING DATA STRUCTURES FOR MASSIVE GRAPH PROBLEMS Juan P. A. - PowerPoint PPT Presentation

SKETCHING DATA STRUCTURES FOR MASSIVE GRAPH PROBLEMS Juan P. A. Lopes 1 , Fabiano S. Oliveira 2 , Paulo E. D. Pinto 2 , Valmir C. Barbosa 1 August 31 st , 2018 1 Federal University of Rio de Janeiro ( UFRJ ) 2 State University of Rio de Janeiro ( UERJ ) VLDB Workshop Poly'18

Agenda Motivation Probabilistic Implicit Representations Graph streams Conclusion 2

Motivation Why are sketching data structures relevant to graph problems? 3

Some real-life graphs are massive Observing global structures is hard Facebook 2.2 Number of active users, 2018. billion Twitter Routers 128 233 Estimated number of Typical amount of RAM in directed edges, 2018. a typical router. billion MB 23 100’s Internet Metagenomic assemblies billion of billions Number of connected Number of basepairs in a typical devices, 2018. metagenomic sample. 4

SOME REAL-LIFE GRAPHS ARE MASSIVE AND DYNAMIC How to deal with them? 5

Probabilistic Implicit Representations Use less memory by allowing errors 6

Space Optimal Representations ● A representation is said to be space optimal if it requires O(f(n)) bits to represent a class containing 2 ϴ (f(n)) graphs on n vertices; ● Optimality depends on the represented class. General Complete Trees Graphs Graphs Adjacency Matrix: O(n 2 ) Adjacency List: O(m log n) Spinrad, J. P. (2003). Efficient graph representations. American Mathematical Society. 7

Implicit Representations A representation is said to be implicit if it has the following properties: Space optimal O(f(n)) bits to represent a class containing 2 ϴ (f(n)) graphs on n vertices; Distributes information Each vertex stores O(f(n)/n) bits; Local adjacency test Only local vertex information is required to test adjacency; Spinrad, J. P. (2003). Efficient graph representations. American Mathematical Society. 8

Probabilistic Implicit Representations For probabilistic implicit representations , we introduce a fourth property : Space optimal O(f(n)) bits to represent a class containing 2 ϴ (f(n)) graphs on n vertices; Distributes information Each vertex stores O(f(n)/n) bits; Local adjacency test Only local vertex information is required to test adjacency; Probabilistic adjacency test Constant relative probability of false positives or false negatives. 9

Bloom filter Represents sets, allowing membership tests with a probability of false positives . ● There are no false negatives ; ● 10 bits per element are enough to ensure for a false positive probability of less than 1% . Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors . Communications of the ACM. 10

Bloom filter REGULAR Idea: to replace each vertex set in an ADJACENCY LIST adjacency list with a Bloom filter. 2 3 1 3 5 ● Each edge would require only 2 4 1 O(1) bits , instead of O(log n); 3 ● By using Bloom filters, there 2 would be no false negatives , only false positives. BLOOM FILTER ● Similarly, a single Bloom filter REPRESENTATION could be used to store the entire 0 1 1 0 edge set , but technically this 1 0 1 1 0 1 would not be an implicit 0 1 1 1 0 1 0 1 representation. 1 0 11

MinHash Represents sets through a constant-sized signature and allow MinHash(A) 11 6 1 6 6 71 71 34 57 57 106 106 computing the Jaccard coefficient MinHash(B) 11 6 1 81 81 80 80 34 34 73 73 88 88 between two or more sets. Broder, A. Z. (1997). On the resemblance and containment of documents . In Compression and complexity of sequences. 12

MinHash Idea: construct a set for each vertex, such that the Jaccard index between any pair of vertices encodes their adjacency. 0 δ A δ B 1 13

MinHash Example of sets construction for δ A = ⅓ and δ B = ½ . root {1, 2, 3, 4, 5, 6, 7, 8} {1, 3, 5, 7} {1, 4, 5, 8} selection {1, 3, 5, 7, 13, 14, 15, 16 } {1, 3, 5, 7, 9, 10, 11, 12 } extension {1, 4, 5, 8, 17, 18, 19, 20 } {1, 5, 9, 11} selection {1, 5, 17, 19} {1, 5, 18, 20} {1, 8, 17, 20} O(n) bits 14

Experimental Results For MinHash-based representation Observations 1 The experiment was run with k=128 hash functions and a graph with n=200 vertices. 2 Increasing the threshold seems to increase the rate of false negatives and decrease false positives. 3 The perfect threshold depends on the application tolerance for false positives and false negatives. 15

Experimental Results For MinHash-based representation Observations 1 The experiment was run with δ = 0.375 and a graph with n=200 vertices. 2 Increasing the signature size seems to have more effect on the rate of false negatives than positives. 3 This effect appears the same for whatever choice of threshold. 16

Other results Any efficient representation for bipartite, co-bipartite or split graphs can be used to represent general graphs efficiently. 1 1 2 5 2 2 1 3 3 4 4 3 4 5 5 17

Other results Modeling this problem through integer S B S A x AB programming allows proving the B x A x B infeasibility of specific configurations. x ABC A C x AC x BC ● Each possible subset of vertices is x C S C modelled as a variable. ● Each variable describes the size of the set intersection between those vertices. 18

K 3,3 Other results Modeling this problem through integer programming allows proving the infeasibility of specific configurations. ● Each possible subset of vertices is modelled as a variable. ● Each variable describes the size of the set intersection between those vertices. ● Impossible for δ A = 0.4 e δ B = 0.6. ● Do all threshold values have an infeasible bipartite graph? Still an ● Possible for δ A = ⅓ e δ B = ½. open problem. 19

Graph Streams How to represent dynamic graphs in sublinear space? 20

Graph Streams Graph Streams are graphs represented in the data stream model, i.e. single-pass through a stream of edge insertions and deletions. Can we compute global parameters in sublinear space ? B D +BC, -DF, -BD, +AE F +DF, -BC, +BE, +AC A E C Ahn, K. J., Guha, S., and McGregor, A. (2012). Analyzing graph structure via linear measurements . In Proceedings of SODA’12. McGregor, A. (2014). Graph stream algorithms: a survey . ACM SIGMOD. 21

Graph Streams Can we construct a full spanning forest of the graph in sublinear space ? B D A F E C 22

Graph Streams Idea: we can sample an edge from each vertex and merge its endpoints in a single “super-vertex”. Repeat. This procedures finishes in O(log n) steps. B D A F E C 23

Graph Streams A simpler problem: Is it possible to sample a random edge from any cut-set [S, V\S] in a graph stream storing less than O(n 2 ) bits ? B D A F E C 26

Sampling edges from cut-set Idea: to represent graph through a modified incidence matrix , where each edge is represented twice (once in each “direction”). B AB BA AC CA BD DB BE EB CD DC CE EC CF FC DF FD D A 1 -1 1 -1 0 0 0 0 0 0 0 0 0 0 0 0 B -1 1 0 0 1 -1 1 -1 0 0 0 0 0 0 0 0 A F C 0 0 -1 1 0 0 0 0 1 -1 1 -1 1 -1 0 0 D 0 0 0 0 -1 1 0 0 -1 1 0 0 0 0 1 -1 E E 0 0 0 0 0 0 -1 1 0 0 -1 1 0 0 0 0 C F 0 0 0 0 0 0 0 0 0 0 0 0 -1 1 -1 1 27

Sampling edges from cut-set The main benefit from this representation is the ability to sum incidence vectors to find the corresponding vector of a cut-set. Being able to sample nonzero coordinates from this vector implies sampling edges from such cut-set. B AB BA AC CA BD DB BE EB CD DC CE EC CF FC DF FD D A 1 -1 1 -1 0 0 0 0 0 0 0 0 0 0 0 0 +B -1 1 0 0 1 -1 1 -1 0 0 0 0 0 0 0 0 A F +D 0 0 0 0 -1 1 0 0 -1 1 0 0 0 0 1 -1 E {A, B, D} 0 0 1 -1 0 0 -1 1 -1 1 0 0 0 0 1 -1 C 28

What is ℓ 0 -sampling? Sampling, with uniform probability , of (9, +3) a nonzero coordinate from a vector a , (10, -5) (10, -1) represented incrementally by a stream of updates. a 1 8 -4 0 0 -7 -15 9 -1 0 ● Some updates may cancel others; 1 2 3 4 5 6 7 8 9 10 ● Must be done in sublinear space; (3, +8) ● Known lower-bound: Ω(log 2 n) . (1, +1) (4, -4) Cormode, G., Muthukrishnan, S., and Rozenbaum, I. (2005). Summarizing and mining inverse distributions on data streams via dynamic inverse sampling . In Proceedings of VLDB’05. Jowhari, H., Saglam, M., and Tardos, G. (2011). Tight bounds for lp-samplers, finding duplicates in streams, and related 29 problems . In Proceedings of PODS’11.

SKETCHING DATA STRUCTURES FOR MASSIVE GRAPH PROBLEMS Juan P. A. - PowerPoint PPT Presentation

SKETCHING DATA STRUCTURES FOR MASSIVE GRAPH PROBLEMS Juan P. A. Lopes 1 , Fabiano S. Oliveira 2 , Paulo E. D. Pinto 2 , Valmir C. Barbosa 1 August 31 st , 2018 1 Federal University of Rio de Janeiro ( UFRJ ) 2 State University of Rio de Janeiro (

Iterative Sketching Agile Arizona 2017 Agenda Who am I? The Power of Sketching When

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Curve Sketching Michael Freeze MAT 151 UNC Wilmington Summer 2013 1 / 10 Section 5.4 :: Curve

Free Form Sketching System for Free Form Sketching System for Product Design Using Virtual

Summary Structures for Massive Data Graham Cormode G.Cormode@warwick.ac.uk 7 6 4 1 Massive

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

Sample Graph Problems Path problems. Graph Operations And Connectedness problems.

Graph Streaming and Sketching Lecture 19 Nov 5, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1

Dynamic Graph Algorithms Christian Wulff-Nilsen University of Copenhagen November 14 , 2019 1 /

Congealing or Finding the Platonic Gate Jason Fennell & Joe Simons Outline Sketching as

Sketching as a tool for Algorithmic Design Alex Andoni (Columbia University) Find similar pairs

Plan of the Lecture Review: rules for sketching root loci; introduction to dynamic

On Sketching Quadratic Forms Bo Qin The Hong Kong University of Science and Technology January

Curve Sketching Since we have graphing calculators, we can find the graph easily of any function,

graphs Nov. 13, 2017 1 Example e g a c f d h b 2 Same Example different notation e

CSE 373: More on graphs; DFS and BFS Two common approaches, with difgerent tradeofgs: 3 Summary

Implementing Graphs Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Inf 2B: Graphs, BFS, DFS Kyriakos Kalorkoti School of Informatics University of Edinburgh 1 /

Last Time Both BFS and DFS are 2 ways to find all the connected components in a graph. It

Graphs: The Basics 0 5 1 2 6 7 8 9 3 4 What is a graph? 0 5 1 2 6 7 8 9 3 4

Graphs - definition Tirgul 7 A directed graph, G, is a couple (V,E) such that V is a finite set

Effective Web Graph Representations Giulio Ermanno Pibiri University of Pisa and ISTI-CNR Pisa,

SKETCHING DATA STRUCTURES FOR MASSIVE GRAPH PROBLEMS Juan P. A. - PowerPoint PPT Presentation

SKETCHING DATA STRUCTURES FOR MASSIVE GRAPH PROBLEMS Juan P. A. Lopes 1 , Fabiano S. Oliveira 2 , Paulo E. D. Pinto 2 , Valmir C. Barbosa 1 August 31 st , 2018 1 Federal University of Rio de Janeiro ( UFRJ ) 2 State University of Rio de Janeiro (

Iterative Sketching Agile Arizona 2017 Agenda Who am I? The Power of Sketching When

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Curve Sketching Michael Freeze MAT 151 UNC Wilmington Summer 2013 1 / 10 Section 5.4 :: Curve

Free Form Sketching System for Free Form Sketching System for Product Design Using Virtual

Summary Structures for Massive Data Graham Cormode G.Cormode@warwick.ac.uk 7 6 4 1 Massive

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

Sample Graph Problems Path problems. Graph Operations And Connectedness problems.

Graph Streaming and Sketching Lecture 19 Nov 5, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1

Dynamic Graph Algorithms Christian Wulff-Nilsen University of Copenhagen November 14 , 2019 1 /

Congealing or Finding the Platonic Gate Jason Fennell &amp; Joe Simons Outline Sketching as

Sketching as a tool for Algorithmic Design Alex Andoni (Columbia University) Find similar pairs

Plan of the Lecture Review: rules for sketching root loci; introduction to dynamic

On Sketching Quadratic Forms Bo Qin The Hong Kong University of Science and Technology January

Curve Sketching Since we have graphing calculators, we can find the graph easily of any function,

graphs Nov. 13, 2017 1 Example e g a c f d h b 2 Same Example different notation e

CSE 373: More on graphs; DFS and BFS Two common approaches, with difgerent tradeofgs: 3 Summary

Implementing Graphs Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Inf 2B: Graphs, BFS, DFS Kyriakos Kalorkoti School of Informatics University of Edinburgh 1 /

Last Time Both BFS and DFS are 2 ways to find all the connected components in a graph. It

Graphs: The Basics 0 5 1 2 6 7 8 9 3 4 What is a graph? 0 5 1 2 6 7 8 9 3 4

Graphs - definition Tirgul 7 A directed graph, G, is a couple (V,E) such that V is a finite set

Effective Web Graph Representations Giulio Ermanno Pibiri University of Pisa and ISTI-CNR Pisa,

Congealing or Finding the Platonic Gate Jason Fennell & Joe Simons Outline Sketching as