Generating Large Graphs for Benchmarking Ali Pinar, Tamara G. - PowerPoint PPT Presentation

Generating Large Graphs for Benchmarking Ali Pinar, Tamara G. Kolda,C. Seshadhri, Todd Plantenga U.S. Department of Energy U.S. Department of Defense Office of Advanced Scientific Computing Research 2/21/2014 Defense Advanced Research Projects Agency Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE -AC04-94AL85000. Pinar @ SIAM PP 14

Modeling graphs is a crucial challenge  Our understanding of network structure Useful Real Data is still limited. Measurements  We do not have the first principles.  Why model graphs? Calibration  Real data will rarely be available. Inherent  Understanding normal helps identifying Properties abnormal.  Benchmarking requires controlled experiments.  Challenges  Data analysis: Identifying metrics that can help in characterization (e.g., degree Mathematical Generative Model distribution, clustering coefficients)  Theoretical analysis: Understanding the structure inferred by these metrics  Algorithms: Designing algorithms to compute these metrics, generate graphs, Generated Data Measurements etc. 2/21/2014 Pinar – SIAM PP 14 2

A Good Network Model…  Encapsulates underlying driving Story-driven models principals  “Physics” Example: Preferential Attachment  (Barabasi & Albert, Science,1999) Captures measurable characteristics of real-world data • New nodes joins graph one at new a time, in sequence  Degree distribution node & • Each new node chooses k new  Clustering coefficients edge(s) neighbors, according to degree  Community structure • Node degrees updated after each addition – Rich get richer!  Connectedness, Diameter  k = 1 Eigenvalues  Calibrates to specific data sets Structure-driven models  Quantitative vs. qualitative 2  Surrogate for real data, protecting Example: CL (aka Configuration) 1 privacy and security (Chung & Lu, PNAS, 2002) 4 1  1 Provides results “like” the real data • Desired node degrees new 7  Easy to share, reproduce specified in advance edge 3 • New edges inserted, choosing  Yields understanding 3 endpoints by desired degree 1 • Higher-degree nodes are more  Serve as null model 2 likely to be selected  Statistical sampling guidance  Predictive capabilities 2/21/2014 Pinar @ SIAM PP 14 3

A Good Network Model…  Encapsulates underlying driving Story-driven models principals  “Physics” Example: Preferential Attachment  (Barabasi & Albert, Science,1999) Captures measurable characteristics of real-world data • New nodes joins graph one at new a time, in sequence  Degree distribution node & • Each new node chooses k new  Clustering coefficients edge(s) neighbors, according to degree  Community structure • Node degrees updated after each addition – Rich get richer!  Connectedness, Diameter  k = 1 Eigenvalues  Calibrates to specific data sets Structure-driven models  Quantitative vs. qualitative 2  Surrogate for real data, protecting Example: CL (aka Configuration) 1 privacy and security (Chung & Lu, PNAS, 2002) 4 1  1 Provides results “like” the real data • Desired node degrees new 7  Easy to share, reproduce specified in advance edge 3 • New edges inserted, choosing  Yields understanding 3 endpoints by desired degree 1 • Higher-degree nodes are more  Serve as null model 2 likely to be selected  Statistical sampling guidance  Predictive capabilities 2/21/2014 Pinar @ SIAM PP 14 4

Degree Dist. Measures Connectivity The degree distribution is one way to characterize a graph. Barabasi & Albert, Science, 1999: “ A common property of many K large networks is that the vertex L A B connectivities follow a scale-free power- law distribution” F C J G H E D 2/21/2014 Pinar @ SIAM PP 14 5

Clustering Coeff. Measures Cohesion The clustering coefficient measures the rate of wedge closure . In social networks, the clustering coefficients decrease smoothly as K the degree increases. High L A B degree nodes generally have little social cohesion. F C J G H E D 2/21/2014 Pinar @ SIAM PP 14 6

Current State-of-the-Art Falls Short Story-Driven Models Structure-Driven Models  Examples  Examples  Preferential Attachment  CL: Chung-Lu; aka Configuration Model,  Barabasi & Albert, Science 1999 Weighted Erdös-Rényi  Forest Fire  Leskovec, Kleinberg, Faloutsos, KDD 2005  PNAS 2002  Random Walk  SKG: Stochastic Kronecker Graphs; R-MAT  Vazquez, Phys. Rev. E 2003 is a special case  Pros & Cons  Leskovec et al., JMLR 2010; Chakrabarti,  Poor fits to real data Zhan, Faloutsos, SDM 2004  Expensive to calibrate to real data  Graph 500 Generator!  Do not scale – inherently sequential  Pros & Cons  Survey: Sala et al., WWW 2010 Do not capture clustering coefficients  SKG expensive to calibrate clustering coefficient  Scales – generation cost O(m log n)  CL & SKG very similar in behavior  Pinar, Seshadhri, Kolda, SDM 2012 degree 2/21/2014 Pinar @ SIAM PP 14 7

Stochastic Kronecker Graph (SKG) as Graph 500 Generator  Pros  Only 5 parameters  2x2 generator matrix (sums to 1)  n = 2 L = # nodes  m = 16n = # edges  O(m log n) generation cost  Edge generation fully parallelizable  Except de-duplication  Cons L Isolated d avg  Oscillations in degree distribution 26 51% 32 (fixed by adding special noise) 29 57% 37  Limited degree distribution 32 62% 41 (noisy version is lognormal) 36 67% 49  Half the nodes are isolated! 39 71% 55  Tiny clustering coefficients! 42 74% 62 Seshadhri, Pinar, Kolda, Journal of the ACM, April 2012 2/21/2014 Pinar @ SIAM PP 14 8

The Physics of Graphs Random graph: CL Model (1) Formed according to CL Model (2) “High” clustering coefficient Thm: Must contain a “substantive” subgraph Global Clustering Coefficient that is a dense Erdös-Rényi graph . A heavy-tailed network with a high clustering Dense Erdös-Rényi Subgraph coefficient contains many Erdös-Rényi affinity blocks . (The distribution of the block sizes is also heavy tailed.) Basic measurements lead to inferences about larger structures (communities) that are consistent with literature. Seshadhri, Kolda, Pinar, Phys. Rev. E, 2012 2/21/2014 Pinar @ SIAM PP 14 9

BTER: Block Two-Level Erdös-Rényi Preprocessing Phase 2 Phase 1 • • CL model on excess • Erdös-Rényi graphs in Create affinity blocks of nodes with (nearly) same degree (a sort of each block degree, determined by • weighted Erdös-Rényi) Need to insert extra degree distribution • Creates connections links to insure enough • Connectivity per block based across blocks unique links per block on clustering coefficient • For each node, compute desired • within-block degree • excess degree Occurring independently Seshadhri, Kolda, Pinar, Phys. Rev. E, 2012 Kolda, Pinar, Plantenga,, Seshadhri, arXiv:1302.6636, Feb. 2013 2/21/2014 Pinar @ SIAM PP 14 10

BTER vs. SKG: Co-authorship Degree Distribution Clustering Coefficients SKG & CL lacking enough triangles SKG parameters from Leskovec et al., JMLR, 2010 2/21/2014 Pinar @ SIAM PP 14 11

BTER vs. SKG: Social Website Degree Distribution Clustering Coefficients Note oscillations in SKG SKG parameters from Leskovec et al., JMLR, 2010 2/21/2014 Pinar @ SIAM PP 14 12

Community Structure of BTER Improves Eigenvalue Fit Leading E-vals of Adjacency Matrix Leading E-vals of Adjacency Matrix 2/21/2014 Pinar @ SIAM PP 14 13

Making BTER Scalable  Requirements:  Extreme scalability requires independent edge insertion.  Data structures should be o(|V|) to be duplicated at each processor.  Data Structures:  Given the degree distribution, compute <block size, #blocks>, which requires O(dmax) memory.  Given the clustering coefficients, compute the number of edges per block, hence the phase 1 degrees.  Given Phase 1 degrees, we can compute residual (Phase 2) degrees.  Challenge: Adjust for repetitions 2/21/2014 Pinar @ SIAM PP 14 14

Adjusting for repeated edges  Parallel edge insertion leads to multiple edges.  This is negligible if edge probabilities are small.  This is the case for SKG, CL  But not for BTER.  BTER has dense blocks, hence many repeats.  We had extra edges to guarantee the number of unique items is as expected.  Coupon collector problem. 2/21/2014 Pinar @ SIAM PP 14 15

Generating Large Graphs for Benchmarking Ali Pinar, Tamara G. - PowerPoint PPT Presentation

Generating Large Graphs for Benchmarking Ali Pinar, Tamara G. Kolda,C. Seshadhri, Todd Plantenga U.S. Department of Energy U.S. Department of Defense Office of Advanced Scientific Computing Research 2/21/2014 Defense Advanced Research

H2 F2009 H2 F2009 GENERATING GENERATING GENERATING GENERATING FREE CASH FLOW FREE CASH FLOW

B3 Benchmarking B3 Building Benchmarking Program Overview www.CleanEnergyResourceTeams.org B3

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Benchmarking Lunch-n-Learn March 18, 2019 Agenda 1. Why Benchmarking? 2. Introduction to

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

Advanced Electric Generating Advanced Electric Generating Advanced Electric Generating

Ratchaburi Electricity Generating Holding PCL. Ratchaburi Electricity Generating Holding PCL.

Recursive Definitions Generating Functions Lecture 18 Generating Functions A generating

Darwini: Generating realistic large- scale social graphs Dionysios Logothetis Cheng Wang Sergey

Pregel Large-Scale Graph Processing William Jones Analysing large graphs is hard. We are

Generating Precise Dependencies for Large Software Pei Wang, Jinqiu Yang, Lin Tan University of

Graphs Graphs Simple graphs Algorithms Depth-first search Breadth-first search

Searching on Graphs November 16, 2016 CMPE 250 Graphs- Searching on Graphs November 16, 2016 1

The Case for Cloud Robotics and Real-Time Big Data Richard Voyles Program Director, National

Vienna PhD School of Informatics www.informatik.tuwien.ac.at/phdschool M. C. Calatrava, H.

Server -side security risks (esp sp. . injection jection atta ttacks) s) websec 1 Attacks

Examines Americas foreign economic objectives, assesses the role and depth of corporate

Information Extraction and Question-Answering Systems Foundations and methods Dr. Gnter

8.3 Networked Application 8.3 Networked Application History and Evolution History and

Winning in Close Combat Ground Forces in Multi-Domain Battle Institute of Land Warfare

Deflation in Coxeter Groups G Eric Moorhouse based on recent work (1993-present) of John H.

Sambuz

Useful Links

Newsletter

Mail Us