generating large graphs for benchmarking
play

Generating Large Graphs for Benchmarking Ali Pinar, Tamara G. - PowerPoint PPT Presentation

Generating Large Graphs for Benchmarking Ali Pinar, Tamara G. Kolda,C. Seshadhri, Todd Plantenga U.S. Department of Energy U.S. Department of Defense Office of Advanced Scientific Computing Research 2/21/2014 Defense Advanced Research


  1. Generating Large Graphs for Benchmarking Ali Pinar, Tamara G. Kolda,C. Seshadhri, Todd Plantenga U.S. Department of Energy U.S. Department of Defense Office of Advanced Scientific Computing Research 2/21/2014 Defense Advanced Research Projects Agency Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE -AC04-94AL85000. Pinar @ SIAM PP 14

  2. Modeling graphs is a crucial challenge  Our understanding of network structure Useful Real Data is still limited. Measurements  We do not have the first principles.  Why model graphs? Calibration  Real data will rarely be available. Inherent  Understanding normal helps identifying Properties abnormal.  Benchmarking requires controlled experiments.  Challenges  Data analysis: Identifying metrics that can help in characterization (e.g., degree Mathematical Generative Model distribution, clustering coefficients)  Theoretical analysis: Understanding the structure inferred by these metrics  Algorithms: Designing algorithms to compute these metrics, generate graphs, Generated Data Measurements etc. 2/21/2014 Pinar – SIAM PP 14 2

  3. A Good Network Model…  Encapsulates underlying driving Story-driven models principals  “Physics” Example: Preferential Attachment  (Barabasi & Albert, Science,1999) Captures measurable characteristics of real-world data • New nodes joins graph one at new a time, in sequence  Degree distribution node & • Each new node chooses k new  Clustering coefficients edge(s) neighbors, according to degree  Community structure • Node degrees updated after each addition – Rich get richer!  Connectedness, Diameter  k = 1 Eigenvalues  Calibrates to specific data sets Structure-driven models  Quantitative vs. qualitative 2  Surrogate for real data, protecting Example: CL (aka Configuration) 1 privacy and security (Chung & Lu, PNAS, 2002) 4 1  1 Provides results “like” the real data • Desired node degrees new 7  Easy to share, reproduce specified in advance edge 3 • New edges inserted, choosing  Yields understanding 3 endpoints by desired degree 1 • Higher-degree nodes are more  Serve as null model 2 likely to be selected  Statistical sampling guidance  Predictive capabilities 2/21/2014 Pinar @ SIAM PP 14 3

  4. A Good Network Model…  Encapsulates underlying driving Story-driven models principals  “Physics” Example: Preferential Attachment  (Barabasi & Albert, Science,1999) Captures measurable characteristics of real-world data • New nodes joins graph one at new a time, in sequence  Degree distribution node & • Each new node chooses k new  Clustering coefficients edge(s) neighbors, according to degree  Community structure • Node degrees updated after each addition – Rich get richer!  Connectedness, Diameter  k = 1 Eigenvalues  Calibrates to specific data sets Structure-driven models  Quantitative vs. qualitative 2  Surrogate for real data, protecting Example: CL (aka Configuration) 1 privacy and security (Chung & Lu, PNAS, 2002) 4 1  1 Provides results “like” the real data • Desired node degrees new 7  Easy to share, reproduce specified in advance edge 3 • New edges inserted, choosing  Yields understanding 3 endpoints by desired degree 1 • Higher-degree nodes are more  Serve as null model 2 likely to be selected  Statistical sampling guidance  Predictive capabilities 2/21/2014 Pinar @ SIAM PP 14 4

  5. Degree Dist. Measures Connectivity The degree distribution is one way to characterize a graph. Barabasi & Albert, Science, 1999: “ A common property of many K large networks is that the vertex L A B connectivities follow a scale-free power- law distribution” F C J G H E D 2/21/2014 Pinar @ SIAM PP 14 5

  6. Clustering Coeff. Measures Cohesion The clustering coefficient measures the rate of wedge closure . In social networks, the clustering coefficients decrease smoothly as K the degree increases. High L A B degree nodes generally have little social cohesion. F C J G H E D 2/21/2014 Pinar @ SIAM PP 14 6

  7. Current State-of-the-Art Falls Short Story-Driven Models Structure-Driven Models  Examples  Examples  Preferential Attachment  CL: Chung-Lu; aka Configuration Model,  Barabasi & Albert, Science 1999 Weighted Erdös-Rényi  Forest Fire  Leskovec, Kleinberg, Faloutsos, KDD 2005  PNAS 2002  Random Walk  SKG: Stochastic Kronecker Graphs; R-MAT  Vazquez, Phys. Rev. E 2003 is a special case  Pros & Cons  Leskovec et al., JMLR 2010; Chakrabarti,  Poor fits to real data Zhan, Faloutsos, SDM 2004  Expensive to calibrate to real data  Graph 500 Generator!  Do not scale – inherently sequential  Pros & Cons  Survey: Sala et al., WWW 2010 Do not capture clustering coefficients  SKG expensive to calibrate clustering coefficient  Scales – generation cost O(m log n)  CL & SKG very similar in behavior  Pinar, Seshadhri, Kolda, SDM 2012 degree 2/21/2014 Pinar @ SIAM PP 14 7

  8. Stochastic Kronecker Graph (SKG) as Graph 500 Generator  Pros  Only 5 parameters  2x2 generator matrix (sums to 1)  n = 2 L = # nodes  m = 16n = # edges  O(m log n) generation cost  Edge generation fully parallelizable  Except de-duplication  Cons L Isolated d avg  Oscillations in degree distribution 26 51% 32 (fixed by adding special noise) 29 57% 37  Limited degree distribution 32 62% 41 (noisy version is lognormal) 36 67% 49  Half the nodes are isolated! 39 71% 55  Tiny clustering coefficients! 42 74% 62 Seshadhri, Pinar, Kolda, Journal of the ACM, April 2012 2/21/2014 Pinar @ SIAM PP 14 8

  9. The Physics of Graphs Random graph: CL Model (1) Formed according to CL Model (2) “High” clustering coefficient Thm: Must contain a “substantive” subgraph Global Clustering Coefficient that is a dense Erdös-Rényi graph . A heavy-tailed network with a high clustering Dense Erdös-Rényi Subgraph coefficient contains many Erdös-Rényi affinity blocks . (The distribution of the block sizes is also heavy tailed.) Basic measurements lead to inferences about larger structures (communities) that are consistent with literature. Seshadhri, Kolda, Pinar, Phys. Rev. E, 2012 2/21/2014 Pinar @ SIAM PP 14 9

  10. BTER: Block Two-Level Erdös-Rényi Preprocessing Phase 2 Phase 1 • • CL model on excess • Erdös-Rényi graphs in Create affinity blocks of nodes with (nearly) same degree (a sort of each block degree, determined by • weighted Erdös-Rényi) Need to insert extra degree distribution • Creates connections links to insure enough • Connectivity per block based across blocks unique links per block on clustering coefficient • For each node, compute desired • within-block degree • excess degree Occurring independently Seshadhri, Kolda, Pinar, Phys. Rev. E, 2012 Kolda, Pinar, Plantenga,, Seshadhri, arXiv:1302.6636, Feb. 2013 2/21/2014 Pinar @ SIAM PP 14 10

  11. BTER vs. SKG: Co-authorship Degree Distribution Clustering Coefficients SKG & CL lacking enough triangles SKG parameters from Leskovec et al., JMLR, 2010 2/21/2014 Pinar @ SIAM PP 14 11

  12. BTER vs. SKG: Social Website Degree Distribution Clustering Coefficients Note oscillations in SKG SKG parameters from Leskovec et al., JMLR, 2010 2/21/2014 Pinar @ SIAM PP 14 12

  13. Community Structure of BTER Improves Eigenvalue Fit Leading E-vals of Adjacency Matrix Leading E-vals of Adjacency Matrix 2/21/2014 Pinar @ SIAM PP 14 13

  14. Making BTER Scalable  Requirements:  Extreme scalability requires independent edge insertion.  Data structures should be o(|V|) to be duplicated at each processor.  Data Structures:  Given the degree distribution, compute <block size, #blocks>, which requires O(dmax) memory.  Given the clustering coefficients, compute the number of edges per block, hence the phase 1 degrees.  Given Phase 1 degrees, we can compute residual (Phase 2) degrees.  Challenge: Adjust for repetitions 2/21/2014 Pinar @ SIAM PP 14 14

  15. Adjusting for repeated edges  Parallel edge insertion leads to multiple edges.  This is negligible if edge probabilities are small.  This is the case for SKG, CL  But not for BTER.  BTER has dense blocks, hence many repeats.  We had extra edges to guarantee the number of unique items is as expected.  Coupon collector problem. 2/21/2014 Pinar @ SIAM PP 14 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend