a foray into graph mining
play

A foray into graph mining Neil Shah April 15 th , 2019 (Graph) data - PowerPoint PPT Presentation

A foray into graph mining Neil Shah April 15 th , 2019 (Graph) data is prevalent 2.5 exabytes of data produced every day 90% generated in the last 2 years Data is produced as the product of a highly interconnected world 244 million


  1. A foray into graph mining Neil Shah April 15 th , 2019

  2. (Graph) data is prevalent • 2.5 exabytes of data produced every day • 90% generated in the last 2 years • Data is produced as the product of a highly interconnected world 244 million users 1.3 billion users 187 million daily actives 480 million products 1 billion daily mobile views 3.5 billion daily snaps

  3. (Graph) data shapes perspectives e e n n o i i v g i o t a n M e d n h e c g m r n a m i e k o S n c a e r r m r o f t a l p t n c u l o a i d i t g c c o n o a r i P S r s e a t h n c i r u p

  4. What’s in a graph? • Graphs consist of nodes, edges and attributes • ex: Facebook social network where • nodes = individuals • edges = friendship • attributes = gender (node), # of messages exchanged (edge) • Graphs can easily model relationships between entities • Who-follows-whom on a social network • Who-buys-what on an e-commerce platform • Who-calls-whom using a certain cellular provider

  5. Roadmap • Preliminaries • Notable graph properties • Cool applications • Recommendation and ranking • Clustering • Anomaly detection • Takeaways

  6. Roadmap • Preliminaries • Notable graph properties • Cool applications • Recommendation and ranking • Clustering • Anomaly detection • Takeaways

  7. Graph preliminaries – directionality Users-by-users Users-by-users u 1 u 1 u 7 u 7 u 2 u 2 u 8 u 8 u 3 u 3 u 9 u 9 u 4 u 4 u 10 u 10 u 5 u 5 u 11 u 11 u 6 u 6

  8. Graph preliminaries – degree • Degree: # of adjacent edges Users-by-users u 1 u 7 u 2 u 8 • Degree(u 7 ) = 2 u 3 u 9 u 4 u 10 u 5 u 11 u 6

  9. Graph preliminaries – out- and in-degree • Degree: # of adjacent edges Users-by-users • Out-degree: # outgoing edges • In-degree: # incoming edges u 1 u 7 u 2 u 8 • Out-degree(u 4 ) = 1 u 3 u 9 • In-degree(u 6 ) = 2 u 4 u 10 u 5 u 11 u 6

  10. Graph preliminaries – weighted degree • Weighted degree: total sum of adjacent edge weights Users-by-users • i.e. “how many times did two users 3 u 1 communicate” u 7 4 u 2 u 8 u 3 1 u 9 2 u 4 • Weighted-degree(u 6 ) = 7 u 10 u 5 9 u 11 6 u 6 1

  11. Graph preliminaries – ego(net) • Ego : single, central node Users-by-users u 1 • Ego network (egonet): nodes and u 7 edges within one “hop” from ego u 2 u 8 u 3 u 9 • Egonet(u 7 ) = u 4 u 10 • Nodes {u 7 , u 3 , u 5 } u 5 • Edges {u 7 -u 3 , u 7 -u 5 } u 11 u 6

  12. Graph preliminaries – connectivity • Two nodes are connected if there is a path between them. Users-by-users • A graph is fully connected if all node u 1 u 7 pairs are connected. u 2 u 8 u 3 u 9 • u 1 and u 8 are connected u 4 • u 3 and u 5 are connected u 10 u 5 • u 1 and u 9 are not connected u 11 u 6 • This graph is not fully connected

  13. Graph preliminaries – node and edge types Users-by-products • A heterogeneous graph has multiple node and/or edge types. Users Products u 1 p 1 • Users and products u 2 p 2 • Who-buys-what and who-rates-what u 3 p 3 u 4 p 4 u 5 p 5 u 6

  14. Graph preliminaries – matrix representation • Graph connectivity can be summarized in an adjacency matrix . • A i,j = # (or weight) of edges from node i to j • A usually very sparse (makes compact representations possible!) Users Users-by-users u 1 1 u 7 1 u 2 1 u 8 Users 1 u 3 1 u 9 1 u 4 1 u 10 1 u 5 u 11 1 u 6

  15. Roadmap • Preliminaries • Notable graph properties • Cool applications • Recommendation and ranking • Clustering • Anomaly detection • Takeaways

  16. Key question: What does a graph “look like”? • At first look… large, unwieldy and seemingly random. • Spoiler: In actuality, most real- world graphs are far from random. Trace-route paths Lyon ’03 on the internet

  17. A quick detour: “Random” graphs • Erdos-Renyi random graph model: graph G(n,p) • n = number of nodes • p = probability of an edge between two nodes (independent edges) • Expected # of edges: • Degree distribution: (binom.) Babaoglu’ 18

  18. What about real graphs? Faloutsos ‘99 Viswanath ‘09 Adamic ‘02 log(# visitors) vs. log(# sites) Log(# posts) vs. log(# users) log(# peers) vs. log(# routers) • X-axis: degree, Y-axis: frequency/probability • Degree distributions of real graphs are not “random” • What exactly are they, then?

  19. The “scale-free” property • Real-world graphs are often scale-free, meaning that their degree distribution obeys a power-law: • Scaling the input by a multiple simply results in proportional scaling of the whole function • Power laws are linear in log-log scales • Typical 2 ≤ # ≤ 3 log(# visitors) vs. log(# sites)

  20. Scale-freeness is evident in many domains Newman ‘05

  21. Why are many real graphs scale-free? • Hypothesis: preferential attachment, or a “rich-get-richer” effect • Generative process to construct a network: • Start with ! " nodes, each with at least 1 edge • At each timestep, add a new node with ! edges connecting it to ! already existing nodes • Probability of new node to connect to node # depends on the degree $ % as • Many real-world variants of this effect: academic citations, recommendation, virality log(# visitors) vs. log(# sites)

  22. Real graphs have “small-world” effects • How “far apart” are nodes in real graphs? • Interestingly, not very far! The typical number is 6 . You may have heard of the “six degrees of separation” • Milgram ‘69: avg. # of hops for a letter to travel from Nebraska to Boston was 6.2 (sample size 64) • Leskovec ‘08: avg. distance between node pairs on MSN messenger has mode 6 (sample size 180M nodes and 1.3B edges)

  23. What causes the small-world effect? • Hypothesis: The abundance of hubs, or high-degree nodes • Even though most nodes aren’t connected to most other nodes, they are connected to hubs, which facilitate paths log(# visitors) vs. log(# sites)

  24. How do real graphs “grow” over time? • Consider a time-evolving graph ! • If it has "($) nodes and &($) edges at time t… • Suppose that "($ + 1) = 2"($) • What is &($ + 1) ? • Not only is it > 2& $ ; the growth is actually superlinear and follows & $ ∝ " $ . (power law!) with 1 ≤ 0 ≤ 2 , generally

  25. Real graphs exhibit densification Avg. out-degree increases over time Power-law in # edges vs. # nodes (over time)

  26. Moreover, the graph diameter shrinks • Graph diameter = max(distance between node pairs) • Leskovec ‘05 shows that diameter actually shrinks over time, instead of growing. In other words, nodes tend to get closer • Hypothesis: Once again due to prevalence and growth of hubs

  27. Much more work done on graph behaviors • Generative graph models (Leskovec ‘05) • Patterns in sizes of connected components (Kang ‘10) • Node in-degree (popularity) over time (McGlohon ‘07) • Duration of calls in phone-call networks (Vaz de Melo ‘10) • Temporal structure evolution (Shah ‘15) … the list goes on

  28. Roadmap • Preliminaries • Notable graph properties • Cool applications • Recommendation and ranking • Clustering • Anomaly detection • Takeaways

  29. Key question: how can we leverage graphs for recommendation/ranking tasks? • Measuring webpage importance • Link prediction and recommendation • Local methods • Global methods

  30. PageRank for large-scale search engines • Key problem: how to prioritize/curate a large (ever-growing) hyperlinked body of pages by importance and relevance? • Key idea: leverage the hyperlink citation graph (page-links-page) to rank page importance according to connectivity patterns • 150 million web pages à 1.7 billion links Backlinks and Forward links: Ø A and B are C’s backlinks Ø C is A and B’s forward link Content adapted from Li ‘09

  31. Simplified PageRank Idea: each page equally distributes its own PageRank to its forward-links recursively. “An important page has many important pages pointing to it” • ! : a web page • " ! : the set of u’s backlinks • # $ : the number of forward links of page v • % : the normalization factor to make & a probability distribution • Simplified PageRank is the stationary probability dist. of a random- walk on the graph; a surfer keeps clicking successive pages at random.

  32. Simplified PageRank Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each” Yahoo Amzn MS Adjacency matrix transposed Initial PageRank scores and column-normalized (accounts for equal neighbor distribution) PageRank Calculation: first iteration

  33. Simplified PageRank Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each” Yahoo Amzn MS Adjacency matrix transposed Initial PageRank scores and column-normalized (accounts for equal neighbor distribution) PageRank Calculation: second iteration

  34. Simplified PageRank Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each” Yahoo Amzn MS Adjacency matrix transposed Initial PageRank scores and column-normalized (accounts for equal neighbor distribution) Convergence after some iterations

  35. Problem with Simplified PageRank A loop: During each iteration, the loop accumulates rank but never distributes rank to other pages!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend