Mining ¡Social ¡Network ¡Graphs ¡
Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata
November 13, 17, 2014
Social Network No introduc+on required Really? We - - PowerPoint PPT Presentation
Mining Social Network Graphs Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014 Social Network No introduc+on required Really? We s7ll
November 13, 17, 2014
2 ¡
disclaimer: ¡the ¡brand ¡logos ¡are ¡used ¡here ¡en7rely ¡for ¡educa7onal ¡purpose ¡ ¡
– Typically people, but could be something else too
– For example: friends – Sometimes boolean: two people are either friends or they are not – May have a degree – Discrete degree: friends, family, acquaintances, or none – Degree – real number: the fraction of the average day that two people spend talking to each other
– Hard to formalize – Intuition: that relationships tend to cluster – If entity A is related to both B and C, then the probability that B and C are related is higher than average (random)
3 ¡
– For a large random graph, it would be close to |E|/(|V|C2) = 9/21 ~ 0.43 – Small graph: XY and YZ are already edges, so compute within the rest – So the probability is (|E|−2)/(|V|C2−2) = 7/19 = 0.37
4 ¡
A graph with boolean (friends) relationship
Example ¡courtesy: ¡Leskovec, ¡Rajaraman ¡and ¡Ullman ¡
5 ¡
A graph with boolean (friends) relationship X= YZ= Yes/Total A BC 1/1 B AC, AD, CD 1/3 C AB 1/1 D BE,BG,BF,EF, EG,FG 2/6 X= YZ= Yes/Total E DF 1/1 F DE,DG,EG 2/3 G DF 1/1 Total 9/16 ~ 0.56
Does have locality property
6 ¡
– One directional edges would allow spammers to have edges
7 ¡
8 ¡
– Likely to put two nodes with small distance in the same cluster – Social network graphs would have cross-community edges – Severe merging of communities likely
9 ¡
§ Betweenness of an edge AB: #of pairs of nodes (X,Y) such that AB lies on the shortest path between X and Y – There can be more than one shortest paths between X and Y – Credit AB the fraction of those paths which include the edge AB § High score of betweenness means? – The edge runs “between” two communities § Betweenness gives a better measure – Edges such as BD get a higher score than edges such as AB § Not a distance measure, may not satisfy triangle inequality. Doesn’t matter!
10 ¡
11 ¡
Calculate ¡betweenness ¡of ¡edges ¡
Level ¡1 ¡ Level ¡2 ¡ Level ¡3 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 2 ¡ 1 ¡ 1 ¡
Step 3 – credit sharing: § Each leaf node gets credit 1 § Each non-leaf node gets 1 + sum(credits of the DAG edges to the level below) § Credit of DAG edges: Let Yi (i=1, … , k) be parents of Z, pi = label(Yi)
12 ¡
Calculate ¡betweenness ¡of ¡edges ¡
Level ¡1 ¡ Level ¡2 ¡ Level ¡3 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 2 ¡ 1 ¡ 1 ¡
credit(Yi, Z) = credit(Z)× pi (p1 +!pk)
1 ¡ 1 ¡ 1 ¡ 3 ¡ 1 ¡ 1 ¡
§ Intuition: a DAG edge YiZ gets the share of credit of Z proportional to the #of shortest paths from X to Z going through YiZ Finally: Repeat Steps 1, 2 and 3 with each node as root. For each edge, betweenness = sum credits obtained in all iterations / 2
3 ¡
0.5 ¡ 0.5 ¡ 4.5 ¡ 1.5 ¡ 4.5 ¡ 1.5 ¡
– BFS starting at each node: O(e) – Do it for n nodes – Total: O(ne) time – Very expensive
– Choose a random subset W of the nodes – Compute credit of each edge starting at each node in W – Sum and compute betweenness – A reasonable approximation
13 ¡
14 ¡
Method 1: § Keep adding edges (among existing ones) starting from lowest betweenness § Gradually join small components to build large connected components
Method 1: § Keep adding edges (among existing ones) starting from lowest betweenness § Gradually join small components to build large connected components
15 ¡
Method 1: § Keep adding edges (among existing ones) starting from lowest betweenness § Gradually join small components to build large connected components
16 ¡
Method 1: § Keep adding edges (among existing ones) starting from lowest betweenness § Gradually join small components to build large connected components
17 ¡
Method 1: § Keep adding edges (among existing ones) starting from lowest betweenness § Gradually join small components to build large connected components
18 ¡
Method 1: § Keep adding edges (among existing ones) starting from lowest betweenness § Gradually join small components to build large connected components
19 ¡
Method 2: § Start from all existing edges. The graph may look like one big component. § Keep removing edges starting from highest betweenness § Gradually split large components to arrive at communities
20 ¡
Method 2: § Start from all existing edges. The graph may look like one big component. § Keep removing edges starting from highest betweenness § Gradually split large components to arrive at communities
21 ¡
Method 2: § Start from all existing edges. The graph may look like one big component. § Keep removing edges starting from highest betweenness § Gradually split large components to arrive at communities
22 ¡
At ¡some ¡point, ¡removing ¡the ¡edge ¡with ¡highest ¡betweenness ¡would ¡split ¡ the ¡graph ¡into ¡separate ¡components ¡
– Method 2 is likely to take less number of operations. Why? – Inter-community edges are less than intra-community edges
23 ¡
24 ¡
25 ¡
26 ¡
27 ¡
§ If G is a complete graph § Number of triangles = mC3 ~ O(m3/2) § Cannot even enumerate all triangles in less than O(m3/2) § Hence it is the lower bound for computing all triangles
§ Consider a complete graph G’ with n nodes, m edges § Note that m = nC2 = O(n2) § Construct G from G’ by adding a chain of length n2 § The number of triangles remain the same, O(m3/2) § The number of edges remain of the same order O(m) § G is quite sparse, lowering edge to node ratio § Still cannot compute the triangles in less than O(m3/2) time
28 ¡
29 ¡
30 ¡
31 ¡
32 ¡
33 ¡
34 ¡