CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University
http://cs224w.stanford.edu Networks of tightly Networks of tightly - - PowerPoint PPT Presentation
http://cs224w.stanford.edu Networks of tightly Networks of tightly - - PowerPoint PPT Presentation
CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu Networks of tightly Networks of tightly connected groups Network communities: Sets of
Networks of tightly Networks of tightly
connected groups
Network communities:
- Sets of nodes with lots of
- Sets of nodes with lots of
connections inside and few to outside (the rest few to outside (the rest
- f the network)
Communities, clusters,
2
, , groups, modules
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[Onnela et al. ‘07] Edge strengths (call volume) in real network Edge betweenness in real network
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
[Girvan‐Newman PNAS ‘02]
Divisive hierarchical clustering based on edge
b t betweenness:
Number of shortest paths passing through the edge
Girvan Newman Algorithm: Girvan‐Newman Algorithm:
- Repeat until no edges are left:
- Calculate betweenness of edges
- Remove edges with highest betweenness
- Connected components are communities
- Gives a hierarchical decomposition of the network
Gives a hierarchical decomposition of the network
Example:
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
[Newman‐Girvan PhysRevE ‘03]
Zachary’s Karate club: Zachary s Karate club:
hierarchical decomposition
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
[Newman‐Girvan PhysRevE ‘03]
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
Communities in physics collaborations
Breath first search
t ti f A starting from A:
Want to compute
betweenness of paths starting at node A
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
Count the number of shortest paths from A to Count the number of shortest paths from A to
all other nodes of the network:
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
Compute betweenness by working up the tree: Compute betweenness by working up the tree:
If there are multiple paths count them fractionally
1+1 paths to H Split evenly
- Repeat the BFS
procedure for each
1+0.5 paths to J Split 1:2
node of the network
- Add edge scores
- Runtime (all pairs shortest path):
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
1 path to K Split evenly
Runtime (all pairs shortest path): ‐‐Weighted graphs: O(N3) ‐‐ Unweighted graphs: O(N2)
Define modularity to be Define modularity to be Q = (number of edges within groups) – (expected number within groups) (expected number within groups) Actual number of edges between i and j is Expected number of edges between i and j is
11/3/2010 10 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
m…number of edges
Q = (number of edges within groups) Q = (number of edges within groups) –
(expected number within groups)
Then: Then:
m … number of edges Aij … 1 if (i,j) is edge, else 0 ki … degree of node i ci group id of node i
j i j i ij
c c k k A Q ) , ( 2 4 1
Modularity lies in the range [−1,1]
ci … group id of node i (a, b) … 1 if a=b, else 0
j i j i ij
m m Q
,
) , ( 2 4
y g [ , ]
- It is positive if the number of edges within groups
exceeds the expected number
- 0.3<Q<0.7 means significant community structure
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
Modularity is useful for selecting the Modularity is useful for selecting the
number of clusters:
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
Why not optimize modularity directly?
Consider splitting the graph in two communities Consider splitting the graph in two communities Modularity Q is:
2
j i ij
m k k A
y
Or we can write in matrix form as
group same in ,
2
j i
m
- s … vector of group memberships si={+1, ‐1}
- B … modularity matrix
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
Note: each row (column) of B sums to 0
Task: Find s{ 1 +1}n that maximizes Q Task: Find s{‐1,+1} that maximizes Q Rewrite Q in terms of eigenvalues βi of B
n 2
i i i i i i i i i i i
u s s u u s s u u s Q
1 2 T T T T T
To maximize Q, easiest way is to make s = u1
A i ll i h i h β (l i l)
- Assigns all weight in the sum to β1 (largest eigval)
- (all other sTui terms zero because of orthonormality)
- Unfortunately elements of s must be 1
- Unfortunately, elements of s must be 1
- In general, finding optimal s is NP‐hard
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
2 2 T
n n
Q
β
1 1 1 1 T
i i i i i i
u s u s Q
Heuristic: try to maximize only the β1 term Similar in spirit to the spectral partitioning
p p p g algorithm (we will explore it next time)
Continue the bisection hierarchically
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
Fast Modularity Optimization Algorithm:
Fast Modularity Optimization Algorithm:
- Find leading eigenvector u1 of modularity matrix B
- Divide the nodes by the signs of the elements of u1
y g
1
- Repeat hierarchically until:
- If a proposed split does not cause
modularity to increase declare modularity to increase, declare community indivisible and do not split it
- If all communities are indivisible, stop
How to find u1? Power method!
- Iterative multiplication, normalization
k
Bv v
- Start with random v, until convergence:
11/3/2010 16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
k k
Bv v
1
Also, can combine with other methods:
,
- Randomly divide the nodes into two groups
- Move the node that, if moved, will increase Q the most
- Repeat for all nodes, with each node only moved once
epeat o a
- des,
t eac
- de o y
- ed o ce
- Once complete, find intermediate state with highest Q
- Start from this state and repeat until Q stops increasing
- Good results for “fine‐tuning” the spectral method
Good results for fine tuning the spectral method
CNM Algorithm (Clauset‐Newman‐Moore ‘04):
- (1) Separate each vertex solely into n community
(1) Separate each vertex solely into n community
- (2) Calculate Q for all possible community pairs
- (3) Merge the pair of the largest increase in Q
- Repeat (2)&(3) until one community remains
Repeat (2)&(3) until one community remains
- Cross cut the dendogram where Q is maximum
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17
Fast modularity Fast modularity GN = Girvan‐Newman, O(n3) CNM = Greedy merging (n log2n)
Issues with modularity:
- May not find communities with less than m links
DA = External Optimization O(n2 log2 n)
- NP‐hard to optimize exactly [Brandes et al. ‘07]
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
[Kumar et al. ‘99]
Searching for small communities Searching for small communities
in a Web graph
(1) The signature of a community/discussion (1) The signature of a community/discussion
in the context of a Web graph
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
A dense 2‐layer graph Intuition: a bunch of people all talking about the same things
20
(2) A more well defined problem: (2) A more well‐defined problem:
Enumerate complete bipartite subgraphs Ks,t
- Where K
s nodes where each links to the same
- Where Ks,t = s nodes where each links to the same
t other nodes
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
Two points:
Two points:
- (1) The signature of a community/discussion
- (2) Complete bipartite subgraph Ks t
( ) p p g p
s,t
- Ks,t = graph on s nodes, each links to the same t other nodes
Plan:
(A) F (2) t b k t (1)
- (A) From (2) get back to (1):
- Via: Any dense enough graph contains a smaller Ks,t as a
subgraph
- (B) How do we solve (2) in a giant graph?
- What similar problems have been solved on a giant non‐graph
datasets? datasets?
- (3) Frequent itemset enumeration [Agrawal‐Srikant ‘99]
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
[Agrawa‐Srikant ‘99]
Marketbasket analysis: Marketbasket analysis:
- What items are bought together in a store?
Setting: Setting:
- Universe U of n items
b t f U S S S U
- m subsets of U: S1, S2, …, Sm U
(Si is a set of items one person bought)
- Frequency threshold f
- Frequency threshold f
Goal:
- Fi d ll
b t T t T S f f t S
- Find all subsets T s.t. T Si of f sets Si
(items in T were bought together f times)
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
Example: Example:
- Universe of items:
- U={1,2,3,4,5}
{ , , , , }
- Itemsets:
- S1={1,3,5}, S2={2,3,4}, S3={2,4,5},
{ } { } { } S4={3,4,5}, S5={1,3,4,5}, S6={2,3,4,5}
- Minimum support:
- f=3
- f=3
Algorithm: Build up the lists
- Insight: for a frequent set of size k, all its subsets
Insight: for a frequent set of size k, all its subsets are also frequent
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24
[Agrawa‐Srikant ‘99]
U={1 2 3 4 5} f=3 U={1,2,3,4,5}, f=3 S1={1,3,5}, S2={2,3,4}, S3={2,4,5}, S4={3,4,5},
S ={1 3 4 5} S ={2 3 4 5} S5={1,3,4,5}, S6={2,3,4,5}
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
[Agrawa‐Srikant ‘99]
For i = 1
k
For i = 1,…,k
- Find all frequent sets of size i by
composing sets of size i‐1 that composing sets of size i‐1 that differ in 1 element
Open question:
- Efficiently find only maximal frequent sets
Efficiently find only maximal frequent sets
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
Claim: (3) (itemsets) solves (2) (bipartite graphs)
Claim: (3) (itemsets) solves (2) (bipartite graphs)
How?
- View each node i as a
set Si of nodes i points to K t f i t
- Ks,t = a set y of size t
that occurs in s sets Si
- Looking for Ks,t set of
frequency threshold to s and look at layer t – all and look at layer t – all frequent sets of size t.
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
(2) (1): Informally every dense enough (2) (1): Informally, every dense enough
bipartite graph G contains a Ks,t subgraph where s and t depend on size (# of nodes) and where s and t depend on size (# of nodes) and density (avg. degree) of G [Kovan‐Sos‐Turan ‘53]
Theorem: Let G=(X,Y,E), |X|=|Y|=n with avg.
degree:
t d
t t / 1 1 / 1
g then G contains Ks t as a subgraph
t n s d
t t
/ 1 1 / 1
s,t
g p
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
For the proof we will need the following fact For the proof we will need the following fact
- Recall:
! ) 1 )...( 1 ( b b a a a b a
- Let f(x) = x(x‐1)(x‐2)…(x‐k)
Once xk, f(x) curves upward (convex)
- Suppose a setting:
- g(y) is convex
- Want to minimize i
n g(xi)
- where i
n xi=x
T i i i n ( ) k h /
- To minimize i
n g(xi) make each xi = x/n
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29
Node i degree di:
Potential right‐hand id f K (i ll i
Node i, degree di:
sides of Ks,t(i.e., all size t subsets of Yi)
Put node i in buckets
for all size t subsets for all size t subsets
- f its neighbors
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30
Notice: As soon as s people appear in a bucket Notice: As soon as s people appear in a bucket
we have a Ks,t
To how many buckets i contributes to?
f
What is the total size of all buckets?
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
So total height of all buckets is
! ) 1 )...( 1 ( b b a a a b a
So, total height of all buckets is…
! b b
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32
Total height of all buckets:
Total height of all buckets:
How many buckets are there?
How many buckets are there?
What is the average height of buckets?
What is the average height of buckets?
So by pigeonhole principle there must be a So by pigeonhole principle, there must be a
bucket with more than s nodes in it.
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33
Girvan‐Newman:
Girvan Newman:
- Based on strength of weak ties
- Remove edge of highest betweenness
g g
Modularity:
- Useful to determine the number of clusters
- Direct approx submodularity optimization
Trawling (complete bipartite subgraphs):
- Frequent itemsets and dynamic programming
- Frequent itemsets and dynamic programming
- Theorem that complete bipartite subgraphs are
embedded in bigger graphs
- SCALABLE!!!
11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34