http://cs224w.stanford.edu Networks of tightly Networks of tightly - - PowerPoint PPT Presentation

http cs224w stanford edu networks of tightly networks of
SMART_READER_LITE
LIVE PREVIEW

http://cs224w.stanford.edu Networks of tightly Networks of tightly - - PowerPoint PPT Presentation

CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu Networks of tightly Networks of tightly connected groups Network communities: Sets of


slide-1
SLIDE 1

CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University

http://cs224w.stanford.edu

slide-2
SLIDE 2

 Networks of tightly  Networks of tightly

connected groups

 Network communities:

  • Sets of nodes with lots of
  • Sets of nodes with lots of

connections inside and few to outside (the rest few to outside (the rest

  • f the network)

Communities, clusters,

2

, , groups, modules

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-3
SLIDE 3

[Onnela et al. ‘07] Edge strengths (call volume) in real network Edge betweenness in real network

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

slide-4
SLIDE 4

[Girvan‐Newman PNAS ‘02]

 Divisive hierarchical clustering based on edge

b t betweenness:

Number of shortest paths passing through the edge

 Girvan Newman Algorithm:  Girvan‐Newman Algorithm:

  • Repeat until no edges are left:
  • Calculate betweenness of edges
  • Remove edges with highest betweenness
  • Connected components are communities
  • Gives a hierarchical decomposition of the network

Gives a hierarchical decomposition of the network

 Example:

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

slide-5
SLIDE 5

[Newman‐Girvan PhysRevE ‘03]

 Zachary’s Karate club:  Zachary s Karate club:

hierarchical decomposition

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

slide-6
SLIDE 6

[Newman‐Girvan PhysRevE ‘03]

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6

Communities in physics collaborations

slide-7
SLIDE 7

 Breath first search

t ti f A starting from A:

 Want to compute

betweenness of paths starting at node A

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7

slide-8
SLIDE 8

 Count the number of shortest paths from A to  Count the number of shortest paths from A to

all other nodes of the network:

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

slide-9
SLIDE 9

 Compute betweenness by working up the tree:  Compute betweenness by working up the tree:

If there are multiple paths count them fractionally

1+1 paths to H Split evenly

  • Repeat the BFS

procedure for each

1+0.5 paths to J Split 1:2

node of the network

  • Add edge scores
  • Runtime (all pairs shortest path):

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

1 path to K Split evenly

Runtime (all pairs shortest path): ‐‐Weighted graphs: O(N3) ‐‐ Unweighted graphs: O(N2)

slide-10
SLIDE 10

Define modularity to be Define modularity to be Q = (number of edges within groups) – (expected number within groups) (expected number within groups) Actual number of edges between i and j is Expected number of edges between i and j is

11/3/2010 10 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

m…number of edges

slide-11
SLIDE 11

 Q = (number of edges within groups)  Q = (number of edges within groups) –

(expected number within groups)

 Then:  Then:

m … number of edges Aij … 1 if (i,j) is edge, else 0 ki … degree of node i ci group id of node i

             

j i j i ij

c c k k A Q ) , ( 2 4 1 

 Modularity lies in the range [−1,1]

ci … group id of node i (a, b) … 1 if a=b, else 0

       

j i j i ij

m m Q

,

) , ( 2 4

y g [ , ]

  • It is positive if the number of edges within groups

exceeds the expected number

  • 0.3<Q<0.7 means significant community structure

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11

slide-12
SLIDE 12

 Modularity is useful for selecting the  Modularity is useful for selecting the

number of clusters:

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12

Why not optimize modularity directly?

slide-13
SLIDE 13

 Consider splitting the graph in two communities  Consider splitting the graph in two communities  Modularity Q is: 

 2

j i ij

m k k A

y

 Or we can write in matrix form as

group same in ,

2

j i

m

  • s … vector of group memberships si={+1, ‐1}
  • B … modularity matrix

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

Note: each row (column) of B sums to 0

slide-14
SLIDE 14

 Task: Find s{ 1 +1}n that maximizes Q  Task: Find s{‐1,+1} that maximizes Q  Rewrite Q in terms of eigenvalues βi of B

 

 

n 2

 

  

        

i i i i i i i i i i i

u s s u u s s u u s Q

1 2 T T T T T

  

 To maximize Q, easiest way is to make s = u1

A i ll i h i h β (l i l)

  • Assigns all weight in the sum to β1 (largest eigval)
  • (all other sTui terms zero because of orthonormality)
  • Unfortunately elements of s must be 1
  • Unfortunately, elements of s must be 1
  • In general, finding optimal s is NP‐hard

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

slide-15
SLIDE 15

 

2 2 T

     

n n

Q

β

 

1 1 1 1 T

         

 

  i i i i i i

u s u s Q

 Heuristic: try to maximize only the β1 term  Similar in spirit to the spectral partitioning

p p p g algorithm (we will explore it next time)

 Continue the bisection hierarchically

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15

slide-16
SLIDE 16

 Fast Modularity Optimization Algorithm:

Fast Modularity Optimization Algorithm:

  • Find leading eigenvector u1 of modularity matrix B
  • Divide the nodes by the signs of the elements of u1

y g

1

  • Repeat hierarchically until:
  • If a proposed split does not cause

modularity to increase declare modularity to increase, declare community indivisible and do not split it

  • If all communities are indivisible, stop

 How to find u1? Power method!

  • Iterative multiplication, normalization

k

Bv v 

  • Start with random v, until convergence:

11/3/2010 16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

k k

Bv v 

1

slide-17
SLIDE 17

 Also, can combine with other methods:

,

  • Randomly divide the nodes into two groups
  • Move the node that, if moved, will increase Q the most
  • Repeat for all nodes, with each node only moved once

epeat o a

  • des,

t eac

  • de o y
  • ed o ce
  • Once complete, find intermediate state with highest Q
  • Start from this state and repeat until Q stops increasing
  • Good results for “fine‐tuning” the spectral method

Good results for fine tuning the spectral method

 CNM Algorithm (Clauset‐Newman‐Moore ‘04):

  • (1) Separate each vertex solely into n community

(1) Separate each vertex solely into n community

  • (2) Calculate Q for all possible community pairs
  • (3) Merge the pair of the largest increase in Q
  • Repeat (2)&(3) until one community remains

Repeat (2)&(3) until one community remains

  • Cross cut the dendogram where Q is maximum

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17

slide-18
SLIDE 18

Fast modularity Fast modularity GN = Girvan‐Newman, O(n3) CNM = Greedy merging (n log2n)

 Issues with modularity:

  • May not find communities with less than m links

DA = External Optimization O(n2 log2 n)

  • NP‐hard to optimize exactly [Brandes et al. ‘07]

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18

slide-19
SLIDE 19

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19

slide-20
SLIDE 20

[Kumar et al. ‘99]

 Searching for small communities  Searching for small communities

in a Web graph

 (1) The signature of a community/discussion  (1) The signature of a community/discussion

in the context of a Web graph

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

A dense 2‐layer graph Intuition: a bunch of people all talking about the same things

20

slide-21
SLIDE 21

 (2) A more well defined problem:  (2) A more well‐defined problem:

Enumerate complete bipartite subgraphs Ks,t

  • Where K

s nodes where each links to the same

  • Where Ks,t = s nodes where each links to the same

t other nodes

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

slide-22
SLIDE 22

 Two points:

Two points:

  • (1) The signature of a community/discussion
  • (2) Complete bipartite subgraph Ks t

( ) p p g p

s,t

  • Ks,t = graph on s nodes, each links to the same t other nodes

 Plan:

(A) F (2) t b k t (1)

  • (A) From (2) get back to (1):
  • Via: Any dense enough graph contains a smaller Ks,t as a

subgraph

  • (B) How do we solve (2) in a giant graph?
  • What similar problems have been solved on a giant non‐graph

datasets? datasets?

  • (3) Frequent itemset enumeration [Agrawal‐Srikant ‘99]

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22

slide-23
SLIDE 23

[Agrawa‐Srikant ‘99]

 Marketbasket analysis:  Marketbasket analysis:

  • What items are bought together in a store?

 Setting:  Setting:

  • Universe U of n items

b t f U S S S U

  • m subsets of U: S1, S2, …, Sm  U

(Si is a set of items one person bought)

  • Frequency threshold f
  • Frequency threshold f

 Goal:

  • Fi d ll

b t T t T S f f t S

  • Find all subsets T s.t. T  Si of f sets Si

(items in T were bought together f times)

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23

slide-24
SLIDE 24

 Example:  Example:

  • Universe of items:
  • U={1,2,3,4,5}

{ , , , , }

  • Itemsets:
  • S1={1,3,5}, S2={2,3,4}, S3={2,4,5},

{ } { } { } S4={3,4,5}, S5={1,3,4,5}, S6={2,3,4,5}

  • Minimum support:
  • f=3
  • f=3

 Algorithm: Build up the lists

  • Insight: for a frequent set of size k, all its subsets

Insight: for a frequent set of size k, all its subsets are also frequent

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24

slide-25
SLIDE 25

[Agrawa‐Srikant ‘99]

 U={1 2 3 4 5} f=3  U={1,2,3,4,5}, f=3  S1={1,3,5}, S2={2,3,4}, S3={2,4,5}, S4={3,4,5},

S ={1 3 4 5} S ={2 3 4 5} S5={1,3,4,5}, S6={2,3,4,5}

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25

slide-26
SLIDE 26

[Agrawa‐Srikant ‘99]

 For i = 1

k

 For i = 1,…,k

  • Find all frequent sets of size i by

composing sets of size i‐1 that composing sets of size i‐1 that differ in 1 element

 Open question:

  • Efficiently find only maximal frequent sets

Efficiently find only maximal frequent sets

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26

slide-27
SLIDE 27

 Claim: (3) (itemsets) solves (2) (bipartite graphs)

Claim: (3) (itemsets) solves (2) (bipartite graphs)

 How?

  • View each node i as a

set Si of nodes i points to K t f i t

  • Ks,t = a set y of size t

that occurs in s sets Si

  • Looking for Ks,t  set of

frequency threshold to s and look at layer t – all and look at layer t – all frequent sets of size t.

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27

slide-28
SLIDE 28

 (2)  (1): Informally every dense enough  (2)  (1): Informally, every dense enough

bipartite graph G contains a Ks,t subgraph where s and t depend on size (# of nodes) and where s and t depend on size (# of nodes) and density (avg. degree) of G [Kovan‐Sos‐Turan ‘53]

 Theorem: Let G=(X,Y,E), |X|=|Y|=n with avg.

degree:

t d

t t  / 1 1 / 1

g then G contains Ks t as a subgraph

t n s d

t t

 

/ 1 1 / 1

s,t

g p

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28

slide-29
SLIDE 29

 For the proof we will need the following fact  For the proof we will need the following fact

  • Recall:

! ) 1 )...( 1 ( b b a a a b a            

  • Let f(x) = x(x‐1)(x‐2)…(x‐k)

Once xk, f(x) curves upward (convex)

 

  • Suppose a setting:
  • g(y) is convex
  • Want to minimize i

n g(xi)

  • where i

n xi=x

T i i i  n ( ) k h /

  • To minimize i

n g(xi) make each xi = x/n

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29

slide-30
SLIDE 30

 Node i degree di:

Potential right‐hand id f K (i ll i

 Node i, degree di:

sides of Ks,t(i.e., all size t subsets of Yi)

 Put node i in buckets

for all size t subsets for all size t subsets

  • f its neighbors

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30

slide-31
SLIDE 31

 Notice: As soon as s people appear in a bucket  Notice: As soon as s people appear in a bucket

we have a Ks,t

 To how many buckets i contributes to?

f

 What is the total size of all buckets?

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31

slide-32
SLIDE 32

 So total height of all buckets is

! ) 1 )...( 1 ( b b a a a b a            

 So, total height of all buckets is…

! b b 

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32

slide-33
SLIDE 33

 Total height of all buckets:

Total height of all buckets:

 How many buckets are there?

How many buckets are there?

 What is the average height of buckets?

What is the average height of buckets?

 So by pigeonhole principle there must be a  So by pigeonhole principle, there must be a

bucket with more than s nodes in it.

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33

slide-34
SLIDE 34

 Girvan‐Newman:

Girvan Newman:

  • Based on strength of weak ties
  • Remove edge of highest betweenness

g g

 Modularity:

  • Useful to determine the number of clusters
  • Direct approx submodularity optimization

 Trawling (complete bipartite subgraphs):

  • Frequent itemsets and dynamic programming
  • Frequent itemsets and dynamic programming
  • Theorem that complete bipartite subgraphs are

embedded in bigger graphs

  • SCALABLE!!!

11/3/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34