http://cs224w.stanford.edu [LibenNowell Kleinberg 03] Link - - PowerPoint PPT Presentation
http://cs224w.stanford.edu [LibenNowell Kleinberg 03] Link - - PowerPoint PPT Presentation
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University Jure Leskovec Stanford University http://cs224w.stanford.edu [LibenNowell Kleinberg 03] Link prediction task: Link prediction task: Given G[t 0 ,t
[LibenNowell‐Kleinberg ‘03]
Link prediction task: Link prediction task:
- Given G[t0,t0’] a graph on edges up to time t0’
- utput a ranked list L of links (not in G[t t ’]) that
- utput a ranked list L of links (not in G[t0,t0 ]) that
are predicted to appear in G[t1,t1’]
Evaluation:
- n=|Enew|: # new edges that appear during the test
period [t1,t1’]
- Take top n elements of L and count correct edges
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
[LibenNowell‐Kleinberg ‘03]
Link prediction task: Link prediction task:
- Given G[t0,t0’] a graph on edges up to time t0’
- utput a ranked list L of links (not in G[t t ’]) that
- utput a ranked list L of links (not in G[t0,t0 ]) that
are predicted to appear in G[t1,t1’]
Evaluation:
- n=|Enew|: # new edges that appear during the test
period [t1,t1’]
- Take top n elements of L and count correct edges
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
[LibenNowell‐Kleinberg ‘03]
Predict links evolving collaboration network Predict links evolving collaboration network Core: Since network data is very sparse
- Consider only nodes with in‐degree and out‐
degree of at least 3
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
[LibenNowell‐Kleinberg ‘03]
For every pair of nodes (x,y) compute:
Γ(x) … degree of node x
For every pair of nodes (x,y) compute:
Sort the pairs by score and
di t t i li k predict top n pairs as new links
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
Γ(x) … degree of node x
[LibenNowell‐Kleinberg ‘03]
Rank potential links (x,y) based on:
Rank potential links (x,y) based on:
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
Γ(x) … degree of node x
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
[LibenNowell‐Kleinberg ’ 03]
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
Improvement over #common neighbors
Recommend a list of possible friends
Recommend a list of possible friends
Supervised machine learning setting:
- Training example:
- For every node s have a list of nodes
she will create links to {v1, …, vk}
- Problem:
Problem:
- Learn a model that will for a given
node s rank nodes {v1, …, vk} higher than other nodes in the network than other nodes in the network
How to combine node/edge
attributes and network structure?
- Let’s learn how to bias random walks!
10 12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WSDM ’11]
Let s be the center node
v1 v1 v2 v2
Let fw(u,v) be a function that assigns
a strength to each edge:
f ( ) ( Ψ )
s
auv = fw(u,v) = exp(-wΨuv)
- Ψuv is a feature vector
- Features of node u
s
- Features of node u
- Features of node v
- Features of edge (u,v)
v3 v3 positive examples negative examples
- w is the parameter vector we want to learn
Do a random walk from s where transitions
di t d t th
negative examples
are according to edge strengths
How to learn fw(u,v)?
11 12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WSDM ’11]
Random walk transition matrix:
v1 v1 v2 v2
Random walk transition matrix:
2
PageRank transition matrix:
s
g
- with prob. α jump back to s
v3 v3
Compute PageRank vector: p=pTQ Rank nodes by p Rank nodes by pu
12 12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WSDM ’11]
Each node u has a score p
v1 v1 v2 v2
Each node u has a score pu Destination nodes D={v1,…, vk} No‐link nodes L={the rest}
2
No‐link nodes L={the rest} What do we want?
s v3 v3
Hard constraints, make them soft
13 12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WSDM ’11]
Want to minimize:
v1 v1 v2 v2
Want to minimize:
2
- Loss: h(x)=0 if x<0, x2 else
How to minimize F?
s
How to minimize F? pl and pd depend on w:
- Given w assign edge weights a =f (u v)
v3 v3
Given w assign edge weights auv fw(u,v)
- Using transition matrix Q=[auv] compute
PageRank scores p PageRank scores pu
- Want to set w such that pl<pd
14 12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WSDM ’11]
How to minimize F?
v1 v1 v2 v2
- Take the derivative!
2
s
We know:
i.e.
v3 v3
So: Looks like the PageRank equation!
15 12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WSDM ’11]
Iceland Facebook network
v1 v1 v2 v2
Iceland Facebook network
- 174,000 nodes (55% of population)
- A
d 168
2
- Avg. degree 168
- Avg. person added 26 new friends/month
For every node
s
For every node s:
- Positive examples:
D { f i d hi f i N ‘09 }
v3 v3
- D={ new friendships of s in Nov ‘09 }
- Negative examples:
- L { th
d did t t li k t }
- L={ other nodes s did not create new links to }
16 12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Node and Edge features for learning:
g g
- Node:
- Age
- Gender
- Degree
- Edge:
- Age of an edge
C i ti
- Communication,
- Profile visits
- Co‐tagged photos
Baselines:
Baselines:
- Decision trees and logistic regression:
- Above features + 10 network features (PageRank, common friends)
Evaluation:
- AUC and precision at Top20
17 12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Facebook: Facebook:
predicting future friends friends
18 12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Arxiv Hep Ph Arxiv Hep‐Ph
collaboration network network
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
Results: Results:
- 2.3X improvement over
previous FB‐PYMK system
2.3x
previous FB‐PYMK system
How to scale to FB size?
- FB network:
- >500 million people, >65 billion edges
- 40 machines, each 72GB of RAM (total 2.8TB)
- System makes 8.6 million suggests per second
y gg p
20 12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
- Many social or information networks are implicit or
Many social or information networks are implicit or hard to observe:
- Hidden/hard‐to‐reach populations:
k f dl h b d
- Network of needle sharing between drug injection users
- Implicit connections:
- Network of information propagation in online news media
- But we can observe results of the processes
taking place on such (invisible) networks:
- Virus propagation:
- Drug users get sick, and we observe when they see the doctor
- Information networks:
Information networks:
- We observe when media sites mention information
21
- Question: Can we infer the hidden networks?
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
- There is a directed social network over which
diff i t k l diffusions take place:
b a b a b d e c e c d e
- But we do not observe the edges of the network
- We only see the time when a node gets infected:
- We only see the time when a node gets infected:
- Cascade c1: (a, 1), (c, 2), (b, 6), (e, 9)
- Cascade c2: (c, 1), (a, 4), (b, 5), (d, 8)
22
2 ( , ), ( , ), ( , ), ( , )
- Task: inferring the underlying network
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Word of mouth &
Virus propagation
Word of mouth & Viral marketing
Viruses propagate Recommendations and Process p p g through the network We only observe when people get sick Recommendations and influence propagate We only observe when Process We observe people get sick But NOT who infected whom people buy products But NOT who influenced h We observe It’s hidden whom whom
23
Can we infer the underlying network?
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
- Continuous time cascade diffusion model:
Continuous time cascade diffusion model:
- Cascade c reaches node u at tu
and spreads to u’s neighbors:
- With probability β cascade propagates along edge (u, v)
and we determine the infection time of node v tv = tu + Δ e.g.: Δ ~ Exponential or Power‐law
tu tb tc
Δ1 Δ2
b a b u u b
2 24
c c d
We assume each node v has only one parent!
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
- Probability that cascade c propagates from
node u to node v is:
Pc(u, v) P(tv‐ tu) with tv> tu
- Since not all nodes get infected by the diffusion process, we
introduce the external influence node m: Pc(m, v) = ε
m b a b
- Prob. that cascade c propagates
in a tree pattern T:
m ε ε ε d e c e
25
Tree pattern T on cascade c: Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8)
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
- There are many possible propagation trees that are
h h b d d consistent with the observed data:
- c: (a, 1), (c, 2), (b, 3), (e, 4)
b a b b a b b a b b d e a c a c b e b d e a c a c b e b d e a c a c b e e e e
- Need to consider all possible propagation trees T
supported by the graph G:
- Likelihood of a set of cascades C:
26
Likelihood of a set of cascades C:
- Want to find a graph:
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
- We consider only the most likely tree
We consider only the most likely tree
- Maximum log‐likelihood for a cascade c
under a graph G: g L lik lih d f G i t f d C
- Log‐likelihood of G given a set of cascades C:
27 12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Given a cascade c, Given a cascade c,
- What is the most likely propagation tree?
where where
- A maximum directed spanning tree (MDST):
- The sub‐graph of G induced by the nodes in the
The sub graph of G induced by the nodes in the cascade c is a DAG
- Because edges point forward in time
- For each node, just picks an in‐edge of max‐weight:
28
Greedy parent selection of each node gives globally optimal tree!
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
- Theorem:
Theorem:
Log‐likelihood Fc(G) of cascade c is monotonic, and submodular in the edges of the graph G
Gain of adding an edge to a “small” graph Gain of adding an edge to a “large“ graph
Fc(A {e}) – Fc(A) ≥ Fc(B {e}) – Fc(B)
graph graph
A B VxV
- Log‐likelihood FC(G) is a sum of submodular functions,
then it is submodular too
29 12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
- Use the greedy hill‐climbing to maximize FC(G):
- For i=1…k:
- At every step, pick the edge that maximizes the marginal
improvement
B fit
b d a Marginal gains
a b c b d b : 20 : 18 : 4 : 17 : 2 : 1
- 1. Approximation guarantee (≈ 0.63 of OPT)
2 Tight on line bo nds on the sol tion q alit
Benefits:
d e c
e b : 5 a c b c b d c d : 15 : 8 : 16 : 8 : 3 : 1 : 7 : 6
- 2. Tight on‐line bounds on the solution quality
- 3. Speed‐ups:
Lazy evaluation (by submodularity)
30
d c d e d : 8 : 10 b e d e : 7 : 13 : 8 :
Lazy evaluation (by submodularity) Localized update (by the structure of the problem)
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
- We validate our method on:
We validate our method on: Synthetic data
Generate a graph G on k edges
Real data
MemeTracker: 172m news articles Generate cascades Record node infection times Reconstruct G Aug ’08 – Sept ‘09 343m textual phrases (quotes)
- How many edges of
G can we find?
- How well do we
- ptimize the
G can we find?
- Precision‐Recall
- Break‐even point
- How fast is the
l i h ?
- ptimize the
likelihood Fc(G)?
31
- How many cascades
do we need? algorithm?
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
1024 node hierarchical Kronecker 1000 node Forest Fire (α = 1.1) l l
- Performance does not depend on the network structure:
S h i N k F t Fi K k
1024 node hierarchical Kronecker exponential transmission model power law transmission model
- Synthetic Networks: Forest Fire, Kronecker, etc.
- Transmission time distribution: Exponential, Power Law
- Break even point of > 90%
32
- Break‐even point of > 90%
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
- We achieve ≈ 90 % of the best possible network!
We achieve 90 % of the best possible network!
33 12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
- With 2x as many infections as edges, the break‐even
With 2x as many infections as edges, the break even point is already 0.8 ‐ 0.9!
34 12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
- 5,000 news sites:
5,000 news sites:
35
Blogs Mainstream media
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
36
Blogs Mainstream media
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Poster session: Poster session:
- Tuesday 3‐6pm in Gates lobby
- C
l t i k t b d i th l bb
- Come early to pick poster boards in the lobby
- At least 2 (out of 5) course staff should see your
poster poster
- There will be cookies
Project writeups:
- Due Wednesday midnight
Due Wednesday midnight
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37
CS246: Mining Massive Datasets (Winter 2011)
CS246: Mining Massive Datasets (Winter 2011)
- How to deal with big datasets,
- Emphasis on parallel processing, large scale machine
Emphasis on parallel processing, large scale machine learning, web and social network data
CS341: Special topics in Data Mining (Spring 2011)
CS341: Special topics in Data Mining (Spring 2011)
- Project oriented large scale data mining
- Hadoop, Map‐reduce, unlimited access to Amazon’s cloud.
Workshop on Social Networks (SOC 317W, Ecu317X)
- Discussion oriented class where students present their
Discussion oriented class where students present their
- wn research and provide feedback on others’ work
12/01/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 38