CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University
http://cs224w.stanford.edu How to organize/navigate it? How to - - PowerPoint PPT Presentation
http://cs224w.stanford.edu How to organize/navigate it? How to - - PowerPoint PPT Presentation
CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize/navigate it? How to organize/navigate it? First try: y Web directories
How to organize/navigate it? How to organize/navigate it? First try:
y Web directories
- Yahoo,
,
- DMOZ,
- LookSmart
LookSmart
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
SEARCH! SEARCH! Find relevant docs in a small and trusted set:
- Newspaper articles
- Patents, etc.
Patents, etc.
Two traditional problems:
- Synonimy: buy – purchase, sick – ill
- Polysemi: jaguar
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
D d t b tt lt ? Does more documents mean better results?
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
What is “best” answer to query “Stanford”?
What is best answer to query Stanford ?
- Anchor Text: I go to Stanford where I study
What about query “newspaper”? What about query newspaper ?
- No single right answer
Scarcity (IR) vs abundance (Web) of information Scarcity (IR) vs. abundance (Web) of information
- Web: Many sources of information. Who to “trust”
Trick: Trick:
- Pages that actually know about newspapers
might all be pointing to many newspapers might all be pointing to many newspapers
Ranking!
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
Goal (back to the newspaper example):
Goal (back to the newspaper example):
- Don’t just find newspapers.Find “experts” – people
who link in a coordinated way to good newspapers
Idea: Links as votes Idea: Links as votes
- Page is more important if it has more links
- In‐coming links? Out‐going links?
Hubs and Authorities
- Quality as an expert (hub):
NYT: 10 Ebay: 3
Q y p ( )
- Total sum of votes of pages pointed to
- Quality as an content (authority):
- Total sum of votes of experts
Yahoo: 3 CNN: 8
- Total sum of votes of experts
- Principle of repeated improvement
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
WSJ: 9
6
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
[Kleinberg ‘98]
Each page i has 2 kinds of scores:
Each page i has 2 kinds of scores:
- Hub score: hi
- Authority score: ai
y
i
HITS algorithm:
- Initialize: ai=hi=1
i i
- Then keep iterating:
h i
h
- Authority:
- Hub:
j i i j
h a
j i j i
a h
- Normalize: ai=1, hi=1
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
[Kleinberg ‘98]
HITS converges to a single stable point HITS converges to a single stable point Slightly change the notation:
- Vector a=(a
a ) h=(h h )
- Vector a=(a1…,an), h=(h1…,hn)
- Adjacency matrix (n x n): Mij=1 if ij
Then: Then:
j j ij i j i j i
a M h a h
So: And likewise:
j j i
Ma h h M a
T
And likewise:
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
h M a
11
Algorithm in new notation: Algorithm in new notation:
- Set: a = h = 1n
- Repeat:
Repeat:
- h=Ma, a=MTh
- Normalize
T
Then: a=MT(Ma)
new h new a
a is being updated (in 2 steps): MT(Ma)=(MTM)a
Thus, in 2k steps:
a=(MTM)ka
new a
( ) ( ) h is updated (in 2 steps): M (MTh)=(MMT)h
( ) h=(MMT)kh
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Repeated matrix powering
12
Definition: Definition:
- Let Ax=x for some scalar , vector x and matrix A
- Th
i i t d i it i l
- Then x is an eigenvector, and is its eigenvalue
Fact:
- If A is symmetric (Aij=Aji)
(in our case MTM and MMT are symmetric) ( y )
- Then A has n orthogonal unit eigenvectors w1…wn
that form a basis (coordinate system) with eigenvalues 1... n (|i||i+1|)
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
Write x in coordinate system w1
w
Write x in coordinate system w1…wn
x=i i wi
- x has coordinates (1,…, n)
x has coordinates (1,…, n)
Suppose: 1 ... n (|1||2| … |n|) Akx = ( k
k k ) = k w
A x
(1 1, 2 2,…., n n) i i wi
As k, if we normalize
Ak x 1 1 w1 A x 1 1 w1
(all other coordinates 0)
So authority a is eigenvector of MTM associated with
l t i l largest eigenvalue 1
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
A vote from an important
The web in 1839
A vote from an important
page is worth more
A page is important if it is
y y/2
A page is important if it is
pointed to by other important pages
y y/2 a/2
important pages
Define a “rank” rj for node j
r should be proportional to:
m a a/2 m
rj should be proportional to:
i j
r r
y = y /2 + a /2 /2
Flow equations:
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
j i j
i
- f
- utdegree
15
a = y /2 + m m = a /2
Stochastic adjacency matrix M
Stochastic adjacency matrix M
- Let page j has dj out‐links
- If j → i, then Mij = 1/ dj else Mij = 0
ij j ij
- M is a column stochastic matrix
- Columns sum to 1
R k i h 1
Rank vector r: vector with 1 entry per page
- ri is the importance score of page i
- |r| = 1
- |r| = 1
The flow equations can be written
r = Mr
11/29/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16
Imagine a random web surfer: Imagine a random web surfer:
- At any time t, surfer is on some page u
- At ti
t+1 th f f ll t li k
- At time t+1, the surfer follows an out‐link
from u uniformly at random
- Ends up on some page v linked from u
- Ends up on some page v linked from u
- Process repeats indefinitely
Let: Let:
p(t) … vector whose ith coordinate is the
- prob. that the surfer is at page i at time t
- prob. that the surfer is at page i at time t
- p(t) is a probability distribution over pages
11/29/2010 17 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
Where is the surfer at time t+1? Where is the surfer at time t+1?
- Follows a link uniformly at random
p(t+1) = Mp(t) p(t+1) = Mp(t)
Suppose the random walk reaches a state
(t+1) M (t) (t) p(t+1) = Mp(t) = p(t)
- then p(t) is stationary distribution of a random walk
O k i fi M
Our rank vector r satisfies r = Mr
- So it is a stationary distribution for the random
f surfer
11/29/2010 18 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
Power Iteration: Power Iteration:
- Set ri=1
/d
y y a m
- rj=i ri/di
- And iterate
y a m y ½ ½ a ½ 1 m ½
Example:
1 1 5/4 9/8 6/5 y 1 1 5/4 9/8 6/5 a = 1 3/2 1 11/8 … 6/5 m 1 ½ ¾ ½ 3/5
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
Some pages are “dead ends” Some pages are dead ends
(have no out‐links)
- Such pages cause importance
- Such pages cause importance
to leak out
Spider traps (all out links are
within the group) within the group)
- Eventually spider traps absorb all importance
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20
Power Iteration: Power Iteration:
- Set ri=1
/d
y y a m y ½ ½
- rj=i ri/di
- And iterate
a m a ½ m ½
Example:
1 1 ¾ 5/8 y 1 1 ¾ 5/8 a = 1 ½ ½ 3/8 … m 1 ½ ¼ ¼
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
Power Iteration:
y y a m
Power Iteration:
- Set ri=1
/d
y a m y y ½ ½ a ½ ½ 1
- rj=i ri/di
- And iterate
m m ½ 1
Example:
1 1 ¾ 5/8 y 1 1 ¾ 5/8 a = 1 ½ ½ 3/8 … m 1 3/2 7/4 2 3
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
At each step random surfer has two options:
di … outdegree of node i
At each step, random surfer has two options:
- With probability 1‐, follow a link at random
- With
b bilit j t if l
- With probability , jump to some page uniformly
at random
PageRank equation:
rj=(1- ) ij ri/di +
j (
)
ij i i
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
PageRank as a principal eigenvector
di … outdegree of node i
PageRank as a principal eigenvector
r=Mr rj=i ri/di
But we really want: But we really want:
rj = (1- ) ij ri/di + iri
Define: Define:
M’ij = (1- ) Mij + 1/n
Then: r = M’r Then: r = M r What is ?
I ti 0 15 (5 li k d j )
- In practice =0.15 (5 links and jump)
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
Goal: Evaluate pages not just by popularity but Goal: Evaluate pages not just by popularity but
by how close they are to the topic
Teleporting can go to: Teleporting can go to:
- Any page with equal probability
- (we used this so far)
- (we used this so far)
- A topic‐specific set of “relevant” pages
- Topic‐specific (personalized) PageRank
- Topic‐specific (personalized) PageRank
M’ij = (1-) Mij + c
(c...teleport vector)
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
Graphs and web search:
… …
Graphs and web search:
- Ranks nodes by “importance”
Personalized PageRank:
Philip S. Yu IJCAI
Personalized PageRank:
- Ranks proximity of nodes
to the teleport nodes c
ICDM KDD Ning Zhong
to the teleport nodes c
Proximity on graphs:
- Q: What is most related
SDM AAAI
- M. Jordan
- R. Ramakrishnan
- Q: What is most related
conference to ICDM?
- Random Walks with Restarts
NIPS
… …
- Random Walks with Restarts
- Teleport back: c=(0…0, 1, 0…0)
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
…
Conference Author
Link Farms: networks of Link Farms: networks of
millions of pages design to focus PageRank on a g few undeserving webpages
To minimize their
i fl t l t influence use a teleport set of trusted webpages
- E g homepages of
- E.g., homepages of
universities
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
[LibenNowell‐Kleinberg ‘03]
Link prediction task: Link prediction task:
- Given G[t0,t0’] a graph on edges up to time t0’
- utput a ranked list L of links (not in G[t t ’]) that
- utput a ranked list L of links (not in G[t0,t0 ]) that
are predicted to appear in G[t1,t1’]
Evaluation:
- n=|Enew|: # new edges that appear during the test
period [t1,t1’]
- Take top n elements of L and count correct edges
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29
[LibenNowell‐Kleinberg ‘03]
Predict links a evolving collaboration network Predict links a evolving collaboration network Core: Since network data is very sparse
- Consider only nodes with in‐degree and out‐
degree of at least 3
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30
[LibenNowell‐Kleinberg CIKM ‘03]
Rank potential links (x,y) based on:
Rank potential links (x,y) based on:
Γ(x) … degree of node x
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
[LibenNowell‐Kleinberg CIKM’ 03]
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32
Recommend a list of possible friends
Recommend a list of possible friends
Supervised machine learning setting:
- Training example:
- For every node s have a list of nodes
she will create links to {g1, …, gk}
- Problem:
g1 g1 g2 g2
Problem:
- Learn a model that will for a given
node s rank nodes {g1, …, gk} higher than other nodes in the network
s
than other nodes in the network
How to combine node/edge
attributes and network structure?
g3
- Let’s learn how to bias random walks!
33
g3 g3 positive examples negative examples
11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WSDM ’11]
Let s be the center node
v1 v1
Let fw(u,v) be a function that assigns
a strength to each edge:
f ( ) ( Ψ )
v2 v2
auv = fw(u,v) = exp(-wΨuv)
- Ψuv is a feature vector
- Features of node u
s
- Features of node u
- Features of node v
- Features of edge (u,v)
v3 v3
- w is the parameter vector we want to learn
Do a random walk from s where transitions
di t d t th are according to edge strengths
How to learn fw(u,v)?
34 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WSDM ’11]
Random walk transition matrix:
v1 v1 v2 v2
Random walk transition matrix:
2
PageRank transition matrix:
s
g
- with prob. α jump back to s
v3 v3
Compute PageRank vector: p=pTQ Rank nodes by p Rank nodes by pu
35 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WSDM ’11]
Each node u has a score p
v1 v1 v2 v2
Each node u has a score pu Destination nodes D={v1,…, vk} No‐link nodes L={the rest}
2
No‐link nodes L={the rest} What do we want?
s v3 v3
Hard constraints, make them soft
36 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WSDM ’11]
Want to minimize:
v1 v1 v2 v2
Want to minimize:
2
- Loss: h(x)=0 if x<0, x2 else
How to minimize F?
s
How to minimize F? pl and pd depend on w:
- Given w assign edge weights a =f (u v)
v3 v3
Given w assign edge weights auv fw(u,v)
- Using transition matrix Q=[auv] compute
PageRank scores p PageRank scores pu
- Want to set w such that pl<pd
37 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WSDM ’11]
How to minimize F?
v1 v1 v2 v2
- Take the derivative!
2
s
We know:
i.e.
v3 v3
So: Looks like the PageRank equation!
38 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WSDM ’11]
Iceland Facebook network
v1 v1 v2 v2
Iceland Facebook network
- 174,000 nodes (55% of population)
- A
d 168
2
- Avg. degree 168
- Avg. person added 26 new friends/month
For every node
s
For every node s:
- Positive examples:
D { f i d hi f i N ‘09 }
v3 v3
- D={ new friendships of s in Nov ‘09 }
- Negative examples:
- L { th
d did t t li k t }
- L={ other nodes s did not create new links to }
39 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Node and Edge features for learning:
g g
- Node:
- Age
- Gender
- Degree
- Edge:
- Age of an edge
C i ti
- Communication,
- Profile visits
- Co‐tagged photos
Baselines:
Baselines:
- Decision trees and logistic regression:
- Above features + 10 network features (PageRank, common friends)
Evaluation:
- AUC and precision at Top20
40 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Facebook: predicting your future friends Facebook: predicting your future friends
41 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Results: Results:
- 2.3X improvement over
previous FB‐PYMK system
2.3x
previous FB‐PYMK system
How to scale to FB size?
- FB network:
- >500 million people, >65 billion edges
- 40 machines, each 72GB of RAM (total 2.8TB)
- System makes 8.6 million suggests per second
y gg p
42 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu