[PPT] - http://cs224w.stanford.edu How to organize/navigate it? How to PowerPoint Presentation

SLIDE 1

CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University

http://cs224w.stanford.edu

SLIDE 2

 How to organize/navigate it?  How to organize/navigate it?  First try:

y Web directories

Yahoo,

,

DMOZ,
LookSmart

LookSmart

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

SLIDE 3

 SEARCH!  SEARCH!  Find relevant docs in a small and trusted set:

Newspaper articles
Patents, etc.

Patents, etc.

 Two traditional problems:

Synonimy: buy – purchase, sick – ill
Polysemi: jaguar

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

SLIDE 4

D d t b tt lt ? Does more documents mean better results?

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

SLIDE 5

 What is “best” answer to query “Stanford”?

What is best answer to query Stanford ?

Anchor Text: I go to Stanford where I study

 What about query “newspaper”?  What about query newspaper ?

No single right answer

 Scarcity (IR) vs abundance (Web) of information  Scarcity (IR) vs. abundance (Web) of information

Web: Many sources of information. Who to “trust”

 Trick:  Trick:

Pages that actually know about newspapers

might all be pointing to many newspapers might all be pointing to many newspapers

 Ranking!

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

SLIDE 6

 Goal (back to the newspaper example):

Goal (back to the newspaper example):

Don’t just find newspapers.Find “experts” – people

who link in a coordinated way to good newspapers

 Idea: Links as votes  Idea: Links as votes

Page is more important if it has more links
In‐coming links? Out‐going links?

 Hubs and Authorities

Quality as an expert (hub):

NYT: 10 Ebay: 3

Q y p ( )

Total sum of votes of pages pointed to
Quality as an content (authority):
Total sum of votes of experts

Yahoo: 3 CNN: 8

Total sum of votes of experts
Principle of repeated improvement

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

WSJ: 9

6

SLIDE 7

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7

SLIDE 8

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

SLIDE 9

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

SLIDE 10

[Kleinberg ‘98]

 Each page i has 2 kinds of scores:

Each page i has 2 kinds of scores:

Hub score: hi
Authority score: ai

y

i

 HITS algorithm:

Initialize: ai=hi=1

i i

Then keep iterating:

h i

h

Authority:
Hub:







j i i j

h a







j i j i

a h

Normalize: ai=1, hi=1

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

SLIDE 11

[Kleinberg ‘98]

 HITS converges to a single stable point  HITS converges to a single stable point  Slightly change the notation:

Vector a=(a

a ) h=(h h )

Vector a=(a1…,an), h=(h1…,hn)
Adjacency matrix (n x n): Mij=1 if ij

 Then:  Then:

 

  

 j j ij i j i j i

a M h a h

 So:  And likewise:

 j j i

Ma h  h M a

T



 And likewise:

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

h M a 

11

SLIDE 12

 Algorithm in new notation:  Algorithm in new notation:

Set: a = h = 1n
Repeat:

Repeat:

h=Ma, a=MTh
Normalize

T

 Then: a=MT(Ma)

new h new a

a is being updated (in 2 steps): MT(Ma)=(MTM)a

 Thus, in 2k steps:

a=(MTM)ka

new a

( ) ( ) h is updated (in 2 steps): M (MTh)=(MMT)h

( ) h=(MMT)kh

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Repeated matrix powering

12

SLIDE 13

 Definition:  Definition:

Let Ax=x for some scalar , vector x and matrix A
Th

i i t d  i it i l

Then x is an eigenvector, and  is its eigenvalue

 Fact:

If A is symmetric (Aij=Aji)

(in our case MTM and MMT are symmetric) ( y )

Then A has n orthogonal unit eigenvectors w1…wn

that form a basis (coordinate system) with eigenvalues 1... n (|i||i+1|)

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

SLIDE 14

 Write x in coordinate system w1

w

 Write x in coordinate system w1…wn

x=i i wi

x has coordinates (1,…, n)

x has coordinates (1,…, n)

 Suppose: 1 ... n (|1||2|  … |n|)  Akx = ( k 

 k   k  ) =   k  w

 A x

(1 1, 2 2,…., n n)  i i wi

 As k, if we normalize

Ak x 1 1 w1 A x 1 1 w1

(all other coordinates 0)

 So authority a is eigenvector of MTM associated with

l t i l  largest eigenvalue 1

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

SLIDE 15

 A vote from an important

The web in 1839

 A vote from an important

page is worth more

 A page is important if it is

y y/2

 A page is important if it is

pointed to by other important pages

y y/2 a/2

important pages

 Define a “rank” rj for node j

r should be proportional to:

m a a/2 m

rj should be proportional to:





i j

r r

y = y /2 + a /2 /2

Flow equations:

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu



 j i j

i

f
utdegree

15

a = y /2 + m m = a /2

SLIDE 16

 Stochastic adjacency matrix M

Stochastic adjacency matrix M

Let page j has dj out‐links
If j → i, then Mij = 1/ dj else Mij = 0

ij j ij

M is a column stochastic matrix
Columns sum to 1

R k i h 1

 Rank vector r: vector with 1 entry per page

ri is the importance score of page i
|r| = 1
|r| = 1

 The flow equations can be written

r = Mr

11/29/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16

SLIDE 17

 Imagine a random web surfer:  Imagine a random web surfer:

At any time t, surfer is on some page u
At ti

t+1 th f f ll t li k

At time t+1, the surfer follows an out‐link

from u uniformly at random

Ends up on some page v linked from u
Ends up on some page v linked from u
Process repeats indefinitely

 Let:  Let:

 p(t) … vector whose ith coordinate is the

prob. that the surfer is at page i at time t
prob. that the surfer is at page i at time t
p(t) is a probability distribution over pages

11/29/2010 17 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

SLIDE 18

 Where is the surfer at time t+1?  Where is the surfer at time t+1?

Follows a link uniformly at random

p(t+1) = Mp(t) p(t+1) = Mp(t)

 Suppose the random walk reaches a state

(t+1) M (t) (t) p(t+1) = Mp(t) = p(t)

then p(t) is stationary distribution of a random walk

O k i fi M

 Our rank vector r satisfies r = Mr

So it is a stationary distribution for the random

f surfer

11/29/2010 18 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

SLIDE 19

 Power Iteration:  Power Iteration:

Set ri=1

 /d

y y a m

rj=i ri/di
And iterate

y a m y ½ ½ a ½ 1 m ½

 Example:

1 1 5/4 9/8 6/5 y 1 1 5/4 9/8 6/5 a = 1 3/2 1 11/8 … 6/5 m 1 ½ ¾ ½ 3/5

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19

SLIDE 20

 Some pages are “dead ends”  Some pages are dead ends

(have no out‐links)

Such pages cause importance
Such pages cause importance

to leak out

 Spider traps (all out links are

within the group) within the group)

Eventually spider traps absorb all importance

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20

SLIDE 21

 Power Iteration:  Power Iteration:

Set ri=1

 /d

y y a m y ½ ½

rj=i ri/di
And iterate

a m a ½ m ½

 Example:

1 1 ¾ 5/8 y 1 1 ¾ 5/8 a = 1 ½ ½ 3/8 … m 1 ½ ¼ ¼

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

SLIDE 22

 Power Iteration:

y y a m

 Power Iteration:

Set ri=1

 /d

y a m y y ½ ½ a ½ ½ 1

rj=i ri/di
And iterate

m m ½ 1

 Example:

1 1 ¾ 5/8 y 1 1 ¾ 5/8 a = 1 ½ ½ 3/8 … m 1 3/2 7/4 2 3

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22

SLIDE 23

 At each step random surfer has two options:

di … outdegree of node i

 At each step, random surfer has two options:

With probability 1‐, follow a link at random
With

b bilit  j t if l

With probability , jump to some page uniformly

at random

 PageRank equation:

rj=(1- ) ij ri/di + 

j (

)

ij i i



11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23

SLIDE 24

 PageRank as a principal eigenvector

di … outdegree of node i

 PageRank as a principal eigenvector

r=Mr  rj=i ri/di

 But we really want:  But we really want:

rj = (1- ) ij ri/di +  iri

 Define:  Define:

M’ij = (1- ) Mij +  1/n

 Then: r = M’r  Then: r = M r  What is ?

I ti  0 15 (5 li k d j )

In practice  =0.15 (5 links and jump)

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24

SLIDE 25

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25

SLIDE 26

 Goal: Evaluate pages not just by popularity but  Goal: Evaluate pages not just by popularity but

by how close they are to the topic

 Teleporting can go to:  Teleporting can go to:

Any page with equal probability
(we used this so far)
(we used this so far)
A topic‐specific set of “relevant” pages
Topic‐specific (personalized) PageRank
Topic‐specific (personalized) PageRank

M’ij = (1-) Mij +  c

(c...teleport vector)

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26

SLIDE 27

 Graphs and web search:

… …

 Graphs and web search:

Ranks nodes by “importance”

 Personalized PageRank:

Philip S. Yu IJCAI

 Personalized PageRank:

Ranks proximity of nodes

to the teleport nodes c

ICDM KDD Ning Zhong

to the teleport nodes c

 Proximity on graphs:

Q: What is most related

SDM AAAI

M. Jordan
R. Ramakrishnan
Q: What is most related

conference to ICDM?

Random Walks with Restarts

NIPS

… …

Random Walks with Restarts
Teleport back: c=(0…0, 1, 0…0)

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27

…

Conference Author

SLIDE 28

 Link Farms: networks of  Link Farms: networks of

millions of pages design to focus PageRank on a g few undeserving webpages

 To minimize their

i fl t l t influence use a teleport set of trusted webpages

E g homepages of
E.g., homepages of

universities

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28

SLIDE 29

[LibenNowell‐Kleinberg ‘03]

 Link prediction task:  Link prediction task:

Given G[t0,t0’] a graph on edges up to time t0’
utput a ranked list L of links (not in G[t t ’]) that
utput a ranked list L of links (not in G[t0,t0 ]) that

are predicted to appear in G[t1,t1’]

 Evaluation:

n=|Enew|: # new edges that appear during the test

period [t1,t1’]

Take top n elements of L and count correct edges

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29

SLIDE 30

[LibenNowell‐Kleinberg ‘03]

 Predict links a evolving collaboration network  Predict links a evolving collaboration network  Core: Since network data is very sparse

Consider only nodes with in‐degree and out‐

degree of at least 3

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30

SLIDE 31

[LibenNowell‐Kleinberg CIKM ‘03]

 Rank potential links (x,y) based on:

Rank potential links (x,y) based on:

Γ(x) … degree of node x

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31

SLIDE 32

[LibenNowell‐Kleinberg CIKM’ 03]

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32

SLIDE 33

 Recommend a list of possible friends

Recommend a list of possible friends

 Supervised machine learning setting:

Training example:
For every node s have a list of nodes

she will create links to {g1, …, gk}

Problem:

g1 g1 g2 g2

Problem:

Learn a model that will for a given

node s rank nodes {g1, …, gk} higher than other nodes in the network

s

than other nodes in the network

 How to combine node/edge

attributes and network structure?

g3

Let’s learn how to bias random walks!

33

g3 g3 positive examples negative examples

11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 34

[WSDM ’11]

 Let s be the center node

v1 v1

 Let fw(u,v) be a function that assigns

a strength to each edge:

f ( ) ( Ψ )

v2 v2

auv = fw(u,v) = exp(-wΨuv)

Ψuv is a feature vector
Features of node u

s

Features of node u
Features of node v
Features of edge (u,v)

v3 v3

w is the parameter vector we want to learn

 Do a random walk from s where transitions

di t d t th are according to edge strengths

 How to learn fw(u,v)?

34 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 35

[WSDM ’11]

 Random walk transition matrix:

v1 v1 v2 v2

 Random walk transition matrix:

2

 PageRank transition matrix:

s

g

with prob. α jump back to s

v3 v3

 Compute PageRank vector: p=pTQ  Rank nodes by p  Rank nodes by pu

35 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 36

[WSDM ’11]

 Each node u has a score p

v1 v1 v2 v2

 Each node u has a score pu  Destination nodes D={v1,…, vk}  No‐link nodes L={the rest}

2

 No‐link nodes L={the rest}  What do we want?

s v3 v3

 Hard constraints, make them soft

36 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 37

[WSDM ’11]

 Want to minimize:

v1 v1 v2 v2

 Want to minimize:

2

Loss: h(x)=0 if x<0, x2 else

 How to minimize F?

s

How to minimize F? pl and pd depend on w:

Given w assign edge weights a =f (u v)

v3 v3

Given w assign edge weights auv fw(u,v)

Using transition matrix Q=[auv] compute

PageRank scores p PageRank scores pu

Want to set w such that pl<pd

37 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 38

[WSDM ’11]

 How to minimize F?

v1 v1 v2 v2

Take the derivative!

2

s

 We know:

i.e.

v3 v3

 So:  Looks like the PageRank equation!

38 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 39

[WSDM ’11]

 Iceland Facebook network

v1 v1 v2 v2

 Iceland Facebook network

174,000 nodes (55% of population)
A

d 168

2

Avg. degree 168
Avg. person added 26 new friends/month

For every node

s

 For every node s:

Positive examples:

D { f i d hi f i N ‘09 }

v3 v3

D={ new friendships of s in Nov ‘09 }
Negative examples:
L { th

d did t t li k t }

L={ other nodes s did not create new links to }

39 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 40

 Node and Edge features for learning:

g g

Node:
Age
Gender
Degree
Edge:
Age of an edge

C i ti

Communication,
Profile visits
Co‐tagged photos

 Baselines:

Baselines:

Decision trees and logistic regression:
Above features + 10 network features (PageRank, common friends)

 Evaluation:

AUC and precision at Top20

40 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 41

 Facebook: predicting your future friends  Facebook: predicting your future friends

41 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 42

 Results:  Results:

2.3X improvement over

previous FB‐PYMK system

2.3x

previous FB‐PYMK system

 How to scale to FB size?

FB network:
>500 million people, >65 billion edges
40 machines, each 72GB of RAM (total 2.8TB)
System makes 8.6 million suggests per second

y gg p

42 11/29/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu