Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! - - PowerPoint PPT Presentation

purnamrita sarkar carnegie mellon deepayan chakrabarti
SMART_READER_LITE
LIVE PREVIEW

Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! - - PowerPoint PPT Presentation

Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W . Moore (Google, Inc.) Which pair of nodes {i,j} should be connected? Variant: node i is given Alice Bob Charlie Friend suggestion in


slide-1
SLIDE 1

Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W . Moore (Google, Inc.)

slide-2
SLIDE 2

Which pair of nodes {i,j} should be connected? Variant: node i is given

Friend suggestion in Facebook

Alice Bob Charlie

Movie recommendation in Netflix

slide-3
SLIDE 3

Predict link between nodes

  • With the minimum number of hops
  • With max common neighbors (length 2 paths)

8 followers 1000 followers Prolific common friends Less evidence Less prolific Much more evidence Alice Bob Charlie

The Adamic/Adar score gives more weight to low degree common neighbors.

slide-4
SLIDE 4

Predict link between nodes

  • With the minimum number of hops
  • With more common neighbors (length 2 paths)
  • With larger Adamic/Adar
  • With more short paths (e.g. length 3 paths )
slide-5
SLIDE 5

Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths

Link prediction accuracy*

*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

How do we justify these

  • bservations?

Especially if the graph is sparse

slide-6
SLIDE 6

6

Nodes are uniformly distributed in a latent space

The problem of link prediction is to find the nearest neighbor who is not currently linked to the node. Equivalent to inferring distances in the latent space

Raftery et al.’s Model:

Unit volume universe

Points close in this space are more likely to be connected.

slide-7
SLIDE 7

7

1

  • Higher probability
  • f linking

Two sources of randomness

  • Point positions: uniform in D dimensional space
  • Linkage probability: logistic with parameters , r
  • , r and D are known

radius r

determines the steepness

slide-8
SLIDE 8

Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths Link prediction accuracy *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 Especially if the graph is sparse

slide-9
SLIDE 9

Pr2(i,j) = Pr(common neighbor|dij)

Product of two logistic probabilities, integrated over a volume determined by dij As Logistic Step function Much easier to analyze! i j

slide-10
SLIDE 10

10

Everyone has same radius r i j

Empirical Bernstein Bounds on distance V(r)=volume

  • f radius r in

D dims =Number

  • f common

neighbors

Unit volume universe

slide-11
SLIDE 11

OPT = node closest to i MAX = node with max common neighbors with i Theorem:

dOPT dMAX dOPT + 2[

  • w.h.p

Common neighbors is an asymptotically optimal heuristic as N

slide-12
SLIDE 12

Node k has radius rk . ik if dik rk (Directed graph)

rk captures popularity of node k

12

i k j

Type 1: i k j ri rj

A(ri , rj ,dij)

Type 2: i k j

i k j

rk rk

A(rk , rk ,dij)

slide-13
SLIDE 13

i

j

k 1 ~ Bin[N1 , A(r1, r1, dij)] 2 ~ Bin[N2 , A(r2, r2, dij)] Example graph: N1 nodes of radius r1 and N2 nodes of radius r2 r1 << r2 Maximize Pr[1 , 2 | dij] = product of two binomials w(r1) E[1|d*] + w(r2) E[2|d*] = w(r1)1 + w(r2) 2 RHS LHS d*

slide-14
SLIDE 14

{

Variance Jacobian

Small variance Presence is more surprising

r is close to max radius

Small variance Absence is more surprising

Adamic/Adar

1/r

Real world graphs generally fall in this range

slide-15
SLIDE 15

Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths Link prediction accuracy *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 Especially if the graph is sparse

slide-16
SLIDE 16

Common neighbors = 2 hop paths Analysis of longer paths: two components

  • 1. Bounding E(l | dij). [l = # l hop paths]

Bounds Prl (i,j) by using triangle inequality on a series of common neighbor probabilities.

  • 2. l E(l | dij)

Triangulation

slide-17
SLIDE 17

Common neighbors = 2 hop paths Analysis of longer paths: two components

  • 1. Bounding E(l | dij) [l = # l hop paths]

Bounds Prl (i,j) by using triangle inequality on a series of common neighbor probabilities.

  • 2. l E(l | dij)
  • Bounded dependence of l on position of each node

Can use McDiarmid’s inequality to bound |

l

  • E(l|dij)|
slide-18
SLIDE 18

Bound dij as a function of l using McDiarmid’s

inequality.

For l’ l we need l’ >> l to obtain similar bounds Also, we can obtain much tighter bounds for long paths

if shorter paths are known to exist.

slide-19
SLIDE 19

1

  • Factor weak bound

for Logistic Can be made tighter, as logistic approaches the step function.

slide-20
SLIDE 20

Three key ingredients

  • 1. Closer points are likelier to be linked.

Small World Model- Watts, Strogatz, 1998, Kleinberg 2001

  • 2. Triangle inequality holds

necessary to extend to l hop paths

  • 3. Points are spread uniformly at random

Otherwise properties will depend on location as

well as distance

slide-21
SLIDE 21

Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths

Link prediction accuracy*

*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 The number of paths matters, not the length For large dense graphs, common neighbors are enough Differentiating between different degrees is important In sparse graphs, length 3 or more paths help in prediction.

slide-22
SLIDE 22
slide-23
SLIDE 23

23

Generative model Link Prediction Heuristics node a Most likely neighbor

  • f node i ?

node b

Compare

A few properties

Can justify the empirical observations We also offer some new prediction algorithms

slide-24
SLIDE 24

Combine bounds from different radii But there might not be enough data to obtain individual

bounds from each radius

New sweep estimator Qr = Fraction of nodes w. radius r, which are common

neighbors.

Higher Qr smaller dij w.h.p

slide-25
SLIDE 25

Qr = Fraction of nodes w. radius r, which are common

neighbors

  • larger Qr smaller dij w.h.p

TR : = Fraction of nodes w. radius R, which are

common neighbors.

Smaller TR large dij w.h.p

slide-26
SLIDE 26

Qr = Fraction of nodes with radius r which are common neighbors TR = Fraction of nodes with radius R which are common neighbors

Number of common neighbors

  • f a given radius

Large Qr small dij Small TR large dij r