SLIDE 1
Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! - - PowerPoint PPT Presentation
Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! - - PowerPoint PPT Presentation
Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W . Moore (Google, Inc.) Which pair of nodes {i,j} should be connected? Variant: node i is given Alice Bob Charlie Friend suggestion in
SLIDE 2
SLIDE 3
Predict link between nodes
- With the minimum number of hops
- With max common neighbors (length 2 paths)
8 followers 1000 followers Prolific common friends Less evidence Less prolific Much more evidence Alice Bob Charlie
The Adamic/Adar score gives more weight to low degree common neighbors.
SLIDE 4
Predict link between nodes
- With the minimum number of hops
- With more common neighbors (length 2 paths)
- With larger Adamic/Adar
- With more short paths (e.g. length 3 paths )
- …
SLIDE 5
Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths
Link prediction accuracy*
*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007
How do we justify these
- bservations?
Especially if the graph is sparse
SLIDE 6
6
Nodes are uniformly distributed in a latent space
The problem of link prediction is to find the nearest neighbor who is not currently linked to the node. Equivalent to inferring distances in the latent space
Raftery et al.’s Model:
Unit volume universe
Points close in this space are more likely to be connected.
SLIDE 7
7
1
- Higher probability
- f linking
Two sources of randomness
- Point positions: uniform in D dimensional space
- Linkage probability: logistic with parameters , r
- , r and D are known
radius r
determines the steepness
SLIDE 8
Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths Link prediction accuracy *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 Especially if the graph is sparse
SLIDE 9
Pr2(i,j) = Pr(common neighbor|dij)
Product of two logistic probabilities, integrated over a volume determined by dij As Logistic Step function Much easier to analyze! i j
SLIDE 10
10
Everyone has same radius r i j
Empirical Bernstein Bounds on distance V(r)=volume
- f radius r in
D dims =Number
- f common
neighbors
Unit volume universe
SLIDE 11
OPT = node closest to i MAX = node with max common neighbors with i Theorem:
dOPT dMAX dOPT + 2[
- w.h.p
Common neighbors is an asymptotically optimal heuristic as N
SLIDE 12
Node k has radius rk . ik if dik rk (Directed graph)
rk captures popularity of node k
12
i k j
Type 1: i k j ri rj
A(ri , rj ,dij)
Type 2: i k j
i k j
rk rk
A(rk , rk ,dij)
SLIDE 13
i
j
k 1 ~ Bin[N1 , A(r1, r1, dij)] 2 ~ Bin[N2 , A(r2, r2, dij)] Example graph: N1 nodes of radius r1 and N2 nodes of radius r2 r1 << r2 Maximize Pr[1 , 2 | dij] = product of two binomials w(r1) E[1|d*] + w(r2) E[2|d*] = w(r1)1 + w(r2) 2 RHS LHS d*
SLIDE 14
{
Variance Jacobian
Small variance Presence is more surprising
r is close to max radius
Small variance Absence is more surprising
Adamic/Adar
1/r
Real world graphs generally fall in this range
SLIDE 15
Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths Link prediction accuracy *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 Especially if the graph is sparse
SLIDE 16
Common neighbors = 2 hop paths Analysis of longer paths: two components
- 1. Bounding E(l | dij). [l = # l hop paths]
Bounds Prl (i,j) by using triangle inequality on a series of common neighbor probabilities.
- 2. l E(l | dij)
Triangulation
SLIDE 17
Common neighbors = 2 hop paths Analysis of longer paths: two components
- 1. Bounding E(l | dij) [l = # l hop paths]
Bounds Prl (i,j) by using triangle inequality on a series of common neighbor probabilities.
- 2. l E(l | dij)
- Bounded dependence of l on position of each node
Can use McDiarmid’s inequality to bound |
l
- E(l|dij)|
SLIDE 18
Bound dij as a function of l using McDiarmid’s
inequality.
For l’ l we need l’ >> l to obtain similar bounds Also, we can obtain much tighter bounds for long paths
if shorter paths are known to exist.
SLIDE 19
1
- Factor weak bound
for Logistic Can be made tighter, as logistic approaches the step function.
SLIDE 20
Three key ingredients
- 1. Closer points are likelier to be linked.
Small World Model- Watts, Strogatz, 1998, Kleinberg 2001
- 2. Triangle inequality holds
necessary to extend to l hop paths
- 3. Points are spread uniformly at random
Otherwise properties will depend on location as
well as distance
SLIDE 21
Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths
Link prediction accuracy*
*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 The number of paths matters, not the length For large dense graphs, common neighbors are enough Differentiating between different degrees is important In sparse graphs, length 3 or more paths help in prediction.
SLIDE 22
SLIDE 23
23
Generative model Link Prediction Heuristics node a Most likely neighbor
- f node i ?
node b
Compare
A few properties
Can justify the empirical observations We also offer some new prediction algorithms
SLIDE 24
Combine bounds from different radii But there might not be enough data to obtain individual
bounds from each radius
New sweep estimator Qr = Fraction of nodes w. radius r, which are common
neighbors.
Higher Qr smaller dij w.h.p
SLIDE 25
Qr = Fraction of nodes w. radius r, which are common
neighbors
- larger Qr smaller dij w.h.p
TR : = Fraction of nodes w. radius R, which are
common neighbors.
Smaller TR large dij w.h.p
SLIDE 26
Qr = Fraction of nodes with radius r which are common neighbors TR = Fraction of nodes with radius R which are common neighbors
Number of common neighbors
- f a given radius