http://cs224w.stanford.edu 10/25/2010 Jure Leskovec, Stanford - - PowerPoint PPT Presentation

http cs224w stanford edu 10 25 2010 jure leskovec
SMART_READER_LITE
LIVE PREVIEW

http://cs224w.stanford.edu 10/25/2010 Jure Leskovec, Stanford - - PowerPoint PPT Presentation

CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu 10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis,


slide-1
SLIDE 1

CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University

http://cs224w.stanford.edu

slide-2
SLIDE 2

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

slide-3
SLIDE 3

 [Faloutsos Faloutsos and Faloutsos 1999]  [Faloutsos, Faloutsos and Faloutsos, 1999]

Internet domain topology

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

Internet domain topology

slide-4
SLIDE 4

 [Barabasi Albert 1999]  [Barabasi‐Albert, 1999]

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

Power‐grid Web graph Actor collaborations

slide-5
SLIDE 5

 [Broder Kumar Maghoul Raghavan  [Broder, Kumar, Maghoul, Raghavan,

Rajagopalan, Stata, Tomkins, Wiener, 2000]

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

slide-6
SLIDE 6

[Leskovec et al. KDD ‘08]

 Take real network plot a histogram of p vs k  Take real network plot a histogram of pk vs. k

Flickr social Flickr social network n= 584,207, m=3,555,115

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6

slide-7
SLIDE 7

[Leskovec et al. KDD ‘08]

 Plot the same data on log log axis:  Plot the same data on log‐log axis:

Flickr social network network n= 584,207, m=3,555,115

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7

slide-8
SLIDE 8

 Degrees are heavily skewed:  Degrees are heavily skewed:

Distribution P(X>x) is heavy tailed if:

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

slide-9
SLIDE 9

[Clauset‐Shalizi‐Newman 2007]

 Power law vs exponential on log log scales  Power‐law vs. exponential on log‐log scales

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

slide-10
SLIDE 10

[Clauset‐Shalizi‐Newman 2007]

 Various names kinds and forms:  Various names, kinds and forms:

  • Long tail, Heavy tail, Zipf’s law, Pareto’s law

 P(x) is proportional to:  P(x) is proportional to:

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

slide-11
SLIDE 11

 In social systems – lots of power laws:  In social systems – lots of power‐laws:

  • Pareto, 1897 – Wealth distribution
  • L tk 1926

S i tifi t t

  • Lotka 1926 – Scientific output
  • Yule 1920s – Biological taxa and subtaxa

Zi f 1940 W d f

  • Zipf 1940s – Word frequency
  • Simon 1950s – City populations

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11

slide-12
SLIDE 12

[Clauset‐Shalizi‐Newman 2007]

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12

Many other quantities follow heavy‐tailed distributions

slide-13
SLIDE 13

[Chris Anderson, Wired, 2004]

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

slide-14
SLIDE 14

CMU grad‐students at the G20 meeting in b h

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

Pittsburgh in Sept 2009

slide-15
SLIDE 15

 Power‐law degree exponent is

g p typically 2 <  < 3

  • Web graph:
  • in = 2.1, out = 2.4 [Broder et al. 00]
  • Autonomous systems:
  •  = 2 4 [Faloutsos3 99]

 = 2.4 [Faloutsos , 99]

  • Actor‐collaborations:
  •  = 2.3 [Barabasi‐Albert 00]
  • Citations to papers:
  •   3 [Redner 98]
  • Online social networks:
  • Online social networks:
  •   2 [Leskovec et al. 07]

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15

slide-16
SLIDE 16

[Clauset‐Shalizi‐Newman 2007]

 What is the normalizing constant?

What is the normalizing constant? P(x) = c x- c=?

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16

slide-17
SLIDE 17

[Clauset‐Shalizi‐Newman 2007]

 What’s the expectation of a power‐law rnd var?

p p E[x]=

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17

slide-18
SLIDE 18

 Power laws: Infinite moments!  Power‐laws: Infinite moments!

  • If α ≤ 2 : E[x]= ∞
  • If

≤ 3 V [ ]

  • If α ≤ 3 : Var[x]=∞

 Sample average of n samples form a

p g p power‐law with exponent α:

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18

slide-19
SLIDE 19

[Clauset‐Shalizi‐Newman 2007]

 Estimating  from data:

Estimating  from data:

  • 1. Fit a line on log‐log axis

using least squares

BAD!

using least squares

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19

slide-20
SLIDE 20

[Clauset‐Shalizi‐Newman 2007]

 Estimating  from data:

  • 2. Plot Complementary CDF P(X>x)

Then α=1+α’ where α’ is the slope of P(X>x). E i if P(X ) 

α th

P(X> )

(α 1)

Ok

E.i., if P(X=x)x-α then P(X> x) x-(α-1)

Ok

10/25/2010 20

slide-21
SLIDE 21

[Clauset‐Shalizi‐Newman 2007]

 Estimating power‐law exponent  from data:

Best

Estimating power law exponent  from data:

  • 3. Use MLE:  =

xi is degree of node i

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

slide-22
SLIDE 22

Linear scale L l Log scale, α=1.75 CCDF, Log scale, α=1.75 CCDF, Log scale, α=1.75, exp cutoff

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22

,

  • exp. cutoff
slide-23
SLIDE 23

 Not well characterized by the mean:

y

  • Avg. U.S. city size: 165k, StdDev=410k
  • If human heights in US would be power‐law:
  • Expect to have 60k as high as 2.72m (world record), 10k people as high as

giraffe, 1 person as high as Empire State Building

 Can not arise from sums of independent events

  • Recall: in Gnp each pair of nodes in connected independently

ith b with prob. p

  • X… degree of node v,

Xw … event that w links to v

  • X = w Xw, E[xi]= w E[Xw] = (n-1)p
  • Now what is Pr[X=k]?
  • Now what is Pr[X=k]?
  • Central limit theorem:
  • x1,…,xn: rnd. vars with mean , var 2
  • S = i

n Xi:

E[S ]=n  var[S ]=n 2 std dev[S ]= n Sn i Xi: E[Sn] n , var[Sn] n  , std dev[Sn] n

  • P[Sn=E[Sn]+X*std.dev.(Sn)] ~ 1/(2) exp(-x2/2)

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23

slide-24
SLIDE 24

Random network Scale‐free (power‐law) network

Function is l f if (Erdos‐Renyi random graph) Degree distribution is scale free if: f(ax) = c f(x) Degree distribution is Binomial Power‐law

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu Part 1‐24 10/25/2010

slide-25
SLIDE 25

 What is a good model that gives rise to  What is a good model that gives rise to

power‐law degree distributions?

 What is the analog of central limit theorem

for power‐laws? for power‐laws?

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25

slide-26
SLIDE 26

 Preferential attachment  Preferential attachment

[Price 1965, Albert‐Barabasi 1999]:

  • Nodes arrive in order

Nodes arrive in order

  • A new node j creates m out‐links
  • Prob. of linking to a previous node i is

g p proportional to its degree di

d

 

k i

d d i j P ) (

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26

k

slide-27
SLIDE 27

 New nodes are more likely to link to

y nodes that already have high degree

 Herbert Simon’s result:

  • Power‐laws arise from “Rich get richer”

( l i d ) (cumulative advantage)

 Examples [Price 65]:  Examples [Price 65]:

  • Citations: new citations of a paper are

proportional to the number it already has proportional to the number it already has

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27

slide-28
SLIDE 28

[Mitzenmacher, ‘03]

 Pages are created in order 1 2 3

n

 Pages are created in order 1,2,3,…,n  When node j is created it makes a

single link to an earlier node i chosen: single link to an earlier node i chosen:

1) With prob. p, j links to i chosen uniformly at random (from among all earlier nodes) random (from among all earlier nodes) 2) With prob. 1-p, node j chooses node i uniformly at random and links to the node i points to at random and links to the node i points to.

Note this is same as saying:

2)With prob 1-p node j links to node u with prob 2)With prob. 1 p, node j links to node u with prob. proportional to du (the degree of u)

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28

slide-29
SLIDE 29

 Claim: The described model generates  Claim: The described model generates

networks where the fraction of nodes with degree k scales as: degree k scales as: ) 1 1 (  ) 1 (

) (

q i

k k d P

 

 

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29

where q=1-p

slide-30
SLIDE 30

 Degree d (t) of node i (i=1 2

n) is a

 Degree di(t) of node i (i=1,2,…,n) is a

continuous quantity and it grows deterministically as a function of time t deterministically as a function of time t

 Analyze d (t) – continuous degree of  Analyze di(t) – continuous degree of

node i at time t  i

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30

slide-31
SLIDE 31

 Initial condition:  Initial condition:

  • di(t)=0, when t=i (i just arrived)

 Expected change of di(t) over time:

pected c a ge o di(t) o e t e

  • Node i gains an in‐link at step t+1 only if a link

from a newly created node t+1 points to it.

  • What’s the prob. of this event?
  • With prob. p node t+1 links to a random node:

li k t i ith b 1/t

  • links to i with prob. 1/t
  • With prob 1-p node t+1 links preferentially:
  • links to i with prob. di(t)/t

d 1

  • So: prob. node t+1 links to i is:

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31

t d p t p

i

) 1 ( 1  

slide-32
SLIDE 32

d d

i i

1 d t q t p t

i i

  d

 

1

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32

 

p At q t d

q i

  1 ) (

slide-33
SLIDE 33

 We know: d (i)=0  We know: di(i)=0

         1 ) (

q

t p t d

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33

            1 ) (

i

i q t d

slide-34
SLIDE 34

 What is F(d) the fraction of nodes that has  What is F(d) the fraction of nodes that has

degree at least d at time t?

1 q q i

d p q t i d i t q p t d

1

1 1 ) (

                         

 There are t nodes total at time t so F(d):

p i q          

( )

q

d q d F

1

1 ) (

     

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34

d p d F 1 ) (     

slide-35
SLIDE 35

 What is the fraction of nodes with degree  What is the fraction of nodes with degree

exactly d?

  • Take derivative of F(d):
  • Take derivative of F(d):

q

 

 

1 1 1

1 1

p q d p q p d F

q

              1 1 1 1 1 1 1 ) ( ' 

10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35

slide-36
SLIDE 36

 Two changes from the Gnp model

g

np

  • The network grows
  • Preferential attachment

 Do we need both? Yes!

  • If we just add growth to Gnp (p=1):

Hn…n‐th harmonic number:

p

  • xj = degree of node j at the end
  • Xj(u)= 1 if u links to j, else 0
  • (j+1)+ (j+2)+

+ ( )

  • xj = xj(j+1)+xj(j+2)+…+xj(n)
  • E[xj(u)] = P[u links to j]= 1/(u-1)
  • E[xj] =  1/(u-1) = 1/j + 1/(j+1)+…+1/(n-1) = Hn-1 – Hj

[ j] ( ) j (j ) ( )

n-1 j

  • E[xj] = log(n-1) – log(j) = log((n-1)/j) NOT (n/j)

7/2/2009 Jure Leskovec, Stanford CS322: Network Analysis 36

slide-37
SLIDE 37

 Preferential attachment gives power‐law  Preferential attachment gives power‐law

degrees

 Intuitively reasonable process  Intuitively reasonable process  Can tune p to get the observed exponent

  • On the web P[node has degree d]

d-2 1

  • On the web, P[node has degree d] ~ d 2.1
  • 2.1 = 1+1/(1-p)  p ~ 0.1

7/2/2009 Jure Leskovec, Stanford CS322: Network Analysis 37

slide-38
SLIDE 38

 Preferential attachment is not so good at  Preferential attachment is not so good at

predicting network structure

  • Age‐degree correlation

Age degree correlation

  • Links among high degree nodes
  • On the web nodes sometime avoid linking to each other

g

 Further questions:

  • What is a reasonable probabilistic model for how

people sample through web‐pages and link to them?

  • Short+Random walks

Eff t f h i hi b d b f

  • Effect of search engines – reaching pages based on number of

links to them

7/2/2009 Jure Leskovec, Stanford CS322: Network Analysis 38

slide-39
SLIDE 39

 Preferential attachment is a key ingredient  Preferential attachment is a key ingredient  Extensions:

  • Early nodes have advantage: node fitness
  • Early nodes have advantage: node fitness
  • Geometric preferential attachment

 Copying model [Kleinberg et al ]:  Copying model [Kleinberg et al.]:

  • Picking a node proportional to

the degree is same as picking the degree is same as picking an edge at random (pick node and then it’s neighbor) and then it s neighbor)

6/14/2009 Jure Leskovec, ICML '09 39

slide-40
SLIDE 40

 We observe how the

connectivity (length of the paths) of the network changes as the vertices get removed g [Albert et al. 00; Palmer et al. 01]

 Vertices can be removed:

  • Uniformly at random
  • In order of decreasing degree

In order of decreasing degree

 It is important for epidemiology

  • Removal of vertices corresponds to

p vaccination

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10/25/2010 40

slide-41
SLIDE 41

 Real‐world networks are resilient to random attacks

  • One has to remove all web‐pages of degree > 5 to disconnect the web

One has to remove all web pages of degree > 5 to disconnect the web

  • But this is a very small percentage of web pages

 Random network has better resilience to targeted attacks

Random network Internet (Autonomous systems) h Preferential removal Random network ( y ) path length Mean Random removal Fraction of removed nodes Fraction of removed nodes

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10/25/2010 41