Quick detection of popular entities in large on-line networks Nelly - - PowerPoint PPT Presentation

quick detection of popular entities in large on line
SMART_READER_LITE
LIVE PREVIEW

Quick detection of popular entities in large on-line networks Nelly - - PowerPoint PPT Presentation

Quick detection of popular entities in large on-line networks Nelly Litvak University of Twente, Stochastic Operations Research group Joint work with K. Avrachenkov (INRIA), L. Ostroumova (Yandex) Luchon 24-06-2014 Finding largest nodes in


slide-1
SLIDE 1

Quick detection of popular entities in large on-line networks

Nelly Litvak University of Twente, Stochastic Operations Research group Joint work with

  • K. Avrachenkov (INRIA), L. Ostroumova (Yandex)

Luchon 24-06-2014

slide-2
SLIDE 2

Finding largest nodes in large complex networks

◮ Complex networks: Internet, World Wide Web, social

networks, protein-protein interactions, citation networks.

[ Nelly Litvak, 24-06-2014 ] 2/28

slide-3
SLIDE 3

Finding largest nodes in large complex networks

◮ Complex networks: Internet, World Wide Web, social

networks, protein-protein interactions, citation networks.

◮ Many networks are very large. [ Nelly Litvak, 24-06-2014 ] 2/28

slide-4
SLIDE 4

Finding largest nodes in large complex networks

◮ Complex networks: Internet, World Wide Web, social

networks, protein-protein interactions, citation networks.

◮ Many networks are very large. ◮ Facebook has more than 1 billion users. With an average user

having 190 friends, the number of social links in Facebook is 190 billion.

◮ The static part of the web graph has more than 10 billion

  • pages. With an average number of 38 hyper-links per page,

the total number of hyper-links is 380 billion.

[ Nelly Litvak, 24-06-2014 ] 2/28

slide-5
SLIDE 5

Finding top-k largest degree nodes

◮ Goal: Find top-k network nodes with largest degrees [ Nelly Litvak, 24-06-2014 ] 3/28

slide-6
SLIDE 6

Finding top-k largest degree nodes

◮ Goal: Find top-k network nodes with largest degrees ◮ Some applications:

◮ Routing via large degree nodes ◮ Proxy for various centrality measures ◮ Node clustering and classification ◮ Epidemic processes on networks ◮ Finding most popular entities (e.g. interest groups)

[ Nelly Litvak, 24-06-2014 ] 3/28

slide-7
SLIDE 7

Finding top-k largest degree nodes

◮ Goal: Find top-k network nodes with largest degrees ◮ Some applications:

◮ Routing via large degree nodes ◮ Proxy for various centrality measures ◮ Node clustering and classification ◮ Epidemic processes on networks ◮ Finding most popular entities (e.g. interest groups) ◮ It is simply interesting!

[ Nelly Litvak, 24-06-2014 ] 3/28

slide-8
SLIDE 8

Top-k largest degree nodes

If the adjacency list of the network is known... the top-k list of nodes can be found by the HeapSort with complexity O(N + klog(N)), where N is the total number of nodes. Even this modest complexity can be demanding for large networks.

[ Nelly Litvak, 24-06-2014 ] 4/28

slide-9
SLIDE 9

Top-k largest degree nodes

If the adjacency list of the network is known... the top-k list of nodes can be found by the HeapSort with complexity O(N + klog(N)), where N is the total number of nodes. Even this modest complexity can be demanding for large networks. Questions:

◮ How to do this faster? [ Nelly Litvak, 24-06-2014 ] 4/28

slide-10
SLIDE 10

Top-k largest degree nodes

If the adjacency list of the network is known... the top-k list of nodes can be found by the HeapSort with complexity O(N + klog(N)), where N is the total number of nodes. Even this modest complexity can be demanding for large networks. Questions:

◮ How to do this faster? ◮ How to do it when the network structure is not known (cannot

be crawled without restrictions or stored in the memory)?

[ Nelly Litvak, 24-06-2014 ] 4/28

slide-11
SLIDE 11

Top-k largest degree nodes

If the adjacency list of the network is known... the top-k list of nodes can be found by the HeapSort with complexity O(N + klog(N)), where N is the total number of nodes. Even this modest complexity can be demanding for large networks. Questions:

◮ How to do this faster? ◮ How to do it when the network structure is not known (cannot

be crawled without restrictions or stored in the memory)? Answer: Randomized algorithms.

[ Nelly Litvak, 24-06-2014 ] 4/28

slide-12
SLIDE 12

Top-k largest degree nodes

If the adjacency list of the network is known... the top-k list of nodes can be found by the HeapSort with complexity O(N + klog(N)), where N is the total number of nodes. Even this modest complexity can be demanding for large networks. Questions:

◮ How to do this faster? ◮ How to do it when the network structure is not known (cannot

be crawled without restrictions or stored in the memory)? Answer: Randomized algorithms. Idea: Find a ‘good enough’ answer in a short time.

[ Nelly Litvak, 24-06-2014 ] 4/28

slide-13
SLIDE 13

Top-k largest degree nodes

If the adjacency list of the network is known... the top-k list of nodes can be found by the HeapSort with complexity O(N + klog(N)), where N is the total number of nodes. Even this modest complexity can be demanding for large networks. Questions:

◮ How to do this faster? ◮ How to do it when the network structure is not known (cannot

be crawled without restrictions or stored in the memory)? Answer: Randomized algorithms. Idea: Find a ‘good enough’ answer in a short time.

Avrachenkov, L, Sokol, Towsley (2012); Cooper, Radzik, Siantos (2012), Borgs, Brautbar, Chayes, Khanna, Lucier (2012), Brautbar and Kearns (2010), Kumar, Lang, Marlow, Tomkins (2008)

[ Nelly Litvak, 24-06-2014 ] 4/28

slide-14
SLIDE 14

Finding most popular entities in directed on-line social networks

◮ Social networks are large [ Nelly Litvak, 24-06-2014 ] 5/28

slide-15
SLIDE 15

Finding most popular entities in directed on-line social networks

◮ Social networks are large ◮ The complete graphs structure is only available to the owners [ Nelly Litvak, 24-06-2014 ] 5/28

slide-16
SLIDE 16

Finding most popular entities in directed on-line social networks

◮ Social networks are large ◮ The complete graphs structure is only available to the owners ◮ Many companies maintain network statistics

(twittercounter.com, followerwonk.com, twitaholic.com, www.insidefacebook.com, yavkontakte.ru)

[ Nelly Litvak, 24-06-2014 ] 5/28

slide-17
SLIDE 17

Finding most popular entities in directed on-line social networks

◮ Social networks are large ◮ The complete graphs structure is only available to the owners ◮ Many companies maintain network statistics

(twittercounter.com, followerwonk.com, twitaholic.com, www.insidefacebook.com, yavkontakte.ru)

◮ The network can be accessed only via API, with limited access [ Nelly Litvak, 24-06-2014 ] 5/28

slide-18
SLIDE 18

Finding most popular entities in directed on-line social networks

◮ Social networks are large ◮ The complete graphs structure is only available to the owners ◮ Many companies maintain network statistics

(twittercounter.com, followerwonk.com, twitaholic.com, www.insidefacebook.com, yavkontakte.ru)

◮ The network can be accessed only via API, with limited access ◮ Twitter API allows one access per minute. We need 950 years

to crawl the current Twitter graph!

[ Nelly Litvak, 24-06-2014 ] 5/28

slide-19
SLIDE 19

Finding most popular entities in directed on-line social networks

◮ Social networks are large ◮ The complete graphs structure is only available to the owners ◮ Many companies maintain network statistics

(twittercounter.com, followerwonk.com, twitaholic.com, www.insidefacebook.com, yavkontakte.ru)

◮ The network can be accessed only via API, with limited access ◮ Twitter API allows one access per minute. We need 950 years

to crawl the current Twitter graph! Goal: Find top-k most popular entities in social (directed) networks (nodes with highest in/out-degrees, largest interest groups, largest user categories), using the minimal number of API requests.

[ Nelly Litvak, 24-06-2014 ] 5/28

slide-20
SLIDE 20

Problem formulation

◮ Consider a bi-partite graph (V , W , E) ◮ V and W are sets of entities, |V | = M, |W | = N. ◮ A directed edge (v, w) ∈ E represents a relation between

v ∈ V and w ∈ W .

◮ Goal: Quickly find entities in W with highest degrees. [ Nelly Litvak, 24-06-2014 ] 6/28

slide-21
SLIDE 21

Problem formulation

◮ Consider a bi-partite graph (V , W , E) ◮ V and W are sets of entities, |V | = M, |W | = N. ◮ A directed edge (v, w) ∈ E represents a relation between

v ∈ V and w ∈ W .

◮ Goal: Quickly find entities in W with highest degrees.

  • Example. V = W is a set of Twit-

ter users, (v, w) means that v fol- lows w.

  • Example. V is a set of users, W

is a set of interest groups, (v, w) means that user v is a member of an interest group w.

[ Nelly Litvak, 24-06-2014 ] 6/28

slide-22
SLIDE 22

Algorithm for finding top-k most popular entities

Algorithm for finding top-k most popular entities

1 Choose a set A ⊂ V of n1 nodes sampled from V at random. 2 For each v ∈ A retrieve the id’s of nodes in W that have an

edge from v.

3 Compute Sw – the number of edges of w ∈ W from A. 4 Retrieve the actual degrees for the n2 nodes w with the

largest values of Sw.

5 Return the identified top-k list of most popular entities in W .

In total, we use n = n1 + n2 requests to API (Step 2 and Step 4).

[ Nelly Litvak, 24-06-2014 ] 7/28

slide-23
SLIDE 23

Finding most followed users on Twitter

◮ Huge network (more than 500M users) [ Nelly Litvak, 24-06-2014 ] 8/28

slide-24
SLIDE 24

Finding most followed users on Twitter

◮ Huge network (more than 500M users) ◮ Network accessed only through Twitter API [ Nelly Litvak, 24-06-2014 ] 8/28

slide-25
SLIDE 25

Finding most followed users on Twitter

◮ Huge network (more than 500M users) ◮ Network accessed only through Twitter API ◮ The rate of requests is limited ◮ One request:

◮ ID’s of at most 5000 followers of a node, or ◮ the number of followers of a node

◮ In a randomly chosen set of n1 Twitter users only a few users

follow more than 5000 people. Thus, we retrieve at most 5000 followees of each node. This does not affect the results.

[ Nelly Litvak, 24-06-2014 ] 8/28

slide-26
SLIDE 26

Finding most followed users on Twitter

◮ Huge network (more than 500M users) ◮ Network accessed only through Twitter API ◮ The rate of requests is limited ◮ One request:

◮ ID’s of at most 5000 followers of a node, or ◮ the number of followers of a node

◮ In a randomly chosen set of n1 Twitter users only a few users

follow more than 5000 people. Thus, we retrieve at most 5000 followees of each node. This does not affect the results.

◮ Make a guess: We use 1000 requests to API. For which k can

we identify a top-k list of most followed Twitter users with 90% precision?

[ Nelly Litvak, 24-06-2014 ] 8/28

slide-27
SLIDE 27

Results

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 200 300 400 500 600 700 800 900 1000 fraction of correctly identified nodes n2 k=50 k=100 k=250

Figure : The fraction of correctly identified top-k most followed Twitter users as a function of n2, with n = 1000.

[ Nelly Litvak, 24-06-2014 ] 9/28

slide-28
SLIDE 28

Most followed

[ Nelly Litvak, 24-06-2014 ] 10/28

slide-29
SLIDE 29

Interest groups VKontakte

◮ Popular social network in Russian, more than 200M users. Rank Number of participants Topic 1 4,35M humor 2 4,1M humor 3 3,76M movies 4 3,69M humor 5 3,59M humor 6 3,58M facts 7 3,36M cookery 8 3,31M humor 9 3,14M humor 10 3,14M movies 100 1,65M success ◮ With n1 = 700, n2 = 300, our algorithm identifies on average

73.2 from the top-100 interest groups (averaged over 25 experiments). The standard deviation is 4.6.

[ Nelly Litvak, 24-06-2014 ] 11/28

slide-30
SLIDE 30

Comparison to known algorithms

◮ Well-studied problem [ Nelly Litvak, 24-06-2014 ] 12/28

slide-31
SLIDE 31

Comparison to known algorithms

◮ Well-studied problem ◮ How our algorithm compares to baselines? [ Nelly Litvak, 24-06-2014 ] 12/28

slide-32
SLIDE 32

Algorithm by Cooper, Radzik, Siantos (2012)

◮ Random-walk based ◮ Transitions probabilities along undirected edges (x, y) are

proportional to (d(x)d(y))b, where d(x) is the degree of a vertex x and b > 0 is some parameter. Problems?

[ Nelly Litvak, 24-06-2014 ] 13/28

slide-33
SLIDE 33

Algorithm by Cooper, Radzik, Siantos (2012)

◮ Random-walk based ◮ Transitions probabilities along undirected edges (x, y) are

proportional to (d(x)d(y))b, where d(x) is the degree of a vertex x and b > 0 is some parameter. Problems?

◮ Designed for undirected and connected graphs

(preferential attachment graphs)

[ Nelly Litvak, 24-06-2014 ] 13/28

slide-34
SLIDE 34

Algorithm by Cooper, Radzik, Siantos (2012)

◮ Random-walk based ◮ Transitions probabilities along undirected edges (x, y) are

proportional to (d(x)d(y))b, where d(x) is the degree of a vertex x and b > 0 is some parameter. Problems?

◮ Designed for undirected and connected graphs

(preferential attachment graphs)

◮ We need d(x) API requests to know the d(y)’s. All these

resources are spent to make just ONE transition!

[ Nelly Litvak, 24-06-2014 ] 13/28

slide-35
SLIDE 35

Algorithm by Cooper, Radzik, Siantos (2012)

◮ Random-walk based ◮ Transitions probabilities along undirected edges (x, y) are

proportional to (d(x)d(y))b, where d(x) is the degree of a vertex x and b > 0 is some parameter. Problems?

◮ Designed for undirected and connected graphs

(preferential attachment graphs)

◮ We need d(x) API requests to know the d(y)’s. All these

resources are spent to make just ONE transition!

◮ Not implementable on Twitter [ Nelly Litvak, 24-06-2014 ] 13/28

slide-36
SLIDE 36

Random Walk

Avrachenkov, L, Sokol, Towsley (2012)

◮ Random walk with uniform jumps:

p(x, y) =

  • α/N+1

d(x)+α,

if x has a link to y,

α/N d(x)+α,

if x does not have a link to y, where N is the number of nodes in the graph and d(x) is the degree of a node x.

◮ Rationale: in undirected graphs the stationary distribution is

given by πx(α) = d(x) + α 2|E| + Nα.

[ Nelly Litvak, 24-06-2014 ] 14/28

slide-37
SLIDE 37

Random Walk

Avrachenkov, L, Sokol, Towsley (2012)

◮ Random walk with uniform jumps:

p(x, y) =

  • α/N+1

d(x)+α,

if x has a link to y,

α/N d(x)+α,

if x does not have a link to y, where N is the number of nodes in the graph and d(x) is the degree of a node x.

◮ Rationale: in undirected graphs the stationary distribution is

given by πx(α) = d(x) + α 2|E| + Nα.

◮ Best to take α approximately equal to the average degree

Problems?

[ Nelly Litvak, 24-06-2014 ] 14/28

slide-38
SLIDE 38

Random Walk: problems

◮ Undirected graphs:

πx(α) = d(x) + α 2|E| + Nα. In directed graphs, stationary distribution will not give the

  • rder according to degrees.

[ Nelly Litvak, 24-06-2014 ] 15/28

slide-39
SLIDE 39

Random Walk: problems

◮ Undirected graphs:

πx(α) = d(x) + α 2|E| + Nα. In directed graphs, stationary distribution will not give the

  • rder according to degrees.

◮ Fix: make the graph undirected (symmetrized). Usually

in-degrees are larger than out-degrees, so ordering by total degree and by in-degree should be similar.

[ Nelly Litvak, 24-06-2014 ] 15/28

slide-40
SLIDE 40

Random Walk: problems

◮ Undirected graphs:

πx(α) = d(x) + α 2|E| + Nα. In directed graphs, stationary distribution will not give the

  • rder according to degrees.

◮ Fix: make the graph undirected (symmetrized). Usually

in-degrees are larger than out-degrees, so ordering by total degree and by in-degree should be similar. More problems?

[ Nelly Litvak, 24-06-2014 ] 15/28

slide-41
SLIDE 41

Random Walk: problems

◮ Undirected graphs:

πx(α) = d(x) + α 2|E| + Nα. In directed graphs, stationary distribution will not give the

  • rder according to degrees.

◮ Fix: make the graph undirected (symmetrized). Usually

in-degrees are larger than out-degrees, so ordering by total degree and by in-degree should be similar. More problems?

◮ We need to know ids of all neighbors of x to decide where to

go, but we can obtain only 5000 ids per API request.

[ Nelly Litvak, 24-06-2014 ] 15/28

slide-42
SLIDE 42

Random Walk: problems

◮ Undirected graphs:

πx(α) = d(x) + α 2|E| + Nα. In directed graphs, stationary distribution will not give the

  • rder according to degrees.

◮ Fix: make the graph undirected (symmetrized). Usually

in-degrees are larger than out-degrees, so ordering by total degree and by in-degree should be similar. More problems?

◮ We need to know ids of all neighbors of x to decide where to

go, but we can obtain only 5000 ids per API request.

◮ Strict: [one step of the algorithm] = [one API request] ◮ Relaxed: [one step of the algorithm] = [one considered vertex] [ Nelly Litvak, 24-06-2014 ] 15/28

slide-43
SLIDE 43

Crawl-Al and Crawl-GAI

Kumar, Lang, Marlow, Tomkins (2008)

◮ Designed for WWW crawl [ Nelly Litvak, 24-06-2014 ] 16/28

slide-44
SLIDE 44

Crawl-Al and Crawl-GAI

Kumar, Lang, Marlow, Tomkins (2008)

◮ Designed for WWW crawl ◮ At every step all nodes have their apparent in-degrees Sj,

j = 1, . . . , N: the number of discovered edges pointing to this node.

◮ Crawl-AI: the next node is chosen at random with probability

proportional to its apparent in-degree

◮ Crawl-GAI: the next node is the node with the highest

apparent in-degree

[ Nelly Litvak, 24-06-2014 ] 16/28

slide-45
SLIDE 45

Crawl-Al and Crawl-GAI

Kumar, Lang, Marlow, Tomkins (2008)

◮ Designed for WWW crawl ◮ At every step all nodes have their apparent in-degrees Sj,

j = 1, . . . , N: the number of discovered edges pointing to this node.

◮ Crawl-AI: the next node is chosen at random with probability

proportional to its apparent in-degree

◮ Crawl-GAI: the next node is the node with the highest

apparent in-degree Problems?

[ Nelly Litvak, 24-06-2014 ] 16/28

slide-46
SLIDE 46

Crawl-Al and Crawl-GAI

Kumar, Lang, Marlow, Tomkins (2008)

◮ Designed for WWW crawl ◮ At every step all nodes have their apparent in-degrees Sj,

j = 1, . . . , N: the number of discovered edges pointing to this node.

◮ Crawl-AI: the next node is chosen at random with probability

proportional to its apparent in-degree

◮ Crawl-GAI: the next node is the node with the highest

apparent in-degree Problems?

◮ The resulting list is created according to the apparent

in-degrees, a lot of randomness

[ Nelly Litvak, 24-06-2014 ] 16/28

slide-47
SLIDE 47

Crawl-Al and Crawl-GAI

Kumar, Lang, Marlow, Tomkins (2008)

◮ Designed for WWW crawl ◮ At every step all nodes have their apparent in-degrees Sj,

j = 1, . . . , N: the number of discovered edges pointing to this node.

◮ Crawl-AI: the next node is chosen at random with probability

proportional to its apparent in-degree

◮ Crawl-GAI: the next node is the node with the highest

apparent in-degree Problems?

◮ The resulting list is created according to the apparent

in-degrees, a lot of randomness

◮ Crawl-GAI can get stuck in some densely connected cluster [ Nelly Litvak, 24-06-2014 ] 16/28

slide-48
SLIDE 48

Crawl-Al and Crawl-GAI

Kumar, Lang, Marlow, Tomkins (2008)

◮ Designed for WWW crawl ◮ At every step all nodes have their apparent in-degrees Sj,

j = 1, . . . , N: the number of discovered edges pointing to this node.

◮ Crawl-AI: the next node is chosen at random with probability

proportional to its apparent in-degree

◮ Crawl-GAI: the next node is the node with the highest

apparent in-degree Problems?

◮ The resulting list is created according to the apparent

in-degrees, a lot of randomness

◮ Crawl-GAI can get stuck in some densely connected cluster ◮ Can suffer from correlations between in- and out-degrees [ Nelly Litvak, 24-06-2014 ] 16/28

slide-49
SLIDE 49

HighestDegree

Borgs, Brautbar, Chayes, Khanna, Lucier (2012)

◮ Retrieve a random node ◮ Check in-degrees of its out-neighbors ◮ Proceed while resources are available [ Nelly Litvak, 24-06-2014 ] 17/28

slide-50
SLIDE 50

HighestDegree

Borgs, Brautbar, Chayes, Khanna, Lucier (2012)

◮ Retrieve a random node ◮ Check in-degrees of its out-neighbors ◮ Proceed while resources are available

Problems?

[ Nelly Litvak, 24-06-2014 ] 17/28

slide-51
SLIDE 51

HighestDegree

Borgs, Brautbar, Chayes, Khanna, Lucier (2012)

◮ Retrieve a random node ◮ Check in-degrees of its out-neighbors ◮ Proceed while resources are available

Problems?

◮ A lot of resources are spent on out-neighbors of random nodes [ Nelly Litvak, 24-06-2014 ] 17/28

slide-52
SLIDE 52

Comparison of the algorithms

Table : Percentage of correctly identified nodes from top-100 in Twitter averaged over 30 experiments, n = 1000

Algorithm mean standard deviation Two-stage algorithm 92.6 4.7 Random walk (strict) 0.43 0.63 Random walk (relaxed) 8.7 2.4 Crawl-GAI 4.1 5.9 Crawl-AI 23.9 20.2 HighestDegree 24.7 11.8

[ Nelly Litvak, 24-06-2014 ] 18/28

slide-53
SLIDE 53

Comparison of the algorithms

Table : Percentage of correctly identified nodes from top-100 in Twitter averaged over 30 experiments, n = 1000

Algorithm mean standard deviation Two-stage algorithm 92.6 4.7 Random walk (strict) 0.43 0.63 Random walk (relaxed) 8.7 2.4 Crawl-GAI 4.1 5.9 Crawl-AI 23.9 20.2 HighestDegree 24.7 11.8 Advantages of the two-stage algorithm:

◮ does not waste resources ◮ obtains exact degrees of the n2 ‘most promising’ nodes [ Nelly Litvak, 24-06-2014 ] 18/28

slide-54
SLIDE 54

Comparison of the algorithms

20 40 60 80 100 100 250 500 1000 2000 5000 fraction of correctly identified nodes n Two-stage algorithm Random walk (strict) Random walk (relaxed) Crawl GAI Crawl-AI HighestDegree

Figure : The fraction of correctly identified top-100 most followed Twitter users as a function of n averaged over 10 experiments.

[ Nelly Litvak, 24-06-2014 ] 19/28

slide-55
SLIDE 55

Influence of graph size?

0.2 0.4 0.6 0.8 1 100 200 300 400 500 600 700 800 900 fraction of correctly identified nodes n2 k=50 k=100 k=250

Figure : The fraction of correctly identified top-k in-degree nodes in the CNR-2000 graph (law.di.unimi.it/webdata/cnr-2000) as a function of n2, with n = 1000. Note that algorithm performs similarly on CNR-2000 (half a million nodes) and Twitter.

[ Nelly Litvak, 24-06-2014 ] 20/28

slide-56
SLIDE 56

Hubs in complex networks

◮ degree of the node = # links, [fraction nodes degree k] = pk, [ Nelly Litvak, 24-06-2014 ] 21/28

slide-57
SLIDE 57

Hubs in complex networks

◮ degree of the node = # links, [fraction nodes degree k] = pk, ◮ Power law: pk ≈ const · k−γ−1, γ > 1. [ Nelly Litvak, 24-06-2014 ] 21/28

slide-58
SLIDE 58

Hubs in complex networks

◮ degree of the node = # links, [fraction nodes degree k] = pk, ◮ Power law: pk ≈ const · k−γ−1, γ > 1. ◮ Model for high variability, scale-free graph. ◮ Hubs are the nodes with extremely large degrees. [ Nelly Litvak, 24-06-2014 ] 21/28

slide-59
SLIDE 59

Hubs in complex networks

◮ degree of the node = # links, [fraction nodes degree k] = pk, ◮ Power law: pk ≈ const · k−γ−1, γ > 1. ◮ Model for high variability, scale-free graph. ◮ Hubs are the nodes with extremely large degrees. [ Nelly Litvak, 24-06-2014 ] 21/28

slide-60
SLIDE 60

Formal view on the hubs

Let D be a degree of a random node. Regular varying distribution: P(D > x) = L(x)x−γ (1) L(x) is slowly varying, i.e. limt→∞ L(tx)/L(t) = 1, x 0

[ Nelly Litvak, 24-06-2014 ] 22/28

slide-61
SLIDE 61

Formal view on the hubs

Let D be a degree of a random node. Regular varying distribution: P(D > x) = L(x)x−γ (1) L(x) is slowly varying, i.e. limt→∞ L(tx)/L(t) = 1, x 0 Extreme value theory. Let F1 F2 · · · FN be the order statistics of the i.i.d. r.v.’s D1, D2, . . . , DN as in (1). Then there are (aN) such that for finite k F1 aN , · · · , Fk aN

  • d

→   E −δ

1

δ , · · · , k

i=1 Ei

−δ δ    , where δ = 1/γ and Ei’are i.i.d. exponential(1) r.v.’s. k

[ Nelly Litvak, 24-06-2014 ] 22/28

slide-62
SLIDE 62

Formal view on the hubs

Let D be a degree of a random node. Regular varying distribution: P(D > x) = L(x)x−γ (1) L(x) is slowly varying, i.e. limt→∞ L(tx)/L(t) = 1, x 0 Extreme value theory. Let F1 F2 · · · FN be the order statistics of the i.i.d. r.v.’s D1, D2, . . . , DN as in (1). Then there are (aN) such that for finite k F1 aN , · · · , Fk aN

  • d

→   E −δ

1

δ , · · · , k

i=1 Ei

−δ δ    , where δ = 1/γ and Ei’are i.i.d. exponential(1) r.v.’s. k

  • Example. P(D > x) = Cx−γ, then aN = δC δNδ, bN = C δNδ.

The largest degrees are ‘of the order’ N1/γ.

[ Nelly Litvak, 24-06-2014 ] 22/28

slide-63
SLIDE 63

Formal view on the hubs

Let D be a degree of a random node. Regular varying distribution: P(D > x) = L(x)x−γ (1) L(x) is slowly varying, i.e. limt→∞ L(tx)/L(t) = 1, x 0 Extreme value theory. Let F1 F2 · · · FN be the order statistics of the i.i.d. r.v.’s D1, D2, . . . , DN as in (1). Then there are (aN) such that for finite k F1 aN , · · · , Fk aN

  • d

→   E −δ

1

δ , · · · , k

i=1 Ei

−δ δ    , where δ = 1/γ and Ei’are i.i.d. exponential(1) r.v.’s. k

  • Example. P(D > x) = Cx−γ, then aN = δC δNδ, bN = C δNδ.

The largest degrees are ‘of the order’ N1/γ.

[ Nelly Litvak, 24-06-2014 ] 22/28

slide-64
SLIDE 64

Performance prediction

◮ Number nodes in W in the decreasing order of their degrees:

F1 F2 · · · FN.

[ Nelly Litvak, 24-06-2014 ] 23/28

slide-65
SLIDE 65

Performance prediction

◮ Number nodes in W in the decreasing order of their degrees:

F1 F2 · · · FN.

◮ Sj is the number of followers of node j = 1, 2, . . . , N among

the n1 randomly chosen nodes in V

◮ Sj ∼ Binomial(n1, Fj/N) [ Nelly Litvak, 24-06-2014 ] 23/28

slide-66
SLIDE 66

Performance prediction

◮ Number nodes in W in the decreasing order of their degrees:

F1 F2 · · · FN.

◮ Sj is the number of followers of node j = 1, 2, . . . , N among

the n1 randomly chosen nodes in V

◮ Sj ∼ Binomial(n1, Fj/N) ◮ Si1 Si2 . . . SiN be the order statistics of S1, . . . , SN. ◮ Performance measure:

E[fraction of correctly identified top-k entities] = 1 k

k

  • j=1

P(j ∈ {i1, . . . , in2}). (2)

[ Nelly Litvak, 24-06-2014 ] 23/28

slide-67
SLIDE 67

Performance prediction

◮ Number nodes in W in the decreasing order of their degrees:

F1 F2 · · · FN.

◮ Sj is the number of followers of node j = 1, 2, . . . , N among

the n1 randomly chosen nodes in V

◮ Sj ∼ Binomial(n1, Fj/N) ◮ Si1 Si2 . . . SiN be the order statistics of S1, . . . , SN. ◮ Performance measure:

E[fraction of correctly identified top-k entities] = 1 k

k

  • j=1

P(j ∈ {i1, . . . , in2}). (2)

◮ Computation of P(j ∈ {i1, . . . , in2}) is not feasible even if

degrees are known

[ Nelly Litvak, 24-06-2014 ] 23/28

slide-68
SLIDE 68

Poisson prediction

◮ P(j ∈ {i1, . . . , in2})

= P(Sj > Sin2) + P(Sj = Sin2, j ∈ {i1, . . . , in2})

◮ Example. Twitter graph, take n1 = n2 = 500. Then the

average number of nodes i with Si = 1 among the top-l nodes is

l

  • i=1

P(Si = 1) =

l

  • i=1

500 Fi 5 · 108

  • 1 −

Fi 5 · 108 499 , which is 2540.6 for l = 10, 000 and it is 57.4 for l = n2 = 500. Hence, typically, [Si500 = 1]. The event [i ∈ {i1, . . . , in2}] occurs

  • nly for a small fraction of nodes i with [Si = 1].

[ Nelly Litvak, 24-06-2014 ] 24/28

slide-69
SLIDE 69

Poisson prediction

◮ P(j ∈ {i1, . . . , in2})

= P(Sj > Sin2) + P(Sj = Sin2, j ∈ {i1, . . . , in2})

◮ Example. Twitter graph, take n1 = n2 = 500. Then the

average number of nodes i with Si = 1 among the top-l nodes is

l

  • i=1

P(Si = 1) =

l

  • i=1

500 Fi 5 · 108

  • 1 −

Fi 5 · 108 499 , which is 2540.6 for l = 10, 000 and it is 57.4 for l = n2 = 500. Hence, typically, [Si500 = 1]. The event [i ∈ {i1, . . . , in2}] occurs

  • nly for a small fraction of nodes i with [Si = 1].

◮ Approximation:

P(j ∈ {i1, . . . , in2}) ≈ P(Sj > Sin2) ≈ P(Sj > Sn2)

[ Nelly Litvak, 24-06-2014 ] 24/28

slide-70
SLIDE 70

Poisson prediction

◮ P(j ∈ {i1, . . . , in2})

= P(Sj > Sin2) + P(Sj = Sin2, j ∈ {i1, . . . , in2})

◮ Example. Twitter graph, take n1 = n2 = 500. Then the

average number of nodes i with Si = 1 among the top-l nodes is

l

  • i=1

P(Si = 1) =

l

  • i=1

500 Fi 5 · 108

  • 1 −

Fi 5 · 108 499 , which is 2540.6 for l = 10, 000 and it is 57.4 for l = n2 = 500. Hence, typically, [Si500 = 1]. The event [i ∈ {i1, . . . , in2}] occurs

  • nly for a small fraction of nodes i with [Si = 1].

◮ Approximation:

P(j ∈ {i1, . . . , in2}) ≈ P(Sj > Sin2) ≈ P(Sj > Sn2)

◮ Assume Fj and Fn2 are known, then approximate

Sj ∼ Poisson(n1Fj/N)

[ Nelly Litvak, 24-06-2014 ] 24/28

slide-71
SLIDE 71

EVT predictions

◮ Poisson approximation is not realistic: degrees are unknown [ Nelly Litvak, 24-06-2014 ] 25/28

slide-72
SLIDE 72

EVT predictions

◮ Poisson approximation is not realistic: degrees are unknown ◮ The algorithm finds a few highest degrees with accuracy

almost 100%

◮ Let ˆ

F1 ˆ F2 · · · ˆ Fm be the top-m degrees found by the algorithm, m < k

[ Nelly Litvak, 24-06-2014 ] 25/28

slide-73
SLIDE 73

EVT predictions

◮ Poisson approximation is not realistic: degrees are unknown ◮ The algorithm finds a few highest degrees with accuracy

almost 100%

◮ Let ˆ

F1 ˆ F2 · · · ˆ Fm be the top-m degrees found by the algorithm, m < k

◮ The degrees follow a power law distribution with exponent γ [ Nelly Litvak, 24-06-2014 ] 25/28

slide-74
SLIDE 74

EVT predictions

◮ Poisson approximation is not realistic: degrees are unknown ◮ The algorithm finds a few highest degrees with accuracy

almost 100%

◮ Let ˆ

F1 ˆ F2 · · · ˆ Fm be the top-m degrees found by the algorithm, m < k

◮ The degrees follow a power law distribution with exponent γ ◮ Hill’s estimator:

ˆ γ =

  • 1

m − 1

m−1

  • i=1

log( ˆ Fi) − log( ˆ Fm) −1 . (3)

[ Nelly Litvak, 24-06-2014 ] 25/28

slide-75
SLIDE 75

EVT predictions

◮ Poisson approximation is not realistic: degrees are unknown ◮ The algorithm finds a few highest degrees with accuracy

almost 100%

◮ Let ˆ

F1 ˆ F2 · · · ˆ Fm be the top-m degrees found by the algorithm, m < k

◮ The degrees follow a power law distribution with exponent γ ◮ Hill’s estimator:

ˆ γ =

  • 1

m − 1

m−1

  • i=1

log( ˆ Fi) − log( ˆ Fm) −1 . (3)

◮ Estimator for high degrees: Dekkers et al. (1989)

ˆ fj = ˆ Fm

  • m

j−1

1/ˆ

γ

, j > 1, j << N.

◮ Use Sj ∼ Poisson(n1 ˆ

fj/N)

[ Nelly Litvak, 24-06-2014 ] 25/28

slide-76
SLIDE 76

Performance predictions on the Twitter graph

100 200 300 400 500 600 700 800 900 1000 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Poisson+EVT based on top−20 Poisson Experiment

[ Nelly Litvak, 24-06-2014 ] 26/28

slide-77
SLIDE 77

Sublinear complexity

◮ 1, . . . , k – top-k nodes in W ; F1, . . . , Fk – their degrees [ Nelly Litvak, 24-06-2014 ] 27/28

slide-78
SLIDE 78

Sublinear complexity

◮ 1, . . . , k – top-k nodes in W ; F1, . . . , Fk – their degrees ◮ Sj ∼ Binomial(n1, Fj/N) [ Nelly Litvak, 24-06-2014 ] 27/28

slide-79
SLIDE 79

Sublinear complexity

◮ 1, . . . , k – top-k nodes in W ; F1, . . . , Fk – their degrees ◮ Sj ∼ Binomial(n1, Fj/N) ◮ With normal approximation, and error pr-ty α we need that

n1 N Fk − Fn2

  • Fk + Fn2

> z1−α

[ Nelly Litvak, 24-06-2014 ] 27/28

slide-80
SLIDE 80

Sublinear complexity

◮ 1, . . . , k – top-k nodes in W ; F1, . . . , Fk – their degrees ◮ Sj ∼ Binomial(n1, Fj/N) ◮ With normal approximation, and error pr-ty α we need that

n1 N Fk − Fn2

  • Fk + Fn2

> z1−α

◮ Fk >> Fn2 [ Nelly Litvak, 24-06-2014 ] 27/28

slide-81
SLIDE 81

Sublinear complexity

◮ 1, . . . , k – top-k nodes in W ; F1, . . . , Fk – their degrees ◮ Sj ∼ Binomial(n1, Fj/N) ◮ With normal approximation, and error pr-ty α we need that

n1 N Fk − Fn2

  • Fk + Fn2

> z1−α

◮ Fk >> Fn2 ◮ Assuming the i.i.d. degrees, by the Extreme Value Theory,

w.h.p., log(Fk) = γ−1 log(N)(1 + o(log(N)))

[ Nelly Litvak, 24-06-2014 ] 27/28

slide-82
SLIDE 82

Sublinear complexity

◮ 1, . . . , k – top-k nodes in W ; F1, . . . , Fk – their degrees ◮ Sj ∼ Binomial(n1, Fj/N) ◮ With normal approximation, and error pr-ty α we need that

n1 N Fk − Fn2

  • Fk + Fn2

> z1−α

◮ Fk >> Fn2 ◮ Assuming the i.i.d. degrees, by the Extreme Value Theory,

w.h.p., log(Fk) = γ−1 log(N)(1 + o(log(N)))

◮ Roughly, n1 = O(N1−1/γ) [ Nelly Litvak, 24-06-2014 ] 27/28

slide-83
SLIDE 83

Sublinear complexity

◮ 1, . . . , k – top-k nodes in W ; F1, . . . , Fk – their degrees ◮ Sj ∼ Binomial(n1, Fj/N) ◮ With normal approximation, and error pr-ty α we need that

n1 N Fk − Fn2

  • Fk + Fn2

> z1−α

◮ Fk >> Fn2 ◮ Assuming the i.i.d. degrees, by the Extreme Value Theory,

w.h.p., log(Fk) = γ−1 log(N)(1 + o(log(N)))

◮ Roughly, n1 = O(N1−1/γ) ◮ Since w Sw = O(n1) w.h.p., n2 is at most O(n1) [ Nelly Litvak, 24-06-2014 ] 27/28

slide-84
SLIDE 84

Sublinear complexity

◮ 1, . . . , k – top-k nodes in W ; F1, . . . , Fk – their degrees ◮ Sj ∼ Binomial(n1, Fj/N) ◮ With normal approximation, and error pr-ty α we need that

n1 N Fk − Fn2

  • Fk + Fn2

> z1−α

◮ Fk >> Fn2 ◮ Assuming the i.i.d. degrees, by the Extreme Value Theory,

w.h.p., log(Fk) = γ−1 log(N)(1 + o(log(N)))

◮ Roughly, n1 = O(N1−1/γ) ◮ Since w Sw = O(n1) w.h.p., n2 is at most O(n1) ◮ We conclude that roughly n = n1 + n2 = O(N1−1/γ) [ Nelly Litvak, 24-06-2014 ] 27/28

slide-85
SLIDE 85

Sublinear complexity

◮ 1, . . . , k – top-k nodes in W ; F1, . . . , Fk – their degrees ◮ Sj ∼ Binomial(n1, Fj/N) ◮ With normal approximation, and error pr-ty α we need that

n1 N Fk − Fn2

  • Fk + Fn2

> z1−α

◮ Fk >> Fn2 ◮ Assuming the i.i.d. degrees, by the Extreme Value Theory,

w.h.p., log(Fk) = γ−1 log(N)(1 + o(log(N)))

◮ Roughly, n1 = O(N1−1/γ) ◮ Since w Sw = O(n1) w.h.p., n2 is at most O(n1) ◮ We conclude that roughly n = n1 + n2 = O(N1−1/γ) ◮ Note that the complexity is in terms of |W | = N [ Nelly Litvak, 24-06-2014 ] 27/28

slide-86
SLIDE 86

Sublinear complexity

◮ 1, . . . , k – top-k nodes in W ; F1, . . . , Fk – their degrees ◮ Sj ∼ Binomial(n1, Fj/N) ◮ With normal approximation, and error pr-ty α we need that

n1 N Fk − Fn2

  • Fk + Fn2

> z1−α

◮ Fk >> Fn2 ◮ Assuming the i.i.d. degrees, by the Extreme Value Theory,

w.h.p., log(Fk) = γ−1 log(N)(1 + o(log(N)))

◮ Roughly, n1 = O(N1−1/γ) ◮ Since w Sw = O(n1) w.h.p., n2 is at most O(n1) ◮ We conclude that roughly n = n1 + n2 = O(N1−1/γ) ◮ Note that the complexity is in terms of |W | = N ◮ High variability helps a lot! [ Nelly Litvak, 24-06-2014 ] 27/28

slide-87
SLIDE 87

Thank you!

[ Nelly Litvak, 24-06-2014 ] 28/28