Inference in OSNs via Lightweight Partial Crawls Jithin K. - - PowerPoint PPT Presentation

β–Ά
inference in osns via lightweight partial crawls
SMART_READER_LITE
LIVE PREVIEW

Inference in OSNs via Lightweight Partial Crawls Jithin K. - - PowerPoint PPT Presentation

Inference in OSNs via Lightweight Partial Crawls Jithin K. Sreedharan Inria, France Konstantin Avrachenkov Bruno Ribeiro Inria, France Purdue University, USA Sigmetrics 2016, June 16 Motivation Estimation and inference in Online Social


slide-1
SLIDE 1

Inference in OSNs via Lightweight Partial Crawls

Jithin K. Sreedharan

Inria, France

Bruno Ribeiro

Purdue University, USA

Konstantin Avrachenkov

Inria, France

Sigmetrics 2016, June 16

slide-2
SLIDE 2

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 2

  • Estimation and inference in Online

Social Network (OSN)

  • Example:

OSN users more likely to form edges with those with similar attributes ?

Motivation

slide-3
SLIDE 3

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 2

  • Estimation and inference in Online

Social Network (OSN)

  • Example:

OSN users more likely to form edges with those with similar attributes ?

Motivation

Easy to answer if the graph is fully known beforehand What if the network is not known?

  • Can only crawl network
  • Few queries
slide-4
SLIDE 4

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 3

Problem definition

slide-5
SLIDE 5

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 3

Problem definition

Let

slide-6
SLIDE 6

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 3

Problem definition

  • Undirected graph

Let

slide-7
SLIDE 7

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 3

Problem definition

  • Undirected graph
  • Node and edge have labels

Let

slide-8
SLIDE 8

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 3

Problem definition

  • Undirected graph
  • Node and edge have labels
  • Not necessarily connected or has included

connected components of interest Let

slide-9
SLIDE 9

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 3

Problem definition

  • Undirected graph
  • Node and edge have labels
  • Not necessarily connected or has included

connected components of interest

  • Few seed nodes

Let Seed nodes

slide-10
SLIDE 10

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 3

Problem definition

  • Undirected graph
  • Node and edge have labels
  • Not necessarily connected or has included

connected components of interest

  • Few seed nodes
  • Large graph

Let

slide-11
SLIDE 11

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 4

Problem definition (contd.)

slide-12
SLIDE 12

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 4

Problem definition (contd.)

Estimate

slide-13
SLIDE 13

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 4

Problem definition (contd.)

Estimate

  • Graph is unknown
slide-14
SLIDE 14

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 4

Problem definition (contd.)

Estimate

  • Graph is unknown
  • Only local information available

Seed nodes and their neighbor IDs Query (visit) a neighbor Visited nodes and their neighbor IDs

slide-15
SLIDE 15

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 4

Problem definition (contd.)

Estimate

  • Graph is unknown
  • Only local information available

Seed nodes and their neighbor IDs Query (visit) a neighbor Visited nodes and their neighbor IDs How do we know in real time if our estimates are accurate?

slide-16
SLIDE 16

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 5

Random walk based estimation

slide-17
SLIDE 17

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 5

Random walk based estimation

slide-18
SLIDE 18

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 5

Random walk based estimation

slide-19
SLIDE 19

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 5

Random walk based estimation

slide-20
SLIDE 20

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 5

Random walk based estimation

Random walk has unique stationary distribution if graph 𝐻 is connected and non- bipartite Estimate

  • Goal:
slide-21
SLIDE 21

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 5

Random walk based estimation

Estimator for Random walk has unique stationary distribution if graph 𝐻 is connected and non- bipartite Estimate

  • Goal:
  • How [Ribeiro and Towsley `10]:
slide-22
SLIDE 22

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 5

Random walk based estimation

Estimator for Random walk has unique stationary distribution if graph 𝐻 is connected and non- bipartite Estimate

  • Goal:
  • How [Ribeiro and Towsley `10]:

Asymptotically converges

slide-23
SLIDE 23

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 5

Random walk based estimation

Estimator for Random walk has unique stationary distribution if graph 𝐻 is connected and non- bipartite Estimate

  • Goal:
  • How [Ribeiro and Towsley `10]:

Asymptotically converges Extensions: [Lee et al. `12], [Gjoka et al. `11] [Ribeiro et al. `12]

slide-24
SLIDE 24

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 6

We get an estimate of 𝜈 𝐻 but how accurate is it ?

slide-25
SLIDE 25

7

Existing asymptotic techniques and issues

slide-26
SLIDE 26

7

Existing asymptotic techniques and issues

  • Asymptotic convergence: Ergodic theorem
  • Crawling the graph multiple times
slide-27
SLIDE 27

7

Existing asymptotic techniques and issues

  • Asymptotic convergence: Ergodic theorem
  • Crawling the graph multiple times
slide-28
SLIDE 28

7

Existing asymptotic techniques and issues

  • Asymptotic convergence: Ergodic theorem
  • Crawling the graph multiple times
slide-29
SLIDE 29

7

Existing asymptotic techniques and issues

  • Asymptotic convergence: Ergodic theorem
  • Crawling the graph multiple times
  • Variety of convergence diagnostics for MCMCs
slide-30
SLIDE 30

7

Existing asymptotic techniques and issues

  • Asymptotic convergence: Ergodic theorem
  • Crawling the graph multiple times
  • Variety of convergence diagnostics for MCMCs

Roughly divided into:

slide-31
SLIDE 31

7

Existing asymptotic techniques and issues

  • Asymptotic convergence: Ergodic theorem
  • Crawling the graph multiple times
  • Variety of convergence diagnostics for MCMCs

Roughly divided into:

  • Multiple walks to check convergence
  • Walks not independent (start at same seeds)
  • No guarantees
slide-32
SLIDE 32

7

Existing asymptotic techniques and issues

  • Asymptotic convergence: Ergodic theorem
  • Crawling the graph multiple times
  • Variety of convergence diagnostics for MCMCs

Roughly divided into:

  • Multiple walks to check convergence
  • Walks not independent (start at same seeds)
  • No guarantees
  • Break a long walk into β€œnearly” independent segments
  • Asymptotic & throws away most observations

X1 X2 ……. Xk : accepted sample : rejected sample Thrown away

slide-33
SLIDE 33

8

Idea of tours

slide-34
SLIDE 34

8

Idea of tours

b a c d e f g i h k l m n p q r

slide-35
SLIDE 35

8

Idea of tours

b a c d e f g i h k l m n p q r

slide-36
SLIDE 36

8

Idea of tours

b a c d e f g i h k l m n p q r

slide-37
SLIDE 37

8

Idea of tours

Properties of tours:

b a c d e f g i h k l m n p q r

slide-38
SLIDE 38

8

Idea of tours

Properties of tours:

  • Tours are independent

b a c d e f g i h k l m n p q r

slide-39
SLIDE 39

8

Idea of tours

Properties of tours:

  • Tours are independent
  • Fully distributed crawler implementation

b a c d e f g i h k l m n p q r

slide-40
SLIDE 40

8

Idea of tours

Properties of tours:

  • Tours are independent
  • Fully distributed crawler implementation

b a c d e f g i h k l m n p q r

Issues with tours:

slide-41
SLIDE 41

8

Idea of tours

Properties of tours:

  • Tours are independent
  • Fully distributed crawler implementation

b a c d e f g i h k l m n p q r

Issues with tours:

  • Returning to same node will take β€œforever” in

a large network [MassouliΓ© et al’06] 2|E|

slide-42
SLIDE 42

8

Idea of tours

Properties of tours:

  • Tours are independent
  • Fully distributed crawler implementation

b a c d e f g i h k l m n p q r

Issues with tours:

  • Returning to same node will take β€œforever” in

a large network [MassouliΓ© et al’06]

  • Solution? Renewal from the most frequent

node. 2|E|

Tour 1 RW node sequence : most frequent node in sequence Tour 3 X1 X2

slide-43
SLIDE 43

8

Idea of tours

Properties of tours:

  • Tours are independent
  • Fully distributed crawler implementation

b a c d e f g i h k l m n p q r

Issues with tours:

  • Returning to same node will take β€œforever” in

a large network [MassouliΓ© et al’06]

  • Solution? Renewal from the most frequent

node.

  • No, tours will be interdependent

2|E|

Tour 1 RW node sequence : most frequent node in sequence Tour 3 X1 X2

slide-44
SLIDE 44

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 9

The idea of Super-node

slide-45
SLIDE 45

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 9

b a c d e f

g

i h k l m n p

q

r

The idea of Super-node

slide-46
SLIDE 46

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 9

b a c d e f

g

i h k l m n p

q

r b a d e f i h l m n p r

The idea of Super-node

slide-47
SLIDE 47

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 9

b a c d e f

g

i h k l m n p

q

r b a d e f i h l m n p r

The idea of Super-node

  • Tackling disconnected graph
slide-48
SLIDE 48

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 9

b a c d e f

g

i h k l m n p

q

r b a d e f i h l m n p r

The idea of Super-node

  • Tackling disconnected graph
  • Faster estimate with shorter

crawls

slide-49
SLIDE 49

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 9

b a c d e f

g

i h k l m n p

q

r b a d e f i h l m n p r

The idea of Super-node

  • Tackling disconnected graph
  • Faster estimate with shorter

crawls

  • Not related to lumpability
slide-50
SLIDE 50

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 9

b a c d e f

g

i h k l m n p

q

r b a d e f i h l m n p r

The idea of Super-node

  • Tackling disconnected graph
  • Faster estimate with shorter

crawls

  • Not related to lumpability

Super-node formation:

  • static and dynamic (will see later)
slide-51
SLIDE 51

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 10

Estimator

slide-52
SLIDE 52

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 10

Estimator

Key property of tours:

slide-53
SLIDE 53

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 10

Estimator

Length of 𝑙th tour Samples in 𝑙th tour Degree of super-node True value of the contracted graph

Key property of tours:

𝑔 𝑣, 𝑀 ∢= 𝑕(𝑣, 𝑀) except when 𝑣 or 𝑀 is π‘‡π‘œ

slide-54
SLIDE 54

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 10

Estimator

Length of 𝑙th tour Samples in 𝑙th tour Degree of super-node True value of the contracted graph

Key property of tours:

𝑔 𝑣, 𝑀 ∢= 𝑕(𝑣, 𝑀) except when 𝑣 or 𝑀 is π‘‡π‘œ

slide-55
SLIDE 55

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 10

Estimator

Length of 𝑙th tour Samples in 𝑙th tour Degree of super-node True value of the contracted graph

Key property of tours:

𝑔 𝑣, 𝑀 ∢= 𝑕(𝑣, 𝑀) except when 𝑣 or 𝑀 is π‘‡π‘œ

slide-56
SLIDE 56

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 11

Estimator

slide-57
SLIDE 57

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 11

  • Unbiased (unlike asymptotic in [Ribeiro and Towsley β€˜10])

Estimator

slide-58
SLIDE 58

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 11

  • Unbiased (unlike asymptotic in [Ribeiro and Towsley β€˜10])
  • Strongly consistent

Estimator

slide-59
SLIDE 59

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 11

  • Unbiased (unlike asymptotic in [Ribeiro and Towsley β€˜10])
  • Strongly consistent

Estimator

Confidence interval

Sampled variance

slide-60
SLIDE 60

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 11

  • Unbiased (unlike asymptotic in [Ribeiro and Towsley β€˜10])
  • Strongly consistent

Estimator

Confidence interval

Sampled variance

slide-61
SLIDE 61

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 12

Bayesian formulation

Find a posterior probability distribution with suitable prior distribution

slide-62
SLIDE 62

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 13

Bayesian formulation (contd.)

slide-63
SLIDE 63

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 13

Bayesian formulation (contd.)

slide-64
SLIDE 64

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 13

Bayesian formulation (contd.)

slide-65
SLIDE 65

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 13

Bayesian formulation (contd.)

slide-66
SLIDE 66

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 14

Simulations on real-world networks

slide-67
SLIDE 67

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 14

Simulations on real-world networks

Dogster network: Online social network for dogs ?

slide-68
SLIDE 68

15

Simulations on real-world networks: Dogster network

415K nodes, 8.27M edges

Percentage of graph covered: 2.72% (edges), 14.86% (nodes)

slide-69
SLIDE 69

15

Simulations on real-world networks: Dogster network

415K nodes, 8.27M edges

Percentage of graph covered: 2.72% (edges), 14.86% (nodes)

Estimated value

slide-70
SLIDE 70

15

Simulations on real-world networks: Dogster network

415K nodes, 8.27M edges

Percentage of graph covered: 2.72% (edges), 14.86% (nodes)

Estimated value

slide-71
SLIDE 71

15

Simulations on real-world networks: Dogster network

415K nodes, 8.27M edges

Percentage of graph covered: 2.72% (edges), 14.86% (nodes)

Estimated value

slide-72
SLIDE 72

15

Simulations on real-world networks: Dogster network

415K nodes, 8.27M edges

Percentage of graph covered: 2.72% (edges), 14.86% (nodes)

Estimated value

slide-73
SLIDE 73

15

Simulations on real-world networks: Dogster network

415K nodes, 8.27M edges

Percentage of graph covered: 2.72% (edges), 14.86% (nodes)

Estimated value

slide-74
SLIDE 74

15

Simulations on real-world networks: Dogster network

415K nodes, 8.27M edges

Percentage of graph covered: 2.72% (edges), 14.86% (nodes)

Estimated value

slide-75
SLIDE 75

15

Simulations on real-world networks: Dogster network

415K nodes, 8.27M edges

Percentage of graph covered: 2.72% (edges), 14.86% (nodes)

Estimated value

slide-76
SLIDE 76

15

Simulations on real-world networks: Dogster network

415K nodes, 8.27M edges

Percentage of graph covered: 2.72% (edges), 14.86% (nodes)

Estimated value

slide-77
SLIDE 77

15

Simulations on real-world networks: Dogster network

415K nodes, 8.27M edges

Percentage of graph covered: 2.72% (edges), 14.86% (nodes)

Estimated value

slide-78
SLIDE 78

15

Simulations on real-world networks: Dogster network

415K nodes, 8.27M edges

Percentage of graph covered: 2.72% (edges), 14.86% (nodes)

Estimated value

slide-79
SLIDE 79

15

Simulations on real-world networks: Dogster network

415K nodes, 8.27M edges

Percentage of graph covered: 2.72% (edges), 14.86% (nodes)

Estimated value

slide-80
SLIDE 80

15

Simulations on real-world networks: Dogster network

415K nodes, 8.27M edges

Percentage of graph covered: 2.72% (edges), 14.86% (nodes)

Estimated value

slide-81
SLIDE 81

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 16

Simulations on real-world networks: Friendster network

64K nodes, 1.25M edges Percentage of graph covered: 7.43% (edges), 18.52% (nodes)

slide-82
SLIDE 82

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 16

Simulations on real-world networks: Friendster network

64K nodes, 1.25M edges Percentage of graph covered: 7.43% (edges), 18.52% (nodes) Estimated value

slide-83
SLIDE 83

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 16

Simulations on real-world networks: Friendster network

64K nodes, 1.25M edges Percentage of graph covered: 7.43% (edges), 18.52% (nodes) Estimated value Estimated value

slide-84
SLIDE 84

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 17

Simulations on real-world networks: ADD Health data

1545 nodes, 4003 edges Percentage of graph covered: 10.87% (edges), 19.76% (nodes)

A friendship network among high school students in USA

slide-85
SLIDE 85

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 17

Simulations on real-world networks: ADD Health data

1545 nodes, 4003 edges Percentage of graph covered: 10.87% (edges), 19.76% (nodes) Estimated value

A friendship network among high school students in USA

slide-86
SLIDE 86

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 17

Simulations on real-world networks: ADD Health data

1545 nodes, 4003 edges Percentage of graph covered: 10.87% (edges), 19.76% (nodes) Estimated value Estimated value

A friendship network among high school students in USA

slide-87
SLIDE 87

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 18

What if the super-node is not that β€œsuper”?

slide-88
SLIDE 88

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 18

What if the super-node is not that β€œsuper”?

Adaptive crawler: super-node gets bigger as crawling progresses

slide-89
SLIDE 89

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 18

What if the super-node is not that β€œsuper”?

How to add nodes to super-node: Adaptive crawler: super-node gets bigger as crawling progresses

slide-90
SLIDE 90

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 18

What if the super-node is not that β€œsuper”?

How to add nodes to super-node:

  • via any method as long as independent of already observed tours

Adaptive crawler: super-node gets bigger as crawling progresses

slide-91
SLIDE 91

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 18

What if the super-node is not that β€œsuper”?

How to add nodes to super-node:

  • via any method as long as independent of already observed tours
  • Emulates retrospectively adding new node 𝑗 into super-node from the start

Adaptive crawler: super-node gets bigger as crawling progresses

slide-92
SLIDE 92

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 18

What if the super-node is not that β€œsuper”?

How to add nodes to super-node:

  • via any method as long as independent of already observed tours
  • Emulates retrospectively adding new node 𝑗 into super-node from the start
  • Checks previous tours. Breaks them when 𝑗 is found.

Adaptive crawler: super-node gets bigger as crawling progresses

sample 1 sample 2 ……. sample 𝑙 = π‘‡π‘œ : node 𝑗

Original tour: Tour 1 Tour 2 Tour 3 Tour 4

slide-93
SLIDE 93

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 18

What if the super-node is not that β€œsuper”?

Adaptive crawler: super-node gets bigger as crawling progresses

b a d e f i h l m n

p

r

β€œCorrection” tours from 𝒋: Start at 𝑗, end in 𝑗 or 𝑇4 How to add nodes to super-node:

  • via any method as long as independent of already observed tours
  • Emulates retrospectively adding new node 𝑗 into super-node from the start
  • Checks previous tours. Breaks them when 𝑗 is found.
  • Start 𝑙 new tours from newly added node 𝑗;

k ~ negative Binomial distribution (function of degrees of 𝑗, and no of tours)

slide-94
SLIDE 94

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 18

What if the super-node is not that β€œsuper”?

Adaptive crawler: super-node gets bigger as crawling progresses

b a d e f i h l m n

p

r

β€œCorrection” tours from 𝒋: Start at 𝑗, end in 𝑗 or 𝑇4 How to add nodes to super-node:

  • via any method as long as independent of already observed tours
  • Emulates retrospectively adding new node 𝑗 into super-node from the start
  • Checks previous tours. Breaks them when 𝑗 is found.
  • Start 𝑙 new tours from newly added node 𝑗;

k ~ negative Binomial distribution (function of degrees of 𝑗, and no of tours)

slide-95
SLIDE 95

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 18

What if the super-node is not that β€œsuper”?

Adaptive crawler: super-node gets bigger as crawling progresses Theorem Dynamic and static super-node sample paths are equivalent in distribution How to add nodes to super-node:

  • via any method as long as independent of already observed tours
  • Emulates retrospectively adding new node 𝑗 into super-node from the start
  • Checks previous tours. Breaks them when 𝑗 is found.
  • Start 𝑙 new tours from newly added node 𝑗;

k ~ negative Binomial distribution (function of degrees of 𝑗, and no of tours)

slide-96
SLIDE 96

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 19

From metric 𝜈(𝐻) does network look random ?

slide-97
SLIDE 97

Jithin K. Sreedharan (jithin.sreedharan@inria.fr)

Estimation and hypothesis testing in Chung-Lu or configuration model

20

slide-98
SLIDE 98

Jithin K. Sreedharan (jithin.sreedharan@inria.fr)

Estimation and hypothesis testing in Chung-Lu or configuration model

20

Assumption: edges labels can be written as a function of node labels

slide-99
SLIDE 99

Jithin K. Sreedharan (jithin.sreedharan@inria.fr)

Estimation and hypothesis testing in Chung-Lu or configuration model

20

Assumption: edges labels can be written as a function of node labels

  • Does the true value of the given graph belongs to the class of

values when the edges are formed purely at random?

slide-100
SLIDE 100

Jithin K. Sreedharan (jithin.sreedharan@inria.fr)

Estimation and hypothesis testing in Chung-Lu or configuration model

20

Assumption: edges labels can be written as a function of node labels

  • Does the true value of the given graph belongs to the class of

values when the edges are formed purely at random?

slide-101
SLIDE 101

Jithin K. Sreedharan (jithin.sreedharan@inria.fr)

Estimation and hypothesis testing in Chung-Lu or configuration model

20

Assumption: edges labels can be written as a function of node labels

  • Does the true value of the given graph belongs to the class of

values when the edges are formed purely at random?

  • Does the true value belongs to the class when the connections are formed

based on degrees alone with no other influence ?

slide-102
SLIDE 102

Jithin K. Sreedharan (jithin.sreedharan@inria.fr)

Estimation and hypothesis testing in Chung-Lu or configuration model

20

Assumption: edges labels can be written as a function of node labels

  • Does the true value of the given graph belongs to the class of

values when the edges are formed purely at random?

  • Does the true value belongs to the class when the connections are formed

based on degrees alone with no other influence ? Configuration model:

slide-103
SLIDE 103

Jithin K. Sreedharan (jithin.sreedharan@inria.fr)

Estimation and hypothesis testing in Chung-Lu or configuration model

20

Assumption: edges labels can be written as a function of node labels

  • Does the true value of the given graph belongs to the class of

values when the edges are formed purely at random?

  • Does the true value belongs to the class when the connections are formed

based on degrees alone with no other influence ? Configuration model:

  • Assume the degree sequence same as that of G.
slide-104
SLIDE 104

Jithin K. Sreedharan (jithin.sreedharan@inria.fr)

Estimation and hypothesis testing in Chung-Lu or configuration model

20

Assumption: edges labels can be written as a function of node labels

  • Does the true value of the given graph belongs to the class of

values when the edges are formed purely at random?

  • Does the true value belongs to the class when the connections are formed

based on degrees alone with no other influence ? Configuration model:

  • Assume the degree sequence same as that of G.
  • Edges formed by uniformly selecting the half edges of each

node

slide-105
SLIDE 105

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 21

Estimation in Chung-Lu or configuration model

slide-106
SLIDE 106

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 21

Estimation in Chung-Lu or configuration model

Estimate

  • The entire degree sequence unknown; only the

degrees of sampled nodes known

slide-107
SLIDE 107

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 21

Estimation in Chung-Lu or configuration model

Estimate

  • The entire degree sequence unknown; only the

degrees of sampled nodes known

slide-108
SLIDE 108

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 21

Estimation in Chung-Lu or configuration model

Estimate

  • The entire degree sequence unknown; only the

degrees of sampled nodes known Random walk with jumps to estimate 𝑕 𝑣, 𝑀 , for 𝑣, 𝑀 βˆ‰ 𝐹

slide-109
SLIDE 109

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 21

Estimation in Chung-Lu or configuration model

Estimate

  • The entire degree sequence unknown; only the

degrees of sampled nodes known Random walk with jumps to estimate 𝑕 𝑣, 𝑀 , for 𝑣, 𝑀 βˆ‰ 𝐹

slide-110
SLIDE 110

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 21

Estimation in Chung-Lu or configuration model

Estimate

  • The entire degree sequence unknown; only the

degrees of sampled nodes known Random walk with jumps to estimate 𝑕 𝑣, 𝑀 , for 𝑣, 𝑀 βˆ‰ 𝐹

slide-111
SLIDE 111

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 22

Hypothesis testing with the Chung-Lu model

slide-112
SLIDE 112

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 22

Hypothesis testing with the Chung-Lu model

(Lindeberg central limit theorem)

slide-113
SLIDE 113

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 22

Hypothesis testing with the Chung-Lu model

(Lindeberg central limit theorem)

Look for the value of 𝑏 the following satisfies

Estimate value of given graph Mean and variance of Chung-Lu graph

slide-114
SLIDE 114

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 22

Hypothesis testing with the Chung-Lu model

(Lindeberg central limit theorem)

Look for the value of 𝑏 the following satisfies

Estimate value of given graph Mean and variance of Chung-Lu graph

Dogster network: Estimator for

slide-115
SLIDE 115

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 22

Hypothesis testing with the Chung-Lu model

(Lindeberg central limit theorem)

Look for the value of 𝑏 the following satisfies

Estimate value of given graph Mean and variance of Chung-Lu graph

Dogster network: Estimator for

Percentage of graph crawled: 8.9% (edges), 18.51% (nodes)

slide-116
SLIDE 116

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 22

Hypothesis testing with the Chung-Lu model

(Lindeberg central limit theorem)

Look for the value of 𝑏 the following satisfies

Estimate value of given graph Mean and variance of Chung-Lu graph

Dogster network: Estimator for

Edge function True value Estimated value 1{same breed nodes} 8.12 Γ— 106 8.066 Γ— 106 1{different breed nodes} 2.17 Γ— 105 1.995 Γ— 105

Percentage of graph crawled: 8.9% (edges), 18.51% (nodes)

slide-117
SLIDE 117

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 23

Conclusions

slide-118
SLIDE 118

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 23

Conclusions

  • Unbiased estimator of
slide-119
SLIDE 119

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 23

Conclusions

  • Unbiased estimator of
  • Propose dynamic super-node:

οƒΌ Short parallel random walk crawls οƒΌ Parameter-free crawling

b a c d e f g i h k l m n p q r b a d e f i h l m n p r

slide-120
SLIDE 120

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 23

Conclusions

  • Unbiased estimator of
  • Propose dynamic super-node:

οƒΌ Short parallel random walk crawls οƒΌ Parameter-free crawling

  • Provides real-time assessment of estimation accuracy:

οƒΌ Bayesian formulation: posterior distribution, matches well true histogram

slide-121
SLIDE 121

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 23

Conclusions

  • Unbiased estimator of
  • Propose dynamic super-node:

οƒΌ Short parallel random walk crawls οƒΌ Parameter-free crawling

  • Provides real-time assessment of estimation accuracy:

οƒΌ Bayesian formulation: posterior distribution, matches well true histogram

  • If the given network forms connections randomly:

οƒΌ Estimation of expected value and variance of οƒΌ Check whether original network value samples from distribution of

slide-122
SLIDE 122

Jithin K. Sreedharan (jithin.sreedharan@inria.fr) 24

Thank you!

Software and paper available at http://bit.do/Jithin