Estimating Properties of Social Networks via Random Walk considering - - PowerPoint PPT Presentation

estimating properties of social networks via random walk
SMART_READER_LITE
LIVE PREVIEW

Estimating Properties of Social Networks via Random Walk considering - - PowerPoint PPT Presentation

KDD 2020 Research track Estimating Properties of Social Networks via Random Walk considering Private Nodes Kazuki Nakajima Kazuyuki Shudo Tokyo Institute of Technology Graph Sampling on Social Networks Purpose: Understand global social


slide-1
SLIDE 1

Estimating Properties of Social Networks via Random Walk considering Private Nodes

Kazuki Nakajima Kazuyuki Shudo

Tokyo Institute of Technology KDD 2020 Research track

slide-2
SLIDE 2

Graph Sampling on Social Networks

Purpose: Understand global social structure Challenge: Accurate analysis of graph properties

  • Access limitations to graph data for most researchers.
  • Crawling-based sampling is effective.
  • ex) Breadth-first search, Random walk, …
  • biased estimation due to sampling

Social Graph

node (= user) traverse

1/20

slide-3
SLIDE 3

Re-weighted Random Walk [Gjoka et al., 2010]

Effective scheme to obtain unbiased estimators

  • 1. Sample nodes via a random walk
  • 2. Derive sampling bias based on Markov chain analysis
  • 3. Re-weighting each sample to correct sampling bias

Unbiased estimators for several graph properties

  • size (number of nodes), average degree, degree distribution…

Degree Distribution

Re-weighting

Unbiased!

Estimate True 2/20

slide-4
SLIDE 4

Private Node Problem

  • Previous algorithms have ignored private nodes
  • What problems happen?
  • 1. Private nodes inhibit a simple random walk
  • Require a random walk considering private nodes
  • Require samples keeping Markov property
  • 2. Private nodes cause estimation errors
  • Conventional weighting only corrects sampling bias

1 3 4 5 7 8 6 2 9

10

Public nodes (publish neighbors) Private nodes (hide neighbors)

  • ex) Facebook, Twitter, Pokec, …
  • There was 20-30% in actual

[Catanese et al., 2011], [Takac et al., 2012]

3/20

slide-5
SLIDE 5

Our Study:

Addressing Private Node Problem

Contributions: Our study enables us to

  • 1. successfully perform re-weighted random walk algorithms

in real social networks including private nodes

  • Discuss transition neighbor selection
  • Derive sampling bias of each node
  • Describe calculation of weights to correct sampling bias
  • 2. accurately estimate size and average degree
  • f whole social graph including private nodes
  • Propose weighting methods to reduce not only sampling bias

but also estimation errors caused by private nodes

  • Theoretically explain estimates obtained by proposed weighting

have smaller expected errors than previous weighting

4/20

slide-6
SLIDE 6

Preliminaries (1/2)

Social graph: 𝐻 = (𝑊, 𝐹)

  • Each node has a privacy label: public or private.
  • Public node: provide their neighbor data
  • Private node: does not provide their neighbor data
  • Public-cluster 𝐷
  • connected subgraph consisting of public nodes
  • Public-degree 𝑒*

  • number of public neighbors of node 𝑤*

1 3 4 5 7 8 6 2 9

10

𝐷- 𝐷. 𝐷/ 𝑒0

∗ = 1 5/20

𝑒0 = 2

slide-7
SLIDE 7

Preliminaries (2/2)

Three assumptions

  • 1. Indices of all neighbors of a queried public node

are obtained.

  • 2. Each node independently becomes private

with probability 𝑞, otherwise, public.

  • 3. A seed of a random walk is on the largest public-

cluster (LPC).

Two access models

  • 1. Ideal model (ex. [Gjoka et al., 2011])
  • Obtain neighbor indices and privacy labels.
  • 2. Hidden privacy model (ex. Twiiter API)
  • Obtain only neighbor indices.

ex) when public node 0 is queried. 1. 2.

6/20

slide-8
SLIDE 8

Random Walk Sampling (1/2)

Random walk considering private nodes

  • Simple random walk: randomly traverse neighbor
  • Cannot simply continue when private nodes are traversed
  • Difficult to correct sampling bias of sampled private nodes

Randomly traversing public neighbor [Gjoka et al., 2011]

1. Randomly select a neighbor 2. Traverse if that is public, otherwise, randomly select again

  • Sampling bias
  • Each node is sampled in proportion to public-degree.

7/20

slide-9
SLIDE 9

Random Walk Sampling (2/2)

Calculate public-degree 𝒆𝒋

∗ to correct sampling bias

  • 1. Ideal model → Exact calculation
  • 2. Hidden privacy model → Proposed approximation
  • Record two values via designed random walk
  • 𝑏* = total number of successful public neighbor selections
  • 𝑐* = total number of neighbor selections
  • 1. When private neighbor

is selected.

  • 2. When public neighbor

is selected.

𝑐8 ← 𝑐8 + 1 𝑏8 ← 𝑏8 + 1 𝑐8 ← 𝑐8 + 1 Approximated value 𝒃𝒋

𝒄𝒋 × 𝒆𝒋 converges to true value 𝒆𝒋 ∗

Theoretical result 8/20

slide-10
SLIDE 10

Properties Estimation (1/2)

Problem of existing estimators

  • Conventional weighting only corrects sampling bias

by using public-degree Estimates converge to properties of largest public- cluster (LPC).

  • Errors of convergence values caused by private nodes
  • Case of size estimation

Under the condition that all public nodes belong to LPC, 𝑭𝒒𝒔𝒋 𝒐∗ = 𝟐 − 𝒒 𝒐

Theoretical result

1 3 4 5 7 8 6 2 9

10 𝒐∗ = 𝟔 𝒐∗ = 𝟗 Derive expectation regarding a set of privacy labels 𝑭𝒒𝒔𝒋[𝒐∗]

9/20

slide-11
SLIDE 11

Properties Estimation (2/2)

Proposed estimators

Goal: Reduce errors of convergence values.

  • Value of 𝑞 is unknown and difficult to estimate.

Idea: Weighting using public-degree 𝑒*

∗ and degree 𝑒*

  • 𝑒*

∗ follows binomial distribution with parameters 1 − 𝑞 and 𝑒*.

  • Modify weight for each sample

so that errors of convergence values are minimally reduced.

Generality:

  • Our goal and idea are shared in

all random walk-based estimators for social networks.

Under the condition that all public nodes belong to LPC, 𝑭𝒒𝒔𝒋 H 𝒐 ≈ 𝒐

Theoretical result H 𝒐: convergence value of proposed estimator 10/20

slide-12
SLIDE 12

Experiments

Conduct four experiments:

  • 1. Estimation accuracy of size and average degree

for various probabilities 𝑞

  • 2. Performance in real-world datasets including

real private nodes

  • 3. Effectiveness of proposed public-degree calculation
  • 4. Number of queries performed in seed selection

11/20

slide-13
SLIDE 13

Experimental Setup

  • Publicly available datasets of social graphs
  • Accuracy measure: Normalized root mean square error

𝑂𝑆𝑁𝑇𝐹 =

  • N ∑*P-

N Q R QS Q .

  • Number of simulations: 𝑢 = 1000
  • True value: 𝑦
  • Estimate: 𝑦*

Network Size Average degree Privacy label setting YouTube 1,134,890 5.27 independently with probability 𝑞 Pokec 1,632,803 27.32 real labels Orkut 3,072,441 76.28 independently with probability 𝑞 Facebook 3,097,165 15.28 independently with probability 𝑞 LiveJournal 3,997,962 17.35 independently with probability 𝑞

12/20

slide-14
SLIDE 14

Experiment 1:

Estimation accuracy for several probabilities 𝑞

NRMSEs of each estimators for several 𝑞 (1% sample)

  • Size
  • Average degree

0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed

0.0 0.1 0.2 0.4 0.3

Better

0.0 0.1 0.2 0.3

𝒒 YouTube Orkut Facebook LiveJournal 88.1%

0.0 0.1 0.2 0.3

13/20 Existing Proposed

slide-15
SLIDE 15

Discussion on results in Experiment 1

  • 1. Improvement of estimation errors results from

that of convergence errors.

  • NRMSEs of converged size
  • NRMSEs of converged average degree

0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed

0.0 0.1 0.2 0.4 0.3

0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed

Better

0.0 0.1 0.2 0.3

𝒒 YouTube Orkut Facebook LiveJournal

0.0 0.1 0.2 0.3

14/20

Existing Proposed

slide-16
SLIDE 16

Discussion on results in Experiment 1

  • 2. Estimation and convergence errors are affected by

relative size of the largest public-cluster (LPC).

  • NRMSEs of converged size
  • Relative size of LPC

0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed

YouTube Orkut

0.0 0.1 0.2 0.3 𝒒 0.1 0.2 0.4 0.3 0.0 0.0 0.1 0.2 0.3 𝒒 0.1 0.2 0.4 0.3 0.0 0.0 0.1 0.2 0.3 𝒒 0.8 0.6 1.0

  • On Orkut, almost all public nodes belong to LPC.

Experimental results support theoretical claims.

  • On YouTube, relatively many nodes do not belong to LPC.

NRMSEs relatively increase.

15/20 Existing Proposed

slide-17
SLIDE 17

Experiment 2:

Estimation on datasets with real private users

Estimation in Pokec graph including real privacy labels

  • NRMSEs of each estimator for several sample rates
  • Relative error of convergence values

1 2 3 4 5 Sample rate (%) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 NRMSE NC Proposed 1 2 3 4 5 Sample rate (%) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed

Algorithm Size Average degree Existing 0.338 0.287 Proposed 0.009 0.036

0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 Sample rate (%) 1 2 3 4 5 Sample rate (%) 1 2 3 4 5

Size Average degree

92.6% 86.8% Better 16/20 97.3% 87.5%

Existing Proposed

slide-18
SLIDE 18

Experiment 2:

Estimation on datasets with real private users

Data [Kurant et al., 2011]

  • 1,016,275 real public Facebook user samples
  • btained by a random walk during October 2010.
  • Contains ID, exact public–degree, and exact degree of

each sampled public user.

Result Discussion

  • 1. In July 2010, over 500 millions active Facebook users.
  • 2. In August 2010, previous study [Catanese, 2011] estimated

proportion of private users as 0.266 from uniform sample.

  • 1 − 480,298,540

656,874,081 = 0.269, 1 − 102.1 137.0 = 0.255

Algorithm Size estimate Average degree estimate Existing 480,298,540 102.1 Proposed 656,874,081 137.0

17/20

slide-19
SLIDE 19

Experiment 3:

Effects of proposed public-degree calculation

Comparing size estimation accuracy and proportion of queried nodes (1% sample on Orkut)

  • Exact calculation queries all neighbors of each sample.
  • Almost same estimation accuracy
  • Proposed calculation queries approximately 1% nodes
  • Exact calculation queries over 50% nodes

0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Exact Proposed 0.0 0.1 0.2 0.3 p 10−2 10−1 Proportion of queried nodes Exact Proposed

  • NRMSEs of size estimates
  • Proportion of queried nodes

0.1 0.2 0.3 0.0 Better 0.01 0.1 Better 0.0 0.1 0.2 0.3 𝒒 0.0 0.1 0.2 0.3 𝒒

Exact Proposed

18/20 Exact Proposed

slide-20
SLIDE 20

Experiment 4:

Number of queries in seed selection

  • Require additional queries in two cases
  • 1. Seed is private node.
  • 2. Seed is on isolated public-cluster.
  • We believe each number of queries is small.
  • Two natures of graphs with power-law degree distributions

under random removal of nodes [Albert et al., 2000]

1 3 4 5 7 8 6 2 9

10

0.0 0.1 0.2 0.3 𝒒 0.8 0.6 1.0 0.0 0.1 0.2 0.3 𝒒 1.0 1.2 1.4 1.6

1. Relative size of LPC 2. Average size of isolated public-clusters

Even if 𝑞 = 0.3, 99.1% on Orkut 76.7% on YouTube average size ≈ 1 19/20

slide-21
SLIDE 21

Summary

  • We designed re-weighted random walk algorithms

considering private nodes for the first time.

ü We designed random walk-based sampling algorithm considering private nodes. ü We proposed weighting methods to reduce both sampling bias and estimation errors due to private nodes. ü We validated theoretical claims and effectiveness of proposed algorithms on extensive experiments using real social network datasets.

  • Datasets and source code used in this study

is publicly available at https://github.com/kazuibasou/KDD2020

20/20