Estimating Properties of Social Networks via Random Walk considering - - PowerPoint PPT Presentation
Estimating Properties of Social Networks via Random Walk considering - - PowerPoint PPT Presentation
KDD 2020 Research track Estimating Properties of Social Networks via Random Walk considering Private Nodes Kazuki Nakajima Kazuyuki Shudo Tokyo Institute of Technology Graph Sampling on Social Networks Purpose: Understand global social
Graph Sampling on Social Networks
Purpose: Understand global social structure Challenge: Accurate analysis of graph properties
- Access limitations to graph data for most researchers.
- Crawling-based sampling is effective.
- ex) Breadth-first search, Random walk, …
- biased estimation due to sampling
Social Graph
node (= user) traverse
1/20
Re-weighted Random Walk [Gjoka et al., 2010]
Effective scheme to obtain unbiased estimators
- 1. Sample nodes via a random walk
- 2. Derive sampling bias based on Markov chain analysis
- 3. Re-weighting each sample to correct sampling bias
Unbiased estimators for several graph properties
- size (number of nodes), average degree, degree distribution…
Degree Distribution
Re-weighting
Unbiased!
Estimate True 2/20
Private Node Problem
- Previous algorithms have ignored private nodes
- What problems happen?
- 1. Private nodes inhibit a simple random walk
- Require a random walk considering private nodes
- Require samples keeping Markov property
- 2. Private nodes cause estimation errors
- Conventional weighting only corrects sampling bias
1 3 4 5 7 8 6 2 9
10
Public nodes (publish neighbors) Private nodes (hide neighbors)
- ex) Facebook, Twitter, Pokec, …
- There was 20-30% in actual
[Catanese et al., 2011], [Takac et al., 2012]
3/20
Our Study:
Addressing Private Node Problem
Contributions: Our study enables us to
- 1. successfully perform re-weighted random walk algorithms
in real social networks including private nodes
- Discuss transition neighbor selection
- Derive sampling bias of each node
- Describe calculation of weights to correct sampling bias
- 2. accurately estimate size and average degree
- f whole social graph including private nodes
- Propose weighting methods to reduce not only sampling bias
but also estimation errors caused by private nodes
- Theoretically explain estimates obtained by proposed weighting
have smaller expected errors than previous weighting
4/20
Preliminaries (1/2)
Social graph: 𝐻 = (𝑊, 𝐹)
- Each node has a privacy label: public or private.
- Public node: provide their neighbor data
- Private node: does not provide their neighbor data
- Public-cluster 𝐷
- connected subgraph consisting of public nodes
- Public-degree 𝑒*
∗
- number of public neighbors of node 𝑤*
1 3 4 5 7 8 6 2 9
10
𝐷- 𝐷. 𝐷/ 𝑒0
∗ = 1 5/20
𝑒0 = 2
Preliminaries (2/2)
Three assumptions
- 1. Indices of all neighbors of a queried public node
are obtained.
- 2. Each node independently becomes private
with probability 𝑞, otherwise, public.
- 3. A seed of a random walk is on the largest public-
cluster (LPC).
Two access models
- 1. Ideal model (ex. [Gjoka et al., 2011])
- Obtain neighbor indices and privacy labels.
- 2. Hidden privacy model (ex. Twiiter API)
- Obtain only neighbor indices.
ex) when public node 0 is queried. 1. 2.
6/20
Random Walk Sampling (1/2)
Random walk considering private nodes
- Simple random walk: randomly traverse neighbor
- Cannot simply continue when private nodes are traversed
- Difficult to correct sampling bias of sampled private nodes
Randomly traversing public neighbor [Gjoka et al., 2011]
1. Randomly select a neighbor 2. Traverse if that is public, otherwise, randomly select again
- Sampling bias
- Each node is sampled in proportion to public-degree.
7/20
Random Walk Sampling (2/2)
Calculate public-degree 𝒆𝒋
∗ to correct sampling bias
- 1. Ideal model → Exact calculation
- 2. Hidden privacy model → Proposed approximation
- Record two values via designed random walk
- 𝑏* = total number of successful public neighbor selections
- 𝑐* = total number of neighbor selections
- 1. When private neighbor
is selected.
- 2. When public neighbor
is selected.
𝑐8 ← 𝑐8 + 1 𝑏8 ← 𝑏8 + 1 𝑐8 ← 𝑐8 + 1 Approximated value 𝒃𝒋
𝒄𝒋 × 𝒆𝒋 converges to true value 𝒆𝒋 ∗
Theoretical result 8/20
Properties Estimation (1/2)
Problem of existing estimators
- Conventional weighting only corrects sampling bias
by using public-degree Estimates converge to properties of largest public- cluster (LPC).
- Errors of convergence values caused by private nodes
- Case of size estimation
Under the condition that all public nodes belong to LPC, 𝑭𝒒𝒔𝒋 𝒐∗ = 𝟐 − 𝒒 𝒐
Theoretical result
1 3 4 5 7 8 6 2 9
10 𝒐∗ = 𝟔 𝒐∗ = 𝟗 Derive expectation regarding a set of privacy labels 𝑭𝒒𝒔𝒋[𝒐∗]
9/20
Properties Estimation (2/2)
Proposed estimators
Goal: Reduce errors of convergence values.
- Value of 𝑞 is unknown and difficult to estimate.
Idea: Weighting using public-degree 𝑒*
∗ and degree 𝑒*
- 𝑒*
∗ follows binomial distribution with parameters 1 − 𝑞 and 𝑒*.
- Modify weight for each sample
so that errors of convergence values are minimally reduced.
Generality:
- Our goal and idea are shared in
all random walk-based estimators for social networks.
Under the condition that all public nodes belong to LPC, 𝑭𝒒𝒔𝒋 H 𝒐 ≈ 𝒐
Theoretical result H 𝒐: convergence value of proposed estimator 10/20
Experiments
Conduct four experiments:
- 1. Estimation accuracy of size and average degree
for various probabilities 𝑞
- 2. Performance in real-world datasets including
real private nodes
- 3. Effectiveness of proposed public-degree calculation
- 4. Number of queries performed in seed selection
11/20
Experimental Setup
- Publicly available datasets of social graphs
- Accuracy measure: Normalized root mean square error
𝑂𝑆𝑁𝑇𝐹 =
- N ∑*P-
N Q R QS Q .
- Number of simulations: 𝑢 = 1000
- True value: 𝑦
- Estimate: 𝑦*
Network Size Average degree Privacy label setting YouTube 1,134,890 5.27 independently with probability 𝑞 Pokec 1,632,803 27.32 real labels Orkut 3,072,441 76.28 independently with probability 𝑞 Facebook 3,097,165 15.28 independently with probability 𝑞 LiveJournal 3,997,962 17.35 independently with probability 𝑞
12/20
Experiment 1:
Estimation accuracy for several probabilities 𝑞
NRMSEs of each estimators for several 𝑞 (1% sample)
- Size
- Average degree
0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed
0.0 0.1 0.2 0.4 0.3
Better
0.0 0.1 0.2 0.3
𝒒 YouTube Orkut Facebook LiveJournal 88.1%
0.0 0.1 0.2 0.3
13/20 Existing Proposed
Discussion on results in Experiment 1
- 1. Improvement of estimation errors results from
that of convergence errors.
- NRMSEs of converged size
- NRMSEs of converged average degree
0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed
0.0 0.1 0.2 0.4 0.3
0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed
Better
0.0 0.1 0.2 0.3
𝒒 YouTube Orkut Facebook LiveJournal
0.0 0.1 0.2 0.3
14/20
Existing Proposed
Discussion on results in Experiment 1
- 2. Estimation and convergence errors are affected by
relative size of the largest public-cluster (LPC).
- NRMSEs of converged size
- Relative size of LPC
0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed 0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 NRMSE NC Proposed
YouTube Orkut
0.0 0.1 0.2 0.3 𝒒 0.1 0.2 0.4 0.3 0.0 0.0 0.1 0.2 0.3 𝒒 0.1 0.2 0.4 0.3 0.0 0.0 0.1 0.2 0.3 𝒒 0.8 0.6 1.0
- On Orkut, almost all public nodes belong to LPC.
Experimental results support theoretical claims.
- On YouTube, relatively many nodes do not belong to LPC.
NRMSEs relatively increase.
15/20 Existing Proposed
Experiment 2:
Estimation on datasets with real private users
Estimation in Pokec graph including real privacy labels
- NRMSEs of each estimator for several sample rates
- Relative error of convergence values
1 2 3 4 5 Sample rate (%) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 NRMSE NC Proposed 1 2 3 4 5 Sample rate (%) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Smooth Proposed
Algorithm Size Average degree Existing 0.338 0.287 Proposed 0.009 0.036
0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 Sample rate (%) 1 2 3 4 5 Sample rate (%) 1 2 3 4 5
Size Average degree
92.6% 86.8% Better 16/20 97.3% 87.5%
Existing Proposed
Experiment 2:
Estimation on datasets with real private users
Data [Kurant et al., 2011]
- 1,016,275 real public Facebook user samples
- btained by a random walk during October 2010.
- Contains ID, exact public–degree, and exact degree of
each sampled public user.
Result Discussion
- 1. In July 2010, over 500 millions active Facebook users.
- 2. In August 2010, previous study [Catanese, 2011] estimated
proportion of private users as 0.266 from uniform sample.
- 1 − 480,298,540
656,874,081 = 0.269, 1 − 102.1 137.0 = 0.255
Algorithm Size estimate Average degree estimate Existing 480,298,540 102.1 Proposed 656,874,081 137.0
17/20
Experiment 3:
Effects of proposed public-degree calculation
Comparing size estimation accuracy and proportion of queried nodes (1% sample on Orkut)
- Exact calculation queries all neighbors of each sample.
- Almost same estimation accuracy
- Proposed calculation queries approximately 1% nodes
- Exact calculation queries over 50% nodes
0.0 0.1 0.2 0.3 p 0.00 0.05 0.10 0.15 0.20 0.25 0.30 NRMSE Exact Proposed 0.0 0.1 0.2 0.3 p 10−2 10−1 Proportion of queried nodes Exact Proposed
- NRMSEs of size estimates
- Proportion of queried nodes
0.1 0.2 0.3 0.0 Better 0.01 0.1 Better 0.0 0.1 0.2 0.3 𝒒 0.0 0.1 0.2 0.3 𝒒
Exact Proposed
18/20 Exact Proposed
Experiment 4:
Number of queries in seed selection
- Require additional queries in two cases
- 1. Seed is private node.
- 2. Seed is on isolated public-cluster.
- We believe each number of queries is small.
- Two natures of graphs with power-law degree distributions
under random removal of nodes [Albert et al., 2000]
1 3 4 5 7 8 6 2 9
10
0.0 0.1 0.2 0.3 𝒒 0.8 0.6 1.0 0.0 0.1 0.2 0.3 𝒒 1.0 1.2 1.4 1.6
1. Relative size of LPC 2. Average size of isolated public-clusters
Even if 𝑞 = 0.3, 99.1% on Orkut 76.7% on YouTube average size ≈ 1 19/20
Summary
- We designed re-weighted random walk algorithms
considering private nodes for the first time.
ü We designed random walk-based sampling algorithm considering private nodes. ü We proposed weighting methods to reduce both sampling bias and estimation errors due to private nodes. ü We validated theoretical claims and effectiveness of proposed algorithms on extensive experiments using real social network datasets.
- Datasets and source code used in this study