Estimating Properties of Social Networks via Random Walk considering - PowerPoint PPT Presentation

KDD 2020 Research track Estimating Properties of Social Networks via Random Walk considering Private Nodes Kazuki Nakajima Kazuyuki Shudo Tokyo Institute of Technology

Graph Sampling on Social Networks Purpose: Understand global social structure Challenge: Accurate analysis of graph properties • Access limitations to graph data for most researchers. • Crawling-based sampling is effective. • ex) Breadth-first search, Random walk, … • biased estimation due to sampling traverse node (= user) Social Graph 1/20

Re-weighted Random Walk [Gjoka et al., 2010] Effective scheme to obtain unbiased estimators 1. Sample nodes via a random walk 2. Derive sampling bias based on Markov chain analysis 3. Re-weighting each sample to correct sampling bias Unbiased estimators for several graph properties • size (number of nodes), average degree, degree distribution… Distribution Re-weighting True Estimate Degree Unbiased! 2/20

Private Node Problem •Previous algorithms have ignored private nodes Public nodes (publish neighbors) 6 10 8 9 5 4 2 3 Private nodes (hide neighbors) • ex) Facebook, Twitter, Pokec, … 7 1 • There was 20-30% in actual [Catanese et al., 2011], [Takac et al., 2012] •What problems happen? 1. Private nodes inhibit a simple random walk • Require a random walk considering private nodes • Require samples keeping Markov property 2. Private nodes cause estimation errors • Conventional weighting only corrects sampling bias 3/20

Our Study: Addressing Private Node Problem Contributions: Our study enables us to 1. successfully perform re-weighted random walk algorithms in real social networks including private nodes • Discuss transition neighbor selection • Derive sampling bias of each node • Describe calculation of weights to correct sampling bias 2. accurately estimate size and average degree of whole social graph including private nodes • Propose weighting methods to reduce not only sampling bias but also estimation errors caused by private nodes • Theoretically explain estimates obtained by proposed weighting have smaller expected errors than previous weighting 4/20

Preliminaries (1/2) Social graph: 𝐻 = (𝑊, 𝐹) 𝐷 . 𝐷 - 6 10 8 𝐷 / 9 5 4 2 3 ∗ = 1 7 1 𝑒 0 = 2 𝑒 0 • Each node has a privacy label: public or private. • Public node: provide their neighbor data • Private node: does not provide their neighbor data • Public-cluster 𝐷 • connected subgraph consisting of public nodes ∗ • Public-degree 𝑒 * • number of public neighbors of node 𝑤 * 5/20

Preliminaries (2/2) Three assumptions 1. Indices of all neighbors of a queried public node are obtained. 2. Each node independently becomes private with probability 𝑞 , otherwise, public. 3. A seed of a random walk is on the largest public- cluster (LPC). ex) when public node 0 is queried. Two access models 1. 1. Ideal model (ex. [Gjoka et al., 2011]) • Obtain neighbor indices and privacy labels. 2. 2. Hidden privacy model (ex. Twiiter API) • Obtain only neighbor indices. 6/20

Random Walk Sampling (1/2) Random walk considering private nodes • Simple random walk: randomly traverse neighbor • Cannot simply continue when private nodes are traversed • Difficult to correct sampling bias of sampled private nodes Randomly traversing public neighbor [Gjoka et al., 2011] 1. Randomly select a neighbor 2. Traverse if that is public, otherwise, randomly select again • Sampling bias • Each node is sampled in proportion to public-degree. 7/20

Random Walk Sampling (2/2) ∗ to correct sampling bias Calculate public-degree 𝒆 𝒋 1. Ideal model → Exact calculation 2. Hidden privacy model → Proposed approximation • Record two values via designed random walk • 𝑏 * = total number of successful public neighbor selections • 𝑐 * = total number of neighbor selections 2. When public neighbor 1. When private neighbor is selected. is selected. 𝑏 8 ← 𝑏 8 + 1 𝑐 8 ← 𝑐 8 + 1 𝑐 8 ← 𝑐 8 + 1 Theoretical result Approximated value 𝒃 𝒋 ∗ 𝒄 𝒋 × 𝒆 𝒋 converges to true value 𝒆 𝒋 8/20

Properties Estimation (1/2) Problem of existing estimators • Conventional weighting only corrects sampling bias by using public-degree Estimates converge to properties of largest public- cluster (LPC). • Errors of convergence values caused by private nodes • Case of size estimation 6 10 8 Derive expectation regarding a set of 9 5 4 2 3 privacy labels 𝑭 𝒒𝒔𝒋 [𝒐 ∗ ] 7 1 𝒐 ∗ = 𝟔 𝒐 ∗ = 𝟗 Theoretical result Under the condition that all public nodes belong to LPC, 𝑭 𝒒𝒔𝒋 𝒐 ∗ = 𝟐 − 𝒒 𝒐 9/20

Properties Estimation (2/2) Proposed estimators Goal: Reduce errors of convergence values. • Value of 𝑞 is unknown and difficult to estimate. ∗ and degree 𝑒 * Idea: Weighting using public-degree 𝑒 * ∗ follows binomial distribution with parameters 1 − 𝑞 and 𝑒 * . • 𝑒 * • Modify weight for each sample so that errors of convergence values are minimally reduced. Theoretical result Under the condition that all public nodes belong to LPC, 𝑭 𝒒𝒔𝒋 H 𝒐 ≈ 𝒐 𝒐 : convergence value of proposed estimator H Generality: • Our goal and idea are shared in all random walk-based estimators for social networks. 10/20

Experiments Conduct four experiments: 1. Estimation accuracy of size and average degree for various probabilities 𝑞 2. Performance in real-world datasets including real private nodes 3. Effectiveness of proposed public-degree calculation 4. Number of queries performed in seed selection 11/20

Experimental Setup •Publicly available datasets of social graphs Average Network Size Privacy label setting degree independently with probability 𝑞 YouTube 1,134,890 5.27 Pokec 1,632,803 27.32 real labels independently with probability 𝑞 Orkut 3,072,441 76.28 independently with probability 𝑞 Facebook 3,097,165 15.28 independently with probability 𝑞 LiveJournal 3,997,962 17.35 •Accuracy measure: Normalized root mean square error . - Q R Q S N N ∑ *P- 𝑂𝑆𝑁𝑇𝐹 = Q • Number of simulations: 𝑢 = 1000 • True value: 𝑦 • Estimate: 𝑦 * 12/20

Experiment 1: Estimation accuracy for several probabilities 𝑞 NRMSEs of each estimators for several 𝑞 (1% sample) • Size 0.45 Existing 0.45 0.45 0.45 NC NC NC NC 0.4 0.40 0.40 0.40 0.40 Proposed Proposed Proposed Proposed 0.35 0.35 0.35 0.35 0.3 0.30 0.30 0.30 0.30 NRMSE NRMSE NRMSE NRMSE 0.25 0.25 0.25 0.25 0.2 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.1 0.10 0.10 0.10 0.10 Proposed 0.05 0.05 0.05 0.05 0.0 0.00 0.00 0.00 0.00 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 Better p p p p 𝒒 • Average degree 0.3 0.30 0.30 0.30 0.30 Smooth Smooth Smooth Smooth 0.25 0.25 0.25 0.25 Proposed Proposed Proposed Proposed 0.2 0.20 0.20 0.20 0.20 NRMSE NRMSE NRMSE NRMSE 0.15 0.15 0.15 0.15 0.1 0.10 0.10 0.10 0.10 88.1% 0.05 0.05 0.05 0.05 0.0 0.00 0.00 0.00 0.00 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 YouTube p Orkut p Facebook p LiveJournal p 13/20

Discussion on results in Experiment 1 1. Improvement of estimation errors results from that of convergence errors. • NRMSEs of converged size 0.45 Existing 0.45 0.45 0.45 0.4 NC NC NC NC 0.40 0.40 0.40 0.40 Proposed Proposed Proposed Proposed 0.35 0.35 0.35 0.35 0.3 0.30 0.30 0.30 0.30 NRMSE NRMSE NRMSE NRMSE 0.25 0.25 0.25 0.25 0.2 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.1 0.10 0.10 0.10 0.10 Proposed 0.05 0.05 0.05 0.05 0.0 0.00 0.00 0.00 0.00 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 Better p p p p 𝒒 • NRMSEs of converged average degree 0.3 0.30 0.30 0.30 0.30 Smooth Smooth Smooth Smooth 0.25 0.25 0.25 0.25 Proposed Proposed Proposed Proposed 0.2 0.20 0.20 0.20 0.20 NRMSE NRMSE NRMSE NRMSE 0.15 0.15 0.15 0.15 0.1 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.0 0.00 0.00 0.00 0.00 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 14/20 YouTube Orkut LiveJournal Facebook p p p p

Discussion on results in Experiment 1 2. Estimation and convergence errors are affected by relative size of the largest public-cluster (LPC). • Relative size of LPC • NRMSEs of converged size Orkut YouTube 1.0 0.45 0.45 NC NC 0.4 0.4 0.40 0.40 Proposed Proposed 0.35 0.35 0.3 0.3 Existing 0.30 0.30 0.8 NRMSE NRMSE 0.25 0.25 0.2 0.2 0.20 0.20 0.15 0.15 Proposed 0.1 0.1 0.6 0.10 0.10 0.05 0.05 0.0 0.0 0.00 0.00 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 p p 𝒒 𝒒 𝒒 • On Orkut, almost all public nodes belong to LPC. Experimental results support theoretical claims. • On YouTube, relatively many nodes do not belong to LPC. NRMSEs relatively increase. 15/20

Estimating Properties of Social Networks via Random Walk considering - PowerPoint PPT Presentation

KDD 2020 Research track Estimating Properties of Social Networks via Random Walk considering Private Nodes Kazuki Nakajima Kazuyuki Shudo Tokyo Institute of Technology Graph Sampling on Social Networks Purpose: Understand global social

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

The Winter Walk at Wisley The Winter Walk at Wisley The Winter Walk at Wisley The Winter Walk at

Mixing time for a random walk on a ring Stephen Connor Joint work with Michael Bate Aspects of

Back to Random Walks on Graphs Random walk on a graph: Stationary distribution: Back to Random

Outline Applications of Random Networks Random Networks Applications of Random Networks

Short Walks in Higher Dimensions Ghislain McKay Febuary 3, 2015 What is a Random Walk? A path

Estimating Estimating Covariance . . . Statistical Characteristics Estimating . . . Proof of

and Size of Social Networks via Random Walk Stephen J. Hardiman* Liran Katzir Capital Fund

Outline Scheidegger Networks Networks Scheidegger NetworksA Bonus First return First

Advanced Algorithms (XII) Shanghai Jiao Tong University Chihao Zhang May 25, 2020 Random Walk

Critical density for Activated Random Walk Lorenzo Taggi Max Planck Institute for Mathematics in

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Random walk on the torus Jean-Baptiste Boyer (IMB / ModalX) May 16, 2016 Jean-Baptiste Boyer

Random Walks Will Perkins February 5, 2013 Simple Random Walk S 0 = 0, S n = X 1 + X 2 + . . . X

Planning III-A: Planning III-A: Estimating Software Size - Estimating Software Size -

Estimating Frequency Moments Estimating F 0 Algorithm Correctness Further Anil Maheshwari

Security Feature Parity: GCC and Clang Linux Plumbers Conference 2019 Kees Cook

Fuzzy Techniques Provide In General, Signal and . . . a Theoretical Explanation Tikhonov

LP-based Heuristics for Cost-optimal Planning Florian Gabriele Malte Blai Pommerening 1 oger 1

The moment-LP and moment-SOS approaches Jean B. Lasserre LAAS-CNRS and Institute of Mathematics,

Welcome! Reducing Emergency Department among MI Population Learning Series- Systems Improvement-

Growing Global Leaders Advancing Palliative Care Exploring the Leadership Practices Inventory

Intro duction 1 Goals of the lecture: W eak Conjunctive Predicates Logic fo r

Program Analysis with Local Policy Iteration George Karpenkov VERIMAG May 6, 2015 George