Community-Preserving Generalization of Social Networks Jordi - - PowerPoint PPT Presentation

community preserving generalization of social networks
SMART_READER_LITE
LIVE PREVIEW

Community-Preserving Generalization of Social Networks Jordi - - PowerPoint PPT Presentation

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions Community-Preserving Generalization of Social Networks Jordi Casas-Roma 1 and Fran cois Rousseau 2 1 Universitat Oberta de Catalunya,


slide-1
SLIDE 1

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Community-Preserving Generalization

  • f Social Networks

Jordi Casas-Roma1 and Fran¸ cois Rousseau2

1Universitat Oberta de Catalunya, Barcelona, Spain

jcasasr@uoc.edu

Ecole Polytechnique, Palaiseau, France rousseau@lix.polytechnique.fr

SoMeRis ’15, Paris, August 25, 2015

slide-2
SLIDE 2

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Overview

1

Introduction

2

Preliminary concepts

3

Graph Generalization Algorithm

4

Experimental Set Up

5

Results

6

Conclusions

slide-3
SLIDE 3

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Introduction

Scenario Release data to third parties Preserve the privacy of users

slide-4
SLIDE 4

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Simple Anonymization

Simple anonymization does not work! User Dan can be re-identified using his structural properties.

Figure 1 : Original network

Amy Tim Bob Lis Ann Dan Tom Eva Joe

Figure 2 : Simple anonymization

1 2 3 4 5 6 7 8 9

Figure 3 : Dan’s 1-neighbourhood

2 3 6 8 9

Figure 4 : Dan is re-identified

1 2 3 4 5 6 7 8 9

slide-5
SLIDE 5

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Anonymization methods

Goals Introduce noise to hinder the re-identification processes.

Adding/removing edges. Adding fake nodes. Grouping nodes into clusters. . . .

Preserve user’s privacy vs. Maximize data utility (minimize information loss).

Figure 5 : Dan’s 1-neighbourhood

2 3 6 8 9

Figure 6 : Noise added

1 2 3 4 5 6 7 8 9

slide-6
SLIDE 6

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Anonymization methods

Basic approaches for anonymity on networks: Network modification approaches consists on modifying (adding and/or deleting) edges or vertices in a network.

Randomization k-anonymity model

Clustering-based approaches (also known as generalization) consist

  • n cluster vertices and edges into groups and anonymize a

sub-network into a super-vertex in order to publish the aggregate information about structural properties. Differentially private approaches guarantee that individuals are protected under the definition of differential privacy, which imposes a guarantee on the data release mechanism rather than on the data

  • itself. The goal is to provide statistical information about the data

while preserving the privacy of users.

slide-7
SLIDE 7

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Graph degeneracy and k-shell

k-Core Let k be an integer. A subgraph Hk = (V ′, E ′), induced by the subset of vertices V ′ ⊆ V (and a fortiori by the subset of edges E ′ ⊆ E), is called a k-core if and only if ∀vi ∈ V ′, degHk(vi) ≥ k and Hk is the maximal subgraph with this property. k-Shell The notion of k-shell corresponds to the subgraph induced by the set of vertices that belong to the k-core but not the (k + 1)-core, denoted by Sk such that Sk = {vi ∈ G, vi ∈ Hk ∧ vi / ∈ Hk+1}. Core number The core number or shell index of a vertex vi is the highest order of a core that contains this vertex, denoted by core(vi).

slide-8
SLIDE 8

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Graph degeneracy and k-shell

Graph G and its decomposition in disjoint k-shells

3-shell 2-shell 1-shell B C D E A 0-shell F

slide-9
SLIDE 9

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Vertex similarity measures

Manhattan similarity ...measures how many equal neighbors the two vertices share but also how many non-neighbors they share. SimManhattan(vi, vj) = 1 − 1 n

n

  • k=1

|(vi, vk) − (vj, vk)| (1) where (vi, vk) = 1 if (vi, vk) ∈ E and (vi, vk) = 0 otherwise.

slide-10
SLIDE 10

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Vertex similarity measures

2-path similarity ...measures the number of paths of length 2 between two vertices. Sim2-path(vi, vj) = 1 n

n

  • k=1

(vi, vk)(vj, vk) (2) where (vi, vk) = 1 if (vi, vk) ∈ E and (vi, vk) = 0 otherwise.

slide-11
SLIDE 11

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Step 1 – Information gathering

Collect the information needed in the next step to define the partition groups.

1

computes the k-shell of the original graph, since it will preserve the graph decomposition and also the clustering structure;

2

computes vertex similarity measures in order to define groups of vertices that share some properties regarding graph’s structure.

Manhattan similarity 2-path similarity Multilevel clustering algorithm Fastgreedy clustering algorithm

slide-12
SLIDE 12

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Step 2 – Super-vertex definition

Define super-vertices according to the previously collected information for each vertex.

1

For each k-shell in the graph, we merge vertices belonging to the same group partition into the same super-vertex.

2

Additionally, max fusion parameter is defined to avoid merging too many vertices into one super-vertex (split the super-vertex onto two independent super-vertices). As a result of this step, a set of super-vertices is defined and each vertex is assigned to one, and only one, super-vertex.

slide-13
SLIDE 13

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Step 3 – Generalized graph creation

Create the new generalized graph according to the super-vertices defined in the previous step.

1

Define an empty, undirected, edge-labeled and vertex-labeled graph

  • G = (

V , E).

2

The process iterates by adding each previously defined super-vertex svi ∈ V .

3

A super-edge between two super-vertices is created if there exists an edge between two vertices contained in each of the super-vertices, (svi, svj) ∈ E ↔ (vk, vp) ∈ E : vk ∈ svi ∧ vp ∈ svj. Each super-vertex contains information about the number of vertices, which have merged into this super-vertex (IntraVertices) and also the number of edges between the vertices contained in it (IntraEdges). Super-edges contain a label indicating the number of edges between all vertices from their endpoints (InterEdges).

slide-14
SLIDE 14

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Example

Toy example generalization process.

3-shell 2-shell

(a) Original graph

4-4 3-3 1-0 2-1 3-3 4-4 2 1 1 2 4

(b) Generalized graph

slide-15
SLIDE 15

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Networks

Synthetic networks: ER-1000 – Erd¨

  • s-R´

enyi Model [3] is a classical random graph

  • model. It defines a random graph as n vertices connected by m

edges that are chosen randomly from the n(n − 1)/2 possible edges. In our experiments, n=1,000 and m=5,000. BA-1000 – Barab´ asi-Albert Model [2], also called scale-free model, is a network whose degree distribution follows a power law (for degree d, its probability density function is P(d) = d−γ; n=1,000 and γ=1 in our experiments). Real networks: Polblogs – Political blogosphere data [1] compiles the data on the links among US political blogs. URV email – the email communication network at the University Rovira i Virgili in Tarragona (Spain) [4].

slide-16
SLIDE 16

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Generic information loss measures

Network metrics Average distance (dist) Diameter (d) Harmonic mean of the shortest distance (h) Transitivity (T) We compute the error on these network metrics as follows: ǫm(G, G) = |m(G) − m( G)|, (3)

slide-17
SLIDE 17

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Clustering-specific information loss measures

G

  • G

Original clusters c(G) Precision index Perturbed clusters c( G) Anonymization process p Clustering method c Clustering method c

precision(G, G) = 1 n

n

  • i=1

✶ltc(vi)=lpc(vi), (4) where ✶x=y equals 1 if x = y and 0 otherwise.

slide-18
SLIDE 18

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Clustering-specific information loss measures

Clustering algorithms Multilevel (ML) Infomap (IM) Fast greedy modularity optimization (Fastgreedy or FG) Algorithm of Girvan and Newman (Girvan-Newman or GN)

slide-19
SLIDE 19

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Generic information loss measures

Network Method n m dist d h T ER-1000 1,000 4,969 3.263 5 3.083 0.010 Manhattan 672 4,828 2.672 5 2.514 0.051 2-Path 612 4,782 2.617 5 2.451 0.070 Multilevel 135 2,772 1.741 4 1.549 0.489 Fastgreedy 119 2,766 1.630 3 1.443 0.531 BA-1000 1,000 4,985 2.481 4 2.362 0.032 Manhattan 483 4,383 2.144 4 2.047 0.108 2-Path 436 2,690 1.991 3 1.957 0.070 Multilevel 106 1,728 1.689 3 1.526 0.457 Fastgreedy 104 1,682 1.685 2 1.522 0.451 Polblogs 1,222 16,714 2.737 8 2.519 0.225 Manhattan 866 13,229 2.408 5 2.248 0.245 2-Path 1,048 9,086 2.575 7 2.410 0.148 Multilevel 171 3,071 1.944 6 1.737 0.532 Fastgreedy 169 3,062 1.944 6 1.733 0.536 URV email 1,133 5,451 3.606 8 3.334 0.166 Manhattan 745 5,274 2.886 6 2.683 0.149 2-Path 944 4,444 3.334 7 3.091 0.135 Multilevel 160 1,710 2.179 6 1.931 0.386 Fastgreedy 157 1,763 2.187 6 1.922 0.420

slide-20
SLIDE 20

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Degree sequence reconstruction

200 400 600 800 1000 5 10 15 20 Original Manhattan

(c) Manhattan (ER-1000)

200 400 600 800 1000 5 10 15 20 Original Fastgreedy

(d) Fastgreedy (ER-1000)

200 400 600 800 1000 10 30 50 70 Original 2−Path

(e) 2-Path (URV email)

200 400 600 800 1000 10 30 50 70 Original Fastgreedy

(f) Fastgreedy (URV email)

slide-21
SLIDE 21

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Clustering-specific information loss measures

Network Method n m ML IM FG GN ER-1000 1,000 4,969

  • Manhattan

672 4,828 0.147 0.031 0.243 0.296 2-Path 612 4,782 0.150 0.040 0.215 0.311 Multilevel 135 2,772 0.394 0.036 0.229 0.189 Fastgreedy 119 2,766 0.147 0.030 0.544 0.182 BA-1000 1,000 4,985

  • Manhattan

483 4,383 0.157 1.000 0.185 0.433 2-Path 436 2,690 0.132 1.000 0.176 0.321 Multilevel 106 1,728 0.618 1.000 0.184 0.374 Fastgreedy 104 1,682 0.176 1.000 0.477 0.374 Polblogs 1,222 16,714

  • Manhattan

866 13,229 0.830 0.833 0.853 0.646 2-Path 1,048 9,086 0.950 0.959 0.985 0.823 Multilevel 171 3,071 0.993 0.517 0.967 0.767 Fastgreedy 169 3,062 0.976 0.520 0.973 0.740 URV email 1,133 5,451

  • Manhattan

745 5,274 0.420 0.463 0.533 0.352 2-Path 944 4,444 0.586 0.682 0.555 0.517 Multilevel 160 1,710 0.781 0.134 0.601 0.290 Fastgreedy 157 1,763 0.468 0.147 0.862 0.217

slide-22
SLIDE 22

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

Conclusions

Conclusions We have defined a novel approach to generalize a graph by capitalizing on the concept of graph degeneracy and on the similarity between vertices of the same k-shell. We have proposed four different methods to compute the similarity between vertices. We have conducted an empirical evaluation of these methods on several synthetic and real networks, comparing information loss based on different graph properties and also on clustering-specific processes. We have demonstrated that our approach preserves data privacy while simultaneously achieving better data utility through the generalization process.

slide-23
SLIDE 23

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

References

  • L. A. Adamic, and N. Glance, “The political blogosphere and the 2004 U.S.

election” in LinkKDD ’05. ACM, 2005, pp. 36–43. A.-L. Barab´ asi, and R. Albert, “Emergence of Scaling in Random Networks”. Science, vol. 286, no. 5439, pp. 509–512, 1999.

  • P. Erd¨
  • s, and A. R´

enyi, “On Random Graphs I”. Publicationes Mathematicae,

  • vol. 6, pp. 290–297, 1959.
  • R. Guimer`

a, L. Danon, A. D´ ıaz-Guilera, F. Giralt, and A. Arenas, “Self-similar community structure in a network of human interactions”. Physical Review E,

  • vol. 68:065103, pp. 1–4, 2003.
slide-24
SLIDE 24

Introduction Preliminary concepts Graph Generalization Algorithm Experimental Set Up Results Conclusions

The End

Thanks for your attention!

Jordi Casas-Roma UOC jcasasr@uoc.edu Fran¸ cois Rousseau LIX rousseau@lix.polytechnique.fr