Learning to de-anonymize social networks A machine learning approach - - PowerPoint PPT Presentation

learning to de anonymize social networks
SMART_READER_LITE
LIVE PREVIEW

Learning to de-anonymize social networks A machine learning approach - - PowerPoint PPT Presentation

Learning to de-anonymize social networks A machine learning approach to social graph de-anonymization Kumar Sharad October 28, 2016 Royal Holloway, University of London ACM Workshop on Artificial Intelligence and Security, Vienna, Austria


slide-1
SLIDE 1

Learning to de-anonymize social networks

A machine learning approach to social graph de-anonymization Kumar Sharad

October 28, 2016 Royal Holloway, University of London

ACM Workshop on Artificial Intelligence and Security, Vienna, Austria (AISec 2016)

slide-2
SLIDE 2

This talk

  • 1. True Friends Let You

Down: Benchmarking Social Graph Anonymization Schemes

  • 2. Change of Guard:

The Next Generation of Social Graph De-anonymization Attacks

2

slide-3
SLIDE 3

This talk

  • 1. True Friends Let You

Down: Benchmarking Social Graph Anonymization Schemes

  • 2. Change of Guard:

The Next Generation of Social Graph De-anonymization Attacks

2

slide-4
SLIDE 4

This talk

  • 1. True Friends Let You

Down: Benchmarking Social Graph Anonymization Schemes

  • 2. Change of Guard:

The Next Generation of Social Graph De-anonymization Attacks

2

slide-5
SLIDE 5

Overview

  • 1. Introduction
  • 2. Preliminaries
  • 3. Benchmarking social graph anonymization schemes
  • 4. The next generation of social graph de-anonymization attacks
  • 5. Conclusions

3

slide-6
SLIDE 6

Introduction

slide-7
SLIDE 7

The art of data anonymization

  • Goal: process data without jeopardizing privacy.
  • Popular: randomize identifiers and/or perturb data.
  • Pros: cheap, preserves utility, provides legal immunity.
  • Cons: lack of privacy guarantees.

4

slide-8
SLIDE 8

Privacy challenges in anonymized social graphs

  • Social graphs tend to be particularly notorious to anonymize.
  • How can we compare various anonymization schemes?
  • Can we measure privacy leak purely based on graph topology?
  • Could this lead to end-to-end graph de-anonymization?
  • Intuition: Train a machine learning model to learn the

de-anonymization function.

5

slide-9
SLIDE 9

Preliminaries

slide-10
SLIDE 10

Node features

  • Graph nodes represent individuals and the edges represent

relationship among them.

  • Feature vector purely based on topology (no edge weights or

directionality).

  • Too generic: high false positives.
  • Too specific: low true positives.
  • Quantize neighborhood degree distribution.

6

slide-11
SLIDE 11

The 2-hop neighborhood of a node

Ego

1-hop 1-hop 1-hop 2-hop 2-hop 2-hop 2-hop 7

slide-12
SLIDE 12

Node feature vector

Feature vector of a node with neighbors of degrees – [1, 1, 3, 3, 5, 6, 7, 13, 16, 20, 21, 30, 65, 69, 72, 1030, 1100].

c0 = 8 c1 = 4 c2 = 0

size = 15

70 bins … … c4 = 3 … … c69 = 2

8

slide-13
SLIDE 13

Node feature vector

Feature vector of a node with neighbors of degrees – [ 1, 1, 3, 3, 5, 6, 7, 13 , 16, 20, 21, 30, 65, 69, 72, 1030, 1100].

c0 = 8 c1 = 4 c2 = 0

size = 15

70 bins … … c4 = 3 … … c69 = 2

8

slide-14
SLIDE 14

Node feature vector

Feature vector of a node with neighbors of degrees – [1, 1, 3, 3, 5, 6, 7, 13, 16, 20, 21, 30 , 65, 69, 72, 1030, 1100].

c0 = 8 c1 = 4 c2 = 0

size = 15

70 bins … … c4 = 3 … … c69 = 2

8

slide-15
SLIDE 15

Node feature vector

Feature vector of a node with neighbors of degrees – [1, 1, 3, 3, 5, 6, 7, 13, 16, 20, 21, 30, 65, 69, 72, 1030, 1100].

c0 = 8 c1 = 4 c2 = 0

size = 15

70 bins … … c4 = 3 … … c69 = 2

8

slide-16
SLIDE 16

Node feature vector

Feature vector of a node with neighbors of degrees – [1, 1, 3, 3, 5, 6, 7, 13, 16, 20, 21, 30, 65, 69, 72 , 1030, 1100].

c0 = 8 c1 = 4 c2 = 0

size = 15

70 bins … … c4 = 3 … … c69 = 2

8

slide-17
SLIDE 17

Node feature vector

Feature vector of a node with neighbors of degrees – [1, 1, 3, 3, 5, 6, 7, 13, 16, 20, 21, 30, 65, 69, 72, 1030, 1100 ].

c0 = 8 c1 = 4 c2 = 0

size = 15

70 bins … … c4 = 3 … … c69 = 2

8

slide-18
SLIDE 18

Node feature vector

c0 = 8 c1 = 4 c2 = 0 1-hop nodes 2-hop nodes … … c137 = 3 c138 = 1 c139 = 46

8

slide-19
SLIDE 19

The learning task

  • Distinguish whether a pair of graph node feature vectors

represent the same individual.

  • Given a node pair classify them as identical or non-identical.
  • We use random forest which is a collection of decision trees to

classify node pairs.

  • Prediction: aggregate the decision of all trees.

9

slide-20
SLIDE 20

Benchmarking social graph anonymization schemes

slide-21
SLIDE 21

Motivation

  • A large number of social graph anonymization schemes have

been proposed with varied goals.

  • None of them provide any privacy guarantees.
  • Preserving privacy vs. preserving utility.
  • Ad-hoc development of schemes has created a skewed

ecosystem.

  • Research gap: how to compare social graph anonymization

schemes?

10

slide-22
SLIDE 22

Approach

  • Compare social graph anonymization schemes based on

anonymity provided vs. utility preserved.

  • Use a machine learning framework to benchmark

perturbation-based social graph anonymization schemes.

  • Automates evaluation and levels the playing field.

11

slide-23
SLIDE 23

The adversarial model1 (1/2)

  • A sanitized social network is released.
  • Adversary obtains an auxiliary social network with some overlap.
  • Adversary uses graph topology to predict the true

correspondences.

  • Sample nodes at random from original graph G to generate two

graphs G1 and G2 with an overlap.

  • 1A. Narayanan and V. Shmatikov. De-anonymizing social networks. IEEE S&P 2009.

12

slide-24
SLIDE 24

The adversarial model (2/2)

  • Perturb G1 and G2 to produce Gaux and Gsan.
  • Overlap between G1 and G2 is measured using Jaccard

Coefficient. Jaccard Coefficient Jaccard Coefficient between sets X and Y at least one of which is non-empty is defined as: JC(X, Y) = |X ∩ Y| |X ∪ Y|

13

slide-25
SLIDE 25

G G1 G2 Sample Gaux Gsan Perturb Perturb

Generating perturbed graphs.

14

slide-26
SLIDE 26

G G1 G2 Sample Gaux Gsan Perturb Perturb

Generating perturbed graphs.

14

slide-27
SLIDE 27

G G1 G2 Sample Gaux Gsan Perturb Perturb

Generating perturbed graphs.

14

slide-28
SLIDE 28

Schemes analyzed

  • 1. Random Sparsification (RSP)
  • 2. Random Edge Perturbation (REP)
  • 3. k-Degree Anonymous (KDA)
  • 4. 1-hop k-Anonymous (1HKA)
  • 5. Random Add/Delete (RAD)
  • 6. Random Switch (RSW)

15

slide-29
SLIDE 29

Schemes analyzed

  • 1. Random Sparsification (RSP)
  • 2. Random Edge Perturbation (REP)
  • 3. k-Degree Anonymous (KDA)
  • 4. 1-hop k-Anonymous (1HKA)
  • 5. Random Add/Delete (RAD)
  • 6. Random Switch (RSW)

15

slide-30
SLIDE 30

Measuring utility

  • 1. Degree distribution (DD)
  • 2. Joint degree distribution (JDD)
  • 3. Average degree connectivity
  • 4. Degree centrality
  • 5. Eigenvector centrality

16

slide-31
SLIDE 31

Measuring utility

  • 1. Degree distribution (DD)
  • 2. Joint degree distribution (JDD)
  • 3. Average degree connectivity
  • 4. Degree centrality
  • 5. Eigenvector centrality

16

slide-32
SLIDE 32

Measuring anonymity

  • Measured by the de-anonymization success achieved as

depicted by ROC curves with varying perturbation.

  • A higher AUC implies weaker anonymity.
  • An increase in perturbation should produce a commensurate

decrease in de-anonymization success while minimizing damage to utility.

17

slide-33
SLIDE 33

Training without ground truth

  • What is the best way to train a model given the adversary only

has access to Gaux and Gsan?

  • Ideal: generate Gaux and Gsan from G.
  • Practical: split Gaux and Gsan individually and merge the sampled

data.

  • The datasets do not need to be further anonymized.
  • Identical and non-identical node pairs are extremely different.

18

slide-34
SLIDE 34

Gaux Gsan Gaux Gsan Split Gsan Gaux Split Training without ground truth by splitting the original graphs.

19

slide-35
SLIDE 35

Gaux Gsan G

aux

G

san

Split Gsan Gaux Split Training without ground truth by splitting the original graphs.

19

slide-36
SLIDE 36

Gaux Gsan G

aux

G

san

Split G

′′

san

G

′′

aux

Split Training without ground truth by splitting the original graphs.

19

slide-37
SLIDE 37

Evaluation and results

Publicly available datasets used

  • Flickr (80 513 nodes, 5 899 882 edges).
  • Facebook New Orleans dataset (63 731 nodes, 817 090 edges).

20

slide-38
SLIDE 38

Degree distribution

100 101 102 103 104

Degree (log)

100 101 102 103 104

Frequency (log) Facebook: Random Sparsification Original αE = 0.75 αE = 0.50 αE = 0.25

100 101 102 103 104

Degree (log)

100 101 102 103 104

Frequency (log) Facebook: Random Edge Perturbation Original µ = 10−4 µ = 10−3 µ = 10−2

100 101 102 103 104

Degree (log)

100 101 102 103 104

Frequency (log) Facebook: k-Degree Anonymity Original k = 10 k = 50 k = 100

21

slide-39
SLIDE 39

Joint degree distribution: RSP

20 40 60 80 100 20 40 60 80 100

FB: No Anonymization

60 120 180 240 300 360 420 480 20 40 60 80 100 20 40 60 80 100

FB: RSP (αE = 0.75)

60 120 180 240 300 360 420 480 540 20 40 60 80 100 20 40 60 80 100

FB: RSP (αE = 0.50)

60 120 180 240 300 360 420 480 540 20 40 60 80 100 20 40 60 80 100

FB: RSP (αE = 0.25)

100 200 300 400 500 600 700 800

22

slide-40
SLIDE 40

Joint degree distribution: REP

20 40 60 80 100 20 40 60 80 100

FB: No Anonymization

60 120 180 240 300 360 420 480 20 40 60 80 100 20 40 60 80 100

FB: REP (µ = 10−4)

150 300 450 600 750 900 1050 20 40 60 80 100 20 40 60 80 100

FB: REP (µ = 10−3)

250 500 750 1000 1250 1500 1750 2000 2250 100 200 300 400 500 100 200 300 400 500

FB: REP (µ = 10−2)

500 1000 1500 2000 2500 3000 3500 4000 4500

23

slide-41
SLIDE 41

Joint degree distribution: KDA

20 40 60 80 100 20 40 60 80 100

FB: No Anonymization

60 120 180 240 300 360 420 480 20 40 60 80 100 20 40 60 80 100

FB: KDA (k = 10)

40 80 120 160 200 240 280 320 360 20 40 60 80 100 20 40 60 80 100

FB: KDA (k = 50)

40 80 120 160 200 240 280 20 40 60 80 100 20 40 60 80 100

FB: KDA (k = 100)

30 60 90 120 150 180 210 240 270

24

slide-42
SLIDE 42

ROC curves

0.0 0.2 0.4 0.6 0.8 1.0

False Positive

0.0 0.2 0.4 0.6 0.8 1.0

True Positive GS: 0.936 αE = 0.75: 0.926 αE = 0.50: 0.903 αE = 0.25: 0.850 Facebook: Random Sparsification

0.0 0.2 0.4 0.6 0.8 1.0

False Positive

0.0 0.2 0.4 0.6 0.8 1.0

True Positive GS: 0.936 µ = 10−4: 0.900 µ = 10−3: 0.761 µ = 10−2: 0.585 Facebook: Random Edge Perturbation

0.0 0.2 0.4 0.6 0.8 1.0

False Positive

0.0 0.2 0.4 0.6 0.8 1.0

True Positive GS: 0.936 k = 10: 0.920 k = 50: 0.907 k = 100: 0.898 Facebook: k-Degree Anonymity

25

slide-43
SLIDE 43

Hellinger Distance between distributions vs. AUC

Flickr Facebook DD JDD AUC DD JDD AUC RSP (αE = 0.75) 0.109 0.570 0.959 0.062 0.295 0.926 RSP (αE = 0.50) 0.130 0.567 0.950 0.100 0.340 0.903 RSP (αE = 0.25) 0.204 0.610 0.931 0.194 0.477 0.850 REP (µ = 10−4) 0.232 0.568 0.955 0.280 0.286 0.900 REP (µ = 10−3) 0.599 0.612 0.924 0.759 0.630 0.761 REP (µ = 10−2) 0.912 0.899 0.792 0.999 1.000 0.585 KDA (k = 10) 0.136 0.576 0.954 0.056 0.279 0.920 KDA (k = 50) 0.260 0.793 0.950 0.125 0.485 0.907 KDA (k = 100) 0.327 0.863 0.949 0.175 0.603 0.898

26

slide-44
SLIDE 44

Hellinger Distance between distributions vs. AUC

Flickr Facebook DD JDD AUC DD JDD AUC RSP (αE = 0.75) 0.109 0.570 0.959 0.062 0.295 0.926 RSP (αE = 0.50) 0.130 0.567 0.950 0.100 0.340 0.903 RSP (αE = 0.25) 0.204 0.610 0.931 0.194 0.477 0.850 REP (µ = 10−4) 0.232 0.568 0.955 0.280 0.286 0.900 REP (µ = 10−3) 0.599 0.612 0.924 0.759 0.630 0.761 REP (µ = 10−2) 0.912 0.899 0.792 0.999 1.000 0.585 KDA (k = 10) 0.136 0.576 0.954 0.056 0.279 0.920 KDA (k = 50) 0.260 0.793 0.950 0.125 0.485 0.907 KDA (k = 100) 0.327 0.863 0.949 0.175 0.603 0.898

26

slide-45
SLIDE 45

Hellinger Distance between distributions vs. AUC

Flickr Facebook DD JDD AUC DD JDD AUC RSP (αE = 0.75) 0.109 0.570 0.959 0.062 0.295 0.926 RSP (αE = 0.50) 0.130 0.567 0.950 0.100 0.340 0.903 RSP (αE = 0.25) 0.204 0.610 0.931 0.194 0.477 0.850 REP (µ = 10−4) 0.232 0.568 0.955 0.280 0.286 0.900 REP (µ = 10−3) 0.599 0.612 0.924 0.759 0.630 0.761 REP (µ = 10−2) 0.912 0.899 0.792 0.999 1.000 0.585 KDA (k = 10) 0.136 0.576 0.954 0.056 0.279 0.920 KDA (k = 50) 0.260 0.793 0.950 0.125 0.485 0.907 KDA (k = 100) 0.327 0.863 0.949 0.175 0.603 0.898

26

slide-46
SLIDE 46

Summary

  • None of the schemes analyzed are fit for the purpose.
  • Hard to provide anonymity while preserving utility.
  • It is possible to destroy utility without providing any anonymity.
  • Local graph perturbation fares worse than global graph

perturbation.

27

slide-47
SLIDE 47

Anonymizing high-dimensional datasets while maintaining their utility is hard.

27

slide-48
SLIDE 48

The next generation of social graph de-anonymization attacks

slide-49
SLIDE 49

The problem

  • End-to-end identity linkage across graphs is a very potent

attack.

  • Attacks so far use seeds, heuristics and hand picked parameters

to reconcile identities across graphs.

  • Hard to adapt such attacks to changing adversarial models.
  • How to translate the success of machine learning classification

to discover end-to-end mappings?

28

slide-50
SLIDE 50

Unraveling anonymization: 3-phase seedless attack (1/2)

  • Same adversarial model as before with RSP used for

perturbation.

  • Attacks node in phases starting with the high degree nodes.
  • Generate initial mappings using classification score.
  • Clean the mappings to remove contradictory mappings.
  • Filter mappings using similarity of mapped neighbors.

29

slide-51
SLIDE 51

Gaux Gsan

Filtering node pairs using neighborhood similarity.

30

slide-52
SLIDE 52

Gaux Gsan

Filtering node pairs using neighborhood similarity.

30

slide-53
SLIDE 53

Unraveling anonymization: 3-phase seedless attack (2/2)

  • Propagate the mappings – snowball effect.
  • After the mappings of a phase mature they are frozen.
  • The process is repeated for subsequent phases starting with the

frozen mappings.

31

slide-54
SLIDE 54

Feature vector

  • Same feature vector based on 2-hop neighborhood as before.
  • Group membership considered when available.
  • Six node categories for directed graphs:
  • successor
  • predecessor
  • successor-of-successor
  • successor-of-predecessor
  • predecessor-of-successor
  • predecessor-of-predecessor.

Twelve categories in total counting in-degree and out-degree.

32

slide-55
SLIDE 55

Evaluation and results

Publicly available datasets used

  • Flickr (80 513 nodes, 5 899 882 edges) – undirected graph with

group membership.

  • Epinions (75 879 nodes, 508 837 edges) – directed graph.
  • Enron (36 692 nodes, 183 831 edges) – undirected graph.

33

slide-56
SLIDE 56

Accuracy

Flickr

(αE = αV = 0.33)

Epinions

(αE = 0.33, αV = 0.20)

Epinions

(αE = 0.50, αV = 0.35)

Enron

(αE = 0.43, αV = 1)

20 40 60 80 100 Accuracy Percentage

Phase 1 Phase 2 Phase 3 Final

34

slide-57
SLIDE 57

Coverage

Flickr

(αE = αV = 0.33)

Epinions

(αE = 0.33, αV = 0.20)

Epinions

(αE = 0.50, αV = 0.35)

Enron

(αE = 0.43, αV = 1)

20 40 60 80 100 Coverage Percentage

Phase 1 Phase 2 Phase 3 Final

35

slide-58
SLIDE 58

Evolution of node mappings (1/2)

5 10 15 20 25 30

Iteration Number

20 40 60 80 100

Accuracy Percentage Phase 1 Phase 2 Phase 3 Epinions (αE = 0.33, αV = 0.20)

5 10 15 20 25 30

Iteration Number

20 40 60 80 100

Coverage Percentage Phase 1 Phase 2 Phase 3 Epinions (αE = 0.33, αV = 0.20)

36

slide-59
SLIDE 59

Evolution of node mappings (2/2)

2 4 6 8 10

Iteration Number

20 40 60 80 100

Accuracy Percentage Phase 1 Phase 2 Phase 3 Epinions (αE = 0.50, αV = 0.35)

2 4 6 8 10

Iteration Number

20 40 60 80 100

Coverage Percentage Phase 1 Phase 2 Phase 3 Epinions (αE = 0.50, αV = 0.35)

37

slide-60
SLIDE 60

Error analysis: joint degree distribution of node pairs

10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 60 70 80 Epinions (αE = 0.50, αV = 0.35) 10 20 30 40 50 10 20 30 40 50 20 40 60 80 100 120 140 Epinions (αE = 0.50, αV = 0.35) 10 20 30 40 50 10 20 30 40 50 8 16 24 32 40 48 56 64 72 Epinions (αE = 0.50, αV = 0.35)

38

slide-61
SLIDE 61

Error analysis: cumulative distribution of node pairs similarity

0.0 0.2 0.4 0.6 0.8 1.0

Jaccard Coefficient

20 40 60 80 100

Cumulative Percentage True False Unidentified Epinions (αE = 0.33, αV = 0.20)

0.0 0.2 0.4 0.6 0.8 1.0

Jaccard Coefficient

20 40 60 80 100

Cumulative Percentage True False Unidentified Epinions (αE = 0.50, αV = 0.35)

39

slide-62
SLIDE 62

Error analysis: shortest path length between node pairs

1 2 3 4 20 40 60 80 100 Epinions (αE = 0.50, αV = 0.35): Shortest path length between node pairs Percentage of node pairs

Phase 1 (997) Phase 2 (2791) Phase 3 (1734) Final (5522)

40

slide-63
SLIDE 63

Comparison with other prominent attacks2

Attack performance based on Enron and Facebook with αE = 0.43 and αV = 1. A higher percentage is better; ∅ denotes a seedless attack.

Enron Facebook Coverage Accuracy Coverage Accuracy KL 15.96 15.96 5.99 5.99 JLS+ 13.05 13.05 15.68 15.68 SH 12.77 12.77 15.63 15.63 JLSB∅ 11.91 11.91 14.73 14.73 YG 3.10 3.10 28.32 28.32 PFG∅ 7.39 7.39 10.87 10.87 NS 0.37 0.37 0.18 0.18 3PSL∅ 12.61 41.91 >40 >65

  • 2S. Ji et al. Secgraph: A uniform and open-source evaluation system for graph data anonymization and

de-anonymization. USENIX Security 2015.

41

slide-64
SLIDE 64

Comparison with other prominent attacks2

Attack performance based on Enron and Facebook with αE = 0.43 and αV = 1. A higher percentage is better; ∅ denotes a seedless attack.

Enron Facebook Coverage Accuracy Coverage Accuracy KL 15.96 15.96 5.99 5.99 JLS+ 13.05 13.05 15.68 15.68 SH 12.77 12.77 15.63 15.63 JLSB∅ 11.91 11.91 14.73 14.73 YG 3.10 3.10 28.32 28.32 PFG∅ 7.39 7.39 10.87 10.87 NS 0.37 0.37 0.18 0.18 3PSL∅ 12.61 41.91 >40 >65

  • 2S. Ji et al. Secgraph: A uniform and open-source evaluation system for graph data anonymization and

de-anonymization. USENIX Security 2015.

41

slide-65
SLIDE 65

Summary

  • Replacing heuristics based models with statistical models

provides a significant improvement.

  • Optimizing parameters by training is better in adverse scenarios

as human error is costlier in limited information.

  • The classification score can be used with improved filtering

methods to construct better attacks.

42

slide-66
SLIDE 66

Machine learning models beat heuristics based attacks by a significant margin at a lower cost.

42

slide-67
SLIDE 67

Conclusions

slide-68
SLIDE 68
  • Starting from the simple task of classifying node pairs we can

build diverse and expressive graph de-anonymization models.

  • Anonymizing graphs while preserving privacy remains a

challenge.

  • In the presence of big data, attacks based on heuristics will

gradually be replaced by learning models because of their adaptability, automation and superior performance.

  • There is a place for anonymization schemes in social graph

research, but not in the realm of preserving privacy.

k.sharad@gmail.com http://ksharad.com

43

slide-69
SLIDE 69
  • Starting from the simple task of classifying node pairs we can

build diverse and expressive graph de-anonymization models.

  • Anonymizing graphs while preserving privacy remains a

challenge.

  • In the presence of big data, attacks based on heuristics will

gradually be replaced by learning models because of their adaptability, automation and superior performance.

  • There is a place for anonymization schemes in social graph

research, but not in the realm of preserving privacy.

k.sharad@gmail.com http://ksharad.com

43