De-anonymizing Data CompSci 590.03 Instructor: Ashwin - - PowerPoint PPT Presentation

de anonymizing data
SMART_READER_LITE
LIVE PREVIEW

De-anonymizing Data CompSci 590.03 Instructor: Ashwin - - PowerPoint PPT Presentation

Source (http://xkcd.org/834/) De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12 1 Announcements Project ideas will be posted on the site by Friday. You are welcome to send me (or talk to


slide-1
SLIDE 1

De-anonymizing Data

CompSci 590.03 Instructor: Ashwin Machanavajjhala

1 Lecture 2 : 590.03 Fall 12

Source (http://xkcd.org/834/)

slide-2
SLIDE 2

Announcements

  • Project ideas will be posted on the site by Friday.

– You are welcome to send me (or talk to me about) your own ideas.

Lecture 2 : 590.03 Fall 12 2

slide-3
SLIDE 3

Outline

  • Recap & Intro to Anonymization
  • Algorithmically De-anonymizing Netflix Data
  • Algorithmically De-anonymizing Social Networks

– Passive Attacks – Active Attacks

Lecture 2 : 590.03 Fall 12 3

slide-4
SLIDE 4

Outline

  • Recap & Intro to Anonymization
  • Algorithmically De-anonymizing Netflix Data
  • Algorithmically De-anonymizing Social Networks

– Passive Attacks – Active Attacks

Lecture 2 : 590.03 Fall 12 4

slide-5
SLIDE 5

Personal Big-Data

Google

DB

Person 1

r1

Person 2

r2

Person 3

r3

Person N

rN

Census

DB

Hospital

DB

Doctors Medical Researchers Economists Information Retrieval Researchers Recommen- dation Algorithms

5 Lecture 2 : 590.03 Fall 12

slide-6
SLIDE 6

The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]

  • Name
  • SSN
  • Visit Date
  • Diagnosis
  • Procedure
  • Medication
  • Total Charge
  • Name
  • Address
  • Date

Registered

  • Party

affiliation

  • Date last

voted

  • Zip
  • Birth

date

  • Sex

Medical Data Voter List

  • Governor of MA

uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis

6 Lecture 2 : 590.03 Fall 12

slide-7
SLIDE 7

The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002]

  • Name
  • SSN
  • Visit Date
  • Diagnosis
  • Procedure
  • Medication
  • Total Charge
  • Name
  • Address
  • Date

Registered

  • Party

affiliation

  • Date last

voted

  • Zip
  • Birth

date

  • Sex

Medical Data Voter List

  • Governor of MA

uniquely identified using ZipCode, Birth Date, and Sex.

Quasi Identifier

87 % of US population

7 Lecture 2 : 590.03 Fall 12

slide-8
SLIDE 8

Statistical Privacy (Trusted Collector) Problem

8

Individual 1 r1 Individual 2 r2 Individual 3 r3 Individual N rN

Server

DB

Utility: Privacy: No breach about any individual

Lecture 2 : 590.03 Fall 12

slide-9
SLIDE 9

Statistical Privacy (Untrusted Collector) Problem

9

Individual 1 r1 Individual 2 r2 Individual 3 r3 Individual N rN

Server

DB

f ( )

Lecture 2 : 590.03 Fall 12

slide-10
SLIDE 10

Randomized Response

  • Flip a coin

– heads with probability p, and – tails with probability 1-p (p > ½)

  • Answer question according to the following table:

Lecture 2 : 590.03 Fall 12 10

True Answer = Yes True Answer = No Heads Yes No Tails No Yes

slide-11
SLIDE 11

Statistical Privacy (Trusted Collector) Problem

11

Individual 1 r1 Individual 2 r2 Individual 3 r3 Individual N rN

Server

DB

Lecture 2 : 590.03 Fall 12

slide-12
SLIDE 12

Query Answering

12

Individual 1 r1 Individual 2 r2 Individual 3 r3 Individual N rN

Hospital

DB

Lecture 2 : 590.03 Fall 12

Correlate Genome to disease How many allergy patients?

slide-13
SLIDE 13

Query Answering

  • Need to know the list of questions up front
  • Each answer will leak some information about individuals. After

answering a few questions, server will run out of privacy budget and not be able to answer any more questions.

  • Will see this in detail later in the course.

Lecture 2 : 590.03 Fall 12 13

slide-14
SLIDE 14

Anonymous/ Sanitized Data Publishing

14

Individual 1 r1 Individual 2 r2 Individual 3 r3 Individual N rN

Hospital

DB

Lecture 2 : 590.03 Fall 12

I wont tell you what questions I am interested in!

writingcenterunderground.wordpress.com

slide-15
SLIDE 15

Anonymous/ Sanitized Data Publishing

15

Individual 1 r1 Individual 2 r2 Individual 3 r3 Individual N rN

Hospital

DB

Lecture 2 : 590.03 Fall 12

D’B

Answer any # of questions directly on DB’ without any modifications.

slide-16
SLIDE 16

Today’s class

  • Identifying individual records and their sensitive values from data

publishing (with insufficient sanitization).

Lecture 2 : 590.03 Fall 12 16

slide-17
SLIDE 17

Outline

  • Recap & Intro to Anonymization
  • Algorithmically De-anonymizing Netflix Data
  • Algorithmically De-anonymizing Social Networks

– Passive Attacks – Active Attacks

Lecture 2 : 590.03 Fall 12 17

slide-18
SLIDE 18

Terms

  • Coin tosses of an algorithm
  • Union Bound
  • Heavy Tailed Distribution

Lecture 2 : 590.03 Fall 12 18

slide-19
SLIDE 19

Terms (contd.)

  • Heavy Tailed Distribution

Lecture 2 : 590.03 Fall 12 19

Not heavy tailed. Normal Distribution

slide-20
SLIDE 20

Terms (contd.)

  • Heavy Tailed Distribution

Lecture 2 : 590.03 Fall 12 20

Heavy tailed. Laplace Distribution

slide-21
SLIDE 21

Terms (contd.)

  • Heavy Tailed Distribution

Lecture 2 : 590.03 Fall 12 21

Heavy tailed. Zipf Distribution

slide-22
SLIDE 22

Terms (contd.)

  • Cosine Similarity
  • Collaborative filtering

– Problem of recommending new items to a user based on their ratings on previously seen items.

Lecture 2 : 590.03 Fall 12 22

θ

slide-23
SLIDE 23

Netflix Dataset

Lecture 2 : 590.03 Fall 12 23

3 4 2 1 5 1 1 1 5 5 1 5 2 2 1 4 2 1 4 3 3 5 4 3 1 3 2 4 Movies Users Rating + TimeStamp Record (r) Column/Attribute

slide-24
SLIDE 24

Definitions

  • Support

– Set (or number) of non-null attributes in a record or column

  • Similarity
  • Sparsity

Lecture 2 : 590.03 Fall 12 24

slide-25
SLIDE 25

Adversary Model

  • Aux(r) – some subset of attributes from r

Lecture 2 : 590.03 Fall 12 25

slide-26
SLIDE 26

Privacy Breach

  • Definition 1: An algorithm A outputs an r’ such that
  • Definition 2: (When only a sample of the dataset is input)

Lecture 2 : 590.03 Fall 12 26

slide-27
SLIDE 27

Algorithm

ScoreBoard

  • For each record r’, compute Score(r’, aux) to be the minimum

similarity of an attribute in aux to the same attribute in r’.

  • Pick r’ with the maximum score

OR

  • Return all records with Score > α

Lecture 2 : 590.03 Fall 12 27

slide-28
SLIDE 28

Analysis

Theorem 1: Suppose we use Scoreboard with α = 1 – ε. If Aux contains m randomly chosen attributes s.t. Then Scoreboard returns a record r’ such that

Pr [Sim(m, r’) > 1 – ε – δ ] > 1 – ε

Lecture 2 : 590.03 Fall 12 28

slide-29
SLIDE 29

Proof of Theorem 1

  • Call r’ a false match if Sim(Aux, r’) < 1 – ε – δ.
  • For any false match, Pr[ Sim(Auxi, ri’) > 1 – ε ] < 1 – δ
  • Sim(Aux, r’) = min Sim(Auxi, ri’)
  • Therefore, Pr[ Sim(Aux, r’) > 1 – ε ] < (1 – δ)m
  • Pr[some false match has similarity > 1- ε ] < N(1-δ)m
  • N(1-δ)m < ε when m > log(N/ε) / log(1/1-δ)

Lecture 2 : 590.03 Fall 12 29

slide-30
SLIDE 30

Other results

  • If dataset D is (1-ε-δ, ε)-sparse, then D can be (1, 1-ε)-

deanonymized.

  • Analogous results when a list of candidate records are returned

Lecture 2 : 590.03 Fall 12 30

slide-31
SLIDE 31

Netflix Dataset

  • Slightly different algorithm

Lecture 2 : 590.03 Fall 12 31

slide-32
SLIDE 32

Summary of Netflix Paper

  • Adversary can use a subset of ratings made by a user to uniquely

identify the user’s record from the “anonymized” dataset with high probability

  • Simple Scoreboard algorithm provably guarantees identification
  • f records.
  • A variant of Scoreboard can de-anonymize Netflix dataset.
  • Algorithms are robust to noise in the adversary’s background

knowledge

Lecture 2 : 590.03 Fall 12 32

slide-33
SLIDE 33

Outline

  • Recap & Intro to Anonymization
  • Algorithmically De-anonymizing Netflix Data
  • Algorithmically De-anonymizing Social Networks

– Passive Attacks – Active Attacks

Lecture 2 : 590.03 Fall 12 33

slide-34
SLIDE 34

Social Network Data

  • Social networks: graphs where each node represents a social

entity, and each edge represents certain relationship between two entities

  • Example: email communication graphs, social interactions like in

Facebook, Yahoo! Messenger, etc.

34 Lecture 2 : 590.03 Fall 12

slide-35
SLIDE 35

Anonymizing Social Networks

  • Naïve anonymization

– removes the label of each node and publish only the structure of the network

  • Information Leaks

– Nodes may still be re-identified based on network structure Alice Ed Bob Fred Cathy Grace Diane

35 Lecture 2 : 590.03 Fall 12

slide-36
SLIDE 36

Passive Attacks on an Anonymized Network

  • Consider the above email communication graph

– Each node represents an individual – Each edge between two individuals indicates that they have exchanged emails Alice Ed Bob Fred Cathy Grace Diane

36 Lecture 2 : 590.03 Fall 12

slide-37
SLIDE 37

Passive Attacks on an Anonymized Network

  • Alice has sent emails to three individuals only

Alice Ed Bob Fred Cathy Grace Diane

37 Lecture 2 : 590.03 Fall 12

slide-38
SLIDE 38

Passive Attacks on an Anonymized Network

  • Alice has sent emails to three individuals only
  • Only one node in the anonymized network has a degree three
  • Hence, Alice can re-identify herself

Alice Ed Bob Fred Cathy Grace Diane

38 Lecture 2 : 590.03 Fall 12

slide-39
SLIDE 39

Passive Attacks on an Anonymized Network

  • Cathy has sent emails to five individuals

Alice Ed Bob Fred Cathy Grace Diane

39 Lecture 2 : 590.03 Fall 12

slide-40
SLIDE 40

Passive Attacks on an Anonymized Network

  • Cathy has sent emails to five individuals
  • Only one node has a degree five
  • Hence, Cathy can re-identify herself

Alice Ed Bob Fred Cathy Grace Diane

40 Lecture 2 : 590.03 Fall 12

slide-41
SLIDE 41

Passive Attacks on an Anonymized Network

  • Now consider that Alice and Cathy share their knowledge about

the anonymized network

  • What can they learn about the other individuals?

Alice Ed Bob Fred Cathy Grace Diane

41 Lecture 2 : 590.03 Fall 12

slide-42
SLIDE 42

Passive Attacks on an Anonymized Network

  • First, Alice and Cathy know that only Bob have sent emails to both
  • f them

Alice Ed Bob Fred Cathy Grace Diane

42 Lecture 2 : 590.03 Fall 12

slide-43
SLIDE 43

Passive Attacks on an Anonymized Network

  • First, Alice and Cathy know that only Bob have sent emails to both
  • f them
  • Bob can be identified

Alice Ed Bob Fred Cathy Grace Diane

43 Lecture 2 : 590.03 Fall 12

slide-44
SLIDE 44

Passive Attacks on an Anonymized Network

  • Alice has sent emails to Bob, Cathy, and Ed only

Alice Ed Bob Fred Cathy Grace Diane

44 Lecture 2 : 590.03 Fall 12

slide-45
SLIDE 45

Passive Attacks on an Anonymized Network

  • Alice has sent emails to Bob, Cathy, and Ed only
  • Ed can be identified

Alice Ed Bob Fred Cathy Grace Diane

45 Lecture 2 : 590.03 Fall 12

slide-46
SLIDE 46

Passive Attacks on an Anonymized Network

  • Alice and Cathy can learn that Bob and Ed are connected

Alice Ed Bob Fred Cathy Grace Diane

46 Lecture 2 : 590.03 Fall 12

slide-47
SLIDE 47

Passive Attacks on an Anonymized Network

  • The above attack is based on knowledge about degrees of nodes.

[Liu and Terzi, SIGMOD 2008]

  • More sophisticated attacks can be launched given additional

knowledge about the network structure, e.g., a subgraph of the network. [Zhou and Pei, ICDE 2008, Hay et al., VLDB 2008, ]

  • Protecting privacy becomes even more challenging when the

nodes in the anonymized network are labeled. [Pang et al., SIGCOMM CCR 2006]

47 Lecture 2 : 590.03 Fall 12

slide-48
SLIDE 48

Inferring Sensitive Values on a Network

  • Each individual has a single sensitive attribute.

– Some individuals share the sensitive attribute, while others keep it private

  • GOAL: Infer the private

sensitive attributes using

– Links in the social network – Groups that the individuals belong to

  • Approach: Learn a predictive model (think classifier) using public

profiles as training data. [Zheleva and Getoor, WWW 2009]

48 Lecture 2 : 590.03 Fall 12

slide-49
SLIDE 49

Inferring Sensitive Values on a Network

  • Baseline: Most commonly appearing sensitive value amongst all

public profiles.

49 Lecture 2 : 590.03 Fall 12

slide-50
SLIDE 50

Inferring Sensitive Values on a Network

  • LINK: Each node x has a list of binary features Lx, one for every

node in the social network.

– Feature value Lx[y] = 1 if and only if (x,y) is an edge. – Train a model on all pairs (Lx, sensitive value(x)), for x’s with public sensitive values. – Use learnt model to predict private sensitive values

50 Lecture 2 : 590.03 Fall 12

slide-51
SLIDE 51

Inferring Sensitive Values on a Network

  • GROUP: Each node x has a list of binary features Gx, one for every

group in the social network.

– Feature value Gx[y] = 1 if and only if x belongs to group y. – Train a model on all pairs (Gx, sensitive value(x)), where x’s sensitive value is public. – Use model to predict private sensitive values

51 Lecture 2 : 590.03 Fall 12

slide-52
SLIDE 52

Inferring Sensitive Values on a Network

Flickr (Location) Facebook (Gender) Facebook (Political View) Dogster (Dog Breed) Baseline 27.7% 50% 56.5% 28.6% LINK 56.5% 68.6% 58.1% 60.2% GROUP 83.6% 77.2% 46.6% 82.0%

[Zheleva and Getoor, WWW 2009]

52 Lecture 2 : 590.03 Fall 12

slide-53
SLIDE 53

Active Attacks on Social Networks

[Backstrom et al., WWW 2007]

  • Attacker may create a few nodes in the graph

– Creates a few ‘fake’ Facebook user accounts.

  • Attacker may add edges from the new nodes.

– Create friends using ‘fake’ accounts.

  • Goal: Discover an edge between two legitimate users.

53 Lecture 2 : 590.03 Fall 12

slide-54
SLIDE 54

High Level View of Attack

  • Step 1: Create a graph structure with the ‘fake’ nodes such that it

can be identified in the anonymous data.

54 Lecture 2 : 590.03 Fall 12

slide-55
SLIDE 55

High Level View of Attack

  • Step 2: Add edges from the ‘fake’ nodes to real nodes.

55 Lecture 2 : 590.03 Fall 12

slide-56
SLIDE 56

High Level View of Attack

  • Step 3: From the anonymized data, identify fake graph due to its

special graph structure.

56 Lecture 2 : 590.03 Fall 12

slide-57
SLIDE 57

High Level View of Attack

  • Step 4: Deduce edges by following links

57 Lecture 2 : 590.03 Fall 12

slide-58
SLIDE 58

Details of the Attack

  • Choose k real users

W = {w1, …, wk}

  • Create k fake users

X = {x1, …, xk}

  • Creates edges (xi, wi)
  • Create edges (xi, xi+1)
  • Create all other edges in X

with probability 0.5. Large graph

58 Lecture 2 : 590.03 Fall 12

slide-59
SLIDE 59

Why does it work?

  • Given a graph G, and a set of nodes S, G[S] = graph induced by

nodes in S.

  • There is an isomorphism between two sets of nodes S, S’ if

– There is a function mapping each node in S to a node in S’ – (u,v) is an edge in G[S] if and only if (f(u), f(v)) is an edge in S’

  • Isomorphism from S to S is called an automorphism

– Think: permuting the nodes

Lecture 2 : 590.03 Fall 12 59

slide-60
SLIDE 60

Why does it work?

  • There is no S such that G[S] is

isomorphic to G[X] (call it H).

  • H can be efficiently found from G.
  • H has no non-trivial automorphisms.

Large graph (size N)

60 Lecture 2 : 590.03 Fall 12

slide-61
SLIDE 61

Recovery

Subgraph isomorphism is NP-hard

– i.e., Finding X could be hard.

But since X has a path, with random edges, there is a simple brute force with pruning search algorithm. Run Time: O(N 2O(log log N) ) Large graph (size N)

2

61 Lecture 2 : 590.03 Fall 12

slide-62
SLIDE 62

Works in Real Life!

  • LiveJournal –

4.4 million nodes, 77 million edges

  • Success all but guaranteed

by adding 10 nodes.

  • Recovery typically takes a

second. Probability of Successful Attack [Backstrom et al., WWW 2007]

62 Lecture 2 : 590.03 Fall 12

slide-63
SLIDE 63

Summary of Social Networks

  • Nodes in a graph can be re-identified using background

knowledge of the structure of the graph

  • Link and group structure provide valuable information for

accurately inferring private sensitive values.

  • Active attacks that add nodes and edges are shown to be very

successful.

  • Guarding against these attacks is an open area for research !

63 Lecture 2 : 590.03 Fall 12

slide-64
SLIDE 64

Next Class

  • K-Anonymity + Algorithms: How to limit de-anonymization?

Lecture 2 : 590.03 Fall 12 64

slide-65
SLIDE 65

References

  • L. Sweeney, “K-Anonymity: a model for protecting privacy”, IJUFKS 2002
  • A. Narayanan & V. Shmatikov, “Robust De-anonymization of Large Sparse Datasets”, SSP

2008

  • L. Backstrom, C. Dwork & J. Kleinberg, “Wherefore art thou r3579x?: anonymized social

networks, hidden patterns, and structural steganography”, WWW 2007

  • E. Zheleva & L. Getoor, “To join or not to join: the illusion of privacy in social networks with

mixed public and private user profiles”, WWW 2009

Lecture 2 : 590.03 Fall 12 65