De-anonymizing Data CompSci 590.03 Instructor: Ashwin

Source (http://xkcd.org/834/) De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12 1

Announcements • Project ideas will be posted on the site by Friday. – You are welcome to send me (or talk to me about) your own ideas. Lecture 2 : 590.03 Fall 12 2

Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks – Passive Attacks – Active Attacks Lecture 2 : 590.03 Fall 12 3

Personal Big-Data Person 1 Person 2 Person 3 Person N r 1 r 2 r 3 r N Google Census Hospital DB DB DB Information Recommen- Medical Doctors Economists Retrieval dation Researchers Researchers Algorithms Lecture 2 : 590.03 Fall 12 5

The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002] • Governor of MA • Name • Name uniquely identified • SSN • Address • Zip using ZipCode, • Date • Visit Date Birth Date, and Sex. • Birth Registered • Diagnosis date • Party • Procedure Name linked to Diagnosis affiliation • Medication • Sex • Date last • Total Charge voted Medical Data Voter List Lecture 2 : 590.03 Fall 12 6

The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002] • Governor of MA 87 % of US population • Name • Name uniquely identified • SSN • Address • Zip using ZipCode, • Date • Visit Date Birth Date, and Sex. • Birth Registered • Diagnosis date • Party • Procedure affiliation • Medication • Sex • Date last • Total Charge voted Quasi Identifier Medical Data Voter List Lecture 2 : 590.03 Fall 12 7

Statistical Privacy (Trusted Collector) Problem Utility: Privacy: No breach about any individual Server D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 8

Statistical Privacy (Untrusted Collector) Problem Server f ( ) D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 9

Randomized Response • Flip a coin – heads with probability p, and – tails with probability 1-p (p > ½) • Answer question according to the following table: True Answer = Yes True Answer = No Heads Yes No Tails No Yes Lecture 2 : 590.03 Fall 12 10

Statistical Privacy (Trusted Collector) Problem Server D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 11

Query Answering How many allergy patients? Hospital ‘ D B Correlate Genome to disease Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 12

Query Answering • Need to know the list of questions up front • Each answer will leak some information about individuals. After answering a few questions, server will run out of privacy budget and not be able to answer any more questions. • Will see this in detail later in the course. Lecture 2 : 590.03 Fall 12 13

Anonymous/ Sanitized Data Publishing Hospital D B writingcenterunderground.wordpress.com I wont tell you what questions I am interested in! Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 14

Anonymous/ Sanitized Data Publishing Hospital Answer any # of questions directly on D B ’ without D’ B any modifications. D B Individual 1 Individual 2 Individual 3 Individual N r 1 r 2 r 3 r N Lecture 2 : 590.03 Fall 12 15

Today’s class • Identifying individual records and their sensitive values from data publishing (with insufficient sanitization). Lecture 2 : 590.03 Fall 12 16

Terms • Coin tosses of an algorithm • Union Bound • Heavy Tailed Distribution Lecture 2 : 590.03 Fall 12 18

Terms (contd.) • Heavy Tailed Distribution Normal Distribution Not heavy tailed. Lecture 2 : 590.03 Fall 12 19

Terms (contd.) • Heavy Tailed Distribution Laplace Distribution Heavy tailed. Lecture 2 : 590.03 Fall 12 20

Terms (contd.) • Heavy Tailed Distribution Zipf Distribution Heavy tailed. Lecture 2 : 590.03 Fall 12 21

Terms (contd.) • Cosine Similarity θ • Collaborative filtering – Problem of recommending new items to a user based on their ratings on previously seen items. Lecture 2 : 590.03 Fall 12 22

Netflix Dataset Column/Attribute Movies 3 4 2 1 5 Record (r) 1 1 1 5 5 1 Users 5 2 2 1 4 2 1 4 3 3 5 4 3 1 3 2 4 Rating + TimeStamp Lecture 2 : 590.03 Fall 12 23

Definitions • Support – Set (or number) of non-null attributes in a record or column • Similarity • Sparsity Lecture 2 : 590.03 Fall 12 24

Adversary Model • Aux(r) – some subset of attributes from r Lecture 2 : 590.03 Fall 12 25

Privacy Breach • Definition 1: An algorithm A outputs an r’ such that • Definition 2: (When only a sample of the dataset is input) Lecture 2 : 590.03 Fall 12 26

Algorithm ScoreBoard • For each record r’, compute Score(r’, aux) to be the minimum similarity of an attribute in aux to the same attribute in r’. • Pick r’ with the maximum score OR • Return all records with Score > α Lecture 2 : 590.03 Fall 12 27

Analysis Theorem 1: Suppose we use Scoreboard with α = 1 – ε . If Aux contains m randomly chosen attributes s.t. Then Scoreboard returns a record r’ such that Pr [ Sim (m, r’) > 1 – ε – δ ] > 1 – ε Lecture 2 : 590.03 Fall 12 28

Proof of Theorem 1 • Call r’ a false match if Sim (Aux, r’) < 1 – ε – δ . • For any false match, Pr[ Sim(Aux i , r i ’) > 1 – ε ] < 1 – δ • Sim (Aux, r’) = min Sim(Aux i , r i ’) • Therefore, Pr[ Sim (Aux, r’) > 1 – ε ] < (1 – δ ) m • Pr[some false match has similarity > 1- ε ] < N(1- δ ) m • N(1- δ ) m < ε when m > log(N/ ε ) / log(1/1- δ ) Lecture 2 : 590.03 Fall 12 29

Other results • If dataset D is (1- ε - δ , ε )-sparse, then D can be (1, 1- ε )- deanonymized. • Analogous results when a list of candidate records are returned Lecture 2 : 590.03 Fall 12 30

Netflix Dataset • Slightly different algorithm Lecture 2 : 590.03 Fall 12 31

Summary of Netflix Paper • Adversary can use a subset of ratings made by a user to uniquely identify the user’s record from the “ anonymized ” dataset with high probability • Simple Scoreboard algorithm provably guarantees identification of records. • A variant of Scoreboard can de-anonymize Netflix dataset. • Algorithms are robust to noise in the adversary’s background knowledge Lecture 2 : 590.03 Fall 12 32

Social Network Data • Social networks: graphs where each node represents a social entity, and each edge represents certain relationship between two entities • Example: email communication graphs, social interactions like in Facebook, Yahoo! Messenger, etc. Lecture 2 : 590.03 Fall 12 34

Anonymizing Social Networks Alice Bob Cathy Diane Ed Fred Grace • Naïve anonymization – removes the label of each node and publish only the structure of the network • Information Leaks – Nodes may still be re-identified based on network structure Lecture 2 : 590.03 Fall 12 35

Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Consider the above email communication graph – Each node represents an individual – Each edge between two individuals indicates that they have exchanged emails Lecture 2 : 590.03 Fall 12 36

Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Alice has sent emails to three individuals only Lecture 2 : 590.03 Fall 12 37

Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Alice has sent emails to three individuals only • Only one node in the anonymized network has a degree three • Hence, Alice can re-identify herself Lecture 2 : 590.03 Fall 12 38

Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Cathy has sent emails to five individuals Lecture 2 : 590.03 Fall 12 39

Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Cathy has sent emails to five individuals • Only one node has a degree five • Hence, Cathy can re-identify herself Lecture 2 : 590.03 Fall 12 40

Passive Attacks on an Anonymized Network Alice Bob Cathy Diane Ed Fred Grace • Now consider that Alice and Cathy share their knowledge about the anonymized network • What can they learn about the other individuals? Lecture 2 : 590.03 Fall 12 41

De-anonymizing Data CompSci 590.03 Instructor: Ashwin - PowerPoint PPT Presentation

Source (http://xkcd.org/834/) De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12 1 Announcements Project ideas will be posted on the site by Friday. You are welcome to send me (or talk to

VEA: Validating, Evolving & Anonymizing Data in Real Time Albert Franzi Cros, Data Engineer |

De-anonymizing D4D Datasets Kumar Sharad 1 and George Danezis 2 1 University of Cambridge Computer

De#anonymizing,Social,Networks, and,Inferring,Private,Attributes, Using,Knowledge,Graphs,

Tor: An Anonymizing Overlay Network for TCP Roger Dingledine The Free Haven Project

Tarzan: A Peer-to-Peer Anonymizing Network Layer Michael J. Freedman, NYU Robert Morris, MIT

De-Anonymizing Live CDs through Physical Memory Analysis Andrew Case Senior Security Analyst

Rumor Riding: Anonymizing Unstructured Peer-to-Peer Systems Narrated by Christo Wilson Table of

A Practical Congestion Attack on Tor Using Long Paths Towards De-anonymizing Tor Nathan S. Evans

De-anonymizing D4D Datasets Kumar Sharad 1 George Danezis 2 1 University of Cambridge 2 Microsoft

De-Anonymizing Live CDs through Physical Memory Analysis

Anonymizing your hacktop A brief tour of unique identifiers accessible by software @ Unique

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Database Indexes and K-anonymity Tochukwu Iwuchukwu Jeffrey F. Naughton Conclusion There

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

Contents Introduction to data security Public key cryptosystems: Classical

Introduction to Security Attacks Services Mechanisms CSS441: Security and Cryptography

Handout 7 Summary of this handout: Key Exchange Protocols Wide-Mouth Frog Needham-Schroeder

Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

BotNets BotNets- Cybe Cyber T r Torrirism orrirism Ba Batt ttling ling th the t e thr

Security Notions for Bidirectional Channels Giorgia Azzurra Marson Bertram Poettering FSE 2017

in Searchable Symmetric Encryption Kaoru Kurosawa and Yasuhiro Ohtaki Ibaraki University, Japan

Sambuz

Useful Links

Newsletter

Mail Us

De-anonymizing Data CompSci 590.03 Instructor: Ashwin - PowerPoint PPT Presentation

Source (http://xkcd.org/834/) De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12 1 Announcements Project ideas will be posted on the site by Friday. You are welcome to send me (or talk to

VEA: Validating, Evolving &amp; Anonymizing Data in Real Time Albert Franzi Cros, Data Engineer |