Differential privacy and applications to location privacy Catuscia - PowerPoint PPT Presentation

Differential privacy and applications to location privacy Catuscia Palamidessi INRIA & Ecole Polytechnique 1

Plan of the talk • General introduction to privacy issues • A naive approach to privacy protection: anonymization • Why it is so difficult to protect privacy: Focus on Statistical Databases • Differential Privacy: adding controlled noise • Utility and trade-off between utility and privacy • Extensions of DP • Application to Location Privacy: Geo-indistinguishability 2

Digital traces In the “Information Society”, each individual constantly leaves digital traces of his actions that may allow to infer a lot of information about himself   IP address ⇒ location. History of requests ⇒ interests. Activity in social networks ⇒ political opinions, religion, hobbies, . . . Power consumption (smart meters) ⇒ activities at home.   S Risk: collect and use of digital traces for fraudulent purposes. Examples: targeted spam, identity theft, profiling, discrimination, …   3

Privacy via anonymity Nowadays, organizations and companies that collect data are usually obliged to sanitize them by making them anonymous, i.e., by removing all personal identifiers: name, address, SSN, … “We don’t have any raw data on the identifiable individual. Everything is anonymous” (CEO of NebuAd, a U.S. company that offers targeted advertising based on browsing histories) Similar practices are used by Facebook, MySpace, Google, … 4

Privacy via anonymity However, anonymity-based sanitization has been shown to be highly ineffective: Several de-anonymization attacks have been carried out in the last decade • The quasi-identifiers allow to retrieve the identity in a large number of cases. • More sophisticated methods (k-anonymity, l -anonymity, …) take care of the quasi-identifiers, but they are still prone to composition attacks 5

Sweeney’s de-anonymization attack by linking Public collection of non-sensitive data DB 2 Background Anonymized auxiliary DB 1 information Contains sensitive information Algorithm to link information De-anonymized record 6

Sweeney’s de-anonymization attack by linking Name Ethnicity Address ZIP Visit date Date Diagnosis Birth registered date Procedure Party affiliation Medication Sex Date last Total charge voted DB 1 : Medical data DB 2 : Voter list 87 % of the US population is uniquely identifiable by ZIP , gender, DOB 7

Sweeney’s de-anonymization attack by linking Name Quasi-identifier Ethnicity Address ZIP Visit date Date Diagnosis Birth registered date Procedure Party affiliation Medication Sex Date last Total charge voted DB 1 : Medical data DB 2 : Voter list 87 % of the US population is uniquely identifiable by ZIP , gender, DOB 7

K-anonymity • Quasi-identifier: Set of attributes that can be linked with external data to uniquely identify individuals • K-anonymity approach : Make every record in the table indistinguishable from a least k- 1 other records with respect to quasi-identifiers. This is done by: • suppression of attributes, and/or • generalization of attributes, and/or • addition of dummy records • In this way, linking on quasi-identifiers yields at least k records for each possible value of the quasi-identifier 8

K-anonymity Example: 4-anonymity w.r.t. the quasi-identifier {nationality, ZIP , age} achieved by suppressing the nationality and generalizing ZIP and age 9

Composition attacks 1 Showed the limitations of K-anonymity Robust De-anonymization of Large Sparse Datasets. Arvind Narayanan and Vitaly Shmatikov, 2008. They applied de-anonymization to the Netflix Prize dataset (which contained anonymous movie ratings of 500,000 subscribers of Netflix), in combination with the Internet Movie Database as the source of background knowledge. They demonstrated that an adversary who knows only a little bit about an individual subscriber can identify his record in the dataset, uncovering his apparent political preferences and other potentially sensitive information. 10

Composition attacks 2 De-anonymizing Social Networks. Arvind Narayanan and Vitaly Shmatikov, 2009. By using only the network topology, they were able to show that a third of the users who have accounts on both Twitter (a popular microblogging service) and Flickr (an online photo-sharing site), can be re-identified in the anonymous Twitter graph with only a 12% error rate. 11

Statistical Databases • The problem: we want to use databases to get statistical information (aka aggregated information), but without violating the privacy of the people in the database • For instance, medical databases are often used for research purposes. Typically we are interested in studying the correlation between certain diseases, and certain other attributes: age, sex, weight, etc. • A typical query would be: “ Among the people affected by the disease, what percentage is over 60 ? ” • Personal queries are forbidden. An example of forbidden query would be: “ Does Don have the disease ? ” 12

The problem • Statistical queries should not reveal private information, but it is not so easy to prevent such privacy breaches. • Example: in a medical database, we may want to ask queries that help to figure the correlation between a disease and the age, but we want to keep private the info whether a certain person has the disease. Query: name age disease What is the youngest age of a person with the disease? Alice 30 no Bob 30 no Answer: 40 Don 40 yes Problem: Ellie 50 no The adversary may know that Don is the only person in the Frank 50 yes database with age 40 13

The problem • Statistical queries should not reveal private information, but it is not so easy to prevent such privacy breach. • Example: in a medical database, we may want to ask queries that help to figure the correlation between a disease and the age, but we want to keep private the info whether a certain person has the disease. k-anonymity : the answer should correspond name age disease to at least k individuals Alice 30 no Bob 30 no Alice Bob Carl 40 no Don 40 yes Carl Don Ellie 50 no Ellie Frank Frank 50 yes 14

The problem Unfortunately, it is not robust under composition : name age disease Alice 30 no Bob 30 no Alice Bob Carl 40 no Don 40 yes Carl Don Ellie 50 no Ellie Frank Frank 50 yes 15

The problem of composition name weight disease Alice 60 no Consider the query: Bob 90 no What is the minimal weight of a Carl 90 no person with the disease? Don 100 yes Answer: 100 Ellie 60 no Frank 100 yes Alice Bob Carl Don Ellie Frank 16

The problem of composition name weight disease Alice 60 no Combine with the two queries: Bob 90 no minimal weight and the minimal Carl 90 no age of a person with the disease Don 100 yes Answers: 40, 100 Ellie 60 no Frank 100 yes name age disease Alice 30 no Bob 30 no Alice Bob Carl 40 no Don 40 yes Carl Don Ellie 50 no Ellie Frank Frank 50 yes 17

A better solution name weight disease Alice 60 no Introduce some probabilistic noise Bob 90 no on the answer, so that the answers Carl 90 no of minimal age and minimal weight can be given also by other people Don 100 yes with different age and weight Ellie 60 no Frank 100 yes name age disease Alice 30 no Bob 30 no Alice Bob Carl 40 no Don 40 yes Carl Don Ellie 50 no Ellie Frank Frank 50 yes 18

Noisy answers minimal age: 40 with probability 1/2 30 with probability 1/4 50 with probability 1/4 name age disease Alice 30 no Bob 30 no Alice Bob Carl 40 no Don 40 yes Carl Don Ellie 50 no Ellie Frank Frank 50 yes 19

Noisy answers name weight disease Alice 60 no minimal weight: Bob 90 no 100 with prob. 4/7 Carl 90 no 90 with prob. 2/7 Don 100 yes 60 with prob. 1/7 Ellie 60 no Frank 100 yes Alice Bob Carl Don Ellie Frank 20

Noisy answers name weight disease Alice 60 no Combination of the answers Bob 90 no The adversary cannot tell for Carl 90 no sure whether a certain Don 100 yes person has the disease Ellie 60 no Frank 100 yes name age disease Alice 30 no Bob 30 no Alice Bob Carl 40 no Don 40 yes Carl Don Ellie 50 no Ellie Frank Frank 50 yes 21

Noisy mechanisms • The mechanisms reports an approximate answer, typically generated randomly on the basis of the true answer and of some probability distribution • The probability distribution must be chosen carefully, in order to not destroy the utility of the answer • A good mechanism should provide a good trade-off between privacy and utility. Note that, for the same level of privacy, different mechanism may provide different levels of utility. 22

Differential Privacy Definition [Dwork 2006]: a randomized mechanism K provides ε -differential privacy if for all databases x , x ′ which are adjacent ( i.e., differ for only one record), and for all z ∈ Z , we have p ( K = z | X = x ) p ( K = z | X = x 0 ) ≤ e ✏ • The answer by K does not change significantly the knowledge about X • Differential privacy is robust with respect to composition of queries • The definition of differential privacy is independent from the prior 23

Differential privacy and applications to location privacy Catuscia - PowerPoint PPT Presentation

Differential privacy and applications to location privacy Catuscia Palamidessi INRIA & Ecole Polytechnique 1 Plan of the talk General introduction to privacy issues A naive approach to privacy protection: anonymization Why it

CS573 Data Privacy and Security Differential Privacy Real World Deployments Li Xiong

Toniann Pitassi Outline 1. Differential Privacy: The Basics 2. Differential Privacy in New

Location, Location, Location, Location, Location: Location: GPS and Google Earth GPS and

Differential Privacy Techniques Beyond Differential Privacy Steven Wu Assistant Professor

Differential Privacy (Part III) Approximate (or ( , ))-differential privacy

CS573 Data Privacy and Security Local Differential Privacy Li Xiong Privacy at Scale: Local

Differential Privacy Li Xiong Outline Differential Privacy Definition Basic techniques

CS573 Data Privacy and Security Location Privacy Location Privacy Yonghui (Yohu) Xiao htt //

CS371m - Mobile Computing Location (Location, Location, Location) Cheap GPS

MOBILE COMPUTING CSE 40814/60814 Fall 2015 Location, Location, Location Location information

DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOLATILES DIFFERENTIAL AROMA

Facility location II. Chapter 10 Location-Allocation Model Plant Location Model Network

Facility location I. Chapter 10 Facility location Continuous facility location models Single

Privacy Protection privacy notions and metrics; privacy in RFID systems; location privacy in

Tutorial: Differential Categories and Cartesian Differential Categories JS Pacaud Lemay FMCS

Data Mining with Differential Privacy Arik Friedman and Assal Schuster by Slawomir Goryczka

Trade, Market Imperfections and Labour Share Dibyendu Maiti Delhi School of Economics

MA111: Contemporary mathematics . Jack Schmidt University of Kentucky November 14, 2012

Quantifying Privacy Loss of Human Mobility Graph Topology The 18th Privacy Enhancing Technologies

Income inequality in Ireland The distribution of individual earnings Outline Historical

Overview: More Harm Than Good? Authors: Sarah L Bruce Advisor: Alexis Hickman Bateman, Edgar

2017 financial results 20 March 2018 Disclaimer The shares of Evolva Holding (Evolva) are

Companies John J. Borer III Washington, DC Tuesday September 17, 2013 Overview In the US,

The Mixed Blessing in Subsidized Internet Access A presentation at the 7 th Annual Workshop on