Data privacy: introduction Vicen c Torra January 15, 2018 - PowerPoint PPT Presentation

Oslo, 2018 Data privacy: introduction Vicen¸ c Torra January 15, 2018 Privacy, Information and Cyber-Security Center SAIL, School of Informatics, University of Sk¨ ovde, Sweden

Outline Outline 1. Motivation 2. Privacy models and disclosure risk assessment 3. Data protection mechanisms 4. Masking methods 5. Summary Oslo, 2018 1 / 45

Motivation Outline Motivation Oslo, 2018 2 / 45

Introduction Outline Introduction • Data privacy: core ◦ Someone needs to access to data to perform authorized analysis, but access to the data and the result of the analysis should avoid disclosure. ? E.g., you are authorized to compute the average stay in a hospital, but maybe you are not authorized to see the length of stay of your neighbor. Vicen¸ c Torra; Data privacy Oslo, 2018 3 / 45

Introduction Outline Introduction • Data privacy: boundaries ◦ Database in a computer or in a removable device ⇒ access control to avoid unauthorized access = ⇒ Access to address (admissions), Access to blood test (admissions?) ◦ Data is transmitted ⇒ security technology to avoid unauthorized access = ⇒ Data from blood glucose meter sent to hospital. Network sniffers Transmission is sensitive: Near miss/hit report to car manufacturers security Privacy access control Vicen¸ c Torra; Data privacy Oslo, 2018 4 / 45

Introduction Outline Difficulties • Difficulties: Naive anonymization does not work Passenger manifest for the Missouri, arriving February 15, 1882; Port of Boston 1 Names, Age, Sex, Occupation, Place of birth, Last place of residence, Yes/No, condition (healthy?) 1 https://www.sec.state.ma.us/arc/arcgen/genidx.htm Vicen¸ c Torra; Data privacy Oslo, 2018 5 / 45

Introduction Outline Difficulties • Difficulties: highly identifiable data ◦ (Sweeney, 1997) on USA population ⋆ 87.1% (216 million/248 million) were likely made them unique based on 5-digit ZIP, gender, date of birth, ⋆ 3.7% (9.1 million) had characteristics that were likely made them unique based on 5-digit ZIP, gender, Month and year of birth. Vicen¸ c Torra; Data privacy Oslo, 2018 6 / 45

Introduction Outline Difficulties • Difficulties: highly identifiable data ◦ Data from mobile devices: ⋆ two positions can make you unique (home and working place) ◦ AOL 2 and Netflix cases (search logs and movie ratings) ⇒ User No. 4417749, hundreds of searches over a three-noth period including queries ’landscapers in Lilburn, Ga’ ⇒ Thelma Arnold identified! ⇒ individual users matched with film ratings on the Internet Movie Database. ◦ Similar with credit card payments, shopping carts, ... (i.e., high dimensional data) 2 http://www.nytimes.com/2006/08/09/technology/09aol.html Vicen¸ c Torra; Data privacy Oslo, 2018 7 / 45

Introduction Outline Difficulties • Difficulties: highly identifiable data ◦ Example #1: ⋆ University goal: know how sickness is influenced by studies and by commuting distance ⋆ Data: where students live, what they study, if they got sick ⋆ No “personal data”, is this ok ? Vicen¸ c Torra; Data privacy Oslo, 2018 8 / 45

Introduction Outline Difficulties • Difficulties: highly identifiable data ◦ Example #1: ⋆ University goal: know how sickness is influenced by studies and by commuting distance ⋆ Data: where students live, what they study, if they got sick ⋆ No “personal data”, is this ok ? ⋆ NO!!: How many in your degree live in your town ? Vicen¸ c Torra; Data privacy Oslo, 2018 8 / 45

Introduction Outline Difficulties • Difficulties: highly identifiable data ◦ Example #1: ⋆ University goal: know how sickness is influenced by studies and by commuting distance ⋆ Data: where students live, what they study, if they got sick ⋆ No “personal data”, is this ok ? ⋆ NO!!: How many in your degree live in your town ? ◦ Example #2: ⋆ Car company goal: Study driving behaviour in the morning ⋆ Data: First drive (GPS origin + destination, time) × 30 days ⋆ No “personal data”, is this ok? Vicen¸ c Torra; Data privacy Oslo, 2018 8 / 45

Introduction Outline Difficulties • Difficulties: highly identifiable data ◦ Example #1: ⋆ University goal: know how sickness is influenced by studies and by commuting distance ⋆ Data: where students live, what they study, if they got sick ⋆ No “personal data”, is this ok ? ⋆ NO!!: How many in your degree live in your town ? ◦ Example #2: ⋆ Car company goal: Study driving behaviour in the morning ⋆ Data: First drive (GPS origin + destination, time) × 30 days ⋆ No “personal data”, is this ok? ⋆ NO!!!: How many (cars) go from your parking to your university everymorning ? Are you exceeding the speed limit ? Are you visiting a psychiatrisc every tuesday ? Vicen¸ c Torra; Data privacy Oslo, 2018 8 / 45

Introduction Outline Difficulties • Data privacy is “impossible”, or not ? ◦ Privacy vs. utility ◦ Privacy vs. security ◦ Computationally feasible Vicen¸ c Torra; Data privacy Oslo, 2018 9 / 45

Outline Privacy models and disclosure risk assessment Vicen¸ c Torra; Data privacy Oslo, 2018 10 / 45

Introduction > Disclosure risk Outline Disclosure risk assessment Privacy models: What is a privacy model ? • To make a program we need to know what we want to protect Vicen¸ c Torra; Data privacy Oslo, 2018 11 / 45

Introduction > Disclosure risk Outline Disclosure risk assessment Disclosure risk. Disclosure: leakage of information. • Identity disclosure vs. Attribute disclosure ◦ Attribute disclosure: (e.g. learn about Alice’s salary) ⋆ Increase knowledge about an attribute of an individual ◦ Identity disclosure: (e.g. find Alice in the database) ⋆ Find/identify an individual in a database (e.g., masked file) Within machine learning, some attribute disclosure is expected. Vicen¸ c Torra; Data privacy Oslo, 2018 12 / 45

Introduction > Disclosure risk Outline Disclosure risk assessment Disclosure risk. • Boolean vs. quantitative privacy models ◦ Boolean: Disclosure either takes place or not. Check whether the definition holds or not. Includes definitions based on a threshold. ◦ Quantitative: Disclosure is a matter of degree that can be quantified. Some risk is permitted. • minimize information loss (max. utility) vs. multiobjetive optimization Vicen¸ c Torra; Data privacy Oslo, 2018 13 / 45

Introduction > Disclosure risk Outline Disclosure risk assessment Privacy models. (selection) • Secure multiparty computation. Several parties want to compute a function of their databases, but only sharing the result. • Reidentification privacy. Avoid finding a record in a database. • k-Anonymity. A record indistinguishable with k − 1 other records. • Differential privacy. The output of a query to a database should not depend (much) on whether a record is in the database or not. Vicen¸ c Torra; Data privacy Oslo, 2018 14 / 45

Introduction > Disclosure risk Outline Disclosure risk assessment Privacy model. Secure multiparty computation. • Several parties want to compute a function of their databases, but only sharing the result. ◦ hospital A and hospital B , ◦ two independent databases with: age of patient, length of stay in hospital • how to compute a regression with all data (both databases) age → length without sharing data? Vicen¸ c Torra; Data privacy Oslo, 2018 15 / 45

Introduction > Disclosure risk Outline Disclosure risk assessment Privacy model. Reidentification privacy. • Avoid finding a record in a database. ◦ hospital A has a database ◦ a researcher asks for access to this database • how to prepare an anonymized database so that the researcher can not find a friend? Vicen¸ c Torra; Data privacy Oslo, 2018 16 / 45

Introduction > Disclosure risk Outline Disclosure risk assessment Privacy model. k-Anonymity. • Avoid finding a record in a database. ... making each record indistinguishable with k − 1 other records. Vicen¸ c Torra; Data privacy Oslo, 2018 17 / 45

Introduction > Disclosure risk Outline Disclosure risk assessment Privacy model. k-Anonymity. • Avoid finding a record in a database. ... making each record indistinguishable with k − 1 other records. ◦ hospital A has a database ◦ a researcher asks for access to this database • how to prepare an anonymized database so that the researcher can not find a friend? Vicen¸ c Torra; Data privacy Oslo, 2018 17 / 45

Introduction > Disclosure risk Outline Disclosure risk assessment Privacy model. Differential privacy. • The output of a query to a database should not depend (much) on whether a record is in the database or not. ◦ hospital A has a database age of patient, length of stay in hospital • how to compute an average length of stay in such a way that the result does not depend (much) on whether we use or not the data of a particular person. Vicen¸ c Torra; Data privacy Oslo, 2018 18 / 45

Privacy models Outline • Privacy models: quite a few competing models ◦ differential privacy ◦ secure multiparty computation ◦ k-anonymity ◦ computational anonymity ◦ reidentification (record linkage) ◦ uniqueness ◦ result privacy ◦ interval disclosure ◦ integral privacy Vicen¸ c Torra; Data privacy Oslo, 2018 19 / 45

Data privacy: introduction Vicen c Torra January 15, 2018 - PowerPoint PPT Presentation

Oslo, 2018 Data privacy: introduction Vicen c Torra January 15, 2018 Privacy, Information and Cyber-Security Center SAIL, School of Informatics, University of Sk ovde, Sweden Outline Outline 1. Motivation 2. Privacy models and

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

Data privacy. Introduction Vicen c Torra February, 2018 Introduction Outline Introduction

User privacy Vicen c Torra February, 2018 SAIL + PICS, School of Informatics, University of

Data privacy. A briefer. Vicen c Torra (vtorra@ieee.org) May 31st, 2018 Privacy, Information

Data privacy: an introduction (part II) Vicen c Torra February, 2017 School of Informatics,

Data privacy: Introduction Vicen c Torra March, 2019 Hamilton Institute, Maynooth University,

Data privacy: an overview Vicen c Torra December, 2019 Hamilton Institute, Maynooth

Transparency and disclosure risk in data privacy c Torra 1 Vicen March, 2015 1 School of

On machine learning for data privacy Vicen c Torra Dec. 7, 2016 School of Informatics,

Noise Graph Addition: A New Perspective for Graph Anonymization Vicen Torra, Julin Salas

Classification of procedures Vicen c Torra March, 2019 Hamilton Institute, Maynooth

Choquet integral in decision making and metric learning Vicen c Torra Hamilton Institute,

Aggregation functions for social decision making Vicen c Torra torsdag den 17. oktober 2013

Choquet integral: distributions and decisions Vicen c Torra School of Informatics, University

Aggregation functions and information fusion. Modeling decisions Vicen c Torra Universitat de

On some extensions and applications of non-additive measures Vicen c Torra September, 2013

University of the Basque Country (EHU) Systems for the NIST 2011 LRE Mikel Penagarikano, Amparo

Why use R? Introduction to R: To perform inferential statistics (e.g., use a statistical

Fast leakage assessment Oscar Reparaz COSIC / KU Leuven Benedikt Gierlichs CHES 2017 Taipei

Checking Assumptions Normal distributions: use probability plot (or quantile-quantile plot);

Physics in DUNE UK WP1 Steve Dennis DUNE UK Meeting University of Manchester 05/03/2019 Steve

A First Course on Kinetics and Reaction Engineering Class 19 on Unit 18 Where Were Going

1 Class B Class B Ground Water 11 Two criteria turn an otherwise Class A ground water into

Brownian motion (cont.) 18.S995 - L05 1.2 Brownian motion Diffusion equation with constant