Data privacy: introduction Vicen c Torra January 15, 2018 - - PowerPoint PPT Presentation

data privacy introduction vicen c torra january 15 2018
SMART_READER_LITE
LIVE PREVIEW

Data privacy: introduction Vicen c Torra January 15, 2018 - - PowerPoint PPT Presentation

Oslo, 2018 Data privacy: introduction Vicen c Torra January 15, 2018 Privacy, Information and Cyber-Security Center SAIL, School of Informatics, University of Sk ovde, Sweden Outline Outline 1. Motivation 2. Privacy models and


slide-1
SLIDE 1

Oslo, 2018

Data privacy: introduction Vicen¸ c Torra January 15, 2018

Privacy, Information and Cyber-Security Center SAIL, School of Informatics, University of Sk¨

  • vde, Sweden
slide-2
SLIDE 2

Outline

Outline

  • 1. Motivation
  • 2. Privacy models and disclosure risk assessment
  • 3. Data protection mechanisms
  • 4. Masking methods
  • 5. Summary

Oslo, 2018 1 / 45

slide-3
SLIDE 3

Motivation Outline

Motivation

Oslo, 2018 2 / 45

slide-4
SLIDE 4

Introduction Outline

Introduction

  • Data privacy: core
  • Someone needs to access to data to perform authorized analysis,

but access to the data and the result of the analysis should avoid disclosure.

?

E.g., you are authorized to compute the average stay in a hospital, but maybe you are not authorized to see the length of stay of your neighbor.

Vicen¸ c Torra; Data privacy Oslo, 2018 3 / 45

slide-5
SLIDE 5

Introduction Outline

Introduction

  • Data privacy: boundaries
  • Database in a computer or in a removable device

⇒ access control to avoid unauthorized access = ⇒ Access to address (admissions), Access to blood test (admissions?)

  • Data is transmitted

⇒ security technology to avoid unauthorized access = ⇒ Data from blood glucose meter sent to hospital. Network sniffers

Transmission is sensitive: Near miss/hit report to car manufacturers

access control Privacy security

Vicen¸ c Torra; Data privacy Oslo, 2018 4 / 45

slide-6
SLIDE 6

Introduction Outline

Difficulties

  • Difficulties: Naive anonymization does not work

Passenger manifest for the Missouri, arriving February 15, 1882; Port of Boston1 Names, Age, Sex, Occupation, Place of birth, Last place of residence, Yes/No, condition (healthy?)

1https://www.sec.state.ma.us/arc/arcgen/genidx.htm Vicen¸ c Torra; Data privacy Oslo, 2018 5 / 45

slide-7
SLIDE 7

Introduction Outline

Difficulties

  • Difficulties: highly identifiable data
  • (Sweeney, 1997) on USA population

⋆ 87.1% (216 million/248 million) were likely made them unique based on 5-digit ZIP, gender, date of birth, ⋆ 3.7% (9.1 million) had characteristics that were likely made them unique based on 5-digit ZIP, gender, Month and year of birth.

Vicen¸ c Torra; Data privacy Oslo, 2018 6 / 45

slide-8
SLIDE 8

Introduction Outline

Difficulties

  • Difficulties: highly identifiable data
  • Data from mobile devices:

⋆ two positions can make you unique (home and working place)

  • AOL2 and Netflix cases (search logs and movie ratings)

⇒ User No. 4417749, hundreds of searches over a three-noth period including queries ’landscapers in Lilburn, Ga’ ⇒ Thelma Arnold identified! ⇒ individual users matched with film ratings on the Internet Movie Database.

  • Similar with credit card payments, shopping carts, ...

(i.e., high dimensional data)

2http://www.nytimes.com/2006/08/09/technology/09aol.html Vicen¸ c Torra; Data privacy Oslo, 2018 7 / 45

slide-9
SLIDE 9

Introduction Outline

Difficulties

  • Difficulties: highly identifiable data
  • Example #1:

⋆ University goal: know how sickness is influenced by studies and by commuting distance ⋆ Data: where students live, what they study, if they got sick ⋆ No “personal data”, is this ok ?

Vicen¸ c Torra; Data privacy Oslo, 2018 8 / 45

slide-10
SLIDE 10

Introduction Outline

Difficulties

  • Difficulties: highly identifiable data
  • Example #1:

⋆ University goal: know how sickness is influenced by studies and by commuting distance ⋆ Data: where students live, what they study, if they got sick ⋆ No “personal data”, is this ok ? ⋆ NO!!: How many in your degree live in your town ?

Vicen¸ c Torra; Data privacy Oslo, 2018 8 / 45

slide-11
SLIDE 11

Introduction Outline

Difficulties

  • Difficulties: highly identifiable data
  • Example #1:

⋆ University goal: know how sickness is influenced by studies and by commuting distance ⋆ Data: where students live, what they study, if they got sick ⋆ No “personal data”, is this ok ? ⋆ NO!!: How many in your degree live in your town ?

  • Example #2:

⋆ Car company goal: Study driving behaviour in the morning ⋆ Data: First drive (GPS origin + destination, time) × 30 days ⋆ No “personal data”, is this ok?

Vicen¸ c Torra; Data privacy Oslo, 2018 8 / 45

slide-12
SLIDE 12

Introduction Outline

Difficulties

  • Difficulties: highly identifiable data
  • Example #1:

⋆ University goal: know how sickness is influenced by studies and by commuting distance ⋆ Data: where students live, what they study, if they got sick ⋆ No “personal data”, is this ok ? ⋆ NO!!: How many in your degree live in your town ?

  • Example #2:

⋆ Car company goal: Study driving behaviour in the morning ⋆ Data: First drive (GPS origin + destination, time) × 30 days ⋆ No “personal data”, is this ok? ⋆ NO!!!: How many (cars) go from your parking to your university everymorning ? Are you exceeding the speed limit ? Are you visiting a psychiatrisc every tuesday ?

Vicen¸ c Torra; Data privacy Oslo, 2018 8 / 45

slide-13
SLIDE 13

Introduction Outline

Difficulties

  • Data privacy is “impossible”, or not ?
  • Privacy vs. utility
  • Privacy vs. security
  • Computationally feasible

Vicen¸ c Torra; Data privacy Oslo, 2018 9 / 45

slide-14
SLIDE 14

Outline

Privacy models and disclosure risk assessment

Vicen¸ c Torra; Data privacy Oslo, 2018 10 / 45

slide-15
SLIDE 15

Introduction > Disclosure risk Outline

Disclosure risk assessment

Privacy models: What is a privacy model ?

  • To make a program we need to know what we want to protect

Vicen¸ c Torra; Data privacy Oslo, 2018 11 / 45

slide-16
SLIDE 16

Introduction > Disclosure risk Outline

Disclosure risk assessment

Disclosure risk. Disclosure: leakage of information.

  • Identity disclosure vs. Attribute disclosure
  • Attribute disclosure: (e.g. learn about Alice’s salary)

⋆ Increase knowledge about an attribute of an individual

  • Identity disclosure: (e.g. find Alice in the database)

⋆ Find/identify an individual in a database (e.g., masked file) Within machine learning, some attribute disclosure is expected.

Vicen¸ c Torra; Data privacy Oslo, 2018 12 / 45

slide-17
SLIDE 17

Introduction > Disclosure risk Outline

Disclosure risk assessment

Disclosure risk.

  • Boolean vs. quantitative privacy models
  • Boolean: Disclosure either takes place or not. Check whether the

definition holds or not. Includes definitions based on a threshold.

  • Quantitative:

Disclosure is a matter of degree that can be

  • quantified. Some risk is permitted.
  • minimize information loss (max. utility) vs.

multiobjetive optimization

Vicen¸ c Torra; Data privacy Oslo, 2018 13 / 45

slide-18
SLIDE 18

Introduction > Disclosure risk Outline

Disclosure risk assessment

Privacy models. (selection)

  • Secure multiparty computation. Several parties want to compute

a function of their databases, but only sharing the result.

  • Reidentification privacy. Avoid finding a record in a database.
  • k-Anonymity. A record indistinguishable with k − 1 other records.
  • Differential privacy. The output of a query to a database should

not depend (much) on whether a record is in the database or not.

Vicen¸ c Torra; Data privacy Oslo, 2018 14 / 45

slide-19
SLIDE 19

Introduction > Disclosure risk Outline

Disclosure risk assessment

Privacy model. Secure multiparty computation.

  • Several parties want to compute a function of their databases, but
  • nly sharing the result.
  • hospital A and hospital B,
  • two independent databases with:

age of patient, length of stay in hospital

  • how to compute a regression with all data (both databases)

age → length without sharing data?

Vicen¸ c Torra; Data privacy Oslo, 2018 15 / 45

slide-20
SLIDE 20

Introduction > Disclosure risk Outline

Disclosure risk assessment

Privacy model. Reidentification privacy.

  • Avoid finding a record in a database.
  • hospital A has a database
  • a researcher asks for access to this database
  • how to prepare an anonymized database so that the researcher can

not find a friend?

Vicen¸ c Torra; Data privacy Oslo, 2018 16 / 45

slide-21
SLIDE 21

Introduction > Disclosure risk Outline

Disclosure risk assessment

Privacy model. k-Anonymity.

  • Avoid finding a record in a database.

... making each record indistinguishable with k − 1 other records.

Vicen¸ c Torra; Data privacy Oslo, 2018 17 / 45

slide-22
SLIDE 22

Introduction > Disclosure risk Outline

Disclosure risk assessment

Privacy model. k-Anonymity.

  • Avoid finding a record in a database.

... making each record indistinguishable with k − 1 other records.

  • hospital A has a database
  • a researcher asks for access to this database
  • how to prepare an anonymized database so that the researcher can

not find a friend?

Vicen¸ c Torra; Data privacy Oslo, 2018 17 / 45

slide-23
SLIDE 23

Introduction > Disclosure risk Outline

Disclosure risk assessment

Privacy model. Differential privacy.

  • The output of a query to a database should not depend (much) on

whether a record is in the database or not.

  • hospital A has a database

age of patient, length of stay in hospital

  • how to compute an average length of stay in such a way that the

result does not depend (much) on whether we use or not the data of a particular person.

Vicen¸ c Torra; Data privacy Oslo, 2018 18 / 45

slide-24
SLIDE 24

Privacy models Outline

  • Privacy models: quite a few competing models
  • differential privacy
  • secure multiparty computation
  • k-anonymity
  • computational anonymity
  • reidentification (record linkage)
  • uniqueness
  • result privacy
  • interval disclosure
  • integral privacy

Vicen¸ c Torra; Data privacy Oslo, 2018 19 / 45

slide-25
SLIDE 25

Privacy models Outline

  • Privacy models: quite a few competing models
  • differential privacy
  • secure multiparty computation
  • k-anonymity
  • computational anonymity
  • reidentification (record linkage)
  • uniqueness
  • result privacy
  • interval disclosure
  • integral privacy
  • ... and combined:
  • secure multiparty computation + differential privacy

Vicen¸ c Torra; Data privacy Oslo, 2018 19 / 45

slide-26
SLIDE 26

Introduction > Disclosure risk Outline

Disclosure risk assessment

Disclosure risk.

  • Function known vs. unknown (ill-defined)
  • Identity disclosure vs. Attribute disclosure
  • Boolean vs. quantitative measures/models

Vicen¸ c Torra; Data privacy Oslo, 2018 20 / 45

slide-27
SLIDE 27

Introduction > Disclosure risk Outline

Disclosure risk assessment

Disclosure risk.

  • Function known vs. unknown (ill-defined)
  • Identity disclosure vs. Attribute disclosure
  • Boolean vs. quantitative measures/models

Classification of privacy models (and measures)

Boolean Quantitative Identity disclosure Attribute disclosure Interval disclosure Re−identification (record linkage) Uniqueness Differential privacy Result privacy Secure multiparty computation k−Anonymity

Vicen¸ c Torra; Data privacy Oslo, 2018 20 / 45

slide-28
SLIDE 28

Outline

Data protection mechanisms

Vicen¸ c Torra; Data privacy Oslo, 2018 21 / 45

slide-29
SLIDE 29

Introduction > Data protection mechanisms Outline

Data protection mechanisms

  • Focus on respondent privacy
  • Classification w.r.t. knowledge on the computation of a third party
  • Data-driven or general purpose (analysis not known)

→ anonymization methods / masking methods

  • Computation-driven or specific purpose (analysis known)

→ cryptographic protocols, differential privacy

  • Result-driven (analysis known: protection of its results)
  • Figure. Basic model (multiple/dynamic databases + multiple people)

?

Vicen¸ c Torra; Data privacy Oslo, 2018 22 / 45

slide-30
SLIDE 30

Outline

Masking methods

Vicen¸ c Torra; Data privacy Oslo, 2018 23 / 45

slide-31
SLIDE 31

Masking Methods Outline

Masking methods

Classification w.r.t. our knowledge on the computation of a third party

  • Data-driven or general purpose (analysis not known)

→ anonymization methods / masking methods xs

?

Vicen¸ c Torra; Data privacy Oslo, 2018 24 / 45

slide-32
SLIDE 32

Masking Methods Outline

Masking methods

Anonymization/masking method: Given a data file X compute a file X′ with data of less quality.

?

X X’

Vicen¸ c Torra; Data privacy Oslo, 2018 25 / 45

slide-33
SLIDE 33

Masking Methods Outline

Masking methods: questions

Original microdata (X) Masking method Protected microdata (X’) Result(X’) Disclosure Measure Information Loss Measure Data analysis Result(X) Data analysis Risk

Vicen¸ c Torra; Data privacy Oslo, 2018 26 / 45

slide-34
SLIDE 34

Masking Methods Outline

Research questions I: Masking methods

Masking methods (anonymization methods). Build X′ from X.

Vicen¸ c Torra; Data privacy Oslo, 2018 27 / 45

slide-35
SLIDE 35

Masking Methods Outline

Research questions I: Masking methods

Masking methods (anonymization methods). Build X′ from X.

  • Perturbative. (less quality=erroneous data)

E.g. noise addition/multiplication, microaggregation, rank swapping

Vicen¸ c Torra; Data privacy Oslo, 2018 27 / 45

slide-36
SLIDE 36

Masking Methods Outline

Research questions I: Masking methods

Masking methods (anonymization methods). Build X′ from X.

  • Perturbative. (less quality=erroneous data)

E.g. noise addition/multiplication, microaggregation, rank swapping

  • Non-perturbative. (less quality=less detail)

E.g. generalization, suppression

Vicen¸ c Torra; Data privacy Oslo, 2018 27 / 45

slide-37
SLIDE 37

Masking Methods Outline

Research questions I: Masking methods

Masking methods (anonymization methods). Build X′ from X.

  • Perturbative. (less quality=erroneous data)

E.g. noise addition/multiplication, microaggregation, rank swapping

  • Non-perturbative. (less quality=less detail)

E.g. generalization, suppression

  • Synthetic data generators. (less quality=not real data)

E.g. (i) model from the data; (ii) generate data from model

Vicen¸ c Torra; Data privacy Oslo, 2018 27 / 45

slide-38
SLIDE 38

Masking Methods Outline

Research questions II: Information loss/Utility

Information loss measures. Compare X and X′ w.r.t. analysis (f) ILf(X, X′) = divergence(f(X), f(X′))

  • f: generic vs. specific (data uses)
  • Statistics: mean, variance, regression
  • Machine learning: clustering, classification

For example, classification using decision trees

  • . . . specific measures for graphs

?

X X’

f(X) = f(X’)?

Vicen¸ c Torra; Data privacy Oslo, 2018 28 / 45

slide-39
SLIDE 39

Masking Methods Outline

Research questions II: Information loss/Utility

Information loss measures. Compare X and X′ w.r.t. analysis (f)

  • f: generic vs. specific (data uses). E.g. regression
5 10 15 20 5 10 15 20

Anscombe’s datasets

x y
  • Comparison: ILf(X, X′) = divergence(f(X), f(X′))

Original data Masking method Masked data

Private Public

Statistical Analysis Data Mining Statistical Analysis Data Mining

?

X X’

f(X) = f(X’)?

Vicen¸ c Torra; Data privacy Oslo, 2018 29 / 45

slide-40
SLIDE 40

Masking Methods Outline

Research questions II: Information loss/Utility

Information loss measures. Compare X and X′ w.r.t. analysis (f)

  • f: generic vs. specific (data uses). E.g. clustering
−4 −2 2 4 −4 −2 2 4

Two clusters of different size

tableX tableY
  • Comparison: ILf(X, X′) = divergence(f(X), f(X′))

Original data Masking method Masked data

Private Public

Statistical Analysis Data Mining Statistical Analysis Data Mining

?

X X’

f(X) = f(X’)?

Vicen¸ c Torra; Data privacy Oslo, 2018 30 / 45

slide-41
SLIDE 41

Masking Methods Outline

Research questions II: Information loss

Disclosure risk. One of the privacy models: reidentification (identity disclosure)

  • A: File with the protected data set
  • B: File with the data from the intruder (subset of original X)

?

X

Record linkage

X’ / A B

Vicen¸ c Torra; Data privacy Oslo, 2018 31 / 45

slide-42
SLIDE 42

Tabular data Outline

Tabular data

Vicen¸ c Torra; Data privacy Oslo, 2018 32 / 45

slide-43
SLIDE 43

Tabular data Outline

Tabular data

  • Aggregates of data with respect to a few variables. Ex. (Castro, 2012)

P1 P2 P3 P4 P5 Total M1 2 15 30 20 10 77 M2 72 20 1 30 10 133 M3 38 38 15 40 5 136 TOTAL 112 73 46 90 25 346 Cell (M2, P3): number of people with profession P3 living in municipality M2. P1 P2 P3 P4 P5 Total M1 360 450 720 400 360 2290 M2 1440 540 22 570 320 2892 M3 722 1178 375 800 363 3438 TOTAL 2522 2168 1117 1770 1043 8620 Cell (M2, P3): total salary received by people with profession P3 living in M2.

Vicen¸ c Torra; Data privacy Oslo, 2018 33 / 45

slide-44
SLIDE 44

Tabular data Outline

Tabular data

  • Aggregates of data do not avoid disclosure
  • External attack. Combining the information of the two tables the

adversary is able to infer some sensitive information. ⇒ (M2, P3)

Vicen¸ c Torra; Data privacy Oslo, 2018 34 / 45

slide-45
SLIDE 45

Tabular data Outline

Tabular data

  • Aggregates of data do not avoid disclosure
  • External attack. Combining the information of the two tables the

adversary is able to infer some sensitive information. ⇒ (M2, P3)

  • Internal attack. A person whose data is in the database is able to

use the information of the tables to infer some sensitive information about other individuals. A doctor infers the salary of another doctor. ⇒ (M1, P1)

Vicen¸ c Torra; Data privacy Oslo, 2018 34 / 45

slide-46
SLIDE 46

Tabular data Outline

Tabular data

  • Aggregates of data do not avoid disclosure
  • External attack. Combining the information of the two tables the

adversary is able to infer some sensitive information. ⇒ (M2, P3)

  • Internal attack. A person whose data is in the database is able to

use the information of the tables to infer some sensitive information about other individuals. A doctor infers the salary of another doctor. ⇒ (M1, P1)

  • Internal attack with dominance. This is an internal attack where

a contribution of one person, say p0, in a cell is so high that permits p0 to obtain accurate bounds of the contribution of the others. ⇒ (M3, P5) with 5 people. salary(p0) = 350, then the salary of the

  • ther four is at most 363 − 350 = 13.

Vicen¸ c Torra; Data privacy Oslo, 2018 34 / 45

slide-47
SLIDE 47

Tabular data Outline

Tabular data

  • Privacy model / disclosure risk measure
  • Data protection mechanism
  • Information loss

Vicen¸ c Torra; Data privacy Oslo, 2018 35 / 45

slide-48
SLIDE 48

Tabular data Outline

Tabular data: privacy model

  • Rule (n, k)-dominance.

A cell is sensitive when n contributions represent more than the k fraction of the total. That is, the cell is sentitive when n

i=1 cσ(i)

t

i=1 ci

> k where {σ(1), ..., σ(t)} is a permutation of {1, ..., t} such that cσ(i−1) ≥ cσ(i) for all i = {2, ..., t} (i.e., cσ(i) is the ith largest element in the collection c1, ..., ct). This rule is used with n = 1 or n = 2 and k > 0.6.

Vicen¸ c Torra; Data privacy Oslo, 2018 36 / 45

slide-49
SLIDE 49

Tabular data Outline

Tabular data: privacy model

  • Rule pq. This rule is also known as the prior/posterior rule. It is based
  • n two positive parameters p and q with p < q. Prior to the publication
  • f the table, any intruder can estimate the contribution of contributors

within the q percent. Then, a cell is considered sensitive if an intruder

  • n the light of the released table can estimate the contribution of a

contributor within p percent.

  • Rule p%. This rule can be seen as a special case of the previous rule

when no prior knowledge is assumed on any cell. Because of that, it can be seen as equivalent to the previous rule with q = 100.

Vicen¸ c Torra; Data privacy Oslo, 2018 37 / 45

slide-50
SLIDE 50

Tabular data Outline

Tabular data: data protection mechanism

  • Protection of a tabular data
  • Perturbative

⋆ Post-tabular − Rounding − Controlled tabular adjustment (CTA) ⋆ Pre-tabular

  • Non-perturbative: cell suppression

Vicen¸ c Torra; Data privacy Oslo, 2018 38 / 45

slide-51
SLIDE 51

Tabular data Outline

Tabular data: data protection mechanism

  • Protection of a tabular data: cell suppression
  • Primary suppression not enough:

P1 P2 P3 P4 P5 Total M1 360 450 720 400 360 2290 M2 1440 540 22 570 320 2892 M3 722 1178 375 800 363 3438 TOTAL 2522 2168 1117 1770 1043 8620

  • Secondary suppressions required:

P1 P2 P3 P4 P5 Total M1 360 450 400 2290 M2 1440 540 570 2892 M3 722 1178 375 800 363 3438 TOTAL 2522 2168 1117 1770 1043 8620

  • Solutions build using optimization

Vicen¸ c Torra; Data privacy Oslo, 2018 39 / 45

slide-52
SLIDE 52

Tabular data Outline

Tabular data: information loss

  • Minimal number of suppressions
  • Weights associated to cells: minimal weight of suppressed cells

Vicen¸ c Torra; Data privacy Oslo, 2018 40 / 45

slide-53
SLIDE 53

Summary Outline

Summary

Vicen¸ c Torra; Data privacy Oslo, 2018 41 / 45

slide-54
SLIDE 54

Summary Outline

Summary

  • Privacy models
  • Microdata / standard databases
  • Tabular data

Vicen¸ c Torra; Data privacy Oslo, 2018 42 / 45

slide-55
SLIDE 55

Summary Outline

Thank you

Vicen¸ c Torra; Data privacy Oslo, 2018 43 / 45

slide-56
SLIDE 56

Final Outline

References

  • V. Torra, Data Privacy: Foundations, New Developments and the Big

Data Challenge, Springer, 2017.

  • T. Benschop, C. Machingauta, M. Welch, Statistical disclosure control

for microdata: A practical guide, 2016.

  • A. Hundepool, J. Domingo-Ferrer, L. Franconi, S. Giessing, E. S.

Nordholt, K. Spicer, P.-P. de Wolf, Statistical Disclosure Control, Wiley, 2012.

  • M. Templ, Statistical disclosure control for microdata: Methods and

applications in R, Springer, 2017.

  • J. Castro, Recent advances in optimization techniques for statistical

tabular data protection, European Journal of Operational Research 216 (2012) 257-269.

Vicen¸ c Torra; Data privacy Oslo, 2018 44 / 45

slide-57
SLIDE 57

Final Outline

Book

  • Vicen¸

c Torra, Data Privacy: Foundations, New Developments and the Big Data Challenge, Springer, 2017. Content: 1. Introduction. 2. Machine and statistical learning. 3. On the classification of protection procedures. 4. User’s privacy. 5. Privacy models and disclosure risk measures. 6. Masking methods. 7. Information loss: evaluation and measures. 8. Selection of masking

  • methods. 9. Conclusions.

Includes sections on masking methods and transparency, and variants for big data. User privacy for communications and information retrieval (PIR).

Vicen¸ c Torra; Data privacy Oslo, 2018 45 / 45