Data privacy. A briefer. Vicen c Torra (vtorra@ieee.org) May - - PowerPoint PPT Presentation

data privacy a briefer vicen c torra
SMART_READER_LITE
LIVE PREVIEW

Data privacy. A briefer. Vicen c Torra (vtorra@ieee.org) May - - PowerPoint PPT Presentation

IEEE Chapter Meeting, Sk ovde Data privacy. A briefer. Vicen c Torra (vtorra@ieee.org) May 31st, 2018 Privacy, Information and Cyber-Security Center SAIL, School of Informatics, University of Sk ovde, Sweden Outline Outline 1.


slide-1
SLIDE 1

IEEE Chapter Meeting, Sk¨

  • vde

Data privacy. A briefer. Vicen¸ c Torra

(vtorra@ieee.org)

May 31st, 2018

Privacy, Information and Cyber-Security Center SAIL, School of Informatics, University of Sk¨

  • vde, Sweden
slide-2
SLIDE 2

Outline

Outline

  • 1. Motivation
  • 2. Privacy models and disclosure risk assessment
  • 3. Data protection mechanisms
  • 4. Disclosure risk: The worst-case scenario
  • 5. Summary

IEEE Chapter Meeting, Sk¨

  • vde

1 / 30

slide-3
SLIDE 3

Motivation Outline

Motivation

IEEE Chapter Meeting, Sk¨

  • vde

2 / 30

slide-4
SLIDE 4

Motivation Outline

Motivation

  • Data privacy: (for database)
  • Someone needs to access to data to perform authorized analysis,

but access to the data and the result of the analysis should avoid disclosure.

?

E.g., you are authorized to compute the average stay in a hospital, but maybe you are not authorized to see the length of stay of your neighbor.

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

3 / 30

slide-5
SLIDE 5

Motivation Outline

Difficulties

  • Difficulties: Naive anonymization does not work

Passenger manifest for the Missouri, arriving February 15, 1882; Port of Boston1 Names, Age, Sex, Occupation, Place of birth, Last place of residence, Yes/No, condition (healthy?)

1https://www.sec.state.ma.us/arc/arcgen/genidx.htm Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

4 / 30

slide-6
SLIDE 6

Motivation Outline

Difficulties

  • Difficulties: highly identifiable data
  • (Sweeney, 1997) on USA population

⋆ 87.1% (216 million/248 million) were likely made them unique based on 5-digit ZIP, gender, date of birth, ⋆ 3.7% (9.1 million) had characteristics that were likely made them unique based on 5-digit ZIP, gender, Month and year of birth.

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

5 / 30

slide-7
SLIDE 7

Motivation Outline

Difficulties

  • Difficulties: highly identifiable data
  • Data from mobile devices:

⋆ two positions can make you unique (home and working place)

  • AOL2 and Netflix cases (search logs and movie ratings)

⇒ User No. 4417749, hundreds of searches over a three-month period including queries ’landscapers in Lilburn, Ga’ ⇒ Thelma Arnold identified!

⇒ individual users matched with film ratings on the Internet Movie Database.

  • Similar with credit card payments, shopping carts, ...

(i.e., high dimensional data)

2http://www.nytimes.com/2006/08/09/technology/09aol.html Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

6 / 30

slide-8
SLIDE 8

Motivation Outline

Difficulties

  • Difficulties: highly identifiable data
  • Example #1:

⋆ University goal: know how sickness is influenced by studies and by commuting distance ⋆ Data: where students live, what they study, if they got sick ⋆ No “personal data”, is this ok ?

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

7 / 30

slide-9
SLIDE 9

Motivation Outline

Difficulties

  • Difficulties: highly identifiable data
  • Example #1:

⋆ University goal: know how sickness is influenced by studies and by commuting distance ⋆ Data: where students live, what they study, if they got sick ⋆ No “personal data”, is this ok ? ⋆ NO!!: How many in your degree live in your town ?

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

7 / 30

slide-10
SLIDE 10

Motivation Outline

Difficulties

  • Difficulties: highly identifiable data
  • Example #1:

⋆ University goal: know how sickness is influenced by studies and by commuting distance ⋆ Data: where students live, what they study, if they got sick ⋆ No “personal data”, is this ok ? ⋆ NO!!: How many in your degree live in your town ?

  • Example #2:

⋆ Car company goal: Study driving behaviour in the morning ⋆ Data: First drive (GPS origin + destination, time) × 30 days ⋆ No “personal data”, is this ok?

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

7 / 30

slide-11
SLIDE 11

Motivation Outline

Difficulties

  • Difficulties: highly identifiable data
  • Example #1:

⋆ University goal: know how sickness is influenced by studies and by commuting distance ⋆ Data: where students live, what they study, if they got sick ⋆ No “personal data”, is this ok ? ⋆ NO!!: How many in your degree live in your town ?

  • Example #2:

⋆ Car company goal: Study driving behaviour in the morning ⋆ Data: First drive (GPS origin + destination, time) × 30 days ⋆ No “personal data”, is this ok? ⋆ NO!!!: How many (cars) go from your parking to your university everymorning ? Are you exceeding the speed limit ? Are you visiting a psychiatrisc every tuesday ?

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

7 / 30

slide-12
SLIDE 12

Motivation Outline

Difficulties

  • Data privacy is “impossible”, or not ?
  • Privacy vs. utility
  • Privacy vs. security
  • Computationally feasible

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

8 / 30

slide-13
SLIDE 13

Outline

Privacy models and disclosure risk assessment

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

9 / 30

slide-14
SLIDE 14

Privacy models Outline

Privacy models

Privacy models: What is a privacy model ?

  • To make a program we need to know what we want to protect

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

10 / 30

slide-15
SLIDE 15

Privacy models Outline

Privacy models

Disclosure risk. Disclosure: leakage of information.

  • Identity disclosure vs. Attribute disclosure
  • Attribute disclosure: (e.g. learn about Alice’s salary)

⋆ Increase knowledge about an attribute of an individual

  • Identity disclosure: (e.g. find Alice in the database)

⋆ Find/identify an individual in a database (e.g., masked file) Within machine learning, some attribute disclosure is expected.

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

11 / 30

slide-16
SLIDE 16

Privacy models Outline

Privacy models

Disclosure risk.

  • Boolean vs. quantitative privacy models
  • Boolean: Disclosure either takes place or not. Check whether the

definition holds or not. Includes definitions based on a threshold.

  • Quantitative:

Disclosure is a matter of degree that can be

  • quantified. Some risk is permitted.
  • minimize information loss (max. utility) vs.

multiobjetive optimization

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

12 / 30

slide-17
SLIDE 17

Privacy models Outline

Privacy models

Privacy models. quite a few competing models

  • Secure multiparty computation. Several parties want to compute

a function of their databases, but only sharing the result.

  • Reidentification privacy. Avoid finding a record in a database.
  • k-Anonymity. A record indistinguishable with k − 1 other records.
  • Differential privacy. The output of a query to a database should

not depend (much) on whether a record is in the database or not.

  • computational anonymity
  • uniqueness
  • result privacy
  • interval disclosure

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

13 / 30

slide-18
SLIDE 18

Privacy models Outline

Privacy models

Privacy models. quite a few competing models

  • Secure multiparty computation. Several parties want to compute

a function of their databases, but only sharing the result.

  • Reidentification privacy. Avoid finding a record in a database.
  • k-Anonymity. A record indistinguishable with k − 1 other records.
  • Differential privacy. The output of a query to a database should

not depend (much) on whether a record is in the database or not.

  • computational anonymity
  • uniqueness
  • result privacy
  • interval disclosure

... and combined:

  • secure multiparty computation + differential privacy

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

13 / 30

slide-19
SLIDE 19

Outline

Data protection mechanisms: Masking methods

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

14 / 30

slide-20
SLIDE 20

Masking methods Outline

Data protection mechanisms

  • Focus on respondent privacy (in databases)
  • Classification w.r.t. knowledge on the computation of a third party
  • Data-driven or general purpose (analysis not known)

→ anonymization methods / masking methods

  • Computation-driven or specific purpose (analysis known)

→ cryptographic protocols, differential privacy

  • Result-driven (analysis known: protection of its results)
  • Figure. Basic model (multiple/dynamic databases + multiple people)

?

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

15 / 30

slide-21
SLIDE 21

Masking methods Outline

Masking methods

Anonymization/masking method: Given a data file X compute a file X′ with data of less quality.

?

X X’

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

16 / 30

slide-22
SLIDE 22

Masking Methods Outline

Masking methods: questions

Original microdata (X) Masking method Protected microdata (X’) Result(X’) Disclosure Measure Information Loss Measure Data analysis Result(X) Data analysis Risk

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

17 / 30

slide-23
SLIDE 23

Masking Methods Outline

Research questions I: Masking methods

Masking methods (anonymization methods). X′ = ρ(X)

  • Perturbative. (less quality=erroneous data)

E.g. noise addition/multiplication, microaggregation, rank swapping

  • Non-perturbative. (less quality=less detail)

E.g. generalization, suppression

  • Synthetic data generators. (less quality=not real data)

E.g. (i) model from the data; (ii) generate data from model

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

18 / 30

slide-24
SLIDE 24

Masking Methods Outline

Research questions II: Information loss/Utility

Information loss measures. Compare X and X′ w.r.t. analysis (f)

  • f: generic vs. specific (data uses). E.g. regression
5 10 15 20 5 10 15 20

Anscombe’s datasets

x y
  • Comparison: ILf(X, X′) = divergence(f(X), f(X′))

Original data Masking method Masked data

Private Public

Statistical Analysis Data Mining Statistical Analysis Data Mining

?

X X’

f(X) = f(X’)?

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

19 / 30

slide-25
SLIDE 25

Masking Methods Outline

Research questions II: Information loss/Utility

Information loss measures. Compare X and X′ w.r.t. analysis (f)

  • f: generic vs. specific (data uses). E.g. clustering
−4 −2 2 4 −4 −2 2 4

Two clusters of different size

tableX tableY
  • Comparison: ILf(X, X′) = divergence(f(X), f(X′))

Original data Masking method Masked data

Private Public

Statistical Analysis Data Mining Statistical Analysis Data Mining

?

X X’

f(X) = f(X’)?

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

20 / 30

slide-26
SLIDE 26

Masking Methods Outline

Research questions II: Information loss

Disclosure risk. One of the privacy models: reidentification (identity disclosure)

  • A: File with the protected data set
  • B: File with the data from the intruder (subset of original X)

?

X

Record linkage

X’ / A B

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

21 / 30

slide-27
SLIDE 27

Outline

Disclosure risk: The worst-case scenario

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

22 / 30

slide-28
SLIDE 28

Disclosure risk Outline

Disclosure Risk

Disclosure risk (DR)

  • The worst-case scenario
  • DR using the largest data set: original file
  • DR using the best reidentification method: optimal attacks

(ML in reidentification)

  • DR under the transparency principle: transparency attacks

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

23 / 30

slide-29
SLIDE 29

Disclosure risk Outline

Optimal attacks

Machine Learning for distance-based record linkage

  • Supervised approach: maximize the number of correct links.
  • Use: Metric learning
  • Goal (A and B aligned)

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

24 / 30

slide-30
SLIDE 30

Disclosure risk Outline

Transparency

Transparency.

  • “the release of information about processes and even parameters used

to alter data” (Karr, 2009). Transparency principle. (similar to the Kerckhoffs’s principle in cryptography)

  • “Given a privacy model, a masking method should be compliant with

this privacy model even if everything about the method is public knowledge” (Torra, 2017, p. 17)

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

25 / 30

slide-31
SLIDE 31

Disclosure risk Outline

Transparency

Effect.

  • Information Loss. Positive effect, less loss/improve inference

E.g., noise addition ρ(X) = X + ǫ where ǫ s.t. E(ǫ) = 0 and V ar(ǫ) = kV ar(X) V ar(X′) = V ar(X) + kV ar(X) = (1 + k)V ar(X).

  • Disclosure Risk. Negative effect, larger risk
  • Attack to single-ranking microaggregation (Winkler, 2002)
  • Formalization of the transparency attack (Nin, Herranz, Torra, 2008)
  • Attacks to microaggregation and rank swapping (Nin, Herranz, Torra,

2008)

⇒ Transparency aware masking methods

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

26 / 30

slide-32
SLIDE 32

Summary Outline

Summary

Summary

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

27 / 30

slide-33
SLIDE 33

Summary Outline

Summary

  • Short introduction to data privacy

(focus on databases)

  • Worst-case scenario and transparency

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

28 / 30

slide-34
SLIDE 34

Summary Outline

Thank you

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

29 / 30

slide-35
SLIDE 35

References Outline

References

References.

  • Worst-case scenario
  • D. Abril, G. Navarro-Arribas, V. Torra, Supervised Learning Using a Symmetric Bilinear Form

for Record Linkage, Information Fusion 26 (2015) 144-153.

  • Transparency attacks and transparency aware methods
  • J. Nin, J. Herranz, V. Torra, On the Disclosure Risk of Multivariate Microaggregation, Data and

Knowledge Engineering, 67 (2008) 399-412.

  • J. Nin, J. Herranz, V. Torra, Rethinking Rank Swapping to Decrease Disclosure Risk, Data and

Knowledge Engineering, 64:1 (2008) 346-364.

  • V. Torra, Fuzzy microaggregation for the transparency principle, J. Applied Logic 23 (2017)

70-80.

  • Book
  • V. Torra, Data Privacy: Foundations, New Developments and the Big Data Challenge, Springer,

2017.

Vicen¸ c Torra; Data privacy. A briefer. IEEE Chapter Meeting, Sk¨

  • vde

30 / 30