Data privacy: an introduction (part II) Vicen c Torra February, - - PowerPoint PPT Presentation

data privacy an introduction part ii vicen c torra
SMART_READER_LITE
LIVE PREVIEW

Data privacy: an introduction (part II) Vicen c Torra February, - - PowerPoint PPT Presentation

Sk ovde 2017 Data privacy: an introduction (part II) Vicen c Torra February, 2017 School of Informatics, University of Sk ovde, Sweden Outline Outline 1. Basics 2. A classification Dimensions 3. Masking methods 4. Privacy


slide-1
SLIDE 1

Sk¨

  • vde 2017

Data privacy: an introduction (part II) Vicen¸ c Torra February, 2017

School of Informatics, University of Sk¨

  • vde, Sweden
slide-2
SLIDE 2

Outline

Outline

  • 1. Basics
  • 2. A classification – Dimensions
  • 3. Masking methods
  • 4. Privacy models and disclosure risk assessment

Sk¨

  • vde 2017

1 / 33

slide-3
SLIDE 3

Introduction Outline

Basics

Sk¨

  • vde 2017

2 / 33

slide-4
SLIDE 4

Introduction Outline

Introduction

  • Data privacy (technological / computer science perspective)
  • Avoid the disclosure of sensitive information when processing data.

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

3 / 33

slide-5
SLIDE 5

Introduction Outline

Introduction

  • Data privacy: boundaries
  • Database in a computer or in a removable device

⇒ access control to avoid unauthorized access

  • Data is transmitted

⇒ security technology to avoid unauthorized access

access control Privacy security

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

4 / 33

slide-6
SLIDE 6

Introduction Outline

Introduction

  • Data privacy: boundaries
  • Database in a computer or in a removable device

⇒ access control to avoid unauthorized access

  • Data is transmitted

⇒ security technology to avoid unauthorized access

  • Data privacy: core
  • Data is/needs to be processed:

⇒ statistics, data mining, machine learning ⇒ compute indices, find patterns, build models

  • Someone needs to access to data to perform authorized analysis,

but access to the data and the result of the analysis should avoid disclosure.

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

5 / 33

slide-7
SLIDE 7

Introduction Outline

Introduction

  • Data privacy: core
  • Someone needs to access to data to perform authorized analysis,

but access to the data and the result of the analysis should avoid disclosure.

?

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

6 / 33

slide-8
SLIDE 8

Introduction Outline

Difficulties

  • Difficulties: Naive anonymization does not work

Passenger manifest for the Missouri, arriving February 15, 1882; Port of Boston Names, Age, Sex, Occupation, Place of birth, Last place of residence, Yes/No, condition (healthy?)

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

7 / 33

slide-9
SLIDE 9

Introduction Outline

Difficulties

  • Difficulties: highly identifiable data
  • (Sweeney, 1997) on USA population

⋆ 87.1% (216 million/248 million) were likely made them unique based on 5-digit ZIP, gender, date of birth, ⋆ 3.7% had characteristics that were likely made them unique based

  • n

5-digit ZIP, gender, Month and year of birth.

  • Data from mobile devices:

⋆ two positions can make you unique (home and working place)

  • AOL and Netflix cases (search logs and movie ratings)
  • Similar with credit card payments, shopping carts, search logs, ...

(i.e., high dimensional data)

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

8 / 33

slide-10
SLIDE 10

Introduction Outline

Difficulties

  • Data privacy is “impossible”, or not ?
  • Privacy vs. utility
  • Privacy vs. security
  • Computationally feasible

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

9 / 33

slide-11
SLIDE 11

Introduction > Settings Outline

A classification – Dimensions

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

10 / 33

slide-12
SLIDE 12

Introduction Outline

Dimensions: 1st

  • Dimension 1. Whose privacy is being sought
  • Respondents’ (passive data supplier)
  • Holder’s (or owner’s)
  • User’s (active)

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

11 / 33

slide-13
SLIDE 13

Introduction Outline

Dimensions: 1st

  • Ex. 3.1. A hospital collects data from patients and prepares a server

to be used by researchers to explore the data.

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

12 / 33

slide-14
SLIDE 14

Introduction Outline

Dimensions: 1st

  • Ex. 3.1. A hospital collects data from patients and prepares a server

to be used by researchers to explore the data.

  • Actors:

Database of patients

  • Holder: the hospital
  • Respondents: the patients

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

12 / 33

slide-15
SLIDE 15

Introduction Outline

Dimensions: 1st

  • Ex. 3.1. A hospital collects data from patients and prepares a server

to be used by researchers to explore the data.

  • Actors:

Database of patients

  • Holder: the hospital
  • Respondents: the patients
  • Actors:

Database of queries

  • Holder: the hospital
  • Respondents: researchers
  • User’s: researchers if they want to protect the queries

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

12 / 33

slide-16
SLIDE 16

Introduction Outline

Dimensions: 1st

  • Ex.

3.2. An insurance company collects data from customers for internal use. A software company develops new software. A fraction

  • f the database is transferred to the software company for software

testing.

  • Actors:
  • Holder: The insurance company
  • Respondent: Customers

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

13 / 33

slide-17
SLIDE 17

Introduction Outline

Dimensions: 1st

  • Ex. 3.4. Two supermarkets with fidelity cards record all transactions
  • f customers. The two directors will mine relevant association rules

from their databases. In the extent possible, each director do not want the other to access to own records.

  • Actors:
  • Holder: Supermarkets
  • Respondent: Customers

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

14 / 33

slide-18
SLIDE 18

Introduction Outline

Dimensions: 1st

  • Dimension 1. Whose privacy is being sought REVISITED
  • Respondents’ privacy (passive data supplier)
  • Holder’s (or owner’s) privacy
  • User’s (active) privacy

⇒ Respondents’ and holder’s privacy implemented by holder. Different focus. Respondents are worried on their individual record, companies are worried on general inferences (e.g. to be used by competitors). E.g., protection of Ebenezer Scrooge’s data (E. Scrooge | misanthropic, tightfisted, money addict) The hospital may be interested on hiding the number of addiction relapses. ⇒ User’s privacy implemented by the user

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

15 / 33

slide-19
SLIDE 19

Introduction Outline

Dimensions: 2nd

  • Dimension 2. Knowledge on the analysis to be done
  • Full knowledge: Average length of stay for hospital in-patient
  • Partial or null knowledge: A model for mortgage risk prediction

(but we do not know what kind of model will be used)

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

16 / 33

slide-20
SLIDE 20

Introduction Outline

Dimensions: 2nd

  • Dimension 2. Knowledge on the analysis to be done
  • Data-driven or general purpose (analysis not known)
  • Computation-driven or specific purpose (analysis known)
  • Result-driven (analysis known: protection of its results)

?

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

17 / 33

slide-21
SLIDE 21

Introduction Outline

Dimensions: 3rd

  • Dimension 3. Number of data sources
  • Single data source. (single owner)
  • Multiple data sources. (multiple owners)

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

18 / 33

slide-22
SLIDE 22

Introduction Outline

1st - 3rd Dimensions: Summary

Data−driven Computation−driven (specific−purpose) (general−purpose) Result−driven User privacy Protecting the identity of the user Protecting the data generated by the activity of the user Number of sources Multiple data sources Single data source Respondent and holder privacy

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

19 / 33

slide-23
SLIDE 23

Introduction > Settings Outline

Masking methods

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

20 / 33

slide-24
SLIDE 24

Introduction > Masking methods Outline

Masking methods

Respondent and holder privacy. Acc. to knowledge on the analysis

  • Data-driven or general purpose (analysis not known)

→ masking methods / anonymization methods (one data source)

  • Computation-driven or specific purpose (analysis known)

→ cryptographic protocols (multiple data sources) → masking methods (single data source, differential privacy)

  • Result-driven (analysis known: protection of its results)

→ masking methods (one data source, Holder’s privacy)

?

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

21 / 33

slide-25
SLIDE 25

Introduction > Masking methods Outline

Masking methods

Anonymization/masking method: Given a data file X compute a file X′ with data of less quality.

?

X X’

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

22 / 33

slide-26
SLIDE 26

Introduction > Masking methods Outline

Masking methods

Anonymization/masking method: Given a data file X compute a file X′ with data of less quality.

  • Original X

Respondent City Age Illness ABD Sk¨

  • vde

28 Cancer COL Mariestad 31 Cancer GHE Stockholm 62 AIDS CIO Stockholm 64 AIDS HYU G¨

  • teborg

58 Heart attack

  • Protected X′

Respondent City Age Illness ABD Sk¨

  • vde or Mariestad

30 Cancer COL Sk¨

  • vde or Mariestad

30 Cancer GHE Tarragona 60 AIDS CIO Tarragona 60 AIDS

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

23 / 33

slide-27
SLIDE 27

Anonymization > Masking methods Outline

Masking methods

Approach valid for different types of data

  • Databases, documents, search logs, social networks, . . .

(also masking taking into account semantics: wordnet, ODP)

?

X X’

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

24 / 33

slide-28
SLIDE 28

Introduction > Masking methods Outline

Masking methods

Identifiers non-confidential quasi-identifier attributes confidential Protected microdata (X′) Protected Original id Xc id Xnc Xc X′

nc

(data masking) anonymization Identifiers Original non-confidential quasi-identifier attributes Original confidential Original microdata (X) attributes attributes

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

25 / 33

slide-29
SLIDE 29

Introduction > Masking methods Outline

Research questions

Original microdata (X) Masking method Protected microdata (X’) Result(X’) Disclosure Measure Information Loss Measure Data analysis Result(X) Data analysis Risk

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

26 / 33

slide-30
SLIDE 30

Introduction > Masking methods Outline

Masking methods

Masking methods. (anonymization methods)

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

27 / 33

slide-31
SLIDE 31

Introduction > Masking methods Outline

Masking methods

Masking methods. (anonymization methods)

  • Perturbative. (less quality=erroneous data)

E.g. noise addition/multiplication, microaggregation, rank swapping

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

27 / 33

slide-32
SLIDE 32

Introduction > Masking methods Outline

Masking methods

Masking methods. (anonymization methods)

  • Perturbative. (less quality=erroneous data)

E.g. noise addition/multiplication, microaggregation, rank swapping

  • Non-perturbative. (less quality=less detail)

E.g. generalization, suppression

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

27 / 33

slide-33
SLIDE 33

Introduction > Masking methods Outline

Masking methods

Masking methods. (anonymization methods)

  • Perturbative. (less quality=erroneous data)

E.g. noise addition/multiplication, microaggregation, rank swapping

  • Non-perturbative. (less quality=less detail)

E.g. generalization, suppression

  • Synthetic data generators. (less quality=not real data)

E.g. (i) model from the data; (ii) generate data from model

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

27 / 33

slide-34
SLIDE 34

Introduction > Masking methods Outline

Masking methods

Information loss measures. Compare X and X′ w.r.t. analysis (f) ILf(X, X′) = divergence(f(X), f(X′))

  • f: generic vs. specific (data uses)
  • Statistics
  • Machine learning: Clustering and classification

For example, classification using decision trees

  • . . . specific measures for graphs

?

X X’

f(X) = f(X’)?

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

28 / 33

slide-35
SLIDE 35

Introduction > Disclosure risk Outline

Privacy models and disclosure risk assessment

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

29 / 33

slide-36
SLIDE 36

Introduction > Disclosure risk Outline

Disclosure risk assessment

Disclosure risk.

  • Identity disclosure vs. Attribute disclosure
  • Attribute disclosure: (e.g. learn about Alice’s salary)

⋆ Increase knowledge about an attribute of an individual

  • Identity disclosure: (e.g. find Alice in the database)

⋆ Find/identify an individual in a masked file Within machine learning, some attribute disclosure is expected.

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

30 / 33

slide-37
SLIDE 37

Introduction > Disclosure risk Outline

Disclosure risk assessment

Disclosure risk.

  • Identity disclosure vs. Attribute disclosure
  • Boolean vs. quantitative measures

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

31 / 33

slide-38
SLIDE 38

Introduction > Disclosure risk Outline

Disclosure risk assessment

Disclosure risk.

  • Identity disclosure vs. Attribute disclosure
  • Boolean vs. quantitative measures

(minimize information loss vs. multiobjetive optimization)

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

31 / 33

slide-39
SLIDE 39

Introduction > Disclosure risk Outline

Disclosure risk assessment

Disclosure risk.

  • Identity disclosure vs. Attribute disclosure
  • Boolean vs. quantitative measures

(minimize information loss vs. multiobjetive optimization)

  • Examples. Privacy models / disclosure risk measures

Boolean Quantitative Identity disclosure Attribute disclosure Interval disclosure Re−identification (record linkage) Uniqueness Differential privacy k−Anonymity

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

31 / 33

slide-40
SLIDE 40

Summary Outline

Thank you

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

32 / 33

slide-41
SLIDE 41

References Outline

References

Related references.

  • Torra, V. (2017) Data privacy:

Foundations, new developments, and the big data challenge, Springer, forthcomming.

  • Aggarwal, C. C., Yu, P. S. (2008) (eds.) Privacy-Preserving Data Mining: Models and Algorithms,

Springer.

  • Duncan, G. T., Elliot, M., Salazar, J. J. (2011) Statistical confidentiality, Springer.
  • Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S., Spicer, K., de Wolf,

P.-P. (2012) Statistical Disclosure Control, Wiley.

  • Navarro-Arribas, G., Torra, V. (2015) (eds.) Advanced Research in Data Privacy, Springer.
  • Vaidya, J., Clifton, C. W., Zhu, Y. M. (2006) Privacy Preserving Data Mining, Springer.

Vicen¸ c Torra; Data privacy Sk¨

  • vde 2017

33 / 33