Data Anonymization Introduction Li Xiong CS573 Data Privacy and - - PowerPoint PPT Presentation

data anonymization introduction
SMART_READER_LITE
LIVE PREVIEW

Data Anonymization Introduction Li Xiong CS573 Data Privacy and - - PowerPoint PPT Presentation

Data Anonymization Introduction Li Xiong CS573 Data Privacy and Security Outline Problem definition Principles Disclosure Control Methods Inference Control Access control: protecting information and information systems from


slide-1
SLIDE 1

Data Anonymization – Introduction

Li Xiong

CS573 Data Privacy and Security

slide-2
SLIDE 2

Outline

Problem definition Principles Disclosure Control Methods

slide-3
SLIDE 3

Inference Control

NO FOUL PLAY

Access control: protecting information and information systems

from unauthorized access and use.

Inference control: protecting private data while publishing useful

information

3

  • Modify

Data

slide-4
SLIDE 4

Problem: Disclosure Control

  • Disclosure Control is the discipline concerned with the modification of

data, containing confidential information about individual entities such as persons, households, businesses, etc. in order to prevent third parties working with these data to recognize individuals in the data

  • Privacy preserving data publishing, anonymization, de-identification

Types of disclosure

  • Identity disclosure - identification of an entity (person, institution)
  • Attribute disclosure - the intruder finds something new about the

target person

  • Disclosure – identity, attribute disclosure or both.
slide-5
SLIDE 5

Microdata and External Information

  • Microdata represents a series of records, each record containing

information on an individual unit such as a person, a firm, an institution, etc

In contrast to computed tables (Macrodata)

  • Masked Microdata names and other identifying information are

removed or modified from microdata

  • External Information any known information by a presumptive

intruder related to some individuals from initial microdata

slide-6
SLIDE 6

Disclosure Risk and Information Loss

Disclosure risk - the risk that a given form of disclosure will arise if a masked microdata is released Information loss - the quantity of information which exist in the initial microdata but not in masked microdata due to disclosure control methods

slide-7
SLIDE 7

Disclosure Control Problem

Individuals Data Submit Collect Researcher Intruder Data Owner Masked Data Collect Release Receive

Masking Process

slide-8
SLIDE 8

Disclosure Control Problem

Individuals Data Submit Collect Confidentiality

  • f Individuals

Disclosure Risk / Anonymity Properties Researcher Intruder Data Owner Masked Data Collect Release Receive

Masking Process

  • f Individuals

Preserve Data Utility External Data Use Masked Data for Statistical Analysis Use Masked Data and External Data to disclose confidential information Anonymity Properties Information Loss

slide-9
SLIDE 9

Disclosure Control for Tables vs. Microdata

Microdata Macrodata - precomputed statistics tables Macrodata - precomputed statistics tables

slide-10
SLIDE 10

Disclosure Control For Microdata Disclosure Control For Microdata

slide-11
SLIDE 11

Disclosure Control for Tables

slide-12
SLIDE 12

Anonymization

Microdata release

Guidelines Cases and controversies Current research Current research

Macrodata release

slide-13
SLIDE 13
  • !
slide-14
SLIDE 14
  • "

#$% &'

  • "

()

" *+*

  • ,'
slide-15
SLIDE 15
  • *

"

(.+ ++' *+

  • (/
  • (.

0+ +*+++1* 2+*+ ++''3' 11

slide-16
SLIDE 16

!

4"5 6! 7

8 '

(9

8 ' +'+ :,-

';5

slide-17
SLIDE 17

0(.

  • !"

"#" #""$ "% !

  • &'$

#! ( ) *

  • + #

$,

  • .".../
  • + /
  • 0' /
  • &/
  • # /
  • /
  • # /
  • /
  • )1 /
  • 2 "

(<

  • +$
  • ."...

,.../

  • '##

! " *

  • """

/

  • !34

#! "' # #4./

  • 2 "

/

  • ! /
  • 56!7 8678/
  • /
  • (" !/
  • 0 #

/

  • # % # ""

9

slide-18
SLIDE 18

617

  • ++6+

(.

++6+ ++

  • 7

;

slide-19
SLIDE 19

!

4'+' +++ =5

(>

+* +#!&

slide-20
SLIDE 20

?@+

  • AB
  • C

C

?+ ?+

  • :;

<(;; ' &#

slide-21
SLIDE 21

Anonymization

Microdata release

Guidelines Cases and controversies Current research Current research

Macrodata release

slide-22
SLIDE 22

Massachusetts GIC Incident

Massachusetts GIC released “anonymized” data on

state employees’ hospital visit

Then Governor William Weld assured public on

privacy

  • (
  • :
  • (AD)/9<.>

)) ).ABA

  • 3

DADADADAD )) ).ABA

  • @

ADAD)/9/9 )) ).AB(

  • DDDDDDDDD

// ).D(B

  • ?

999999999 // ).D(B

  • Anonymized

(

  • :
  • ))

).ABA

  • ))

).ABA

  • ))

).AB(

  • //

).D(B

  • //

).D(B

  • GIC

privacy

slide-23
SLIDE 23

Massachusetts GIC

  • (
  • :
  • (AD)/9<.>

)) ).ABA

  • (<'BBB

(

  • :
  • ))

).ABA

  • (<'BBB

Then graduate student Sweeney linked the data with

Voter roller in Cambridge and identified Governor Weld’s record

3 DADADADAD )) ).ABA

  • 9.'BBB

@ ADAD)/9/9 )) ).AB(

  • .B'BBB
  • DDDDDDDDD

// ).D(B

  • //'BBB

? 999999999 // ).D(B

  • AD'BBB
  • (
  • :
  • ))

).ABA @ )) ).AB(

  • //

).D(B

Voter roll for Cambridge

)) ).ABA

  • 9.'BBB

)) ).AB(

  • .B'BBB

// ).D(B

  • //'BBB

// ).D(B

  • AD'BBB
slide-24
SLIDE 24

Re-identification

1/24/2012 24

slide-25
SLIDE 25

AOL Query Log Release

AnonID Query QueryTime ItemRank ClickURL 217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com 217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com 1268 gall stones 2006-05-11 02:12:51 1268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov 1268

  • zark horse blankets

2006-03-01 17:39:28 8 http://www.blanketsnmore.com

20 million Web search queries by AOL

(Source: AOL Query Log)

slide-26
SLIDE 26

User No. 4417749

  • User 4417749

“numb fingers”, “60 single men” “dog that urinates on everything” “landscapers in Lilburn, Ga” Several people names with last name Arnold Several people names with last name Arnold “homes sold in shadow lake subdivision gwinnett county georgia”

slide-27
SLIDE 27

User No. 4417749

  • User 4417749

“numb fingers”, “60 single men” “dog that urinates on everything” “landscapers in Lilburn, Ga” Several people names with last name Arnold Several people names with last name Arnold “homes sold in shadow lake subdivision gwinnett county georgia” Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her dogs

slide-28
SLIDE 28

Anonymization

Microdata release

Guidelines Cases and controversies Current research Current research

Principles Anonymization methods

Macrodata release

slide-29
SLIDE 29

K-Anonymity

  • The term was introduced in 1998 by Samarati

and Sweeney.

  • Important papers:
  • Sweeney L. (2002), K-Anonymity: A Model for Protecting Privacy, International Journal
  • n Uncertainty, Fuzziness and Knowledge-based Systems, Vol. 10, No. 5, 557-570
  • n Uncertainty, Fuzziness and Knowledge-based Systems, Vol. 10, No. 5, 557-570
  • Sweeney L. (2002), Achieving K-Anonymity Privacy Protection using Generalization

and Suppression, International Journal on Uncertainty, Fuzziness and Knowledge- based Systems, Vol. 10, No. 5, 571-588

  • Samarati P. (2001), Protecting Respondents Identities in Microdata Release, IEEE

Transactions on Knowledge and Data Engineering, Vol. 13, No. 6, 1010-1027

  • Many new research papers in the last 10 years
  • Theoretical results
  • Many algorithms achieving k-anonymity
  • Many improved principles and algorithms
slide-30
SLIDE 30

Motivating Example

  • Modify

Data

  • !"

30

  • Data
  • !"

#

  • $%

& !"

slide-31
SLIDE 31

Motivating Example (continued)

  • Published Data: Alice publishes data without the Name

Modify Data

  • !"

31

  • Data
  • !"

#

  • $%

!"

Attacker’s Knowledge: Voter registration list

Chris Bob Paul John

Name

US 23 13067 4 US 29 13067 3 US 22 13067 2 US 45 13067 1

Nationality Age Zip #

slide-32
SLIDE 32

Motivating Example (continued)

  • Published Data: Alice publishes data without the Name

Modify Data

  • !"

32

  • Data
  • !"

#

  • $%

!"

  • $'
  • #
  • (
  • # !'
  • Attacker’s Knowledge: Voter registration list

Data Leak !

slide-33
SLIDE 33

Even if we do not publish the individuals:

  • There are some fields that may uniquely identify some individual
  • Source of the Problem

33

  • The attacker can use them to join with other sources and identify the individuals

) ) ) ) )

Quasi Identifier

slide-34
SLIDE 34

Attribute Classification

I1, I2,..., Im - identifier attributes

Ex: Name and SSN Information that leads to a specific entity.

K , K ,.…, K - key attributes (quasi-identifiers) K1, K2,.…, Kp - key attributes (quasi-identifiers)

Ex: Zip Code and Age May be known by an intruder.

S1, S2,.…, Sq - confidential attributes

Ex: Principal Diagnosis and Annual Income Assumed to be unknown to an intruder.

slide-35
SLIDE 35

Attribute Types

Identifier, Key (Quasi-Identifiers) and Confidential Attributes

7

  • (

( E C (AD)/9<.> ))

  • )/'/BB

('ABB A F DADADADAD ))

  • D<'>BB

A'/BB D E 31 ADAD)/9/9 //

  • 9<'BBB

D'BBB ) E @ DDDDDDDDD ))

  • A('BBB

('BBB / E1 ))))))))) //

  • >B'BBB

>BB 9 1 G+ 999999999 )/

  • ).'BBB

</B < <<<<<<<<< A/ :

  • )>'BBB

('ABB . :1 C ......... D/

  • 99'BBB

A'ABB > 1 >>>>>>>>> //

  • 9>'BBB

)'ABB (B (BBBBBBBB )/

  • D)'BBB

D'(BB

slide-36
SLIDE 36

K-Anonymity Definition

The k-anonymity property for a masked microdata (MM) is satisfied if with respect to Quasi-identifier set (QID) if every count in the frequency set of MM with respect to in the frequency set of MM with respect to QID is greater or equal to k

slide-37
SLIDE 37

K-Anonymity Example

7

  • :

'

  • (

/B )(B<9

  • A

DB )(B<9 4

  • D

DB )(B<9 4

  • )

AB )(B<9

  • /

AB )(B<9

  • 9

/B )(B<9

  • 9

/B )(B<9

  • QID = { Age, Zip, Sex }
  • SELECT COUNT(*) FROM Patient GROUP BY Sex, Zip, Age;
  • If the results include groups with count less than k, the relation Patient does

not have k-anonymity property with respect to QID.

slide-38
SLIDE 38

Homogeneity Attack

k-Anonymity can create groups that leak information due to lack of diversity in sensitive attribute.

slide-39
SLIDE 39

Anonymization

Microdata release

Guidelines Cases and controversies Current research Current research

Principles Anonymization methods

Macrodata release

slide-40
SLIDE 40

L-diversity

Each equivalence group must have l “well- represented” sensitive values

slide-41
SLIDE 41

More attacks and principles

t-closeness – skewed data m-variance – incremental releases …

slide-42
SLIDE 42

Disclosure Control Techniques

Remove Identifiers Generalization Suppression Sampling Sampling Microaggregation Perturbation / randomization Rounding Data Swapping Etc.

slide-43
SLIDE 43

Disclosure Control Techniques

  • Different disclosure control techniques are applied to the following initial

microdata:

7

  • (

( E C (AD)/9<.> ))

  • )/'/BB

('ABB A F DADADADAD ))

  • D<'>BB

A'/BB D E 31 ADAD)/9/9 //

  • 9<'BBB

D'BBB D E 31 ADAD)/9/9 //

  • 9<'BBB

D'BBB ) E @ DDDDDDDDD ))

  • A('BBB

('BBB / E1 ))))))))) //

  • >B'BBB

>BB 9 1 G+ 999999999 )/

  • ).'BBB

</B < <<<<<<<<< A/ :

  • )>'BBB

('ABB . :1 C ......... D/

  • 99'BBB

A'ABB > 1 >>>>>>>>> //

  • 9>'BBB

)'ABB (B (BBBBBBBB )/

  • D)'BBB

D'(BB

slide-44
SLIDE 44

Remove Identifiers

  • Identifiers such as Names, SSN etc. are removed

7

  • (

( ))

  • )/'/BB

('ABB A ))

  • D<'>BB

A'/BB D //

  • 9<'BBB

D'BBB ) ))

  • A('BBB

('BBB ) ))

  • A('BBB

('BBB / //

  • >B'BBB

>BB 9 )/

  • ).'BBB

</B < A/ :

  • )>'BBB

('ABB . D/

  • 99'BBB

A'ABB > //

  • 9>'BBB

)'ABB (B )/

  • D)'BBB

D'(BB

slide-45
SLIDE 45

Sampling

  • Sampling is the disclosure control method in which only a subset of

records is released

  • If n is the number of elements in initial microdata and t the released

number of elements we call sf = t / n the sampling factor

  • Simple random sampling is more frequently used. In this technique, each

individual is chosen entirely by chance and each member of the individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample

7

  • (

/ //

  • >B'BBB

>BB ) ))

  • A('BBB

('BBB . D/

  • 99'BBB

A'ABB > //

  • 9>'BBB

)'ABB < A/ :

  • )>'BBB

('ABB

slide-46
SLIDE 46

Microaggregation

  • Order records from the initial microdata by an attribute, create groups of

consecutive values, replace those values by the group average

  • Microaggregation for attribute Income and minimum size 3
  • The total sum for all Income values remains the same.

7

  • (

A ))

  • DB'>9<

A'/BB A ))

  • DB'>9<

A'/BB ) ))

  • DB'>9<

('BBB (B )/

  • DB'>9<

D'(BB ( ))

  • )<'/BB

('ABB 9 )/

  • )<'/BB

</B < A/ :

  • )<'/BB

('ABB D //

  • <D'BBB

D'BBB / //

  • <D'BBB

>BB . D/

  • <D'BBB

A'ABB > //

  • <D'BBB

)'ABB

slide-47
SLIDE 47

Data Swapping

  • In this disclosure method a sequence of so-called elementary swaps is applied

to a microdata

  • An elementary swap consists of two actions:
  • A random selection of two records i and j from the microdata
  • A swap (interchange) of the values of the attribute being swapped for records i and j

7

  • (

( ))

  • ).'BBB

('ABB A ))

  • D<'>BB

A'/BB D //

  • 9<'BBB

D'BBB ) ))

  • A('BBB

('BBB / //

  • >B'BBB

>BB 9 )/

  • )/'/BB

</B < A/ :

  • )>'BBB

('ABB . D/

  • 99'BBB

A'ABB > //

  • 9>'BBB

)'ABB (B )/

  • D)'BBB

D'(BB

slide-48
SLIDE 48

Generalization and Suppression

  • Generalization

Replace the value with a less specific but semantically consistent value

Suppression

Do not release a

value at all

  • #

*# +

  • #

*# +

  • #

*# + !" # # *# + !"

slide-49
SLIDE 49

Domain and Value Generalization Hierarchies

Z0 = {41075, 41076, 41095, 41099} Z1 = {4107*. 4109*} Z2 = {410**} 4107* 4109* 410** 41075 41076 41095 41099 S0 = {Male, Female} S1 = {Person} Person 41075 41076 41095 41099 Male Female

slide-50
SLIDE 50

Generalization Lattice

S0 = {Male, Female} S1 = {Person} <S1, Z0> <S0, Z1> <S1, Z1> <S0, Z2> <S1, Z2> [1, 2] Z0 = {41075, 41076, 41095, 41099} Z1 = {4107*, 4109*} Z2 = {410**}

Generalization Lattice

Distance Vector Generalization Lattice

<S0, Z0> [0, 0] [1, 0] [0, 1] [1, 1] [0, 2]

slide-51
SLIDE 51

Generalization Tables

slide-52
SLIDE 52

Coming up

Guest lecture by James Gardner Improved principles and anonymization algorithms