Data Anonymization – Introduction
Li Xiong
CS573 Data Privacy and Security
Data Anonymization Introduction Li Xiong CS573 Data Privacy and - - PowerPoint PPT Presentation
Data Anonymization Introduction Li Xiong CS573 Data Privacy and Security Outline Problem definition Principles Disclosure Control Methods Inference Control Access control: protecting information and information systems from
CS573 Data Privacy and Security
NO FOUL PLAY
Access control: protecting information and information systems
from unauthorized access and use.
Inference control: protecting private data while publishing useful
information
3
Data
data, containing confidential information about individual entities such as persons, households, businesses, etc. in order to prevent third parties working with these data to recognize individuals in the data
Types of disclosure
target person
information on an individual unit such as a person, a firm, an institution, etc
In contrast to computed tables (Macrodata)
removed or modified from microdata
intruder related to some individuals from initial microdata
Disclosure risk - the risk that a given form of disclosure will arise if a masked microdata is released Information loss - the quantity of information which exist in the initial microdata but not in masked microdata due to disclosure control methods
Individuals Data Submit Collect Researcher Intruder Data Owner Masked Data Collect Release Receive
Masking Process
Individuals Data Submit Collect Confidentiality
Disclosure Risk / Anonymity Properties Researcher Intruder Data Owner Masked Data Collect Release Receive
Masking Process
Preserve Data Utility External Data Use Masked Data for Statistical Analysis Use Masked Data and External Data to disclose confidential information Anonymity Properties Information Loss
Disclosure Control For Microdata Disclosure Control For Microdata
Disclosure Control for Tables
#$% &'
()
" *+*
(.+ ++' *+
0+ +*+++1* 2+*+ ++''3' 11
(9
';5
"#" #""$ "% !
#! ( ) *
$,
(<
,.../
! " *
/
#! "' # #4./
/
/
9
(.
(>
?@+
C
?+ ?+
<(;; ' &#
Massachusetts GIC released “anonymized” data on
Then Governor William Weld assured public on
)) ).ABA
DADADADAD )) ).ABA
ADAD)/9/9 )) ).AB(
// ).D(B
999999999 // ).D(B
(
).ABA
).ABA
).AB(
).D(B
).D(B
)) ).ABA
(
).ABA
Then graduate student Sweeney linked the data with
3 DADADADAD )) ).ABA
@ ADAD)/9/9 )) ).AB(
// ).D(B
? 999999999 // ).D(B
).ABA @ )) ).AB(
).D(B
Voter roll for Cambridge
)) ).ABA
)) ).AB(
// ).D(B
// ).D(B
1/24/2012 24
AnonID Query QueryTime ItemRank ClickURL 217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com 217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com 1268 gall stones 2006-05-11 02:12:51 1268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov 1268
2006-03-01 17:39:28 8 http://www.blanketsnmore.com
20 million Web search queries by AOL
(Source: AOL Query Log)
“numb fingers”, “60 single men” “dog that urinates on everything” “landscapers in Lilburn, Ga” Several people names with last name Arnold Several people names with last name Arnold “homes sold in shadow lake subdivision gwinnett county georgia”
“numb fingers”, “60 single men” “dog that urinates on everything” “landscapers in Lilburn, Ga” Several people names with last name Arnold Several people names with last name Arnold “homes sold in shadow lake subdivision gwinnett county georgia” Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her dogs
Principles Anonymization methods
and Suppression, International Journal on Uncertainty, Fuzziness and Knowledge- based Systems, Vol. 10, No. 5, 571-588
Transactions on Knowledge and Data Engineering, Vol. 13, No. 6, 1010-1027
Data
30
#
& !"
Modify Data
31
#
!"
Attacker’s Knowledge: Voter registration list
Chris Bob Paul John
Name
US 23 13067 4 US 29 13067 3 US 22 13067 2 US 45 13067 1
Nationality Age Zip #
Modify Data
32
#
!"
Data Leak !
Even if we do not publish the individuals:
33
) ) ) ) )
Quasi Identifier
I1, I2,..., Im - identifier attributes
Ex: Name and SSN Information that leads to a specific entity.
K , K ,.…, K - key attributes (quasi-identifiers) K1, K2,.…, Kp - key attributes (quasi-identifiers)
Ex: Zip Code and Age May be known by an intruder.
S1, S2,.…, Sq - confidential attributes
Ex: Principal Diagnosis and Annual Income Assumed to be unknown to an intruder.
7
( E C (AD)/9<.> ))
('ABB A F DADADADAD ))
A'/BB D E 31 ADAD)/9/9 //
D'BBB ) E @ DDDDDDDDD ))
('BBB / E1 ))))))))) //
>BB 9 1 G+ 999999999 )/
</B < <<<<<<<<< A/ :
('ABB . :1 C ......... D/
A'ABB > 1 >>>>>>>>> //
)'ABB (B (BBBBBBBB )/
D'(BB
7
'
/B )(B<9
DB )(B<9 4
DB )(B<9 4
AB )(B<9
AB )(B<9
/B )(B<9
/B )(B<9
not have k-anonymity property with respect to QID.
k-Anonymity can create groups that leak information due to lack of diversity in sensitive attribute.
Principles Anonymization methods
microdata:
7
( E C (AD)/9<.> ))
('ABB A F DADADADAD ))
A'/BB D E 31 ADAD)/9/9 //
D'BBB D E 31 ADAD)/9/9 //
D'BBB ) E @ DDDDDDDDD ))
('BBB / E1 ))))))))) //
>BB 9 1 G+ 999999999 )/
</B < <<<<<<<<< A/ :
('ABB . :1 C ......... D/
A'ABB > 1 >>>>>>>>> //
)'ABB (B (BBBBBBBB )/
D'(BB
7
( ))
('ABB A ))
A'/BB D //
D'BBB ) ))
('BBB ) ))
('BBB / //
>BB 9 )/
</B < A/ :
('ABB . D/
A'ABB > //
)'ABB (B )/
D'(BB
records is released
number of elements we call sf = t / n the sampling factor
individual is chosen entirely by chance and each member of the individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample
7
/ //
>BB ) ))
('BBB . D/
A'ABB > //
)'ABB < A/ :
('ABB
consecutive values, replace those values by the group average
7
A ))
A'/BB A ))
A'/BB ) ))
('BBB (B )/
D'(BB ( ))
('ABB 9 )/
</B < A/ :
('ABB D //
D'BBB / //
>BB . D/
A'ABB > //
)'ABB
to a microdata
7
( ))
('ABB A ))
A'/BB D //
D'BBB ) ))
('BBB / //
>BB 9 )/
</B < A/ :
('ABB . D/
A'ABB > //
)'ABB (B )/
D'(BB
Replace the value with a less specific but semantically consistent value
Suppression
Do not release a
value at all
*# +
*# +
*# + !" # # *# + !"
Z0 = {41075, 41076, 41095, 41099} Z1 = {4107*. 4109*} Z2 = {410**} 4107* 4109* 410** 41075 41076 41095 41099 S0 = {Male, Female} S1 = {Person} Person 41075 41076 41095 41099 Male Female
S0 = {Male, Female} S1 = {Person} <S1, Z0> <S0, Z1> <S1, Z1> <S0, Z2> <S1, Z2> [1, 2] Z0 = {41075, 41076, 41095, 41099} Z1 = {4107*, 4109*} Z2 = {410**}
Generalization Lattice
Distance Vector Generalization Lattice
<S0, Z0> [0, 0] [1, 0] [0, 1] [1, 1] [0, 2]