Fast Data Anonymization with Low Information Loss
1 National University of Singapore
{ghinitag,kalnis}@comp.nus.edu.sg
2 Hong Kong University
{pkarras,nikos}@cs.hku.hk
Fast Data Anonymization with Low Information Loss Gabriel Ghinita 1 - - PowerPoint PPT Presentation
Fast Data Anonymization with Low Information Loss Gabriel Ghinita 1 Panagiotis Karras 2 Panos Kalnis 1 Nikos Mamoulis 2 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Hong Kong University {pkarras,nikos}@cs.hku.hk
1 National University of Singapore
{ghinitag,kalnis}@comp.nus.edu.sg
2 Hong Kong University
{pkarras,nikos}@cs.hku.hk
! Large amounts of public data
" Research or statistical purposes " e.g. distribution of disease for age, city
! Data may contain sensitive information
" Ensure data privacy
Dyspepsia 55000 67 Dyspepsia 41000 62 Gastritis 27000 55 Flu 32000 51 Pneumonia 43000 47 Ulcer 52000 42 Disease ZipCode Age Sam Mike Nash Ken Bill Andy Name Dyspepsia 55000 67 Dyspepsia 41000 62 Gastritis 27000 55 Flu 32000 51 Pneumonia 43000 47 Ulcer 52000 42 Disease ZipCode Age
(a) Microdata (b) Voting Registration List (public)
Dyspepsia 41000-55000 62-67 Dyspepsia 41000-55000 62-67 Gastritis 27000-32000 51-55 Flu 27000-32000 51-55 Pneumonia 43000-52000 42-47 Ulcer 43000-52000 42-47 Disease ZipCode Age
[Sam01] P. Samarati, "Protecting Respondent's Privacy in Microdata Release," in IEEE TKDE, vol. 13, n. 6, November/December 2001, pp. 1010-1027.
Sam Mike Nash Ken Bill Andy Name 55000 67 Dyspepsia 41000 62 27000 55 Flu or Gastritis 32000 51 43000 47 Ulcer or Pneumonia 52000 42 Disease ZipCode Age
(a) 2-anonymous microdata (b) Voting Registration List (public)
! QID generalization or suppression
! At least ℓ sensitive attribute (SA) values
" e.g. freq. of an SA value in a group < 1/ℓ
[MGKV06] A. Machanavajjhala et al. ℓ-diversity: Privacy Beyond k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006
! Find k-anonymous/ℓ-diverse transformation ! Minimize information loss ! Incur reduced anonymization overhead
! 1D QID
" Linear, optimal k-anonymous partitioning " Polynomial, optimal ℓ-diverse partitioning " Linear heuristic for ℓ-diverse partitioning
! Generalization to multi-dimensional QID
" Multi-to-1D mapping
! Hilbert Space-Filling Curve ! i-Distance
" Apply 1D algorithms
! Dimensionality Mapping
! Generalization-based
" data-space partitioning " similar to k-d-trees
! split recursively as long as
privacy condition holds k = 2
[FWR06] K. LeFevre et al. Mondrian Multidimensional k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006
Age 20 40 60 Weight 40 60 80 100
Age Weight 42 22 24 40 30 35 31 33 55 56 63 61 35 45 40 55 50 65 60 70 50 55 60 65 70 75 80 85 Mondrian k-anonymity, k = 4
Age Weight 42 22 24 40 30 35 31 33 55 56 63 61 35 45 40 55 50 65 60 70 50 55 60 65 70 75 80 85 Our Method k-anonymity, k = 4
Age Weight 42 22 24 40 30 35 31 33 55 56 63 61 35 45 40 55 50 65 60 70 50 55 60 65 70 75 80 85 ℓ -diversity, ℓ = 3
Age Weight 42 22 24 40 30 35 31 33 55 56 63 61 35 45 40 55 50 65 60 70 50 55 60 65 70 75 80 85 ℓ -diversity, ℓ = 3 Our Method
! Permutation-based method
" discloses exact QID values " vulnerable to presence attacks
Gastritis(1) Dyspepsia(1) Flu(1) Dyspepsia(1) Ulcer(1) Pneumonia(1) Disease
[XT06] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation, Proceedings
55000 67 27000 55 41000 62 32000 51 43000 47 52000 42 ZipCode Age Dyspepsia 55000 67 Dyspepsia 41000 62 Gastritis 27000 55 Flu 32000 51 Pneumonia 43000 47 Ulcer 52000 42 Disease ZipCode Age “Anatomized” table
|G|! permutations
20 QID: SA: 80 60 40 100 D2 D3 D1 Alzheimer
Age Weight 42 22 24 40 30 35 31 33
G2
55 56 63 61 35 45 40 55 50 65 60 70 50 55 60 65 70 75 80 85
Weight Age
IL =
! Properties of optimal solution
" Groups do not overlap in QID space " Group size bounded by 2k-1
! DP Formulation O(kN)
previous group
for current group
! Properties of optimal solution
" Group size bounded by 2ℓ-1 " But groups MAY overlap in QID space
! SA Domain Representation
! Optimal algorithm is polynomial
" But may be costly in practice
! Linear heuristic algorithm
" Considers single “frontier of search” " Frontier consists of first non-assigned record in
! use “frontier” of search ! check “eligibility condition” (for termination)
! Census dataset
" Data about 500,000 individuals
! General purpose information loss metric
" Based on group extent in QID space
! OLAP query accuracy
" KL-divergence pdf distance
! Distance between actual and approximate
! Framework for k-anonymity and ℓ-diversity
" Transform the multi-D QID problem to 1-D " Apply linear optimal/heuristic 1D algorithms
! Results
" Clearly superior utility to Mondrian, with
" Similar (or better) utility as Anatomy for