HIDE: Privacy Preserving Medical Data Publishing
James Gardner Department of Mathematics and Computer Science Emory University jgardn3@emory.edu
HIDE: Privacy Preserving Medical Data Publishing James Gardner - - PowerPoint PPT Presentation
HIDE: Privacy Preserving Medical Data Publishing James Gardner Department of Mathematics and Computer Science Emory University jgardn3@emory.edu Motivation De-identification is critical in any health informatics system Research
James Gardner Department of Mathematics and Computer Science Emory University jgardn3@emory.edu
informatics system
framework for data custodians and publishers
de-identify data
equivalent geocodes, except for the initial three digits of a zip code, if according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000.
discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;
Vehicle identifiers and serial numbers, including license plate numbers;
the investigator to code the data)
detect PHI
are used to detect PHI
Pittsburgh and approved by IRB
in an approved list of non-identifying terms
text into predefined categories such as person, organization, location, expressions of time, quantities, etc.
RB good/JJ idea/NN ./.
gender> with history of <disease>B-cell lymphoma</disease> (Marginal zone, <mrn>SH-04-4444</mrn>)
dictionaries
closed class with an exhaustive list, e.g. geographic locations
terms that follow certain syntactic patterns, e.g. phone numbers
where each token is assigned a label
attributes) for training and classification of the sequence
MEMM, SVM, and CRF
than “coding”
according to a given anonymization principle
differential privacy principle
databases
database
Table 1: Illustration of Anonymization Name Age Gender Zipcode Diagnosis Henry 25 Male 53710 Influenza Irene 28 Female 53712 Lymphoma Dan 28 Male 53711 Bronchitis Erica 26 Female 53712 Influenza Original Data Name Age Gender Zipcode Disease ∗ [25 − 28] Male [53710-53711] Influenza ∗ [25 − 28] Female 53712 Lymphoma ∗ [25 − 28] Male [53710-53711] Bronchitis ∗ [25 − 28] Female 53712 Influenza Anonymized Data
Name Age Gender Zipcode Diagnosis Henry 25 Male 53710 Influenza Irene 28 Female 53712 Lymphoma Dan 28 Male 53711 Bronchitis Erica 26 Female 53712 Influenza Original Data Name Age Gender Zipcode Disease ∗ [25 − 28] Male [53710-53711] Influenza ∗ [25 − 28] Female 53712 Lymphoma ∗ [25 − 28] Male [53710-53711] Bronchitis ∗ [25 − 28] Female 53712 Influenza Anonymized Data
k-1 other records with the same quasi- identifier set
specific record through QID is at most 1/k
nearly identical output when performed on nearly identical input
that maximize utility for a random query workload
Differentially Private Interface Original Data
Diff. Private Histogram
User
Answers Queries Query Strategy
Answers Pre-designed Queries Workload
sensitive information
generalization to provide a k-anonymized view of the data
statistics from the patient-centric view
precision or recall
Token Label
77 B-age year O
O female B-gender with O history O
Token Label
O B B-disease
cell I-disease lymphoma I-disease ( O
Regular Expression Name ^[A-Za-z]$ ALPHA ^[A-Z].*$ INITCAPS ^[A-Z][a-z].*$ UPPER-LOWER ^[A-Z]+$ ALLCAPS ^[A-Z][a-z]+[A-Z][A-Za-z]*$ MIXEDCAPS ^[A-Za-z]$ SINGLECHAR ^[0-9]$ SINGLEDIGIT ^[0-9][0-9]$ DOUBLEDIGIT ^[0-9][0-9][0-9]$ TRIPLEDIGIT ^[0-9][0-9][0-9][0-9]$ QUADDIGIT ^[0-9,]+$ NUMBER [0-9] HASDIGIT ^.*[0-9].*[A-Za-z].*$ ALPHANUMERIC ^.*[A-Za-z].*[0-9].*$ ALPHANUMERIC ^[0-9]+[A-Za-z]$ NUMBERS LETTERS ^[A-Za-z]+[0-9]+$ LETTERS NUMBERS
’ HASQUOTE / HASSLASH ‘~!@#$%\^&*()\-=_+\[\]{}|;’:\",./<>?]+$ ISPUNCT (-|\+)?[0-9,]+(\.[0-9]*)?%?$ REALNUMBER ^-.* STARTMINUS ^\+.*$ STARTPLUS ^.*%$ ENDPERCENT ^[IVXDLCM]+$ ROMAN ^\s+$ ISSPACE
Table 1: List of regular expression features used in HIDE
Token CAPS? SPECIAL? PREVIOUS NEXT LABEL 77 N Y ? year B-age year N N 77
O
N N year female O female N N
with B-gender with N N female history O history N N with
O
N N history B O B Y N
Y B cell I-disease cell N N
I-disease lymphoma N N cell ( I-disease
0.5 0.6 0.7 0.8 0.9 1
Precision 0.562 0.745 0.749 0.788 0.792 0.811 0.81 0.944 0.948 0.956 0.958 0.961 0.962 0.962 0.963 Recall 0.623 0.832 0.839 0.847 0.853 0.868 0.868 0.967 0.969 0.975 0.977 0.982 0.982 0.982 0.984 F-Score 0.591 0.786 0.792 0.816 0.821 0.838 0.838 0.955 0.958 0.965 0.967 0.971 0.972 0.972 0.973 d r rd a ad ra rad c cd ac acd rc rac racd rcd
0.75 0.8 0.85 0.9 0.95 1 0.2 0.4 0.6 0.8 1 Sample Probability prec recall f-score 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Sample Probability prec recall f-score
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 10 20 30 40 50 60 70 80 History Size Precision F-Score Recall 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 70 80 History Size Precision Recall F-Score
detecting PHI
in great detail
recall with minimal impact on precision
28 27 26 25
53712 53711 53710
(a) Patients
28 27 26 25
53712 53711 53710
(b) Single-Dimensional
28 27 26 25
53712 53711 53710
(c) Strict Multidimensional
(k = 50) (k = 25) (k = 10)
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50(b) Greedy strict multidimensional partitioning
50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 Query Precision (%) k Statistical De-identification Partial De-identification Full De-identification
module in HIDE
each dimension represents a statistic over the patient-centric view
gain to maximize level of utility of differentially private data cube
Differentially Private Interface Original Data
Diff. Private Histogram
User
Answers Queries Query Strategy
Answers Pre-designed Queries Workload
best satisfies the workload of queries
data cube
accessed we use some of the privacy budget
differentially private manner
data is queried to minimize the amount of noise we must add to the results
into itʼs individual cells and release a perturbed count for each cell
strategy, where each split value selection maximizes the information gain and ensures the uniformity of the data points in the partition
cubes that will increase the accuracy of the released data- cubes
Income Age
90 50 10 50
40K 50K 20 30
Income Age
90’ 50’ 10’ 50’
20 30 40K 50K
Q1: count() where Age = 20, Income = 40K Q2: count() where Age = 20, Income = 50K …
Q alpha
30
Income Age
90 50 10 50
40K 50K 20 30
Income Age
90’ 10’ 100’
20 30 40K 50K
Multi-dimensioning partitioning
Income Age
90 50 10 50
40K 50K 20 30
90’ 50’ 10’ 50’
20 30 40K 50K
90’ 10’ 50’ + 50’
20 30 40K 50K
Partitioning
90’ 10’ 100’
20 30 40K 50K
underlying classifier
CRFSuite
Information (demo paper). In 28th IEEE International Conference on Data Engineering (ICDE), 2012
sets and sampling techniques for statistical de-identification of medical records. In 1st ACM International Health Informatics Symposium, 2010 (to appear).
Discovery on EHRs. In Information Discovery on Electronic Health Records, Ed. Vagelis Hristidis. Chapman and Hall/CRC, pp. 197–225, 2009.
Data and Knowledge Engineering, 68(12), pp. 1441–1451, 2009, doi:10.1016/j.datak.2009.07.006.
identification (demo track). 12th International Conference on Extending Database Technology (EDBT), March, 2009.
International Symposium on Computer-Based Medical Systems (CBMS), June, 2008