HIDE: Privacy Preserving Medical Data Publishing James Gardner - - PowerPoint PPT Presentation

hide privacy preserving medical data publishing
SMART_READER_LITE
LIVE PREVIEW

HIDE: Privacy Preserving Medical Data Publishing James Gardner - - PowerPoint PPT Presentation

HIDE: Privacy Preserving Medical Data Publishing James Gardner Department of Mathematics and Computer Science Emory University jgardn3@emory.edu Motivation De-identification is critical in any health informatics system Research


slide-1
SLIDE 1

HIDE: Privacy Preserving Medical Data Publishing

James Gardner Department of Mathematics and Computer Science Emory University jgardn3@emory.edu

slide-2
SLIDE 2

Motivation

  • De-identification is critical in any health

informatics system

  • Research
  • Sharing
  • Need an easy-to-use interface and

framework for data custodians and publishers

  • Understanding data is necessary to

de-identify data

slide-3
SLIDE 3

HIPAA

  • 1. Names;
  • 2. All geographical subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their

equivalent geocodes, except for the initial three digits of a zip code, if according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000.

  • 3. All elements of dates (except year) for dates directly related to an individual, including birth date, admission date,

discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;

  • 4. Phone numbers;
  • 5. Fax numbers;
  • 6. Electronic mail addresses;
  • 7. Social Security numbers;
  • 8. Medical record numbers;
  • 9. Health plan beneficiary numbers;
  • 10. Account numbers;
  • 11. Certificate/license numbers;
  • 12.

Vehicle identifiers and serial numbers, including license plate numbers;

  • 13. Device identifiers and serial numbers;
  • 14. Web Universal Resource Locators (URLs);
  • 15. Internet Protocol (IP) address numbers;
  • 16. Biometric identifiers, including finger and voice prints;
  • 17. Full face photographic images and any comparable images; and
  • 18. Any other unique identifying number, characteristic, or code (note this does not mean the unique code assigned by

the investigator to code the data)

slide-4
SLIDE 4

PHI Summary

  • Protected Health Information (PHI) is

defined by HIPAA as individually identifiable health information

  • Direct identifiers include name, SSN,

etc.

  • Indirect identifiers include gender, age,

address information, etc.

slide-5
SLIDE 5

Research Challenges

  • Detect PHI in heterogeneous medical

data

  • Apply structured anonymization

principles on heterogeneous medical data (micro-privacy)

  • Release differentially private

aggregated statistics (macro-privacy)

slide-6
SLIDE 6

HIDE

  • Health Information DE-identification
  • Uses techniques from
  • Information Extraction
  • Data linking
  • Structured Anonymization
  • Differential Privacy
  • Data Mining
slide-7
SLIDE 7

HIDE

slide-8
SLIDE 8

Outline

  • Background and related work
  • Existing de-identification approaches
  • Named entity recognition
  • Privacy preserving data publishing
  • Proposed Work
  • HIDE framework
  • Identifying and sensitive information extraction
  • Micro-data publishing
  • Macro-data publishing
  • Software
slide-9
SLIDE 9

Alternative Systems

  • Scrub System - rules and dictionaries are used to

detect PHI

  • Semantic Lexicon System - rules and dictionaries

are used to detect PHI

  • DE-ID - rules and dictionaries, developed at

Pittsburgh and approved by IRB

  • Concept-Match Scrubber - removes every word not

in an approved list of non-identifying terms

  • Carafe - uses a CRF to detect PHI
slide-10
SLIDE 10

Limitations of Most Systems

  • Lack portability
  • Donʼt give formal privacy guarantees
  • Donʼt utilize the latest work from

structured data anonymization

  • Focus only on removing PHI
slide-11
SLIDE 11

Named Entity Recognition

  • Locate and classify atomic elements in

text into predefined categories such as person, organization, location, expressions of time, quantities, etc.

  • NER systems can be classified into either:
  • Rule-based
  • Machine Learning-based
slide-12
SLIDE 12

NER Examples

  • Part-of-speech (POS) Tagging
  • I/PRP think/VBP it/PRP ‘s/BES a/DT pretty/

RB good/JJ idea/NN ./.

  • Personal Health Identifier Detection
  • <age>77</age> year old <gender>female</

gender> with history of <disease>B-cell lymphoma</disease> (Marginal zone, <mrn>SH-04-4444</mrn>)

slide-13
SLIDE 13

NER Metrics

  • Precision
  • TP / (TP + FP)
  • Recall
  • TP / (TP + FN)
slide-14
SLIDE 14

Rule-based

  • Rely on hand-coded rules and

dictionaries

  • Dictionaries can be used for terms in a

closed class with an exhaustive list, e.g. geographic locations

  • Regular expressions are used to detect

terms that follow certain syntactic patterns, e.g. phone numbers

slide-15
SLIDE 15

Machine learning-based

  • Model the NER as a sequence labeling task

where each token is assigned a label

  • Train classifiers to label each token
  • Classifiers use a list of features (or

attributes) for training and classification of the sequence

  • Frequently applied classifiers are HMM,

MEMM, SVM, and CRF

slide-16
SLIDE 16

Conditional Random Field

  • A Conditional Random Field (CRF)

provides a probabilistic framework for labeling and segmenting sequential data

  • A CRF defines a conditional probability
  • f a label sequence given an
  • bservation sequence
slide-17
SLIDE 17

Comparison

  • Rule-based
  • Accurate
  • Require experts to modify
  • Not portable
  • Machine learning-based
  • Accurate
  • Modification of models is done through training rather

than “coding”

  • Portable
slide-18
SLIDE 18

Privacy Preserving Data Publishing

  • Weak privacy (Micro)
  • release a modified version of each record

according to a given anonymization principle

  • assumes level of background knowledge
  • Differential privacy (Macro)
  • release perturbed statistics that satisfy the

differential privacy principle

  • no assumptions of background knowledge
slide-19
SLIDE 19

Micro-data publishing

  • Prevent linking of records in separate

databases

  • k-anonymization
  • Prevent discovery of sensitive values
  • l-diversity
  • Prevent discovery of presence or absence in a

database

  • delta-presence
slide-20
SLIDE 20

Micro-data publishing

Table 1: Illustration of Anonymization Name Age Gender Zipcode Diagnosis Henry 25 Male 53710 Influenza Irene 28 Female 53712 Lymphoma Dan 28 Male 53711 Bronchitis Erica 26 Female 53712 Influenza Original Data Name Age Gender Zipcode Disease ∗ [25 − 28] Male [53710-53711] Influenza ∗ [25 − 28] Female 53712 Lymphoma ∗ [25 − 28] Male [53710-53711] Bronchitis ∗ [25 − 28] Female 53712 Influenza Anonymized Data

Name Age Gender Zipcode Diagnosis Henry 25 Male 53710 Influenza Irene 28 Female 53712 Lymphoma Dan 28 Male 53711 Bronchitis Erica 26 Female 53712 Influenza Original Data Name Age Gender Zipcode Disease ∗ [25 − 28] Male [53710-53711] Influenza ∗ [25 − 28] Female 53712 Lymphoma ∗ [25 − 28] Male [53710-53711] Bronchitis ∗ [25 − 28] Female 53712 Influenza Anonymized Data

slide-21
SLIDE 21

k-anonymization

  • Quasi identifier set
  • Sensitive attributes
  • Table is k-anonymous if every record has

k-1 other records with the same quasi- identifier set

  • The probability of linking a victim to a

specific record through QID is at most 1/k

slide-22
SLIDE 22

l-diversity

  • Extension of k-anonymization
  • Also ensures that each group has at

least l distinct sensitive values

  • Prevents disclosure of sensitive values
slide-23
SLIDE 23

Macro-data publishing

  • Differential Privacy is a strong privacy notion
  • Requires that a randomized computation yields

nearly identical output when performed on nearly identical input

  • Interactive model
  • limited to a specific number of queries
  • Non-interactive model
  • need query strategies to build noisy data cubes

that maximize utility for a random query workload

slide-24
SLIDE 24

Differentially Private Interface

Differentially Private Interface Original Data

Diff. Private Histogram

User

Answers Queries Query Strategy

  • Diff. Private

Answers Pre-designed Queries Workload

slide-25
SLIDE 25

HIDE Framework

  • Identifying and Sensitive Information Extraction
  • uses state-of-the-art CRF model to extract PHI and

sensitive information

  • Data linking
  • provides structured patient-centric view of the data
  • De-identification and Anonymization
  • Micro-data publication - uses data suppression and

generalization to provide a k-anonymized view of the data

  • Macro-data publication - release perturbed aggregated

statistics from the patient-centric view

slide-26
SLIDE 26

HIDE

slide-27
SLIDE 27

Identifying and sensitive information extraction

  • Use CRF classifier to extract information
  • Studied impact of features including:
  • regular expressions
  • affixes
  • dictionaries
  • context
  • Sampling techniques to adjust classifier for higher

precision or recall

slide-28
SLIDE 28

Example

Token Label

77 B-age year O

  • ld

O female B-gender with O history O

Token Label

  • f

O B B-disease

  • I-disease

cell I-disease lymphoma I-disease ( O

slide-29
SLIDE 29

Regular Expressions

Regular Expression Name ^[A-Za-z]$ ALPHA ^[A-Z].*$ INITCAPS ^[A-Z][a-z].*$ UPPER-LOWER ^[A-Z]+$ ALLCAPS ^[A-Z][a-z]+[A-Z][A-Za-z]*$ MIXEDCAPS ^[A-Za-z]$ SINGLECHAR ^[0-9]$ SINGLEDIGIT ^[0-9][0-9]$ DOUBLEDIGIT ^[0-9][0-9][0-9]$ TRIPLEDIGIT ^[0-9][0-9][0-9][0-9]$ QUADDIGIT ^[0-9,]+$ NUMBER [0-9] HASDIGIT ^.*[0-9].*[A-Za-z].*$ ALPHANUMERIC ^.*[A-Za-z].*[0-9].*$ ALPHANUMERIC ^[0-9]+[A-Za-z]$ NUMBERS LETTERS ^[A-Za-z]+[0-9]+$ LETTERS NUMBERS

  • HASDASH

’ HASQUOTE / HASSLASH ‘~!@#$%\^&*()\-=_+\[\]{}|;’:\",./<>?]+$ ISPUNCT (-|\+)?[0-9,]+(\.[0-9]*)?%?$ REALNUMBER ^-.* STARTMINUS ^\+.*$ STARTPLUS ^.*%$ ENDPERCENT ^[IVXDLCM]+$ ROMAN ^\s+$ ISSPACE

Table 1: List of regular expression features used in HIDE

slide-30
SLIDE 30

Affixes

  • Prefixes
  • Suffixes
  • All affixes up to size 3
slide-31
SLIDE 31

Dictionaries

  • Company Names
  • Male First Names
  • Female First Names
  • Last Names
  • State Names
  • State Abbreviations
  • Hospital Names
slide-32
SLIDE 32

Context

  • Previous 4 words
  • Next 4 words
  • Occurrence counts
slide-33
SLIDE 33

Feature Vectors

Token CAPS? SPECIAL? PREVIOUS NEXT LABEL 77 N Y ? year B-age year N N 77

  • ld

O

  • ld

N N year female O female N N

  • ld

with B-gender with N N female history O history N N with

  • f

O

  • f

N N history B O B Y N

  • f
  • B-disease
  • N

Y B cell I-disease cell N N

  • lymphoma

I-disease lymphoma N N cell ( I-disease

slide-34
SLIDE 34
  • 220 re-identified pathology reports for

i2b2 task

  • 10-fold cross-validation

Features Set Results

slide-35
SLIDE 35

Features Set Results

0.5 0.6 0.7 0.8 0.9 1

Precision 0.562 0.745 0.749 0.788 0.792 0.811 0.81 0.944 0.948 0.956 0.958 0.961 0.962 0.962 0.963 Recall 0.623 0.832 0.839 0.847 0.853 0.868 0.868 0.967 0.969 0.975 0.977 0.982 0.982 0.982 0.984 F-Score 0.591 0.786 0.792 0.816 0.821 0.838 0.838 0.955 0.958 0.965 0.967 0.971 0.972 0.972 0.973 d r rd a ad ra rad c cd ac acd rc rac racd rcd

slide-36
SLIDE 36

Sampling

  • Honest brokers are concerned more

about recall than precision

  • Cost proportionate rejection sampling is
  • ften used for boosting
  • Training examples are selected based
  • n the associated cost of missing that

label

slide-37
SLIDE 37
  • Keep all non-”O” labels
  • Select “O” labels with given probability
  • Biases the classifier to select a label
  • ther than “O”

Random O-Sampling

slide-38
SLIDE 38

Random O-Sampling

i2b2 PhysioNet

0.75 0.8 0.85 0.9 0.95 1 0.2 0.4 0.6 0.8 1 Sample Probability prec recall f-score 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Sample Probability prec recall f-score

slide-39
SLIDE 39

Window Sampling

  • Select all non-”O” labels
  • Select all terms within given window of

any non-”O” label

slide-40
SLIDE 40

Window Sampling

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 10 20 30 40 50 60 70 80 History Size Precision F-Score Recall 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 70 80 History Size Precision Recall F-Score

i2b2 PhysioNet

slide-41
SLIDE 41

Information Extraction Conclusion

  • HIDE has a fast and accurate CRF for

detecting PHI

  • Feature engineering has been explored

in great detail

  • Window Sampling can be used to adjust

recall with minimal impact on precision

  • Impact of training data size
slide-42
SLIDE 42

Micro-data publishing

  • Release patient-centric view or original

data with suppressed or generalized values

  • Apply k-anonymization and l-diversity

principles to unstructured data

  • Evaluate query accuracy on real

medical data

slide-43
SLIDE 43

Micro-data publishing

  • Full de-identification
  • Remove all identifiers
  • Partial de-identification
  • Remove direct identifiers
  • Statistical de-identification
  • Statistical anonymization
slide-44
SLIDE 44

Statistical Anonymization

  • Partition the original data points into

groups that will all share the same values with respect to QID

  • Use multi-dimensional mondrian

algorithm for releasing k-anonymized and l-diverse version of structured patient-centric view

slide-45
SLIDE 45

Partitioning

28 27 26 25

53712 53711 53710

(a) Patients

28 27 26 25

53712 53711 53710

(b) Single-Dimensional

28 27 26 25

53712 53711 53710

(c) Strict Multidimensional

slide-46
SLIDE 46

Mondrian algorithm

  • Greedy top-down partitioning approach
  • Choose dimension with maximum

range

  • Split at median if each newly created

partition still satisfies k-anonymization and l-diversity

slide-47
SLIDE 47

Example

(k = 50) (k = 25) (k = 10)

10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50

(b) Greedy strict multidimensional partitioning

  • More precision partitions are possible

with smaller k

slide-48
SLIDE 48

Query Accuracy

  • 100 pathology reports
  • 10,000 random queries
  • age > n
  • age < n
slide-49
SLIDE 49

Query Accuracy

50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 Query Precision (%) k Statistical De-identification Partial De-identification Full De-identification

slide-50
SLIDE 50

Macro-data publishing

  • Differentially private data publishing (DPDP)

module in HIDE

  • Create differentially private data cube where

each dimension represents a statistic over the patient-centric view

  • Partitioning algorithm based on information

gain to maximize level of utility of differentially private data cube

  • Consistency algorithm to enhance utility
slide-51
SLIDE 51

Differentially Private Interface

Differentially Private Interface Original Data

Diff. Private Histogram

User

Answers Queries Query Strategy

  • Diff. Private

Answers Pre-designed Queries Workload

slide-52
SLIDE 52

DPDP

  • DPDP considers:
  • Access to original data
  • Partitioning of the original database that

best satisfies the workload of queries

  • Level of differential privacy of data cube
  • Level of utility (or noise) in the released

data cube

slide-53
SLIDE 53

Access to original database

  • Every time the original database is

accessed we use some of the privacy budget

  • Access the original database in a

differentially private manner

  • Minimize the amount of times the original

data is queried to minimize the amount of noise we must add to the results

slide-54
SLIDE 54

Query Strategy

  • Develop a query strategy that will allow

the most utility given random queries from the user

  • This query strategy is accomplished by

partitioning the data according to information gain

slide-55
SLIDE 55

Partitioning of the

  • riginal database
  • Release two data cubes
  • One using cell-based algorithm that partitions database

into itʼs individual cells and release a perturbed count for each cell

  • One using top-down multi-dimensional partitioning

strategy, where each split value selection maximizes the information gain and ensures the uniformity of the data points in the partition

  • A consistency algorithm will be applied to the two data-

cubes that will increase the accuracy of the released data- cubes

slide-56
SLIDE 56

Cell partitioning

Income Age

90 50 10 50

40K 50K 20 30

Income Age

90’ 50’ 10’ 50’

20 30 40K 50K

Q1: count() where Age = 20, Income = 40K Q2: count() where Age = 20, Income = 50K …

Q alpha

  • Select count where age > 20 and age <

30

  • alpha is the differential privacy parameter
slide-57
SLIDE 57

Multi-dimensional partitioning

Income Age

90 50 10 50

40K 50K 20 30

Income Age

90’ 10’ 100’

20 30 40K 50K

Multi-dimensioning partitioning

  • Select count where age > 20 and age <

30

  • Noise is divided
slide-58
SLIDE 58

Goals of partitioning strategy

  • Large partitions to minimize aggregated

perturbation error

  • Uniform partitions to minimize

approximation error

  • Minimize the number of times we

access the original data

slide-59
SLIDE 59

Proposed Approach

Income Age

90 50 10 50

40K 50K 20 30

90’ 50’ 10’ 50’

20 30 40K 50K

90’ 10’ 50’ + 50’

20 30 40K 50K

  • 2. Multi-dim

Partitioning

90’ 10’ 100’

20 30 40K 50K

  • 1. Cell partitioning queries (alpha/2)
  • 3. Multi-dim partitioning queries (alpha/2)
slide-60
SLIDE 60

Utility of release

  • The level of utility is measured by

comparing the value a query workload

  • n the released differentially private

data cubes and a non-perturbed data cube generated from the original data

  • We empirically evaluate the level of

error that is function of the given privacy budget

slide-61
SLIDE 61

Software

  • Web application
  • Python, Django, and CouchDB
  • Interface
  • Iterative labeling of documents and training

underlying classifier

  • Analyze accuracy of classifier on validation sets
  • Classifier is super-fast CRF provided by

CRFSuite

slide-62
SLIDE 62

Publications

  • Y. Xiao, J. Gardner, L. Xiong. DPCube: Releasing Differentially Private Data Cubes for Health

Information (demo paper). In 28th IEEE International Conference on Data Engineering (ICDE), 2012

  • James Gardner, Li Xiong, Fusheng Wang, Andrew Post and Joel Saltz. An evaluation of feature

sets and sampling techniques for statistical de-identification of medical records. In 1st ACM International Health Informatics Symposium, 2010 (to appear).

  • Li Xiong, James Gardner, Pawel Jurczyk and James J. Lu. Privacy Preserving Information

Discovery on EHRs. In Information Discovery on Electronic Health Records, Ed. Vagelis Hristidis. Chapman and Hall/CRC, pp. 197–225, 2009.

  • James Gardner and Li Xiong. An integrated framework for de-identifying unstructured medical data.

Data and Knowledge Engineering, 68(12), pp. 1441–1451, 2009, doi:10.1016/j.datak.2009.07.006.

  • James Gardner, Kanwei Li, Li Xiong and James J. Lu. HIDE: Heterogeneous Information DE-

identification (demo track). 12th International Conference on Extending Database Technology (EDBT), March, 2009.

  • James Gardner and Li Xiong. HIDE: A Health Information DE-identification System. In 21st IEEE

International Symposium on Computer-Based Medical Systems (CBMS), June, 2008