HIDE: Privacy Preserving Medical Data Publishing James Gardner - PowerPoint PPT Presentation

HIDE: Privacy Preserving Medical Data Publishing James Gardner Department of Mathematics and Computer Science Emory University jgardn3@emory.edu

Motivation • De-identification is critical in any health informatics system • Research • Sharing • Need an easy-to-use interface and framework for data custodians and publishers • Understanding data is necessary to de-identify data

HIPAA  1. Names;  2. All geographical subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code, if according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000.  3. All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;  4. Phone numbers;  5. Fax numbers;  6. Electronic mail addresses;  7. Social Security numbers;  8. Medical record numbers;  9. Health plan beneficiary numbers; 10. Account numbers;   11. Certificate/license numbers;  12. Vehicle identifiers and serial numbers, including license plate numbers;  13. Device identifiers and serial numbers;  14. Web Universal Resource Locators (URLs);  15. Internet Protocol (IP) address numbers;  16. Biometric identifiers, including finger and voice prints;  17. Full face photographic images and any comparable images; and  18. Any other unique identifying number, characteristic, or code (note this does not mean the unique code assigned by the investigator to code the data)

PHI Summary • Protected Health Information (PHI) is defined by HIPAA as individually identifiable health information • Direct identifiers include name, SSN, etc. • Indirect identifiers include gender, age, address information, etc.

Research Challenges • Detect PHI in heterogeneous medical data • Apply structured anonymization principles on heterogeneous medical data (micro-privacy) • Release differentially private aggregated statistics (macro-privacy)

HIDE • Health Information DE-identification • Uses techniques from • Information Extraction • Data linking • Structured Anonymization • Differential Privacy • Data Mining

Outline • Background and related work • Existing de-identification approaches • Named entity recognition • Privacy preserving data publishing • Proposed Work • HIDE framework • Identifying and sensitive information extraction • Micro-data publishing • Macro-data publishing • Software

Alternative Systems • Scrub System - rules and dictionaries are used to detect PHI • Semantic Lexicon System - rules and dictionaries are used to detect PHI • DE-ID - rules and dictionaries, developed at Pittsburgh and approved by IRB • Concept-Match Scrubber - removes every word not in an approved list of non-identifying terms • Carafe - uses a CRF to detect PHI

Limitations of Most Systems • Lack portability • Don ʼ t give formal privacy guarantees • Don ʼ t utilize the latest work from structured data anonymization • Focus only on removing PHI

Named Entity Recognition • Locate and classify atomic elements in text into predefined categories such as person, organization, location, expressions of time, quantities, etc. • NER systems can be classified into either: • Rule-based • Machine Learning-based

NER Examples • Part-of-speech (POS) Tagging • I/PRP think/VBP it/PRP ‘s/BES a/DT pretty/ RB good/JJ idea/NN ./. • Personal Health Identifier Detection • <age>77</age> year old <gender>female</ gender> with history of <disease>B-cell lymphoma</disease> (Marginal zone, <mrn>SH-04-4444</mrn>)

NER Metrics • Precision • TP / (TP + FP) • Recall • TP / (TP + FN)

Rule-based • Rely on hand-coded rules and dictionaries • Dictionaries can be used for terms in a closed class with an exhaustive list, e.g. geographic locations • Regular expressions are used to detect terms that follow certain syntactic patterns, e.g. phone numbers

Machine learning-based • Model the NER as a sequence labeling task where each token is assigned a label • Train classifiers to label each token • Classifiers use a list of features (or attributes) for training and classification of the sequence • Frequently applied classifiers are HMM, MEMM, SVM, and CRF

Conditional Random Field • A Conditional Random Field (CRF) provides a probabilistic framework for labeling and segmenting sequential data • A CRF defines a conditional probability of a label sequence given an observation sequence

Comparison • Rule-based • Accurate • Require experts to modify • Not portable • Machine learning-based • Accurate • Modification of models is done through training rather than “coding” • Portable

Privacy Preserving Data Publishing • Weak privacy (Micro) • release a modified version of each record according to a given anonymization principle • assumes level of background knowledge • Differential privacy (Macro) • release perturbed statistics that satisfy the differential privacy principle • no assumptions of background knowledge

Micro-data publishing • Prevent linking of records in separate databases • k-anonymization • Prevent discovery of sensitive values • l-diversity • Prevent discovery of presence or absence in a database • delta-presence

Micro-data publishing Name Age Gender Zipcode Diagnosis Henry 25 Male 53710 Influenza Table 1: Illustration of Anonymization Name Age Gender Zipcode Diagnosis Irene 28 Female 53712 Lymphoma Henry 25 Male 53710 Influenza Dan 28 Male 53711 Bronchitis Irene 28 Female 53712 Lymphoma Dan 28 Male 53711 Bronchitis Erica 26 Female 53712 Influenza Erica 26 Female 53712 Influenza Original Data Original Data Name Age Gender Zipcode Disease Name Age Gender Zipcode Disease [25 − 28] Male [53710-53711] Influenza ∗ [25 − 28] Male [53710-53711] Influenza ∗ [25 − 28] Female 53712 Lymphoma ∗ [25 − 28] Male [53710-53711] Bronchitis [25 − 28] Female 53712 Lymphoma ∗ ∗ [25 − 28] Female 53712 Influenza ∗ [25 − 28] Male [53710-53711] Bronchitis ∗ Anonymized Data [25 − 28] Female 53712 Influenza ∗ Anonymized Data

k-anonymization • Quasi identifier set • Sensitive attributes • Table is k-anonymous if every record has k-1 other records with the same quasi- identifier set • The probability of linking a victim to a specific record through QID is at most 1/k

l-diversity • Extension of k-anonymization • Also ensures that each group has at least l distinct sensitive values • Prevents disclosure of sensitive values

Macro-data publishing • Differential Privacy is a strong privacy notion • Requires that a randomized computation yields nearly identical output when performed on nearly identical input • Interactive model • limited to a specific number of queries • Non-interactive model • need query strategies to build noisy data cubes that maximize utility for a random query workload

Differentially Private Interface Query Workload Strategy Pre-designed Queries Differentially Queries Diff. Original User Private Private Data Interface Histogram Diff. Private Answers Answers

HIDE Framework • Identifying and Sensitive Information Extraction • uses state-of-the-art CRF model to extract PHI and sensitive information • Data linking • provides structured patient-centric view of the data • De-identification and Anonymization • Micro-data publication - uses data suppression and generalization to provide a k-anonymized view of the data • Macro-data publication - release perturbed aggregated statistics from the patient-centric view

Identifying and sensitive information extraction • Use CRF classifier to extract information • Studied impact of features including: • regular expressions • affixes • dictionaries • context • Sampling techniques to adjust classifier for higher precision or recall

Example Token Label Token Label 77 B-age of O year O B B-disease old O - I-disease cell I-disease female B-gender with O lymphoma I-disease history O ( O

HIDE: Privacy Preserving Medical Data Publishing James Gardner - PowerPoint PPT Presentation

HIDE: Privacy Preserving Medical Data Publishing James Gardner Department of Mathematics and Computer Science Emory University jgardn3@emory.edu Motivation De-identification is critical in any health informatics system Research

Towards Privacy-Preserving Ontology Publishing F. Baader & A. Nuradiansyah Technische

Top Trends in Trade Publishing Jane Tappuni, Publishing Technology Chris McCrudden, Midas PR

Privacy Preserving Protocols Workshop on Cryptography for the Internet of Things Jens Hermans KU

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY

Preserving the Privacy of Sensitive Relationships in Graph Data Motivation Valuable Data! No

Privacy Preserving Privacy Preserving Netw ork Flow Netw ork Flow Recording Recording Bilal

Privacy preserving data mining randomized response and association rule hiding Li Xiong

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security

Privacy in Wireless Networks privacy notions and metrics; privacy in RFID systems; location

New Directions in Privacy- preserving Machine Learning Kamalika Chaudhuri University of

$ Lesson Fourteen Consumer Privacy 04/09 privacy and information information privacy: privacy

$ Lesson Ten Consumer Privacy 04/09 privacy and information information privacy: privacy that

CS305 Topic Privacy Concept Evolution Rights to Privacy Privacy and Technologies

Privacy Protection privacy notions and metrics; privacy in RFID systems; location privacy in

Minimality Attack in Privacy Preserving Data Publishing Raymond Chi-Wing Wong (the Chinese

Model fitting and inference for infectious disease dynamics Useful R commands Contents 1

Breakout Slide for Subcommittees In Introductions Tell us about yourself: Name Title

for Banana Leaf Dis iseases Cla lassification Jihene Amara 1 , Bassem Bouaziz 1 , Alsayed

Training Program in Basic Microbiology and Infectious Disease T32 AI007110-34 Program-specific

P a t t e r n s o f I n f o r m a t i o n D i f f u s i o n D a v

CPTR RDST Data Platform Concept September 22, 2014 Outline C-Path overview and examples of

Integration of classifications and terminologies in metadata registries based on ISO/IEC 11179

Eliminating HCV in Massachusetts: Leveraging Assets Liisa M. Randall, PhD Director, Office of

HIDE: Privacy Preserving Medical Data Publishing James Gardner - PowerPoint PPT Presentation

HIDE: Privacy Preserving Medical Data Publishing James Gardner Department of Mathematics and Computer Science Emory University jgardn3@emory.edu Motivation De-identification is critical in any health informatics system Research

Towards Privacy-Preserving Ontology Publishing F. Baader &amp; A. Nuradiansyah Technische

Top Trends in Trade Publishing Jane Tappuni, Publishing Technology Chris McCrudden, Midas PR

Privacy Preserving Protocols Workshop on Cryptography for the Internet of Things Jens Hermans KU

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY

Preserving the Privacy of Sensitive Relationships in Graph Data Motivation Valuable Data! No

Privacy Preserving Privacy Preserving Netw ork Flow Netw ork Flow Recording Recording Bilal

Privacy preserving data mining randomized response and association rule hiding Li Xiong

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security

Privacy in Wireless Networks privacy notions and metrics; privacy in RFID systems; location

New Directions in Privacy- preserving Machine Learning Kamalika Chaudhuri University of

$ Lesson Fourteen Consumer Privacy 04/09 privacy and information information privacy: privacy

$ Lesson Ten Consumer Privacy 04/09 privacy and information information privacy: privacy that

CS305 Topic Privacy Concept Evolution Rights to Privacy Privacy and Technologies

Privacy Protection privacy notions and metrics; privacy in RFID systems; location privacy in

Minimality Attack in Privacy Preserving Data Publishing Raymond Chi-Wing Wong (the Chinese

Model fitting and inference for infectious disease dynamics Useful R commands Contents 1

Breakout Slide for Subcommittees In Introductions Tell us about yourself: Name Title

for Banana Leaf Dis iseases Cla lassification Jihene Amara 1 , Bassem Bouaziz 1 , Alsayed

Training Program in Basic Microbiology and Infectious Disease T32 AI007110-34 Program-specific

P a t t e r n s o f I n f o r m a t i o n D i f f u s i o n D a v

CPTR RDST Data Platform Concept September 22, 2014 Outline C-Path overview and examples of

Integration of classifications and terminologies in metadata registries based on ISO/IEC 11179

Eliminating HCV in Massachusetts: Leveraging Assets Liisa M. Randall, PhD Director, Office of

Towards Privacy-Preserving Ontology Publishing F. Baader & A. Nuradiansyah Technische