De-identification of the HHP Data Khaled El Emam, CHEO RI & - PDF document

De-identification of the HHP Data Khaled El Emam, CHEO RI & uOttawa Today’s Presentation • Provide overview of rationale and methods P id i f ti l d th d used to de-identify the HHP data set, as well as lessons learnt • The complete details have been published in a recent article in JMIR: http: / / www.jmir.org/ 2012/ 1/ e33/ • Address questions from different communities: – entrants in the competition – disclosure control community – other competition organizers Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 1

Caveats • Certain manipulations are not revealed C t i i l ti t l d • We do not represent HPN or Kaggle – questions about the competition rules should be posted on the HHP forum for the Kaggle team to respond to Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Basic Principles • The HHP data set had to be compliant with Th HHP d t t h d t b li t ith the HIPAA Privacy Rule - this defined basic parameters that guided the de-identification • Many versions of de-identified data set were created and the data utility evaluated through modeling to see how data quality was affected – achieve a balance • Extensive discussions with other disclosure control experts along the way • De-identification was informed by known re- identification attacks Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 2

The Data Set • The original data set had a lot of missingness Th i i l d t t h d l t f i i in it – this is real data that was pulled out of production systems • We do not have the names or identities of any of the patients – therefore risk assessments had to be done with estimates and simulations d i l ti • The competition data set represents a small sample of HPN members – the sub-sampling has a big impact on re-identification risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca HI PAA Privacy Rule • HIPAA defines two standards for the de- HIPAA d fi t t d d f th d identification of health information: – Safe Harbor – Statistical method • HIPAA has tended to be more precise about de-identification than privacy legislation in p y g other jurisdictions Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 3

HI PAA Safe Harbor Safe Harbor Direct Identifiers and Quasi-identifiers 1. Names 12.Vehicle identifiers 18.Any other unique 2. ZIP Codes (except and serial numbers, identifying number, first three) including license characteristic, or 3. All elements of dates plate numbers code (except year) 13.Device identifiers 4. Telephone numbers and serial numbers 5. Fax numbers 14.Web Universal 6. Electronic mail Resource Locators addresses (URLs) 7. Social security 7. Social security 15.Internet Protocol (IP) 15.Internet Protocol (IP) numbers address numbers 8. Medical record 16.Biometric identifiers, numbers including finger and 9. Health plan voice prints beneficiary numbers 17.Full face 10.Account numbers photographic images 11.Certificate/license and any comparable numbers images; Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca HI PAA Safe Harbor Safe Harbor Direct Identifiers and Quasi-identifiers 1. Names 13.Device identifiers 2. ZIP Codes (except and serial numbers first three) 14.Web Universal 3. All elements of dates Resource Locators (except year) (URLs) 4. Telephone numbers 15.Internet Protocol (IP) 5. Fax numbers address numbers 6. Electronic mail 16.Biometric identifiers, addresses including finger and 7. Social security 7. Social security voice prints voice prints numbers 17.Full face 8. Medical record photographic images numbers and any comparable 12.Vehicle identifiers 9. Health plan images; and serial numbers, beneficiary numbers 18.Any other unique including license 10.Account numbers identifying number, 11.Certificate/license plate numbers characteristic, or numbers code Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 4

Reasonableness Criterion • “Health information that does not identify an individual “H lth i f ti th t d t id tif i di id l and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable health information.” • “… generally accepted statistical and scientific principles … ” • • “ the risk is very sm all that the information could be .. the risk is very sm all that the information could be used, alone or in combination with other reasonably available inform ation , by an anticipated recipient to identify an individual who is a subject of the information .. “ Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Statistical Method • Need to ensure that the risk of re- N d t th t th i k f identification is very small   1          I  i     N N i Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 5

“Reasonable” Risk Thresholds Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Precedents - I   • The value of represents the probability Th l f t th b bilit that a record can be correctly re-identified • There are many precedents for setting this value to 0.2, 0.1, and 0.05 for the public release of health data (as well as other types of data) • For the HHP data it was decided to err on the conservative side and use a threshold value of 0.05 • This is under ideal conditions – real value likely lower Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 6

Precedents - I I • HIPAA Safe Harbor estimated risk is that HIPAA S f H b ti t d i k i th t 0.04% of the population is unique:   1        1  0.9996 I i     N N i Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Risk Exposure   Risk Exposure Loss Probability • In the case of Safe Harbor:    Risk Exposure N 0.0004 1 • Equivalent HHP risk exposure:    Risk Exposure N 0.008 0.05 Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 7

Risk Management • Ensure that no more than 0.8% of members E th t th 0 8% f b have a probability of re-identification greater than 0.05 • A combination of technical and legal approaches used to manage the overall risk • Legal limits: – Prohibition on re-identification – Agreements with HPN service providers (e.g., labs and insurers) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Data Set Age (members) Date of claim (claim) Sex (members) Diagnosis (claim) Days in Hospital (Outcome) Length of stay (claim) Specialty of provider (claim) Provider ID (claim) Place of service (claim) Vendor ID (claim) CPT Code (claim) Pay delay (claim) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 8

Pre-processing • Creating pseudonyms for the IDs C ti d f th ID • Top coding pay delay and days in hospital (99 th percentile) • Removal of high (re-identification and stigmatization) risk patients and claims: – rare and visible diagnoses – sensitive diagnoses and procedures (e.g., HIV, abortions, substance abuse, sex change) • Suppression of unique provider and vendor patterns Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Truncation of Claims • Some patients had an unusually large S ti t h d ll l numbers of claims per year – they stand out • The number of claims distribution has a very long tail • Used a score to identify which claims to truncate – those that are unique among the patients • Truncation at the 95 th percentile • Out of 113,000 patients, 9,556 patients had at least one claim truncated Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 9

De-identification of the HHP Data Khaled El Emam, CHEO RI & - PDF document

De-identification of the HHP Data Khaled El Emam, CHEO RI & uOttawa Todays Presentation Provide overview of rationale and methods P id i f ti l d th d used to de-identify the HHP data set, as well as lessons learnt The

H EALING H URT P EOPLE (H (HHP) P ROGRAM Healing Hurt People (HHP) is a trauma-informed

Cat Fanciers Federation Presents Cat Fanciers Federation 2017 2018 Awards Banquet Emeritus

One Big Happy Family Strategies for Planning a Facility for Recreation, Athletics, HHP and

RISK IDENTIFICATION Everything your competitor knows about Risk Identification on Software

Agenda Unique Identification (UID); Item Unique Identification; Unique Item Identifier (UII)

Hazard Identification & Control Contents Hazard Identification & Control Hazard Alert

CIMMYT CAGE meeting CIMMYT CAGE meeting Update : Identification and utilization of novel sources

Religious Profile: Jewish Identification 2 Jewish Identification (Jewish Households)

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

In System Identification, System Identification: . . . Interval (and Fuzzy) Estimates Algorithm

Person Re-Identification Yiheng Liu Outli line Background Image-Based Person

Extent- -based Incremental Identification based Incremental Identification Extent of Reaction

Face Recognition: Some Challenges in Forensics Anil K. Jain, Brendan Klare, and Unsang Park

Tree Identification Tree Identification Summer Phase Summer Phase Learning to identify trees by

Convicting the Guilty, Protecting the Innocent: Double-Blind Sequential Lineup Identification

Quantitation and Identification Quantitation and Identification of Urine Mucopolysaccharides of

1 PleasePrEPMe.org is your home for PrEP & PEP access. PleasePrEPMe is a mobile optimized,

Lets Revoke Public key infrastructure prevents Man-in-the-Middle attacks Revocation protects

Unicode BCP47 Extensions Mark Davis http://goo.gl/owbBk Unicode Locale/Lang ID BCP47

MEDIACTRL IETF 77 Eric Burger eburger@standardstrack.com Spencer Dawkins

Advocating for CA Prisoners Marvin Mutch Progressive Lawyering Day October 8, 2016

Implementing Risk-Limiting Post-Election Audits in California J.L. Hall 1 , 2 L.W. Miratrix 3 P.B.

Student Code of Conduct- Academic Integrity Slides for Faculty Members & Instructors to cover

Slides built from Carter Chapter 10 Animating Sprites (textures) Images from wikipedia.org

Sambuz

Useful Links

Newsletter

Mail Us

De-identification of the HHP Data Khaled El Emam, CHEO RI & - PDF document

De-identification of the HHP Data Khaled El Emam, CHEO RI & uOttawa Todays Presentation Provide overview of rationale and methods P id i f ti l d th d used to de-identify the HHP data set, as well as lessons learnt The

H EALING H URT P EOPLE (H (HHP) P ROGRAM Healing Hurt People (HHP) is a trauma-informed

Cat Fanciers Federation Presents Cat Fanciers Federation 2017 2018 Awards Banquet Emeritus

One Big Happy Family Strategies for Planning a Facility for Recreation, Athletics, HHP and

RISK IDENTIFICATION Everything your competitor knows about Risk Identification on Software

Agenda Unique Identification (UID); Item Unique Identification; Unique Item Identifier (UII)

Hazard Identification &amp; Control Contents Hazard Identification &amp; Control Hazard Alert

CIMMYT CAGE meeting CIMMYT CAGE meeting Update : Identification and utilization of novel sources

Religious Profile: Jewish Identification 2 Jewish Identification (Jewish Households)

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

In System Identification, System Identification: . . . Interval (and Fuzzy) Estimates Algorithm

Person Re-Identification Yiheng Liu Outli line Background Image-Based Person

Extent- -based Incremental Identification based Incremental Identification Extent of Reaction

Face Recognition: Some Challenges in Forensics Anil K. Jain, Brendan Klare, and Unsang Park

Tree Identification Tree Identification Summer Phase Summer Phase Learning to identify trees by

Convicting the Guilty, Protecting the Innocent: Double-Blind Sequential Lineup Identification

Quantitation and Identification Quantitation and Identification of Urine Mucopolysaccharides of

1 PleasePrEPMe.org is your home for PrEP &amp; PEP access. PleasePrEPMe is a mobile optimized,

Lets Revoke Public key infrastructure prevents Man-in-the-Middle attacks Revocation protects

Unicode BCP47 Extensions Mark Davis http://goo.gl/owbBk Unicode Locale/Lang ID BCP47

MEDIACTRL IETF 77 Eric Burger eburger@standardstrack.com Spencer Dawkins

Advocating for CA Prisoners Marvin Mutch Progressive Lawyering Day October 8, 2016

Implementing Risk-Limiting Post-Election Audits in California J.L. Hall 1 , 2 L.W. Miratrix 3 P.B.

Student Code of Conduct- Academic Integrity Slides for Faculty Members &amp; Instructors to cover

Slides built from Carter Chapter 10 Animating Sprites (textures) Images from wikipedia.org

Sambuz

Useful Links

Newsletter

Mail Us

Hazard Identification & Control Contents Hazard Identification & Control Hazard Alert

1 PleasePrEPMe.org is your home for PrEP & PEP access. PleasePrEPMe is a mobile optimized,

Student Code of Conduct- Academic Integrity Slides for Faculty Members & Instructors to cover