The De-identification of Longitudinal and Geospatial Data Khaled El - PDF document

The De-identification of Longitudinal and Geospatial Data Khaled El Emam, CHEO RI & uOttawa Context • The disclosure of health information for Th di l f h lth i f ti f secondary purposes, such as research • Many practical challenges to obtaining express individual consent from patients for large databases • Even if express consent can be obtained, there is compelling evidence that consenters differ from non-consenters – introduces bias • age, sex, race, marital status, educational level, socioeconomic status, health status, mortality, lifestyle factors, functioning Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 1

Complex Data • Simplest data: cross-sectional Si l t d t ti l • Longitudinal data: data about individuals over time – Patients with multiple visits – Patients with multiple insurance claims • Geospatial data: data that contains location information – Residence postal codes or dissemination areas Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Longitudinal Data • This will usually have higher re- Thi ill ll h hi h identification risk than cross-sectional data • We need to make assumptions about the background information of the adversary (how many visits/ claims will adversary (how many visits/ claims will they have information about) • Multiple attributes collected per visit/ claim Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 2

Residence Trails • In previous work we examined the re- I i k i d th identification risk from residence trails • The model was based on RAMQ data • Only considers location information over time and does not take the specifics of a data set into account f f d • Looked at uniqueness only • For example, … Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Representation - Tree {7/8/2000 M} {7/8/2000, M} {1/1/2009,K7G2C3} {18/4/2009,K7G2C4} {14/1/2009,K7G2C3} {14/1/2009 K G2C3} • Each patient can be represented as a tree (multiple levels are possible) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 3

Representation - Tables PI D PI D Vi it D t Visit Date P Postal Code t l C d 10 1/ 1/ 2009 K7G2C3 PI D DoB Gender 10 14/ 1/ 2009 K7G2C3 10 7/ 8/ 2000 M 10 18/ 4/ 2009 K7G2C4 11 1/ 1/ 1975 F 11 1/ 1/ 2009 K1V7E6 12 24/ 6/ 1975 F 11 20/ 1/ 2009 K1V7E8 13 17/ 8/ 1975 F 11 22/ 2/ 2009 K1V7E8 K1V7E8 14 18/ 9/ 1975 8/ / F 12 15/ 12/ 2008 K1Y4L5 15 12/ 2/ 2000 M 12 20/ 1/ 2009 K1V7E8 13 22/ 12/ 2008 K1Z5H9 14 13/ 1/ 2009 K1Y4L5 15 20/ 4/ 2009 K7G2G5 Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Reduction in I nformation Loss Dataset Reduction in Entropy California 71% Florida 71% New York 64% Washington 80% • Comparison of two assumptions about adversary background knowledge • Incomplete knowledge has significantly less information loss Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 4

Geographic Areas • Postal codes are most often collected P t l d t ft ll t d from patients • Some disadvantages: they change over time (number and boundaries) • Census geography is more stable and postal codes can be converted to these l d b d h • For now we are focusing on postal codes Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Postal Code Population Sizes 25 th 50 th 75 th P/T # PC Min Max AB 77,348 1 5 24 50 7,084 BC 113,222 , 1 6 19 40 13,537 , MB 24,015 1 6 25 49 6,298 NB 57,389 1 3 8 17 1.971 NL 10,376 2 7 18 39 5,506 NS 25,332 1 5 13 29 8,983 NU/NWT 535 2 14 33 82 5,794 ON 270,277 1 7 21 47 17,165 PE 3,165 2 5 12 26 8,327 QC 203,637 1 5 17 39 12,635 SK 21,563 1 6 22 36 6,939 YT 935 2 2 12 33 2,107 5

6000 6000 6000 6000 6000 5000 5000 5000 5000 5000 POPULATION 4000 POPULATION 4000 POPULATION 4000 POPULATION 4000 POPULATION 4000 3000 3000 3000 3000 3000 2000 2000 2000 2000 2000 1000 1000 1000 1000 1000 0 0 0 0 0 0 0 0 0 Rural Urban Rural Urban Rural Urban Rural Urban Rural Urban AB BC MB NB NL 6000 6000 6000 6000 6000 5000 5000 5000 5000 5000 PULATION PULATION 4000 PULATION 4000 PULATION 4000 PULATION 4000 4000 3000 3000 3000 3000 3000 POP POP POP POP POP 2000 2000 2000 2000 2000 1000 1000 1000 1000 1000 0 0 0 0 0 1 2 Rural Urban Rural Urban Rural Urban Rural Urban NS ON PE QC SK Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Cropping • The most common way for de-identifying areas was The most common way for de identifying areas was to crop them (i.e., remove characters/digits from the end) • This works for postal codes because the areas are hierarchical in structure • Larger areas will have more people living in them – reducing the re-identification risk reducing the re identification risk • But the effectiveness of this will depend on the data set itself and the variables that are being disclosed Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 6

Cropping Example Cropping Postal Code Postal Code + Age 6 character 69% 97% 5 character 27.5% 93% 4 character 1% 39% 3 character 0.3% 9% • Ontario birth registry example for one year • Table shows the percentage of records with a high risk at a 0.2 threshold (prosecutor risk) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca Aggregation • We have developed an algorithm that would W h d l d l ith th t ld aggregate small postal codes into larger ones • Finds much smaller areas that maintain acceptable re-identification risk levels (compared to cropping) • Allows much higher geographic specificity • We have shown that disease outbreak cluster detection is affected minimally by the aggregation • Currently implemented for postal codes – being developed for dissemination areas and ZCTAs Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 7

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca kelemam@uottawa.ca www.ehealthinformation.ca www.ehealthinformation.ca/ knowledgebase 8

References • K El Emam D Buckeridge R Tamblyn A Neisa E Jonker A Verma: “The K. El Emam, D. Buckeridge, R. Tamblyn, A. Neisa, E. Jonker, A. Verma: The Re-identification Risk of Canadians from Longitudinal Demographics.” BMC Medical Informatics and Decision Making 2011, 11:46, DOI:10.1186/1472- 6947-11-46, 2011. • El Emam K, Brown A, AbdelMalik P, Neisa A, Walker M, Bottomley J, Roffey T: A method for managing re-identification risk from small geographic areas in Canada. BMC Medical Informatics and Decision Making, 10, 2010. • El Emam K, Brown A, Abdelmalik P: Evaluating Predictors of Geographic Area Population Size Cutoffs to Manage Re-identification Risk. Journal of the American Medical Informatics Association 16:256-266 2009 the American Medical Informatics Association, 16:256-266, 2009. • El Emam K, Dankar F, Issa R, Jonker E, Amyot D, Cogo E, Corriveau J-P, Walker M, Chowdhury S, Vaillancourt R, Roffey T, Bottomley J: A Globally Optimal k-Anonymity Method for the De-identification of Health Data . Journal of the American Medical Informatics Association, 16(5):670-682, 2009. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 9

The De-identification of Longitudinal and Geospatial Data Khaled El - PDF document

The De-identification of Longitudinal and Geospatial Data Khaled El Emam, CHEO RI & uOttawa Context The disclosure of health information for Th di l f h lth i f ti f secondary purposes, such as research Many practical

Introduction to Longitudinal Data Brandon LeBeau Assistant Professor DataCamp Longitudinal

Geospatial & Hexagon Emerging Geospatial Technology Trends Middle East Geospatial Forum Claudio

JLCIMT/State CIO Geospatial Data Sharing Workgroup Geospatial Framework Data Sharing Among Public

Longitudinal Analysis for Continuous Outcomes Brandon LeBeau Assistant Professor DataCamp

Interoperability in Geospatial Web Services Jeff de La Beaujardire, PhD NASA Geospatial

Bachelor of Geospatial Science The University of the South Pacific Geospatial Science Unit School

Geospatial Engineering: A Lever to Assist Developing Countries to Bridge the Geospatial Digital

CLICK HERE TO KNOW MORE AN OPEN DISCUSSION ON ROLE OF GEOSPATIAL BRINGING GEOSPATIAL

A Longitudinal Look at Longitudinal Mediation Models David P. MacKinnon, Arizona State

1 Longitudinal Analysis Survival Trees Mining Frequent Episodes Summary Longitudinal Analysis

A VISION FOR GEOSPATIAL DATA IN OREGON THERESA BURCSU FRAMEWORK COORDINATOR GEOSPATIAL

Geospatial Data Act Update GDA Tiger Team Meeting February 26, 2019 Geospatial Data Act (GDA) 2

KBS LTER Geospatial Data Management Suzanne Sippel W.K. Kellogg Biological Station Geospatial Data

Outline Mixed models in R using the lme4 package Part 3: Longitudinal data Longitudinal data:

Introduction to spatial data Working with Geospatial Data in R What is spatial data? Data

Fundamental Global Geospatial Data Themes Clare Hadley Chair, Working Group World Geospatial

EXPECTED GAMMA-RAY EMISION FROM X-RAY BINARIES W lodek Bednarek Department of Astrophysics,

Higgs ID at the LHC with V. Barger, H. E. Logan arxiv:0902.0170 Gabe Shaughnessy Northwestern

1 Research Goal Proposed tool: Chained clone detection tool Detection of clone sets connected

Class you will: of Know the name of your school counselor 2021 Understand important tasks for

1. Introduction In its complement clauses, Persian makes a three-way distinction between indicative

Lecture 11 Multiuser MIMO Capacity General model SIMO uplink: 10.1 MIMO uplink: 10.2

Statistical Device Variability and Statistical Device Variability and its Impact on Design its

Chapter 9: Calibration of Photon and Electron Beams Set of 189 slides based on the chapter

Sambuz

Useful Links

Newsletter

Mail Us

The De-identification of Longitudinal and Geospatial Data Khaled El - PDF document

The De-identification of Longitudinal and Geospatial Data Khaled El Emam, CHEO RI & uOttawa Context The disclosure of health information for Th di l f h lth i f ti f secondary purposes, such as research Many practical

Introduction to Longitudinal Data Brandon LeBeau Assistant Professor DataCamp Longitudinal

Geospatial &amp; Hexagon Emerging Geospatial Technology Trends Middle East Geospatial Forum Claudio

JLCIMT/State CIO Geospatial Data Sharing Workgroup Geospatial Framework Data Sharing Among Public

Longitudinal Analysis for Continuous Outcomes Brandon LeBeau Assistant Professor DataCamp

Interoperability in Geospatial Web Services Jeff de La Beaujardire, PhD NASA Geospatial

Bachelor of Geospatial Science The University of the South Pacific Geospatial Science Unit School

Geospatial Engineering: A Lever to Assist Developing Countries to Bridge the Geospatial Digital

CLICK HERE TO KNOW MORE AN OPEN DISCUSSION ON ROLE OF GEOSPATIAL BRINGING GEOSPATIAL

A Longitudinal Look at Longitudinal Mediation Models David P. MacKinnon, Arizona State

1 Longitudinal Analysis Survival Trees Mining Frequent Episodes Summary Longitudinal Analysis

A VISION FOR GEOSPATIAL DATA IN OREGON THERESA BURCSU FRAMEWORK COORDINATOR GEOSPATIAL

Geospatial Data Act Update GDA Tiger Team Meeting February 26, 2019 Geospatial Data Act (GDA) 2

KBS LTER Geospatial Data Management Suzanne Sippel W.K. Kellogg Biological Station Geospatial Data

Outline Mixed models in R using the lme4 package Part 3: Longitudinal data Longitudinal data:

Introduction to spatial data Working with Geospatial Data in R What is spatial data? Data

Fundamental Global Geospatial Data Themes Clare Hadley Chair, Working Group World Geospatial

EXPECTED GAMMA-RAY EMISION FROM X-RAY BINARIES W lodek Bednarek Department of Astrophysics,

Higgs ID at the LHC with V. Barger, H. E. Logan arxiv:0902.0170 Gabe Shaughnessy Northwestern

1 Research Goal Proposed tool: Chained clone detection tool Detection of clone sets connected

Class you will: of Know the name of your school counselor 2021 Understand important tasks for

1. Introduction In its complement clauses, Persian makes a three-way distinction between indicative

Lecture 11 Multiuser MIMO Capacity General model SIMO uplink: 10.1 MIMO uplink: 10.2

Statistical Device Variability and Statistical Device Variability and its Impact on Design its

Chapter 9: Calibration of Photon and Electron Beams Set of 189 slides based on the chapter

Sambuz

Useful Links

Newsletter

Mail Us

Geospatial & Hexagon Emerging Geospatial Technology Trends Middle East Geospatial Forum Claudio