The De-identification of Longitudinal and Geospatial Data Khaled El - - PDF document

the de identification of longitudinal and geospatial data
SMART_READER_LITE
LIVE PREVIEW

The De-identification of Longitudinal and Geospatial Data Khaled El - - PDF document

The De-identification of Longitudinal and Geospatial Data Khaled El Emam, CHEO RI & uOttawa Context The disclosure of health information for Th di l f h lth i f ti f secondary purposes, such as research Many practical


slide-1
SLIDE 1

1

The De-identification of Longitudinal and Geospatial Data

Khaled El Emam, CHEO RI & uOttawa

Th di l f h lth i f ti f

Context

  • The disclosure of health information for

secondary purposes, such as research

  • Many practical challenges to obtaining

express individual consent from patients for large databases

  • Even if express consent can be obtained,

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

there is compelling evidence that consenters differ from non-consenters – introduces bias

  • age, sex, race, marital status, educational

level, socioeconomic status, health status, mortality, lifestyle factors, functioning

slide-2
SLIDE 2

2

Si l t d t ti l

Complex Data

  • Simplest data: cross-sectional
  • Longitudinal data: data about

individuals over time

– Patients with multiple visits – Patients with multiple insurance claims

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

  • Geospatial data: data that contains

location information

– Residence postal codes or dissemination areas

Thi ill ll h hi h

Longitudinal Data

  • This will usually have higher re-

identification risk than cross-sectional data

  • We need to make assumptions about

the background information of the adversary (how many visits/ claims will

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

adversary (how many visits/ claims will they have information about)

  • Multiple attributes collected per

visit/ claim

slide-3
SLIDE 3

3

I i k i d th

Residence Trails

  • In previous work we examined the re-

identification risk from residence trails

  • The model was based on RAMQ data
  • Only considers location information
  • ver time and does not take the

f f d

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

specifics of a data set into account

  • Looked at uniqueness only
  • For example, …

Representation - Tree

{7/8/2000 M} {7/8/2000, M} {1/1/2009,K7G2C3} {14/1/2009 K G2C3} {18/4/2009,K7G2C4}

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

  • Each patient can be represented as a

tree (multiple levels are possible)

{14/1/2009,K7G2C3}

slide-4
SLIDE 4

4

Representation - Tables

PI D Vi it D t P t l C d PI D DoB Gender 10 7/ 8/ 2000 M 11 1/ 1/ 1975 F 12 24/ 6/ 1975 F 13 17/ 8/ 1975 F 8/ / PI D Visit Date Postal Code 10 1/ 1/ 2009 K7G2C3 10 14/ 1/ 2009 K7G2C3 10 18/ 4/ 2009 K7G2C4 11 1/ 1/ 2009 K1V7E6 11 20/ 1/ 2009 K1V7E8 11 22/ 2/ 2009 K1V7E8

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

14 18/ 9/ 1975 F 15 12/ 2/ 2000 M K1V7E8 12 15/ 12/ 2008 K1Y4L5 12 20/ 1/ 2009 K1V7E8 13 22/ 12/ 2008 K1Z5H9 14 13/ 1/ 2009 K1Y4L5 15 20/ 4/ 2009 K7G2G5

Reduction in I nformation Loss

Dataset Reduction in Entropy California 71% Florida 71% New York 64% Washington 80%

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

  • Comparison of two assumptions about

adversary background knowledge

  • Incomplete knowledge has significantly

less information loss

slide-5
SLIDE 5

5

P t l d t ft ll t d

Geographic Areas

  • Postal codes are most often collected

from patients

  • Some disadvantages: they change over

time (number and boundaries)

  • Census geography is more stable and

l d b d h

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

postal codes can be converted to these

  • For now we are focusing on postal

codes

Postal Code Population Sizes

P/T # PC Min 25th 50th 75th Max

AB 77,348 1 5 24 50 7,084 BC 113,222 1 6 19 40 13,537 , , MB 24,015 1 6 25 49 6,298 NB 57,389 1 3 8 17 1.971 NL 10,376 2 7 18 39 5,506 NS 25,332 1 5 13 29 8,983 NU/NWT 535 2 14 33 82 5,794 ON 270,277 1 7 21 47 17,165 PE 3,165 2 5 12 26 8,327 QC 203,637 1 5 17 39 12,635 SK 21,563 1 6 22 36 6,939 YT 935 2 2 12 33 2,107

slide-6
SLIDE 6

6

1000 2000 3000 4000 5000 6000

POPULATION

1000 2000 3000 4000 5000 6000

POPULATION

1000 2000 3000 4000 5000 6000

POPULATION

1000 2000 3000 4000 5000 6000

POPULATION

1000 2000 3000 4000 5000 6000

POPULATION

Rural Urban AB Rural Urban BC Rural Urban MB Rural Urban NB Rural Urban NL 3000 4000 5000 6000

PULATION

3000 4000 5000 6000

PULATION

3000 4000 5000 6000

PULATION

3000 4000 5000 6000

PULATION

3000 4000 5000 6000

PULATION

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca 1 2 NS 1000 2000

POP

Rural Urban ON 1000 2000

POP

Rural Urban PE 1000 2000

POP

Rural Urban QC 1000 2000

POP

Rural Urban SK 1000 2000

POP

Cropping

  • The most common way for de-identifying areas was

The most common way for de identifying areas was to crop them (i.e., remove characters/digits from the end)

  • This works for postal codes because the areas are

hierarchical in structure

  • Larger areas will have more people living in them –

reducing the re identification risk

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

reducing the re-identification risk

  • But the effectiveness of this will depend on the data

set itself and the variables that are being disclosed

slide-7
SLIDE 7

7

Cropping Example

Cropping Postal Code Postal Code + Age 6 character 69% 97% 5 character 27.5% 93% 4 character 1% 39% 3 character 0.3% 9%

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

  • Ontario birth registry example for one year
  • Table shows the percentage of records with a high

risk at a 0.2 threshold (prosecutor risk)

Aggregation

W h d l d l ith th t ld

  • We have developed an algorithm that would

aggregate small postal codes into larger ones

  • Finds much smaller areas that maintain acceptable

re-identification risk levels (compared to cropping)

  • Allows much higher geographic specificity
  • We have shown that disease outbreak cluster

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

detection is affected minimally by the aggregation

  • Currently implemented for postal codes – being

developed for dissemination areas and ZCTAs

slide-8
SLIDE 8

8

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

www.ehealthinformation.ca

www.ehealthinformation.ca/ knowledgebase

kelemam@uottawa.ca

slide-9
SLIDE 9

9

References

  • K El Emam D Buckeridge R Tamblyn A Neisa E Jonker A Verma: “The
  • K. El Emam, D. Buckeridge, R. Tamblyn, A. Neisa, E. Jonker, A. Verma: The

Re-identification Risk of Canadians from Longitudinal Demographics.” BMC Medical Informatics and Decision Making 2011, 11:46, DOI:10.1186/1472- 6947-11-46, 2011.

  • El Emam K, Brown A, AbdelMalik P, Neisa A, Walker M, Bottomley J, Roffey

T: A method for managing re-identification risk from small geographic areas in Canada. BMC Medical Informatics and Decision Making, 10, 2010.

  • El Emam K, Brown A, Abdelmalik P: Evaluating Predictors of Geographic

Area Population Size Cutoffs to Manage Re-identification Risk. Journal of the American Medical Informatics Association 16:256-266 2009

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

the American Medical Informatics Association, 16:256-266, 2009.

  • El Emam K, Dankar F, Issa R, Jonker E, Amyot D, Cogo E, Corriveau J-P,

Walker M, Chowdhury S, Vaillancourt R, Roffey T, Bottomley J: A Globally Optimal k-Anonymity Method for the De-identification of Health Data. Journal of the American Medical Informatics Association, 16(5):670-682, 2009.