Does De-identification Work ? Khaled El Emam, CHEO RI & uOttawa - - PDF document

does de identification work
SMART_READER_LITE
LIVE PREVIEW

Does De-identification Work ? Khaled El Emam, CHEO RI & uOttawa - - PDF document

Does De-identification Work ? Khaled El Emam, CHEO RI & uOttawa Key Points Progress The evidence that it is easy to re-identify health data Th id th t it i t id tif h lth d t Intro Reid or that current de-identification


slide-1
SLIDE 1

1

Does De-identification Work ?

Khaled El Emam, CHEO RI & uOttawa

Progress

Th id th t it i t id tif h lth d t

Key Points

Intro Risk Reid

  • The evidence that it is easy to re-identify health data
  • r that current de-identification methods do not work

is quite weak

  • There are powerful de-identification techniques in use

today that can provide strong guarantees and are defensible under existing standards

  • It is very difficult to re-identify data that has been

properly de identified

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

properly de-identified

  • We have a “poor de-identification” problem
  • There are defensible de-identification methods that

retain data utility

slide-2
SLIDE 2

2

Progress

Th “ id tifi ti ” h th i

Broad Claims

Intro Risk Reid

  • The “easy re-identification” hypothesis
  • This has had some impact on policy makers
  • Such claims need to be examined in a

systematic way because the implications are very serious:

– May make it necessary to obtain patient consent

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

every time data is used for secondary purposes – Discourages de-identification, and therefore more identifiable information will be used and disclosed – The likelihood of reportable data breaches would increase leading to erosion of patient trust

Progress

AOL AOL

Weld Weld

Illi i Illi i N tfli N tfli CBC CBC

Examples Often Used

Intro Risk Reid AOL AOL

Weld Weld

Illinois Illinois Netflix Netflix CBC CBC

Governor Weld of MA

  • Insurance claims data was matched

with voter registration list

  • Both databases had full date of birth,

ZIP5, and gender

  • Both databases were publicly available

for free or a nominal fee

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

  • The claims belonging to the governor

were re-identified

slide-3
SLIDE 3

3

Progress

Examples Often Used

Illi i Illi i W ld W ld

AOL AOL

N tfli N tfli CBC CBC Intro Risk Reid Illinois Illinois Weld Weld

AOL AOL

Netflix Netflix CBC CBC

AOL Search Queries

  • AOL researcher made search queries

available to the research community

  • n-line
  • USERIDs were replaced with persistent

pseudonyms

  • NYT reporters were able to re-identify

Thelma Arnold based on her search

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

Thelma Arnold based on her search queries

  • There are various other

unsubstantiated claims of other people being re-identified Progress

Examples Often Used

Netflix Netflix AOL AOL Weld Weld

Illinois Illinois

CBC CBC Intro Risk Reid Netflix Netflix AOL AOL Weld Weld

Illinois Illinois

CBC CBC

Neuroblastoma Registry

  • Newspaper made an access request

for the cancer registry (rare and in children)

  • Public health unit argued that this is

identifiable information

  • They went to court, all the way to the

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

y y state supreme court

  • An expert witness was apparently able

to re-identify most of the records in cancer registry

slide-4
SLIDE 4

4

Progress

Examples Often Used

CBC CBC Illi i Illi i AOL AOL W ld W ld

Netflix Netflix

Intro Risk Reid CBC CBC Illinois Illinois AOL AOL Weld Weld

Netflix Netflix

Movie Ratings Data Competition

  • Netflix holds a $1m competition for a

recommendation system that is better than theirs

  • A large data set of movie ratings is

made publicly available for the entrants

  • Researchers claim to have re-identified

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

Researchers claim to have re identified individuals in the data set

  • Netflix cancels a second competition

Progress

Examples Often Used

Netflix Netflix

CBC CBC

Illinois Illinois AOL AOL Weld Weld Intro Risk Reid Netflix Netflix

CBC CBC

Illinois Illinois AOL AOL Weld Weld

Adverse Drug Event Database

  • CBC obtained ADE database through

an access request

  • Health Canada claims that CBC

matched DB with obituaries to re- identify a record, and broadcast that in a program

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

  • CBC asked for more data – Health

Canada reduced the details

  • They went to federal court
slide-5
SLIDE 5

5

Progress

W t d t i th i i l id

Our Objectives

Intro Risk Reid

  • We wanted to examine the empirical evidence
  • f re-identification attacks on health

information through a systematic review

  • The objectives were:

– characterize known re-identification attacks on health data and contrast that to re-identification attacks on other kinds of data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

attacks on other kinds of data, – compute the overall proportion of records that have been correctly re-identified in these attacks, and – assess whether these demonstrate weaknesses in current de-identification methods

slide-6
SLIDE 6

6

Progress

S t ti i d t l i

Methodological Considerations

Intro Risk Reid

  • Systematic reviews and meta analysis are a very

common way for combining evidence across multiple studies – it has been in use for many decades in multiple disciplines

  • The methodology is quite standardized and addresses

issues around: publication bias and heterogeneity in studies

  • It is well known that single studies are unreliable

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

  • It is well known that single studies are unreliable –

practice recommendations should come from systematic reviews

  • Computer scientists and lawyers do not get a free pass

Progress

PRI MSA

Intro Risk Reid

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

slide-7
SLIDE 7

7

Progress

U t O t b 2010 th 14

Main Observations - I

Intro Risk Reid

  • Up to October 2010 there were 14

published re-identification attacks on health and non-health data

  • Only 10 described the methodology

used in the attack / h l h d

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

  • 6/ 14 were on health data
  • 11/ 14 were conducted by researchers

as demonstration attacks

  • Most attacks were conducted in the US

Progress

2/ 14 tt k f ll d i ti

Main Observations - I I

Intro Risk Reid

  • 2/ 14 attacks followed existing

standards; most of the attacked data was not de-identified in a defensible way

  • Only study that followed existing

standards had a success rate of

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

standards had a success rate of 0.00013 (ONC study)

  • All attacks are identity disclosure
slide-8
SLIDE 8

8

Progress

HI PAA Safe Harbor

Intro Risk Reid

Safe Harbor Direct Identifiers and Quasi-identifiers

  • 1. Names
  • 2. ZIP Codes (except

first three)

  • 3. All elements of dates

(except year)

  • 4. Telephone numbers
  • 5. Fax numbers
  • 6. Electronic mail

addresses

  • 7. Social security

12.Vehicle identifiers and serial numbers, including license plate numbers 13.Device identifiers and serial numbers 14.Web Universal Resource Locators (URLs) 15.Internet Protocol (IP) 18.Any other unique identifying number, characteristic, or code

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

  • 7. Social security

numbers

  • 8. Medical record

numbers

  • 9. Health plan

beneficiary numbers 10.Account numbers 11.Certificate/license numbers 15.Internet Protocol (IP) address numbers 16.Biometric identifiers, including finger and voice prints 17.Full face photographic images and any comparable images;

slide-9
SLIDE 9

9

Progress

AOL AOL

Weld Weld

Illi i Illi i N tfli N tfli CBC CBC

Poor De-identification Practices

Intro Risk Reid AOL AOL

Weld Weld

Illinois Illinois Netflix Netflix CBC CBC

Governor Weld of MA

  • Insurance claims data had date of birth,

gender, and ZIP Code

  • This would not pass the basic HIPAA

Safe Harbor de-identification standard

  • This re-identification attack occurred

before HIPAA came into effect

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

Progress

Poor De-identification Practices

Illi i Illi i W ld W ld

AOL AOL

N tfli N tfli CBC CBC Intro Risk Reid Illinois Illinois Weld Weld

AOL AOL

Netflix Netflix CBC CBC

AOL Search Queries

  • It is known that many individuals run

vanity queries and that it is possible to determine geographic location from searches

  • Thelma ran queries on her last name

and on her county, as well as queries indicating that she is female is in her

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

indicating that she is female, is in her mid-sixties, and has a dog

  • Raw search queries cannot be

considered de-identified

slide-10
SLIDE 10

10

Progress

Poor De-identification Practices

Netflix Netflix AOL AOL Weld Weld

Illinois Illinois

CBC CBC Intro Risk Reid Netflix Netflix AOL AOL Weld Weld

Illinois Illinois

CBC CBC

Neuroblastoma Registry

  • The registry had ZIP5 information

about the patients, and therefore would not meet the HIPAA Safe Harbor de- identification standard

  • The re-identification methods used

were sealed in the court case, and therefore the exact methodology used

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

therefore the exact methodology used is not known Progress

Poor De-identification Practices

CBC CBC Illi i Illi i AOL AOL W ld W ld

Netflix Netflix

Intro Risk Reid CBC CBC Illinois Illinois AOL AOL Weld Weld

Netflix Netflix

Movie Ratings Data Competition

  • The Netflix data set contained dates,

and therefore would not meet the Safe Harbor standard

  • The researchers who launched the re-

identification attack did not believe the data had any meaningful de- id tifi ti d t it

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

identification done to it

slide-11
SLIDE 11

11

Progress

Poor De-identification Practices

Netflix Netflix

CBC CBC

Illinois Illinois AOL AOL Weld Weld Intro Risk Reid Netflix Netflix

CBC CBC

Illinois Illinois AOL AOL Weld Weld

Adverse Drug Event Database

  • Database had report date (e.g., death

as the adverse event would mean the date is close to the date of death) and therefore would not meet the Safe Harbor requirement

  • Court ruled that inclusion of province

id tif i i f ti i thi

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

was identifying information in this case Progress

Th id f id tifi ti tt k i

What Did We Learn ?

Intro Risk Reid

  • The evidence for re-identification attacks is

weak and difficult to draw strong conclusions from: – Not surprising the results are alarming because the mean is quite high – In most cases the data was not properly

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

de-identified to start off with – Considerable heterogeneity across studies – Properly de-identified data has a low re- identification rate

slide-12
SLIDE 12

12

Progress

Intro Risk Reid

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

Progress

Intro Risk Reid

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

slide-13
SLIDE 13

13

Progress

Intro Risk Reid

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

Progress

Intro Risk Reid

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

slide-14
SLIDE 14

14

De-identification Process

Apply Transformations Set Risk Threshold

Based on the characteristics

  • f the data recipient, the

data, and precedents and quantitative threshold is set. If the measured risk is below the threshold, specific transformations (such as generalization and suppression) are applied to reduce the risk.

De-identification Process

pp y

Based on plausible re- identification attacks, appropriate metrics are selected and used to measure actual re-identification risk from the data.

Measure Risk

“Reasonable” Risk Thresholds

slide-15
SLIDE 15

15

Progress

H i ti l f tti

Basis of Thresholds

Intro Risk Reid

  • Having a continuous scale for setting

thresholds had low acceptability than four pre-specified bins

  • There is a tremendous amount of precedent

going back more than 20 years: census data, health data, educational data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

  • Precedent in: government policies,

government and expert guidelines, scientific literature, court orders, and regulator orders

  • These are well established and accepted risk

levels

Assessment of Background Risk

three dimensions of risk recommended threshold

slide-16
SLIDE 16

16

De-identification Process

Apply Transformations Set Risk Threshold

Based on the characteristics

  • f the data recipient, the

data, and precedents and quantitative threshold is set. If the measured risk is below the threshold, specific transformations (such as generalization and suppression) are applied to reduce the risk.

De-identification Process

pp y

Based on plausible re- identification attacks, appropriate metrics are selected and used to measure actual re-identification risk from the data.

Measure Risk

Progress

It i i t t t ti d

Summary - I

Intro Risk Reid

  • It is important to continue de-

identifying personal health information for secondary uses and disclosures

  • But this has to be done properly using

known best practices – homegrown methods are often problematic

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

methods are often problematic

slide-17
SLIDE 17

17

Progress

R id tifi ti i k d t t k i t

Summary - I I

Intro Risk Reid

  • Re-identification risk needs to take into

account all of the fields and not just examine one field at a time

  • The amount of acceptable risk should

be context dependent

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

End Deid

www.ehealthinformation.ca

www.ehealthinformation.ca/ knowledgebase

kelemam@uottawa.ca