De-identification 101 Khaled El Emam, CHEO RI & uOttawa Trends - - PDF document

de identification 101
SMART_READER_LITE
LIVE PREVIEW

De-identification 101 Khaled El Emam, CHEO RI & uOttawa Trends - - PDF document

De-identification 101 Khaled El Emam, CHEO RI & uOttawa Trends Increasing demands for health data for I i d d f h lth d t f secondary purposes Stronger enforcement of regulations with severe penalties One consequence


slide-1
SLIDE 1

1

De-identification 101

Khaled El Emam, CHEO RI & uOttawa

I i d d f h lth d t f

Trends

  • Increasing demands for health data for

secondary purposes

  • Stronger enforcement of regulations with

severe penalties

  • One consequence is the need for more

defensible methods for the de-identification of

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

the data

slide-2
SLIDE 2

2

Most Common Secondary ‘Uses’

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Source: Pricewaterhouse Coopers Survey. 2009 Transforming healthcare through secondary use of health data.

Di tl id tif i

Variable Distinctions

  • Directly identifying

– Can uniquely identify an individual by itself

  • r in conjunction with other readily

available information

  • Quasi-identifiers

– Can identify an individual by itself or in

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

– Can identify an individual by itself or in conjunction with other information

  • Sensitive variables
slide-3
SLIDE 3

3

N dd t l h b f

Examples of Direct I dentifiers

  • Name, address, telephone number, fax

number, MRN, health card number, health plan beneficiary number, license plate number, email address, photograph, biometrics, SSN, SIN, implanted device number

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

implanted device number

d t f bi th hi l ti ( h

Examples of Quasi-I dentifiers

  • sex, date of birth or age, geographic locations (such

as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, aboriginal identity, total years of schooling, marital status, criminal history, total income, visible minority status, activity difficulties/reductions profession event

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

activity difficulties/reductions, profession, event dates (such as admission, discharge, procedure, death, specimen collection, visit/encounter), codes (such as diagnosis codes, procedure codes, and adverse event codes), country of birth, birth weight, and birth plurality

slide-4
SLIDE 4

4

Attribute vs I dentity Disclosure

  • Attribute disclosure: discover something
  • Attribute disclosure: discover something

new about an individual in the database without knowing which record belongs to that individual

  • Identity disclosure: determine which record

in the database belongs to a particular

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

in the database belongs to a particular individual (for example, determine that record number 7 belongs to Bob Smith – that is identity disclosure)

Attribute Disclosure - I

  • For example:
  • For example:

HPV Vaccinated NOT HPV Vaccinated Religion A 5 40 Not 40 5

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Religion A 40 5

  • Statistically significant relationship (chi-square,

p<0.05)

  • High risk of attribute disclosure
slide-5
SLIDE 5

5

Attribute Disclosure - I

  • For example:
  • For example:

HPV Vaccinated NOT HPV Vaccinated Religion A 5 40 Not 40 5

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Religion A 40 5

  • Statistically significant relationship (chi-square,

p<0.05)

  • High risk of attribute disclosure

Attribute Disclosure - I I

  • Use suppression to eliminate risk:
  • Use suppression to eliminate risk:

HPV Vaccinated NOT HPV Vaccinated Religion A 5 6 Not 6 5

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Religion A 6 5

  • Not statistically significant relationship (chi-square)
  • Low risk of attribute disclosure
slide-6
SLIDE 6

6

Attribute Disclosure - I I I

  • Attribute disclosure is an important outcome
  • Attribute disclosure is an important outcome
  • f analytics – it is arguably more of an ethics

question whether it is acceptable to ask certain questions or discover certain things about individuals with certain characteristics

  • The HIPAA regulations do not require one to

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

  • The HIPAA regulations do not require one to

address risks from attribute disclosure – only identity disclosure risks need to be address

  • All known re-identification attacks were

identity disclosure

Tables vs Records

  • Often disclosure control guidelines are stated in
  • Often disclosure control guidelines are stated in

terms of tables or ‘aggregate’ tables

  • Tables of counts are exactly the same thing as

individual-level data and can be converted from one to the other

  • When data is released in tabular format, however,

dditi l i b i d

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

additional issues can be raised:

– Often small cells are suppressed. Using iterative algorithms it is easy to recover suppressed cells if the totals for (some

  • f) the marginals are known

– Tables with overlapping dimensions can leak information that is useful for recovering small cells

slide-7
SLIDE 7

7

Does De-identification Work ?

  • Existing evidence shows that data sets that have
  • Existing evidence shows that data sets that have

been properly de-identified have a low probability of being re-identified

  • All publicly known examples of serious re-

identification attacks were done on data sets that were not properly de-identified (i.e., it is possible to show that their risk of re identification was quite high

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

show that their risk of re-identification was quite high and did not meet HIPAA de-identification standards)

  • As far as we know, proper de-identification works

effectively in managing risk

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028071

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

slide-8
SLIDE 8

8

De-identification Standards

  • The HIPAA Privacy Rule specified two de
  • The HIPAA Privacy Rule specified two de-

identification standards:

– Safe Harbor – Statistical method

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Safe Harbor Direct Identifiers and Quasi identifiers

HI PAA Safe Harbor

Safe Harbor Direct Identifiers and Quasi-identifiers

  • 1. Names
  • 2. ZIP Codes (except

first three)

  • 3. All elements of dates

(except year)

  • 4. Telephone numbers
  • 5. Fax numbers
  • 6. Electronic mail

addresses

  • 7. Social security

numbers 12.Vehicle identifiers and serial numbers, including license plate numbers 13.Device identifiers and serial numbers 14.Web Universal Resource Locators (URLs) 15.Internet Protocol (IP) address numbers

  • 18. Any other unique

identifying number, characteristic, or code

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

numbers

  • 8. Medical record

numbers

  • 9. Health plan

beneficiary numbers 10.Account numbers 11.Certificate/license numbers address numbers 16.Biometric identifiers, including finger and voice prints 17.Full face photographic images and any comparable images;

slide-9
SLIDE 9

9

Names (Element 1)

  • Covers only the names of the individuals or
  • Covers only the names of the individuals or
  • f relatives, employers, or household

members of the individual

  • Names of providers would not be considered

as part of the Safe Harbor list and therefore, strictly speaking it would not be necessary

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

strictly speaking, it would not be necessary to remove them from the data set

Element 18

  • Any other unique identifying number
  • Any other unique identifying number,

characteristic, or code:

– Number: clinical record number – Characteristic: rare and visible diagnosis – Code: hashed SSN (derived from a direct identifier without a salt)

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

de t e t out a sa t)

slide-10
SLIDE 10

10

Codes for Re-identification

  • A covered entity may assign a code or other
  • A covered entity may assign a code or other

means of record identification to allow information de-identified under this section to be re-identified by the covered entity, provided that:

– The code or other means of record identification is not

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

derived from or related to information about the individual and is not otherwise capable of being translated so as to identify the individual; and – The covered entity does not use or disclose the code or

  • ther means of record identification for any other purpose,

and does not disclose the mechanism of re-identification.

What to do about codes ?

  • There is some confusion between HIPAA
  • There is some confusion between HIPAA

and the Common Rule on this point

  • Create a pseudonym that is not derived from

the original data and secure the linking table

  • Rely on the second de-identification

standard to argue that the risk of re

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

standard to argue that the risk of re- identification is acceptably small when using appropriate hashing mechanisms

slide-11
SLIDE 11

11

Actual Knowledge

  • The covered entity does not have actual
  • The covered entity does not have actual

knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information

  • For example:

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

  • For example:

– Staff with relatives in the database – Famous people in the database – Individuals with unique characteristics or

  • ccupations

Safe Harbor Direct Identifiers and Quasi identifiers

HI PAA Safe Harbor

Safe Harbor Direct Identifiers and Quasi-identifiers

  • 1. Names
  • 2. ZIP Codes (except

first three)

  • 3. All elements of dates

(except year)

  • 4. Telephone numbers
  • 5. Fax numbers
  • 6. Electronic mail

addresses

  • 7. Social security

numbers 13.Device identifiers and serial numbers 14.Web Universal Resource Locators (URLs) 15.Internet Protocol (IP) address numbers 16.Biometric identifiers, including finger and voice prints

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

numbers

  • 8. Medical record

numbers

  • 9. Health plan

beneficiary numbers 10.Account numbers 11.Certificate/license numbers 12.Vehicle identifiers and serial numbers, including license plate numbers 17.Full face photographic images and any comparable images; 18.Any other unique identifying number, characteristic, or code

slide-12
SLIDE 12

12

Applicability of Safe Harbor

  • Safe Harbor ensures that the risk of re
  • Safe Harbor ensures that the risk of re-

identification is low if:

– The data is a random sample from the population (common assumption) – The only quasi-identifiers in the data are dates and ZIP codes

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

– The adversary does not know who is in the data set – The data is cross-sectional

High Risk Safe Harbor - I

  • Inclusion of provider names makes it
  • Inclusion of provider names makes it

possible to determine additional information about the patient (with a high probability):

– The facility where treatment was received – General area where the patient lives (likely close to the facility)

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

y) – The range of diagnoses (if the provider is a specialist or if multiple providers with different specialties see a patient)

slide-13
SLIDE 13

13

High Risk Safe Harbor - I I

  • If an adversary knows that Bob is in the
  • If an adversary knows that Bob is in the

database:

Gender Age ZIP Lab Test M 55 112 Albumin, Serum

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

F 53 114 Alkaline Phosphatase M 24 134 Creatine Kinase

High Risk Safe Harbor - I I I

  • Longitudinal data can have a high risk of re
  • Longitudinal data can have a high risk of re-

identification even if it meets the stipulations

  • f Safe Harbor
  • For example, using the State Inpatient

Database for NY (2007) with 2 million visits:

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Quasi-identifiers % of patients unique age, gender, ZIP3 1.8% age, gender, ZIP3, LOS 20.75%

slide-14
SLIDE 14

14

High Risk Safe Harbor - I V

  • Safe Harbor ignores the impact of sub
  • Safe Harbor ignores the impact of sub-

sampling on re-identification risk

  • If a Safe Harbor compliant data set is a 10%

sample it will have a lower risk than a 50% sample from exactly the same population Therefore there can be significant variation

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

  • Therefore, there can be significant variation

in re-identification risk for different sub- samples from the same data set

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

slide-15
SLIDE 15

15

Statistical Method

  • A person with appropriate knowledge of and
  • A person with appropriate knowledge of and

experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable:

I. Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

information, by an anticipated recipient to identify an individual who is a subject of the information; and II. Documents the methods and results of the analysis that justify such determination

Documentation

  • It is important to have complete
  • It is important to have complete

documentation of the methods used to evaluate re-identification risk as OCR may ask for that at a later date

  • This documentation should include:

– The attack scenarios assumed for the data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

– The attack scenarios assumed for the data – Risk thresholds used and their justifications – Assumptions made about the adversary – How re-identification risk was actually measured and evidence of validity of the metrics

slide-16
SLIDE 16

16

Who is an “Expert” ?

  • This is not explicitly defined but some
  • This is not explicitly defined, but some

reasonable requirements can be considered:

– Training – Experience doing de-identification of health data – Publications and research in the area of de- identification

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

de t cat o – Education in methods relevant to de-identification

  • There are on-going efforts to develop

“certification” schemes for de-identification experts

Uniqueness As a Measure of Risk

  • Uniqueness is typically defined as the
  • Uniqueness is typically defined as the

percentage of records in the population that are unique on the combination of quasi- identifier values

  • Unique records can be re-identified with

certainty

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

certainty

  • Non-unique records can also contribute to a

high probability of re-identification

slide-17
SLIDE 17

17

Uniqueness Measures

  • Based on the 2010 census a data set that is
  • Based on the 2010 census, a data set that is

strictly compliant with Safe Harbor will not have any unique records

  • But the average risk of re-identification due

to the non-unique records is 0.05% of the population

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

population

  • As another example, in a birth registry data

set (126k records) we de-identified to eliminate uniques, but the average risk of re- identification was still ~15%

Other Ways of Measuring Risk

  • It is important to go beyond uniqueness:
  • It is important to go beyond uniqueness:

– Maximum re-identification probability – Average re-identification probability

  • Maximum risk assumes that the

adversary is targeting a single

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

y g g g individual

  • Maximum risk is always higher than

average risk

slide-18
SLIDE 18

18

http://genomemedicine.com/content/3/4/25/abstract

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Universe of Attacks

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

slide-19
SLIDE 19

19

Universe of Attacks

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

What is “very small” ?

  • With reference to:
  • With reference to:

– The data set only – The overall context of the disclosure and use

  • The disclosure control community has in practice

taken the overall context of the disclosure and use when deciding what is acceptable risk, and this is evident in writings and guidance going back a

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

evident in writings and guidance going back a number of decades; this is also a more realistic assessment of risk

  • Acceptable risk will also depend on the nature of the

attack (which one of the three)

slide-20
SLIDE 20

20

Re-identification Risk Spectrum

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Managing Re-identification Risk

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

slide-21
SLIDE 21

21

B d th h t i ti If th d i k i b l

Apply Transformations Set Risk Threshold

Based on the characteristics

  • f the data recipient, the

data, and precedents and quantitative threshold is set. If the measured risk is below the threshold, specific transformations (such as generalization and suppression) are applied to reduce the risk.

De-identification Process

Measure Risk

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

De-identification Process

Based on plausible re- identification attacks, appropriate metrics are selected and used to measure actual re-identification risk from the data.

Measure Risk

How Long is De-id Valid For ?

  • We have done some experiments on
  • We have done some experiments on

longitudinal data sets to see how long a de- identification can be valid for:

– For registries with relatively stable processes and patient populations (e.g., birth) the validity can be 5 years

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

– For business process that may change more

  • ften the interval is likely shorter – we generally

recommend 18 to 24 months

slide-22
SLIDE 22

22

www.ehealthinformation.ca

www.ehealthinformation.ca/ knowledgebase

kelemam@uottawa.ca