Quantitative Methods to Measure the Risk of Re-identification: - PowerPoint PPT Presentation

Quantitative Methods to Measure the Risk of Re-identification: Methodology Review Bradley Malin, Ph.D. November 30, 2017 Vanderbilt University, Professor (Biomedical Informatics, Biostatistics, & Computer Science)

A Very Simplified View on Risk P( reid ) ≈ • Distinguishability • Replicability • Availability Data 2

Risk is Contextual Bill Mike Brad Sample 3

Risk is Contextual Population (Multiple clinical Mitch trials) Bill Will Mike Brad Sample Abe 4

Risk is Contextual Population (All Eligible Mitch People) Bill Will Mike Brad Sample Abe 5

Risk Measures ● Worst Case Risk Measures ● Prosecutor: Most risky record in the sample ● Journalist: Most risky record in the population ● Amortized (or Average) Measure ● Marketer: Expected risk of an arbitrary record 6

A Very Simplified View on Risk Prior • Neighbor (friend) P( reid ) ≈ • Prosecutor • Journalist P( knowledge ) * • … pick a framework P( reid | knowledge ) • Distinguishability • Replicability • Availability Data 7

Prosecutor Risk 1 (Because of Mike) Bill Mike Brad Sample 8

Journalist Risk ½ = 0.5 (Because of Bill & Brad) Population Mitch Bill Will Mike Brad Sample Abe 9

½ + ½ + 1 Marketer Risk (Sample) ------------- = 2/3 3 ½ Bill ½ Brad 1 Mike Sample Sample 10

.5 + .5 + .33 Marketer Risk (Population) ------------- = .44 3 3 1/2 Bill 1/2 Brad Mike 1/3 Will Abe Population Sample 11

Central Dogma of Re-identification Anonymised Identified Linking Mechanism Data Data Necessary Necessary Necessary Condition Condition Condition Malin, Benitez, Loukides, & Clayton. Human Genetics. 2011. 12

A Famous Linkage Attack High Profile Name Ethnicity Re-identification Address Visit date Date registered ZIP Code Diagnosis Party affiliation Birthdate Procedure Date last voted Gender Medication Total charge U.S. Hospital U.S. Voter List Discharge Data Sweeney, JLME. 1997 13

But availability of Demographics Varies… IL MN TN WA WI WHO Registered Political MN Voters Anyone Anyone Anyone Committees (ANYONE – In Person) Format Disk Disk Disk Disk Disk Cost $500 $46; “use ONLY for $2500 $30 $12,500 elections, political activities, or law enforcement”      Name      Address     Date of Birth    Sex  Race   Phone Number Benitez & Malin. JAMIA. 2010. 14

Adversaries are Not All Knowing ● Research subject demographics ● Unique (potential) ● Replicable ● Available ● Series of drug doses administered ● Unique (potential) ● Not replicable ● Not available 15

Adversaries are Not All Knowing (Assume knowledge between x and y features) Feature Observation Feature Observation Date of Birth 1/1/1970 Date of Birth 1/1/1970 Gender Male Gender Male Feature Observation Race White Race White Country France Country France Date of Birth 1/1/1970 Death Yes Death Yes Gender Male Occupation Teacher Occupation Teacher Married Yes Married Yes Race White Country France Feature Observation Feature Observation Death Yes Date of Birth 1/1/1970 Date of Birth 1/1/1970 Occupation Teacher Gender Male Gender Male Race White Race White Married Yes Country France Country France Death Yes Death Yes Occupation Teacher Occupation Teacher Married Yes Married Yes 16

… Which Means No Single Risk Score Feature Observation Feature Observation Date of Birth 1/1/1970 Date of Birth 1/1/1970 Gender Male Gender Male Race White Race White Country France Country France Death Yes Death Yes Frequency Occupation Teacher Occupation Teacher Married Yes Married Yes Feature Observation Feature Observation Date of Birth 1/1/1970 Date of Birth 1/1/1970 Gender Male Gender Male Race White Race White Country France Country France Re-identification Risk Death Yes Death Yes Occupation Teacher Occupation Teacher Married Yes Married Yes 17

You May Not See Everything ● Field structured databases – you can issue exact queries! ● Semi-structured reports and narratives – you must rely on a mix of ● Artificial intelligence ● Human review ● Sampling ● Problem has become a bit more tricky because leaks can now include explicitly identifying values 18

A Natural Language Setting ● Create a Gold Standard Dataset ● Ask X humans to read and label a selection of records ● Ensure concordance between human annotations Original PHI (e.g., via a Cohen’s Kappa) ● Apply (manual or automated) identifier detection Smith , 61 yo ... daughter, Lynn , to ... strategy oncologist Dr. White ... 5/13/10 to consider ... SWOG protocol 1811 , ... was randomized 5/10 ... ● Compute performance in terms of: to call Mr. Smith on ... PLAN: Dr White and I ... ● (R)ecall – rate at which real identifier instances were detected ● (P)recision – rate at which claimed identifiers are in fact real ● F-measure – weighted average of R and P 19

A Natural Language Setting ● A leak is not necessarily a re-identification **Redacted PHI & Original PHI ● Need to assess the Leaked PHI potential given the leak rates Smith , 61 yo ... **pt_name<A> , **age<60s> yo ... daughter, Lynn , to ... daughter, Lynn , to ... oncologist Dr. White ... oncologist Dr. **MD_name<C> ... 5/13/10 to consider ... **date<5/28/10> to consider ... ● Can assess on a per SWOG protocol 1811 , ... SWOG protocol **other_id , ... was randomized 5/10 ... was randomized 5/10 ... research subject level (if to call Mr. Smith on ... to call Mr. **pt_name<A> on ... labels are person specific) PLAN: Dr White and I ... PLAN: Dr White and I ... P(date of birth leak, date of death leak) ● Alternatively, may assume each feature leaks at ≈ P(date of birth leak) * P(date of death leak) random 20

Hiding in Plain Sight (Carrell et al, 2013) ● Must be careful – this is a relatively new technique ● Also evidence to suggest a computer may mimic the initial detection strategy… and redact the fakes situated in a pattern (Li et al, 2016) **Redacted PHI & Surrogate PHI & Original PHI Leaked PHI Hidden PHI Smith , 61 yo ... **pt_name<A> , **age<60s> yo ... Jones , a 64 yo ... daughter, Lynn , to ... daughter, Lynn , to ... daughter, Lynn , for ... oncologist Dr. White ... oncologist Dr. **MD_name<C> ... oncologist Dr. Howe ... 5/13/10 to consider ... **date<5/28/10> to consider ... 5/28/10 to consider ... SWOG protocol 1811 , ... SWOG protocol **other_id , ... SWOG protocol 1798, ... was randomized 5/10 ... was randomized 5/10 ... was randomized 5/10 ... to call Mr. Smith on ... to call Mr. **pt_name<A> on ... to call Mr. Jones on ... PLAN: Dr White and I ... PLAN: Dr White and I ... PLAN: Dr White and I ... ● In this case, we would need two risk measures ● Human recognition risk ● Computer-assisted recognition risk ● Recent approach to prevent this problem, but comes at a massive loss in precision (Li et al, 2017) 21

If Time Permits… If Not… 22

Latest Development ● Methods presented model data sharer and adversary separately ● New approaches use game theory to consider their interactions (Wan et al 2015) ● Game theory requires robust estimates of many parameters, such as ● benefit the sharer gets in providing data ● benefit the attacker gets in re-identifying the data 23

Stackelberg Game Sharing Strategy 1 Attack Strategy A Utility 1 Utility A Risk ??? Risk A Attack Strategy B Strategies: Utility B - Generalize Age Risk B - Suppress Dates … - Perturb Geography Attack Strategy C Utility C Risk C Publisher Recipient 24

Stackelberg Game Sharing Strategy 1 Attack Strategy A Utility 1 Utility A Risk ??? Risk A Attack Strategy B Utility B Risk B Recipient’s Best Strategy Attack Strategy C Utility C Risk C Publisher Recipient 25

Stackelberg Game Sharing Strategy 1 Attack Strategy A Utility 1 Utility A Risk B Risk A Attack Strategy B Utility B Risk B Recipient’s Best Strategy Attack Strategy C Utility C Risk C Publisher Recipient 26

Stackelberg Game Sharing Strategy 1 Attack Strategy A Utility 1 Utility A Risk B Risk A Sharing Strategy 2 Attack Strategy B Utility 2 Utility B Risk ??? Risk B Attack Strategy C Utility C Risk C Publisher Recipient 27

Stackelberg Game Sharing Strategy 1 Attack Strategy A Recipient’s Best Strategy Utility 1 Utility A Risk B Risk A Sharing Strategy 2 Attack Strategy B Utility 2 Utility B Risk A Risk B Attack Strategy C Utility C Risk C Publisher Recipient 28

Stackelberg Game Sharing Strategy 1 Utility 1 Risk B Sharing Strategy 2 Utility 2 Risk A Sharing Strategy Z Utility Z Risk Z Publisher 29

Stackelberg Game Sharing Strategy 1 Utility 1 Risk B Choose strategy that maximizes Sharing Strategy 2 overall benefit Utility 2 Risk A Optimizes the Risk-Utility tradeoff Sharing Strategy Z Utility Z Risk Z Publisher 30

Demographic Case Study ● $1200: Benefit per record ● ~30,000 Census records ● $300: Cost per violation ● Average Payoff Per Record ● $4: Access cost per record $3.00 $2.50 $2.00 Attacker $1.50 $1.00 US Safe Harbor $0.50 $0.00 $0.00 $500.00 $1,000.00 $1,500.00 Publisher 31

Quantitative Methods to Measure the Risk of Re-identification: - PowerPoint PPT Presentation

Quantitative Methods to Measure the Risk of Re-identification: Methodology Review Bradley Malin, Ph.D. November 30, 2017 Vanderbilt University, Professor (Biomedical Informatics, Biostatistics, & Computer Science) A Very Simplified View on

Quantitative Quantitative Quantitative Quantitative Modal Modal Transition Transition

On the nature of financial risk: Why risk is so hard to measure and why risk models fail so often

Welcome to the course! Quantitative Risk Management in R About me Professor in

Risk Management Workshop 1 Risk management workshop Why do we Risk Risk and need risk

Why not Quantitative Methods? Why not Quantitative Methods? division into variables:

Regional Measure 3 May 16, 2017 SFMTA Board of Directors Regional Measure 3 Prior Regional

Polynomial Julia sets with positive measure Why bother? Quasiconformal NILF Measure 0? Measure

Quantitative Evaluation Research Questions Quantitative Data Controlled Studies Experimental

Quantitative Evaluation Research Questions Quantitative Data Controlled Studies Experimental

What is Measure FF? Measure FF is on the November 2018 ballot to extend existing,

COMMUNITY UPDATE Measure AA Voter Information CITY OF WILDOMAR Fall 2018 Measure AA on November

Measure M Draft Guidelines Workshop March 9, 2017 1 Introduction Measure M is Distinct from

Performance and Benefits Realisation HOW TO OPTIMISE AND MEASURE THE HOW TO OPTIMISE AND MEASURE

1 Introductions Measure H: Background Measure H: Bond Program Progress Measure H:

What is a Measure? Planning and Assessment Ramapo College What is a Measure? A measure(s)

Internet Measurements Dr. Vaibhav Bajpai 1. Measure Adoption 2. Measure Performance 3. Measure

Quantitative and Qualitative Data Analyses and Presentation Prof Lester M. Davids

HSRC MONITORING THE L.E.D. FUND IN SOUTH AFRICA LESSONS FROM A BRAVE NEW PROGRAMME Dr Doreen

The The power power of of white white supremacy supremacy and and colonialism colonialism in

THE THE DOMES OMESTIC & R TIC & RESIDENT ESIDENTIAL IAL CLEANING SPE LEANING

Chapter 8 Data Analysis, Interpretation and Presentation Aims Discuss the difference between

Gathering Research Data Stephen E. Brock, Ph.D., NCSP California State University, Sacramento 1

BRAND JOURNEY WE SEE THE FUTURE IN YOU - OUR BRAND JOURNEY 2 Show you where we started

West Seattle and Ballard Link Extensions Stakeholder Advisory Group | March 14, 2018 Agenda

Sambuz

Useful Links

Newsletter

Mail Us

Quantitative Methods to Measure the Risk of Re-identification: - PowerPoint PPT Presentation

Quantitative Methods to Measure the Risk of Re-identification: Methodology Review Bradley Malin, Ph.D. November 30, 2017 Vanderbilt University, Professor (Biomedical Informatics, Biostatistics, & Computer Science) A Very Simplified View on

Quantitative Quantitative Quantitative Quantitative Modal Modal Transition Transition

On the nature of financial risk: Why risk is so hard to measure and why risk models fail so often

Welcome to the course! Quantitative Risk Management in R About me Professor in

Risk Management Workshop 1 Risk management workshop Why do we Risk Risk and need risk

Why not Quantitative Methods? Why not Quantitative Methods? division into variables:

Regional Measure 3 May 16, 2017 SFMTA Board of Directors Regional Measure 3 Prior Regional

Polynomial Julia sets with positive measure Why bother? Quasiconformal NILF Measure 0? Measure

Quantitative Evaluation Research Questions Quantitative Data Controlled Studies Experimental

Quantitative Evaluation Research Questions Quantitative Data Controlled Studies Experimental

What is Measure FF? Measure FF is on the November 2018 ballot to extend existing,

COMMUNITY UPDATE Measure AA Voter Information CITY OF WILDOMAR Fall 2018 Measure AA on November

Measure M Draft Guidelines Workshop March 9, 2017 1 Introduction Measure M is Distinct from

Performance and Benefits Realisation HOW TO OPTIMISE AND MEASURE THE HOW TO OPTIMISE AND MEASURE

1 Introductions Measure H: Background Measure H: Bond Program Progress Measure H:

What is a Measure? Planning and Assessment Ramapo College What is a Measure? A measure(s)

Internet Measurements Dr. Vaibhav Bajpai 1. Measure Adoption 2. Measure Performance 3. Measure

Quantitative and Qualitative Data Analyses and Presentation Prof Lester M. Davids

HSRC MONITORING THE L.E.D. FUND IN SOUTH AFRICA LESSONS FROM A BRAVE NEW PROGRAMME Dr Doreen

The The power power of of white white supremacy supremacy and and colonialism colonialism in

THE THE DOMES OMESTIC &amp; R TIC &amp; RESIDENT ESIDENTIAL IAL CLEANING SPE LEANING

Chapter 8 Data Analysis, Interpretation and Presentation Aims Discuss the difference between

Gathering Research Data Stephen E. Brock, Ph.D., NCSP California State University, Sacramento 1

BRAND JOURNEY WE SEE THE FUTURE IN YOU - OUR BRAND JOURNEY 2 Show you where we started

West Seattle and Ballard Link Extensions Stakeholder Advisory Group | March 14, 2018 Agenda

Sambuz

Useful Links

Newsletter

Mail Us

THE THE DOMES OMESTIC & R TIC & RESIDENT ESIDENTIAL IAL CLEANING SPE LEANING