iDASH Healthcare Privacy Protection Challenge Fei Yu - PowerPoint PPT Presentation

Overview Method Scoring functions Summary iDASH Healthcare Privacy Protection Challenge Fei Yu feiy@stat.cmu.edu Carnegie Mellon University 24 March 2014

Overview Method Scoring functions Summary Overview Task Select the K most significant SNPs differentially-privately. * Setting: case-control study. * Input data: genotype data (e.g., AA, AT, TT) for cases, minor allele frequencies for controls. * Ranking significance: p -value corresponding to Pearson’s χ 2 test of association between SNP and phenotype. * Performance evaluation: the proportion of significant SNPs recovered.

Overview Method Scoring functions Summary Overview * Method is based on the exponential mechanism. * Two variations of the method. Pros and cons.

Overview Method Scoring functions Summary Definitions Differential privacy Let D denote the set of all data sets. Write D ∼ D ′ if D and D ′ differ in one individual. A randomized mechanism K is ǫ -differentially private if, for all D ∼ D ′ and for any measurable set S ⊂ R , Pr ( K ( D ) ∈ S ) Pr ( K ( D ′ ) ∈ S ) ≤ e ǫ . Sensitivity The sensitivity of a function f : D N → R d , where D N denotes the set of all databases with N individuals, is the smallest number S ( f ) such that || f ( D ) − f ( D ′ ) || 1 ≤ S ( f ) , for all data sets D, D ′ ∈ D N such that D ∼ D ′ .

Overview Method Scoring functions Summary Exponential mechanism McSherry and Talwar (2007) : Given D = { SNP i } M i =1 , ε ǫ q is a r.v. with � ǫq ( D, i ) � Pr( ε ǫ q ( D ) = i ) ∝ exp µ ( i ) 2∆ q � ǫq ( D, i ) � ∝ exp 2 s where q ( D, i ) = the score for SNP i s = the sensitivity of q ( D, · ) µ ( i ) = 1 /M. ε ǫ q is ǫ -differentially private.

Overview Method Scoring functions Summary Exponential mechanism We can use any scoring function q ( D, · ) with the exponential mechanism. Examples: 1. χ 2 statistic 2. Hamming distance (Johnson and Shmatikov 2013)

Overview Method Scoring functions Summary Extending the exponential mechanism Johnson and Shmatikov (2013) : selecting the K most significant SNPs ( LocSig ). 1. Initialize S = ∅ and q i = score of SNP i . � ǫq i � M � and Pr( ε ǫ � 2. Set w i = exp q ( D ) = i ) = w i w j . 2 Ks j =1 3. Sample j ∼ ε ǫ q ( D ) . Add SNP j to S . Set q j = −∞ . 4. If |S| < K , return to Step 2. Otherwise, output S . LocSig is ǫ -differentially private (Yu et al. 2014).

Overview Method Scoring functions Summary Performance of different scoring functions * Hamming (distance) outperforms χ 2 when ǫ is small. * Utility of Hamming may plateur before it reaches 1.0. (Why?)

Overview Method Scoring functions Summary Setup Assumptions: * # of cases = # of controls = N/ 2 . * Case data are private but control data are known.

Overview Method Scoring functions Summary Setup Summarizing a SNP: * Genotype table is not available. We only know the genotypes of the cases: Genotype 0 1 2 Case g 0 g 1 g 2 N/ 2 * Derived allelic table: Allele 0 1 Case n 00 n 01 N Control n 10 n 11 N n 0 n 1 2 N

Overview Method Scoring functions Summary Using χ 2 statistic as score * Pearson’s χ 2 statistics are used to rank significance of SNPs. * Higher utility is attainable by increasing ǫ . * Sensitivity of the Pearson’s χ 2 statistic of an allelic table with positive margins, N/ 2 cases and N/ 2 controls is 8 N 2 � 1 − 2 � when N ≥ 3 . ( N + 3)( N + 1) N See Yu et al. (2014).

Overview Method Scoring functions Summary χ 2 statistic vs. ranking

Overview Method Scoring functions Summary Using χ 2 statistic as score * Pearson’s χ 2 statistics are used to rank significance of SNPs. * Higher utility is attainable by increasing ǫ . * Sensitivity of the Pearson’s χ 2 statistic of an allelic table with positive margins, N/ 2 cases and N/ 2 controls is 8 N 2 � 1 − 2 � when N ≥ 3 . ( N + 3)( N + 1) N See Yu et al. (2014).

Overview Method Scoring functions Summary Using Hamming distance as score D ∼ D 1 ∼ · · · ∼ D n − 1 ∼ D n ⇓ ⇓ ⇓ ⇓ p p 1 . . . p n − 1 p n (sig) (sig) (sig) (not sig) * Score > 0 only when D ∈ D is significant. * SNP significance ordering resulting from Hamming distance could be different than that resulting from χ 2 statistic. * Sensitive to the choice of the threshold p -value. * No genotype data for controls: necessary to assume controls are known.

Overview Method Scoring functions Summary Finding the Hamming distance D ∼ ∼ · · · ∼ D n − 1 ∼ D 1 D n ⇓ ⇓ ⇓ ⇓ p p 1 . . . p n − 1 p n (sig) (sig) (sig) (not sig) * Instead of examining all possible paths, follow the path of the greatest ascent or descent. * The resulting path may not have the shortest Hamming distance.

Overview Method Scoring functions Summary Finding the Hamming distance Derived allelic table Partial genotype table Allele 0 1 Genotype 0 1 2 Case n 00 n 01 N n 10 n 11 N Case g 0 g 1 g 2 N/ 2 Control n 0 n 1 2 N χ 2 = 2 N ( n 00 − n 10 ) 2 2 N (2 g 0 + g 1 − n 10 ) 2 = n 0 n 1 (2 g 0 + g 1 + n 10 )( N − 2 g 0 − g 1 − n 10 ) � ∂ χ 2 , ∂ � ∇ χ 2 = χ 2 ∂g 0 ∂g 1 ∂ χ 2 = 2 ∂ χ 2 ∂g 0 ∂g 1 � n 00 � � n 10 � ∂ n 11 − n 01 n 10 + n 01 χ 2 ∝ ∂g 1 n 0 n 1 n 1 n 0 n 0 n 1

Overview Method Scoring functions Summary Performance of different scoring functions * Hamming (distance) outperforms χ 2 when ǫ is small. * Utility of Hamming may plateur before it reaches 1.0. (Why?)

Overview Method Scoring functions Summary Comparison of scoring functions χ 2 Hamming Computation Trivial Expensive Sensitivity Nontrivial; 1 may use upper bounds Stable Yes Not always

Overview Method Scoring functions Summary Summary * Extending exponential mechanism — LocSig * χ 2 statistic as score * Hamming distance as score * Compare different scoring functions

References References Johnson, Aaron, and Vitaly Shmatikov. 2013. “Privacy-preserving data exploration in genome-wide association studies”. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 1079–1087. McSherry, Frank, and Kunal Talwar. 2007. “Mechanism Design via Differential Privacy”. 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07) (): 94–103. Yu, Fei, et al. 2014. “Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies”. Journal of Biomedical Informatics (). arXiv: 1401.5193 .

iDASH Healthcare Privacy Protection Challenge Fei Yu - PowerPoint PPT Presentation

Overview Method Scoring functions Summary iDASH Healthcare Privacy Protection Challenge Fei Yu feiy@stat.cmu.edu Carnegie Mellon University 24 March 2014 Overview Method Scoring functions Summary Overview Task Select the K most

Secure Genomic Computation Kristin Lauter Cryptography Research Group Microsoft Research iDASH

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security

Privacy Protection privacy notions and metrics; privacy in RFID systems; location privacy in

WORKSHOP 2016 WORKSHOP 2016 -- COMPETITION RESULTS -- COMPETITION RESULTS Competition

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

$ Lesson Fourteen Consumer Privacy 04/09 privacy and information information privacy: privacy

$ Lesson Ten Consumer Privacy 04/09 privacy and information information privacy: privacy that

CS305 Topic Privacy Concept Evolution Rights to Privacy Privacy and Technologies

AI in Healthcare: Privacy & Ethics considerations Ivana Bartoletti Head of Privacy, Data

iDASH - Secure Genome Analysis Task 1A Competition Using ObliVM Task 1B Set union Task 2A Xiao

VAST CHALLENGE 2017 Bianca Barnucz & Stephanie Wegscheidl OVERVIEW VAST Challenge

Introduction to Cybersecurity Database Privacy Review: Anonymity vs. Privacy Privacy -

Database Privacy Review: Anonymity vs. Privacy Privacy - Privacy is the claim of individuals,

Privacy engineering, CyLab privacy by design, privacy impact assessments, and privacy governance

Privacy Enhancing Technologies Spring 2006 Outline Privacy Overview Course Topics

Privacy engineering, CyLab privacy by design, privacy impact assessments, and privacy governance

NATURAL SELECTION AND GENE FREQUENCY BY WOLFGANG RUBI CATALAN, MARNELLE MAC DULA, LIANNE UMALI,

Agenda Introducing... Holotype HLA Early Access Program

Tom C. Badge,, M.D., Ph.D. Peter Johansson, Ph.D. Pediatric

Environmental Forensics of Coal Tars: A Case Study Christopher Gallacher PhD Candidate

Detecting Epistatic Interactions Contributing to a Quantitative Trait: The Restricted Partition

Introduction to RNA-Seq David Wood Winter School in Mathematics and Computational Biology July

Diastaticus An Expos of Everyones Favorite Explosive Yeast Matt Linske Manager & Lead

Comparison of RNA sequencing with 19,319 lab validated RT-qPCR assays Jan Hellemans, PhD London,

iDASH Healthcare Privacy Protection Challenge Fei Yu - PowerPoint PPT Presentation

Overview Method Scoring functions Summary iDASH Healthcare Privacy Protection Challenge Fei Yu feiy@stat.cmu.edu Carnegie Mellon University 24 March 2014 Overview Method Scoring functions Summary Overview Task Select the K most

Secure Genomic Computation Kristin Lauter Cryptography Research Group Microsoft Research iDASH

CS573 Data Privacy and Security Data Privacy and Security in Healthcare Data Privacy and Security

Privacy Protection privacy notions and metrics; privacy in RFID systems; location privacy in

WORKSHOP 2016 WORKSHOP 2016 -- COMPETITION RESULTS -- COMPETITION RESULTS Competition

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

$ Lesson Fourteen Consumer Privacy 04/09 privacy and information information privacy: privacy

$ Lesson Ten Consumer Privacy 04/09 privacy and information information privacy: privacy that

CS305 Topic Privacy Concept Evolution Rights to Privacy Privacy and Technologies

AI in Healthcare: Privacy &amp; Ethics considerations Ivana Bartoletti Head of Privacy, Data

iDASH - Secure Genome Analysis Task 1A Competition Using ObliVM Task 1B Set union Task 2A Xiao

VAST CHALLENGE 2017 Bianca Barnucz &amp; Stephanie Wegscheidl OVERVIEW VAST Challenge

Introduction to Cybersecurity Database Privacy Review: Anonymity vs. Privacy Privacy -

Database Privacy Review: Anonymity vs. Privacy Privacy - Privacy is the claim of individuals,

Privacy engineering, CyLab privacy by design, privacy impact assessments, and privacy governance

Privacy Enhancing Technologies Spring 2006 Outline Privacy Overview Course Topics

Privacy engineering, CyLab privacy by design, privacy impact assessments, and privacy governance

NATURAL SELECTION AND GENE FREQUENCY BY WOLFGANG RUBI CATALAN, MARNELLE MAC DULA, LIANNE UMALI,

Agenda Introducing... Holotype HLA Early Access Program

Tom C. Badge,, M.D., Ph.D. Peter Johansson, Ph.D. Pediatric

Environmental Forensics of Coal Tars: A Case Study Christopher Gallacher PhD Candidate

Detecting Epistatic Interactions Contributing to a Quantitative Trait: The Restricted Partition

Introduction to RNA-Seq David Wood Winter School in Mathematics and Computational Biology July

Diastaticus An Expos of Everyones Favorite Explosive Yeast Matt Linske Manager &amp; Lead

Comparison of RNA sequencing with 19,319 lab validated RT-qPCR assays Jan Hellemans, PhD London,

AI in Healthcare: Privacy & Ethics considerations Ivana Bartoletti Head of Privacy, Data

VAST CHALLENGE 2017 Bianca Barnucz & Stephanie Wegscheidl OVERVIEW VAST Challenge

Diastaticus An Expos of Everyones Favorite Explosive Yeast Matt Linske Manager & Lead