idash healthcare privacy protection challenge
play

iDASH Healthcare Privacy Protection Challenge Fei Yu - PowerPoint PPT Presentation

Overview Method Scoring functions Summary iDASH Healthcare Privacy Protection Challenge Fei Yu feiy@stat.cmu.edu Carnegie Mellon University 24 March 2014 Overview Method Scoring functions Summary Overview Task Select the K most


  1. Overview Method Scoring functions Summary iDASH Healthcare Privacy Protection Challenge Fei Yu feiy@stat.cmu.edu Carnegie Mellon University 24 March 2014

  2. Overview Method Scoring functions Summary Overview Task Select the K most significant SNPs differentially-privately. * Setting: case-control study. * Input data: genotype data (e.g., AA, AT, TT) for cases, minor allele frequencies for controls. * Ranking significance: p -value corresponding to Pearson’s χ 2 test of association between SNP and phenotype. * Performance evaluation: the proportion of significant SNPs recovered.

  3. Overview Method Scoring functions Summary Overview * Method is based on the exponential mechanism. * Two variations of the method. Pros and cons.

  4. Overview Method Scoring functions Summary Definitions Differential privacy Let D denote the set of all data sets. Write D ∼ D ′ if D and D ′ differ in one individual. A randomized mechanism K is ǫ -differentially private if, for all D ∼ D ′ and for any measurable set S ⊂ R , Pr ( K ( D ) ∈ S ) Pr ( K ( D ′ ) ∈ S ) ≤ e ǫ . Sensitivity The sensitivity of a function f : D N → R d , where D N denotes the set of all databases with N individuals, is the smallest number S ( f ) such that || f ( D ) − f ( D ′ ) || 1 ≤ S ( f ) , for all data sets D, D ′ ∈ D N such that D ∼ D ′ .

  5. Overview Method Scoring functions Summary Exponential mechanism McSherry and Talwar (2007) : Given D = { SNP i } M i =1 , ε ǫ q is a r.v. with � ǫq ( D, i ) � Pr( ε ǫ q ( D ) = i ) ∝ exp µ ( i ) 2∆ q � ǫq ( D, i ) � ∝ exp 2 s where q ( D, i ) = the score for SNP i s = the sensitivity of q ( D, · ) µ ( i ) = 1 /M. ε ǫ q is ǫ -differentially private.

  6. Overview Method Scoring functions Summary Exponential mechanism We can use any scoring function q ( D, · ) with the exponential mechanism. Examples: 1. χ 2 statistic 2. Hamming distance (Johnson and Shmatikov 2013)

  7. Overview Method Scoring functions Summary Extending the exponential mechanism Johnson and Shmatikov (2013) : selecting the K most significant SNPs ( LocSig ). 1. Initialize S = ∅ and q i = score of SNP i . � ǫq i � M � and Pr( ε ǫ � 2. Set w i = exp q ( D ) = i ) = w i w j . 2 Ks j =1 3. Sample j ∼ ε ǫ q ( D ) . Add SNP j to S . Set q j = −∞ . 4. If |S| < K , return to Step 2. Otherwise, output S . LocSig is ǫ -differentially private (Yu et al. 2014).

  8. Overview Method Scoring functions Summary Performance of different scoring functions * Hamming (distance) outperforms χ 2 when ǫ is small. * Utility of Hamming may plateur before it reaches 1.0. (Why?)

  9. Overview Method Scoring functions Summary Setup Assumptions: * # of cases = # of controls = N/ 2 . * Case data are private but control data are known.

  10. Overview Method Scoring functions Summary Setup Summarizing a SNP: * Genotype table is not available. We only know the genotypes of the cases: Genotype 0 1 2 Case g 0 g 1 g 2 N/ 2 * Derived allelic table: Allele 0 1 Case n 00 n 01 N Control n 10 n 11 N n 0 n 1 2 N

  11. Overview Method Scoring functions Summary Using χ 2 statistic as score * Pearson’s χ 2 statistics are used to rank significance of SNPs. * Higher utility is attainable by increasing ǫ . * Sensitivity of the Pearson’s χ 2 statistic of an allelic table with positive margins, N/ 2 cases and N/ 2 controls is 8 N 2 � 1 − 2 � when N ≥ 3 . ( N + 3)( N + 1) N See Yu et al. (2014).

  12. Overview Method Scoring functions Summary χ 2 statistic vs. ranking

  13. Overview Method Scoring functions Summary Using χ 2 statistic as score * Pearson’s χ 2 statistics are used to rank significance of SNPs. * Higher utility is attainable by increasing ǫ . * Sensitivity of the Pearson’s χ 2 statistic of an allelic table with positive margins, N/ 2 cases and N/ 2 controls is 8 N 2 � 1 − 2 � when N ≥ 3 . ( N + 3)( N + 1) N See Yu et al. (2014).

  14. Overview Method Scoring functions Summary Using Hamming distance as score D ∼ D 1 ∼ · · · ∼ D n − 1 ∼ D n ⇓ ⇓ ⇓ ⇓ p p 1 . . . p n − 1 p n (sig) (sig) (sig) (not sig) * Score > 0 only when D ∈ D is significant. * SNP significance ordering resulting from Hamming distance could be different than that resulting from χ 2 statistic. * Sensitive to the choice of the threshold p -value. * No genotype data for controls: necessary to assume controls are known.

  15. Overview Method Scoring functions Summary Using Hamming distance as score D ∼ D 1 ∼ · · · ∼ D n − 1 ∼ D n ⇓ ⇓ ⇓ ⇓ p p 1 . . . p n − 1 p n (sig) (sig) (sig) (not sig) * Score > 0 only when D ∈ D is significant. * SNP significance ordering resulting from Hamming distance could be different than that resulting from χ 2 statistic. * Sensitive to the choice of the threshold p -value. * No genotype data for controls: necessary to assume controls are known.

  16. Overview Method Scoring functions Summary Using Hamming distance as score D ∼ D 1 ∼ · · · ∼ D n − 1 ∼ D n ⇓ ⇓ ⇓ ⇓ p p 1 . . . p n − 1 p n (sig) (sig) (sig) (not sig) * Score > 0 only when D ∈ D is significant. * SNP significance ordering resulting from Hamming distance could be different than that resulting from χ 2 statistic. * Sensitive to the choice of the threshold p -value. * No genotype data for controls: necessary to assume controls are known.

  17. Overview Method Scoring functions Summary Using Hamming distance as score D ∼ D 1 ∼ · · · ∼ D n − 1 ∼ D n ⇓ ⇓ ⇓ ⇓ p p 1 . . . p n − 1 p n (sig) (sig) (sig) (not sig) * Score > 0 only when D ∈ D is significant. * SNP significance ordering resulting from Hamming distance could be different than that resulting from χ 2 statistic. * Sensitive to the choice of the threshold p -value. * No genotype data for controls: necessary to assume controls are known.

  18. Overview Method Scoring functions Summary Finding the Hamming distance D ∼ ∼ · · · ∼ D n − 1 ∼ D 1 D n ⇓ ⇓ ⇓ ⇓ p p 1 . . . p n − 1 p n (sig) (sig) (sig) (not sig) * Instead of examining all possible paths, follow the path of the greatest ascent or descent. * The resulting path may not have the shortest Hamming distance.

  19. Overview Method Scoring functions Summary Finding the Hamming distance Derived allelic table Partial genotype table Allele 0 1 Genotype 0 1 2 Case n 00 n 01 N n 10 n 11 N Case g 0 g 1 g 2 N/ 2 Control n 0 n 1 2 N χ 2 = 2 N ( n 00 − n 10 ) 2 2 N (2 g 0 + g 1 − n 10 ) 2 = n 0 n 1 (2 g 0 + g 1 + n 10 )( N − 2 g 0 − g 1 − n 10 ) � ∂ χ 2 , ∂ � ∇ χ 2 = χ 2 ∂g 0 ∂g 1 ∂ χ 2 = 2 ∂ χ 2 ∂g 0 ∂g 1 � n 00 � � n 10 � ∂ n 11 − n 01 n 10 + n 01 χ 2 ∝ ∂g 1 n 0 n 1 n 1 n 0 n 0 n 1

  20. Overview Method Scoring functions Summary Performance of different scoring functions * Hamming (distance) outperforms χ 2 when ǫ is small. * Utility of Hamming may plateur before it reaches 1.0. (Why?)

  21. Overview Method Scoring functions Summary Comparison of scoring functions χ 2 Hamming Computation Trivial Expensive Sensitivity Nontrivial; 1 may use upper bounds Stable Yes Not always

  22. Overview Method Scoring functions Summary Summary * Extending exponential mechanism — LocSig * χ 2 statistic as score * Hamming distance as score * Compare different scoring functions

  23. References References Johnson, Aaron, and Vitaly Shmatikov. 2013. “Privacy-preserving data exploration in genome-wide association studies”. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 1079–1087. McSherry, Frank, and Kunal Talwar. 2007. “Mechanism Design via Differential Privacy”. 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07) (): 94–103. Yu, Fei, et al. 2014. “Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies”. Journal of Biomedical Informatics (). arXiv: 1401.5193 .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend