iDASH Healthcare Privacy Protection Challenge Fei Yu - - PowerPoint PPT Presentation

idash healthcare privacy protection challenge
SMART_READER_LITE
LIVE PREVIEW

iDASH Healthcare Privacy Protection Challenge Fei Yu - - PowerPoint PPT Presentation

Overview Method Scoring functions Summary iDASH Healthcare Privacy Protection Challenge Fei Yu feiy@stat.cmu.edu Carnegie Mellon University 24 March 2014 Overview Method Scoring functions Summary Overview Task Select the K most


slide-1
SLIDE 1

Overview Method Scoring functions Summary

iDASH Healthcare Privacy Protection Challenge

Fei Yu feiy@stat.cmu.edu

Carnegie Mellon University

24 March 2014

slide-2
SLIDE 2

Overview Method Scoring functions Summary

Overview

Task Select the K most significant SNPs differentially-privately. * Setting: case-control study. * Input data: genotype data (e.g., AA, AT, TT) for cases, minor allele frequencies for controls. * Ranking significance: p-value corresponding to Pearson’s χ2 test of association between SNP and phenotype. * Performance evaluation: the proportion of significant SNPs recovered.

slide-3
SLIDE 3

Overview Method Scoring functions Summary

Overview

* Method is based on the exponential mechanism. * Two variations of the method. Pros and cons.

slide-4
SLIDE 4

Overview Method Scoring functions Summary

Definitions

Differential privacy Let D denote the set of all data sets. Write D ∼ D′ if D and D′ differ in one individual. A randomized mechanism K is ǫ-differentially private if, for all D ∼ D′ and for any measurable set S ⊂ R, Pr(K(D) ∈ S) Pr(K(D′) ∈ S) ≤ eǫ. Sensitivity The sensitivity of a function f : DN → Rd, where DN denotes the set of all databases with N individuals, is the smallest number S(f) such that ||f(D) − f(D′)||1 ≤ S(f), for all data sets D, D′ ∈ DN such that D ∼ D′.

slide-5
SLIDE 5

Overview Method Scoring functions Summary

Exponential mechanism

McSherry and Talwar (2007): Given D = {SNPi}M

i=1, εǫ q is a r.v.

with Pr(εǫ

q(D) = i) ∝ exp

ǫq(D, i) 2∆q

  • µ(i)

∝ exp ǫq(D, i) 2s

  • where

q(D, i) = the score for SNPi s = the sensitivity of q(D, ·) µ(i) = 1/M. εǫ

q is ǫ-differentially private.

slide-6
SLIDE 6

Overview Method Scoring functions Summary

Exponential mechanism

We can use any scoring function q(D, ·) with the exponential

  • mechanism. Examples:
  • 1. χ2 statistic
  • 2. Hamming distance (Johnson and Shmatikov 2013)
slide-7
SLIDE 7

Overview Method Scoring functions Summary

Extending the exponential mechanism

Johnson and Shmatikov (2013): selecting the K most significant SNPs (LocSig).

  • 1. Initialize S = ∅ and qi = score of SNPi.
  • 2. Set wi = exp

ǫqi 2Ks

  • and Pr(εǫ

q(D) = i) = wi

  • M
  • j=1

wj .

  • 3. Sample j ∼ εǫ

q(D). Add SNPj to S. Set qj = −∞.

  • 4. If |S| < K, return to Step 2. Otherwise, output S.

LocSig is ǫ-differentially private (Yu et al. 2014).

slide-8
SLIDE 8

Overview Method Scoring functions Summary

Performance of different scoring functions

* Hamming (distance)

  • utperforms χ2 when ǫ is

small. * Utility of Hamming may plateur before it reaches 1.0. (Why?)

slide-9
SLIDE 9

Overview Method Scoring functions Summary

Setup

Assumptions: * # of cases = # of controls = N/2. * Case data are private but control data are known.

slide-10
SLIDE 10

Overview Method Scoring functions Summary

Setup

Summarizing a SNP: * Genotype table is not available. We only know the genotypes

  • f the cases:

Genotype 1 2 Case g0 g1 g2 N/2 * Derived allelic table: Allele 1 Case n00 n01 N Control n10 n11 N n0 n1 2N

slide-11
SLIDE 11

Overview Method Scoring functions Summary

Using χ2 statistic as score

* Pearson’s χ2 statistics are used to rank significance of SNPs. * Higher utility is attainable by increasing ǫ. * Sensitivity of the Pearson’s χ2 statistic of an allelic table with positive margins, N/2 cases and N/2 controls is 8N2 (N + 3)(N + 1)

  • 1 − 2

N

  • when N ≥ 3.

See Yu et al. (2014).

slide-12
SLIDE 12

Overview Method Scoring functions Summary

χ2 statistic vs. ranking

slide-13
SLIDE 13

Overview Method Scoring functions Summary

Using χ2 statistic as score

* Pearson’s χ2 statistics are used to rank significance of SNPs. * Higher utility is attainable by increasing ǫ. * Sensitivity of the Pearson’s χ2 statistic of an allelic table with positive margins, N/2 cases and N/2 controls is 8N2 (N + 3)(N + 1)

  • 1 − 2

N

  • when N ≥ 3.

See Yu et al. (2014).

slide-14
SLIDE 14

Overview Method Scoring functions Summary

Using Hamming distance as score

D ∼ D1 ∼ · · · ∼ Dn−1 ∼ Dn ⇓ ⇓ ⇓ ⇓ p p1 . . . pn−1 pn (sig) (sig) (sig) (not sig) * Score > 0 only when D ∈ D is significant. * SNP significance ordering resulting from Hamming distance could be different than that resulting from χ2 statistic. * Sensitive to the choice of the threshold p-value. * No genotype data for controls: necessary to assume controls are known.

slide-15
SLIDE 15

Overview Method Scoring functions Summary

Using Hamming distance as score

D ∼ D1 ∼ · · · ∼ Dn−1 ∼ Dn ⇓ ⇓ ⇓ ⇓ p p1 . . . pn−1 pn (sig) (sig) (sig) (not sig) * Score > 0 only when D ∈ D is significant. * SNP significance ordering resulting from Hamming distance could be different than that resulting from χ2 statistic. * Sensitive to the choice of the threshold p-value. * No genotype data for controls: necessary to assume controls are known.

slide-16
SLIDE 16

Overview Method Scoring functions Summary

Using Hamming distance as score

D ∼ D1 ∼ · · · ∼ Dn−1 ∼ Dn ⇓ ⇓ ⇓ ⇓ p p1 . . . pn−1 pn (sig) (sig) (sig) (not sig) * Score > 0 only when D ∈ D is significant. * SNP significance ordering resulting from Hamming distance could be different than that resulting from χ2 statistic. * Sensitive to the choice of the threshold p-value. * No genotype data for controls: necessary to assume controls are known.

slide-17
SLIDE 17

Overview Method Scoring functions Summary

Using Hamming distance as score

D ∼ D1 ∼ · · · ∼ Dn−1 ∼ Dn ⇓ ⇓ ⇓ ⇓ p p1 . . . pn−1 pn (sig) (sig) (sig) (not sig) * Score > 0 only when D ∈ D is significant. * SNP significance ordering resulting from Hamming distance could be different than that resulting from χ2 statistic. * Sensitive to the choice of the threshold p-value. * No genotype data for controls: necessary to assume controls are known.

slide-18
SLIDE 18

Overview Method Scoring functions Summary

Finding the Hamming distance

D ∼ D1 ∼ · · · ∼ Dn−1 ∼ Dn ⇓ ⇓ ⇓ ⇓ p p1 . . . pn−1 pn (sig) (sig) (sig) (not sig) * Instead of examining all possible paths, follow the path of the greatest ascent or descent. * The resulting path may not have the shortest Hamming distance.

slide-19
SLIDE 19

Overview Method Scoring functions Summary

Finding the Hamming distance

Partial genotype table

Genotype 1 2 Case g0 g1 g2 N/2

Derived allelic table

Allele 1 Case n00 n01 N Control n10 n11 N n0 n1 2N

χ2 = 2N(n00 − n10)2 n0n1 = 2N(2g0 + g1 − n10)2 (2g0 + g1 + n10)(N − 2g0 − g1 − n10) ∇χ2 = ∂ ∂g0 χ2, ∂ ∂g1 χ2

∂g0 χ2 = 2 ∂ ∂g1 χ2 ∂ ∂g1 χ2 ∝ n00 n0 n11 n1 − n01 n1 n10 n0 n10 n0 + n01 n1

slide-20
SLIDE 20

Overview Method Scoring functions Summary

Performance of different scoring functions

* Hamming (distance)

  • utperforms χ2 when ǫ is

small. * Utility of Hamming may plateur before it reaches 1.0. (Why?)

slide-21
SLIDE 21

Overview Method Scoring functions Summary

Comparison of scoring functions

χ2 Hamming Computation Trivial Expensive Sensitivity Nontrivial; 1 may use upper bounds Stable Yes Not always

slide-22
SLIDE 22

Overview Method Scoring functions Summary

Summary

* Extending exponential mechanism — LocSig * χ2 statistic as score * Hamming distance as score * Compare different scoring functions

slide-23
SLIDE 23

References

References

Johnson, Aaron, and Vitaly Shmatikov. 2013. “Privacy-preserving data exploration in genome-wide association studies”. In Proceedings of the 19th ACM SIGKDD International Conference

  • n Knowledge Discovery and Data Mining, 1079–1087.

McSherry, Frank, and Kunal Talwar. 2007. “Mechanism Design via Differential Privacy”. 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07) (): 94–103. Yu, Fei, et al. 2014. “Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies”. Journal of Biomedical Informatics (). arXiv: 1401.5193.