homomorphic encryption based secure genome data analysis
play

Homomorphic Encryption Based Secure Genome Data Analysis Miran Kim - PowerPoint PPT Presentation

Homomorphic Encryption Based Secure Genome Data Analysis Miran Kim and Kristin Lauter Seoul National University Microsoft Research iDASH Privacy&Security Workshop, March 16, 2015 1 / 16 Secure Outsourcing GWAS 2 / 16 Minor


  1. Homomorphic Encryption Based Secure Genome Data Analysis Miran Kim ⋆ and Kristin Lauter † ⋆ Seoul National University † Microsoft Research iDASH Privacy&Security Workshop, March 16, 2015 1 / 16

  2. Secure Outsourcing GWAS 2 / 16

  3. Minor Allele Frequency There are 200 people, and each of them has 311 genotypes. Each genotype has two kinds of SNPs. Data Encoding For a fixed genotype, suppose that 200 people have “AT, AT, AA, . . . , TT”. Then the encoding method is as follows: ◮ If the pair consists of different SNPs (AT), then encode it into ‘1’. ◮ The first pair with the same SNP (AA) is encoded into ‘0’. ◮ Then the other one (TT) is encoded into ‘2’. ( ⇒ the encoded value means the number of ‘T’ in the individual SNPs.) G 1 G 2 G 3 G i G 311 P 1 : · · · · · · · · · 1 2 2 0 1 0 0 . . . AT P 200 : · · · · · · · · · 2 0 2 1 1 0 0 TT 3 / 16

  4. Minor Allele Frequency There are 200 people, and each of them has 311 genotypes. Each genotype has two kinds of SNPs. Data Encoding For a fixed genotype, suppose that 200 people have “AT, AT, AA, . . . , TT”. Then the encoding method is as follows: ◮ If the pair consists of different SNPs (AT), then encode it into ‘1’. ◮ The first pair with the same SNP (AA) is encoded into ‘0’. ◮ Then the other one (TT) is encoded into ‘2’. ( ⇒ the encoded value means the number of ‘T’ in the individual SNPs.) G 1 G 2 G 3 G i G 311 P 1 : · · · · · · · · · 1 2 2 0 1 0 0 . . . AT P 200 : · · · · · · · · · 2 0 2 1 1 0 0 TT 3 / 16

  5. Minor Allele Frequency There are 200 people, and each of them has 311 genotypes. Each genotype has two kinds of SNPs. Data Encoding For a fixed genotype, suppose that 200 people have “AT, AT, AA, . . . , TT”. Then the encoding method is as follows: ◮ If the pair consists of different SNPs (AT), then encode it into ‘1’. ◮ The first pair with the same SNP (AA) is encoded into ‘0’. ◮ Then the other one (TT) is encoded into ‘2’. ( ⇒ the encoded value means the number of ‘T’ in the individual SNPs.) G 1 G 2 G 3 G i G 311 P 1 : · · · · · · · · · 1 2 2 0 1 0 0 . . . AT P 200 : · · · · · · · · · 2 0 2 1 1 0 0 TT 3 / 16

  6. Minor Allele Frequency Encryption & Evaluation G 1 G 2 G 311 Enc P 1 : C 1 1 2 · · · 1 0 − → . . . P 200 : · · · 2 0 1 0 + − → C 200 � 200 · · · ← − i =1 C i 0 Dec #(T) (We can perform the aggregate operations simultaneously for all the genotypes.) Decryption ◮ Decrypt the ciphertext “ � 200 i =1 C i ” with the secret key. ◮ Let ℓ i be the value in the i ’th slot. Decoding ◮ For 1 ≤ i ≤ 311, if ℓ i > 200, then ℓ i ← (400 − ℓ i ). � ℓ i ◮ The minor allele frequency of the genotype G i is � . 400 4 / 16

  7. Minor Allele Frequency Encryption & Evaluation G 1 G 2 G 311 Enc P 1 : C 1 1 2 · · · 1 0 − → . . . P 200 : · · · 2 0 1 0 + − → C 200 � 200 · · · ← − i =1 C i 0 Dec #(T) (We can perform the aggregate operations simultaneously for all the genotypes.) Decryption ◮ Decrypt the ciphertext “ � 200 i =1 C i ” with the secret key. ◮ Let ℓ i be the value in the i ’th slot. Decoding ◮ For 1 ≤ i ≤ 311, if ℓ i > 200, then ℓ i ← (400 − ℓ i ). � ℓ i ◮ The minor allele frequency of the genotype G i is � . 400 4 / 16

  8. Chi-squared Test Data Encoding ◮ For each genotype, encode the given SNPs of case group and control group. * Note that the result of chi-squared test is 800 ( a (400 − c ) − c (400 − a )) 2 n ( ad − bc ) 2 = r · s · g · k 400 · 400 · g · k 800 ( a − c ) 2 = ( a + c )(800 − ( a + c )) where ‘ a ’ and ‘ c ’ are the allele counts of some SNP in case and control group. 5 / 16

  9. Chi-squared Test Data Encoding ◮ For each genotype, encode the given SNPs of case group and control group. * Note that the result of chi-squared test is 800 ( a (400 − c ) − c (400 − a )) 2 n ( ad − bc ) 2 = r · s · g · k 400 · 400 · g · k 800 ( a − c ) 2 = ( a + c )(800 − ( a + c )) where ‘ a ’ and ‘ c ’ are the allele counts of some SNP in case and control group. 5 / 16

  10. Chi-squared Test Evaluation Let us denote C i and C ′ i the ciphertexts for the case&control groups. ◮ Evaluate � 200 let = C case ) and � 200 let i =1 C i ( i =1 C ′ i ( = C cont ). ◮ Compute “ C case − C cont ” and “ C case + C cont ” Decryption For the message space Z t = [0 , t ), let ◮ den = Dec( C case + C cont ) = a + c ( < t ) � if a > c , a − c let ◮ num = Dec( C case − C cont ) = ( a − c ) + t otherwise . Decoding ◮ If num > t 2 , then num ← (num − t ). 800(num) 2 ◮ The result of chi-squared test is (den)(800 − den) 6 / 16

  11. Chi-squared Test Evaluation Let us denote C i and C ′ i the ciphertexts for the case&control groups. ◮ Evaluate � 200 let = C case ) and � 200 let i =1 C i ( i =1 C ′ i ( = C cont ). ◮ Compute “ C case − C cont ” and “ C case + C cont ” Decryption For the message space Z t = [0 , t ), let ◮ den = Dec( C case + C cont ) = a + c ( < t ) � if a > c , a − c let ◮ num = Dec( C case − C cont ) = ( a − c ) + t otherwise . Decoding ◮ If num > t 2 , then num ← (num − t ). 800(num) 2 ◮ The result of chi-squared test is (den)(800 − den) 6 / 16

  12. Secure Comparison between Genomic Data 7 / 16

  13. Hamming Distance Two individuals have genotypes over many SNPs. For a fixed genotype, � 1 if ( S 1 = null) || ( S 2 = null) || ( S 1 . alt � = S 2 . alt) d = 0 otherwise x [ j ] let = j -th bit of x , starting with the least significant bit of x . ⊕ : XOR gate (= Add over Z 2 ), ∧ : AND gate (= Mult over Z 2 ) . SVTYPE d SV 1 or SV 2 = INS/DEL 0 SV 1 or SV 2 = null 1 SV 1 and SV 2 = SNP/SUB EQU ( S 1 , S 2 ) ⊕ 1 � 1 if S 1 = S 2 = ∧ µ where EQU ( S 1 , S 2 ) = j =1 ( S 1 [ j ] ⊕ S 2 [ j ] ⊕ 1) 0 o.w, We need the encodings to determine ‘null’ and ‘INS/DEL’. 8 / 16

  14. Hamming Distance Two individuals have genotypes over many SNPs. For a fixed genotype, � 1 if ( S 1 = null) || ( S 2 = null) || ( S 1 . alt � = S 2 . alt) d = 0 otherwise x [ j ] let = j -th bit of x , starting with the least significant bit of x . ⊕ : XOR gate (= Add over Z 2 ), ∧ : AND gate (= Mult over Z 2 ) . SVTYPE d SV 1 or SV 2 = INS/DEL 0 SV 1 or SV 2 = null 1 SV 1 and SV 2 = SNP/SUB EQU ( S 1 , S 2 ) ⊕ 1 � 1 if S 1 = S 2 = ∧ µ where EQU ( S 1 , S 2 ) = j =1 ( S 1 [ j ] ⊕ S 2 [ j ] ⊕ 1) 0 o.w, We need the encodings to determine ‘null’ and ‘INS/DEL’. 8 / 16

  15. Hamming Distance Data Encoding ◮ Clean two datasets using POS, then make the merged list L . ◮ For i ∈ [1 , #( L )], � � 1 if POS i ∈ L 0 if SV i = INS/DEL define m i = and h i = 0 otherwise 1 otherwise ⇒ ( m 1 ⊕ m 2 ) = 1 iff (SV 1 , i = null) or (SV 2 , i = null) ( h 1 ∧ h 2 ) = 0 iff (SV 1 , i = INS/DEL) or (SV 2 , i = INS/DEL) ◮ Encode the SNP string as follows: A → 00 , G → 01 , C → 10 , T → 11 , ⋆ Each SNP is encoded and concatenated each other. ⋆ Pad ‘1’ at the end of the string, and ‘0’ to make 21 bit string, say S i . ⋆ In the case of missing genotype, it is encoded as ‘0’ string. ⋆ For example, ‘ GTA ’ is encoded as ‘01 || 11 || 00 || 1 0 . . . 00 ’. � �� � 14 9 / 16

  16. Hamming Distance Data Encoding ◮ Clean two datasets using POS, then make the merged list L . ◮ For i ∈ [1 , #( L )], � � 1 if POS i ∈ L 0 if SV i = INS/DEL define m i = and h i = 0 otherwise 1 otherwise ⇒ ( m 1 ⊕ m 2 ) = 1 iff (SV 1 , i = null) or (SV 2 , i = null) ( h 1 ∧ h 2 ) = 0 iff (SV 1 , i = INS/DEL) or (SV 2 , i = INS/DEL) ◮ Encode the SNP string as follows: A → 00 , G → 01 , C → 10 , T → 11 , ⋆ Each SNP is encoded and concatenated each other. ⋆ Pad ‘1’ at the end of the string, and ‘0’ to make 21 bit string, say S i . ⋆ In the case of missing genotype, it is encoded as ‘0’ string. ⋆ For example, ‘ GTA ’ is encoded as ‘01 || 11 || 00 || 1 0 . . . 00 ’. � �� � 14 9 / 16

  17. Hamming Distance Encryption ◮ Embed the data of P 1 (= m 1 , i , h 1 , i , S 1 , i ) and P 2 (= m 2 , i , h 2 , i , S 2 , i ) into the plaintext slots in a bit-by-bit manner. ◮ Encrypt the slots with the public key. m 1 · · · h 1 · · · S 1 [1] · · · . . . . . . . . . S 1 [21] · · · #( L ) 10 / 16

  18. Hamming Distance Evaluation ◮ Evaluate the following binary circuit over encrypted data: � �� � � � ( h 1 , i ∧ h 2 , i ) ∧ ( m 1 , i ⊕ m 2 , i ) ⊕ m 1 , i ⊕ m 2 , i ⊕ 1 ∧ EQU ( S 1 , i , S 2 , i ) ⊕ 1 ◮ Take m = 8191 so that we can embed 630 messages into one ciphertext and perform the operations simultaneously for all the messages. Decryption ◮ Decrypt the evaluated value and let ℓ i the value in the i ’th slot. Decoding ◮ Note that ℓ i is the Hamming distance result of i ’th genotype. ◮ Compute � #( L ) i =1 ℓ i . 11 / 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend