Homomorphic Encryption Based Secure Genome Data Analysis
Miran Kim⋆ and Kristin Lauter†
⋆Seoul National University †Microsoft Research
iDASH Privacy&Security Workshop, March 16, 2015
1 / 16
Homomorphic Encryption Based Secure Genome Data Analysis Miran Kim - - PowerPoint PPT Presentation
Homomorphic Encryption Based Secure Genome Data Analysis Miran Kim and Kristin Lauter Seoul National University Microsoft Research iDASH Privacy&Security Workshop, March 16, 2015 1 / 16 Secure Outsourcing GWAS 2 / 16 Minor
⋆Seoul National University †Microsoft Research
1 / 16
2 / 16
◮ If the pair consists of different SNPs (AT), then encode it into ‘1’. ◮ The first pair with the same SNP (AA) is encoded into ‘0’. ◮ Then the other one (TT) is encoded into ‘2’.
Gi
G1 AT
G2
G3
G311
TT
3 / 16
◮ If the pair consists of different SNPs (AT), then encode it into ‘1’. ◮ The first pair with the same SNP (AA) is encoded into ‘0’. ◮ Then the other one (TT) is encoded into ‘2’.
Gi
G1 AT
G2
G3
G311
TT
3 / 16
◮ If the pair consists of different SNPs (AT), then encode it into ‘1’. ◮ The first pair with the same SNP (AA) is encoded into ‘0’. ◮ Then the other one (TT) is encoded into ‘2’.
Gi
G1 AT
G2
G3
G311
TT
3 / 16
G1
G2
G311
i=1 Ci #(T) (We can perform the aggregate operations simultaneously for all the genotypes.)
◮ Decrypt the ciphertext “200
i=1 Ci” with the secret key.
◮ Let ℓi be the value in the i’th slot.
◮ For 1 ≤ i ≤ 311, if ℓi > 200, then ℓi ← (400 − ℓi). ◮ The minor allele frequency of the genotype Gi is
400
4 / 16
G1
G2
G311
i=1 Ci #(T) (We can perform the aggregate operations simultaneously for all the genotypes.)
◮ Decrypt the ciphertext “200
i=1 Ci” with the secret key.
◮ Let ℓi be the value in the i’th slot.
◮ For 1 ≤ i ≤ 311, if ℓi > 200, then ℓi ← (400 − ℓi). ◮ The minor allele frequency of the genotype Gi is
400
4 / 16
◮ For each genotype, encode the given SNPs of case group and control
5 / 16
◮ For each genotype, encode the given SNPs of case group and control
5 / 16
i the ciphertexts for the case&control groups.
◮ Evaluate 200
i=1 Ci ( let
i=1 C ′ i ( let
◮ Compute “Ccase − Ccont” and “Ccase + Ccont”
◮ den
let
◮ num
let
◮ If num > t
2, then num ← (num − t).
◮ The result of chi-squared test is
800(num)2 (den)(800−den)
6 / 16
i the ciphertexts for the case&control groups.
◮ Evaluate 200
i=1 Ci ( let
i=1 C ′ i ( let
◮ Compute “Ccase − Ccont” and “Ccase + Ccont”
◮ den
let
◮ num
let
◮ If num > t
2, then num ← (num − t).
◮ The result of chi-squared test is
800(num)2 (den)(800−den)
6 / 16
7 / 16
d =
if (S1 = null) || (S2 = null) || (S1.alt = S2.alt)
SVTYPE d SV1 or SV2 = INS/DEL SV1 or SV2 = null 1 SV1 and SV2 = SNP/SUB EQU(S1, S2) ⊕ 1
j=1 (S1[j] ⊕ S2[j] ⊕ 1)
8 / 16
d =
if (S1 = null) || (S2 = null) || (S1.alt = S2.alt)
SVTYPE d SV1 or SV2 = INS/DEL SV1 or SV2 = null 1 SV1 and SV2 = SNP/SUB EQU(S1, S2) ⊕ 1
j=1 (S1[j] ⊕ S2[j] ⊕ 1)
8 / 16
◮ Clean two datasets using POS, then make the merged list L. ◮ For i ∈ [1, #(L)],
◮ Encode the SNP string as follows:
⋆ Each SNP is encoded and concatenated each other. ⋆ Pad ‘1’ at the end of the string, and ‘0’ to make 21 bit string, say Si. ⋆ In the case of missing genotype, it is encoded as ‘0’ string. ⋆ For example, ‘GTA’ is encoded as ‘01||11||00||1 0 . . . 00 14
9 / 16
◮ Clean two datasets using POS, then make the merged list L. ◮ For i ∈ [1, #(L)],
◮ Encode the SNP string as follows:
⋆ Each SNP is encoded and concatenated each other. ⋆ Pad ‘1’ at the end of the string, and ‘0’ to make 21 bit string, say Si. ⋆ In the case of missing genotype, it is encoded as ‘0’ string. ⋆ For example, ‘GTA’ is encoded as ‘01||11||00||1 0 . . . 00 14
9 / 16
◮ Embed the data of P1(= m1,i, h1,i, S1,i) and P2 (= m2,i, h2,i, S2,i) into
◮ Encrypt the slots with the public key.
10 / 16
◮ Evaluate the following binary circuit over encrypted data:
◮ Decrypt the evaluated value and let ℓi the value in the i’th slot.
◮ Note that ℓi is the Hamming distance result of i’th genotype. ◮ Compute #(L)
i=1 ℓi.
11 / 16
SVTYPE d (SV1 = INS, SV2 = INS)||(SV1 = INS, SV2 = INS) max(n1, n2) Otherwise max(n1, n2) ∧ (EQU(S1.alt, S2.alt) ⊕ 1)
◮ We don’t need the reference comparison anymore. ◮ We need an encoding which determine whether the genotype is INS or
12 / 16
SVTYPE d (SV1 = INS, SV2 = INS)||(SV1 = INS, SV2 = INS) max(n1, n2) Otherwise max(n1, n2) ∧ (EQU(S1.alt, S2.alt) ⊕ 1)
◮ We don’t need the reference comparison anymore. ◮ We need an encoding which determine whether the genotype is INS or
12 / 16
◮ Clean two datasets using POS, then make the merged list L. ◮ For i ∈ [1, #(L)], define ei =
◮ Encode the SNP string as Si. (The missing genotype is encoded as ‘0’) ◮ Encode the length of SNP string, say ni.
◮ Embed the data of P1(= e1,i, S1,i, n1,i) and P2 (= e2,i, S2,i, n2,i) into
◮ Encrypt the slots with the public key. 13 / 16
◮ C(x, y) =
⋆ c1 = (1 ⊕ x[1]) ∧y[1], ⋆ cj =
◮ max(x, y)[j] =
14 / 16
◮ Decrypt the evaluated values, and let ℓi,j the i’th value in the j’th slot.
◮ ℓi
let
j=1 ℓi,j · 2j−1 is the Edit distance result of i’th genotype.
◮ Compute #(L)
i=1 ℓi.
15 / 16
16 / 16