Providing Input-Discriminative Protection for Local Differential Privacy
Xiaolan Gu*, Ming Li*, Li Xiong# and Yang Cao†
*University of Arizona #Emory University †Kyoto University
IEEE International Conference on Data Engineering (ICDE), April 2020
Providing Input-Discriminative Protection for Local Differential - - PowerPoint PPT Presentation
Providing Input-Discriminative Protection for Local Differential Privacy Xiaolan Gu * , Ming Li * , Li Xiong # and Yang Cao *University of Arizona # Emory University Kyoto University IEEE International Conference on Data Engineering
IEEE International Conference on Data Engineering (ICDE), April 2020
Analysis
M
Raw data Perturbed data Randomized mechanism
Upload perturbed data
Untrusted server
An adversary cannot infer whether the input is or with high confidence (controlled by )
x x′ ϵ
[Duchi et al, FOCS’ 13]
Source: https://developers.googleblog.com/2019/09/enabling-developers-and-organizations.html
Apple: discovering popular Emojis under LDP
Source: https://machinelearning.apple.com/2017/12/06/learning-with-privacy-at-scale.html
Scenarios High sensitiveness Low sensitiveness Website-click records Politics-related Facebook and Amazon Medical records HIV and cancer Anemia and headache
)
is a function of two privacy budgets
r( ⋅ , ⋅ )
Intuition: for any pair of inputs , MinID-LDP guarantees the adversary’s capability of distinguishing them would not exceed the bound controlled by both and (thus achieving differentiated privacy protection for each pair)
MinID-LDP has Sequential Composition like LDP , which guarantees the overall privacy for a sequence
is the privacy budget
ϵx x
Factor 2 is due to the symmetric property
[Murakami and Kawamoto, USENIX Security’ 19]
𝑦 𝑦 𝑦 𝑦 𝜗 𝜗 𝜗 𝜗 𝜗 𝜗 𝑦 𝑦 𝑦 𝑦 𝜗𝑣 𝜗𝑣 𝜗𝑣 𝜗𝑣 𝜗𝑣 𝜗𝑣
PLDP
𝜗 all ts, 𝜗 min𝜗 𝑦 𝜗 𝜗𝑣: the privacy budget of a user 𝑣 for all pairs of inputs (different user may have different 𝜗𝑣)
𝜗 𝜗 𝜗 𝜗 𝜗 𝜗
ID-LDP
𝜗: the privacy budget of 𝑦 𝜗: the privacy budget of a pair
MinID-LDP: 𝜗 min𝜗, 𝜗
𝑦 𝑦 𝑦 𝑦 𝜗𝑒
GI or CLDP
𝜗𝑒: the privacy budget for a pair of inputs 𝑦, 𝑦 𝑒: distance between 𝑦, 𝑦
𝜗𝑒 𝜗𝑒 𝜗𝑒 𝜗𝑒 𝜗𝑒 𝑦(ϵ) 𝑦(ϵ) 𝑦(ϵ) 𝑦(ϵ)
Privacy budget of a pair of inputs in several related notions
User-discriminative Distance-discriminative Input-discriminative
ULDP does not guarantee the indistinguishability between the sensitive and non-sensitive inputs when observing some outputs, thus ULDP does not guarantee LDP . ULDP
𝒴S
𝒴N
𝒵P 𝒵I
ϵ
x y
Sensitive inputs Non- sensitive inputs
) can be very large (especially for a large domain or item-set data).
Our protocol satisfying ID-LDP is based on this
Example: assume domain size , then variables and constraints
m m2 m3
ID-LDP protocols perturb inputs with different probabilities
1 Truth x 1 w.p. p w.p. 1 − p 1 w.p. p w.p. 1 − p Response y
To satisfy -LDP: (since )
ϵ p = eϵ eϵ + 1 p 1 − p = eϵ
Frequency estimation: Unbiasedness:
True frequency
𝔽[f ] = f*p + (1 − f*)(1 − p) = (2p − 1)f* + (1 − p)
Frequency of response y
Advanced versions: Unary Encoding, Generalized RR, …
into vector with length
1
x[k]
1 w.p. p w.p. 1 − p 1
y[k]
w.p. 1 − p w.p. p RAPPOR
[Erlingsson et al, CCS’ 14]
w.p. 0.5 w.p. 1 − q w.p. q OUE
[Wang et al, USENIX Security’ 17]
w.p. 0.5
To satisfy -LDP: ,
ϵ p = eϵ/2 eϵ/2 + 1 q = 1 eϵ + 1
By minimizing the approximate MSE of frequency estimation
variables and constraints
can further reduce the problem complexity)
Recall the two challenges: 1) High complexity of the optimization problem. 2) MSE depends on unknown true frequencies.
1
x[k]
1 w.p. ak w.p. 1 − ak 1 w.p. bk
y[k]
w.p. 1 − bk
ai(1 − bj) bi(1 − aj) ⩽ er(ϵi,ϵj) (∀i, j)
̂ ci = ∑u yu[i] − nbi ai − bi
variables and constraints
second term is dependent on the true frequencies
i
MSE ̂
ci = Var[ ̂
ci] = nbi(1 − bj) (ai − bi)2 + c*
i (1 − ai − bi)
ai − bi
— number of users — perturbation probabilities — true frequency — estimated frequency
n ai, bi c*
i
̂ ci
Example: a health organization is taking a survey which asks participants to return a response perturbed from categories {HIV, anemia, headache, stomachache, toothache}, where HIV ( ) is more sensitive, thus we set different privacy budgets, such as and .
The total variance of IDUE is in a range because it depends on the distribution of true input data, and the upper bound is still less than that of RAPPOR and OUE.
More perturbation noise for i = 1 Less perturbation noise for i ≠ 1
1 1.5 2 2.5 3 25 50 100 200 400
RAPPOR RAPPOR OUE OUE IDUE-opt0 IDUE-opt0 IDUE-opt1 IDUE-opt1 IDUE-opt2 IDUE-opt2
1 1.5 2 2.5 3 200 400 800 1600 3200
Comparison of Empirical (dashed lines) and Theoretical(solid lines) results of synthetic data (single-item input). More Private More Accurate Empirical results are very close to theoretical results IDUE has smaller MSE than RAPPOR and OUE
We compare the frequency estimation results of our mechanisms (IDUE and IDUE-PS) with RAPPOR and OUE using two synthetic datasets and three real-world datasets.
but better than RAPPOR and OUE
0.5 1 2 3 4 103 104 105 106 0.1 0.5 1 5 RAPPOR OUE IDUE RAPPOR OUE IDUE 0.5 1 2 3 4 103 104 105 106 0.1 0.5 1
IDUE has smallest MSE and RE (relative error)
2% 5% 20% 40% 60% 80% 5 10 15 103 0.5 1 1.5
MSE (RAPPOR) MSE (OUE) MSE (IDUE) RE (RAPPOR) RE (OUE) RE (IDUE)
2% 5% 20% 40% 60% 80% 5 10 15 103 0.1 0.2 0.3
If only small portion of inputs are more sensitive (i.e., have the smallest privacy budget), then IDUE has smaller estimation error. Otherwise, IDUE has similar performance compared with OUE RE =
1 |S| ∑
i∈S
| ̂ ci − c*
i |
c*
i
1 2 3 4 5 6 0.5 1 1.5 2 1 2 5 10 15 20 0.5 1 1.5 2
Varying and in Kosarak item-set data
ℓ ϵ
The optimal (parameter of Padding-and-Sampling protocol) depends on both data distribution and privacy budget (the original paper only mentioned data-dependent). We leave this as our further work.
ℓ
Future work: