Differential Privacy (Part III)
Differential Privacy (Part III) Approximate (or ( , - - PowerPoint PPT Presentation
Differential Privacy (Part III) Approximate (or ( , - - PowerPoint PPT Presentation
Differential Privacy (Part III) Approximate (or ( , ))-differential privacy Generalized definition of differential privacy allowing for a (supposedly small) additive factor Used in a variety of applications A query mechanism M is (
Approximate (or (ℇ,∂))-differential privacy
- Generalized definition of differential privacy allowing for a
(supposedly small) additive factor
- Used in a variety of applications
A query mechanism M is (✏, )-differentially private if, for any two adjacent databases D and D0 (differing in just one entry) and C ⊆ range(M) Pr(M(D) ∈ C) ≤ e✏ · Pr(M(D0) ∈ C) +
The Gaussian mechanism
For c2>2ln(1.25/δ), the Gaussian mechanism with parameter σ≥c∆2(f)/ε is (ε,δ)-differentially private The ℓ2-sensitivity of f:ℕ|X|→ℝk is defined as ∆2(f)=max ||f(x)-f(y)||2 for all x,y∈ℕ|X|,||x-y||1=1
Sparse Vector Technique
✦ [Hardt-Rothblum, FOCS’10] study the problem of k,
adaptively chosen, low sensitivity queries where
- only a very small number of these queries (say c)
take values above a certain threshold T
- the data analyst is only interested in such queries
- useful to learn correlations, e.g., whether there is
a dependency between smoke and cancer
✦ The data analyst could ask only the significant
queries, but she does not know them in advance!
✦ Goal: answer only the significant queries, pay only
for them, and ignore the others
Histograms and linear queries
✦ A histogram x ∈ ℝN represents a database (or a distribution)
- ver a universe U of size |U|=N
- Databases have support of size n, whereas distributions do
not necessarily have a small support
✦ We assume x is normalized so that ✦ Here we focus on linear queries
- can be seen as the inner-product <x,f > for
- counting queries (i.e., how many elements in the database
fulfill a certain predicate) are a special case
✦ Example: U={1,2,3} D=[1,2,2,3,1]
- x = (2,2,1), after normalization (2/5,2/5,1/5)
- “how many entries ≤ 2” ⇒ f = (1,1,0)
✦ By normalization, linear queries have sensitivity 1/n X
i∈Uxi = 1 f : RN → [0, 1]
f ∈ [0, 1]N
SVT: algorithm
✦ Intuition: answer only those queries whose sanitized
result is above the sanitized threshold
We pay only for c queries We pay only for c queries We need to sanitize the threshold otherwise the conditional branch would leak information
SVT: accuracy
- α captures the distance between the sanitized result
and the real result
- β captures the error probability
We say Sparse is (α, β)-accurate for a sequence of k queries Q1, . . . , Qk, if except with probability at most β, the algorithm does not abort before Qk, and for all ai ∈ R: |ai − Qi(D)| ≤ α and for all ai =⊥: Qi(D) ≤ T + α
SVT: accuracy theorem
- The larger β, the smaller α
- The accuracy loss is logarithmic in the number of
queries
For any sequence of k queries Q1, . . . , Qk such that L(T) = |{i : Qi(D) ≥ T −↵}| ≤ c, Sparse(D, {Qi}, T, c) is (↵, )- accurate for: ↵ = 2(log k + log 2 ) = 4c (log k + log 2
β )
✏n
SVT: privacy theorem
- So, what did we prove in the end?
- You can estimate the actual answers and report only
those in this range:
- We can fish out insignificant queries almost “for free”,
paying only logarithmically for them in terms of accuracy
The Sparse vector algorithm is ✏-differentially private
T T+ α
∞
SVT: approximate differential privacy
✦ Setting , we get the following theorems:
= p 32c ln 1/ ✏n
The Sparse vector algorithm is (✏, )-differentially private
For any sequence of k queries Q1, . . . , Qk such that L(T) = |{i : Qi(D) ≥ T −↵}| ≤ c, Sparse(D, {Qi}, T, c) is (↵, )- accurate for: ↵ = 2(log k + log 2 ) = 128c ln 1
δ (log k + log 2 β )
✏n
Limitations
✦ Differential privacy is a general purpose privacy
definition, originally thought for databases and later applied to a variety of different settings
✦ At the moment, it is considered the state-of-the-art ✦ Still, it is not the holy grail and it is not immune from
concerns, criticisms, and limitations
✦ Typically accompanied by some over-claims
No free lunch in data privacy
✦ Privacy and utility cannot be provided without
making assumptions about how data are generated (no free lunch theorem)
✦ Privacy means hiding the evidence of participation of
an individual in the data generating process
✦ If database rows are not independent, this is
different from removing one row
- Bob’s participation in a social network may cause
new edges between pairs of his friends
✦ If there is group structure, differential privacy may
not work very well...
No free lunch in data privacy (cont’d)
✦ This work disputes three popular over-claims ✦ “DP requires no assumptions on the data”
- database rows must actually be independent,
- therwise removing one row does not suffice to
remove the individual’s participation
✦ If rows are not independent, deciding how many
entries should be removed and which ones is far from being easy...
No free lunch in data privacy (cont’d)
✦ The attacker knows all entries of
the database except for one, so “the more an attacker knows, the greater the privacy risks”
✦ Thus we should protect against the
strongest attacker
✦ Careful! In DP, the more the
attacker knows, the less noise we actually add
- intuitively, this is due to the fact
that we have less to hide
No free lunch in data privacy (cont’d)
✦ “DP is robust to arbitrary background knowledge” ✦ Actually, DP is robust when certain subsets of the
tuples are known to the attacker
✦ Other types of background knowledge may instead
be harmful
- e.g., previous exact query answers
✦ DP composes well with itself, but not necessarily with
- ther privacy definitions or release mechanisms
✦ One can get a new, more generic, DP privacy
guarantee if, after releasing exact query answers, a set of tuples (not just one), called neighbours, is altered in a way that is still consistent with previously answered queries (plausible deniability)
Geo-indistinguishability
- Goal: protect user’s exact location, while allowing
approximate information (typically needed to obtain a certain desired service) to be released
- Idea: protect the user’s location within a radius r with
a level of privacy that depends on r
- corresponds to a generalized version of the well-
known concept of differential privacy.
Pictorially…
- Achieve l-privacy within r
- the provider cannot easily infer the user’s location
within, say, the 7th arrondissement of Paris
- the provider can infer with high probability that the user
is located in Paris instead of, say, London
More formally…
- Here K(x) denotes the distribution (of locations)
generated by the mechanism K applied to location x
- Achieved through a variant of the Laplace mechanism
Browser extension
Malicious aggregators
- So far we focused on malicious analysts…
- …but aggregators can be malicious (or at least
curious) too!
Users Aggregator Analyst x1 xn f(x1,…,xn)
Existing approaches
- Secure hardware (or trusted server)-based mechanisms
- Fully distributed mechanisms with individual noise
Distributed Differential Privacy
How to compute differentially private queries in a distributed setting (attacker model, cryptographic protocols…)? “What’s the average age
- f your self-help group?”
Smart-metering
✦ Fine-grained smart-metering has multiple uses:
- time-of-use billing, providing energy advice, settlement,
forecasting, demand response, and fraud detection
✦ USA: Energy Independence and Security Act of 2007
- American Recovery and Reinvestment Act (2009, $4.5bn)
✦ EU: Directive 2009/72/EC ✦ UK: deployment of 47 million smart meters by 2020 ✦ Remote reads ✦ Reads every 15-30 min ✦ Manual reads ✦ One reads every 3
months to 1 year
Smart-metering: privacy issues
✦ Meter readings are sensitive
- Were you in last night?
- You do like watching TV, don’t you?
- Another ready meal in the microwave?
- Has your boyfriend moved in?
Smart-metering: privacy issues (cont’d)
Privacy-friendly smart metering
✦ Goals:
- precise billing of
consumption while revealing no consumption information to third parties
- privacy-friendly real-
time aggregation
Protocol overview
✦ ri answer from client i ✦ kij key shared between
client i and aggregator j
✦ t label classifying the
kind of reading
✦ wi weight given to i’s
answers
Protocol overview
✦ Geometric distribution,
Geom(α), with α >1, is the discrete distribution with support and probability mass function
✦ Discrete counterpart of
Laplace distribution
α − 1 α + 1α−|k|
Z
Let f : D → Z be a function with sensitivity ∆f. Then g = f(X) + Geom( ✏
∆f ) is ✏-differentially private.
Protocol overview
✦ In terms of utility, the
noise added to the aggregate has mean 0 and variance
✦ P is the number of
aggregators
✦ The protocol guarantees
ε-differential privacy even if all except for one aggregators are dishonest
P X
k∈Zα − 1 α + 1α−|k|k2 = 2Pα (α − 1)2
The noise increases with the number
- f aggregators (each adds noise that
suffices to get ε-differential privacy).
On the other hand this seems to be necessary to protect from malicious aggregators…we will see a more elegant and precise solution based on SMPC
Limitations of Existing Approaches
- Privacy vs utility tradeoff
- Lack of generality (and scalability)
- Inefficiency:
significant computational effort on user’s side
- Answer pollution:
single entity can pollute result by excessive noise
PrivaDA: Idea and Design
Secure Mulq-Party Computaqon Secure Mulq-Party Computaqon
✦
Computaqon parqes
- Inputs are shared among
computation parties
- Computation parties jointly
compute differentially private statistics
- Required noise is generated in a
distributed fashion
- No party learns the individual
inputs
Our Contributions (PrivaDA)
- We leverage recent advances on SMPC for arithmetic
- perations
- uses SMPC to compose user data
- uses SMPC to jointly compute the sanitization mechanism
- We support three sanitization mechanisms
- Lap, DLap, exponential mechanism, more are possible
- We employ β computation parties
- We employ zero-knowledge proofs
- First publicly available library for efficient arithmetic SMPC
- perations in malicious setting
strong privacy
- ptimal utility
efficiency scalability generality malicious setting no answer pollution
PrivaDA 101: Differentially Private Year of Birth
born approx. + = 1 5 4 7 8 1 5 5 4 7 4
1 9 7 8 ≈
1 9 7 9 1 9 7 9
SMPC for Distributed Sanitization Mechanisms
- We employ recent SMPC for arithmetic operations
- fixed-point numbers [Catrina & Saxena, FC’10]
- floating point numbers [Aliasgari et al., NDSS’13]
- integers [From & Jakobsen, 2006]
- Key SMPC primitives
- RandInt(k)
- IntAdd, FPAdd, FLAdd, FLMul, FLDiv
- FL2Int, Int2FL, FL2FP, FP2FL
- FLExp, FLLog, FLLT, FLRound
In: d1, . . . , dn; λ = f
✏Out: (
nP
i=1di) + Lap(λ) 1: d =
nP
i=1di 2: rx U(0,1]; ry U(0,1] 3: rz = λ(ln rx ln ry) 4: w = d + rz 5: return w (a) LM In: d1, . . . , dn; λ = e
✏ fOut: (
nP
i=1di) + DLap(λ) 1: d =
nP
i=1di 2: rx U(0,1]; ry U(0,1] 3: α =
1 ln = f ✏4: rz = bα(ln rx)c bα(ln ry)c 5: w = d + rz 6: return w (b) DLM In: d1, . . . , dn; a1, . . . , am; λ = ✏
2Out: winning ak 1: I0 = 0 2: for j = 1 to m do 3: zj =
nP
i=1di(j) 4: δj = ezj 5: Ij = δj + Ij1 6: r U(0,1]; r0 = rIm 7: k = binary search(r0, , I0, . . . , Im) 8: return ak (c) EM
- We provide algorithms for Laplace, Discrete Laplace, and Exponential
- Trick: reduce the problem to random number generation
- Lap(λ) = Exp(1/λ)-Exp(1/λ) with Exp(λ)=-ln 𝒱(0,1] /λ
- DLap(λ) = Geo(1-λ)-Geo(1-λ) with Geo(λ)= ⎣Exp(- ln (1-λ))⎦
- Exp(ε/2) = draw r∈𝒱(0,1] and check
Algorithms for Sanitization Mechanisms
r ·
mX
j=1e✏q(D,aj) ∈ (
j−1X
k=1e✏q(D,ak),
jX
k=1e✏q(D,ak)]
In: Shared fixed point form (, f) inputs [d1], . . . , [dn]; = f
✏Out: w = (
nP
i=1di) + Lap() in the fixed point form 1: [d] = [d1] 2: for i = 2 to n do 3: [d] = FPAdd([d], [di]) 4: [rx] = RandInt( + 1); [ry] = RandInt( + 1) 5: h[vx], [px], 0, 0i = FP2FL([rx], , f = , `, k) 6: h[vy], [py], 0, 0i = FP2FL([ry], , f = , `, k) 7: h[vx/y], [px/y], 0, 0i = FLDiv(h[vx], [px], 0, 0i, h[vy], [py], 0, 0i) 8: h[vln], [pln], [zln], [sln]i = FLLog2(h[vx/y], [px/y], 0, 0i) 9: h[vz], [pz], [zz], [sz]i = FLMul(
- log2 e, h[vln], [pln], [zln], [sln]i)
10: [z] = FL2FP(h[vz1], [pz1], [zz1], [sz1]i, `, k, ) 11: [w] = FPAdd([d], [z]) 12: return w = Rec([w])
- For β computation parties:
Protocol for Distributed Laplace Noise
In: d1, . . . , dn; = ∆f ✏ Out: ( n P i=1 di) + Lap() 1: d = n P i=1 di 2: rx U(0,1]; ry U(0,1] 3: rz = (ln rx ln ry) 4: w = d + rz 5: return w- For β computation parties:
Protocol for Distributed Discrete Laplace Noise
In: Shared integer number () inputs [d1], . . . , [dn]; = e− ✏ ∆f ; ↵ = 1 ln ·log2 e Out: integer w = ( n P i=1 di) + DLap() 1: [d] = [d1] 2: for i = 2 to n do 3: [d] = IntAdd([d], [di]) 4: [rx] = RandInt( + 1); [ry] = RandInt( + 1) 5: h[vx], [px], 0, 0i = FP2FL([rx], , f = , `, k) 6: h[vy], [py], 0, 0i = FP2FL([ry], , f = , `, k) 7: h[vlnx], [plnx], [zlnx], [slnx]i = FLLog2(h[vx], [px], 0, 0i) 8: h[vlny], [plny], [zlny], [slny]i = FLLog2(h[vy], [py], 0, 0i) 9: h[v↵lnx], [p↵lnx], [z↵lnx], [s↵lnx]i = FLMul(↵, h[vlnx], [plnx], [zlnx], [slnx]i) 10: h[v↵lny], [p↵lny], [z↵lny], [s↵lny]i = FLMul(↵, h[vlny], [plny], [zlny], [slny]i) 11: h[vz1], [pz1], [zz1], [sz1]i = FLRound(h[v↵lnx], [p↵lnx], [z↵lnx], [s↵lnx]i, 0) 12: h[vz2], [pz2], [zz2], [sz2]i = FLRound(h[v↵lny], [p↵lny], [z↵lny], [s↵lny]i, 0) 13: [z1] = FL2Int(h[vz1], [pz1], [zz1], [sz1]i, `, k, ) 14: [z2] = FL2Int(h[vz2], [pz2], [zz2], [sz2]i, `, k, ) 15: [w] = IntAdd([d], IntAdd([z1], [z2])) 16: return w = Rec([w]) In: d1, . . . , dn; = e ✏ ∆f Out: ( n P i=1 di) + DLap() 1: d = n P i=1 di 2: rx U(0,1]; ry U(0,1] 3: ↵ = 1 ln = ∆f ✏ 4: rz = b↵(ln rx)c b↵(ln ry)c 5: w = d + rz 6: return wSIMILAR
- For β computation parties:
Protocol for Distributed Exponential Mechanism
In: [d1], . . . , [dn]; the number m of candidates; = ✏ 2 Out: m-bit w, s.t. smallest i for which w(i) = 1 denotes winning candidate ai 1: I0 = h0, 0, 1, 0i 2: for j = 1 to m do 3: [zj] = 0 4: for i = 1 to n do 5: [zj] = IntAdd([zj], [di(j)]) 6: h[vzj], [pzj], [zzj], [szj]i = Int2FL([zj], , `) 7: h[vz0 j], [pz0 j], [zz0 j], [sz0 j]i = FLMul( · log2 e, h[vzj], [pzj], [zzj], [szj]i) 8: h[vj], [pj], [zj], [sj]i = FLExp2(h[vz0 j], [pz0 j], [zz0 j], [sz0 j]i) 9: h[vIj], [pIj], [zIj], [sIj]i = FLAdd(h[vIj1], [pIj1], [zIj1], [sIj1]i, h[vj], [pj], [zj], [sj]i) 10: [r] = RandInt( + 1) 11: h[vr], [pr], 0, 0i = FP2FL([r], , f = , `, k) 12: h[v0 r], [p0 r], [z0 r], [s0 r]i = FLMul(h[vr], [pr], 0, 0i, h[vIm], [pIm], [zIm], [sIm]i) 13: jmin = 1; jmax = m 14: while jmin < jmax do 15: jM = b jmin+jmax 2 c 16: if FLLT(h[vIjM ], [pIjM], [zIjM], [sIjM]i, h[v0 r], [p0 r], [z0 r], [s0 r]i) then 17: jmin = jM + 1 else jmax = jM 18: return wjmin In: d1, . . . , dn; a1, . . . , am; = ✏ 2 Out: winning ak1: I0 = 0 2: for j = 1 to m do 3:
zj = n P i=1 di(j)4:
j = ezj5:
Ij = j + Ij16: r U(0,1]; r0 = rIm 7: k = binary search(r0, , I0, . . . , Im) 8: return
akSIMILAR
Attacker Model and Privacy Guarantees
- We consider two settings:
- honest-but-curious (HbC) computation parties:
- we assume that less than t < β/2 of β parties collude
- malicious computation parties:
- we assume that less than t < β/2 of β parties collude
- we modify our SMPC such that correctness of each computation
step is proved by zero-knowledge proofs
Main results: ✦ The SMPC protocols for LM, DLM, and EM are differentially private in the honest-but-curious setting. ✦ The SMPC protocols for LM, DLM, and EM are differentially private in the malicious setting under the strong RSA and decisional Diffie-Hellman assumptions.
Performance of SMPC Operations (in sec)
Libraries: GMP, Relic, Boost, and OpenSSL Setup: 3.20 GHz (Intel i5) Linux machine with 16 GB RAM, using a 1 Gbps LAN
Type Protocol HbC Malicious β = 3, β = 5, β = 3, β = 5, t = 1 t = 2 t = 1 t = 2 Float FLAdd 0.48 0.76 14.6 29.2 FLMul 0.22 0.28 3.35 7.54 FLScMul 0.20 0.28 3.35 7.50 FLDiv 0.54 0.64 4.58 10.2 FLLT 0.16 0.23 2.82 6.22 FLRound 0.64 0.85 11.4 23.4 Convert FP2FL 0.83 1.21 25.7 50.9 Int2FL 0.85 1.22 25.7 50.9 FL2Int 1.35 1.91 26.3 54.3 FL2FP 1.40 1.96 26.8 55.3 Log FLLog2 12.0 17.0 274 566 Exp FLExp2 7.12 9.66 120 265
Performance of LM, DLM and EM
- For β = 3 and t = 1 and number of users n = 100,000
- The HbC setting
- Distributed LM protocol: 15.5 sec
- Distributed DLM protocol: 31.3 sec
- Distributed EM protocol: 42.3 sec
(for number of candidates m = 5)
- The malicious setting
- Distributed LM protocol: 344 sec
Caveats with number representations
- Careful with finite representation of real numbers!
- E.g., porosity of FL representation breaks Laplace
- In the above papers, solutions based on suitable
rounding and truncation mechanisms
- Can be easily integrated in our framework
in sec
✦ Operations performed by computation parties ✦ No critical timing restrictions on DDP computations in most real-life scenarios ✦ Users simply forward their shared values to the computation parties (< 1 sec) Demonstrates practicality of PrivaDA (even on computationally limited devices, such as smartphones) Implementation and Performance