Mobile Data Collection and Analysis with Local Differential Privacy - Part 1
Ninghui Li (Purdue University)
1
Mobile Data Collection and Analysis with Local Differential Privacy - - PowerPoint PPT Presentation
Mobile Data Collection and Analysis with Local Differential Privacy - Part 1 Ninghui Li (Purdue University) 1 Outline Motivation of Differential Privacy and Local Differential Privacy (LDP) Frequency Oracles in LDP Tradeoff between
1
3 6/13/2019
“landscapers in Lilburn, GA” queries on last name “Arnold” “homes sold in shadow lake subdivision Gwinnett County, GA” “num fingers” “60 single men” “dog that urinates on everything” Thelman Arnold, a 62 year old widow who lives in Liburn GA, has three dogs, frequently searches her friends’ medical ailments. AOL searcher # 4417749 NYT
Re-identification occurs!
6/13/2019 4
5
Pr[𝐵 𝐸 =𝑢] Pr[𝐵 𝐸′ =𝑢] ≤ 𝑓𝜗
6/13/2019
considered to be privacy violation
6/13/2019 6
Data mining Statistical queries
Database
+Noise
DataData Data Data Data Classical/ centralized setting Differential Privacy Interpretation: The decision to include/exclude an individual’s record has limited (𝜁) influence on the outcome. Smaller 𝜁 ➔ Stronger Privacy Differential Privacy Interpretation: The decision to include/exclude an individual’s record has limited (𝜁) influence on the outcome. Smaller 𝜁 ➔ Stronger Privacy
7
Data mining Statistical queries
Database
+Noise
Trusted
Data Data Data Data Data
8
Trust boundary
Data mining Statistical queries
Database
No worry about untrusted server
Data+Noise Data+Noise Data+Noise
9
Trust boundary
takes input value 𝑤 from domain 𝐸 and outputs 𝑧. 𝑧
takes reports {𝑧} from all users and outputs estimations 𝑑(𝑤) for any value 𝑤 in domain 𝐸
FO is 𝜁 -LDP iff′for any 𝑤 and 𝑤′ from 𝐸, and any valid output 𝑧,
Pr 𝑄 𝑤 =𝑧 Pr 𝑄 𝑤′ =𝑧 ≤ 𝑓𝜁
11
𝐹[ 𝐽𝑤] = 0.75𝑜𝑤 + 0.25(𝑜 − 𝑜𝑤) “yes” answers
0.75−0.5 is the unbiased estimation of number of patients
Provide deniability: Seeing answer, not certain about the secret.
12 6/13/2019
truth Expected yes Expected no yes 80 60 20 no 20 5 15
𝑑(𝑜𝑤) =
𝐽𝑤−0.25𝑜 0.75−0.25
An individual will answer “yes” w/p 75%, and “no” w/p 25%
65 35 80 20
estimate
13 6/13/2019
Generalized Random Response Unary Encoding
Local Hash
RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal
Pihur, A. Korolova, CCS 2014 Local, Private, Efficient Protocols for Succinct Histograms R. Bassily, A.
Locally Differentially Private Protocols for Frequency Estimation T. Wang, J. Blocki, N. Li, S. Jha: USENIX Security 2017
14 6/13/2019
𝑒−1
(uniformly at random)
𝑓𝜁 𝑓𝜁+𝑒−1 , 𝑟 = 1 𝑓𝜁+𝑒−1 ⇒ Pr 𝑄 𝒘 =𝒘 Pr 𝑄 𝒘′ =𝒘 = 𝑞 𝑟 = 𝑓𝜁
𝑞−𝑟
Intuitively, the higher 𝑞, the more accurate Intuitively, the higher 𝑞, the more accurate However, when 𝑒 is large, 𝑞 becomes small (for the same 𝜁) However, when 𝑒 is large, 𝑞 becomes small (for the same 𝜁)
𝜁 𝒒(𝒆 = 𝟑) 𝒒(𝒆 = 𝟗) 𝒒(𝒆 = 𝟐𝟑𝟗) 𝒒(𝒆 = 𝟐𝟏𝟑𝟓) 0.1
0.52 0.13 0.016 0.001
1
0.73 0.27 0.027 0.002
2
0.88 0.51 0.057 0.007
4
0.98 0.88 0.307 0.05
To get rid of dependency on domain size, we move to the other protocols. To get rid of dependency on domain size, we move to the other protocols.
15 6/13/2019
𝑓𝜁/2 𝑓𝜁/2+1
𝑞1→0 = 𝑞0→1 = 𝑟 =
1 𝑓𝜁/2+1
Pr 𝑄(𝐹 𝑤′ )=𝒚 ≤ 𝑞1→1 𝑞0→1 × 𝑞0→0 𝑞1→0 = 𝑓𝜁
reducing 𝑒 in each location to 2. (But privacy budget is halved.)
16 6/13/2019
equivalent description
𝑓𝜁 𝑓𝜁+1 , 𝑟 = 1 𝑓𝜁+1
⇒ Pr 𝑄(𝐹 𝒘 ) = 𝑐 Pr 𝑄(𝐹 𝒘′ ) = 𝑐 = 𝑞 𝑟 = 𝑓𝜁
2 𝑟 + 1 2 𝑞)
𝐽𝑤−𝑜⋅1
2
𝑞−1
2
17 6/13/2019
= 𝑊𝑏𝑠
𝐽𝑤−𝑜⋅𝑟 𝑞−𝑟
=
𝑊𝑏𝑠[𝐽𝑤] 𝑞−𝑟 2 ≈ 𝑜⋅𝑟⋅(1−𝑟) 𝑞−𝑟 2
value with a corresponding 1
𝑞′, and into a value not supporting it with 𝑟′
𝑛𝑗𝑜𝑟′𝑊𝑏𝑠 𝑑 𝑤
𝑜⋅𝑟′⋅(1−𝑟′) 𝑞′−𝑟 ′2
where 𝑞′, 𝑟′ satisfy 𝜁-LDP 𝑛𝑗𝑜𝑟′𝑊𝑏𝑠 𝑑 𝑤
𝑜⋅𝑟′⋅(1−𝑟′) 𝑞′−𝑟 ′2
where 𝑞′, 𝑟′ satisfy 𝜁-LDP
6/13/2019 18
evasive answer bias
Ordinal Response.
Estimation
perturbed and transmitted.
two steps:
information
[−1, +1]
1524]
with Local Differential Privacy. ACM CCS 2016
Locally Differentially Private Frequent Itemset Mining. IEEE Symposium on Security and Privacy 2018
time
current result
User Gender Smoke
Alice female smoker Bob male non-smoker Tom male smoker … Lily female non-smoker
v F(v)
< female, non-smoker > 0.35 < female, smoker > 0.15 < male, non-smoker > 0.1 < male, smoker > 0.4
2-way marginal 1-way marginal
v F(v)
< female, * > 0.5 < male, * > 0.5
v F(v)
< *, non-smoker > 0.55 < *, smoker > 0.45
Dataset:
as a frequency oracle problem
FO
Aggregator:
Calculate all k-way marginals
Alice female
…
smoker Bob male
…
non-smoker Tom male
…
smoker Sally female
…
Non-smoker Lily female
…
non-smoker
FO FO FO FO
Users:
v F(v)
< female, non-smoker > 0.35 < female, smoker > 0.15 < male, non-smoker > 0.1 < male, smoker > 0.4
Frequency Oracle All k-way marginals
All attributes Full contingency table
𝑊𝑏𝑠 = 𝑃(2𝑒)
Gender Smoke female smoker male non-smoker male smoker … … female non-smoker
v
< female, *> < male, *>
Frequency Oracle
𝑙 becomes large, each user contributes less information to
each marginal
Attributes corresponding to each k-way marginal v
<*, smoker> <*, non-smoker>
All k-way marginals 𝑒 𝑙
𝑙 disjoint groups
𝑊𝑏𝑠 = 𝑃(2𝑙 ∙ 𝑒 𝑙 )
Fourier domain (values in marginals → Fourier coefficients)
be estimated.
Fourier Transformation All k-way marginals All attributes Sample and randomize Unary encoding 𝑊𝑏𝑠 = 𝑃(
𝑡=0 𝑙
𝑒 𝑙 )
FO
CALM Fourier Transformation Full Contingency Table K-way Marginal Table
network, transportation and knowledge base.
that only differ in one bit
1 1 1 1 1 1 1 1 1 1
𝑚𝑗 : 𝑚𝑘 :
1 1 1 1 1 1 1 1 1 1 1
𝑚𝑗 : 𝑚𝑘 :
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Adjacency bit vector
fake edges
4,039 nodes 88,234 edges 4,039 nodes 4,427,047 edges 𝜗 = 1 Facebook
98% fake edges
(a single degree → a degree vector)
RNL DGG
LDPGen
partitions users into k groups
to these groups
k = 2
partitions users into k groups
to these groups
partitions users with similar degree distribution into new groups
k = 3
partitions users into k groups
to these groups
partitions users with similar degree distribution into new groups
links to the new groups k = 3
partitions users into k groups
to these groups
partitions users with similar degree distribution into new groups
links to the new groups
corresponding graph from BTER model
k = 3
< Key, Value >
2.1h 2.8h 3.2h 1.5h 0.5h 0.1h 0.2h 0.1h 0.5h 2.2h 1.6h 1.1h
Disease Domain
Cancer [0, 0.35] HIV [0.3, 0.6] Fever [0.5, 1.0]
< Cancer, 0.2 >
Mean Oracle
Cancer 0.2 Fever 0.4 < Fever, 0.4 >
0.4 ∉ [0.5, 1.0]
Frequency Oracle
Users Item
Alice < 0, 0 > Bob < 1, 0.6 > Chris < 0, 0 > Tom < 1, 0.8 > 1 1
p 1-p
1
p 1-p <1, 0.6> <1, 0.6> <1, 0.6> < 0, 0 > < 0, 0 > < 0, 0 > < 0, 0 > < 1, ? >
v*
the ground truth.
transmission overhead
Perturbed data Mean
……
Users Aggregator
Batch
Perturbed data Mean
……
Batch
Iteration 10
6
Iteration
5
Iteration
1
Iteration
Real iteration Virtual iterations Real iteration Virtual iterations
Mean prediction Mean prediction
similar distribution as the real mean. Deviate from the true distribution
→ multiple rounds of interactions
Estimation accuracy Communication bandwidth
logistic regression and support vector machine [ICDE’ 19]
Statistics
Machine learning
to the privacy definition? Pr[𝐵 𝑡 = 𝑡∗] ≤ 𝑓𝜁 ∙ Pr 𝐵 𝑡′ = 𝑡∗ + δ
user to a single bit?
challenging problem.
has been widely adopted.