Providing Input-Discriminative Protection for Local Differential - - PowerPoint PPT Presentation

providing input discriminative protection for local
SMART_READER_LITE
LIVE PREVIEW

Providing Input-Discriminative Protection for Local Differential - - PowerPoint PPT Presentation

Providing Input-Discriminative Protection for Local Differential Privacy Xiaolan Gu * , Ming Li * , Li Xiong # and Yang Cao *University of Arizona # Emory University Kyoto University IEEE International Conference on Data Engineering


slide-1
SLIDE 1

Providing Input-Discriminative Protection for Local Differential Privacy

Xiaolan Gu*, Ming Li*, Li Xiong# and Yang Cao†

*University of Arizona #Emory University †Kyoto University

IEEE International Conference on Data Engineering (ICDE), April 2020

slide-2
SLIDE 2

Overview

  • Background on LDP
  • Our Privacy Notion: ID-LDP
  • Our Privacy Mechanism on ID-LDP
  • Evaluation
  • Conclusion
slide-3
SLIDE 3

Background

  • Companies are collecting our private data to provide better services (Google, Facebook,

Apple, Yahoo, Uber, …)

  • However, privacy concerns arise
  • Possible solution: locally private data collection model
  • Yahoo: massive data breaches impacted 3 billion user account, 2013
  • Facebook: 267 million users’ data has reportedly been leaked, 2019

y1 ⋮

Analysis

y2 yn xi yi

M

Raw data Perturbed data Randomized mechanism

Upload perturbed data

Untrusted server

slide-4
SLIDE 4

Local Differential Privacy (LDP)

A mechanism satisfies -LDP if and only if for any pair of inputs and any output

  • : the possible input (raw) data (generated by the user)
  • : the output (perturbed) data (public and known by adversary)
  • : privacy budget (a smaller indicates stronger privacy)

M ϵ x, x′ y

x, x′ y ϵ ϵ

Pr(M(x) = y) Pr(M(x′ ) = y) ⩽ eϵ

An adversary cannot infer whether the input is or with high confidence (controlled by )

x x′ ϵ

[Duchi et al, FOCS’ 13]

slide-5
SLIDE 5

Applications of LDP

Source: https://developers.googleblog.com/2019/09/enabling-developers-and-organizations.html

Apple: discovering popular Emojis under LDP

Source: https://machinelearning.apple.com/2017/12/06/learning-with-privacy-at-scale.html

slide-6
SLIDE 6

Limitations of LDP

  • LDP notion requires the same privacy budget for all pairs of possible inputs
  • Existing LDP protocols perturb the data in the same way for all inputs
  • However, in many practical scenarios, different inputs have different degrees of

sensitiveness, thus require distinct levels of privacy protection.

  • LDP protocols can provide excessive protection for some inputs that do not need such

strong privacy (leading to an inferior privacy-utility tradeoff)

Scenarios High sensitiveness Low sensitiveness Website-click records Politics-related Facebook and Amazon Medical records HIV and cancer Anemia and headache

slide-7
SLIDE 7

Our Privacy Notion: Input-Discriminative LDP (ID-LDP)

  • Given a privacy budget set

, a randomized mechanism satisfies

  • ID-LDP if and only if for any pair of inputs

and output

ℰ = {ϵx}x∈𝒠 M ℰ x, x′ ∈ 𝒠 y ∈ Range(M)

Pr(M(x) = y) Pr(M(x′ ) = y) ⩽ er(ϵx,ϵx′

)

is a function of two privacy budgets

r( ⋅ , ⋅ )

  • In this paper, we focus on an instantiation called MinID-LDP with r(ϵx, ϵx′ ) = min{ϵx, ϵx′ }

Intuition: for any pair of inputs , MinID-LDP guarantees the adversary’s capability of distinguishing them would not exceed the bound controlled by both and (thus achieving differentiated privacy protection for each pair)

x, x′ ϵx ϵx′

MinID-LDP has Sequential Composition like LDP , which guarantees the overall privacy for a sequence

  • f mechanisms.

is the privacy budget

  • f an input

ϵx x

slide-8
SLIDE 8

Relationships with LDP

  • 1. If

for all , then

  • MinID-LDP
  • LDP
  • 2. If

, then -LDP

  • MinID-LDP
  • 3. If

, then

  • MinID-LDP
  • LDP

ϵx = ϵ x ∈ 𝒠 ℰ ⇔ ϵ min{ℰ} ⩾ ϵ ϵ ⇒ ℰ ϵ ⩾ min{max{ℰ}, 2 min{ℰ}} ℰ ⇒ ϵ

MinID-LDP can be regarded as a relaxation compared with LDP . It captures user’s fine-grained privacy requirement, when LDP is too strong (i.e., provides overprotection).

Factor 2 is due to the symmetric property

  • f the indistinguishability definition
slide-9
SLIDE 9

Related Privacy Notions

  • Personalized LDP (PLDP) [Chen et al, ICDE’ 16]
  • Geo-indistinguishability (GI) [Andres et al, CCS’ 13]
  • Condensed LDP (CLDP) [Gursoy et al, TDSC’ 19]
  • Utility-optimized LDP (ULDP)

[Murakami and Kawamoto, USENIX Security’ 19]

𝑦 𝑦 𝑦 𝑦 𝜗 𝜗 𝜗 𝜗 𝜗 𝜗 𝑦 𝑦 𝑦 𝑦 𝜗𝑣 𝜗𝑣 𝜗𝑣 𝜗𝑣 𝜗𝑣 𝜗𝑣

PLDP

𝜗 all ts, 𝜗 min𝜗 𝑦 𝜗 𝜗𝑣: the privacy budget of a user 𝑣 for all pairs of inputs (different user may have different 𝜗𝑣)

𝜗 𝜗 𝜗 𝜗 𝜗 𝜗

ID-LDP

𝜗: the privacy budget of 𝑦 𝜗: the privacy budget of a pair

  • f inputs 𝑦, 𝑦 for all users

MinID-LDP: 𝜗 min𝜗, 𝜗

𝑦 𝑦 𝑦 𝑦 𝜗𝑒

GI or CLDP

𝜗𝑒: the privacy budget for a pair of inputs 𝑦, 𝑦 𝑒: distance between 𝑦, 𝑦

𝜗𝑒 𝜗𝑒 𝜗𝑒 𝜗𝑒 𝜗𝑒 𝑦(ϵ) 𝑦(ϵ) 𝑦(ϵ) 𝑦(ϵ)

Privacy budget of a pair of inputs in several related notions

User-discriminative Distance-discriminative Input-discriminative

ULDP does not guarantee the indistinguishability between the sensitive and non-sensitive inputs when observing some outputs, thus ULDP does not guarantee LDP . ULDP

𝒴S

𝒴N

𝒵P 𝒵I

  • LDP

ϵ

x y

Sensitive inputs Non- sensitive inputs

slide-10
SLIDE 10

Privacy Mechanism Design under ID-LDP

Problem Statement

  • Data types: categorical (two cases: each user has only one item or an item-set)
  • Analysis Task/Application: frequency estimation (which is the building block for many applications)
  • Objectives: minimize MSE of frequency estimation while satisfying ID-LDP

Preliminaries: LDP protocols

  • Randomized Response
  • Unary Encoding

Challenges

  • The number of variables (perturbation parameters) and privacy constraints (to be satisfied for any

) can be very large (especially for a large domain or item-set data).

  • Objective function (MSE) is dependent on the unknown true frequencies;

x, x′ , y

Our protocol satisfying ID-LDP is based on this

Example: assume domain size , then variables and constraints

m m2 m3

ID-LDP protocols perturb inputs with different probabilities

slide-11
SLIDE 11

LDP Protocol: Randomized Response

  • Randomized Response (RR) [Warner, 1965]: reports the truth with some probability (for

binary answer: yes-or-no)

  • Example: Is your annual income more than 100k?

1 Truth x 1 w.p. p w.p. 1 − p 1 w.p. p w.p. 1 − p Response y

To satisfy -LDP: (since )

ϵ p = eϵ eϵ + 1 p 1 − p = eϵ

Frequency estimation: Unbiasedness:

̂ f = f − (1 − p) 2p − 1 𝔽[ ̂ f ] = f*

True frequency

𝔽[f ] = f*p + (1 − f*)(1 − p) = (2p − 1)f* + (1 − p)

Frequency of response y

Advanced versions: Unary Encoding, Generalized RR, …

slide-12
SLIDE 12

LDP Protocol: Unary Encoding (UE)

  • To handle more general case (domain size is ), UE represents the input/output by multiple bits.
  • Step 1. encode the input

into vector with length

  • Step 2. perturb each bit independently

d x = i x = [0,⋯,0,1,0,⋯,0] d

1

x[k]

1 w.p. p w.p. 1 − p 1

y[k]

w.p. 1 − p w.p. p RAPPOR

[Erlingsson et al, CCS’ 14]

w.p. 0.5 w.p. 1 − q w.p. q OUE

[Wang et al, USENIX Security’ 17]

w.p. 0.5

To satisfy -LDP: ,

ϵ p = eϵ/2 eϵ/2 + 1 q = 1 eϵ + 1

By minimizing the approximate MSE of frequency estimation

slide-13
SLIDE 13

Overview of Our Protocol for ID-LDP

  • 1. We propose Unary Encoding based protocol with only

variables and constraints

  • 2. We address the second challenge by developing three variants of optimization models (some models

can further reduce the problem complexity)

2m m2

For single-item data: IDUE (Input-Discriminative Unary Encoding)

Recall the two challenges: 1) High complexity of the optimization problem. 2) MSE depends on unknown true frequencies.

For item-set data: IDUE-PS (with Padding-and-Sampling protocol)

  • 1. We extend IDUE for item-set data (by combining with a sampling protocol) to solve the scalability issue
  • 2. We show IDUE-PS also satisfies MinID-LDP (if the base protocol IDUE satisfies MinID-LDP)
slide-14
SLIDE 14

Privacy Mechanism for Single-Item Data

  • Step 1, encode the input

into

  • Step 2, perturb each bit independently (with different probabilities)
  • Step 3, estimate frequency/counting by

x = i x = [0,⋯,0,1,0,⋯,0]

1

x[k]

1 w.p. ak w.p. 1 − ak 1 w.p. bk

y[k]

w.p. 1 − bk

ai(1 − bj) bi(1 − aj) ⩽ er(ϵi,ϵj) (∀i, j)

̂ ci = ∑u yu[i] − nbi ai − bi

  • 1. The optimization problem only has

variables and constraints

  • 2. The frequency estimator is unbiased, and its MSE can be composed by two terms, where only the

second term is dependent on the true frequencies

2m m2 c*

i

Benefits

MSE ̂

ci = Var[ ̂

ci] = nbi(1 − bj) (ai − bi)2 + c*

i (1 − ai − bi)

ai − bi

— number of users — perturbation probabilities — true frequency — estimated frequency

n ai, bi c*

i

̂ ci

slide-15
SLIDE 15

Comparison with LDP Protocols

Example: a health organization is taking a survey which asks participants to return a response perturbed from categories {HIV, anemia, headache, stomachache, toothache}, where HIV ( ) is more sensitive, thus we set different privacy budgets, such as and .

n i = 1 ϵ1 = ln 4 ϵi = ln 6 (i = 2,⋯,5)

The total variance of IDUE is in a range because it depends on the distribution of true input data, and the upper bound is still less than that of RAPPOR and OUE.

More perturbation noise for i = 1 Less perturbation noise for i ≠ 1

slide-16
SLIDE 16

Evaluation

1 1.5 2 2.5 3 25 50 100 200 400

RAPPOR RAPPOR OUE OUE IDUE-opt0 IDUE-opt0 IDUE-opt1 IDUE-opt1 IDUE-opt2 IDUE-opt2

1 1.5 2 2.5 3 200 400 800 1600 3200

Comparison of Empirical (dashed lines) and Theoretical(solid lines) results of synthetic data (single-item input). More Private More Accurate Empirical results are very close to theoretical results IDUE has smaller MSE than RAPPOR and OUE

We compare the frequency estimation results of our mechanisms (IDUE and IDUE-PS) with RAPPOR and OUE using two synthetic datasets and three real-world datasets.

  • pt0: has the smallest MSE
  • pt1 and opt2: not good as opt0,

but better than RAPPOR and OUE

slide-17
SLIDE 17

Real-World Data (Single-Item)

0.5 1 2 3 4 103 104 105 106 0.1 0.5 1 5 RAPPOR OUE IDUE RAPPOR OUE IDUE 0.5 1 2 3 4 103 104 105 106 0.1 0.5 1

IDUE has smallest MSE and RE (relative error)

2% 5% 20% 40% 60% 80% 5 10 15 103 0.5 1 1.5

MSE (RAPPOR) MSE (OUE) MSE (IDUE) RE (RAPPOR) RE (OUE) RE (IDUE)

2% 5% 20% 40% 60% 80% 5 10 15 103 0.1 0.2 0.3

If only small portion of inputs are more sensitive (i.e., have the smallest privacy budget), then IDUE has smaller estimation error. Otherwise, IDUE has similar performance compared with OUE RE =

1 |S| ∑

i∈S

| ̂ ci − c*

i |

c*

i

slide-18
SLIDE 18

Item-Set Data

1 2 3 4 5 6 0.5 1 1.5 2 1 2 5 10 15 20 0.5 1 1.5 2

Varying and in Kosarak item-set data

ℓ ϵ

The optimal (parameter of Padding-and-Sampling protocol) depends on both data distribution and privacy budget (the original paper only mentioned data-dependent). We leave this as our further work.

slide-19
SLIDE 19

Conclusion

  • 1. Privacy notion ID-LDP provides input-discriminative protection in the local setting
  • 2. Its instantiation MinID-LDP is a fine-grained version of LDP
  • 3. The proposed protocol IDUE outperforms LDP protocols
  • 4. The advanced version IDUE-PS solves the scalability problem for item-set data

Future work:

  • Extend our work to handle more complex data types and analysis tasks;
  • Study the strategy of finding the optimal based on the data distribution and privacy budget.

slide-20
SLIDE 20

Thanks for your attention !

Q&A