[PPT] - Privacy guarantees in statistical estimation: How to formalize the PowerPoint Presentation

SLIDE 1

Privacy guarantees in statistical estimation: How to formalize the problem?

Martin Wainwright

UC Berkeley Departments of Statistics, and EECS van Dantzig Seminar, University of Leiden

Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 1 / 22

SLIDE 2

The modern landscape

Modern data sets are often very large

biological data (genes, proteins, etc.) medical imaging (MRI, fMRI etc.) astronomy datasets social network data recommender systems (Amazon, Netflix etc.)

SLIDE 3

The modern landscape

Modern data sets are often very large

biological data (genes, proteins, etc.) medical imaging (MRI, fMRI etc.) astronomy datasets social network data recommender systems (Amazon, Netflix etc.)

Statistical considerations interact with:

1 Computational constraints: (low-order) polynomial-time is essential!

SLIDE 4

The modern landscape

Modern data sets are often very large

biological data (genes, proteins, etc.) medical imaging (MRI, fMRI etc.) astronomy datasets social network data recommender systems (Amazon, Netflix etc.)

Statistical considerations interact with:

1 Computational constraints: (low-order) polynomial-time is essential! 2 Communication/storage constraints: distributed implementations are

ften needed

SLIDE 5

The modern landscape

Modern data sets are often very large

biological data (genes, proteins, etc.) medical imaging (MRI, fMRI etc.) astronomy datasets social network data recommender systems (Amazon, Netflix etc.)

Statistical considerations interact with:

1 Computational constraints: (low-order) polynomial-time is essential! 2 Communication/storage constraints: distributed implementations are

ften needed

3 Privacy constraints: tension between hiding/sharing data

SLIDE 6

From Classical Minimax Risk...

Choose estimator to minimize the worst-case risk Classical minimax risk = inf

θn

sup

θ∈Ω

E

L
θn, θ
.
Abraham Wald

1902–1950

SLIDE 7

From Classical Minimax Risk...

Choose estimator to minimize the worst-case risk Classical minimax risk = inf

θn

sup

θ∈Ω

E

L
θn, θ
.
Two party game:

Nature chooses parameter θ ∈ Ω in a potentially adversarial manner Statistician takes infimum over all estimators: (X1, . . . , Xn) → θn ∈ Ω

arbitrary measurable function

Abraham Wald 1902–1950

SLIDE 8

From Classical Minimax Risk...

Choose estimator to minimize the worst-case risk Classical minimax risk = inf

θn

sup

θ∈Ω

E

L
θn, θ
.
Two party game:

Nature chooses parameter θ ∈ Ω in a potentially adversarial manner Statistician takes infimum over all estimators: (X1, . . . , Xn) → θn ∈ Ω

arbitrary measurable function

Abraham Wald 1902–1950 Classical questions about minimax risk: how fast does it decay as a function of sample size n? dependence on dimensionality, smoothness etc.? characterization of optimal estimators?

SLIDE 9

....to Constrained Minimax Risk

Classical framework imposes no constraints on the choice of estimators θn.

SLIDE 10

....to Constrained Minimax Risk

Classical framework imposes no constraints on the choice of estimators θn. Unbounded memory and computational power. Provided centralized access to all n samples. Data is fully revealed: no privacy-preserving properties.

SLIDE 11

....to Constrained Minimax Risk

Classical framework imposes no constraints on the choice of estimators θn. Unbounded memory and computational power. Provided centralized access to all n samples. Data is fully revealed: no privacy-preserving properties. On-going research: statistical minimax with constraints Computationally-constrained estimators

(e.g., Rigollet & Berthet, 2013; Ma & Wu, 2014; Zhang, W. & Jordan, 2014)

Communication constraints

(e.g., Zhang et al., 2013; Ma et al. 2014; Braverman et al., 2015)

Privacy constraints (e.g., Dwork, 2006; Hardt & Rothblum, 2010; Hall et al., 2011;

Duchi, W. & Jordan, 2013)

SLIDE 12

Why be concerned with privacy?

Many sources of data have both statistical utility and privacy concerns. (a) Personal genome project

SLIDE 13

Why be concerned with privacy?

Many sources of data have both statistical utility and privacy concerns. (a) Personal genome project (b) Privacy breach Scientific American, August 2013

SLIDE 14

Why be concerned with privacy?

Many sources of data have both statistical utility and privacy concerns. (a) Personal genome project (b) Privacy breach Scientific American, August 2013 Question How to obtain principled tradeoffs between these competing criteria?

SLIDE 15

Basic model of local privacy

Zn

1

Q(Zn

1 | Xn 1 )

X1 X2 X3 Xn

θ

each individual i ∈ {1, 2, . . . , n} has personal data Xi ∼ Pθ∗ conditional distribution Q between private data Xn

1 and public data Zn 1

estimator Zn

1 →

θ of unknown parameter θ∗.

SLIDE 16

Local privacy at level α

log Q(· | x) log Q(· | ¯ x) z Log likelihood Definition Conditional distribution Q is locally α-differentially private if e−α ≤ sup

z

Q(z | xn

1)

Q(z | ¯ xn

1) ≤ eα

for all xn

1 and ¯

xn

1 such that dHAM(xn 1, ¯

xn

1) = 1.

(Dwork et al., 2006)

SLIDE 17

Illustration of Laplacian mechanism

x x Add α-Laplacian noise

(Dwork et al., 2006)

Z = x + W, where W has density ∝ e−α |w|

SLIDE 18

Illustration of Laplacian mechanism

x x Add α-Laplacian noise

(Dwork et al., 2006)

Z = x + W, where W has density ∝ e−α |w| For all x, x′ ∈ [−1/2, 1/2]: sup

z∈R

log Q(z | x)

Q(z | x)

=

α

sup

z∈R

|z − x| − |z − x|

≤

α.

SLIDE 19

Various mechanisms for α-privacy

Choices from past work: randomized response in survey questions

(Warner, 1965)

Laplacian noise

(Dwork et al., 2006)

exponential mechanism

(McSherry & Talwar, 2007)

SLIDE 20

Various mechanisms for α-privacy

Choices from past work: randomized response in survey questions

(Warner, 1965)

Laplacian noise

(Dwork et al., 2006)

exponential mechanism

(McSherry & Talwar, 2007)

Some past work on privacy and estimation: local differential privacy and PAC learning

(Kasiviswanathan et al., 2008)

linear queries over discrete-valued data sets

(Hardt & Rothblum, 2010)

global differential privacy and histogram estimators

(Hall et al., 2011)

lower bounds for certain 1-D statistics

(Chaudhuri & Hsu, 2012)

SLIDE 21

Various mechanisms for α-privacy

Choices from past work: randomized response in survey questions

(Warner, 1965)

Laplacian noise

(Dwork et al., 2006)

exponential mechanism

(McSherry & Talwar, 2007)

Some past work on privacy and estimation: local differential privacy and PAC learning

(Kasiviswanathan et al., 2008)

linear queries over discrete-valued data sets

(Hardt & Rothblum, 2010)

global differential privacy and histogram estimators

(Hall et al., 2011)

lower bounds for certain 1-D statistics

(Chaudhuri & Hsu, 2012)

Questions: Can we provide a general characterization of trade-offs between α-privacy and statistical utility? Can we identify optimal “mechanisms” for privacy?

SLIDE 22

Minimax optimality with α-privacy

family of distributions

P ∈ F}, and functional P → θ(P)

samples Xn

1 ≡ {X1, . . . , Xn} ∼ P and estimator Xn 1 →

θ(Xn

1 )

loss function (e.g., squared error, 0-1 error, ℓ1-error) ( θ, θ) → L( θ, θ) quality of θ as estimate of θ

SLIDE 23

Minimax optimality with α-privacy

family of distributions

P ∈ F}, and functional P → θ(P)

samples Xn

1 ≡ {X1, . . . , Xn} ∼ P and estimator Xn 1 →

θ(Xn

1 )

loss function (e.g., squared error, 0-1 error, ℓ1-error) ( θ, θ) → L( θ, θ) quality of θ as estimate of θ Ordinary minimax risk: Mn(F) := inf

θ
Best estimator

sup

P∈F

Worst-case distribution

E

L
θ(Xn

1 ), θ(P)

SLIDE 24

Minimax optimality with α-privacy

family of distributions

P ∈ F}, and functional P → θ(P)

samples Xn

1 ≡ {X1, . . . , Xn} ∼ P and estimator Xn 1 →

θ(Xn

1 )

loss function (e.g., squared error, 0-1 error, ℓ1-error) ( θ, θ) → L( θ, θ) quality of θ as estimate of θ Ordinary minimax risk: Mn(F) := inf

θ
Best estimator

sup

P∈F

Worst-case distribution

E

L
θ(Xn

1 ), θ(P)

Minimax risk with α-privacy

Estimators now depend on privatized samples Zn

1

Mn(α; F) := inf

Q∈Qα

Best α-private channel inf

θ

sup

P∈F

E

L
θ(Zn

1 ), θ(P)

SLIDE 25

Vignette A: α-private location estimation

Consider estimation of mean functional θ(P) = E[X] over family Fk :=

distributions P such that E[X] ∈ [−1, 1] and E[|X|k|] ≤ 1

SLIDE 26

Vignette A: α-private location estimation

Consider estimation of mean functional θ(P) = E[X] over family Fk :=

distributions P such that E[X] ∈ [−1, 1] and E[|X|k|] ≤ 1
For k ≥ 2 and non-private setting, sample mean

θ = 1

n

i=1 Xi achieves rate

1/n.

SLIDE 27

Vignette A: α-private location estimation

Consider estimation of mean functional θ(P) = E[X] over family Fk :=

distributions P such that E[X] ∈ [−1, 1] and E[|X|k|] ≤ 1
For k ≥ 2 and non-private setting, sample mean

θ = 1

n

i=1 Xi achieves rate

1/n. Theorem For all k ≥ 2 and α ∈ (0, 1/4], the α-private minimax risk scales as Mn(α; Fk) ≍ min

1,

1 α2n k−1

k

.

SLIDE 28

Vignette A: α-private location estimation

Consider estimation of mean functional θ(P) = E[X] over family Fk :=

distributions P such that E[X] ∈ [−1, 1] and E[|X|k|] ≤ 1
For k ≥ 2 and non-private setting, sample mean

θ = 1

n

i=1 Xi achieves rate

1/n. Theorem For all k ≥ 2 and α ∈ (0, 1/4], the α-private minimax risk scales as Mn(α; Fk) ≍ min

1,

1 α2n k−1

k

. Examples: For two moments k = 2, rate is reduced from parametric 1/n to 1/(α√n). As k → ∞ (roughly bounded random variables), private rate converges to the parametric one (with a pre-factor of 1/α2).

SLIDE 29

Sample size reduction: n → α2n

Given an α-private channel, any pair {Pj, j = 1, 2} induces marginals Mn

j (A) :=

Q(A | x1, . . . , xn)dPn

j (x1, . . . , xn)

for j = 1, 2.

SLIDE 30

Sample size reduction: n → α2n

Given an α-private channel, any pair {Pj, j = 1, 2} induces marginals Mn

j (A) :=

Q(A | x1, . . . , xn)dPn

j (x1, . . . , xn)

for j = 1, 2. Question: How much “contraction” induced by local α-privacy?

SLIDE 31

Sample size reduction: n → α2n

Given an α-private channel, any pair {Pj, j = 1, 2} induces marginals Mn

j (A) :=

Q(A | x1, . . . , xn)dPn

j (x1, . . . , xn)

for j = 1, 2. Question: How much “contraction” induced by local α-privacy? Theorem (Duchi, W., & Jordan, 2013) Given n i.i.d. samples from any α-private channel with α ∈ (0, 1/2], we have 1 n

D(Mn

1 Mn 0) + D(Mn 0 Mn 1)

Symmetrized KL divergence
(eα − 1)2

P1 − P02

TV

Total variation

SLIDE 32

Sample size reduction: n → α2n

Given an α-private channel, any pair {Pj, j = 1, 2} induces marginals Mn

j (A) :=

Q(A | x1, . . . , xn)dPn

j (x1, . . . , xn)

for j = 1, 2. Question: How much “contraction” induced by local α-privacy? Theorem (Duchi, W., & Jordan, 2013) Given n i.i.d. samples from any α-private channel with α ∈ (0, 1/2], we have 1 n

D(Mn

1 Mn 0) + D(Mn 0 Mn 1)

Symmetrized KL divergence
(eα − 1)2

P1 − P02

TV

Total variation

Note that

eα − 1

2 α2 for α ∈ (0, 1/4].

SLIDE 33

Vignette B: Non-parametric density estimation

Suppose that we want to estimate the quantity P → θ(P) ≡ density f

f = θ(P)

X1 Xn

SLIDE 34

Vignette B: Non-parametric density estimation

Suppose that we want to estimate the quantity P → θ(P) ≡ density f

f = θ(P)

X1 Xn Ordinary minimax rates depend on number of derivatives β > 1/2 of density f: Mn

F(β)
≍

1 n

2β

2β+1 .

(Ibragimov & Hasminskii, 1978; Stone, 1980)

SLIDE 35

Optimal rates for α-private density estimation

Consider density estimation based on α-private views (Z1, . . . , Zn) of original samples (X1, . . . , Xn).

SLIDE 36

Optimal rates for α-private density estimation

Consider density estimation based on α-private views (Z1, . . . , Zn) of original samples (X1, . . . , Xn). Theorem (Duchi, W. & Jordan, 2013) For all privacy levels α ∈ (0, 1/4] and smoothness levels β > 1/2: Mn

α; F(β)
≍

1 α2n

2β

2β+2

SLIDE 37

Optimal rates for α-private density estimation

Consider density estimation based on α-private views (Z1, . . . , Zn) of original samples (X1, . . . , Xn). Theorem (Duchi, W. & Jordan, 2013) For all privacy levels α ∈ (0, 1/4] and smoothness levels β > 1/2: Mn

α; F(β)
≍

1 α2n

2β

2β+2

can give a simple/explicit scheme that achieves this optimal rate.

SLIDE 38

Optimal rates for α-private density estimation

Consider density estimation based on α-private views (Z1, . . . , Zn) of original samples (X1, . . . , Xn). Theorem (Duchi, W. & Jordan, 2013) For all privacy levels α ∈ (0, 1/4] and smoothness levels β > 1/2: Mn

α; F(β)
≍

1 α2n

2β

2β+2

can give a simple/explicit scheme that achieves this optimal rate. contrast with classical rate (1/n)

2β 2β+1 : Penalty for privacy can be

significant!

SLIDE 39

Optimal rates for α-private density estimation

Consider density estimation based on α-private views (Z1, . . . , Zn) of original samples (X1, . . . , Xn). Theorem (Duchi, W. & Jordan, 2013) For all privacy levels α ∈ (0, 1/4] and smoothness levels β > 1/2: Mn

α; F(β)
≍

1 α2n

2β

2β+2

can give a simple/explicit scheme that achieves this optimal rate. contrast with classical rate (1/n)

2β 2β+1 : Penalty for privacy can be

significant!

Example: How many samples N(ǫ) to achieve error ǫ = 0.01 for Lipschitz densities (β = 1)? Classical case N ≈ 1, 000 versus Private case N ≈ 10, 000.

SLIDE 40

How to achieve a matching upper bound?

Naive approach: Add Laplacian noise directly to samples Zi = Xi + Wi, with Wi ∼ α

2 e−α|w|

Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 15 / 22

SLIDE 41

How to achieve a matching upper bound?

Naive approach: Add Laplacian noise directly to samples Zi = Xi + Wi, with Wi ∼ α

2 e−α|w|

Transforms problem into non-parametric deconvolution.

Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 15 / 22

SLIDE 42

How to achieve a matching upper bound?

Naive approach: Add Laplacian noise directly to samples Zi = Xi + Wi, with Wi ∼ α

2 e−α|w|

Transforms problem into non-parametric deconvolution. Lower bound for this mechanism For any estimator f based on (Z1, . . . , Zn): sup

f ∗∈F(β)

E[ f − f ∗2

2]

1 n

2β

2β+5

Follows from known lower bounds for deconvolution

(Carroll & Hall, 1988)

Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 15 / 22

SLIDE 43

An optimal mechanism

1 For a given orthonormal basis {φj}∞ j=1 of L2[0, 1], individual i computes

ΦD

1 (Xi) := {φ1(Xi), φ2(Xi), . . . , φD(Xi)}

for dimension D to be chosen

SLIDE 44

An optimal mechanism

1 For a given orthonormal basis {φj}∞ j=1 of L2[0, 1], individual i computes

ΦD

1 (Xi) := {φ1(Xi), φ2(Xi), . . . , φD(Xi)}

for dimension D to be chosen

2 Privatized D-dimensional vector:

Hypercube sampling scheme with E[Zi | Xi] = ΦD

1 (Xi)

SLIDE 45

An optimal mechanism

1 For a given orthonormal basis {φj}∞ j=1 of L2[0, 1], individual i computes

ΦD

1 (Xi) := {φ1(Xi), φ2(Xi), . . . , φD(Xi)}

for dimension D to be chosen

2 Privatized D-dimensional vector:

Hypercube sampling scheme with E[Zi | Xi] = ΦD

1 (Xi) 3 Statistician can compute noisy versions of D basis expansion coefficients

Bj = 1

n

i=1

Zij, and

f =

D

j=1
Bjφj

SLIDE 46

An optimal mechanism

1 For a given orthonormal basis {φj}∞ j=1 of L2[0, 1], individual i computes

ΦD

1 (Xi) := {φ1(Xi), φ2(Xi), . . . , φD(Xi)}

for dimension D to be chosen

2 Privatized D-dimensional vector:

Hypercube sampling scheme with E[Zi | Xi] = ΦD

1 (Xi) 3 Statistician can compute noisy versions of D basis expansion coefficients

Bj = 1

n

i=1

Zij, and

f =

D

j=1
Bjφj

Upper bound For any D ≥ 1, the privatized density estimate satisfies E

f − f ∗2

2

D2

nα2 + 1 D2β

SLIDE 47

Hypercube sampling: Optimal privacy mechanism

V 1 1+eα eα 1+eα Given V = ΦD

1 (X) with V ∞ ≤ C,

form D-dimensional random vector

Vj =
+C

with prob.

1 2 + Vj 2C

−C with prob.

1 2 − Vj 2C .

Draw T ∼ Ber

eα

1+eα

and set

Z ∼

Uni
{−C, +C}D | Z,

V > 0

if T = 1

Uni

{−C, +C}D | Z,

V ≤ 0

if T = 0

SLIDE 48

Lower bounds via metric entropy

f 1 f 2 f 3 f 4 f M 2δ Andrey Kolmogorov 1903–1987

SLIDE 49

Lower bounds via metric entropy

f 1 f 2 f 3 f 4 f M 2δ Andrey Kolmogorov 1903–1987 Packing number Given a metric ρ and function class F, a δ-packing is a collection {f 1, . . . , f M} contained in F such that ρ(f j, f k) > 2δ for all j = k.

SLIDE 50

From metric entropy to hypothesis testing

f 1 f 2 f 3 f 4 f M 2δ Two-person game: Nature chooses a random index J ∈ {1, . . . , M} Statistician estimates density based on n i.i.d. samples from f J

SLIDE 51

From metric entropy to hypothesis testing

f 1 f 2 f 3 f 4 f M 2δ Two-person game: Nature chooses a random index J ∈ {1, . . . , M} Statistician estimates density based on n i.i.d. samples from f J Reduction to hypothesis testing Any estimator f for which ρ( f, f J) < δ with high probability can be used to decode the index J.

SLIDE 52

A quantitative data processing inequality

J X Z Q packing index J ∈ {1, 2, . . . , M} non-private variables (X | J = j) ∼ Pj mixture distribution P =

1 M

M

j=1 Pj.

SLIDE 53

A quantitative data processing inequality

J X Z Q packing index J ∈ {1, 2, . . . , M} non-private variables (X | J = j) ∼ Pj mixture distribution P =

1 M

M

j=1 Pj.

Theorem (Duchi, W. & Jordan, 2013) For any non-interactive α-private channel Q, we have I(Z1, . . . , Zn; J) n ≤ (eα − 1)2 sup

γ∞≤1

1 M

M

j=1

X

γ(x)

dPj(x) − dP(x)

2

dimension-dependent contraction

SLIDE 54

High-level and extensions

High-level Two main theorems are forms of “information contraction”:

1 Pairwise contraction: consequences for Le Cam’s method 2 Mutual information contraction: consequences for Fano’s method

Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 21 / 22

SLIDE 55

High-level and extensions

High-level Two main theorems are forms of “information contraction”:

1 Pairwise contraction: consequences for Le Cam’s method 2 Mutual information contraction: consequences for Fano’s method

Some extensions:

1 Matching rates for linear regression (n → nα2) 2 Matching rates for multinomial estimation (n → nα2 d )

Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 21 / 22

SLIDE 56

High-level and extensions

High-level Two main theorems are forms of “information contraction”:

1 Pairwise contraction: consequences for Le Cam’s method 2 Mutual information contraction: consequences for Fano’s method

Some extensions:

1 Matching rates for linear regression (n → nα2) 2 Matching rates for multinomial estimation (n → nα2 d ) 3 Convex risk minimization: dimension-dependent effects.

Sparse optimization no longer depends logarithmically on dimension.

Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 21 / 22

SLIDE 57

High-level and extensions

High-level Two main theorems are forms of “information contraction”:

1 Pairwise contraction: consequences for Le Cam’s method 2 Mutual information contraction: consequences for Fano’s method

Some extensions:

1 Matching rates for linear regression (n → nα2) 2 Matching rates for multinomial estimation (n → nα2 d ) 3 Convex risk minimization: dimension-dependent effects.

Sparse optimization no longer depends logarithmically on dimension.

4 Laplacian mechanism can be sub-optimal. Need to consider geometry of

set.

Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 21 / 22

SLIDE 58

Summary

interesting trade-offs between privacy and statistical utility new notion of locally α-private minimax risk provided some general bounds and techniques: bounds on total variation useful for Le Cam’s method bounds on mutual information useful for Fano’s method sharp bounds for several parametric/non-parametric problems

Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 22 / 22

SLIDE 59

Summary

interesting trade-offs between privacy and statistical utility new notion of locally α-private minimax risk provided some general bounds and techniques: bounds on total variation useful for Le Cam’s method bounds on mutual information useful for Fano’s method sharp bounds for several parametric/non-parametric problems many open problems and issues:

◮ benefits of partially local privacy? ◮ other models for privacy? ◮ privacy for multiple statistical objectives? Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 22 / 22

SLIDE 60

Summary

interesting trade-offs between privacy and statistical utility new notion of locally α-private minimax risk provided some general bounds and techniques: bounds on total variation useful for Le Cam’s method bounds on mutual information useful for Fano’s method sharp bounds for several parametric/non-parametric problems many open problems and issues:

◮ benefits of partially local privacy? ◮ other models for privacy? ◮ privacy for multiple statistical objectives?

Some papers: Duchi, W. & Jordan (2013). Local privacy and statistical minimax rates. http://arxiv.org/abs/1302.3203, February 2013.

W. (2015). Constrained forms of statistical minimax: Computation,

communication, and privacy. Proceedings of the International Congress of Mathematicians.

Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 22 / 22