SLIDE 1 Privacy guarantees in statistical estimation: How to formalize the problem?
Martin Wainwright
UC Berkeley Departments of Statistics, and EECS van Dantzig Seminar, University of Leiden
Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 1 / 22
SLIDE 2
The modern landscape
Modern data sets are often very large
biological data (genes, proteins, etc.) medical imaging (MRI, fMRI etc.) astronomy datasets social network data recommender systems (Amazon, Netflix etc.)
SLIDE 3 The modern landscape
Modern data sets are often very large
biological data (genes, proteins, etc.) medical imaging (MRI, fMRI etc.) astronomy datasets social network data recommender systems (Amazon, Netflix etc.)
Statistical considerations interact with:
1 Computational constraints: (low-order) polynomial-time is essential!
SLIDE 4 The modern landscape
Modern data sets are often very large
biological data (genes, proteins, etc.) medical imaging (MRI, fMRI etc.) astronomy datasets social network data recommender systems (Amazon, Netflix etc.)
Statistical considerations interact with:
1 Computational constraints: (low-order) polynomial-time is essential! 2 Communication/storage constraints: distributed implementations are
SLIDE 5 The modern landscape
Modern data sets are often very large
biological data (genes, proteins, etc.) medical imaging (MRI, fMRI etc.) astronomy datasets social network data recommender systems (Amazon, Netflix etc.)
Statistical considerations interact with:
1 Computational constraints: (low-order) polynomial-time is essential! 2 Communication/storage constraints: distributed implementations are
3 Privacy constraints: tension between hiding/sharing data
SLIDE 6 From Classical Minimax Risk...
Choose estimator to minimize the worst-case risk Classical minimax risk = inf
sup
θ∈Ω
E
1902–1950
SLIDE 7 From Classical Minimax Risk...
Choose estimator to minimize the worst-case risk Classical minimax risk = inf
sup
θ∈Ω
E
- L
- θn, θ
- .
- Two party game:
Nature chooses parameter θ ∈ Ω in a potentially adversarial manner Statistician takes infimum over all estimators: (X1, . . . , Xn) → θn ∈ Ω
- arbitrary measurable function
Abraham Wald 1902–1950
SLIDE 8 From Classical Minimax Risk...
Choose estimator to minimize the worst-case risk Classical minimax risk = inf
sup
θ∈Ω
E
- L
- θn, θ
- .
- Two party game:
Nature chooses parameter θ ∈ Ω in a potentially adversarial manner Statistician takes infimum over all estimators: (X1, . . . , Xn) → θn ∈ Ω
- arbitrary measurable function
Abraham Wald 1902–1950 Classical questions about minimax risk: how fast does it decay as a function of sample size n? dependence on dimensionality, smoothness etc.? characterization of optimal estimators?
SLIDE 9
....to Constrained Minimax Risk
Classical framework imposes no constraints on the choice of estimators θn.
SLIDE 10
....to Constrained Minimax Risk
Classical framework imposes no constraints on the choice of estimators θn. Unbounded memory and computational power. Provided centralized access to all n samples. Data is fully revealed: no privacy-preserving properties.
SLIDE 11
....to Constrained Minimax Risk
Classical framework imposes no constraints on the choice of estimators θn. Unbounded memory and computational power. Provided centralized access to all n samples. Data is fully revealed: no privacy-preserving properties. On-going research: statistical minimax with constraints Computationally-constrained estimators
(e.g., Rigollet & Berthet, 2013; Ma & Wu, 2014; Zhang, W. & Jordan, 2014)
Communication constraints
(e.g., Zhang et al., 2013; Ma et al. 2014; Braverman et al., 2015)
Privacy constraints (e.g., Dwork, 2006; Hardt & Rothblum, 2010; Hall et al., 2011;
Duchi, W. & Jordan, 2013)
SLIDE 12
Why be concerned with privacy?
Many sources of data have both statistical utility and privacy concerns. (a) Personal genome project
SLIDE 13
Why be concerned with privacy?
Many sources of data have both statistical utility and privacy concerns. (a) Personal genome project (b) Privacy breach Scientific American, August 2013
SLIDE 14
Why be concerned with privacy?
Many sources of data have both statistical utility and privacy concerns. (a) Personal genome project (b) Privacy breach Scientific American, August 2013 Question How to obtain principled tradeoffs between these competing criteria?
SLIDE 15 Basic model of local privacy
Zn
1
Q(Zn
1 | Xn 1 )
X1 X2 X3 Xn
each individual i ∈ {1, 2, . . . , n} has personal data Xi ∼ Pθ∗ conditional distribution Q between private data Xn
1 and public data Zn 1
estimator Zn
1 →
θ of unknown parameter θ∗.
SLIDE 16 Local privacy at level α
log Q(· | x) log Q(· | ¯ x) z Log likelihood Definition Conditional distribution Q is locally α-differentially private if e−α ≤ sup
z
Q(z | xn
1)
Q(z | ¯ xn
1) ≤ eα
for all xn
1 and ¯
xn
1 such that dHAM(xn 1, ¯
xn
1) = 1.
(Dwork et al., 2006)
SLIDE 17
Illustration of Laplacian mechanism
x x Add α-Laplacian noise
(Dwork et al., 2006)
Z = x + W, where W has density ∝ e−α |w|
SLIDE 18 Illustration of Laplacian mechanism
x x Add α-Laplacian noise
(Dwork et al., 2006)
Z = x + W, where W has density ∝ e−α |w| For all x, x′ ∈ [−1/2, 1/2]: sup
z∈R
Q(z | x)
α
z∈R
|z − x| − |z − x|
α.
SLIDE 19
Various mechanisms for α-privacy
Choices from past work: randomized response in survey questions
(Warner, 1965)
Laplacian noise
(Dwork et al., 2006)
exponential mechanism
(McSherry & Talwar, 2007)
SLIDE 20
Various mechanisms for α-privacy
Choices from past work: randomized response in survey questions
(Warner, 1965)
Laplacian noise
(Dwork et al., 2006)
exponential mechanism
(McSherry & Talwar, 2007)
Some past work on privacy and estimation: local differential privacy and PAC learning
(Kasiviswanathan et al., 2008)
linear queries over discrete-valued data sets
(Hardt & Rothblum, 2010)
global differential privacy and histogram estimators
(Hall et al., 2011)
lower bounds for certain 1-D statistics
(Chaudhuri & Hsu, 2012)
SLIDE 21
Various mechanisms for α-privacy
Choices from past work: randomized response in survey questions
(Warner, 1965)
Laplacian noise
(Dwork et al., 2006)
exponential mechanism
(McSherry & Talwar, 2007)
Some past work on privacy and estimation: local differential privacy and PAC learning
(Kasiviswanathan et al., 2008)
linear queries over discrete-valued data sets
(Hardt & Rothblum, 2010)
global differential privacy and histogram estimators
(Hall et al., 2011)
lower bounds for certain 1-D statistics
(Chaudhuri & Hsu, 2012)
Questions: Can we provide a general characterization of trade-offs between α-privacy and statistical utility? Can we identify optimal “mechanisms” for privacy?
SLIDE 22 Minimax optimality with α-privacy
family of distributions
- P ∈ F}, and functional P → θ(P)
samples Xn
1 ≡ {X1, . . . , Xn} ∼ P and estimator Xn 1 →
θ(Xn
1 )
loss function (e.g., squared error, 0-1 error, ℓ1-error) ( θ, θ) → L( θ, θ) quality of θ as estimate of θ
SLIDE 23 Minimax optimality with α-privacy
family of distributions
- P ∈ F}, and functional P → θ(P)
samples Xn
1 ≡ {X1, . . . , Xn} ∼ P and estimator Xn 1 →
θ(Xn
1 )
loss function (e.g., squared error, 0-1 error, ℓ1-error) ( θ, θ) → L( θ, θ) quality of θ as estimate of θ Ordinary minimax risk: Mn(F) := inf
sup
P∈F
E
1 ), θ(P)
SLIDE 24 Minimax optimality with α-privacy
family of distributions
- P ∈ F}, and functional P → θ(P)
samples Xn
1 ≡ {X1, . . . , Xn} ∼ P and estimator Xn 1 →
θ(Xn
1 )
loss function (e.g., squared error, 0-1 error, ℓ1-error) ( θ, θ) → L( θ, θ) quality of θ as estimate of θ Ordinary minimax risk: Mn(F) := inf
sup
P∈F
E
1 ), θ(P)
- Minimax risk with α-privacy
Estimators now depend on privatized samples Zn
1
Mn(α; F) := inf
Q∈Qα
Best α-private channel inf
sup
P∈F
E
1 ), θ(P)
SLIDE 25 Vignette A: α-private location estimation
Consider estimation of mean functional θ(P) = E[X] over family Fk :=
- distributions P such that E[X] ∈ [−1, 1] and E[|X|k|] ≤ 1
SLIDE 26 Vignette A: α-private location estimation
Consider estimation of mean functional θ(P) = E[X] over family Fk :=
- distributions P such that E[X] ∈ [−1, 1] and E[|X|k|] ≤ 1
- For k ≥ 2 and non-private setting, sample mean
θ = 1
n
n
i=1 Xi achieves rate
1/n.
SLIDE 27 Vignette A: α-private location estimation
Consider estimation of mean functional θ(P) = E[X] over family Fk :=
- distributions P such that E[X] ∈ [−1, 1] and E[|X|k|] ≤ 1
- For k ≥ 2 and non-private setting, sample mean
θ = 1
n
n
i=1 Xi achieves rate
1/n. Theorem For all k ≥ 2 and α ∈ (0, 1/4], the α-private minimax risk scales as Mn(α; Fk) ≍ min
1 α2n k−1
k
.
SLIDE 28 Vignette A: α-private location estimation
Consider estimation of mean functional θ(P) = E[X] over family Fk :=
- distributions P such that E[X] ∈ [−1, 1] and E[|X|k|] ≤ 1
- For k ≥ 2 and non-private setting, sample mean
θ = 1
n
n
i=1 Xi achieves rate
1/n. Theorem For all k ≥ 2 and α ∈ (0, 1/4], the α-private minimax risk scales as Mn(α; Fk) ≍ min
1 α2n k−1
k
. Examples: For two moments k = 2, rate is reduced from parametric 1/n to 1/(α√n). As k → ∞ (roughly bounded random variables), private rate converges to the parametric one (with a pre-factor of 1/α2).
SLIDE 29 Sample size reduction: n → α2n
Given an α-private channel, any pair {Pj, j = 1, 2} induces marginals Mn
j (A) :=
j (x1, . . . , xn)
for j = 1, 2.
SLIDE 30 Sample size reduction: n → α2n
Given an α-private channel, any pair {Pj, j = 1, 2} induces marginals Mn
j (A) :=
j (x1, . . . , xn)
for j = 1, 2. Question: How much “contraction” induced by local α-privacy?
SLIDE 31 Sample size reduction: n → α2n
Given an α-private channel, any pair {Pj, j = 1, 2} induces marginals Mn
j (A) :=
j (x1, . . . , xn)
for j = 1, 2. Question: How much “contraction” induced by local α-privacy? Theorem (Duchi, W., & Jordan, 2013) Given n i.i.d. samples from any α-private channel with α ∈ (0, 1/2], we have 1 n
1 Mn 0) + D(Mn 0 Mn 1)
- Symmetrized KL divergence
- (eα − 1)2
P1 − P02
TV
SLIDE 32 Sample size reduction: n → α2n
Given an α-private channel, any pair {Pj, j = 1, 2} induces marginals Mn
j (A) :=
j (x1, . . . , xn)
for j = 1, 2. Question: How much “contraction” induced by local α-privacy? Theorem (Duchi, W., & Jordan, 2013) Given n i.i.d. samples from any α-private channel with α ∈ (0, 1/2], we have 1 n
1 Mn 0) + D(Mn 0 Mn 1)
- Symmetrized KL divergence
- (eα − 1)2
P1 − P02
TV
Note that
2 α2 for α ∈ (0, 1/4].
SLIDE 33 Vignette B: Non-parametric density estimation
Suppose that we want to estimate the quantity P → θ(P) ≡ density f
X1 Xn
SLIDE 34 Vignette B: Non-parametric density estimation
Suppose that we want to estimate the quantity P → θ(P) ≡ density f
X1 Xn Ordinary minimax rates depend on number of derivatives β > 1/2 of density f: Mn
1 n
2β+1 .
(Ibragimov & Hasminskii, 1978; Stone, 1980)
SLIDE 35
Optimal rates for α-private density estimation
Consider density estimation based on α-private views (Z1, . . . , Zn) of original samples (X1, . . . , Xn).
SLIDE 36 Optimal rates for α-private density estimation
Consider density estimation based on α-private views (Z1, . . . , Zn) of original samples (X1, . . . , Xn). Theorem (Duchi, W. & Jordan, 2013) For all privacy levels α ∈ (0, 1/4] and smoothness levels β > 1/2: Mn
1 α2n
2β+2
SLIDE 37 Optimal rates for α-private density estimation
Consider density estimation based on α-private views (Z1, . . . , Zn) of original samples (X1, . . . , Xn). Theorem (Duchi, W. & Jordan, 2013) For all privacy levels α ∈ (0, 1/4] and smoothness levels β > 1/2: Mn
1 α2n
2β+2
can give a simple/explicit scheme that achieves this optimal rate.
SLIDE 38 Optimal rates for α-private density estimation
Consider density estimation based on α-private views (Z1, . . . , Zn) of original samples (X1, . . . , Xn). Theorem (Duchi, W. & Jordan, 2013) For all privacy levels α ∈ (0, 1/4] and smoothness levels β > 1/2: Mn
1 α2n
2β+2
can give a simple/explicit scheme that achieves this optimal rate. contrast with classical rate (1/n)
2β 2β+1 : Penalty for privacy can be
significant!
SLIDE 39 Optimal rates for α-private density estimation
Consider density estimation based on α-private views (Z1, . . . , Zn) of original samples (X1, . . . , Xn). Theorem (Duchi, W. & Jordan, 2013) For all privacy levels α ∈ (0, 1/4] and smoothness levels β > 1/2: Mn
1 α2n
2β+2
can give a simple/explicit scheme that achieves this optimal rate. contrast with classical rate (1/n)
2β 2β+1 : Penalty for privacy can be
significant!
Example: How many samples N(ǫ) to achieve error ǫ = 0.01 for Lipschitz densities (β = 1)? Classical case N ≈ 1, 000 versus Private case N ≈ 10, 000.
SLIDE 40 How to achieve a matching upper bound?
Naive approach: Add Laplacian noise directly to samples Zi = Xi + Wi, with Wi ∼ α
2 e−α|w|
Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 15 / 22
SLIDE 41 How to achieve a matching upper bound?
Naive approach: Add Laplacian noise directly to samples Zi = Xi + Wi, with Wi ∼ α
2 e−α|w|
Transforms problem into non-parametric deconvolution.
Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 15 / 22
SLIDE 42 How to achieve a matching upper bound?
Naive approach: Add Laplacian noise directly to samples Zi = Xi + Wi, with Wi ∼ α
2 e−α|w|
Transforms problem into non-parametric deconvolution. Lower bound for this mechanism For any estimator f based on (Z1, . . . , Zn): sup
f ∗∈F(β)
E[ f − f ∗2
2]
1 n
2β+5
Follows from known lower bounds for deconvolution
(Carroll & Hall, 1988)
Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 15 / 22
SLIDE 43 An optimal mechanism
1 For a given orthonormal basis {φj}∞ j=1 of L2[0, 1], individual i computes
ΦD
1 (Xi) := {φ1(Xi), φ2(Xi), . . . , φD(Xi)}
for dimension D to be chosen
SLIDE 44 An optimal mechanism
1 For a given orthonormal basis {φj}∞ j=1 of L2[0, 1], individual i computes
ΦD
1 (Xi) := {φ1(Xi), φ2(Xi), . . . , φD(Xi)}
for dimension D to be chosen
2 Privatized D-dimensional vector:
Hypercube sampling scheme with E[Zi | Xi] = ΦD
1 (Xi)
SLIDE 45 An optimal mechanism
1 For a given orthonormal basis {φj}∞ j=1 of L2[0, 1], individual i computes
ΦD
1 (Xi) := {φ1(Xi), φ2(Xi), . . . , φD(Xi)}
for dimension D to be chosen
2 Privatized D-dimensional vector:
Hypercube sampling scheme with E[Zi | Xi] = ΦD
1 (Xi) 3 Statistician can compute noisy versions of D basis expansion coefficients
n
n
Zij, and
D
SLIDE 46 An optimal mechanism
1 For a given orthonormal basis {φj}∞ j=1 of L2[0, 1], individual i computes
ΦD
1 (Xi) := {φ1(Xi), φ2(Xi), . . . , φD(Xi)}
for dimension D to be chosen
2 Privatized D-dimensional vector:
Hypercube sampling scheme with E[Zi | Xi] = ΦD
1 (Xi) 3 Statistician can compute noisy versions of D basis expansion coefficients
n
n
Zij, and
D
Upper bound For any D ≥ 1, the privatized density estimate satisfies E
2
nα2 + 1 D2β
SLIDE 47 Hypercube sampling: Optimal privacy mechanism
V 1 1+eα eα 1+eα Given V = ΦD
1 (X) with V ∞ ≤ C,
form D-dimensional random vector
with prob.
1 2 + Vj 2C
−C with prob.
1 2 − Vj 2C .
Draw T ∼ Ber
1+eα
Z ∼
V > 0
Uni
V ≤ 0
SLIDE 48
Lower bounds via metric entropy
f 1 f 2 f 3 f 4 f M 2δ Andrey Kolmogorov 1903–1987
SLIDE 49
Lower bounds via metric entropy
f 1 f 2 f 3 f 4 f M 2δ Andrey Kolmogorov 1903–1987 Packing number Given a metric ρ and function class F, a δ-packing is a collection {f 1, . . . , f M} contained in F such that ρ(f j, f k) > 2δ for all j = k.
SLIDE 50
From metric entropy to hypothesis testing
f 1 f 2 f 3 f 4 f M 2δ Two-person game: Nature chooses a random index J ∈ {1, . . . , M} Statistician estimates density based on n i.i.d. samples from f J
SLIDE 51
From metric entropy to hypothesis testing
f 1 f 2 f 3 f 4 f M 2δ Two-person game: Nature chooses a random index J ∈ {1, . . . , M} Statistician estimates density based on n i.i.d. samples from f J Reduction to hypothesis testing Any estimator f for which ρ( f, f J) < δ with high probability can be used to decode the index J.
SLIDE 52 A quantitative data processing inequality
J X Z Q packing index J ∈ {1, 2, . . . , M} non-private variables (X | J = j) ∼ Pj mixture distribution P =
1 M
M
j=1 Pj.
SLIDE 53 A quantitative data processing inequality
J X Z Q packing index J ∈ {1, 2, . . . , M} non-private variables (X | J = j) ∼ Pj mixture distribution P =
1 M
M
j=1 Pj.
Theorem (Duchi, W. & Jordan, 2013) For any non-interactive α-private channel Q, we have I(Z1, . . . , Zn; J) n ≤ (eα − 1)2 sup
γ∞≤1
1 M
M
X
γ(x)
2
- dimension-dependent contraction
SLIDE 54 High-level and extensions
High-level Two main theorems are forms of “information contraction”:
1 Pairwise contraction: consequences for Le Cam’s method 2 Mutual information contraction: consequences for Fano’s method
Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 21 / 22
SLIDE 55 High-level and extensions
High-level Two main theorems are forms of “information contraction”:
1 Pairwise contraction: consequences for Le Cam’s method 2 Mutual information contraction: consequences for Fano’s method
Some extensions:
1 Matching rates for linear regression (n → nα2) 2 Matching rates for multinomial estimation (n → nα2 d )
Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 21 / 22
SLIDE 56 High-level and extensions
High-level Two main theorems are forms of “information contraction”:
1 Pairwise contraction: consequences for Le Cam’s method 2 Mutual information contraction: consequences for Fano’s method
Some extensions:
1 Matching rates for linear regression (n → nα2) 2 Matching rates for multinomial estimation (n → nα2 d ) 3 Convex risk minimization: dimension-dependent effects.
Sparse optimization no longer depends logarithmically on dimension.
Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 21 / 22
SLIDE 57 High-level and extensions
High-level Two main theorems are forms of “information contraction”:
1 Pairwise contraction: consequences for Le Cam’s method 2 Mutual information contraction: consequences for Fano’s method
Some extensions:
1 Matching rates for linear regression (n → nα2) 2 Matching rates for multinomial estimation (n → nα2 d ) 3 Convex risk minimization: dimension-dependent effects.
Sparse optimization no longer depends logarithmically on dimension.
4 Laplacian mechanism can be sub-optimal. Need to consider geometry of
set.
Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 21 / 22
SLIDE 58 Summary
interesting trade-offs between privacy and statistical utility new notion of locally α-private minimax risk provided some general bounds and techniques: bounds on total variation useful for Le Cam’s method bounds on mutual information useful for Fano’s method sharp bounds for several parametric/non-parametric problems
Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 22 / 22
SLIDE 59 Summary
interesting trade-offs between privacy and statistical utility new notion of locally α-private minimax risk provided some general bounds and techniques: bounds on total variation useful for Le Cam’s method bounds on mutual information useful for Fano’s method sharp bounds for several parametric/non-parametric problems many open problems and issues:
◮ benefits of partially local privacy? ◮ other models for privacy? ◮ privacy for multiple statistical objectives? Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 22 / 22
SLIDE 60 Summary
interesting trade-offs between privacy and statistical utility new notion of locally α-private minimax risk provided some general bounds and techniques: bounds on total variation useful for Le Cam’s method bounds on mutual information useful for Fano’s method sharp bounds for several parametric/non-parametric problems many open problems and issues:
◮ benefits of partially local privacy? ◮ other models for privacy? ◮ privacy for multiple statistical objectives?
Some papers: Duchi, W. & Jordan (2013). Local privacy and statistical minimax rates. http://arxiv.org/abs/1302.3203, February 2013.
- W. (2015). Constrained forms of statistical minimax: Computation,
communication, and privacy. Proceedings of the International Congress of Mathematicians.
Martin Wainwright (UC Berkeley) Privacy and statistics October 2015 22 / 22