Learning from Untrusted Data Moses Charikar, Jacob Steinhardt, - - PowerPoint PPT Presentation
Learning from Untrusted Data Moses Charikar, Jacob Steinhardt, - - PowerPoint PPT Presentation
Learning from Untrusted Data Moses Charikar, Jacob Steinhardt, Gregory Valiant Symposium on the Theory of Computing June 19, 2017 (Icon credit: Annie Lin) Motivation: data poisoning attacks: 1 (Icon credit: Annie Lin) Motivation: data
Motivation: data poisoning attacks:
(Icon credit: Annie Lin) 1
Motivation: data poisoning attacks: Question: what concepts can be learned in the presence of arbitrarily corrupted data?
(Icon credit: Annie Lin) 1
Related Work
- 60 years of work on robust statistics...
PCA:
- XCM ’10, CLMW ’11, CSPW ’11
Mean estimation:
- LRV ’16, DKKLMS ’16, DKKLMS ’17, L ’17, DBS ’17, SCV ’17
Regression:
- NTN ’11, NT ’13, CCM ’13, BJK ’15
Classification:
- FHKP ’09, GR ’09, KLS ’09, ABL ’14
Semi-random graphs:
- FK ’01, C ’07, MMV ’12, S ’17
Other:
- HM ’13, C ’14, C ’16, DKS ’16, SCV ’16
2
Problem Setting
Observe n points x1, . . . , xn
3
Problem Setting
Observe n points x1, . . . , xn Unknown subset of αn points drawn i.i.d. from p∗
3
Problem Setting
Observe n points x1, . . . , xn Unknown subset of αn points drawn i.i.d. from p∗ Remaining (1 − α)n points are arbitrary
3
Problem Setting
Observe n points x1, . . . , xn Unknown subset of αn points drawn i.i.d. from p∗ Remaining (1 − α)n points are arbitrary Goal: estimate parameter of interest θ(p∗)
- assuming p∗ ∈ P (e.g. bounded moments)
- θ(p∗) could be mean, best fit line, ranking, etc.
3
Problem Setting
Observe n points x1, . . . , xn Unknown subset of αn points drawn i.i.d. from p∗ Remaining (1 − α)n points are arbitrary Goal: estimate parameter of interest θ(p∗)
- assuming p∗ ∈ P (e.g. bounded moments)
- θ(p∗) could be mean, best fit line, ranking, etc.
New regime: α ≪ 1
3
Why Is This Possible?
If e.g. α = 1
3, estimation seems impossible:
4
Why Is This Possible?
If e.g. α = 1
3, estimation seems impossible:
4
Why Is This Possible?
If e.g. α = 1
3, estimation seems impossible:
But can narrow down to 3 possibilities!
4
Why Is This Possible?
If e.g. α = 1
3, estimation seems impossible:
But can narrow down to 3 possibilities! List-decodable learning [Balcan, Blum, Vempala ’08]
- output O(1/α) answers, one of which is approximately correct
4
Why Is This Possible?
If e.g. α = 1
3, estimation seems impossible:
But can narrow down to 3 possibilities! List-decodable learning [Balcan, Blum, Vempala ’08]
- output O(1/α) answers, one of which is approximately correct
Semi-verified learning
- observe O(1) verified points from p∗
4
Why Is This Possible?
If e.g. α = 1
3, estimation seems impossible:
But can narrow down to 3 possibilities! List-decodable learning [Balcan, Blum, Vempala ’08]
- output O(1/α) answers, one of which is approximately correct
Semi-verified learning
- observe O(1) verified points from p∗
4
Why Care?
Practical problem: data poisoning attacks
- How can we build learning algorithms that are provably secure
to manipulation?
5
Why Care?
Practical problem: data poisoning attacks
- How can we build learning algorithms that are provably secure
to manipulation? Fundamental problem in robust statistics
- What can be learned in presence of arbitrary outliers?
5
Why Care?
Practical problem: data poisoning attacks
- How can we build learning algorithms that are provably secure
to manipulation? Fundamental problem in robust statistics
- What can be learned in presence of arbitrary outliers?
Agnostic learning of mixtures
- When is it possible to learn about one mixture component,
with no assumptions about the other components?
5
Main Theorem
Observed functions: f1, . . . , fn Want to minimize unknown target function: ¯ f
6
Main Theorem
Observed functions: f1, . . . , fn Want to minimize unknown target function: ¯ f Key quantity: spectral norm bound on a subset I:
1
√
|I| max w∈Rd [∇fi(w) − ∇ ¯
f(w)]i∈Iop ≤ S.
6
Main Theorem
Observed functions: f1, . . . , fn Want to minimize unknown target function: ¯ f Key quantity: spectral norm bound on a subset I:
1
√
|I| max w∈Rd [∇fi(w) − ∇ ¯
f(w)]i∈Iop ≤ S. Meta-Theorem Given a spectral norm bound on an unknown subset of αn functions, learning is possible:
- in the semi-verified model (for convex fi)
- in the list-decodable model (for strongly convex fi)
6
Main Theorem
Observed functions: f1, . . . , fn Want to minimize unknown target function: ¯ f Key quantity: spectral norm bound on a subset I:
1
√
|I| max w∈Rd [∇fi(w) − ∇ ¯
f(w)]i∈Iop ≤ S. Meta-Theorem Given a spectral norm bound on an unknown subset of αn functions, learning is possible:
- in the semi-verified model (for convex fi)
- in the list-decodable model (for strongly convex fi)
All results direct corollaries of meta-theorem!
6
Corollary: Mean Estimation
Setting: distribution p∗ on Rd with mean µ and bounded 1st moments: Ep∗[|x − µ, v|] ≤ σv2 for all v ∈ Rd.
7
Corollary: Mean Estimation
Setting: distribution p∗ on Rd with mean µ and bounded 1st moments: Ep∗[|x − µ, v|] ≤ σv2 for all v ∈ Rd. Observe αn samples from p∗ and (1 − α)n arbitrary points, and want to estimate µ.
7
Corollary: Mean Estimation
Setting: distribution p∗ on Rd with mean µ and bounded 1st moments: Ep∗[|x − µ, v|] ≤ σv2 for all v ∈ Rd. Observe αn samples from p∗ and (1 − α)n arbitrary points, and want to estimate µ. Theorem (Mean Estimation) If αn ≥ d, it is possible to output estimates ˆ µ1, . . . , ˆ µm of the mean µ such that
- m ≤ 2/α, and
- minm
j=1 ˆ
µj − µ2 = ˜ O(σ/√α) w.h.p.
7
Corollary: Mean Estimation
Setting: distribution p∗ on Rd with mean µ and bounded 1st moments: Ep∗[|x − µ, v|] ≤ σv2 for all v ∈ Rd. Observe αn samples from p∗ and (1 − α)n arbitrary points, and want to estimate µ. Theorem (Mean Estimation) If αn ≥ d, it is possible to output estimates ˆ µ1, . . . , ˆ µm of the mean µ such that
- m ≤ 2/α, and
- minm
j=1 ˆ
µj − µ2 = ˜ O(σ/√α) w.h.p. Alternately, it is possible to output an estimate ˆ µ given a single verified point from p∗.
7
Comparisons
Mean estimation: Bound Regime Assumption Samples LRV ’16 σ√1 − α α > 1 − c 4th moments d DKKLMS ’16 σ(1 − α) α > 1 − c sub-Gaussian d3 CSV ’17 σ/√α α > 0 1st moments d
8
Comparisons
Mean estimation: Bound Regime Assumption Samples LRV ’16 σ√1 − α α > 1 − c 4th moments d DKKLMS ’16 σ(1 − α) α > 1 − c sub-Gaussian d3 CSV ’17 σ/√α α > 0 1st moments d Estimating mixtures: Separation Robust? AM ’05 σ(k + 1/√α) no KK ’10 σk no AS ’12 σ √ k no CSV ’17 σ/√α yes
8
Other Results
Stochastic Block Model: (sparse regime: cf. GV ’14, LLV ’15, RT ’15, RV ’16) Average Degree Robust? GV ’14 1/α4 no AS ’15 1/α2 no CSV ’17 1/α3 yes
9
Other Results
Stochastic Block Model: (sparse regime: cf. GV ’14, LLV ’15, RT ’15, RV ’16) Average Degree Robust? GV ’14 1/α4 no AS ’15 1/α2 no CSV ’17 1/α3 yes Others:
- discrete product distributions
- exponential families
- ranking
9
Proof Overview (Mean Estimation)
Recall goal: given n points x1, . . . , xn, αn drawn from p∗, estimate mean µ of p∗
10
Proof Overview (Mean Estimation)
Recall goal: given n points x1, . . . , xn, αn drawn from p∗, estimate mean µ of p∗ Key tension: balance adversarial and statistical error
10
Proof Overview (Mean Estimation)
Recall goal: given n points x1, . . . , xn, αn drawn from p∗, estimate mean µ of p∗ Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error
10
Proof Overview (Mean Estimation)
Recall goal: given n points x1, . . . , xn, αn drawn from p∗, estimate mean µ of p∗ Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error
10
Proof Overview (Mean Estimation)
Recall goal: given n points x1, . . . , xn, αn drawn from p∗, estimate mean µ of p∗ Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error High-level strategy: solve convex optimization problem
- if cost is low, estimation succeeds (spectral norm bound)
- if cost is high, identify and remove outliers
10
Proof Overview (Mean Estimation)
Recall goal: given n points x1, . . . , xn, αn drawn from p∗, estimate mean µ of p∗ Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error High-level strategy: solve convex optimization problem
- if cost is low, estimation succeeds (spectral norm bound)
- if cost is high, identify and remove outliers
10
Algorithm
First pass: minimizeµ n
i=1 xi − µ2 2
11
Algorithm
First pass: minimizeµ n
i=1 xi − µ2 2
Second pass: minimizeµ1,...,µn n
i=1 xi − µi2 2
11
Algorithm
First pass: minimizeµ n
i=1 xi − µ2 2
Second pass: minimizeµ1,...,µn n
i=1 xi − µi2 2
Final pass: minimizeµ1,...,µn n
i=1 xi − µi2 2 + λF(µ1, . . . , µn)
11
Algorithm
First pass: minimizeµ n
i=1 xi − µ2 2
Second pass: minimizeµ1,...,µn n
i=1 xi − µi2 2
Final pass: minimizeµ1,...,µn n
i=1 xi − µi2 2 + λF(µ1, . . . , µn)
Choices for F:
- nuclear norm: error σ/α
- maximum nuclear norm over subsets: error σ/√α (intractable)
- minimum trace ellipsoid: error σ/√α (tractable)
11
Algorithm
First pass: minimizeµ n
i=1 xi − µ2 2
Second pass: minimizeµ1,...,µn n
i=1 xi − µi2 2
Final pass: minimizeµ1,...,µn n
i=1 xi − µi2 2 + λF(µ1, . . . , µn)
Choices for F:
- nuclear norm: error σ/α
- maximum nuclear norm over subsets: error σ/√α (intractable)
- minimum trace ellipsoid: error σ/√α (tractable)
Clean-up: remove outliers, cluster the µi, output the cluster means
- padded decompositions [FRT ’03]
11
Algorithm
First pass: minimizeµ n
i=1 xi − µ2 2
Second pass: minimizeµ1,...,µn n
i=1 xi − µi2 2
Final pass: minimizeµ1,...,µn n
i=1 fi(µi) + λF(µ1, . . . , µn)
Choices for F:
- nuclear norm: error σ/α
- maximum nuclear norm over subsets: error σ/√α (intractable)
- minimum trace ellipsoid: error σ/√α (tractable)
Clean-up: remove outliers, cluster the µi, output the cluster means
- padded decompositions [FRT ’03]
11
Summary
Method for robustness to large fraction of adversarial data
12
Summary
Method for robustness to large fraction of adversarial data Can handle arbitrary convex loss functions
- based on spectral norm bound on gradients
12
Summary
Method for robustness to large fraction of adversarial data Can handle arbitrary convex loss functions
- based on spectral norm bound on gradients
Strong bounds in many concrete settings
- mixtures, stochastic block model
12
Summary
Method for robustness to large fraction of adversarial data Can handle arbitrary convex loss functions
- based on spectral norm bound on gradients
Strong bounds in many concrete settings
- mixtures, stochastic block model
Open questions:
- Can larger amounts of verified data yield stronger bounds?
- Can we exploit strong convexity / gradient bounds in other norms?
- Can we obtain guarantees in the online setting?
12
Main Theorem
Meta-Theorem Let f1, . . . , fn : Rd → R be a collection of κ-strongly convex functions, and let ¯ f : Rd → R an unknown target function minimized at w∗. Suppose there is an (unknown) subset I ⊆ [n] of size αn such that
1
√
|I| max w∈Rd [∇fi(w) − ∇ ¯
f(w)]i∈Iop ≤ S. Then, there is an algorithm outputting m = 2
α candidates ˆ
w1, . . . , ˆ wm such that minm
j=1 ˆ
wj − w∗2 = ˜ O(S/(κ√α)).
13
Main Theorem
Meta-Theorem Let f1, . . . , fn : Rd → R be a collection of κ-strongly convex functions, and let ¯ f : Rd → R an unknown target function minimized at w∗. Suppose there is an (unknown) subset I ⊆ [n] of size αn such that
1
√
|I| max w∈Rd [∇fi(w) − ∇ ¯
f(w)]i∈Iop ≤ S. Then, there is an algorithm outputting m = 2
α candidates ˆ
w1, . . . , ˆ wm such that minm
j=1 ˆ
wj − w∗2 = ˜ O(S/(κ√α)).
- Can remove strong convexity (semi-verified model)
13