Learning from Untrusted Data Moses Charikar, Jacob Steinhardt, - - PowerPoint PPT Presentation

learning from untrusted data
SMART_READER_LITE
LIVE PREVIEW

Learning from Untrusted Data Moses Charikar, Jacob Steinhardt, - - PowerPoint PPT Presentation

Learning from Untrusted Data Moses Charikar, Jacob Steinhardt, Gregory Valiant Symposium on the Theory of Computing June 19, 2017 (Icon credit: Annie Lin) Motivation: data poisoning attacks: 1 (Icon credit: Annie Lin) Motivation: data


slide-1
SLIDE 1

Learning from Untrusted Data

Moses Charikar, Jacob Steinhardt, Gregory Valiant

Symposium on the Theory of Computing June 19, 2017

slide-2
SLIDE 2

Motivation: data poisoning attacks:

(Icon credit: Annie Lin) 1

slide-3
SLIDE 3

Motivation: data poisoning attacks: Question: what concepts can be learned in the presence of arbitrarily corrupted data?

(Icon credit: Annie Lin) 1

slide-4
SLIDE 4

Related Work

  • 60 years of work on robust statistics...

PCA:

  • XCM ’10, CLMW ’11, CSPW ’11

Mean estimation:

  • LRV ’16, DKKLMS ’16, DKKLMS ’17, L ’17, DBS ’17, SCV ’17

Regression:

  • NTN ’11, NT ’13, CCM ’13, BJK ’15

Classification:

  • FHKP ’09, GR ’09, KLS ’09, ABL ’14

Semi-random graphs:

  • FK ’01, C ’07, MMV ’12, S ’17

Other:

  • HM ’13, C ’14, C ’16, DKS ’16, SCV ’16

2

slide-5
SLIDE 5

Problem Setting

Observe n points x1, . . . , xn

3

slide-6
SLIDE 6

Problem Setting

Observe n points x1, . . . , xn Unknown subset of αn points drawn i.i.d. from p∗

3

slide-7
SLIDE 7

Problem Setting

Observe n points x1, . . . , xn Unknown subset of αn points drawn i.i.d. from p∗ Remaining (1 − α)n points are arbitrary

3

slide-8
SLIDE 8

Problem Setting

Observe n points x1, . . . , xn Unknown subset of αn points drawn i.i.d. from p∗ Remaining (1 − α)n points are arbitrary Goal: estimate parameter of interest θ(p∗)

  • assuming p∗ ∈ P (e.g. bounded moments)
  • θ(p∗) could be mean, best fit line, ranking, etc.

3

slide-9
SLIDE 9

Problem Setting

Observe n points x1, . . . , xn Unknown subset of αn points drawn i.i.d. from p∗ Remaining (1 − α)n points are arbitrary Goal: estimate parameter of interest θ(p∗)

  • assuming p∗ ∈ P (e.g. bounded moments)
  • θ(p∗) could be mean, best fit line, ranking, etc.

New regime: α ≪ 1

3

slide-10
SLIDE 10

Why Is This Possible?

If e.g. α = 1

3, estimation seems impossible:

4

slide-11
SLIDE 11

Why Is This Possible?

If e.g. α = 1

3, estimation seems impossible:

4

slide-12
SLIDE 12

Why Is This Possible?

If e.g. α = 1

3, estimation seems impossible:

But can narrow down to 3 possibilities!

4

slide-13
SLIDE 13

Why Is This Possible?

If e.g. α = 1

3, estimation seems impossible:

But can narrow down to 3 possibilities! List-decodable learning [Balcan, Blum, Vempala ’08]

  • output O(1/α) answers, one of which is approximately correct

4

slide-14
SLIDE 14

Why Is This Possible?

If e.g. α = 1

3, estimation seems impossible:

But can narrow down to 3 possibilities! List-decodable learning [Balcan, Blum, Vempala ’08]

  • output O(1/α) answers, one of which is approximately correct

Semi-verified learning

  • observe O(1) verified points from p∗

4

slide-15
SLIDE 15

Why Is This Possible?

If e.g. α = 1

3, estimation seems impossible:

But can narrow down to 3 possibilities! List-decodable learning [Balcan, Blum, Vempala ’08]

  • output O(1/α) answers, one of which is approximately correct

Semi-verified learning

  • observe O(1) verified points from p∗

4

slide-16
SLIDE 16

Why Care?

Practical problem: data poisoning attacks

  • How can we build learning algorithms that are provably secure

to manipulation?

5

slide-17
SLIDE 17

Why Care?

Practical problem: data poisoning attacks

  • How can we build learning algorithms that are provably secure

to manipulation? Fundamental problem in robust statistics

  • What can be learned in presence of arbitrary outliers?

5

slide-18
SLIDE 18

Why Care?

Practical problem: data poisoning attacks

  • How can we build learning algorithms that are provably secure

to manipulation? Fundamental problem in robust statistics

  • What can be learned in presence of arbitrary outliers?

Agnostic learning of mixtures

  • When is it possible to learn about one mixture component,

with no assumptions about the other components?

5

slide-19
SLIDE 19

Main Theorem

Observed functions: f1, . . . , fn Want to minimize unknown target function: ¯ f

6

slide-20
SLIDE 20

Main Theorem

Observed functions: f1, . . . , fn Want to minimize unknown target function: ¯ f Key quantity: spectral norm bound on a subset I:

1

|I| max w∈Rd [∇fi(w) − ∇ ¯

f(w)]i∈Iop ≤ S.

6

slide-21
SLIDE 21

Main Theorem

Observed functions: f1, . . . , fn Want to minimize unknown target function: ¯ f Key quantity: spectral norm bound on a subset I:

1

|I| max w∈Rd [∇fi(w) − ∇ ¯

f(w)]i∈Iop ≤ S. Meta-Theorem Given a spectral norm bound on an unknown subset of αn functions, learning is possible:

  • in the semi-verified model (for convex fi)
  • in the list-decodable model (for strongly convex fi)

6

slide-22
SLIDE 22

Main Theorem

Observed functions: f1, . . . , fn Want to minimize unknown target function: ¯ f Key quantity: spectral norm bound on a subset I:

1

|I| max w∈Rd [∇fi(w) − ∇ ¯

f(w)]i∈Iop ≤ S. Meta-Theorem Given a spectral norm bound on an unknown subset of αn functions, learning is possible:

  • in the semi-verified model (for convex fi)
  • in the list-decodable model (for strongly convex fi)

All results direct corollaries of meta-theorem!

6

slide-23
SLIDE 23

Corollary: Mean Estimation

Setting: distribution p∗ on Rd with mean µ and bounded 1st moments: Ep∗[|x − µ, v|] ≤ σv2 for all v ∈ Rd.

7

slide-24
SLIDE 24

Corollary: Mean Estimation

Setting: distribution p∗ on Rd with mean µ and bounded 1st moments: Ep∗[|x − µ, v|] ≤ σv2 for all v ∈ Rd. Observe αn samples from p∗ and (1 − α)n arbitrary points, and want to estimate µ.

7

slide-25
SLIDE 25

Corollary: Mean Estimation

Setting: distribution p∗ on Rd with mean µ and bounded 1st moments: Ep∗[|x − µ, v|] ≤ σv2 for all v ∈ Rd. Observe αn samples from p∗ and (1 − α)n arbitrary points, and want to estimate µ. Theorem (Mean Estimation) If αn ≥ d, it is possible to output estimates ˆ µ1, . . . , ˆ µm of the mean µ such that

  • m ≤ 2/α, and
  • minm

j=1 ˆ

µj − µ2 = ˜ O(σ/√α) w.h.p.

7

slide-26
SLIDE 26

Corollary: Mean Estimation

Setting: distribution p∗ on Rd with mean µ and bounded 1st moments: Ep∗[|x − µ, v|] ≤ σv2 for all v ∈ Rd. Observe αn samples from p∗ and (1 − α)n arbitrary points, and want to estimate µ. Theorem (Mean Estimation) If αn ≥ d, it is possible to output estimates ˆ µ1, . . . , ˆ µm of the mean µ such that

  • m ≤ 2/α, and
  • minm

j=1 ˆ

µj − µ2 = ˜ O(σ/√α) w.h.p. Alternately, it is possible to output an estimate ˆ µ given a single verified point from p∗.

7

slide-27
SLIDE 27

Comparisons

Mean estimation: Bound Regime Assumption Samples LRV ’16 σ√1 − α α > 1 − c 4th moments d DKKLMS ’16 σ(1 − α) α > 1 − c sub-Gaussian d3 CSV ’17 σ/√α α > 0 1st moments d

8

slide-28
SLIDE 28

Comparisons

Mean estimation: Bound Regime Assumption Samples LRV ’16 σ√1 − α α > 1 − c 4th moments d DKKLMS ’16 σ(1 − α) α > 1 − c sub-Gaussian d3 CSV ’17 σ/√α α > 0 1st moments d Estimating mixtures: Separation Robust? AM ’05 σ(k + 1/√α) no KK ’10 σk no AS ’12 σ √ k no CSV ’17 σ/√α yes

8

slide-29
SLIDE 29

Other Results

Stochastic Block Model: (sparse regime: cf. GV ’14, LLV ’15, RT ’15, RV ’16) Average Degree Robust? GV ’14 1/α4 no AS ’15 1/α2 no CSV ’17 1/α3 yes

9

slide-30
SLIDE 30

Other Results

Stochastic Block Model: (sparse regime: cf. GV ’14, LLV ’15, RT ’15, RV ’16) Average Degree Robust? GV ’14 1/α4 no AS ’15 1/α2 no CSV ’17 1/α3 yes Others:

  • discrete product distributions
  • exponential families
  • ranking

9

slide-31
SLIDE 31

Proof Overview (Mean Estimation)

Recall goal: given n points x1, . . . , xn, αn drawn from p∗, estimate mean µ of p∗

10

slide-32
SLIDE 32

Proof Overview (Mean Estimation)

Recall goal: given n points x1, . . . , xn, αn drawn from p∗, estimate mean µ of p∗ Key tension: balance adversarial and statistical error

10

slide-33
SLIDE 33

Proof Overview (Mean Estimation)

Recall goal: given n points x1, . . . , xn, αn drawn from p∗, estimate mean µ of p∗ Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error

10

slide-34
SLIDE 34

Proof Overview (Mean Estimation)

Recall goal: given n points x1, . . . , xn, αn drawn from p∗, estimate mean µ of p∗ Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error

10

slide-35
SLIDE 35

Proof Overview (Mean Estimation)

Recall goal: given n points x1, . . . , xn, αn drawn from p∗, estimate mean µ of p∗ Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error High-level strategy: solve convex optimization problem

  • if cost is low, estimation succeeds (spectral norm bound)
  • if cost is high, identify and remove outliers

10

slide-36
SLIDE 36

Proof Overview (Mean Estimation)

Recall goal: given n points x1, . . . , xn, αn drawn from p∗, estimate mean µ of p∗ Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error Key tension: balance adversarial and statistical error High-level strategy: solve convex optimization problem

  • if cost is low, estimation succeeds (spectral norm bound)
  • if cost is high, identify and remove outliers

10

slide-37
SLIDE 37

Algorithm

First pass: minimizeµ n

i=1 xi − µ2 2

11

slide-38
SLIDE 38

Algorithm

First pass: minimizeµ n

i=1 xi − µ2 2

Second pass: minimizeµ1,...,µn n

i=1 xi − µi2 2

11

slide-39
SLIDE 39

Algorithm

First pass: minimizeµ n

i=1 xi − µ2 2

Second pass: minimizeµ1,...,µn n

i=1 xi − µi2 2

Final pass: minimizeµ1,...,µn n

i=1 xi − µi2 2 + λF(µ1, . . . , µn)

11

slide-40
SLIDE 40

Algorithm

First pass: minimizeµ n

i=1 xi − µ2 2

Second pass: minimizeµ1,...,µn n

i=1 xi − µi2 2

Final pass: minimizeµ1,...,µn n

i=1 xi − µi2 2 + λF(µ1, . . . , µn)

Choices for F:

  • nuclear norm: error σ/α
  • maximum nuclear norm over subsets: error σ/√α (intractable)
  • minimum trace ellipsoid: error σ/√α (tractable)

11

slide-41
SLIDE 41

Algorithm

First pass: minimizeµ n

i=1 xi − µ2 2

Second pass: minimizeµ1,...,µn n

i=1 xi − µi2 2

Final pass: minimizeµ1,...,µn n

i=1 xi − µi2 2 + λF(µ1, . . . , µn)

Choices for F:

  • nuclear norm: error σ/α
  • maximum nuclear norm over subsets: error σ/√α (intractable)
  • minimum trace ellipsoid: error σ/√α (tractable)

Clean-up: remove outliers, cluster the µi, output the cluster means

  • padded decompositions [FRT ’03]

11

slide-42
SLIDE 42

Algorithm

First pass: minimizeµ n

i=1 xi − µ2 2

Second pass: minimizeµ1,...,µn n

i=1 xi − µi2 2

Final pass: minimizeµ1,...,µn n

i=1 fi(µi) + λF(µ1, . . . , µn)

Choices for F:

  • nuclear norm: error σ/α
  • maximum nuclear norm over subsets: error σ/√α (intractable)
  • minimum trace ellipsoid: error σ/√α (tractable)

Clean-up: remove outliers, cluster the µi, output the cluster means

  • padded decompositions [FRT ’03]

11

slide-43
SLIDE 43

Summary

Method for robustness to large fraction of adversarial data

12

slide-44
SLIDE 44

Summary

Method for robustness to large fraction of adversarial data Can handle arbitrary convex loss functions

  • based on spectral norm bound on gradients

12

slide-45
SLIDE 45

Summary

Method for robustness to large fraction of adversarial data Can handle arbitrary convex loss functions

  • based on spectral norm bound on gradients

Strong bounds in many concrete settings

  • mixtures, stochastic block model

12

slide-46
SLIDE 46

Summary

Method for robustness to large fraction of adversarial data Can handle arbitrary convex loss functions

  • based on spectral norm bound on gradients

Strong bounds in many concrete settings

  • mixtures, stochastic block model

Open questions:

  • Can larger amounts of verified data yield stronger bounds?
  • Can we exploit strong convexity / gradient bounds in other norms?
  • Can we obtain guarantees in the online setting?

12

slide-47
SLIDE 47

Main Theorem

Meta-Theorem Let f1, . . . , fn : Rd → R be a collection of κ-strongly convex functions, and let ¯ f : Rd → R an unknown target function minimized at w∗. Suppose there is an (unknown) subset I ⊆ [n] of size αn such that

1

|I| max w∈Rd [∇fi(w) − ∇ ¯

f(w)]i∈Iop ≤ S. Then, there is an algorithm outputting m = 2

α candidates ˆ

w1, . . . , ˆ wm such that minm

j=1 ˆ

wj − w∗2 = ˜ O(S/(κ√α)).

13

slide-48
SLIDE 48

Main Theorem

Meta-Theorem Let f1, . . . , fn : Rd → R be a collection of κ-strongly convex functions, and let ¯ f : Rd → R an unknown target function minimized at w∗. Suppose there is an (unknown) subset I ⊆ [n] of size αn such that

1

|I| max w∈Rd [∇fi(w) − ∇ ¯

f(w)]i∈Iop ≤ S. Then, there is an algorithm outputting m = 2

α candidates ˆ

w1, . . . , ˆ wm such that minm

j=1 ˆ

wj − w∗2 = ˜ O(S/(κ√α)).

  • Can remove strong convexity (semi-verified model)

13