Resilience: A Criterion for Learning in the Presence of Arbitrary - - PowerPoint PPT Presentation

resilience a criterion for learning in the presence of
SMART_READER_LITE
LIVE PREVIEW

Resilience: A Criterion for Learning in the Presence of Arbitrary - - PowerPoint PPT Presentation

Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers Jacob Steinhardt, Moses Charikar, Gregory Valiant ITCS 2018 January 14, 2018 Motivation: Robust Learning Question What concepts can be learned robustly , even if some


slide-1
SLIDE 1

Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers

Jacob Steinhardt, Moses Charikar, Gregory Valiant

ITCS 2018 January 14, 2018

slide-2
SLIDE 2

Motivation: Robust Learning

Question What concepts can be learned robustly, even if some data is arbitrarily corrupted?

1

slide-3
SLIDE 3

Example: Mean Estimation

Problem Given data x1, . . . , xn ∈ Rd, of which (1 − ǫ)n come from p∗ (and remaining ǫn are arbitrary outliers), estimate mean µ of p∗.

2

slide-4
SLIDE 4

Example: Mean Estimation

Problem Given data x1, . . . , xn ∈ Rd, of which (1 − ǫ)n come from p∗ (and remaining ǫn are arbitrary outliers), estimate mean µ of p∗.

2

slide-5
SLIDE 5

Example: Mean Estimation

Problem Given data x1, . . . , xn ∈ Rd, of which (1 − ǫ)n come from p∗ (and remaining ǫn are arbitrary outliers), estimate mean µ of p∗.

2

slide-6
SLIDE 6

Example: Mean Estimation

Problem Given data x1, . . . , xn ∈ Rd, of which (1 − ǫ)n come from p∗ (and remaining ǫn are arbitrary outliers), estimate mean µ of p∗.

2

slide-7
SLIDE 7

Example: Mean Estimation

Problem Given data x1, . . . , xn ∈ Rd, of which (1 − ǫ)n come from p∗ (and remaining ǫn are arbitrary outliers), estimate mean µ of p∗. Issue: high dimensions

2

slide-8
SLIDE 8

Mean Estimation: Gaussian Example

Suppose clean data is Gaussian: xi ∼ N(µ, I)

Gaussian mean µ variance 1 each coord.

3

slide-9
SLIDE 9

Mean Estimation: Gaussian Example

Suppose clean data is Gaussian: xi ∼ N(µ, I)

Gaussian mean µ variance 1 each coord. √ d

xi − µ2 ≈ √ 12 + · · · + 12 = √ d

3

slide-10
SLIDE 10

Mean Estimation: Gaussian Example

Suppose clean data is Gaussian: xi ∼ N(µ, I)

Gaussian mean µ variance 1 each coord. √ d ǫ √ d

xi − µ2 ≈ √ 12 + · · · + 12 = √ d

3

slide-11
SLIDE 11

Mean Estimation: Gaussian Example

Suppose clean data is Gaussian: xi ∼ N(µ, I)

Gaussian mean µ variance 1 each coord. √ d ǫ √ d

xi − µ2 ≈ √ 12 + · · · + 12 = √ d

Cannot filter independently even if know true density!

3

slide-12
SLIDE 12

History

Progress in high dimensions only recently:

  • Tukey median [1975]: robust but NP-hard
  • Donoho estimator [1982]: high error
  • [DKKLMS16, LRV16]: first dimension-independent error bounds

4

slide-13
SLIDE 13

History

Progress in high dimensions only recently:

  • Tukey median [1975]: robust but NP-hard
  • Donoho estimator [1982]: high error
  • [DKKLMS16, LRV16]: first dimension-independent error bounds
  • large body of work since then [CSV17, DKKLMS17, L17, DBS17]
  • many
  • ther

problems including PCA [XCM10], regression [NTN11], classification [FHKP09], etc.

4

slide-14
SLIDE 14

This Talk

Question What general and simple properties enable robust estimation?

5

slide-15
SLIDE 15

This Talk

Question What general and simple properties enable robust estimation? New information-theoretic criterion: resilience.

5

slide-16
SLIDE 16

Resilience

Suppose {xi}i∈S is a set of points in Rd. Definition (Resilience) A set S is (σ, ǫ)-resilient in a norm · around a point µ if for all subsets T ⊆ S of size at least (1 − ǫ)|S|,

  • 1

|T|

  • i∈T

(xi − µ)

  • ≤ σ.

Intuition: all large subsets have similar mean.

6

slide-17
SLIDE 17

Main Result

Let S ⊆ Rd be a set of of (1 − ǫ)n “good” points. Let Sout be a set of ǫn arbitrary outliers. We observe ˜ S = S ∪ Sout. Theorem If S is (σ,

ǫ 1−ǫ)-resilient around µ, then it is possible to output

ˆ µ such that ˆ µ − µ ≤ 2σ. In fact, outputting the center of any resilient subset of ˜ S will work!

7

slide-18
SLIDE 18

Pigeonhole Argument

Claim: If S and S′ are (σ,

ǫ 1−ǫ)-resilient around µ and µ′ and have size

(1 − ǫ)n, then µ − µ′ ≤ 2σ.

8

slide-19
SLIDE 19

Pigeonhole Argument

Claim: If S and S′ are (σ,

ǫ 1−ǫ)-resilient around µ and µ′ and have size

(1 − ǫ)n, then µ − µ′ ≤ 2σ. Proof:

˜ S S

8

slide-20
SLIDE 20

Pigeonhole Argument

Claim: If S and S′ are (σ,

ǫ 1−ǫ)-resilient around µ and µ′ and have size

(1 − ǫ)n, then µ − µ′ ≤ 2σ. Proof:

˜ S S S′

8

slide-21
SLIDE 21

Pigeonhole Argument

Claim: If S and S′ are (σ,

ǫ 1−ǫ)-resilient around µ and µ′ and have size

(1 − ǫ)n, then µ − µ′ ≤ 2σ. Proof:

˜ S S S′ S ∩ S′

  • Let µS∩S′ be the mean of S ∩ S′.
  • By Pigeonhole, |S ∩ S′| ≥

ǫ 1−ǫ|S′|.

8

slide-22
SLIDE 22

Pigeonhole Argument

Claim: If S and S′ are (σ,

ǫ 1−ǫ)-resilient around µ and µ′ and have size

(1 − ǫ)n, then µ − µ′ ≤ 2σ. Proof:

˜ S S S′ S ∩ S′

  • Let µS∩S′ be the mean of S ∩ S′.
  • By Pigeonhole, |S ∩ S′| ≥

ǫ 1−ǫ|S′|.

  • Then µ′ − µS∩S′ ≤ σ by resilience.
  • Similarly, µ − µS∩S′ ≤ σ.
  • Result follows by triangle inequality.

8

slide-23
SLIDE 23

Implication: Mean Estimation

Lemma If a dataset has bounded covariance, it is (ǫ, O(√ǫ))-resilient (in the ℓ2-norm).

9

slide-24
SLIDE 24

Implication: Mean Estimation

Lemma If a dataset has bounded covariance, it is (ǫ, O(√ǫ))-resilient (in the ℓ2-norm). Proof: If ǫn points ≫ 1/√ǫ from mean, would make variance ≫ 1. Therefore, deleting ǫn points changes mean by at most ≈ ǫ·1/√ǫ = √ǫ.

9

slide-25
SLIDE 25

Implication: Mean Estimation

Lemma If a dataset has bounded covariance, it is (ǫ, O(√ǫ))-resilient (in the ℓ2-norm). Proof: If ǫn points ≫ 1/√ǫ from mean, would make variance ≫ 1. Therefore, deleting ǫn points changes mean by at most ≈ ǫ·1/√ǫ = √ǫ. Corollary If the clean data has bounded covariance, its mean can be estimated to ℓ2-error O(√ǫ) in the presence of ǫn outliers.

9

slide-26
SLIDE 26

Implication: Mean Estimation

Lemma If a dataset has bounded covariance, it is (ǫ, O(√ǫ))-resilient (in the ℓ2-norm). Proof: If ǫn points ≫ 1/√ǫ from mean, would make variance ≫ 1. Therefore, deleting ǫn points changes mean by at most ≈ ǫ·1/√ǫ = √ǫ. Corollary If the clean data has bounded kth moments, its mean can be estimated to ℓ2-error O(ǫ1−1/k) in the presence of ǫn outliers.

9

slide-27
SLIDE 27

Implication: Learning Discrete Distributions

Suppose we observe samples from a distribution π on {1, . . . , m}. Samples come in r-tuples, which are either all good or all outliers.

10

slide-28
SLIDE 28

Implication: Learning Discrete Distributions

Suppose we observe samples from a distribution π on {1, . . . , m}. Samples come in r-tuples, which are either all good or all outliers. Corollary The distribution π can be estimated (in TV distance) to error O(ǫ

  • log(1/ǫ)/r) in the presence of ǫn outliers.

10

slide-29
SLIDE 29

Implication: Learning Discrete Distributions

Suppose we observe samples from a distribution π on {1, . . . , m}. Samples come in r-tuples, which are either all good or all outliers. Corollary The distribution π can be estimated (in TV distance) to error O(ǫ

  • log(1/ǫ)/r) in the presence of ǫn outliers.
  • follows from resilience in ℓ1-norm
  • see also [Qiao & Valiant, 2018] later in this session!

10

slide-30
SLIDE 30

A Majority of Outliers

Can also handle the case where clean set has size only αn (α < 1

2):

˜ S S

11

slide-31
SLIDE 31

A Majority of Outliers

Can also handle the case where clean set has size only αn (α < 1

2):

˜ S S

  • cover ˜

S by resilient sets

11

slide-32
SLIDE 32

A Majority of Outliers

Can also handle the case where clean set has size only αn (α < 1

2):

˜ S S′ S S ∩ S′

  • cover ˜

S by resilient sets

  • at least one set S′ must have high overlap with S...

11

slide-33
SLIDE 33

A Majority of Outliers

Can also handle the case where clean set has size only αn (α < 1

2):

˜ S S′ S S ∩ S′

  • cover ˜

S by resilient sets

  • at least one set S′ must have high overlap with S...
  • ...and hence µ′ − µ ≤ 2σ as before.

11

slide-34
SLIDE 34

A Majority of Outliers

Can also handle the case where clean set has size only αn (α < 1

2):

˜ S S′ S S ∩ S′

  • cover ˜

S by resilient sets

  • at least one set S′ must have high overlap with S...
  • ...and hence µ′ − µ ≤ 2σ as before.
  • Recovery in list-decodable model [BBV08].

11

slide-35
SLIDE 35

Implication: Stochastic Block Models

Set of αn good and (1 − α)n bad vertices.

12

slide-36
SLIDE 36

Implication: Stochastic Block Models

Set of αn good and (1 − α)n bad vertices.

  • good ↔ good: dense (avg. deg. = a)
  • good ↔ bad: sparse (avg. deg. = b)

12

slide-37
SLIDE 37

Implication: Stochastic Block Models

Set of αn good and (1 − α)n bad vertices.

  • good ↔ good: dense (avg. deg. = a)
  • good ↔ bad: sparse (avg. deg. = b)
  • bad ↔ bad: arbitrary

12

slide-38
SLIDE 38

Implication: Stochastic Block Models

Set of αn good and (1 − α)n bad vertices.

  • good ↔ good: dense (avg. deg. = a)
  • good ↔ bad: sparse (avg. deg. = b)
  • bad ↔ bad: arbitrary

Question: when can good set be recovered (in terms of α, a, b)?

12

slide-39
SLIDE 39

Implication: Stochastic Block Models

Using resilience in “truncated ℓ1-norm”, can show: Corollary The set of good vertices can be approximately recovered when- ever (a−b)2

a

≫ log(2/α)

α2

.

13

slide-40
SLIDE 40

Implication: Stochastic Block Models

Using resilience in “truncated ℓ1-norm”, can show: Corollary The set of good vertices can be approximately recovered when- ever (a−b)2

a

≫ log(2/α)

α2

. Matches Kesten-Stigum threshold up to log factors!

13

slide-41
SLIDE 41

Implication: Stochastic Block Models

Using resilience in “truncated ℓ1-norm”, can show: Corollary The set of good vertices can be approximately recovered when- ever (a−b)2

a

≫ log(2/α)

α2

. Matches Kesten-Stigum threshold up to log factors! For planted clique (a = n, b = n/2), recover cliques of size Ω(√n log n).

  • this is tight [S’17]

13

slide-42
SLIDE 42

Algorithmic Results

Can (sometimes) turn info-theoretic into algorithmic results.

14

slide-43
SLIDE 43

Algorithmic Results

Can (sometimes) turn info-theoretic into algorithmic results. Most existing algorithmic results rely on bounded covariance.

14

slide-44
SLIDE 44

Algorithmic Results

Can (sometimes) turn info-theoretic into algorithmic results. Most existing algorithmic results rely on bounded covariance. We show:

  • for strongly convex norms, resilient sets can be “pruned” to have

bounded covariance

  • if injective norm is approximable, bounded covariance → efficient

algorithm with √ǫ error

  • both true for ℓp-norms! (p ∈ [2, ∞])

14

slide-45
SLIDE 45

Algorithmic Results

Can (sometimes) turn info-theoretic into algorithmic results. Most existing algorithmic results rely on bounded covariance. We show:

  • for strongly convex norms, resilient sets can be “pruned” to have

bounded covariance

  • if injective norm is approximable, bounded covariance → efficient

algorithm with √ǫ error

  • both true for ℓp-norms! (p ∈ [2, ∞])

See [Li, 2017] and [Du, Balakrishnan, & Singh, 2017] for a non-ℓp-norm.

14

slide-46
SLIDE 46

Other Results

Finite-sample bounds Extension to SVD

15

slide-47
SLIDE 47

Summary

Information-theoretic criterion yielding (tight?) robust recovery bounds.

  • based on simple pigeonhole arguments

16

slide-48
SLIDE 48

Summary

Information-theoretic criterion yielding (tight?) robust recovery bounds.

  • based on simple pigeonhole arguments

Benefit: from statistical problem to algorithmic problem.

16

slide-49
SLIDE 49

Summary

Information-theoretic criterion yielding (tight?) robust recovery bounds.

  • based on simple pigeonhole arguments

Benefit: from statistical problem to algorithmic problem. Open questions:

  • resilience for other problems (e.g. regression)
  • efficient algos under other assumptions
  • matching lower bounds?

16