Robustness and Generalization Huan Xu The University of Texas at - - PowerPoint PPT Presentation

robustness and generalization
SMART_READER_LITE
LIVE PREVIEW

Robustness and Generalization Huan Xu The University of Texas at - - PowerPoint PPT Presentation

Robustness and Generalization Huan Xu The University of Texas at Austin Department of Electrical and Computer Engineering COLT, June 29, 2010 Joint work with Shie Mannor What is Robustness? What is Robustness? What is Robustness?


slide-1
SLIDE 1

Robustness and Generalization

Huan Xu

The University of Texas at Austin Department of Electrical and Computer Engineering

COLT, June 29, 2010

Joint work with Shie Mannor

slide-2
SLIDE 2

What is Robustness?

slide-3
SLIDE 3

What is Robustness?

slide-4
SLIDE 4

What is Robustness?

  • Robustness is the property that tested on a training sample

and on a similar testing sample, the performance is close.

slide-5
SLIDE 5

What is Robustness?

  • Robust decision making/optimization:
  • Consider a general decision problem: find v such that

ℓ(v, ξ) is small.

  • If for ξ′ ≈ ξ, ℓ(v, ξ′) is also small, then v is robust to the

perturbation of parameter.

  • Robust optimization: minv maxξ′≈ξ ℓ(v, ξ′)
slide-6
SLIDE 6

What is Robustness?

  • Robust decision making/optimization:
  • Consider a general decision problem: find v such that

ℓ(v, ξ) is small.

  • If for ξ′ ≈ ξ, ℓ(v, ξ′) is also small, then v is robust to the

perturbation of parameter.

  • Robust optimization: minv maxξ′≈ξ ℓ(v, ξ′)
  • Robustness in machine learning
  • Robust optimization was introduced to machine learning to

handle observation noise (e.g., [Lanckriet et al 2003]; [Lebret and El Ghaoui 1997]; [Shivaswamy et al 2006]).

  • It is then discovered that SVM and Lasso can both be

rewritten as robust optimization (of empirical loss), and the RO formulation implies consistency [HX, Caramanis and SM 2009; 2010].

slide-7
SLIDE 7

What is Robustness?

  • Robust decision making/optimization:
  • Consider a general decision problem: find v such that

ℓ(v, ξ) is small.

  • If for ξ′ ≈ ξ, ℓ(v, ξ′) is also small, then v is robust to the

perturbation of parameter.

  • Robust optimization: minv maxξ′≈ξ ℓ(v, ξ′)
  • Robustness in machine learning
  • Robust optimization was introduced to machine learning to

handle observation noise (e.g., [Lanckriet et al 2003]; [Lebret and El Ghaoui 1997]; [Shivaswamy et al 2006]).

  • It is then discovered that SVM and Lasso can both be

rewritten as robust optimization (of empirical loss), and the RO formulation implies consistency [HX, Caramanis and SM 2009; 2010].

  • This paper formalizes this observation to general learning

algorithms.

slide-8
SLIDE 8

Difference with Stabiilty

Non-stable algorithm:

slide-9
SLIDE 9

Difference with Stabiilty

Stable algorithm:

slide-10
SLIDE 10

Difference with Stabiilty

Non-robust algorithm:

slide-11
SLIDE 11

Difference with Stabiilty

Robust algorithm:

slide-12
SLIDE 12

Outline

  • 1. Algorithmic Robustness and Generalization Bound
  • 2. Robust Algorithms
  • 3. (Weak) Robustness is Necessary and Sufficient to

(Asymptotic) Generalizability

slide-13
SLIDE 13

Outline

  • 1. Algorithmic Robustness and Generalization Bound
  • 2. Robust Algorithms
  • 3. (Weak) Robustness is Necessary and Sufficient to

(Asymptotic) Generalizability

slide-14
SLIDE 14

Notations

  • Training sample set s of n training samples (s1, · · · , sn).
  • Z and H are the set from which each sample is drawn, and

the hypothesis set.

  • As is the hypothesis learned given training set s.
  • For each hypothesis h ∈ H and a point z ∈ Z, there is an

associated loss ℓ(h, z) ∈ [0, M].

slide-15
SLIDE 15

Notations

  • Training sample set s of n training samples (s1, · · · , sn).
  • Z and H are the set from which each sample is drawn, and

the hypothesis set.

  • As is the hypothesis learned given training set s.
  • For each hypothesis h ∈ H and a point z ∈ Z, there is an

associated loss ℓ(h, z) ∈ [0, M].

  • In supervised learning, we decompose Z = Y × X, and

use |x and |y to denote the x-component and y-component

  • f a point.
slide-16
SLIDE 16

Notations

  • Training sample set s of n training samples (s1, · · · , sn).
  • Z and H are the set from which each sample is drawn, and

the hypothesis set.

  • As is the hypothesis learned given training set s.
  • For each hypothesis h ∈ H and a point z ∈ Z, there is an

associated loss ℓ(h, z) ∈ [0, M].

  • In supervised learning, we decompose Z = Y × X, and

use |x and |y to denote the x-component and y-component

  • f a point.
  • The covering number of a metric space T: N(ǫ, T, ρ)
slide-17
SLIDE 17

Motivating example 1: Large Margin Classifier

An algorithm As has a margin γ if for j = 1, · · · , n As(x) = As(sj|x); ∀x : x − sj|x2 < γ.

Example

Fix γ > 0 and put K = 2N(γ/2, X, · 2). If As has a margin γ, then Z can be partitioned into K disjoint sets, denoted by {Ci}K

i=1, such that if sj and z ∈ Z belong to a same Ci, then

|ℓ(As, sj) − ℓ(As, z)| = 0.

slide-18
SLIDE 18

Motivating example 1: Large Margin Classifier

slide-19
SLIDE 19

Motivating example 1: Large Margin Classifier

slide-20
SLIDE 20

Motivating example 1: Large Margin Classifier

slide-21
SLIDE 21

Motivating example 2: Linear Regression

The norm-constrained linear regression algorithm is As = min

w∈Rm:w2≤c n

  • i=1

|si|y − w⊤si|x|, (0.1)

Example

Fix ǫ > 0 and let K = N(ǫ/2, X, · 2) × N(ǫ/2, Y, | · |). Consider the norm-constrained linear regression algorithm as in (0.1). The set Z can be partitioned into K disjoint sets, such that if sj and z ∈ Z belong to a same Ci, then |ℓ(As, sj) − ℓ(As, z)| ≤ (c + 1)ǫ.

slide-22
SLIDE 22

Algorithmic Robustness

Definition (Algorithmic Robustness)

Algorithm A is (K, ǫ(s)) robust if

  • Z can be partitioned into K disjoint sets, denoted by

{Ci}K

i=1;

  • such that ∀s ∈ s,

s, z ∈ Ci, = ⇒ |ℓ(As, s) − ℓ(As, z)| ≤ ǫ(s). (0.2)

slide-23
SLIDE 23

Algorithmic Robustness

Definition (Algorithmic Robustness)

Algorithm A is (K, ǫ(s)) robust if

  • Z can be partitioned into K disjoint sets, denoted by

{Ci}K

i=1;

  • such that ∀s ∈ s,

s, z ∈ Ci, = ⇒ |ℓ(As, s) − ℓ(As, z)| ≤ ǫ(s). (0.2) Remark:

  • The definition requires that the difference between a

testing sample “similar to” a training sample is small.

  • The property jointly depends on the solution to the

algorithm and the training set.

slide-24
SLIDE 24

Generalization property of robust algorithms – the main theorem

Theorem

Let ˆ ℓ(·) and ℓemp(·) denote the expected loss and the training

  • loss. If s consists of n i.i.d. samples, and A is (K, ǫ(s))-robust,

then for any δ > 0, with probability at least 1 − δ,

  • ˆ

ℓ(As) − ℓemp(As)

  • ≤ ǫ(s) + M
  • 2K ln 2 + 2 ln(1/δ)

n .

slide-25
SLIDE 25

Generalization property of robust algorithms – the main theorem

Theorem

Let ˆ ℓ(·) and ℓemp(·) denote the expected loss and the training

  • loss. If s consists of n i.i.d. samples, and A is (K, ǫ(s))-robust,

then for any δ > 0, with probability at least 1 − δ,

  • ˆ

ℓ(As) − ℓemp(As)

  • ≤ ǫ(s) + M
  • 2K ln 2 + 2 ln(1/δ)

n . Remark: The bounds depends on the partitioning of the sample space.

slide-26
SLIDE 26

Proof of the Main Theorem

  • Let Ni be the set of index of points of s that fall into Ci.

Then (|N1|, · · · , |NK|) is an IID multinomial random variable with parameters n and (µ(C1), · · · , µ(CK)).

slide-27
SLIDE 27

Proof of the Main Theorem

  • Let Ni be the set of index of points of s that fall into Ci.

Then (|N1|, · · · , |NK|) is an IID multinomial random variable with parameters n and (µ(C1), · · · , µ(CK)).

  • Breteganolle-Huber-Carol inequality gives

Pr K

  • i=1
  • |Ni|

n − µ(Ci)

  • ≥ λ
  • ≤ 2K exp(−nλ2

2 ).

slide-28
SLIDE 28

Proof of the Main Theorem

  • Let Ni be the set of index of points of s that fall into Ci.

Then (|N1|, · · · , |NK|) is an IID multinomial random variable with parameters n and (µ(C1), · · · , µ(CK)).

  • Breteganolle-Huber-Carol inequality gives

Pr K

  • i=1
  • |Ni|

n − µ(Ci)

  • ≥ λ
  • ≤ 2K exp(−nλ2

2 ).

  • Hence, with probability at least 1 − δ,

K

  • i=1
  • |Ni|

n − µ(Ci)

  • 2K ln 2 + 2 ln(1/δ)

n . (0.3)

slide-29
SLIDE 29

Proof of the Main Theorem (Cont.)

Furthermore,

  • ˆ

ℓ(As) − ℓemp(As)

  • =
  • K
  • i=1

E

  • ℓ(As, z)|z ∈ Ci
  • µ(Ci) − 1

n

n

  • i=1

ℓ(As, si)

  • K
  • i=1

E

  • ℓ(As, z)|z ∈ Ci

|Ni| n − 1 n

n

  • i=1

ℓ(As, si)

  • +
  • K
  • i=1

E

  • ℓ(As, z)|z ∈ Ci
  • µ(Ci) −

K

  • i=1

E

  • ℓ(As, z)|z ∈ Ci

|Ni| n

slide-30
SLIDE 30

Proof of the Main Theorem (Cont.)

Furthermore,

  • ˆ

ℓ(As) − ℓemp(As)

  • =
  • K
  • i=1

E

  • ℓ(As, z)|z ∈ Ci
  • µ(Ci) − 1

n

n

  • i=1

ℓ(As, si)

  • K
  • i=1

E

  • ℓ(As, z)|z ∈ Ci

|Ni| n − 1 n

n

  • i=1

ℓ(As, si)

  • +
  • K
  • i=1

E

  • ℓ(As, z)|z ∈ Ci
  • µ(Ci) −

K

  • i=1

E

  • ℓ(As, z)|z ∈ Ci

|Ni| n

  • The first term is bounded by
  • 1

n

K

i=1

  • j∈Ni maxz2∈Ci |ℓ(As, sj) − ℓ(As, z2)|
  • ≤ ǫ(s).
slide-31
SLIDE 31

Proof of the Main Theorem (Cont.)

Furthermore,

  • ˆ

ℓ(As) − ℓemp(As)

  • =
  • K
  • i=1

E

  • ℓ(As, z)|z ∈ Ci
  • µ(Ci) − 1

n

n

  • i=1

ℓ(As, si)

  • K
  • i=1

E

  • ℓ(As, z)|z ∈ Ci

|Ni| n − 1 n

n

  • i=1

ℓ(As, si)

  • +
  • K
  • i=1

E

  • ℓ(As, z)|z ∈ Ci
  • µ(Ci) −

K

  • i=1

E

  • ℓ(As, z)|z ∈ Ci

|Ni| n

  • The first term is bounded by
  • 1

n

K

i=1

  • j∈Ni maxz2∈Ci |ℓ(As, sj) − ℓ(As, z2)|
  • ≤ ǫ(s).
  • The second term is bounded by
  • maxz∈Z |ℓ(As,z)| K

i=1

  • |Ni|

n − µ(Ci)

  • ≤ M K

i=1

  • |Ni|

n − µ(Ci)

  • .
slide-32
SLIDE 32

Additional Results: Pseudo Robustness

  • Robustness – “similar performace” around each training

sample.

slide-33
SLIDE 33

Additional Results: Pseudo Robustness

  • Robustness – “similar performace” around each training

sample.

  • Pseudo robustness – “similar performace” around some

training sample:

slide-34
SLIDE 34

Additional Results: Pseudo Robustness

  • Robustness – “similar performace” around each training

sample.

  • Pseudo robustness – “similar performace” around some

training sample:

Definition (Pseudo robustness:)

Algorithm A is (K, ǫ(s), ˆ n(s)) pseudo robust if

  • Z can be partitioned into K disjoint sets, denoted as

{Ci}K

i=1,

  • and there exists a subset of training samples ˆ

s with |ˆ s| = ˆ n(s);

  • such that ∀s ∈ ˆ

s, s, z ∈ Ci, = ⇒ |ℓ(As, s) − ℓ(As, z)| ≤ ǫ(s).

slide-35
SLIDE 35

Additional Results: Pseudo Robustness

Theorem

If s consists of n i.i.d. samples, and A is (K, ǫ(s), ˆ n(s)) pseudo robust, then for any δ > 0, with probability at least 1 − δ,

  • ˆ

ℓ(As) − ℓemp(As)

  • ≤ ˆ

n(s) n ǫ(s)+M

  • n − ˆ

n(s) n +

  • 2K ln 2 + 2 ln(1/δ)

n

  • .
slide-36
SLIDE 36

Additional Results: Pseudo Robustness

Theorem

If s consists of n i.i.d. samples, and A is (K, ǫ(s), ˆ n(s)) pseudo robust, then for any δ > 0, with probability at least 1 − δ,

  • ˆ

ℓ(As) − ℓemp(As)

  • ≤ ˆ

n(s) n ǫ(s)+M

  • n − ˆ

n(s) n +

  • 2K ln 2 + 2 ln(1/δ)

n

  • .
  • An additional term due to “non-robust” traninig samples.
slide-37
SLIDE 37

Outline

  • 1. Algorithmic Robustness and Generalization Bound
  • 2. Robust Algorithms
  • 3. (Weak) Robustness is Necessary and Sufficient to

(Asymptotic) Generalizability

slide-38
SLIDE 38

Which algorithms are robust?

Example (Majority Voting)

Let Y = {−1, +1}. Partition X to C1, · · · , CK, and use C(x) to denote the set to which x belongs. A new sample xa ∈ X is labeled by As(xa) 1, if

si∈C(xa) 1(si|y = 1) ≥ si∈C(xa) 1(si|y = −1);

−1,

  • therwise.

If the loss function is l(As, z) = f(z|y, As(z|x)) for some function f, then MV is (2K, 0) robust.

slide-39
SLIDE 39

Which algorithms are robust?

Theorem

Fix γ > 0 and metric ρ of Z. Suppose A satisfies |ℓ(As, z1) − ℓ(As, z2)| ≤ ǫ(s), ∀z1, z2 : z1 ∈ s, ρ(z1, z2) ≤ γ, and N(γ/2, Z, ρ) < ∞. Then A is

  • N(γ/2, Z, ρ), ǫ(s)
  • robust.
slide-40
SLIDE 40

Which algorithms are robust?

Theorem

Fix γ > 0 and metric ρ of Z. Suppose A satisfies |ℓ(As, z1) − ℓ(As, z2)| ≤ ǫ(s), ∀z1, z2 : z1 ∈ s, ρ(z1, z2) ≤ γ, and N(γ/2, Z, ρ) < ∞. Then A is

  • N(γ/2, Z, ρ), ǫ(s)
  • robust.

Example (Lipschitz continuous functions)

If Z is compact w.r.t. metric ρ, ℓ(As, ·) is Lipschitz continuous with Lipschitz constant c(s), i.e., |l(As, z1) − l(As, z2)| ≤ c(s)ρ(z1, z2), ∀z1, z2 ∈ Z, then A is

  • N(γ/2, Z, ρ), c(s)γ
  • robust for all γ > 0.
slide-41
SLIDE 41

Which algorithms are robust?

Theorem

Fix γ > 0 and metric ρ of Z. Suppose A satisfies |ℓ(As, z1) − ℓ(As, z2)| ≤ ǫ(s), ∀z1, z2 : z1 ∈ s, ρ(z1, z2) ≤ γ, and N(γ/2, Z, ρ) < ∞. Then A is

  • N(γ/2, Z, ρ), ǫ(s)
  • robust.

Example (Lipschitz continuous functions)

If Z is compact w.r.t. metric ρ, ℓ(As, ·) is Lipschitz continuous with Lipschitz constant c(s), i.e., |l(As, z1) − l(As, z2)| ≤ c(s)ρ(z1, z2), ∀z1, z2 ∈ Z, then A is

  • N(γ/2, Z, ρ), c(s)γ
  • robust for all γ > 0.
  • Similarly, SVM, Lasso, feed-forward neural network and

PCA are robust.

slide-42
SLIDE 42

Which algorithms are robust?

A large margin classifier is a classification rule such that most

  • f the training samples are “far away” from the classification
  • boundary. We denote the distance of a point x to a

classification rule ∆ by D(x, ∆).

Example (Large-margin classifier)

If there exist γ and ˆ n such that

n

  • i=1

1

  • D(si|x, As) > γ
  • ≥ ˆ

n, then algorithm A is (2N(γ/2, X, ρ), 0, ˆ n) pseudo robust, provided that N(γ/2, X, ρ) < ∞.

slide-43
SLIDE 43

Outline

  • 1. Algorithmic Robustness and Generalization Bound
  • 2. Robust Algorithms
  • 3. (Weak) Robustness is Necessary and Sufficient to

(Asymptotic) Generalizability

slide-44
SLIDE 44

(Asymptotic) generalizability

Finite sample bound

slide-45
SLIDE 45

(Asymptotic) generalizability

Finite sample bound asymptotic property

slide-46
SLIDE 46

(Asymptotic) generalizability

Finite sample bound asymptotic property

Definition

  • 1. A learning algorithm A generalizes w.r.t. s if

lim sup

n

  • Et
  • ℓ(As(n), t)
  • − 1

n

n

  • i=1

ℓ(As(n), si)

  • ≤ 0.
  • 2. A learning algorithm A generalize w.p. 1 if it generalize

w.r.t. almost every s.

slide-47
SLIDE 47

Weak robustness

Robustness

slide-48
SLIDE 48

Weak robustness

Robustness weak robustness

slide-49
SLIDE 49

Weak robustness

Robustness weak robustness

  • Robustness requires that the sample space can be

partitioned into disjoint subsets such that if a testing sample belongs to the same partitioning set of a training sample, then they have similar loss.

  • Weak robustness generalizes such notion by considering

the average loss of testing samples and training samples: if for a large (in the probabilistic sense) subset of Zn, the testing error is close to the training error, then the algorithm is weakly robust.

slide-50
SLIDE 50

Weak robustness (cont.)

Definition

  • 1. A learning algorithm A is weakly robust w.r.t s if there

exists a sequence of {Dn ⊆ Zn} such that Pr(t(n) ∈ Dn) → 1, here t(n) are n i.i.d. testing samples, and lim sup

n

  • max

ˆ s(n)∈Dn

1 n

n

  • i=1

ℓ(As(n), ˆ si) − 1 n

n

  • i=1

ℓ(As(n), si)

  • ≤ 0.
  • 2. A learning algorithm A is a.s. weakly robust if it is robust

w.r.t. almost every s.

slide-51
SLIDE 51

All Learning is Robust !

Theorem

  • 1. An algorithm A generalizes w.r.t. s if and only if it is weakly

robust w.r.t. s.

  • 2. An algorithm A generalizes w.p. 1 if and only if it is a.s.

weakly robust.

slide-52
SLIDE 52

Conclusion

Summary:

  • Propose Algorithmic Robustness.
  • Present finite sample bound based on algorithmic

robustness.

  • Show that weak robustness is necessary and

sufficient for generalizability.

slide-53
SLIDE 53

Conclusion

Summary:

  • Propose Algorithmic Robustness.
  • Present finite sample bound based on algorithmic

robustness.

  • Show that weak robustness is necessary and

sufficient for generalizability. Future Direction:

  • Adaptive partition?
  • Other robust algorithms?
  • Better rate?