Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 - - PowerPoint PPT Presentation

generalization theory
SMART_READER_LITE
LIVE PREVIEW

Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 - - PowerPoint PPT Presentation

Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d , Y = { 1 , +1 } . w R d to following optimization problem: Return solution n 2 + 1 2 w 2 min [1 y i


slide-1
SLIDE 1

Generalization theory

Daniel Hsu

Columbia TRIPODS Bootcamp 1

slide-2
SLIDE 2

Motivation

2

slide-3
SLIDE 3

Support vector machines

X = Rd, Y = {−1, +1}.

◮ Return solution ˆ

w ∈ Rd to following optimization problem: min

w∈Rd

λ 2 w2

2 + 1

n

n

  • i=1

[1 − yiwTxi]+.

◮ Loss function is hinge loss

ℓ(ˆ y, y) = [1 − yˆ y]+ = max{1 − yˆ y, 0}. (Here, we are okay with a real-valued prediction.)

◮ The λ 2w2 2 term is called Tikhonov regularization, which we’ll

discuss later.

3

slide-4
SLIDE 4

Basic statistical model for data

IID model of data

◮ Training data and test example are independent and identically

distributed (X × Y)-valued random variables: (X1, Y1), . . . , (Xn, Yn), (X, Y ) ∼iid P.

4

slide-5
SLIDE 5

Basic statistical model for data

IID model of data

◮ Training data and test example are independent and identically

distributed (X × Y)-valued random variables: (X1, Y1), . . . , (Xn, Yn), (X, Y ) ∼iid P. SVM in the iid model

◮ Return solution ˆ

w to following optimization problem: min

w∈Rd

λ 2 w2

2 + 1

n

n

  • i=1

[1 − YiwTXi]+.

◮ Therefore, ˆ

w is a random variable, depending on (X1, Y1), . . . , (Xn, Yn).

4

slide-6
SLIDE 6

Convergence of empirical risk

For w that does not depend on training data: Empirical risk Rn(w) = 1 n

n

  • i=1

ℓ(wTXi, Yi) is a sum of iid random variables.

5

slide-7
SLIDE 7

Convergence of empirical risk

For w that does not depend on training data: Empirical risk Rn(w) = 1 n

n

  • i=1

ℓ(wTXi, Yi) is a sum of iid random variables. Law of Large Numbers gives an asymptotic result: Rn(w) = 1 n

n

  • i=1

ℓ(wTXi, Yi)

p

− → E[ℓ(wTX, Y )] = R(w). (This can be made non-asymptotic.)

5

slide-8
SLIDE 8

Uniform convergence of empirical risk

However, ˆ w does depend on training data. Empirical risk of ˆ w is not a sum of iid random variables: Rn( ˆ w) = 1 n

n

  • i=1

ℓ( ˆ wTXi, Yi).

6

slide-9
SLIDE 9

Uniform convergence of empirical risk

However, ˆ w does depend on training data. Empirical risk of ˆ w is not a sum of iid random variables: Rn( ˆ w) = 1 n

n

  • i=1

ℓ( ˆ wTXi, Yi). Idea: ˆ w could conceivably take any value w, but if sup

w |Rn(w) − R(w)| p

− → 0, (1) then Rn( ˆ w)

p

− → R( ˆ w) as well. (1) is called uniform convergence.

6

slide-10
SLIDE 10

Detour: Concentration inequalities

7

slide-11
SLIDE 11

Symmetric random walk

Rademacher random variables ε1, . . . , εn iid with P(εi = −1) = P(εi = 1) = 1/2. Symmetric random walk: position after n steps is Sn =

n

  • i=1

εi.

8

slide-12
SLIDE 12

Symmetric random walk

Rademacher random variables ε1, . . . , εn iid with P(εi = −1) = P(εi = 1) = 1/2. Symmetric random walk: position after n steps is Sn =

n

  • i=1

εi. How far from origin?

◮ By independence, var(Sn) = n i=1 var(εi) = n. ◮ So expected distance from origin is

E|Sn| ≤

  • var(Sn) ≤ √n.

8

slide-13
SLIDE 13

Symmetric random walk

Rademacher random variables ε1, . . . , εn iid with P(εi = −1) = P(εi = 1) = 1/2. Symmetric random walk: position after n steps is Sn =

n

  • i=1

εi. How far from origin?

◮ By independence, var(Sn) = n i=1 var(εi) = n. ◮ So expected distance from origin is

E|Sn| ≤

  • var(Sn) ≤ √n.

How many realizations are ≫ √n from origin?

8

slide-14
SLIDE 14

Markov’s inequality

For any random variable X and any t ≥ 0, P(|X| ≥ t) ≤ E|X| t .

◮ Proof:

t · 1{|X| ≥ t} ≤ |X|.

9

slide-15
SLIDE 15

Markov’s inequality

For any random variable X and any t ≥ 0, P(|X| ≥ t) ≤ E|X| t .

◮ Proof:

t · 1{|X| ≥ t} ≤ |X|. Application to symmetric random walk: P(|Sn| ≥ c√n) ≤ E|Sn| c√n ≤ 1 c.

9

slide-16
SLIDE 16

Hoeffding’s inequality

If X1, . . . , Xn are independent random variables, with Xi taking values in [ai, bi], then for any t ≥ 0, P

 

n

  • i=1

(Xi − E(Xi)) ≥ t

  ≤ exp

2t2

n

i=1(bi − ai)2

  • .

10

slide-17
SLIDE 17

Hoeffding’s inequality

If X1, . . . , Xn are independent random variables, with Xi taking values in [ai, bi], then for any t ≥ 0, P

 

n

  • i=1

(Xi − E(Xi)) ≥ t

  ≤ exp

2t2

n

i=1(bi − ai)2

  • .

E.g., Rademacher random variables have [ai, bi] = [−1, +1], so P(Sn ≥ t) ≤ exp(−2t2/(4n)).

10

slide-18
SLIDE 18

Applying Hoeffding’s inequality to symmetric random walk

Union bound: For any events A and B, P(A ∪ B) ≤ P(A) + P(B).

11

slide-19
SLIDE 19

Applying Hoeffding’s inequality to symmetric random walk

Union bound: For any events A and B, P(A ∪ B) ≤ P(A) + P(B).

  • 1. Apply Hoeffding to ε1, . . . , εn:

P(Sn ≥ c√n) ≤ exp(−c2/2).

11

slide-20
SLIDE 20

Applying Hoeffding’s inequality to symmetric random walk

Union bound: For any events A and B, P(A ∪ B) ≤ P(A) + P(B).

  • 1. Apply Hoeffding to ε1, . . . , εn:

P(Sn ≥ c√n) ≤ exp(−c2/2).

  • 2. Apply Hoeffding to −ε1, . . . , −εn:

P(−Sn ≥ c√n) ≤ exp(−c2/2).

11

slide-21
SLIDE 21

Applying Hoeffding’s inequality to symmetric random walk

Union bound: For any events A and B, P(A ∪ B) ≤ P(A) + P(B).

  • 1. Apply Hoeffding to ε1, . . . , εn:

P(Sn ≥ c√n) ≤ exp(−c2/2).

  • 2. Apply Hoeffding to −ε1, . . . , −εn:

P(−Sn ≥ c√n) ≤ exp(−c2/2).

  • 3. Therefore, by union bound,

P(|Sn| ≥ c√n) ≤ 2 exp(−c2/2). (Compare to bound from Markov’s inequality: 1/c.)

11

slide-22
SLIDE 22

Equivalent form of Hoeffding’s inequality

Let X1, . . . , Xn be independent random variables, with Xi taking values in [ai, bi], and let Sn = n

i=1 Xi. For any δ ∈ (0, 1),

P

  Sn − E[Sn] <

  • 1

2

n

  • i=1

(bi − ai)2 ln(1/δ)

   ≥ 1 − δ.

This is a “high probability” upper-bound on Sn − E[Sn].

12

slide-23
SLIDE 23

Uniform convergence: Finite classes

13

slide-24
SLIDE 24

Back to statistical learning

Cast of characters:

◮ feature and outcome spaces: X, Y ◮ function class: F ⊂ YX ◮ loss function: ℓ: Y × Y → R+ (assume bounded by 1) ◮ training and test data: (X1, Y1), . . . , (Xn, Yn), (X, Y ) ∼iid P 14

slide-25
SLIDE 25

Back to statistical learning

Cast of characters:

◮ feature and outcome spaces: X, Y ◮ function class: F ⊂ YX ◮ loss function: ℓ: Y × Y → R+ (assume bounded by 1) ◮ training and test data: (X1, Y1), . . . , (Xn, Yn), (X, Y ) ∼iid P

We let ˆ f ∈ arg minf∈F Rn(f) be minimizer of empirical risk Rn(f) = 1 n

n

  • i=1

ℓ(f(Xi), Yi).

14

slide-26
SLIDE 26

Back to statistical learning

Cast of characters:

◮ feature and outcome spaces: X, Y ◮ function class: F ⊂ YX ◮ loss function: ℓ: Y × Y → R+ (assume bounded by 1) ◮ training and test data: (X1, Y1), . . . , (Xn, Yn), (X, Y ) ∼iid P

We let ˆ f ∈ arg minf∈F Rn(f) be minimizer of empirical risk Rn(f) = 1 n

n

  • i=1

ℓ(f(Xi), Yi). Our worry: over-fitting R( ˆ f) ≫ Rn( ˆ f).

14

slide-27
SLIDE 27

Convergence of empirical risk for fixed function

For any fixed function f ∈ F, E

Rn(f) = E   1

n

n

  • i=1

ℓ(f(Xi), Yi)

  = 1

n

n

  • i=1

E

ℓ(f(Xi), Yi) = R(f).

15

slide-28
SLIDE 28

Convergence of empirical risk for fixed function

For any fixed function f ∈ F, E

Rn(f) = E   1

n

n

  • i=1

ℓ(f(Xi), Yi)

  = 1

n

n

  • i=1

E

ℓ(f(Xi), Yi) = R(f).

Since Rn(f) is sum of n independent [0, 1

n]-valued random

variables, P

|Rn(f) − R(f)| ≥ t ≤ 2 exp

2t2

n

i=1( 1 n)2

  • = 2 exp(−2nt2)

for any t > 0, by Hoeffding’s inequality and union bound.

15

slide-29
SLIDE 29

Convergence of empirical risk for fixed function

For any fixed function f ∈ F, E

Rn(f) = E   1

n

n

  • i=1

ℓ(f(Xi), Yi)

  = 1

n

n

  • i=1

E

ℓ(f(Xi), Yi) = R(f).

Since Rn(f) is sum of n independent [0, 1

n]-valued random

variables, P

|Rn(f) − R(f)| ≥ t ≤ 2 exp

2t2

n

i=1( 1 n)2

  • = 2 exp(−2nt2)

for any t > 0, by Hoeffding’s inequality and union bound. This argument does not apply to ˆ f, because ˆ f depends on (X1, Y1), . . . , (Xn, Yn).

15

slide-30
SLIDE 30

Uniform convergence

We cannot directly apply Hoeffding’s inequality to ˆ f, since its empirical risk Rn( ˆ f) is not average of iid random variables.

16

slide-31
SLIDE 31

Uniform convergence

We cannot directly apply Hoeffding’s inequality to ˆ f, since its empirical risk Rn( ˆ f) is not average of iid random variables. One possible solution: ensure empirical risk of every f ∈ F is close to its expected value. This is called uniform convergence.

16

slide-32
SLIDE 32

Uniform convergence

We cannot directly apply Hoeffding’s inequality to ˆ f, since its empirical risk Rn( ˆ f) is not average of iid random variables. One possible solution: ensure empirical risk of every f ∈ F is close to its expected value. This is called uniform convergence.

◮ How much data is needed to ensure this? 16

slide-33
SLIDE 33

Uniform convergence for all functions in a finite class

If |F| < ∞, then by Hoeffding’s inequality and union bound, P

∃f ∈ F s.t. |Rn(f) − R(f)| ≥ t = P  

f∈F

|Rn(f) − R(f)| ≥ t

  • f∈F

P

|Rn(f) − R(f)| ≥ t

  • ≤ |F| · 2 exp(−2nt2).

17

slide-34
SLIDE 34

Uniform convergence for all functions in a finite class

If |F| < ∞, then by Hoeffding’s inequality and union bound, P

∃f ∈ F s.t. |Rn(f) − R(f)| ≥ t = P  

f∈F

|Rn(f) − R(f)| ≥ t

  • f∈F

P

|Rn(f) − R(f)| ≥ t

  • ≤ |F| · 2 exp(−2nt2).

Choose t so that RHS is δ, and “invert”.

  • Theorem. For any δ ∈ (0, 1),

P

 ∀f ∈ F : |Rn(f) − R(f)| <

  • ln(2|F|/δ)

2n

  ≥ 1 − δ.

17

slide-35
SLIDE 35

What we get from uniform convergence

If n ≫ log |F|, then with high probability, no function f ∈ F will

  • ver-fit the training data.

18

slide-36
SLIDE 36

What we get from uniform convergence

If n ≫ log |F|, then with high probability, no function f ∈ F will

  • ver-fit the training data.

Also: An empirical risk minimizer (ERM), like ˆ f, is near optimal!

  • Theorem. With probability at least 1 − δ,

R( ˆ f) − R(f∗) = R( ˆ f) − Rn( ˆ f) (≤ ǫ) + Rn( ˆ f) − Rn(f∗) (≤ 0) + Rn(f∗) − R(f∗) (≤ ǫ) ≤ 2ǫ where f∗ ∈ arg minf∈F R(f) and ǫ =

  • ln(2|F|/δ)

2n

.

18

slide-37
SLIDE 37

Uniform convergence: General case

19

slide-38
SLIDE 38

Uniform convergence: General case

Let F ⊂ RX be a class of real-valued functions, and let P be a probability distribution on X.

20

slide-39
SLIDE 39

Uniform convergence: General case

Let F ⊂ RX be a class of real-valued functions, and let P be a probability distribution on X. Notation:

◮ Let Pf = E[f(X)] for X ∼ P. ◮ Let Pn be the empirical distribution on X1, . . . , Xn ∼iid P,

which assigns probability mass 1/n to each Xi.

◮ So Pnf = 1 n

n

i=1 f(Xi).

We are interested in the maximum (or supremum) deviation: sup

f∈F

|Pnf − Pf|.

20

slide-40
SLIDE 40

Uniform convergence: General case

Let F ⊂ RX be a class of real-valued functions, and let P be a probability distribution on X. Notation:

◮ Let Pf = E[f(X)] for X ∼ P. ◮ Let Pn be the empirical distribution on X1, . . . , Xn ∼iid P,

which assigns probability mass 1/n to each Xi.

◮ So Pnf = 1 n

n

i=1 f(Xi).

We are interested in the maximum (or supremum) deviation: sup

f∈F

|Pnf − Pf|. The arguments from before show that for any finite class of bounded functions F, sup

f∈F

|Pnf − Pf|

p

− → 0, and also give a non-asymptotic rate of convergence.

20

slide-41
SLIDE 41

Infinite classes

For which classes F ⊂ RX does uniform convergence hold?

21

slide-42
SLIDE 42

Infinite classes

For which classes F ⊂ RX does uniform convergence hold? Example: F = {fS(x) = 1{x ∈ S} : S ⊂ R, |S| < ∞}, i.e., {0, 1}-valued functions that take value 1 on a finite set.

◮ If P is continuous, then Pf = 0 for all f ∈ F. ◮ But supf∈F Pnf = 1 for all n. ◮ So supf∈F |Pnf − Pf| = 1 for all n. 21

slide-43
SLIDE 43

Infinite classes

For which classes F ⊂ RX does uniform convergence hold? Example: F = {fS(x) = 1{x ∈ S} : S ⊂ R, |S| < ∞}, i.e., {0, 1}-valued functions that take value 1 on a finite set.

◮ If P is continuous, then Pf = 0 for all f ∈ F. ◮ But supf∈F Pnf = 1 for all n. ◮ So supf∈F |Pnf − Pf| = 1 for all n.

What is the appropriate “complexity” measure of a function class?

21

slide-44
SLIDE 44

Rademacher complexity

Let ε1, . . . , εn be independent Rademacher random variables. Uniform convergence with F holds iff lim

n→∞ EEε

  sup

f∈F

  • 1

n

n

  • i=1

εif(Xi)

 

  • Radn(F)

= 0 (where Eε is expectation with respect to ε = (ε1, . . . , εn)).

22

slide-45
SLIDE 45

Rademacher complexity

Let ε1, . . . , εn be independent Rademacher random variables. Uniform convergence with F holds iff lim

n→∞ EEε

  sup

f∈F

  • 1

n

n

  • i=1

εif(Xi)

 

  • Radn(F)

= 0 (where Eε is expectation with respect to ε = (ε1, . . . , εn)). Radn(F) is the Rademacher complexity of F, which measures how well vectors in (random) set F(X1:n) = {(f(X1), . . . , f(Xn)) : f ∈ F} can correlate with uniformly random signs ε1, . . . , εn.

22

slide-46
SLIDE 46

Extreme cases of Rademacher complexity

For simplicity, assume X1, . . . , Xn are distinct (e.g., P continuous).

23

slide-47
SLIDE 47

Extreme cases of Rademacher complexity

For simplicity, assume X1, . . . , Xn are distinct (e.g., P continuous).

◮ F contains a single function f0 : X → {−1, +1}:

Radn(F) = EEε

  

  • 1

n

n

  • i=1

εif0(Xi)

  ≤

1 √n.

23

slide-48
SLIDE 48

Extreme cases of Rademacher complexity

For simplicity, assume X1, . . . , Xn are distinct (e.g., P continuous).

◮ F contains a single function f0 : X → {−1, +1}:

Radn(F) = EEε

  

  • 1

n

n

  • i=1

εif0(Xi)

  ≤

1 √n.

◮ F contains all functions X → {−1, +1}:

Radn(F) = EEε

  sup

f∈F

  • 1

n

n

  • i=1

εif(Xi)

  = 1.

23

slide-49
SLIDE 49

Uniform convergence via Rademacher complexity

Theorem.

  • 1. Uniform convergence in expectation:

For any F ⊂ RX , E

  • sup

f∈F

|Pnf − Pf|

  • ≤ 2 Radn(F).
  • 2. Uniform convergence with high probability:

For any F ⊂ [−1, +1]X and δ ∈ (0, 1), with probability ≥ 1 − δ, sup

f∈F

|Pnf − Pf| ≤ 2 Radn(F) +

  • 2 ln(1/δ)

n .

24

slide-50
SLIDE 50

Step 1: Symmetrization by “ghost sample”

Let P ′

n be empirical distribution on independent copies X′ 1, . . . , X′ n

  • f X1, . . . , Xn. Write E′ for expectation with respect to X′

1:n. 25

slide-51
SLIDE 51

Step 1: Symmetrization by “ghost sample”

Let P ′

n be empirical distribution on independent copies X′ 1, . . . , X′ n

  • f X1, . . . , Xn. Write E′ for expectation with respect to X′

1:n.

Then E

  • sup

f∈F

|Pnf − Pf|

  • = E

   sup

f∈F

  • E′

  

1 n

n

  • i=1

f(Xi) − f(X′

i)

  

  

≤ E

   E′     

sup

f∈F

  • 1

n

n

  • i=1

f(Xi) − f(X′

i)

       

= EE′ sup

f∈F

|Pnf − P ′

nf|. 25

slide-52
SLIDE 52

Step 1: Symmetrization by “ghost sample”

Let P ′

n be empirical distribution on independent copies X′ 1, . . . , X′ n

  • f X1, . . . , Xn. Write E′ for expectation with respect to X′

1:n.

Then E

  • sup

f∈F

|Pnf − Pf|

  • = E

   sup

f∈F

  • E′

  

1 n

n

  • i=1

f(Xi) − f(X′

i)

  

  

≤ E

   E′     

sup

f∈F

  • 1

n

n

  • i=1

f(Xi) − f(X′

i)

       

= EE′ sup

f∈F

|Pnf − P ′

nf|.

The random variable Pnf − P ′

nf is arguably nicer than Pnf − Pf

because it is symmetric.

25

slide-53
SLIDE 53

Step 2: Symmetrization by random signs

Consider any ε = (ε1, . . . , εn) ∈ {−1, +1}n. Distribution of Pnf − P ′

nf = 1

n

n

  • i=1

f(Xi) − f(X′

i)

is the same distribution of Pnf − P ′

nf = 1

n

n

  • i=1

εi

  • f(Xi) − f(X′

i)

  • .

26

slide-54
SLIDE 54

Step 2: Symmetrization by random signs

Consider any ε = (ε1, . . . , εn) ∈ {−1, +1}n. Distribution of Pnf − P ′

nf = 1

n

n

  • i=1

f(Xi) − f(X′

i)

is the same distribution of Pnf − P ′

nf = 1

n

n

  • i=1

εi

  • f(Xi) − f(X′

i)

  • .

Thus, this is also true for uniform average over all ε ∈ {−1, +1}n (i.e., expectation over Rademacher ε): EE′ sup

f∈F

|Pnf − P ′

nf| = EE′Eε sup f∈F

  • 1

n

n

  • i=1

εi

  • f(Xi) − f(X′

i)

  • .

26

slide-55
SLIDE 55

Step 3: Back to a single sample

By triangle inequality, sup

f∈F

  • 1

n

n

  • i=1

εi

  • f(Xi) − f(X′

i)

  • ≤ sup

f∈F

  • 1

n

n

  • i=1

εif(Xi)

  • + sup

f∈F

  • 1

n

n

  • i=1

εif(X′

i)

  • The two terms on the RHS have the same distribution.

27

slide-56
SLIDE 56

Step 3: Back to a single sample

By triangle inequality, sup

f∈F

  • 1

n

n

  • i=1

εi

  • f(Xi) − f(X′

i)

  • ≤ sup

f∈F

  • 1

n

n

  • i=1

εif(Xi)

  • + sup

f∈F

  • 1

n

n

  • i=1

εif(X′

i)

  • The two terms on the RHS have the same distribution.

So EE′Eε sup

f∈F

  • 1

n

n

  • i=1

εi

  • f(Xi) − f(X′

i)

  • ≤ 2EEε sup

f∈F

  • 1

n

n

  • i=1

εif(Xi)

  • = 2 Radn(F).

27

slide-57
SLIDE 57

Recap

For any F ⊂ RX , E

  • sup

f∈F

|Pnf − Pf|

  • ≤ 2 Radn(F).

For any F ⊂ [−1, +1]X and δ ∈ (0, 1), with probability ≥ 1 − δ, sup

f∈F

|Pnf − Pf| ≤ 2 Radn(F) +

  • 2 ln(1/δ)

n .

28

slide-58
SLIDE 58

Recap

For any F ⊂ RX , E

  • sup

f∈F

|Pnf − Pf|

  • ≤ 2 Radn(F).

For any F ⊂ [−1, +1]X and δ ∈ (0, 1), with probability ≥ 1 − δ, sup

f∈F

|Pnf − Pf| ≤ 2 Radn(F) +

  • 2 ln(1/δ)

n . Conclusion If Radn(F) → 0, then uniform convergence holds.

28

slide-59
SLIDE 59

Recap

For any F ⊂ RX , E

  • sup

f∈F

|Pnf − Pf|

  • ≤ 2 Radn(F).

For any F ⊂ [−1, +1]X and δ ∈ (0, 1), with probability ≥ 1 − δ, sup

f∈F

|Pnf − Pf| ≤ 2 Radn(F) +

  • 2 ln(1/δ)

n . Conclusion If Radn(F) → 0, then uniform convergence holds. (Can also show: If uniform convergence holds, then Radn(F) → 0.)

28

slide-60
SLIDE 60

Analysis of SVM

29

slide-61
SLIDE 61

Loss class

Back to classes of prediction functions F ⊂ RX .

30

slide-62
SLIDE 62

Loss class

Back to classes of prediction functions F ⊂ RX . Consider a loss function ℓ: R × Y → R+ that satisfies ℓ(0, y) ≤ 1 for all y ∈ Y, and is 1-Lipschitz in first argument: for all ˆ y, ˆ y′ ∈ R, |ℓ(ˆ y, y) − ℓ(ˆ y′, y)| ≤ |ˆ y − ˆ y′|. (Example: hinge loss ℓ(ˆ y, y) = [1 − ˆ yy]+.)

30

slide-63
SLIDE 63

Loss class

Back to classes of prediction functions F ⊂ RX . Consider a loss function ℓ: R × Y → R+ that satisfies ℓ(0, y) ≤ 1 for all y ∈ Y, and is 1-Lipschitz in first argument: for all ˆ y, ˆ y′ ∈ R, |ℓ(ˆ y, y) − ℓ(ˆ y′, y)| ≤ |ˆ y − ˆ y′|. (Example: hinge loss ℓ(ˆ y, y) = [1 − ˆ yy]+.) Define the associated loss class by ℓF = {(x, y) → ℓ(f(x), y) : f ∈ F}. Then Radn(ℓF) ≤ 2 Radn(F) +

  • 2 ln 2

n . So uniform convergence holds for ℓF if it holds for F.

30

slide-64
SLIDE 64

Rademacher complexity of linear predictors

Linear functions Flin = {w ∈ Rd}. What is the Rademacher complexity of Flin? Radn(Flin) = EEε

   sup

w∈Rd

  • 1

n

n

  • i=1

εiwTXi

  .

31

slide-65
SLIDE 65

Rademacher complexity of linear predictors

Linear functions Flin = {w ∈ Rd}. What is the Rademacher complexity of Flin? Radn(Flin) = EEε

   sup

w∈Rd

  • 1

n

n

  • i=1

εiwTXi

  .

Inside the EEε: sup

w∈Rd

  • wT

  1

n

n

  • i=1

εiXi

 

  • = sup

w∈Rd w2

  • 1

n

n

  • i=1

εiXi

  • 2

. As long as n

i=1 εiXi = 0, this is unbounded! :-( 31

slide-66
SLIDE 66

Regularization

Recall SVM optimization problem: min

w∈Rd

λ 2 w2

2 + 1

n

n

  • i=1

[1 − yiwTxi]+.

32

slide-67
SLIDE 67

Regularization

Recall SVM optimization problem: min

w∈Rd

λ 2 w2

2 + 1

n

n

  • i=1

[1 − yiwTxi]+. Objective value at w = 0 is 1, so objective value at minimizer ˆ w is no worse than this: λ 2 ˆ w2

2 + 1

n

n

  • i=1

[1 − yi ˆ wTxi]+ ≤ 1.

32

slide-68
SLIDE 68

Regularization

Recall SVM optimization problem: min

w∈Rd

λ 2 w2

2 + 1

n

n

  • i=1

[1 − yiwTxi]+. Objective value at w = 0 is 1, so objective value at minimizer ˆ w is no worse than this: λ 2 ˆ w2

2 + 1

n

n

  • i=1

[1 − yi ˆ wTxi]+ ≤ 1. Therefore ˆ w2

2 ≤ 2

λ.

32

slide-69
SLIDE 69

Rademacher complexity of bounded linear predictors

Bounded linear functions Fℓ2,B = {w ∈ Rd : w2 ≤ B}.

33

slide-70
SLIDE 70

Rademacher complexity of bounded linear predictors

Bounded linear functions Fℓ2,B = {w ∈ Rd : w2 ≤ B}. What is the Rademacher complexity of Fℓ2,B? Radn(Fℓ2,B) = EEε

   sup

w2≤B

  • 1

n

n

  • i=1

εiwTXi

 

= BEEε

   sup

u2≤1

  • 1

n

n

  • i=1

εiuTXi

 

= BEEε

  • 1

n

n

  • i=1

εiXi

  • 2

≤ B

  • EEε
  • 1

n

n

  • i=1

εiXi

  • 2

2

.

33

slide-71
SLIDE 71

Rademacher complexity of bounded linear predictors

Bounded linear functions Fℓ2,B = {w ∈ Rd : w2 ≤ B}. What is the Rademacher complexity of Fℓ2,B? Radn(Fℓ2,B) = EEε

   sup

w2≤B

  • 1

n

n

  • i=1

εiwTXi

 

= BEEε

   sup

u2≤1

  • 1

n

n

  • i=1

εiuTXi

 

= BEEε

  • 1

n

n

  • i=1

εiXi

  • 2

≤ B

  • EEε
  • 1

n

n

  • i=1

εiXi

  • 2

2

. This is d-dimensional random walk, where i-th step is ±Xi.

33

slide-72
SLIDE 72

Rademacher complexity of bounded linear predictors (2)

EEε

  • 1

n

n

  • i=1

εiXi

  • 2

2

= 1 n2 EEε

 

n

  • i=1

εiXi2

2 +

  • i=j

εiεjXT

i Xj

 

= 1 n2 E

 

n

  • i=1

Xi2

2

 

= 1 nEX2

2. 34

slide-73
SLIDE 73

Rademacher complexity of bounded linear predictors (2)

EEε

  • 1

n

n

  • i=1

εiXi

  • 2

2

= 1 n2 EEε

 

n

  • i=1

εiXi2

2 +

  • i=j

εiεjXT

i Xj

 

= 1 n2 E

 

n

  • i=1

Xi2

2

 

= 1 nEX2

2.

Conclusion Rademacher complexity of Fℓ2,B = {w ∈ Rd : w2 ≤ B}: Radn(Fℓ2,B) ≤ B

  • EX2

2

n .

34

slide-74
SLIDE 74

Risk bound for SVM

E

R( ˆ

w) − R(w∗)

  • = E

R( ˆ

w) − Rn( ˆ w)

  • (≤ ǫ)

+ E

  • λ

2 ˆ

w2

2 + Rn( ˆ

w) − λ

2w∗2 2 − Rn(w∗)

  • (≤ 0)

+ E

Rn(w∗) − R(w∗)

  • (= 0)

+ E

  • λ

2w∗2 2 − λ 2 ˆ

w2

2

  • ≤ ǫ + λ

2w∗2 2

where w∗ ∈ arg min

w∈Rd λ 2w2 2 + R(w),

ǫ = O

  

  • EX2

2

λn + 1 √n

   .

35

slide-75
SLIDE 75

Risk bound for SVM

E

R( ˆ

w) − R(w∗)

  • = E

R( ˆ

w) − Rn( ˆ w)

  • (≤ ǫ)

+ E

  • λ

2 ˆ

w2

2 + Rn( ˆ

w) − λ

2w∗2 2 − Rn(w∗)

  • (≤ 0)

+ E

Rn(w∗) − R(w∗)

  • (= 0)

+ E

  • λ

2w∗2 2 − λ 2 ˆ

w2

2

  • ≤ ǫ + λ

2w∗2 2

where w∗ ∈ arg min

w∈Rd λ 2w2 2 + R(w),

ǫ = O

  

  • EX2

2

λn + 1 √n

   .

This suggests we should use λ → 0 such that λn → ∞ as n → ∞.

35

slide-76
SLIDE 76

Kernels

Excess risk bound has no explicit dependence on the dimension d. In particular, it holds in infinite dimensional inner product spaces.

◮ SVM can be applied in such spaces as long as there is an

algorithm for computing inner products.

◮ This is the kernel trick, and these corresponding spaces are

called Reproducing Kernel Hilbert Spaces (RKHS).

36

slide-77
SLIDE 77

Kernels

Excess risk bound has no explicit dependence on the dimension d. In particular, it holds in infinite dimensional inner product spaces.

◮ SVM can be applied in such spaces as long as there is an

algorithm for computing inner products.

◮ This is the kernel trick, and these corresponding spaces are

called Reproducing Kernel Hilbert Spaces (RKHS). Universal approximation With some RKHS, can approximate any function arbitrarily well: lim

λ→0

  • inf

w∈F λ 2w2 + R(w)

  • =

inf

g : X→R R(g). 36

slide-78
SLIDE 78

Other regularizers

Instead of SVM, suppose ˆ w is solution to min

w∈Rd λw1 + Rn(w).

So ˆ w ∈ Fℓ1,B = {w ∈ Rd : w1 ≤ B} for B = 1/λ.

37

slide-79
SLIDE 79

Other regularizers

Instead of SVM, suppose ˆ w is solution to min

w∈Rd λw1 + Rn(w).

So ˆ w ∈ Fℓ1,B = {w ∈ Rd : w1 ≤ B} for B = 1/λ. What is Rademacher complexity of Fℓ1,B? Radn(Fℓ1,B) = EEε

   sup

w1≤B

  • 1

n

n

  • i=1

εiwTXi

 

= BEEε

   sup

u1≤1

  • 1

n

n

  • i=1

εiuTXi

 

= BEEε

  • 1

n

n

  • i=1

εiXi

.

37

slide-80
SLIDE 80

Rademacher complexity of ℓ1-bounded linear predictors

Can show, using martingale argument, EEε

  • 1

n

n

  • i=1

εiXi

  • O(log d) · EX2

n .

38

slide-81
SLIDE 81

Rademacher complexity of ℓ1-bounded linear predictors

Can show, using martingale argument, EEε

  • 1

n

n

  • i=1

εiXi

  • O(log d) · EX2

n . Rademacher complexity of Fℓ1,B = {w ∈ Rd : w1 ≤ B}: Radn(Fℓ1,B) ≤ B

  • O(log d) · EX2

n .

38

slide-82
SLIDE 82

Rademacher complexity of ℓ1-bounded linear predictors

Can show, using martingale argument, EEε

  • 1

n

n

  • i=1

εiXi

  • O(log d) · EX2

n . Rademacher complexity of Fℓ1,B = {w ∈ Rd : w1 ≤ B}: Radn(Fℓ1,B) ≤ B

  • O(log d) · EX2

n . Let X = {−1, +1}d. Then x2

2 = d but x2 ∞ = 1 for all x ∈ X.

Dependence on d much better than using bound for ℓ2-bounded linear predictors, which would have looked like B

  • d/n.

38

slide-83
SLIDE 83

Rademacher complexity of ℓ1-bounded linear predictors

Can show, using martingale argument, EEε

  • 1

n

n

  • i=1

εiXi

  • O(log d) · EX2

n . Rademacher complexity of Fℓ1,B = {w ∈ Rd : w1 ≤ B}: Radn(Fℓ1,B) ≤ B

  • O(log d) · EX2

n . Let X = {−1, +1}d. Then x2

2 = d but x2 ∞ = 1 for all x ∈ X.

Dependence on d much better than using bound for ℓ2-bounded linear predictors, which would have looked like B

  • d/n.

This kind of bound is used to study generalization of AdaBoost.

38

slide-84
SLIDE 84

Other examples of Rademacher complexity

◮ F = any class of {0, 1}-valued functions with VC dimension V :

Radn(F) = O

 

  • V

n

  .

◮ F = ReLU networks of depth D with parameter matrices of

Frobenius norm ≤ 1: Radn(F) = O

  

  • D · EX2

2

n

   .

◮ F = Lipschitz functions from [0, 1]d to R:

Radn(F) = O

  • n−1/(2+d)

.

◮ F = functions from [0, 1]d to R with Lipschitz k-th derivatives:

Radn(F) = O

  • n−(k+1)/(2(k+1)+d)

.

39

slide-85
SLIDE 85

Questions

Are these the “right” notions of complexity?

40

slide-86
SLIDE 86

Questions

Are these the “right” notions of complexity?

◮ For SVM, the complexity of ℓ2-bounded linear predictors is

relevant because ℓ2-regularization explicitly ensures the solution to SVM problem is ℓ2-bounded.

40

slide-87
SLIDE 87

Questions

Are these the “right” notions of complexity?

◮ For SVM, the complexity of ℓ2-bounded linear predictors is

relevant because ℓ2-regularization explicitly ensures the solution to SVM problem is ℓ2-bounded.

◮ Do training algorithms for neural nets lead to Frobenius

norm-bounded parameter matrices?

40

slide-88
SLIDE 88

Questions

Are these the “right” notions of complexity?

◮ For SVM, the complexity of ℓ2-bounded linear predictors is

relevant because ℓ2-regularization explicitly ensures the solution to SVM problem is ℓ2-bounded.

◮ Do training algorithms for neural nets lead to Frobenius

norm-bounded parameter matrices? Do complexity bounds suggest different algorithms?

40

slide-89
SLIDE 89

Beyond uniform convergence

41

slide-90
SLIDE 90

Deficiencies of uniform convergence analysis

◮ For certain loss functions, if R(f) is small, then variance of

Rn(f) is also small, and bound should reflect this.

◮ Instead of Hoeffding’s inequality, use concentration inequality

that involves variance information (e.g., Bernstein’s inequality).

◮ Overkill to require all functions in F to not over-fit.

◮ Just need to worry about the f, e.g., with small empirical risk. ◮ Solution: Local Rademacher complexity.

42

slide-91
SLIDE 91

Example: Occam’s razor bound

Suppose F is countable and we fix (a priori) a probability distribution π = (πf : f ∈ F) on F.

◮ Think of π as placing bets on which functions are likely to be

the one to be picked by your learning algorithm.

43

slide-92
SLIDE 92

Example: Occam’s razor bound

Suppose F is countable and we fix (a priori) a probability distribution π = (πf : f ∈ F) on F.

◮ Think of π as placing bets on which functions are likely to be

the one to be picked by your learning algorithm. For any fixed f ∈ F, P

  • |Rn(f) − R(f)| ≥ tf
  • ≤ 2 exp(−2nt2

f)

for any tf > 0, by Hoeffding’s inequality and union bound. Note: We can choose the tf’s non-uniformly.

43

slide-93
SLIDE 93

Occam’s razor bound (continued)

Let tf =

  • ln(1/πf)+ln(2/δ)

2n

. By union bound, P

  • ∃f ∈ F s.t. |Rn(f) − R(f)| ≥ tf
  • f∈F

P

  • |Rn(f) − R(f)| ≥ tf
  • f∈F

2 exp(−2nt2

f) =

  • f∈F

πfδ = δ.

44

slide-94
SLIDE 94

Occam’s razor bound (continued)

Let tf =

  • ln(1/πf)+ln(2/δ)

2n

. By union bound, P

  • ∃f ∈ F s.t. |Rn(f) − R(f)| ≥ tf
  • f∈F

P

  • |Rn(f) − R(f)| ≥ tf
  • f∈F

2 exp(−2nt2

f) =

  • f∈F

πfδ = δ.

  • Theorem. For any δ ∈ (0, 1),

P

 ∀f ∈ F : |Rn(f) − R(f)| <

  • ln(1/πf) + ln(2/δ)

2n

  ≥ 1 − δ.

44

slide-95
SLIDE 95

Occam’s razor bound (continued)

Let tf =

  • ln(1/πf)+ln(2/δ)

2n

. By union bound, P

  • ∃f ∈ F s.t. |Rn(f) − R(f)| ≥ tf
  • f∈F

P

  • |Rn(f) − R(f)| ≥ tf
  • f∈F

2 exp(−2nt2

f) =

  • f∈F

πfδ = δ.

  • Theorem. For any δ ∈ (0, 1),

P

 ∀f ∈ F : |Rn(f) − R(f)| <

  • ln(1/πf) + ln(2/δ)

2n

  ≥ 1 − δ.

Better bound for functions f with higher “prior probability” πf!

44

slide-96
SLIDE 96

Other forms of generalization analysis

◮ Stability

◮ If a learning algorithm’s output does not change much if a

single data point is changed, then its output will generalize.

◮ Connections to differential privacy and regularization.

◮ Compression bounds

◮ If a learning algorithm’s output is invariant to all but a small

number k ≪ n of training data (e.g., # support vectors in SVM), then get bound of the form

  • k/(n − k).

◮ Direct analyses

◮ Some well-known learning algorithms do not fit the mold of

typical (regularized) ERM algorithm, and seem to require a direct analysis.

◮ E.g., nearest neighbor rule.

◮ Many others 45

slide-97
SLIDE 97

Many active areas of research in learning theory

◮ Implicit bias of optimization algorithms

◮ E.g., gradient descent for least squares linear regression

converges to solution of smallest norm.

◮ What about for other problems?

◮ Efficient algorithms for non-linear models

◮ E.g., polynomials, neural networks, kernel machines. ◮ Understand if/why existing algorithms work!

◮ Learning algorithms with robustness guarantees

◮ Noisy labels, missing / malformed data, heavy-tail distributions,

adversarial corruptions, etc.

◮ Interactive learning

◮ Learning algorithms that interact with external environment

(e.g., bandits, active learning, reinforcement learning).

◮ More: see proceedings of Conference on Learning Theory! 46