Summary Overfitting arises when we evaluate and train on the same - - PowerPoint PPT Presentation

summary
SMART_READER_LITE
LIVE PREVIEW

Summary Overfitting arises when we evaluate and train on the same - - PowerPoint PPT Presentation

Summary Overfitting arises when we evaluate and train on the same data. We can bound error of a fixed function with Hoeffdings inequality. Next lecture well get a version sensitive to function class size. 41 / 61 Part 3. . .


slide-1
SLIDE 1

Summary

◮ Overfitting arises when we evaluate and train on the same data. ◮ We can bound error of a fixed function with Hoeffding’s inequality. ◮ Next lecture we’ll get a version sensitive to function class size.

41 / 61

slide-2
SLIDE 2

Part 3. . .

slide-3
SLIDE 3

Overfitting, in pictures

With SVM, the model size scales with C. We had our best test error in the middle.

10

  • 1

10 10

1

C

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

misclassification rate

train test 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

  • 1

. 2

  • 1.200
  • 1.200
  • 1.200
  • .

8

  • 0.800
  • 0.800
  • 0.800
  • 0.400
  • 0.400
  • 0.400

. . . 0.400 0.400 . 4 0.800 0.800 0.800 1 . 2 1 . 2

C = 1.481101 , λ = 0.006752 ; train 0.010000 , test 0.060000

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

  • 0.240
  • 0.160
  • 0.080
  • 0.080
  • .

8 0.000 . . 0.080 0.080 . 1 6 0.160 . 2 4 0.240

C = 0.040000 , λ = 0.250000 ; train 0.050000 , test 0.076000

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

  • 1.600
  • 1.600
  • 1

. 6

  • 1

. 6

  • 0.800
  • 0.800
  • .

8

  • 0.800

0.000 0.000 . 0.800 0.800 . 8 0.800 1.600 1.600 2.400

C = 20.480000 , λ = 0.000488 ; train 0.000000 , test 0.076000

42 / 61

slide-4
SLIDE 4

Bernoulli walks

Let Zi be bernoulli with EZi = 1/2; consider t

i=1(2Zi − 1).

200 400 600 800 1000 60 40 20 20 40 60 80 43 / 61

slide-5
SLIDE 5

Bernoulli walks

Let Zi be bernoulli with EZi = 1/2; consider t

i=1(2Zi − 1).

200 400 600 800 1000 60 40 20 20 40 60 80

Fact: with probability ≥ 1 − 1/e, position ≤ √ 2n. Thus: with probability ≥ 1 − 1/e, R(h) − R(h) ≤

  • 1/2n.

43 / 61

slide-6
SLIDE 6

Two ways to get that Bernoulli walk “Fact”

Theorem (via Chebyshev). Given IID Zi ∈ [a, b], with probability ≥ 1 − δ,

  • EZ1 − 1

n

n

  • i=1

Zi

  • ≤ (b − a)
  • (1/δ)

4n . Theorem (via Hoeffding). Given IID Zi ∈ [a, b], with probability ≥ 1 − δ, EZ1 − 1 n

n

  • i=1

Zi ≤ (b − a)

  • ln(1/δ)

2n . Remarks. ◮ Defining Zi := 1[h(Xi) = Yi] for a fixed h chosen without seeing ((Xi, Yi))n

i=1,

left hand side becomes R(h) − R(h).

44 / 61

slide-7
SLIDE 7

Overfitting

◮ These bounds require IID (Zi)n

i=1 where Zi := 1[h(Xi) = Yi].

◮ If h depends on ((Xi, Yi))n

i=1, we can’t guarantee independence of Zi.

◮ E.g., suppose h memorizes training data, and outputs “bear” on new data; can force R(h) = 0 and R(h) = 0, and also R(h) = 0 but R(h) = 1.

45 / 61

slide-8
SLIDE 8
  • 11. Finite classes
slide-9
SLIDE 9

Controlling k predictors

  • Theorem. Let Zi,j ∈ [a, b] be given, where (Zi,j)n

i=1 are independent for each

j (but nothing is said across j). With probability at least 1 − δ, max

j∈{1,...,k} EZ1,j − 1

n

n

  • i=1

Zi,j ≤ (b − a)

  • ln k + ln(1/δ)

2n .

  • Theorem. Let predictors (h1, . . . , hk) be given.

With probability ≥ 1 − δ over an IID draw ((Xi, Yi))n

i=1,

R(hj) ≤ R(hj) +

  • ln k + ln(1/δ)

2n ∀j ∈ {1, . . . , k} .

46 / 61

slide-10
SLIDE 10

Controlling k predictors

  • Theorem. Let Zi,j ∈ [a, b] be given, where (Zi,j)n

i=1 are independent for each

j (but nothing is said across j). With probability at least 1 − δ, max

j∈{1,...,k} EZ1,j − 1

n

n

  • i=1

Zi,j ≤ (b − a)

  • ln k + ln(1/δ)

2n .

  • Theorem. Let predictors (h1, . . . , hk) be given.

With probability ≥ 1 − δ over an IID draw ((Xi, Yi))n

i=1,

R(hj) ≤ R(hj) +

  • ln k + ln(1/δ)

2n ∀j ∈ {1, . . . , k} . Remarks. ◮ We pick (h1, . . . , hk) without seeing data! ◮ This is how all our generalization guarantees will go: we prove a guarantee on all possible things the algorithm can output, and thus avoid the issue of “h depends on data”. Called “uniform deviations” or “uniform law of large numbers”. ◮ For this approach to work, we must build the tightest possible estimate of what the algorithm considers (on particular data).

46 / 61

slide-11
SLIDE 11

Proof of finite class bound

Proof.

47 / 61

slide-12
SLIDE 12

Proof of finite class bound

  • Proof. Fix any hj and some confidence level δj > 0. Define a failure event

Fj :=

  • R(hj) >

R(hj) + ǫj

  • where ǫj :=
  • ln(1/δj)

2n .

47 / 61

slide-13
SLIDE 13

Proof of finite class bound

  • Proof. Fix any hj and some confidence level δj > 0. Define a failure event

Fj :=

  • R(hj) >

R(hj) + ǫj

  • where ǫj :=
  • ln(1/δj)

2n . By Hoeffding’s inequality (which requires independence!), Pr(Fj) ≤ δj.

47 / 61

slide-14
SLIDE 14

Proof of finite class bound

  • Proof. Fix any hj and some confidence level δj > 0. Define a failure event

Fj :=

  • R(hj) >

R(hj) + ǫj

  • where ǫj :=
  • ln(1/δj)

2n . By Hoeffding’s inequality (which requires independence!), Pr(Fj) ≤ δj. The events (F1, . . . , Fk) are not independent, but it doesn’t matter: Pr (∀j ¬Fj) =

47 / 61

slide-15
SLIDE 15

Proof of finite class bound

  • Proof. Fix any hj and some confidence level δj > 0. Define a failure event

Fj :=

  • R(hj) >

R(hj) + ǫj

  • where ǫj :=
  • ln(1/δj)

2n . By Hoeffding’s inequality (which requires independence!), Pr(Fj) ≤ δj. The events (F1, . . . , Fk) are not independent, but it doesn’t matter: Pr (∀j ¬Fj) = 1 − Pr (∃j Fj) ≥ 1 −

k

  • j=1

Pr (Fj) ≥ 1 −

k

  • j=1

δj.

47 / 61

slide-16
SLIDE 16

Proof of finite class bound

  • Proof. Fix any hj and some confidence level δj > 0. Define a failure event

Fj :=

  • R(hj) >

R(hj) + ǫj

  • where ǫj :=
  • ln(1/δj)

2n . By Hoeffding’s inequality (which requires independence!), Pr(Fj) ≤ δj. The events (F1, . . . , Fk) are not independent, but it doesn’t matter: Pr (∀j ¬Fj) = 1 − Pr (∃j Fj) ≥ 1 −

k

  • j=1

Pr (Fj) ≥ 1 −

k

  • j=1

δj. To finish the proof, set δj = δ/k.

  • 47 / 61
slide-17
SLIDE 17

Proof of finite class bound

  • Proof. Fix any hj and some confidence level δj > 0. Define a failure event

Fj :=

  • R(hj) >

R(hj) + ǫj

  • where ǫj :=
  • ln(1/δj)

2n . By Hoeffding’s inequality (which requires independence!), Pr(Fj) ≤ δj. The events (F1, . . . , Fk) are not independent, but it doesn’t matter: Pr (∀j ¬Fj) = 1 − Pr (∃j Fj) ≥ 1 −

k

  • j=1

Pr (Fj) ≥ 1 −

k

  • j=1

δj. To finish the proof, set δj = δ/k.

  • We can also prove the first (abstract) Theorem, and then plug in random

variables Zi,j := 1

  • hj(Xi) = Yi
  • .

47 / 61

slide-18
SLIDE 18

Picture behind proof

Each predictor hj had a failure event Fj := [R(hj) > R(hj) + ǫj].

48 / 61

slide-19
SLIDE 19

Picture behind proof

Each predictor hj had a failure event Fj := [R(hj) > R(hj) + ǫj]. Concretely, Fj is a subset of all possible ((Xi, Yi))n

i=1.

48 / 61

slide-20
SLIDE 20

Picture behind proof

Each predictor hj had a failure event Fj := [R(hj) > R(hj) + ǫj]. Concretely, Fj is a subset of all possible ((Xi, Yi))n

i=1.

Some other hl might have a different failure event Fl! (But Fl is still a subset of the same sample space!)

48 / 61

slide-21
SLIDE 21

Picture behind proof

Each predictor hj had a failure event Fj := [R(hj) > R(hj) + ǫj]. Concretely, Fj is a subset of all possible ((Xi, Yi))n

i=1.

Some other hl might have a different failure event Fl! (But Fl is still a subset of the same sample space!) F1

48 / 61

slide-22
SLIDE 22

Picture behind proof

Each predictor hj had a failure event Fj := [R(hj) > R(hj) + ǫj]. Concretely, Fj is a subset of all possible ((Xi, Yi))n

i=1.

Some other hl might have a different failure event Fl! (But Fl is still a subset of the same sample space!) F1 F2 F3 F4 F5

48 / 61

slide-23
SLIDE 23

Picture behind proof

Each predictor hj had a failure event Fj := [R(hj) > R(hj) + ǫj]. Concretely, Fj is a subset of all possible ((Xi, Yi))n

i=1.

Some other hl might have a different failure event Fl! (But Fl is still a subset of the same sample space!) F1 F2 F3 F4 F5 Looking ahead: for infinitely many predictors, picture still works if failure events overlap!

48 / 61

slide-24
SLIDE 24

Finite class bound — summary

  • Theorem. Let predictors (h1, . . . , hk) be given.

With probability ≥ 1 − δ over an IID draw ((Xi, Yi))n

i=1,

R(hj) ≤ R(hj) +

  • ln k + ln(1/δ)

2n ∀j. Remarks. ◮ If we choose (h1, . . . , hk) before seeing ((Xi, Yi))n

i=1,

we can use this bound. ◮ Example: train k classifiers, pick the best on validation set! ◮ This approach “produce bound for all possible algo outputs” may seem sloppy, but it’s the best we have! ◮ Letting F = (h1, . . . , hk) denote our set of predictors, the bound is: with probability ≥ 1 − δ, every f ∈ F satisfies R(f) ≤ R(f) +

  • ln |F| + ln 1/δ

2n . In the next sections, we’ll handle |F| = ∞ by replacing ln |F| with complexity(F), whose meaning will vary.

49 / 61

slide-25
SLIDE 25
  • 12. VC Dimension
slide-26
SLIDE 26

VC dimension overview

  • Theorem. With probability at least 1 − δ, every f ∈ F satisfies

R(f) ≤ R(f) + O

  • VC(F) + ln(1/δ)

n

  • ,

where VC(F), the Vapnik-Chervonenkis dimension of F is the largest number

  • f points where F can realize all labelings:

VC(F) := sup

  • n ∈ Z :

50 / 61

slide-27
SLIDE 27

VC dimension overview

  • Theorem. With probability at least 1 − δ, every f ∈ F satisfies

R(f) ≤ R(f) + O

  • VC(F) + ln(1/δ)

n

  • ,

where VC(F), the Vapnik-Chervonenkis dimension of F is the largest number

  • f points where F can realize all labelings:

VC(F) := sup

  • n ∈ Z : ∃(x1, . . . , xn),

50 / 61

slide-28
SLIDE 28

VC dimension overview

  • Theorem. With probability at least 1 − δ, every f ∈ F satisfies

R(f) ≤ R(f) + O

  • VC(F) + ln(1/δ)

n

  • ,

where VC(F), the Vapnik-Chervonenkis dimension of F is the largest number

  • f points where F can realize all labelings:

VC(F) := sup

  • n ∈ Z : ∃(x1, . . . , xn), ∀(y1, . . . , yn),

50 / 61

slide-29
SLIDE 29

VC dimension overview

  • Theorem. With probability at least 1 − δ, every f ∈ F satisfies

R(f) ≤ R(f) + O

  • VC(F) + ln(1/δ)

n

  • ,

where VC(F), the Vapnik-Chervonenkis dimension of F is the largest number

  • f points where F can realize all labelings:

VC(F) := sup

  • n ∈ Z : ∃(x1, . . . , xn), ∀(y1, . . . , yn), ∃f ∈ F,

R0/1(f) = 0

  • .

50 / 61

slide-30
SLIDE 30

VC dimension overview

  • Theorem. With probability at least 1 − δ, every f ∈ F satisfies

R(f) ≤ R(f) + O

  • VC(F) + ln(1/δ)

n

  • ,

where VC(F), the Vapnik-Chervonenkis dimension of F is the largest number

  • f points where F can realize all labelings:

VC(F) := sup

  • n ∈ Z : ∃(x1, . . . , xn), ∀(y1, . . . , yn), ∃f ∈ F,

R0/1(f) = 0

  • .

Remarks. ◮ |F| can be infinite! ◮ Definition only requires some set of points we can label in every way; this set is unrelated to the IID sample for the bound. ◮ Say that F shatters (x1, . . . , xn) when it can realize all labelings.

50 / 61

slide-31
SLIDE 31

VC example: intervals

VC({x → 1[x ∈ [a, b]] : a, b ∈ R}) = 2.

51 / 61

slide-32
SLIDE 32

VC example: intervals

VC({x → 1[x ∈ [a, b]] : a, b ∈ R}) = 2. Usual approach to VC bounds: separately establish upper and lower bounds.

51 / 61

slide-33
SLIDE 33

VC example: intervals

VC({x → 1[x ∈ [a, b]] : a, b ∈ R}) = 2. Usual approach to VC bounds: separately establish upper and lower bounds. ◮ Upper bound: no set of 3 points can be shattered.

51 / 61

slide-34
SLIDE 34

VC example: intervals

VC({x → 1[x ∈ [a, b]] : a, b ∈ R}) = 2. Usual approach to VC bounds: separately establish upper and lower bounds. ◮ Upper bound: no set of 3 points can be shattered. Order the points so that x1 ≤ x2 ≤ x3; we can’t label according to (+1, −1, +1).

51 / 61

slide-35
SLIDE 35

VC example: intervals

VC({x → 1[x ∈ [a, b]] : a, b ∈ R}) = 2. Usual approach to VC bounds: separately establish upper and lower bounds. ◮ Upper bound: no set of 3 points can be shattered. Order the points so that x1 ≤ x2 ≤ x3; we can’t label according to (+1, −1, +1). ◮ Lower bound:

51 / 61

slide-36
SLIDE 36

VC example: intervals

VC({x → 1[x ∈ [a, b]] : a, b ∈ R}) = 2. Usual approach to VC bounds: separately establish upper and lower bounds. ◮ Upper bound: no set of 3 points can be shattered. Order the points so that x1 ≤ x2 ≤ x3; we can’t label according to (+1, −1, +1). ◮ Lower bound: we can shatter {0, 1}.

51 / 61

slide-37
SLIDE 37

VC example: intervals

VC({x → 1[x ∈ [a, b]] : a, b ∈ R}) = 2. Usual approach to VC bounds: separately establish upper and lower bounds. ◮ Upper bound: no set of 3 points can be shattered. Order the points so that x1 ≤ x2 ≤ x3; we can’t label according to (+1, −1, +1). ◮ Lower bound: we can shatter {0, 1}. (Indeed, we can shatter any two distinct points, but that doesn’t matter.)

51 / 61

slide-38
SLIDE 38

Linear (affine) classifiers

VC

  • x → 1[a

Tx + b ≥ 0] : a ∈ Rd, b ∈ R

  • = d + 1.

52 / 61

slide-39
SLIDE 39

Linear (affine) classifiers

VC

  • x → 1[a

Tx + b ≥ 0] : a ∈ Rd, b ∈ R

  • = d + 1.

Again proceed with separate upper and lower bounds.

52 / 61

slide-40
SLIDE 40

Linear (affine) classifiers

VC

  • x → 1[a

Tx + b ≥ 0] : a ∈ Rd, b ∈ R

  • = d + 1.

Again proceed with separate upper and lower bounds. ◮ Upper bound: can’t shatter any set of d + 2 points.

52 / 61

slide-41
SLIDE 41

Linear (affine) classifiers

VC

  • x → 1[a

Tx + b ≥ 0] : a ∈ Rd, b ∈ R

  • = d + 1.

Again proceed with separate upper and lower bounds. ◮ Upper bound: can’t shatter any set of d + 2 points. Geometric fact: can group any d + 2 points into two sets so that their convex hulls intersect (“Radon lemma”), and this gives a labeling which linear classifiers can not realize.

52 / 61

slide-42
SLIDE 42

Linear (affine) classifiers

VC

  • x → 1[a

Tx + b ≥ 0] : a ∈ Rd, b ∈ R

  • = d + 1.

Again proceed with separate upper and lower bounds. ◮ Upper bound: can’t shatter any set of d + 2 points. Geometric fact: can group any d + 2 points into two sets so that their convex hulls intersect (“Radon lemma”), and this gives a labeling which linear classifiers can not realize. ◮ Lower bound: exists a set of d + 1 points we can shatter.

52 / 61

slide-43
SLIDE 43

Linear (affine) classifiers

VC

  • x → 1[a

Tx + b ≥ 0] : a ∈ Rd, b ∈ R

  • = d + 1.

Again proceed with separate upper and lower bounds. ◮ Upper bound: can’t shatter any set of d + 2 points. Geometric fact: can group any d + 2 points into two sets so that their convex hulls intersect (“Radon lemma”), and this gives a labeling which linear classifiers can not realize. ◮ Lower bound: exists a set of d + 1 points we can shatter. Suffices to consider (e1, . . . , ed, 0).

52 / 61

slide-44
SLIDE 44

Deep networks

53 / 61

slide-45
SLIDE 45

Deep networks

◮ Binary activation networks: O(p ln(p)), where p is #parameters.

53 / 61

slide-46
SLIDE 46

Deep networks

◮ Binary activation networks: O(p ln(p)), where p is #parameters. ◮ ReLU networks: O(pL ln(pL)), L is #layers.

53 / 61

slide-47
SLIDE 47

Deep networks

◮ Binary activation networks: O(p ln(p)), where p is #parameters. ◮ ReLU networks: O(pL ln(pL)), L is #layers. ◮ Sigmoid networks: O(p2m2), where m is #nodes.

53 / 61

slide-48
SLIDE 48

Deep networks

◮ Binary activation networks: O(p ln(p)), where p is #parameters. ◮ ReLU networks: O(pL ln(pL)), L is #layers. ◮ Sigmoid networks: O(p2m2), where m is #nodes. ◮ Arbitrary continuous convex-concave activation: ∞, even with one node.

53 / 61

slide-49
SLIDE 49

Deep networks

◮ Binary activation networks: O(p ln(p)), where p is #parameters. ◮ ReLU networks: O(pL ln(pL)), L is #layers. ◮ Sigmoid networks: O(p2m2), where m is #nodes. ◮ Arbitrary continuous convex-concave activation: ∞, even with one node. Remarks. ◮ These bounds are “tight”, but alone they are garbage. E.g., often p ≥ n, but we need VC(F) = o(n). ◮ With deep networks, we don’t have a handle on the set of functions investigated by standard methods on standard data with standard algorithms; if we had this, we could plug it into VC and also the tool in the next section.

53 / 61

slide-50
SLIDE 50

VC dimension summary

  • Theorem. With probability at least 1 − δ, every f ∈ F satisfies

R(f) ≤ R(f) + O

  • VC(F) + ln(1/δ)

n

  • ,

where VC(F), the Vapnik-Chervonenkis dimension of F is the largest number

  • f points where F can realize all labelings:

VC(F) := sup

  • n ∈ Z :

54 / 61

slide-51
SLIDE 51

VC dimension summary

  • Theorem. With probability at least 1 − δ, every f ∈ F satisfies

R(f) ≤ R(f) + O

  • VC(F) + ln(1/δ)

n

  • ,

where VC(F), the Vapnik-Chervonenkis dimension of F is the largest number

  • f points where F can realize all labelings:

VC(F) := sup

  • n ∈ Z : ∃(x1, . . . , xn),

54 / 61

slide-52
SLIDE 52

VC dimension summary

  • Theorem. With probability at least 1 − δ, every f ∈ F satisfies

R(f) ≤ R(f) + O

  • VC(F) + ln(1/δ)

n

  • ,

where VC(F), the Vapnik-Chervonenkis dimension of F is the largest number

  • f points where F can realize all labelings:

VC(F) := sup

  • n ∈ Z : ∃(x1, . . . , xn), ∀(y1, . . . , yn),

54 / 61

slide-53
SLIDE 53

VC dimension summary

  • Theorem. With probability at least 1 − δ, every f ∈ F satisfies

R(f) ≤ R(f) + O

  • VC(F) + ln(1/δ)

n

  • ,

where VC(F), the Vapnik-Chervonenkis dimension of F is the largest number

  • f points where F can realize all labelings:

VC(F) := sup

  • n ∈ Z : ∃(x1, . . . , xn), ∀(y1, . . . , yn), ∃f ∈ F,

R0/1(f) = 0

  • .

54 / 61

slide-54
SLIDE 54

VC dimension summary

  • Theorem. With probability at least 1 − δ, every f ∈ F satisfies

R(f) ≤ R(f) + O

  • VC(F) + ln(1/δ)

n

  • ,

where VC(F), the Vapnik-Chervonenkis dimension of F is the largest number

  • f points where F can realize all labelings:

VC(F) := sup

  • n ∈ Z : ∃(x1, . . . , xn), ∀(y1, . . . , yn), ∃f ∈ F,

R0/1(f) = 0

  • .

Remarks. ◮ To determine VC(F), prove upper and lower bounds separately. ◮ Examples: intervals, linear separators, deep networks. ◮ Lower bounds: can hand-craft distributions for which the above theorem’s inequality is reversed.

54 / 61

slide-55
SLIDE 55
  • 13. Rademacher complexity
slide-56
SLIDE 56

What we have so far: ◮ A bound for finitely many predictors. ◮ A bound for classifiers with “finite VC dimension”.

55 / 61

slide-57
SLIDE 57

What we have so far: ◮ A bound for finitely many predictors. ◮ A bound for classifiers with “finite VC dimension”. Things we are missing: ◮ Infinite classes and non-classification! ◮ Fine-grained sensitivity to model classes; e.g., affine classifier VC dimension is d + 1, which is insensitive to SVM’s C.

55 / 61

slide-58
SLIDE 58

Rademacher complexity

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

56 / 61

slide-59
SLIDE 59

Rademacher complexity

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

Remarks. ◮ We whould make F as tight as possible; e.g., for SVM, we’ll incorporate C.

56 / 61

slide-60
SLIDE 60

Rademacher complexity

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

Remarks. ◮ Interpretation: ability of F to fit random signs. ◮ We whould make F as tight as possible; e.g., for SVM, we’ll incorporate C.

56 / 61

slide-61
SLIDE 61

Rademacher complexity

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

Remarks. ◮ Interpretation: ability of F to fit random signs. ◮ We whould make F as tight as possible; e.g., for SVM, we’ll incorporate C. ◮ Compared to VC: depends on (x1, . . . , xn), doesn’t require classification, is sensitive to scale of f.

56 / 61

slide-62
SLIDE 62

Rademacher complexity

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

Remarks. ◮ Interpretation: ability of F to fit random signs. ◮ We whould make F as tight as possible; e.g., for SVM, we’ll incorporate C. ◮ Compared to VC: depends on (x1, . . . , xn), doesn’t require classification, is sensitive to scale of f. ◮ The general form (not presented here) can handle including labels, multiclass, amongst other things.

56 / 61

slide-63
SLIDE 63

Generalization via Rademacher complexity

Theorem (simplified Rademacher generalization).

57 / 61

slide-64
SLIDE 64

Generalization via Rademacher complexity

Theorem (simplified Rademacher generalization). Let predictors F and a distribution on (X, Y ) be given.

57 / 61

slide-65
SLIDE 65

Generalization via Rademacher complexity

Theorem (simplified Rademacher generalization). Let predictors F and a distribution on (X, Y ) be given. Suppose for (almost) any (x, y), loss ℓ satisfies:

57 / 61

slide-66
SLIDE 66

Generalization via Rademacher complexity

Theorem (simplified Rademacher generalization). Let predictors F and a distribution on (X, Y ) be given. Suppose for (almost) any (x, y), loss ℓ satisfies: ◮ There exists ρ ≥ 0 so that for any f, g ∈ F, |ℓ(f(x), y) − ℓ(g(x), y)| ≤ ρ|f(x) − g(x)| (“ρ-Lipschitz”).

57 / 61

slide-67
SLIDE 67

Generalization via Rademacher complexity

Theorem (simplified Rademacher generalization). Let predictors F and a distribution on (X, Y ) be given. Suppose for (almost) any (x, y), loss ℓ satisfies: ◮ There exists ρ ≥ 0 so that for any f, g ∈ F, |ℓ(f(x), y) − ℓ(g(x), y)| ≤ ρ|f(x) − g(x)| (“ρ-Lipschitz”). ◮ There exists [a, b] so that ℓ(f(x), y) ∈ [a, b] for any f ∈ F.

57 / 61

slide-68
SLIDE 68

Generalization via Rademacher complexity

Theorem (simplified Rademacher generalization). Let predictors F and a distribution on (X, Y ) be given. Suppose for (almost) any (x, y), loss ℓ satisfies: ◮ There exists ρ ≥ 0 so that for any f, g ∈ F, |ℓ(f(x), y) − ℓ(g(x), y)| ≤ ρ|f(x) − g(x)| (“ρ-Lipschitz”). ◮ There exists [a, b] so that ℓ(f(x), y) ∈ [a, b] for any f ∈ F. With probability ≥ 1 − δ, every f ∈ F satisfies Rℓ(f) ≤ Rℓ(f) + 2ρRad(F) + 3(b − a)

  • ln(2/δ)

2n .

57 / 61

slide-69
SLIDE 69

Generalization via Rademacher complexity

Theorem (simplified Rademacher generalization). Let predictors F and a distribution on (X, Y ) be given. Suppose for (almost) any (x, y), loss ℓ satisfies: ◮ There exists ρ ≥ 0 so that for any f, g ∈ F, |ℓ(f(x), y) − ℓ(g(x), y)| ≤ ρ|f(x) − g(x)| (“ρ-Lipschitz”). ◮ There exists [a, b] so that ℓ(f(x), y) ∈ [a, b] for any f ∈ F. With probability ≥ 1 − δ, every f ∈ F satisfies Rℓ(f) ≤ Rℓ(f) + 2ρRad(F) + 3(b − a)

  • ln(2/δ)

2n . Remarks. ◮ Can get a bound in terms of 1/(λn) for linear SVM and ridge regression. (Homework problems?) ◮ Kernel SVM okay as well.

57 / 61

slide-70
SLIDE 70

Rademacher complexity examples

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

Examples.

58 / 61

slide-71
SLIDE 71

Rademacher complexity examples

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

Examples. ◮ If x ≤ R, then Rad({x → x

Tw : w ≤ W}) ≤ RW

√n . For SVM, we can set W =

  • 2/λ.

58 / 61

slide-72
SLIDE 72

Rademacher complexity examples

  • Definition. Given examples (x1, . . . , xn) and functions F,

Rad(F) = Eε max

f∈F

1 n

n

  • i=1

ǫif(xi), where (ǫ1, . . . , ǫn) are IID Rademacher rv (Pr[ǫi = 1] = Pr[ǫi = −1] = 1

2).

Examples. ◮ If x ≤ R, then Rad({x → x

Tw : w ≤ W}) ≤ RW

√n . For SVM, we can set W =

  • 2/λ.

◮ For deep networks, we have Rad(F) ≤ Lipschitz ·

  • Junk/n;

still very loose.

58 / 61

slide-73
SLIDE 73
  • 14. ML Theory summary / retrospective
slide-74
SLIDE 74

No free lunch

Theorem (informal). For any n and sample space with at least 2n points and any learning algorithm, there is a distribution over data so that: ◮ some function g is perfect, meaning R0,1(g) = 0, ◮ the algorithm returns f with R0/1(f) = 1/4.

59 / 61

slide-75
SLIDE 75

No free lunch

Theorem (informal). For any n and sample space with at least 2n points and any learning algorithm, there is a distribution over data so that: ◮ some function g is perfect, meaning R0,1(g) = 0, ◮ the algorithm returns f with R0/1(f) = 1/4. Getting around this. ◮ This does not require g to be in the learning method’s model class; learning methods operate optimistically under some inductive bias. ◮ Inductive bias is a consequence of model class and learning algorithm. ◮ Ideally, we know something about the data, and use it to drive these choices.

59 / 61

slide-76
SLIDE 76

Decomposition of error

◮ We decomposed the error of a learning problem into many terms; the main terms were approximation (which improves as model class increases) and generalization (which worsens as model class increases). ◮ Approximation can be made arbitrarily small with most machine learning model classes (RBF SVM with C ↑ ∞, ReLU networks with depth or width ↑ ∞). ◮ Generalization requires algorithmic care. ◮ For SVM, we have to be careful about C. ◮ For deep networks, it seems careful architecture choice plus gradient descent is often magically sufficient; this is not well understood, and there is only good practical guidance in some settings (e.g., computer vision).

60 / 61

slide-77
SLIDE 77

Summary; what to know

◮ Remember always that goal in ML is low test error, not low training error. ◮ The main sources of error: approximation (model class too weak), generalization (model class too powerful). ◮ How to control model size (e.g., with SVM, we shrink C or change kernels). ◮ Hoeffding’s inequality; concept of concentration of measure. ◮ Concrete source of overfitting (IID violation). ◮ Ways to establish generalization: finite classes, VC dimension, Rademacher complexity.

61 / 61