Machine Learning Theory CS 446 1. SVM risk 0.6 0.5 aff tr aff - - PowerPoint PPT Presentation

machine learning theory
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Theory CS 446 1. SVM risk 0.6 0.5 aff tr aff - - PowerPoint PPT Presentation

Machine Learning Theory CS 446 1. SVM risk 0.6 0.5 aff tr aff te misclassification rate quad tr 0.4 quad te poly10 tr poly10 te 0.3 rbf1 tr rbf1 te rbf01 tr 0.2 rbf01 te 0.1 0.0 -1 0 1 10 10 10 C SVM risk Consider the


slide-1
SLIDE 1

Machine Learning Theory

CS 446

slide-2
SLIDE 2
  • 1. SVM risk
slide-3
SLIDE 3

SVM risk

Consider the empirical and true/population risk of SVM: given f, R(f) = E

  • ℓ(Y ˆ

f(X))

  • ,
  • R(f) = 1

n

  • i=1

ℓ(yi ˆ f(xi)), and furthermore define excess risk R(f) − R(f).

10

  • 1

10 10

1

C

0.0 0.1 0.2 0.3 0.4 0.5 0.6

misclassification rate

aff tr aff te quad tr quad te poly10 tr poly10 te rbf1 tr rbf1 te rbf01 tr rbf01 te 1 / 61

slide-4
SLIDE 4

SVM risk

Consider the empirical and true/population risk of SVM: given f, R(f) = E

  • ℓ(Y ˆ

f(X))

  • ,
  • R(f) = 1

n

  • i=1

ℓ(yi ˆ f(xi)), and furthermore define excess risk R(f) − R(f).

10

  • 1

10 10

1

C

0.0 0.1 0.2 0.3 0.4 0.5 0.6

misclassification rate

aff tr aff te quad tr quad te poly10 tr poly10 te rbf1 tr rbf1 te rbf01 tr rbf01 te 1 / 61

slide-5
SLIDE 5

SVM risk

Consider the empirical and true/population risk of SVM: given f, R(f) = E

  • ℓ(Y ˆ

f(X))

  • ,
  • R(f) = 1

n

  • i=1

ℓ(yi ˆ f(xi)), and furthermore define excess risk R(f) − R(f).

10

  • 1

10 10

1

C

0.0 0.1 0.2 0.3 0.4 0.5 0.6

misclassification rate

aff tr aff te quad tr quad te poly10 tr poly10 te rbf1 tr rbf1 te rbf01 tr rbf01 te

What’s going on here?

1 / 61

slide-6
SLIDE 6

SVM risk

Consider the empirical and true/population risk of SVM: given f, R(f) = E

  • ℓ(Y ˆ

f(X))

  • ,
  • R(f) = 1

n

  • i=1

ℓ(yi ˆ f(xi)), and furthermore define excess risk R(f) − R(f).

10

  • 1

10 10

1

C

0.0 0.1 0.2 0.3 0.4 0.5 0.6

misclassification rate

aff tr aff te quad tr quad te poly10 tr poly10 te rbf1 tr rbf1 te rbf01 tr rbf01 te

What’s going on here? (I just tricked you into caring about theory.)

1 / 61

slide-7
SLIDE 7

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Let’s also define true/population risk minimizer ¯ f := arg minf∈F R(f).

2 / 61

slide-8
SLIDE 8

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Let’s also define true/population risk minimizer ¯ f := arg minf∈F R(f). (Question: What is F?

2 / 61

slide-9
SLIDE 9

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Let’s also define true/population risk minimizer ¯ f := arg minf∈F R(f). (Question: What is F? Answer: depends on kernel!)

2 / 61

slide-10
SLIDE 10

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Let’s also define true/population risk minimizer ¯ f := arg minf∈F R(f). (Question: What is F? Answer: depends on kernel!) (Question: is ¯ f = arg minf∈F R(f)?

2 / 61

slide-11
SLIDE 11

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Let’s also define true/population risk minimizer ¯ f := arg minf∈F R(f). (Question: What is F? Answer: depends on kernel!) (Question: is ¯ f = arg minf∈F R(f)? Answer: no; in general R( ¯ f) ≥ R( ˆ f)!)

2 / 61

slide-12
SLIDE 12

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Let’s also define true/population risk minimizer ¯ f := arg minf∈F R(f). (Question: What is F? Answer: depends on kernel!) (Question: is ¯ f = arg minf∈F R(f)? Answer: no; in general R( ¯ f) ≥ R( ˆ f)!) Nature labels according to some g (not necessarily inside F!): R( ˆ f) = R(g) (inherent unpredictability) + R( ¯ f) − R(g) (approximation gap) + R( ¯ f) − R( ¯ f) (estimation gap) + R( ˆ f) − R( ¯ f) (optimization gap) + R( ˆ f) − R( ˆ f) (generalization gap)

2 / 61

slide-13
SLIDE 13

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Let’s also define true/population risk minimizer ¯ f := arg minf∈F R(f). (Question: What is F? Answer: depends on kernel!) (Question: is ¯ f = arg minf∈F R(f)? Answer: no; in general R( ¯ f) ≥ R( ˆ f)!) Nature labels according to some g (not necessarily inside F!): R( ˆ f) = R(g) (inherent unpredictability) + R( ¯ f) − R(g) (approximation gap) + R( ¯ f) − R( ¯ f) (estimation gap) + R( ˆ f) − R( ¯ f) (optimization gap) + R( ˆ f) − R( ˆ f) (generalization gap) Let’s go through this step by step.

2 / 61

slide-14
SLIDE 14

Inherent unpredictability

Nature labels according to some g (not necessarily inside F!): R(g) (inherent unpredictability)

3 / 61

slide-15
SLIDE 15

Inherent unpredictability

Nature labels according to some g (not necessarily inside F!): R(g) (inherent unpredictability) ◮ If g is the function with lowest classification error, we can write down an explicit form: g(x) := sign(Pr[Y = +1|X = x] − 1/2). ◮ If g minimizes R with convex ℓ, again can write down g pointwise via Pr[Y = +1|X = x].

3 / 61

slide-16
SLIDE 16

Approximation gap

¯ f minimizes R over F, and g is chosen by nature; consider R( ¯ f) − R(g). (approximation gap)

4 / 61

slide-17
SLIDE 17

Approximation gap

¯ f minimizes R over F, and g is chosen by nature; consider R( ¯ f) − R(g). (approximation gap) ◮ We’ve shown that if R is misclassification, F is affine classifier, g is quadratic, can have gap 1/4. ◮ We can make this gap arbitrarily small if F is: 2 layer wide network, RBF kernel SVM, polynomial classifier with arbitrary degree . . . ◮ What is F for SVM?

4 / 61

slide-18
SLIDE 18

Approximation gap

Consider SVM with no kernel. Can we only say F :=

  • x → w

Tx : w ∈ Rd

?

5 / 61

slide-19
SLIDE 19

Approximation gap

Consider SVM with no kernel. Can we only say F :=

  • x → w

Tx : w ∈ Rd

? Note, for ˆ w := arg minw R(w) + λ

2 w2,

λ 2 ˆ w2 ≤ R( ˆ w) + λ 2 ˆ w2 ≤ R(0) + λ 2 02 = 1 n

n

  • i=1
  • 1 − 0

Txiyi

  • + = 1,

and so SVM is working with the finer set Fλ :=

  • w → w

Tx : w2 ≤ 2

λ

  • .

5 / 61

slide-20
SLIDE 20

Approximation gap

What about kernel SVM?

6 / 61

slide-21
SLIDE 21

Approximation gap

What about kernel SVM? Now working with Fk :=   x →

n

  • i=1

αiyik(xi, x) : α ∈ Rn    which is a random variable! ((xi, yi))n

i=1 given by data.

6 / 61

slide-22
SLIDE 22

Approximation gap

What about kernel SVM? Now working with Fk :=   x →

n

  • i=1

αiyik(xi, x) : α ∈ Rn    which is a random variable! ((xi, yi))n

i=1 given by data.

This function class is called a reproducing kernel hilbert space (RKHS). We can use it to develop a refined notion Fk,λ.

6 / 61

slide-23
SLIDE 23

Approximation gap

What about kernel SVM? Now working with Fk :=   x →

n

  • i=1

αiyik(xi, x) : α ∈ Rn    which is a random variable! ((xi, yi))n

i=1 given by data.

This function class is called a reproducing kernel hilbert space (RKHS). We can use it to develop a refined notion Fk,λ. Going forward: we always try to work with the tightest possible function class defined by the data and algorithm.

6 / 61

slide-24
SLIDE 24

Estimation gap

¯ f minimizes R over F.

  • R( ¯

f) − R( ¯ f) (estimation gap)

7 / 61

slide-25
SLIDE 25

Estimation gap

¯ f minimizes R over F.

  • R( ¯

f) − R( ¯ f) (estimation gap) ◮ If ((xi, yi))n

i=1 drawn IID from same distribution as E in R,

by central limit theorem, R( ¯ f) − − − − →

n→∞ R( ¯

f). ◮ Next week, we’ll discuss high probability bounds for finite n.

7 / 61

slide-26
SLIDE 26

Optimization gap

ˆ f ∈ F minimizes R, and ¯ f ∈ F minimizes R.

  • R( ˆ

f) − R( ¯ f) (optimization gap)

8 / 61

slide-27
SLIDE 27

Optimization gap

ˆ f ∈ F minimizes R, and ¯ f ∈ F minimizes R.

  • R( ˆ

f) − R( ¯ f) (optimization gap) ◮ This is algorithmic: we reduce this number by optimizing better. ◮ We’ve advocated the use of gradient descent. ◮ Many of these problems are NP-hard even in trivial cases. (Linear separator with noise and 3-node neural net are NP-hard.) ◮ If R uses a convex loss and ˆ f has at least one training mistake, relating R and test set mislcassifications can be painful.

8 / 61

slide-28
SLIDE 28

Optimization gap

ˆ f ∈ F minimizes R, and ¯ f ∈ F minimizes R.

  • R( ˆ

f) − R( ¯ f) (optimization gap) ◮ This is algorithmic: we reduce this number by optimizing better. ◮ We’ve advocated the use of gradient descent. ◮ Many of these problems are NP-hard even in trivial cases. (Linear separator with noise and 3-node neural net are NP-hard.) ◮ If R uses a convex loss and ˆ f has at least one training mistake, relating R and test set mislcassifications can be painful. Specifically considering SVM. ◮ This is a convex optimization problem. ◮ We can solve it in many ways (primal, dual, projected gradient descent, coordinate descent, Newton, et.), it doesn’t really matter so long as we end up close; the primal solutions are unique.

8 / 61

slide-29
SLIDE 29

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit.

9 / 61

slide-30
SLIDE 30

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. ◮ Before, we said “By CLT, R( ¯ f) − − − − →

n→∞ R( ¯

f)”. Is this quantity the same?

9 / 61

slide-31
SLIDE 31

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. ◮ Before, we said “By CLT, R( ¯ f) − − − − →

n→∞ R( ¯

f)”. Is this quantity the same? ◮ No! ˆ f is a random variable!

9 / 61

slide-32
SLIDE 32

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. ◮ Before, we said “By CLT, R( ¯ f) − − − − →

n→∞ R( ¯

f)”. Is this quantity the same? ◮ No! ˆ f is a random variable! ◮ Controlling this quantity will be the main topic next week!

9 / 61

slide-33
SLIDE 33

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. ◮ Before, we said “By CLT, R( ¯ f) − − − − →

n→∞ R( ¯

f)”. Is this quantity the same? ◮ No! ˆ f is a random variable! ◮ Controlling this quantity will be the main topic next week! ◮ Basic gist: this quantity degrades the “larger” F is. Measuring “size” of F is tricky business! We can get something pretty nice by considering Fk,λ (kernel and regularizer).

9 / 61

slide-34
SLIDE 34

Decomposing excess risk, revisited

Nature labels according to some g (not necessarily inside F!); ˆ f ∈ F is ERM, ¯ f ∈ F is best for R. R( ˆ f) = R(g) (inherent unpredictability) + R( ¯ f) − R(g) (approximation gap) + R( ¯ f) − R( ¯ f) (estimation gap) + R( ˆ f) − R( ¯ f) (optimization gap) + R( ˆ f) − R( ˆ f) (generalization gap)

10 / 61

slide-35
SLIDE 35

Decomposing excess risk, revisited

Nature labels according to some g (not necessarily inside F!); ˆ f ∈ F is ERM, ¯ f ∈ F is best for R. R( ˆ f) = R(g) (inherent unpredictability) + R( ¯ f) − R(g) (approximation gap) + R( ¯ f) − R( ¯ f) (estimation gap) + R( ˆ f) − R( ¯ f) (optimization gap) + R( ˆ f) − R( ˆ f) (generalization gap) ◮ Key point: reason about tightest F defined by data and algorithm. ◮ For SVM, this means Fk,λ, a random variable depending on the data, regularization, and kernel. ◮ Smaller F worsens approximation, but improves generalization (and might improve optimization).

10 / 61

slide-36
SLIDE 36

SVM risk, revisited

10

  • 1

10 10

1

C

0.0 0.1 0.2 0.3 0.4 0.5 0.6

misclassification rate

aff tr aff te quad tr quad te poly10 tr poly10 te rbf1 tr rbf1 te rbf01 tr rbf01 te

11 / 61

slide-37
SLIDE 37
  • 2. k-nn
slide-38
SLIDE 38

Consider k-nn with varying k.

12 / 61

slide-39
SLIDE 39

Consider k-nn with varying k.

12 / 61

slide-40
SLIDE 40

Consider k-nn with varying k. We can also interpret this in terms of function class size! 1-nn breaks Rd into n data-dependent regions; k-nn breaks Rd into potentially nk data-dependent regions!

12 / 61

slide-41
SLIDE 41
  • 3. Deep networks
slide-42
SLIDE 42

Do things still work?

Some parts of the story seem off

13 / 61

slide-43
SLIDE 43

Do things still work?

Some parts of the story seem off

13 / 61

slide-44
SLIDE 44

Do things still work?

Some parts of the story seem off

13 / 61

slide-45
SLIDE 45

Do things still work?

Some parts of the story seem off E.g., fiddling with regularization isn’t the way to control excess risk.

13 / 61

slide-46
SLIDE 46

Conflated concerns

◮ For SVM, we could decouple the components; e.g., the optimization problem is convex with a unique optimum, any reasonable solver suffices.

14 / 61

slide-47
SLIDE 47

Conflated concerns

◮ For SVM, we could decouple the components; e.g., the optimization problem is convex with a unique optimum, any reasonable solver suffices. ◮ In deep networks, the choice of solver affects all other aspects of the problem, and moreover the solver is affected by the data and architecture/approximation.

14 / 61

slide-48
SLIDE 48
  • 4. Summary
slide-49
SLIDE 49

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Now it depends highly on the algorithm, not just the data! Let’s also pick true/population risk minimizer ¯ f := arg minf∈F R(f); this choice is now more careful as well.

15 / 61

slide-50
SLIDE 50

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Now it depends highly on the algorithm, not just the data! Let’s also pick true/population risk minimizer ¯ f := arg minf∈F R(f); this choice is now more careful as well. Nature labels according to some g (not necessarily inside F!): R( ˆ f) = R(g) (inherent unpredictability) + R( ¯ f) − R(g) (approximation gap) + R( ¯ f) − R( ¯ f) (estimation gap) + R( ˆ f) − R( ¯ f) (optimization gap) + R( ˆ f) − R( ˆ f) (generalization gap)

15 / 61

slide-51
SLIDE 51

Approximation gap

¯ f minimizes R over F, and g is chosen by nature; consider R( ¯ f) − R(g). (approximation gap)

16 / 61

slide-52
SLIDE 52

Approximation gap

¯ f minimizes R over F, and g is chosen by nature; consider R( ¯ f) − R(g). (approximation gap) ◮ We know that there exist wide shallow networks that make this arbitrarily small. But what can we say about the networks (and parameters) used in practice?

16 / 61

slide-53
SLIDE 53

Approximation gap

¯ f minimizes R over F, and g is chosen by nature; consider R( ¯ f) − R(g). (approximation gap) ◮ We know that there exist wide shallow networks that make this arbitrarily small. But what can we say about the networks (and parameters) used in practice? ◮ We must consider the optimization and the data when constructing this class. It will be more complicated than Fk,λ with kernel SVM. . .

16 / 61

slide-54
SLIDE 54

Optimization gap

ˆ f ∈ F minimizes R, and ¯ f ∈ F minimizes R.

  • R( ˆ

f) − R( ˆ f) (optimization gap)

17 / 61

slide-55
SLIDE 55

Optimization gap

ˆ f ∈ F minimizes R, and ¯ f ∈ F minimizes R.

  • R( ˆ

f) − R( ˆ f) (optimization gap) ◮ This problem is NP-hard in general, but we can find global optima with gradient descent and other solvers. ◮ How do we know? We get 0 error. ◮ The solutions are no longer unique, and depend on the optimization algorithm. ◮ We pick an optimizer with an eye towards test error! ◮ Network architecture choices and data seem to heavily influence

  • ptimization.

17 / 61

slide-56
SLIDE 56

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit.

18 / 61

slide-57
SLIDE 57

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit.

18 / 61

slide-58
SLIDE 58

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. ◮ It seems all reasonable deep networks will get 0 training error on many tasks (certainly true in computer vision?).

18 / 61

slide-59
SLIDE 59

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. ◮ It seems all reasonable deep networks will get 0 training error on many tasks (certainly true in computer vision?). ◮ The generalization question is currently a mess, and we have no idea how to measure “size”.

18 / 61

slide-60
SLIDE 60

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. ◮ It seems all reasonable deep networks will get 0 training error on many tasks (certainly true in computer vision?). ◮ The generalization question is currently a mess, and we have no idea how to measure “size”. ◮ E.g., sometimes increasing number nodes or edges can decrease excess risk.

18 / 61

slide-61
SLIDE 61

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. ◮ It seems all reasonable deep networks will get 0 training error on many tasks (certainly true in computer vision?). ◮ The generalization question is currently a mess, and we have no idea how to measure “size”. ◮ E.g., sometimes increasing number nodes or edges can decrease excess risk. ◮ Convolutions and other magical choices like batch norm might be helping in ways no one understands.

18 / 61

slide-62
SLIDE 62

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. ◮ It seems all reasonable deep networks will get 0 training error on many tasks (certainly true in computer vision?). ◮ The generalization question is currently a mess, and we have no idea how to measure “size”. ◮ E.g., sometimes increasing number nodes or edges can decrease excess risk. ◮ Convolutions and other magical choices like batch norm might be helping in ways no one understands. ◮ “Normal” regularization (e.g., weight decay) can hurt other aspects (e.g.,

  • ptimization).

18 / 61

slide-63
SLIDE 63

Test error?

Is test error all we should discuss and analyze?

19 / 61

slide-64
SLIDE 64

Test error?

Is test error all we should discuss and analyze? ◮ Adversarial robustness; consider self-driving cars! Theory questions: certified robustness, algorithmic defenses, . . .

19 / 61

slide-65
SLIDE 65

Test error?

Is test error all we should discuss and analyze? ◮ Adversarial robustness; consider self-driving cars! Theory questions: certified robustness, algorithmic defenses, . . . ◮ Deep reinforcement learning; many aspects are very finicky.

19 / 61

slide-66
SLIDE 66

Test error?

Is test error all we should discuss and analyze? ◮ Adversarial robustness; consider self-driving cars! Theory questions: certified robustness, algorithmic defenses, . . . ◮ Deep reinforcement learning; many aspects are very finicky. ◮ Architecture search and tradeoffs.

19 / 61

slide-67
SLIDE 67

Test error?

Is test error all we should discuss and analyze? ◮ Adversarial robustness; consider self-driving cars! Theory questions: certified robustness, algorithmic defenses, . . . ◮ Deep reinforcement learning; many aspects are very finicky. ◮ Architecture search and tradeoffs. ◮ . . .

19 / 61

slide-68
SLIDE 68
  • 5. Closing comments on theory
slide-69
SLIDE 69

ML theory vs CS theory

CS Theory. ◮ Design and analysis of algorithms. ◮ Time complexity, space complexity, . . . ◮ Often worst-case.

20 / 61

slide-70
SLIDE 70

ML theory vs CS theory

CS Theory. ◮ Design and analysis of algorithms. ◮ Time complexity, space complexity, . . . ◮ Often worst-case. ML Theory. ◮ Design and analysis of ML algorithms. ◮ Time complexity, space complexity, . . . sample complexity, label complexity, covariate shift . . . ◮ Often average-case.

20 / 61

slide-71
SLIDE 71

Why?

◮ Some old methods like SVM and AdaBoost were rooted in theory from their earliest genesis. ◮ Now, the pendulum has swung (and broken off?) in applied work; it has been a while since theory contributed a genuine algorithm, although theory sometimes guides practice.

21 / 61

slide-72
SLIDE 72

Why?

◮ Some old methods like SVM and AdaBoost were rooted in theory from their earliest genesis. ◮ Now, the pendulum has swung (and broken off?) in applied work; it has been a while since theory contributed a genuine algorithm, although theory sometimes guides practice. ◮ Why do theory?

21 / 61

slide-73
SLIDE 73

Why?

◮ Some old methods like SVM and AdaBoost were rooted in theory from their earliest genesis. ◮ Now, the pendulum has swung (and broken off?) in applied work; it has been a while since theory contributed a genuine algorithm, although theory sometimes guides practice. ◮ Why do theory? ◮ For some problems in DL, it might help. (E.g., adversarial robustness.)

21 / 61

slide-74
SLIDE 74

Why?

◮ Some old methods like SVM and AdaBoost were rooted in theory from their earliest genesis. ◮ Now, the pendulum has swung (and broken off?) in applied work; it has been a while since theory contributed a genuine algorithm, although theory sometimes guides practice. ◮ Why do theory? ◮ For some problems in DL, it might help. (E.g., adversarial robustness.) ◮ Personally, I do it because I want to (curiosity/fun/etc).

21 / 61

slide-75
SLIDE 75

Summary

◮ Decomposition of risk into estimation, approximation, generalization,

  • ptimization.

◮ Decoupling of concerns for some problems (SVM) versus tangled concerns for others (DL). ◮ General thought process: (a) we carefully identify the tightest class of predictors F considered by the algorithm on this particular data, (b) generalization seems to worsen with some intricate notion of size of F. ◮ Next week: statistical learning theory, focusing on estimation and generalization.

22 / 61

slide-76
SLIDE 76

Part 2. . .