[PPT] - Machine Learning Theory CS 446 1. SVM risk SVM risk Consider the PowerPoint Presentation

SLIDE 1

Machine Learning Theory

CS 446

SLIDE 2

1. SVM risk

SLIDE 3

SVM risk

Consider the empirical and true/population risk of SVM: given f, R(f) = E

(Y ˆ

f(X))

,
R(f) = 1

n

i=1

(yi ˆ f(xi)), and furthermore define excess risk R(f) − R(f).

1 / 22

SLIDE 4

SVM risk

Consider the empirical and true/population risk of SVM: given f, R(f) = E

(Y ˆ

f(X))

,
R(f) = 1

n

i=1

(yi ˆ f(xi)), and furthermore define excess risk R(f) − R(f).

1 / 22

SLIDE 5

SLIDE 6

SVM risk

Consider the empirical and true/population risk of SVM: given f, R(f) = E

(Y ˆ

f(X))

,
R(f) = 1

n

i=1

(yi ˆ f(xi)), and furthermore define excess risk R(f) − R(f). What’s going on here?

1 / 22

SLIDE 7

SVM risk

Consider the empirical and true/population risk of SVM: given f, R(f) = E

(Y ˆ

f(X))

,
R(f) = 1

n

i=1

(yi ˆ f(xi)), and furthermore define excess risk R(f) − R(f). What’s going on here? (I just tricked you into caring about theory.)

1 / 22

SLIDE 8

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Let’s also define true/population risk minimizer ¯ f := arg minf∈F R(f).

2 / 22

SLIDE 9

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Let’s also define true/population risk minimizer ¯ f := arg minf∈F R(f). (Question: What is F?

2 / 22

SLIDE 10

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Let’s also define true/population risk minimizer ¯ f := arg minf∈F R(f). (Question: What is F? Answer: depends on kernel!)

2 / 22

SLIDE 11

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Let’s also define true/population risk minimizer ¯ f := arg minf∈F R(f). (Question: What is F? Answer: depends on kernel!) (Question: is ¯ f = arg minf∈F R(f)?

2 / 22

SLIDE 12

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Let’s also define true/population risk minimizer ¯ f := arg minf∈F R(f). (Question: What is F? Answer: depends on kernel!) (Question: is ¯ f = arg minf∈F R(f)? Answer: no; in general R( ¯ f) ≥ R( ˆ f)!)

2 / 22

SLIDE 13

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Let’s also define true/population risk minimizer ¯ f := arg minf∈F R(f). (Question: What is F? Answer: depends on kernel!) (Question: is ¯ f = arg minf∈F R(f)? Answer: no; in general R( ¯ f) ≥ R( ˆ f)!) Nature labels according to some g (not necessarily inside F!): R( ˆ f) = R(g) (inherent unpredictability) + R( ¯ f) − R(g) (approximation gap) + R( ¯ f) − R( ¯ f) (estimation gap) + R( ˆ f) − R( ¯ f) (optimization gap) + R( ˆ f) − R( ˆ f) (generalization gap)

2 / 22

SLIDE 14

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Let’s also define true/population risk minimizer ¯ f := arg minf∈F R(f). (Question: What is F? Answer: depends on kernel!) (Question: is ¯ f = arg minf∈F R(f)? Answer: no; in general R( ¯ f) ≥ R( ˆ f)!) Nature labels according to some g (not necessarily inside F!): R( ˆ f) = R(g) (inherent unpredictability) + R( ¯ f) − R(g) (approximation gap) + R( ¯ f) − R( ¯ f) (estimation gap) + R( ˆ f) − R( ¯ f) (optimization gap) + R( ˆ f) − R( ˆ f) (generalization gap) Let’s go through this step by step.

2 / 22

SLIDE 15

Inherent unpredictability

Nature labels according to some g (not necessarily inside F!): R(g) (inherent unpredictability)

3 / 22

SLIDE 16

Inherent unpredictability

Nature labels according to some g (not necessarily inside F!): R(g) (inherent unpredictability) If g is the function with lowest classification error, we can write down an explicit form: g(x) := sign(Pr[Y = +1|X = x] − 1/2). If g minimizes R with convex , again can write down g pointwise via Pr[Y = +1|X = x].

3 / 22

SLIDE 17

Approximation gap

¯ f minimizes R over F, and g is chosen by nature; consider R( ¯ f) − R(g). (approximation gap)

4 / 22

SLIDE 18

Approximation gap

¯ f minimizes R over F, and g is chosen by nature; consider R( ¯ f) − R(g). (approximation gap) We’ve shown that if R is misclassification, F is affine classifier, g is quadratic, can have gap 1/4. We can make this gap arbitrarily small if F is: 2 layer wide network, RBF kernel SVM, polynomial classifier with arbitrary degree . . . What is F for SVM?

4 / 22

SLIDE 19

Approximation gap

Consider SVM with no kernel. Can we only say F :=

x → w

Tx : w ∈ Rd

?

5 / 22

SLIDE 20

Approximation gap

Consider SVM with no kernel. Can we only say F :=

x → w

Tx : w ∈ Rd

? Note, for ˆ w := arg minw R(w) + λ

2 w2,

λ 2 ˆ w2 ≤ R( ˆ w) + λ 2 ˆ w2 ≤ R(0) + λ 2 02 = 1 n

n

i=1
1 − 0

Txiyi

+ = 1,

and so SVM is working with the finer set Fλ :=

w → w

Tx : w2 ≤ 2

λ

.

5 / 22

SLIDE 21

Approximation gap

What about kernel SVM?

6 / 22

SLIDE 22

Approximation gap

What about kernel SVM? Now working with Fk :=   x →

n

i=1

αiyik(xi, x) : α ∈ Rn    which is a random variable! ((xi, yi))n

i=1 given by data.

6 / 22

SLIDE 23

Approximation gap

What about kernel SVM? Now working with Fk :=   x →

n

i=1

αiyik(xi, x) : α ∈ Rn    which is a random variable! ((xi, yi))n

i=1 given by data.

This function class is called a reproducing kernel hilbert space (RKHS). We can use it to develop a refined notion Fk,λ.

6 / 22

SLIDE 24

Approximation gap

What about kernel SVM? Now working with Fk :=   x →

n

i=1

αiyik(xi, x) : α ∈ Rn    which is a random variable! ((xi, yi))n

i=1 given by data.

This function class is called a reproducing kernel hilbert space (RKHS). We can use it to develop a refined notion Fk,λ. Going forward: we always try to work with the tightest possible function class defined by the data and algorithm.

6 / 22

SLIDE 25

Estimation gap

¯ f minimizes R over F.

R( ¯

f) − R( ¯ f) (estimation gap)

7 / 22

SLIDE 26

Estimation gap

¯ f minimizes R over F.

R( ¯

f) − R( ¯ f) (estimation gap) If ((xi, yi))n

i=1 drawn IID from same distribution as E in R,

by central limit theorem, R( ¯ f) − − − − →

n→∞ R( ¯

f). Next week, we’ll discuss high probability bounds for finite n.

7 / 22

SLIDE 27

Optimization gap

ˆ f ∈ F minimizes R, and ¯ f ∈ F minimizes R.

R( ˆ

f) − R( ¯ f) (optimization gap)

8 / 22

SLIDE 28

Optimization gap

ˆ f ∈ F minimizes R, and ¯ f ∈ F minimizes R.

R( ˆ

f) − R( ¯ f) (optimization gap) This is algorithmic: we reduce this number by optimizing better. We’ve advocated the use of gradient descent. Many of these problems are NP-hard even in trivial cases. (Linear separator with noise and 3-node neural net are NP-hard.) If R uses a convex loss and ˆ f has at least one training mistake, relating R and test set mislcassifications can be painful.

8 / 22

SLIDE 29

Optimization gap

ˆ f ∈ F minimizes R, and ¯ f ∈ F minimizes R.

R( ˆ

f) − R( ¯ f) (optimization gap) This is algorithmic: we reduce this number by optimizing better. We’ve advocated the use of gradient descent. Many of these problems are NP-hard even in trivial cases. (Linear separator with noise and 3-node neural net are NP-hard.) If R uses a convex loss and ˆ f has at least one training mistake, relating R and test set mislcassifications can be painful. Specifically considering SVM. This is a convex optimization problem. We can solve it in many ways (primal, dual, projected gradient descent, coordinate descent, Newton, et.), it doesn’t really matter so long as we end up close; the primal solutions are unique.

8 / 22

SLIDE 30

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit.

9 / 22

SLIDE 31

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. Before, we said “By CLT, R( ¯ f) − − − − →

n→∞ R( ¯

f)”. Is this quantity the same?

9 / 22

SLIDE 32

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. Before, we said “By CLT, R( ¯ f) − − − − →

n→∞ R( ¯

f)”. Is this quantity the same? No! ˆ f is a random variable!

9 / 22

SLIDE 33

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. Before, we said “By CLT, R( ¯ f) − − − − →

n→∞ R( ¯

f)”. Is this quantity the same? No! ˆ f is a random variable! Controlling this quantity will be the main topic next week!

9 / 22

SLIDE 34

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. Before, we said “By CLT, R( ¯ f) − − − − →

n→∞ R( ¯

f)”. Is this quantity the same? No! ˆ f is a random variable! Controlling this quantity will be the main topic next week! Basic gist: this quantity degrades the “larger” F is. Measuring “size” of F is tricky business! We can get something pretty nice by considering Fk,λ (kernel and regularizer).

9 / 22

SLIDE 35

Decomposing excess risk, revisited

Nature labels according to some g (not necessarily inside F!); ˆ f ∈ F is ERM, ¯ f ∈ F is best for R. R( ˆ f) = R(g) (inherent unpredictability) + R( ¯ f) − R(g) (approximation gap) + R( ¯ f) − R( ¯ f) (estimation gap) + R( ˆ f) − R( ¯ f) (optimization gap) + R( ˆ f) − R( ˆ f) (generalization gap)

10 / 22

SLIDE 36

Decomposing excess risk, revisited

Nature labels according to some g (not necessarily inside F!); ˆ f ∈ F is ERM, ¯ f ∈ F is best for R. R( ˆ f) = R(g) (inherent unpredictability) + R( ¯ f) − R(g) (approximation gap) + R( ¯ f) − R( ¯ f) (estimation gap) + R( ˆ f) − R( ¯ f) (optimization gap) + R( ˆ f) − R( ˆ f) (generalization gap) Key point: reason about tightest F defined by data and algorithm. For SVM, this means Fk,λ, a random variable depending on the data, regularization, and kernel. Smaller F worsens approximation, but improves generalization (and might improve optimization).

10 / 22

SLIDE 37

SVM risk, revisited

11 / 22

SLIDE 38

2. k-nn

SLIDE 39

Consider k-nn with varying k.

12 / 22

SLIDE 40

Consider k-nn with varying k.

12 / 22

SLIDE 41

Consider k-nn with varying k. We can also interpret this in terms of function class size! 1-nn breaks Rd into n data-dependent regions; k-nn breaks Rd into potentially nk data-dependent regions!

12 / 22

SLIDE 42

3. Deep networks

SLIDE 43

Do things still work?

Some parts of the story seem off

13 / 22

SLIDE 44

Do things still work?

Some parts of the story seem off

13 / 22

SLIDE 45

Do things still work?

Some parts of the story seem off

13 / 22

SLIDE 46

Do things still work?

Some parts of the story seem off E.g., fiddling with regularization isn’t the way to control excess risk.

13 / 22

SLIDE 47

Conflated concerns

For SVM, we could decouple the components; e.g., the optimization problem is convex with a unique optimum, any reasonable solver suffices.

14 / 22

SLIDE 48

Conflated concerns

For SVM, we could decouple the components; e.g., the optimization problem is convex with a unique optimum, any reasonable solver suffices. In deep networks, the choice of solver affects all other aspects of the problem, and moreover the solver is affected by the data and architecture/approximation.

14 / 22

SLIDE 49

4. Summary

SLIDE 50

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Now it depends highly on the algorithm, not just the data! Let’s also pick true/population risk minimizer ¯ f := arg minf∈F R(f); this choice is now more careful as well.

15 / 22

SLIDE 51

Decomposing excess risk

ERM ˆ f is an approximate ERM: R( ˆ f) ≈ minf∈F R(f). Now it depends highly on the algorithm, not just the data! Let’s also pick true/population risk minimizer ¯ f := arg minf∈F R(f); this choice is now more careful as well. Nature labels according to some g (not necessarily inside F!): R( ˆ f) = R(g) (inherent unpredictability) + R( ¯ f) − R(g) (approximation gap) + R( ¯ f) − R( ¯ f) (estimation gap) + R( ˆ f) − R( ¯ f) (optimization gap) + R( ˆ f) − R( ˆ f) (generalization gap)

15 / 22

SLIDE 52

Approximation gap

¯ f minimizes R over F, and g is chosen by nature; consider R( ¯ f) − R(g). (approximation gap)

16 / 22

SLIDE 53

Approximation gap

¯ f minimizes R over F, and g is chosen by nature; consider R( ¯ f) − R(g). (approximation gap) We know that there exist wide shallow networks that make this arbitrarily small. But what can we say about the networks (and parameters) used in practice?

16 / 22

SLIDE 54

Approximation gap

¯ f minimizes R over F, and g is chosen by nature; consider R( ¯ f) − R(g). (approximation gap) We know that there exist wide shallow networks that make this arbitrarily small. But what can we say about the networks (and parameters) used in practice? We must consider the optimization and the data when constructing this class. It will be more complicated than Fk,λ with kernel SVM. . .

16 / 22

SLIDE 55

Optimization gap

ˆ f ∈ F minimizes R, and ¯ f ∈ F minimizes R.

R( ˆ

f) − R( ˆ f) (optimization gap)

17 / 22

SLIDE 56

Optimization gap

ˆ f ∈ F minimizes R, and ¯ f ∈ F minimizes R.

R( ˆ

f) − R( ˆ f) (optimization gap) This problem is NP-hard in general, but we can find global optima with gradient descent and other solvers. How do we know? We get 0 error. The solutions are no longer unique, and depend on the optimization algorithm. We pick an optimizer with an eye towards test error! Network architecture choices and data seem to heavily influence

ptimization.

17 / 22

SLIDE 57

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit.

18 / 22

SLIDE 58

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit.

18 / 22

SLIDE 59

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. It seems all reasonable deep networks will get 0 training error on many tasks (certainly true in computer vision?).

18 / 22

SLIDE 60

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. It seems all reasonable deep networks will get 0 training error on many tasks (certainly true in computer vision?). The generalization question is currently a mess, and we have no idea how to measure “size”.

18 / 22

SLIDE 61

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. It seems all reasonable deep networks will get 0 training error on many tasks (certainly true in computer vision?). The generalization question is currently a mess, and we have no idea how to measure “size”. E.g., sometimes increasing number nodes or edges can decrease excess risk.

18 / 22

SLIDE 62

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. It seems all reasonable deep networks will get 0 training error on many tasks (certainly true in computer vision?). The generalization question is currently a mess, and we have no idea how to measure “size”. E.g., sometimes increasing number nodes or edges can decrease excess risk. Convolutions and other magical choices like batch norm might be helping in ways no one understands.

18 / 22

SLIDE 63

Generalization

ˆ f is returned by ERM. R( ˆ f) − R( ˆ f) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. It seems all reasonable deep networks will get 0 training error on many tasks (certainly true in computer vision?). The generalization question is currently a mess, and we have no idea how to measure “size”. E.g., sometimes increasing number nodes or edges can decrease excess risk. Convolutions and other magical choices like batch norm might be helping in ways no one understands. “Normal” regularization (e.g., weight decay) can hurt other aspects (e.g.,

ptimization).

18 / 22

SLIDE 64

corrected from lecture

SLIDE 65

Test error?

Is test error all we should discuss and analyze?

19 / 22

SLIDE 66

Test error?

Is test error all we should discuss and analyze? Adversarial robustness; consider self-driving cars! Theory questions: certified robustness, algorithmic defenses, . . .

19 / 22

SLIDE 67

Test error?

Is test error all we should discuss and analyze? Adversarial robustness; consider self-driving cars! Theory questions: certified robustness, algorithmic defenses, . . . Deep reinforcement learning; many aspects are very finicky.

19 / 22

SLIDE 68

Test error?

Is test error all we should discuss and analyze? Adversarial robustness; consider self-driving cars! Theory questions: certified robustness, algorithmic defenses, . . . Deep reinforcement learning; many aspects are very finicky. Architecture search and tradeoffs.

19 / 22

SLIDE 69

Test error?

Is test error all we should discuss and analyze? Adversarial robustness; consider self-driving cars! Theory questions: certified robustness, algorithmic defenses, . . . Deep reinforcement learning; many aspects are very finicky. Architecture search and tradeoffs. . . .

19 / 22

SLIDE 70

5. Closing comments on theory

SLIDE 71

ML theory vs CS theory

CS Theory. Design and analysis of algorithms. Time complexity, space complexity, . . . Often worst-case.

20 / 22

SLIDE 72

ML theory vs CS theory

CS Theory. Design and analysis of algorithms. Time complexity, space complexity, . . . Often worst-case. ML Theory. Design and analysis of ML algorithms. Time complexity, space complexity, . . . sample complexity, label complexity, covariate shift . . . Often average-case.

20 / 22

SLIDE 73

Why?

Some old methods like SVM and AdaBoost were rooted in theory from their earliest genesis. Now, the pendulum has swung (and broken off?) in applied work; it has been a while since theory contributed a genuine algorithm, although theory sometimes guides practice.

21 / 22

SLIDE 74

Why?

Some old methods like SVM and AdaBoost were rooted in theory from their earliest genesis. Now, the pendulum has swung (and broken off?) in applied work; it has been a while since theory contributed a genuine algorithm, although theory sometimes guides practice. Why do theory?

21 / 22

SLIDE 75

Why?

Some old methods like SVM and AdaBoost were rooted in theory from their earliest genesis. Now, the pendulum has swung (and broken off?) in applied work; it has been a while since theory contributed a genuine algorithm, although theory sometimes guides practice. Why do theory? For some problems in DL, it might help. (E.g., adversarial robustness.)

21 / 22

SLIDE 76

Why?

Some old methods like SVM and AdaBoost were rooted in theory from their earliest genesis. Now, the pendulum has swung (and broken off?) in applied work; it has been a while since theory contributed a genuine algorithm, although theory sometimes guides practice. Why do theory? For some problems in DL, it might help. (E.g., adversarial robustness.) Personally, I do it because I want to (curiosity/fun/etc).

21 / 22

SLIDE 77

Summary

Decomposition of risk into estimation, approximation, generalization,

ptimization.

Decoupling of concerns for some problems (SVM) versus tangled concerns for others (DL). General thought process: (a) we carefully identify the tightest class of predictors F considered by the algorithm on this particular data, (b) generalization seems to worsen with some intricate notion of size of F. Next week: statistical learning theory, focusing on estimation and generalization.

22 / 22