Learning Theory & Regularization Shan-Hung Wu - - PowerPoint PPT Presentation

learning theory regularization
SMART_READER_LITE
LIVE PREVIEW

Learning Theory & Regularization Shan-Hung Wu - - PowerPoint PPT Presentation

Learning Theory & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 1 /


slide-1
SLIDE 1

Learning Theory & Regularization

Shan-Hung Wu

shwu@cs.nthu.edu.tw

Department of Computer Science, National Tsing Hua University, Taiwan

Machine Learning

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 1 / 44

slide-2
SLIDE 2

Outline

1

Learning Theory

2

Point Estimation: Bias and Variance Consistency*

3

Decomposing Generalization Error

4

Regularization Weight Decay Validation

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 2 / 44

slide-3
SLIDE 3

Outline

1

Learning Theory

2

Point Estimation: Bias and Variance Consistency*

3

Decomposing Generalization Error

4

Regularization Weight Decay Validation

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 3 / 44

slide-4
SLIDE 4

Which Polynomial Degree Is Better? I

Given a training set X = {(x(i),y(i))}N

i=1 i.i.d. sampled from of P(x,y)

Assume P(x,y) = P(y|x)P(x), where

P(x) ⇠ Uniform(1,1) P(y|x) = sin(πx)+ε, ε ⇠ N (0,σ2)

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 4 / 44

slide-5
SLIDE 5

Which Polynomial Degree Is Better? II

Consider 3 unregularized polynomial regressors of degrees P = 1, 3, and 10 Which one would you pick?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 5 / 44

slide-6
SLIDE 6

Which Polynomial Degree Is Better? II

Consider 3 unregularized polynomial regressors of degrees P = 1, 3, and 10 Which one would you pick? Probably not P = 1 nor P = 10

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 5 / 44

slide-7
SLIDE 7

Which Polynomial Degree Is Better? II

Consider 3 unregularized polynomial regressors of degrees P = 1, 3, and 10 Which one would you pick? Probably not P = 1 nor P = 10 Note that P = 10 has zero training error

Any N points can be perfectly fitted by a polynomial of degree N 1

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 5 / 44

slide-8
SLIDE 8

Empirical Error vs. Generalization Error

In ML, we usually “learn” a function by minimizing the empirical error/risk defined over a training set of size N: CN(w) or CN[f] = 1 N

N

i=1

loss ⇣ f(x(i);w),y(i)⌘

E.g., CN(w) = 1

2 ∑N i=1

⇣ y(i) w>x(i)⌘2 in linear regression

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 6 / 44

slide-9
SLIDE 9

Empirical Error vs. Generalization Error

In ML, we usually “learn” a function by minimizing the empirical error/risk defined over a training set of size N: CN(w) or CN[f] = 1 N

N

i=1

loss ⇣ f(x(i);w),y(i)⌘

E.g., CN(w) = 1

2 ∑N i=1

⇣ y(i) w>x(i)⌘2 in linear regression

But our goal is to have a low generalization error/risk defined over the underlying data distribution: C(w) or C[f] =

Z

loss(f(x;w),y)dP(x,y)

Can be estimated by the testing error CN0(w) = 1

N0 ∑N0 i=1 loss

⇣ f(x0(i);w),y0(i)⌘ defined over the testing set X0 = {(x0(i),y0(i))}N0

i=1

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 6 / 44

slide-10
SLIDE 10

Empirical Error vs. Generalization Error

In ML, we usually “learn” a function by minimizing the empirical error/risk defined over a training set of size N: CN(w) or CN[f] = 1 N

N

i=1

loss ⇣ f(x(i);w),y(i)⌘

E.g., CN(w) = 1

2 ∑N i=1

⇣ y(i) w>x(i)⌘2 in linear regression

But our goal is to have a low generalization error/risk defined over the underlying data distribution: C(w) or C[f] =

Z

loss(f(x;w),y)dP(x,y)

Can be estimated by the testing error CN0(w) = 1

N0 ∑N0 i=1 loss

⇣ f(x0(i);w),y0(i)⌘ defined over the testing set X0 = {(x0(i),y0(i))}N0

i=1

Does a low CN[f] implies low C[f]?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 6 / 44

slide-11
SLIDE 11

Empirical Error vs. Generalization Error

In ML, we usually “learn” a function by minimizing the empirical error/risk defined over a training set of size N: CN(w) or CN[f] = 1 N

N

i=1

loss ⇣ f(x(i);w),y(i)⌘

E.g., CN(w) = 1

2 ∑N i=1

⇣ y(i) w>x(i)⌘2 in linear regression

But our goal is to have a low generalization error/risk defined over the underlying data distribution: C(w) or C[f] =

Z

loss(f(x;w),y)dP(x,y)

Can be estimated by the testing error CN0(w) = 1

N0 ∑N0 i=1 loss

⇣ f(x0(i);w),y0(i)⌘ defined over the testing set X0 = {(x0(i),y0(i))}N0

i=1

Does a low CN[f] implies low C[f]? No, as P = 10 indicates

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 6 / 44

slide-12
SLIDE 12

No-Free-Lunch Theorem

Why C[f] is defined over a particular data generating distribution P?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 7 / 44

slide-13
SLIDE 13

No-Free-Lunch Theorem

Why C[f] is defined over a particular data generating distribution P? Theorem (No-Free-Lunch Theorem [4]) Averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying unseen points.

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 7 / 44

slide-14
SLIDE 14

No-Free-Lunch Theorem

Why C[f] is defined over a particular data generating distribution P? Theorem (No-Free-Lunch Theorem [4]) Averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying unseen points. No machine learning algorithm is better than any other universally

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 7 / 44

slide-15
SLIDE 15

No-Free-Lunch Theorem

Why C[f] is defined over a particular data generating distribution P? Theorem (No-Free-Lunch Theorem [4]) Averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying unseen points. No machine learning algorithm is better than any other universally The goal of ML is not to seek a universally good learning algorithm Instead, a good algorithm that performs well on data drawn from a particular P we care about

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 7 / 44

slide-16
SLIDE 16

Learning Theory

Let f ⇤ = argminf C[f] be the best possible function we can get

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 8 / 44

slide-17
SLIDE 17

Learning Theory

Let f ⇤ = argminf C[f] be the best possible function we can get Since we are seeking a prediction function in a model (hypothesis space) F, this is what can have at best: f ⇤

F = argminf2F C[f]

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 8 / 44

slide-18
SLIDE 18

Learning Theory

Let f ⇤ = argminf C[f] be the best possible function we can get Since we are seeking a prediction function in a model (hypothesis space) F, this is what can have at best: f ⇤

F = argminf2F C[f]

But we only minimizes empirical errors on limited examples of size N, this is what we actually have fN = argminf2F CN[f]

Ignoring numerical errors (due to, e.g., numerical optimization)

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 8 / 44

slide-19
SLIDE 19

Learning Theory

Let f ⇤ = argminf C[f] be the best possible function we can get Since we are seeking a prediction function in a model (hypothesis space) F, this is what can have at best: f ⇤

F = argminf2F C[f]

But we only minimizes empirical errors on limited examples of size N, this is what we actually have fN = argminf2F CN[f]

Ignoring numerical errors (due to, e.g., numerical optimization)

Learning theory: how to characterize C[fN] =

Z

loss(fN(x;w),y)dP(x,y)?

Not to confuse C[fN] with CN[f]

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 8 / 44

slide-20
SLIDE 20

Learning Theory

Let f ⇤ = argminf C[f] be the best possible function we can get Since we are seeking a prediction function in a model (hypothesis space) F, this is what can have at best: f ⇤

F = argminf2F C[f]

But we only minimizes empirical errors on limited examples of size N, this is what we actually have fN = argminf2F CN[f]

Ignoring numerical errors (due to, e.g., numerical optimization)

Learning theory: how to characterize C[fN] =

Z

loss(fN(x;w),y)dP(x,y)?

Not to confuse C[fN] with CN[f]

Bounding methods Decomposition methods

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 8 / 44

slide-21
SLIDE 21

Bounding Methods I

minf C[f] = C[f ⇤] is called the Bayes error

Larger than 0 when there is randomness in P(y|x) E.g., in our regression problem: y = f ⇤(x;w)+ε, ε ⇠ N (0,σ2)

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 9 / 44

slide-22
SLIDE 22

Bounding Methods I

minf C[f] = C[f ⇤] is called the Bayes error

Larger than 0 when there is randomness in P(y|x) E.g., in our regression problem: y = f ⇤(x;w)+ε, ε ⇠ N (0,σ2)

Cannot be avoided even we know P(x,y) in the ground truth

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 9 / 44

slide-23
SLIDE 23

Bounding Methods I

minf C[f] = C[f ⇤] is called the Bayes error

Larger than 0 when there is randomness in P(y|x) E.g., in our regression problem: y = f ⇤(x;w)+ε, ε ⇠ N (0,σ2)

Cannot be avoided even we know P(x,y) in the ground truth So, our target is to make C[fN] as close to C[f ⇤] as possible

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 9 / 44

slide-24
SLIDE 24

Bounding Methods II

Let E = C[fN]C[f ⇤] be the excess error We have E = C[f ⇤

F]C[f ⇤]

| {z }

Eapp

+C[fN]C[f ⇤

F]

| {z }

Eest

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44

slide-25
SLIDE 25

Bounding Methods II

Let E = C[fN]C[f ⇤] be the excess error We have E = C[f ⇤

F]C[f ⇤]

| {z }

Eapp

+C[fN]C[f ⇤

F]

| {z }

Eest

Eapp is called the approximation error Eest is called the estimation error

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44

slide-26
SLIDE 26

Bounding Methods II

Let E = C[fN]C[f ⇤] be the excess error We have E = C[f ⇤

F]C[f ⇤]

| {z }

Eapp

+C[fN]C[f ⇤

F]

| {z }

Eest

Eapp is called the approximation error Eest is called the estimation error How to reduce these errors?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44

slide-27
SLIDE 27

Bounding Methods II

Let E = C[fN]C[f ⇤] be the excess error We have E = C[f ⇤

F]C[f ⇤]

| {z }

Eapp

+C[fN]C[f ⇤

F]

| {z }

Eest

Eapp is called the approximation error Eest is called the estimation error How to reduce these errors? We can reduce Eapp by choosing a more complex F

A complex F has a larger capacity E.g., larger polynomial degree P in polynomial regression

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44

slide-28
SLIDE 28

Bounding Methods II

Let E = C[fN]C[f ⇤] be the excess error We have E = C[f ⇤

F]C[f ⇤]

| {z }

Eapp

+C[fN]C[f ⇤

F]

| {z }

Eest

Eapp is called the approximation error Eest is called the estimation error How to reduce these errors? We can reduce Eapp by choosing a more complex F

A complex F has a larger capacity E.g., larger polynomial degree P in polynomial regression

How to reduce Eest?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44

slide-29
SLIDE 29

Bounding Methods III

Bounds of Eest for, e.g., binary classifiers [1, 2, 3]: Eest = O ✓Complexity(F)logN N ◆α ,α 2 1 2,1

  • , with high probability

So, to reduce Eest, we should either have

Simpler model (e.g., smaller polynomial degree P), or Larger training set

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 11 / 44

slide-30
SLIDE 30

Model Complexity, Overfit, and Underfit

Too simple a model leads to high Eapp Too complex a model leads to high Eest

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44

slide-31
SLIDE 31

Model Complexity, Overfit, and Underfit

Too simple a model leads to high Eapp due to underfitting

fN fails to capture the shape of f ⇤

Too complex a model leads to high Eest

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44

slide-32
SLIDE 32

Model Complexity, Overfit, and Underfit

Too simple a model leads to high Eapp due to underfitting

fN fails to capture the shape of f ⇤ High training error; high testing error (given a sufficiently large N)

Too complex a model leads to high Eest

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44

slide-33
SLIDE 33

Model Complexity, Overfit, and Underfit

Too simple a model leads to high Eapp due to underfitting

fN fails to capture the shape of f ⇤ High training error; high testing error (given a sufficiently large N)

Too complex a model leads to high Eest due to overfitting

fN captures not only the shape of f ⇤ but also some spurious patterns (e.g., noise) local to a particular training set

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44

slide-34
SLIDE 34

Model Complexity, Overfit, and Underfit

Too simple a model leads to high Eapp due to underfitting

fN fails to capture the shape of f ⇤ High training error; high testing error (given a sufficiently large N)

Too complex a model leads to high Eest due to overfitting

fN captures not only the shape of f ⇤ but also some spurious patterns (e.g., noise) local to a particular training set Low training error; high testing error

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44

slide-35
SLIDE 35

Sample Complexity and Learning Curves

How many training examples (N) are sufficient?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 13 / 44

slide-36
SLIDE 36

Sample Complexity and Learning Curves

How many training examples (N) are sufficient? Different models/algorithms may have different sample complexity

I.e., the N required to learn a target function with specified generalizability

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 13 / 44

slide-37
SLIDE 37

Sample Complexity and Learning Curves

How many training examples (N) are sufficient? Different models/algorithms may have different sample complexity

I.e., the N required to learn a target function with specified generalizability

Can be visualized using the learning curves

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 13 / 44

slide-38
SLIDE 38

Sample Complexity and Learning Curves

How many training examples (N) are sufficient? Different models/algorithms may have different sample complexity

I.e., the N required to learn a target function with specified generalizability

Can be visualized using the learning curves Too small N results in overfit regardless of model complexity

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 13 / 44

slide-39
SLIDE 39

Decomposition Methods

Bounding methods analyze C[fN] qualitatively

General, as no (or weak) assumption on data distribution is made

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44

slide-40
SLIDE 40

Decomposition Methods

Bounding methods analyze C[fN] qualitatively

General, as no (or weak) assumption on data distribution is made

However, in practice, these bounds are too loose to quantify C[fN]

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44

slide-41
SLIDE 41

Decomposition Methods

Bounding methods analyze C[fN] qualitatively

General, as no (or weak) assumption on data distribution is made

However, in practice, these bounds are too loose to quantify C[fN] In some particular situations, we can decompose C[fN] into multiple meaningful terms

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44

slide-42
SLIDE 42

Decomposition Methods

Bounding methods analyze C[fN] qualitatively

General, as no (or weak) assumption on data distribution is made

However, in practice, these bounds are too loose to quantify C[fN] In some particular situations, we can decompose C[fN] into multiple meaningful terms Assume particular

Loss function loss(·), and Data generating distribution P(x,y)

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44

slide-43
SLIDE 43

Decomposition Methods

Bounding methods analyze C[fN] qualitatively

General, as no (or weak) assumption on data distribution is made

However, in practice, these bounds are too loose to quantify C[fN] In some particular situations, we can decompose C[fN] into multiple meaningful terms Assume particular

Loss function loss(·), and Data generating distribution P(x,y)

Require knowledge about the point estimation

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44

slide-44
SLIDE 44

Outline

1

Learning Theory

2

Point Estimation: Bias and Variance Consistency*

3

Decomposing Generalization Error

4

Regularization Weight Decay Validation

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 15 / 44

slide-45
SLIDE 45

Sample Mean and Variance

Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 16 / 44

slide-46
SLIDE 46

Sample Mean and Variance

Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Let X = {x(1),··· ,x(n)} be a set of n i.i.d. samples of a random variable x, a point estimator or statistic is a function of the data: ˆ θn = g(x(1),··· ,x(n))

The value ˆ θn is called the estimate of θ

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 16 / 44

slide-47
SLIDE 47

Sample Mean and Variance

Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Let X = {x(1),··· ,x(n)} be a set of n i.i.d. samples of a random variable x, a point estimator or statistic is a function of the data: ˆ θn = g(x(1),··· ,x(n))

The value ˆ θn is called the estimate of θ

Sample mean: ˆ µx = 1

n ∑i x(i)

Sample variance: ˆ σx = 1

n ∑i(x(i) ˆ

µx)2

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 16 / 44

slide-48
SLIDE 48

Sample Mean and Variance

Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Let X = {x(1),··· ,x(n)} be a set of n i.i.d. samples of a random variable x, a point estimator or statistic is a function of the data: ˆ θn = g(x(1),··· ,x(n))

The value ˆ θn is called the estimate of θ

Sample mean: ˆ µx = 1

n ∑i x(i)

Sample variance: ˆ σx = 1

n ∑i(x(i) ˆ

µx)2 How good are these estimators?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 16 / 44

slide-49
SLIDE 49

Bias & Variance

Bias of an estimator: bias( ˆ θn) = EX( ˆ θn)θ

Here, the expectation is defined over all possible X’s of size n, i.e., EX( ˆ θn) =

R ˆ

θndP(X)

We call a statistic unbiased estimator iff it has zero bias

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44

slide-50
SLIDE 50

Bias & Variance

Bias of an estimator: bias( ˆ θn) = EX( ˆ θn)θ

Here, the expectation is defined over all possible X’s of size n, i.e., EX( ˆ θn) =

R ˆ

θndP(X)

We call a statistic unbiased estimator iff it has zero bias Variance of an estimator: VarX( ˆ θn) = EX h ˆ θn EX[ ˆ θn] 2i

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44

slide-51
SLIDE 51

Bias & Variance

Bias of an estimator: bias( ˆ θn) = EX( ˆ θn)θ

Here, the expectation is defined over all possible X’s of size n, i.e., EX( ˆ θn) =

R ˆ

θndP(X)

We call a statistic unbiased estimator iff it has zero bias Variance of an estimator: VarX( ˆ θn) = EX h ˆ θn EX[ ˆ θn] 2i Is ˆ µx = 1

n ∑i x(i) an unbiased estimator of µx?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44

slide-52
SLIDE 52

Bias & Variance

Bias of an estimator: bias( ˆ θn) = EX( ˆ θn)θ

Here, the expectation is defined over all possible X’s of size n, i.e., EX( ˆ θn) =

R ˆ

θndP(X)

We call a statistic unbiased estimator iff it has zero bias Variance of an estimator: VarX( ˆ θn) = EX h ˆ θn EX[ ˆ θn] 2i Is ˆ µx = 1

n ∑i x(i) an unbiased estimator of µx? Yes [Homework]

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44

slide-53
SLIDE 53

Bias & Variance

Bias of an estimator: bias( ˆ θn) = EX( ˆ θn)θ

Here, the expectation is defined over all possible X’s of size n, i.e., EX( ˆ θn) =

R ˆ

θndP(X)

We call a statistic unbiased estimator iff it has zero bias Variance of an estimator: VarX( ˆ θn) = EX h ˆ θn EX[ ˆ θn] 2i Is ˆ µx = 1

n ∑i x(i) an unbiased estimator of µx? Yes [Homework]

What much is VarX( ˆ µx)?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44

slide-54
SLIDE 54

Variance of ˆ µx

VarX( ˆ µ) = EX[( ˆ µ EX[ ˆ µ])2] = E[ ˆ µ2 2 ˆ µµ + µ2] = E[ ˆ µ2] µ2

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44

slide-55
SLIDE 55

Variance of ˆ µx

VarX( ˆ µ) = EX[( ˆ µ EX[ ˆ µ])2] = E[ ˆ µ2 2 ˆ µµ + µ2] = E[ ˆ µ2] µ2 = E[ 1

n2 ∑i,j x(i)x(j)] µ2 = 1 n2 ∑i,j E[x(i)x(j)] µ2

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44

slide-56
SLIDE 56

Variance of ˆ µx

VarX( ˆ µ) = EX[( ˆ µ EX[ ˆ µ])2] = E[ ˆ µ2 2 ˆ µµ + µ2] = E[ ˆ µ2] µ2 = E[ 1

n2 ∑i,j x(i)x(j)] µ2 = 1 n2 ∑i,j E[x(i)x(j)] µ2

= 1

n2

  • ∑i=j E[x(i)x(j)]+∑i6=j E[x(i)x(j)]
  • µ2

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44

slide-57
SLIDE 57

Variance of ˆ µx

VarX( ˆ µ) = EX[( ˆ µ EX[ ˆ µ])2] = E[ ˆ µ2 2 ˆ µµ + µ2] = E[ ˆ µ2] µ2 = E[ 1

n2 ∑i,j x(i)x(j)] µ2 = 1 n2 ∑i,j E[x(i)x(j)] µ2

= 1

n2

  • ∑i=j E[x(i)x(j)]+∑i6=j E[x(i)x(j)]
  • µ2

= 1

n2

  • ∑i E[x(i)2]+n(n1)E[x(i)]E[x(j)]
  • µ2

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44

slide-58
SLIDE 58

Variance of ˆ µx

VarX( ˆ µ) = EX[( ˆ µ EX[ ˆ µ])2] = E[ ˆ µ2 2 ˆ µµ + µ2] = E[ ˆ µ2] µ2 = E[ 1

n2 ∑i,j x(i)x(j)] µ2 = 1 n2 ∑i,j E[x(i)x(j)] µ2

= 1

n2

  • ∑i=j E[x(i)x(j)]+∑i6=j E[x(i)x(j)]
  • µ2

= 1

n2

  • ∑i E[x(i)2]+n(n1)E[x(i)]E[x(j)]
  • µ2

= 1

nE[x2]+ (n1) n

µ2 µ2 = 1

n

  • E[x2] µ2

= 1

nσ2 x

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44

slide-59
SLIDE 59

Variance of ˆ µx

VarX( ˆ µ) = EX[( ˆ µ EX[ ˆ µ])2] = E[ ˆ µ2 2 ˆ µµ + µ2] = E[ ˆ µ2] µ2 = E[ 1

n2 ∑i,j x(i)x(j)] µ2 = 1 n2 ∑i,j E[x(i)x(j)] µ2

= 1

n2

  • ∑i=j E[x(i)x(j)]+∑i6=j E[x(i)x(j)]
  • µ2

= 1

n2

  • ∑i E[x(i)2]+n(n1)E[x(i)]E[x(j)]
  • µ2

= 1

nE[x2]+ (n1) n

µ2 µ2 = 1

n

  • E[x2] µ2

= 1

nσ2 x

The variance of ˆ µx diminishes as n ! ∞

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44

slide-60
SLIDE 60

Unbiased Estimator of σx

Is ˆ σx = 1

n ∑i(x(i) ˆ

µx)2 and an unbiased estimator of σx?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44

slide-61
SLIDE 61

Unbiased Estimator of σx

Is ˆ σx = 1

n ∑i(x(i) ˆ

µx)2 and an unbiased estimator of σx? No EX[ ˆ σ] = E[1

n ∑i(x(i) ˆ

µ)2] = E[ 1

n(∑i x(i)2 2∑i x(i) ˆ

µ +∑i ˆ µ2)]

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44

slide-62
SLIDE 62

Unbiased Estimator of σx

Is ˆ σx = 1

n ∑i(x(i) ˆ

µx)2 and an unbiased estimator of σx? No EX[ ˆ σ] = E[1

n ∑i(x(i) ˆ

µ)2] = E[ 1

n(∑i x(i)2 2∑i x(i) ˆ

µ +∑i ˆ µ2)] = E[1

n(∑i x(i)2 n ˆ

µ2)] = 1

n(∑i E[x(i)2]nE[ ˆ

µ2])

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44

slide-63
SLIDE 63

Unbiased Estimator of σx

Is ˆ σx = 1

n ∑i(x(i) ˆ

µx)2 and an unbiased estimator of σx? No EX[ ˆ σ] = E[1

n ∑i(x(i) ˆ

µ)2] = E[ 1

n(∑i x(i)2 2∑i x(i) ˆ

µ +∑i ˆ µ2)] = E[1

n(∑i x(i)2 n ˆ

µ2)] = 1

n(∑i E[x(i)2]nE[ ˆ

µ2]) = E[x2]E[ ˆ µ2] = E[(x µ)2 +2xµ µ2]E[ ˆ µ2] = (σ2 + µ2)(Var[ ˆ µ]+E[ ˆ µ]2)

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44

slide-64
SLIDE 64

Unbiased Estimator of σx

Is ˆ σx = 1

n ∑i(x(i) ˆ

µx)2 and an unbiased estimator of σx? No EX[ ˆ σ] = E[1

n ∑i(x(i) ˆ

µ)2] = E[ 1

n(∑i x(i)2 2∑i x(i) ˆ

µ +∑i ˆ µ2)] = E[1

n(∑i x(i)2 n ˆ

µ2)] = 1

n(∑i E[x(i)2]nE[ ˆ

µ2]) = E[x2]E[ ˆ µ2] = E[(x µ)2 +2xµ µ2]E[ ˆ µ2] = (σ2 + µ2)(Var[ ˆ µ]+E[ ˆ µ]2) = σ2 + µ2 1

nσ2 µ2 = n1 n σ2 6=σ2

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44

slide-65
SLIDE 65

Unbiased Estimator of σx

Is ˆ σx = 1

n ∑i(x(i) ˆ

µx)2 and an unbiased estimator of σx? No EX[ ˆ σ] = E[1

n ∑i(x(i) ˆ

µ)2] = E[ 1

n(∑i x(i)2 2∑i x(i) ˆ

µ +∑i ˆ µ2)] = E[1

n(∑i x(i)2 n ˆ

µ2)] = 1

n(∑i E[x(i)2]nE[ ˆ

µ2]) = E[x2]E[ ˆ µ2] = E[(x µ)2 +2xµ µ2]E[ ˆ µ2] = (σ2 + µ2)(Var[ ˆ µ]+E[ ˆ µ]2) = σ2 + µ2 1

nσ2 µ2 = n1 n σ2 6=σ2

What’s the unbiased estimator of σx?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44

slide-66
SLIDE 66

Unbiased Estimator of σx

Is ˆ σx = 1

n ∑i(x(i) ˆ

µx)2 and an unbiased estimator of σx? No EX[ ˆ σ] = E[1

n ∑i(x(i) ˆ

µ)2] = E[ 1

n(∑i x(i)2 2∑i x(i) ˆ

µ +∑i ˆ µ2)] = E[1

n(∑i x(i)2 n ˆ

µ2)] = 1

n(∑i E[x(i)2]nE[ ˆ

µ2]) = E[x2]E[ ˆ µ2] = E[(x µ)2 +2xµ µ2]E[ ˆ µ2] = (σ2 + µ2)(Var[ ˆ µ]+E[ ˆ µ]2) = σ2 + µ2 1

nσ2 µ2 = n1 n σ2 6=σ2

What’s the unbiased estimator of σx? ˆ σx = n n1(1 n ∑

i

(x(i) ˆ µx)2) = 1 n1 ∑

i

(x(i) ˆ µx)2

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44

slide-67
SLIDE 67

Mean Square Error

Mean square error of an estimator: MSE( ˆ θn) = EX ⇥ ( ˆ θn θ)2⇤

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44

slide-68
SLIDE 68

Mean Square Error

Mean square error of an estimator: MSE( ˆ θn) = EX ⇥ ( ˆ θn θ)2⇤ Can be decomposed into the bias and variance: EX ⇥ ( ˆ θn θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn]E[ ˆ θn]+θ)2⇤

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44

slide-69
SLIDE 69

Mean Square Error

Mean square error of an estimator: MSE( ˆ θn) = EX ⇥ ( ˆ θn θ)2⇤ Can be decomposed into the bias and variance: EX ⇥ ( ˆ θn θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn]E[ ˆ θn]+θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn])2 +(E[ ˆ θn]θ)2 +2( ˆ θn E[ ˆ θn])(E[ ˆ θn]θ) ⇤

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44

slide-70
SLIDE 70

Mean Square Error

Mean square error of an estimator: MSE( ˆ θn) = EX ⇥ ( ˆ θn θ)2⇤ Can be decomposed into the bias and variance: EX ⇥ ( ˆ θn θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn]E[ ˆ θn]+θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn])2 +(E[ ˆ θn]θ)2 +2( ˆ θn E[ ˆ θn])(E[ ˆ θn]θ) ⇤ = E ⇥ ( ˆ θn E[ ˆ θn])2⇤ +E ⇥ (E[ ˆ θn]θ)2⇤ +2E ˆ θn E[ ˆ θn]

  • (E[ ˆ

θn]θ)

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44

slide-71
SLIDE 71

Mean Square Error

Mean square error of an estimator: MSE( ˆ θn) = EX ⇥ ( ˆ θn θ)2⇤ Can be decomposed into the bias and variance: EX ⇥ ( ˆ θn θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn]E[ ˆ θn]+θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn])2 +(E[ ˆ θn]θ)2 +2( ˆ θn E[ ˆ θn])(E[ ˆ θn]θ) ⇤ = E ⇥ ( ˆ θn E[ ˆ θn])2⇤ +E ⇥ (E[ ˆ θn]θ)2⇤ +2E ˆ θn E[ ˆ θn]

  • (E[ ˆ

θn]θ) = E ⇥ ( ˆ θn E[ ˆ θn])2⇤ +

  • E[ ˆ

θn]θ 2 +2·0·(E[ ˆ θn]θ)

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44

slide-72
SLIDE 72

Mean Square Error

Mean square error of an estimator: MSE( ˆ θn) = EX ⇥ ( ˆ θn θ)2⇤ Can be decomposed into the bias and variance: EX ⇥ ( ˆ θn θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn]E[ ˆ θn]+θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn])2 +(E[ ˆ θn]θ)2 +2( ˆ θn E[ ˆ θn])(E[ ˆ θn]θ) ⇤ = E ⇥ ( ˆ θn E[ ˆ θn])2⇤ +E ⇥ (E[ ˆ θn]θ)2⇤ +2E ˆ θn E[ ˆ θn]

  • (E[ ˆ

θn]θ) = E ⇥ ( ˆ θn E[ ˆ θn])2⇤ +

  • E[ ˆ

θn]θ 2 +2·0·(E[ ˆ θn]θ) = VarX( ˆ θn)+bias( ˆ θn)2 MSE of an unbiased estimator is its variance

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44

slide-73
SLIDE 73

Outline

1

Learning Theory

2

Point Estimation: Bias and Variance Consistency*

3

Decomposing Generalization Error

4

Regularization Weight Decay Validation

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 21 / 44

slide-74
SLIDE 74

Consistency

So far, we discussed the “goodness” of an estimator based on samples

  • f fixed size

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 22 / 44

slide-75
SLIDE 75

Consistency

So far, we discussed the “goodness” of an estimator based on samples

  • f fixed size

If we have more samples, will the estimate become more accurate?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 22 / 44

slide-76
SLIDE 76

Consistency

So far, we discussed the “goodness” of an estimator based on samples

  • f fixed size

If we have more samples, will the estimate become more accurate? An estimator is (weak) consistent iff: lim

n!∞

ˆ θn

Pr

  • ! θ,

where Pr

  • ! means “converge in probability”

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 22 / 44

slide-77
SLIDE 77

Consistency

So far, we discussed the “goodness” of an estimator based on samples

  • f fixed size

If we have more samples, will the estimate become more accurate? An estimator is (weak) consistent iff: lim

n!∞

ˆ θn

Pr

  • ! θ,

where Pr

  • ! means “converge in probability”

Strong consistent iff “converge almost surely”

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 22 / 44

slide-78
SLIDE 78

Law of Large Numbers

Theorem (Weak Law of Large Numbers) The sample mean ˆ µx = 1

n ∑i x(i) is a consistent estimator of µx, i.e.,

limn!∞ Pr(| ˆ µx,n µx| < ε) = 1 for any ε > 0.

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 23 / 44

slide-79
SLIDE 79

Law of Large Numbers

Theorem (Weak Law of Large Numbers) The sample mean ˆ µx = 1

n ∑i x(i) is a consistent estimator of µx, i.e.,

limn!∞ Pr(| ˆ µx,n µx| < ε) = 1 for any ε > 0. Theorem (Strong Law of Large Numbers) In addition, ˆ µx is a strong consistent estimator: Pr(limn!∞ ˆ µx,n = µx) = 1.

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 23 / 44

slide-80
SLIDE 80

Outline

1

Learning Theory

2

Point Estimation: Bias and Variance Consistency*

3

Decomposing Generalization Error

4

Regularization Weight Decay Validation

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 24 / 44

slide-81
SLIDE 81

Expected Generalization Error

In ML, we get fN = argminf2F CN[f] by minimizing the empirical error

  • ver a training set of size N

How to decompose the generalization error C[fN]?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 25 / 44

slide-82
SLIDE 82

Expected Generalization Error

In ML, we get fN = argminf2F CN[f] by minimizing the empirical error

  • ver a training set of size N

How to decompose the generalization error C[fN]? Regard fN(x) as an estimate of true label y given x

fN an estimator mapped from i.i.d. samples in the training set X

To evaluate the estimator fN, we consider the expected generalization error: EX (C[fN]) = EX [

R loss(fN(x)y)dP(x,y)]

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 25 / 44

slide-83
SLIDE 83

Expected Generalization Error

In ML, we get fN = argminf2F CN[f] by minimizing the empirical error

  • ver a training set of size N

How to decompose the generalization error C[fN]? Regard fN(x) as an estimate of true label y given x

fN an estimator mapped from i.i.d. samples in the training set X

To evaluate the estimator fN, we consider the expected generalization error: EX (C[fN]) = EX [

R loss(fN(x)y)dP(x,y)]

= EX,x,y [loss(fN(x)y)]

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 25 / 44

slide-84
SLIDE 84

Expected Generalization Error

In ML, we get fN = argminf2F CN[f] by minimizing the empirical error

  • ver a training set of size N

How to decompose the generalization error C[fN]? Regard fN(x) as an estimate of true label y given x

fN an estimator mapped from i.i.d. samples in the training set X

To evaluate the estimator fN, we consider the expected generalization error: EX (C[fN]) = EX [

R loss(fN(x)y)dP(x,y)]

= EX,x,y [loss(fN(x)y)] = Ex

  • EX,y [loss(fN(x)y)|x = x]
  • Shan-Hung Wu (CS, NTHU)

Learning Theory & Regularization Machine Learning 25 / 44

slide-85
SLIDE 85

Expected Generalization Error

In ML, we get fN = argminf2F CN[f] by minimizing the empirical error

  • ver a training set of size N

How to decompose the generalization error C[fN]? Regard fN(x) as an estimate of true label y given x

fN an estimator mapped from i.i.d. samples in the training set X

To evaluate the estimator fN, we consider the expected generalization error: EX (C[fN]) = EX [

R loss(fN(x)y)dP(x,y)]

= EX,x,y [loss(fN(x)y)] = Ex

  • EX,y [loss(fN(x)y)|x = x]
  • There’s a simple decomposition of EX,y [loss(fN(x)y)|x] for

linear/polynomial regression

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 25 / 44

slide-86
SLIDE 86

Example: Linear/Polynomial Regression

In linear/polynomial regression, we have

loss(·) = (·)2 a squared loss y = f ⇤(x)+ε, ε ⇠ N (0,σ2), thus Ey[y|x] = f ⇤(x) and Vary[y|x] = σ2

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44

slide-87
SLIDE 87

Example: Linear/Polynomial Regression

In linear/polynomial regression, we have

loss(·) = (·)2 a squared loss y = f ⇤(x)+ε, ε ⇠ N (0,σ2), thus Ey[y|x] = f ⇤(x) and Vary[y|x] = σ2

We can decompose the mean square error: EX,y [loss(fN(x)y)|x] = EX,y[(fN(x)y)2|x] = EX,y[y2 +fN(x)2 2fN(x)y|x]

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44

slide-88
SLIDE 88

Example: Linear/Polynomial Regression

In linear/polynomial regression, we have

loss(·) = (·)2 a squared loss y = f ⇤(x)+ε, ε ⇠ N (0,σ2), thus Ey[y|x] = f ⇤(x) and Vary[y|x] = σ2

We can decompose the mean square error: EX,y [loss(fN(x)y)|x] = EX,y[(fN(x)y)2|x] = EX,y[y2 +fN(x)2 2fN(x)y|x] = Ey[y2|x]+EX[fN(x)2|x]2EX,y[fN(x)y|x]

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44

slide-89
SLIDE 89

Example: Linear/Polynomial Regression

In linear/polynomial regression, we have

loss(·) = (·)2 a squared loss y = f ⇤(x)+ε, ε ⇠ N (0,σ2), thus Ey[y|x] = f ⇤(x) and Vary[y|x] = σ2

We can decompose the mean square error: EX,y [loss(fN(x)y)|x] = EX,y[(fN(x)y)2|x] = EX,y[y2 +fN(x)2 2fN(x)y|x] = Ey[y2|x]+EX[fN(x)2|x]2EX,y[fN(x)y|x] = (Vary[y|x]+Ey[y|x]2)+(VarX[fN(x)|x]+EX[fN(x)|x]2) 2Ey[y|x]EX[fN(x)|x]

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44

slide-90
SLIDE 90

Example: Linear/Polynomial Regression

In linear/polynomial regression, we have

loss(·) = (·)2 a squared loss y = f ⇤(x)+ε, ε ⇠ N (0,σ2), thus Ey[y|x] = f ⇤(x) and Vary[y|x] = σ2

We can decompose the mean square error: EX,y [loss(fN(x)y)|x] = EX,y[(fN(x)y)2|x] = EX,y[y2 +fN(x)2 2fN(x)y|x] = Ey[y2|x]+EX[fN(x)2|x]2EX,y[fN(x)y|x] = (Vary[y|x]+Ey[y|x]2)+(VarX[fN(x)|x]+EX[fN(x)|x]2) 2Ey[y|x]EX[fN(x)|x] = Vary[y|x]+VarX[fN(x)|x]+(EX[fN(x)|x]Ey[y|x])2

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44

slide-91
SLIDE 91

Example: Linear/Polynomial Regression

In linear/polynomial regression, we have

loss(·) = (·)2 a squared loss y = f ⇤(x)+ε, ε ⇠ N (0,σ2), thus Ey[y|x] = f ⇤(x) and Vary[y|x] = σ2

We can decompose the mean square error: EX,y [loss(fN(x)y)|x] = EX,y[(fN(x)y)2|x] = EX,y[y2 +fN(x)2 2fN(x)y|x] = Ey[y2|x]+EX[fN(x)2|x]2EX,y[fN(x)y|x] = (Vary[y|x]+Ey[y|x]2)+(VarX[fN(x)|x]+EX[fN(x)|x]2) 2Ey[y|x]EX[fN(x)|x] = Vary[y|x]+VarX[fN(x)|x]+(EX[fN(x)|x]Ey[y|x])2 = Vary[y|x]+VarX[fN(x)|x]+EX[fN(x)f ⇤(x)|x]2

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44

slide-92
SLIDE 92

Example: Linear/Polynomial Regression

In linear/polynomial regression, we have

loss(·) = (·)2 a squared loss y = f ⇤(x)+ε, ε ⇠ N (0,σ2), thus Ey[y|x] = f ⇤(x) and Vary[y|x] = σ2

We can decompose the mean square error: EX,y [loss(fN(x)y)|x] = EX,y[(fN(x)y)2|x] = EX,y[y2 +fN(x)2 2fN(x)y|x] = Ey[y2|x]+EX[fN(x)2|x]2EX,y[fN(x)y|x] = (Vary[y|x]+Ey[y|x]2)+(VarX[fN(x)|x]+EX[fN(x)|x]2) 2Ey[y|x]EX[fN(x)|x] = Vary[y|x]+VarX[fN(x)|x]+(EX[fN(x)|x]Ey[y|x])2 = Vary[y|x]+VarX[fN(x)|x]+EX[fN(x)f ⇤(x)|x]2 = σ2 +VarX[fN(x)|x]+bias[fN(x)|x]2

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44

slide-93
SLIDE 93

Bias-Variance Tradeoff I

EX (C[fN]) = Ex

  • EX,y [loss(fN(x)y)|x]
  • = Ex
  • σ2 +VarX[fN(x)|x]+bias[fN(x)|x]2

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 27 / 44

slide-94
SLIDE 94

Bias-Variance Tradeoff I

EX (C[fN]) = Ex

  • EX,y [loss(fN(x)y)|x]
  • = Ex
  • σ2 +VarX[fN(x)|x]+bias[fN(x)|x]2

The first term cannot be avoided when P(y|x) is stochastic

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 27 / 44

slide-95
SLIDE 95

Bias-Variance Tradeoff I

EX (C[fN]) = Ex

  • EX,y [loss(fN(x)y)|x]
  • = Ex
  • σ2 +VarX[fN(x)|x]+bias[fN(x)|x]2

The first term cannot be avoided when P(y|x) is stochastic Model complexity controls the tradeoff between variance and bias E.g., polynomial regressors (dotted line = average training error):

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 27 / 44

slide-96
SLIDE 96

Bias-Variance Tradeoff II

Provides another way to understand the generalization/testing error

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 28 / 44

slide-97
SLIDE 97

Bias-Variance Tradeoff II

Provides another way to understand the generalization/testing error Too simple a model leads to high bias or underfitting

High training error; high testing error (given a sufficiently large N)

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 28 / 44

slide-98
SLIDE 98

Bias-Variance Tradeoff II

Provides another way to understand the generalization/testing error Too simple a model leads to high bias or underfitting

High training error; high testing error (given a sufficiently large N)

Too complex a model leads to high variance or overfitting

Low training error; high testing error

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 28 / 44

slide-99
SLIDE 99

Outline

1

Learning Theory

2

Point Estimation: Bias and Variance Consistency*

3

Decomposing Generalization Error

4

Regularization Weight Decay Validation

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 29 / 44

slide-100
SLIDE 100

Regularization

We get fN = argminf2F CN[f] by minimizing the empirical error

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 30 / 44

slide-101
SLIDE 101

Regularization

We get fN = argminf2F CN[f] by minimizing the empirical error But what we really care about is the generalization error C[fN]

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 30 / 44

slide-102
SLIDE 102

Regularization

We get fN = argminf2F CN[f] by minimizing the empirical error But what we really care about is the generalization error C[fN] Regularization refers to any technique designed to improve the generalizability of fN Any idea inspired by the learning theory?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 30 / 44

slide-103
SLIDE 103

Regularization

We get fN = argminf2F CN[f] by minimizing the empirical error But what we really care about is the generalization error C[fN] Regularization refers to any technique designed to improve the generalizability of fN Any idea inspired by the learning theory? Regularization in the cost function: weight decay Regularization during the training process: validation

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 30 / 44

slide-104
SLIDE 104

Outline

1

Learning Theory

2

Point Estimation: Bias and Variance Consistency*

3

Decomposing Generalization Error

4

Regularization Weight Decay Validation

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 31 / 44

slide-105
SLIDE 105

Panelizing Complex Functions

Occam’s razor: among equal-performing models, the simplest one should be selected

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 32 / 44

slide-106
SLIDE 106

Panelizing Complex Functions

Occam’s razor: among equal-performing models, the simplest one should be selected Idea: to add a term in the cost function that panelizes complex functions So, with sufficiently complex F:

Minimizing the empirical error term reduces bias Minimizing the penalty term reduces variance

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 32 / 44

slide-107
SLIDE 107

What to Panelize?

What impacts Complexity(F) in a model?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 33 / 44

slide-108
SLIDE 108

What to Panelize?

What impacts Complexity(F) in a model? Some constants in the model F

E.g., degree P in polynomial regression

Restricts the capacity of F

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 33 / 44

slide-109
SLIDE 109

What to Panelize?

What impacts Complexity(F) in a model? Some constants in the model F

E.g., degree P in polynomial regression

Restricts the capacity of F However, cannot be penalized in a cost fucntion since fixed

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 33 / 44

slide-110
SLIDE 110

What to Panelize?

What impacts Complexity(F) in a model? Some constants in the model F

E.g., degree P in polynomial regression

Restricts the capacity of F However, cannot be penalized in a cost fucntion since fixed Alternatively, function parameters

E.g., the parameter w of a function f(·;w) 2 F

Also restricts the capacity of F Can be penalized

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 33 / 44

slide-111
SLIDE 111

What to Panelize?

What impacts Complexity(F) in a model? Some constants in the model F

E.g., degree P in polynomial regression

Restricts the capacity of F However, cannot be penalized in a cost fucntion since fixed Alternatively, function parameters

E.g., the parameter w of a function f(·;w) 2 F

Also restricts the capacity of F Can be penalized But which w implies a complex model?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 33 / 44

slide-112
SLIDE 112

Weight Decay

In practice, w = 0 is usually the “simplest” function

E.g, in binary classification for labels {1,1}, a perceptron with w = 0 means random guessing

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 34 / 44

slide-113
SLIDE 113

Weight Decay

In practice, w = 0 is usually the “simplest” function

E.g, in binary classification for labels {1,1}, a perceptron with w = 0 means random guessing

Weight decay: to penalize the norm of w, which is nonnegative and equals to 0 when w = 0

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 34 / 44

slide-114
SLIDE 114

Weight Decay

In practice, w = 0 is usually the “simplest” function

E.g, in binary classification for labels {1,1}, a perceptron with w = 0 means random guessing

Weight decay: to penalize the norm of w, which is nonnegative and equals to 0 when w = 0 E.g., the Ridge regression: argmin

w,b

1 2ky(Xwb1)k2 subject to kwk2  T for some constant T > 0

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 34 / 44

slide-115
SLIDE 115

Weight Decay

In practice, w = 0 is usually the “simplest” function

E.g, in binary classification for labels {1,1}, a perceptron with w = 0 means random guessing

Weight decay: to penalize the norm of w, which is nonnegative and equals to 0 when w = 0 E.g., the Ridge regression: argmin

w,b

1 2ky(Xwb1)k2 subject to kwk2  T for some constant T > 0 In practice, we usually solve a simpler problem: argmin

w,b

1 2N ky(Xwb1)k2 + α 2 kwk2 where α > 0 is a constant representing both T and the KKT multiplier

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 34 / 44

slide-116
SLIDE 116

Weight Decay

In practice, w = 0 is usually the “simplest” function

E.g, in binary classification for labels {1,1}, a perceptron with w = 0 means random guessing

Weight decay: to penalize the norm of w, which is nonnegative and equals to 0 when w = 0 E.g., the Ridge regression: argmin

w,b

1 2ky(Xwb1)k2 subject to kwk2  T for some constant T > 0 In practice, we usually solve a simpler problem: argmin

w,b

1 2N ky(Xwb1)k2 + α 2 kwk2 where α > 0 is a constant representing both T and the KKT multiplier What does a larger α means?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 34 / 44

slide-117
SLIDE 117

Weight Decay

In practice, w = 0 is usually the “simplest” function

E.g, in binary classification for labels {1,1}, a perceptron with w = 0 means random guessing

Weight decay: to penalize the norm of w, which is nonnegative and equals to 0 when w = 0 E.g., the Ridge regression: argmin

w,b

1 2ky(Xwb1)k2 subject to kwk2  T for some constant T > 0 In practice, we usually solve a simpler problem: argmin

w,b

1 2N ky(Xwb1)k2 + α 2 kwk2 where α > 0 is a constant representing both T and the KKT multiplier What does a larger α means? We prefer a more simple function

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 34 / 44

slide-118
SLIDE 118

Flat Regressors

argmin

w,b

1 2

  • ky(Xwb1)k2 +αkwk2

The bias b is not regularized, why?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 35 / 44

slide-119
SLIDE 119

Flat Regressors

argmin

w,b

1 2

  • ky(Xwb1)k2 +αkwk2

The bias b is not regularized, why? We want the simplest function with w = 0 means “a dummy regressor by averaging”

Remember R2 (coefficient of determination)?

However, the label y’s may not be standardized to have zero mean

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 35 / 44

slide-120
SLIDE 120

Flat Regressors

argmin

w,b

1 2

  • ky(Xwb1)k2 +αkwk2

The bias b is not regularized, why? We want the simplest function with w = 0 means “a dummy regressor by averaging”

Remember R2 (coefficient of determination)?

However, the label y’s may not be standardized to have zero mean This explains why we prefer a “flat” hyperplane in the previous lecture We have discussed how to solve the Ridge regression problem

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 35 / 44

slide-121
SLIDE 121

Sparse Weight Decay

Alternatively we can minimizes the L1-norm in weight decay E.g., LASSO (least absolute shrinkage and selection operator): argmin

w,b

1 2N ky(Xwb1)k2 +αkwk1 for some constant α > 0 Usually results in sparse w that has many zero attributes Why?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 36 / 44

slide-122
SLIDE 122

Sparsity

argmin

w,b

1 2N ky(Xwb1)k2 +αkwk1 The surface of the cost function is the sum of SSE (blue contours) and 1-norm (red contours) Optimal point locates on some axes

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 37 / 44

slide-123
SLIDE 123

Elastic Net**

LASSO can be used as a feature selection technique

The sparse w selects explanatory variables that are most correlated to the target variable

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 38 / 44

slide-124
SLIDE 124

Elastic Net**

LASSO can be used as a feature selection technique

The sparse w selects explanatory variables that are most correlated to the target variable

Limitations:

1

Selects at most N variables if D > N

2

No group selection

Important in some applications, e.g., gene selection problems

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 38 / 44

slide-125
SLIDE 125

Elastic Net**

LASSO can be used as a feature selection technique

The sparse w selects explanatory variables that are most correlated to the target variable

Limitations:

1

Selects at most N variables if D > N

2

No group selection

Important in some applications, e.g., gene selection problems

Elastic net combines Ridge and LASSO: argmin

w,b

1 2N ky(Xwb1)k2 +α ✓ βkwk1 + 1β 2 kwk2 ◆ for some constant β 2 (0,1)

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 38 / 44

slide-126
SLIDE 126

Elastic Net**

LASSO can be used as a feature selection technique

The sparse w selects explanatory variables that are most correlated to the target variable

Limitations:

1

Selects at most N variables if D > N

2

No group selection

Important in some applications, e.g., gene selection problems

Elastic net combines Ridge and LASSO: argmin

w,b

1 2N ky(Xwb1)k2 +α ✓ βkwk1 + 1β 2 kwk2 ◆ for some constant β 2 (0,1) Still gives a sparse w Highly correlated variables will have similar values in w

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 38 / 44

slide-127
SLIDE 127

Outline

1

Learning Theory

2

Point Estimation: Bias and Variance Consistency*

3

Decomposing Generalization Error

4

Regularization Weight Decay Validation

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 39 / 44

slide-128
SLIDE 128

Tuning Hyperparameters

In ML, we call the constants that are fixed in a model the hyperparameters

Degree P in polynomial regression Coefficient α of the weight decay term in the cost function of Ridge and LASSO, etc.

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 40 / 44

slide-129
SLIDE 129

Tuning Hyperparameters

In ML, we call the constants that are fixed in a model the hyperparameters

Degree P in polynomial regression Coefficient α of the weight decay term in the cost function of Ridge and LASSO, etc.

Usually reflect some assumptions about the model Changing their values changes model complexity

And therefore generalization performance

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 40 / 44

slide-130
SLIDE 130

Tuning Hyperparameters

In ML, we call the constants that are fixed in a model the hyperparameters

Degree P in polynomial regression Coefficient α of the weight decay term in the cost function of Ridge and LASSO, etc.

Usually reflect some assumptions about the model Changing their values changes model complexity

And therefore generalization performance

How to set appropriate values?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 40 / 44

slide-131
SLIDE 131

Tuning Hyperparameters

In ML, we call the constants that are fixed in a model the hyperparameters

Degree P in polynomial regression Coefficient α of the weight decay term in the cost function of Ridge and LASSO, etc.

Usually reflect some assumptions about the model Changing their values changes model complexity

And therefore generalization performance

How to set appropriate values? Train a model many times with different hyperparameters, and choose the function with best generalizability Very time consuming, can we have heuristics to speed up the process?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 40 / 44

slide-132
SLIDE 132

Structured Risk Minimization

Consider again the Occam’s razor Structured risk minimization: start from the simplest model, gradually increase its complexity, and stop when overfitting

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 41 / 44

slide-133
SLIDE 133

Validation Set

Pitfall:

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 42 / 44

slide-134
SLIDE 134

Validation Set

Pitfall: we peep the testing set during the training process

The final function will overfit the testing set Optimistic testing error

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 42 / 44

slide-135
SLIDE 135

Validation Set

Pitfall: we peep the testing set during the training process

The final function will overfit the testing set Optimistic testing error

Fix?

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 42 / 44

slide-136
SLIDE 136

Validation Set

Pitfall: we peep the testing set during the training process

The final function will overfit the testing set Optimistic testing error

Fix? Split a validation set from the training set and use it for hyperparameter selection

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 42 / 44

slide-137
SLIDE 137

Reference I

[1] Olivier Bousquet. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. Ph.D. thesis, Ecole Polytechnique, Palaiseau, France, 2002. [2] Pascal Massart. Some applications of concentration inequalities to statistics. In Annales de la Faculté des sciences de Toulouse: Mathématiques, volume 9, pages 245–303, 2000. [3] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of Complexity, pages 11–30. Springer, 2015.

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 43 / 44

slide-138
SLIDE 138

Reference II

[4] David H Wolpert. The lack of a priori distinctions between learning algorithms. Neural computation, 8(7):1341–1390, 1996.

Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 44 / 44