[PPT] - Applied Machine Learning Applied Machine Learning Regularization PowerPoint Presentation

SLIDE 1

Applied Machine Learning Applied Machine Learning

Regularization

Siamak Ravanbakhsh Siamak Ravanbakhsh

COMP 551 COMP 551 (winter 2020) (winter 2020)

1

SLIDE 2

Basic idea of

verfitting and underfitting

Regularization (L1 & L2) MLE vs MAP estimation bias and variance trade off evaluation metrics & cross validation

Learning objectives Learning objectives

2

SLIDE 3

Linear regression and logistic regression is linear too simple? what if it's not a good fit? how to increase the models expressiveness? create new nonlinear features is there a downside?

Previously... Previously...

3 . 1

SLIDE 4

Recall: nonlinear basis functions Recall: nonlinear basis functions

replace original features in f

(x) =

w

w x

∑d

d d

3 . 2

SLIDE 5

Recall: nonlinear basis functions Recall: nonlinear basis functions

replace original features in f

(x) =

w

w x

∑d

d d

with nonlinear bases

f

(x) =

w

w ϕ (x)

∑d

d d

3 . 2

SLIDE 6

Recall: nonlinear basis functions Recall: nonlinear basis functions

replace original features in f

(x) =

w

w x

∑d

d d

with nonlinear bases

f

(x) =

w

w ϕ (x)

∑d

d d

w =

∗

(Φ Φ) Φ y

⊤ −1 ⊤

linear least squares solution

Φ = ⎣ ⎢ ⎢ ⎢ ⎢ ⎡ ϕ

(x

),

1 (1)

ϕ

(x

),

1 (2)

⋮ ϕ

(x

),

1 (N)

ϕ

(x

),

2 (1)

ϕ

(x

),

2 (2)

⋮ ϕ

(x

),

2 (N)

⋯ , ⋯ , ⋱ ⋯ , ϕ

(x

)

D (1)

ϕ

(x

)

D (2)

⋮ ϕ

(x

)

D (N) ⎦

⎥ ⎥ ⎥ ⎥ ⎤

replacing

X with Φ

3 . 2

SLIDE 7

Recall: nonlinear basis functions Recall: nonlinear basis functions

replace original features in f

(x) =

w

w x

∑d

d d

with nonlinear bases

f

(x) =

w

w ϕ (x)

∑d

d d

w =

∗

(Φ Φ) Φ y

⊤ −1 ⊤

linear least squares solution

Φ = ⎣ ⎢ ⎢ ⎢ ⎢ ⎡ ϕ

(x

),

1 (1)

ϕ

(x

),

1 (2)

⋮ ϕ

(x

),

1 (N)

ϕ

(x

),

2 (1)

ϕ

(x

),

2 (2)

⋮ ϕ

(x

),

2 (N)

⋯ , ⋯ , ⋱ ⋯ , ϕ

(x

)

D (1)

ϕ

(x

)

D (2)

⋮ ϕ

(x

)

D (N) ⎦

⎥ ⎥ ⎥ ⎥ ⎤

replacing

X with Φ

3 . 2

a (nonlinear) feature

ne instance

SLIDE 8

Winter 2020 | Applied Machine Learning (COMP551)

Recall: nonlinear basis functions Recall: nonlinear basis functions

examples

x ∈ R

polynomial bases

ϕ

(x) =

k

xk

Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

Sigmoid bases

ϕ

(x) =

k 1+e−

s x−μ k

1

riginal input is scalar

3 . 3

SLIDE 9

Example Example: Gaussian bases : Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2 4 . 1

SLIDE 10

Example Example: Gaussian bases : Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

y =

(n)

sin(x ) +

(n)

cos(

) +

∣x ∣

(n)

ϵ

4 . 1

SLIDE 11

Example Example: Gaussian bases : Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

y =

(n)

sin(x ) +

(n)

cos(

) +

∣x ∣

(n)

ϵ

ur fit to data using 10 Gaussian bases

4 . 1

SLIDE 12

Example Example: Gaussian bases : Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

y =

(n)

sin(x ) +

(n)

cos(

) +

∣x ∣

(n)

ϵ

ur fit to data using 10 Gaussian bases

f(x ) =

′

ϕ(x ) (Φ Φ) Φ y

′ ⊤ ⊤ −1 ⊤

new instance

w

features evaluated for the new point prediction for a new instance found using LLS

4 . 1

SLIDE 13

Example Example: Gaussian bases : Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

mu = np.linspace(0,10,10) #10 Gaussians bases #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

ur fit to data using 10 Gaussian bases

why not more?

4 . 2

SLIDE 14

Example Example: Gaussian bases : Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2 mu = np.linspace(0,10,50) #50 Gaussians bases #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 phi = lambda x,mu: np.exp(-(x-mu)**2) 4 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

using 50 bases

4 . 3

SLIDE 15

mu = np.linspace(0,10,200) #200 Gaussians bases #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 phi = lambda x,mu: np.exp(-((x-mu)/.1**2) 4 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

using 200, thinner bases (s=.1)

Example Example: Gaussian bases : Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2 4 . 4

SLIDE 16

mu = np.linspace(0,10,200) #200 Gaussians bases #x: N 1 #y: N 2 plt.plot(x, y, 'b.') 3 phi = lambda x,mu: np.exp(-((x-mu)/.1**2) 4 5 Phi = phi(x[:,None], mu[None,:]) #N x 10 6 w = np.linalg.lstsq(Phi, y)[0] 7 yh = np.dot(Phi,w) 8 plt.plot(x, yh, 'g-') 9

cost function is small and we have a "perfect" fit!

J(w)

using 200, thinner bases (s=.1)

Example Example: Gaussian bases : Gaussian bases

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2 4 . 4

SLIDE 17

Generalization Generalization

which one of these models performs better at test time?

D = 5 D = 10 D = 50 D = 200

lower training error

4 . 5

SLIDE 18

Overfitting Overfitting

which one of these models performs better at test time?

predictions of 4 models for the same input

x′ y

f(x )

′

4 . 6

SLIDE 19

Overfitting Overfitting

which one of these models performs better at test time?

predictions of 4 models for the same input

x′

D = 5 D = 10 D = 50 D = 200

y

f(x )

′

4 . 6

SLIDE 20

Winter 2020 | Applied Machine Learning (COMP551)

Overfitting Overfitting

which one of these models performs better at test time?

predictions of 4 models for the same input

x′

D = 5 D = 10 D = 50 D = 200

y

lowest test error

verfitting

underfitting f(x )

′

4 . 6

SLIDE 21

Model selection Model selection

how to pick the model with lowest expected loss / test error?

bound the test error by bounding training error model complexity regularization

5

SLIDE 22

Model selection Model selection

how to pick the model with lowest expected loss / test error?

use for model selection

use for final model assessment

use a validation set (and a separate test set for final assessment)

bound the test error by bounding training error model complexity regularization

5

SLIDE 23

Model selection Model selection

how to pick the model with lowest expected loss / test error?

use for model selection

use for final model assessment

use a validation set (and a separate test set for final assessment)

bound the test error by bounding training error model complexity regularization

5

SLIDE 24

An observation An observation

when overfitting, we often see large weights dashed lines are w

ϕ (x)

∀d

d d

6 . 1

SLIDE 25

An observation An observation

when overfitting, we often see large weights dashed lines are w

ϕ (x)

∀d

d d

D = 10

6 . 1

SLIDE 26

An observation An observation

when overfitting, we often see large weights dashed lines are w

ϕ (x)

∀d

d d

D = 10 D = 15

6 . 1

SLIDE 27

D = 20

An observation An observation

when overfitting, we often see large weights dashed lines are w

ϕ (x)

∀d

d d

D = 10 D = 15

6 . 1

SLIDE 28

D = 20

An observation An observation

when overfitting, we often see large weights dashed lines are w

ϕ (x)

∀d

d d

D = 10 D = 15

idea: penalize large parameter values

6 . 1

SLIDE 29

Ridge Ridge regression regression

L2 regularized linear least squares regression:

J(w) =

∣∣Xw −

2 1

y∣∣

+

2 2

∣∣w∣∣

2 λ 2 2

6 . 2

SLIDE 30

Ridge Ridge regression regression

L2 regularized linear least squares regression:

J(w) =

∣∣Xw −

2 1

y∣∣

+

2 2

∣∣w∣∣

2 λ 2 2

(y

−

2 1 ∑n (n)

w x)

⊤ 2

sum of squared error

6 . 2

SLIDE 31

Ridge Ridge regression regression

L2 regularized linear least squares regression:

J(w) =

∣∣Xw −

2 1

y∣∣

+

2 2

∣∣w∣∣

2 λ 2 2

(y

−

2 1 ∑n (n)

w x)

⊤ 2

sum of squared error (squared) L2 norm of w

w w =

T

w

∑d

2

6 . 2

SLIDE 32

Ridge Ridge regression regression

L2 regularized linear least squares regression:

J(w) =

∣∣Xw −

2 1

y∣∣

+

2 2

∣∣w∣∣

2 λ 2 2

(y

−

2 1 ∑n (n)

w x)

⊤ 2

sum of squared error (squared) L2 norm of w

w w =

T

w

∑d

2

regularization parameter controls the strength of regularization

λ > 0

6 . 2

SLIDE 33

Ridge Ridge regression regression

L2 regularized linear least squares regression:

J(w) =

∣∣Xw −

2 1

y∣∣

+

2 2

∣∣w∣∣

2 λ 2 2

(y

−

2 1 ∑n (n)

w x)

⊤ 2

sum of squared error (squared) L2 norm of w

w w =

T

w

∑d

2

regularization parameter controls the strength of regularization

λ > 0

a good practice is to not penalize the intercept λ(∣∣w∣∣

−

2 2

w

)

2

6 . 2

SLIDE 34

Ridge Ridge regression regression

we can set the derivative to zero J(w) =

(Xw −

2 1

y) (Xw −

⊤

y) +

w w

2 λ ⊤

∇J(w) = X (Xw −

⊤

y) + λw = 0

6 . 3

SLIDE 35

Ridge Ridge regression regression

we can set the derivative to zero J(w) =

(Xw −

2 1

y) (Xw −

⊤

y) +

w w

2 λ ⊤

∇J(w) = X (Xw −

⊤

y) + λw = 0

when using gradient descent, this term reduces the weights at each step (weight decay) 6 . 3

SLIDE 36

Ridge Ridge regression regression

we can set the derivative to zero J(w) =

(Xw −

2 1

y) (Xw −

⊤

y) +

w w

2 λ ⊤

∇J(w) = X (Xw −

⊤

y) + λw = 0 (X X +

⊤

λI)w = X y

⊤

when using gradient descent, this term reduces the weights at each step (weight decay) 6 . 3

SLIDE 37

Ridge Ridge regression regression

we can set the derivative to zero J(w) =

(Xw −

2 1

y) (Xw −

⊤

y) +

w w

2 λ ⊤

∇J(w) = X (Xw −

⊤

y) + λw = 0 (X X +

⊤

λI)w = X y

⊤

w = (X X +

⊤

λI) X y

−1 ⊤

when using gradient descent, this term reduces the weights at each step (weight decay) 6 . 3

SLIDE 38

Ridge Ridge regression regression

we can set the derivative to zero J(w) =

(Xw −

2 1

y) (Xw −

⊤

y) +

w w

2 λ ⊤

∇J(w) = X (Xw −

⊤

y) + λw = 0 (X X +

⊤

λI)w = X y

⊤

w = (X X +

⊤

λI) X y

−1 ⊤

the only part different due to regularization

when using gradient descent, this term reduces the weights at each step (weight decay) 6 . 3

SLIDE 39

Ridge Ridge regression regression

we can set the derivative to zero J(w) =

(Xw −

2 1

y) (Xw −

⊤

y) +

w w

2 λ ⊤

∇J(w) = X (Xw −

⊤

y) + λw = 0 (X X +

⊤

λI)w = X y

⊤

w = (X X +

⊤

λI) X y

−1 ⊤

the only part different due to regularization

makes it invertible! we can have linearly dependent features (e.g., D > N) the solution will be unique!

λI

when using gradient descent, this term reduces the weights at each step (weight decay) 6 . 3

SLIDE 40

Example: Example: polynomial bases polynomial bases

polynomial bases

ϕ

(x) =

k

xk

Without regularization:

using D=10 we can perfectly fit the data (high test error)

6 . 4

SLIDE 41

Example: Example: polynomial bases polynomial bases

degree 2 (D=3)

polynomial bases

ϕ

(x) =

k

xk

Without regularization:

using D=10 we can perfectly fit the data (high test error)

6 . 4

SLIDE 42

Example: Example: polynomial bases polynomial bases

degree 2 (D=3)

polynomial bases

ϕ

(x) =

k

xk

degree 4 (D=5)

Without regularization:

using D=10 we can perfectly fit the data (high test error)

6 . 4

SLIDE 43

Example: Example: polynomial bases polynomial bases

degree 2 (D=3)

polynomial bases

ϕ

(x) =

k

xk

degree 4 (D=5) degree 9 (D=10)

Without regularization:

using D=10 we can perfectly fit the data (high test error)

6 . 4

SLIDE 44

Example: Example: polynomial bases polynomial bases

polynomial bases

ϕ

(x) =

k

xk

with regularization:

fixed D=10, changing the amount of regularization

6 . 5

SLIDE 45

Example: Example: polynomial bases polynomial bases

polynomial bases

ϕ

(x) =

k

xk

with regularization:

fixed D=10, changing the amount of regularization

λ = 0

6 . 5

SLIDE 46

Example: Example: polynomial bases polynomial bases

polynomial bases

ϕ

(x) =

k

xk

with regularization:

fixed D=10, changing the amount of regularization

λ = 0 λ = .1

6 . 5

SLIDE 47

Example: Example: polynomial bases polynomial bases

polynomial bases

ϕ

(x) =

k

xk

with regularization:

fixed D=10, changing the amount of regularization

λ = 0 λ = .1 λ = 10

6 . 5

SLIDE 48

Data normalization Data normalization

what if we scale the input features, using different factors

= x ~(n) γ

x

∀d, n

d (n)

6 . 6

SLIDE 49

Data normalization Data normalization

what if we scale the input features, using different factors

= x ~(n) γ

x

∀d, n

d (n)

if we have no regularization:

=

w

d

~

w ∀d

γ

d

1 d

everything remains the same because:

∣∣Xw − y∣∣

=

2 2

∣∣ − X ~w ~ y∣∣

2 2

6 . 6

SLIDE 50

Data normalization Data normalization

what if we scale the input features, using different factors

= x ~(n) γ

x

∀d, n

d (n)

with regularization: ∣∣ ∣∣

=

w ~

2  ∣∣w∣∣ 2 2

so the optimal w will be different!

if we have no regularization:

=

w

d

~

w ∀d

γ

d

1 d

everything remains the same because:

∣∣Xw − y∣∣

=

2 2

∣∣ − X ~w ~ y∣∣

2 2

6 . 6

SLIDE 51

Winter 2020 | Applied Machine Learning (COMP551)

Data normalization Data normalization

what if we scale the input features, using different factors

= x ~(n) γ

x

∀d, n

d (n)

with regularization: ∣∣ ∣∣

=

w ~

2  ∣∣w∣∣ 2 2

so the optimal w will be different!

if we have no regularization:

=

w

d

~

w ∀d

γ

d

1 d

everything remains the same because:

∣∣Xw − y∣∣

=

2 2

∣∣ − X ~w ~ y∣∣

2 2

features of different mean and variance will be penalized differently μ

=

d

x

N 1 d (n)

σ

=

d 2

(x −

N−1 1 d (n)

μ

)

d 2

{

normalization

makes sure all features have the same mean and variance x

←

d (n) σ

d

x

−μ

d (n) d

6 . 6

SLIDE 52

Maximum likelihood Maximum likelihood

previously: linear regression & logistic regression maximize log-likelihood

7 . 1

SLIDE 53

Maximum likelihood Maximum likelihood

previously: linear regression & logistic regression maximize log-likelihood

w =

∗

arg max p(y∣w) ≡ arg min

L (y

, w ϕ(x )) ∑n

2 (n) ⊤ (n)

= arg max

N(y; Φw, σ )

w ∏n=1 N 2

linear regression

7 . 1

SLIDE 54

Maximum likelihood Maximum likelihood

previously: linear regression & logistic regression maximize log-likelihood

w =

∗

arg max p(y∣w) ≡ arg min

L (y

, w ϕ(x )) ∑n

2 (n) ⊤ (n)

= arg max

N(y; Φw, σ )

w ∏n=1 N 2

linear regression

w =

∗

arg max p(y∣x, w) ≡ arg min

L (y

, σ(w ϕ(x ))) ∑n

CE (n) ⊤ n

= arg max

Bernoulli(y; σ(Φw))

w ∏n=1 N

logistic regression

7 . 1

SLIDE 55

Maximum likelihood Maximum likelihood

previously: linear regression & logistic regression maximize log-likelihood

w =

∗

arg max p(y∣w) ≡ arg min

L (y

, w ϕ(x )) ∑n

2 (n) ⊤ (n)

= arg max

N(y; Φw, σ )

w ∏n=1 N 2

linear regression

w =

∗

arg max p(y∣x, w) ≡ arg min

L (y

, σ(w ϕ(x ))) ∑n

CE (n) ⊤ n

= arg max

Bernoulli(y; σ(Φw))

w ∏n=1 N

logistic regression

idea: maximize the posterior instead of likelihood

p(w∣y) =

p(y) p(w)p(y∣w)

7 . 1

SLIDE 56

Maximum a Posteriori ( Maximum a Posteriori (MAP MAP)

use the Bayes rule and find the parameters with max posterior prob.

p(w∣y) =

p(y) p(w)p(y∣w)

the same for all choices of w (ignore)

7 . 2

SLIDE 57

Maximum a Posteriori ( Maximum a Posteriori (MAP MAP)

use the Bayes rule and find the parameters with max posterior prob.

p(w∣y) =

p(y) p(w)p(y∣w)

the same for all choices of w (ignore)

MAP estimate

w =

∗

arg max

p(w)p(y∣w)

w

7 . 2

SLIDE 58

Maximum a Posteriori ( Maximum a Posteriori (MAP MAP)

use the Bayes rule and find the parameters with max posterior prob.

p(w∣y) =

p(y) p(w)p(y∣w)

the same for all choices of w (ignore)

MAP estimate

w =

∗

arg max

p(w)p(y∣w)

w

≡ arg max

log p(y∣w) +

w

log p(w)

7 . 2

SLIDE 59

Maximum a Posteriori ( Maximum a Posteriori (MAP MAP)

use the Bayes rule and find the parameters with max posterior prob.

p(w∣y) =

p(y) p(w)p(y∣w)

the same for all choices of w (ignore)

MAP estimate

w =

∗

arg max

p(w)p(y∣w)

w

≡ arg max

log p(y∣w) +

w

log p(w)

likelihood: original objective

7 . 2

SLIDE 60

Maximum a Posteriori ( Maximum a Posteriori (MAP MAP)

use the Bayes rule and find the parameters with max posterior prob.

p(w∣y) =

p(y) p(w)p(y∣w)

the same for all choices of w (ignore)

MAP estimate

w =

∗

arg max

p(w)p(y∣w)

w

≡ arg max

log p(y∣w) +

w

log p(w)

likelihood: original objective prior

7 . 2

SLIDE 61

Maximum a Posteriori ( Maximum a Posteriori (MAP MAP)

use the Bayes rule and find the parameters with max posterior prob.

p(w∣y) =

p(y) p(w)p(y∣w)

the same for all choices of w (ignore)

MAP estimate

w =

∗

arg max

p(w)p(y∣w)

w

≡ arg max

log p(y∣w) +

w

log p(w)

likelihood: original objective prior

even better would be to estimate the posterior distribution

Gaussian prior Gaussian prior

Gaussian likelihood and Gaussian prior

w =

∗

arg max

p(w)p(y∣w)

w 7 . 3

SLIDE 63

Gaussian prior Gaussian prior

Gaussian likelihood and Gaussian prior

w =

∗

arg max

p(w)p(y∣w)

w

≡ arg max

log p(y∣w) +

w

log p(w)

7 . 3

SLIDE 64

≡ arg max

log N(y∣w x, σ ) +

w ⊤ 2

log N(w , 0, τ )

∑d=1

D d 2

Gaussian prior Gaussian prior

Gaussian likelihood and Gaussian prior

w =

∗

arg max

p(w)p(y∣w)

w

≡ arg max

log p(y∣w) +

w

log p(w)

7 . 3

SLIDE 65

≡ arg max

log N(y∣w x, σ ) +

w ⊤ 2

log N(w , 0, τ )

∑d=1

D d 2

Gaussian prior Gaussian prior

Gaussian likelihood and Gaussian prior

w =

∗

arg max

p(w)p(y∣w)

w

≡ arg max

log p(y∣w) +

w

log p(w)

assuming independent Gaussian (one per each weight)

7 . 3

SLIDE 66

≡ arg max

log N(y∣w x, σ ) +

w ⊤ 2

log N(w , 0, τ )

∑d=1

D d 2

Gaussian prior Gaussian prior

Gaussian likelihood and Gaussian prior

w =

∗

arg max

p(w)p(y∣w)

w

≡ arg max

log p(y∣w) +

w

log p(w)

assuming independent Gaussian (one per each weight)

≡ arg max

(y −

w 2σ2 −1

w x) −

⊤ 2

w

∑d=1

D 2τ 2 1 d 2 7 . 3

SLIDE 67

≡ arg max

log N(y∣w x, σ ) +

w ⊤ 2

log N(w , 0, τ )

∑d=1

D d 2

Gaussian prior Gaussian prior

Gaussian likelihood and Gaussian prior

w =

∗

arg max

p(w)p(y∣w)

w

≡ arg max

log p(y∣w) +

w

log p(w)

assuming independent Gaussian (one per each weight)

≡ arg max

(y −

w 2σ2 −1

w x) −

⊤ 2

w

∑d=1

D 2τ 2 1 d 2

≡ arg min

(y −

w 2 1

w x) +

⊤ 2

w

∑d=1

D 2τ 2 σ2 d 2 7 . 3

SLIDE 68

≡ arg max

log N(y∣w x, σ ) +

w ⊤ 2

log N(w , 0, τ )

∑d=1

D d 2

Gaussian prior Gaussian prior

Gaussian likelihood and Gaussian prior

w =

∗

arg max

p(w)p(y∣w)

w

≡ arg max

log p(y∣w) +

w

log p(w)

assuming independent Gaussian (one per each weight)

≡ arg max

(y −

w 2σ2 −1

w x) −

⊤ 2

w

∑d=1

D 2τ 2 1 d 2

≡ arg min

(y −

w 2 1

w x) +

⊤ 2

w

∑d=1

D 2τ 2 σ2 d 2

multiple data-points

≡ arg min

(y

−

w 2 1 ∑n (n)

w x ) +

⊤ (n) 2

w

∑d=1

D 2 λ d 2

7 . 3

SLIDE 69

≡ arg max

log N(y∣w x, σ ) +

w ⊤ 2

log N(w , 0, τ )

∑d=1

D d 2

Gaussian prior Gaussian prior

Gaussian likelihood and Gaussian prior

w =

∗

arg max

p(w)p(y∣w)

w

≡ arg max

log p(y∣w) +

w

log p(w)

assuming independent Gaussian (one per each weight)

≡ arg max

(y −

w 2σ2 −1

w x) −

⊤ 2

w

∑d=1

D 2τ 2 1 d 2

≡ arg min

(y −

w 2 1

w x) +

⊤ 2

w

∑d=1

D 2τ 2 σ2 d 2

multiple data-points

≡ arg min

(y

−

w 2 1 ∑n (n)

w x ) +

⊤ (n) 2

w

∑d=1

D 2 λ d 2

L2 regularization

λ =

τ 2 σ2 7 . 3

SLIDE 70

≡ arg max

log N(y∣w x, σ ) +

w ⊤ 2

log N(w , 0, τ )

∑d=1

D d 2

Gaussian prior Gaussian prior

Gaussian likelihood and Gaussian prior

w =

∗

arg max

p(w)p(y∣w)

w

≡ arg max

log p(y∣w) +

w

log p(w)

assuming independent Gaussian (one per each weight)

≡ arg max

(y −

w 2σ2 −1

w x) −

⊤ 2

w

∑d=1

D 2τ 2 1 d 2

≡ arg min

(y −

w 2 1

w x) +

⊤ 2

w

∑d=1

D 2τ 2 σ2 d 2

multiple data-points

≡ arg min

(y

−

w 2 1 ∑n (n)

w x ) +

⊤ (n) 2

w

∑d=1

D 2 λ d 2

L2 regularization

λ =

τ 2 σ2

L2- regularization is assuming a Gaussian prior on weights the same is true for logistic regression (or any other cost function)

7 . 3

SLIDE 71

Laplace prior Laplace prior

another notable choice of prior is the Laplace distribution

image:https://stats.stackexchange.com/questions/177210/why-is-laplace-prior-producing-sparse-solutions

7 . 4

SLIDE 72

Laplace prior Laplace prior

another notable choice of prior is the Laplace distribution

image:https://stats.stackexchange.com/questions/177210/why-is-laplace-prior-producing-sparse-solutions

p(w; β) =

e

2β 1 −

β ∣w∣

w

notice the peak around zero

7 . 4

SLIDE 73

Laplace prior Laplace prior

another notable choice of prior is the Laplace distribution

image:https://stats.stackexchange.com/questions/177210/why-is-laplace-prior-producing-sparse-solutions

minimizing negative log-likelihood

−

log p(w ) =

∑d

d

∣w ∣

∑d 2β

1 d

p(w; β) =

e

2β 1 −

β ∣w∣

w

notice the peak around zero

7 . 4

SLIDE 74

Laplace prior Laplace prior

another notable choice of prior is the Laplace distribution

image:https://stats.stackexchange.com/questions/177210/why-is-laplace-prior-producing-sparse-solutions

minimizing negative log-likelihood

−

log p(w ) =

∑d

d

∣w ∣

∑d 2β

1 d

=

∣∣w∣∣

2β 1 1

L1 norm of w

p(w; β) =

e

2β 1 −

β ∣w∣

w

notice the peak around zero

7 . 4

SLIDE 75

Laplace prior Laplace prior

another notable choice of prior is the Laplace distribution

image:https://stats.stackexchange.com/questions/177210/why-is-laplace-prior-producing-sparse-solutions

minimizing negative log-likelihood

−

log p(w ) =

∑d

d

∣w ∣

∑d 2β

1 d

=

∣∣w∣∣

2β 1 1

L1 norm of w

p(w; β) =

e

2β 1 −

β ∣w∣

w

notice the peak around zero

J(w) ← J(w) + λ∣∣w∣∣

1

L1 regularization:

also called lasso

(least absolute shrinkage and selection operator) 7 . 4

SLIDE 76

regularization regularization

L

vs L

1 2

regularization path shows how change as we change

{w

}

d

λ

decreasing regularization coef.

λ

w

d′

w

d

Ridge regression Lasso

7 . 5

SLIDE 77

regularization regularization

L

vs L

1 2

regularization path shows how change as we change

{w

}

d

λ

decreasing regularization coef.

λ

w

d′ Lasso produces sparse weights (many are zero, rather than small)

w

d

Ridge regression Lasso

7 . 5

SLIDE 78

regularization regularization

L

vs L

1 2

regularization path shows how change as we change

{w

}

d

λ

decreasing regularization coef.

λ

w

d′ Lasso produces sparse weights (many are zero, rather than small) red-line is the optimal from cross-validation

λ

w

d

Ridge regression Lasso

7 . 5

SLIDE 79

regularization regularization

L

vs L

1 2

7 . 6

is equivalent to min

J(w) +

w

λ∣∣w∣∣

p p

min

J(w)

w

subject to ∣∣w∣∣

≤

p p

λ ~

for an appropriate choice of λ

~

SLIDE 80

regularization regularization

L

vs L

1 2

figures below show the constraint and the isocontours of J(w)

w

1

w

1

w

2 w

MLE

w

MLE

w

MAP

w

MAP

w

2

∣∣w∣∣

≤

2 2

λ ~ ∣∣w∣∣

≤

1

λ ~

J(w) J(w)

any convex cost function

7 . 6

is equivalent to min

J(w) +

w

λ∣∣w∣∣

p p

min

J(w)

w

subject to ∣∣w∣∣

≤

p p

λ ~

for an appropriate choice of λ

~

SLIDE 81

regularization regularization

L

vs L

1 2

figures below show the constraint and the isocontours of J(w)

ptimal solution with L1-regularization is more likely to have zero components

w

1

w

1

w

2 w

MLE

w

MLE

w

MAP

w

MAP

w

2

∣∣w∣∣

≤

2 2

λ ~ ∣∣w∣∣

≤

1

λ ~

J(w) J(w)

any convex cost function

7 . 6

is equivalent to min

J(w) +

w

λ∣∣w∣∣

p p

min

J(w)

w

subject to ∣∣w∣∣

≤

p p

λ ~

for an appropriate choice of λ

~

SLIDE 82

Subset selection Subset selection

7 . 7

w

∑d

d 4

w

∑d

d 2

∣w ∣

∑d

d

∣w ∣

∑d

d

2 1

∣w ∣

∑d

d

10 1

SLIDE 83

Subset selection Subset selection

p-norms with are convex (easier to optimize)

p ≥ 1

7 . 7

w

∑d

d 4

w

∑d

d 2

∣w ∣

∑d

d

∣w ∣

∑d

d

2 1

∣w ∣

∑d

d

10 1

SLIDE 84

Subset selection Subset selection

p-norms with induces sparsity

p ≤ 1

p-norms with are convex (easier to optimize)

p ≥ 1

7 . 7

w

∑d

d 4

w

∑d

d 2

∣w ∣

∑d

d

∣w ∣

∑d

d

2 1

∣w ∣

∑d

d

10 1

SLIDE 85

Subset selection Subset selection

closer to 0-norm

L

norm

p-norms with induces sparsity

p ≤ 1

p-norms with are convex (easier to optimize)

p ≥ 1

7 . 7

w

∑d

d 4

w

∑d

d 2

∣w ∣

∑d

d

∣w ∣

∑d

d

2 1

∣w ∣

∑d

d

10 1

SLIDE 86

Subset selection Subset selection

penalizes the number of non-zero features

J(w) + λ∣∣w∣∣

=

J(w) + λ

I(w =

∑d

d  0)

closer to 0-norm

L

norm

p-norms with induces sparsity

p ≤ 1

p-norms with are convex (easier to optimize)

p ≥ 1

7 . 7

w

∑d

d 4

w

∑d

d 2

∣w ∣

∑d

d

∣w ∣

∑d

d

2 1

∣w ∣

∑d

d

10 1

SLIDE 87

Subset selection Subset selection

penalizes the number of non-zero features

J(w) + λ∣∣w∣∣

=

J(w) + λ

I(w =

∑d

d  0) performs feature selection

a penalty of for each feature

λ

closer to 0-norm

L

norm

p-norms with induces sparsity

p ≤ 1

p-norms with are convex (easier to optimize)

p ≥ 1

7 . 7

w

∑d

d 4

w

∑d

d 2

∣w ∣

∑d

d

∣w ∣

∑d

d

2 1

∣w ∣

∑d

d

10 1

SLIDE 88

Subset selection Subset selection

w

∑d

d 4

w

∑d

d 2

∣w ∣

∑d

d

∣w ∣

∑d

d

2 1

∣w ∣

∑d

d

10 1

p-norms with induces sparsity

p ≤ 1

p-norms with are convex (easier to optimize)

p ≥ 1

closer to 0-norm

ptimizing this is a difficult combinatorial problem:

search over all subsets

2D

7 . 8

L

norm

SLIDE 89

Winter 2020 | Applied Machine Learning (COMP551)

Subset selection Subset selection

L1 regularization is a viable alternative to L0 regularization

w

∑d

d 4

w

∑d

d 2

∣w ∣

∑d

d

∣w ∣

∑d

d

2 1

∣w ∣

∑d

d

10 1

p-norms with induces sparsity

p ≤ 1

p-norms with are convex (easier to optimize)

p ≥ 1

closer to 0-norm

ptimizing this is a difficult combinatorial problem:

search over all subsets

2D

7 . 8

L

norm

SLIDE 90

Bias-variance decomposition Bias-variance decomposition

for L2 loss

8 . 1

SLIDE 91

Bias-variance decomposition Bias-variance decomposition

for L2 loss assume a true distribution p(x, y)

8 . 1

SLIDE 92

Bias-variance decomposition Bias-variance decomposition

for L2 loss assume a true distribution p(x, y)

f(x) = E

[y∣x]

p

the regression function is

8 . 1

SLIDE 93

Bias-variance decomposition Bias-variance decomposition

for L2 loss assume a true distribution p(x, y)

f(x) = E

[y∣x]

p

the regression function is assume that a dataset is sampled from

D = {(x , y )}

(n) (n) n

p(x, y)

8 . 1

SLIDE 94

Bias-variance decomposition Bias-variance decomposition

let be our model based on the dataset

f ^

D

for L2 loss assume a true distribution p(x, y)

f(x) = E

[y∣x]

p

the regression function is assume that a dataset is sampled from

D = {(x , y )}

(n) (n) n

p(x, y)

8 . 1

SLIDE 95

Bias-variance decomposition Bias-variance decomposition

let be our model based on the dataset

f ^

D

for L2 loss assume a true distribution p(x, y)

f(x) = E

[y∣x]

p

the regression function is assume that a dataset is sampled from

D = {(x , y )}

(n) (n) n

p(x, y)

what we care about is the expected loss (aka risk)

E[(

(x) −

f ^

D

y) ]

2

all blue items are random variables

8 . 1

SLIDE 96

Bias-variance decomposition Bias-variance decomposition

what we care about is the expected loss (aka risk) for L2 loss

E[(

(x) −

f ^

D

y) ]

2

8 . 2

SLIDE 97

Bias-variance decomposition Bias-variance decomposition

what we care about is the expected loss (aka risk) for L2 loss

E[(

(x) −

f ^

D

y) ]

2 f(x) + ϵ

8 . 2

SLIDE 98

Bias-variance decomposition Bias-variance decomposition

what we care about is the expected loss (aka risk) for L2 loss

E[(

(x) −

f ^

D

y) ]

2 f(x) + ϵ

(x) +

f ^D E

[ (x)] −

D f

^D E

[ (x)]

D f

^D

add and subtract a term

8 . 2

SLIDE 99

Bias-variance decomposition Bias-variance decomposition

what we care about is the expected loss (aka risk) for L2 loss

E[(

(x) −

f ^

D

y) ]

2 f(x) + ϵ

(x) +

f ^D E

[ (x)] −

D f

^D E

[ (x)]

D f

^D

add and subtract a term

8 . 2

= E[(

(x) −

f ^

D

E

[ (x)] −

D f

^

D

y + E

[ (x)]) ]

D f

^

D 2

SLIDE 100

Bias-variance decomposition Bias-variance decomposition

what we care about is the expected loss (aka risk) for L2 loss

E[(

(x) −

f ^

D

y) ]

2 f(x) + ϵ

= E[(

(x) −

f ^

D

E

[ (x)]) ]

D f

^

D 2 + E[(f(x) − E

[ (x)]) ]

D f

^

D 2

+E[ϵ ]

2

(x) +

f ^D E

[ (x)] −

D f

^D E

[ (x)]

D f

^D

add and subtract a term

8 . 2

= E[(

(x) −

f ^

D

E

[ (x)] −

D f

^

D

y + E

[ (x)]) ]

D f

^

D 2

the remaining terms evaluate to zero (check for yourself!)

SLIDE 101

Bias-variance decomposition Bias-variance decomposition

what we care about is the expected loss (aka risk) for L2 loss

E[(

(x) −

f ^

D

y) ]

2 f(x) + ϵ

bias variance

unavoidable noise error

= E[(

(x) −

f ^

D

E

[ (x)]) ]

D f

^

D 2 + E[(f(x) − E

[ (x)]) ]

D f

^

D 2

+E[ϵ ]

2

(x) +

f ^D E

[ (x)] −

D f

^D E

[ (x)]

D f

^D

add and subtract a term

8 . 2

= E[(

(x) −

f ^

D

E

[ (x)] −

D f

^

D

y + E

[ (x)]) ]

D f

^

D 2

the remaining terms evaluate to zero (check for yourself!)

SLIDE 102

Bias-variance decomposition Bias-variance decomposition

for L2 loss the expected loss is decomposed to:

image: P. Domingos' posted article

8 . 3

SLIDE 103

Bias-variance decomposition Bias-variance decomposition

for L2 loss

E[(f(x) − E

[ (x)]) ]

D f

^

D 2

bias: how average over all datasets differs from the regression function

the expected loss is decomposed to:

image: P. Domingos' posted article

8 . 3

SLIDE 104

Bias-variance decomposition Bias-variance decomposition

for L2 loss

E[(f(x) − E

[ (x)]) ]

D f

^

D 2

bias: how average over all datasets differs from the regression function

E[(

(x) −

f ^

D

E

[ (x)]) ]

D f

^

D 2 variance: how change of dataset affects the prediction

the expected loss is decomposed to:

image: P. Domingos' posted article

8 . 3

SLIDE 105

Bias-variance decomposition Bias-variance decomposition

for L2 loss

E[(f(x) − E

[ (x)]) ]

D f

^

D 2

bias: how average over all datasets differs from the regression function

E[(

(x) −

f ^

D

E

[ (x)]) ]

D f

^

D 2 variance: how change of dataset affects the prediction

E[ϵ ]

2 noise error: the error even if we used the true model f(x)

the expected loss is decomposed to:

image: P. Domingos' posted article

8 . 3

SLIDE 106

Bias-variance decomposition Bias-variance decomposition

for L2 loss

E[(f(x) − E

[ (x)]) ]

D f

^

D 2

bias: how average over all datasets differs from the regression function

E[(

(x) −

f ^

D

E

[ (x)]) ]

D f

^

D 2 variance: how change of dataset affects the prediction

E[ϵ ]

2 noise error: the error even if we used the true model f(x)

the expected loss is decomposed to:

image: P. Domingos' posted article

8 . 3

SLIDE 107

Bias-variance decomposition Bias-variance decomposition

for L2 loss

E[(f(x) − E

[ (x)]) ]

D f

^

D 2

bias: how average over all datasets differs from the regression function

E[(

(x) −

f ^

D

E

[ (x)]) ]

D f

^

D 2 variance: how change of dataset affects the prediction

E[ϵ ]

2 noise error: the error even if we used the true model f(x)

the expected loss is decomposed to: different models vary in their trade off between error due to bias and variance

simple models: often more biased complex models: often have more variance image: P. Domingos' posted article

8 . 3

SLIDE 108

x x y

Example: Example: bias vs. variance bias vs. variance

8 . 4

SLIDE 109

x x y

models for different datasets f

^D

random datasets of size N=25 instances are not shown using Gaussian bases

Example: Example: bias vs. variance bias vs. variance

8 . 4

SLIDE 110

x x y

their average E[

]

f ^D

true model f models for different datasets f

^D

random datasets of size N=25 instances are not shown using Gaussian bases

Example: Example: bias vs. variance bias vs. variance

8 . 4

SLIDE 111

x x y

their average E[

]

f ^D

true model f models for different datasets f

^D

random datasets of size N=25 instances are not shown using Gaussian bases

bias is the difference (in L2 norm) between two curves variance is the average difference (in squared L2 norm) between these curves and their average

Example: Example: bias vs. variance bias vs. variance

8 . 4

SLIDE 112

x x y

Example: Example: bias vs. variance bias vs. variance

8 . 5

SLIDE 113

x x y x x y

using larger regularization penalty: higher bias - lower variance

Example: Example: bias vs. variance bias vs. variance

8 . 5

SLIDE 114

x x y x x y

using larger regularization penalty: higher bias - lower variance

the average fit is very good, despite high variance model averaging: uses "average" prediction of expressive models to prevent overfitting

side note

Example: Example: bias vs. variance bias vs. variance

8 . 5

SLIDE 115

increasing variance increasing bias

Example: Example: bias vs. variance bias vs. variance

8 . 6

SLIDE 116

increasing variance increasing bias

the lowest expected loss (test error) is somewhere between the two extremes

Example: Example: bias vs. variance bias vs. variance

8 . 6

SLIDE 117

increasing variance increasing bias

the lowest expected loss (test error) is somewhere between the two extremes in reality, we don't have access to the true model how to decide which model to use?

Example: Example: bias vs. variance bias vs. variance

8 . 6

SLIDE 118

Big picture! Big picture!

model complexity prediction error

error for random dataset

average training error average test error

D

8 . 7

SLIDE 119

Big picture! Big picture!

model complexity prediction error

error for random dataset

average training error average test error

D

high variance in more complex models means that test and training error can be very different

8 . 7

SLIDE 120

Winter 2020 | Applied Machine Learning (COMP551)

Big picture! Big picture!

model complexity prediction error

error for random dataset

average training error average test error

D

high variance in more complex models means that test and training error can be very different high bias in simplistic models means that training error can be high

8 . 7

SLIDE 121

Model selection Model selection

how to pick the model with lowest expected loss / test error?

use for model selection

use for final model assessment

use a validation set (and a separate test set for final assessment)

bound the test error by bounding training error model complexity regularization

9 . 1

SLIDE 122

Model selection Model selection

how to pick the model with lowest expected loss / test error?

use for model selection

use for final model assessment

use a validation set (and a separate test set for final assessment)

bound the test error by bounding training error model complexity regularization in the end we may have to use a validation set to find the right amount of regularization

9 . 1

SLIDE 123

Cross validation Cross validation

getting a more reliable estimate of test error using validation set K-fold cross validation(CV)

randomly partition the data into K folds use K-1 for training, and 1 for validation report average/std of the validation error over all folds

9 . 2

SLIDE 124

Cross validation Cross validation

getting a more reliable estimate of test error using validation set K-fold cross validation(CV)

randomly partition the data into K folds use K-1 for training, and 1 for validation report average/std of the validation error over all folds

leave-one-out CV:extreme case of k=N

9 . 2

SLIDE 125

Cross validation Cross validation

getting a more reliable estimate of test error using validation set K-fold cross validation(CV)

randomly partition the data into k folds use k-1 for training, and 1 for validation report average/std of the validation error

ver all folds

image credit: Thanh Nguyen et al'19 9 . 3

SLIDE 126

Cross validation Cross validation

getting a more reliable estimate of test error using validation set K-fold cross validation(CV)

randomly partition the data into k folds use k-1 for training, and 1 for validation report average/std of the validation error

ver all folds

image credit: Thanh Nguyen et al'19

nce the hyper-parameters are selected, we can use the whole set for training

use test set for the final assessment

9 . 3

SLIDE 127

Evaluation Evaluation

evaluation metric can be different from the optimization objective confusion matrix is a CxC table that compares truth-vs-prediction for binary classification:

9 . 4

SLIDE 128

Evaluation Evaluation

evaluation metric can be different from the optimization objective confusion matrix is a CxC table that compares truth-vs-prediction for binary classification: Accuracy =

P+N TP+TN

F

score =

1

2

Precision+Recall Precision×Recall

Recall =

P TP

Precision =

RP TP

Error rate =

P+N FP+FN

some evaluation metrics (based on the confusion table)

9 . 4

SLIDE 129

Evaluation Evaluation

evaluation metric can be different from the optimization objective

type I vs type II error

confusion matrix is a CxC table that compares truth-vs-prediction for binary classification: Accuracy =

P+N TP+TN

F

score =

1

2

Precision+Recall Precision×Recall

Recall =

P TP

Precision =

RP TP

Error rate =

P+N FP+FN

some evaluation metrics (based on the confusion table)

9 . 4

SLIDE 130

if we produce class score (probability) we can trade-off between type I & type II error

Evaluation Evaluation

1

p(y = 1∣x)

threshold

9 . 5

SLIDE 131

if we produce class score (probability) we can trade-off between type I & type II error

Evaluation Evaluation

1

p(y = 1∣x)

threshold goal: evaluate class scores/probabilities (independent of choice of threshold)

9 . 5

SLIDE 132

Winter 2020 | Applied Machine Learning (COMP551)

if we produce class score (probability) we can trade-off between type I & type II error

Evaluation Evaluation

1

p(y = 1∣x)

threshold goal: evaluate class scores/probabilities (independent of choice of threshold)

TPR = TP/P (recall, sensitivity) FPR = FP/N (fallout, false alarm)

Receiver Operating Characteristic ROC curve

9 . 5

SLIDE 133

Summary Summary

complex models can have very different training and test error (generalization gap) regularization bounds this gap by penalizing model complexity L1 & L2 regularization probabilistic interpretation: different priors on weights L1 produces sparse solutions (useful for feature selection)

10

SLIDE 134

Summary Summary

complex models can have very different training and test error (generalization gap) regularization bounds this gap by penalizing model complexity L1 & L2 regularization probabilistic interpretation: different priors on weights L1 produces sparse solutions (useful for feature selection) bias-variance trade off: formalizes the relation between training error (bias) complexity (variance) and and the test error (bias + variance) not so elegant beyond L2 loss

10

SLIDE 135

Summary Summary

complex models can have very different training and test error (generalization gap) regularization bounds this gap by penalizing model complexity L1 & L2 regularization probabilistic interpretation: different priors on weights L1 produces sparse solutions (useful for feature selection) bias-variance trade off: formalizes the relation between training error (bias) complexity (variance) and and the test error (bias + variance) not so elegant beyond L2 loss (cross) validation for model selection

10