Regularization Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G - - PowerPoint PPT Presentation

โ–ถ
regularization
SMART_READER_LITE
LIVE PREVIEW

Regularization Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G - - PowerPoint PPT Presentation

Regularization Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative Women in Data Science Blacksburg Location: Holtzman Alumni Center Welcome, 3:30 - 3:40, Assembly hall Keynote Speaker: Milinda Lakkam,


slide-1
SLIDE 1

Regularization

Jia-Bin Huang Virginia Tech

Spring 2019

ECE-5424G / CS-5824

slide-2
SLIDE 2

Administrative

  • Women in Data Science Blacksburg
  • Location: Holtzman Alumni Center
  • Welcome, 3:30 - 3:40, Assembly hall
  • Keynote Speaker: Milinda Lakkam,

"Detecting automation on LinkedIn's platform," 3:40 - 4:05, Assembly hall

  • Career Panel, 4:05 - 5:00, hall
  • Break , 5:00 - 5:20, Grand hallAssembly
  • Keynote Speaker: Sally Morton , "Bias," 5:20 - 5:45, Assembly hall
  • Dinner with breakout discussion groups, 5:45 - 7:00, Museum
  • Introductory track tutorial: Jennifer Van Mullekom, "Data Visualization", 7:00 - 8:15,

Assembly hall

  • Advanced track tutorial: Cheryl Danner, "Focal-loss-based Deep Learning for Object

Detection," 7-8:15, 2nd floor board room

slide-3
SLIDE 3

k-NN (Classification/Regression)

  • Model

๐‘ฆ 1 , ๐‘ง 1 , ๐‘ฆ 2 , ๐‘ง 2 , โ‹ฏ , ๐‘ฆ ๐‘› , ๐‘ง ๐‘›

  • Cost function

None

  • Learning

Do nothing

  • Inference

เทœ ๐‘ง = โ„Ž ๐‘ฆtest = ๐‘ง(๐‘™), where ๐‘™ = argmin๐‘— ๐ธ(๐‘ฆtest, ๐‘ฆ(๐‘—))

slide-4
SLIDE 4

Linear regression (Regression)

  • Model

โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ1 + ๐œ„2๐‘ฆ2 + โ‹ฏ + ๐œ„๐‘œ๐‘ฆ๐‘œ = ๐œ„โŠค๐‘ฆ

  • Cost function

๐พ ๐œ„ = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

  • Learning

1) Gradient descent: Repeat {๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1 ๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ๐‘˜

๐‘— }

2) Solving normal equation ๐œ„ = (๐‘ŒโŠค๐‘Œ)โˆ’1๐‘ŒโŠค๐‘ง

  • Inference

เทœ ๐‘ง = โ„Ž๐œ„ ๐‘ฆtest = ๐œ„โŠค๐‘ฆtest

slide-5
SLIDE 5

Naรฏve Bayes (Classification)

  • Model

โ„Ž๐œ„ ๐‘ฆ = ๐‘„(๐‘|๐‘Œ1, ๐‘Œ2, โ‹ฏ , ๐‘Œ๐‘œ) โˆ ๐‘„ ๐‘ ฮ ๐‘—๐‘„ ๐‘Œ๐‘— ๐‘)

  • Cost function

Maximum likelihood estimation: ๐พ ๐œ„ = โˆ’ log ๐‘„ Data ๐œ„ Maximum a posteriori estimation :๐พ ๐œ„ = โˆ’ log ๐‘„ Data ๐œ„ ๐‘„ ๐œ„

  • Learning

๐œŒ๐‘™ = ๐‘„(๐‘ = ๐‘ง๐‘™) (Discrete ๐‘Œ๐‘—) ๐œ„๐‘—๐‘˜๐‘™ = ๐‘„(๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜๐‘™|๐‘ = ๐‘ง๐‘™) (Continuous ๐‘Œ๐‘—) mean ๐œˆ๐‘—๐‘™, variance ๐œ๐‘—๐‘™

2 , ๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘™) = ๐’ช(๐‘Œ๐‘—|๐œˆ๐‘—๐‘™, ๐œ๐‘—๐‘™

2 )

  • Inference

เท  ๐‘ โ† argmax

๐‘ง๐‘™

๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘—

test ๐‘ = ๐‘ง๐‘™)

slide-6
SLIDE 6

Logistic regression (Classification)

  • Model

โ„Ž๐œ„ ๐‘ฆ = ๐‘„ ๐‘ = 1 ๐‘Œ1, ๐‘Œ2, โ‹ฏ , ๐‘Œ๐‘œ =

1 1+๐‘“โˆ’๐œ„โŠค๐‘ฆ

  • Cost function

๐พ ๐œ„ = 1 ๐‘› เท

๐‘—=1 ๐‘›

Cost(โ„Ž๐œ„(๐‘ฆ ๐‘— ), ๐‘ง(๐‘—))) Cost(โ„Ž๐œ„ ๐‘ฆ , ๐‘ง) = เตโˆ’log โ„Ž๐œ„ ๐‘ฆ if ๐‘ง = 1 โˆ’log 1 โˆ’ โ„Ž๐œ„ ๐‘ฆ if ๐‘ง = 0

  • Learning

Gradient descent: Repeat {๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1 ๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ๐‘˜

๐‘— }

  • Inference

เท  ๐‘ = โ„Ž๐œ„ ๐‘ฆtest = 1 1 + ๐‘“โˆ’๐œ„โŠค๐‘ฆtest

slide-7
SLIDE 7

Logistic Regression

  • Hypothesis representation
  • Cost function
  • Logistic regression with gradient descent
  • Regularization
  • Multi-class classification

โ„Ž๐œ„ ๐‘ฆ = 1 1 + ๐‘“โˆ’๐œ„โŠค๐‘ฆ Cost(โ„Ž๐œ„ ๐‘ฆ , ๐‘ง) = เตโˆ’log โ„Ž๐œ„ ๐‘ฆ if ๐‘ง = 1 โˆ’log 1 โˆ’ โ„Ž๐œ„ ๐‘ฆ if ๐‘ง = 0 ๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1

๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง(๐‘—) ๐‘ฆ๐‘˜

(๐‘—)

slide-8
SLIDE 8

How about MAP?

  • Maximum conditional likelihood estimate (MCLE)
  • Maximum conditional a posterior estimate (MCAP)

๐œ„MCLE = argmax

๐œ„

ฯ‚๐‘—=1

๐‘› ๐‘„๐œ„ ๐‘ง(๐‘—)|๐‘ฆ ๐‘—

๐œ„MCAP = argmax

๐œ„

ฯ‚๐‘—=1

๐‘› ๐‘„๐œ„ ๐‘ง(๐‘—)|๐‘ฆ ๐‘—

๐‘„(๐œ„)

slide-9
SLIDE 9

Prior ๐‘„(๐œ„)

  • Common choice of ๐‘„(๐œ„):
  • Normal distribution, zero mean, identity covariance
  • โ€œPushesโ€ parameters towards zeros
  • Corresponds to Regularization
  • Helps avoid very large weights and overfitting

Slide credit: Tom Mitchell

slide-10
SLIDE 10

MLE vs. MAP

  • Maximum conditional likelihood estimate (MCLE)

๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1

๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง(๐‘—) ๐‘ฆ๐‘˜

(๐‘—)

  • Maximum conditional a posterior estimate (MCAP)

๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ๐œ‡๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1

๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง(๐‘—) ๐‘ฆ๐‘˜

(๐‘—)

slide-11
SLIDE 11

Logistic Regression

  • Hypothesis representation
  • Cost function
  • Logistic regression with gradient descent
  • Regularization
  • Multi-class classification
slide-12
SLIDE 12

Multi-class classification

  • Email foldering/taggning: Work, Friends, Family, Hobby
  • Medical diagrams: Not ill, Cold, Flu
  • Weather: Sunny, Cloudy, Rain, Snow

Slide credit: Andrew Ng

slide-13
SLIDE 13

Binary classification

๐‘ฆ2 ๐‘ฆ1

Multiclass classification

๐‘ฆ2 ๐‘ฆ1

slide-14
SLIDE 14

One-vs-all (one-vs-rest)

๐‘ฆ2 ๐‘ฆ1

Class 1: Class 2: Class 3:

โ„Ž๐œ„

๐‘— ๐‘ฆ = ๐‘„ ๐‘ง = ๐‘— ๐‘ฆ; ๐œ„

(๐‘— = 1, 2, 3) ๐‘ฆ2 ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ1

โ„Ž๐œ„

1 ๐‘ฆ

โ„Ž๐œ„

2 ๐‘ฆ

โ„Ž๐œ„

3 ๐‘ฆ

Slide credit: Andrew Ng

slide-15
SLIDE 15

One-vs-all

  • Train a logistic regression classifier โ„Ž๐œ„

๐‘— ๐‘ฆ for

each class ๐‘— to predict the probability that ๐‘ง = ๐‘—

  • Given a new input ๐‘ฆ, pick the class ๐‘— that

maximizes max

i

โ„Ž๐œ„

๐‘— ๐‘ฆ

Slide credit: Andrew Ng

slide-16
SLIDE 16

Generative Approach

Ex: Naรฏve Bayes

Estimate ๐‘„(๐‘) and ๐‘„(๐‘Œ|๐‘) Prediction

เทœ ๐‘ง = argmax๐‘ง ๐‘„ ๐‘ = ๐‘ง ๐‘„(๐‘Œ = ๐‘ฆ|๐‘ = ๐‘ง)

Discriminative Approach

Ex: Logistic regression

Estimate ๐‘„(๐‘|๐‘Œ) directly (Or a discriminant function: e.g., SVM) Prediction

เทœ ๐‘ง = ๐‘„(๐‘ = ๐‘ง|๐‘Œ = ๐‘ฆ)

slide-17
SLIDE 17

Further readings

  • Tom M. Mitchell

Generative and discriminative classifiers: Naรฏve Bayes and Logistic Regression http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

  • Andrew Ng, Michael Jordan

On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes http://papers.nips.cc/paper/2020-on-discriminative-vs-generative- classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf

slide-18
SLIDE 18

Regularization

  • Overfitting
  • Cost function
  • Regularized linear regression
  • Regularized logistic regression
slide-19
SLIDE 19

Regularization

  • Overfitting
  • Cost function
  • Regularized linear regression
  • Regularized logistic regression
slide-20
SLIDE 20

Example: Linear regression

Price ($) in 1000โ€™s Size in feet^2 Price ($) in 1000โ€™s Size in feet^2 Price ($) in 1000โ€™s Size in feet^2

โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ + ๐œ„2๐‘ฆ2 โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ + ๐œ„2๐‘ฆ2 + ๐œ„3๐‘ฆ3 + ๐œ„4๐‘ฆ4 + โ‹ฏ

Underfitting Overfitting Just right

Slide credit: Andrew Ng

slide-21
SLIDE 21

Overfitting

  • If we have too many features (i.e. complex model), the

learned hypothesis may fit the training set very well ๐พ ๐œ„ = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2 โ‰ˆ 0

but fail to generalize to new examples (predict prices on new examples).

Slide credit: Andrew Ng

slide-22
SLIDE 22

Example: Linear regression

Price ($) in 1000โ€™s Size in feet^2 Price ($) in 1000โ€™s Size in feet^2 Price ($) in 1000โ€™s Size in feet^2

โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ + ๐œ„2๐‘ฆ2 โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ + ๐œ„2๐‘ฆ2 + ๐œ„3๐‘ฆ3 + ๐œ„4๐‘ฆ4 + โ‹ฏ

Underfitting Overfitting Just right

High bias High variance

Slide credit: Andrew Ng

slide-23
SLIDE 23

Bias-Variance Tradeoff

  • Bias: difference between

what you expect to learn and truth

  • Measures how well you expect to represent true solution
  • Decreases with more complex model
  • Variance: difference between

what you expect to learn and what you learn from a particular dataset

  • Measures how sensitive learner is to specific dataset
  • Increases with more complex model
slide-24
SLIDE 24

Low variance High variance Low bias High bias

slide-25
SLIDE 25

Biasโ€“variance decomposition

  • Training set { ๐‘ฆ1, ๐‘ง1 , ๐‘ฆ2, ๐‘ง2 , โ‹ฏ , ๐‘ฆ๐‘œ, ๐‘ง๐‘œ }
  • ๐‘ง = ๐‘” ๐‘ฆ + ๐œ
  • We want แˆ˜

๐‘” ๐‘ฆ that minimizes ๐น ๐‘ง โˆ’ แˆ˜ ๐‘” ๐‘ฆ

2

๐น ๐‘ง โˆ’ แˆ˜ ๐‘” ๐‘ฆ

2

= Bias แˆ˜ ๐‘” ๐‘ฆ

2 + Var แˆ˜

๐‘” ๐‘ฆ + ๐œ2

Bias แˆ˜ ๐‘” ๐‘ฆ = ๐น แˆ˜ ๐‘” ๐‘ฆ โˆ’ ๐‘”(๐‘ฆ) Var แˆ˜ ๐‘” ๐‘ฆ = ๐น แˆ˜ ๐‘” ๐‘ฆ 2 โˆ’ ๐น แˆ˜ ๐‘” ๐‘ฆ

2

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

slide-26
SLIDE 26

Overfitting

Tumor Size Age Tumor Size Age Tumor Size Age

โ„Ž๐œ„ ๐‘ฆ = ๐‘•(๐œ„0 + ๐œ„1๐‘ฆ + ๐œ„2๐‘ฆ2) โ„Ž๐œ„ ๐‘ฆ = ๐‘•(๐œ„0 + ๐œ„1๐‘ฆ + ๐œ„2๐‘ฆ2 + ๐œ„3๐‘ฆ1

2 + ๐œ„4๐‘ฆ2 2 + ๐œ„5๐‘ฆ1๐‘ฆ2)

โ„Ž๐œ„ ๐‘ฆ = ๐‘•(๐œ„0 + ๐œ„1๐‘ฆ + ๐œ„2๐‘ฆ2 + ๐œ„3๐‘ฆ1

2 + ๐œ„4๐‘ฆ2 2 + ๐œ„5๐‘ฆ1๐‘ฆ2 +

๐œ„6๐‘ฆ1

3๐‘ฆ2 + ๐œ„7๐‘ฆ1๐‘ฆ2 3 + โ‹ฏ )

Underfitting Overfitting

Slide credit: Andrew Ng

slide-27
SLIDE 27

Addressing overfitting

  • ๐‘ฆ1 = size of house
  • ๐‘ฆ2 = no. of bedrooms
  • ๐‘ฆ3 = no. of floors
  • ๐‘ฆ4 = age of house
  • ๐‘ฆ5 = average income in neighborhood
  • ๐‘ฆ6 = kitchen size
  • โ‹ฎ
  • ๐‘ฆ100

Price ($) in 1000โ€™s Size in feet^2

Slide credit: Andrew Ng

slide-28
SLIDE 28

Addressing overfitting

  • 1. Reduce number of features.
  • Manually select which features to keep.
  • Model selection algorithm (later in course).
  • 2. Regularization.
  • Keep all the features, but reduce magnitude/values of parameters

๐œ„

๐‘˜.

  • Works well when we have a lot of features, each of which

contributes a bit to predicting ๐‘ง.

Slide credit: Andrew Ng

slide-29
SLIDE 29

Overfitting Thriller

  • https://www.youtube.com/watch?v=DQWI1kvmwRg
slide-30
SLIDE 30

Regularization

  • Overfitting
  • Cost function
  • Regularized linear regression
  • Regularized logistic regression
slide-31
SLIDE 31

Intuition

  • Suppose we penalize and make ๐œ„3, ๐œ„4 really small.

min

๐œ„ ๐พ ๐œ„ = 1

2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2 + 1000 ๐œ„3 2 + 1000 ๐œ„4 2

โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ + ๐œ„2๐‘ฆ2 โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ + ๐œ„2๐‘ฆ2 + ๐œ„3๐‘ฆ3 + ๐œ„4๐‘ฆ4

Price ($) in 1000โ€™s Size in feet^2 Price ($) in 1000โ€™s Size in feet^2

Slide credit: Andrew Ng

slide-32
SLIDE 32

Regularization.

  • Small values for parameters ๐œ„1, ๐œ„2, โ‹ฏ , ๐œ„๐‘œ
  • โ€œSimplerโ€ hypothesis
  • Less prone to overfitting
  • Housing:
  • Features: ๐‘ฆ1, ๐‘ฆ2, โ‹ฏ , ๐‘ฆ100
  • Parameters: ๐œ„0, ๐œ„1, ๐œ„2, โ‹ฏ , ๐œ„100

๐พ ๐œ„ = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2 + ๐œ‡ เท ๐‘˜=1 ๐‘œ

๐œ„

๐‘˜ 2

Slide credit: Andrew Ng

slide-33
SLIDE 33

Regularization

๐พ ๐œ„ = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2 + ๐œ‡ เท ๐‘˜=1 ๐‘œ

๐œ„

๐‘˜ 2

min

๐œ„ ๐พ(๐œ„)

Price ($) in 1000โ€™s Size in feet^2

๐œ‡: Regularization parameter

Slide credit: Andrew Ng

slide-34
SLIDE 34

Question

๐พ ๐œ„ = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2 + ๐œ‡ เท ๐‘˜=1 ๐‘œ

๐œ„

๐‘˜ 2

What if ๐œ‡ is set to an extremely large value (say ๐œ‡ = 1010)?

  • 1. Algorithm works fine; setting to be very large canโ€™t hurt it
  • 2. Algorithm fails to eliminate overfitting.
  • 3. Algorithm results in underfitting. (Fails to fit even training data well).
  • 4. Gradient descent will fail to converge.

Slide credit: Andrew Ng

slide-35
SLIDE 35

Question

๐พ ๐œ„ = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2 + ๐œ‡ เท ๐‘˜=1 ๐‘œ

๐œ„

๐‘˜ 2

What if ๐œ‡ is set to an extremely large value (say ๐œ‡ = 1010)?

Price ($) in 1000โ€™s Size in feet^2

โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ1 + ๐œ„2๐‘ฆ2 + โ‹ฏ + ๐œ„๐‘œ๐‘ฆ๐‘œ = ๐œ„โŠค๐‘ฆ

Slide credit: Andrew Ng

slide-36
SLIDE 36

Regularization

  • Overfitting
  • Cost function
  • Regularized linear regression
  • Regularized logistic regression
slide-37
SLIDE 37

Regularized linear regression

๐พ ๐œ„ = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2 + ๐œ‡ เท ๐‘˜=1 ๐‘œ

๐œ„

๐‘˜ 2

min

๐œ„ ๐พ(๐œ„)

๐‘œ: Number of features ๐œ„0 is not panelized

Slide credit: Andrew Ng

slide-38
SLIDE 38

Gradient descent (Previously)

Repeat { ๐œ„0 โ‰” ๐œ„0 โˆ’ ๐›ฝ 1 ๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1

๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ๐‘˜

๐‘—

} (๐‘˜ = 1, 2, 3, โ‹ฏ , ๐‘œ)

Slide credit: Andrew Ng

(๐‘˜ = 0)

slide-39
SLIDE 39

Gradient descent (Regularized)

Repeat { ๐œ„0 โ‰” ๐œ„0 โˆ’ ๐›ฝ 1 ๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1

๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ๐‘˜

๐‘— + ๐œ‡๐œ„ ๐‘˜

}

๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜(1 โˆ’ ๐›ฝ ๐œ‡

๐‘›) โˆ’ ๐›ฝ 1 ๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ๐‘˜

๐‘—

Slide credit: Andrew Ng

slide-40
SLIDE 40

Comparison

Regularized linear regression

๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜(1 โˆ’ ๐›ฝ ๐œ‡

๐‘›) โˆ’ ๐›ฝ 1 ๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ๐‘˜

๐‘—

Un-regularized linear regression

๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜

โˆ’ ๐›ฝ 1 ๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ๐‘˜

๐‘—

1 โˆ’ ๐›ฝ

๐œ‡ ๐‘› < 1: Weight decay

slide-41
SLIDE 41

Normal equation

  • ๐‘Œ =

๐‘ฆ 1 โŠค ๐‘ฆ 2 โŠค โ‹ฎ ๐‘ฆ ๐‘› โŠค โˆˆ ๐‘†๐‘›ร—(๐‘œ+1) ๐‘ง = ๐‘ง(1) ๐‘ง(2) โ‹ฎ ๐‘ง(๐‘›) โˆˆ ๐‘†๐‘›

  • min

๐œ„ ๐พ(๐œ„)

  • ๐œ„ =

๐‘ŒโŠค๐‘Œ + ๐œ‡ โ‹ฏ 1 โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ 1

โˆ’1

๐‘ŒโŠค๐‘ง

(๐‘œ + 1 ) ร— (๐‘œ + 1)

Slide credit: Andrew Ng

slide-42
SLIDE 42

Regularization

  • Overfitting
  • Cost function
  • Regularized linear regression
  • Regularized logistic regression
slide-43
SLIDE 43

Regularized logistic regression

  • Cost function:

๐พ ๐œ„ = 1 ๐‘› เท

๐‘—=1 ๐‘›

๐‘ง ๐‘— log โ„Ž๐œ„ ๐‘ฆ ๐‘— + (1 โˆ’ ๐‘ง ๐‘— ) log 1 โˆ’ โ„Ž๐œ„ ๐‘ฆ ๐‘— + ๐œ‡ 2 เท

๐‘˜=1 ๐‘œ

๐œ„

๐‘˜ 2

Tumor Size Age

โ„Ž๐œ„ ๐‘ฆ = ๐‘•(๐œ„0 + ๐œ„1๐‘ฆ + ๐œ„2๐‘ฆ2 + ๐œ„3๐‘ฆ1

2 + ๐œ„4๐‘ฆ2 2 + ๐œ„5๐‘ฆ1๐‘ฆ2 +

๐œ„6๐‘ฆ1

3๐‘ฆ2 + ๐œ„7๐‘ฆ1๐‘ฆ2 3 + โ‹ฏ )

Slide credit: Andrew Ng

slide-44
SLIDE 44

Gradient descent (Regularized)

Repeat { ๐œ„0 โ‰” ๐œ„0 โˆ’ ๐›ฝ 1 ๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1

๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ๐‘˜

๐‘— โˆ’ ๐œ‡๐œ„ ๐‘˜

} โ„Ž๐œ„ ๐‘ฆ = 1 1 + ๐‘“โˆ’๐œ„โŠค๐‘ฆ ๐œ– ๐œ–๐œ„

๐‘˜

๐พ(๐œ„)

Slide credit: Andrew Ng

slide-45
SLIDE 45

๐œ„ 1: Lasso regularization

๐พ ๐œ„ = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2 + ๐œ‡ เท ๐‘˜=1 ๐‘œ

|๐œ„

๐‘˜|

LASSO: Least Absolute Shrinkage and Selection Operator

slide-46
SLIDE 46

Single predictor: Soft Thresholding

  • minimize๐œ„

1 2๐‘› ฯƒ๐‘—=1 ๐‘›

๐‘ฆ(๐‘—)๐œ„ โˆ’ ๐‘ง ๐‘—

2 + ๐œ‡ ๐œ„ 1

เท  ๐œ„ = 1 ๐‘› < ๐’š, ๐’› > โˆ’๐œ‡ if 1 ๐‘› < ๐’š, ๐’› > > ๐œ‡ if 1 ๐‘› | < ๐’š, ๐’› > | โ‰ค ๐œ‡ 1 ๐‘› < ๐’š, ๐’› > +๐œ‡ if 1 ๐‘› < ๐’š, ๐’› > < โˆ’๐œ‡

เท  ๐œ„ = ๐‘‡๐œ‡(1 ๐‘› < ๐’š, ๐’› >) Soft Thresholding operator ๐‘‡๐œ‡ ๐‘ฆ = sign ๐‘ฆ ๐‘ฆ โˆ’ ๐œ‡ +

slide-47
SLIDE 47

Multiple predictors: : Cyclic Coordinate Desce

  • minimize๐œ„

1 2๐‘› ฯƒ๐‘—=1 ๐‘›

๐‘ฆ๐‘˜

๐‘— ๐œ„ ๐‘˜ + ฯƒ๐‘™โ‰ ๐‘˜ ๐‘ฆ๐‘—๐‘˜ ๐‘— ๐œ„๐‘™ โˆ’ ๐‘ง ๐‘— 2

+ ๐œ‡ เท

๐‘™โ‰ ๐‘˜

|๐œ„๐‘™| + ๐œ‡ ๐œ„

๐‘˜ 1

For each ๐‘˜, update ๐œ„

๐‘˜ with

minimize๐œ„ 1 2๐‘› เท

๐‘—=1 ๐‘›

๐‘ฆ๐‘˜

๐‘— ๐œ„ ๐‘˜ โˆ’ ๐‘  ๐‘˜ (๐‘—) 2

+ ๐œ‡ ๐œ„

๐‘˜ 1

where ๐‘ 

๐‘˜ (๐‘—) = ๐‘ง ๐‘— โˆ’ ฯƒ๐‘™โ‰ ๐‘˜ ๐‘ฆ๐‘—๐‘˜ ๐‘— ๐œ„๐‘™

slide-48
SLIDE 48

L1 and L2 balls

Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf

slide-49
SLIDE 49

Terminology

Regularization function Name Solver

๐œ„ 2

2 = เท ๐‘˜=1 ๐‘œ

๐œ„

๐‘˜ 2

Tikhonov regularization Ridge regression Close form ๐œ„

1 = เท ๐‘˜=1 ๐‘œ

|๐œ„๐‘˜| LASSO regression Proximal gradient descent, least angle regression ๐›ฝ ๐œ„

1 + (1 โˆ’ ๐›ฝ)

๐œ„ 2

2

Elastic net regularization Proximal gradient descent

slide-50
SLIDE 50

Things to remember

  • Overfitting
  • Cost function
  • Regularized linear regression
  • Regularized logistic regression