Applied Machine Learning Applied Machine Learning Gradient Descent - - PowerPoint PPT Presentation

applied machine learning applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning Applied Machine Learning Gradient Descent - - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives Basic idea of gradient descent


slide-1
SLIDE 1

Applied Machine Learning Applied Machine Learning

Gradient Descent Methods

Siamak Ravanbakhsh Siamak Ravanbakhsh

COMP 551 COMP 551 (winter 2020) (winter 2020)

1

slide-2
SLIDE 2

Basic idea of gradient descent stochastic gradient descent method of momentum using adaptive learning rate sub-gradient Application to linear regression and classification

Learning objectives Learning objectives

2

slide-3
SLIDE 3

Inference and learning of a model often involves optimization:

Optimization in ML Optimization in ML

  • ptimization is a huge field

3

slide-4
SLIDE 4

Inference and learning of a model often involves optimization:

Optimization in ML Optimization in ML

discrete (combinatorial) vs continuous variables constrained vs unconstrained for continuous optimization in ML: convex vs non-convex looking for local vs global optima? analytic gradient? analytic Hessian? stochastic vs batch smooth vs non-smooth

bold: the setting considered in this class

  • ptimization is a huge field

3

slide-5
SLIDE 5

Gradient Gradient

for a multivariate function J(w

, w )

1

partial derivatives instead of derivative

J(w , w ) ≜

∂w

1

∂ 1

lim

ϵ→0 ϵ J(w

,w +ϵ)−J(w ,w )

1 1

w

w

1

J

= derivative when other vars. are fixed we can estimate this numerically if needed

(use small epsilon in the the formula above)

4 . 1

slide-6
SLIDE 6

Gradient Gradient

for a multivariate function J(w

, w )

1

partial derivatives instead of derivative

J(w , w ) ≜

∂w

1

∂ 1

lim

ϵ→0 ϵ J(w

,w +ϵ)−J(w ,w )

1 1

w

w

1

J

= derivative when other vars. are fixed

gradient: vector of all partial derivatives

∇J(w) = [

J(w), ⋯ J(w)]

∂w

1

∂ ∂w

D

∂ T

w

1

w

J

we can estimate this numerically if needed

(use small epsilon in the the formula above)

4 . 1

slide-7
SLIDE 7

Gradient descent Gradient descent

an iterative algorithm for optimization starts from some update using gradient converges to a local minima

w{0}

w ←

{t+1}

w −

{t}

α∇J (w )

{t}

image: https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

steepest descent direction

4 . 2

slide-8
SLIDE 8

Gradient descent Gradient descent

an iterative algorithm for optimization starts from some update using gradient converges to a local minima

w{0}

w ←

{t+1}

w −

{t}

α∇J (w )

{t}

learning rate

image: https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

steepest descent direction

4 . 2

slide-9
SLIDE 9

Gradient descent Gradient descent

an iterative algorithm for optimization starts from some update using gradient converges to a local minima

w{0}

w ←

{t+1}

w −

{t}

α∇J (w )

{t}

learning rate

image: https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

steepest descent direction

∇J (w) = [

J (w), ⋯ J (w)]

∂w

1

∂ ∂w

D

∂ T

4 . 2

slide-10
SLIDE 10

Gradient descent Gradient descent

an iterative algorithm for optimization starts from some update using gradient converges to a local minima

w{0}

w ←

{t+1}

w −

{t}

α∇J (w )

{t}

learning rate cost function

(for maximization : objective function )

image: https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

steepest descent direction

∇J (w) = [

J (w), ⋯ J (w)]

∂w

1

∂ ∂w

D

∂ T

4 . 2

slide-11
SLIDE 11

Convex function Convex function

a convex subset of intersects any line in at most one line segment

RN

convex not convex

4 . 3

slide-12
SLIDE 12

Convex function Convex function

a convex subset of intersects any line in at most one line segment

RN

convex not convex

a convex function is a function for which the epigraph is a convex set

epigraph: set of all points above the graph

4 . 3

slide-13
SLIDE 13

Convex function Convex function

a convex subset of intersects any line in at most one line segment

RN

convex not convex

a convex function is a function for which the epigraph is a convex set

epigraph: set of all points above the graph

f(λw + (1 − λ)w ) ≤

λf(w) + (1 − λ)f(w ) 0 <

λ < 1

w

w′

4 . 3

slide-14
SLIDE 14

Convex function Convex function

Convex functions are easier to minimize: critical points are global minimum gradient descent can find it

w ←

{t+1}

w −

{t}

α∇J (w )

{t}

J(w) w

image: https://www.willamette.edu/~gorr/classes/cs449/momrate.html

convex

w

non-convex: gradient descent may find a local optima

4 . 4

slide-15
SLIDE 15

Convex function Convex function

Convex functions are easier to minimize: critical points are global minimum gradient descent can find it

w ←

{t+1}

w −

{t}

α∇J (w )

{t}

J(w) w

image: https://www.willamette.edu/~gorr/classes/cs449/momrate.html

convex

w

non-convex: gradient descent may find a local optima a concave function is a negative of a convex function (easy to maximize)

4 . 4

slide-16
SLIDE 16

Recognizing convex functions Recognizing convex functions

a linear function is convex w x

T

4 . 5

slide-17
SLIDE 17

Recognizing convex functions Recognizing convex functions

f ≥

x2 d2

convex if second derivative is positive everywhere example

x , e , − log(x), −

2d x

x

a linear function is convex w x

T

4 . 5

slide-18
SLIDE 18

Recognizing convex functions Recognizing convex functions

f ≥

x2 d2

convex if second derivative is positive everywhere example

x , e , − log(x), −

2d x

x

a linear function is convex

sum of convex functions is convex

∣∣WX − Y ∣∣

+

2 2

λ∣∣w∣∣

2 2

example

w x

T

4 . 5

slide-19
SLIDE 19

Recognizing convex functions Recognizing convex functions

f ≥

x2 d2

convex if second derivative is positive everywhere example

x , e , − log(x), −

2d x

x

a linear function is convex

sum of convex functions is convex

∣∣WX − Y ∣∣

+

2 2

λ∣∣w∣∣

2 2

example maximum of convex functions is convex

f(y) = max

y

x∈[1,5]

x 4

example

note this is not convex in x

w x

T

4 . 5

slide-20
SLIDE 20

Recognizing convex functions Recognizing convex functions

f ≥

x2 d2

convex if second derivative is positive everywhere example

x , e , − log(x), −

2d x

x

a linear function is convex

sum of convex functions is convex

∣∣WX − Y ∣∣

+

2 2

λ∣∣w∣∣

2 2

example maximum of convex functions is convex

f(y) = max

y

x∈[1,5]

x 4

example

note this is not convex in x

composition of convex functions is generally not convex

(− log(x))2

example

w x

T

4 . 5

slide-21
SLIDE 21

Winter 2020 | Applied Machine Learning (COMP551)

Recognizing convex functions Recognizing convex functions

f ≥

x2 d2

convex if second derivative is positive everywhere example

x , e , − log(x), −

2d x

x

a linear function is convex

sum of convex functions is convex

∣∣WX − Y ∣∣

+

2 2

λ∣∣w∣∣

2 2

example maximum of convex functions is convex

f(y) = max

y

x∈[1,5]

x 4

example

note this is not convex in x

composition of convex functions is generally not convex

(− log(x))2

example

w x

T

however, if f,g are convex, and g is non-decreasing is convex

g(f(x))

ef(x)

example

for convex f

4 . 5

slide-22
SLIDE 22

Gradient Gradient

∇J(w) = X (

T y

^ y)

for linear and logistic regression in both cases: linear regression:

=

y ^ Xw

logistic regression:

=

y ^ σ(Xw)

D × N N × 1 N × D D × 1

5 . 1

slide-23
SLIDE 23

Gradient Gradient

∇J(w) = X (

T y

^ y)

for linear and logistic regression in both cases: linear regression:

=

y ^ Xw

logistic regression:

=

y ^ σ(Xw)

D × N N × 1 N × D D × 1

1

compared to the direct solution for linear regression: gradient descent can be much faster for large D

O(ND)

time complexity:

(two matrix multiplications)

O(ND +

2

D )

3

5 . 1

slide-24
SLIDE 24

Gradient Gradient

∇J(w) = X (

T y

^ y)

for linear and logistic regression in both cases: linear regression:

=

y ^ Xw

logistic regression:

=

y ^ σ(Xw)

D × N N × 1 N × D D × 1

1

compared to the direct solution for linear regression: gradient descent can be much faster for large D

O(ND)

time complexity:

(two matrix multiplications)

O(ND +

2

D )

3

def gradient(X, y, w): N,D = X.shape yh = logistic(np.dot(X, w)) grad = np.dot(X.T, yh - y) / N return grad 1 2 3 4 5

5 . 1

slide-25
SLIDE 25

Gradient Descent Gradient Descent

def GradientDescent(X, # N x D y, # N lr=.01, # learning rate eps=1e-2, # termination codition ): N,D = X.shape w = np.zeros(D) g = np.inf while np.linalg.norm(g) > eps: g = gradient(X, y, w) w = w - lr*g return w 1 2 3 4 5 6 7 8 9 10 11 12 13

code on the previous page

implementing gradient descent is easy!

5 . 2

slide-26
SLIDE 26

Gradient Descent Gradient Descent

def GradientDescent(X, # N x D y, # N lr=.01, # learning rate eps=1e-2, # termination codition ): N,D = X.shape w = np.zeros(D) g = np.inf while np.linalg.norm(g) > eps: g = gradient(X, y, w) w = w - lr*g return w 1 2 3 4 5 6 7 8 9 10 11 12 13

code on the previous page

implementing gradient descent is easy!

Some termination conditions: some max #iterations small gradient a small change in the objective increasing error on validation set early stopping (one way to avoid overfitting)

5 . 2

slide-27
SLIDE 27

Example: Example: GD for Linear Regression GD for Linear Regression

def GradientDescent(X, # N x D y, # N lr=.01, # learning rate eps=1e-2, # termination codition ): N,D = X.shape w = np.zeros(D) g = np.inf while np.linalg.norm(g) > eps: g = gradient(X, y, w) w = w - lr*g return w 1 2 3 4 5 6 7 8 9 10 11 12 13 def gradient(X, y, w): N,D = X.shape yh = np.dot(X, w) grad = np.dot(X.T, yh - y) / N return grad 1 2 3 4 5

applying this to to fit toy data

5 . 3

slide-28
SLIDE 28

single feature (intercept is zero)

Example: Example: GD for Linear Regression GD for Linear Regression

#D = 1 N = 20 X = np.linspace(1,10, N)[:,None] y_truth = np.dot(x, np.array([-3.])) y = y_truth + 10*np.random.randn(N) 1 2 3 4 5

applying this to to fit toy data

5 . 4

slide-29
SLIDE 29

(x , −3x +

(n) (n)

noise)

single feature (intercept is zero)

Example: Example: GD for Linear Regression GD for Linear Regression

#D = 1 N = 20 X = np.linspace(1,10, N)[:,None] y_truth = np.dot(x, np.array([-3.])) y = y_truth + 10*np.random.randn(N) 1 2 3 4 5

applying this to to fit toy data

5 . 4

slide-30
SLIDE 30

(x , −3x +

(n) (n)

noise)

single feature (intercept is zero)

Example: Example: GD for Linear Regression GD for Linear Regression

y = −3x

y = wx

#D = 1 N = 20 X = np.linspace(1,10, N)[:,None] y_truth = np.dot(x, np.array([-3.])) y = y_truth + 10*np.random.randn(N) 1 2 3 4 5

applying this to to fit toy data

5 . 4

slide-31
SLIDE 31

(x , −3x +

(n) (n)

noise)

single feature (intercept is zero)

Example: Example: GD for Linear Regression GD for Linear Regression

y = −3x

y = wx

w = (X X) X y ≈

T −1 T

−3.2

using direct solution method

#D = 1 N = 20 X = np.linspace(1,10, N)[:,None] y_truth = np.dot(x, np.array([-3.])) y = y_truth + 10*np.random.randn(N) 1 2 3 4 5

applying this to to fit toy data

5 . 4

slide-32
SLIDE 32

Example: Example: GD for Linear Regression GD for Linear Regression

After 22 iterations of Gradient Descent

w ←

{t+1}

w −

{t}

.01∇J(w )

{t}

5 . 5

slide-33
SLIDE 33

Example: Example: GD for Linear Regression GD for Linear Regression

After 22 iterations of Gradient Descent

w =

{0}

w

J(w)

w ≈

{22}

−3.2

cost function

w ←

{t+1}

w −

{t}

.01∇J(w )

{t}

5 . 5

slide-34
SLIDE 34

data space

Example: Example: GD for Linear Regression GD for Linear Regression

y = w x

After 22 iterations of Gradient Descent

w =

{0}

w

J(w)

w ≈

{22}

−3.2

cost function

w ←

{t+1}

w −

{t}

.01∇J(w )

{t}

5 . 5

slide-35
SLIDE 35

Learning rate Learning rate

Learning rate has a significant effect on GD too small: may take a long time to converge too large: it overshoots

J(w) w

α = .01 α = .05

α

5 . 6

slide-36
SLIDE 36

GD for logistic Regression GD for logistic Regression

def GradientDescent(X, # N x D y, # N lr=.01, # learning rate eps=1e-2, # termination codition ): N,D = X.shape w = np.zeros(D) g = np.inf while np.linalg.norm(g) > eps: g = gradient(X, y, w) w = w - lr*g return w 1 2 3 4 5 6 7 8 9 10 11 12 13

example: logistic regression for Iris dataset (D=2, lr=.01)

5 . 7

slide-37
SLIDE 37

GD for logistic Regression GD for logistic Regression

def GradientDescent(X, # N x D y, # N lr=.01, # learning rate eps=1e-2, # termination codition ): N,D = X.shape w = np.zeros(D) g = np.inf while np.linalg.norm(g) > eps: g = gradient(X, y, w) w = w - lr*g return w 1 2 3 4 5 6 7 8 9 10 11 12 13

example: logistic regression for Iris dataset (D=2, lr=.01)

def gradient(X, y, w): yh = logistic(np.dot(X, w)) grad = np.dot(X.T, yh - y) return grad 1 2 3 4

5 . 7

slide-38
SLIDE 38

Winter 2020 | Applied Machine Learning (COMP551)

GD for logistic Regression GD for logistic Regression

def GradientDescent(X, # N x D y, # N lr=.01, # learning rate eps=1e-2, # termination codition ): N,D = X.shape w = np.zeros(D) g = np.inf while np.linalg.norm(g) > eps: g = gradient(X, y, w) w = w - lr*g return w 1 2 3 4 5 6 7 8 9 10 11 12 13

example: logistic regression for Iris dataset (D=2, lr=.01)

def gradient(X, y, w): yh = logistic(np.dot(X, w)) grad = np.dot(X.T, yh - y) return grad 1 2 3 4

5 . 7

slide-39
SLIDE 39

we can write the cost function as a average over instances

Stochastic Stochastic Gradient Descent Gradient Descent

J(w) =

J (w)

N 1 ∑n=1 N n

cost for a single data-point e.g. for linear regression

J

(w) =

n

(w x

2 1 T (n)

y )

(n) 2

6 . 1

slide-40
SLIDE 40

we can write the cost function as a average over instances

Stochastic Stochastic Gradient Descent Gradient Descent

J(w) =

J (w)

N 1 ∑n=1 N n

the same is true for the partial derivatives

J(w) =

∂w

j

J (w)

N 1 ∑n=1 N ∂w

j

∂ n

cost for a single data-point e.g. for linear regression

J

(w) =

n

(w x

2 1 T (n)

y )

(n) 2

6 . 1

slide-41
SLIDE 41

we can write the cost function as a average over instances

Stochastic Stochastic Gradient Descent Gradient Descent

J(w) =

J (w)

N 1 ∑n=1 N n

the same is true for the partial derivatives

J(w) =

∂w

j

J (w)

N 1 ∑n=1 N ∂w

j

∂ n

cost for a single data-point e.g. for linear regression

J

(w) =

n

(w x

2 1 T (n)

y )

(n) 2

therefore

∇J(w) = E[∇J

(w)]

n

6 . 1

slide-42
SLIDE 42

Stochastic Stochastic Gradient Descent Gradient Descent

Idea: use stochastic approximations in gradient descent

∇J

(w)

n

6 . 2

slide-43
SLIDE 43

Stochastic Stochastic Gradient Descent Gradient Descent

Idea: use stochastic approximations in gradient descent

∇J

(w)

n

w w

1

contour plot of the cost function + batch gradient update with small learning rate: guaranteed improvement at each step

w ← w − α∇J(w)

image:https://jaykanidan.wordpress.com

6 . 2

slide-44
SLIDE 44

Stochastic Stochastic Gradient Descent Gradient Descent

Idea: use stochastic approximations in gradient descent

∇J

(w)

n

w w

1

using stochastic gradient w ← w − α∇J

(w)

n

image:https://jaykanidan.wordpress.com image:https://jaykanidan.wordpress.com 6 . 3

slide-45
SLIDE 45

Stochastic Stochastic Gradient Descent Gradient Descent

Idea: use stochastic approximations in gradient descent

∇J

(w)

n

w w

1

using stochastic gradient w ← w − α∇J

(w)

n

the steps are "on average" in the right direction

image:https://jaykanidan.wordpress.com image:https://jaykanidan.wordpress.com 6 . 3

slide-46
SLIDE 46

Stochastic Stochastic Gradient Descent Gradient Descent

Idea: use stochastic approximations in gradient descent

∇J

(w)

n

w w

1

using stochastic gradient w ← w − α∇J

(w)

n

the steps are "on average" in the right direction each step is using gradient of a different cost J

(w)

n

image:https://jaykanidan.wordpress.com image:https://jaykanidan.wordpress.com 6 . 3

slide-47
SLIDE 47

Stochastic Stochastic Gradient Descent Gradient Descent

Idea: use stochastic approximations in gradient descent

∇J

(w)

n

w w

1

using stochastic gradient w ← w − α∇J

(w)

n

the steps are "on average" in the right direction each step is using gradient of a different cost J

(w)

n

image:https://jaykanidan.wordpress.com image:https://jaykanidan.wordpress.com

each update is (1/N) of the cost of batch gradient

6 . 3

slide-48
SLIDE 48

Stochastic Stochastic Gradient Descent Gradient Descent

Idea: use stochastic approximations in gradient descent

∇J

(w)

n

w w

1

using stochastic gradient w ← w − α∇J

(w)

n

the steps are "on average" in the right direction each step is using gradient of a different cost J

(w)

n

image:https://jaykanidan.wordpress.com image:https://jaykanidan.wordpress.com

each update is (1/N) of the cost of batch gradient

∇J

(w) =

n

x (w x −

(n) T (n)

y )

(n) e.g., for linear regression O(D)

6 . 3

slide-49
SLIDE 49

Example: Example: SGD for logistic regression SGD for logistic regression

def GradientDescent(X, # N x D y, # N lr=.01, # learning rate eps=1e-2, # termination codition ): N,D = X.shape w = np.zeros(D) g = np.inf while np.linalg.norm(g) > eps: g = gradient(X, y, w) w = w - lr*g return w 1 2 3 4 5 6 7 8 9 10 11 12 13

logistic regression for Iris dataset (D=2 , )

def gradient(X, y, w): N, D = X.shape yh = logistic(np.dot(X, w)) grad = np.dot(X.T, yh - y) / N return grad 1 2 3 4 5

w =

{t}

(0, 0)

α = .1

after 8000 iterations

setting 1: using batch gradient

6 . 4

slide-50
SLIDE 50

Example: Example: SGD for logistic regression SGD for logistic regression

n = np.random.randint(N) g = gradient(X[[n],:], y[[n]], w) w = w - lr*g def Stochastic GradientDescent( 1 X, # N x D 2 y, # N 3 lr=.01, # learning rate 4 eps=1e-2, # termination codition 5 ): 6 N,D = X.shape 7 w = np.zeros(D) 8 g = np.inf 9 while np.linalg.norm(g) > eps: 10 11 12 13 return w 14 15

logistic regression for Iris dataset (D=2, )

w =

t=0

(0, 0)

α = .1

setting 2: using stochastic gradient

def gradient(X, y, w): N, D = X.shape yh = logistic(np.dot(X, w)) grad = np.dot(X.T, yh - y) / N return grad 1 2 3 4 5

6 . 5

slide-51
SLIDE 51

Convergence of SGD Convergence of SGD

stochastic gradients are not zero at optimum how to guarantee convergence?

6 . 6

slide-52
SLIDE 52

Convergence of SGD Convergence of SGD

stochastic gradients are not zero at optimum how to guarantee convergence?

schedule to have a smaller learning rate over time

6 . 6

slide-53
SLIDE 53

Convergence of SGD Convergence of SGD

stochastic gradients are not zero at optimum how to guarantee convergence?

schedule to have a smaller learning rate over time the sequence we use should satisfy:

  • therwise for large we can't reach the minimum

∣∣w −

{0}

w ∣∣

the steps should go to zero

Robbins Monro

α

= ∑t=0

∞ {t}

) < ∑t=0

∞ {t} 2

6 . 6

slide-54
SLIDE 54

Convergence of SGD Convergence of SGD

stochastic gradients are not zero at optimum how to guarantee convergence?

schedule to have a smaller learning rate over time

example

α =

{t}

, α

=

t 10 {t}

t−.51

the sequence we use should satisfy:

  • therwise for large we can't reach the minimum

∣∣w −

{0}

w ∣∣

the steps should go to zero

Robbins Monro

α

= ∑t=0

∞ {t}

) < ∑t=0

∞ {t} 2

6 . 6

slide-55
SLIDE 55

Minibatch Minibatch SGD SGD

use a minibatch to produce gradient estimates

∇J

=

B

∇J (w)

∑n∈B

n

a subset of the dataset

B ⊆ {1, … , N}

6 . 7

slide-56
SLIDE 56

Minibatch Minibatch SGD SGD

use a minibatch to produce gradient estimates

∇J

=

B

∇J (w)

∑n∈B

n

a subset of the dataset

B ⊆ {1, … , N}

minibatch = np.random.randint(N, size=(bsize)) g = gradient(X[minibatch,:], y[inibatch], w) def MinibatchSGD(X, y, lr=.01, eps=1e-2, bsize=8): 1 N,D = X.shape 2 w = np.zeros(D) 3 g = np.inf 4 while np.linalg.norm(g) > eps: 5 6 7 w = w - lr*g 8 return w 9 10

6 . 7

slide-57
SLIDE 57

Minibatch Minibatch SGD SGD

use a minibatch to produce gradient estimates

GD full batch

∇J

=

B

∇J (w)

∑n∈B

n

a subset of the dataset

B ⊆ {1, … , N}

minibatch = np.random.randint(N, size=(bsize)) g = gradient(X[minibatch,:], y[inibatch], w) def MinibatchSGD(X, y, lr=.01, eps=1e-2, bsize=8): 1 N,D = X.shape 2 w = np.zeros(D) 3 g = np.inf 4 while np.linalg.norm(g) > eps: 5 6 7 w = w - lr*g 8 return w 9 10

6 . 7

slide-58
SLIDE 58

Minibatch Minibatch SGD SGD

use a minibatch to produce gradient estimates

GD full batch

∇J

=

B

∇J (w)

∑n∈B

n

a subset of the dataset

B ⊆ {1, … , N}

minibatch = np.random.randint(N, size=(bsize)) g = gradient(X[minibatch,:], y[inibatch], w) def MinibatchSGD(X, y, lr=.01, eps=1e-2, bsize=8): 1 N,D = X.shape 2 w = np.zeros(D) 3 g = np.inf 4 while np.linalg.norm(g) > eps: 5 6 7 w = w - lr*g 8 return w 9 10

SGD minibatch-size=1

6 . 7

slide-59
SLIDE 59

Winter 2020 | Applied Machine Learning (COMP551)

Minibatch Minibatch SGD SGD

use a minibatch to produce gradient estimates

GD full batch

∇J

=

B

∇J (w)

∑n∈B

n

a subset of the dataset

B ⊆ {1, … , N}

minibatch = np.random.randint(N, size=(bsize)) g = gradient(X[minibatch,:], y[inibatch], w) def MinibatchSGD(X, y, lr=.01, eps=1e-2, bsize=8): 1 N,D = X.shape 2 w = np.zeros(D) 3 g = np.inf 4 while np.linalg.norm(g) > eps: 5 6 7 w = w - lr*g 8 return w 9 10

SGD minibatch-size=16 SGD minibatch-size=1

6 . 7

slide-60
SLIDE 60

Momentum Momentum

to help with oscillations of SGD (or even full-batch GD): use a running average of gradients more recent gradients should have higher weights

7 . 1

slide-61
SLIDE 61

Momentum Momentum

to help with oscillations of SGD (or even full-batch GD): use a running average of gradients more recent gradients should have higher weights

Δw ←

{t}

βΔw +

{t−1}

(1 − β)∇J

(w

)

B {t}

w ←

{t}

w −

{t−1}

αΔw{t}

7 . 1

slide-62
SLIDE 62

Momentum Momentum

to help with oscillations of SGD (or even full-batch GD): use a running average of gradients more recent gradients should have higher weights

Δw ←

{t}

βΔw +

{t−1}

(1 − β)∇J

(w

)

B {t}

w ←

{t}

w −

{t−1}

αΔw{t}

momentum of 0 reduces to SGD common value > .9 7 . 1

slide-63
SLIDE 63

Momentum Momentum

to help with oscillations of SGD (or even full-batch GD): use a running average of gradients more recent gradients should have higher weights

Δw ←

{t}

βΔw +

{t−1}

(1 − β)∇J

(w

)

B {t}

w ←

{t}

w −

{t−1}

αΔw{t}

momentum of 0 reduces to SGD common value > .9

is effectively an exponential moving average

Δw =

{T}

β

(1 − ∑t=1

T T−t

β)∇J

(w

)

B {t}

7 . 1

slide-64
SLIDE 64

Momentum Momentum

to help with oscillations of SGD (or even full-batch GD): use a running average of gradients more recent gradients should have higher weights

Δw ←

{t}

βΔw +

{t−1}

(1 − β)∇J

(w

)

B {t}

w ←

{t}

w −

{t−1}

αΔw{t}

momentum of 0 reduces to SGD common value > .9

is effectively an exponential moving average

Δw =

{T}

β

(1 − ∑t=1

T T−t

β)∇J

(w

)

B {t}

there are other variations of momentum with similar idea

7 . 1

slide-65
SLIDE 65

Momentum Momentum

to help with oscillations of SGD (or even full-batch GD): use a running average of gradients more recent gradients should have higher weights

dw = (1-beta)*g + beta*dw w = w - lr*dw def MinibatchSGD(X, y, lr=.01, eps=1e-2, bsize=8, beta=.99): 1 N,D = X.shape 2 w = np.zeros(D) 3 g = np.inf 4 dw = 0 5 while np.linalg.norm(g) > eps: 6 minibatch = np.random.randint(N, size=(bsize)) 7 g = gradient(X[minibatch,:], y[inibatch], w) 8 9 10 return w 11 12

7 . 2

slide-66
SLIDE 66

Momentum Momentum

Example: logistic regression α = .5, β = 0, ∣B∣ = 8

no momentum

7 . 3

slide-67
SLIDE 67

Momentum Momentum

Example: logistic regression α = .5, β = 0, ∣B∣ = 8

no momentum

Δw ←

t

βΔw +

t−1

(1 − β)∇J

(w

)

B t−1

w ←

t

w −

t−1

αΔwt

α = .5, β = .99, ∣B∣ = 8

7 . 3

slide-68
SLIDE 68

Winter 2020 | Applied Machine Learning (COMP551)

Momentum Momentum

Example: logistic regression α = .5, β = 0, ∣B∣ = 8

no momentum

Δw ←

t

βΔw +

t−1

(1 − β)∇J

(w

)

B t−1

w ←

t

w −

t−1

αΔwt

α = .5, β = .99, ∣B∣ = 8 see the beautiful demo at Distill

https://distill.pub/2017/momentum/

7 . 3

slide-69
SLIDE 69

Adagrad Adagrad (Ada

Adaptive ptive grad gradient) ient)

use different learning rate for each parameter also make the learning rate adaptive

w

d

8 . 1

slide-70
SLIDE 70

Adagrad Adagrad (Ada

Adaptive ptive grad gradient) ient)

use different learning rate for each parameter also make the learning rate adaptive

w

d S

d {t}

S

+

d {t−1}

J(w

)

∂w

d

∂ {t−1} 2

sum of squares of derivatives over all iterations so far (for individual parameter)

8 . 1

slide-71
SLIDE 71

Adagrad Adagrad (Ada

Adaptive ptive grad gradient) ient)

use different learning rate for each parameter also make the learning rate adaptive

w

d S

d {t}

S

+

d {t−1}

J(w

)

∂w

d

∂ {t−1} 2

sum of squares of derivatives over all iterations so far (for individual parameter)

w

d {t}

w

d {t−1}

J(w

)

S

d {t−1}

α ∂w

d

∂ {t−1}

the learning rate is adapted to previous updates is to avoid numerical issues

ϵ

8 . 1

slide-72
SLIDE 72

Adagrad Adagrad (Ada

Adaptive ptive grad gradient) ient)

use different learning rate for each parameter also make the learning rate adaptive

w

d S

d {t}

S

+

d {t−1}

J(w

)

∂w

d

∂ {t−1} 2

sum of squares of derivatives over all iterations so far (for individual parameter)

w

d {t}

w

d {t−1}

J(w

)

S

d {t−1}

α ∂w

d

∂ {t−1}

the learning rate is adapted to previous updates is to avoid numerical issues

ϵ

useful when parameters are updated at different rates (e.g., NLP)

8 . 1

slide-73
SLIDE 73

Adagrad Adagrad (Ada

Adaptive ptive grad gradient) ient)

different learning rate for each parameter make the learning rate adaptive

w

d

α = .1, ∣B∣ = 1, T = 80, 000

SGD

8 . 2

slide-74
SLIDE 74

Adagrad Adagrad (Ada

Adaptive ptive grad gradient) ient)

different learning rate for each parameter make the learning rate adaptive

w

d

α = .1, ∣B∣ = 1, T = 80, 000

SGD Adagrad

α = .1, ∣B∣ = 1, T = 80, 000, ϵ = 1e − 8

8 . 2

slide-75
SLIDE 75

Winter 2020 | Applied Machine Learning (COMP551)

Adagrad Adagrad (Ada

Adaptive ptive grad gradient) ient)

different learning rate for each parameter make the learning rate adaptive

w

d

problem: the learning rate goes to zero too quickly

α = .1, ∣B∣ = 1, T = 80, 000

SGD Adagrad

α = .1, ∣B∣ = 1, T = 80, 000, ϵ = 1e − 8

8 . 2

slide-76
SLIDE 76

RMSprop RMSprop

solve the problem of diminishing step-size with Adagrad use exponential moving average instead of sum (similar to momentum)

S ←

{t}

γS +

{t−1}

(1 − γ)∇J(w )

{t−1} 2

w ←

{t}

w

d {t−1}

∇J(w

)

S +ϵ

{t−1}

α {t−1}

identical to Adagrad

(Root Mean Squared propagation)

S = (1-gamma)*g**2 + gamma*S w = w - lr*g/np.sqrt(S + epsilon) def RMSprop(X, y, lr=.01, eps=1e-2, bsize=8, gamma=.9, epsilon=1e-8): 1 N,D = X.shape 2 w = np.zeros(D) 3 g = np.inf 4 S = 0 5 while np.linalg.norm(g) > eps: 6 minibatch = np.random.randint(N, size=(bsize)) 7 g = gradient(X[minibatch,:], y[inibatch], w) 8 9 10 return w 11 12

9 . 1

slide-77
SLIDE 77

Adam Adam (Ada

Adaptive ptive M Moment Estimation)

  • ment Estimation)

two ideas so far:

  • 1. use momentum to smooth out the oscillations
  • 2. adaptive per-parameter learning rate

S ←

{t}

β

S

+

2 {t−1}

(1 − β

)∇J(w

)

2 {t−1} 2 identical to RMSProp

(moving average of the second moment)

both use exponential moving averages

Adam combines the two:

M ←

{t}

β

M

+

1 {t−1}

(1 − β

)∇J(w

)

1 {t−1}

identical to method of momentum

(moving average of the first moment)

w ← w

− ∇J(w

)

{t} d {t−1}

S ^{t} αM ^ {t} {t−1}

9 . 2

slide-78
SLIDE 78

Winter 2020 | Applied Machine Learning (COMP551)

Adam Adam (Adaptive Moment Estimation)

(Adaptive Moment Estimation)

S ←

{t}

β

S

+

2 {t−1}

(1 − β

)∇J(w

)

2 {t−1} 2 identical to RMSProp

(moving average of the second moment)

Adam combines thee two:

M ←

{t}

β

M

+

1 {t−1}

(1 − β

)∇J(w

)

1 {t−1}

identical to method of momentum

(moving average of the first moment)

w ← w

− ∇J(w

)

{t} d {t−1}

S ^{t} αM ^ {t} {t−1} since M and S are initialized to be zero, at early stages they are biased towards zero

← M ^ {t}

1−β

1 t

M {t}

← S ^{t}

1−β

2 t

S{t}

for large time-steps it has no effect for small t, it scales up numerator

9 . 3

slide-79
SLIDE 79

In practice In practice

the list of methods is growing ...

image:Alec Radford

logistic regression example

they have recommended range of parameters learning rate, momentum etc. still may need some hyper-parameter tuning these are all first order methods they only need the first derivative 2nd order methods can be much more effective, but also much more expensive

10

slide-80
SLIDE 80

Adding regularization Adding regularization

do not penalize the bias

L

2

w

11 . 1

slide-81
SLIDE 81

Adding regularization Adding regularization

do not penalize the bias

L

2

w

grad[1:] += lambdaa * w[1:] def gradient(X, y, w, lambdaa): 1 N,D = X.shape 2 yh = logistic(np.dot(X, w)) 3 grad = np.dot(X.T, yh - y) / N 4 5 return grad 6

weight decay

11 . 1

slide-82
SLIDE 82

Adding regularization Adding regularization

do not penalize the bias

L

2

L2 penalty makes the optimization easier too!

w

grad[1:] += lambdaa * w[1:] def gradient(X, y, w, lambdaa): 1 N,D = X.shape 2 yh = logistic(np.dot(X, w)) 3 grad = np.dot(X.T, yh - y) / N 4 5 return grad 6

weight decay

11 . 1

slide-83
SLIDE 83

Adding regularization Adding regularization

do not penalize the bias

L

2

L2 penalty makes the optimization easier too!

w

w w

1

λ = 0

grad[1:] += lambdaa * w[1:] def gradient(X, y, w, lambdaa): 1 N,D = X.shape 2 yh = logistic(np.dot(X, w)) 3 grad = np.dot(X.T, yh - y) / N 4 5 return grad 6

weight decay

11 . 1

slide-84
SLIDE 84

Adding regularization Adding regularization

do not penalize the bias

L

2

L2 penalty makes the optimization easier too!

w

w w

1

λ = 0 λ = .01

grad[1:] += lambdaa * w[1:] def gradient(X, y, w, lambdaa): 1 N,D = X.shape 2 yh = logistic(np.dot(X, w)) 3 grad = np.dot(X.T, yh - y) / N 4 5 return grad 6

weight decay

11 . 1

slide-85
SLIDE 85

Adding regularization Adding regularization

do not penalize the bias

L

2

L2 penalty makes the optimization easier too!

w

w w

1

λ = 0 λ = .01 λ = .1

grad[1:] += lambdaa * w[1:] def gradient(X, y, w, lambdaa): 1 N,D = X.shape 2 yh = logistic(np.dot(X, w)) 3 grad = np.dot(X.T, yh - y) / N 4 5 return grad 6

weight decay

11 . 1

slide-86
SLIDE 86

Adding regularization Adding regularization

do not penalize the bias

L

2

L2 penalty makes the optimization easier too!

w

w w

1

λ = 0 λ = .01 λ = .1

note that the optimal shrinks

w

1

grad[1:] += lambdaa * w[1:] def gradient(X, y, w, lambdaa): 1 N,D = X.shape 2 yh = logistic(np.dot(X, w)) 3 grad = np.dot(X.T, yh - y) / N 4 5 return grad 6

weight decay

11 . 1

slide-87
SLIDE 87

Subgderivatives Subgderivatives

L1 penalty is no longer smooth or differentiable (at 0) extend the notion of derivative to non-smooth functions

11 . 2

slide-88
SLIDE 88

Subgderivatives Subgderivatives

L1 penalty is no longer smooth or differentiable (at 0) extend the notion of derivative to non-smooth functions sub-differential is the set of all sub-derivatives at a point

lim

, lim

[

w→w ^− w−w ^ f(w)−f( ) w ^ w→w ^+ w−w ^ f(w)−f( ) w ^ ]

∂f( ) = w ^

11 . 2

slide-89
SLIDE 89

Subgderivatives Subgderivatives

L1 penalty is no longer smooth or differentiable (at 0) extend the notion of derivative to non-smooth functions sub-differential is the set of all sub-derivatives at a point

lim

, lim

[

w→w ^− w−w ^ f(w)−f( ) w ^ w→w ^+ w−w ^ f(w)−f( ) w ^ ]

∂f( ) = w ^

if f is differentiable at then sub-differential has one member

w ^

f( )

dw d

w ^

11 . 2

slide-90
SLIDE 90

Subgderivatives Subgderivatives

L1 penalty is no longer smooth or differentiable (at 0) extend the notion of derivative to non-smooth functions sub-differential is the set of all sub-derivatives at a point

lim

, lim

[

w→w ^− w−w ^ f(w)−f( ) w ^ w→w ^+ w−w ^ f(w)−f( ) w ^ ]

∂f( ) = w ^

if f is differentiable at then sub-differential has one member

w ^

f( )

dw d

w ^

∂f( ) = w ^ {g ∈ R∣ f(w) > f( ) + w ^ g(w − )} w ^

another expression for sub-differential

w ^

11 . 2

slide-91
SLIDE 91

Subgradient Subgradient

example subdifferential absolute f(w) = ∣w∣ ∂f(0) = [−1, 1] ∂f(w

=

 0) = {sign(w)}

image credit: G. Gordon 11 . 3

slide-92
SLIDE 92

Subgradient Subgradient

subgradient is a vector of sub-derivatives recall, gradient was the vector of partial derivatives example subdifferential absolute f(w) = ∣w∣ ∂f(0) = [−1, 1] ∂f(w

=

 0) = {sign(w)}

image credit: G. Gordon 11 . 3

slide-93
SLIDE 93

Subgradient Subgradient

subgradient is a vector of sub-derivatives recall, gradient was the vector of partial derivatives example subdifferential absolute f(w) = ∣w∣ ∂f(0) = [−1, 1] ∂f(w

=

 0) = {sign(w)}

∂f( ) = w ^ {g ∈ R ∣f(w) >

D

f( ) + w ^ g (w −

T

)} w ^

subdifferential for functions of multiple variables

image credit: G. Gordon 11 . 3

slide-94
SLIDE 94

Subgradient Subgradient

subgradient is a vector of sub-derivatives recall, gradient was the vector of partial derivatives we can use sub-gradient with diminishing step-size for optimization example subdifferential absolute f(w) = ∣w∣ ∂f(0) = [−1, 1] ∂f(w

=

 0) = {sign(w)}

∂f( ) = w ^ {g ∈ R ∣f(w) >

D

f( ) + w ^ g (w −

T

)} w ^

subdifferential for functions of multiple variables

image credit: G. Gordon 11 . 3

slide-95
SLIDE 95

Adding regularization Adding regularization

L

1

L1-regularized linear regression has efficient solvers subgradient method for L1-regularized logistic regression

11 . 4

slide-96
SLIDE 96

Adding regularization Adding regularization

L

1

do not penalize the bias using diminishing learning rate

w

L1-regularized linear regression has efficient solvers subgradient method for L1-regularized logistic regression

11 . 4

slide-97
SLIDE 97

Adding regularization Adding regularization

L

1

do not penalize the bias using diminishing learning rate

w

grad[1:] += lambdaa * np.sign(w[1:]) def gradient(X, y, w, lambdaa): 1 N,D = X.shape 2 yh = logistic(np.dot(X, w)) 3 grad = np.dot(X.T, yh - y) / N 4 5 return grad 6

L1-regularized linear regression has efficient solvers subgradient method for L1-regularized logistic regression

11 . 4

slide-98
SLIDE 98

Adding regularization Adding regularization

L

1

do not penalize the bias using diminishing learning rate

w

grad[1:] += lambdaa * np.sign(w[1:]) def gradient(X, y, w, lambdaa): 1 N,D = X.shape 2 yh = logistic(np.dot(X, w)) 3 grad = np.dot(X.T, yh - y) / N 4 5 return grad 6

L1-regularized linear regression has efficient solvers subgradient method for L1-regularized logistic regression

w

λ = 0

w

1

11 . 4

slide-99
SLIDE 99

Winter 2020 | Applied Machine Learning (COMP551)

Adding regularization Adding regularization

L

1

do not penalize the bias using diminishing learning rate

w

note that the optimal becomes 0

w

1

grad[1:] += lambdaa * np.sign(w[1:]) def gradient(X, y, w, lambdaa): 1 N,D = X.shape 2 yh = logistic(np.dot(X, w)) 3 grad = np.dot(X.T, yh - y) / N 4 5 return grad 6

L1-regularized linear regression has efficient solvers subgradient method for L1-regularized logistic regression

λ = .1 λ = 1 λ = .1 λ = .1

w

λ = 0

w

1

11 . 4

slide-100
SLIDE 100

Summary Summary

learning: optimizing the model parameters (minimizing a cost function) use gradient descent to find local minimum easy to implement (esp. using automated differentiation) for convex functions gives global minimum

12

slide-101
SLIDE 101

Summary Summary

learning: optimizing the model parameters (minimizing a cost function) use gradient descent to find local minimum easy to implement (esp. using automated differentiation) for convex functions gives global minimum

Stochastic GD: for large data-sets use mini-batch for a noisy-fast estimate of gradient Robbins Monro condition: reduce the learning rate to help with the noise better (stochastic) gradient optimization Momentum: exponential running average to help with the noise Adagrad & RMSProp: per parameter adaptive learning rate Adam: combining these two ideas

12

slide-102
SLIDE 102

Summary Summary

learning: optimizing the model parameters (minimizing a cost function) use gradient descent to find local minimum easy to implement (esp. using automated differentiation) for convex functions gives global minimum

Stochastic GD: for large data-sets use mini-batch for a noisy-fast estimate of gradient Robbins Monro condition: reduce the learning rate to help with the noise better (stochastic) gradient optimization Momentum: exponential running average to help with the noise Adagrad & RMSProp: per parameter adaptive learning rate Adam: combining these two ideas Adding regularization can also help with optimization

12

slide-103
SLIDE 103

Adadelta Adadelta

solve the problem of diminishing step-size with Adagrad use exponential moving average instead of sum (similar to momentum) also gets rid of a "learning rate" altogether use another moving average for that!

w ←

{t}

w +

{t−1}

Δw{t} S ←

{t}

γS +

{t−1}

(1 − γ)∇J(w )

{t−1} 2

moving average of the sq. gradient

U ←

{t}

γU +

{t−1}

(1 − γ)Δw{t−1} Δw ←

{t}

∇J(w

)

S +ϵ

{t}

U {t−1} {t−1}

moving average of the sq. updates square root of the ratio of the above is used as the adaptive learning rate

13