On Gradient Descent and Local vs. Global Optimum We conjecture that - - PowerPoint PPT Presentation

on gradient descent and local vs global optimum
SMART_READER_LITE
LIVE PREVIEW

On Gradient Descent and Local vs. Global Optimum We conjecture that - - PowerPoint PPT Presentation

On Gradient Descent and Local vs. Global Optimum We conjecture that both simulating anneal- ing and SGD converge to the band of low crit- icial points, and that all criticial points found are local minima of high quality measured by the


slide-1
SLIDE 1

On Gradient Descent and Local vs. Global Optimum

We conjecture that both simulating anneal- ing and SGD converge to the band of low crit- icial points, and that all criticial points found are local minima of high quality measured by the test error. ... it is in practice irrelevant as global minimum often leads to overfitting. Note: Critical points are maxima, minima, and saddle points.

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-2
SLIDE 2

Activation functions

Discrimination functions of the form y(x) = wTx + w0 are simple linear functions of the input variables x, where distances are measured by means

  • f the dot product.

Let us consider the non-linear logistic sigmoid activation function g(·) for limiting the output to (0, 1), that is, y(x) = g(wT x + w0), where g(a) = 1 1 + exp(−a)

  • 2
  • 4

1 0.8 0.6 4 0.4 0.2 2

a

Single-layer network with a logistic sigmoid activation function can also

  • utput probabilities (rather than geometric distances).

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-3
SLIDE 3

Activation functions (cont.)

Heaviside step function: g(a) = if a < 0 1 if a ≥ 0

  • 2

2 4

  • 4

1 0.8 0.6 0.4 0.2

a

Hyperbolic tangent function: g(a) = tanh(a) = exp(a) − exp(−a) exp(a) + exp(−a) Note, tanh(a) ∈ (−1, 1)

1 2 0.5

  • 0.5
  • 1
  • 4
  • 2

4

a T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-4
SLIDE 4

Activation functions (cont.)

Rectified Linear Unit (ReLU) function: g(a) = max(0, a) Leaky ReLU g(a) = max(0.1 · a, a)

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-5
SLIDE 5

Online/Mini-Batch/Batch Learning

Online learning: Update weight w(i+1) = w(i) − η ∂E (i)

∂w (pattern by pattern).

This type of online learning is also called stochastic gradient descent, it is an approximation of the true gradient. Mini-Batch Learning: Partition X randomly in subsets B1, B2, . . . , BS and Update weight w(i+1) = w(i) − η 1

|Bs|

S

s ∂E (s) ∂w

by computing derivatives for each pattern in subset Bs separately and then sum over all patterns in Bs. Batch learning: Update weight w(i+1) = w(i) − η 1

N

N

n=1 ∂E (n) ∂w

by computing derivatives for each pattern separately and then sum over all patterns.

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-6
SLIDE 6

Learning in Neural Networks with Backpropagation

x1 x2 xD a(1)

1

a(1)

2

a(1)

N1

a(2)

1

a(2)

2

a(2)

N2

y1 y2 W(1), b(1) W(2), b(2) W(3), b(3) parameters to fit minimize 1

2f (W(3)f (W(2)f (W(1)X+

b(1)) + b(2)) + b(3)) − Y2

Core idea: Calculate error of loss function and change weights and biases based on output. These “error” measurements for each unit can be used to calculate the partial derivatives. Use partial derivatives with gradient descent for updating weights and biases and minimizing loss function. Problem: At which magnitude one shall change e.g. weight W (1)

ij

based on error of y2?

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-7
SLIDE 7

Learning in Neural Networks with Backpropagation (cont.)

Input: x1, x2, output: a(3)

1 , a(3) 2 , target: y1, y2 and g(·) is activation

  • function. NN calculates2 g(W(2)g(W(1)x)).

a(1)

1

x1 a(1)

2

x2 a(2)

1

z(2)

1

a(2)

2

z(2)

2

a(2)

3

z(2)

3

a(3)

1

z(3)

1

a(3)

2

z(3)

2

L1 L2 L3 W(1) W(2)

z(2)

1

= W (1)

10 x0 + W (1) 11 x1 + W (1) 12 x2

a(2)

1

= g(z(2)

1

) z(2)

2

= W (1)

20 x0 + W (1) 21 x1 + W (1) 22 x2

a(2)

2

= g(z(2)

2

) z(2)

3

= W (1)

30 x0 + W (1) 31 x1 + W (1) 32 x2

a(2)

3

= g(z(2)

3

) z(2)

  • 3×1

= W(1)

3×3

x

  • 3×1

a(2) = g(z(2)) z(3)

1

= W (2)

10 a(2)

+ W (2)

11 a(2) 1

+ W (2)

12 a(2) 2

+ W (2)

13 a(2) 3

a(3)

1

= g(z(2)

1

) z(3)

2

= W (2)

20 a(2)

+ W (2)

21 a(2) 1

+ W (2)

22 a(2) 2

+ W (2)

23 a(2) 3

a(3)

2

= g(z(2)

2

) z(3)

  • 2×1

= W(2)

2×4

a(2)

  • 4×1

a(3) = g(z(3))

Forward pass E(W) = 1

2

  • (a(3)

1

− y1)2 + (a(3)

2

− y2)2 = 1

2a(3) − y2

2Notation adapted from Andew Ng’s slides. T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-8
SLIDE 8

Learning in Neural Networks with Backpropagation (cont.)

For each node we calculate δ(l)

j , that is, error of unit j in layer l, because ∂ ∂W (l)

ij

E(W) = a(l)

j δ(l+1) i

. Note ⊙ is element wise multiplication.

a(1)

1

x1 a(1)

2

x2 a(2)

1

z(2)

1

a(2)

2

z(2)

2

a(2)

3

z(2)

3

a(3)

1

z(3)

1

a(3)

2

z(3)

2

L1 L2 L3 W(1) W(2) δ(3) = (a(3) − y) ⊙ g′(z(3)) δ(2) = (W(2))Tδ(3) ⊙ g′(z(2)) Note δ(1) is the input, so no term. Backward pass E(W) = 1

2

  • (a(3)

1

− y1)2 + (a(3)

2

− y2)2 = 1

2a(3) − y2

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-9
SLIDE 9

Learning in Neural Networks with Backpropagation (cont.)

Backpropagation = forward pass & backward pass Given labeled training data (x1, y1), . . . , (xN, yN). Set ∆(l)

ij = 0 for all l, i, j. Value ∆ will be used as accumulators for

computing partial derivatives. For n = 1 to N Forward pass, compute z(2), a(2), z(3), a(3), . . . , z(L), a(L) Backward pass, compute δ(L), δ(L−1), . . . , δ(2) Accumulate partial derivate terms, ∆(l) := ∆(l) + δ(l+1)(a(l))T Finally calculated partial derivatives for each parameter:

∂ ∂W (l)

ij

E(W) = 1

N ∆(l) ij

and use these in gradient descent. See interactive demo.

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-10
SLIDE 10

Bayes Decision Region vs. Neural Network

2 4 6 8 10 0.0 0.5 1.0 1.5 2.0 2.5 x y

Points from blue and red class are generated by a mixture of Gaussians. Black curve shows optimal separation in a Bayes sense. Gray curve shows neural network separation of two independent backpropagation learning runs.

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-11
SLIDE 11

Neural Network (Density) Decision Region

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-12
SLIDE 12

Overfitting/Underfitting & Generalization

Consider the problem of polynomial curve fitting where we shall fit the data using a polynomial function of the form: y(x, w) = w0 + w1x + w2x2 + . . . + wMxM =

M

  • j=0

wjxj. We measure the misfit of our predictive function y(x, w) by means of error function which we like to minimize: E(w) = 1 2

N

  • i=1

(y(xi, w) − ti)2 where ti is the corresponding target value in the given training data set.

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-13
SLIDE 13

Polynomial Curve Fitting

x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1 T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-14
SLIDE 14

Polynomial Curve Fitting (cont.)

M = 0 M = 1 M = 3 M = 9 w⋆ 0.19 0.82 0.31 0.35 w⋆

1

−1.27 7.99 232.37 w⋆

2

−25.43 −5321.83 w⋆

3

17.37 48568.31 w⋆

4

−231639.30 w⋆

5

640042.26 w⋆

6

−1061800.52 w⋆

7

1042400.18 w⋆

8

−557682.99 w⋆

9

125201.43

Table: Coefficients w⋆ obtained from polynomials of various order. Observe the dramatically

increase as the order of the polynomial increases (this table is taken from Bishop’s book).

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-15
SLIDE 15

Polynomial Curve Fitting (cont.)

Observe: if M is too small then the model underfits the data if M is too large then the model overfits the data If M is too large then the model is more flexible and is becoming increasingly tuned to random noise on the target values. It is interesting to note that the overfitting problem become less severe as the size of the data set increases.

x t N = 15 1 −1 1 x t N = 100 1 −1 1

ImageNet Classification with Deep ConvolutionalNeural Networks: “The

easiest and most common method to reduce overfitting on image data is to artificially enlargethe dataset using label-preserving transformation.”

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-16
SLIDE 16

Polynomial Curve Fitting (cont.)

One technique that can be used to control the overfitting phenomenon is the regularization. Regularization involves adding a penalty term to the error function in

  • rder to discourage the coefficients from reaching large values.

The modified error function has the form:

  • E(w) = 1

2

N

  • i=1

(y(xi, w) − ti)2 + λ 2wTw. By means of the penalty term one reduces the value of the coefficients (shrinkage method).

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-17
SLIDE 17

Regularized Polynomial Curve Fitting M = 9

x t ln λ = −18 1 −1 1

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-18
SLIDE 18

Regularization in Neural Networks

Number of input/output units is generally determined by the dimensionality of the data set. Number of hidden units M is free parameter that can be adjusted to

  • btain best predictive performance.

Generalization error is not a simple function of M due to the presence

  • f local minima in the error function.

One straightforward way to deal with this problem is to increase stepwise the value of M and to choose the specific solution having the smallest test error.

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-19
SLIDE 19

Regularization in Neural Networks (cont.)

Equivalent to the regularized curve fitting approach, we can choose a relatively large value for M and control the complexity by the addition of a regularized term to the error function.

  • E(w) = E(w) + λ

2wTw This form of regularization in neural networks is known as weight decay. Weight decay encourages weight values to decay towards zero, unless supported by the data. It can be considered as an example of a parameter shrinkage method because parameter values are shrunk towards zero. It can be also interpreted as the removal of non-useful connections during training.

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-20
SLIDE 20

A too Overfitted Neural Network Model

−1 1 2 3 4 5 −1 1 2 3 4 5 x y Hidden units: 20, weight decay: 0 Hidden units: 20, weight decay: 0

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-21
SLIDE 21

A too Underfitted Neural Network Model

2 4 6 −1 1 2 3 4 5 x y Hidden units: 20, weight decay: 2 Hidden units: 20, weight decay: 2

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-22
SLIDE 22

Model Complexity is Properly Penalized

2 4 6 −1 1 2 3 4 5 x y Hidden units: 20, weight decay: 0.3 Hidden units: 20, weight decay: 0.3

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-23
SLIDE 23

Regularization by Early Stopping

Another alternative of regularization as a way of controlling the effective complexity of a network is the procedure of early stopping.

error train test epochs stop

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-24
SLIDE 24

Example Early Stopping after 10 Epochs

−2 2 4 −1 1 2 3 4 5 x y Hidden units: 20, weight decay: 0, early stop after: 10 Hidden units: 20, weight decay: 0, early stop after: 10

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-25
SLIDE 25

Example Early Stopping after 50 Epochs

−1 1 2 3 4 5 6 1 2 3 4 5 x y Hidden units: 20, weight decay: 0, early stop after: 50 Hidden units: 20, weight decay: 0, early stop after: 50

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020

slide-26
SLIDE 26

Example Early Stopping after 100 Epochs

−1 1 2 3 4 5 −1 1 2 3 4 5 x y Hidden units: 20, weight decay: 0, early stop after: 100 Hidden units: 20, weight decay: 0, early stop after: 100

T.Stibor (GSI) ML for Beginners 21th September 2020 - 25th September 2020