Concise Introduction to Deep Neural Networks Outline: - - PDF document

concise introduction to deep neural networks
SMART_READER_LITE
LIVE PREVIEW

Concise Introduction to Deep Neural Networks Outline: - - PDF document

Concise Introduction to Deep Neural Networks Outline: Classification problems Motivating Deep (large) Neural Network (DNN) classifiers Neurons and DNN architectures Numerical training of DNNs (supervised deep learning) Spiking


slide-1
SLIDE 1

Concise Introduction to Deep Neural Networks

Outline:

  • Classification problems
  • Motivating Deep (large) Neural Network (DNN) classifiers
  • Neurons and DNN architectures
  • Numerical training of DNNs (supervised deep learning)
  • Spiking and gated neurons
  • Concluding remarks

c

  • Jan. 2020 George Kesidis

1

Glossary

N: dimension of sample (classifier input pattern) space, RN T : finite set of labelled training samples s 2 RN, i.e., T ⇢ RN C: the (finite) number of classes c(s) 2 {1, 2, ..., C}: true class label of s 2 RN. I: finite set of unlabelled test/production data samples s 2 RN to perform class inference, I ⇢ RN ˆ c(s): inferred class of sample s by the neural network w: edge weights of the neural network b: neuron (or “unit”) parameters x = (w, b): collective parameters of the neural network v: neuron output (activation) f, g: neuron activation function `: a set of neurons comprising a network layer `(n): a set of neurons comprising a network layer prior to that in which neuron n resides L: loss function used for training ⌘: learning rate or step size ↵, : gradient momentum parameter, forgetting/fading factor : Lagrange multiplier

c

  • Jan. 2020 George Kesidis

2

slide-2
SLIDE 2

Classification problems

  • Consider many data samples in a large feature space.
  • The samples may be, e.g., images, segments of speech, documents, or the current state of

an online game.

  • Suppose that, based on each sample, one of a finite number of decision must be made.
  • Plural samples may be associated with the same decision, e.g.,

– the type of animal in an image, – the word that is being spoken in a segment of speech, – the sentiment or topic of some text, or – the action that is to be taken by a particular player at a particular state in the game.

  • Thus, we can define a class of samples as all of those associated with the same decision.

c

  • Jan. 2020 George Kesidis

3

Classifier

  • A sample s is an input pattern to a classifier.
  • The output ˆ

c(s) is the inferred class label (decision) for the sample s.

  • The classifier parameters x = (w, b) need to be learned so that the inferred class decisions

are mostly accurate.

c

  • Jan. 2020 George Kesidis

4

slide-3
SLIDE 3

Types of data

  • The samples themselves may have features that are of different types, e.g., categorical,

discrete numerical, continuous numerical.

  • There are different ways to transform data of all types to continuous numerical.
  • How this is done may significantly affect classification performance.
  • This is part of an often complex, initial data-preparation phase of DNN training.
  • In the following, we assume all samples s 2 RN for some feature dimension N.

c

  • Jan. 2020 George Kesidis

5

Training and test datasets for classification

  • Consider a finite training dataset T ⇢ RN with true class labels c(s) for all s 2 T .
  • T has representative samples of all C classes,

c : T ! {1, 2, ..., C}.

  • Using T , c, the goal is to create a classifier

ˆ c : RN ! {1, 2, ..., C} that – accurately classifies on T , i.e., 8s 2 T , ˆ c(s) = c(s), and – hopefully generalizes well to an unlabelled production/test set I encountered in the field with the same distribution as T , i.e., hopefully for most s 2 I, ˆ c(s) = c(s).

  • That is, the classifier “infers” the class label of the test samples s 2 I.
  • To learn decision-making hyperparameters, a held-out subset of the training set, H, with

representatives from all classes, may be used to ascertain the accuracy of a classifier ˆ c on H as P

s2H 1{ˆ

c(s) = c(s)} |H| ⇥ 100%.

c

  • Jan. 2020 George Kesidis

6

slide-4
SLIDE 4

Optimal Bayes error rate

  • The test/production set I is not available or known during training.
  • May be some ambiguity when deciding about some samples.
  • For each sample/input-pattern s, there is a true posterior distribution on the classes p(|s),

where p(|s) 0 and PC

=1 p(|s) = 1.

  • This gives the Bayes error (misclassification) rate, e.g.,

B := Z

RN(1 p(c(s)|s)) (s)ds,

where is the (true) prior density on the input sample-space RN.

  • A given classifier ˆ

c trained on a finite training dataset T (hopefully sampled according to ) may have normalized outputs for each class, ˆ p(|s) 0, cf. softmax output layers.

  • The classifier will have error rate

Z

RN(1 ˆ

p(ˆ c(s)|s)) (s)ds B.

  • See Duda, Hart and Stork. Pattern Classification. 2nd Ed. Wiley, 2001.

c

  • Jan. 2020 George Kesidis

7

Motivating Deep (large) Neural Network (DNN) classifiers

  • Consider a large training set T ⇢ RN (|T | 1) in a high-dimensional feature space

(N 1) with a possibly large number of associated classes (C 1).

  • In such cases, class decision boundaries may be nonconvex, and each class may consist of

multiple disjoint regions (components) in feature space RN.

  • So a highly parameterized classifier, e.g., Deep (large) artificial Neural Network (DNN), is

warranted.

  • Note: A ⇢ RN is a convex set iff 8x, y 2 A and 8r 2 [0, 1], rx + (1 r)y 2 A.

c

  • Jan. 2020 George Kesidis

8

slide-5
SLIDE 5

Non-convex classes ⇢ RN

single-component classes all convex components, A & B are not convex A is convex class, B & D are not Some alternative classification frameworks:

  • Gaussian Mixture Models (GMMs) with BIC training objective to select the number of

components

  • Support-Vector Machines (SVMs)

9

Cover’s theorem

Theorem: If the classes represented in T ⇢ RN are not linearly separable, then there is a nonlinear mapping µ such that µ(T ) = {µ(s) | s 2 T } are linearly separable. Proof:

  • Choose an enumeration T = {s(1), s(2), ..., s(K)} where K = |T |.
  • Continuously map each sample s to a different unit vector 2 RK;
  • that is, 8k, µ(s(k)) = e(k), where e(k)

k

= 1 and e(k)

j

= 0 8j 6= k.

  • For example, use Lagrange interpolating polynomials with 2-norm k · k in RN:

8k, µk(s) =

K

Y

j=1,j6=k

ks s(j)k ks(k) s(j)k, where µ = [µ1, ..., µK]T : RN ! RK.

c

  • Jan. 2020 George Kesidis

10

slide-6
SLIDE 6

Proof of Cover’s theorem (cont)

  • Every partition of the samples µ(T ) = {µ(s) | s 2 T } into two different sets (classes)

1 and 2 is separable by the hyperplane with parameters w = X

k21

e(k) X

k22

e(k) (so w 2 RK has entries ±1).

  • Thus, 8k 2 1, wTe(k) = 1 > 0, and 8k 2 2, wTe(k) = 1 < 0.
  • We can build a classifier for C > 2 classes from C such linear, binary classifiers:

– Consider partition 1, 2, ..., C of µ(T ). – ith binary classifier separates i from [j6=ij, i.e., “one versus rest”.

  • Q.E.D.

c

  • Jan. 2020 George Kesidis

11

Cover’s theorem - Remarks

  • Here, µ(s) may be analogous to DNN mapping from input s to an internal layer.
  • One can roughly conclude from Cover’s theorem that:
  • If the feature dimension is already much larger than the number of samples (i.e., N K

as in, e.g., some genome datasets), then the data T will likely already be linearly separable.

c

  • Jan. 2020 George Kesidis

12

slide-7
SLIDE 7

DNN architectures

Outline:

  • Some types of neurons/units (activation functions)
  • Some types of layers
  • Example DNN architectures especially for image classification

c

  • Jan. 2020 George Kesidis

13

Illustrative 4-layer, 2-class neural network (with softmax layer)

14

slide-8
SLIDE 8

Some types of neurons

  • Consider a neuron/unit n in layer `(n), n 2 `(n), with input edge-weights wi,n, where

neurons i are in layer prior (closer to the input) to that of n, i 2 `(n).

  • The activation of neuron n is

vn = f @ X

i2`(n)

viwi,n , bn 1 A , where bn are additional parameters of the activation itself.

  • Neurons of the linear type have activation functions of the form

f(z, bn) = bn,1z + bn,0, where slope bn,1 > 0 and bn,0 is a “bias” parameter.

c

  • Jan. 2020 George Kesidis

15

Sigmoid activation function

  • 10
  • 5

5 10 0.2 0.4 0.6 0.8 1.0

c

  • Jan. 2020 George Kesidis

16

slide-9
SLIDE 9

Some types of neurons (cont)

  • Neurons of the sigmoid type have activation functions that include

f(z, bn) = tanh(zbn,1 + bn,0) 2 (1, 1), or f(z, bn) = 1 1 + exp(zbn,1 bn,0) 2 (0, 1), where bn,1 > 0.

  • Rectified Linear activations Units (ReLU) type activation functions include

f(z, bn) = (bn,1z + bn,0)+ = max{bn,1z + bn,0, 0}.

  • Note that ReLUs are not continuously differentiable at z = bn,0/bn,1.
  • Also, both linear and ReLU activations are not necessarily bounded, whereas sigmoids are.
  • “Hard threshold” neural activations involving unit-step functions u(x) = 1{x 0},

e.g., f(z, bn) = bn,0u(z bn,1) 0, obviously are not differentiable.

  • Spiking and gated neuron types are discussed later.

c

  • Jan. 2020 George Kesidis

17

Some types of layers - fully connected

  • Consider neurons n in layer ` = `(n).
  • If it’s possible that wi,n 6= 0 for all i 2 `(n), n 2 `, then layer ` is said to be fully

interconnected.

c

  • Jan. 2020 George Kesidis

18

slide-10
SLIDE 10

Max-pooling layer - example of two partition elements A, A

  • Pooling layers are intended to downsample from a large layer ` to a smaller one `,

i.e., |`| |`|.

  • Each number above is a neural network activation of a max pooling layer, where
  • |`| = 16 (left), |`| = 4, and
  • the window of size 4 slides across the larger representation (` at left) according to the

stride parameter (2) to take |`| = 4 different maximum readings and form the downsampled layer `.

19

Convolutional layers

  • Consider neurons n 2 ` and suppose the neurons in layer ` and in the previous layer ` (or

`(n)) are somehow ordered and enumerated.

  • Let K = max{|`|, |`|}.
  • Layer ` is said to be convolutional if, 8n 2 `, its activations are:

vn = f @ X

i2`(n)

vih(ni)modK , bn 1 A where h is the K-dimensional convolution kernel.

  • Two-dimensional convolutions are used in cases where the data are images.
  • Compared to fully connected layers with |`| · |`| parameters, convolutional layers are

“regularized” with only K parameters.

  • Convolutions are characteristic of linear, time-invariant transformations, which were used

for decades in data processing prior to their incorporation into neural networks, and continue to be used today.

20

slide-11
SLIDE 11

Graphical layers

  • Consider a layer ` with activations xi, i 2 `
  • Suppose that there is a sense of Boolean adjascency, Ai,j 2 {0, 1} 8i 6= j 2 `.
  • For each note i 2 `, we can consider a neighborhood Ni(r) of radius r,
  • i.e., for all k 2 Ni(r), there are j0, j2, ..., jr 2 ` such that j0 = i, jr = k, and

Ajm,jm+1 = 1 for all m = 0, ..., r 1 (there is a path from i to k of length  r).

  • For an example graphical layer of radius r, we can define the activations of the next layer

` as, 8j 2 `, xj = X

i2`

wi,jf @ X

k2Ni(r)

bi,kxk 1 A , where, e.g., fi is a sigmoid or ReLU.

  • Here, the parameters to be learned w, b may be simplified so that, e.g., 8i, k, bi,k = b|ik|

where |i k|  r is the minimum path length between i and k.

c

  • Jan. 2020 George Kesidis

21

Nearest-prototype final layer

  • Assuming a penultimate layer with activations z 2 RK, the idea is to learn a prototype

bi 2 RK for each class i 2 {1, 2, .., C}.

  • The final-layer activations are, e.g.,

fi(z) = wi(kz bik2

2)

where is a smooth, positive, increasing function with (0) = 0, and wi > 0.

  • The use of Euclidean Radial Basis Functions (RBFs), i.e., (x) ⌘ x, in this layer is

equivalent to a C-component Gaussian Mixture Model (GMM) with identity covariances and hard assignments to components.

  • For nearest-prototype final layer, the class decision is the minimum of the final layer acti-

vations, ˆ c = arg min

i

fi.

c

  • Jan. 2020 George Kesidis

22

slide-12
SLIDE 12

Softmax class decisions based on the final layer

  • Again, suppose the DNN has C outputs fi for class i 2 {1, 2, ..., C}.
  • If fi(s) 0 for all (DNN inputs) s, then we may define, e.g.,

pi(s) = fi(s) , C X

k=1

fk(s) , else, e.g., pi(s) = exp(bfi(s)) , C X

k=1

exp(bfk(s)) and b > 0.

  • These terms are sometimes interpreted as posterior probabilities of the classes,

pi(s) = p(i|s).

  • So, we can add a softmax output layer, p1(s), ..., pC(s), indicating “confidence” in the

class decision ˆ c(s) = arg max

i

pi(s) made for each input pattern s.

c

  • Jan. 2020 George Kesidis

23

Softmax layer (cont)

  • The class decision for s may not be accepted unless it has some “margin” µ > 0, i.e., unless

c(s)(s) max i6=ˆ c(s) pi(s) > µ.

  • Test samples I are not necessarily close to training samples T , so large classification margin
  • n T obviously does not imply the same for test samples.
  • See, e.g., https://arxiv.org/abs/1910.08032

c

  • Jan. 2020 George Kesidis

24

slide-13
SLIDE 13

Example DNN architectures - LeNet-5 ⇤

  • The final layer is nearest-prototype using Euclidean RBFs (“Gaussian”).
  • See the “explanation” near Equ. (8) of [Y. LeCun et al., 1998].

⇤Figure 2 of Y. LeCun et al. Gradient Based Learning Applied to Document Recognition. Proc. IEEE, Nov.

1998. 25

Example DNN architectures - ResNet⇤

c

  • Jan. 2020 George Kesidis

⇤K. He et al. Deep Residual Learning for Image Recognition. https://arxiv.org/pdf/1512.03385.pdf

26

slide-14
SLIDE 14

DNN architectures - Discussion

  • The front end performs abstract feature extraction, e.g., convoluational layers.
  • The back end makes class decisions based on combinations of abstracted features, e.g., fully

connected layers.

  • The final softmax layer allows for interpretation of class-decision confidence.

c

  • Jan. 2020 George Kesidis

27

Optimization methods for training

Outline:

  • Types of training objectives
  • Background on gradient based methods
  • Stochastic Gradient Descent (SGD) with momentum
  • Background on first-order autoregressive (AR-1) estimators
  • Overfitting and DNN regularization
  • Training dataset augmentation and batch normalization
  • Held-out validation set for hyperparameters

c

  • Jan. 2020 George Kesidis

28

slide-15
SLIDE 15

Types of training objectives

  • Objective is to choose classifier parameters x = (w, b), i.e., train the classifier, to min-

imize the following “loss” expressions over the DNN parameters on which final-layer activations fi, i 2 {1, 2, ..., C}, implicitly depend.

  • Minimizing non-negative loss may coincide with achieving zero loss.
  • The misclassification-rate objective,

L(x) = 1 |T | X

s2T

1{ˆ c(s) 6= c(s)}, is not differentiable, and so would not lend itself to training by gradient based methods,

  • where ˆ

c(s) depends on DNN parameters x.

  • A Mean Square Error (MSE) loss objective is,

L(x) = 1 |T | X

s2T

|ˆ c(s) c(s)|2, where ˆ c(s) = arg max

i

fi(s).

c

  • Jan. 2020 George Kesidis

29

Differentiable training objective

  • A cross-entropy loss objective is, e.g.,

L(x) = 1 |T | X

s2T

1{ˆ c(s) = c(s)} log ˆ p(c(s)), where ˆ p(c(s)) = fc(s)(s) P

k fk(s)

and activations fi 0 are differentiable w.r.t. DNN parameters x.

  • Cross-entropy loss objectives are commonly used and optimized using gradient-based meth-
  • ds.

c

  • Jan. 2020 George Kesidis

30

slide-16
SLIDE 16

Training objectives (cont)

  • Suppose it is desired that

8s 2 T , fc(s) max

i6=c(s) fi(s) + µ,

i.e., correct classification occurs on all training samples with prescribed margin µ > 0.

  • A margin-based “dual” loss objective is, e.g.,

L(x) = X

s2T

s ✓ max

i6=c(s) fi(s) + µ fc(x)(s)

◆ .

  • If a margin constraint is not satisfied, retraining may be required with larger corresponding

dual parameters s > 0.

  • Though margin-based training may achieve a kind of “robustness”, it may also overfit to

the training set, resulting in reduced generalization performance.

c

  • Jan. 2020 George Kesidis

31

Promoting sparcity in DNN parameters

  • Initially, a DNN may have excess parameters for the particular training task under consid-

eration for it.

  • Using excess parameters may result in overfitting.
  • Note that

lim

p#0

X

i

|xi|p = X

i

1{xi 6= 0}, i.e., the number of non-zero elements of x.

  • So, one way to promote sparcity among excess parameters (i.e., zeroing them out) is to

suitably penalize the optimization objective with an approximate “0-norm” penalty term, e.g., L(x) + X

i

|xi|p, where 0 < p ⌧ 1.

  • Here, reducing p > 0 and increasing the penalty parameter > 0 promotes more sparcity

in the DNN parameters x.

  • The number of non-zero elements itself is not as useful as it’s not differentiable, and so

would not lend itself to training by gradient based methods.

  • cf. discussion of overfitting and DNN regularization.

32

slide-17
SLIDE 17

Back propagation to compute the gradient

  • Back propagation is just the chain rule for differentiation to compute the gradient of com-

posed functions.

  • Consider a function of two real variables g(z1, z2) and define

@1g = @g1 @z1 and @2g = @g1 @z2 .

  • Now consider the following composed function of three variables

L(x1, x2, x3) = g3(x3, g2(x2, g1(x1))) where g3, g2 are functions of two variables.

  • L represents a loss function of a simple neural network consisting of just three consecutive

neurons (one per layer) having differentiable activations g.

  • Here, xk is the variable associated with “layer” k and gk is the output of layer k, and
  • the DNN output layer is 3 and layers 1,2 are further back (toward the input).

c

  • Jan. 2020 George Kesidis

33

Back propagation (cont)

  • Again, L(x1, x2, x3) = g3(x3, g2(x2, g1(x1))).
  • Simply by the chain rule, the gradient of L is

rL = 2 4 @L/@x3 @L/@x2 @L/@x1 3 5 = 2 4 @1g3 @2g3 · @1g2 @2g3 · @2g2 · @1g1 3 5

  • Note that to compute @L/@xk for k = 1, 2, one needs to compute @2g3, i.e., this quantity

needs to be propagated back from layer 3.

  • In a similar way for a more complex feed-forward neural network, to compute the partial

derivative of a loss function with respect to parameters in layer ` of a DNN, the partial derivatives with respect to parameters from layers closer to the output need to be propa- gated back to layer `.

c

  • Jan. 2020 George Kesidis

34

slide-18
SLIDE 18

Background on gradient based methods

Outline:

  • Directional derivative and descent directions
  • First and second order optimality conditions
  • Gradient methods for local optimality
  • Reference: E. Polak. Notes on Fundamentals of Optimization For Engineers. U.C. Berkeley,

Spring, 1990.

c

  • Jan. 2020 George Kesidis

35

Gradients of continuously differentiable functions

  • Consider a continuously differentiable function L : Rn ! R for integer n 1.
  • Our objective is to find a local minimum ˆ

x 2 Rn of L.

  • The gradient of L is

rL(x) = 2 6 6 6 6 6 6 4

@L @x1(x) @L @x2(x)

. . .

@L @xn(x)

3 7 7 7 7 7 7 5 .

  • Note that rL : Rn ! Rn.

c

  • Jan. 2020 George Kesidis

36

slide-19
SLIDE 19

Directional derivatives

  • The directional derivative of L at x in the direction h is

(rL(x))Th = hrL(x), hi = lim

⌘!0

L(x + ⌘h) L(x) ⌘ .

  • Here, ⌘ 2 R, x, h 2 Rn.
  • h is a descent direction at x if hrL(x), hi < 0.
  • Obviously, rL(x) is a descent direction at x unless rL(x) = 0.
  • Theorem: If h is a descent direction of L, then there is a ⌘ > 0 such that L(x + ⌘h) <

L(x).

  • Proof: By the previous display, there is a sufficiently small ⌘ > 0 such that

L(x + ⌘h) L(x) ⌘ hrL(x), hi  1 2hrL(x), hi ) L(x + ⌘h) L(x)  ⌘ 2hrL(x), hi < 0.

c

  • Jan. 2020 George Kesidis

37

Optimality conditions - necessity

  • ˆ

x is a local minimum of L if there is a r > 0 such that L(ˆ x)  L(x) for all x 2 B(ˆ x, r) = {y : ky ˆ xk2 < r} (open ball centered at ˆ x with radius r).

  • Theorem: If ˆ

x is a local minimizer of L then rL(ˆ x) = 0.

  • Proof: Assume rL(ˆ

x) 6= 0, use the descent direction h = rL(ˆ x) and argue as previous theorem (with ⌘ < r) to contradict local minimality of ˆ x.

  • The Hessian of (twice continuously differentiable) L is the n ⇥ n matrix

H = @2L @x2 =  @2L @xi@xj n

i,j=1

  • Note that H : Rn ! Rn⇥n.

c

  • Jan. 2020 George Kesidis

38

slide-20
SLIDE 20

Optimality conditions - necessity (cont)

Theorem: If ˆ x is a local minimizer of L, then 8h, hh, H(ˆ x)hi 0, i.e., H(ˆ x) is positive semi-definite. Proof:

  • For x, y 2 Rn, s 2 [0, 1], let g(s) = L(x + s(y x)).
  • Integrating g00(s)(1 s) = (g0(s)(1 s) + g(s))0 gives

L(y) L(x) = hrL(x), y xi + Z 1 (1 s)hy x, H(x + s(y x))(y x)ids

  • Substitute y = ˆ

x + ⌘h, x = ˆ x, and rL(ˆ x) = 0.

  • Finally, let ⌘ ! 0.

c

  • Jan. 2020 George Kesidis

39

Optimality conditions - sufficiency

  • Theorem: If rL(ˆ

x) = 0 and 8h, hh, H(ˆ x)hi > 0, then ˆ x is a local minimizer.

  • To prove this, assume ˆ

x is not a local miminizer. – So there is a sequence xi ! ˆ x such that L(xi) < L(ˆ x) for all i 2 N. – Then argue as the previous theorems to show a contradiction with the hypothesis.

  • For n = 1, recall that if L0(ˆ

x) = 0 and L00(ˆ x) 0 then ˆ x is a local minimum of L.

c

  • Jan. 2020 George Kesidis

40

slide-21
SLIDE 21

Local minima, maxima and (when n > 1) saddle points ⇤

  • local minimum: rL = 0 and H is positive definite
  • local maximum: rL = 0 and H is negative definite
  • saddle: rL = 0 and H is neither (dimension n > 1)

⇤Figure from C.K. Reddy and H.-D. Chiang. Stability Boundary Based Method for Finding Saddle Points on

Potential Energy Surfaces. J. Computational Biology 13(3):745-766, 2006. 41

Gradient methods for local optimization

  • To find a local minimizer, we could:
  • 1. try to solve rL(x) = 0 by Newton-Raphson,
  • 2. then assess whether the solution is a local minimum, maximum or saddle by considering

the Hessian,

  • 3. if not a local minimum or doesn’t converge, restart Newton-Raphson at another initial

point (perhaps chosen at random).

  • An advantage of gradient based methods is they do not require require higher order deriva-

tives (H), or estimates of them (BFGS or DFP quasi-Newton methods).

  • Disadvantages of gradient based methods is that they tend to converge slowly compared to

Newton-Raphson and may converge to saddle points.

c

  • Jan. 2020 George Kesidis

42

slide-22
SLIDE 22

Steepest Descent

  • 1. initially, x0 2 Rn, iteration index k = 0, small " > 0
  • 2. if krL(xk)k < " then stop
  • 3. search (descent) direction hk = rL(xk)
  • 4. line search to find step size:

⌘⇤ = arg min

⌘>0 L(xk + ⌘hk)

  • 5. update xk+1 = xk + ⌘⇤hk
  • 6. k++ and go to step 2.

Note that determining ⌘⇤ is a one-dimensional optimization problem.

c

  • Jan. 2020 George Kesidis

43

Line search terminates at point where ⌘rL(xk) is tangent to a level-set of L

c

  • Jan. 2020 George Kesidis

44

slide-23
SLIDE 23

Steepest descent - convergence

  • Can show by contradiction that any accumulation point ˆ

x (limit of a convergent subse- quence) of the sequence xk must satisfy rL(ˆ x) = 0.

  • So, if H(ˆ

x) is positive definite (i.e., ˆ x is not a saddle) then ˆ x is a local minimum.

  • Additional Wolfe condition on curvature of L guarantees convergence of gradient descent

to a local minimum: 9c > 0 s.t. 8k, |hhk, rL(xk + ⌘khk)i|  c|hhk, rL(xk)i|.

c

  • Jan. 2020 George Kesidis

45

Optimization heuristics for Deep Neural Networks

  • Approaches that leverage second-order derivatives (Newton-Raphson) or their approxima-

tions (BFGS or DFP quasi-Newton methods) converge more rapidly than gradient descent,

  • but these are too complex for deep learning.
  • There are simpler approaches to line search (Armijo), which may still be too complex for

the DNN setting.

  • In the following, we describe some heuristics that are used to train DNNs.
  • Note that to minimize non-negative loss objectives L 0, may terminate gradient descent

when L  " for some small " 0.

c

  • Jan. 2020 George Kesidis

46

slide-24
SLIDE 24

Constant learning rate instead of optimal step-size

  • Rather than attempting to compute an optimal step size per iteration of gradient descent,
  • ne can simply take a constant step size, ⌘ > 0, i.e., xk = xk1 + ⌘hk.
  • Here, ⌘ > 0 is also called the learning rate.
  • This said, ⌘ may change dynamically, particularly ⌘ becomes smaller as iteration index k

increases for greater “depth” of search,

  • as opposed to greater “breadth” with larger ⌘ initially (when k is small).
  • Typically, chosen learning rate ⌘ 2 [0.01 0.99], e.g., ⌘ = 0.1.

c

  • Jan. 2020 George Kesidis

47

Stochastic Gradient Descent (SGD)

  • Suppose an additive loss objective to be minimized

L(x) = 1 J

J

X

j=1

gj(x).

  • When J 1, computing rL(x) at each x can be very costly.
  • Instead, at step k use, e.g., search direction hk = rg(kmodJ)(xk),
  • r choose hk = rgj(xk) for randomly chosen j.
  • Note that such hk might not be descent direction for L!
  • Alternatively, compute the average gradient over a small random batch Bk ⇢ {1, 2, ..., J}

(|Bk| ⌧ J and the Bk are i.i.d.), so that hk = 1 |Bk| X

j2Bk

rgj(xk).

c

  • Jan. 2020 George Kesidis

48

slide-25
SLIDE 25

Review of first-order autoregressive estimators

  • Suppose we want to iteratively estimate the mean of the possibly nonstationary sequence

Xn, for n 2 {0, 1, 2, ...}, and possibly with an unknown (stationary) limiting distribution.

  • Since the distribution of Xn may change with n, one could want to more significantly

weight the recent samples Xk (i.e., k  n and k ⇡ n) in the computation of Xn.

  • An order-1 autoregressive estimator (AR-1) is

Xn = ↵Xn1 + (1 ↵)Xn where 0 < ↵ < 1 is the forgetting/fading factor and X0 = X0.

c

  • Jan. 2020 George Kesidis

49

AR-1 estimators (cont)

  • Note that all past values of X contribute to the current value of this autoregressive process

according to weights that exponentially diminish: Xn = ↵nX0 + (1 ↵)(↵n1X1 + ↵n2X2 + ... + ↵Xn1 + Xn).

  • Also, if 1 ↵ is a power of 2, then the autoregressive update

Xn = Xn1 + (1 ↵)(Xn Xn1), is simply implemented with two additive operations and one bit-shift (the latter to multiply by 1 ↵).

  • There is a simple trade-off in the choice of ↵.
  • A small ↵ implies that Xn is more responsive to the recent samples Xk (k < n, k ⇡ n),

but this can lead to undesirable oscillations in the AR-1 process X.

  • A large value of ↵ means that the AR-1 process will have diminished oscillations (”low-pass”

filter) but will be less responsive to changes in the distribution of the samples Xk.

c

  • Jan. 2020 George Kesidis

50

slide-26
SLIDE 26

AR-1 estimators - example

  • Suppose the initial distribution is uniform on the interval [0, 1] (i.e., E X = 0.5), but for

n 20 the distribution is uniform on the interval [3, 4] (i.e., E X changes to 3.5).

  • When ↵ = 0.2, a sample path of the first-order AR-1 process X responds much more

quickly to the change in mean (at n = 20), but is more oscillatory than the corresponding sample path of the AR-1 process when ↵ = 0.8.

c

  • Jan. 2020 George Kesidis

51

SGD with momentum⇤

  • Momentum incorporates information from prior “stochastic gradients” hj for j < k to try

to improve possibly crude approximations hk of rL(xk).

  • For example, using simple first-order autoregression with forgetting/fading factor ↵ 2

(0, 1), take search direction Hk = ↵Hk1 + (1 ↵)hk where Hk1 is the search direction used for the previous set of DNN parameters, xk1.

  • Thus, xk = xk1 + ⌘Hk.
  • To further simplify in this land of heuristics, take

xk = xk1 + Hk with Hk = ↵Hk1 + ⌘hk.

  • SGD’s randomness and momentum may avoid zigzagging through “ravines” associated with

shallow local minima of L,

  • where zigzag is indicated by persistently negative sign of hhk1, hki.
  • Typically, chosen momentum parameter ↵ 2 [0.1, 0.9], e.g., ↵ = 0.8.
  • The commonly used “Adam” optimizer and RMS techniques normalize an autoregressive

estimate of the gradient by an autoregressive estimate of its (uncentered) second moment.

⇤An overview is here: https://arxiv.org/pdf/1609.04747.pdf

52

slide-27
SLIDE 27

Overfitting and DNN regularization

  • “We may assume the superiority ... of the demonstration which derives from fewer postu-

lates or hypotheses.” Aristotle, Posterior Analytics (as Occam’s Razor).

  • That is, best generalization performance if minimum number parameters used to explain

the training data, i.e., avoid overfitting to the training set T .

  • Note that a priori no idea how many parameters suitable for very complex training datasets

(large number of samples in large feature dimensions), and number of DNN parameters can be very large.

  • So DNN may (initially) be overparameterized.
  • Can heuristically reduce the number of parameters, e.g., by random dropout of neurons or

edges during or even after training (the latter may require retraining).

  • Note reproducibility issues with random dropout.
  • Recall promoting neuron/edge sparcity discussion.

c

  • Jan. 2020 George Kesidis

53

Overfitting illustrated⇤

⇤Figure from Shubham Jain. An Overview of Regularization Techniques in Deep Learning, https://www.

analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/, April 19, 2018. 54

slide-28
SLIDE 28

Training dataset augmentation & batch normalization

  • Consider the training dataset of an image classifer including images that belong to say the

cat class.

  • To improve generalization performance on the test/production set, the training set can be

augmented with a version of each cat image that is rotated, tint/color adjusted, contrast adjusted, etc.

  • But in some cases, augmenting with samples that are close to training samples (e.g., aug-

menting with “adversarial” samples in an attempt to be robust to test-time evasion attacks), may cause overfitting to training samples and degrade generalization performance.

  • Also to improve generalization performance, the training dataset can be batch normalized:

– The mean µi = |T |1 P

s2T si and variance 2 i = (|T | 1)1 P s2T (si µi)2 of

all sample features indexed i is computed across the training dataset T , and – each training sample s with features denoted si is replaced or augmented with one whose features are (si µi)/i for all i.

  • Batch normalization can also be done on internal DNN layers where all neural activations

x of a layer ` are adjusted by subtracting their mean and dividing by their standard devi- ation (as computed over the training dataset), but this complicates the neural activation functions.

55

Held-out validation set for hyperparameters

  • Some training samples used to set “main” classifier parameters x = (w, b) (by gradient

based learning), while a held-out validation H set is used to tune, e.g., – training hyperparameters, e.g., initial weights, learning rate, forgetting factors, bounds

  • n or normalizations of classifier parameters,

– parameters controlling how classifier structure is simplified (random drop-out rate dur- ing training and/or drop-out post training), and – parameters for representing and preprocessing data (before training).

  • Note that retraining may be required.
  • Again, the validation set is uniformly sampled from the training set so that it is unbiased.
  • The validation set H is not the unlabelled production/test set I, and only the rest of the

training set (T \H) is used to learn the main classifier parameters x.

c

  • Jan. 2020 George Kesidis

56

slide-29
SLIDE 29

Spiking and Gated Neurons

Outline:

  • Spiking neurons
  • Dependent training samples and gated neurons
  • LSTM neurons
  • Back propagation through time

c

  • Jan. 2020 George Kesidis

57

Spiking neuron types - example

  • Consider a thin rectangular pulse (spike) at the origin,

⇡(·) = u(t) u(t "), were u is the unit-step and positive width " ⌧ 1.

  • Suppose time-varying activations vn(t) are given by the solution to the following first-order

ODE, d⇠n dt (t) = a · @ X

i2`(n)

vi(t)wi,n ⇠n(t) 1 A vn(t) = X

k

⇡ ✓ t k f(⇠n(t), bn) ◆ where parameter a 2 (0, 1] and positive sigmoid f.

  • So, the activation vn(t) at time t is a pulse train ⇡ with rate f(⇠n(t), bn).

c

  • Jan. 2020 George Kesidis

58

slide-30
SLIDE 30

Spiking neuron types - example (cont)

  • Note that the solution to d⇠/dt = a(y ⇠) is

⇠(t) = eat⇠(0) + a Z t ea(t⌧)y(⌧)d⌧

  • So, the superposed pulse train y is “smoothened-out” to determine ⇠ and, in turn, the

activation frequency f(⇠, b).

  • The constant-rate neural activations v of the input layer directly correspond to the features
  • f the current sample s, which was applied at time zero.
  • In practice, the spiking activation may be numerically simulated, e.g., by Euler’s method,

to solve the ODE.

c

  • Jan. 2020 George Kesidis

59

Spiking neuron types (cont)

  • One can train a DNN of such spiking neurons by porting parameters of a trained DNN

with the same topology and activation functions but with the usual constant (non-spiking) signals.

  • In this case, the parameters ", a, ⇠(0) (which may be neuron (n) dependent) may be tuned,

e.g., to ensure over the training set that every pulse train’s duty cycle is smaller than its period, i.e., " < min

n,t

1 f(⇠n(t), bn).

  • For distributed inference, a DNN may be partitioned into multiple elements, e.g., along

edge-cuts with least average-signal magnitudes during training.

  • A potential benefit of spiking DNNs is that, for high-speed inference, such distributed

elements need not be very carefully synchronized.

c

  • Jan. 2020 George Kesidis

60

slide-31
SLIDE 31

Dependent training samples

  • Classifiers may be subjected to a dependent sequence of input input patterns.
  • For example, a sequence of: images of a video, sounds of a speech, or words of a document.
  • Classical approaches include Hidden Markov Models (HMMs).
  • That is, each sample s 2 T may itself be a time series of dependent input patterns to the

DNN, i.e., s = {s(1), s(2), ..., s(Ts)} = {s(t)}Ts

t=1.

  • The final class decision for s is that which is made, e.g., when the last input pattern s(Ts)

is applied to the DNN.

c

  • Jan. 2020 George Kesidis

61

Gated neurons

  • An example memoried neuron n using a simple first-order autoregressive mechanism with

forgetting/fading-factor n 2 (0, 1) is: vn(t) = nvn(t 1) + (1 n)f @ X

i2`(n)

vi(t)wi,n , bn 1 A , where `(n) is the layer prior to that of n.

c

  • Jan. 2020 George Kesidis

62

slide-32
SLIDE 32

Long/Short-Term Memoried (LSTM) neurons

  • For neurons n 2 `, now suppose that the forgetting factor n itself is dynamic (changes

from sample to sample as indexed by t) and potentially depends on all activations vi(t) for i 2 `(n) (current sample t, previous layer `) and vk(t 1) for k 2 `, k 6= n (previous sample t 1, currently layer `).

  • That is, the activations of the previous sample are stored and used (gated) when it comes

time to compute the next sample.

  • Consider a LSTM layer with ”Minimal Gated Units” (MGUs) using positive sigmoid, e.g.,

f(z, b) = 1 1 + exp((b1z + b0)) 2 (0, 1) with b > 0, and some other activation function g.

  • For all neurons n 2 `,

n(t) = f @ X

i2`(n)

vi(t)w()

i,n +

X

k2`,k6=n

vk(t 1)w()

k,n , b() n

1 A vn(t) = n(t)vn(t 1) + (1 n(t))g @ X

i2`(n)

vi(t)w(v)

i,n + vn(t 1)n(t)w(v) n,n, b(v) n

1 A

  • Higher order autoregression and more complex LSTM neurons are in use.

63

Training LSTMs

  • Recall the cross-entropy loss function as

L(x) = 1 |T | X

s2T

1{ˆ c(s) = c(s)} log ˆ p(c(s)).

  • Note that the activations for the previous sample t 1, vn(t 1), are needed to compute

the gradient summand at time t, i.e., “back-propagation through time”.

c

  • Jan. 2020 George Kesidis

64

slide-33
SLIDE 33

Concluding comments

  • DNNs and the datasets they classify are extremely complex and large-scale.
  • DNNs have highly heterogeneous architectures and are highly nonconvex and nonlinear.
  • Class partitions in “raw” input feature space (RN) are highly nonconvex.
  • In practice, optimization mechanisms used, and neural and network-architectural choices

made, are heuristic, trial-and-error affairs⇤, when they are not based on classical ideas (e.g., regression, convolutions, AR-1, gradient descent, residual signals).

  • Data representation, formatting and curating to produce T and I, requiring actual domain

expertise, may be much more time-consuming and costly than DNN training/inference.†

  • An interesting history of neural networks is here:

http://people.idsia.ch/~juergen/deep-learning-conspiracy.html

c

  • Jan. 2020 George Kesidis

⇤S. Higginbotham. Show Your Machine-Learning Work. IEEE Spectrum, Dec. 2019. †e.g., E. Strickland. How IBM Watson Overpromised and Underdelivered on AI Health Care. IEEE Spectrum,

  • Apr. 2019; J. Murdock Google’s AI Health Screening Tool Claimed 90 Percent Accuracy, but Failed to Deliver

in Real World Tests. Newsweek, 4/28/20. 65