Machine Learning Lecture 06: Deep Feedforward Networks Nevin L. - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Lecture 06: Deep Feedforward Networks Nevin L. - - PowerPoint PPT Presentation

Machine Learning Lecture 06: Deep Feedforward Networks Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on various sources on the


slide-1
SLIDE 1

Machine Learning

Lecture 06: Deep Feedforward Networks Nevin L. Zhang lzhang@cse.ust.hk

Department of Computer Science and Engineering The Hong Kong University of Science and Technology

This set of notes is based on various sources on the internet and Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT press. www.deeplearningbook.org

Nevin L. Zhang (HKUST) Machine Learning 1 / 51

slide-2
SLIDE 2

Introduction

So far, probabilistic models for supervised learning {xi, yi}N

i=1 −

→ P(y|x) Next, deep learning: {xi, yi}N

i=1 −

→ h = f (x), P(y|h) h = f (x) is a feature transformation represented by a neural network, P(y|h) is a probabilistic model on the transformed features. Regarded as one whole model: P(y|x). This lecture: h = f (x) as a feedforward neural network (FNN). Next lecture: h = f (x) as a convolutional neural network (CNN or ConvNet).

Nevin L. Zhang (HKUST) Machine Learning 2 / 51

slide-3
SLIDE 3

Feedforward Neural Network as Function Approximator

Outline

1 Feedforward Neural Network as Function Approximator 2 Feedforward Neural Network as Probabilistic Model 3 Backpropagation 4 Dropout 5 Optimization Algorithms

Nevin L. Zhang (HKUST) Machine Learning 3 / 51

slide-4
SLIDE 4

Feedforward Neural Network as Function Approximator

Deep Feedforward Networks

Deep feedforward networks, also often called feedforward neural networks (FNNs),or multilayer perceptrons(MLPs), are the quintessential deep learning models A feedforward network defines a function y = f (x, θ). During learning, the parameters θ are optimized so that f (x, θ) approximates some target function f ∗(x)

Nevin L. Zhang (HKUST) Machine Learning 4 / 51

slide-5
SLIDE 5

Feedforward Neural Network as Function Approximator

Feedforward Neural Networks

Networks of simple computing elements (units, neurons). Each unit is connected to units on the previous layer and units on the next layer Parameters include a bias for each unit and a weight for each link. Units are divided into: input units, hidden units, and output units Feedforward networks: Inputs enter the input units, Propagate through the network to the output units.

Nevin L. Zhang (HKUST) Machine Learning 5 / 51

slide-6
SLIDE 6

Feedforward Neural Network as Function Approximator

The Units

A unit accepts vector of inputs x, computes an affine transformation z = W⊤x + b, where W = (W1, W2, . . . , Wn)⊤ are the link weights and b is the bias of the unit. z is sometimes called the net input of the unit. applies nonlinear activation function g(z), and gives output g(z) = g(W⊤x + b) Different types of units have different activation functions. Initialize all weights to small random values, and biases to zero or to small positive values.

Nevin L. Zhang (HKUST) Machine Learning 6 / 51

slide-7
SLIDE 7

Feedforward Neural Network as Function Approximator

Rectified Linear Units (ReLU)

General form of activation function: g(z) = g(W⊤x + b) ReLU: g(z) = max{0, z}. Constant gradient when z > 0, which leads to faster learning. Probably the most commonly used activation function. Zero gradient when z < 0. The neuron is dead if z < 0 for all training

  • examples. Dead neurons cannot be revived.

To mitigate the problem somewhat, initialize b to be small positive value, e.g. 0.1, so that the unit is initially active (z > 0).

Nevin L. Zhang (HKUST) Machine Learning 7 / 51

slide-8
SLIDE 8

Feedforward Neural Network as Function Approximator

Variations of ReLU

g(z, α) = max{0, z} + α min{0, z}: Absolute value rectification: α = −1, Leaky ReLU: α is small value like 0.01, Parametric ReLU: Learns α Non-zero gradient even when z < 0, which mitigates the dead neuron problem.

Nevin L. Zhang (HKUST) Machine Learning 8 / 51

slide-9
SLIDE 9

Feedforward Neural Network as Function Approximator

Sigmoid

Sigmoid activation function: g(z) = σ(z) = 1 1 + exp(−z) Sigmoidal unit has small gradients across most of its domain (vanishing gradient), and hence leads to slow learning, especially in deep models. It is said to saturate easily, where saturation refers to the state in which a neuron predominantly outputs values close to the asymptotic ends of the bounded activation function. Saturation implies slow learning. Not recommended as internal unit.

Nevin L. Zhang (HKUST) Machine Learning 9 / 51

slide-10
SLIDE 10

Feedforward Neural Network as Function Approximator

Hyperbolic Tagent

Hyperbolic tagent activation function: g(z) = tanh(z) = exp(z) − exp(−z) exp(z) + exp(−z) Tanh has larger gradients than sigmoid, and smoother than ReLU. However, it can still saturate. A popular choice in practice.

Nevin L. Zhang (HKUST) Machine Learning 10 / 51

slide-11
SLIDE 11

Feedforward Neural Network as Function Approximator

Swish

Swish activation was proposed in 2017: g(z) = zσ(βz), β is parameter fixed or learned Non-zero gradients when z < 0, smooth and non-monotonic. Unbounded above, which avoids saturation. Outperforms ReLU in deep networks.

Nevin L. Zhang (HKUST) Machine Learning 11 / 51

slide-12
SLIDE 12

Feedforward Neural Network as Function Approximator

Softplus and Mish

Softplus activation function: g(z) = ζ(z) = log(1 + exp(z)) Softplus is a smooth version of ReLU, empirically not as good as ReLU. Mish activation function: g(z) = z tanh(log(1 + exp(z)) = z tanh(ζ(z)) Recently proposed.Similar to Swish, and better.

Nevin L. Zhang (HKUST) Machine Learning 12 / 51

slide-13
SLIDE 13

Feedforward Neural Network as Function Approximator

Computation by Feedfoward Neural Network

h(i) — Column vector for units on layer i; b(i) — Biases for units on layer i; g (i) — Activation functions for units on layer i. W(i) — Matrix of weights for units on layer i, with weight for unit j at the j-th column. h(1) = g (1)(W(1)⊤x + b(1)) h(2) = g (2)(W(2)⊤h(1) + b(2)) y = g (3)(W(3)⊤h(2) + b(3))

Nevin L. Zhang (HKUST) Machine Learning 13 / 51

slide-14
SLIDE 14

Feedforward Neural Network as Function Approximator

Universal Approximation

Theorem Only one layer of sigmoid hidden units suffices to approximate any well-behaved function (e.g., bounded continuous functions) to arbitrary precision. Deep learning useful when you: Need a complex function, and Have abundant data.

Nevin L. Zhang (HKUST) Machine Learning 14 / 51

slide-15
SLIDE 15

Feedforward Neural Network as Probabilistic Model

Outline

1 Feedforward Neural Network as Function Approximator 2 Feedforward Neural Network as Probabilistic Model 3 Backpropagation 4 Dropout 5 Optimization Algorithms

Nevin L. Zhang (HKUST) Machine Learning 15 / 51

slide-16
SLIDE 16

Feedforward Neural Network as Probabilistic Model

An FNN can be used to define a probabilistic model

The first-second last layers define a feature transformation: h = f (x) The last layer defines a probabilistic model on the features: P(y|h) Note: Here y is a vector of output variables, while so far we have been talking about only one output variable y. The whole network defines a probabilistic model: P(y|x, θ), where θ consists of weights in f and parameters in P(y|h).

Nevin L. Zhang (HKUST) Machine Learning 16 / 51

slide-17
SLIDE 17

Feedforward Neural Network as Probabilistic Model

Logits

To define a probabilistic model on the features h, first define a logit vector z = W⊤h + b, where W is a weight matrix and b is a bias vector. They are the parameters

  • f the last layer. z = (z1, z2, . . .)⊤.

Then, we can define various probabilistic models using z, which are viewed as units for the last layer.

Nevin L. Zhang (HKUST) Machine Learning 17 / 51

slide-18
SLIDE 18

Feedforward Neural Network as Probabilistic Model

Linear-Gaussian Output Unit

When y is real-valued vector, we can assume that y follows a Gaussian distribution with mean z and identity covariance matrix I: p(y|x) = N(y|z, I) In this case, the per-sample loss is: L(x, y, θ) = − log P(y|x, θ) ∝ 1 2||y − z||2

2

(See p12 of L02) Partial derivative of per-sample loss with respect to logit zk: ∂L ∂zk = zk − yk. It is error.

Nevin L. Zhang (HKUST) Machine Learning 18 / 51

slide-19
SLIDE 19

Feedforward Neural Network as Probabilistic Model

Sigmoid Output Unit

When there is only one binary output variable y ∈ {0, 1}, we can define a distribution p(y|x) using a sigmoid unit: P(y|x) = Ber(y|σ(z)), where z is a scalar. In this case, the per-sample loss is: L(x, y, θ) = −[y log σ(z) + (1 − y) log(1 − σ(z))] (See p10 of L03) Partial derivative of per-sample loss with respect to logit z: ∂L ∂z = σ(z) − y. (See p21 of L03) It is error.

Nevin L. Zhang (HKUST) Machine Learning 19 / 51

slide-20
SLIDE 20

Feedforward Neural Network as Probabilistic Model

Sigmoid Output Unit

We said earlier that sigmoid units are not recommended for hidden units because they saturates across most of their domains. They are fine as output units because the negative log-likelihood in the cost function helps to avoid the problem. In fact, σ(z) − y ≈ 0 means: z ≫ 0, y = 1, or z ≪ 0, y = 0 In words, saturation occurs only when the model already has the right answer: When y = 1 and z is very positive, or y = 0 and z is very negative.

Nevin L. Zhang (HKUST) Machine Learning 20 / 51

slide-21
SLIDE 21

Feedforward Neural Network as Probabilistic Model

Softmax Output Unit

When y is a variable with C possible values {1, 2, . . . , C}, we can define a distribution p(y|x) using a softmax unit. Here z = W⊤h + b is a vector of n components: z = (z1, z2, . . . , zC)⊤. The follow formula gives a distribution over the domain of y: P(y = k|x) = exp(zk) C

j=1 exp(zj)

Partial derivative of per-sample cross entroyp loss with respect to logit zk: ∂L ∂zk = P(y = k) − 1(y = k). (See p40 of L03) It is error.

Nevin L. Zhang (HKUST) Machine Learning 21 / 51

slide-22
SLIDE 22

Backpropagation

Outline

1 Feedforward Neural Network as Function Approximator 2 Feedforward Neural Network as Probabilistic Model 3 Backpropagation 4 Dropout 5 Optimization Algorithms

Nevin L. Zhang (HKUST) Machine Learning 22 / 51

slide-23
SLIDE 23

Backpropagation

Training Feedforward Neural Networks

Given data {xi, yi}N

i=1, we want to learn the parameters θ by minimizing the

cross-entropy loss: J(θ) = 1 N

N

  • i=1

L(xi, yi, θ) = 1 N

N

  • i=1

(− log P(yi|xi, θ)) Algorithm: Stochastic gradient descent (SGD) Initialize θ Repeat for a predetermined number of epochs

For each minibatch B, θ ← θ − α 1 |B|

  • i∈B

∇L(xi, yi, θ)

Question: How to compute the gradient ∇L(xi, yi, θ) for each training example? Answer: Backpropagation.

Nevin L. Zhang (HKUST) Machine Learning 23 / 51

slide-24
SLIDE 24

Backpropagation

Backpropagation: Output Layer

Let yk, uj and ui each be the output of a unit at the last three layers, and zk, zj, zi be the output before applying the activation function: zk =

  • j

ujWkj, uj = g(zj), zj =

  • i

uiWji (Bias ignored for simplicity.) Let L = L(x, y, θ) for a particular training example (x, y). The partial derivative of L w.r.t a weigh in the last layer: ∂L ∂Wkj = ∂L ∂zk ∂zk ∂Wkj = ujδk, where δk = ∂L

∂zk

Nevin L. Zhang (HKUST) Machine Learning 24 / 51

slide-25
SLIDE 25

Backpropagation

Backpropagation: Hidden Layer

zk =

  • j

ujWkj, uj = g(zj), zj =

  • i

uiWji The partial derivative of L w.r.t a weigh in the second last layer (similar for weight in other hidden layers): ∂L ∂Wji =

  • k

∂L ∂zk ∂zk ∂Wji =

  • k

δk ∂zk ∂uj ∂uj ∂zj ∂zj ∂Wji =

  • k

δkWkj ∂uj ∂zj ui = uiδj, where δj = ∂uj

∂zj

  • k Wkjδk

Nevin L. Zhang (HKUST) Machine Learning 25 / 51

slide-26
SLIDE 26

Backpropagation

Backpropagation of error

Nevin L. Zhang (HKUST) Machine Learning 26 / 51

slide-27
SLIDE 27

Backpropagation

Backpropagation

Summary: For a weight Wkj the output layer:

∂L ∂Wkj = ujδk, where δk = ∂L ∂zk

uj is the output of unit j from input x δk is the “error value” for unit k obtained by comparing the model

  • utput z = f (x) and the observed output y in L.

For a weight Wji in a hidden layer:

∂L ∂Wji = uiδj, where δj = ∂uj ∂zj

  • k Wkjδk

ui is the output of unit i from input x δj is the “error value” for unit j obtained by backpropagating the errors (δk) from the next layer .

Nevin L. Zhang (HKUST) Machine Learning 27 / 51

slide-28
SLIDE 28

Backpropagation

The Backpropagation Algorithm

Objective: Compute

∂L ∂Wji for all weights (and biases) for each training

example:

1 Propagate forward to compute the net input and output of each unit

(i.e, ui, uj, . . . ).

2 Propagate “error” backward

For each output unit k: δk ←

∂L ∂zk

For each hidden unit j: δj ←

∂uj ∂zj

  • k Wkjδk

3 Get the gradient for each weight:

∂L ∂Wji ← uiδj, ∂L ∂bj ← δj

Nevin L. Zhang (HKUST) Machine Learning 28 / 51

slide-29
SLIDE 29

Backpropagation

Backprop Example

Nevin L. Zhang (HKUST) Machine Learning 29 / 51

slide-30
SLIDE 30

Backpropagation

The Backpropagation Algorithm

The backpropagation algorithm and SGD are implemented in deep learning packages such as Tensorflow. Weights are packed into tensors ([aijk]) for efficiency A 1-D tensor is a vector A 2-D tensor is a matrix A 3-D tensor is several matrices stacked on top of each other. Although you do not need to know everything about backpropagation and SGD, some basic understanding will be very helpful when you work with deep learning.

Nevin L. Zhang (HKUST) Machine Learning 30 / 51

slide-31
SLIDE 31

Backpropagation

Demonstrations http://playground.tensorflow.org/

ReLU units are easier to learn than Sigmoid and Tanh units

Nevin L. Zhang (HKUST) Machine Learning 31 / 51

slide-32
SLIDE 32

Backpropagation

Demonstrations http://playground.tensorflow.org/

Deep structure helps Need to adjust learning rate during training.

Nevin L. Zhang (HKUST) Machine Learning 32 / 51

slide-33
SLIDE 33

Dropout

Outline

1 Feedforward Neural Network as Function Approximator 2 Feedforward Neural Network as Probabilistic Model 3 Backpropagation 4 Dropout 5 Optimization Algorithms

Nevin L. Zhang (HKUST) Machine Learning 33 / 51

slide-34
SLIDE 34

Dropout

Introduction

DL models are complex. Overfitting is an important issue. One way to do that is to regularize (weight decay): ˜ J(θ) = J(θ) + λ||θ||2

2

Almost always do this. Dropout is another technique for avoiding overfitting specifically proposed for deep learning.

Nevin L. Zhang (HKUST) Machine Learning 34 / 51

slide-35
SLIDE 35

Dropout

Bagging (Bootstrap Aggregating)

A technique for reducing generalization error by combining several models Train several different models separately on different randomly selected subsets of data called bootstrap samples, Classification: Have all of the models vote on the output for test examples. Bagging reduces variance: Variance is error due to randomness. Different bootstrap samples contain different randomness, and hence unlikely to most make the same errors on the test set.

Nevin L. Zhang (HKUST) Machine Learning 35 / 51

slide-36
SLIDE 36

Dropout

One Way to Use Bagging for Feedforward Networks

Start with a large network. Obtain subnetworks and train them separately on different randomly selected subsets of data. To classify future examples, let the subnetworks vote. Problem: Too expensive. Dropout is an inexpensive approximation of the idea.

Nevin L. Zhang (HKUST) Machine Learning 36 / 51

slide-37
SLIDE 37

Dropout

Subnetwork Selection via Masking

Associate a binary mask variable with each input and hidden unit. Sample their values randomly and independently (e.g., with probability 0.5 for hidden units and 0.8 for input units). Multiply the output of each unit by its mask value before passing it to the next layer. Units with 0 mask values are effectively removed from the network.

Nevin L. Zhang (HKUST) Machine Learning 37 / 51

slide-38
SLIDE 38

Dropout

Dropout

To be used with minibatch-based algorithms such as SGD. For each minibatch, Randomly sample values for mask variables Carry out one step of gradient descent.

(Only parameters for the units with mask value 1 are updated. The

  • thers have gradient zero and hence not updated. )

There is only one network, no subnetworks. At each step, training is conducted in a part of the network (a subnetwork). Dropout is a widely used regularization method for deep learning. It reduces

  • verfitting by preventing complex co-adaptations of parameters on training

data. See example in CNN code.

Nevin L. Zhang (HKUST) Machine Learning 38 / 51

slide-39
SLIDE 39

Optimization Algorithms

Outline

1 Feedforward Neural Network as Function Approximator 2 Feedforward Neural Network as Probabilistic Model 3 Backpropagation 4 Dropout 5 Optimization Algorithms

Nevin L. Zhang (HKUST) Machine Learning 39 / 51

slide-40
SLIDE 40

Optimization Algorithms

Our task is: Find θ to minimize the loss function J(θ). There are multiple algorithms that we can use. Stochastic gradient descent is one of them. There are others ....

Nevin L. Zhang (HKUST) Machine Learning 40 / 51

slide-41
SLIDE 41

Optimization Algorithms

Performances of Different Optimizers Vary a Lot

Nevin L. Zhang (HKUST) Machine Learning 41 / 51

slide-42
SLIDE 42

Optimization Algorithms

Momentum

Momentum is a technique to accelerate SGD. View θ as a “particle” that travels in the space of parameter values. In SGD, its movement is determined by gradient and learning rate. In SGD with momentum, the movement is additionally affected by its velocity v v is an exponentially decaying average of past negative gradients. The hyperparameter α is usually set at 0.5, 0.9 or 0.99.

Nevin L. Zhang (HKUST) Machine Learning 42 / 51

slide-43
SLIDE 43

Optimization Algorithms

Momentum

The incorporation of momentum into stochastic gradient descent reduces the variation in overall gradient directions and speeds up learning.

Online demo: https://medium.com/@ramrajchandradevan/ the-evolution-of-gradient-descend-optimization-algorithm-4106a6702d39

Nevin L. Zhang (HKUST) Machine Learning 43 / 51

slide-44
SLIDE 44

Optimization Algorithms

Algorithms with Adaptive Learning Rates

Learning rate should be reduced gradually. Several common algorithms: AdaGrad RMSProp Adam Key idea: Parameters that have changed a lot should be allowed to change less in future.

Nevin L. Zhang (HKUST) Machine Learning 44 / 51

slide-45
SLIDE 45

Optimization Algorithms

AdaGrad (Adaptive Gradient Algorithm)

AdaGrad scales the learning rates of all model parameters by inversely proportional to the square root of the sum of all of their historical squared gradients. The net effect is greater progress in the more gently sloped directions of parameter space.

Nevin L. Zhang (HKUST) Machine Learning 45 / 51

slide-46
SLIDE 46

Optimization Algorithms

RMSProp (Root Mean Square Propagation)

RMSProp uses an exponentially decaying average to discard history from the extreme past

Nevin L. Zhang (HKUST) Machine Learning 46 / 51

slide-47
SLIDE 47

Optimization Algorithms

AdaGrad vs RMSProp

Nevin L. Zhang (HKUST) Machine Learning 47 / 51

slide-48
SLIDE 48

Optimization Algorithms

Adam (Adaptive Moments)

Roughly a combination of RMSProp and momentum, with bias correction. “Insofar, Adam might be the best overall choice.”

Nevin L. Zhang (HKUST) Machine Learning 48 / 51

slide-49
SLIDE 49

Optimization Algorithms

Adam: Bias Correction

Nevin L. Zhang (HKUST) Machine Learning 49 / 51

slide-50
SLIDE 50

Optimization Algorithms

Adam (Adaptive Moments)

Nevin L. Zhang (HKUST) Machine Learning 50 / 51

slide-51
SLIDE 51

Optimization Algorithms

Adam (Adaptive Moments)

Results of different optimizers

Nevin L. Zhang (HKUST) Machine Learning 51 / 51