Data Mining and Machine Learning: Fundamental Concepts and - - PowerPoint PPT Presentation

data mining and machine learning fundamental concepts and
SMART_READER_LITE
LIVE PREVIEW

Data Mining and Machine Learning: Fundamental Concepts and - - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


slide-1
SLIDE 1

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

dataminingbook.info Mohammed J. Zaki1 Wagner Meira Jr.2

1Department of Computer Science

Rensselaer Polytechnic Institute, Troy, NY, USA

2Department of Computer Science

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 25: Neural Networks

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 1 / 60

slide-2
SLIDE 2

Artificial Neural Networks

Artificial neural networks or simply neural networks are inspired by biological neuronal networks. A real biological neuron, or a nerve cell, comprises dendrites, a cell body, and an axon that leads to synaptic terminals. A neuron transmits information via electrochemical signals. When there is enough concentration of ions at the dendrites of a neuron it generates an electric pulse along its axon called an action potential, which in turn activates the synaptic terminals, releasing more ions and thus causing the information to flow to dendrites of other neurons. A human brain has on the order of 100 billion neurons, with each neuron having between 1,000 to 10,000 connections to other neurons. Artificial neural networks are comprised of abstract neurons that try to mimic real neurons at a very high level. They can be described via a weighted directed graph G = (V ,E), with each node vi ∈ V representing a neuron, and each directed edge (vi,vj) ∈ E representing a synaptic to dendritic connection from vi to vj. The weight of the edge wij denotes the synaptic strength.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 2 / 60

slide-3
SLIDE 3

Artificial neuron: aggregation and activation.

1 x0 x1 x2 zk

d

i=1 wik · xi + bk

. . . xd w1k w2k wdk bk wk·

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 3 / 60

slide-4
SLIDE 4

Artificial Neuron

An artificial neuron acts as a processing unit, that first aggregates the incoming signals via a weighted sum, and then applies some function to generate an output. For example, a binary neuron will output a 1 whenever the combined signal exceeds a threshold, or 0 otherwise. netk = bk +

d

  • i=1

wik · xi = bk + w Tx (1) where w k = (w1k,w2k,··· ,wdk)T ∈ Rd and x = (x1,x2,··· ,xd)T ∈ Rd is an input

  • point. Notice that x0 is a special bias neuron whose value is always fixed at 1, and

the weight from x0 to zk is bk, which specifies the bias term for the neuron. Finally, the output value of zk is given as some activation function, f (·), applied to the net input at zk zk = f (netk)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 4 / 60

slide-5
SLIDE 5

Linear Function

Function: f (netk) = netk Derivative:

∂f (netj ) ∂netj

= 1

wTx zk

−∞ −bk +∞ −∞ +∞

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 5 / 60

slide-6
SLIDE 6

Step Function

Function: f (netk) =

  • if netk ≤ 0

1 if netk > 0 Derivative:

∂f (netj ) ∂netj

= 0

0.5 1.0

wTx zk

−∞ −bk +∞

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 6 / 60

slide-7
SLIDE 7

Rectified Linear Function

Function: f (netk) =

  • if netk ≤ 0

netk if netk > 0 Derivative:

∂f (netj ) ∂netj

=

  • if netj ≤ 0

1 if netj > 0

wTx zk

−∞ −bk +∞

+∞

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 7 / 60

slide-8
SLIDE 8

Sigmoid Function

Function: f (netk) =

1 1+exp{−netk}

Derivative:

∂f (netj ) ∂netj

= f (netj) · (1 − f (netj))

0.5 1.0

wTx zk

−∞ −bk +∞

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 8 / 60

slide-9
SLIDE 9

Hyperbolic Tangent Function

Function: f (netk) = exp{netk}−exp{−netk}

exp{netk}+exp{−netk} = exp{2·netk}−1 exp{2·netk}+1

Derivative:

∂f (netj ) ∂netj

= 1 − f (netj)2

−1 1

wTx zk

−∞

−bk

+∞

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 9 / 60

slide-10
SLIDE 10

Softmax Function

Function: f (netk| net) =

exp{netk} p

i=1 exp{neti }

Derivative:

∂f (netj | net) ∂netj

= f (netj) · (1 − f (netj)) netk netj zk

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 10 / 60

slide-11
SLIDE 11

Linear and Logistic Regression via Neural Networks

1 x0 x1

  • .

. . xd b w1 wd 1 x0 x1

  • 1

. . . . . . xd

  • p

b1 bp w11 w1p wd1 wdp

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 11 / 60

slide-12
SLIDE 12

ANN for Multiple and Multivariate Regression

Example

Consider the multiple regression of sepal length and petal length on the dependent attribute petal width for the Iris dataset with n = 150 points. The solution is given as ˆ y = −0.014 − 0.082 · x1 + 0.45 · x2 The squared error for this optimal solution is 6.179 on the training data. Using the presented neural network,with linear activation for the output and minimizing the squared error via gradient descent, results in the following learned parameters, b = 0.0096, w1 = −0.087 and w2 = 0.452, yielding the regression model

  • = 0.0096 − 0.087 · x1 + 0.452 · x2

with a squared error of 6.18, which is very close to the optimal solution.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 12 / 60

slide-13
SLIDE 13

ANN for Multiple and Multivariate Regression

Example

Multivariate Linear Regression For multivariate regression, we use the neural network architecture presented to learn the weights and bias for the Iris dataset, where we use sepal length and sepal width as the independent attributes, and petal length and petal width as the response or dependent attributes. Therefore, each input point xi is 2-dimensional, and the true response vector y i is also 2-dimensional. That is, d = 2 and p = 2 specify the size of the input and

  • utput layers. Minimizing the squared error via gradient descent, yields the

following parameters:   b1 b2 w11 w12 w21 w22   =   −1.83 −1.47 1.72 0.72 −1.46 −0.50  

  • 1
  • 2
  • =
  • −1.83 + 1.72 · x1 − 1.46 · x2

−1.47 + 0.72 · x1 − 0.50 · x2

  • The squared error on the training set is 84.9. Optimal least squared multivariate

regression yields a squared error of 84.16 with the following parameters

  • ˆ

y1 ˆ y2

  • =
  • −2.56 + 1.78 · x1 − 1.34 · x2

−1.59 + 0.73 · x1 − 0.48 · x2

  • Zaki & Meira Jr.

(RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 13 / 60

slide-14
SLIDE 14

Neural networks for multiclass logistic regression

Iris principal components data. Misclassified point are shown in dark gray color. Points in class c1 and c2 are shown displaced with respect to the base class c3

  • nly for illustration.

X1 X2 Y π1(x) π3(x) π2(x)

rS rS rS rS rS rS rS rS rS rSrS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rSrS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uTuT uT uT uT uT bC bC uT uT uT Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 14 / 60

slide-15
SLIDE 15

Logistic Regression: Binary and Multiclass

Example

We applied the neural network presented, with logistic activation at the output neuron and cross-entropy error function, on the Iris principal components dataset. The output is a binary response indicating Iris-virginica (Y = 1) or one of the other Iris types (Y = 0). As expected, the neural network learns an identical set of weights and bias as shown for the logistic regression model, namely:

  • = −6.79 − 5.07 · x1 − 3.29 · x2

Next, we we applied the neural network presented, using a softmax activation and cross-entropy error function, to the Iris principal components data with three classes: Iris-setosa (Y = 1), Iris-versicolor (Y = 2) and Iris-virginica (Y = 3). Thus, we need K = 3 output neurons, o1, o2, and o3. Further, to obtain the same model as in the multiclass logistic regression example, we fix the incoming weights and bias for output neuron o3 to be zero. The model is given as

  • 1 = −3.49 + 3.61 · x1 + 2.65 · x2
  • 2 = −6.95 − 5.18 · x1 − 3.40 · x2
  • 3 = 0 + 0 · x1 + 0 · x2

which is essentially the same as presented before.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 15 / 60

slide-16
SLIDE 16

Logistic Regression: Binary and Multiclass

Example

If we do not constrain the weights and bias for o3 we obtain the following model:

  • 1 = −0.89 + 4.54 · x1 + 1.96 · x2
  • 2 = −3.38 − 5.11 · x1 − 2.88 · x2
  • 3 = 4.24 + 0.52 · x1 + 0.92 · x2

The classification decision surface for each class is illustrated in the figure. The points in class c1 are shown as squares, c2 as circles, and c3 as triangles. This figure should be contrasted with the decision boundaries shown for multiclass logistic regression, which has the weights and bias set to 0 for the base class c3.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 16 / 60

slide-17
SLIDE 17

Error Functions

Squared Error: Given an input vector x ∈ Rd, the squared error loss function measures the squared deviation between the predicted output vector o ∈ Rp and the true response y ∈ Rp, defined as follows: Ex = 1 2 y − o2 = 1 2

p

  • j=1

(yj − oj)2 (2) where Ex denotes the error on input x. The partial derivative of the squared error function with respect to a particular output neuron oj is ∂Ex ∂oj = 1 2 · 2 · (yj − oj) · −1 = oj − yj (3) Across all the output neurons, we can write this as ∂Ex ∂o = o − y (4)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 17 / 60

slide-18
SLIDE 18

Error Functions

Cross-Entropy Error: For classification tasks, with K classes {c1,c2,··· ,cK}, we usually set the number of output neurons p = K, with one output neuron per

  • class. Furthermore, each of the classes is coded as a one-hot vector, with class ci

encoded as the ith standard basis vector ei = (ei1,ei2,··· ,eiK)T ∈ {0,1}K, with eii = 1 and eij = 0 for all j = i. Thus, given input x ∈ Rd, with the true response y = (y1,y2,··· ,yK)T, where y ∈ {e1,e2,··· ,eK}, the cross-entropy loss is defined as Ex = −

K

  • i=1

yi · ln(oi) = −

  • y1 · ln(o1) + ··· + yK · ln(oK)
  • (5)

Note that only one element of y is 1 and the rest are 0 due to the one-hot

  • encoding. That is, if y = ei, then only yi = 1, and the other elements yj = 0 for

j = i. The partial derivative of the cross-entropy error function with respect to a particular output neuron oj is ∂Ex ∂oj = −yj

  • j

(6)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 18 / 60

slide-19
SLIDE 19

Error Functions

The vector of partial derivatives of the error function with respect to the output neurons is therefore given as ∂Ex ∂o = ∂Ex ∂o1 , ∂Ex ∂o2 , ··· , ∂Ex ∂oK T =

  • −y1
  • 1

,−y2

  • 2

,··· ,−yK

  • K

T (7)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 19 / 60

slide-20
SLIDE 20

Error Functions

Binary Cross-Entropy Error: For classification tasks with binary classes, it is typical to encode the positive class as 1 and the negative class as 0, as opposed to using a one-hot encoding as in the general K-class case. Given an input x ∈ Rd, with true response y ∈ {0,1}, there is only one output neuron o. Therefore, the binary cross-entropy error is defined as Ex = −

  • y · ln(o) + (1 − y) · ln(1 − o)
  • (8)

Here y is either 1 or 0. The partial derivative of the binary cross-entropy error function with respect to the output neuron o is ∂Ex ∂o = ∂ ∂o

  • −y · ln(o) − (1 − y) · ln(1 − o)
  • = −

y

  • + 1 − y

1 − o · −1

  • = −y · (1 − o) + (1 − y) · o
  • · (1 − o)

=

  • − y
  • · (1 − o)

(9)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 20 / 60

slide-21
SLIDE 21

Multilayer Perceptron: One Hidden Layer

A multilayer perceptron (MLP) is a neural network that has distinct layers of neurons. The inputs to the neural network comprise the input layer, and the final outputs from the MLP comprise the output layer. Any intermediate layer is called a hidden layer, and an MLP can have one or many hidden layers. Networks with many hidden layers are called deep neural networks. An MLP is also a feed-forward network. That is, information flows in the forward direction, and from a layer only to the subsequent layer.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 21 / 60

slide-22
SLIDE 22

Multilayer perceptron with one hidden layer

Input Hidden Output Layer Layer Layer 1 x0 1 z0 x1 z1

  • 1

. . . . . . . . . xi zk

  • j

. . . . . . . . . xd zm

  • p

bk w1k wik wdk wk1 wkj wkp

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 22 / 60

slide-23
SLIDE 23

Feed-forward Phase

Given the input neuron values, we compute the output value for each hidden neuron zk as follows: zk = f (netk) = f

  • bk +

d

  • i=1

wik · xi

  • where f is some activation function, and wik denotes the weight between input

neuron xi and hidden neuron zk. Next, given the hidden neuron values, we compute the value for each output neuron oj as follows:

  • j = f (netj) = f
  • bj +

m

  • i=1

wij · zi

  • where wij denotes the weight between hidden neuron zi and output neuron oj.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 23 / 60

slide-24
SLIDE 24

Feed-Forward Phase

The output vector can then be computed as follows: neto = bo + W T

  • z
  • = f (neto) = f
  • bo + W T
  • z
  • (10)

(11) To summarize, for a given input x ∈ D with desired response y, an MLP computes the output vector via the feed-forward process, as follows:

  • = f
  • bo + W T
  • z
  • = f
  • bo + W T
  • · f
  • bh + W T

h x

  • (12)

where, o = (o1,o2,··· ,op)T is the vector of predicted outputs from the single hidden layer MLP.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 24 / 60

slide-25
SLIDE 25

Backpropagation Phase

Backpropagation is the algorithm used to learn the weights between successive layers in an MLP. The name comes from the manner in which the error gradient is propagated backwards from the output to input layers via the hidden layers. For a given input pair (x,y) in the training data, the MLP first computes the

  • utput vector o via the feed-forward step.Next, it computes the error in the

predicted output vis-a-vis the true response y using the squared error function Ex = 1 2 y − o2 = 1 2

p

  • j=1

(yj − oj)2 (13)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 25 / 60

slide-26
SLIDE 26

Backpropagation Phase

The basic idea behind backpropagation is to examine the extent to which an

  • utput neuron, say oj, deviates from the corresponding target response yj, and to

modify the weights wij between each hidden neuron zi and oj as some function of the error – large error should cause a correspondingly large change in the weight, and small error should result in smaller changes. The weight update is done via a gradient descent approach to minimize the error. Let ∇wij be the gradient of the error function with respect to wij, or simply the weight gradient at wij. Given the previous weight estimate wij, a new weight is computed by taking a small step η in a direction that is opposite to the weight gradient at wij wij = wij − η · ∇wij (14) In a similar manner, the bias term bj is also updated via gradient descent bj = bj − η · ∇bj (15) where ∇bj is the gradient of the error function with respect to bj, which we call the bias gradient at bj.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 26 / 60

slide-27
SLIDE 27

Updating Parameters Between Hidden and Output Layer

Consider the weight wij between hidden neuron zi and output neuron oj, and the bias term bj between z0 and oj. Using the chain rule of differentiation, we compute the weight gradient at wij and bias gradient at bj, as follows: ∇wij = ∂Ex ∂wij = ∂Ex ∂netj · ∂netj ∂wij = δj · zi ∇bj = ∂Ex ∂bj = ∂Ex ∂netj · ∂netj ∂bj = δj (16) where we use the symbol δj to denote the partial derivative of the error with respect to net signal at oj, which we also call the net gradient at oj Next, we need to compute δj, the net gradient at oj. This can also be computed via the chain rule δj = ∂Ex ∂netj = ∂Ex ∂f (netj) · ∂f (netj) ∂netj (17) Note that f (netj) = oj.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 27 / 60

slide-28
SLIDE 28

Updating Parameters Between Hidden and Output Layer

Using the squared error function for the former, we have ∂Ex ∂f (netj) = ∂Ex ∂oj = ∂ ∂oj

  • 1

2

p

  • k=1

(yk − ok)2

  • = (oj − yj)

where we used the observation that all ok for k = j are constants with respect to

  • j. Since we assume a sigmoid activation function, for the latter, we have

∂f (netj) ∂netj = oj · (1 − oj) Putting it all together, we get δj = (oj − yj) · oj · (1 − oj)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 28 / 60

slide-29
SLIDE 29

Updating Parameters Between Input and Hidden Layer

Consider the weight wij between input neuron xi and hidden neuron zj, and the bias term between x0 and zj. ∇wij = ∂Ex ∂wij = ∂Ex ∂netj · ∂netj ∂wij = δj · xi ∇bj = ∂Ex ∂bj = ∂Ex ∂netj · ∂netj ∂bj = δj (18) To compute the net gradient δj at the hidden neuron zj we have to consider the error gradients that flow back from all the output neurons to zj. Applying the chain rule, we get: δj = ∂Ex ∂netj =

p

  • k=1

∂Ex ∂netk · ∂netk ∂zj · ∂zj ∂netj = ∂zj ∂netj ·

p

  • k=1

∂Ex ∂netk · ∂netk ∂zj = zj · (1 − zj) ·

p

  • k=1

δk · wjk where

∂zj ∂netj = zj · (1 − zj), since we assume a sigmoid activation function for the

hidden neurons.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 29 / 60

slide-30
SLIDE 30

Backpropagation of gradients from output to hidden layer

Input Hidden Output Layer Layer Layer 1 x0

  • 1 δ1

. . . xi zj

p

  • k=1

δk · wjk

  • k δk

. . .

  • p δp

bj wij wj1 wjk wjp

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 30 / 60

slide-31
SLIDE 31

MLP Training: Stochastic Gradient Descent

MLP-Training (D,m,η,maxiter): // Initialize bias vectors

1 bh ← random m-dimensional vector with small values 2 bo ← random p-dimensional vector with small values

// Initialize weight matrices

3 Wh ← random d × m matrix with small values 4 Wo ← random m × p matrix with small values 5 t ← 0 // iteration counter

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 31 / 60

slide-32
SLIDE 32

MLP Training: Stochastic Gradient Descent

6 repeat 7

foreach (xi,y i) ∈ D in random order do // Feed-forward phase

8

zi ← f

  • bh + W T

h xi

  • 9
  • i ← f
  • bo + W T
  • zi
  • // Backpropagation phase:

net gradients

10

δo ← oi ⊙ (1 − oi) ⊙ (oi − y i)

11

δh ← zi ⊙ (1 − zi) ⊙ (Wo · δo) // Gradient descent for bias vectors

12

∇bo ← δo; bo ← bo − η · ∇bo

13

∇bh ← δh; bh ← bh − η · ∇bh // Gradient descent for weight matrices

14

∇Wo ← zi · δT

  • ;

Wo ← Wo − η · ∇Wo

15

∇Wh ← xi · δT

h ;

Wh ← Wh − η · ∇Wh

16

t ← t + 1

17 until t ≥ maxiter

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 32 / 60

slide-33
SLIDE 33

MLP with one hidden layer

Example

We now illustrate an MLP with a hidden layer using a non-linear activation function to learn the sine curve. Figure shows the training data (the gray points

  • n the curve), which comprises n = 25 points xi sampled randomly in the range

[−10,10], with yi = sin(xi). The testing data comprises 1000 points sampled uniformly from the same range. The figure also shows the desired output curve (thin line). We used an MLP with one input neuron (d = 1), ten hidden neurons (m = 10) and one output neuron (p = 1). The hidden neurons use tanh activations, whereas the output unit uses an identity activation. The step size is η = 0.005. The input to hidden weight matrix Wh ∈ R1×10 and the corresponding bias vector bh ∈ R10×1 are given as: Wh = (−0.68,0.77,−0.42,−0.72,−0.93,−0.42,−0.66,−0.70,−0.62,−0.50) bh = (−4.36,2.43,−0.52,2.35 − 1.64,3.98,0.31,4.45,1.03,−4.77)T

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 33 / 60

slide-34
SLIDE 34

MLP with one hidden layer

Example

The hidden to output weight matrix Wo ∈ R10×1 and the bias term bo ∈ R are given as: Wo = (−1.82,−1.69,−0.82,1.37,0.14,2.37,−1.64,−1.92,0.78,2.17)T bo = −0.16 Figures show the output of the MLP on the test for various numbers of iterations. The final SSE is 1.45 over the 1000 test points. We can observe that, even with a very small training data of 25 points sampled randomly from the sine curve, the MLP is able to learn the desired function. However, it is also important to recognize that the MLP model has not really learned the sine function; rather, it has learned to approximate it only in the specified range [−10,10]. We can also see that when we try to predict values

  • utside this range, the MLP does not yield a good fit.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 34 / 60

slide-35
SLIDE 35

MLP for sine curve

t = 1

−12 −10 −8 −6 −4 −2 2 4 6 8 10 12 −2.5 −2.0 −1.5 −1.0 −0.5 0.5 1.0

X Y

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 35 / 60

slide-36
SLIDE 36

MLP for sine curve

t = 1000

−12 −10 −8 −6 −4 −2 2 4 6 8 10 12 −1.25 −1.00 −0.75 −0.50 −0.25 0.25 0.50 0.75 1.00

X Y

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 36 / 60

slide-37
SLIDE 37

MLP for sine curve

t = 5000

−12 −10 −8 −6 −4 −2 2 4 6 8 10 12 −1.25 −1.00 −0.75 −0.50 −0.25 0.25 0.50 0.75 1.00

X Y

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 37 / 60

slide-38
SLIDE 38

MLP for sine curve

t = 10000

−12 −10 −8 −6 −4 −2 2 4 6 8 10 12 −1.25 −1.00 −0.75 −0.50 −0.25 0.25 0.50 0.75 1.00 1.25

X Y

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 38 / 60

slide-39
SLIDE 39

MLP for sine curve

t = 15000

−12 −10 −8 −6 −4 −2 2 4 6 8 10 12 −1.25 −1.00 −0.75 −0.50 −0.25 0.25 0.50 0.75 1.00

X Y

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 39 / 60

slide-40
SLIDE 40

MLP for sine curve

t = 30000

−12 −10 −8 −6 −4 −2 2 4 6 8 10 12 −1.25 −1.00 −0.75 −0.50 −0.25 0.25 0.50 0.75 1.00

X Y

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 40 / 60

slide-41
SLIDE 41

MLP for sine curve

Test range [−20,20]

−20 −15 −10 −5 5 10 15 20 −2.5 −2.0 −1.5 −1.0 −0.5 0.5 1.0 1.5 2.0

X Y

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC

bCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbCbC

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 41 / 60

slide-42
SLIDE 42

MLP for handwritten digit classification

Example

In this example, we apply an MLP with one hidden layer for the task of predicting the correct label for a hand-written digit from the MNIST database, which contains 60,000 training images that span the 10 digit labels, from 0 to 9. Each (grayscale) image is a 28 × 28 matrix of pixels, with values between 0 and 255. Each pixel is converted to a value in the interval [0,1] by dividing by 255. Figure shows an example of each digit from the MNIST dataset. Since images are 2-dimensional matrices, we first flatten them into a vector x ∈ R784 with dimensionality d = 28 × 28 = 784. This is done by simply concatenating all of the rows of the images to obtain one long vector. Next, since the output labels are categorical values that denote the digits from 0 to 9, we need to convert them into binary (numerical) vectors, using one-hot encoding. Thus, the label 0 is encoded as e1 = (1,0,0,0,0,0,0,0,0,0)T ∈ R10, the label 1 as e2 = (0,1,0,0,0,0,0,0,0,0)T ∈ R10, and so on, and finally the label 9 is encoded as e10 = (0,0,0,0,0,0,0,0,0,1)T ∈ R10. That is, each input image vector x has a corresponding target response vector y ∈ {e1,e2,··· ,e10}. Thus, the input layer for the MLP has d = 784 neurons, and the output layer has p = 10 neurons.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 42 / 60

slide-43
SLIDE 43

MLP for handwritten digit classification

Example

For the hidden layer, we consider several MLP models, each with a different number of hidden neurons m. We try m = 0,7,49,98,196,392, to study the effect

  • f increasing the number of hidden neurons, from small to large. For the hidden

layer, we use ReLU activation function, and for the output layer, we use softmax activation, since the target response vector has only one neuron with value 1, with the rest being 0. Note that m = 0 means that there is no hidden layer – the input layer is directly connected to the output layer, which is equivalent to a multiclass logistic regression model. We train each MLP for t = 15 epochs, using step size η = 0.25. During training, we plot the number of misclassified images after each epoch, on the separate MNIST test set comprising 10,000 images. Figure shows the number of errors from each of the models (with a different number of hidden neurons m), after each epoch.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 43 / 60

slide-44
SLIDE 44

MLP for handwritten digit classification

Example

The final test error at the end of training is given as m 7 10 49 98 196 392 errors 1677 901 792 546 495 470 454 We can observe that adding a hidden layer significantly improves the prediction

  • accuracy. Using even a small number of hidden neurons helps, compared to the

logistic regression model (m = 0). For example, using m = 7 results in 901 errors (or error rate 9.01%) compared to using m = 0, which results in 1677 errors (or error rate 16.77%). On the other hand, as we increase the number of hidden neurons, the error rate decreases, though with diminishing returns. Using m = 196, the error rate is 4.70%, but even after doubling the number of hidden neurons (m = 392), the error rate goes down to only 4.54%. Further increasing m does not reduce the error rate.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 44 / 60

slide-45
SLIDE 45

MNIST dataset: sample handwritten digits

5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 45 / 60

slide-46
SLIDE 46

MNIST: Prediction error as a function of epochs

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 500 1,000 1,500 2,000 2,500 3,000

epochs errors

m = 0 m = 7 m = 10 m = 49 m = 98 m = 196 m = 392

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 46 / 60

slide-47
SLIDE 47

Deep Multilayer Perceptron

l = 0 l = 1 l = 2 ... l = h l = h + 1 1 x0 1 z1 1 z2 1 ··· 1 zh x1 z1

1

z2

1

··· zh

1

  • 1

. . . . . . . . . . . . . . . . . . xi z1

i

z2

i

··· zh

i

  • i

. . . . . . . . . . . . . . . . . . xd z1

n1

z2

n2

··· zh

nh

  • p

l = 0 l = 1 l = 2 ... l = h l = h + 1 x z1 z2 ··· zh

  • W 0,b0

W 1,b1 W h,bh Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 47 / 60

slide-48
SLIDE 48

Feed-forward Phase

Typically in a deep MLP, the same activation function f l is used for all neurons in a given layer l. The input layer always uses the identity activation, so f 0 is the identity function. Also, all bias neurons also use the identity function with a fixed value of 1. The output layer typically uses sigmoid or softmax activations for classification tasks, or identity activations for regression tasks. The hidden layers typically use sigmoid, tanh, or ReLU activations. For a given input pair (x,y) ∈ D, the deep MLP computes the output vector via the feed-forward process:

  • = f h+1

bh + W T

h · zh

= f h+1 bh + W T

h · f h

bh−1 + W T

h−1 · zh−1

= . . . = f h+1

  • bh + W T

h · f h

  • bh−1 + W T

h−1 · f h−1

···f 2 b1 + W T

1 · f 1

b0 + W T

0 · x

  • Zaki & Meira Jr.

(RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 48 / 60

slide-49
SLIDE 49

Backpropagation Phase

Consider the weight update between a given layer and another, including between the input and hidden layer, or between two hidden layers, or between the last hidden layer and the output layer. Let zl

i be a neuron in layer l, and zl+1 j

a neuron in the next layer l + 1. Let w l

ij be the weight between zl i and zl+1 j

, and let bl

j

denote the bias term between zl

0 and zl+1 j

. The weight and bias are updated using the gradient descent approach w l

ij = w l ij − η · ∇wl

ij

bl

j = bl j − η · ∇bl

j

where ∇wl

ij is the weight gradient and ∇bl j is the bias gradient, i.e., the partial

derivative of the error function with respect to the weight and bias, respectively.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 49 / 60

slide-50
SLIDE 50

Backpropagation Phase

We can use the chain rule to write the weight and bias gradient, as follows ∇wl

ij = ∂Ex

∂w l

ij

= ∂Ex ∂netj · ∂netj ∂w l

ij

= δl+1

j

· zl

i = zl i · δl+1 j

∇bl

j = ∂Ex

∂bl

j

= ∂Ex ∂netj · ∂netj ∂bl

j

= δl+1

j

In summary, the update of the weights and biases is W l = W l − η · ∇W l bl = bl − η · ∇bl (19) where η is the step size. However, we observe that to compute the weight and bias gradients for layer l we need to compute the net gradients δl+1 at layer l + 1.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 50 / 60

slide-51
SLIDE 51

Net Gradients at Output Layer

Let us consider how to compute the net gradients at the output layer h + 1. If all

  • f the output neurons are independent (for example, when using linear or sigmoid

activations), the net gradient is obtained by differentiating the error function with respect to the net signal at the output neurons. That is, δh+1

j

= ∂Ex ∂netj = ∂Ex ∂f h+1(netj) · ∂f h+1(netj) ∂netj = ∂Ex ∂oj · ∂f h+1(netj) ∂netj Thus, the gradient depends on two terms, the partial derivative of the error function with respect to the output neuron value, and the derivative of the activation function with respect to its argument. On the other hand, if the output neurons are not independent (for example, when using a softmax activation), then we have to modify the computation of the net gradient at each output neuron as follows: δh+1

j

= ∂Ex ∂netj =

p

  • i=1

∂Ex ∂f h+1(neti) · ∂f h+1(neti) ∂netj Typically, for regression tasks, we use the squared error function with linear activation function at the output neurons, whereas for logistic regression and classification, we use the cross-entropy error function with a sigmoid activation for binary classes, and softmax activation for multiclass problems.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 51 / 60

slide-52
SLIDE 52

Net Gradients at Hidden Layers

Let us assume that we have already computed the net gradients at layer l + 1, namely δl+1. Since neuron zl

j in layer l is connected to all of the neurons in layer

l +1 (except for the bias neuron zl+1 ), to compute the net gradient at zl

j , we have

to account for the error from each neuron in layer l + 1, as follows: δl

j = ∂Ex

∂netj =

nl+1

  • k=1

∂Ex ∂netk · ∂netk ∂f l(netj) · ∂f l(netj) ∂netj = ∂f l(netj) ∂netj ·

nl+1

  • k=1

δl+1

k

· w l

jk

So the net gradient at zl

j in layer l depends on the derivative of the activation

function with respect to its netj, and the weighted sum of the net gradients from all the neurons zl+1

k

at the next layer l + 1.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 52 / 60

slide-53
SLIDE 53

Net Gradients at Hidden Layers

For the commonly used activation functions at the hidden layer, we have ∂f l =      1 for linear zl(1 − zl) for sigmoid (1 − zl ⊙ zl) for tanh The net gradients are computed recursively, starting from the output layer h + 1, then hidden layer h, and so on, until we finally compute the net gradients at the first hidden layer l = 1. That is, δh = ∂f h ⊙

  • W h · δh+1

δh−1 = ∂f h−1 ⊙

  • Wh−1 · δh

= ∂f h−1 ⊙

  • Wh−1 ·
  • ∂f h ⊙
  • W h · δh+1

. . . δ1 = ∂f 1 ⊙

  • W 1 ·
  • ∂f 2 ⊙
  • W 2 · ···
  • ∂f h ⊙
  • W h · δh+1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 53 / 60

slide-54
SLIDE 54

Deep MLP Training: Stochastic Gradient Descent

Deep-MLP-Training (D,h,η,maxiter,n1,n2,··· ,nh,f 1,f 2,··· ,f h+1):

1 n0 ← d // input layer size 2 nh+1 ← p // output layer size

// Initialize weight matrices and bias vectors

3 for l = 0,1,2,··· ,h do 4

bl ← random nl+1 vector with small values

5

W l ← random nl × nl+1 matrix with small values

6 t ← 0 // iteration counter

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 54 / 60

slide-55
SLIDE 55

Deep MLP Training: Stochastic Gradient Descent

7 repeat 8

foreach (xi,y i) ∈ D in random order do

9

z0 ← xi // Feed-Forward Phase

10

for l = 0,1,2,...,h do zl+1 ← f l+1 bl + W T

l · zl

11

  • i ← zh+1

12

if independent outputs then // Backpropagation phase

13

δh+1 ← ∂f h+1 ⊙ ∂Exi // net gradients at output

14

else

15

δh+1 ← ∂F h+1 · ∂Exi // net gradients at output

16

for l = h,h − 1,··· ,1 do δl ← ∂f l ⊙

  • W l · δl+1

// net gradients

17

for l = 0,1,··· ,h do // Gradient Descent Step

18

∇W l ← zl ·

  • δl+1T // weight gradient matrix at layer l

19

∇bl ← δl+1 // bias gradient vector at layer l

20

for l = 0,1,··· ,h do

21

W l ← W l − η · ∇W l // update W l

22

bl ← bl − η · ∇bl // update bl

23

t ← t + 1

24 until t ≥ maxiter

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 55 / 60

slide-56
SLIDE 56

Vanishing or Exploding Gradients

In the vanishing gradient problem, the norm of the net gradient can decay exponentially with the distance from the output layer, that is, as we backpropagate the gradients from the output layer to the input layer. In this case the network will learn extremely slowly, if at all, since the gradient descent method will make minuscule changes to the weights and biases. On the other hand, in the exploding gradient problem, the norm of the net gradient can grow exponentially with the distance from the output layer. In this case, the weights and biases will become exponentially large, resulting in a failure to learn. The gradient explosion problem can be mitigated to some extent by gradient thresholding, that is, by resetting the value if it exceeds an upper bound. The vanishing gradients problem is more difficult to address. Typically sigmoid activations are more susceptible to this problem, and one solution is to use alternative activation functions such as ReLU. In general, recurrent neural networks, which are deep neural networks with feedback connections, are more prone to vanishing and exploding gradients.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 56 / 60

slide-57
SLIDE 57

MNIST: Deep MLPs

Prediction error as a function of epochs.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1,000 2,000 3,000 4,000 5,000

epochs errors n1 = 392 n1 = 196,n2 = 49 n1 = 392,n2 = 196,n3 = 49 n1 = 392,n2 = 196,n3 = 98,n4 = 49

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 57 / 60

slide-58
SLIDE 58

Deep MLP

Example

We now examine deep MLPs for predicting the labels for the MNIST handwritten images dataset. Recall that this dataset has n = 60000 grayscale images of size 28 × 28 that we treat as d = 784 dimensional vectors. The pixel values between 0 and 255 are converted to the range 0 and 1 by dividing each value by 255. The target response vector is a one-hot encoded vector for class labels {0,1,...,9}. Thus, the input to the MLP xi has dimensionality d = 784, and the output layer has dimensionality p = 10. We use softmax activation for the output layer. We use ReLU activation for the hidden layers, and consider several deep models with different number and sizes of the hidden layers. We use step size η = 0.3 and train for t = 15 epochs. Training was done using minibatches, using batch size of 1000.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 58 / 60

slide-59
SLIDE 59

Deep MLP

Example

During the training of each of the deep MLPs, we evaluate its performance on the separate MNIST test datatset that contains 10,000 images. Figure plots the number of errors after each epoch for the different deep MLP models. The final test error at the end of training is given as hidden layers errors n1 = 392 396 n1 = 196,n2 = 49 303 n1 = 392,n2 = 196,n3 = 49 290 n1 = 392,n2 = 196,n3 = 98,n4 = 49 278 We can observe that as we increase the number of layers, we do get performance

  • improvements. The deep MLP with four hidden layers of sizes

n1 = 392,n2 = 196,n3 = 98,n4 = 49 results in an error rate of 2.78% on the training set, whereas the MLP with a single hidden layer of size n1 = 392 has an error rate of 3.96%. Thus, the deeper MLP significantly improves the prediction

  • accuracy. However, adding more layers does not reduce the error rate, and can

also lead to performance degradation.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 59 / 60

slide-60
SLIDE 60

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

dataminingbook.info Mohammed J. Zaki1 Wagner Meira Jr.2

1Department of Computer Science

Rensselaer Polytechnic Institute, Troy, NY, USA

2Department of Computer Science

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 25: Neural Networks

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 60 / 60