Introduction to (shallow) Neural Networks Pr. Fabien MOUTARDE - - PDF document

introduction to
SMART_READER_LITE
LIVE PREVIEW

Introduction to (shallow) Neural Networks Pr. Fabien MOUTARDE - - PDF document

Introduction to (shallow) Neural Networks Pr. Fabien MOUTARDE Center for Robotics MINES ParisTech PSL Universit Paris Fabien.Moutarde@mines-paristech.fr http://people.mines-paristech.fr/fabien.moutarde Introduction to (shallow) Neural


slide-1
SLIDE 1

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 1

Introduction to

(shallow) Neural Networks

  • Pr. Fabien MOUTARDE

Center for Robotics MINES ParisTech PSL Université Paris

Fabien.Moutarde@mines-paristech.fr http://people.mines-paristech.fr/fabien.moutarde

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 2

Neural Networks: from biology to engineering

  • Understanding and modelling of brain
  • Imitation to reproduce high-level functions
  • Mathematical tool for engineers
slide-2
SLIDE 2

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 3

Application domains

  • Pattern recognition
  • Voice recognition
  • Classification, diagnosis
  • Identification
  • Forecasting
  • Control, regulation

Modelling any input-output function by “learning” from examples:

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 4

Biological neurons

  • Electric signal: dendrites à cell body à> axon àsynapses

axon Cell body dendrite synapse

slide-3
SLIDE 3

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 5

Empirical model of neurons

(electric) potential of membrane

Input Frequency f ~ 500 Hz sigmoïd q (membrane potential) f

q

å

»

i i if

C q

è Neuron output = periodic signal with frequency f » sigmoid(q) = sigmoid(Si Cifi)

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 6

“Birth” of formal Neuron

  • Mc Culloch & Pitts (1943)
  • Simple model of neuron
  • goal: model the brain

xi

Wi

y

+1

S

x1 WD W1

xD threshold W0

slide-4
SLIDE 4

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 7

Linear separation by a single neuron

linear separation

W.X

hyperplane with W.X – W0 = 0

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 8

Theoretical model for learning

  • Hebb rule (1949)

”Cells that fire together wire together”, ie synaptic weight increases between neurons that activate simultaneously

yi yj Wij

( ) ( )

t y t y t W dt t W

j i ij ij

l + = + ) ( ) (

slide-5
SLIDE 5

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 9

First formal Neural Networks

(en français : Réseaux de Neurones)

Formal neuron of Mac Culloch & Pitts + Hebb rule for learning

  • PERCEPTRON (Rosenblatt, 1957)
  • ADALINE (Widrow, 1962)

Possible to “learn” Boolean functions by training from examples

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 10

Training of Perceptron

è Linear separation Training algorithm: Wk+1 = Wk + vX if X incorrectly classified (v: target value) Wk+1 = Wk if X correctly classified < W0 Wk.X > W0 Wk.X

W.X

y X W

  • Convergence if linearly-separable problem
  • ?? if NOT linearly-separable
slide-6
SLIDE 6

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 11

Limits of first models, necessity of “hidden” neurons

  • PERCEPTRONS, book by Minsky & Papert (1969)

Detailed study on Perceptrons and their intrinsic limits:

  • can NOT learn some types of Boolean functions

(even simple one like XOR)

  • can do ONLY LINEAR separations

But many classes cannot be linearly-separated (by a single hyper-plane)

èNecessity of several layers in the Neural Network è Requires new training algorithm

CLASSE 1 CLASSE 2 Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 13

1st revival of Neural Nets

  • GRADIENT BACK-PROPAGATION (Rumelhart 1986, Le Cun 85)

(en français : Rétro-propagation du gradient) àOvercome Credit Assignment Problem by training Neural Networks with HIDDEN layers

  • Empirical solutions for MANY real-world applications
  • Some strong theoretical results:

Multi-Layer Perceptrons are UNIVERSAL (and parsimonious) approximators

  • around years 2000’s: still used, but much less popular

than SVMs and boosting

USE OF DERIVABLE NEURONS + APPLY GRADIENT DESCENT METHOD

slide-7
SLIDE 7

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 14

2nd recent « revival »: Deep-Learning

  • Since ~2006, rising interest for, and excellent results

with ”deep” neural networks, consisting in MANY layers:

– Unsupervised ”intelligent” initialization of weights – Standard gradient descent, and/or fine-tuning from initial values of weights – Hidden layers è learnt hierarchy of features

  • In particular, since ~2013 dramatic progresses in

visual recognition (and voice recognition), with deep Convolutional Neural Networks

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 15

What is a FORMAL neuron?

In general: a processing “unit” applying a simple operation to its inputs, and which can be “connected” to others to build a networks able to realize any input-output function “Usual” definition: a “unit” computing a weighted sum of its inputs, and then applying some non-linearity (sigmoïd, ReLU, Gaussian, …) DEFINITIONS OF FORMAL NEURONS

slide-8
SLIDE 8

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 16

General formal neuron

ei: inputs of neuron sj: potential of neuron Oj: output of neuron Wij: (synaptic) weights h: input function (computation of potential = S, dist, kernel, …) f: activation (or transfer) function

f

ei Wij Oj

h

sj sj = h(ei, {Wij, i=0 à kj}) Oj = f(sj)

The combination of particular h and f functions defines the type of formal neuron

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 17

Summating artificial “neurons”

PRINCIPLE ACTIVATION FUNCTIONS

  • Threshold (Heaviside or sign)

à binary neurons

  • Sigmoïd (logistic or tanh)

à most common for MLPs

  • Gaussian
  • Identity à linear neurons

÷ ÷ ø ö ç ç è æ + =

å

=

j

n i i ij j j

e W W f O

1

W0j = "bias"

ei

f

Wij Oj

S

  • Saturation
  • ReLU (Rectified Linear Unit)
slide-9
SLIDE 9

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 18

“Distance” formal neurons

The potential of these neurons is the Euclidian DISTANCE between input vector (ei)i and weight vector (Wij)i

Input function:

ei Wij Oj

DIST

f

( )

å

  • =

÷ ÷ ÷ ÷ ø ö ç ç ç ç è æ ÷ ÷ ÷ ø ö ç ç ç è æ ÷ ÷ ÷ ø ö ç ç ç è æ

i ij i kj j k

W e W W e e h

2 1 1

... , ... Activation function = Identity of Gaussian

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 19

Kernel-type formal neurons

Examples of possible kernels:

–Polynomial: K(u,v)=[u.v + 1]p –Radial Basis Function: K(u,v)=exp(-||u-v||2 / 2s2)

è equivalent to distance-neuron+gaussian-activation

–Sigmoïd: K(u,v)=tanh(u.v+q)

è equivalent to summating-neurons+sigmoïd-activation

Input function:

( ) ( )

w e K w e h , , =

Activation function = Identity

with K symmetric and ”positive” in Mercer sense: "y tq ò y2(x)dx < ¥,

ò K(u,v)y(u)y(v)dudv³0 e w Oj

K

Id

slide-10
SLIDE 10

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 20

Networks of formal neurons

TWO FAMILIES OF NETWORKS

  • FEED-FORWARD NETWORKS

(en français, “réseaux non bouclés”):

NO feedback connection, The output depends only on current input (NO memory)

  • FEEDBACK OR RECURRENT NETWORKS

(en français, “réseaux bouclés”):

Some internal feedback/backwards connection è output depends on current input AND ON ALL PREVIOUS INPUTS (some memory inside!)

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 21

Feed-forward networks

(en français : réseau “NON-bouclé”)

Neurons can be ordered so that there is NO “backwards” connection Time is NOT a functional variable, i.e. there is NO MEMORY, and the output depends only on current input

Input neurons

X1 X3 X2 X4

1 2 3 4 5 Y1 Y2

Neurons 1, 3 and 4 are said “hidden”

slide-11
SLIDE 11

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 22

Feed-forward Multi-layer Neural Networks

Input Hidden layers (0, 1 or more)

Y1 Y2 X1 X2 X3

Output layer

For “Multi-Layer Perceptron” (MLP), neurons type generally “summating with sigmoid activation”

[terme français pour MLP : “Réseau Neuronal à couches”]

Connections with Weights

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 23

Recurrent Neural Networks

A time-delay is associated to each connection

Equivalent form

f f

1 1 1 2

x2

  • utput

x1 x3

input

S

S

  • utput

f f x2(t) x1(t) x3(t)

input 1

x2(t-1)

1

x3(t-1) x2(t-1)

1

x2(t-2)

1

S S S

slide-12
SLIDE 12

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 24

Canonical form of Recurrent Neural Networks Feedforward network

ui(t) External inputs ............ .......... ........

  • utputs yj(t)

1 1 1

....... ........

xk(t-1) State variables xk(t) State variables

The output at time t depend not only on external inputs U(t), but also (via internal “state variables”) on the whole sequence of previous inputs (and on initialization of state variables)

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 25

Use of a Neural Network

  • Two modes:

– training: based on examples of (input,output) couples,

the network modifies

§ Its parameters W (synaptic weights of connections) § And also potentially its architecture A (by creating/eliminating neurons or connections)

–recognition:

computation of output associated to a given input (architecture and weights remaining frozen)

y = F (x) A,W

weights W architecture A x y = F

A,W (x)

input

  • utput

network

slide-13
SLIDE 13

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 26

Training principle for Neural Networks

  • Supervised training = adaptation of synaptic weights of

the network so that its output is close to target value for each example

  • Given n examples (Xp; Dp), and the network outputs

Yp=NN(Xp), the average quadratic error is Training ~ finding W* =ArgMin(E), ie minimize the cost function E(W)

  • Generally this is done by using gradient descent (total,

partial or stochastic):

( ) ( ) ( )

E g r a d t W t W

W

. 1 l

  • =

+

(

)

( )

E W Y D

p p p

=

  • å

2

[+ m(t)(W(t)-W(t-1))]

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 27

Usual training algo for Multi Layer Perceptrons (MLP)

  • Training by Stochastic Gradient Descent

(SGD), using back-propagation:

– Input 1 (or a few) random training sample(s) – Propagate – Calculate error (loss) – Back-propagate through all layers from end to input, to compute gradient and update weights

slide-14
SLIDE 14

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 28

Back-propagation principle

Smart method for efficient computing of gradient (w.r.t. weights) of a Neural Network cost function, based on chain rule for derivation.

Cost function is Q(t) = Sm loss(Ym,Dm), where m runs over

training set examples

Usually, loss(Ym,Dm) = ||Ym-Dm||2 [quadratic error] Total gradient: W(t+1) = W(t) - l(t) gradW(Q(t)) + m(t)(W(t)-W(t-1)) Stochastic gradient: W(t+1) = W(t) - l(t) gradW(Qm(t)) + m(t)(W(t)-W(t-1))

where Qm=loss(Ym,Dm), is error computed on only ONE example randomly drawn from training set at every iteration and l(t) = learning rate (fixed, decreasing or adaptive), m(t) = momentum

Now, how to compute dQm/dWij?

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 29

Backprop through fully-connected layers: use of chain rule derivative computation

wij

yj yi f

si sj f

f

sk

wjk

Otherwise, dj=(dEm/dsj)=Sk (dEm/dsk)(dsk/dsj)=Sk dk(dsk/dsj) =Sk dkWjk(dyj/dsj) so dj = (Sk Wjkdk)f'(sj) if neuron j is “hidden” dEm/dWij =(dEm/dsj)(dsj/dWij)=(dEm/dsj) yi Let dj = (dEm/dsj). Then Wij(t+1) = Wij(t) - l(t) yi dj If neuron j is output, dj = (dEm/dsj) = (dEm/dyj)(dyj/dsj) with Em=||Ym-Dm||2 so dj = 2(yj-Dj)f'(sj) if neuron j is an output

(and W0j(t+1) = W0j(t) - l(t)dj)

è all the dj can be computed successively from last layer

to upstream layers by “error backpropagation” from output

slide-15
SLIDE 15

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 30

Universal approximation theorem

  • For any continuous function F defined and

bounded on a bounded set, and for any e, there exists a layered Neural Network with ONLY ONE HIDDEN LAYER (of sigmoïd neurons) which approximates F with error < e

Sussman 92 …But the theorem does not provide any clue about how to find this one_hidden-layer NN, nor about its size! And the size of hidden layer might be huge…

  • The set of MLPs with ONE hidden layer of sigmoid

neurons is a family of PARCIMONIOUS approximators: for equal number of parameters, more functions can be correctly approximated than with polynoms Cybenko 1989

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 31

Multi-layer (MLP) v.s. single-layer (perceptron)

Single-layer à one linear separation by neuron

W.X

Multi-layer: any shape of boundary possible

slide-16
SLIDE 16

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 32

Pros and cons of MLPs

ADVANTAGES

  • Universal and parsimonious approximators (& classifiers)
  • Fast to compute
  • Robustness to data noise
  • Rather easy to train and program

DRAWBACKS

  • Choice of ARCHITECTURE (# of neurons in hidden layer)

is CRITICAL, and empiric!

  • Many other critical hyper-parameters

(learning rate, # of iterations, initialization of weights, etc…)

  • Many local minima in cost function
  • Blackbox: difficult to interpret the model

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 33

Why gradient descent works despites non-convexity?

  • Local minima dominate in low-Dim…
  • …but recent work has shown

saddle points dominate in high-Dim

  • Furthermore, most local minima are close to

the global minimum

slide-17
SLIDE 17

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 34

Saddle points in training curves

  • Oscillating between two behaviors:

– Slowly approaching a saddle point – Escaping it

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 35

METHODOLOGY FOR SUPERVISED TRAINING OF MULTI-LAYER NEURAL NETWORKS

slide-18
SLIDE 18

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 36

Training set vs. TEST set

  • Need to collect enough and representative examples
  • Essential to keep aside a subset of examples that shall be

used only as TEST SET for estimating final generalization (when training finished)

  • Need also to use some “validation set” independant from

training set, in order to tune all hyper-parameters (layer sizes, number of iterations, etc…)

  • Space of possible input values usually infinite, and training

set is only a FINITE subset

  • Zero error on all training examples ¹ good results on

whole space of possible inputs (cf generalization error ¹ empirical error…)

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 37

Optimize hyper-parameters by "VALIDATION"

To avoid

  • ver-fitting

and maximize generalization, absolutely essential to use some VALIDATION estimation, for optimizing training hyper-parameters (and stopping criterion):

– either use a separate validation dataset (random split

  • f data into Training-set + Validation-set)

– or use CROSS-VALIDATION:

  • Repeat k times: train on (k-1)/k proportion of data +

estimate error on remaining 1/k portion

  • Average the k error estimations

S3 S2 S1

3-fold cross-validation:

  • Train on S1ÈS2 then estimate errS3 error on S3
  • Train on S1ÈS3 then estimate errS2 error on S2
  • Train on S2ÈS3 then estimate errS1 error on S1
  • Average validation error: (errS1+errS2+errS3)/3
slide-19
SLIDE 19

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 38

Some Neural Networks training "tricks"

  • Importance of input normalization

(zero mean, unit variance)

  • Importance of weights initialization

random but SMALL and prop. to 1/sqrt(nbInputs)

  • Decreasing (or adaptive) learning rate
  • Importance of training set size

If a Neural Net has a LARGE number of free parameters, è train it with a sufficiently large training-set!

  • Avoid overfitting by Early Stopping of training

iterations

  • Avoid overfitting by use of L1 or L2 regularization

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 39

Avoid overfitting by EARLY STOPPING

  • For Neural Networks, a first method to avoid
  • verfitting is to STOP LEARNING iterations as

soon as the validation_error stops decreasing

  • Generally, not a good idea to decide the

number of iterations beforehand. Better to ALWAYS USE EARLY STOPPING

slide-20
SLIDE 20

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 40

Avoid overfitting using regularization (« weight decay »)

For neural network, the regularization term is just norm L2 or L1 of vector of all weights: K = Sm(loss(Ym,Dm)) + β Sij |Wij|p

with p=2 (L2) or p=1 (L1)

à name “Weight decay”

Trying to fit too many free parameters with not enough information can lead to overfitting

Regularization = penalizing too complex models Often done by adding a special term to cost function

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 41

MLP hyper-parameters

  • Number and sizes of hidden layers!!
  • Activation functions
  • Learning rate (& momentum) [optimizer]
  • Number of gradient iterations!! (& early_stopping)
  • Regularization factor
  • Weight initialization
slide-21
SLIDE 21

Introduction to (shallow) Neural Networks, Pr. Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL, Nov.2019 42

Some (old) references on (shallow,

i.e. non deep) Neural Networks

  • Réseaux de neurones : méthodologie et applications,
  • G. Dreyfus et al., Eyrolles, 2002.
  • Réseaux de neurones formels pour la modélisation, la

commande, et la classification, L. Personnaz et I. Rivals, CNRS éditions, collection Sciences et Techniques de l’Ingénieur, 2003.

  • Réseaux de neurones : de la physique à la

psychologie, J.-P. Nadal, Armand Colin, 1993.