Deep learning J er emy Fix CentraleSup elec - - PowerPoint PPT Presentation

deep learning
SMART_READER_LITE
LIVE PREVIEW

Deep learning J er emy Fix CentraleSup elec - - PowerPoint PPT Presentation

Introduction Feedforward Neural Networks Deep learning J er emy Fix CentraleSup elec jeremy.fix@centralesupelec.fr 2016 1 / 94 Introduction Feedforward Neural Networks Introduction and historical perspective [Schmidhuber, 2015]


slide-1
SLIDE 1

Introduction Feedforward Neural Networks

Deep learning

J´ er´ emy Fix

CentraleSup´ elec jeremy.fix@centralesupelec.fr

2016

1 / 94

slide-2
SLIDE 2

Introduction Feedforward Neural Networks

Introduction and historical perspective [Schmidhuber, 2015] Deep Learning in Neural Networks: An Overview, J¨ urgen Schmidhuber, Neural Networks (61), Pages 85-117

2 / 94

slide-3
SLIDE 3

Introduction Feedforward Neural Networks

Historical perspective on neural networks

  • Perceptron (Rosenblatt, 1962) : linear classifier
  • AdaLinE (Widrow, Hoff, 1962): linear regressor
  • Minsky/Papert (1969): first winter
  • Convolutional Neural Networks (1980, 1998) : great!
  • Multilayer Perceptron and backprop (Rumelhart, 1986) :

great!

  • but it is hard to train, and the SVM come in the 1990s .... :

second winter

  • 2006 : pretraining !
  • 2012 : AlexNet on Imagenet (10% better on test than the

2nd)

  • Now on : lot of state of the art neural networks

3 / 94

slide-4
SLIDE 4

Introduction Feedforward Neural Networks

Some reasons of the current success

  • GPU (speed of processing) / Data (regularizing)
  • theoretical understandings on the difficulty of training deep

networks

Which libs?

  • Torch(Lua)/PyTorhc , Caffee(Python/C++)
  • Theano/Lasagne (python, RIP 2017) , Tensorflow (Google,

Python, C++), Keras (wrapper over Tensorflow and Theano), CNTK, MXNET, Chainer, DyNet, ...

4 / 94

slide-5
SLIDE 5

Introduction Feedforward Neural Networks

What to read ?

www.deeplearningbook.org Goodfellow, Bengio, Courville(2016)

Who to follow ?

  • N-1 : LeCun, Bengio, Hinton, Schmidhuber
  • N : Goodfellow, Dauphin, Graves, Sutskever, Karpathy,

Krizevsky, ..

  • bviously, the community is much larger.

Which conferences ?

ICML, NIPS, ICLR, ..

https://github.com/terryum/awesome-deep-learning-papers

5 / 94

slide-6
SLIDE 6

Introduction Feedforward Neural Networks

What is a neural network

The tree

a Neural network is a directed graph

  • edges : weighted connections
  • nodes : computational units
  • no cycle : feedforward neural networks (FNN)
  • with cycles : recurrent neural networks (RNN)

hides the jungle

What is a convolutional neural network with a softmax output, ReLu hidden activations, with batch normalization layers, trained with RMSprop with Nesterov momentum regularized with dropout ?

6 / 94

slide-7
SLIDE 7

Introduction Feedforward Neural Networks

Feedforward Neural Networks (FNN)

Input

Skip layer connection

Hidden Output

7 / 94

slide-8
SLIDE 8

Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

  • classification, Given (xi, yi), yi ∈ {−1, 1}
  • SAR Architecture, Basis functions φj(x) with φ0(x) = 1
  • Algorithm
  • Geometrical interpretation

Sensory x0 x1 x2 x3 Associative Result a0 = φ0(x) a1 = φ1(x) a2 = φ2(x) r0 r1 r2 w00 w10 w01 w02 w22 Σ Σ Σ g

8 / 94

slide-9
SLIDE 9

Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

Classifier

Given feature functions φj, with φ0(x) = 1, the perceptron classifies x as : y = g(wTΦ(x)) (1) g(x)

  • −1

if x < 0 +1 if x ≥ 0 (2) with φ(x) ∈ Rna+1, φ(x) =      1 φ1(x) φ2(x) . . .     

9 / 94

slide-10
SLIDE 10

Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

Online Training algorithm

Given (xi, yi), yi ∈ {−1, 1}, the perceptron learning rule operates

  • nline :

w =      w if the input is correctly classified w + φ(xi) if the input is incorrectly classified as -1 w − φ(xi) if the input is incorrectly classified as +1 (3)

10 / 94

slide-11
SLIDE 11

Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

Geometrical interpretation

y = g(wTΦ(x)) Cases when a sample is correctly classified

Case yi = +1 Case yi = −1 w φ(xi) φ(xi) w

11 / 94

slide-12
SLIDE 12

Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

Geometrical interpretation

y = g(wTΦ(x)) Cases when a sample is misclassified

Case yi = +1 Case yi = −1 φ(xi) φ(xi) w w w + φ(xi) w − φ(xi)

12 / 94

slide-13
SLIDE 13

Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

The cone of feasible solutions

Consider two samples x1, x2 with y1 = +1, y2 = −1

φ(x2) φ(x1) vTφ(x1) = 0 vTφ(x2) = 0 w

13 / 94

slide-14
SLIDE 14

Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

Online Training algorithm

Given (xi, yi), yi ∈ {−1, 1}, the perceptron learning rule operates

  • nline :

w =      w if the input is correctly classified w + φ(xi) if the input is incorrectly classified as -1 w − φ(xi) if the input is incorrectly classified as +1 (4) w =      w if g(wTφ(xi)) = yi w + φ(xi) if g(wTφ(xi)) = −1 and yi = +1 w − φ(xi) if g(wTφ(xi)) = +1 and yi = −1 (5)

14 / 94

slide-15
SLIDE 15

Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

Online Training algorithm

Given (xi, yi), yi ∈ {−1, 1}, the perceptron learning rule operates

  • nline :

w =      w if g(wTφ(xi)) = yi w + φ(xi) if g(wTφ(xi)) = −1 and yi = +1 w − φ(xi) if g(wTφ(xi)) = +1 and yi = −1 w =

  • w

if g(wTφ(xi)) = yi w + yiφ(xi) if g(wTφ(xi)) = yi w = w + 1 2(yi − ˆ yi)φ(xi) with ˆ yi = g(wTφ(xi))

15 / 94

slide-16
SLIDE 16

Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

Definition (Linear separability)

A binary classification problem (xi, yi) ∈ Rd × {−1, 1}, i ∈ [1..N] is said to be linearly separable if there exists w ∈ Rd such that : ∀i, sign(wTxi) = yi with ∀x < 0, sign(x) = −1, ∀x ≥ 0, sign(x) = +1.

Theorem (Perceptron convergence theorem)

A classification problem (xi, yi) ∈ Rd × {−1, 1}, i ∈ [1..N] is linearly separable if and only if the perceptron learning rule converges to an optimal solution in a finite number of steps. ⇐: easy; ⇒ : we upper/lower bound |w(t)|2

2

16 / 94

slide-17
SLIDE 17

Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

  • wt = w0 +

i∈I(t) yiφ(xi), with I(t) the set of misclassified

samples

  • it minimizes a loss : J(w) = 1

N

  • i max(0, −yiwTφ(xi))
  • the solution can be written as

wt = w0 +

  • i

1 2(yi − ˆ yi)φ(xi) (yi − ˆ yi) is the prediction error

17 / 94

slide-18
SLIDE 18

Introduction Feedforward Neural Networks

Kernel Perceptron

Any linear predictor involving only scalar products can be kernelized (kernel trick, cf SVM); Given w(t) = w0 +

i∈I yixi

< w, x > =< w0, x > +

  • i∈I

yi < xi, x > ⇒ k(w, x) = k(w0, x) +

  • i∈I

yik(xi, x)

3 2 1 1 2 3 3 2 1 1 2 3

18 / 94

slide-19
SLIDE 19

Introduction Feedforward Neural Networks

Adaptive Linear Elements (Widrow, Hoff, 1962)

Linear regression, Analytically

  • Given (xi, yi), yi ∈ R
  • minimize J(w) = 1

N

  • i ||yi − wTxi||2
  • Analytically ∇wJ(w) = 0 ⇒ XX Tw = Xy
  • XX T non singular : w = (XX T)−1Xy
  • XX T singular (e.g. points along a line in 2D), infinite nb

solutions

  • regularized least square : min G(w) = J(w) + αw Tw
  • ∇wG(w) = 0 ⇒ (XX T + αI)w = Xy
  • as soon as α > 0, (XX T + αI) is not singular

Needs to compute XX T, i.e. over the whole training set...

19 / 94

slide-20
SLIDE 20

Introduction Feedforward Neural Networks

Adaptive Linear Elements (Widrow, Hoff, 1962)

Linear regression with stochastic gradient descent

  • start at w0
  • take each sample one after the other (online) xi, yi
  • denote ˆ

yi = wTxi the prediction

  • update wt+1 = wt − ǫ∇wJ(wt) = wt + ǫ(yi − ˆ

yi)xi

  • delta rule, δ = (yi − ˆ

yi) prediction error wt+1 = wt + ǫδxi

  • note the similarity with the perceptron learning rule

The samples xi are supposed to be “extended” with one dimension set to 1.

20 / 94

slide-21
SLIDE 21

Introduction Feedforward Neural Networks

Batch/Minibatch/Stochastic gradient descent

J(w, x, y) = 1 N

N

  • i=1

L(w, xi, yi) e.g. L(w, xi, yi) = ||yi − wTxi||2

Batch gradient descent

  • compute the gradient of the loss J(w) over the whole training

set

  • performs one step in direction of −∇wJ(w, x, y)

wt+1 = wt − ǫt∇wJ(w, x, y)

  • ǫ : learning rate

21 / 94

slide-22
SLIDE 22

Introduction Feedforward Neural Networks

Batch/Minibatch/Stochastic gradient descent

J(w, x, y) = 1 N

N

  • i=1

L(w, xi, yi) e.g. L(w, xi, yi) = ||yi − wTxi||2

Stochastic gradient descent (SGD)

  • one sample at a time, noisy estimate of ∇wJ
  • performs one step in direction of −∇wL(w, xi, yi)

wt+1 = wt − ǫt∇wL(w, xi, yi)

  • faster to converge than gradient descent

22 / 94

slide-23
SLIDE 23

Introduction Feedforward Neural Networks

Batch/Minibatch/Stochastic gradient descent

J(w, x, y) = 1 N

N

  • i=1

L(w, xi, yi) e.g. L(w, xi, yi) = ||yi − wTxi||2

Minibatch

  • noisy estimate of the true gradient with M samples (e.g.

M = 64, 128); M is the minibatch size

  • Randomize J with |J | = M, one set at a time

wt+1 = wt − ǫt 1 M

  • j∈J

∇wL(w, xj, yj)

  • smoother estimate than SGD
  • great for parallel architectures (GPU)

23 / 94

slide-24
SLIDE 24

Introduction Feedforward Neural Networks

Does it make sense to use gradient descent ?

Convex function

A function f : Rn → R is convex :

1

⇐ ⇒ ∀x1, x2 ∈ Rn, ∀t ∈ [0, 1] f (tx1 + (1 − t)x2) ≤ tf (x1) + (1 − t)f (x2)

2 with f twice diff.,

⇐ ⇒ ∀x ∈ Rn, H = ∇2f (x) is postive semidefinite i.e. ∀x ∈ Rn, xTHx ≥ 0 For a convex function f , all local minima are global minima. Our losses are lower bounded, so these minima exist. Under mild conditions, gradient descent and stochastic gradient descent converge, typically ǫt = ∞, ǫ2

t < ∞ (cf lectures on convex

  • ptimization).

24 / 94

slide-25
SLIDE 25

Introduction Feedforward Neural Networks

Does it make sense to use gradient descent ?

Linear regression with L2 loss is convex

Indeed,

  • Given xi, yi, L(w) = 1

2(wTxi − yi)2 is convex:

∇wL = (wTxi − yi)xi ∇2

wL = xixT i

∀x ∈ RnxTxixT

i x = (xT i x)2 ≥ 0

  • a non negative weighted sum of convex functions is convex

25 / 94

slide-26
SLIDE 26

Introduction Feedforward Neural Networks

Linear regression, synthesis

Linear regression

  • samples (xi, yi), yi ∈ R,
  • extend xi by adding a constant dimension equal to 1, accounts

for the bias

  • Linear model ˆ

yi = wTxi

  • L2 loss L(ˆ

y, y) = 1

2||ˆ

y − y||2

  • by gradient descent

∇wL(w, xi, yi) = ∂L ∂ˆ y ∂ˆ y ∂w = −(yi − ˆ yi)xi

26 / 94

slide-27
SLIDE 27

Introduction Feedforward Neural Networks

Linear classification, synthesis

Linear binary classification (logistic regression)

  • samples (xi, yi), yi ∈ [0, 1]
  • extend xi by adding a constant dimension equal to 1, accounts

for the bias

  • Linear model wTx
  • sigmoid transfer function ˆ

yi = σ(wTxi),

  • σ(x) =

1 1+exp(−x), σ(x) ∈ [0, 1]

  • d

dx σ(x) = σ(x)(1 − σ(x))

  • Cross entropy loss L(ˆ

y, y) = −y log ˆ y − (1 − y) log(1 − ˆ y)

  • by gradient descent :

∇wL(w, xi, yi) = ∂L ∂ˆ y ∂ˆ y ∂w = −(yi − ˆ yi)xi

27 / 94

slide-28
SLIDE 28

Introduction Feedforward Neural Networks

Linear classification, synthesis

Logistic regression is convex

Indeed,

  • Given xi, yi = 1,

L1(w) = − log(σ(wTxi) = log(1 + exp(−wTxi)), ∇wL1 = −(1 − σ(wTxi))xi ∇2

wL1 = σ(wTxi)(1 − σ(wTxi))

  • >0

xixT

i

  • Given xi, yi = 0, L2(w) = − log(1 − σ(wTxi))

∇wL2 = σ(wTxi)x ∇2

wL2 = σ(wTxi)(1 − σ(wTxi))

  • >0

xixT

i

  • a non negative weighted sum of convex functions is convex

28 / 94

slide-29
SLIDE 29

Introduction Feedforward Neural Networks

Why L2 loss for linear classification with SGD is bad

Compute the gradient to see why...

  • Take L2 loss L(ˆ

y, y) = 1

2||ˆ

y − y||2

  • Take the “linear” model : ˆ

yi = σ(wTxi)

  • Check that

d dx σ(x) = σ(x)(1 − σ(x))

  • Compute the gradient wrt w:

∇wL(w, xi, yi) = ∂L ∂ˆ y ∂ˆ y ∂w = −(yi−ˆ yi)σ(wTxi)(1 − σ(wTxi))xi

  • If xi is strongly misclassified (e.g. yi = 1, wTxi = −big)

Then σ(wTxi)(1 − σ(wTxi)) ≈ 0, i.e. ∇wL(w, xi, yi) ≈ 0 ⇒ stepsize is very small while the sample is misclassified With a cross entropy loss, ∇wL(w, xi, yi) is proportional to the error

29 / 94

slide-30
SLIDE 30

Introduction Feedforward Neural Networks

Linear classification, synthesis

Linear multiclass classification

  • samples (xi, yi), labels yi ∈ [|0, k − 1|]
  • extend xi by adding a constant dimension equal to 1, accounts

for the bias

  • Linear models for each class: wT

j x

  • softmax transfer function : P(y = j/x) = ˆ

yj =

exp(wT

j x)

  • k exp(wT

k x)

  • generalization of the sigmoid for a vectorial output
  • Cross entropy loss L(ˆ

y, y) = − log ˆ yy

  • by gradient descent

∇wjL(w, x, y) =

  • k

∂L ∂ˆ yk ∂ˆ yk ∂wj = −(δj,y − ˆ yj)x

30 / 94

slide-31
SLIDE 31

Introduction Feedforward Neural Networks

Perceptron and linear separability

Perceptrons performs linear separation in a predefined, fixed feature space

The XOR

x1 x2 xor(x1, x2) 1 1 1 1 ?? ?? ?? x1x2 x1x2 xor(x1, x2) = x1x2 + x1x2 1 1 1 1

Can we learn the φj(x) ???

31 / 94

slide-32
SLIDE 32

Introduction Feedforward Neural Networks

Radial basis functions(RBF)

RBF (Broomhead, 1988)

  • RBF kernel φ0(x) = 1, φj(x) = exp −||x−µj||2

2σ2

j

  • for regression (L2 loss) or classification (cross entropy loss)
  • e.g. for regression :

ˆ y(x) = wTφ(x) L(w, xi, yi) = ||yi − wTφ(xi)||2

  • What about the centers and variances ?[Schwenker, 2001]
  • place them uniformly, randomly, by vector quantization

(k-means, GNG [Fritzke, 1994])

  • two phases : fix the centers/variances, fit the weights
  • three phases : fix the centers/variances, fit the weights, fit

everything (∇µL, ∇σL, ∇wL)

32 / 94

slide-33
SLIDE 33

Introduction Feedforward Neural Networks

Radial basis functions(RBF)

RBF are universal approximators [Park, Sandberg (1991)]

Denote S the family of functions based on RBF in Rd: S = {g ∈ Rd → R, g(x) =

  • i

wiφi(x), w ∈ RN} Then S is dense in Lp(R) for every p ∈ [1, ∞) Actually, the theorem applies for a larger class of functions φi.

33 / 94

slide-34
SLIDE 34

Introduction Feedforward Neural Networks

Feedforward neural networks (or MLP [Rumelhart, 1986])

Input Hidden layers Output

Layer 0 Layer 1 · · · Layer L-1 Layer L · · · · · · · · ·

1 1 1

y(L−1) a(L−1)

w(L−1)

00

w(L−1)

0i

w(1)

00

w(1)

01

w(1)

02

x0 x1 x2 x3 y(L)

i

= f(a(L)

i

) a(L)

i

=

j w(L) ij y(L−1) j

y(L−1)

i

= g(a(L−1)

i

) a(L−1)

i

=

j w(L−1) ij

y(L−2)

j

y(1)

i

= g(a(1)

i )

a(1)

i

=

j w(1) ij xj

y(L−1) 1 a(L−1) 1

y(L) a(L) y(L)

1

a(L)

1

Named MLP for historical reasons. Should be called FNN. 34 / 94

slide-35
SLIDE 35

Introduction Feedforward Neural Networks

Feedforward neural networks

Input Hidden layers Output

Layer 0 Layer 1 · · · Layer L-1 Layer L · · · · · · · · ·

1 1 1

y(L−1) a(L−1)

w(L−1)

00

w(L−1)

0i

w(1)

00

w(1)

01

w(1)

02

x0 x1 x2 x3 y(L)

i

= f(a(L)

i

) a(L)

i

=

j w(L) ij y(L−1) j

y(L−1)

i

= g(a(L−1)

i

) a(L−1)

i

=

j w(L−1) ij

y(L−2)

j

y(1)

i

= g(a(1)

i )

a(1)

i

=

j w(1) ij xj

y(L−1) 1 a(L−1) 1

y(L) a(L) y(L)

1

a(L)

1

Architecture

  • Depth : number of layers without counting the input

deep = large depth

  • Width : number of units per layer
  • weights and biases for each unit
  • Hidden transfer function f , Output transfer function g

35 / 94

slide-36
SLIDE 36

Introduction Feedforward Neural Networks

Feedforward neural networks

Architecture

Hidden transfer function

  • Historically, f taken as a sigmoid or tanh.
  • Now, mainly Rectified Linear Units (ReLu) or similar

f (x) = max(x, 0) ReLu are more favorable for the gradient flow than the saturating functions [Krizhevsky(2012), Nair(2010), Jarrett(2009)]

4 2 2 4 1.0 0.5 0.0 0.5 1.0 4 2 2 4 0.0 0.2 0.4 0.6 0.8 1.0 4 2 2 4 1 2 3 4 5

36 / 94

slide-37
SLIDE 37

Introduction Feedforward Neural Networks

Feedforward neural networks

Architecture

Output transfer function and loss

  • for regression :
  • linear f (x) = x
  • L2 loss L(ˆ

y, y) = ||y − ˆ y||2

  • for multiclass classification:
  • softmax ˆ

yj =

eaj

  • k eak ,
  • negative log likelihood loss L(ˆ

y, y) = − log(ˆ yy)

37 / 94

slide-38
SLIDE 38

Introduction Feedforward Neural Networks

FNN training : error backpropagation

Training by gradient descent

  • initialize weights and biases w0
  • at every iteration, compute :

w ← w − ǫ∇wJ

The partial derivatives

∂J ∂wi ??

Fundamentally, use the chain rule within the computational graph linking any variable (inputs,weights, biases) to the output of the loss.

Backprop is usually attributed to [Rumelhart,1986] but [Werbos,1981] already introduced the idea. 38 / 94

slide-39
SLIDE 39

Introduction Feedforward Neural Networks

Computing partial derivatives

Computational graph

A computational graph is a directed graph

  • nodes : variables (weights, inputs, outputs, targets,..)
  • edges : operations (ReLu, Softmax, wTx + b, .., Losses,..)

We only need to know, for each operations :

  • the partial derivatives wrt its parameters
  • the partial derivatives wrt its inputs

39 / 94

slide-40
SLIDE 40

Introduction Feedforward Neural Networks

Computing partial derivatives

∂J ∂wi

The chain rule : single path

Suppose there is a single path, e.g. xi → u3 Applying the chain rule ∂u3

∂xi = ∂u3 ∂u2 . ∂u2 ∂u1 . ∂u1 ∂xi = ∂ ∂xi (yi − wxi − b)2

u3 = u2

2

u2 = yi − u1 u1 = wxi + b      ∂u3 ∂xi = 2u2.(−1).w = −2w(yi − wxi − b)

40 / 94

slide-41
SLIDE 41

Introduction Feedforward Neural Networks

Computing partial derivatives

∂J ∂wi

The chain rule : multiple paths

Sum over all the paths (e.g. u3 = w1xi + w2xi):

∂u3 ∂xi =

  • j∈{1,2}

∂u3 ∂uj ∂uj ∂xi u3 = u1 + u2 u2 = w2xi u1 = w1xi      ∂u3 ∂xi = 1.w2+1.w1

41 / 94

slide-42
SLIDE 42

Introduction Feedforward Neural Networks

But, it is computationally expensive

There are a lot of paths...

There are 4 paths from xi to u5

42 / 94

slide-43
SLIDE 43

Introduction Feedforward Neural Networks

Let us be more efficient: Forward-mode differentiation

Forward differentiation

Idea : to compute ∂u5

∂xi , propagate forward ∂ ∂xi ∂u1 ∂xi = ∂u1 ∂z1 ∂z1 ∂xi + ∂u1 ∂z2 ∂z2 ∂xi = z2.1 + z1.0 = z2 = w1

But how to compute ∂u5

∂w1 ? Well, propagate ∂ ∂w1

And ∂u5

∂w2 ? propagate again...... or....

Griewank(2010) Who Invented the Reverse Mode of Differentiation? http://colah.github.io/posts/2015-08-Backprop/

43 / 94

slide-44
SLIDE 44

Introduction Feedforward Neural Networks

Let us be even more efficient: reverse-mode differentiation

Reverse differentiation

Idea : to compute ∂u5

∂xi , backpropagate ∂u5 ∂ ∂u5 ∂u1 = ∂u5 ∂u3 ∂u3 ∂u1 + ∂u5 ∂u4 ∂u4 ∂u1 = 1.w3 + 1.w6

We have ∂u5

∂xi , but also ∂u5 ∂w1 , ∂u5 ∂w2 ,... all in a single pass !

Griewank(2010) Who Invented the Reverse Mode of Differentiation? http://colah.github.io/posts/2015-08-Backprop/

44 / 94

slide-45
SLIDE 45

Introduction Feedforward Neural Networks

FNN training : error backpropagation

In Neural Networks, reverse-mode differentation is called error backpropagation

Training in two phases

  • Evaluation of the output : forward propagation
  • Evaluation of the gradient : reverse-mode differentation

Carefull The reverse-mode differentation uses the activations computed during the forward propagation Libraries like theano augment the computational graph with nodes computing numerically the gradient by reverse-mode differentiation.

45 / 94

slide-46
SLIDE 46

Introduction Feedforward Neural Networks

Universal approximator

Any well behaved functions can be arbitrarily approximated with a single hidden layer FNN.

Intuition

  • Take a sigmoid transfer function f (x) =

1 1+exp(−α(x−bi)) : this

is the hidden layer

  • substract two such activations to get gaussian like kernels

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

  • weight such substractions, you are back to the RBFs

46 / 94

slide-47
SLIDE 47

Introduction Feedforward Neural Networks

But then, why deep networks ??

Going deeper

  • Single FNN are universal approximators, but the hidden layer can be

arbitrarily large

  • A deep network (large number of layers) builds high level features

by composing lower level features

  • A shallow directly learns these high level features
  • Image analogy :
  • first layers : extract oriented contours (e.g. gabors)
  • second layers : learn corners by combining contours
  • next layers : build up more and more complex features
  • Theoretical works comparing expressiveness depth{d} FNN with

depth{d-1} FNN

Learning deep architectures for AI, Bengio(2009), chap2; Benefits of depth in

47 / 94

slide-48
SLIDE 48

Introduction Feedforward Neural Networks

And why ReLu ?

Vanishing/exploding gradient [Hochreiter(1991),Bengio(1994)]

Consider u2 = f (wu1)

  • Remember when the gradient is “backpropagated”, it involves

∂J ∂u1 = ∂J ∂u2 ∂u2 ∂u1 = ∂J ∂u2 w.f ′(wu1)

  • backpropagated through L layers (w.f ′)L
  • with f (x) =

1 1+e−x , f ′(x) < 1 for x = 0

  • If w.f ′ = 1, (w.f ′)L → {0, ∞}
  • ⇒ gradient vanishes or explodes

With ReLu, f ′(x) ∈ {0, 1}. But you can get dead units

48 / 94

slide-49
SLIDE 49

Introduction Feedforward Neural Networks

But the ReLus can die....

Why do they die ?

If the input to ReLu is negative, the gradient is 0, that’s it...”forever” lost

And then ?

  • Add a linear component for negative x : Leaky Relu,

Parametric Relu [He(2015)]

  • Exponential Linear Units [Clevert, Hochreiter(2016)]

4 2 2 4 1 2 3 4 5 4 2 2 4 1 1 2 3 4 5 4 2 2 4 1 1 2 3 4 5

49 / 94

slide-50
SLIDE 50

Introduction Feedforward Neural Networks

How to deal with the vanishing/exploding gradient

Preventing vanishing gradient by preserving gradient flow

  • Using ReLu, Leaky Relu, PReLu, ELU to ensure a good flow
  • f gradient
  • Specific architectures :
  • ResNet (CNN) : shortcut connections
  • LSTM (RNN) : constant error caroussel

Preventing exploding gradients

1 unit, σ(x), 50 lay- ers Gradient clipping [Pascanu, 2013] : clip the norm of the gradient

50 / 94

slide-51
SLIDE 51

Introduction Feedforward Neural Networks

Regularization Like kids, FNN can do a lot of things, but we must focus their expressiveness Chap 7, [Bengio et al.(2016)]

51 / 94

slide-52
SLIDE 52

Introduction Feedforward Neural Networks

Regularization

L2 regularization

Add a L2 penalty on the weights, α > 0. J(w) = L(w) + α 2 ||w||2

2 = L(w) + α

2 wTw ∇wJ = ∇wL + αw Example : RBF, 1 kernel per sample, N=30, noisy sinus.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 25 30 0.6 0.4 0.2 0.0 0.2 0.4 0.6

Shown are : α = 0, α = 2, w⋆ Chap 7 of [Bengio et al.(2016)] for a geometrical interpretation

52 / 94

slide-53
SLIDE 53

Introduction Feedforward Neural Networks

Regularization

L2 regularization

In principle, We should not regularize the bias. Example : J(w) = 1 N

N

  • i=1

||yi − w0 −

  • k≥1

wkxi,k||2 ∇w0J = 0 ⇒ w0 = ( 1 N

  • i

yi) −

  • k≥1

wk 1 N

  • i

xi,k e.g. if your data are centered, i.e.

1 N

  • i xi,k = 0, then

w0 = 1

N

  • i yi.

Regularizing the bias might lead to underfitting.

53 / 94

slide-54
SLIDE 54

Introduction Feedforward Neural Networks

Regularization

L1 regularization promotes sparsity

Add a L1 penalty on the weights.

J(w) = L(w) + α||w||1 = L(w) + α

  • k

|wk| ∇wJ = ∇wL + α

  • k

sign(wk)

Example : RBF, 1 kernel per sample, N=30, noisy sinus, α = 0.003.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 25 30 0.6 0.4 0.2 0.0 0.2 0.4 0.6

Shown are : α = 0, α = 0.003, w⋆

54 / 94

slide-55
SLIDE 55

Introduction Feedforward Neural Networks

Regularization

Drop-out regularization [Srivastava(2014)]

Idea 1: to prevent co-adaptation. A unit is good by itself not because

  • thers are doing part of the job.

Idea 2 : combine an exponential number of networks (Ensemble). How :

  • For each minibatch, keep hidden and input activations with

probability p (p=0.5 for hidden, p=0.8 for inputs). At test time, multiply all the activations by p

  • “Inverted”: scale by 1/p at training ; no scaling at test time

55 / 94

slide-56
SLIDE 56

Introduction Feedforward Neural Networks

Regularization (dropout)

,Srivastava(2014), Hinton(2012) Usually after FC layers (p=0.5) and input layer. Can be interpreted as of training/averaging all the possible subnetworks.

56 / 94

slide-57
SLIDE 57

Introduction Feedforward Neural Networks

Regularization

Split your data in three sets :

  • training set : for training..
  • validation set : for choosing hyperparameters
  • test set : for estimating the generalization error

Early stopping

Idea : monitor your error on the validation set, U-shaped performance. Keep the model with the lowest validation error during training.

57 / 94

slide-58
SLIDE 58

Introduction Feedforward Neural Networks

Training by some forms of gradient descent w(t + 1) ← w(t) − ǫ∇wJ(wt)

* Chap 8 [Bengio et al. (2016)] * A. Karpathy : http://cs231n.github.io/neural-networks-3/

58 / 94

slide-59
SLIDE 59

Introduction Feedforward Neural Networks

But wait...

Does it make sense to apply gradient descent to Neural networks ?

  • we cannot get better than a local minima ?!?!
  • and neural networks lead to non convex optimization

problems, i.e. a lot of local minima (think about the symmetries) But empirically, most local minima are close to the global minima with large/deep networks.

Choromanska, 2015 : The Loss Surface of Multilayer Nets Dauphin, 2014 : Identifying and attacking the saddle point problem in high-dimensional non-convex optimization Pascanu, 2014 : On the saddle point problem for non-convex optimization 59 / 94

slide-60
SLIDE 60

Introduction Feedforward Neural Networks

Identifying the critical points

The hessian matrix

  • Matrix of second order derivatives. Inform on the local

curvature H(θ) = ∇2

θJ =

              ∂2f ∂x2

1

∂2f ∂x1 ∂x2 · · · ∂2f ∂x1 ∂xn ∂2f ∂x2 ∂x1 ∂2f ∂x2

2

· · · ∂2f ∂x2 ∂xn . . . . . . ... . . . ∂2f ∂xn ∂x1 ∂2f ∂xn ∂x2 · · · ∂2f ∂x2

n

             

  • for a convex function : H is symmetric, semi definite positive

60 / 94

slide-61
SLIDE 61

Introduction Feedforward Neural Networks

Identifying the type of critical points

Eigenvalues of H

  • a critical point is where ∇θJ = 0
  • if all eigenvalues(H) > 0 : local minima
  • if all eigenvalues(H) < 0 : local maxima
  • if eigenvalues(H) pos and neg : saddle point

x2 + y2 −(x2 + y2) x2 − y2

61 / 94

slide-62
SLIDE 62

Introduction Feedforward Neural Networks

Identifying the type of critical points

And if H is degenerate

If H is degenerate (some eigenvalues=0, det(H) = 0), we can have :

  • a local minimum :

f (x, y) = −(x4 + y4), H(x, y) = −12x2 −12y2

  • a local maximum :

f (x, y) = x4 + y4, H(x, y) = 12x2 12y2

  • a saddle point : f (x, y) = x3 + y2, H(x, y) =

6x 2

  • 62 / 94
slide-63
SLIDE 63

Introduction Feedforward Neural Networks

Local minima are not an issue with deep networks

Local minima and their loss

Experiment : One hidden layer, Mnist, Trained with SGD. Converges mostly to local minima and some saddle points. The distribution of test loss of the local minima tend to shrink. Index α : fraction of negative eigenvalues of the Hessian Choromanska, 2015 : The Loss Surface of Multilayer Nets

63 / 94

slide-64
SLIDE 64

Introduction Feedforward Neural Networks

Saddle points seem to be the issue

Saddle points and their loss

Experiment : “small MLP”, Trained with Saddle-free Newton (converges to critical points). Used Newton to discover the critical points High loss critical points are saddle points Low loss critical points are local minima. Index α : fraction of negative eigenvalues of the Hessian Dauphin, 2014 : Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

64 / 94

slide-65
SLIDE 65

Introduction Feedforward Neural Networks

Training 1st order methods J(θ) ≈ J(θ0) + (θ − θ0)T∇θJ(θ0) θ ← θ − ǫ∇θJ Rationale : First order approximation ˆ J(θ) = J(θ0) + (θ − θ0)T∇θJ(θ0) (θ − θ0) = −ǫ∇θJ(θ0) ⇒ ˆ J(θ) = J(θ0) − ǫ(∇θJ(θ0))2 For small ||ǫ∇θJ(θ0)||: J(θ0 − ǫ∇θJ(θ0)) ≤ J(θ0)

65 / 94

slide-66
SLIDE 66

Introduction Feedforward Neural Networks

Training, 1st order methods

Minibatch Stochastic Gradient descent

  • start at θ0
  • for every minibatch :

θ(t + 1) = θ(t) − ǫ∇θ( 1 M

  • i

Ji(θ))

  • M = 1 : very noisy estimate, Stochastic gradient descent
  • M = N : true gradient, batch gradient descent
  • (minibatch)SGD converges faster
  • The trajectory may converge slowly or diverge if ǫ not

appropriate

66 / 94

slide-67
SLIDE 67

Introduction Feedforward Neural Networks

Training, 1st order methods

Stochastic Gradient descent : example

Take N = 30 samples with y = 3x + 2 + U(−0.1, 0.1) Let us perform linear regression (ˆ y = wx + b, L2 loss) with SGD

10 5 5 10 x 30 20 10 10 20 30 40 y

67 / 94

slide-68
SLIDE 68

Introduction Feedforward Neural Networks

Training, 1st order methods

Stochastic Gradient descent : ZigZag

ǫ = 0.005, b0 = 10, w0 = 5 Converges to w⋆ = 2.9975, b⋆ = 1.9882

5 5 10 b 5 5 10 w 200 400 600 800 1000 iteration 5 4 3 2 1 1 2 log(J) 68 / 94

slide-69
SLIDE 69

Introduction Feedforward Neural Networks

Training, 1st order methods

Momentum

Idea: let us damp the oscillations with a low-pass filter on ∇θ

  • Start at θ0, v = 0
  • for every minibatch :

v(t + 1) = αv(t) − ǫ∇θ θ(t + 1) = θ(t) + v(t) Usually, α ≈ 0.9 Experiment on http://distill.pub/2017/momentum/

69 / 94

slide-70
SLIDE 70

Introduction Feedforward Neural Networks

Training, 1st order methods

Stochastic Gradient descent with momentum

ǫ = 0.005, α = 0.6, b0 = 10, w0 = 5 Converges to w⋆ = 2.9933, b⋆ = 1.9837

5 5 10 b 5 5 10 w 200 400 600 800 1000 iteration 5 4 3 2 1 1 2 log(J)

Adviced : set α ∈ {0.5, 0.9, 0.99}

70 / 94

slide-71
SLIDE 71

Introduction Feedforward Neural Networks

Training, 1st order methods

SGD without/with momentum

200 400 600 800 1000 iteration 5 4 3 2 1 1 2 log(J) 200 400 600 800 1000 iteration 5 4 3 2 1 1 2 log(J)

71 / 94

slide-72
SLIDE 72

Introduction Feedforward Neural Networks

Training, 1st order methods

Nesterov Momentum [Sutskever, PhD Thesis]

Idea : look ahead to potentially correct the update. Based on Nesterov Accelerated Gradient

  • Start at θ0, v = 0
  • for every minibatch :

˜ θ(t + 1) = θ(t) + αv(t) v(t + 1) = αv(t) − ǫ∇θJ(˜ θ) θ(t + 1) = θ(t) + v(t + 1)

72 / 94

slide-73
SLIDE 73

Introduction Feedforward Neural Networks

Training, 1st order methods

SGD with Nesterov Momentum

ǫ = 0.005, α = 0.8, b0 = 10, w0 = 5 Converges to w⋆ = 2.9914, b⋆ = 1.9738

5 5 10 b 5 5 10 w 200 400 600 800 1000 iteration 5 4 3 2 1 1 2 log(J)

In this experiment, with nesterov momentum, a larger momentum was allowed. With α = 0.8, momentum strongly oscillates.

73 / 94

slide-74
SLIDE 74

Introduction Feedforward Neural Networks

Training, 1st order methods

SGD/ SGD momentum / SGD nesterov momentum

200 400 600 800 1000 iteration 5 4 3 2 1 1 2 log(J) 200 400 600 800 1000 iteration 5 4 3 2 1 1 2 log(J) 200 400 600 800 1000 iteration 5 4 3 2 1 1 2 log(J)

74 / 94

slide-75
SLIDE 75

Introduction Feedforward Neural Networks

Training 1st order methods with adaptive learning rates

75 / 94

slide-76
SLIDE 76

Introduction Feedforward Neural Networks

Training : adapting the learning rate

Learning rate annealing

Some possible schedules :

  • Linear decay between ǫ0 and ǫτ.
  • halve the learning rate when validation error stops improving

76 / 94

slide-77
SLIDE 77

Introduction Feedforward Neural Networks

Training : adapting the learning rate

Adagrad [Duchi, 2011]

  • Accumulate the square of the gradients :

r(t + 1) = r(t) + ∇θJ(θ(t)) ⊙ ∇θJ(θ(t))

  • Scale individually the learning rates

θ(t + 1) = θ(t) −

ǫ δ+√ r(t+1) ⊙ ∇θJ(θ(t))

The √. is experimentally critical. δ ≈ [1e − 8, 1e − 4], for numerical stability Small gradients ⇒ bigger learning rate Big gradients ⇒ smaller learning rate Accumulation from the beginning is too agressive. Learning rates decrease too fast.

77 / 94

slide-78
SLIDE 78

Introduction Feedforward Neural Networks

Training : adapting the learning rate

RMSProp [Hinton, unpublished]

Idea : Use an exponentially decaying integration of the gradient

  • Accumulate the square of the gradients :

r(t + 1) = ρr(t) + (1 − ρ)∇θJ(θ(t)) ⊙ ∇θJ(θ(t))

  • Scale individually the learning rates

θ(t + 1) = θ(t) −

ǫ δ+√ r(t+1) ⊙ ∇θJ(θ(t))

ρ ≈ 0.9. And some others : Adadelta [Zeiler, 2012], Adam [Kingma, 2014], ...

78 / 94

slide-79
SLIDE 79

Introduction Feedforward Neural Networks

Training with 1st order methods

So, which one do I use ?

[Bengio et al. 2016] There is currently no consensus[...]no single best algorithm has emerged[...]the most popular and actively in use include SGD, SGD with momentum, RMSprop, RMSprop with momentum, Adadelta and Adam Loading...

  • A. Karpathy

Schaul(2014). Unit Tests for Stochastic Optimization

79 / 94

slide-80
SLIDE 80

Introduction Feedforward Neural Networks

Training A glimpse into 2nd order methods J(θ) ≈ J(θ0) + (θ − θ0)T∇θJ(θ0) + 1 2(θ − θ0)T∇2J(θ0)(θ − θ0) ∇θJ(θ0) Gradient vector ∇2

θJ(θ0) Hessian matrix

Idea : use a better local approximation to make a more informed update

80 / 94

slide-81
SLIDE 81

Introduction Feedforward Neural Networks

Training : 2nd order methods

Newton method

From a 2nd order taylor approximation J(θ) ≈ J(θ0) + (θ − θ0)T∇θJ(θ0) + 1 2(θ − θ0)T∇2J(θ0)(θ − θ0) Critical point at : ∇θJ(θ) = 0 ⇒θ = θ0 − ǫH−1∇wJ(θ0) Critical points (min, max, saddle) are attractors for Newton !!

  • cool : we can locate critical points
  • but : do not use it for optimizing a neural network !

Dauphin(2014) : Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

81 / 94

slide-82
SLIDE 82

Introduction Feedforward Neural Networks

Training : 2nd order methods

Second order methods require a larger batchsize.

Some algorithms

  • Conjugate gradient : no need to compute the hessian,

guaranteed to converge in k steps for k dimensional quadratic function,

  • Saddle-free Newton [Daupin, 2014]
  • Hessian free optimization (truncated newton) [Martens, 2010]
  • BFGS (quasi-newton): approximation of H−1, needs to store

it which is larged for deep networks,

  • L-BFGS : limited memory BFGS

82 / 94

slide-83
SLIDE 83

Introduction Feedforward Neural Networks

Initialization and the importance of good activation distributions

83 / 94

slide-84
SLIDE 84

Introduction Feedforward Neural Networks

Preprocessing your inputs

Gradient descent converges faster if your data are normalized and

  • decorrelated. Denote by xi ∈ Rd your input data, ˆ

xi its normalized

Input normalization

  • Min-max scaling :

∀i, j ˆ xi,j = xi,j − mink xk,j maxk xk,j − mink xk,j + ǫ

  • Z-score normalization : (goal: ˆ

µj = 0, ˆ σj = 1) ∀i, j, ˆ xi,j = xi,j − µj σj + ǫ

  • ZCA-Whitening : (goal: ˆ

µj = 0, ˆ σj = 1,

1 n−1 ˆ

X ˆ X T = I ˆ X = WX, W = 1 √n − 1(XX T)−1/2

84 / 94

slide-85
SLIDE 85

Introduction Feedforward Neural Networks

Z-score normalization / Standardizing the inputs

Remember our linear regression : y = 3x + 2 + U(−0.1, 0.1), L2 loss, 30 1D samples

5 5 10 b 5 5 10 w 5 10 15 b 10 15 20 25 w

Loss with raw input Loss with standardized input With standardized inputs, the gradient always points to the minimum !!

85 / 94

slide-86
SLIDE 86

Introduction Feedforward Neural Networks

The starting point of training is critical

Pretraining

Historically, training deep FNN was known to be hard, i.e. bad generalization errors. The starting point of a gradient descent has a dramatic impact.

  • neural history compressors [Schmidhuber, 1991]
  • competitive learning [Maclin and Shavlik, 1995]
  • unsupervised pretraining based on Boltzman Machines

[Hinton, 2006]

  • unsupervised pretraining based on autoencoders [Bengio,

2006]

86 / 94

slide-87
SLIDE 87

Introduction Feedforward Neural Networks

For example, pretraining with autoencoders

Idea : extract features that allow to reconstruct the previous layer activities Followed by fine-tuning with gradient descent Does not appear to be that critical nowadays (because of xxReLu and initialization strategies)

87 / 94

slide-88
SLIDE 88

Introduction Feedforward Neural Networks

Initializing the weights/biases

Thoughts

  • intially behave as a linear predictor; Non linearities should be

activated by the learning algorithm only if necessary.

  • units should not extract the same features : symmetry

breaking, otherwise, same gradients. Suppose the inputs standardized, make output and gradients standardized:

  • sigmoid : b = 0, w ∼ N(0,

1 √ fanin) ⇒ in the linear part

  • sigm, tanh : b = 0, w ∼ U(−

√ 6 √ni+no , √ 6 √ni+no ) [Glorot, 2010]

  • ReLu : b = 0, w ∼ N(0,
  • 2/fanin) [He(2015)]

88 / 94

slide-89
SLIDE 89

Introduction Feedforward Neural Networks

LeCun initialization

Initialization in the linear regime for the forward pass

Aim : Initialize the weights so that f acts in its linear part, i.e. w close to 0

  • Use the symmetric transfer function f (x) = 1.7159 tanh( 2

3x)

⇒ f (1) = 1, f (−1) = −1

  • Center, normalize (unit variance) and decorrelate the input

dimensions

  • initialize the weights from a distrib with µ = 0, σ =

1 √ ni

  • set the biases to 0
  • This ensures the output of the layer is zero mean, unit

variance Efficient Backprop, Lecun et al. (1998); Generalization and network design strategies, LeCun (1989)

89 / 94

slide-90
SLIDE 90

Introduction Feedforward Neural Networks

Glorot initialization strategy

Keep same distribution for the forward and backward pass

  • The activations and the gradients should have, initially, similar

distributions accross the layers

  • to avoid vanishing/exploding gradient
  • The input dimensions should centered, normalized,

uncorrelated

  • With a transfer function f , f ′(0) = 1, it turns to :

∀i, Var[W i] =

2 fanin+fanout

Glorot (Xavier) Uniform : W ∼ U[−

√ 6 √ni+no , √ 6 √ni+no ], b = 0

Glorot (Xavier) Normal : W ∼ N(0,

√ 2 √ fanin+fanout ), b = 0

Understanding the difficulty of training deep feedforward neural networks, Glorot, Bengio, JMLR(2010).

90 / 94

slide-91
SLIDE 91

Introduction Feedforward Neural Networks

He initialization strategy

Designed for rectifier non linearities (ReLU, PReLU).

Keep same distribution for the forward and backward pass

  • The activations and the gradients should have, initially, similar

distributions accross the layers

  • The input dimensions should centered, normalized,

uncorrelated

  • With a ReLU transfer function, d Conv k × k filters on c

channels : 1

2k2cVar[wl] = 1, 1 2k2dVar[wl] = 1

He Uniform : W ∼ U[−

√ 6 √ ni , √ 6 √ ni ], b = 0

He Normal : W ∼ N(0,

√ 2 √ni )

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, He et al, ICCV(2015).

91 / 94

slide-92
SLIDE 92

Introduction Feedforward Neural Networks

Batch normalization [Ioffe, Szegedy(2015)]

Internal Covariate Shift

Def [Ioffe(2015)]: The change in the distribution of network activations due to the change in network parameters during training Exp : 3 FC(100 units), sigmoid, output softmax, MNIST Measure: distribution of activations of the last hidden layer during training, {15,50,85}th percentile

92 / 94

slide-93
SLIDE 93

Introduction Feedforward Neural Networks

Batch normalization [Ioffe, Szegedy(2015)]

Batch normalization to prevent covariate shift

Idea: standardize the activations of every layers to keep the same distributions during training.

  • The gradient must be aware of this normalization, otherwise

may get parameter explosion (see Ioffe(2015))

  • Introduces a differentiable BN normalization layer :

z = g(Wu + b) → z = g(BN(Wu + b))

yi = BNγ,β(xi) = γ ˆ xi + β ˆ xi = xi − µB

  • σ2

B + ǫ

µB, σ2

B : minibatch mean, variance

93 / 94

slide-94
SLIDE 94

Introduction Feedforward Neural Networks

Batch Normalization

Train and test time

  • Where : everywhere along the network, before ReLus
  • at training : standardize each unit’s activations over a

minibatch

  • at test :
  • with one sample, standardize over the population
  • use mean/variance from the train set
  • standardize over a batch of test samples

Learning much faster, better generalization

94 / 94

slide-95
SLIDE 95

Convolutional Neural Networks (CNN) Neocognitron [Fukushima(1980)] LeNet5 [LeCun(1998)]

1 / 35

slide-96
SLIDE 96

Idea : Exploiting the structure of the inputs

Ideas

  • Features detected by convolutions with local kernels
  • parameters sharing, sparse weights ⇒ strongly regularized

FNN (e.g. detecting an oriented edge is translation invariant)

2 / 35

slide-97
SLIDE 97

The CNN of LeCun(1998)

Architecture

  • (Conv/NonLinear/Pool) * n
  • followed by fully connected layers

3 / 35

slide-98
SLIDE 98

General architecture of a CNN)

Architecture : Conv/ReLu/Pool

3 channels

1 kernel

K kernels K feature maps

ReLu

s s

Max

4 2 2 4 1 2 3 4 5

Bias

K feature maps K feature maps

  • Convolution : depth, size (3x3, 5x5), pad, stride
  • Max pooling : size, stride (e.g. (2,2))

4 / 35

slide-99
SLIDE 99

Recent CNN

Multicolumn CDNN, Ciressan(2012)

Ensemble of Convolutional neural networks trained with dataset augmentation. 0.23 % test misclassification on MNIST. 1.5 million of parameters.

5 / 35

slide-100
SLIDE 100

Recent CNN

SuperVision, Krizhevsky(2012)

  • top 5 error of 16% compared to runner-up with 26% error.
  • several convolutions were stacked without pooling,
  • trained on 2 GPUs, for a week
  • 60 Millions parameters, dropout, momentum, L2 penalty,

dataset augmentation (trans, reflections, PCA)

6 / 35

slide-101
SLIDE 101

Recent CNN

SuperVision, Krizhevsky(2012)

  • top 5 error of 16% compared to runner-up with 26% error.
  • several convolutions were stacked without pooling
  • 60 Millions parameters, dropout, momentum, L2 penalty,

dataset augmentation (trans, reflections, PCA)

7 / 35

slide-102
SLIDE 102

Recent CNN

VGG, Simonyan(2014)

  • 16 layers : 13 convolutive, 3 fully connected
  • 3x3 convolutions, 2x2 pooling
  • stacked 3x3 convolutions ⇒ result in a 5x5 convolution with

less parameters :

  • K input channels, K output channels, 5x5 convolution ⇒ 25K 2

parameters

  • K input channels, K output channels, 2 3x3 convolutions

⇒ 18K 2 parameters

  • 140 Million of parameters, dropout, momentum, L2 penalty,

learning rate annealing, trained progressively

8 / 35

slide-103
SLIDE 103

Recent CNN

Inception, GoogLeNet (Szegedy,2015)

Idea : decrease the number of parameters by using 1x1 convolutions for cross-channel interactions.

  • dramatic decrease in the number of parameters ≈ 6 Million
  • multi-level feature extraction

9 / 35

slide-104
SLIDE 104

Recent CNN

Residual Networks (He,2015)

Idea : shortcut connections, no fully connected layers

  • L2 penalty, batch normalization (NO dropout), momentum
  • up to 150 layers, for only 2 Million parameters

10 / 35

slide-105
SLIDE 105

Recent CNN

Striving for simplicity: The all convolutional Net (Springenberg,2014)

Output : You can get out pooling and only use convolutions (All-CNN-C).

  • training with SGD, momentum, L2 penalty, dropout
  • only 3x3 convolutions with various stride
  • last layers use 3x3 and 1x1 convolutions instead of FC layers

11 / 35

slide-106
SLIDE 106

An attempt at synthesizing CNN design principles

12 / 35

slide-107
SLIDE 107

Design principles for Convolutional Neural networks

Increase the number of filters through the network

Rationale :

  • first layers extract low level features
  • higher layers combine the previous features

Number of filters

  • LeNet-5 (1998) : 6 5x5 - 16 5x5
  • AlexNet(2012) : 96 11x11, 256 5x5, (384 3x3)*2, 256 3x3
  • VGG (2014) : 64 - 128 - 256 - 512; all 3x3
  • ResNet (2015) : 64 - 128 - 256 - 512; all 3x3
  • Inception (2015) : 64 → 1024, 1x1, 3x3, “5x5”

13 / 35

slide-108
SLIDE 108

Design principles for Convolutional Neural networks

Effective Receptive Field size

(Conv 3x3 - Conv 3x3 - Max Pool) blocks

28x28 1x1 same stride 1 Conv 3x3 same stride 1 Conv 3x3 stride 2 Max 2x2 same stride 1 Conv 3x3 same stride 1 Conv 3x3 stride 2 Max 2x2 same stride 1 Conv 3x3 28x28 3x3 Representation Size Input RF Size 28x28 1x1 same stride 1 Conv 3x3 same stride 1 Conv 3x3 stride 2 Max 2x2 same stride 1 Conv 3x3 same stride 1 Conv 3x3 stride 2 Max 2x2 same stride 1 Conv 3x3 28x28 3x3 28x28 5x5 Representation Size Input RF Size 28x28 1x1 same stride 1 Conv 3x3 same stride 1 Conv 3x3 stride 2 Max 2x2 same stride 1 Conv 3x3 same stride 1 Conv 3x3 stride 2 Max 2x2 same stride 1 Conv 3x3 28x28 3x3 28x28 5x5 14x14 6x6 Representation Size Input RF Size 28x28 1x1 same stride 1 Conv 3x3 same stride 1 Conv 3x3 stride 2 Max 2x2 same stride 1 Conv 3x3 same stride 1 Conv 3x3 stride 2 Max 2x2 same stride 1 Conv 3x3 28x28 3x3 28x28 5x5 14x14 6x6 14x14 10x10 Representation Size Input RF Size

14 / 35

slide-109
SLIDE 109

Design principles for Convolutional Neural networks

Effective Receptive Field size

(Conv 3x3 - Conv 3x3 - Max Pool) blocks

28x28 1x1 same stride 1 Conv 3x3 same stride 1 Conv 3x3 stride 2 Max 2x2 same stride 1 Conv 3x3 same stride 1 Conv 3x3 stride 2 Max 2x2 same stride 1 Conv 3x3 28x28 3x3 28x28 5x5 14x14 6x6 14x14 10x10 14x14 14x14 Representation Size Input RF Size 28x28 1x1 same stride 1 Conv 3x3 same stride 1 Conv 3x3 stride 2 Max 2x2 same stride 1 Conv 3x3 same stride 1 Conv 3x3 stride 2 Max 2x2 same stride 1 Conv 3x3 28x28 3x3 28x28 5x5 14x14 6x6 14x14 10x10 14x14 14x14 15x15 7x7 Representation Size Input RF Size 28x28 1x1 same stride 1 Conv 3x3 same stride 1 Conv 3x3 stride 2 Max 2x2 same stride 1 Conv 3x3 same stride 1 Conv 3x3 stride 2 Max 2x2 same stride 1 Conv 3x3 28x28 3x3 28x28 5x5 14x14 6x6 14x14 10x10 14x14 14x14 15x15 7x7 23x23 7x7 Representation Size Input RF Size

= ⇒ Stack layers to ensure the RF cover the objects to detect. https://github.com/vdumoulin/conv_arithmetic

15 / 35

slide-110
SLIDE 110

Design principles for Convolutional Neural networks

Stacking small kernels (VGG, Inception)

Szegedi(2015)

n input filters,αn output filters :

  • αn5x5 conv : 25αn2 params
  • √αn, 3x3- αn, 3x3 : 9√αn2 + 9√ααn2 params; α = 2 ⇒ 24%

saving n input filters,αn output filters :

  • αn3x3 conv : 9αn2 params
  • √αn, 1x3 - αn, 3x1 : 3√αn2 + 3α√αn2 params; α = 2 ⇒ 30%

saving

16 / 35

slide-111
SLIDE 111

Design principles for Convolutional Neural Networks

Depthwise convolutions MobileNets [Howard,2017]

Decrease the number

  • f

parameters by decoupling feature extraction in space and feature combination See also Xcep- tion[Chollet(2016)]

17 / 35

slide-112
SLIDE 112

Design principles for Convolutional Neural networks

Multiscale feature extraction

18 / 35

slide-113
SLIDE 113

Design principles for Convolutional Neural networks

Dimensionality reduction with 1x1 convolutions

width height #channels n

1 1 n

Conv 1x1 m filters

width height #channels m

Relu

Equivalent to a single layer FFN, slided over the pixels. Multiple 1x1 ↔ MLP Trainable non-linear transformation of the channels. Network in Network (Lin, 2013).

19 / 35

slide-114
SLIDE 114

Design principles for Convolutional Neural networks

Ease the gradient flow with shortcuts

20 / 35

slide-115
SLIDE 115

Design principles for Convolutional Neural networks

Do we need max pooling and fully connected layers ?

The all convnet [Springenberg, 2015], ResNet [He, 2015]: Avantage : you can slide the network over larger images to produce a volume of class probabilities.

21 / 35

slide-116
SLIDE 116

Usefull tricks

Using Pre-trained models

Already trained models can be found :

  • https://github.com/tensorflow/models : Tensorflow

zoo

  • https://keras.io/applications/ : Keras pre-trained

models

  • Caffe Model Zoo

e.g. use a VGG pretrained on ImageNet, 1) replace the softmax, 2) fine-tune some of the deepest layers

22 / 35

slide-117
SLIDE 117

Usefull tricks

Dataset augmentation

Regularize your network by providing more samples :

  • With small perturbations (rotation, shift, zoom, ..)
  • by altering the RGB pixel values with a PCA [Krizhevsky et

al.,2012]

  • by learning a generator (see Generative Adversial Networks)

23 / 35

slide-118
SLIDE 118

Usefull tricks

Model averaging

1 Train several models with different initialization, architecture 2 Average the response of these models

e.g. on CIFAR-100, 11 models with loss≈1.3, acc≈70%; averaged : loss ≈ 0.82, acc≈ 77%. All winners to challenge use model averaging

Model compression : speeding up inference time

  • Binarized Neural Networks [Courbariaux(2016)]
  • Knowledge distillation (train a small models with soft targets
  • f a big model) [Hinton(2015)]

24 / 35

slide-119
SLIDE 119

Viewing and understanding deep networks

Demo of Deep Visualization Toolbox http://yosinski.com/deepvis

25 / 35

slide-120
SLIDE 120

Some applications of CNN

26 / 35

slide-121
SLIDE 121

Image classification

Aim : assign a label to an image

Some benchmarks

  • MNIST (28x28, 10 classes, grayscale, 60000 training, 10000

testing)

  • CIFAR-10, CIFAR-100
  • ImageNet Task 1 (256x256, 1000 classes, 1.2 million training,

50.000 validation, and 100.000 test) Image from [He(2016)]

27 / 35

slide-122
SLIDE 122

Image classification

ImageNet

28 / 35

slide-123
SLIDE 123

Image classification

ImageNet

Image from [Canziani(2016)]

29 / 35

slide-124
SLIDE 124

Object detection

Aim : detect the objects and output bounding boxes Metrics : detected classes, bbox coverage

Some benchmarks

  • Pascal-VOC
  • ImageNet Task 3
  • Microsoft COCO

30 / 35

slide-125
SLIDE 125

Applications of CNN : Object detection

Region based CNN [Girshick,2014]

Using the model AlexNet [Krizhevsky(2012)] for classifying.

31 / 35

slide-126
SLIDE 126

Applications of CNN : Object detection

Fast RCNN (Girshick, 2015)

32 / 35

slide-127
SLIDE 127

Applications of CNN : Object detection

Faster RCNN (Ren, Girshik, 2015)

  • Introduces Region Proposal Network feeding a Fast-RCNN
  • end-to-end training

More recently : YOLO(2015), YOLO9000(2016)

33 / 35

slide-128
SLIDE 128

Applications of CNN : Semantic/Instance segmentation

Instance segmentation [He,2017]: Mask-RCNN

Predicts a binary mask in addition to the classes and boxes Other approaches : SegNet(2015), FC-DenseNet(2017), UNet(2015), ENet(2016)

34 / 35

slide-129
SLIDE 129

Spatial Transformer Network (Jaderberg, 2016)

Learns a differentiable transformation Tθ (crop, translation, rotation, scale, and skew). Aim: decrease the number of degrees of freedom of the objects.

35 / 35

slide-130
SLIDE 130

Recurrent Neural Networks

Recurrent neural networks (RNN) Handling sequential data (speech, handwriting, language,...) Predicting in context

Input Output

1 / 15

slide-131
SLIDE 131

Recurrent Neural Networks

Handling sequences with FNN

Time delay neural networks [Waibel(1989)]

Delay line Hidden layers Output layer xt xt−1 xt−2 xt−3 xt−4 xt−5 xt−6

But which size of the time window ? Must the history size be always the same ? Do we need the data over the whole time span ?

2 / 15

slide-132
SLIDE 132

Recurrent Neural Networks

Recurrent Neural Networks (RNN)

Architecture

Input Output

  • W in inputs to hidden
  • W back outputs to hidden
  • W hidden to hidden
  • W out hidden to output

Note it Applies repeatedly the weight matrix.

3 / 15

slide-133
SLIDE 133

Recurrent Neural Networks

Training a RNN: Forward mode differentation

Real Time Reccurent Learning (RTRL), [Williams(1989)]

Same idea than forward-mode differentiation for FNN.

  • Computationally more expensive than reverse mode

differentation,

  • Online training

Sustkever (2013). Training recurrent neural networks, PhD.

4 / 15

slide-134
SLIDE 134

Recurrent Neural Networks

Training a RNN : Reverse mode differentation

Backpropagation Through Time (BPTT), [Werbos(1990)]

  • Unfolds the computational graph in time ⇒ ∼ backprop in a

deep FNN

  • computationnaly cheaper than RTRL
  • batch training

Sustkever (2013). Training recurrent neural networks, PhD.

5 / 15

slide-135
SLIDE 135

Recurrent Neural Networks

Training a RNN is hard

Long-term dependencies

  • Exploding/Vanishing gradient
  • if one output depends on an input a long time ago (long-term

dependencies), that information may actually be lost or hard to be sensitive to

  • ⇒ introduce memory units, specifically designed to hold some

information

6 / 15

slide-136
SLIDE 136

Recurrent Neural Networks

Long-Short Term Memory

Architecture [Hochreiter,Schmidhuber(1997)][Gers(2000)]

Specifically designed to store information for long time delays Gating units specify when to integrate, release, forget Jozefowicz(2015); “LSTM: A search space odyssey” [Greff(2017)]; Recurrent Highway Networks Other possibilities : e.g. Gated Recurrent Units (GRUs) [Cho(2014)];

7 / 15

slide-137
SLIDE 137

Recurrent Neural Networks

Bidirectional LSTM

Speech to text

Bidirectional LSTM for Speech recognition [Graves(2013)]

Both past and future contexts are used for classifying the current

  • bservation.

When you speak, past and future phonemes influence the way you pronounce the current one.

8 / 15

slide-138
SLIDE 138

Recurrent Neural Networks

Applications of RNN: Language modelling

Char RNN [Karpathy(2015)]

Train A LSTM network to predict the next character. Then provide a seed and let it generate, character by character, a sentence. http: //karpathy.github.io/2015/05/21/rnn-effectiveness/

9 / 15

slide-139
SLIDE 139

Recurrent Neural Networks

Applications of RNN: Text to text

Mapping a sentence in one language to its translation.

Encoder/Decode [Sustkever(2014), Cho(2014)]

https://devblogs.nvidia.com/parallelforall/ introduction-neural-machine-translation-with-gpus/

10 / 15

slide-140
SLIDE 140

Recurrent Neural Networks

Applications of RNN: Text to text

The encoder/decoder suffers when the sentences are long. Idea : let the network decides on which part of the sentence to attend when translating. [Bahdanau(2015)]

Attention based LSTM translation

See also [Cho et al.(2015)] for image captioning.

11 / 15

slide-141
SLIDE 141

Recurrent Neural Networks

Applications of RNN: Multi language translation

Google’s Multilingual Neural Machine Translation System

Model trained with English↔Portuguese and English↔Spanish generalizes to English↔Spanish.

[Wu(2016); Johnson(2016)]

12 / 15

slide-142
SLIDE 142

Recurrent Neural Networks

Applications of RNN

Text to handwritten text

Handwriting [Graves(2013)]

Data : sequence of characters, sequence of pen positions (x,y,up/down)

1 learn a handwritten text generator:

(δxt+1, δyt+1, udt) = f ({δxk, δyk, udk}k∈[0,t−1]

2 condition the network by inputing a sequence of characters,

and learn to attend to the right characters

3 prime the network with a given character/pen sequence to

mimic style The network outputs parameters for a mixture of gaussians http://www.cs.toronto.edu/~graves/handwriting.html

13 / 15

slide-143
SLIDE 143

Recurrent Neural Networks

Combining CNN and RNN

Automatic captioning [Karpathy(2015)]

http://cs.stanford.edu/people/karpathy/deepimagesent/

14 / 15

slide-144
SLIDE 144

Recurrent Neural Networks

We did not speak about

  • Encoders/decoders, Deconvolutional networks,
  • Models for Natural Language Processing (e.g. word2vec,

GLoVe, recursive networks)

  • Probabilistic/Energy based models : Hopfield networks,

Restrictied boltzman machines, deep belief networks

  • More generally : generative models (e.g. Generative Adversial

Networks[Goodfellow, 2014])

  • Neural Turing Machines, Attention based models

https://distill.pub/2016/augmented-rnns/

15 / 15