MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks - - PowerPoint PPT Presentation

mit 9 520 6 860 fall 2018 class 11 neural networks tips
SMART_READER_LITE
LIVE PREVIEW

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks - - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej Banburski Last time - Convolutional neural networks source: github.com/vdumoulin/conv arithmetic Large-scale Datasets General Purpose GPUs AlexNet


slide-1
SLIDE 1

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks – tips, tricks & software

Andrzej Banburski

slide-2
SLIDE 2

Last time - Convolutional neural networks

source: github.com/vdumoulin/conv arithmetic

Large-scale Datasets General Purpose GPUs AlexNet Krizhevsky et al (2012)

  • A. Banburski
slide-3
SLIDE 3

Overview

Initialization & hyper-parameter tuning Optimization algorithms Batchnorm & Dropout Finite dataset woes Software

  • A. Banburski
slide-4
SLIDE 4

Initialization & hyper-parameter tuning

Consider the problem of training a neural network fθ(x) by minimizing a loss L(θ, x) =

N

  • i=1

li(yi, fθ(xi)) + λ|θ|2 with SGD and mini-batch size b: θt+1 = θt − η 1 b

  • i∈B

∇θL(θt, xi) (1)

  • A. Banburski
slide-5
SLIDE 5

Initialization & hyper-parameter tuning

Consider the problem of training a neural network fθ(x) by minimizing a loss L(θ, x) =

N

  • i=1

li(yi, fθ(xi)) + λ|θ|2 with SGD and mini-batch size b: θt+1 = θt − η 1 b

  • i∈B

∇θL(θt, xi) (1)

◮ How should we choose the initial set of parameters θ?

  • A. Banburski
slide-6
SLIDE 6

Initialization & hyper-parameter tuning

Consider the problem of training a neural network fθ(x) by minimizing a loss L(θ, x) =

N

  • i=1

li(yi, fθ(xi)) + λ|θ|2 with SGD and mini-batch size b: θt+1 = θt − η 1 b

  • i∈B

∇θL(θt, xi) (1)

◮ How should we choose the initial set of parameters θ? ◮ How about the hyper-parameters η, λ and b?

  • A. Banburski
slide-7
SLIDE 7

Weight Initialization

◮ First obvious observation: starting with 0 will make every weight

update in the same way. Similarly, too big and we can run into NaN.

  • A. Banburski
slide-8
SLIDE 8

Weight Initialization

◮ First obvious observation: starting with 0 will make every weight

update in the same way. Similarly, too big and we can run into NaN.

◮ What about θ0 = ǫ × N(0, 1), with ǫ ≈ 10−2?

  • A. Banburski
slide-9
SLIDE 9

Weight Initialization

◮ First obvious observation: starting with 0 will make every weight

update in the same way. Similarly, too big and we can run into NaN.

◮ What about θ0 = ǫ × N(0, 1), with ǫ ≈ 10−2? ◮ For a few layers this would seem to work nicely.

  • A. Banburski
slide-10
SLIDE 10

Weight Initialization

◮ First obvious observation: starting with 0 will make every weight

update in the same way. Similarly, too big and we can run into NaN.

◮ What about θ0 = ǫ × N(0, 1), with ǫ ≈ 10−2? ◮ For a few layers this would seem to work nicely. ◮ If we go deeper however...

  • A. Banburski
slide-11
SLIDE 11

Weight Initialization

◮ First obvious observation: starting with 0 will make every weight

update in the same way. Similarly, too big and we can run into NaN.

◮ What about θ0 = ǫ × N(0, 1), with ǫ ≈ 10−2? ◮ For a few layers this would seem to work nicely. ◮ If we go deeper however... ◮ Super slow update of earlier layers 10−2L for sigmoid or tanh

activations – vanishing gradients. ReLU activations do not suffer so much from this.

  • A. Banburski
slide-12
SLIDE 12

Xavier & He initializations

◮ For tanh and sigmoid activations, near origin we deal with a nearly

linear function y = Wx, with x = (x1, . . . , xnin). To stop vanishing and exploding gradients we need Var(y) = Var(Wx) = Var(w1x1) + · · · + Var(wninxnin)

  • A. Banburski
slide-13
SLIDE 13

Xavier & He initializations

◮ For tanh and sigmoid activations, near origin we deal with a nearly

linear function y = Wx, with x = (x1, . . . , xnin). To stop vanishing and exploding gradients we need Var(y) = Var(Wx) = Var(w1x1) + · · · + Var(wninxnin)

◮ If we assume that W and x are i.i.d. and have zero mean, then

Var(y) = nVar(wi)Var(xi)

  • A. Banburski
slide-14
SLIDE 14

Xavier & He initializations

◮ For tanh and sigmoid activations, near origin we deal with a nearly

linear function y = Wx, with x = (x1, . . . , xnin). To stop vanishing and exploding gradients we need Var(y) = Var(Wx) = Var(w1x1) + · · · + Var(wninxnin)

◮ If we assume that W and x are i.i.d. and have zero mean, then

Var(y) = nVar(wi)Var(xi)

◮ If we want the inputs and outputs to have same variance, this gives

us Var(wi) =

1 nin .

  • A. Banburski
slide-15
SLIDE 15

Xavier & He initializations

◮ For tanh and sigmoid activations, near origin we deal with a nearly

linear function y = Wx, with x = (x1, . . . , xnin). To stop vanishing and exploding gradients we need Var(y) = Var(Wx) = Var(w1x1) + · · · + Var(wninxnin)

◮ If we assume that W and x are i.i.d. and have zero mean, then

Var(y) = nVar(wi)Var(xi)

◮ If we want the inputs and outputs to have same variance, this gives

us Var(wi) =

1 nin . ◮ Similar analysis for backward pass gives Var(wi) = 1 nout .

  • A. Banburski
slide-16
SLIDE 16

Xavier & He initializations

◮ For tanh and sigmoid activations, near origin we deal with a nearly

linear function y = Wx, with x = (x1, . . . , xnin). To stop vanishing and exploding gradients we need Var(y) = Var(Wx) = Var(w1x1) + · · · + Var(wninxnin)

◮ If we assume that W and x are i.i.d. and have zero mean, then

Var(y) = nVar(wi)Var(xi)

◮ If we want the inputs and outputs to have same variance, this gives

us Var(wi) =

1 nin . ◮ Similar analysis for backward pass gives Var(wi) = 1 nout . ◮ The compromise is the Xavier initialization [Glorot et al., 2010]:

Var(wi) = 2 nin + nout (2)

  • A. Banburski
slide-17
SLIDE 17

Xavier & He initializations

◮ For tanh and sigmoid activations, near origin we deal with a nearly

linear function y = Wx, with x = (x1, . . . , xnin). To stop vanishing and exploding gradients we need Var(y) = Var(Wx) = Var(w1x1) + · · · + Var(wninxnin)

◮ If we assume that W and x are i.i.d. and have zero mean, then

Var(y) = nVar(wi)Var(xi)

◮ If we want the inputs and outputs to have same variance, this gives

us Var(wi) =

1 nin . ◮ Similar analysis for backward pass gives Var(wi) = 1 nout . ◮ The compromise is the Xavier initialization [Glorot et al., 2010]:

Var(wi) = 2 nin + nout (2)

◮ Heuristically, ReLU is half of the linear function, so we can take

Var(wi) = 4 nin + nout (3) An analysis in [He et al., 2015] confirms this.

  • A. Banburski
slide-18
SLIDE 18

Hyper-parameter tuning

How about the hyper-parameters η, λ and b

◮ How do we choose optimal η, λ and b?

  • A. Banburski
slide-19
SLIDE 19

Hyper-parameter tuning

How about the hyper-parameters η, λ and b

◮ How do we choose optimal η, λ and b? ◮ Basic idea: split your training dataset into a smaller training set and

a cross-validation set.

  • A. Banburski
slide-20
SLIDE 20

Hyper-parameter tuning

How about the hyper-parameters η, λ and b

◮ How do we choose optimal η, λ and b? ◮ Basic idea: split your training dataset into a smaller training set and

a cross-validation set.

– Run a coarse search (on a logarithmic scale) over the parameters for just a few epochs of SGD and evaluate on the cross-validation set.

  • A. Banburski
slide-21
SLIDE 21

Hyper-parameter tuning

How about the hyper-parameters η, λ and b

◮ How do we choose optimal η, λ and b? ◮ Basic idea: split your training dataset into a smaller training set and

a cross-validation set.

– Run a coarse search (on a logarithmic scale) over the parameters for just a few epochs of SGD and evaluate on the cross-validation set. – Perform a finer search.

  • A. Banburski
slide-22
SLIDE 22

Hyper-parameter tuning

How about the hyper-parameters η, λ and b

◮ How do we choose optimal η, λ and b? ◮ Basic idea: split your training dataset into a smaller training set and

a cross-validation set.

– Run a coarse search (on a logarithmic scale) over the parameters for just a few epochs of SGD and evaluate on the cross-validation set. – Perform a finer search.

◮ Interestingly, [Bergstra and Bengio, 2012] shows that it is better to

run the search randomly than on a grid.

  • A. Banburski
slide-23
SLIDE 23

Hyper-parameter tuning

How about the hyper-parameters η, λ and b

◮ How do we choose optimal η, λ and b? ◮ Basic idea: split your training dataset into a smaller training set and

a cross-validation set.

– Run a coarse search (on a logarithmic scale) over the parameters for just a few epochs of SGD and evaluate on the cross-validation set. – Perform a finer search.

◮ Interestingly, [Bergstra and Bengio, 2012] shows that it is better to

run the search randomly than on a grid.

source: [Bergstra and Bengio, 2012]

  • A. Banburski
slide-24
SLIDE 24

Decaying learning rate

◮ To improve convergence of SGD, we have to use a decaying learning

rate.

  • A. Banburski
slide-25
SLIDE 25

Decaying learning rate

◮ To improve convergence of SGD, we have to use a decaying learning

rate.

◮ Typically we use a scheduler – decrease η after some fixed number

  • f epochs.
  • A. Banburski
slide-26
SLIDE 26

Decaying learning rate

◮ To improve convergence of SGD, we have to use a decaying learning

rate.

◮ Typically we use a scheduler – decrease η after some fixed number

  • f epochs.

◮ This allows the training loss to keep improving after it has plateaued

  • A. Banburski
slide-27
SLIDE 27

Batch-size & learning rate

An interesting linear scaling relationship seems to exist between the learning rate η and mini-batch size b:

◮ In the SGD update, they appear as a ratio η b , with an additional

implicit dependence of the sum of gradients on b.

  • A. Banburski
slide-28
SLIDE 28

Batch-size & learning rate

An interesting linear scaling relationship seems to exist between the learning rate η and mini-batch size b:

◮ In the SGD update, they appear as a ratio η b , with an additional

implicit dependence of the sum of gradients on b.

◮ If b ≪ N, we can approximate SGD by a stochastic differential

equation with a noise scale g ≈ η N

b [Smit & Le, 2017].

  • A. Banburski
slide-29
SLIDE 29

Batch-size & learning rate

An interesting linear scaling relationship seems to exist between the learning rate η and mini-batch size b:

◮ In the SGD update, they appear as a ratio η b , with an additional

implicit dependence of the sum of gradients on b.

◮ If b ≪ N, we can approximate SGD by a stochastic differential

equation with a noise scale g ≈ η N

b [Smit & Le, 2017]. ◮ This means that instead of decaying η, we can increase batch size

dynamically.

  • A. Banburski
slide-30
SLIDE 30

Batch-size & learning rate

An interesting linear scaling relationship seems to exist between the learning rate η and mini-batch size b:

◮ In the SGD update, they appear as a ratio η b , with an additional

implicit dependence of the sum of gradients on b.

◮ If b ≪ N, we can approximate SGD by a stochastic differential

equation with a noise scale g ≈ η N

b [Smit & Le, 2017]. ◮ This means that instead of decaying η, we can increase batch size

dynamically.

source: [Smith et al., 2018]

  • A. Banburski
slide-31
SLIDE 31

Batch-size & learning rate

An interesting linear scaling relationship seems to exist between the learning rate η and mini-batch size b:

◮ In the SGD update, they appear as a ratio η b , with an additional

implicit dependence of the sum of gradients on b.

◮ If b ≪ N, we can approximate SGD by a stochastic differential

equation with a noise scale g ≈ η N

b [Smit & Le, 2017]. ◮ This means that instead of decaying η, we can increase batch size

dynamically.

source: [Smith et al., 2018]

◮ As b approaches N the dynamics become more and more

deterministic and we would expect this relationship to vanish. A. Banburski

slide-32
SLIDE 32

Batch-size & learning rate

source: [Goyal et al., 2017]

  • A. Banburski
slide-33
SLIDE 33

Overview

Initialization & hyper-parameter tuning Optimization algorithms Batchnorm & Dropout Finite dataset woes Software

  • A. Banburski
slide-34
SLIDE 34

SGD is kinda slow...

◮ GD – use all points each iteration to compute gradient ◮ SGD – use one point each iteration to compute gradient ◮ Faster: Mini-Batch – use a mini-batch of points each iteration to

compute gradient

  • A. Banburski
slide-35
SLIDE 35

Alternatives to SGD

Are there reasonable alternatives outside of Newton method? Accelerations

◮ Momentum ◮ Nesterov’s method ◮ Adagrad ◮ RMSprop ◮ Adam ◮ . . .

  • A. Banburski
slide-36
SLIDE 36

SGD with Momentum

We can try accelerating SGD θt+1 = θt − η∇f(θt) by adding a momentum/velocity term:

  • A. Banburski
slide-37
SLIDE 37

SGD with Momentum

We can try accelerating SGD θt+1 = θt − η∇f(θt) by adding a momentum/velocity term: vt+1 = µvt − η∇f(θt) θt+1 = θt + vt+1 (4) µ is a new ”momentum” hyper-parameter.

  • A. Banburski
slide-38
SLIDE 38

SGD with Momentum

We can try accelerating SGD θt+1 = θt − η∇f(θt) by adding a momentum/velocity term: vt+1 = µvt − η∇f(θt) θt+1 = θt + vt+1 (4) µ is a new ”momentum” hyper-parameter.

source: cs213n.github.io

  • A. Banburski
slide-39
SLIDE 39

Nesterov Momentum

◮ Sometimes the momentum update can overshoot

  • A. Banburski
slide-40
SLIDE 40

Nesterov Momentum

◮ Sometimes the momentum update can overshoot ◮ We can instead evaluate the gradient at the point where momentum

takes us:

  • A. Banburski
slide-41
SLIDE 41

Nesterov Momentum

◮ Sometimes the momentum update can overshoot ◮ We can instead evaluate the gradient at the point where momentum

takes us: vt+1 = µvt − η∇f(θt + µvt) θt+1 = θt + vt+1 (5)

  • A. Banburski
slide-42
SLIDE 42

Nesterov Momentum

◮ Sometimes the momentum update can overshoot ◮ We can instead evaluate the gradient at the point where momentum

takes us: vt+1 = µvt − η∇f(θt + µvt) θt+1 = θt + vt+1 (5)

source: Geoff Hinton’s lecture

  • A. Banburski
slide-43
SLIDE 43

AdaGrad

◮ An alternative way is to automatize the decay of the learning rate.

  • A. Banburski
slide-44
SLIDE 44

AdaGrad

◮ An alternative way is to automatize the decay of the learning rate. ◮ The Adaptive Gradient algorithm does this by accumulating

magnitudes of gradients

  • A. Banburski
slide-45
SLIDE 45

AdaGrad

◮ An alternative way is to automatize the decay of the learning rate. ◮ The Adaptive Gradient algorithm does this by accumulating

magnitudes of gradients

  • A. Banburski
slide-46
SLIDE 46

AdaGrad

◮ An alternative way is to automatize the decay of the learning rate. ◮ The Adaptive Gradient algorithm does this by accumulating

magnitudes of gradients

◮ AdaGrad accelerates in flat directions of optimization landscape and

slows down in step ones.

  • A. Banburski
slide-47
SLIDE 47

RMSProp

Problem:The updates in AdaGrad always decrease the learning rate, so some of the parameters can become un-learnable.

  • A. Banburski
slide-48
SLIDE 48

RMSProp

Problem:The updates in AdaGrad always decrease the learning rate, so some of the parameters can become un-learnable.

◮ Fix by Hinton: use weighted sum of the square magnitudes instead.

  • A. Banburski
slide-49
SLIDE 49

RMSProp

Problem:The updates in AdaGrad always decrease the learning rate, so some of the parameters can become un-learnable.

◮ Fix by Hinton: use weighted sum of the square magnitudes instead. ◮ This assigns more weight to recent iterations. Useful if directions of

steeper or shallower descent suddenly change.

  • A. Banburski
slide-50
SLIDE 50

RMSProp

Problem:The updates in AdaGrad always decrease the learning rate, so some of the parameters can become un-learnable.

◮ Fix by Hinton: use weighted sum of the square magnitudes instead. ◮ This assigns more weight to recent iterations. Useful if directions of

steeper or shallower descent suddenly change.

  • A. Banburski
slide-51
SLIDE 51

Adam

Adaptive Moment – a combination of the previous approaches.

[Kingma and Ba, 2014]

  • A. Banburski
slide-52
SLIDE 52

Adam

Adaptive Moment – a combination of the previous approaches.

[Kingma and Ba, 2014]

◮ Ridiculously popular – more than 13K citations!

  • A. Banburski
slide-53
SLIDE 53

Adam

Adaptive Moment – a combination of the previous approaches.

[Kingma and Ba, 2014]

◮ Ridiculously popular – more than 13K citations! ◮ Probably because it comes with recommended parameters and came

with a proof of convergence (which was shown to be wrong).

  • A. Banburski
slide-54
SLIDE 54

So what should I use in practice?

◮ Adam is a good default in many cases.

  • A. Banburski
slide-55
SLIDE 55

So what should I use in practice?

◮ Adam is a good default in many cases. ◮ There exist datasets in which Adam and other adaptive methods do

not generalize to unseen data at all! [Marginal Value of Adaptive Gradient Methods in Machine Learning]

  • A. Banburski
slide-56
SLIDE 56

So what should I use in practice?

◮ Adam is a good default in many cases. ◮ There exist datasets in which Adam and other adaptive methods do

not generalize to unseen data at all! [Marginal Value of Adaptive Gradient Methods in Machine Learning]

◮ SGD with Momentum and a decay rate often outperforms Adam

(but requires tuning).

  • A. Banburski
slide-57
SLIDE 57

So what should I use in practice?

◮ Adam is a good default in many cases. ◮ There exist datasets in which Adam and other adaptive methods do

not generalize to unseen data at all! [Marginal Value of Adaptive Gradient Methods in Machine Learning]

◮ SGD with Momentum and a decay rate often outperforms Adam

(but requires tuning). includegraphicsFigures/comp.png

source: github.com/YingzhenLi

  • A. Banburski
slide-58
SLIDE 58

Overview

Initialization & hyper-parameter tuning Optimization algorithms Batchnorm & Dropout Finite dataset woes Software

  • A. Banburski
slide-59
SLIDE 59

Data pre-processing

Since our non-linearities change their behavior around the origin, it makes sense to pre-process to zero-mean and unit variance. ˆ xi = xi − E[xi]

  • Var[xi]

(6)

  • A. Banburski
slide-60
SLIDE 60

Data pre-processing

Since our non-linearities change their behavior around the origin, it makes sense to pre-process to zero-mean and unit variance. ˆ xi = xi − E[xi]

  • Var[xi]

(6)

source: cs213n.github.io

  • A. Banburski
slide-61
SLIDE 61

Batch Normalization

A common technique is to repeat this throughout the deep network in a differentiable way:

  • A. Banburski
slide-62
SLIDE 62

Batch Normalization

A common technique is to repeat this throughout the deep network in a differentiable way: [Ioffe and Szegedy, 2015]

  • A. Banburski
slide-63
SLIDE 63

Batch Normalization

In practice, a batchnorm layer is added after a conv or fully-connected layer, but before activations.

  • A. Banburski
slide-64
SLIDE 64

Batch Normalization

In practice, a batchnorm layer is added after a conv or fully-connected layer, but before activations.

◮ In the original paper the authors claimed that this is meant to

reduce covariate shift.

  • A. Banburski
slide-65
SLIDE 65

Batch Normalization

In practice, a batchnorm layer is added after a conv or fully-connected layer, but before activations.

◮ In the original paper the authors claimed that this is meant to

reduce covariate shift.

◮ More obviously, this reduces 2nd-order correlations between layers.

Recently shown that it actually doesn’t change covariate shift! Instead it smooths out the landscape.

  • A. Banburski
slide-66
SLIDE 66

Batch Normalization

In practice, a batchnorm layer is added after a conv or fully-connected layer, but before activations.

◮ In the original paper the authors claimed that this is meant to

reduce covariate shift.

◮ More obviously, this reduces 2nd-order correlations between layers.

Recently shown that it actually doesn’t change covariate shift! Instead it smooths out the landscape.

  • A. Banburski

[Santurkar, Tsipras, Ilyas, Madry, 2018]

slide-67
SLIDE 67

Batch Normalization

In practice, a batchnorm layer is added after a conv or fully-connected layer, but before activations.

◮ In the original paper the authors claimed that this is meant to

reduce covariate shift.

◮ More obviously, this reduces 2nd-order correlations between layers.

Recently shown that it actually doesn’t change covariate shift! Instead it smooths out the landscape.

◮ In practice this reduces dependence on initialization and seems to

stabilize the flow of gradient descent.

  • A. Banburski
slide-68
SLIDE 68

Batch Normalization

In practice, a batchnorm layer is added after a conv or fully-connected layer, but before activations.

◮ In the original paper the authors claimed that this is meant to

reduce covariate shift.

◮ More obviously, this reduces 2nd-order correlations between layers.

Recently shown that it actually doesn’t change covariate shift! Instead it smooths out the landscape.

◮ In practice this reduces dependence on initialization and seems to

stabilize the flow of gradient descent.

◮ Using BN usually nets you a gain of few % increase in test accuracy.

  • A. Banburski
slide-69
SLIDE 69

Dropout

Another common technique: during forward pass, set some of the weights to 0 randomly with probability p. Typical choice is p = 50%.

  • A. Banburski
slide-70
SLIDE 70

Dropout

Another common technique: during forward pass, set some of the weights to 0 randomly with probability p. Typical choice is p = 50%.

  • A. Banburski
slide-71
SLIDE 71

Dropout

Another common technique: during forward pass, set some of the weights to 0 randomly with probability p. Typical choice is p = 50%.

◮ The idea is to prevent co-adaptation of neurons.

  • A. Banburski
slide-72
SLIDE 72

Dropout

Another common technique: during forward pass, set some of the weights to 0 randomly with probability p. Typical choice is p = 50%.

◮ The idea is to prevent co-adaptation of neurons. ◮ At test want to remove the randomness. A good approximation is to

multiply the neural network by p.

  • A. Banburski
slide-73
SLIDE 73

Dropout

Another common technique: during forward pass, set some of the weights to 0 randomly with probability p. Typical choice is p = 50%.

◮ The idea is to prevent co-adaptation of neurons. ◮ At test want to remove the randomness. A good approximation is to

multiply the neural network by p.

◮ Dropout is more commonly applied for fully-connected layers,

though its use is waning.

  • A. Banburski
slide-74
SLIDE 74

Overview

Initialization & hyper-parameter tuning Optimization algorithms Batchnorm & Dropout Finite dataset woes Software

  • A. Banburski
slide-75
SLIDE 75

Finite dataset woes

While we are entering the Big Data age, in practice we often find

  • urselves with insufficient data to sufficiently train our deep neural

networks.

◮ What if collecting more data is slow/difficult?

  • A. Banburski
slide-76
SLIDE 76

Finite dataset woes

While we are entering the Big Data age, in practice we often find

  • urselves with insufficient data to sufficiently train our deep neural

networks.

◮ What if collecting more data is slow/difficult? ◮ Can we squeeze out more from what we already have?

  • A. Banburski
slide-77
SLIDE 77

Invariance problem

An often-repeated claim about CNNs is that they are invariant to small

  • translations. Independently of whether this is true, they are not invariant

to most other types of transformations:

source: cs213n.github.io

  • A. Banburski
slide-78
SLIDE 78

Data augmentation

◮ Can greatly increase the amount of data by performing:

  • A. Banburski
slide-79
SLIDE 79

Data augmentation

◮ Can greatly increase the amount of data by performing:

– Translations – Rotations – Reflections – Scaling – Cropping – Adding Gaussian Noise – Adding Occlusion – Interpolation – etc.

  • A. Banburski
slide-80
SLIDE 80

Data augmentation

◮ Can greatly increase the amount of data by performing:

– Translations – Rotations – Reflections – Scaling – Cropping – Adding Gaussian Noise – Adding Occlusion – Interpolation – etc.

◮ Crucial for achieving state-of-the-art performance!

  • A. Banburski
slide-81
SLIDE 81

Data augmentation

◮ Can greatly increase the amount of data by performing:

– Translations – Rotations – Reflections – Scaling – Cropping – Adding Gaussian Noise – Adding Occlusion – Interpolation – etc.

◮ Crucial for achieving state-of-the-art performance! ◮ For example, ResNet improves from 11.66% to 6.41% error on

CIFAR-10 dataset and from 44.74% to 27.22% on CIFAR-100.

  • A. Banburski
slide-82
SLIDE 82

Data augmentation

source: github.com/aleju/imgaug

  • A. Banburski
slide-83
SLIDE 83

Transfer Learning

What if you truly have too little data?

◮ If your data has sufficient similarity to a bigger dataset, the you’re in

luck!

  • A. Banburski
slide-84
SLIDE 84

Transfer Learning

What if you truly have too little data?

◮ If your data has sufficient similarity to a bigger dataset, the you’re in

luck!

◮ Idea: take a model trained for example on ImageNet.

  • A. Banburski
slide-85
SLIDE 85

Transfer Learning

What if you truly have too little data?

◮ If your data has sufficient similarity to a bigger dataset, the you’re in

luck!

◮ Idea: take a model trained for example on ImageNet. ◮ Freeze all but last few layers and retrain on your small data. The

bigger your dataset, the more layers you have to retrain.

  • A. Banburski
slide-86
SLIDE 86

Transfer Learning

What if you truly have too little data?

◮ If your data has sufficient similarity to a bigger dataset, the you’re in

luck!

◮ Idea: take a model trained for example on ImageNet. ◮ Freeze all but last few layers and retrain on your small data. The

bigger your dataset, the more layers you have to retrain.

source: [Haase et al., 2014]

  • A. Banburski
slide-87
SLIDE 87

Overview

Initialization & hyper-parameter tuning Optimization algorithms Batchnorm & Dropout Finite dataset woes Software

  • A. Banburski
slide-88
SLIDE 88

Software overview

  • A. Banburski
slide-89
SLIDE 89

Software overview

  • A. Banburski
slide-90
SLIDE 90

Why use frameworks?

◮ You don’t have to implement everything yourself.

  • A. Banburski
slide-91
SLIDE 91

Why use frameworks?

◮ You don’t have to implement everything yourself. ◮ Many inbuilt modules allow quick iteration of ideas – building a

neural network becomes putting simple blocks together and computing backprop is a breeze.

  • A. Banburski
slide-92
SLIDE 92

Why use frameworks?

◮ You don’t have to implement everything yourself. ◮ Many inbuilt modules allow quick iteration of ideas – building a

neural network becomes putting simple blocks together and computing backprop is a breeze.

◮ Someone else already wrote CUDA code to efficiently run training

  • n GPUs (or TPUs).
  • A. Banburski
slide-93
SLIDE 93

Main design difference

source: Introduction to Chainer

  • A. Banburski
slide-94
SLIDE 94

PyTorch concepts

Similar in code to numpy.

  • A. Banburski
slide-95
SLIDE 95

PyTorch concepts

Similar in code to numpy.

◮ Tensor: nearly identical to np.array, can run on GPU just with

  • A. Banburski
slide-96
SLIDE 96

PyTorch concepts

Similar in code to numpy.

◮ Tensor: nearly identical to np.array, can run on GPU just with ◮ Autograd: package for automatic computation of backprop and

construction of computational graphs.

  • A. Banburski
slide-97
SLIDE 97

PyTorch concepts

Similar in code to numpy.

◮ Tensor: nearly identical to np.array, can run on GPU just with ◮ Autograd: package for automatic computation of backprop and

construction of computational graphs.

◮ Module: neural network layer storing weights

  • A. Banburski
slide-98
SLIDE 98

PyTorch concepts

Similar in code to numpy.

◮ Tensor: nearly identical to np.array, can run on GPU just with ◮ Autograd: package for automatic computation of backprop and

construction of computational graphs.

◮ Module: neural network layer storing weights ◮ Dataloader: class for simplifying efficient data loading

  • A. Banburski
slide-99
SLIDE 99

PyTorch - optimization

  • A. Banburski
slide-100
SLIDE 100

PyTorch - ResNet in one page

@jeremyphoward

  • A. Banburski
slide-101
SLIDE 101

Tensorflow static graphs

source: cs213n.github.io

  • A. Banburski
slide-102
SLIDE 102

Keras wrapper - closer to PyTorch

source: cs213n.github.io

  • A. Banburski
slide-103
SLIDE 103

Tensorboard - very useful tool for visualization

  • A. Banburski
slide-104
SLIDE 104

Tensorflow overview

◮ Main difference – uses static graphs. Longer code, but more

  • ptimized. In practice PyTorch is faster to experiment on.
  • A. Banburski
slide-105
SLIDE 105

Tensorflow overview

◮ Main difference – uses static graphs. Longer code, but more

  • ptimized. In practice PyTorch is faster to experiment on.

◮ With Keras wrapper code is more similar to PyTorch however.

  • A. Banburski
slide-106
SLIDE 106

Tensorflow overview

◮ Main difference – uses static graphs. Longer code, but more

  • ptimized. In practice PyTorch is faster to experiment on.

◮ With Keras wrapper code is more similar to PyTorch however. ◮ Can use TPUs

  • A. Banburski
slide-107
SLIDE 107

But

◮ Tensorflow has added dynamic batching, which makes dynamic

graphs possible.

  • A. Banburski
slide-108
SLIDE 108

But

◮ Tensorflow has added dynamic batching, which makes dynamic

graphs possible.

◮ PyTorch is merging with Caffe2, which will provide static graphs too!

  • A. Banburski
slide-109
SLIDE 109

But

◮ Tensorflow has added dynamic batching, which makes dynamic

graphs possible.

◮ PyTorch is merging with Caffe2, which will provide static graphs too! ◮ Which one to choose then?

  • A. Banburski
slide-110
SLIDE 110

But

◮ Tensorflow has added dynamic batching, which makes dynamic

graphs possible.

◮ PyTorch is merging with Caffe2, which will provide static graphs too! ◮ Which one to choose then?

– PyTorch is more popular in the research community for easy development and debugging.

  • A. Banburski
slide-111
SLIDE 111

But

◮ Tensorflow has added dynamic batching, which makes dynamic

graphs possible.

◮ PyTorch is merging with Caffe2, which will provide static graphs too! ◮ Which one to choose then?

– PyTorch is more popular in the research community for easy development and debugging. – In the past a better choice for production was Tensorflow. Still the

  • nly choice if you want to use TPUs.
  • A. Banburski