Deep learning 5.6. Architecture choice and training protocol Fran - - PowerPoint PPT Presentation

deep learning 5 6 architecture choice and training
SMART_READER_LITE
LIVE PREVIEW

Deep learning 5.6. Architecture choice and training protocol Fran - - PowerPoint PPT Presentation

Deep learning 5.6. Architecture choice and training protocol Fran cois Fleuret https://fleuret.org/ee559/ Nov 2, 2020 Choosing the network structure is a difficult exercise. There is no silver bullet. Re-use something well known, that


slide-1
SLIDE 1

Deep learning 5.6. Architecture choice and training protocol

Fran¸ cois Fleuret https://fleuret.org/ee559/ Nov 2, 2020

slide-2
SLIDE 2

Choosing the network structure is a difficult exercise. There is no silver bullet.

  • Re-use something “well known, that works”, or at least start from there,

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 1 / 9

slide-3
SLIDE 3

Choosing the network structure is a difficult exercise. There is no silver bullet.

  • Re-use something “well known, that works”, or at least start from there,
  • split feature extraction / inference (although this is debatable),

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 1 / 9

slide-4
SLIDE 4

Choosing the network structure is a difficult exercise. There is no silver bullet.

  • Re-use something “well known, that works”, or at least start from there,
  • split feature extraction / inference (although this is debatable),
  • modulate the capacity until it overfits a small subset, but does not overfit /

underfit the full set,

  • capacity increases with more layers, more channels, larger receptive fields,
  • r more units,
  • regularization to reduce the capacity or induce sparsity,

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 1 / 9

slide-5
SLIDE 5

Choosing the network structure is a difficult exercise. There is no silver bullet.

  • Re-use something “well known, that works”, or at least start from there,
  • split feature extraction / inference (although this is debatable),
  • modulate the capacity until it overfits a small subset, but does not overfit /

underfit the full set,

  • capacity increases with more layers, more channels, larger receptive fields,
  • r more units,
  • regularization to reduce the capacity or induce sparsity,
  • identify common paths for siamese-like,
  • identify what path(s) or sub-parts need more/less capacity,

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 1 / 9

slide-6
SLIDE 6

Choosing the network structure is a difficult exercise. There is no silver bullet.

  • Re-use something “well known, that works”, or at least start from there,
  • split feature extraction / inference (although this is debatable),
  • modulate the capacity until it overfits a small subset, but does not overfit /

underfit the full set,

  • capacity increases with more layers, more channels, larger receptive fields,
  • r more units,
  • regularization to reduce the capacity or induce sparsity,
  • identify common paths for siamese-like,
  • identify what path(s) or sub-parts need more/less capacity,
  • use prior knowledge about the ”scale of meaningful context” to size filters

/ combinations of filters (e.g. knowing the size of objects in a scene, the max duration of a sound snippet that matters),

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 1 / 9

slide-7
SLIDE 7

Choosing the network structure is a difficult exercise. There is no silver bullet.

  • Re-use something “well known, that works”, or at least start from there,
  • split feature extraction / inference (although this is debatable),
  • modulate the capacity until it overfits a small subset, but does not overfit /

underfit the full set,

  • capacity increases with more layers, more channels, larger receptive fields,
  • r more units,
  • regularization to reduce the capacity or induce sparsity,
  • identify common paths for siamese-like,
  • identify what path(s) or sub-parts need more/less capacity,
  • use prior knowledge about the ”scale of meaningful context” to size filters

/ combinations of filters (e.g. knowing the size of objects in a scene, the max duration of a sound snippet that matters),

  • grid-search all the variations that come to mind (and hopefully have farms
  • f GPUs to do so).

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 1 / 9

slide-8
SLIDE 8

Choosing the network structure is a difficult exercise. There is no silver bullet.

  • Re-use something “well known, that works”, or at least start from there,
  • split feature extraction / inference (although this is debatable),
  • modulate the capacity until it overfits a small subset, but does not overfit /

underfit the full set,

  • capacity increases with more layers, more channels, larger receptive fields,
  • r more units,
  • regularization to reduce the capacity or induce sparsity,
  • identify common paths for siamese-like,
  • identify what path(s) or sub-parts need more/less capacity,
  • use prior knowledge about the ”scale of meaningful context” to size filters

/ combinations of filters (e.g. knowing the size of objects in a scene, the max duration of a sound snippet that matters),

  • grid-search all the variations that come to mind (and hopefully have farms
  • f GPUs to do so).

We will re-visit this list with additional regularization / normalization methods.

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 1 / 9

slide-9
SLIDE 9

Regarding the learning rate, for training to succeed it has to

  • reduce the loss quickly ⇒ large learning rate,
  • not be trapped in a bad minimum ⇒ large learning rate,

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 2 / 9

slide-10
SLIDE 10

Regarding the learning rate, for training to succeed it has to

  • reduce the loss quickly ⇒ large learning rate,
  • not be trapped in a bad minimum ⇒ large learning rate,
  • not bounce around in narrow valleys ⇒ small learning rate, and
  • not oscillate around a minimum ⇒ small learning rate.

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 2 / 9

slide-11
SLIDE 11

Regarding the learning rate, for training to succeed it has to

  • reduce the loss quickly ⇒ large learning rate,
  • not be trapped in a bad minimum ⇒ large learning rate,
  • not bounce around in narrow valleys ⇒ small learning rate, and
  • not oscillate around a minimum ⇒ small learning rate.

These constraints lead to a general policy of using a larger step size first, and a smaller one in the end.

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 2 / 9

slide-12
SLIDE 12

Regarding the learning rate, for training to succeed it has to

  • reduce the loss quickly ⇒ large learning rate,
  • not be trapped in a bad minimum ⇒ large learning rate,
  • not bounce around in narrow valleys ⇒ small learning rate, and
  • not oscillate around a minimum ⇒ small learning rate.

These constraints lead to a general policy of using a larger step size first, and a smaller one in the end. The practical strategy is to look at the losses and error rates across epochs and pick a learning rate and learning rate adaptation. For instance by reducing it at discrete pre-defined steps, or with a geometric decay.

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 2 / 9

slide-13
SLIDE 13

CIFAR10 data-set 32 × 32 color images, 50, 000 train samples, 10, 000 test samples. (Krizhevsky, 2009, chap. 3)

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 3 / 9

slide-14
SLIDE 14

Small convnet on CIFAR10, cross-entropy, batch size 100, η = 1e − 1.

10 20 30 40 50

  • Nb. epochs

10−3 10−2 10−1 100 Loss Train loss Test accuracy 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 Accuracy

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 4 / 9

slide-15
SLIDE 15

Small convnet on CIFAR10, cross-entropy, batch size 100

10 20 30 40 50

  • Nb. epochs

10−3 10−2 10−1 100 Loss Train loss (lr = 2e-1) Train loss (lr = 1e-1) Train loss (lr = 1e-2) 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 Accuracy

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 5 / 9

slide-16
SLIDE 16

Using η = 1e − 1 for 25 epochs, then reducing it.

10 20 30 40 50

  • Nb. epochs

10−3 10−2 10−1 100 Loss Train loss (no change) Train loss (lr2 = 7e-2) Train loss (lr2 = 5e-2) Train loss (lr2 = 2e-2) 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 Accuracy

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 6 / 9

slide-17
SLIDE 17

Using η = 1e − 1 for 25 epochs, then η = 5e − 2.

10 20 30 40 50

  • Nb. epochs

10−3 10−2 10−1 100 Loss Train loss Test accuracy 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 Accuracy

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 7 / 9

slide-18
SLIDE 18

While the test error still goes down, the test loss may increase, as it gets even worse on misclassified examples, and decreases less on the ones getting fixed.

10 20 30 40 50

  • Nb. epochs

10−3 10−2 10−1 100 Loss Test loss Train loss Test accuracy 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 Accuracy

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 8 / 9

slide-19
SLIDE 19

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 1

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-20
SLIDE 20

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 2

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-21
SLIDE 21

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 3

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-22
SLIDE 22

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 4

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-23
SLIDE 23

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 5

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-24
SLIDE 24

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 6

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-25
SLIDE 25

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 7

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-26
SLIDE 26

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 8

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-27
SLIDE 27

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 9

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-28
SLIDE 28

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 10

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-29
SLIDE 29

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 15

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-30
SLIDE 30

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 20

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-31
SLIDE 31

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 25

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-32
SLIDE 32

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 30

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-33
SLIDE 33

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 35

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-34
SLIDE 34

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 40

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-35
SLIDE 35

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 45

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-36
SLIDE 36

We can plot the train and test distributions of the per-sample loss 퓁 = − log

  • exp(fY (X; w))
  • k exp(fk(X; w))
  • through epochs to visualize the over-fitting.

10−5 10−3 10−1 101 2 4 6 8 10

Epoch 50

Train Test

Fran¸ cois Fleuret Deep learning / 5.6. Architecture choice and training protocol 9 / 9

slide-37
SLIDE 37

The end

slide-38
SLIDE 38

References

  • A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis,

Department of Computer Science, University of Toronto, 2009.