EE-559 Deep learning 8. Under the hood Fran cois Fleuret - - PowerPoint PPT Presentation

ee 559 deep learning 8 under the hood
SMART_READER_LITE
LIVE PREVIEW

EE-559 Deep learning 8. Under the hood Fran cois Fleuret - - PowerPoint PPT Presentation

EE-559 Deep learning 8. Under the hood Fran cois Fleuret https://fleuret.org/dlc/ [version of: June 5, 2018] COLE POLYTECHNIQUE FDRALE DE LAUSANNE Understanding a networks behavior Fran cois Fleuret EE-559 Deep


slide-1
SLIDE 1

EE-559 – Deep learning

  • 8. Under the hood

Fran¸ cois Fleuret https://fleuret.org/dlc/

[version of: June 5, 2018]

ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

slide-2
SLIDE 2

Understanding a network’s behavior

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 2 / 89

slide-3
SLIDE 3

Understanding what is happening in a deep architectures after training is complex and the tools we have at our disposal are limited. In the case of convolutional feed-forward networks, we can look at

  • the network’s parameters, filters as images,
  • internal activations as images,
  • distributions of activations on a population of samples,
  • derivatives of the response(s) wrt the input,
  • maximum-response synthetic samples,
  • adversarial samples.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 3 / 89

slide-4
SLIDE 4

Given a one-hidden layer fully connected network R2 → R2

nb_hidden = 20 model = nn. Sequential ( nn.Linear (2, nb_hidden), nn.ReLU (), nn.Linear(nb_hidden , 2) )

we can represent each of its internal units as a line corresponding to { x : w · x + b = 0 }.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 4 / 89

slide-5
SLIDE 5

Given a one-hidden layer fully connected network R2 → R2

nb_hidden = 20 model = nn. Sequential ( nn.Linear (2, nb_hidden), nn.ReLU (), nn.Linear(nb_hidden , 2) )

we can represent each of its internal units as a line corresponding to { x : w · x + b = 0 }. During training, these separations get organized so that their combination delimitates properly the different value domains in the signal space.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 4 / 89

slide-6
SLIDE 6

Iteration 1

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 5 / 89

slide-7
SLIDE 7

Iteration 4

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 5 / 89

slide-8
SLIDE 8

Iteration 7

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 5 / 89

slide-9
SLIDE 9

Iteration 10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 5 / 89

slide-10
SLIDE 10

Iteration 16

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 5 / 89

slide-11
SLIDE 11

Iteration 34

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 5 / 89

slide-12
SLIDE 12

Iteration 77

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 5 / 89

slide-13
SLIDE 13

Iteration 100

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 5 / 89

slide-14
SLIDE 14

Iteration 703

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 5 / 89

slide-15
SLIDE 15

Iteration 1407

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 5 / 89

slide-16
SLIDE 16

Iteration 2789

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 5 / 89

slide-17
SLIDE 17

Iteration 4999

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 5 / 89

slide-18
SLIDE 18

Iteration 1

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 6 / 89

slide-19
SLIDE 19

Iteration 4

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 6 / 89

slide-20
SLIDE 20

Iteration 7

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 6 / 89

slide-21
SLIDE 21

Iteration 10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 6 / 89

slide-22
SLIDE 22

Iteration 16

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 6 / 89

slide-23
SLIDE 23

Iteration 34

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 6 / 89

slide-24
SLIDE 24

Iteration 100

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 6 / 89

slide-25
SLIDE 25

Iteration 272

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 6 / 89

slide-26
SLIDE 26

Iteration 556

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 6 / 89

slide-27
SLIDE 27

Iteration 887

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 6 / 89

slide-28
SLIDE 28

Iteration 2222

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 6 / 89

slide-29
SLIDE 29

Iteration 4999

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 6 / 89

slide-30
SLIDE 30

Convnet filters

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 7 / 89

slide-31
SLIDE 31

A similar analysis is complicated to conduct with real-life networks given the high dimension of the signal. The simplest approach for convnets consists of looking at the filters as images. While it is quite reasonable in the first layer, since the filters are indeed consistent with the image input, it is far less so in the subsequent layers.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 8 / 89

slide-32
SLIDE 32

LeNet’s first convolutional layer (1 → 32), all filters

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 9 / 89

slide-33
SLIDE 33

LeNet’s second convolutional layer (32 → 64), first 32 filters out of 64

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 10 / 89

slide-34
SLIDE 34

AlexNet’s first convolutional layer (3 → 64), first 20 filters out of 64

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 11 / 89

slide-35
SLIDE 35

AlexNet’s first convolutional layer (3 → 64), first 20 filters out of 64

  • r as RGB images

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 11 / 89

slide-36
SLIDE 36

AlexNet’s second convolutional layer (64 → 192). First 15 channels (out of 64)

  • f the first 20 filters (out of 192).

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 12 / 89

slide-37
SLIDE 37

Convnet internal layer activations

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 13 / 89

slide-38
SLIDE 38

An alternative approach is to look at the activations themselves. Since the convolutional layers maintain the 2d structure of the signal, the activations can be visualized as images, where the local coding at any location

  • f an activation map is associated to the original content at that same location.

Given the large number of channels, we have to pick a few at random.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 14 / 89

slide-39
SLIDE 39

An alternative approach is to look at the activations themselves. Since the convolutional layers maintain the 2d structure of the signal, the activations can be visualized as images, where the local coding at any location

  • f an activation map is associated to the original content at that same location.

Given the large number of channels, we have to pick a few at random. Since the representation is distributed across multiple channels, individual channel have usually no clear semantic.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 14 / 89

slide-40
SLIDE 40

A MNIST character with LeNet (leCun et al., 1998).

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 15 / 89

slide-41
SLIDE 41

An RGB image with AlexNet (Krizhevsky et al., 2012).

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 16 / 89

slide-42
SLIDE 42

An RGB image with AlexNet (Krizhevsky et al., 2012).

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 16 / 89

slide-43
SLIDE 43

An RGB image with AlexNet (Krizhevsky et al., 2012).

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 16 / 89

slide-44
SLIDE 44

An RGB image with AlexNet (Krizhevsky et al., 2012).

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 16 / 89

slide-45
SLIDE 45

An RGB image with AlexNet (Krizhevsky et al., 2012).

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 16 / 89

slide-46
SLIDE 46

ILSVRC12 with ResNet152 (He et al., 2015).

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 17 / 89

slide-47
SLIDE 47

ILSVRC12 with ResNet152 (He et al., 2015).

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 17 / 89

slide-48
SLIDE 48

ILSVRC12 with ResNet152 (He et al., 2015).

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 17 / 89

slide-49
SLIDE 49

Yosinski et al. (2015) developed analysis tools to visit a network and look at the internal activations for a given input signal. This allowed them in particular to find units with a clear semantic in an AlexNet-like network trained on ImageNet.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 18 / 89

slide-50
SLIDE 50

Figure 2. A view of the 13×13 activations of the 151st channel on the conv5 layer of a deep neural network trained on ImageNet, a dataset that does not contain a face class, but does contain many images with faces. The channel responds to human and animal faces and is robust to changes in scale, pose, lighting, and context, which can be discerned by a user by actively changing the scene in front of a webcam or by loading static images (e.g. of the lions) and seeing the corresponding response of the unit. Photo of lions via Flickr user arnolouise, licensed under CC BY-NC-SA 2.0.

(Yosinski et al., 2015)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 19 / 89

slide-51
SLIDE 51

Prediction of 2d dynamics with a 18 layer residual network. Gn Sn Rn (Fleuret, 2016)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 20 / 89

slide-52
SLIDE 52

Sn Gn Rn Ψ(Sn, Gn) (Fleuret, 2016)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 21 / 89

slide-53
SLIDE 53

1/1024 2/1024 3/1024 511/1024 512/1024 513/1024 514/1024

. . . (Fleuret, 2016)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 22 / 89

slide-54
SLIDE 54

(Fleuret, 2016)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 23 / 89

slide-55
SLIDE 55

(Fleuret, 2016)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 23 / 89

slide-56
SLIDE 56

Layers as embeddings

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 24 / 89

slide-57
SLIDE 57

In the classification case, the network can be seen as a series of processings aiming as disentangling classes to make them easily separable for the final decision. In this perspective, it makes sense to look at how the samples are distributed spatially after each layer.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 25 / 89

slide-58
SLIDE 58

The main issue to do so is the dimensionality of the signal. If we look at the total number of dimensions in each layer:

  • A MNIST sample in a LeNet goes from 784 to up to 18k dimensions,
  • A ILSVRC12 sample in Resnet152 goes from 150k to up to 800k

dimensions. This require a mean to project a [very] high dimension point cloud into a 2d or 3d “human-brain accessible” representation

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 26 / 89

slide-59
SLIDE 59

We have already seen PCA and k-means as two standard methods for dimension reduction, but they poorly convey the structure of a smooth low-dimension and non-flat manifold.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 27 / 89

slide-60
SLIDE 60

We have already seen PCA and k-means as two standard methods for dimension reduction, but they poorly convey the structure of a smooth low-dimension and non-flat manifold. It exists a plethora of methods that aim at reflecting in low-dimension the structure of data points in high dimension. A popular one is t-SNE developed by van der Maaten and Hinton (2008).

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 27 / 89

slide-61
SLIDE 61

Given data-points in high dimension D =

  • xn ∈ RD, n = 1, . . . , N
  • the objective of data-visualization is to find a set of corresponding

low-dimension points E =

  • yn ∈ RC , n = 1, . . . , N
  • such that the positions of the ys “reflect” that of the xs.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 28 / 89

slide-62
SLIDE 62

The t-Distributed Stochastic Neighbor Embedding (t-SNE) proposed by van der Maaten and Hinton (2008) optimizes with SGD the yis so that the distances to close neighbors of each point are preserved.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 29 / 89

slide-63
SLIDE 63

The t-Distributed Stochastic Neighbor Embedding (t-SNE) proposed by van der Maaten and Hinton (2008) optimizes with SGD the yis so that the distances to close neighbors of each point are preserved. It actually matches for DKL two distance-dependent distributions: Gaussian in the original space, and Student t-distribution in the low-dimension one.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 29 / 89

slide-64
SLIDE 64

The scikit-learn toolbox http://scikit-learn.org/ is built around SciPy, and provides many machine learning algorithms, in particular embeddings, among which an implementation of t-SNE. The only catch to use it in PyTorch is the conversions to and from numpy arrays.

from sklearn.manifold import TSNE # x is the array of the

  • riginal

high -dimension points x_np = x.numpy () y_np = TSNE( n_components = 2, perplexity = 50). fit_transform (x_np) y = torch.from_numpy (y_np)

n components specifies the embedding dimension and perplexity states [crudely] how many points are considered neighbors of each point.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 30 / 89

slide-65
SLIDE 65

t-SNE unrolling of the swiss roll (with one noise dimension)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 31 / 89

slide-66
SLIDE 66

t-SNE unrolling of the swiss roll (with one noise dimension)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 31 / 89

slide-67
SLIDE 67

Input

t-SNE for LeNet on MNIST

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 32 / 89

slide-68
SLIDE 68

Layer #1

t-SNE for LeNet on MNIST

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 32 / 89

slide-69
SLIDE 69

Layer #4

t-SNE for LeNet on MNIST

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 32 / 89

slide-70
SLIDE 70

Layer #7

t-SNE for LeNet on MNIST

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 32 / 89

slide-71
SLIDE 71

Layer #9

t-SNE for LeNet on MNIST

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 32 / 89

slide-72
SLIDE 72

Input

t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

slide-73
SLIDE 73

Layer #5

t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

slide-74
SLIDE 74

Layer #10

t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

slide-75
SLIDE 75

Layer #15

t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

slide-76
SLIDE 76

Layer #20

t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

slide-77
SLIDE 77

Layer #25

t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

slide-78
SLIDE 78

Layer #30

t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

slide-79
SLIDE 79

Layer #31

t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

slide-80
SLIDE 80

Layer #32

t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

slide-81
SLIDE 81

Layer #33

t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

slide-82
SLIDE 82

Layer #34

t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

slide-83
SLIDE 83

Layer #35

t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

slide-84
SLIDE 84

Layer #36

t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

slide-85
SLIDE 85

Layer #37

t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

slide-86
SLIDE 86

Occlusion sensitivity

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 34 / 89

slide-87
SLIDE 87

Another approach to understanding the functioning of a network is to look at the behavior of the network “around” an image. For instance, we can get a simple estimate of the importance of a part of the input image by computing the difference between:

  • 1. the value of the maximally responding output unit on the image, and
  • 2. the value of the same unit with that part occluded.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 35 / 89

slide-88
SLIDE 88

Another approach to understanding the functioning of a network is to look at the behavior of the network “around” an image. For instance, we can get a simple estimate of the importance of a part of the input image by computing the difference between:

  • 1. the value of the maximally responding output unit on the image, and
  • 2. the value of the same unit with that part occluded.

This is computationally intensive since it requires as many forward passes as there are locations of the occlusion mask, ideally the number of pixels.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 35 / 89

slide-89
SLIDE 89

Original images Occlusion mask 32 × 32

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 36 / 89

slide-90
SLIDE 90

Original images Occlusion sensitivity, mask 32 × 32, stride of 2, AlexNet

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 37 / 89

slide-91
SLIDE 91

Original images Occlusion sensitivity, mask 32 × 32, stride of 2, VGG16

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 37 / 89

slide-92
SLIDE 92

Original images Occlusion sensitivity, mask 32 × 32, stride of 2, VGG19

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 37 / 89

slide-93
SLIDE 93

Saliency maps

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 38 / 89

slide-94
SLIDE 94

An alternative is to compute the gradient of the maximally responding output unit with respect to the input (Erhan et al., 2009; Simonyan et al., 2013), e.g. ∇|xf (x; w) where f is the activation of the output unit with maximum response, and |x stresses that the gradient is computed with respect to the input x and not as usual with respect to the parameters w.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 39 / 89

slide-95
SLIDE 95

This can be implemented by specifying that we need the gradient with respect to the input. We use here the correct unit, not the maximum response one. Using torch.autograd.grad to compute the gradient wrt the input image instead of torch.autograd.backward has the advantage of not changing the model’s parameter gradients.

input = Variable(img , requires_grad = True)

  • utput = model(input)

loss = nllloss(output , target) grad_input , = torch.autograd.grad(loss , input)

Note that since torch.autograd.grad computes the gradient of a function with possibly multiple inputs, the returned result is a tuple.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 40 / 89

slide-96
SLIDE 96

The resulting maps are quite noisy. For instance with AlexNet:

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 41 / 89

slide-97
SLIDE 97

This is due to the local irregularity of the network’s response as a function of the input.

Figure 2. The partial derivative of Sc with respect to the RGB val- ues of a single pixel as a fraction of the maximum entry in the gradient vector, maxi

∂Sc ∂xi (t), (middle plot) as one slowly moves

away from a baseline image x (left plot) to a fixed location x + ǫ (right plot). ǫ is one random sample from N(0, 0.012). The fi- nal image (x + ǫ) is indistinguishable to a human from the origin image x.

(Smilkov et al., 2017)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 42 / 89

slide-98
SLIDE 98

Smilkov et al. (2017) proposed to smooth the gradient with respect to the input image by averaging over slightly perturbed versions of the latter. ˜ ∇|xfy(x; w) = 1 N

N

  • n=1

∇|xfy(x + ǫn; w) where ǫ1, . . . , ǫN are i.i.d of distribution N(0, σ2I), and σ is a fraction of the gap ∆ between the maximum and the minimum of the pixel values.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 43 / 89

slide-99
SLIDE 99

A simple version of this “SmoothGrad” approach can be implemented as follows

nb_smooth = 100 std = smooth_std * (img.max () - img.min ()) acc_grad = img.new(img.size ()).zero_ () for q in range(nb_smooth): # This should be done with mini -batches ... noisy_input = img + img.new(img.size ()).normal_ (0, std) noisy_input = Variable(noisy_input , requires_grad = True)

  • utput = model( noisy_input )

loss = nllloss(output , target) grad_input , = torch.autograd.grad(loss , noisy_input ) acc_grad += grad_input .data acc_grad = acc_grad.abs ().sum (1) # sum across channels

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 44 / 89

slide-100
SLIDE 100

Original images Gradient, AlexNet SmoothGrad, AlexNet, σ = ∆

4

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 45 / 89

slide-101
SLIDE 101

Original images Gradient, VGG16 SmoothGrad, VGG16, σ = ∆

4

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 45 / 89

slide-102
SLIDE 102

Original images Gradient, VGG19 SmoothGrad, VGG19, σ = ∆

4

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 45 / 89

slide-103
SLIDE 103

Deconvolution and guided back-propagation

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 46 / 89

slide-104
SLIDE 104

Zeiler and Fergus (2014) proposed to invert the processing flow of a convolutional network by constructing a corresponding deconvolutional network to compute the “activating pattern” of a sample. As they point out, the resulting processing is identical to a standard backward pass, except when going through the ReLU layers.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 47 / 89

slide-105
SLIDE 105

Remember that if s is one of the input to a ReLU layer, and x the corresponding output, we have for the forward pass x = max(0, s), and for the backward ∂l ∂s = 1{s>0} ∂l ∂x .

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 48 / 89

slide-106
SLIDE 106

Zeiler and Fergus’s deconvolution can be seen as a backward pass where we propagate back through ReLU layers the quantity max

  • 0, ∂l

∂x

  • = 1{ ∂l

∂x >0}

∂l ∂x , instead of the usual ∂l ∂s = 1{s>0} ∂l ∂x .

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 49 / 89

slide-107
SLIDE 107

Zeiler and Fergus’s deconvolution can be seen as a backward pass where we propagate back through ReLU layers the quantity max

  • 0, ∂l

∂x

  • = 1{ ∂l

∂x >0}

∂l ∂x , instead of the usual ∂l ∂s = 1{s>0} ∂l ∂x . This quantity is positive for units whose output has a positive contribution to the response, kills the others, and is not modulated by the pre-layer activation s.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 49 / 89

slide-108
SLIDE 108

Springenberg et al. (2014) improved upon the deconvolution with the guided back-propagation, which aims at the best of both worlds: Discarding structures which would not contribute positively to the final response, and discarding structures which are not already present. It back-propagates through the ReLU layers the quantity 1{s>0}1{ ∂l

∂x >0}

∂l ∂x which keeps only units which have a positive contribution and activation.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 50 / 89

slide-109
SLIDE 109

So these three visualization methods differ only in the quantities propagated through ReLU layers during the back-pass:

  • back-propagation (Erhan et al., 2009; Simonyan et al., 2013):

1{s>0} ∂l ∂x ,

  • deconvolution (Zeiler and Fergus, 2014):

1{ ∂l

∂x >0}

∂l ∂x ,

  • guided back-propagation (Springenberg et al., 2014):

1{s>0}1{ ∂l

∂x >0}

∂l ∂x .

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 51 / 89

slide-110
SLIDE 110

These procedures can be implemented simply in PyTorch by changing the nn.ReLU ’s backward pass. The class nn.Module provides methods to register “hook” functions that are called during the forward or the backward pass, and can implement a different computation for the latter.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 52 / 89

slide-111
SLIDE 111

For instance

>>> x = Variable(Tensor ([ 1.23 ,

  • 4.56 ]))

>>> m = nn.ReLU () >>> m(x) Variable containing: 1.2300 0.0000 [torch. FloatTensor

  • f size 2]

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 53 / 89

slide-112
SLIDE 112

For instance

>>> x = Variable(Tensor ([ 1.23 ,

  • 4.56 ]))

>>> m = nn.ReLU () >>> m(x) Variable containing: 1.2300 0.0000 [torch. FloatTensor

  • f size 2]

>>> def my_hook(module , input , output): ... print(str(m) + ’ got ’ + str(input [0]. size ())) ... >>> handle = m. register_forward_hook (my_hook) >>> m(x) ReLU () got torch.Size ([2]) Variable containing: 1.2300 0.0000 [torch. FloatTensor

  • f size 2]

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 53 / 89

slide-113
SLIDE 113

For instance

>>> x = Variable(Tensor ([ 1.23 ,

  • 4.56 ]))

>>> m = nn.ReLU () >>> m(x) Variable containing: 1.2300 0.0000 [torch. FloatTensor

  • f size 2]

>>> def my_hook(module , input , output): ... print(str(m) + ’ got ’ + str(input [0]. size ())) ... >>> handle = m. register_forward_hook (my_hook) >>> m(x) ReLU () got torch.Size ([2]) Variable containing: 1.2300 0.0000 [torch. FloatTensor

  • f size 2]

>>> handle.remove () >>> m(x) Variable containing: 1.2300 0.0000 [torch. FloatTensor

  • f size 2]

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 53 / 89

slide-114
SLIDE 114

Using hooks, we can implement the deconvolution as follows:

def relu_backward_deconv_hook (module , grad_input , grad_output ): return F.relu( grad_output [0]) , def equip_model_deconv (model): for m in model.modules (): if isinstance(m, nn.ReLU):

  • m. register_backward_hook ( relu_backward_deconv_hook )

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 54 / 89

slide-115
SLIDE 115

Using hooks, we can implement the deconvolution as follows:

def relu_backward_deconv_hook (module , grad_input , grad_output ): return F.relu( grad_output [0]) , def equip_model_deconv (model): for m in model.modules (): if isinstance(m, nn.ReLU):

  • m. register_backward_hook ( relu_backward_deconv_hook )

def grad_view(model , image_name ): to_tensor = transforms .ToTensor () img = to_tensor(PIL.Image.open(image_name )) img = 0.5 + 0.5 * (img - img.mean ()) / img.std () if torch.cuda. is_available (): img = img.cuda () input = Variable(img.view(1, img.size (0) , img.size (1) , img.size (2)), \ requires_grad = True)

  • utput = model(input)

result , = torch.autograd.grad(output.max (), input) result = result.data / result.data.max () + 0.5 return result

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 54 / 89

slide-116
SLIDE 116

Using hooks, we can implement the deconvolution as follows:

def relu_backward_deconv_hook (module , grad_input , grad_output ): return F.relu( grad_output [0]) , def equip_model_deconv (model): for m in model.modules (): if isinstance(m, nn.ReLU):

  • m. register_backward_hook ( relu_backward_deconv_hook )

def grad_view(model , image_name ): to_tensor = transforms .ToTensor () img = to_tensor(PIL.Image.open(image_name )) img = 0.5 + 0.5 * (img - img.mean ()) / img.std () if torch.cuda. is_available (): img = img.cuda () input = Variable(img.view(1, img.size (0) , img.size (1) , img.size (2)), \ requires_grad = True)

  • utput = model(input)

result , = torch.autograd.grad(output.max (), input) result = result.data / result.data.max () + 0.5 return result model = models.vgg16(pretrained = True) model.eval () model = model.features equip_model_deconv (model) result = grad_view(model , ’blacklab.jpg ’) utils.save_image (result , ’blacklab -vgg16 -deconv.png ’)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 54 / 89

slide-117
SLIDE 117

The code is the same for the guided back-propagation, except the hooks themselves:

def relu_forward_gbackprop_hook (module , input , output):

  • module. input_kept = input [0]

def relu_backward_gbackprop_hook (module , grad_input , grad_output ): return F.relu( grad_output [0]) * F.relu(module. input_kept).sign (), def equip_model_gbackprop (model): for m in model.modules (): if isinstance(m, nn.ReLU):

  • m. register_forward_hook ( relu_forward_gbackprop_hook )
  • m. register_backward_hook ( relu_backward_gbackprop_hook )

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 55 / 89

slide-118
SLIDE 118

Original images AlexNet, max feature response, gradient

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 56 / 89

slide-119
SLIDE 119

Original images AlexNet, max feature response, deconvolution

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 56 / 89

slide-120
SLIDE 120

Original images AlexNet, max feature response, guided back-propagation

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 56 / 89

slide-121
SLIDE 121

Original images VGG16, max feature response, gradient

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 56 / 89

slide-122
SLIDE 122

Original images VGG16, max feature response, deconvolution

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 56 / 89

slide-123
SLIDE 123

Original images VGG16, max feature response, guided back-propagation

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 56 / 89

slide-124
SLIDE 124

Original images VGG19, max feature response, gradient

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 56 / 89

slide-125
SLIDE 125

Original images VGG19, max feature response, deconvolution

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 56 / 89

slide-126
SLIDE 126

Original images VGG19, max feature response, guided back-propagation

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 56 / 89

slide-127
SLIDE 127

Experiments with an AlexNet-like network. Original images + deconvolution (or filters) for the top-9 activations for channels picked randomly. (Zeiler and Fergus, 2014)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 57 / 89

slide-128
SLIDE 128

(Zeiler and Fergus, 2014)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 58 / 89

slide-129
SLIDE 129

Maximum response samples

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 59 / 89

slide-130
SLIDE 130

Another approach to get an intuition of the information actually encoded in the weights of a convnet consists of optimizing from scratch a sample to maximize the activation f of a chosen unit, or the sum over an activation map.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 60 / 89

slide-131
SLIDE 131

Doing so generates images with high frequencies, which tend to activate units a

  • lot. For instance these images maximize the responses of the units “bathtub”

and “lipstick” respectively (yes, this is strange, we will come back to it).

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 61 / 89

slide-132
SLIDE 132

Since f is trained in a discriminative manner, there is no reason that a sample maximizing its response would be “realistic”.

Class 0 Class 1

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 62 / 89

slide-133
SLIDE 133

Since f is trained in a discriminative manner, there is no reason that a sample maximizing its response would be “realistic”.

Class 0 Class 1 f

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 62 / 89

slide-134
SLIDE 134

Since f is trained in a discriminative manner, there is no reason that a sample maximizing its response would be “realistic”.

Class 0 Class 1 f

ˆ x

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 62 / 89

slide-135
SLIDE 135

Since f is trained in a discriminative manner, there is no reason that a sample maximizing its response would be “realistic”.

Class 0 Class 1 p −h

We can mitigate this by adding a penalty h corresponding to a “realistic” prior and compute in the end argmax

x

f (x; w) − h(x)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 62 / 89

slide-136
SLIDE 136

Since f is trained in a discriminative manner, there is no reason that a sample maximizing its response would be “realistic”.

Class 0 Class 1 f − h

We can mitigate this by adding a penalty h corresponding to a “realistic” prior and compute in the end argmax

x

f (x; w) − h(x)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 62 / 89

slide-137
SLIDE 137

Since f is trained in a discriminative manner, there is no reason that a sample maximizing its response would be “realistic”.

Class 0 Class 1 f − h

ˆ x We can mitigate this by adding a penalty h corresponding to a “realistic” prior and compute in the end argmax

x

f (x; w) − h(x)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 62 / 89

slide-138
SLIDE 138

Since f is trained in a discriminative manner, there is no reason that a sample maximizing its response would be “realistic”.

Class 0 Class 1 f − h

ˆ x We can mitigate this by adding a penalty h corresponding to a “realistic” prior and compute in the end argmax

x

f (x; w) − h(x) by iterating a standard gradient update: xk+1 = xk − η∇|x(h(xk) − f (xk; w)).

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 62 / 89

slide-139
SLIDE 139

A reasonable h penalizes too much energy in the high frequencies by integrating edge amplitude at multiple scales.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 63 / 89

slide-140
SLIDE 140

This can be formalized as a penalty function h of the form h(x) =

  • s≥0

δs(x) − g ⊛ δs(x)2 where g is a Gaussian kernel, and δ is a factor-2 downscale operator.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 64 / 89

slide-141
SLIDE 141

We first implement h(x) =

  • s≥0

δs(x) − g ⊛ δs(x)2 as a module:

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 65 / 89

slide-142
SLIDE 142

We first implement h(x) =

  • s≥0

δs(x) − g ⊛ δs(x)2 as a module:

class MultiScaleEdgeEnergy (nn.Module): def __init__(self): super(MultiScaleEdgeEnergy , self).__init__ () k = Tensor ([[1 , 4, 6, 4, 1]]) k_5x5_pseudo_gaussian = k.t().mm(k).view(1, 1, 5, 5) k_5x5_pseudo_gaussian /= k_5x5_pseudo_gaussian .sum ()

  • self. k_5x5_pseudo_gaussian = Parameter( k_5x5_pseudo_gaussian )

k_2x2_uniform = Tensor ([[0.25 , 0.25] , [0.25 , 0.25]]).view(1, 1, 2, 2)

  • self. k_2x2_uniform = Parameter( k_2x2_uniform )

def forward(self , x): if x.size (1) > 1: # dealing with multiple channels by unfolding them as as # many

  • ne

channel images result = self(x.view(x.size (0) * x.size (1) , 1, x.size (2) , x.size (3))) result = result.view(x.size (0) ,

  • 1).sum (1)

else: result = 0.0 while x.size (2) > 5 and x.size (3) > 5: blurry = F.conv2d(x, self. k_5x5_pseudo_gaussian , padding = 2) result += (x - blurry).view(x.size (0) ,

  • 1).pow (2).sum (1)

x = F.conv2d(x, self.k_2x2_uniform , stride = 2, padding = 1) return result

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 65 / 89

slide-143
SLIDE 143

Then, the optimization of the image per se is straightforward:

model = models.vgg16(pretrained = True) model.eval () edge_energy = MultiScaleEdgeEnergy () input = Tensor (1, 3, 224, 224).normal_ (0, 0.01) if torch.cuda. is_available (): model.cuda () edge_energy .cuda () input = input.cuda () input = Variable(input , requires_grad = True)

  • ptimizer = optim.Adam ([ input], lr = 1e -1)

for k in range (250):

  • utput = model(input)

loss = edge_energy (input) - output [0, 700] # paper towel

  • ptimizer.zero_grad ()

loss.backward ()

  • ptimizer.step ()

result = input.data result = 0.5 + 0.1 * (result - result.mean ()) / result.std () torchvision .utils. save_image (result , ’result.png ’)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 66 / 89

slide-144
SLIDE 144

Then, the optimization of the image per se is straightforward:

model = models.vgg16(pretrained = True) model.eval () edge_energy = MultiScaleEdgeEnergy () input = Tensor (1, 3, 224, 224).normal_ (0, 0.01) if torch.cuda. is_available (): model.cuda () edge_energy .cuda () input = input.cuda () input = Variable(input , requires_grad = True)

  • ptimizer = optim.Adam ([ input], lr = 1e -1)

for k in range (250):

  • utput = model(input)

loss = edge_energy (input) - output [0, 700] # paper towel

  • ptimizer.zero_grad ()

loss.backward ()

  • ptimizer.step ()

result = input.data result = 0.5 + 0.1 * (result - result.mean ()) / result.std () torchvision .utils. save_image (result , ’result.png ’)

(take a second to think about the beauty of autograd)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 66 / 89

slide-145
SLIDE 145

VGG16, maximizing a channel of the 4th convolution layer

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 67 / 89

slide-146
SLIDE 146

VGG16, maximizing a channel of the 7th convolution layer

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 68 / 89

slide-147
SLIDE 147

VGG16, maximizing a unit of the 10th convolution layer

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 69 / 89

slide-148
SLIDE 148

VGG16, maximizing a unit of the 13th (and last) convolution layer

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 70 / 89

slide-149
SLIDE 149

VGG16, maximizing a unit of the output layer

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 71 / 89

slide-150
SLIDE 150

VGG16, maximizing a unit of the output layer “King crab”

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 71 / 89

slide-151
SLIDE 151

VGG16, maximizing a unit of the output layer “King crab” “Samoyed” (that’s a fluffy dog)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 71 / 89

slide-152
SLIDE 152

VGG16, maximizing a unit of the output layer

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 72 / 89

slide-153
SLIDE 153

VGG16, maximizing a unit of the output layer “Hourglass”

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 72 / 89

slide-154
SLIDE 154

VGG16, maximizing a unit of the output layer “Hourglass” “Paper towel”

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 72 / 89

slide-155
SLIDE 155

VGG16, maximizing a unit of the output layer

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 73 / 89

slide-156
SLIDE 156

VGG16, maximizing a unit of the output layer “Ping-pong ball”

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 73 / 89

slide-157
SLIDE 157

VGG16, maximizing a unit of the output layer “Ping-pong ball” “Steel arch bridge”

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 73 / 89

slide-158
SLIDE 158

VGG16, maximizing a unit of the output layer

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 74 / 89

slide-159
SLIDE 159

VGG16, maximizing a unit of the output layer “Sunglass”

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 74 / 89

slide-160
SLIDE 160

VGG16, maximizing a unit of the output layer “Sunglass” “Geyser”

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 74 / 89

slide-161
SLIDE 161

These results show that the parameters of a network trained for classification carry enough information to generate identifiable large-scale structures. Although the training is discriminative, the resulting model has strong generative capabilities. It also gives an intuition of the accuracy and shortcomings of the resulting global compositional model.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 75 / 89

slide-162
SLIDE 162

Adversarial examples

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 76 / 89

slide-163
SLIDE 163

In spite of their good predictive capabilities, deep neural networks are quite sensitive to adversarial inputs, that is to inputs crafted to make them behave incorrectly (Szegedy et al., 2014).

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 77 / 89

slide-164
SLIDE 164

In spite of their good predictive capabilities, deep neural networks are quite sensitive to adversarial inputs, that is to inputs crafted to make them behave incorrectly (Szegedy et al., 2014). The simplest strategy to exhibit such behavior is to optimize the input to maximize the loss.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 77 / 89

slide-165
SLIDE 165

Let x be an image, y its proper label, f (x; w) the network’s prediction, and L the cross-entropy loss. We can construct an adversarial example by maximizing the loss. To do so, we iterate a “gradient ascent” step: xk+1 = xk + η∇|xL(f (xk; w), y). After a few iterations, this procedure will reach a sample ˇ x whose class is not y.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 78 / 89

slide-166
SLIDE 166

Let x be an image, y its proper label, f (x; w) the network’s prediction, and L the cross-entropy loss. We can construct an adversarial example by maximizing the loss. To do so, we iterate a “gradient ascent” step: xk+1 = xk + η∇|xL(f (xk; w), y). After a few iterations, this procedure will reach a sample ˇ x whose class is not y. The counter-intuitive result is that the resulting miss-classified images are indistinguishable from the original ones to a human eye.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 78 / 89

slide-167
SLIDE 167

input = Variable(input , requires_grad = True) model = torchvision .models.alexnet(pretrained = True) cross_entropy = nn. CrossEntropyLoss ()

  • ptimizer = optim.SGD ([ input], lr = 1e -1)

if torch.cuda. is_available (): model.cuda () cross_entropy .cuda () target = model(input).data.max (1) [1]. view (-1) if torch.cuda. is_available (): target = target.cuda () target = Variable(target) for k in range (15):

  • utput = model(input)

loss = - cross_entropy (output , target)

  • ptimizer.zero_grad ()

loss.backward ()

  • ptimizer.step ()

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 79 / 89

slide-168
SLIDE 168

Original Adversarial Differences (magnified)

x−ˇ x x

1.02% 0.27%

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 80 / 89

slide-169
SLIDE 169

Predicted classes

  • Nb. iterations

Image #1 Image #2 Weimaraner desktop computer 1 Weimaraner desktop computer 2 Labrador retriever desktop computer 3 Labrador retriever desktop computer 4 Labrador retriever desktop computer 5 brush kangaroo desktop computer 6 brush kangaroo desktop computer 7 sundial desktop computer 8 sundial desktop computer 9 sundial desktop computer 10 sundial desktop computer 11 sundial desktop computer 12 sundial desktop computer 13 sundial desktop computer 14 sundial desk

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 81 / 89

slide-170
SLIDE 170

Another counter-intuitive result is that if we sample 1, 000 images on the sphere centered on x of radius 2x − ˇ x, we do not observe any change of label. x ˇ x

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 82 / 89

slide-171
SLIDE 171

Adversarial images can be pushed one step further by optimizing images from scratch with genetic optimization to maximize the network’s response (Nguyen et al., 2015)

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 83 / 89

slide-172
SLIDE 172

Dilated convolution

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 84 / 89

slide-173
SLIDE 173

Convolution operations admit one more standard parameter that we have not discussed yet: The dilation, which modulates the expansion of the filter support (Yu and Koltun, 2015). It is 1 for standard convolutions, but can be greater, in which case the resulting

  • peration can be envisioned as a convolution with a regularly sparsified filter.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 85 / 89

slide-174
SLIDE 174

Convolution operations admit one more standard parameter that we have not discussed yet: The dilation, which modulates the expansion of the filter support (Yu and Koltun, 2015). It is 1 for standard convolutions, but can be greater, in which case the resulting

  • peration can be envisioned as a convolution with a regularly sparsified filter.

This notion comes from signal processing, where it is referred to as algorithme ` a trous, hence the term sometime used of “convolution ` a trous”.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 85 / 89

slide-175
SLIDE 175

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 86 / 89

slide-176
SLIDE 176

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 86 / 89

slide-177
SLIDE 177

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 86 / 89

slide-178
SLIDE 178

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 86 / 89

slide-179
SLIDE 179

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 86 / 89

slide-180
SLIDE 180

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 86 / 89

slide-181
SLIDE 181

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 86 / 89

slide-182
SLIDE 182

Input Output

Dilation = 1

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 86 / 89

slide-183
SLIDE 183

Input Output

Dilation = 2

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 87 / 89

slide-184
SLIDE 184

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 87 / 89

slide-185
SLIDE 185

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 87 / 89

slide-186
SLIDE 186

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 87 / 89

slide-187
SLIDE 187

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 87 / 89

slide-188
SLIDE 188

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 87 / 89

slide-189
SLIDE 189

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 87 / 89

slide-190
SLIDE 190

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 87 / 89

slide-191
SLIDE 191

Input Output

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 87 / 89

slide-192
SLIDE 192

A convolution with a 1d kernel of size k and dilation d can be interpreted as a convolution with a filter of size 1 + (k − 1)d with only k non-zero coefficients. For with k = 3 and d = 4, the difference between the input map size and the

  • utput map size is 1 + (3 − 1)4 − 1 = 8.

>>> from torch import nn , Tensor >>> from torch.autograd import Variable >>> x = Variable(Tensor (1, 1, 20, 30).normal_ ()) >>> l = nn.Conv2d (1, 1, kernel_size = 3, dilation = 4) >>> l(x).size () torch.Size ([1, 1, 12, 22])

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 88 / 89

slide-193
SLIDE 193

Having a dilation greater than one increases the units’ receptive field size without increasing the number of parameters. Convolutions with stride or dilation strictly greater than one reduce the activation map size, for instance to make a final classification decision, without employing pooling operators.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 89 / 89

slide-194
SLIDE 194

Having a dilation greater than one increases the units’ receptive field size without increasing the number of parameters. Convolutions with stride or dilation strictly greater than one reduce the activation map size, for instance to make a final classification decision, without employing pooling operators. Such networks have the advantage of simplicity:

  • non-linear operations are only in the activation function,
  • joint operations (combining multiple activations to produce one) are only

in the convolutional layers.

Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 89 / 89

slide-195
SLIDE 195

The end

slide-196
SLIDE 196

References

  • D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higher-layer features of a

deep network. Technical Report 1341, Departement IRO, Universit´ e de Montr´ eal, 2009.

  • F. Fleuret. Predicting the dynamics of 2d objects with a deep residual network. CoRR,

abs/1610.04032, 2016.

  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR,

abs/1512.03385, 2015.

  • A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional

neural networks. In Neural Information Processing Systems (NIPS), 2012.

  • Y. leCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

  • A. M. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High

confidence predictions for unrecognizable images. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

  • K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks:

Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013.

  • D. Smilkov, N. Thorat, B. Kim, F. Viegas, and M. Wattenberg. Smoothgrad: removing

noise by adding noise. CoRR, abs/1706.03825, 2017.

  • J. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all

convolutional net. CoRR, abs/1412.6806, 2014.

slide-197
SLIDE 197
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus.

Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014.

  • L. van der Maaten and G. Hinton. Visualizing high-dimensional data using t-sne. Journal of

Machine Learning Research (JMLR), 9:2579–2605, 2008.

  • J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding neural networks

through deep visualization. In Deep Learning Workshop, International Conference on Machine Learning (WS/ICML), 2015.

  • F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. CoRR,

abs/1511.07122v3, 2015.

  • M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In

European Conference on Computer Vision (ECCV), 2014.