EE-559 Deep learning 8. Under the hood Fran cois Fleuret - PowerPoint PPT Presentation

Layers as embeddings Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 24 / 89

In the classification case, the network can be seen as a series of processings aiming as disentangling classes to make them easily separable for the final decision. In this perspective, it makes sense to look at how the samples are distributed spatially after each layer. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 25 / 89

The main issue to do so is the dimensionality of the signal. If we look at the total number of dimensions in each layer: • A MNIST sample in a LeNet goes from 784 to up to 18k dimensions, • A ILSVRC12 sample in Resnet152 goes from 150k to up to 800k dimensions. This require a mean to project a [very] high dimension point cloud into a 2d or 3d “human-brain accessible” representation Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 26 / 89

We have already seen PCA and k -means as two standard methods for dimension reduction, but they poorly convey the structure of a smooth low-dimension and non-flat manifold. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 27 / 89

We have already seen PCA and k -means as two standard methods for dimension reduction, but they poorly convey the structure of a smooth low-dimension and non-flat manifold. It exists a plethora of methods that aim at reflecting in low-dimension the structure of data points in high dimension. A popular one is t-SNE developed by van der Maaten and Hinton (2008). Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 27 / 89

Given data-points in high dimension � � x n ∈ R D , n = 1 , . . . , N D = the objective of data-visualization is to find a set of corresponding low-dimension points � y n ∈ R C , n = 1 , . . . , N � E = such that the positions of the y s “reflect” that of the x s. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 28 / 89

The t-Distributed Stochastic Neighbor Embedding (t-SNE) proposed by van der Maaten and Hinton (2008) optimizes with SGD the y i s so that the distances to close neighbors of each point are preserved. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 29 / 89

The t-Distributed Stochastic Neighbor Embedding (t-SNE) proposed by van der Maaten and Hinton (2008) optimizes with SGD the y i s so that the distances to close neighbors of each point are preserved. It actually matches for D KL two distance-dependent distributions: Gaussian in the original space, and Student t-distribution in the low-dimension one. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 29 / 89

The scikit-learn toolbox http://scikit-learn.org/ is built around SciPy, and provides many machine learning algorithms, in particular embeddings, among which an implementation of t-SNE. The only catch to use it in PyTorch is the conversions to and from numpy arrays. from sklearn.manifold import TSNE # x is the array of the original high -dimension points x_np = x.numpy () y_np = TSNE( n_components = 2, perplexity = 50). fit_transform (x_np) y = torch.from_numpy (y_np) n components specifies the embedding dimension and perplexity states [crudely] how many points are considered neighbors of each point. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 30 / 89

t-SNE unrolling of the swiss roll (with one noise dimension) Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 31 / 89

Input t-SNE for LeNet on MNIST Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 32 / 89

Layer #1 t-SNE for LeNet on MNIST Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 32 / 89

Input t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

Layer #5 t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89

Occlusion sensitivity Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 34 / 89

Another approach to understanding the functioning of a network is to look at the behavior of the network “around” an image. For instance, we can get a simple estimate of the importance of a part of the input image by computing the difference between: 1. the value of the maximally responding output unit on the image, and 2. the value of the same unit with that part occluded. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 35 / 89

Another approach to understanding the functioning of a network is to look at the behavior of the network “around” an image. For instance, we can get a simple estimate of the importance of a part of the input image by computing the difference between: 1. the value of the maximally responding output unit on the image, and 2. the value of the same unit with that part occluded. This is computationally intensive since it requires as many forward passes as there are locations of the occlusion mask, ideally the number of pixels. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 35 / 89

Original images Occlusion mask 32 × 32 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 36 / 89

Original images Occlusion sensitivity, mask 32 × 32, stride of 2, AlexNet Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 37 / 89

Original images Occlusion sensitivity, mask 32 × 32, stride of 2, VGG16 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 37 / 89

Original images Occlusion sensitivity, mask 32 × 32, stride of 2, VGG19 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 37 / 89

Saliency maps Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 38 / 89

An alternative is to compute the gradient of the maximally responding output unit with respect to the input (Erhan et al., 2009; Simonyan et al., 2013), e.g. ∇ | x f ( x ; w ) where f is the activation of the output unit with maximum response, and | x stresses that the gradient is computed with respect to the input x and not as usual with respect to the parameters w . Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 39 / 89

This can be implemented by specifying that we need the gradient with respect to the input. We use here the correct unit, not the maximum response one. Using torch.autograd.grad to compute the gradient wrt the input image instead of torch.autograd.backward has the advantage of not changing the model’s parameter gradients. input = Variable(img , requires_grad = True) output = model(input) loss = nllloss(output , target) grad_input , = torch.autograd.grad(loss , input) Note that since torch.autograd.grad computes the gradient of a function with possibly multiple inputs, the returned result is a tuple. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 40 / 89

The resulting maps are quite noisy. For instance with AlexNet: Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 41 / 89

This is due to the local irregularity of the network’s response as a function of the input. Figure 2. The partial derivative of S c with respect to the RGB values of a single pixel as a fraction of the maximum entry in the ∂S c gradient vector, max i ∂x i ( t ) , (middle plot) as one slowly moves away from a baseline image x (left plot) to a fixed location x + ǫ (right plot). ǫ is one random sample from N (0 , 0 . 01 2 ) . The final image ( x + ǫ ) is indistinguishable to a human from the origin image x . (Smilkov et al., 2017) Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 42 / 89

Smilkov et al. (2017) proposed to smooth the gradient with respect to the input image by averaging over slightly perturbed versions of the latter. N ∇ | x f y ( x ; w ) = 1 � ˜ ∇ | x f y ( x + ǫ n ; w ) N n =1 where ǫ 1 , . . . , ǫ N are i.i.d of distribution N (0 , σ 2 I ), and σ is a fraction of the gap ∆ between the maximum and the minimum of the pixel values. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 43 / 89

A simple version of this “SmoothGrad” approach can be implemented as follows nb_smooth = 100 std = smooth_std * (img.max () - img.min ()) acc_grad = img.new(img.size ()).zero_ () for q in range(nb_smooth): # This should be done with mini -batches ... noisy_input = img + img.new(img.size ()).normal_ (0, std) noisy_input = Variable(noisy_input , requires_grad = True) output = model( noisy_input ) loss = nllloss(output , target) grad_input , = torch.autograd.grad(loss , noisy_input ) acc_grad += grad_input .data acc_grad = acc_grad.abs ().sum (1) # sum across channels Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 44 / 89

Original images Gradient, AlexNet SmoothGrad, AlexNet, σ = ∆ 4 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 45 / 89

EE-559 Deep learning 8. Under the hood Fran cois Fleuret - PowerPoint PPT Presentation

EE-559 Deep learning 8. Under the hood Fran cois Fleuret https://fleuret.org/dlc/ [version of: June 5, 2018] COLE POLYTECHNIQUE FDRALE DE LAUSANNE Understanding a networks behavior Fran cois Fleuret EE-559 Deep

EE-559 Deep learning 1a. Introduction Fran cois Fleuret https://fleuret.org/dlc/

CS 559: Machine Learning CS 559: Machine Learning Fundamentals and Applications 12 th Set of

EE-559 Deep learning 9.3. Visualizing the processing in the input Fran cois Fleuret

EE-559 Deep learning 7. Networks for computer vision Fran cois Fleuret

EE-559 Deep learning 1b. PyTorch Tensors Fran cois Fleuret https://fleuret.org/dlc/

EE-559 Deep learning 6. Going deeper Fran cois Fleuret https://fleuret.org/dlc/ [version

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Employment Land Employment (ELE) Intensification Study 575 Hood Road 575 Hood Road

Mount Hood (composite cone) Photo: E. M. Puris Larch Mountain (shield volcano) Mount Hood

Search/Discovery Under the Hood Tricia Jenkins and Sean Luyk | Spring Training 2019

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Static Checking of Dynamically-Varying Security Policies in Database-Backed Applications Adam

Infrared Divergences in Quantum Gravity General reference: Herbert W. Hamber Quantum

Critical Phenomena, Finite Size Scaling and Monte Carlo Simulations of Spin Models Martin

MIZAR in MathWiki Adam Naumowicz adamn@mizar.org Institute of Computer Science University of

Regulatory Spillover: Evidence from Classifying Municipal Bonds as High-Quality Liquid Assets 1

Who We Are Our mission at CFED is to make it possible for

Discussion of Chiu, Meh and Wright Nancy L. Stokey University of Chicago November 19, 2009

Liquidation Strategies for Infinitely Divisble Portfolios David Hobson University of Warwick