Assignment 5 Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof - - PDF document

assignment 5
SMART_READER_LITE
LIVE PREVIEW

Assignment 5 Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof - - PDF document

Assignment 5 Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof April 2020 1 Convolutional Neural Networks Basics Consider the following CNN architecture: conv: conv: input: 100 filters 50 filters pooling: size 3 3 size 5 5 256


slide-1
SLIDE 1

Assignment 5

Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof April 2020

1 Convolutional Neural Networks Basics

Consider the following CNN architecture:

conv: 100 filters size 3 × 3 stride 1 input: 256 × 256 × 3 conv: 50 filters size 5 × 5 stride 3 pooling: size 2 × 2 conv: 30 filters size 3 × 3 stride 1 pooling: size 2 × 4 fully connected: 10 outputs flattened

In all layers we do not add the bias term (for simplicity), and the nonlinear activations are omitted throughout. Note that the third dimension of each conv filter is omitted as it can be inferred from the previous layer, while pooling (with each pixel appearing exactly once in all pooling windows) is performed

  • n each channel/feature map separately. Answer the following questions. [For

(b) (c) (d), you will receive partial credit if you correctly derive some of the intermediate results.]

  • 1. (5 pts) What can we do to achieve the same output size (excluding the

dimension of channels/feature maps) as the input size for each convolu- tional layer ? Explain using a filter with size (2w+1)×(2h+1) and stride 1 on an input image with size W × H.

  • 2. (5 pts) Without padding what is the output size of each intermediate

layer (except the first input layer and the last fully connected layer)?

  • 3. (5 pts) What is the receptive field of a single (output) neuron on the last

convolutional layer (ignoring any boundary issue)?

  • 4. (5 pts) How many parameters do the above architecture have? For this

exercise we assume the output image size of each convolutional layer is the same as its input image size (except the number of channels, of course). 1

slide-2
SLIDE 2

2 CNN Implementation

Note: Please mention your Python version (and maybe the version of all other packages) in the code. In this exercise you are going to run some experiments involving CNNs. You need to know Python and install the following libraries: PyTorch, Numpy and all their dependencies. You can find detailed instructions and tutorials for each

  • f these libraries on the respective websites.

For all experiments, running on CPU is sufficient. You do not need to run the code on GPUs. Before start, we suggest you review what we learned about each layer in CNN, and read at least this tutorial.

  • 1. Train a VGG11 net on the MNIST dataset.

VGG11 was an earlier version of VGG16 and can be found as model A in Table 1 of this paper, whose Section 2.1 also gives you all the details about each layer. The goal is to get as close to 0 loss as possible. Note that our input dimension is different from the VGG paper. You need to resize each image in MNIST from its original size 28 × 28 to 32 × 32 [why?], and it might be necessary to change at least one other layer of VGG11. [This experiment will take up to 1 hour on a CPU, so please be cautious of your

  • time. If this running time is not bearable, you may cut the training set by

1/10, so only have ∼600 images per class instead of the regular ∼6000.]

  • 2. Once you’ve done the above, the next goal is to inspect the training pro-
  • cess. Create the following plots:

(a) test accuracy vs the number of epochs (say 3 ∼ 5) (b) training accuracy vs the number of epochs (c) test loss vs the number of epochs (d) training loss vs the number of epochs [If running more than 1 epoch is computationally infeasible, simply run 1 epoch and try to record the accuracy / loss every few minibatches.]

3 Recurrent Neural Network Implementation

In this question, you will implement a recurrent neural network to classify Twitter messages as positive or negative sentiment. You are provided two files, tweets train.csv and tweets test.csv that contain the training data and test data. Each line of the files consists of a tweet and its sentiment, either negative (0) or positive (1). To classify the tweets, implement an LSTM using PyTorch. Each tweet should be treated as a sequence of ASCII characters. The input to the LSTM at each step should be a one-hot encoding of the ASCII character. Since there are 128 ASCII characters, this should be a vector of size 128. 2

slide-3
SLIDE 3

The size of the hidden layer should also be 128. Use a linear layer to map the final hidden layer to the output (size 2). Finally, apply the softmax function to obtain class probabilities. Use negative log likelihood as the loss function with stochastic gradient descent or any related optimization algorithm. Train on the training set using a batch size of 32 and a learning rate of 0.0001 for at least 100000

  • iterations. Plot curves for (a) the test set accuracy vs. number of iterations

and (b) the training set accuracy vs. number of iterations. You may compute accuracy every 1000 iterations to speed up the process.

4 Variational Auto-Encoder

In Variational autoencoder, we optimize the evidence lower bound L(θ, φ; x) = −DKL[qφ(z|x)||pθ(z)] + Eqφ(z|x)

  • log pθ(x|z)
  • where qφ(z|x) is the variational distribution with variational parameter φ which

approximates the posterior p(z|x).

  • Given z ∈ R1, p(z) ∼ N(0, 1) and q(z|x) ∼ N(µz, σ2

z), write down

DKL[qφ(z|x)||pθ(z)] in terms of µz and σz.

  • Assuming qφ(z|x) is a Gaussian, the decoder network computes its mean

µz and its variance σ2. Why do we model σ2

z in log space using neural

networks instead of directly model σ2

z? Why do we need the reparame-

terization trick, instead of directly sampling from the latent distribution N(µz, σ2

z)?

  • For decoder, we use sometimes a Multi-Layer Perceptions with either

Bernoulli (in case of binary data) or Gaussian (in case of real-valued data) outputs. The expected reconstruction error (cross-entropy term) Eqφ(z|x(i))

  • log pθ(x(i)|z)
  • can be estimated by sampling, that is

Eqφ(z|x(i))

  • log pθ(x(i)|z)
  • = 1

L

L

  • l=1
  • log pθ(x(i)|z)
  • if data x given z follows a multivariate Bernoulli with dimension D, how

should this reconstruction loss term look like?

5 The Final Small-Project

This part of assignment should be done by the students who decided to attend the final exam. Collaboration policy: This assignment should be done individually. 3

slide-4
SLIDE 4

5.1 Implementation of VAE

In the penultimate lecture, we have learned about approximating posteriors with variational inference, using the reparameterization trick for VI, and deep gen- erative models for images using variational autoencoders. We also learnt more about convolutional neural networks in the final lecture and their applications in the computer vision. In this project, you will bring together both constructs to train a model that can generate images of different classes of galaxies. The data is available

  • here. The training images are JPG images of 61578 galaxies. The probability

distributions for the classifications for each of the training images are given in solutions training rev1. There are 37 different classes of galaxy types and at the end of training process, you should be able to generate these categories and compare with the original images. You’ll need to install PyTorch to use the starter code. The objective is to minimize the reconstruction error using cross-entropy and the Kullback–Leibler

  • divergence. In the following you’ll find a code snippet that you can start to

complete and make it work for this dataset. import torch import torch.utils.data from torch import nn, optim from torch.nn import functional as F from torchvision import datasets, transforms from torchvision.utils import save_image class VAE(nn.Module): def __init__(self, input_chanel=3, zdim=512, image_size=424): super(VAE, self).__init__() self.z_dim = z_dim # encoder part self.encoder_conv1 = nn.Conv2d(input_chanel, zdim//16, kernel_size=4, stride=2, padding=1) self.encoder_bn1 = nn.BatchNorm2d(zdim//16) #You must extend this part #decoder part self.decoder_conv1 = nn.ConvTranspose2d(zdim, zdim//2, kernel_size=4, stride=1, padding=0) self.decoder_bn1 = nn.BatchNorm2d(zdim//2) #You must extend this part def encode(self, x): x1 = F.leaky_relu(self.encoder_bn1(self.encoder_conv1(x)), negative_slope=0.2) #Fill this part with some CNN architecture def decode(self, z): #complete this part 4

slide-5
SLIDE 5

def reparameterize(self, z_means, z_log_var): # epsilon * variance + means #write down the appropriate code def forward(self, x): # encoder return x_recon, mu, logvar model = VAE() # Loss and Optimizer

  • ptimizer = #fill in this part

# Start training for epoch in range(epoch): for i, data in enumerate(training_data_loader): # Forward pass x = data # x_reconst, mu, log_var = model(x) # reconstruction loss between x_reconst and x #write the appropriate code # kl divergence kl_div = #write down the correct objective function # Backprop and optimize loss = reconst_loss + kl_div

  • ptimizer.zero_grad()

loss.backward()

  • ptimizer.step()

First, we want to train decoder parameters θ and encoder parameters φ to have accurate reconstructions. Second, we wish to build a probabilistic model

  • n top of an autoencoder, so that we can reason about our uncertainty over the

code space.

  • For goal 1, we will simply produce a point estimate of the encoder and

decoder parameters θ (following the principle of minimizing reconstruction error). We wish to find the decoder and encoder parameters that, for the training data at hand, minimize reconstruction error meaning that a good autoencoder must have low binary cross entropy (you can change the reconstruction loss). You should adjust the learning rate –lr and potentially other keyword arguments as well. Then train for at least 200 epochs (or more if you don’t see convergence).

  • For goal 2, we’d like to be ”more Bayesian”, so we’ll assume a full gener-

ative model for our data. 5

slide-6
SLIDE 6

Plot reconstruction error (y-axis) on both train and test sets versus the number of training iterations (x-axis). Show performance with 3 different VAE encoder/decoder architectures: deep feed-forward neural network with 512 hidden units. Then a five convolutional layers architecture which starts with 32 filters and double it each time and finally build up a ResNet block from scratch and add five ResNet blocks to your VAE model as well. For the further example you can take a look at this paper.

  • Compare sampled images of a particular class of galaxy with their original

images.

  • Illustrate the location of different categories of galaxies within the latent
  • space. One way to understand what is represented in the latent space is

to consider where it encodes elements of the data. Here you will produce a scatter plot in the latent space, where each point in the plot will be the mean vector for the distribution q(z|x) given by the encoder. Further, we will colour each point in the plot by the class label for the input data. You must include the visualization result of the weight matrix for different layer.

  • For this assignment, any new idea that would be tried and not mentioned

in the description which leads to a significant improvement of result or including the idea of this paper, would have extra bonus. 6