INTRODUCTION TO PYTORCH Caio corro Computation Graph Dynamic: you - - PowerPoint PPT Presentation

introduction to pytorch
SMART_READER_LITE
LIVE PREVIEW

INTRODUCTION TO PYTORCH Caio corro Computation Graph Dynamic: you - - PowerPoint PPT Presentation

INTRODUCTION TO PYTORCH Caio corro Computation Graph Dynamic: you re-build the computation graph for each input Eager: each operation is immediately computed ( no lazy computation) Interfaces Python C++ (somewhat experimental)


slide-1
SLIDE 1

INTRODUCTION TO PYTORCH

Caio corro

slide-2
SLIDE 2

Components

➤ torch: tensors (with gradient computation ability) ➤ torch.nn.functionnal: functions that manipulates tensors ➤ torch.nn: neural network (sub-)components (e.g. affine transformation) ➤ torch.optim: optimizers

Interfaces

➤ Python ➤ C++ (somewhat experimental)

Computation Graph

➤ Dynamic: you re-build the computation graph for each input ➤ Eager: each operation is immediately computed (no lazy computation)

slide-3
SLIDE 3
  • 1. TENSORS
slide-4
SLIDE 4

TENSORS

torch.Tensor

➤ dtype: type of elements ➤ shape: shape/size of the tensor ➤ device: device where the tensor is stored (i.e. cpu, gpu) ➤ requires_grad: do we want to backpropagate gradient to this tensor?

dtype

➤ torch.float/torch.float32 (default type) ➤ torch.double/torch.float64 ➤ … ➤ torch.long (signed integer) ➤ torch.bool

https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.dtype

slide-5
SLIDE 5

CREATING TENSORS

import torch t = torch.empty( (2, 4, 4), # shape dtype=torch.float, device="cpu", requires_grad=True )

Creating an uninitialized tensor Default arguments

➤ Float ➤ CPU ➤ No grad

slide-6
SLIDE 6

CREATING TENSORS

import torch t = torch.empty( (2, 4, 4), # shape dtype=torch.float, device="cpu", requires_grad=True ) torch.zeros((2, 4, 4), dtype=torch.float, requires_grad=True) torch.ones((2, 4, 4), dtype=torch.float, requires_grad=True) torch.rand((2, 4, 4), dtype=torch.float, requires_grad=True)

https://pytorch.org/docs/stable/torch.html#creation-ops

Creating an uninitialized tensor Creating an initialized tensor Default arguments

➤ Float ➤ CPU ➤ No grad

slide-7
SLIDE 7

CREATING TENSORS FROM OTHER TENSORS

t2 = torch.zeros_like(t) t_bool = torch.zeros_like(t, dtype=torch.bool)

*_like() functions

Create a new tensor with the same attributes than the argument:

➤ Specific attributes can be overridden ➤ Shape cannot be changed

slide-8
SLIDE 8

CREATING TENSORS FROM OTHER TENSORS

t2 = torch.zeros_like(t) t_bool = torch.zeros_like(t, dtype=torch.bool)

*_like() functions

Create a new tensor with the same attributes than the argument:

➤ Specific attributes can be overridden ➤ Shape cannot be changed

clone()

Create a copy of a tensor t1 = torch.ones((1,)) t2 = t1.clone() t2[0] = 3 print(t1, t2) tensor([1.]) tensor([3.])

slide-9
SLIDE 9

CREATING TENSORS FROM DATA

From python data

t1 = torch.tensor([0, 1, 2, 3], dtype=torch.long)

➤ Creates a vector with integers 0, 1, 2, 3 ➤ Elements are signed integers (longs)

slide-10
SLIDE 10

CREATING TENSORS FROM DATA

From python data

t1 = torch.tensor([0, 1, 2, 3], dtype=torch.long) t2 = torch.tensor(range(10))

➤ Creates a vector with integers 0, 1, 2, 3 ➤ Elements are signed integers (longs)

Using iterables

➤ Vector of floats with values from 0 to 9

slide-11
SLIDE 11

CREATING TENSORS FROM DATA

From python data

t1 = torch.tensor([0, 1, 2, 3], dtype=torch.long) t2 = torch.tensor(range(10)) t3 = torch.tensor([[0, 1], [2, 3]])

➤ Creates a vector with integers 0, 1, 2, 3 ➤ Elements are signed integers (longs)

Using iterables

➤ Vector of floats with values from 0 to 9

Creating matrices

➤ First row: 0, 1 ➤ Second row: 2, 3

slide-12
SLIDE 12

OPERATIONS

https://pytorch.org/docs/stable/notes/autograd.html#in-place-operations-with-autograd

Out-of-place operations

➤ Create a new tensor, i.e. memory is allocated to store the results ➤ Set back-propagation information if required


(i.e. if at least one the inputs has requires_grad=True)

In-place operations

➤ Modify the data of the tensor (no memory allocation) ➤ Easy to identify: name ending with an underscore ➤ Can be problematic for gradient computation:

  • Forget the forward value
  • Can break the backpropagation algorithm }

Be careful when
 requires_grad=True

slide-13
SLIDE 13

OUT-OF-PLACE OPERATIONS

t1 = torch.rand((4, 4)) t2 = torch.rand((4, 4)) t3 = torch.add(t1, t2) t3 = t1.add(t2) t3 = t1 + t2 t3 = torch.sub(t1, t2) t3 = t1.sub(t2) t3 = t1 - t2

slide-14
SLIDE 14

OUT-OF-PLACE OPERATIONS

t1 = torch.rand((4, 4)) t2 = torch.rand((4, 4)) t3 = torch.add(t1, t2) t3 = t1.add(t2) t3 = t1 + t2

!

This is not matrix multiplication t3 = torch.sub(t1, t2) t3 = t1.sub(t2) t3 = t1 - t2 t3 = torch.mul(t1, t2) t3 = t1.mul(t2) t3 = t1 * t2 t3 = torch.div(t1, t2) t3 = t1.div(t2) t3 = t1 / t2 t3 = torch.matmul(t1, t2) t3 = t1.matmul(t2) t3 = t1 @ t2

slide-15
SLIDE 15

IN-PLACE OPERATIONS

t1.add_(t2) t1.sub_(t2) t1.mul_(t2) t1.div_(t2) Note: no in-place matrix multiplication

slide-16
SLIDE 16

ELEMENT-WISE OPERATIONS: IN-PLACE

a x x tmp d b c

tmp = a × b d = tmp × c

slide-17
SLIDE 17

ELEMENT-WISE OPERATIONS: IN-PLACE

a x x tmp d b c

∂d ∂c = tmp ∂d ∂tmp = c tmp = a × b d = tmp × c

Back-propagation to tmp

➤ We need the value of « c » ➤ We don’t need the value of « tmp »

slide-18
SLIDE 18

ELEMENT-WISE OPERATIONS: IN-PLACE

a x x tmp d b c

∂d ∂c = tmp ∂d ∂tmp = c tmp = a × b d = tmp × c

Back-propagation to tmp

➤ We need the value of « c » ➤ We don’t need the value of « tmp »

a = torch.rand((1,)) b = torch.rand((1,)) c = torch.rand((1,), requires_grad=True) tmp = a * b d = tmp * c # erase the data of tmp torch.zero_(tmp)

!

Backprop will FAIL!

slide-19
SLIDE 19

ELEMENT-WISE OPERATIONS: IN-PLACE

a x x tmp d b c

∂d ∂c = tmp ∂d ∂tmp = c tmp = a × b d = tmp × c

Back-propagation to tmp

➤ We need the value of « c » ➤ We don’t need the value of « tmp »

a = torch.rand((1,), requires_grad=True) b = torch.rand((1,), requires_grad=True) c = torch.rand((1,)) tmp = a * b d = tmp * c # erase the data of tmp torch.zero_(tmp) This is OK!

slide-20
SLIDE 20

ACTIVATION FUNCTIONS

import torch import torch.nn import torch.nn.functional as F t1 = torch.rand((2, 10)) t2 = torch.relu(t1) t2 = torch.tanh(t1) t2 = torch.sigmoid(t1) torch.relu_(t1) torch.tanh_(t1) torch.sigmoid_(t1) t2 = F.leaky_relu(t1) t2 = F.leaky_relu_(t1) t2 = F.elu(t1) t2 = F.elu_(t1) « Standard » activations Other activations

slide-21
SLIDE 21

BROADCASTING

+ = c a b

!

Invalid dimensions

slide-22
SLIDE 22

BROADCASTING

+ = c a b

Copy rows so that dimensions are correct

slide-23
SLIDE 23

BROADCASTING

+ = c a b

Copy rows so that dimensions are correct

Explicit broadcasting

a = torch.rand((3, 3)) b = torch.rand((1, 3)) # explicitly copy the data b.repeat((3, 1)) # implicit construction # (no duplicated memory) b.expand((3, -1))

slide-24
SLIDE 24

BROADCASTING

+ = c a b

Copy rows so that dimensions are correct

Explicit broadcasting

a = torch.rand((3, 3)) b = torch.rand((1, 3)) # explicitly copy the data b.repeat((3, 1)) # implicit construction # (no duplicated memory) b.expand((3, -1)) a = torch.rand((3, 3)) b = torch.rand((1, 3)) c = a + b

Implicit broadcasting

https://pytorch.org/docs/stable/notes/broadcasting.html#broadcasting-semantics https://pytorch.org/docs/stable/torch.html#torch.add

Many operations will automatically broadcast dimensions RTFM!

slide-25
SLIDE 25

GRADIENT COMPUTATION

➤ backward() launch the back-prop algorithm if and only if a gradient is required

a = torch.rand((1,)) b = torch.rand((1,)) c = a * b # after the call a.grad contains the gradient c.backward()

!

No gradient is required so the call to backward will fail! Explicitly set the incoming sensitivity

slide-26
SLIDE 26

GRADIENT COMPUTATION

a = torch.rand((1,), requires_grad=True) b = torch.rand((1,)) c = a * b # after the call a.grad contains the gradient c.backward()

➤ backward() launch the back-prop algorithm if and only if a gradient is required

Explicitly set the incoming sensitivity

slide-27
SLIDE 27

GRADIENT COMPUTATION

a = torch.rand((1,), requires_grad=True) b = torch.rand((1,)) c = a * b # after the call a.grad contains the gradient c.backward() # let's do something else... b2 = torch.rand((1, )) c2 = a * b2

➤ backward() launch the back-prop algorithm if and only if a gradient is required

Explicitly set the incoming sensitivity

slide-28
SLIDE 28

GRADIENT COMPUTATION

a = torch.rand((1,), requires_grad=True) b = torch.rand((1,)) c = a * b # after the call a.grad contains the gradient c.backward() # let's do something else... b2 = torch.rand((1, )) c2 = a * b2 # by default gradient is accumulated, # so if we want to recompute a gradient, # we have to erase the previous one manually! a.grad.zero_() c2.backward(torch.tensor([2.]))

➤ backward() launch the back-prop algorithm if and only if a gradient is required

Explicitly set the incoming sensitivity

slide-29
SLIDE 29
  • 2. NEURAL NETWORKS
slide-30
SLIDE 30

MODULES AND PARAMETERS

torch.nn.Module

To build a neural network, we store in a module:

➤ Parameters of the network ➤ Other modules

Benefits

➤ Execution mode: we can set the network in training or in test mode


(e.g. to automatically apply or discard dropout)

➤ Move whole network to a device ➤ Retrieve all learnable parameters of the network ➤ …

slide-31
SLIDE 31

SINGLE HIDDEN LAYER 1/2

class HiddenLayer(torch.nn.Module): def __init__(self, input_dim, output_dim): super().__init__() self.W = torch.nn.Parameter( torch.empty(dim_output, dim_input) ) self.bias = torch.nn.Parameter( torch.empty(dim_output, 1) )

slide-32
SLIDE 32

SINGLE HIDDEN LAYER 1/2

class HiddenLayer(torch.nn.Module): def __init__(self, input_dim, output_dim): super().__init__() self.W = torch.nn.Parameter( torch.empty(dim_output, dim_input) ) self.bias = torch.nn.Parameter( torch.empty(dim_output, 1) ) Transpose everything because of Pytorch data format def forward(self, inputs): z = x @ self.W.data.transpose(0, 1) \ + self.bias.data.transpose(0, 1) return torch.relu(z) Non-linearity!

slide-33
SLIDE 33

SINGLE HIDDEN LAYER 1/2

class HiddenLayer(torch.nn.Module): def __init__(self, input_dim, output_dim): super().__init__() self.W = torch.nn.Parameter( torch.empty(dim_output, dim_input) ) self.bias = torch.nn.Parameter( torch.empty(dim_output, 1) ) def forward(self, inputs): z = x @ self.W.data.transpose(0, 1) \ + self.bias.data.transpose(0, 1) return torch.relu(z) nn = HiddenLayer(10, 2) x = torch.rand((64, 10)) # Shape of y is (64, 5) y = nn(x)

!

Do not call forward directly!

Batch of 64 inputs

slide-34
SLIDE 34

SINGLE HIDDEN LAYER 2/2

class HiddenLayer(torch.nn.Module): def __init__(self, input_dim, output_dim): super().__init__() self.linear = torch.nn.Linear(input_dim, output_dim) def forward(self, inputs): z = self.linear(inputs) return torch.relu(z) nn = HiddenLayer(10, 2) x = torch.rand((64, 10)) y = nn(x)

https://pytorch.org/docs/stable/nn.html#torch.nn.Linear

Module that implement an affine transformation

slide-35
SLIDE 35

SIMPLE NEURAL NETWORK: 1 HIDDEN LAYER

class SimpleNetwork(torch.nn.Module): def __init__(self, input_dim, hidden_dim, n_classes): super().__init__() self.z_proj = torch.nn.Linear(input_dim, hidden_dim) self.output_proj = torch.nn.Linear(hidden_dim, n_classes) def forward(self, inputs): z = torch.relu(self.z_proj(inputs))

  • = self.output_proj(z)

return o

slide-36
SLIDE 36

SIMPLE NEURAL NETWORK: 2 HIDDEN LAYERS

class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hidden_dim1, hidden_dim2, n_classes): super().__init__() self.z_proj1 = torch.nn.Linear(input_dim, hidden_dim1) self.z_proj2 = torch.nn.Linear(hidden_dim1, hidden_dim2) self.output_proj = torch.nn.Linear(hidden_dim2, n_classes) def forward(self, inputs): z1 = torch.relu(self.z_proj1(inputs)) z2 = torch.relu(self.z_proj2(z1))

  • = self.output_proj(z2)

return o

slide-37
SLIDE 37

LIST OF MODULES

class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hidden_dim1, hidden_dim2, n_classes): super().__init__() self.z_projs = [] self.z_projs.append(torch.nn.Linear(input_dim, hidden_dim1)) self.z_projs.append(torch.nn.Linear(hidden_dim1, hidden_dim2))

!

Do not do this!

slide-38
SLIDE 38

LIST OF MODULES

class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hidden_dim1, hidden_dim2, n_classes): super().__init__() self.z_projs = [] self.z_projs.append(torch.nn.Linear(input_dim, hidden_dim1)) self.z_projs.append(torch.nn.Linear(hidden_dim1, hidden_dim2))

!

Do not do this!

Module inspection

Pytorch will automatically inspect attributes of modules, e.g. to extract all parameters of a network. However, only appropriate containers will be recursively inspected:

➤ torch.nn.Module ➤ torch.nn.ModuleList ➤ torch.nn.ModuleDict ➤ torch.nn.Sequential ➤ ParameterList ➤ ParameterDict

For parameters! For modules!

slide-39
SLIDE 39

LIST OF MODULES

class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hidden_dim1, hidden_dim2, n_classes): super().__init__() self.z_projs = torch.nn.ModuleList() self.z_projs.append(torch.nn.Linear(input_dim, hidden_dim1)) self.z_projs.append(torch.nn.Linear(hidden_dim1, hidden_dim2)) self.output_proj = torch.nn.Linear(hidden_dim2, n_classes) def forward(self, inputs): z = inputs for nn in self.z_projs: z = torch.relu(nn(z))

  • = self.output_proj(z)

return o

slide-40
SLIDE 40

SEQUENTIAL CONTAINER

class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hdim1, hdim2, n_classes): super().__init__() self.seq = torch.nn.Sequential( torch.nn.Linear(input_dim, hdim1), torch.nn.ReLU(), torch.nn.Linear(hdim1, hdim2), torch.nn.ReLU(), torch.nn.Linear(hdim2, n_classes) ) def forward(self, inputs): return self.seq(inputs)

https://pytorch.org/docs/stable/nn.html#torch.nn.Sequential

Note the different relu

torch.nn.Sequential

➤ Built with a list of modules ➤ Executed in order, the output of a module is given as input to the next one

Will apply all modules in the given order

slide-41
SLIDE 41

LOSS FUNCTIONS

loss_builder = torch.nn.NLLLoss(reduction=‘mean’) loss = loss_builder(y, gold) epoch_loss += loss.item()

https://pytorch.org/docs/stable/nn.html#loss-functions

Important for the garbage collector!

slide-42
SLIDE 42
  • 2. OPTIMIZATION
slide-43
SLIDE 43

SINGLE PARAMETER UPDATE

nn = SimpleNetwork2(10, 250, 100, 2)

  • ptimizer = torch.optim.SGD(

nn.parameters(), lr=0.01, weight_decay=0.0001, momentum=0.9 ) Get the parameters of the networks Options

slide-44
SLIDE 44

SINGLE PARAMETER UPDATE

nn = SimpleNetwork2(10, 250, 100, 2)

  • ptimizer = torch.optim.SGD(

nn.parameters(), lr=0.01, weight_decay=0.0001, momentum=0.9 ) # Forward x = torch.rand((64, 10)) y = nn(x) loss = loss_builder(y, gold) Compute loss

slide-45
SLIDE 45

SINGLE PARAMETER UPDATE

nn = SimpleNetwork2(10, 250, 100, 2)

  • ptimizer = torch.optim.SGD(

nn.parameters(), lr=0.01, weight_decay=0.0001, momentum=0.9 ) # Forward x = torch.rand((64, 10)) y = nn(x) loss = loss_builder(y, gold) # backward nn.zero_grad() loss.backward()

  • ptimizer.step()

Reset gradient! Update parameters

slide-46
SLIDE 46

OPTIMIZERS

➤ Many optimizers are available out of the box ➤ More recent techniques are often available on GitHub (RAdam, …)

torch.optim.Adam(nn.parameters()) torch.optim.Adadelta(nn.parameters()) torch.optim.Adagrad(nn.parameters())

https://pytorch.org/docs/stable/optim.html

slide-47
SLIDE 47

DROPOUT

class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hidden_dim1, hidden_dim2, n_classes): super().__init__() self.seq = torch.nn.Sequential( torch.nn.Linear(input_dim, hidden_dim1), torch.nn.ReLU(), torch.nn.Dropout(0.5), torch.nn.Linear(hidden_dim2, n_classes) ) def forward(self, inputs): return self.seq(inputs) nn = SimpleNetwork2(10, 250, 100, 2) x = torch.rand((64, 10)) Dropout!

slide-48
SLIDE 48

DROPOUT

class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hidden_dim1, hidden_dim2, n_classes): super().__init__() self.seq = torch.nn.Sequential( torch.nn.Linear(input_dim, hidden_dim1), torch.nn.ReLU(), torch.nn.Dropout(0.5), torch.nn.Linear(hidden_dim2, n_classes) ) def forward(self, inputs): return self.seq(inputs) nn = SimpleNetwork2(10, 250, 100, 2) x = torch.rand((64, 10)) nn.train() y = nn(x) Train mode: the dropout will be applied

slide-49
SLIDE 49

DROPOUT

class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hidden_dim1, hidden_dim2, n_classes): super().__init__() self.seq = torch.nn.Sequential( torch.nn.Linear(input_dim, hidden_dim1), torch.nn.ReLU(), torch.nn.Dropout(0.5), torch.nn.Linear(hidden_dim2, n_classes) ) def forward(self, inputs): return self.seq(inputs) nn = SimpleNetwork2(10, 250, 100, 2) x = torch.rand((64, 10)) nn.train() y = nn(x) nn.eval() y = nn(x) eval mode: the dropout will be ignored

slide-50
SLIDE 50
  • 3. LAB EXERCISES
slide-51
SLIDE 51

PART 1: MNIST CLASSIFICATION WITH PYTORCH

Todo

  • 1. Build a MLP classifier with Pytorch

➤ Parametrizable number of layers (and hidden layer dimensions) ➤ Parametrizable dropout ratio ➤ Parametrizable activation function

  • 2. Train the network in different settings

➤ Different network architectures


e.g., overparametrized network which overfit (i.e. bad regulation)

➤ Different regularization methods (dropout, …)

Bonus

➤ Use a CNN instead of a MLP

https://pytorch.org/docs/stable/nn.html#torch.nn.Conv2d

slide-52
SLIDE 52

PART 2: VARIATIONAL AUTO-ENCODER

pθ(x|z) qϕ(z|x) z x

Recall

➤ Encoder: map image space to latent space ➤ Decoder: map latent space to image space ➤ Prior distribution on latent variable

Training loss: Evidence Lower Bound Goal

➤ Generate new images!

max

θ,ϕ

𝔽qϕ(z|x) [pθ(x|z)] − KL [qϕ(z|x), p(z)]

slide-53
SLIDE 53

Gaussian Random Variable

➤ : mean ➤ : variance

μ σ2

Reparameterization trick

Differentiable sampling process
 differentiable Monte-Carlo estimation of the ELBO!

z ∼ 𝒪(μ, σ2) e ∼ 𝒪(0,1) z = μ + e × σ

Given by the encoder