INTRODUCTION TO PYTORCH
Caio corro
INTRODUCTION TO PYTORCH Caio corro Computation Graph Dynamic: you - - PowerPoint PPT Presentation
INTRODUCTION TO PYTORCH Caio corro Computation Graph Dynamic: you re-build the computation graph for each input Eager: each operation is immediately computed ( no lazy computation) Interfaces Python C++ (somewhat experimental)
Caio corro
Components
➤ torch: tensors (with gradient computation ability) ➤ torch.nn.functionnal: functions that manipulates tensors ➤ torch.nn: neural network (sub-)components (e.g. affine transformation) ➤ torch.optim: optimizers
Interfaces
➤ Python ➤ C++ (somewhat experimental)
Computation Graph
➤ Dynamic: you re-build the computation graph for each input ➤ Eager: each operation is immediately computed (no lazy computation)
TENSORS
torch.Tensor
➤ dtype: type of elements ➤ shape: shape/size of the tensor ➤ device: device where the tensor is stored (i.e. cpu, gpu) ➤ requires_grad: do we want to backpropagate gradient to this tensor?
dtype
➤ torch.float/torch.float32 (default type) ➤ torch.double/torch.float64 ➤ … ➤ torch.long (signed integer) ➤ torch.bool
https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.dtype
CREATING TENSORS
import torch t = torch.empty( (2, 4, 4), # shape dtype=torch.float, device="cpu", requires_grad=True )
Creating an uninitialized tensor Default arguments
➤ Float ➤ CPU ➤ No grad
CREATING TENSORS
import torch t = torch.empty( (2, 4, 4), # shape dtype=torch.float, device="cpu", requires_grad=True ) torch.zeros((2, 4, 4), dtype=torch.float, requires_grad=True) torch.ones((2, 4, 4), dtype=torch.float, requires_grad=True) torch.rand((2, 4, 4), dtype=torch.float, requires_grad=True)
https://pytorch.org/docs/stable/torch.html#creation-ops
Creating an uninitialized tensor Creating an initialized tensor Default arguments
➤ Float ➤ CPU ➤ No grad
CREATING TENSORS FROM OTHER TENSORS
t2 = torch.zeros_like(t) t_bool = torch.zeros_like(t, dtype=torch.bool)
*_like() functions
Create a new tensor with the same attributes than the argument:
➤ Specific attributes can be overridden ➤ Shape cannot be changed
CREATING TENSORS FROM OTHER TENSORS
t2 = torch.zeros_like(t) t_bool = torch.zeros_like(t, dtype=torch.bool)
*_like() functions
Create a new tensor with the same attributes than the argument:
➤ Specific attributes can be overridden ➤ Shape cannot be changed
clone()
Create a copy of a tensor t1 = torch.ones((1,)) t2 = t1.clone() t2[0] = 3 print(t1, t2) tensor([1.]) tensor([3.])
CREATING TENSORS FROM DATA
From python data
t1 = torch.tensor([0, 1, 2, 3], dtype=torch.long)
➤ Creates a vector with integers 0, 1, 2, 3 ➤ Elements are signed integers (longs)
CREATING TENSORS FROM DATA
From python data
t1 = torch.tensor([0, 1, 2, 3], dtype=torch.long) t2 = torch.tensor(range(10))
➤ Creates a vector with integers 0, 1, 2, 3 ➤ Elements are signed integers (longs)
Using iterables
➤ Vector of floats with values from 0 to 9
CREATING TENSORS FROM DATA
From python data
t1 = torch.tensor([0, 1, 2, 3], dtype=torch.long) t2 = torch.tensor(range(10)) t3 = torch.tensor([[0, 1], [2, 3]])
➤ Creates a vector with integers 0, 1, 2, 3 ➤ Elements are signed integers (longs)
Using iterables
➤ Vector of floats with values from 0 to 9
Creating matrices
➤ First row: 0, 1 ➤ Second row: 2, 3
OPERATIONS
https://pytorch.org/docs/stable/notes/autograd.html#in-place-operations-with-autograd
Out-of-place operations
➤ Create a new tensor, i.e. memory is allocated to store the results ➤ Set back-propagation information if required
(i.e. if at least one the inputs has requires_grad=True)
In-place operations
➤ Modify the data of the tensor (no memory allocation) ➤ Easy to identify: name ending with an underscore ➤ Can be problematic for gradient computation:
Be careful when requires_grad=True
OUT-OF-PLACE OPERATIONS
t1 = torch.rand((4, 4)) t2 = torch.rand((4, 4)) t3 = torch.add(t1, t2) t3 = t1.add(t2) t3 = t1 + t2 t3 = torch.sub(t1, t2) t3 = t1.sub(t2) t3 = t1 - t2
OUT-OF-PLACE OPERATIONS
t1 = torch.rand((4, 4)) t2 = torch.rand((4, 4)) t3 = torch.add(t1, t2) t3 = t1.add(t2) t3 = t1 + t2
!
This is not matrix multiplication t3 = torch.sub(t1, t2) t3 = t1.sub(t2) t3 = t1 - t2 t3 = torch.mul(t1, t2) t3 = t1.mul(t2) t3 = t1 * t2 t3 = torch.div(t1, t2) t3 = t1.div(t2) t3 = t1 / t2 t3 = torch.matmul(t1, t2) t3 = t1.matmul(t2) t3 = t1 @ t2
IN-PLACE OPERATIONS
t1.add_(t2) t1.sub_(t2) t1.mul_(t2) t1.div_(t2) Note: no in-place matrix multiplication
ELEMENT-WISE OPERATIONS: IN-PLACE
a x x tmp d b c
tmp = a × b d = tmp × c
ELEMENT-WISE OPERATIONS: IN-PLACE
a x x tmp d b c
∂d ∂c = tmp ∂d ∂tmp = c tmp = a × b d = tmp × c
Back-propagation to tmp
➤ We need the value of « c » ➤ We don’t need the value of « tmp »
ELEMENT-WISE OPERATIONS: IN-PLACE
a x x tmp d b c
∂d ∂c = tmp ∂d ∂tmp = c tmp = a × b d = tmp × c
Back-propagation to tmp
➤ We need the value of « c » ➤ We don’t need the value of « tmp »
a = torch.rand((1,)) b = torch.rand((1,)) c = torch.rand((1,), requires_grad=True) tmp = a * b d = tmp * c # erase the data of tmp torch.zero_(tmp)
!
Backprop will FAIL!
ELEMENT-WISE OPERATIONS: IN-PLACE
a x x tmp d b c
∂d ∂c = tmp ∂d ∂tmp = c tmp = a × b d = tmp × c
Back-propagation to tmp
➤ We need the value of « c » ➤ We don’t need the value of « tmp »
a = torch.rand((1,), requires_grad=True) b = torch.rand((1,), requires_grad=True) c = torch.rand((1,)) tmp = a * b d = tmp * c # erase the data of tmp torch.zero_(tmp) This is OK!
ACTIVATION FUNCTIONS
import torch import torch.nn import torch.nn.functional as F t1 = torch.rand((2, 10)) t2 = torch.relu(t1) t2 = torch.tanh(t1) t2 = torch.sigmoid(t1) torch.relu_(t1) torch.tanh_(t1) torch.sigmoid_(t1) t2 = F.leaky_relu(t1) t2 = F.leaky_relu_(t1) t2 = F.elu(t1) t2 = F.elu_(t1) « Standard » activations Other activations
BROADCASTING
+ = c a b
!
Invalid dimensions
BROADCASTING
+ = c a b
Copy rows so that dimensions are correct
BROADCASTING
+ = c a b
Copy rows so that dimensions are correct
Explicit broadcasting
a = torch.rand((3, 3)) b = torch.rand((1, 3)) # explicitly copy the data b.repeat((3, 1)) # implicit construction # (no duplicated memory) b.expand((3, -1))
BROADCASTING
+ = c a b
Copy rows so that dimensions are correct
Explicit broadcasting
a = torch.rand((3, 3)) b = torch.rand((1, 3)) # explicitly copy the data b.repeat((3, 1)) # implicit construction # (no duplicated memory) b.expand((3, -1)) a = torch.rand((3, 3)) b = torch.rand((1, 3)) c = a + b
Implicit broadcasting
https://pytorch.org/docs/stable/notes/broadcasting.html#broadcasting-semantics https://pytorch.org/docs/stable/torch.html#torch.add
Many operations will automatically broadcast dimensions RTFM!
⇒
GRADIENT COMPUTATION
➤ backward() launch the back-prop algorithm if and only if a gradient is required
a = torch.rand((1,)) b = torch.rand((1,)) c = a * b # after the call a.grad contains the gradient c.backward()
!
No gradient is required so the call to backward will fail! Explicitly set the incoming sensitivity
GRADIENT COMPUTATION
a = torch.rand((1,), requires_grad=True) b = torch.rand((1,)) c = a * b # after the call a.grad contains the gradient c.backward()
➤ backward() launch the back-prop algorithm if and only if a gradient is required
Explicitly set the incoming sensitivity
GRADIENT COMPUTATION
a = torch.rand((1,), requires_grad=True) b = torch.rand((1,)) c = a * b # after the call a.grad contains the gradient c.backward() # let's do something else... b2 = torch.rand((1, )) c2 = a * b2
➤ backward() launch the back-prop algorithm if and only if a gradient is required
Explicitly set the incoming sensitivity
GRADIENT COMPUTATION
a = torch.rand((1,), requires_grad=True) b = torch.rand((1,)) c = a * b # after the call a.grad contains the gradient c.backward() # let's do something else... b2 = torch.rand((1, )) c2 = a * b2 # by default gradient is accumulated, # so if we want to recompute a gradient, # we have to erase the previous one manually! a.grad.zero_() c2.backward(torch.tensor([2.]))
➤ backward() launch the back-prop algorithm if and only if a gradient is required
Explicitly set the incoming sensitivity
MODULES AND PARAMETERS
torch.nn.Module
To build a neural network, we store in a module:
➤ Parameters of the network ➤ Other modules
Benefits
➤ Execution mode: we can set the network in training or in test mode
(e.g. to automatically apply or discard dropout)
➤ Move whole network to a device ➤ Retrieve all learnable parameters of the network ➤ …
SINGLE HIDDEN LAYER 1/2
class HiddenLayer(torch.nn.Module): def __init__(self, input_dim, output_dim): super().__init__() self.W = torch.nn.Parameter( torch.empty(dim_output, dim_input) ) self.bias = torch.nn.Parameter( torch.empty(dim_output, 1) )
SINGLE HIDDEN LAYER 1/2
class HiddenLayer(torch.nn.Module): def __init__(self, input_dim, output_dim): super().__init__() self.W = torch.nn.Parameter( torch.empty(dim_output, dim_input) ) self.bias = torch.nn.Parameter( torch.empty(dim_output, 1) ) Transpose everything because of Pytorch data format def forward(self, inputs): z = x @ self.W.data.transpose(0, 1) \ + self.bias.data.transpose(0, 1) return torch.relu(z) Non-linearity!
SINGLE HIDDEN LAYER 1/2
class HiddenLayer(torch.nn.Module): def __init__(self, input_dim, output_dim): super().__init__() self.W = torch.nn.Parameter( torch.empty(dim_output, dim_input) ) self.bias = torch.nn.Parameter( torch.empty(dim_output, 1) ) def forward(self, inputs): z = x @ self.W.data.transpose(0, 1) \ + self.bias.data.transpose(0, 1) return torch.relu(z) nn = HiddenLayer(10, 2) x = torch.rand((64, 10)) # Shape of y is (64, 5) y = nn(x)
!
Do not call forward directly!
Batch of 64 inputs
SINGLE HIDDEN LAYER 2/2
class HiddenLayer(torch.nn.Module): def __init__(self, input_dim, output_dim): super().__init__() self.linear = torch.nn.Linear(input_dim, output_dim) def forward(self, inputs): z = self.linear(inputs) return torch.relu(z) nn = HiddenLayer(10, 2) x = torch.rand((64, 10)) y = nn(x)
https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
Module that implement an affine transformation
SIMPLE NEURAL NETWORK: 1 HIDDEN LAYER
class SimpleNetwork(torch.nn.Module): def __init__(self, input_dim, hidden_dim, n_classes): super().__init__() self.z_proj = torch.nn.Linear(input_dim, hidden_dim) self.output_proj = torch.nn.Linear(hidden_dim, n_classes) def forward(self, inputs): z = torch.relu(self.z_proj(inputs))
return o
SIMPLE NEURAL NETWORK: 2 HIDDEN LAYERS
class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hidden_dim1, hidden_dim2, n_classes): super().__init__() self.z_proj1 = torch.nn.Linear(input_dim, hidden_dim1) self.z_proj2 = torch.nn.Linear(hidden_dim1, hidden_dim2) self.output_proj = torch.nn.Linear(hidden_dim2, n_classes) def forward(self, inputs): z1 = torch.relu(self.z_proj1(inputs)) z2 = torch.relu(self.z_proj2(z1))
return o
LIST OF MODULES
class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hidden_dim1, hidden_dim2, n_classes): super().__init__() self.z_projs = [] self.z_projs.append(torch.nn.Linear(input_dim, hidden_dim1)) self.z_projs.append(torch.nn.Linear(hidden_dim1, hidden_dim2))
!
Do not do this!
LIST OF MODULES
class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hidden_dim1, hidden_dim2, n_classes): super().__init__() self.z_projs = [] self.z_projs.append(torch.nn.Linear(input_dim, hidden_dim1)) self.z_projs.append(torch.nn.Linear(hidden_dim1, hidden_dim2))
!
Do not do this!
Module inspection
Pytorch will automatically inspect attributes of modules, e.g. to extract all parameters of a network. However, only appropriate containers will be recursively inspected:
➤ torch.nn.Module ➤ torch.nn.ModuleList ➤ torch.nn.ModuleDict ➤ torch.nn.Sequential ➤ ParameterList ➤ ParameterDict
For parameters! For modules!
LIST OF MODULES
class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hidden_dim1, hidden_dim2, n_classes): super().__init__() self.z_projs = torch.nn.ModuleList() self.z_projs.append(torch.nn.Linear(input_dim, hidden_dim1)) self.z_projs.append(torch.nn.Linear(hidden_dim1, hidden_dim2)) self.output_proj = torch.nn.Linear(hidden_dim2, n_classes) def forward(self, inputs): z = inputs for nn in self.z_projs: z = torch.relu(nn(z))
return o
SEQUENTIAL CONTAINER
class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hdim1, hdim2, n_classes): super().__init__() self.seq = torch.nn.Sequential( torch.nn.Linear(input_dim, hdim1), torch.nn.ReLU(), torch.nn.Linear(hdim1, hdim2), torch.nn.ReLU(), torch.nn.Linear(hdim2, n_classes) ) def forward(self, inputs): return self.seq(inputs)
https://pytorch.org/docs/stable/nn.html#torch.nn.Sequential
Note the different relu
torch.nn.Sequential
➤ Built with a list of modules ➤ Executed in order, the output of a module is given as input to the next one
Will apply all modules in the given order
LOSS FUNCTIONS
loss_builder = torch.nn.NLLLoss(reduction=‘mean’) loss = loss_builder(y, gold) epoch_loss += loss.item()
https://pytorch.org/docs/stable/nn.html#loss-functions
Important for the garbage collector!
SINGLE PARAMETER UPDATE
nn = SimpleNetwork2(10, 250, 100, 2)
nn.parameters(), lr=0.01, weight_decay=0.0001, momentum=0.9 ) Get the parameters of the networks Options
SINGLE PARAMETER UPDATE
nn = SimpleNetwork2(10, 250, 100, 2)
nn.parameters(), lr=0.01, weight_decay=0.0001, momentum=0.9 ) # Forward x = torch.rand((64, 10)) y = nn(x) loss = loss_builder(y, gold) Compute loss
SINGLE PARAMETER UPDATE
nn = SimpleNetwork2(10, 250, 100, 2)
nn.parameters(), lr=0.01, weight_decay=0.0001, momentum=0.9 ) # Forward x = torch.rand((64, 10)) y = nn(x) loss = loss_builder(y, gold) # backward nn.zero_grad() loss.backward()
Reset gradient! Update parameters
OPTIMIZERS
➤ Many optimizers are available out of the box ➤ More recent techniques are often available on GitHub (RAdam, …)
torch.optim.Adam(nn.parameters()) torch.optim.Adadelta(nn.parameters()) torch.optim.Adagrad(nn.parameters())
https://pytorch.org/docs/stable/optim.html
DROPOUT
class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hidden_dim1, hidden_dim2, n_classes): super().__init__() self.seq = torch.nn.Sequential( torch.nn.Linear(input_dim, hidden_dim1), torch.nn.ReLU(), torch.nn.Dropout(0.5), torch.nn.Linear(hidden_dim2, n_classes) ) def forward(self, inputs): return self.seq(inputs) nn = SimpleNetwork2(10, 250, 100, 2) x = torch.rand((64, 10)) Dropout!
DROPOUT
class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hidden_dim1, hidden_dim2, n_classes): super().__init__() self.seq = torch.nn.Sequential( torch.nn.Linear(input_dim, hidden_dim1), torch.nn.ReLU(), torch.nn.Dropout(0.5), torch.nn.Linear(hidden_dim2, n_classes) ) def forward(self, inputs): return self.seq(inputs) nn = SimpleNetwork2(10, 250, 100, 2) x = torch.rand((64, 10)) nn.train() y = nn(x) Train mode: the dropout will be applied
DROPOUT
class SimpleNetwork2(torch.nn.Module): def __init__(self, input_dim, hidden_dim1, hidden_dim2, n_classes): super().__init__() self.seq = torch.nn.Sequential( torch.nn.Linear(input_dim, hidden_dim1), torch.nn.ReLU(), torch.nn.Dropout(0.5), torch.nn.Linear(hidden_dim2, n_classes) ) def forward(self, inputs): return self.seq(inputs) nn = SimpleNetwork2(10, 250, 100, 2) x = torch.rand((64, 10)) nn.train() y = nn(x) nn.eval() y = nn(x) eval mode: the dropout will be ignored
PART 1: MNIST CLASSIFICATION WITH PYTORCH
Todo
➤ Parametrizable number of layers (and hidden layer dimensions) ➤ Parametrizable dropout ratio ➤ Parametrizable activation function
➤ Different network architectures
e.g., overparametrized network which overfit (i.e. bad regulation)
➤ Different regularization methods (dropout, …)
Bonus
➤ Use a CNN instead of a MLP
https://pytorch.org/docs/stable/nn.html#torch.nn.Conv2d
PART 2: VARIATIONAL AUTO-ENCODER
pθ(x|z) qϕ(z|x) z x
Recall
➤ Encoder: map image space to latent space ➤ Decoder: map latent space to image space ➤ Prior distribution on latent variable
Training loss: Evidence Lower Bound Goal
➤ Generate new images!
max
θ,ϕ
𝔽qϕ(z|x) [pθ(x|z)] − KL [qϕ(z|x), p(z)]
Gaussian Random Variable
➤ : mean ➤ : variance
μ σ2
Reparameterization trick
Differentiable sampling process differentiable Monte-Carlo estimation of the ELBO!
⇒
z ∼ 𝒪(μ, σ2) e ∼ 𝒪(0,1) z = μ + e × σ
Given by the encoder