deep learning 10 2 causal convolutions
play

Deep learning 10.2. Causal convolutions Fran cois Fleuret - PowerPoint PPT Presentation

Deep learning 10.2. Causal convolutions Fran cois Fleuret https://fleuret.org/dlc/ Dec 20, 2020 If we use an autoregressive model with a masked input f : { 0 , 1 } T R T R C the input differs from a position to another. During


  1. Deep learning 10.2. Causal convolutions Fran¸ cois Fleuret https://fleuret.org/dlc/ Dec 20, 2020

  2. If we use an autoregressive model with a masked input f : { 0 , 1 } T × R T → R C the input differs from a position to another. During training, even though the full input is known, common computation is lost. Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 1 / 25

  3. We can avoid having the mask itself as input if the model predicts a distribution for every position of the sequence, that is f : R T → R T × C . It can be used for synthesis with x 1 ← sample ( f 1 (0 , . . . , 0)) x 2 ← sample ( f 2 ( x 1 , 0 , . . . , 0)) x 3 ← sample ( f 3 ( x 1 , x 2 , 0 , . . . , 0)) . . . x T ← sample ( f T ( x 1 , x 2 , . . . , x T − 1 , 0)) where the 0s simply fill in for unknown values. Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 2 / 25

  4. If additionally, the model is such that “future values” do not influence the prediction at a certain time, that is ∀ t , x 1 , . . . , x t , α 1 , . . . , α T − t , β 1 , . . . , β T − t , f t +1 ( x 1 , . . . , x t , α 1 , . . . , α T − t ) = f t +1 ( x 1 , . . . , x t , β 1 , . . . , β T − t ) then, we have in particular f 1 (0 , . . . , 0) = f 1 ( x 1 , . . . , x T ) f 2 ( x 1 , 0 , . . . , 0) = f 2 ( x 1 , . . . , x T ) f 3 ( x 1 , x 2 , 0 , . . . , 0) = f 3 ( x 1 , . . . , x T ) . . . f T ( x 1 , x 2 , . . . , x T − 1 , 0) = f T ( x 1 , . . . , x T ) Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 3 / 25

  5. Which provides a tremendous computational advantage during training, since � 퓁 ( f , x ) = 퓁 ( f u ( x 1 , . . . , x u − 1 , 0 , . . . , 0) , x u ) u � = 퓁 ( f u ( x 1 , . . . , x T ) , x u ) . � �� � u Computed once Such models are referred to as causal, since the future cannot affect the past. Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 4 / 25

  6. We can illustrate this with convolutional models. Standard convolutions let information flow “to the past,” and masked input was a way to condition only on already generated values. Padding Padding x 1 x 2 x 3 x 4 x 5 x 6 0 0 0 0 x 1 x 2 x 3 x 4 x 5 x 6 0 0 0 0 Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 5 / 25

  7. We can illustrate this with convolutional models. Standard convolutions let information flow “to the past,” and masked input was a way to condition only on already generated values. Padding Padding x 1 x 2 x 3 x 4 x 5 x 6 0 0 0 0 x 1 x 2 x 3 x 4 x 5 x 6 0 0 0 0 Forbidden Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 5 / 25

  8. We can illustrate this with convolutional models. Standard convolutions let information flow “to the past,” and masked input was a way to condition only on already generated values. Padding Padding x 1 x 2 x 3 x 4 x 5 x 5 x 6 0 0 0 0 0 0 0 0 x 1 x 2 x 3 x 4 x 5 x 6 0 0 0 0 0 0 Masked Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 5 / 25

  9. Such a model can be made causal with convolutions that let information flow only to the future, combined with a first convolution that hides the present. Padding x 1 x 2 x 3 x 4 x 5 x 6 0 0 x 1 x 2 x 3 x 4 x 5 x 6 0 0 Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 6 / 25

  10. Such a model can be made causal with convolutions that let information flow only to the future, combined with a first convolution that hides the present. Padding x 1 x 2 x 3 x 4 x 5 x 6 0 0 x 1 x 2 x 3 x 4 x 5 x 6 0 0 Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 6 / 25

  11. Another option for the first layer is to shift the input by one entry to hide the present. Padding x 1 x 2 x 3 x 4 x 5 x 6 0 0 x 1 x 2 x 3 x 4 x 5 0 0 0 Padded-shifted right Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 7 / 25

  12. PyTorch’s convolutional layers do no accept asymmetric padding, but we can do it with F.pad , which even accepts negative padding to remove entries. For a n -dim tensor, the padding specification is ( start n , end n , start n − 1 , end n − 1 , . . . , start n − k , end n − k ) >>> x = torch.randint(10, (2, 1, 5)) >>> x tensor([[[1, 6, 3, 9, 1]], [[4, 8, 2, 2, 9]]]) >>> F.pad(x, (-1, 1)) tensor([[[6, 3, 9, 1, 0]], [[8, 2, 2, 9, 0]]]) >>> F.pad(x, (0, 0, 2, 0)) tensor([[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [1, 6, 3, 9, 1]], [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [4, 8, 2, 2, 9]]]) Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 8 / 25

  13. PyTorch’s convolutional layers do no accept asymmetric padding, but we can do it with F.pad , which even accepts negative padding to remove entries. For a n -dim tensor, the padding specification is ( start n , end n , start n − 1 , end n − 1 , . . . , start n − k , end n − k ) >>> x = torch.randint(10, (2, 1, 5)) >>> x tensor([[[1, 6, 3, 9, 1]], [[4, 8, 2, 2, 9]]]) >>> F.pad(x, (-1, 1)) tensor([[[6, 3, 9, 1, 0]], [[8, 2, 2, 9, 0]]]) >>> F.pad(x, (0, 0, 2, 0)) tensor([[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [1, 6, 3, 9, 1]], [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [4, 8, 2, 2, 9]]]) Similar processing can be achieved with the modules nn.ConstantPad1d , nn.ConstantPad2d , or nn.ConstantPad3d . Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 8 / 25

  14. Some train sequences 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 9 / 25

  15. Model class NetToy1d(nn.Module): def __init__(self, nb_classes, ks = 2, nc = 32): super().__init__() self.pad = (ks - 1, 0) self.conv0 = nn.Conv1d(1, nc, kernel_size = 1) self.conv1 = nn.Conv1d(nc, nc, kernel_size = ks) self.conv2 = nn.Conv1d(nc, nc, kernel_size = ks) self.conv3 = nn.Conv1d(nc, nc, kernel_size = ks) self.conv4 = nn.Conv1d(nc, nc, kernel_size = ks) self.conv5 = nn.Conv1d(nc, nb_classes, kernel_size = 1) def forward(self, x): x = F.relu(self.conv0(F.pad(x, (1, -1)))) x = F.relu(self.conv1(F.pad(x, self.pad))) x = F.relu(self.conv2(F.pad(x, self.pad))) x = F.relu(self.conv3(F.pad(x, self.pad))) x = F.relu(self.conv4(F.pad(x, self.pad))) x = self.conv5(x) return x.permute(0, 2, 1).contiguous() Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 10 / 25

  16. Training loop for sequences in train_input.split(args.batch_size): input = (sequences - mean)/std output = model(input) loss = cross_entropy( output.view(-1, output.size(-1)), sequences.view(-1) ) optimizer.zero_grad() loss.backward() optimizer.step() Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 11 / 25

  17. Synthesis generated = train_input.new_zeros((48,) + train_input.size()[1:]) flat = generated.view(generated.size(0), -1) for t in range(flat.size(1)): input = (generated.float() - mean) / std output = model(input) logits = output.view(flat.size() + (-1,))[:, t] dist = torch.distributions.categorical.Categorical(logits = logits) flat[:, t] = dist.sample() Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 12 / 25

  18. Some generated sequences 60 60 60 60 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 60 60 60 60 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 60 60 60 60 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 13 / 25

  19. The global structure may not be properly generated. 60 60 60 60 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 60 60 60 60 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 14 / 25

  20. The global structure may not be properly generated. 60 60 60 60 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 60 60 60 60 50 50 50 50 40 40 40 40 30 30 30 30 20 20 20 20 10 10 10 10 0 0 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 This can be fixed with dilated convolutions to have a larger context. Fran¸ cois Fleuret Deep learning / 10.2. Causal convolutions 14 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend