applied machine learning applied machine learning
play

Applied Machine Learning Applied Machine Learning Convolutional - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives understand the convolution layer


  1. Convolution (1D) Convolution (1D) Cross-correlation is similar to convolution w w ∞ y ( c ) = ∑ k =−∞ w ( k ) x ( c + k ) Cross-correlation w ⋆ x x x w is called the filter or kernel ignoring the activation (for simpler notation) assuming w and x are zero for any index outside the input and filter bound w ∗ x x ⋆ w Convolution flips w or x (to be commutative) ∞ ∞ y ( c ) = w ( k ) x ( c − k ) = w ( c − d ) x ( d ) ∑ k =−∞ ∑ d =−∞ w ∗ x x ∗ w change of variable x ∗ w w ⋆ x since we learn w, flipping it makes no difference 3 . 5

  2. Convolution (1D) Convolution (1D) Cross-correlation is similar to convolution w w ∞ y ( c ) = ∑ k =−∞ w ( k ) x ( c + k ) Cross-correlation w ⋆ x x x w is called the filter or kernel ignoring the activation (for simpler notation) assuming w and x are zero for any index outside the input and filter bound w ∗ x x ⋆ w Convolution flips w or x (to be commutative) ∞ ∞ y ( c ) = w ( k ) x ( c − k ) = w ( c − d ) x ( d ) ∑ k =−∞ ∑ d =−∞ w ∗ x x ∗ w change of variable x ∗ w w ⋆ x since we learn w, flipping it makes no difference in practice, we use cross correlation rather than convolution 3 . 5

  3. Convolution (1D) Convolution (1D) Cross-correlation is similar to convolution w w ∞ y ( c ) = ∑ k =−∞ w ( k ) x ( c + k ) Cross-correlation w ⋆ x x x w is called the filter or kernel ignoring the activation (for simpler notation) assuming w and x are zero for any index outside the input and filter bound w ∗ x x ⋆ w Convolution flips w or x (to be commutative) ∞ ∞ y ( c ) = w ( k ) x ( c − k ) = w ( c − d ) x ( d ) ∑ k =−∞ ∑ d =−∞ w ∗ x x ∗ w change of variable x ∗ w w ⋆ x since we learn w, flipping it makes no difference in practice, we use cross correlation rather than convolution convolution is equivariant wrt translation -- i.e., shifting x , shifts w*x 3 . 5

  4. Convolution Convolution (2D) (2D) similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data) K 1 K 2 = ∑ k =1 ∑ k =1 y x w d , d d + k −1, d + k −1 k , k 1 2 1 1 2 2 1 2 1 2 3 . 6

  5. Convolution (2D) Convolution (2D) similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data) K 1 K 2 = ∑ k =1 ∑ k =1 y x w d , d d + k −1, d + k −1 k , k 1 2 1 1 2 2 1 2 1 2 image credit: Vincent Dumoulin, Francesco Visin 3 . 6

  6. Convolution Convolution (2D) (2D) similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data) K 1 K 2 = ∑ k =1 ∑ k =1 y x w d , d d + k −1, d + k −1 k , k 1 2 1 1 2 2 1 2 1 2 participates in all outputs participates in a single output this is related to the borders image credit: Vincent Dumoulin, Francesco Visin 3 . 6

  7. Convolution Convolution (2D) (2D) similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data) K 1 K 2 = ∑ k =1 ∑ k =1 y x w d , d d + k −1, d + k −1 k , k 1 2 1 1 2 2 1 2 1 2 there are different ways of handling the borders image credit: Vincent Dumoulin, Francesco Visin 3 . 7

  8. Convolution Convolution (2D) (2D) similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data) K 1 K 2 = ∑ k =1 ∑ k =1 y x w d , d d + k −1, d + k −1 k , k 1 2 1 1 2 2 1 2 1 2 there are different ways of handling the borders no padding at all (valid) output is small than input (how much?) image credit: Vincent Dumoulin, Francesco Visin 3 . 7

  9. Convolution Convolution (2D) (2D) similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data) K 1 K 2 = ∑ k =1 ∑ k =1 y x w d , d d + k −1, d + k −1 k , k 1 2 1 1 2 2 1 2 1 2 there are different ways of handling the borders zero-pad the input, to keep the output dims similar to input (same) no padding at all (valid) output is small than input (how much?) image credit: Vincent Dumoulin, Francesco Visin 3 . 7

  10. Convolution Convolution (2D) (2D) similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data) K 1 K 2 = ∑ k =1 ∑ k =1 y x w d , d d + k −1, d + k −1 k , k 1 2 1 1 2 2 1 2 1 2 there are different ways of handling the borders zero-pad the input, and produce all non-zero outputs (full) output is larger than input (by how much?) each input participates in the same number of output elements y zero-pad the input, to keep the output dims similar to input (same) w 3x3 kernel no padding at all (valid) output is small than input (how much?) x image credit: Vincent Dumoulin, Francesco Visin 3 . 7

  11. Convolution Convolution (2D) (2D) similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data) K 1 K 2 = ∑ k =1 ∑ k =1 y x w d , d d + k −1, d + k −1 k , k 1 2 1 1 2 2 1 2 1 2 there are different ways of handling the borders zero-pad the input, and produce all non-zero outputs (full) output length (for one dimension) output is larger than input (by how much?) ⌊ D + padding − K + 1⌋ each input participates in the same number of output elements y zero-pad the input, to keep the output dims similar to input (same) w 3x3 kernel no padding at all (valid) output is small than input (how much?) x image credit: Vincent Dumoulin, Francesco Visin 3 . 7 Winter 2020 | Applied Machine Learning (COMP551)

  12. Pooling Pooling sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2 4 . 1

  13. Pooling Pooling sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2 a combination of pooling and downsampling is used 4 . 1

  14. Pooling Pooling sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2 a combination of pooling and downsampling is used ~ d K 1. calculate the output = g ( ∑ k =1 w ) y x d + k −1 k 4 . 1

  15. Pooling Pooling sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2 a combination of pooling and downsampling is used ~ d K 1. calculate the output = g ( ∑ k =1 w ) y x d + k −1 k 2. aggregate the output over different regions ~ d ~ d + p y = pool{ , … , } y y d two common aggregation functions are max and mean 4 . 1

  16. Pooling Pooling sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2 a combination of pooling and downsampling is used ~ d K 1. calculate the output = g ( ∑ k =1 w ) y x d + k −1 k left translation 2. aggregate the output over different regions ~ d ~ d + p y = pool{ , … , } y y d two common aggregation functions are max and mean pooling results in some degree of invariance to translation 4 . 1

  17. Pooling Pooling sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2 a combination of pooling and downsampling is used ~ d K 1. calculate the output = g ( ∑ k =1 w ) y x d + k −1 k left translation 2. aggregate the output over different regions ~ d ~ d + p y = pool{ , … , } y y d two common aggregation functions are max and mean pooling results in some degree of invariance to translation 3. often this is followed by subsampling using the same step size 4 . 1

  18. Pooling Pooling sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2 a combination of pooling and downsampling is used ~ d K 1. calculate the output = g ( ∑ k =1 w ) y x d + k −1 k left translation 2. aggregate the output over different regions ~ d ~ d + p y = pool{ , … , } y y d two common aggregation functions are max and mean pooling results in some degree of invariance to translation 3. often this is followed by subsampling using the same step size the same idea extends to higher dimensions 4 . 1

  19. Strided convolution Strided convolution alternatively we can directly subsample the output ~ d K = g ( ∑ k =1 w ) y x ( d −1)+ k k ~ dp y = y d y 1 y 2 y 3 ~ 1 ~ 2 ~ 3 ~ 3 ~ 4 ~ 5 y y y y y y 4 . 2

  20. Strided convolution Strided convolution alternatively we can directly subsample the output ~ d K = g ( ∑ k =1 w ) y x ( d −1)+ k k ~ dp y = y d y 1 y 2 y 3 equivalent to ~ 1 ~ 2 ~ 3 ~ 3 ~ 4 ~ 5 y y y y y y 4 . 2

  21. Strided convolution Strided convolution alternatively we can directly subsample the output ~ d K = g ( ∑ k =1 w ) y x ( d −1)+ k k ~ d K = ~ dp g ( ∑ k =1 w ) y x y = p ( d −1)+ k k y d y 1 y 2 y 3 y 1 y 2 y 3 equivalent to ~ 1 ~ 2 ~ 3 ~ 3 ~ 4 ~ 5 y y y y y y 4 . 2

  22. Strided convolution Strided convolution the same idea extends to higher dimensions K 1 K 2 = ∑ k =1 ∑ k =1 y x w d , d p ( d −1)+ k , p ( d −1)+ k k , k 1 2 1 1 1 2 2 2 1 2 1 2 different step-sizes for different dimensions output input image: Dumoulin & Visin'16 4 . 3

  23. Strided convolution Strided convolution the same idea extends to higher dimensions K 1 K 2 = ∑ k =1 ∑ k =1 y x w d , d p ( d −1)+ k , p ( d −1)+ k k , k 1 2 1 1 1 2 2 2 1 2 1 2 different step-sizes for different dimensions output input with padding output input image: Dumoulin & Visin'16 4 . 3

  24. Strided convolution Strided convolution the same idea extends to higher dimensions K 1 K 2 = ∑ k =1 ∑ k =1 y x w d , d p ( d −1)+ k , p ( d −1)+ k k , k 1 2 1 1 1 2 2 2 1 2 1 2 different step-sizes for different dimensions output input with padding output output length (for one dimension) D +padding− K ⌊ + 1⌋ input stride image: Dumoulin & Visin'16 4 . 3 Winter 2020 | Applied Machine Learning (COMP551)

  25. Channels Channels so far we assumed a single input and output sequence or image image: Dumoulin & Visin'16 5 . 1

  26. Channels Channels so far we assumed a single input and output sequence or image with RGB data, we have 3 input channels ( ) M = 3 this example: 2 input channels x ∈ R M × D × D 1 2 image: Dumoulin & Visin'16 5 . 1

  27. Channels Channels so far we assumed a single input and output sequence or image with RGB data, we have 3 input channels ( ) M = 3 this example: 2 input channels x ∈ R M × D × D 1 2 similarly we can produce multiple output channels M = ′ 3 ′ ′ ′ y ∈ R M × D × D 1 2 image: Dumoulin & Visin'16 5 . 1

  28. Channels Channels so far we assumed a single input and output sequence or image ′ w ∈ R M × M × K × K we have one filters per input-output channel combination 1 2 K × K 2 1 + add the result of convolution from different input channels with RGB data, we have 3 input channels ( ) M = 3 this example: 2 input channels x ∈ R M × D × D 1 2 similarly we can produce multiple output channels M = ′ 3 ′ ′ ′ y ∈ R M × D × D 1 2 image: Dumoulin & Visin'16 5 . 1

  29. Channels Channels so far we assumed a single input and output sequence or image image: https://cs231n.github.io/convolutional-networks/ 5 . 2

  30. Channels Channels so far we assumed a single input and output sequence or image b ∈ R M ′ we can also add a bias parameter (b), one per each output channel image: https://cs231n.github.io/convolutional-networks/ 5 . 2

  31. Channels Channels so far we assumed a single input and output sequence or image b ∈ R M ′ we can also add a bias parameter (b), one per each output channel M = + g ( ∑ m =1 ∑ k 1 ∑ k 2 ) y w x b ′ ′ m ′ m , d , d m , m , k , k m , d + k −1, d + k −1 1 2 1 2 1 1 2 2 x ∈ R M × D × D 1 2 ′ ′ ′ y ∈ R M × D × D 1 2 ′ w ∈ R M × M × K × K 1 2 image: https://cs231n.github.io/convolutional-networks/ 5 . 2

  32. Channels Channels so far we assumed a single input and output sequence or image b ∈ R M ′ we can also add a bias parameter (b), one per each output channel M = + g ( ∑ m =1 ∑ k 1 ∑ k 2 ) y w x b ′ ′ m ′ m , d , d m , m , k , k m , d + k −1, d + k −1 1 2 1 2 1 1 2 2 x ∈ R M × D × D 1 2 ′ ′ ′ y ∈ R M × D × D D = 1 2 2 K 1 ′ w ∈ R M × M × K × K K 2 1 2 D = 1 M = ′ M = 5 RGB channels image: https://cs231n.github.io/convolutional-networks/ 5 . 2 Winter 2020 | Applied Machine Learning (COMP551)

  33. Convolutional Neural Network ( Convolutional Neural Network (CNN CNN) CNN or convnet is a neural network with convolutional layers (so it's a special type of MLP) 6 . 1

  34. Convolutional Neural Network ( Convolutional Neural Network (CNN CNN) CNN or convnet is a neural network with convolutional layers (so it's a special type of MLP) it could be applied to 1D sequence, 2D image or 3D volumetric data 6 . 1

  35. Convolutional Neural Network ( Convolutional Neural Network (CNN CNN) CNN or convnet is a neural network with convolutional layers (so it's a special type of MLP) it could be applied to 1D sequence, 2D image or 3D volumetric data example: conv-net architecture (derived from AlexNet) for image classification 6 . 1

  36. Convolutional Neural Network ( Convolutional Neural Network (CNN CNN) CNN or convnet is a neural network with convolutional layers (so it's a special type of MLP) it could be applied to 1D sequence, 2D image or 3D volumetric data example: conv-net architecture (derived from AlexNet) for image classification fully connected layers number of classes 6 . 1

  37. Convolutional Neural Network ( Convolutional Neural Network (CNN CNN) CNN or convnet is a neural network with convolutional layers (so it's a special type of MLP) it could be applied to 1D sequence, 2D image or 3D volumetric data example: conv-net architecture (derived from AlexNet) for image classification fully connected layers number of classes visualization of the convolution kernel at the first layer 11x11x3x96 96 filters, each one is 11x11x3. each of these is responsible for one of 96 feature maps in the second layer 6 . 1

  38. Convolutional Neural Network ( Convolutional Neural Network (CNN CNN) CNN or convnet is a neural network with convolutional layers (so it's a special type of MLP) it could be applied to 1D sequence, 2D image or 3D volumetric data example: conv-net architecture (derived from AlexNet) for image classification fully connected layers number of classes deeper units represent more abstract features 6 . 2

  39. Application: image classification Application: image classification Convnets have achieved super-human performance in image classification image credit: He et al'15, https://semiengineering.com/new-vision-technologies-for-real-world-applications/ 6 . 3

  40. Application: image classification Application: image classification Convnets have achieved super-human performance in image classification ImageNet challenge: > 1M images, 1000 classes image credit: He et al'15, https://semiengineering.com/new-vision-technologies-for-real-world-applications/ 6 . 3

  41. Application: image classification Application: image classification Convnets have achieved super-human performance in image classification ImageNet challenge: > 1M images, 1000 classes image credit: He et al'15, https://semiengineering.com/new-vision-technologies-for-real-world-applications/ 6 . 3

  42. Application: image classification Application: image classification variety of increasingly deeper architectures have been proposed image credit: He et al'15, https://semiengineering.com/new-vision-technologies-for-real-world-applications/ 6 . 4

  43. Application: image classification Application: image classification variety of increasingly deeper architectures have been proposed image credit: He et al'15, https://semiengineering.com/new-vision-technologies-for-real-world-applications/ 6 . 5 Winter 2020 | Applied Machine Learning (COMP551)

  44. Training: Training: backpropagation through convolution backpropagation through convolution = ∑ m ∑ k y w x consider the strided 1D convolution op. m , d ′ m , m , k ′ m , p ( d −1)+ k output channel index input channel index filter index stride ∂ J ∂ y m , d ′ ′ 7 . 1

  45. Training: backpropagation through convolution Training: backpropagation through convolution = ∑ m ∑ k y w x consider the strided 1D convolution op. m , d ′ m , m , k ′ m , p ( d −1)+ k output channel index input channel index filter index stride using backprop. we have so far and we need ∂ J ∂ y m , d ′ ′ 7 . 1

  46. Training: backpropagation through convolution Training: backpropagation through convolution = ∑ m ∑ k y w x consider the strided 1D convolution op. m , d ′ m , m , k ′ m , p ( d −1)+ k output channel index input channel index filter index stride using backprop. we have so far and we need ∂ J ∂ y m , d ′ ′ ∂ y m , d ′ ′ 1) so as to get the gradients ∂ w m , m , k ′ 7 . 1

  47. Training: Training: backpropagation through convolution backpropagation through convolution = ∑ m ∑ k y w x consider the strided 1D convolution op. m , d ′ m , m , k ′ m , p ( d −1)+ k output channel index input channel index filter index stride using backprop. we have so far and we need ∂ J ∂ y m , d ′ ′ ∂ y m , d ′ ′ 1) so as to get the gradients ∂ w m , m , k ′ ∂ y m , d 2) ′ ′ to backpropagate to previous layer ∂ x m , d 7 . 1

  48. Training: Training: backpropagation through convolution backpropagation through convolution = ∑ m ∑ k y w x consider the strided 1D convolution op. m , d ′ m , m , k ′ m , p ( d −1)+ k output channel index input channel index filter index stride using backprop. we have so far and we need ∂ J ∂ y m , d ′ ′ ∂ y m , d ′ ′ ∂ y m , d ∂ J ∂ J 1) ′ ′ = ∑ d ′ ∂ y m , d so as to get the gradients ∂ w m , m , k ∂ w m , m , k ∂ w m , m , k ′ ′ ′ ′ ′ ∂ y m , d 2) ′ ′ to backpropagate to previous layer ∂ x m , d 7 . 1

  49. Training: Training: backpropagation through convolution backpropagation through convolution = ∑ m ∑ k y w x consider the strided 1D convolution op. m , d ′ m , m , k ′ m , p ( d −1)+ k output channel index input channel index filter index stride using backprop. we have so far and we need ∂ J ∂ y m , d ′ ′ x m , p ( d −1)+ k ′ ∂ y m , d ′ ′ ∂ y m , d ∂ J ∂ J 1) ′ ′ = ∑ d ′ ∂ y m , d so as to get the gradients ∂ w m , m , k ∂ w m , m , k ∂ w m , m , k ′ ′ ′ ′ ′ ∂ y m , d 2) ′ ′ to backpropagate to previous layer ∂ x m , d 7 . 1

  50. Training: Training: backpropagation through convolution backpropagation through convolution = ∑ m ∑ k y w x consider the strided 1D convolution op. m , d ′ m , m , k ′ m , p ( d −1)+ k output channel index input channel index filter index stride using backprop. we have so far and we need ∂ J ∂ y m , d ′ ′ x m , p ( d −1)+ k ′ ∂ y m , d ′ ′ ∂ y m , d ∂ J ∂ J 1) ′ ′ = ∑ d ′ ∂ y m , d so as to get the gradients ∂ w m , m , k ∂ w m , m , k ∂ w m , m , k ′ ′ ′ ′ ′ ∂ y m , d ∂ y m , d ∂ J ∂ J ′ ′ 2) ′ ′ = ∑ d , m to backpropagate to previous layer ′ ∂ y m , d ′ ∂ x d , m ∂ x d , m ′ ′ ∂ x m , d 7 . 1

  51. Training: Training: backpropagation through convolution backpropagation through convolution = ∑ m ∑ k y w x consider the strided 1D convolution op. m , d ′ m , m , k ′ m , p ( d −1)+ k output channel index input channel index filter index stride using backprop. we have so far and we need ∂ J ∂ y m , d ′ ′ x m , p ( d −1)+ k ′ ∂ y m , d ′ ′ ∂ y m , d ∂ J ∂ J 1) ′ ′ = ∑ d ′ ∂ y m , d so as to get the gradients ∂ w m , m , k ∂ w m , m , k ∂ w m , m , k ′ ′ ′ ′ ′ ∑ k w ′ m , m , k such that ′ p ( d − 1) + k = d ∂ y m , d ∂ y m , d ∂ J ∂ J ′ ′ 2) ′ ′ = ∑ d , m to backpropagate to previous layer ′ ∂ y m , d ′ ∂ x d , m ∂ x d , m ′ ′ ∂ x m , d 7 . 1

  52. Training: Training: backpropagation through convolution backpropagation through convolution = ∑ m ∑ k y w x consider the strided 1D convolution op. m , d ′ m , m , k ′ m , p ( d −1)+ k output channel index input channel index filter index stride using backprop. we have so far and we need ∂ J ∂ y m , d ′ ′ x m , p ( d −1)+ k ′ ∂ y m , d ′ ′ ∂ y m , d ∂ J ∂ J 1) ′ ′ = ∑ d ′ ∂ y m , d so as to get the gradients ∂ w m , m , k ∂ w m , m , k ∂ w m , m , k ′ ′ ′ ′ ′ ∑ k w ′ m , m , k such that ′ p ( d − 1) + k = d ∂ y m , d ∂ y m , d ∂ J ∂ J ′ ′ 2) ′ ′ = ∑ d , m to backpropagate to previous layer ′ ∂ y m , d ′ ∂ x d , m ∂ x d , m ′ ′ ∂ x m , d this operation is similar to multiplication by transpose of the parameter-sharing matrix (transposed convolution) 7 . 1

  53. Naive implementation Naive implementation consider the strided 1D convolution op. with stride 1. and single input-output channels y = ∑ k w x d + k −1 d k 7 . 2

  54. Naive implementation Naive implementation consider the strided 1D convolution op. with stride 1. and single input-output channels y = ∑ k w x d + k −1 d k in practice most efficient implementation depends on the filter size (using FFT for large filters) 7 . 2

  55. Naive implementation Naive implementation consider the strided 1D convolution op. with stride 1. and single input-output channels y = ∑ k w x d + k −1 d k in practice most efficient implementation depends on the filter size (using FFT for large filters) forward pass 1 def Conv1D( 2 x, # D (length) 3 w, # K (filter length) 4 ): 5 6 D, = x.shape 7 K, = w.shape 8 Dp = D - K + 1 #output length 9 y = np.zeros((Dp)) 10 for dp in range(Dp): 11 y[dp] = np.sum(x[dp:dp+K] * w) 12 return y 7 . 2

  56. Naive implementation Naive implementation consider the strided 1D convolution op. with stride 1. and single input-output channels y = ∑ k w x d + k −1 d k in practice most efficient implementation depends on the filter size (using FFT for large filters) backward pass forward pass 1 def Conv1DBackProp( 2 x, #D (length) 3 w, #K 1 def Conv1D( 4 dJdy,#Dp: error from layer above 2 x, # D (length) 5 ): 3 w, # K (filter length) 6 4 ): 7 D, = x.shape 5 8 K, = w.shape 6 D, = x.shape 9 Dp, = dJdy.shape 7 K, = w.shape 10 dw = np.zeros_like(w) 8 Dp = D - K + 1 #output length 11 dJdx = np.zeros_like(x) 9 y = np.zeros((Dp)) 12 for dp in range(Dp): 10 for dp in range(Dp): 13 dw += np.sum(dJdy[dp] * x[dp:dp+K], 11 y[dp] = np.sum(x[dp:dp+K] * w) 14 dJdx[dp:dp+K] += dJdy[dp:dp+K] * w 12 return y 15 return dJdx, dw #error to layer below and weight update 7 . 2

  57. Transposed Convolution Transposed Convolution Transposed convolution (aka deconvolution) recovers the shape of the original input image: Dumoulin & Visin'16 7 . 3

  58. Transposed Convolution Transposed Convolution Transposed convolution (aka deconvolution) recovers the shape of the original input Convolution with no stride and its transpose no padding of the original convolution corresponds to full padding of in transposed version transposed output input image: Dumoulin & Visin'16 7 . 3

  59. Transposed Convolution Transposed Convolution Transposed convolution (aka deconvolution) recovers the shape of the original input Convolution with no stride and its transpose no padding of the original convolution corresponds to full padding of in transposed version transposed output input full padding of the original convolution corresponds to no paddingof in transposed version output input transposed image: Dumoulin & Visin'16 7 . 3

  60. Transposed Convolution Transposed Convolution Transposed convolution (aka deconvolution) recovers the shape of the original input Convolution with no stride and its transpose no padding of the original convolution corresponds to full padding of in transposed version Convolution with stride and its transpose transposed output transposed output input full padding of the original convolution corresponds to no paddingof in transposed version input output input transposed image: Dumoulin & Visin'16 7 . 3

  61. Transposed Convolution Transposed Convolution Transposed convolution (aka deconvolution) recovers the shape of the original input Convolution with no stride and its transpose no padding of the original convolution corresponds to full padding of in transposed version Convolution with stride and its transpose transposed output transposed output input full padding of the original convolution corresponds to no paddingof in transposed version input output this can be used for up-sampling (opposite of stride/pooling) as expected the transpose of a transposed convolution is the original convolution input transposed image: Dumoulin & Visin'16 7 . 3

  62. Dilated Convolution Dilated Convolution Dilated (aka atrous) convolution image credits: Kalchbrenner et al'17, Dumoulin & Visin'16 7 . 4

  63. Dilated Convolution Dilated Convolution Dilated (aka atrous) convolution this can be used to create exponentially large receptive field in few layers image credits: Kalchbrenner et al'17, Dumoulin & Visin'16 7 . 4

  64. Dilated Convolution Dilated Convolution Dilated (aka atrous) convolution this can be used to create exponentially large receptive field in few layers dilation = 1 (i.e., no dilation), size of receptive field = 3 image credits: Kalchbrenner et al'17, Dumoulin & Visin'16 7 . 4

  65. Dilated Convolution Dilated Convolution Dilated (aka atrous) convolution this can be used to create exponentially large receptive field in few layers dilation = 2, size of receptive field = 7 dilation = 1 (i.e., no dilation), size of receptive field = 3 image credits: Kalchbrenner et al'17, Dumoulin & Visin'16 7 . 4

  66. Dilated Convolution Dilated Convolution Dilated (aka atrous) convolution this can be used to create exponentially large receptive field in few layers dilation = 4, size of receptive field = 15 dilation = 2, size of receptive field = 7 dilation = 1 (i.e., no dilation), size of receptive field = 3 image credits: Kalchbrenner et al'17, Dumoulin & Visin'16 7 . 4

  67. Dilated Convolution Dilated Convolution Dilated (aka atrous) convolution this can be used to create exponentially large receptive field in few layers dilation = 8, size of receptive field = 31 dilation = 4, size of receptive field = 15 dilation = 2, size of receptive field = 7 dilation = 1 (i.e., no dilation), size of receptive field = 3 image credits: Kalchbrenner et al'17, Dumoulin & Visin'16 7 . 4

  68. Dilated Convolution Dilated Convolution Dilated (aka atrous) convolution this can be used to create exponentially large receptive field in few layers dilation = 8, size of receptive field = 31 dilation = 4, size of receptive field = 15 dilation = 2, size of receptive field = 7 dilation = 1 (i.e., no dilation), size of receptive field = 3 in contrast to stride, dilation does not lose resolution output length (for one dimension) D +padding−dilation×( K −1)−1 ⌊ + 1⌋ stride image credits: Kalchbrenner et al'17, Dumoulin & Visin'16 7 . 4

  69. Dilated Convolution Dilated Convolution Dilated (aka atrous) convolution this can be used to create exponentially large receptive field in few layers dilation = 8, size of receptive field = 31 dilation = 4, size of receptive field = 15 dilation = 2, size of receptive field = 7 dilation = 1 (i.e., no dilation), size of receptive field = 3 in contrast to stride, dilation does not lose resolution output length (for one dimension) 1 torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, D +padding−dilation×( K −1)−1 padding_mode='zeros') ⌊ + 1⌋ stride image credits: Kalchbrenner et al'17, Dumoulin & Visin'16 7 . 4 Winter 2020 | Applied Machine Learning (COMP551)

  70. Structured Prediction Structured Prediction the output itself may have (image) structure (e.g., predicting text, audio, image) image:https://sthalles.github.io/deep_segmentation_network/ 8

  71. Structured Prediction Structured Prediction the output itself may have (image) structure (e.g., predicting text, audio, image) example in (semantic) segmentation, we classify each pixel loss is the sum of cross-entropy loss across the whole image image:https://sthalles.github.io/deep_segmentation_network/ 8

  72. Structured Prediction Structured Prediction the output itself may have (image) structure (e.g., predicting text, audio, image) example in (semantic) segmentation, we classify each pixel loss is the sum of cross-entropy loss across the whole image variety of architectures... one that performs well is U-Net image:https://sthalles.github.io/deep_segmentation_network/ 8

  73. Structured Prediction Structured Prediction the output itself may have (image) structure (e.g., predicting text, audio, image) example in (semantic) segmentation, we classify each pixel loss is the sum of cross-entropy loss across the whole image variety of architectures... one that performs well is U-Net transposed convolution (upconv), concatenation, and skip connection are common in architecture design image:https://sthalles.github.io/deep_segmentation_network/ 8

  74. Structured Prediction Structured Prediction the output itself may have (image) structure (e.g., predicting text, audio, image) example in (semantic) segmentation, we classify each pixel loss is the sum of cross-entropy loss across the whole image variety of architectures... one that performs well is U-Net transposed convolution (upconv), concatenation, and skip connection are common in architecture design architecture search (i.e., combinatorial hyper-parameter search) is an expensive process and an active research area image:https://sthalles.github.io/deep_segmentation_network/ 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend