Convolutional Neural Networks Rachel Hu and Zhi Zhang Amazon AI - - PowerPoint PPT Presentation

convolutional neural networks
SMART_READER_LITE
LIVE PREVIEW

Convolutional Neural Networks Rachel Hu and Zhi Zhang Amazon AI - - PowerPoint PPT Presentation

Convolutional Neural Networks Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline GPUs Convolutions Pooling, Padding and Stride Convolutional Neural Networks (LeNet) Deep ConvNets (AlexNet) Networks using Blocks (VGG)


slide-1
SLIDE 1

d2l.ai

Convolutional Neural Networks

Rachel Hu and Zhi Zhang Amazon AI

slide-2
SLIDE 2

d2l.ai

Outline

  • GPUs
  • Convolutions
  • Pooling, Padding and Stride
  • Convolutional Neural Networks (LeNet)
  • Deep ConvNets (AlexNet)
  • Networks using Blocks (VGG)
  • Residual Neural Networks (ResNet)
slide-3
SLIDE 3

d2l.ai

GPUs

NVIDIA Turing TU102

slide-4
SLIDE 4

d2l.ai

Intel i7-6700K

  • 4 Physical cores
  • Per core
  • 64KB L1 cache
  • 256KB L2 cache
  • Shared 8MB L3 cache
  • 30 GB/s to RAM
slide-5
SLIDE 5

d2l.ai

GPU performance

slide-6
SLIDE 6

d2l.ai

Highend Gaming / DeepLearning PC

DDR4 32 GB Nvidia Titan RTX 12 TFLOPS (130TF for FP16 TensorCores) 24 GB Intel i7 0.15 TFLOPS

slide-7
SLIDE 7

d2l.ai

Highend Gaming / DeepLearning PC

DDR4 32 GB Nvidia Titan RTX 12 TFLOPS (130TF for FP16 TensorCores) 24 GB Intel i7 0.15 TFLOPS

ctx = npx.cpu() ctx = npx.gpu(0) x.copyto(ctx)

slide-8
SLIDE 8

d2l.ai

GPU Notebook

slide-9
SLIDE 9

d2l.ai

From fully connected to convolutions

slide-10
SLIDE 10

d2l.ai

Classifying Dogs and Cats in Images

  • Use a good camera
  • RGB image has 36M elements
  • The model size of a single hidden

layer MLP with a 100 hidden size is 3.6 Billion parameters

  • Exceeds the population of dogs

and cats on earth
 (900M dogs + 600M cats)

slide-11
SLIDE 11

d2l.ai

Flashback - Network with one hidden layer

36M features 100 neurons

h = σ (Wx + b)

3.6B parameters = 14GB

slide-12
SLIDE 12

d2l.ai

Where is Waldo?

slide-13
SLIDE 13

d2l.ai

  • Translation

Invariance

  • Locality

Two Principles

slide-14
SLIDE 14

d2l.ai

Rethinking Dense Layers

  • Reshape inputs and output into matrix (width, height)
  • Reshape weights into 4-D tensors (h,w) to (h’,w’)

V is re-indexes W such as that

hi,j = ∑

k,l

wi,j,k,lxk,l = ∑

a,b

vi,j,a,bxi+a,j+b

vi,j,a,b = wi,j,i+a,j+b

slide-15
SLIDE 15

d2l.ai

Idea #1 - Translation Invariance

  • A shift in x also leads to a shift in h
  • v should not depend on (i,j). Fix via vi,j,a,b = va,b

hi,j = ∑

a,b

va,bxi+a,j+b hi,j = ∑

a,b

vi,j,a,bxi+a,j+b

That’s a 2-D convolution cross-correlation

slide-16
SLIDE 16

d2l.ai

Idea #2 - Locality

  • We shouldn’t look very far from x(i,j) in order to assess

what’s going on at h(i,j)

  • Outside range parameters vanish

hi,j = ∑

a,b

va,bxi+a,j+b

|a|, |b| > Δ va,b = 0

hi,j =

Δ

a=−Δ Δ

b=−Δ

va,bxi+a,j+b

slide-17
SLIDE 17

d2l.ai

Convolution

slide-18
SLIDE 18

d2l.ai

2-D Cross Correlation

(vdumoulin@ Github)

0 × 0 + 1 × 1 + 3 × 2 + 4 × 3 = 19, 1 × 0 + 2 × 1 + 4 × 2 + 5 × 3 = 25, 3 × 0 + 4 × 1 + 6 × 2 + 7 × 3 = 37, 4 × 0 + 5 × 1 + 7 × 2 + 8 × 3 = 43.

slide-19
SLIDE 19

d2l.ai

2-D Convolution Layer

  • input matrix
  • kernel matrix
  • b: scalar bias
  • output matrix


  • W and b are learnable parameters

Y = X ⋆ W + b

X : nh × nw W : kh × kw Y : (nh − kh + 1) × (nw − kw + 1)

slide-20
SLIDE 20

d2l.ai

Examples

Edge Detection Sharpen Gaussian Blur

(wikipedia)

slide-21
SLIDE 21

d2l.ai

Examples

(Rob Fergus)

slide-22
SLIDE 22

d2l.ai

Convolutions Notebook

slide-23
SLIDE 23

d2l.ai

Padding and Stride

slide-24
SLIDE 24

d2l.ai

Padding

  • Given a 32 x 32 input image
  • Apply convolutional layer with 5 x 5 kernel
  • 28 x 28 output with 1 layer
  • 4 x 4 output with 7 layers
  • Shape decreases faster with larger kernels
  • Shape reduces from to

nh × nw (nh − kh + 1) × (nw − kw + 1)

slide-25
SLIDE 25

d2l.ai

Padding

Padding adds rows/columns around input

0 × 0 + 0 × 1 + 0 × 2 + 0 × 3 = 0

slide-26
SLIDE 26

d2l.ai

Padding

  • Padding rows and columns, output shape will be
  • A common choice is and
  • Odd : pad on both sides
  • Even : pad on top, on bottom

(nh − kh + ph + 1) × (nw − kw + pw + 1)

ph pw ph = kh − 1 pw = kw − 1 kh ph/2 kh

⌈ph/2⌉ ⌊ph/2⌋

slide-27
SLIDE 27

d2l.ai

Stride

  • Padding reduces shape linearly with #layers
  • Given a 224 x 224 input with a 5 x 5 kernel, needs 44

layers to reduce the shape to 4 x 4

  • Requires a large amount of computation
slide-28
SLIDE 28

d2l.ai

Stride

  • Stride is the #rows/#columns per slide

Strides of 3 and 2 for height and width 0 × 0 + 0 × 1 + 1 × 2 + 2 × 3 = 8 0 × 0 + 6 × 1 + 0 × 2 + 0 × 3 = 6

slide-29
SLIDE 29

d2l.ai

Stride

  • Given stride for the height and stride for the width, 


the output shape is

  • With and
  • If input height/width are divisible by strides

sh sw

⌊(nh − kh + ph + sh)/sh⌋ × ⌊(nw − kw + pw + sw)/sw⌋

ph = kh − 1 pw = kw − 1

⌊(nh + sh − 1)/sh⌋ × ⌊(nw + sw − 1)/sw⌋ (nh/sh) × (nw/sw)

slide-30
SLIDE 30

d2l.ai

M u l t i p l e I n p u t a n d O u t p u t C h a n n e l s

slide-31
SLIDE 31

d2l.ai

Multiple Input Channels

  • Color image may have three RGB channels
  • Converting to grayscale loses information
slide-32
SLIDE 32

d2l.ai

Multiple Input Channels

  • Color image may have three RGB channels
  • Converting to grayscale loses information
slide-33
SLIDE 33

d2l.ai

Multiple Input Channels

  • Have a kernel for each channel, and then sum results
  • ver channels

(1 × 1 + 2 × 2 + 4 × 3 + 5 × 4) +(0 × 0 + 1 × 1 + 3 × 2 + 4 × 3) = 56

slide-34
SLIDE 34

d2l.ai

Multiple Input Channels

  • input
  • kernel
  • output

X : ci × nh × nw W : ci × kh × kw Y : mh × mw

Y =

ci

i=0

Xi,:,: ⋆ Wi,:,:

slide-35
SLIDE 35

d2l.ai

Multiple Output Channels

  • No matter how many inputs channels, so far we always

get single output channel

  • We can have multiple 3-D kernels, each one generates a
  • utput channel
  • Input
  • Kernel
  • Output

X : ci × nh × nw W : co × ci × kh × kw Y : co × mh × mw

Yi,:,: = X ⋆ Wi,:,:,: for i = 1,…, co

slide-36
SLIDE 36

d2l.ai

Multiple Input/Output Channels

  • Each output channel may recognize a particular pattern
  • Input channels kernels recognize and combines patterns

in inputs

slide-37
SLIDE 37

d2l.ai

1 x 1 Convolutional Layer

is a popular choice. It doesn’t recognize spatial patterns, but fuse channels.
 Equal to a dense layer with input and 
 weight. kh = kw = 1 nhnw × ci co × ci

slide-38
SLIDE 38

d2l.ai

2-D Convolution Layer Summary

  • Input
  • Kernel
  • Bias
  • Output
  • Complexity (number of floating point operations FLOP)
  • 10 layers, 1M examples: 10PF 


(CPU: 0.15 TF = 18h, GPU: 12 TF = 14min) X : ci × nh × nw W : co × ci × kh × kw Y : co × mh × mw

Y = X ⋆ W + B

B : co × ci

O(cicokhkwmhmw)

ci = co = 100 kh = hw = 5 mh = mw = 64 1GFLOP

slide-39
SLIDE 39

d2l.ai

P

  • l

i n g L a y e r

slide-40
SLIDE 40

d2l.ai

Pooling

  • Convolution is sensitive to position
  • Detect vertical edges
  • We need some degree of invariance to translation
  • Lighting, object positions, scales, appearance vary

among images

X Y 0 output with 1 pixel shift

slide-41
SLIDE 41

d2l.ai

2-D Max Pooling

  • Returns the maximal value in the

sliding window

max(0,1,3,4) = 4

slide-42
SLIDE 42

d2l.ai

2-D Max Pooling

  • Returns the maximal value in the sliding window

Conv output 2 x 2 max pooling Vertical edge detection Tolerant to 1 pixel shift

slide-43
SLIDE 43

d2l.ai

Padding, Stride, and Multiple Channels

  • Pooling layers have similar padding

and stride as convolutional layers

  • No learnable parameters
  • Apply pooling for each input channel to
  • btain the corresponding output

channel
 
 #output channels = #input channels

slide-44
SLIDE 44

d2l.ai

Average Pooling

  • Max pooling: the strongest pattern signal in a window
  • Average pooling: replace max with mean in max pooling
  • The average signal strength in a window

Max pooling Average pooling

slide-45
SLIDE 45

d2l.ai

Pooling Notebook

slide-46
SLIDE 46

d2l.ai

LeNet

slide-47
SLIDE 47

d2l.ai

Handwritten Digit Recognition

slide-48
SLIDE 48

d2l.ai

MNIST

  • Centered and scaled
  • 50,000 training data
  • 10,000 test data
  • 28 x 28 images
  • 10 classes
slide-49
SLIDE 49

d2l.ai

  • Y. LeCun, L.

Bottou, Y. Bengio,

  • P. Haffner, 1998

Gradient-based learning applied to document recognition

slide-50
SLIDE 50

d2l.ai

Expensive if we have many

  • utputs
slide-51
SLIDE 51

d2l.ai

LeNet Notebook

slide-52
SLIDE 52

d2l.ai

AlexNet

slide-53
SLIDE 53

d2l.ai

AlexNet

  • AlexNet won ImageNet

competition in 2012

  • Deeper and bigger LeNet
  • Key modifications
  • Dropout (regularization)
  • ReLu (training)
  • MaxPooling
  • Paradigm shift for computer

vision

Manually engineered features SVM Features learned by a CNN Softmax regression

slide-54
SLIDE 54

d2l.ai

AlexNet Architecture

LeNet AlexNet

Larger kernel size, stride because of the increased image size, and more

  • utput channels.

Larger pool size, change to max pooling

slide-55
SLIDE 55

d2l.ai

AlexNet Architecture

LeNet AlexNet

More output channels. 3 additional
 convolutional layers

slide-56
SLIDE 56

d2l.ai

AlexNet Architecture

LeNet AlexNet

Increase hidden size 
 from 120 to 4096 1000 classes output

slide-57
SLIDE 57

d2l.ai

More Tricks

  • Change activation function from sigmoid to ReLu


(no more vanishing gradient)

  • Add a dropout layer after two hidden dense layers


(better robustness / regularization)

  • Data augmentation
slide-58
SLIDE 58

d2l.ai

Complexity

#parameters FLOP AlexNet LeNet AlexNet LeNet Conv1 35K 150 101M 1.2M Conv2 614K 2.4K 415M 2.4M Conv3-5 3M 445M Dense1 26M 0.48M 26M 0.48M Dense2 16M 0.1M 16M 0.1M Total 46M 0.6M 1G 4M Increase 11x 1x 250x 1x

slide-59
SLIDE 59

d2l.ai

AlexNet Notebook

slide-60
SLIDE 60

d2l.ai

Inception

slide-61
SLIDE 61

d2l.ai

Picking the best convolution …

LeNet AlexNet VGG NiN 1x1 3x3 5x5 Max pooling Multiple 1x1

slide-62
SLIDE 62

d2l.ai

Why choose? Just pick them all.

slide-63
SLIDE 63

d2l.ai

  • Inception Blocks

4 paths extract information from different aspects, then concatenate along the output channel

Extract with different spatial size convolutions Extract spatial info with pooling Same width/ height as input

slide-64
SLIDE 64

d2l.ai

Inception Blocks

  • Allocate various

capacities to each channel Reduce channel size to lower model capacity

The first inception block with channel sizes specified

slide-65
SLIDE 65

d2l.ai

Inception Blocks

  • #parameters FLOPS

Inception 0.16 M 128 M 3x3 Conv 0.44 M 346 M 5x5 Conv 1.22 M 963 M

Inception blocks have fewer parameters and less computation complexity than a single 3x3 or 5x5 convolutional layer

  • Mix of different functions (powerful function class)
  • Memory and compute efficiency (good generalization)
slide-66
SLIDE 66

d2l.ai

GoogLeNet

  • 5 stages with 9

inceptions blocks

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Output

  • 2x

5x 2x

slide-67
SLIDE 67

d2l.ai

The many flavors of Inception Networks

  • Inception-BN (v2) - Add batch normalization
  • Inception-V3 - Modified the inception block
  • Replace 5x5 by multiple 3x3 convolutions
  • Replace 5x5 by 1x7 and 7x1 convolutions
  • Replace 3x3 by 1x3 and 3x1 convolutions
  • Generally deeper stack
  • Inception-V4 - Add residual connections (more later)
slide-68
SLIDE 68

d2l.ai

GluonCV Model Zoo https://gluon- cv.mxnet.io/model_zoo/ classification.html Inception V3

slide-69
SLIDE 69

d2l.ai

Batch Normalization

NORM

slide-70
SLIDE 70

Batch Normalization

  • Loss occurs at last layer
  • Last layers learn quickly
  • Data is inserted at bottom layer
  • Bottom layers change - everything changes
  • Last layers need to relearn many times
  • Slow convergence
  • This is like covariate shift


Can we avoid changing last layers while 
 learning first layers? loss data

slide-71
SLIDE 71

Batch Normalization

  • Can we avoid changing last layers while 


learning first layers?

  • Fix mean and variance



 
 
 and adjust it separately loss data µB = 1 |B| X

i∈B

xi and 2

B =

1 |B| X

i∈B

(xi − µB)2 + ✏ xi+1 = γ xi − µB σB + β mean variance

slide-72
SLIDE 72

d2l.ai

This was the original motivation …

slide-73
SLIDE 73

d2l.ai

What Batch Norms really do

  • Doesn’t really reduce covariate shift (Lipton et al., 2018)
  • Regularization by noise injection
  • Random shift per minibatch
  • Random scale per minibatch
  • No need to mix with dropout (both are capacity control)
  • Ideal minibatch size of 64 to 256

xi+1 = γ xi − ̂ μB ̂ σB + β

Empirical mean Empirical variance Random

  • ffset

Random scale

slide-74
SLIDE 74

d2l.ai

Residual Networks

slide-75
SLIDE 75

d2l.ai

Does adding layers improve accuracy?

? ✔

slide-76
SLIDE 76

d2l.ai

Residual Networks

  • Adding a layer

changes function 
 class

  • We want to add to

the function class

  • ‘Taylor expansion'


style parametrization f(x) = x + g(x)

He et al., 2015

slide-77
SLIDE 77

d2l.ai

ResNet Block in detail

slide-78
SLIDE 78

d2l.ai

In code

def forward(self, X): Y = npx.relu(self.bn1(self.conv1(X))) Y = self.bn2(self.conv2(Y)) if self.conv3: X = self.conv3(X) return npx.relu(Y + X)

slide-79
SLIDE 79

d2l.ai

The many flavors of ResNet blocks

Try every permutation Try every permutation Try every permutation

slide-80
SLIDE 80

d2l.ai

ResNet Module

  • Downsample per module

(stride=2)

  • Enforce some nontrivial

nonlinearity per module (via 1x1 convolution)

  • Stack up in blocks

Stride 2 Stride 2 Multiple Repetitions

blk = nn.Sequential() for i in range(num_residuals): if i == 0 and not first_block: blk.add(Residual(num_channels, use_1x1conv=True, strides=2)) else: blk.add(Residual(num_channels))

slide-81
SLIDE 81

d2l.ai

Putting it all together

  • Same block structure as e.g. VGG or

GoogleNet

  • Residual connection to add to

expressiveness

  • Pooling/stride for dimensionality reduction
  • Batch Normalization for capacity control

… train it at scale …

slide-82
SLIDE 82

d2l.ai

GluonCV Model Zoo https://gluon- cv.mxnet.io/model_zoo/ classification.html ResNet 152

slide-83
SLIDE 83

d2l.ai

Jupyter Notebook

slide-84
SLIDE 84

d2l.ai

More Ideas

slide-85
SLIDE 85

d2l.ai

DenseNet (Huang et al., 2016)

  • ResNet combines x and f(x)
  • DenseNet uses higher order


‘Taylor series’ expansion

  • Occasionally need to reduce resolution (transition layer)

xi+1 = [xi, fi(xi)] x1 = x x2 = [x, f1(x)] x2 = [x, f1(x), f2([x, f1(x)])]

slide-86
SLIDE 86

d2l.ai

Squeeze-Excite Net (Hu et al., 2017)

  • Learn global weighting function per channel
  • Allows for fast information transfer between pixels in

different locations of the image

slide-87
SLIDE 87

d2l.ai

Separable Convolutions - all channels separate

  • Parameters
  • Computation
  • Break up channels to the extreme


No mixing between channels kh ⋅ kw ⋅ ci ⋅ co mh ⋅ mw ⋅ kh ⋅ kw ⋅ ci ⋅ co mh ⋅ mw ⋅ kh ⋅ kw ⋅ c

slide-88
SLIDE 88

d2l.ai

ShuffleNet (Zhang et al., 2018)

  • ResNext breaks convolution into channels
  • ShuffleNet mixes by grouping (very efficient for mobile)
slide-89
SLIDE 89

d2l.ai

Outline

  • GPUs
  • Convolutions
  • Pooling, Padding and Stride
  • Convolutional Neural Networks (LeNet)
  • Deep ConvNets (AlexNet)
  • Networks using Blocks (VGG)
  • Residual Neural Networks (ResNet)