Applied Machine Learning Applied Machine Learning Convolutional - - PowerPoint PPT Presentation

applied machine learning applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning Applied Machine Learning Convolutional - - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives understand the convolution layer


slide-1
SLIDE 1

Applied Machine Learning Applied Machine Learning

Convolutional Neural Networks

Siamak Ravanbakhsh Siamak Ravanbakhsh

COMP 551 COMP 551 (winter 2020) (winter 2020)

1

slide-2
SLIDE 2

understand the convolution layer and the architecture of conv-net its inductive bias its derivation from fully connected layer different types of convolution

Learning objectives Learning objectives

2

slide-3
SLIDE 3

MLP and image data MLP and image data

we can apply an MLP to image data

image:https://medium.com/@rajatjain0807/machine-learning-6ecde3bfd2f4

3 . 1

slide-4
SLIDE 4

MLP and image data MLP and image data

we can apply an MLP to image data

image:https://medium.com/@rajatjain0807/machine-learning-6ecde3bfd2f4

softmax ∘ W ∘

{L}

… ∘ ReLU ∘ W vect(x)

{1}

first vectorize the input x → vec(x) ∈ R784 feed it to the MLP (with L layers) and predict the labels

3 . 1

slide-5
SLIDE 5

MLP and image data MLP and image data

we can apply an MLP to image data

image:https://medium.com/@rajatjain0807/machine-learning-6ecde3bfd2f4

softmax ∘ W ∘

{L}

… ∘ ReLU ∘ W vect(x)

{1}

first vectorize the input x → vec(x) ∈ R784 feed it to the MLP (with L layers) and predict the labels the model knows nothing about the image structure we could shuffle all pixels and learn an MLP with similar performance

3 . 1

slide-6
SLIDE 6

MLP and image data MLP and image data

we can apply an MLP to image data

image:https://medium.com/@rajatjain0807/machine-learning-6ecde3bfd2f4

softmax ∘ W ∘

{L}

… ∘ ReLU ∘ W vect(x)

{1}

first vectorize the input x → vec(x) ∈ R784 feed it to the MLP (with L layers) and predict the labels the model knows nothing about the image structure we could shuffle all pixels and learn an MLP with similar performance

lets find the right model for sequence first...

how to bias the model, so that it "knows" its input is image? image is like 2D version of sequence data

3 . 1

slide-7
SLIDE 7

Parameter-sharing Parameter-sharing

suppose we want to convert one sequence to another R

D

RD

suppose we have a dataset of input-output pairs {(x

, y )}

(n) (n) n

e.g., convert one voice to another

3 . 2

slide-8
SLIDE 8

Parameter-sharing Parameter-sharing

suppose we want to convert one sequence to another R

D

RD

suppose we have a dataset of input-output pairs {(x

, y )}

(n) (n) n

e.g., convert one voice to another

consider only a single layer y = g(Wx)

W

... ...

  • utput

... ...

input

3 . 2

slide-9
SLIDE 9

Parameter-sharing Parameter-sharing

suppose we want to convert one sequence to another R

D

RD

suppose we have a dataset of input-output pairs {(x

, y )}

(n) (n) n

e.g., convert one voice to another

consider only a single layer y = g(Wx)

W

... ...

  • utput

... ...

input

3 . 2

we may assume, each output unit is the same function shifted along the sequence

when is this a good assumption?

W

... ...

  • utput

... ...

input elements of w of the same color are tied together (parameter-sharing)

slide-10
SLIDE 10

Locality & sparse weight Locality & sparse weight

we may assume, each output unit is the same function shifted along the sequence

W

... ...

  • utput

... ...

input

3 . 3

slide-11
SLIDE 11

Locality & sparse weight Locality & sparse weight

we may assume, each output unit is the same function shifted along the sequence

W

... ...

  • utput

... ...

input

3 . 3

we may further assume each output is a local function of input

... ... ... ...

slide-12
SLIDE 12

Locality & sparse weight Locality & sparse weight

we may assume, each output unit is the same function shifted along the sequence

W

... ...

  • utput

... ...

input

3 . 3

we may further assume each output is a local function of input

... ... ... ...

size of the receptive field is 3

slide-13
SLIDE 13

Locality & sparse weight Locality & sparse weight

we may assume, each output unit is the same function shifted along the sequence

W

... ...

  • utput

... ...

input

3 . 3

we may further assume each output is a local function of input larger receptive field with multiple layers

... ... ... ...

size of the receptive field is 3

... ...

size of the receptive field is 5

slide-14
SLIDE 14

Cross-correlation Cross-correlation (1D) (1D)

we may further assume each output is a local function of input

W

... ...

  • utput

3 . 4

... ...

input

slide-15
SLIDE 15

Cross-correlation Cross-correlation (1D) (1D)

we may further assume each output is a local function of input

W

... ...

  • utput

3 . 4

... ...

input

parameter-sharing in W W is very sparse

slide-16
SLIDE 16

Cross-correlation Cross-correlation (1D) (1D)

we may further assume each output is a local function of input

W

... ...

  • utput

3 . 4

... ...

input

parameter-sharing in W W is very sparse

instead of the whole matrix we can keep the one set of nonzero values

w = [w , … , w ] =

1 K

[W , … , W ]

c,c−⌊ ⌋

2 K

c,c+⌊ ⌋

2 K

slide-17
SLIDE 17

Cross-correlation Cross-correlation (1D) (1D)

we may further assume each output is a local function of input

W

... ...

  • utput

3 . 4

... ...

input

parameter-sharing in W W is very sparse

instead of the whole matrix we can keep the one set of nonzero values

w = [w , … , w ] =

1 K

[W , … , W ]

c,c−⌊ ⌋

2 K

c,c+⌊ ⌋

2 K

we can write matrix multiplication as cross-correlation of w and x

slide-18
SLIDE 18

Cross-correlation Cross-correlation (1D) (1D)

we may further assume each output is a local function of input

W

... ...

  • utput

3 . 4

... ...

input

parameter-sharing in W W is very sparse

instead of the whole matrix we can keep the one set of nonzero values

w = [w , … , w ] =

1 K

[W , … , W ]

c,c−⌊ ⌋

2 K

c,c+⌊ ⌋

2 K

y =

c

g( W x ) ∑d=1

D c,d d

we can write matrix multiplication as cross-correlation of w and x

slide-19
SLIDE 19

Cross-correlation Cross-correlation (1D) (1D)

we may further assume each output is a local function of input

W

... ...

  • utput

3 . 4

... ...

input

parameter-sharing in W W is very sparse

= g( w x ) ∑k=1

K k c−⌊ ⌋+k

2 K

instead of the whole matrix we can keep the one set of nonzero values

w = [w , … , w ] =

1 K

[W , … , W ]

c,c−⌊ ⌋

2 K

c,c+⌊ ⌋

2 K

y =

c

g( W x ) ∑d=1

D c,d d

we can write matrix multiplication as cross-correlation of w and x

slide-20
SLIDE 20

Cross-correlation Cross-correlation (1D) (1D)

we may further assume each output is a local function of input

W

... ...

  • utput

3 . 4

... ...

input

parameter-sharing in W W is very sparse

= g( w x ) ∑k=1

K k c−⌊ ⌋+k

2 K

instead of the whole matrix we can keep the one set of nonzero values

w = [w , … , w ] =

1 K

[W , … , W ]

c,c−⌊ ⌋

2 K

c,c+⌊ ⌋

2 K

y =

c

g( W x ) ∑d=1

D c,d d

we can write matrix multiplication as cross-correlation of w and x slide on the input, calculate inner product and apply the nonlinearity

slide-21
SLIDE 21

Convolution Convolution (1D) (1D)

Cross-correlation is similar to convolution

3 . 5

slide-22
SLIDE 22

Convolution Convolution (1D) (1D)

Cross-correlation is similar to convolution

3 . 5

ignoring the activation (for simpler notation) assuming w and x are zero for any index outside the input and filter bound

Cross-correlation

y(c) = w(k)x(c + ∑k=−∞

k)

w is called the filter or kernel

slide-23
SLIDE 23

Convolution Convolution (1D) (1D)

Cross-correlation is similar to convolution

w x w ⋆ x x ⋆ w

3 . 5

ignoring the activation (for simpler notation) assuming w and x are zero for any index outside the input and filter bound

Cross-correlation

y(c) = w(k)x(c + ∑k=−∞

k)

w is called the filter or kernel

w ⋆ x

slide-24
SLIDE 24

Convolution Convolution (1D) (1D)

Cross-correlation is similar to convolution

y(c) = w(k)x(c − ∑k=−∞

k)

flips w or x (to be commutative) Convolution

w ∗ x

w x w ⋆ x x ⋆ w

3 . 5

ignoring the activation (for simpler notation) assuming w and x are zero for any index outside the input and filter bound

Cross-correlation

y(c) = w(k)x(c + ∑k=−∞

k)

w is called the filter or kernel

w ⋆ x

slide-25
SLIDE 25

Convolution Convolution (1D) (1D)

Cross-correlation is similar to convolution

y(c) = w(k)x(c − ∑k=−∞

k)

flips w or x (to be commutative) Convolution

w ∗ x

x ∗ w

= w(c − ∑d=−∞

d)x(d)

change of variable

w x w ⋆ x x ⋆ w

3 . 5

ignoring the activation (for simpler notation) assuming w and x are zero for any index outside the input and filter bound

Cross-correlation

y(c) = w(k)x(c + ∑k=−∞

k)

w is called the filter or kernel

w ⋆ x

slide-26
SLIDE 26

w x w ∗ x x ∗ w

Convolution Convolution (1D) (1D)

Cross-correlation is similar to convolution

y(c) = w(k)x(c − ∑k=−∞

k)

flips w or x (to be commutative) Convolution

w ∗ x

x ∗ w

= w(c − ∑d=−∞

d)x(d)

change of variable

w x w ⋆ x x ⋆ w

3 . 5

ignoring the activation (for simpler notation) assuming w and x are zero for any index outside the input and filter bound

Cross-correlation

y(c) = w(k)x(c + ∑k=−∞

k)

w is called the filter or kernel

w ⋆ x

slide-27
SLIDE 27

w x w ∗ x x ∗ w

Convolution Convolution (1D) (1D)

Cross-correlation is similar to convolution

y(c) = w(k)x(c − ∑k=−∞

k)

flips w or x (to be commutative) Convolution

w ∗ x

x ∗ w

= w(c − ∑d=−∞

d)x(d)

change of variable

since we learn w, flipping it makes no difference

w x w ⋆ x x ⋆ w

3 . 5

ignoring the activation (for simpler notation) assuming w and x are zero for any index outside the input and filter bound

Cross-correlation

y(c) = w(k)x(c + ∑k=−∞

k)

w is called the filter or kernel

w ⋆ x

slide-28
SLIDE 28

w x w ∗ x x ∗ w

Convolution Convolution (1D) (1D)

Cross-correlation is similar to convolution

y(c) = w(k)x(c − ∑k=−∞

k)

flips w or x (to be commutative) Convolution

w ∗ x

x ∗ w

= w(c − ∑d=−∞

d)x(d)

change of variable

since we learn w, flipping it makes no difference in practice, we use cross correlation rather than convolution

w x w ⋆ x x ⋆ w

3 . 5

ignoring the activation (for simpler notation) assuming w and x are zero for any index outside the input and filter bound

Cross-correlation

y(c) = w(k)x(c + ∑k=−∞

k)

w is called the filter or kernel

w ⋆ x

slide-29
SLIDE 29

w x w ∗ x x ∗ w

Convolution Convolution (1D) (1D)

Cross-correlation is similar to convolution

y(c) = w(k)x(c − ∑k=−∞

k)

flips w or x (to be commutative) Convolution

w ∗ x

x ∗ w

= w(c − ∑d=−∞

d)x(d)

change of variable

since we learn w, flipping it makes no difference in practice, we use cross correlation rather than convolution

w x w ⋆ x x ⋆ w

3 . 5

convolution is equivariant wrt translation

  • - i.e., shifting x, shifts w*x

ignoring the activation (for simpler notation) assuming w and x are zero for any index outside the input and filter bound

Cross-correlation

y(c) = w(k)x(c + ∑k=−∞

k)

w is called the filter or kernel

w ⋆ x

slide-30
SLIDE 30

Convolution Convolution (2D) (2D)

similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data)

y =

d ,d

1 2

x w ∑k =1

1

K1

∑k =1

2

K2 d +k −1,d +k −1

1 1 2 2

k ,k

1 2 3 . 6

slide-31
SLIDE 31

Convolution Convolution (2D) (2D)

similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data)

image credit: Vincent Dumoulin, Francesco Visin

y =

d ,d

1 2

x w ∑k =1

1

K1

∑k =1

2

K2 d +k −1,d +k −1

1 1 2 2

k ,k

1 2 3 . 6

slide-32
SLIDE 32

Convolution Convolution (2D) (2D)

similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data)

image credit: Vincent Dumoulin, Francesco Visin

y =

d ,d

1 2

x w ∑k =1

1

K1

∑k =1

2

K2 d +k −1,d +k −1

1 1 2 2

k ,k

1 2

participates in all outputs participates in a single output this is related to the borders

3 . 6

slide-33
SLIDE 33

Convolution Convolution (2D) (2D)

similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data)

image credit: Vincent Dumoulin, Francesco Visin

there are different ways of handling the borders

y =

d ,d

1 2

x w ∑k =1

1

K1

∑k =1

2

K2 d +k −1,d +k −1

1 1 2 2

k ,k

1 2 3 . 7

slide-34
SLIDE 34

Convolution Convolution (2D) (2D)

similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data)

image credit: Vincent Dumoulin, Francesco Visin

there are different ways of handling the borders

no padding at all (valid)

  • utput is small than input (how much?)

y =

d ,d

1 2

x w ∑k =1

1

K1

∑k =1

2

K2 d +k −1,d +k −1

1 1 2 2

k ,k

1 2 3 . 7

slide-35
SLIDE 35

Convolution Convolution (2D) (2D)

similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data)

image credit: Vincent Dumoulin, Francesco Visin

there are different ways of handling the borders

zero-pad the input, to keep the output dims similar to input (same) no padding at all (valid)

  • utput is small than input (how much?)

y =

d ,d

1 2

x w ∑k =1

1

K1

∑k =1

2

K2 d +k −1,d +k −1

1 1 2 2

k ,k

1 2 3 . 7

slide-36
SLIDE 36

Convolution Convolution (2D) (2D)

similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data)

image credit: Vincent Dumoulin, Francesco Visin

there are different ways of handling the borders

zero-pad the input, and produce all non-zero outputs (full)

  • utput is larger than input (by how much?)

each input participates in the same number of output elements 3x3 kernel

x y

w

zero-pad the input, to keep the output dims similar to input (same) no padding at all (valid)

  • utput is small than input (how much?)

y =

d ,d

1 2

x w ∑k =1

1

K1

∑k =1

2

K2 d +k −1,d +k −1

1 1 2 2

k ,k

1 2 3 . 7

slide-37
SLIDE 37

Winter 2020 | Applied Machine Learning (COMP551)

Convolution Convolution (2D) (2D)

similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data)

image credit: Vincent Dumoulin, Francesco Visin

there are different ways of handling the borders

zero-pad the input, and produce all non-zero outputs (full)

  • utput is larger than input (by how much?)

each input participates in the same number of output elements 3x3 kernel

x y

w

zero-pad the input, to keep the output dims similar to input (same) no padding at all (valid)

  • utput is small than input (how much?)

y =

d ,d

1 2

x w ∑k =1

1

K1

∑k =1

2

K2 d +k −1,d +k −1

1 1 2 2

k ,k

1 2

⌊D + padding − K + 1⌋

  • utput length (for one dimension)

3 . 7

slide-38
SLIDE 38

Pooling Pooling

sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2

4 . 1

slide-39
SLIDE 39

Pooling Pooling

sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2 a combination of pooling and downsampling is used

4 . 1

slide-40
SLIDE 40

Pooling Pooling

sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2

= y ~d g( x w ) ∑k=1

K d+k−1 k

  • 1. calculate the output

a combination of pooling and downsampling is used

4 . 1

slide-41
SLIDE 41

Pooling Pooling

sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2

= y ~d g( x w ) ∑k=1

K d+k−1 k

  • 1. calculate the output

a combination of pooling and downsampling is used

  • 2. aggregate the output over different regions

two common aggregation functions are max and mean

y =

d

pool{ , … , } y ~d y ~d+p

4 . 1

slide-42
SLIDE 42

Pooling Pooling

sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2

= y ~d g( x w ) ∑k=1

K d+k−1 k

  • 1. calculate the output

a combination of pooling and downsampling is used

pooling results in some degree of invariance to translation

  • 2. aggregate the output over different regions

two common aggregation functions are max and mean

y =

d

pool{ , … , } y ~d y ~d+p

left translation

4 . 1

slide-43
SLIDE 43

Pooling Pooling

sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2

= y ~d g( x w ) ∑k=1

K d+k−1 k

  • 1. calculate the output

a combination of pooling and downsampling is used

pooling results in some degree of invariance to translation

  • 2. aggregate the output over different regions

two common aggregation functions are max and mean

y =

d

pool{ , … , } y ~d y ~d+p

left translation

  • 3. often this is followed by subsampling using the same step size

4 . 1

slide-44
SLIDE 44

Pooling Pooling

sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2

= y ~d g( x w ) ∑k=1

K d+k−1 k

  • 1. calculate the output

a combination of pooling and downsampling is used

pooling results in some degree of invariance to translation

  • 2. aggregate the output over different regions

two common aggregation functions are max and mean

y =

d

pool{ , … , } y ~d y ~d+p

left translation

  • 3. often this is followed by subsampling using the same step size

the same idea extends to higher dimensions

4 . 1

slide-45
SLIDE 45

Strided convolution Strided convolution

alternatively we can directly subsample the output

= y ~d g( x w ) ∑k=1

K (d−1)+k k

y =

d

y ~dp

y ~1 y ~2 y ~3 y ~3 y ~4 y ~5 y1 y2 y3

4 . 2

slide-46
SLIDE 46

Strided convolution Strided convolution

alternatively we can directly subsample the output

= y ~d g( x w ) ∑k=1

K (d−1)+k k

y =

d

y ~dp

y ~1 y ~2 y ~3 y ~3 y ~4 y ~5 y1 y2 y3

4 . 2

equivalent to

slide-47
SLIDE 47

Strided convolution Strided convolution

alternatively we can directly subsample the output

= y ~d g( x w ) ∑k=1

K (d−1)+k k

y =

d

y ~dp

y ~1 y ~2 y ~3 y ~3 y ~4 y ~5 y1 y2 y3

= y ~d g( x w ) ∑k=1

K p(d−1)+k k y1 y2 y3

4 . 2

equivalent to

slide-48
SLIDE 48

Strided convolution Strided convolution

the same idea extends to higher dimensions

image: Dumoulin & Visin'16

  • utput

input

y =

d ,d

1 2

x w ∑k =1

1

K1

∑k =1

2

K2 p (d −1)+k ,p (d −1)+k

1 1 1 2 2 2

k ,k

1 2 different step-sizes for different dimensions 4 . 3

slide-49
SLIDE 49

Strided convolution Strided convolution

the same idea extends to higher dimensions

image: Dumoulin & Visin'16

  • utput

input

y =

d ,d

1 2

x w ∑k =1

1

K1

∑k =1

2

K2 p (d −1)+k ,p (d −1)+k

1 1 1 2 2 2

k ,k

1 2 different step-sizes for different dimensions

  • utput

input

with padding

4 . 3

slide-50
SLIDE 50

Winter 2020 | Applied Machine Learning (COMP551)

Strided convolution Strided convolution

the same idea extends to higher dimensions

image: Dumoulin & Visin'16

  • utput

input

y =

d ,d

1 2

x w ∑k =1

1

K1

∑k =1

2

K2 p (d −1)+k ,p (d −1)+k

1 1 1 2 2 2

k ,k

1 2 different step-sizes for different dimensions

  • utput

input

with padding

⌊ +

stride D+padding−K

1⌋

  • utput length (for one dimension)

4 . 3

slide-51
SLIDE 51

Channels Channels

so far we assumed a single input and output sequence or image

image: Dumoulin & Visin'16

5 . 1

slide-52
SLIDE 52

Channels Channels

so far we assumed a single input and output sequence or image

with RGB data, we have 3 input channels ( )

M = 3

this example: 2 input channels

x ∈ RM×D ×D

1 2 image: Dumoulin & Visin'16

5 . 1

slide-53
SLIDE 53

Channels Channels

so far we assumed a single input and output sequence or image

with RGB data, we have 3 input channels ( )

M = 3

this example: 2 input channels

x ∈ RM×D ×D

1 2

similarly we can produce multiple output channels M =

3

y ∈ RM ×D ×D

′ 1 ′ 2 ′ image: Dumoulin & Visin'16

5 . 1

slide-54
SLIDE 54

Channels Channels

so far we assumed a single input and output sequence or image

with RGB data, we have 3 input channels ( )

M = 3

this example: 2 input channels

x ∈ RM×D ×D

1 2

similarly we can produce multiple output channels M =

3

y ∈ RM ×D ×D

′ 1 ′ 2 ′

we have one filters per input-output channel combination K ×

1

K2

w ∈ RM×M ×K ×K

′ 1 2

+ add the result of convolution from different input channels

image: Dumoulin & Visin'16

5 . 1

slide-55
SLIDE 55

Channels Channels

so far we assumed a single input and output sequence or image

image: https://cs231n.github.io/convolutional-networks/

5 . 2

slide-56
SLIDE 56

Channels Channels

so far we assumed a single input and output sequence or image

image: https://cs231n.github.io/convolutional-networks/

we can also add a bias parameter (b), one per each output channel

5 . 2

b ∈ RM ′

slide-57
SLIDE 57

Channels Channels

so far we assumed a single input and output sequence or image

image: https://cs231n.github.io/convolutional-networks/

y =

m ,d ,d

′ 1 2

g( w x + ∑m=1

M

∑k1 ∑k2

m,m ,k ,k

′ 1 2

m,d +k −1,d +k −1

1 1 2 2

b )

m′

w ∈ RM×M ×K ×K

′ 1 2

x ∈ RM×D ×D

1 2

y ∈ RM ×D ×D

′ 1 ′ 2 ′

we can also add a bias parameter (b), one per each output channel

5 . 2

b ∈ RM ′

slide-58
SLIDE 58

Winter 2020 | Applied Machine Learning (COMP551)

Channels Channels

so far we assumed a single input and output sequence or image

M = M =

5

D =

1

D =

2

K1 K2

RGB channels

image: https://cs231n.github.io/convolutional-networks/

y =

m ,d ,d

′ 1 2

g( w x + ∑m=1

M

∑k1 ∑k2

m,m ,k ,k

′ 1 2

m,d +k −1,d +k −1

1 1 2 2

b )

m′

w ∈ RM×M ×K ×K

′ 1 2

x ∈ RM×D ×D

1 2

y ∈ RM ×D ×D

′ 1 ′ 2 ′

we can also add a bias parameter (b), one per each output channel

5 . 2

b ∈ RM ′

slide-59
SLIDE 59

Convolutional Neural Network ( Convolutional Neural Network (CNN CNN)

CNN or convnet is a neural network with convolutional layers (so it's a special type of MLP)

6 . 1

slide-60
SLIDE 60

Convolutional Neural Network ( Convolutional Neural Network (CNN CNN)

CNN or convnet is a neural network with convolutional layers (so it's a special type of MLP) it could be applied to 1D sequence, 2D image or 3D volumetric data

6 . 1

slide-61
SLIDE 61

Convolutional Neural Network ( Convolutional Neural Network (CNN CNN)

CNN or convnet is a neural network with convolutional layers (so it's a special type of MLP) example: conv-net architecture (derived from AlexNet) for image classification it could be applied to 1D sequence, 2D image or 3D volumetric data

6 . 1

slide-62
SLIDE 62

Convolutional Neural Network ( Convolutional Neural Network (CNN CNN)

CNN or convnet is a neural network with convolutional layers (so it's a special type of MLP) example: conv-net architecture (derived from AlexNet) for image classification

fully connected layers number of classes

it could be applied to 1D sequence, 2D image or 3D volumetric data

6 . 1

slide-63
SLIDE 63

Convolutional Neural Network ( Convolutional Neural Network (CNN CNN)

CNN or convnet is a neural network with convolutional layers (so it's a special type of MLP) example: conv-net architecture (derived from AlexNet) for image classification

fully connected layers number of classes

it could be applied to 1D sequence, 2D image or 3D volumetric data

visualization of the convolution kernel at the first layer 11x11x3x96 96 filters, each one is 11x11x3. each of these is responsible for one of 96 feature maps in the second layer 6 . 1

slide-64
SLIDE 64

Convolutional Neural Network ( Convolutional Neural Network (CNN CNN)

CNN or convnet is a neural network with convolutional layers (so it's a special type of MLP) example: conv-net architecture (derived from AlexNet) for image classification

fully connected layers number of classes

it could be applied to 1D sequence, 2D image or 3D volumetric data deeper units represent more abstract features

6 . 2

slide-65
SLIDE 65

Application: image classification Application: image classification

Convnets have achieved super-human performance in image classification

image credit: He et al'15, https://semiengineering.com/new-vision-technologies-for-real-world-applications/

6 . 3

slide-66
SLIDE 66

Application: image classification Application: image classification

Convnets have achieved super-human performance in image classification

image credit: He et al'15, https://semiengineering.com/new-vision-technologies-for-real-world-applications/

ImageNet challenge: > 1M images, 1000 classes

6 . 3

slide-67
SLIDE 67

Application: image classification Application: image classification

Convnets have achieved super-human performance in image classification

image credit: He et al'15, https://semiengineering.com/new-vision-technologies-for-real-world-applications/

ImageNet challenge: > 1M images, 1000 classes

6 . 3

slide-68
SLIDE 68

Application: image classification Application: image classification

variety of increasingly deeper architectures have been proposed

image credit: He et al'15, https://semiengineering.com/new-vision-technologies-for-real-world-applications/

6 . 4

slide-69
SLIDE 69

Winter 2020 | Applied Machine Learning (COMP551)

Application: image classification Application: image classification

variety of increasingly deeper architectures have been proposed

image credit: He et al'15, https://semiengineering.com/new-vision-technologies-for-real-world-applications/

6 . 5

slide-70
SLIDE 70

Training: Training: backpropagation through convolution

backpropagation through convolution

y =

m ,d

w x ∑m ∑k

m,m ,k

m,p(d−1)+k

∂ym ,d

′ ′

∂J

consider the strided 1D convolution op.

  • utput channel index

input channel index filter index stride 7 . 1

slide-71
SLIDE 71

Training: Training: backpropagation through convolution

backpropagation through convolution

y =

m ,d

w x ∑m ∑k

m,m ,k

m,p(d−1)+k

using backprop. we have so far and we need

∂ym ,d

′ ′

∂J

consider the strided 1D convolution op.

  • utput channel index

input channel index filter index stride 7 . 1

slide-72
SLIDE 72

Training: Training: backpropagation through convolution

backpropagation through convolution

y =

m ,d

w x ∑m ∑k

m,m ,k

m,p(d−1)+k

using backprop. we have so far and we need

∂ym ,d

′ ′

∂J

consider the strided 1D convolution op.

  • utput channel index

input channel index filter index stride 7 . 1

∂wm,m ,k

∂ym ,d

′ ′

so as to get the gradients

1)

slide-73
SLIDE 73

Training: Training: backpropagation through convolution

backpropagation through convolution

y =

m ,d

w x ∑m ∑k

m,m ,k

m,p(d−1)+k

using backprop. we have so far and we need

∂ym ,d

′ ′

∂J

consider the strided 1D convolution op.

  • utput channel index

input channel index filter index stride

to backpropagate to previous layer

∂xm,d ∂ym ,d

′ ′

2)

7 . 1

∂wm,m ,k

∂ym ,d

′ ′

so as to get the gradients

1)

slide-74
SLIDE 74

Training: Training: backpropagation through convolution

backpropagation through convolution

y =

m ,d

w x ∑m ∑k

m,m ,k

m,p(d−1)+k

using backprop. we have so far and we need

∂ym ,d

′ ′

∂J

consider the strided 1D convolution op.

  • utput channel index

input channel index filter index stride

=

∂wm,m ,k

∂J

∑d′ ∂ym ,d

′ ′

∂J ∂wm,m ,k

∂ym ,d

′ ′

to backpropagate to previous layer

∂xm,d ∂ym ,d

′ ′

2)

7 . 1

∂wm,m ,k

∂ym ,d

′ ′

so as to get the gradients

1)

slide-75
SLIDE 75

xm,p(d −1)+k

Training: Training: backpropagation through convolution

backpropagation through convolution

y =

m ,d

w x ∑m ∑k

m,m ,k

m,p(d−1)+k

using backprop. we have so far and we need

∂ym ,d

′ ′

∂J

consider the strided 1D convolution op.

  • utput channel index

input channel index filter index stride

=

∂wm,m ,k

∂J

∑d′ ∂ym ,d

′ ′

∂J ∂wm,m ,k

∂ym ,d

′ ′

to backpropagate to previous layer

∂xm,d ∂ym ,d

′ ′

2)

7 . 1

∂wm,m ,k

∂ym ,d

′ ′

so as to get the gradients

1)

slide-76
SLIDE 76

xm,p(d −1)+k

Training: Training: backpropagation through convolution

backpropagation through convolution

y =

m ,d

w x ∑m ∑k

m,m ,k

m,p(d−1)+k

using backprop. we have so far and we need

∂ym ,d

′ ′

∂J

consider the strided 1D convolution op.

  • utput channel index

input channel index filter index stride

=

∂wm,m ,k

∂J

∑d′ ∂ym ,d

′ ′

∂J ∂wm,m ,k

∂ym ,d

′ ′

=

∂xd,m ∂J

∑d ,m

′ ′ ∂ym ,d ′ ′

∂J ∂xd,m ∂ym ,d

′ ′

to backpropagate to previous layer

∂xm,d ∂ym ,d

′ ′

2)

7 . 1

∂wm,m ,k

∂ym ,d

′ ′

so as to get the gradients

1)

slide-77
SLIDE 77

xm,p(d −1)+k

Training: Training: backpropagation through convolution

backpropagation through convolution

y =

m ,d

w x ∑m ∑k

m,m ,k

m,p(d−1)+k

using backprop. we have so far and we need

∂ym ,d

′ ′

∂J

consider the strided 1D convolution op.

  • utput channel index

input channel index filter index stride

=

∂wm,m ,k

∂J

∑d′ ∂ym ,d

′ ′

∂J ∂wm,m ,k

∂ym ,d

′ ′

=

∂xd,m ∂J

∑d ,m

′ ′ ∂ym ,d ′ ′

∂J ∂xd,m ∂ym ,d

′ ′

to backpropagate to previous layer

∂xm,d ∂ym ,d

′ ′

2)

7 . 1

∂wm,m ,k

∂ym ,d

′ ′

so as to get the gradients

1)

w ∑k

m,m ,k

such that

p(d −

1) + k = d

slide-78
SLIDE 78

xm,p(d −1)+k

Training: Training: backpropagation through convolution

backpropagation through convolution

y =

m ,d

w x ∑m ∑k

m,m ,k

m,p(d−1)+k

using backprop. we have so far and we need

∂ym ,d

′ ′

∂J

consider the strided 1D convolution op.

  • utput channel index

input channel index filter index stride

=

∂wm,m ,k

∂J

∑d′ ∂ym ,d

′ ′

∂J ∂wm,m ,k

∂ym ,d

′ ′

=

∂xd,m ∂J

∑d ,m

′ ′ ∂ym ,d ′ ′

∂J ∂xd,m ∂ym ,d

′ ′

to backpropagate to previous layer

∂xm,d ∂ym ,d

′ ′

2)

7 . 1

∂wm,m ,k

∂ym ,d

′ ′

so as to get the gradients

1)

w ∑k

m,m ,k

such that

p(d −

1) + k = d

this operation is similar to multiplication by transpose of the parameter-sharing matrix (transposed convolution)

slide-79
SLIDE 79

Naive implementation Naive implementation

y =

d

w x ∑k

k d+k−1

consider the strided 1D convolution op. with stride 1. and single input-output channels

7 . 2

slide-80
SLIDE 80

Naive implementation Naive implementation

y =

d

w x ∑k

k d+k−1

consider the strided 1D convolution op. with stride 1. and single input-output channels in practice most efficient implementation depends on the filter size (using FFT for large filters)

7 . 2

slide-81
SLIDE 81

Naive implementation Naive implementation

y =

d

w x ∑k

k d+k−1

consider the strided 1D convolution op. with stride 1. and single input-output channels

def Conv1D( x, # D (length) w, # K (filter length) ): D, = x.shape K, = w.shape Dp = D - K + 1 #output length y = np.zeros((Dp)) for dp in range(Dp): y[dp] = np.sum(x[dp:dp+K] * w) return y 1 2 3 4 5 6 7 8 9 10 11 12

forward pass in practice most efficient implementation depends on the filter size (using FFT for large filters)

7 . 2

slide-82
SLIDE 82

Naive implementation Naive implementation

y =

d

w x ∑k

k d+k−1

consider the strided 1D convolution op. with stride 1. and single input-output channels

def Conv1DBackProp( x, #D (length) w, #K dJdy,#Dp: error from layer above ): D, = x.shape K, = w.shape Dp, = dJdy.shape dw = np.zeros_like(w) dJdx = np.zeros_like(x) for dp in range(Dp): dw += np.sum(dJdy[dp] * x[dp:dp+K], dJdx[dp:dp+K] += dJdy[dp:dp+K] * w return dJdx, dw #error to layer below and weight update 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 def Conv1D( x, # D (length) w, # K (filter length) ): D, = x.shape K, = w.shape Dp = D - K + 1 #output length y = np.zeros((Dp)) for dp in range(Dp): y[dp] = np.sum(x[dp:dp+K] * w) return y 1 2 3 4 5 6 7 8 9 10 11 12

forward pass backward pass in practice most efficient implementation depends on the filter size (using FFT for large filters)

7 . 2

slide-83
SLIDE 83

Transposed Convolution Transposed Convolution

Transposed convolution (aka deconvolution) recovers the shape of the original input

image: Dumoulin & Visin'16

7 . 3

slide-84
SLIDE 84

Transposed Convolution Transposed Convolution

Transposed convolution (aka deconvolution) recovers the shape of the original input

image: Dumoulin & Visin'16

Convolution with no stride and its transpose

no padding of the original convolution corresponds to full padding of in transposed version

transposed input

  • utput

7 . 3

slide-85
SLIDE 85

Transposed Convolution Transposed Convolution

Transposed convolution (aka deconvolution) recovers the shape of the original input

image: Dumoulin & Visin'16

Convolution with no stride and its transpose

no padding of the original convolution corresponds to full padding of in transposed version

transposed input

  • utput

7 . 3

full padding of the original convolution corresponds to no paddingof in transposed version

input

  • utput

transposed

slide-86
SLIDE 86

Transposed Convolution Transposed Convolution

Transposed convolution (aka deconvolution) recovers the shape of the original input

image: Dumoulin & Visin'16

Convolution with no stride and its transpose

no padding of the original convolution corresponds to full padding of in transposed version

transposed input

  • utput

Convolution with stride and its transpose

transposed input

  • utput

7 . 3

full padding of the original convolution corresponds to no paddingof in transposed version

input

  • utput

transposed

slide-87
SLIDE 87

Transposed Convolution Transposed Convolution

Transposed convolution (aka deconvolution) recovers the shape of the original input

this can be used for up-sampling (opposite of stride/pooling) as expected the transpose of a transposed convolution is the original convolution

image: Dumoulin & Visin'16

Convolution with no stride and its transpose

no padding of the original convolution corresponds to full padding of in transposed version

transposed input

  • utput

Convolution with stride and its transpose

transposed input

  • utput

7 . 3

full padding of the original convolution corresponds to no paddingof in transposed version

input

  • utput

transposed

slide-88
SLIDE 88

Dilated Convolution Dilated Convolution

Dilated (aka atrous) convolution

image credits: Kalchbrenner et al'17, Dumoulin & Visin'16

7 . 4

slide-89
SLIDE 89

Dilated Convolution Dilated Convolution

Dilated (aka atrous) convolution this can be used to create exponentially large receptive field in few layers

image credits: Kalchbrenner et al'17, Dumoulin & Visin'16

7 . 4

slide-90
SLIDE 90

Dilated Convolution Dilated Convolution

Dilated (aka atrous) convolution this can be used to create exponentially large receptive field in few layers

dilation = 1 (i.e., no dilation), size of receptive field = 3

image credits: Kalchbrenner et al'17, Dumoulin & Visin'16

7 . 4

slide-91
SLIDE 91

Dilated Convolution Dilated Convolution

Dilated (aka atrous) convolution this can be used to create exponentially large receptive field in few layers

dilation = 1 (i.e., no dilation), size of receptive field = 3 dilation = 2, size of receptive field = 7

image credits: Kalchbrenner et al'17, Dumoulin & Visin'16

7 . 4

slide-92
SLIDE 92

Dilated Convolution Dilated Convolution

Dilated (aka atrous) convolution this can be used to create exponentially large receptive field in few layers

dilation = 1 (i.e., no dilation), size of receptive field = 3 dilation = 2, size of receptive field = 7 dilation = 4, size of receptive field = 15

image credits: Kalchbrenner et al'17, Dumoulin & Visin'16

7 . 4

slide-93
SLIDE 93

Dilated Convolution Dilated Convolution

Dilated (aka atrous) convolution this can be used to create exponentially large receptive field in few layers

dilation = 1 (i.e., no dilation), size of receptive field = 3 dilation = 2, size of receptive field = 7 dilation = 4, size of receptive field = 15 dilation = 8, size of receptive field = 31

image credits: Kalchbrenner et al'17, Dumoulin & Visin'16

7 . 4

slide-94
SLIDE 94

Dilated Convolution Dilated Convolution

Dilated (aka atrous) convolution this can be used to create exponentially large receptive field in few layers

dilation = 1 (i.e., no dilation), size of receptive field = 3 dilation = 2, size of receptive field = 7 dilation = 4, size of receptive field = 15 dilation = 8, size of receptive field = 31

image credits: Kalchbrenner et al'17, Dumoulin & Visin'16

in contrast to stride, dilation does not lose resolution

⌊ +

stride D+padding−dilation×(K−1)−1

1⌋

  • utput length (for one dimension)

7 . 4

slide-95
SLIDE 95

Winter 2020 | Applied Machine Learning (COMP551)

Dilated Convolution Dilated Convolution

Dilated (aka atrous) convolution this can be used to create exponentially large receptive field in few layers

dilation = 1 (i.e., no dilation), size of receptive field = 3 dilation = 2, size of receptive field = 7 dilation = 4, size of receptive field = 15 dilation = 8, size of receptive field = 31

image credits: Kalchbrenner et al'17, Dumoulin & Visin'16

torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros') 1

in contrast to stride, dilation does not lose resolution

⌊ +

stride D+padding−dilation×(K−1)−1

1⌋

  • utput length (for one dimension)

7 . 4

slide-96
SLIDE 96

Structured Prediction Structured Prediction

image:https://sthalles.github.io/deep_segmentation_network/

the output itself may have (image) structure (e.g., predicting text, audio, image)

8

slide-97
SLIDE 97

Structured Prediction Structured Prediction

image:https://sthalles.github.io/deep_segmentation_network/

the output itself may have (image) structure (e.g., predicting text, audio, image) in (semantic) segmentation, we classify each pixel loss is the sum of cross-entropy loss across the whole image example

8

slide-98
SLIDE 98

variety of architectures... one that performs well is U-Net

Structured Prediction Structured Prediction

image:https://sthalles.github.io/deep_segmentation_network/

the output itself may have (image) structure (e.g., predicting text, audio, image) in (semantic) segmentation, we classify each pixel loss is the sum of cross-entropy loss across the whole image example

8

slide-99
SLIDE 99

variety of architectures... one that performs well is U-Net

Structured Prediction Structured Prediction

transposed convolution (upconv), concatenation, and skip connection are common in architecture design

image:https://sthalles.github.io/deep_segmentation_network/

the output itself may have (image) structure (e.g., predicting text, audio, image) in (semantic) segmentation, we classify each pixel loss is the sum of cross-entropy loss across the whole image example

8

slide-100
SLIDE 100

variety of architectures... one that performs well is U-Net

Structured Prediction Structured Prediction

transposed convolution (upconv), concatenation, and skip connection are common in architecture design

image:https://sthalles.github.io/deep_segmentation_network/

architecture search (i.e., combinatorial hyper-parameter search) is an expensive process and an active research area the output itself may have (image) structure (e.g., predicting text, audio, image) in (semantic) segmentation, we classify each pixel loss is the sum of cross-entropy loss across the whole image example

8

slide-101
SLIDE 101

Summary Summary

convolution layer introduces an inductive bias to MLP equivariance as an inductive bias: translation of the same model is applied to produce different outputs (pixels) the layer is equivariant to translation achieved through parameter-sharing

9

slide-102
SLIDE 102

Summary Summary

convolution layer introduces an inductive bias to MLP equivariance as an inductive bias: translation of the same model is applied to produce different outputs (pixels) the layer is equivariant to translation achieved through parameter-sharing conv-nets use combinations of convolution layers ReLU (or similar) activations pooling and/or stride for down-sampling skip-connection and/or batch-norm to help with optimization / regularization potentially fully connected layers in the end

9

slide-103
SLIDE 103

Summary Summary

convolution layer introduces an inductive bias to MLP equivariance as an inductive bias: translation of the same model is applied to produce different outputs (pixels) the layer is equivariant to translation achieved through parameter-sharing conv-nets use combinations of convolution layers ReLU (or similar) activations pooling and/or stride for down-sampling skip-connection and/or batch-norm to help with optimization / regularization potentially fully connected layers in the end training backpropagation (similar to MLP) SGD or its improved variations with adaptive learning rate monitor the validation error for early stopping

9