Lecture 18: Concluding Convolutional Neural Networks, Graphical Models as Foundation for Recurrent Neural Networks and Bayesian Networks Reference: We will be referring to sections etc of ‘Deep Learning’ by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville https://youtu.be/4PtaZVUbilI?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa&t=1187

The Lego Blocks in Modern Deep Learning Depth/Feature Map [Eg: Red, Green and Blue feature 1 maps] Patches/Filters (provide for spatial interpolations) 2 Non-linear Activation unit (provided for 3 detection/classi fi cation) Strides (enable downsampling) 4 Padding (shrinking across layers) 5 Pooling (non-linear downsampling) 6 Inception [Optional: Extra slides] 7 RNN, Attention and LSTM (Backpropagation through time 8 and Memory cell) [Optional: Extra slides] Embeddings (Unsupervised learning) [Optional: Extra slides] 9

Convolution: Sparse Interactions through Filters K ( . ) (for Single Feature Map) input/( l − 1) th layer l th layer x 5 w l h 5 55 w l 45 w l w l 45 54 x 4 w l h 4 44 w l 34 w l w l 34 43 x 3 w l h 3 33 w l 23 w l w l 23 32 x 2 w l h 2 22 w l 12 w l w l 12 21 x 1 w l h 1 11

Convolution: Sparse Interactions through Filters K ( . ) (for Single Feature Map) input/( l − 1) th layer l th layer � h i = x m w mi K ( i − m ) x 5 w l h 5 55 m On RHS, K ( i − m ) = 1 i ff | m − i | ≤ 1 w l 45 w l w l 45 54 For 2-D inputs (such as images): x 4 w l h 4 44 w l 34 w l w l 34 43 x 3 w l h 3 33 w l 23 w l w l 23 32 x 2 w l h 2 22 w l 12 w l w l 12 21 x 1 w l h 1 11

Convolution: Sparse Interactions through Filters K ( . ) (for Single Feature Map) input/( l − 1) th layer l th layer � h i = x m w mi K ( i − m ) x 5 w l h 5 55 m On RHS, K ( i − m ) = 1 i ff | m − i | ≤ 1 w l 45 w l w l 45 54 For 2-D inputs (such as images): x 4 w l h 4 � � 44 h ij = x mn w ij , mn K ( i − m , j − n ) m n w l 34 w l w l 34 43 Intuition: Neighboring signals x m (or pixels x mn ) more relevant than one’s x 3 w l h 3 33 further away, reduces prediction time Can be viewed as multiplication with a w l 23 w l w l 23 32 Toeplitz matrix K (which has each row as the row above shifted by one element) x 2 w l h 2 22 Further, K is sparse wrt parameter θ (eg: w l 12 w l w l K ( i − m ) = 1 i ff | m − i | ≤ θ ) 12 21 x 1 w l h 1 11

Convolution: Shared parameters and Patches (for Single Feature Map) input/( l − 1) th layer l th layer x 5 w l h 5 0 w l 1 w l w l − 1 1 x 4 w l h 4 0 w l 1 w l w l 1 − 1 x 3 w l h 3 0 w l 1 w l w l 1 − 1 x 2 w l h 2 0 w l 1 w l w l − 1 1 x 1 w l h 1 0

Convolution: Shared parameters and Patches (for Single Feature Map) input/( l − 1) th layer l th layer � h i = x m w i − m K ( i − m ) m x 5 w l h 5 0 On LHS, K ( i − m ) = 1 i ff | m − i | ≤ 1 w l 1 w l w l For 2-D inputs (such as images): − 1 1 x 4 w l h 4 0 w l 1 w l w l 1 − 1 x 3 w l h 3 0 w l 1 w l w l 1 − 1 x 2 w l h 2 0 w l 1 w l w l − 1 1 x 1 w l h 1 0

Convolution: Shared parameters and Patches (for Single Feature Map) input/( l − 1) th layer l th layer � h i = x m w i − m K ( i − m ) m x 5 w l h 5 0 On LHS, K ( i − m ) = 1 i ff | m − i | ≤ 1 w l 1 w l w l For 2-D inputs (such as images): − 1 1 � � h ij = x mn w i − m , j − n K ( i − m , j − n ) x 4 w l h 4 0 m n Intuition: Neighboring signals x m (or w l 1 w l w l 1 − 1 pixels x mn ) a ff ect in similar way irrespective of location ( i.e. , value of m or x 3 w l h 3 0 n ) More Intuition: Corresponds to moving w l 1 w l w l 1 − 1 patches around the image x 2 w l h 2 Further reduces storage requirement; does 0 not a ff ect prediction time w l 1 w l w l − 1 Further, K is often sparse (eg: 1 K ( i − m ) = 1 i ff | m − i | ≤ θ ) x 1 w l h 1 0

Convolution: Strides and Padding (for Single Feature Map) input/( l − 1) th layer l th layer x 5 w l h 5 0 w l 1 w l w l − 1 1 x 4 w l h 4 0 w l 1 w l w l 1 − 1 x 3 w l h 3 0 w l 1 w l w l 1 − 1 x 2 w l h 2 0 w l 1 w l w l − 1 1 x 1 w l h 1 0

Convolution: Strides and Padding (for Single Feature Map) input/( l − 1) th layer l th layer Consider only h i ’s where i is a multiple of x 5 w l h 5 0 s . w l 1 w l w l Intuition: Stride of s corresponds to − 1 1 moving the patch by s steps at a time x 4 w l h 4 0 More Intuition: Stride of s corresponds to downsampling by s w l 1 w l w l 1 − 1 What to do at the corners? x 3 w l h 3 0 w l 1 w l w l 1 − 1 x 2 w l h 2 0 w l 1 w l w l − 1 1 x 1 w l h 1 0

Convolution: Strides and Padding (for Single Feature Map) input/( l − 1) th layer l th layer Consider only h i ’s where i is a multiple of x 5 w l h 5 0 s . w l 1 w l w l Intuition: Stride of s corresponds to − 1 1 moving the patch by s steps at a time x 4 w l h 4 0 More Intuition: Stride of s corresponds to downsampling by s w l 1 w l w l 1 − 1 What to do at the corners? Ans: Pad x 3 w l h 3 with 0 ’s at the edges to create output of 0 same size as input (same padding) or w l 1 w l w l 1 − 1 have no padding at all and let the next layer have fewer nodes (valid) x 2 w l h 2 0 Reduces storage requirement as well as w l 1 w l w l prediction time − 1 1 x 1 w l h 1 0

Examples of Convolutional Filters: Guess what each does +1 0 -1 +2 0 -2 +1 0 -1 5 Also referred to as kernels, but not to be confused with the positive semi-de fi nite kernel

Examples of Convolutional Filters: Guess what each does +1 0 -1 +2 0 -2 +1 0 -1 Sobel Vertical edge detector +1 +2 +1 0 0 0 -1 -2 -1 5 Also referred to as kernels, but not to be confused with the positive semi-de fi nite kernel

Examples of Convolutional Filters: Guess what each does +1 0 -1 1/9 1/9 1/9 +2 0 -2 1/9 1/9 1/9 +1 0 -1 1/9 1/9 1/9 Sobel Vertical edge detector Image blurring fi lter +1 +2 +1 0 -1 0 0 0 0 -1 3 -1 -1 -2 -1 0 -1 0 Sobel Horizontal edge detector Image sharpening fi lter Illustration at https://www.saama.com/blog/different-kinds-convolutional-filters/ In CNNs, these fi lters 5 ( i.e. weights w i − m , j − n ) are generally learnt from the data. Filter size ⇒ Strong prior, Filter value ⇒ Posterior 5 Also referred to as kernels, but not to be confused with the positive semi-de fi nite kernel

The Convolutional Filter

The Convolutional Filter

The Convolutional Filter

Question: MLP Vs CNN Convolution leverages three important ideas that can help improve a machine learning system: (a) sparse interactions, (b) parameter sharing and (c) equivariant representations: f ( g ( x )) = g ( f ( x )) when f is convolution and g is shift function. We just saw these in action:

Question: MLP Vs CNN Convolution leverages three important ideas that can help improve a machine learning system: (a) sparse interactions, (b) parameter sharing and (c) equivariant representations: f ( g ( x )) = g ( f ( x )) when f is convolution and g is shift function. We just saw these in action: Input Image Size: 200 × 200 × 3 MLP : Hidden Layer has 40k neurons, resulting in 4.8 billion parameters. CNN : Say, hidden layer has 20 feature-maps each of size 5 X 5 X 3 with stride = 1 and zero padding of 4 on each side, i.e. , maximum overlapping of convolution windows. A feature map corresponds to one set of weights w l ij . F feature maps ⇒ F times the number of weight parameters Question : How many parameters? Answer : Question : How many neurons (location speci fi c)? Answer :

Answer: MLP Vs CNN MLP : Hidden Layer has 40k neurons, so it has 4800000 parameters. CNN : Hidden layer has 20 feature-maps each of size 5 X 5 X 3 with stride = 1, and zero padding of 4 on each side, i.e. , maximum overlapping of convolution windows. Question : How many parameters? Answer : Just 1500 Question : How many neurons (location speci fi c)? Let M × N × 3 be dimension of image and P × Q × 3 be dimension of fi lter for convolution. Let D be number of zero paddings and s be stride length. Answer : Output size =

Answer: MLP Vs CNN MLP : Hidden Layer has 40k neurons, so it has 4800000 parameters. CNN : Hidden layer has 20 feature-maps each of size 5 X 5 X 3 with stride = 1, and zero padding of 4 on each side, i.e. , maximum overlapping of convolution windows. Question : How many parameters? Answer : Just 1500 Question : How many neurons (location speci fi c)? Let M × N × 3 be dimension of image and P × Q × 3 be dimension of fi lter for convolution. Let D be number of zero paddings and s be stride length. � � � � M − P +2 D N − Q +2 D Answer : Output size = + 1 × + 1 . s s � � � � M + P N + Q In current case, D = P − 1 ⇒ Output size = − 1 × − 1 . s s 20 × ((200 + 5) / s ) − 1) × ((200 + 5) / s ) − 1) = 832320 (around 830 thousand which can increase with max-pooling). If D = ( P − 1) / 2 and S = 1 ,

Recommend

More recommend