convolutional neural networks basics
play

Convolutional Neural Networks Basics Praveen Krishnan Overview - PowerPoint PPT Presentation

Convolutional Neural Networks Basics Praveen Krishnan Overview Paradigm Shift Simple Network Convolutional Network Layers Case Study1: Alex Net Training Generalization Visualizations Transfer Learning Case


  1. Convolutional Neural Networks Basics Praveen Krishnan

  2. Overview  Paradigm Shift  Simple Network  Convolutional Network  Layers  Case Study1: Alex Net  Training  Generalization  Visualizations  Transfer Learning  Case Study2: JAZ Net  Practical Aspects  Gradient checks.  Data  GPU Coding/Libraries

  3. Part Paradigm Shift Models Feature Sparrow Extraction Coding Pooling Classifier (SIFT, HoG,.) Sparrow Feature Learning Classifier (CNN, RBM, …) Layers - L 3 (Hierarchical L 1 L 2 L 4 decomposition)

  4. A simple network f 1 f 2 f n-1 f n x 0 x n-2 x n-1 x n x 1 w 1 w 2 w n-1 w n Here each output x j depends on previous input x j-1 through a function f j with parameters w j

  5. Feed forward neural network x n1 x 00 x n2 x 01 x nc x 0d W 1 W n Zooming In

  6. Feed forward neural network x n1 x 00 x n2 x 01 x nc x 0d W 1 W n z LOSS y = [0,0,…,1,…0]

  7. Feed forward neural network x n1 x n2 x nc W 1 W n z LOSS Weight updates using back propagation of gradients

  8. Convolutional Network Fully connected layer Locally connected layer 200x200x3 3x3x3 • #Hidden Units: 1200,00 • #Hidden Units: 1200,00 • #Params: 12 billion • #Params: 1.08 Million • Need huge training data to prevent • Useful when the image is highly registered over-fitting!

  9. Convolutional Network Convolutional layer 3x3x3 • #Hidden Units: 1200,00 • #Params: 27 • #feature map: 1 • Exploiting the stationarity property.

  10. Convolutional Network Receptive field Convolutional layer 3 3 200 3 3 # feature maps • Use of multiple feature maps. • Sharing parameters • Exploits stationarity of statistics. • Preserves locality of pixel dependencies.

  11. Convolutional Network 200x200x3 Image size: W1xH1xD1 Receptive field size: FxF #Feature maps: K Q. Find out W2,H2 and D2 ?

  12. Convolutional Network 200x200x3 Image size: W1xH1xD1 It is also better to do Receptive field size: FxF zero padding to #Feature maps: K preserve input size spatially. W2=(W1-F)/S+1 H2=(H1-F)/S+1 D2=K

  13. Convolutional Layer y 1 n y 2 n Conv. x 1 n-1 Layer x 2 n-1 x 3 n-1 Here “f” is a non -linear activation function. F= no. of input feature maps n= layer index “*” represents convolution/correlation ? Q. Is there a difference between correlation and convolution in learned network?

  14. Activation Functions Sigmoid tanh ReLU maxout Leaky ReLU

  15. A Typical Supervised CNN Architecture  A typical deep convolutional network SOFTMAX CONV NORM CONV NORM POOL POOL FC  Other layers  Pooling  Normalization  Fully connected  etc.

  16. SOFTMAX CONV NORM CONV NORM POOL POOL FC Pooling Layer Pool Size: 2x2 Stride: 2 2 8 9 4 Type: Max 8 9 3 6 5 7 5 7 3 1 6 4 2 5 7 3 Max pooling  Aggregation over space or feature type.  Invariance to image transformation and increases compactness to representation.  Pooling types: Max, Average, L2 etc.

  17. SOFTMAX CONV NORM CONV NORM POOL POOL FC Normalization  Local contrast normalization (Jarrett et.al ICCV‟09)  reduce illumination artifacts.  performs local subtractive and divisive normalization.  Local response normalization (Krizhevesky et.al. NIPS‟12)  form of lateral inhibition across channels.  Batch normalization (More later)

  18. SOFTMAX CONV NORM CONV NORM POOL POOL FC Fully connected  Multi layer perceptron  Role of an classifier**  Generally used in final layers to classify the object represented in terms of discriminative parts and higher semantic entities.

  19. Case Study: AlexNet  Winner of ImageNet LSVRC-2012.  Trained over 1.2M images using SGD with regularization.  Deep architecture (60M parameters.)  Optimized GPU implementation (cuda-convnet) Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012. - Cited by 11915

  20. deep convolutional neural networks." NIPS 2012. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with Case Study: AlexNet CONV 11x11x96 LRN MAX POOL 2x2 CONV 5x5x256 LRN MAX POOL 2x2 CONV 3x3x384 MAX POOL 2x2 CONV 3x3x384 MAX POOL 2x2 CONV 3x3x256 MAX POOL 2x2 FC - 4096 FC - 4096 SOFTMAX - 1000

  21. Training LOSS  Learning: Minimizing the loss function (incl. FC regularization) w.r.t. parameters of the network. Filter weights NORM POOL  Mini batch stochastic gradient descent CONV  Sample a batch of data.  Forward propagation NORM POOL  Backward propagation CONV  Parameter update x n y n

  22. Training LOSS  Back propagation FC  Consider an layer f with parameters w : Here z is scalar which is the loss computed from NORM loss function h. The derivative of loss function w.r.t POOL to parameters is given as: CONV NORM Recursive eq. which POOL is applicable to each CONV layer x n y n

  23. Training LOSS  Parameter update FC  Stochastic gradient descent Here η is the learning rate and θ is the set of all parameters NORM  Stochastic gradient descent with momentum POOL CONV NORM POOL CONV More in coming slides… x n y n

  24. Training LOSS  Loss functions. FC Measures the compatibility between prediction and ground truth.  one vs. rest classification  Soft-max classifier (cross entropy loss) NORM POOL CONV Derivative w.r.t. x i NORM POOL CONV Proof? x n y n

  25. Training LOSS  Loss functions. FC  one vs. rest classification  Hinge Loss Hinge loss is a convex function but not differentiable but sub-gradient exists. NORM POOL Sub-gradient w.r.t. x i CONV NORM POOL CONV x n y n

  26. Training LOSS  Loss functions. FC  Regression  Euclidean loss / Squared loss NORM Derivative w.r.t. x i POOL CONV NORM POOL CONV x n y n

  27. Training  Visualization of loss function Typically viewed as Initialization highly non-convex function but more recently it‟s believed to have smoother surfaces but with many Loss saddle regions ! θ Step direction Step size/learning rate Momentum

  28. Loss Training θ  Momentum  Better convergence rates.  Physical perspective: Affects velocity of the update.  Higher velocity in the consistent direction of gradient.  Momentum update: Position Velocity

  29. Loss Training θ  Learning Rates ( η )  Controls the kinetic energy of the updates.  Important to know when the decay the η .  Common methods (Annealing):-  Step decay  Exponential/log space decay  Manual  Adaptive learning methods  Adagrad  RMSprop Figure courtesy: Fei Fei et al. , cs231n

  30. Loss Training θ  Initialization  Never initialize weights to all zero‟s or same value. (Why?)  Popular techniques:-  Random values sampled from N(0,1)  Xavier (Glorot et.al JMLR‟10)  Scale of initialization is dependent on the number of input and output neurons.  Initial weights are sampled from N(0,var(w)) Fan-in Fan-out  Pre-training  Using RBMs. (Hinton et.al, Science 2006)

  31. Training  Generalization  How to prevent? val-2 accuracy (overfitting)  Underfitting – Deeper n/ w‟s  Overfitting top5- error  Stopping at the right time. val-1 accuracy (*)  Weight penalties.  L1  L2  Max norm training accuracy  Dropout  Model ensembles epoch  E.g. Same model, different initializations.

  32. Generalization  Dropouts  Stochastic regularization.  Idea applicable to many other networks.  Dropping out hidden units randomly Before dropout with fixed probability „p‟ (say 0.5) temporarily while training.  While testing the all units are preserved but scaled with „p‟.  Dropouts along with max norm constraint is found to be useful. After dropout Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. JMLR 2014

  33. Generalization  Without dropout  With dropout Features learned with one hidden layers auto-encoder on MNIST dataset. Sparsity Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. JMLR 2014

  34. Generalization  Batch Normalization  Covariate Shift  defined as a change in the distribution of a function‟s domain  Mini-batches (randomized) reduces the effect of covariate shift  Internal Covariate Shift go water the plants got water in kite bang eat your pants face monkey  Current layer parameters change the distribution of the input to successive layers.  Slows down training and careful initialization. Image Credit: https://gab41.lab41.org/batch-normalization-what-the-hey-d480039a9e3b

  35. Generalization  Batch Normalization  Fixes the distribution of layer input as training progresses.  Faster convergence. Ioffe, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arxiv'15

  36. Some results on ImageNet GoogLeNet Clarifai AlexNet Source: Krizhevsky et.al. NIPS‟12 T op-5 classification accuracy

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend