learning for computer vision
play

Learning for Computer Vision Ramprasaath Lecture Outline Computer - PowerPoint PPT Presentation

A Shallow Introduction to Deep Learning for Computer Vision Ramprasaath Lecture Outline Computer Vision Before (Image/Alex)Net era (Summer 1956-2012) After (Image/Alex)Net era (2012 present) Neural Networks (Brief


  1. A Shallow Introduction to Deep Learning for Computer Vision Ramprasaath

  2. Lecture Outline • Computer Vision • Before (Image/Alex)Net era – (Summer 1956-2012) • After (Image/Alex)Net era – (2012 – present) • Neural Networks (Brief Introduction) • Need for CNNs • Visualizing, Understanding and Analyzing ConvNets • Transfer Learning • Going beyond Classification: • Localization • Detection • Segmentation • Depth Estimation • Video Classification • Image Ranking and retrieval • Image Captioning • Visual Question Answering

  3. Where are we • Computer Vision • Before (Image/Alex)Net era – (Summer 1956-2012)

  4. 1956 Dartmouth AI Project “ We propose that a 2 month, 10 man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer. ” http://www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html

  5. 1956 Dartmouth AI Project Five of the attendees of the 1956 Dartmouth Summer Research Project on AI reunited in 2006 : Trenchard More, John McCarthy, Marvin Minsky, Oliver Selfridge, and Ray Solomonoff. Missing were: Arthur Samuel, Herbert Simon, Allen Newell, Nathaniel Rochester and Claude Shannon.

  6. The beginning of Computer Vision • During the summer of 1966, Dartmouth Professor Late Dr. Marvin Minsky, asked a student to attach a camera to a Computer and asked him to write an algorithm that would allow the computer to describe what it sees.

  7. Example: Color (Hue) Histogram hue bins +1 Fei-Fei Li &Andrej Karpathy Lecture 4 - Lecture 4 - 12 7 Jan 2015 Fei-Fei Li & Andrej Karpathy

  8. Example: HOG features 8x8 pixel region, quantize the edge orientation into 9 bins (images from vlfeat.org) Fei-Fei Li &Andrej Karpathy Lecture 4 - Lecture 4 - 13 7 Jan 2015 Fei-Fei Li & Andrej Karpathy

  9. Example: Bag of Words 1. Resize patch to a fixed size (e.g. 32x32 pixels) 2. Extract HOG on the patch (get 144 numbers) repeat for each detected feature gives a matrix of size [number_of_features x 144] Fei-Fei Li &Andrej Karpathy Lecture 4 - Lecture 4 - 15 7 Jan 2015 Fei-Fei Li & Andrej Karpathy

  10. Example: Bag of Words histogram of visual words visual word vectors 1000-d vector learn k-means centroids “vocabulary of visual words 144 1000-d vector e.g. 1000 centroids 1000-d vector Fei-Fei Li &Andrej Karpathy Lecture 4 - Lecture 4 - 16 7 Jan 2015 Fei-Fei Li & Andrej Karpathy

  11. Traditional Object recognition K-means SIFT Pooling Classifier Sparse Coding HoG supervised unsupervised fixed

  12. Most recognition systems are build on the same Architecture CNNs: end-to-end models Fei-Fei Li &Andrej Karpathy Lecture 4 - 7 Jan 2015 (slide from Yann LeCun)

  13. Lecture Outline • Computer Vision • After (Image/Alex)Net era – (2012 – present)

  14. Year 2012 Year 2014 Year 2015 Year 2010 NEC-UIUC SuperVision GoogLeNet VGG MSRA Dense grid descriptor: HOG,LBP Coding: local coordinate, super- ‐vector Pooling,SPM LinearSVM Convolution Pooling Softmax Other [Simonyan arxiv 2014] [He arxiv 2014] [Lin CVPR2011] [Krizhevsky NIPS 2012] [Szegedy arxiv 2014] Deep Residual Learning for Image Recognition Fei-Fei Li & AndrejKarpathy Lecture 1 - 5- J ‐ a n 1 ‐ - 5 16 Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun Fei-Fei Li & Andrej Karpathy [He arxiv 2015]

  15. Lecture Outline • Brief Introduction to Neural Networks

  16. Neural Networks: Architectures Neurons “Fully - connected” layers Fei-Fei Li &Andrej Karpathy Lecture 5 - Lecture 5 - 62 21 Jan 2015 Fei-Fei Li & Andrej Karpathy

  17. Neural Networks: Architectures “3 - layer Neural Net”, or “2 - layer Neural Net”, or “2 -hidden-layer Neural Net” “1 -hidden-layer Neural Net” “Fully - connected” layers Fei-Fei Li &Andrej Karpathy Lecture 5 - Lecture 5 - 63 21 Jan 2015 Fei-Fei Li & Andrej Karpathy

  18. Where are we • Need for CNNs

  19. Fully Connected Layer Example: 200x200 image 40K hidden units ~2B parameters !!! - Spatial correlation is local - Waste of resources + we have not enough training samples anyway.. 21 Slide Credit: Marc'Aurelio Ranzato

  20. Locally Connected Layer Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). 22 Slide Credit: Marc'Aurelio Ranzato

  21. Locally Connected Layer STATIONARITY? Statistics is similar at different locations Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). 23 Slide Credit: Marc'Aurelio Ranzato

  22. Convolutional Layer Locality? Nearby pixels are correlated Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels 24 Slide Credit: Marc'Aurelio Ranzato

  23. Convolutional Layer Learn multiple filters. E.g.: 200x200 image 100 Filters Filter size: 10x10 10K parameters (C) Dhruv Batra 25 Slide Credit: Marc'Aurelio Ranzato

  24. Pooling Layer Let us assume filter is an “eye” detector. Q.: how can we make the detection robust to the exact location of the eye? (C) Dhruv Batra 26 Slide Credit: Marc'Aurelio Ranzato

  25. Pooling Layer By “ pooling ” (e.g., taking max) filter responses at different locations we gain robustness to the exact spatial location of features. (C) Dhruv Batra 27 Slide Credit: Marc'Aurelio Ranzato

  26. Hyperparameters to play with: - network architecture - learning rate, its decay schedule, update type - regularization (L2/L1/Maxnorm/Dropout) - loss to use (e.g. SVM/Softmax) - initialization neural networks practitioner music = loss function Fei-Fei Li &Andrej Karpathy Lecture 6 - Lecture 6 - 36 21 Jan 2015 Fei-Fei Li & Andrej Karpathy

  27. Y LeCun MA Ranzato Filter Non- feature Filter Non- feature Norm Classifier Norm Bank Linear Pooling Bank Linear Pooling Normalization: eg. Contrast Normalization Filter Bank: Matrix Multiplication Non-Linearity: eg. ReLU Pooling: aggregation over space or feature type

  28. [From recent Yann LeCun slides] Fast-forward to today Lecture 7 - 56

  29. SHALLOW DEEP Y LeCun MA Ranzato Neural Networks Boosting Neural Net RNN AE D-AE Perceptron Conv. Net DBN SVM RBM DBM Sparse BayesNP GMM Coding  Probabilistic Models DecisionTree Unsupervised Supervised Supervised

  30. (convnetjs example of training of CIFAR-10) demo Fei-Fei Li &Andrej Karpathy Lecture 7 - Lecture 7 - 79 21 Jan 2015 Fei-Fei Li & Andrej Karpathy

  31. Convolutional Layer Just like normal Hidden Layer BUT: - Connect neurons to the input in a local receptive field - All neurons in a single depth slice share weights Lecture 8 - Fei-Fei Li &Andrej Karpathy Lecture 8 - 9 2 Feb 2015 Fei-Fei Li & Andrej Karpathy

  32. The weights of this neuron visualized Fei-Fei Li &Andrej Karpathy Lecture 8 - Lecture 8 - 10 2 Feb 2015 Fei-Fei Li & Andrej Karpathy

  33. convolving the first filter in the input gives the first slice of depth in output volume Fei-Fei Li &Andrej Karpathy Lecture 8 - Lecture 8 - 11 2 Feb 2015 Fei-Fei Li & Andrej Karpathy

  34. Visualizing Learned Filters (C) Dhruv Batra 38 Figure Credit: [Zeiler & Fergus ECCV14]

  35. Visualizing Learned Filters (C) Dhruv Batra 39 Figure Credit: [Zeiler & Fergus ECCV14]

  36. Visualizing Learned Filters (C) Dhruv Batra 40 Figure Credit: [Zeiler & Fergus ECCV14]

  37. Q: What is the learned CNN representation? ... “CNN code” POOL2: [14x14x512] memory: 14*14*512=100K params: 0 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512) *512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512) *512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512) A CNN transforms the *512 = 2,359,296 image to 4096 numbers POOL2: [7x7x512] memory: 7*7*512=25K params: 0 POOL2: [7x7x512] memory: 7*7*512=25K params: 0 that are then linearly FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 classified. FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000 TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters Fei-Fei Li &Andrej Karpathy Lecture 8 - Lecture 8 - 16 2 Feb 2015 Fei-Fei Li & Andrej Karpathy

  38. Visualizing the CNN code representation (“CNN code” = 4096 -D vector before classifier) query image nearest neighbors in the “code” space (But we’d like a more global way to visualize the distances) Fei-Fei Li &Andrej Karpathy Lecture 8 - Lecture 8 - 17 2 Feb 2015 Fei-Fei Li & Andrej Karpathy

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend