A Shallow Introduction to Deep Learning for Computer Vision
Ramprasaath
Learning for Computer Vision Ramprasaath Lecture Outline Computer - - PowerPoint PPT Presentation
A Shallow Introduction to Deep Learning for Computer Vision Ramprasaath Lecture Outline Computer Vision Before (Image/Alex)Net era (Summer 1956-2012) After (Image/Alex)Net era (2012 present) Neural Networks (Brief
Ramprasaath
“We propose that a 2 month, 10 man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.”
http://www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html
Five of the attendees of the 1956 Dartmouth Summer Research Project on AI reunited in 2006: Trenchard More, John McCarthy, Marvin Minsky, Oliver Selfridge, and Ray Solomonoff. Missing were: Arthur Samuel, Herbert Simon, Allen Newell, Nathaniel Rochester and Claude Shannon.
Minsky, asked a student to attach a camera to a Computer and asked him to write an algorithm that would allow the computer to describe what it sees.
Fei-Fei Li &Andrej Karpathy Lecture 4 - 7 Jan 2015
Fei-Fei Li & Andrej Karpathy
hue bins
+1
Lecture 4 - 12
Fei-Fei Li &Andrej Karpathy Lecture 4 - 7 Jan 2015
8x8 pixel region, quantize the edge
Lecture 4 - 13
(images from vlfeat.org)
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 4 - 7 Jan 2015
gives a matrix of size [number_of_features x 144]
1. Resize patch to a fixed size (e.g. 32x32 pixels) 2. Extract HOG on the patch (get 144 numbers) repeat for each detected feature
Lecture 4 - 15
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 4 - 7 Jan 2015
144 visual word vectors e.g. 1000 centroids 1000-d vector learn k-means centroids “vocabulary of visual words 1000-d vector 1000-d vector histogram of visual words
Lecture 4 - 16
Fei-Fei Li & Andrej Karpathy
supervised Classifier SIFT HoG K-means Sparse Coding Pooling fixed unsupervised
Fei-Fei Li &Andrej Karpathy Lecture 4 - 7 Jan 2015
CNNs: end-to-end models
(slide from Yann LeCun)
Convolution Pooling Softmax Other
GoogLeNet VGG MSRA SuperVision
[Krizhevsky NIPS 2012]
Year 2012 Year 2014
Fei-Fei Li & AndrejKarpathy 5- ‐ J a n
1 5 16
Dense grid descriptor: HOG,LBP Coding: local coordinate, super-‐vector Pooling,SPM LinearSVM
NEC-UIUC
[Lin CVPR2011] [Szegedy arxiv 2014] [Simonyan arxiv 2014] [He arxiv 2014]
Lecture 1 -
Year 2010 Year 2015
[He arxiv 2015]
Fei-Fei Li & Andrej Karpathy
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Fei-Fei Li &Andrej Karpathy Lecture 5 - 21 Jan 2015
“Fully-connected” layers
Lecture 5 - 62
Neurons
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 5 - 21 Jan 2015
“2-layer Neural Net”, or “1-hidden-layer Neural Net”
Lecture 5 - 63
“3-layer Neural Net”, or “2-hidden-layer Neural Net” “Fully-connected” layers
Fei-Fei Li & Andrej Karpathy
21
Example: 200x200 image 40K hidden units ~2B parameters!!!
training samples anyway..
Slide Credit: Marc'Aurelio Ranzato
22
Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition).
Slide Credit: Marc'Aurelio Ranzato
23
STATIONARITY? Statistics is similar at different locations Note: This parameterization is good when input image is registered (e.g., face recognition). Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters
Slide Credit: Marc'Aurelio Ranzato
24
Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels
Slide Credit: Marc'Aurelio Ranzato
Locality? Nearby pixels are correlated
Learn multiple filters.
E.g.: 200x200 image 100 Filters Filter size: 10x10 10K parameters
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra 25
Let us assume filter is an “eye” detector. Q.: how can we make the detection robust to the exact location of the eye?
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra 26
By “pooling” (e.g., taking max) filter responses at different locations we gain robustness to the exact spatial location of features.
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra 27
Fei-Fei Li &Andrej Karpathy Lecture 6 - 21 Jan 2015
neural networks practitioner music = loss function
Lecture 6 - 36
Fei-Fei Li & Andrej Karpathy
Y LeCun MA Ranzato
Normalization: eg. Contrast Normalization Filter Bank: Matrix Multiplication Non-Linearity: eg. ReLU Pooling: aggregation over space or feature type
Classifier feature Pooling Non- Linear Filter Bank Norm feature Pooling Non- Linear Filter Bank Norm
[From recent Yann LeCun slides]
Lecture 7 - 56
Y LeCun MA Ranzato
SHALLOW DEEP Neural Networks D-AE DBN DBM AE Perceptron RBM BayesNP SVM Supervised Supervised Unsupervised Sparse GMM Coding Probabilistic Models Boosting DecisionTree Neural Net RNN
Fei-Fei Li &Andrej Karpathy Lecture 7 - 21 Jan 2015
Lecture 7 - 79
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 8 - 9
Just like normal Hidden Layer BUT:
in a local receptive field
slice share weights
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
The weights of this neuron visualized
Lecture 8 - 10
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
convolving the first filter in the input gives the first slice of depth in output volume
Lecture 8 - 11
Fei-Fei Li & Andrej Karpathy
(C) Dhruv Batra 38
Figure Credit: [Zeiler & Fergus ECCV14]
(C) Dhruv Batra 39
Figure Credit: [Zeiler & Fergus ECCV14]
(C) Dhruv Batra 40
Figure Credit: [Zeiler & Fergus ECCV14]
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters
... POOL2: [14x14x512] memory: 14*14*512=100K params: 0 params: (3*3*512) params: (3*3*512) params: (3*3*512) CONV3-512: [14x14x512] memory: 14*14*512=100K *512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K *512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K *512 = 2,359,296 POOL2: [7x7x512] memory: 7*7*512=25K params: 0 POOL2: [7x7x512] memory: 7*7*512=25K params: 0 FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
A CNN transforms the image to 4096 numbers that are then linearly classified.
Lecture 8 - 16
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
(“CNN code” = 4096-D vector before classifier) query image
Lecture 8 - 17
nearest neighbors in the “code” space
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
[van der Maaten & Hinton] Embed high-dimensional points so that locally, pairwise distances are conserved i.e. similar things end up in similar
Right: Example embedding of MNIST digits (0-9) in 2D
Lecture 8 - 18
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
http://cs.stanford. edu/people/karpathy/cnnembed/
Lecture 8 - 19
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 8 - 20
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 8 - 21
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014
Remember: Score for class c (before Softmax)
Lecture 8 - 22
Regularization
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014
Lecture 8 - 23
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 8 - 28
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Visualizing and Understanding Convolutional Networks Zeiler & Fergus, 2013
Lecture 8 - 30
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 8 - 31
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Visualizing arbitrary neurons along the way to the top...
Lecture 8 - 32
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Rich feature hierarchies for accurate object detection and semantic segmentation [Girshick, Donahue, Darrell, Malik]
Lecture 8 - 29
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015 Lecture 8 - 34
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 8 - 35
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 8 - 37
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Understanding Deep Image Representations by Inverting Them [Mahendran and Vedaldi, 2014]
reconstructions from the 1000 log probabilities for ImageNet (ILSVRC) classes
Lecture 8 - 36
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
(immediately before the first Fully Connected layer) Lecture 8 - 38
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 8 - 39
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 8 - 40
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 8 - 41
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Intriguing properties of neural networks [Szegedy et al.]
Lecture 8 - 42
correct +distort
correct +distort
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Exploring the Representation Capabilities of the HOG Descriptor [Tatu et al., 2011]
Lecture 8 - 43
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images [Nguyen, Yosinski, Clune] >99.6% confidences
Lecture 8 - 44
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images [Nguyen, Yosinski, Clune] >99.6% confidences
Lecture 8 - 45
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
[What I learned from competing against a ConvNet on ImageNet] Karpathy, 2014: http://bit.ly/humanvsconvnet Try it out yourself: http://cs.stanford.edu/people/karpathy/ilsvrc/
Lecture 8 - 61
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 8 - 63
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 8 - 64
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 9 - 26
Fei-Fei Li & Andrej Karpathy
Where is the catch??
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 9 - 27
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015 Lecture 9 - 28
To the rescue…
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Imagenet
Lecture 9 - 29
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Imagenet
all weights (treat CNN as fixed feature extractor), retrain only the classifier i.e. swap the Softmax layer at the end
Lecture 9 - 30
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Imagenet
all weights (treat CNN as fixed feature extractor), retrain only the classifier i.e. swap the Softmax layer at the end
dataset, “finetune” instead: use the old weights as initialization, train the full network or only some of the higher layers retrain bigger portion of the network, or even all of it.
Lecture 9 - 31
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Imagenet
all weights (treat CNN as fixed feature extractor), retrain only the classifier i.e. swap the Softmax layer at the end
dataset, “finetune” instead: use the old weights as initialization, train the full network or only some of the higher layers retrain bigger portion of the network, or even all of it. tip: use only ~1/10th of the original learning rate in finetuning to player, and ~1/100th on intermediate layers
Lecture 9 - 32
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015 Lecture 11 - 15
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 11 - 16
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Very ry Deep Convolutional Networks for Large-Scale Image Recognition, Simonyan et al., 2014 OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks, Sermanet et al., 2014
Idea: train a Localization net Take out Softmax loss, swap in L2 (regression) loss, fine-tune the classification network.
swap the Softmax layer at the end with L2 loss
Lecture 11 - 17
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Model must output: A set of detections Each detection has:
bounding box coordinates
Lecture 11 - 24
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Rich feature hierarchies for accurate object detection and semantic segmentation [Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik]
Idea: Turn a Detection Problem into an Image Classification problem (but over image regions). Content of every labeled bounding box for is a positive example for a class. Every other bounding box in the image is a special negative class.
Lecture 11 - 25
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Rich feature hierarchies for accurate object detection and semantic segmentation [Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik]
Idea: Turn a Detection Problem into an Image Classification problem (but over image regions).
Lecture 11 - 26
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Fully Convolutional Networks for Semantic Segmentation Long, Shelhamer, Darrell
Lecture 11 - 29
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
[Eigen et al.], 2014
Lecture 11 - 30
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Two-Stream Convolutional Networks for Action Recognition in Videos [Simonyan et al.], 2014 Large-scale Video Classification with Convolutional Neural Networks [Karpathy et al.], 2014 Long-term Recurrent Convolutional Networks for Visual Recognition and Description [Donahue et al.], 2014
Lecture 11 - 31
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Fei-Fei Li & Andrej Karpathy
Lecture 11 - 32
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 11 - 33
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 11 - 34
Fei-Fei Li & Andrej Karpathy
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models [Kiros, Salakhutdinov, Zemel, 2014]
Lecture 11 - 74
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 12 - 20
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 12 - 21
Fei-Fei Li &Andrej Karpathy Lecture 8 - 2 Feb 2015
Lecture 12 - 22
Questions?? Comments?? Thoughts?? Ideas??