Common recognition tasks
Adapted from Fei-Fei Li Slide from L. Lazebnik.
Common recognition tasks Adapted from Slide from L. Lazebnik. - - PowerPoint PPT Presentation
Common recognition tasks Adapted from Slide from L. Lazebnik. Fei-Fei Li Image classification and tagging outdoor mountains city Asia Lhasa Adapted from Slide from L. Lazebnik. Fei-Fei Li Object detection find
Adapted from Fei-Fei Li Slide from L. Lazebnik.
Adapted from Fei-Fei Li Slide from L. Lazebnik.
Adapted from Fei-Fei Li Slide from L. Lazebnik.
Adapted from Fei-Fei Li Slide from L. Lazebnik.
Adapted from Fei-Fei Li Slide from L. Lazebnik.
mountain building tree umbrella person lamp sky building market stall lamp person person person person ground umbrella
Adapted from Fei-Fei Li Slide from L. Lazebnik.
This is a busy street in an Asian city. Mountains and a large palace or fortress loom in the background. In the foreground, we see colorful souvenir stalls and people walking around and shopping. One person in the lower left is pushing an empty cart, and a couple of people in the middle are sitting, possibly posing for a photograph.
Adapted from Fei-Fei Li Slide from L. Lazebnik.
image to get the desired output:
Slide from L. Lazebnik.
{(x1,y1), …, (xN,yN)}, estimate the prediction function f by minimizing the prediction error on the training set
the predicted value y = f(x)
prediction function feature representation
Slide from L. Lazebnik.
Prediction
Training Labels Training Images Training
Training
Image Features Image Features
Testing
Test Image Learned model Learned model
Slide credit: D. Hoiem
Feature representation Trainable classifier Image Pixels
Class label
Slide from L. Lazebnik.
image Fully connected layer
Slide from L. Lazebnik.
image
Slide from L. Lazebnik.
image feature map learned weights
Slide from L. Lazebnik.
image another feature map another set
weights
Slide from L. Lazebnik.
image feature map . . . bank of K filters K feature maps
Slide from L. Lazebnik.
K feature maps K filters
convolutional layer image Spatial resolution: (roughly) the same if stride of 1 is used, reduced by 1/S if stride of S is used
Slide from L. Lazebnik.
image L feature maps in the next layer convolutional layer + ReLU
F x F x K filter L filters K feature maps
Slide from L. Lazebnik.
Input Image Convolution (Learned) Non-linearity Spatial pooling Feature maps
Input Feature Map
. . .
Source: R. Fergus, Y. LeCun
Input Image Convolution (Learned) Non-linearity Spatial pooling Feature maps
Source: R. Fergus, Y. LeCun Source: Stanford 231n
Input Image Convolution (Learned) Non-linearity Spatial pooling Feature maps
Max (or Avg)
Source: R. Fergus, Y. LeCun
K feature maps, resolution 1/S
F x F pooling filter, stride S K feature maps
max value
Usually: F=2 or 3, S=2
Slide from L. Lazebnik.
P(c | x) = exp(wc ⋅x) exp(wk ⋅x)
k=1 C
Softmax layer:
Slide from L. Lazebnik.
true and estimated labels of training examples:
𝐱(𝐲!) − 𝑧! #
𝐱 𝑧! | 𝐲!
𝐱 𝐲! )
Slide from L. Lazebnik.
conv filters subsample subsample conv linear filters weights
Slide from B. Hariharan
Convolutional network
min
θ
1 N
N
X
i=1
L(h(xi; θ), yi)
θ(t+1) = θ(t) λ 1 N
N
X
i=1
rL(h(xi; θ), yi)
Gradient descent update
Slide from B. Hariharan
rL(h(x; θ), y)
rθL(z, y) = ∂L(z, y) ∂z ∂z ∂θ
Slide from B. Hariharan
conv filters subsample subsample conv linear filters weights
Slide from B. Hariharan
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
Slide from B. Hariharan
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
∂z ∂w5 = ∂f5(z4, w5) ∂w5
Slide from B. Hariharan
∂z ∂w4 = ∂z ∂z4 ∂z4 ∂w4 = ∂f5(z4, w5) ∂z4 ∂f4(z3, w4) ∂w4
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
Slide from B. Hariharan
∂z ∂w4 = ∂z ∂z4 ∂z4 ∂w4 = ∂f5(z4, w5) ∂z4 ∂f4(z3, w4) ∂w4
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
Slide from B. Hariharan
∂z ∂w4 = ∂z ∂z4 ∂z4 ∂w4 = ∂f5(z4, w5) ∂z4 ∂f4(z3, w4) ∂w4
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
Slide from B. Hariharan
∂z ∂w4 = ∂z ∂z4 ∂z4 ∂w4 = ∂f5(z4, w5) ∂z4 ∂f4(z3, w4) ∂w4
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
Slide from B. Hariharan
∂z ∂w4 = ∂z ∂z4 ∂z4 ∂w4 = ∂f5(z4, w5) ∂z4 ∂f4(z3, w4) ∂w4
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
Slide from B. Hariharan
∂z ∂w4 = ∂z ∂z4 ∂z4 ∂w4 = ∂f5(z4, w5) ∂z4 ∂f4(z3, w4) ∂w4
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
Slide from B. Hariharan
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3
Slide from B. Hariharan
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3
Slide from B. Hariharan
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3
Slide from B. Hariharan
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3 ∂z ∂z3 = ∂z ∂z4 ∂z4 ∂z3
Slide from B. Hariharan
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3 ∂z ∂z3 = ∂z ∂z4 ∂z4 ∂z3
Slide from B. Hariharan
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
∂z ∂z2 = ∂z ∂z3 ∂z3 ∂z2 ∂z ∂w2 = ∂z ∂z2 ∂z2 ∂w2
Recurrence going backward!!
Slide from B. Hariharan
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
Slide from B. Hariharan
module
g(wi) = g(zi) ∂zi ∂wi
g(zi−1) = g(zi) ∂zi ∂zi−1
Slide from B. Hariharan
fi zi-1 zi wi
Slide from B. Hariharan
fi g(zi-1) g(zi) g(wi)
Slide from B. Hariharan
Slide from B. Hariharan
fi a d c b
Slide from B. Hariharan
fi
∂z ∂d
∂z ∂a
∂z ∂b ∂z ∂c
Slide from B. Hariharan
fi a d c b
Slide from B. Hariharan
fi
∂z ∂d
∂z ∂a
∂z ∂b ∂z ∂c
Slide from B. Hariharan
Slide from B. Hariharan
Adapted from Fei-Fei Li
5 10 15 20 25 30 2010 2011 2012 2014 2015 Challenge winner’s top-5 error rates
7-layer Convolutional Networks 19 layers 152 layers Slide from B. Hariharan
Horse
Trained feature extractor Linear classifier
Slide from B. Hariharan
new image classification problem?
in feature extractor
Trained feature extractor Linear classifier
Slide from B. Hariharan
Horse Slide from B. Hariharan
Bakery Initialize with pre- trained, then train with low learning rate Slide from B. Hariharan
Adapted from Fei-Fei Li
Input image Extract region proposals (~2k / image) Compute CNN features Classify regions (linear SVM)
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014 Figure credit : Ross Girshick
Input image Extract region proposals (~2k / image) Compute CNN features
Slide credit : Ross Girshick
Input image Extract region proposals (~2k / image) Compute CNN features
227 x 227
Slide credit : Ross Girshick
Input image Extract region proposals (~2k / image) Compute CNN features
Output: “fc7” features
Slide credit : Ross Girshick
Input image Extract region proposals (~2k / image) Compute CNN features
Warped proposal 4096-dimensional fc7 feature vector linear classifiers (SVM or softmax)
person? 1.6 horse? -0.3
... ...
Classify regions
Slide credit : Ross Girshick
Linear regression
Original proposal Predicted
Bounding-box regression
Slide credit : Ross Girshick
5 15 19 29 36 39 49
Faster R-CNN (VGG-16) Late 2018 Early 2015
DPM (Pre DL) Fast R-CNN (AlexNet) Fast R-CNN (VGG-16) Faster R-CNN (ResNet-50) Faster R-CNN (R-101-FPN) Mask R-CNN (X-152-FPN)
Past (best circa 2012) Progress within DL methods: > 3x! ~4 years
5 15 19 29 36 39 46
Faster R-CNN (VGG-16)
DPM (Pre DL) Fast R-CNN (AlexNet) Fast R-CNN (VGG-16) Faster R-CNN (ResNet-50) Faster R-CNN (R-101-FPN) Mask R-CNN (X-152-FPN)
Ø R-CNN [Girshick et al. 2014] Ø SPP-net [He et al. 2014] Ø Fast R-CNN [Girshick. 2015] Ø Faster R-CNN [Ren et al. 2015] Ø R-FCN [Dai et al. 2016] Ø Feature Pyramid Networks + Faster R-CNN [Lin et al. 2017] Ø Mask R-CNN [He et al. 2017] Ø Training with Large Minibatches (MegDet) [Peng, Xiao, Li, et al. 2017] Ø Cascade R-CNN [Cai & Vasconcelos 2018]
Ø …
mountain building tree umbrella person lamp sky building market stall lamp person person person person ground umbrella
Adapted from Fei-Fei Li
h w 3 Slide from B. Hariharan
h/ 4 w/4 c Slide from B. Hariharan
c h/ 4 w/4 Slide from B. Hariharan
h/ 4 w/4 c
Can be considered as a feature vector for a pixel
Slide from B. Hariharan
c Convolve with #classes 1x1 filters
#classes
h/ 4 w/4 Slide from B. Hariharan
layers
Slide from B. Hariharan
person bicycle Slide from B. Hariharan
This is a busy street in an Asian city. Mountains and a large palace or fortress loom in the background. In the foreground, we see colorful souvenir stalls and people walking around and shopping. One person in the lower left is pushing an empty cart, and a couple of people in the middle are sitting, possibly posing for a photograph.
Adapted from Fei-Fei Li
h1 x1 h0
CNN
𝑔
𝐽 ∈ 𝑆 START C
Dog in
x2 h2
C Dog
x3 h3
END
x4 h4 x5 h5
C C C
a hat
in a hat
Slide from D. Fouhey See also: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Additional Resources:
http://saurabhg.web.illinois.edu/teaching/ece549/sp2020/
https://www.doc.ic.ac.uk/~ajd/Robotics/