Common recognition tasks Adapted from Slide from L. Lazebnik. - - PowerPoint PPT Presentation

common recognition tasks
SMART_READER_LITE
LIVE PREVIEW

Common recognition tasks Adapted from Slide from L. Lazebnik. - - PowerPoint PPT Presentation

Common recognition tasks Adapted from Slide from L. Lazebnik. Fei-Fei Li Image classification and tagging outdoor mountains city Asia Lhasa Adapted from Slide from L. Lazebnik. Fei-Fei Li Object detection find


slide-1
SLIDE 1

Common recognition tasks

Adapted from Fei-Fei Li Slide from L. Lazebnik.

slide-2
SLIDE 2

Image classification and tagging

  • outdoor
  • mountains
  • city
  • Asia
  • Lhasa

Adapted from Fei-Fei Li Slide from L. Lazebnik.

slide-3
SLIDE 3

Object detection

  • find pedestrians

Adapted from Fei-Fei Li Slide from L. Lazebnik.

slide-4
SLIDE 4

Activity recognition

  • walking
  • shopping
  • rolling a cart
  • sitting
  • talking

Adapted from Fei-Fei Li Slide from L. Lazebnik.

slide-5
SLIDE 5

Semantic segmentation

Adapted from Fei-Fei Li Slide from L. Lazebnik.

slide-6
SLIDE 6

Semantic segmentation

mountain building tree umbrella person lamp sky building market stall lamp person person person person ground umbrella

Adapted from Fei-Fei Li Slide from L. Lazebnik.

slide-7
SLIDE 7

Image description

This is a busy street in an Asian city. Mountains and a large palace or fortress loom in the background. In the foreground, we see colorful souvenir stalls and people walking around and shopping. One person in the lower left is pushing an empty cart, and a couple of people in the middle are sitting, possibly posing for a photograph.

Adapted from Fei-Fei Li Slide from L. Lazebnik.

slide-8
SLIDE 8

The statistical learning framework

  • Apply a prediction function to a feature representation of the

image to get the desired output:

f( ) = “apple” f( ) = “tomato” f( ) = “cow”

Slide from L. Lazebnik.

slide-9
SLIDE 9

The statistical learning framework

y = f(x)

  • Training: given a training set of labeled examples

{(x1,y1), …, (xN,yN)}, estimate the prediction function f by minimizing the prediction error on the training set

  • Testing: apply f to a never before seen test example x and output

the predicted value y = f(x)

  • utput

prediction function feature representation

Slide from L. Lazebnik.

slide-10
SLIDE 10

Prediction

Steps

Training Labels Training Images Training

Training

Image Features Image Features

Testing

Test Image Learned model Learned model

Slide credit: D. Hoiem

slide-11
SLIDE 11

“Classic” recognition pipeline

Feature representation Trainable classifier Image Pixels

  • Hand-crafted feature representation
  • Off-the-shelf trainable classifier

Class label

Slide from L. Lazebnik.

slide-12
SLIDE 12

Neural networks for images

image Fully connected layer

Slide from L. Lazebnik.

slide-13
SLIDE 13

image

Neural networks for images

Slide from L. Lazebnik.

slide-14
SLIDE 14

image feature map learned weights

Neural networks for images

Slide from L. Lazebnik.

slide-15
SLIDE 15

image another feature map another set

  • f learned

weights

Neural networks for images

Slide from L. Lazebnik.

slide-16
SLIDE 16

Convolution as feature extraction

image feature map . . . bank of K filters K feature maps

Slide from L. Lazebnik.

slide-17
SLIDE 17

K feature maps K filters

Convolutional layer

convolutional layer image Spatial resolution: (roughly) the same if stride of 1 is used, reduced by 1/S if stride of S is used

Slide from L. Lazebnik.

slide-18
SLIDE 18

image L feature maps in the next layer convolutional layer + ReLU

Convolutional layer

F x F x K filter L filters K feature maps

Slide from L. Lazebnik.

slide-19
SLIDE 19

Input Image Convolution (Learned) Non-linearity Spatial pooling Feature maps

Input Feature Map

. . .

CNN pipeline

Source: R. Fergus, Y. LeCun

slide-20
SLIDE 20

Input Image Convolution (Learned) Non-linearity Spatial pooling Feature maps

CNN pipeline

Source: R. Fergus, Y. LeCun Source: Stanford 231n

slide-21
SLIDE 21

Input Image Convolution (Learned) Non-linearity Spatial pooling Feature maps

Max (or Avg)

CNN pipeline

Source: R. Fergus, Y. LeCun

slide-22
SLIDE 22

K feature maps, resolution 1/S

Max pooling layer

F x F pooling filter, stride S K feature maps

max value

Usually: F=2 or 3, S=2

Slide from L. Lazebnik.

slide-23
SLIDE 23

Summary: CNN pipeline

P(c | x) = exp(wc ⋅x) exp(wk ⋅x)

k=1 C

Softmax layer:

Slide from L. Lazebnik.

slide-24
SLIDE 24
  • Find network weights to minimize the prediction loss between

true and estimated labels of training examples:

  • 𝐹 𝐱 = ∑! 𝑚(𝐲!, 𝑧!; 𝐱)
  • Possible losses (for binary problems):
  • Quadratic loss: 𝑚 𝐲!, 𝑧!; 𝐱 = 𝑔

𝐱(𝐲!) − 𝑧! #

  • Log likelihood loss: 𝑚 𝐲!, 𝑧!; 𝐱 = −log 𝑄

𝐱 𝑧! | 𝐲!

  • Hinge loss: 𝑚 𝐲!, 𝑧!; 𝐱 = max(0,1 − 𝑧!𝑔

𝐱 𝐲! )

Training of multi-layer networks

Slide from L. Lazebnik.

slide-25
SLIDE 25

Convolutional networks

conv filters subsample subsample conv linear filters weights

Slide from B. Hariharan

slide-26
SLIDE 26

Empirical Risk Minimization

Convolutional network

min

θ

1 N

N

X

i=1

L(h(xi; θ), yi)

θ(t+1) = θ(t) λ 1 N

N

X

i=1

rL(h(xi; θ), yi)

Gradient descent update

Slide from B. Hariharan

slide-27
SLIDE 27

Computing the gradient of the loss

rL(h(x; θ), y)

z = h(x; θ)

rθL(z, y) = ∂L(z, y) ∂z ∂z ∂θ

Slide from B. Hariharan

slide-28
SLIDE 28

Convolutional networks

conv filters subsample subsample conv linear filters weights

Slide from B. Hariharan

slide-29
SLIDE 29

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

Slide from B. Hariharan

slide-30
SLIDE 30

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

∂z ∂w5 = ∂f5(z4, w5) ∂w5

Slide from B. Hariharan

slide-31
SLIDE 31

∂z ∂w4 = ∂z ∂z4 ∂z4 ∂w4 = ∂f5(z4, w5) ∂z4 ∂f4(z3, w4) ∂w4

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

Slide from B. Hariharan

slide-32
SLIDE 32

∂z ∂w4 = ∂z ∂z4 ∂z4 ∂w4 = ∂f5(z4, w5) ∂z4 ∂f4(z3, w4) ∂w4

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

Slide from B. Hariharan

slide-33
SLIDE 33

∂z ∂w4 = ∂z ∂z4 ∂z4 ∂w4 = ∂f5(z4, w5) ∂z4 ∂f4(z3, w4) ∂w4

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

Slide from B. Hariharan

slide-34
SLIDE 34

∂z ∂w4 = ∂z ∂z4 ∂z4 ∂w4 = ∂f5(z4, w5) ∂z4 ∂f4(z3, w4) ∂w4

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

Slide from B. Hariharan

slide-35
SLIDE 35

∂z ∂w4 = ∂z ∂z4 ∂z4 ∂w4 = ∂f5(z4, w5) ∂z4 ∂f4(z3, w4) ∂w4

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

Slide from B. Hariharan

slide-36
SLIDE 36

∂z ∂w4 = ∂z ∂z4 ∂z4 ∂w4 = ∂f5(z4, w5) ∂z4 ∂f4(z3, w4) ∂w4

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

Slide from B. Hariharan

slide-37
SLIDE 37

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3

Slide from B. Hariharan

slide-38
SLIDE 38

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3

Slide from B. Hariharan

slide-39
SLIDE 39

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3

Slide from B. Hariharan

slide-40
SLIDE 40

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3 ∂z ∂z3 = ∂z ∂z4 ∂z4 ∂z3

Slide from B. Hariharan

slide-41
SLIDE 41

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3 ∂z ∂z3 = ∂z ∂z4 ∂z4 ∂z3

Slide from B. Hariharan

slide-42
SLIDE 42

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

∂z ∂z2 = ∂z ∂z3 ∂z3 ∂z2 ∂z ∂w2 = ∂z ∂z2 ∂z2 ∂w2

Recurrence going backward!!

Slide from B. Hariharan

slide-43
SLIDE 43

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

Slide from B. Hariharan

Back-Propagation

slide-44
SLIDE 44

Backpropagation for a sequence

  • f functions
  • Each “function” has a “forward” and “backward”

module

  • Forward module for fi
  • takes zi-1 and weight wi as input
  • produces zi as output
  • Backward module for fi
  • takes g(zi ) as input
  • produces g(zi-1 ) and g(wi) as output

g(wi) = g(zi) ∂zi ∂wi

g(zi−1) = g(zi) ∂zi ∂zi−1

Slide from B. Hariharan

slide-45
SLIDE 45

Backpropagation for a sequence

  • f functions

fi zi-1 zi wi

Slide from B. Hariharan

slide-46
SLIDE 46

Backpropagation for a sequence

  • f functions

fi g(zi-1) g(zi) g(wi)

Slide from B. Hariharan

slide-47
SLIDE 47

Computation graph - Functions

  • Each node implements two functions
  • A “forward”
  • Computes output given input
  • A “backward”
  • Computes derivative of z w.r.t input, given derivative of z w.r.t
  • utput

Slide from B. Hariharan

slide-48
SLIDE 48

Computation graphs

fi a d c b

Slide from B. Hariharan

slide-49
SLIDE 49

Computation graphs

fi

∂z ∂d

∂z ∂a

∂z ∂b ∂z ∂c

Slide from B. Hariharan

slide-50
SLIDE 50

Computation graphs

fi a d c b

Slide from B. Hariharan

slide-51
SLIDE 51

Computation graphs

fi

∂z ∂d

∂z ∂a

∂z ∂b ∂z ∂c

Slide from B. Hariharan

slide-52
SLIDE 52

Neural network frameworks

Slide from B. Hariharan

slide-53
SLIDE 53

Image classification and tagging

  • outdoor
  • mountains
  • city
  • Asia
  • Lhasa

Adapted from Fei-Fei Li

slide-54
SLIDE 54

5 10 15 20 25 30 2010 2011 2012 2014 2015 Challenge winner’s top-5 error rates

7-layer Convolutional Networks 19 layers 152 layers Slide from B. Hariharan

slide-55
SLIDE 55

Transfer learning with convolutional networks

Horse

Trained feature extractor Linear classifier

Slide from B. Hariharan

slide-56
SLIDE 56

Transfer learning with convolutional networks

  • What do we do for a

new image classification problem?

  • Key idea:
  • Freeze parameters

in feature extractor

  • Retrain classifier

Trained feature extractor Linear classifier

Slide from B. Hariharan

slide-57
SLIDE 57

Finetuning

Horse Slide from B. Hariharan

slide-58
SLIDE 58

Finetuning

Bakery Initialize with pre- trained, then train with low learning rate Slide from B. Hariharan

slide-59
SLIDE 59

Object detection

  • find pedestrians

Adapted from Fei-Fei Li

slide-60
SLIDE 60

R-CNN: Regions with CNN features

Input image Extract region proposals (~2k / image) Compute CNN features Classify regions (linear SVM)

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

  • R. Girshick, J. Donahue, T. Darrell, J. Malik

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014 Figure credit : Ross Girshick

slide-61
SLIDE 61

R-CNN at test time: Step 2

Input image Extract region proposals (~2k / image) Compute CNN features

  • a. Crop

Slide credit : Ross Girshick

slide-62
SLIDE 62

R-CNN at test time: Step 2

Input image Extract region proposals (~2k / image) Compute CNN features

  • a. Crop
  • b. Scale (anisotropic)

227 x 227

Slide credit : Ross Girshick

slide-63
SLIDE 63
  • 1. Crop
  • b. Scale (anisotropic)

R-CNN at test time: Step 2

Input image Extract region proposals (~2k / image) Compute CNN features

  • c. Forward propagate

Output: “fc7” features

Slide credit : Ross Girshick

slide-64
SLIDE 64

R-CNN at test time: Step 3

Input image Extract region proposals (~2k / image) Compute CNN features

Warped proposal 4096-dimensional fc7 feature vector linear classifiers (SVM or softmax)

person? 1.6 horse? -0.3

... ...

Classify regions

Slide credit : Ross Girshick

slide-65
SLIDE 65

Linear regression

  • n CNN features

Step 4: Object proposal refinement

Original proposal Predicted

  • bject bounding box

Bounding-box regression

Slide credit : Ross Girshick

slide-66
SLIDE 66

5 15 19 29 36 39 49

Faster R-CNN (VGG-16) Late 2018 Early 2015

DPM (Pre DL) Fast R-CNN (AlexNet) Fast R-CNN (VGG-16) Faster R-CNN (ResNet-50) Faster R-CNN (R-101-FPN) Mask R-CNN (X-152-FPN)

COCO Object Detection Average Precision (%)

Past (best circa 2012) Progress within DL methods: > 3x! ~4 years

slide-67
SLIDE 67

5 15 19 29 36 39 46

Faster R-CNN (VGG-16)

DPM (Pre DL) Fast R-CNN (AlexNet) Fast R-CNN (VGG-16) Faster R-CNN (ResNet-50) Faster R-CNN (R-101-FPN) Mask R-CNN (X-152-FPN)

Steady Progress on Boxes and Masks

Ø R-CNN [Girshick et al. 2014] Ø SPP-net [He et al. 2014] Ø Fast R-CNN [Girshick. 2015] Ø Faster R-CNN [Ren et al. 2015] Ø R-FCN [Dai et al. 2016] Ø Feature Pyramid Networks + Faster R-CNN [Lin et al. 2017] Ø Mask R-CNN [He et al. 2017] Ø Training with Large Minibatches (MegDet) [Peng, Xiao, Li, et al. 2017] Ø Cascade R-CNN [Cai & Vasconcelos 2018]

Ø …

slide-68
SLIDE 68

Semantic segmentation

mountain building tree umbrella person lamp sky building market stall lamp person person person person ground umbrella

Adapted from Fei-Fei Li

slide-69
SLIDE 69

Semantic segmentation using convolutional networks

h w 3 Slide from B. Hariharan

slide-70
SLIDE 70

Semantic segmentation using convolutional networks

h/ 4 w/4 c Slide from B. Hariharan

slide-71
SLIDE 71

Semantic segmentation using convolutional networks

c h/ 4 w/4 Slide from B. Hariharan

slide-72
SLIDE 72

Semantic segmentation using convolutional networks

h/ 4 w/4 c

Can be considered as a feature vector for a pixel

Slide from B. Hariharan

slide-73
SLIDE 73

Semantic segmentation using convolutional networks

c Convolve with #classes 1x1 filters

#classes

h/ 4 w/4 Slide from B. Hariharan

slide-74
SLIDE 74

Semantic segmentation using convolutional networks

  • Pass image through convolution and subsampling

layers

  • Final convolution with #classes outputs
  • Get scores for subsampled image
  • Upsample back to original size

Slide from B. Hariharan

slide-75
SLIDE 75

Semantic segmentation using convolutional networks

person bicycle Slide from B. Hariharan

slide-76
SLIDE 76

Image description

This is a busy street in an Asian city. Mountains and a large palace or fortress loom in the background. In the foreground, we see colorful souvenir stalls and people walking around and shopping. One person in the lower left is pushing an empty cart, and a couple of people in the middle are sitting, possibly posing for a photograph.

Adapted from Fei-Fei Li

slide-77
SLIDE 77

Captioning

h1 x1 h0

CNN

𝑔

𝐽 ∈ 𝑆 START C

Dog in

x2 h2

C Dog

x3 h3

END

x4 h4 x5 h5

C C C

a hat

in a hat

Slide from D. Fouhey See also: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

slide-78
SLIDE 78

Vision Review

  • The vision problem
  • Why is vision hard?
  • 3D Reconstruction (basis from SLAM)
  • Recognition

Additional Resources:

  • See CS 543 / ECE 549 at UIUC:

http://saurabhg.web.illinois.edu/teaching/ece549/sp2020/

  • See: CS 231n at Stanford: http://cs231n.stanford.edu/
  • See A Davison’s Robotics course:

https://www.doc.ic.ac.uk/~ajd/Robotics/