Lecture 13: Segmentation and Attention Fei-Fei Li & Andrej - - PowerPoint PPT Presentation

lecture 13
SMART_READER_LITE
LIVE PREVIEW

Lecture 13: Segmentation and Attention Fei-Fei Li & Andrej - - PowerPoint PPT Presentation

Lecture 13: Segmentation and Attention Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - Lecture 13 - 24 Feb 2016 24 Feb 2016 1 Administrative Assignment 3 due


slide-1
SLIDE 1

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 13 - 24 Feb 2016 1

Lecture 13:

Segmentation and Attention

slide-2
SLIDE 2

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Administrative

  • Assignment 3 due tonight!
  • We are reading your milestones

2

slide-3
SLIDE 3

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Last time: Software Packages

3

Caffe Torch Theano Lasagne Keras TensorFlow

slide-4
SLIDE 4

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Today

  • Segmentation

○ Semantic Segmentation ○ Instance Segmentation

  • (Soft) Attention

○ Discrete locations ○ Continuous locations (Spatial Transformers)

4

slide-5
SLIDE 5

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

But first….

5

slide-6
SLIDE 6

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

But first….

6

Szegedy et al, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016

New ImageNet Record today!

slide-7
SLIDE 7

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Inception-v4

7

Szegedy et al, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016

V = Valid convolutions (no padding) 1x7, 7x1 filters 9 layers Strided convolution AND max pooling

slide-8
SLIDE 8

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Inception-v4

8

Szegedy et al, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016

x4 9 layers 4 x 3 layers 3 layers

slide-9
SLIDE 9

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Inception-v4

9

Szegedy et al, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016

x7 9 layers 4 x 3 layers 3 layers 5 x 7 layers 4 layers

slide-10
SLIDE 10

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Inception-v4

10

Szegedy et al, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016

x3 9 layers 4 x 3 layers 3 layers 5 x 7 layers 4 layers 3 x 4 layers 75 layers

slide-11
SLIDE 11

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Inception-ResNet-v2

11

9 layers

slide-12
SLIDE 12

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Inception-ResNet-v2

12

9 layers x7 5 x 4 layers 3 layers

slide-13
SLIDE 13

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Inception-ResNet-v2

13

9 layers x10 5 x 4 layers 3 layers 3 layers 10 x 4 layers

slide-14
SLIDE 14

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Inception-ResNet-v2

14

9 layers x 5 5 x 3 layers 3 layers 3 layers 10 x 4 layers 5 x 4 layers 75 layers

slide-15
SLIDE 15

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Inception-ResNet-v2

15

Residual and non-residual converge to similar value, but residual learns faster

slide-16
SLIDE 16

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Today

  • Segmentation

○ Semantic Segmentation ○ Instance Segmentation

  • (Soft) Attention

○ Discrete locations ○ Continuous locations (Spatial Transformers)

16

slide-17
SLIDE 17

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Segmentation

17

slide-18
SLIDE 18

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 18 18 Classification Classification + Localization

Computer Vision Tasks

CAT CAT CAT, DOG, DUCK

Object Detection Segmentation

CAT, DOG, DUCK

Multiple objects Single object

slide-19
SLIDE 19

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 19 19

Computer Vision Tasks

Classification + Localization Object Detection Segmentation

Lecture 8

Classification

slide-20
SLIDE 20

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 20 20 Classification

Computer Vision Tasks

Object Detection Classification + Localization Segmentation

Today

slide-21
SLIDE 21

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation

21 Label every pixel! Don’t differentiate instances (cows) Classic computer vision problem

Figure credit: Shotton et al, “TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context”, IJCV 2007

slide-22
SLIDE 22

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation

22

Figure credit: Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Detect instances, give category, label pixels “simultaneous detection and segmentation” (SDS) Lots of recent work (MS-COCO)

slide-23
SLIDE 23

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation

23

slide-24
SLIDE 24

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation

24

slide-25
SLIDE 25

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation

25

Extract patch

slide-26
SLIDE 26

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation

26 CNN

Extract patch Run through a CNN

slide-27
SLIDE 27

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation

27 CNN

COW

Extract patch Run through a CNN Classify center pixel

slide-28
SLIDE 28

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation

28 CNN

COW

Extract patch Run through a CNN Classify center pixel Repeat for every pixel

slide-29
SLIDE 29

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation

29 CNN

Run “fully convolutional” network to get all pixels at once Smaller output due to pooling

slide-30
SLIDE 30

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation: Multi-Scale

30

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

slide-31
SLIDE 31

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation: Multi-Scale

31

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales

slide-32
SLIDE 32

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation: Multi-Scale

32

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales Run one CNN per scale

slide-33
SLIDE 33

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation: Multi-Scale

33

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales Run one CNN per scale Upscale outputs and concatenate

slide-34
SLIDE 34

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation: Multi-Scale

34

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales Run one CNN per scale Upscale outputs and concatenate External “bottom-up” segmentation

slide-35
SLIDE 35

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation: Multi-Scale

35

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales Run one CNN per scale Upscale outputs and concatenate External “bottom-up” segmentation Combine everything for final outputs

slide-36
SLIDE 36

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation: Refinement

36

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Apply CNN once to get labels

slide-37
SLIDE 37

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation: Refinement

37

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Apply CNN once to get labels Apply AGAIN to refine labels

slide-38
SLIDE 38

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation: Refinement

38

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Apply CNN once to get labels Apply AGAIN to refine labels And again!

slide-39
SLIDE 39

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation: Refinement

39

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Apply CNN once to get labels Apply AGAIN to refine labels And again! Same CNN weights: recurrent convolutional network

slide-40
SLIDE 40

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation: Refinement

40

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Apply CNN once to get labels Apply AGAIN to refine labels And again! Same CNN weights: recurrent convolutional network More iterations improve results

slide-41
SLIDE 41

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 41

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Semantic Segmentation: Upsampling

slide-42
SLIDE 42

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 42

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Semantic Segmentation: Upsampling

Learnable upsampling!

slide-43
SLIDE 43

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 43

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Semantic Segmentation: Upsampling

slide-44
SLIDE 44

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 44

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Semantic Segmentation: Upsampling

“skip connections”

slide-45
SLIDE 45

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 45

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Semantic Segmentation: Upsampling

Skip connections = Better results “skip connections”

slide-46
SLIDE 46

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Learnable Upsampling: “Deconvolution”

46

Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4

slide-47
SLIDE 47

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Learnable Upsampling: “Deconvolution”

47

Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 Dot product between filter and input

slide-48
SLIDE 48

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Learnable Upsampling: “Deconvolution”

48

Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 Dot product between filter and input

slide-49
SLIDE 49

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Learnable Upsampling: “Deconvolution”

49

Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2

slide-50
SLIDE 50

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Learnable Upsampling: “Deconvolution”

50

Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 Dot product between filter and input

slide-51
SLIDE 51

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Learnable Upsampling: “Deconvolution”

51

Typical 3 x 3 convolution, stride 2 pad 1 Input: 4 x 4 Output: 2 x 2 Dot product between filter and input

slide-52
SLIDE 52

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Learnable Upsampling: “Deconvolution”

52

3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4

slide-53
SLIDE 53

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Learnable Upsampling: “Deconvolution”

53

3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter

slide-54
SLIDE 54

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Learnable Upsampling: “Deconvolution”

54

3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter

slide-55
SLIDE 55

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Learnable Upsampling: “Deconvolution”

55

3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter Sum where

  • utput overlaps
slide-56
SLIDE 56

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Learnable Upsampling: “Deconvolution”

56

3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter Sum where

  • utput overlaps

Same as backward pass for normal convolution!

slide-57
SLIDE 57

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Learnable Upsampling: “Deconvolution”

57

3 x 3 “deconvolution”, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter Sum where

  • utput overlaps

Same as backward pass for normal convolution! “Deconvolution” is a bad name, already defined as “inverse of convolution” Better names: convolution transpose, backward strided convolution, 1/2 strided convolution, upconvolution

slide-58
SLIDE 58

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Learnable Upsampling: “Deconvolution”

58

“Deconvolution” is a bad name, already defined as “inverse of convolution” Better names: convolution transpose, backward strided convolution, 1/2 strided convolution, upconvolution

Im et al, “Generating images with recurrent adversarial networks”, arXiv 2016 Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016

slide-59
SLIDE 59

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Learnable Upsampling: “Deconvolution”

59

“Deconvolution” is a bad name, already defined as “inverse of convolution” Better names: convolution transpose, backward strided convolution, 1/2 strided convolution, upconvolution

Im et al, “Generating images with recurrent adversarial networks”, arXiv 2016 Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016

Great explanation in appendix

slide-60
SLIDE 60

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation: Upsampling

60

Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

slide-61
SLIDE 61

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Semantic Segmentation: Upsampling

61

Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Normal VGG “Upside down” VGG 6 days of training on Titan X…

slide-62
SLIDE 62

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation

62

slide-63
SLIDE 63

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation

63

Figure credit: Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Detect instances, give category, label pixels “simultaneous detection and segmentation” (SDS) Lots of recent work (MS-COCO)

slide-64
SLIDE 64

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation

64

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

Similar to R-CNN, but with segments

slide-65
SLIDE 65

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation

65

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals Similar to R-CNN, but with segments

slide-66
SLIDE 66

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation

66

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals Similar to R-CNN, but with segments

slide-67
SLIDE 67

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation

67

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals Mask out background with mean image Similar to R-CNN, but with segments

slide-68
SLIDE 68

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation

68

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals Mask out background with mean image Similar to R-CNN, but with segments

slide-69
SLIDE 69

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation

69

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals Mask out background with mean image Similar to R-CNN, but with segments

slide-70
SLIDE 70

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation: Hypercolumns

70

Hariharan et al, “Hypercolumns for Object Segmentation and Fine-grained Localization”, CVPR 2015

slide-71
SLIDE 71

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation: Hypercolumns

71

Hariharan et al, “Hypercolumns for Object Segmentation and Fine-grained Localization”, CVPR 2015

slide-72
SLIDE 72

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation: Cascades

72

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN Won COCO 2015 challenge (with ResNet)

slide-73
SLIDE 73

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation: Cascades

73

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN Won COCO 2015 challenge (with ResNet)

slide-74
SLIDE 74

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation: Cascades

74

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN

Region proposal network (RPN)

Won COCO 2015 challenge (with ResNet)

slide-75
SLIDE 75

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation: Cascades

75

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN

Region proposal network (RPN) Reshape boxes to fixed size, figure / ground logistic regression

Won COCO 2015 challenge (with ResNet)

slide-76
SLIDE 76

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation: Cascades

76

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN Won COCO 2015 challenge (with ResNet)

Region proposal network (RPN) Reshape boxes to fixed size, figure / ground logistic regression Mask out background, predict object class

slide-77
SLIDE 77

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation: Cascades

77

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN Won COCO 2015 challenge (with ResNet)

Region proposal network (RPN) Reshape boxes to fixed size, figure / ground logistic regression Mask out background, predict object class Learn entire model end-to-end!

slide-78
SLIDE 78

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Instance Segmentation: Cascades

78

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Predictions Ground truth

slide-79
SLIDE 79

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Segmentation Overview

  • Semantic segmentation

○ Classify all pixels ○ Fully convolutional models, downsample then upsample ○ Learnable upsampling: fractionally strided convolution ○ Skip connections can help

  • Instance Segmentation

○ Detect instance, generate mask ○ Similar pipelines to object detection 79

slide-80
SLIDE 80

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Attention Models

80

slide-81
SLIDE 81

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Recall: RNN for Captioning

81

Image: H x W x 3

slide-82
SLIDE 82

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Recall: RNN for Captioning

82 CNN

Image: H x W x 3 Features: D

slide-83
SLIDE 83

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Recall: RNN for Captioning

83 CNN

Image: H x W x 3 Features: D

h0

Hidden state: H

slide-84
SLIDE 84

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Recall: RNN for Captioning

84 CNN

Image: H x W x 3 Features: D

h0

Hidden state: H

h1 y1

First word

d1

Distribution

  • ver vocab
slide-85
SLIDE 85

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Recall: RNN for Captioning

85 CNN

Image: H x W x 3 Features: D

h0

Hidden state: H

h1 y1 h2 y2

First word Second word

d1

Distribution

  • ver vocab

d2

slide-86
SLIDE 86

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Recall: RNN for Captioning

86 CNN

Image: H x W x 3 Features: D

h0

Hidden state: H

h1 y1 h2 y2

First word Second word

d1

Distribution

  • ver vocab

d2

RNN only looks at whole image, once

slide-87
SLIDE 87

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Recall: RNN for Captioning

87 CNN

Image: H x W x 3 Features: D

h0

Hidden state: H

h1 y1 h2 y2

First word Second word

d1

Distribution

  • ver vocab

d2

RNN only looks at whole image, once What if the RNN looks at different parts of the image at each timestep?

slide-88
SLIDE 88

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 88 CNN

Image: H x W x 3 Features: L x D

Soft Attention for Captioning

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-89
SLIDE 89

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 89 CNN

Image: H x W x 3 Features: L x D

h0

Soft Attention for Captioning

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-90
SLIDE 90

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 90 CNN

Image: H x W x 3 Features: L x D

h0 a1

Distribution over L locations

Soft Attention for Captioning

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-91
SLIDE 91

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 91 CNN

Image: H x W x 3 Features: L x D

h0 a1

Weighted combination

  • f features

Distribution over L locations

Soft Attention for Captioning

z1

Weighted features: D

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-92
SLIDE 92

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 92 CNN

Image: H x W x 3 Features: L x D

h0 a1 z1

Weighted combination

  • f features

h1

Distribution over L locations

Soft Attention for Captioning

Weighted features: D

y1

First word

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-93
SLIDE 93

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 93 CNN

Image: H x W x 3 Features: L x D

h0 a1 z1

Weighted combination

  • f features

y1 h1

First word Distribution over L locations

Soft Attention for Captioning

a2 d1

Weighted features: D Distribution

  • ver vocab

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-94
SLIDE 94

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 94 CNN

Image: H x W x 3 Features: L x D

h0 a1 z1

Weighted combination

  • f features

y1 h1

First word Distribution over L locations

Soft Attention for Captioning

a2 d1 z2

Weighted features: D Distribution

  • ver vocab

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-95
SLIDE 95

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 95 CNN

Image: H x W x 3 Features: L x D

h0 a1 z1

Weighted combination

  • f features

y1 h1

First word Distribution over L locations

Soft Attention for Captioning

a2 d1 h2 z2 y2

Weighted features: D Distribution

  • ver vocab

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-96
SLIDE 96

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 96 CNN

Image: H x W x 3 Features: L x D

h0 a1 z1

Weighted combination

  • f features

y1 h1

First word Distribution over L locations

Soft Attention for Captioning

a2 d1 h2 a3 d2 z2 y2

Weighted features: D Distribution

  • ver vocab

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-97
SLIDE 97

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 97 CNN

Image: H x W x 3 Features: L x D

h0 a1 z1

Weighted combination

  • f features

y1 h1

First word Distribution over L locations

Soft Attention for Captioning

a2 d1 h2 a3 d2 z2 y2

Weighted features: D Distribution

  • ver vocab

Guess which framework was used to implement?

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-98
SLIDE 98

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 98 CNN

Image: H x W x 3 Features: L x D

h0 a1 z1

Weighted combination

  • f features

y1 h1

First word Distribution over L locations

Soft Attention for Captioning

a2 d1 h2 a3 d2 z2 y2

Weighted features: D Distribution

  • ver vocab

Guess which framework was used to implement? Crazy RNN = Theano

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-99
SLIDE 99

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Soft vs Hard Attention

99 CNN

Image: H x W x 3 Grid of features (Each D- dimensional) a b c d pa pb pc pd Distribution over grid locations pa + pb + pc + pc = 1 From RNN:

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-100
SLIDE 100

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Soft vs Hard Attention

100 CNN

Image: H x W x 3 Grid of features (Each D- dimensional) a b c d pa pb pc pd Distribution over grid locations pa + pb + pc + pc = 1 Context vector z (D-dimensional) From RNN:

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-101
SLIDE 101

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Soft vs Hard Attention

101 CNN

Image: H x W x 3 Grid of features (Each D- dimensional) a b c d pa pb pc pd Distribution over grid locations pa + pb + pc + pc = 1 Soft attention: Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent Context vector z (D-dimensional) From RNN:

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-102
SLIDE 102

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Soft vs Hard Attention

102 CNN

Image: H x W x 3 Grid of features (Each D- dimensional) a b c d pa pb pc pd Distribution over grid locations pa + pb + pc + pc = 1 Soft attention: Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent Context vector z (D-dimensional) Hard attention: Sample ONE location according to p, z = that vector With argmax, dz/dp is zero almost everywhere … Can’t use gradient descent; need reinforcement learning From RNN:

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-103
SLIDE 103

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Soft Attention for Captioning

103

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Soft attention Hard attention

slide-104
SLIDE 104

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Soft Attention for Captioning

104

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

slide-105
SLIDE 105

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Soft Attention for Captioning

105

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Attention constrained to fixed grid! We’ll come back to this ….

slide-106
SLIDE 106

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

“Mi gato es el mejor” -> “My cat is the best”

Soft Attention for Translation

106

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

slide-107
SLIDE 107

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Soft Attention for Translation

“Mi gato es el mejor” -> “My cat is the best”

107

Distribution over input words

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

slide-108
SLIDE 108

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Soft Attention for Translation

“Mi gato es el mejor” -> “My cat is the best”

108

Distribution over input words

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

slide-109
SLIDE 109

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Soft Attention for Translation

“Mi gato es el mejor” -> “My cat is the best”

109

Distribution over input words

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

slide-110
SLIDE 110

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Soft Attention for Everything!

110

Speech recognition, attention over input sounds:

  • Chan et al, “Listen, Attend, and Spell”, arXiv 2015
  • Chorowski et al, “Attention-based models for

Speech Recognition”, NIPS 2015

Image, question to answer, attention over image:

  • Xu and Saenko, “Ask, Attend and Answer: Exploring

Question-Guided Spatial Attention for Visual Question Answering”, arXiv 2015

  • Zhu et al, “Visual7W: Grounded Question Answering in

Images”, arXiv 2015

Video captioning, attention over input frames:

  • Yao et al, “Describing Videos by Exploiting Temporal

Structure”, ICCV 2015

Machine Translation, attention over input:

  • Luong et al, “Effective Approaches to Attention-

based Neural Machine Translation,” EMNLP 2015

slide-111
SLIDE 111

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Attending to arbitrary regions?

111

Image: H x W x 3 Features: L x D Attention mechanism from Show, Attend, and Tell only lets us softly attend to fixed grid positions … can we do better?

slide-112
SLIDE 112

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Attending to Arbitrary Regions

112

Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013

  • Read text, generate handwriting using an RNN
  • Attend to arbitrary regions of the output by

predicting params of a mixture model

slide-113
SLIDE 113

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Attending to Arbitrary Regions

113

Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013

  • Read text, generate handwriting using an RNN
  • Attend to arbitrary regions of the output by

predicting params of a mixture model Which are real and which are generated?

slide-114
SLIDE 114

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Attending to Arbitrary Regions

114

Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013

  • Read text, generate handwriting using an RNN
  • Attend to arbitrary regions of the output by

predicting params of a mixture model Which are real and which are generated? REAL GENERATED

slide-115
SLIDE 115

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Attending to Arbitrary Regions: DRAW

115

Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015

Classify images by attending to arbitrary regions of the input

slide-116
SLIDE 116

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Attending to Arbitrary Regions: DRAW

116

Generate images by attending to arbitrary regions of the output

Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015

Classify images by attending to arbitrary regions of the input

slide-117
SLIDE 117

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 117

slide-118
SLIDE 118

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Attending to Arbitrary Regions: Spatial Transformer Networks

Attention mechanism similar to DRAW, but easier to explain 118

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

slide-119
SLIDE 119

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Spatial Transformer Networks

119

Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

slide-120
SLIDE 120

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Spatial Transformer Networks

120

Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Can we make this function differentiable?

slide-121
SLIDE 121

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Spatial Transformer Networks

121

Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 Can we make this function differentiable?

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Idea: Function mapping pixel coordinates (xt, yt) of

  • utput to pixel coordinates

(xs, ys) of input

slide-122
SLIDE 122

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Spatial Transformer Networks

122

Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 Can we make this function differentiable?

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Idea: Function mapping pixel coordinates (xt, yt) of

  • utput to pixel coordinates

(xs, ys) of input (xt, yt) (xs, ys)

slide-123
SLIDE 123

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Spatial Transformer Networks

123

Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 Can we make this function differentiable?

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Idea: Function mapping pixel coordinates (xt, yt) of

  • utput to pixel coordinates

(xs, ys) of input (xt, yt) (xs, ys)

slide-124
SLIDE 124

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Spatial Transformer Networks

124

Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 Can we make this function differentiable?

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Idea: Function mapping pixel coordinates (xt, yt) of

  • utput to pixel coordinates

(xs, ys) of input Repeat for all pixels in output to get a sampling grid

slide-125
SLIDE 125

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Spatial Transformer Networks

125

Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 Can we make this function differentiable?

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Idea: Function mapping pixel coordinates (xt, yt) of

  • utput to pixel coordinates

(xs, ys) of input Repeat for all pixels in output to get a sampling grid Then use bilinear interpolation to compute output

slide-126
SLIDE 126

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Spatial Transformer Networks

126

Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 Can we make this function differentiable?

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Idea: Function mapping pixel coordinates (xt, yt) of

  • utput to pixel coordinates

(xs, ys) of input Repeat for all pixels in output to get a sampling grid Then use bilinear interpolation to compute output Network attends to input by predicting

slide-127
SLIDE 127

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Spatial Transformer Networks

127

Input: Full image Output: Region of interest from input

slide-128
SLIDE 128

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Spatial Transformer Networks

128

Input: Full image A small Localization network predicts transform Output: Region of interest from input

slide-129
SLIDE 129

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Spatial Transformer Networks

129

Input: Full image A small Localization network predicts transform Grid generator uses to compute sampling grid Output: Region of interest from input

slide-130
SLIDE 130

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Spatial Transformer Networks

130

Input: Full image A small Localization network predicts transform Grid generator uses to compute sampling grid Sampler uses bilinear interpolation to produce output Output: Region of interest from input

slide-131
SLIDE 131

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Spatial Transformer Networks

131

Differentiable “attention / transformation” module Insert spatial transformers into a classification network and it learns to attend and transform the input

slide-132
SLIDE 132

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016 132

slide-133
SLIDE 133

Lecture 13 -

Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 Feb 2016

Attention Recap

  • Soft attention:

○ Easy to implement: produce distribution over input locations, reweight features and feed as input ○ Attend to arbitrary input locations using spatial transformer networks

  • Hard attention:

○ Attend to a single input location ○ Can’t use gradient descent! ○ Need reinforcement learning!

133