AMMI Introduction to Deep Learning 7.3. Networks for object - - PowerPoint PPT Presentation

ammi introduction to deep learning 7 3 networks for
SMART_READER_LITE
LIVE PREVIEW

AMMI Introduction to Deep Learning 7.3. Networks for object - - PowerPoint PPT Presentation

AMMI Introduction to Deep Learning 7.3. Networks for object detection Fran cois Fleuret https://fleuret.org/ammi-2018/ Wed Aug 29 16:58:03 CAT 2018 COLE POLYTECHNIQUE FDRALE DE LAUSANNE The simplest strategy to move from image


slide-1
SLIDE 1

AMMI – Introduction to Deep Learning 7.3. Networks for object detection

Fran¸ cois Fleuret https://fleuret.org/ammi-2018/ Wed Aug 29 16:58:03 CAT 2018

ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

slide-2
SLIDE 2

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-3
SLIDE 3

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-4
SLIDE 4

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-5
SLIDE 5

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-6
SLIDE 6

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-7
SLIDE 7

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations.

. . .

Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-8
SLIDE 8

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-9
SLIDE 9

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-10
SLIDE 10

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-11
SLIDE 11

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-12
SLIDE 12

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-13
SLIDE 13

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-14
SLIDE 14

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-15
SLIDE 15

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations.

. . .

Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-16
SLIDE 16

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-17
SLIDE 17

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-18
SLIDE 18

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-19
SLIDE 19

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-20
SLIDE 20

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Parsing at fixed scale Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-21
SLIDE 21

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Final list of detections

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-22
SLIDE 22

The simplest strategy to move from image classification to object detection is to classify local regions, at multiple scales and locations. Final list of detections This “sliding window” approach evaluates a classifier multiple times, and its computational cost increases with the prediction accuracy.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 1 / 15

slide-23
SLIDE 23

This was mitigated in overfeat (Sermanet et al., 2013) by adding a regression part to predict the object’s bounding box.

Input image Conv layers Max-pooling 1000d FC layers classication Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 2 / 15

slide-24
SLIDE 24

This was mitigated in overfeat (Sermanet et al., 2013) by adding a regression part to predict the object’s bounding box.

Input image Conv layers Max-pooling 1000d FC layers classication 4d FC layers Localization Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 2 / 15

slide-25
SLIDE 25

In the single-object case, the convolutional layers are frozen, and the localization layers are trained with a 퓁2 loss.

Figure 7: Examples of bounding boxes produced by the regression network, before being com- bined into final predictions. The examples shown here are at a single scale. Predictions may be more optimal at other scales depending on the objects. Here, most of the bounding boxes which are initially organized as a grid, converge to a single location and scale. This indicates that the network is very confident in the location of the object, as opposed to being spread out randomly. The top left image shows that it can also correctly identify multiple location if several objects are present. The various aspect ratios of the predicted bounding boxes shows that the network is able to cope with various object poses.

(Sermanet et al., 2013) Combining the multiple boxes is done with an ad hoc greedy algorithm.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 3 / 15

slide-26
SLIDE 26

This architecture can be applied directly to detection by adding a class “Background” to the object classes. Negative samples are taken in each scene either at random or by selecting the

  • nes with the worst miss-classification.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 4 / 15

slide-27
SLIDE 27

This architecture can be applied directly to detection by adding a class “Background” to the object classes. Negative samples are taken in each scene either at random or by selecting the

  • nes with the worst miss-classification.

Surprisingly, using class-specific localization layers did not provide better results than having a single one shared across classes (Sermanet et al., 2013).

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 4 / 15

slide-28
SLIDE 28

Other approaches evolved from AlexNet, relying on region proposals:

  • Generate thousands of proposal bounding boxes with a non-CNN

“objectness” approach such as Selective search (Uijlings et al., 2013),

  • feed to an AlexNet-like network sub-images cropped and warped from the

input image (“R-CNN”, Girshick et al., 2013), or from the convolutional feature maps to share computation (“Fast R-CNN”, Girshick, 2015).

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 5 / 15

slide-29
SLIDE 29

Other approaches evolved from AlexNet, relying on region proposals:

  • Generate thousands of proposal bounding boxes with a non-CNN

“objectness” approach such as Selective search (Uijlings et al., 2013),

  • feed to an AlexNet-like network sub-images cropped and warped from the

input image (“R-CNN”, Girshick et al., 2013), or from the convolutional feature maps to share computation (“Fast R-CNN”, Girshick, 2015). These methods suffer from the cost of the region proposal computation, which is non-convolutional and non-GPUified. They were improved by Ren et al. (2015) in “Faster R-CNN” by replacing the region proposal algorithm with a convolutional processing similar to Overfeat.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 5 / 15

slide-30
SLIDE 30

The most famous algorithm from this lineage is “You Only Look Once” (YOLO, Redmon et al. 2015).

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 6 / 15

slide-31
SLIDE 31

The most famous algorithm from this lineage is “You Only Look Once” (YOLO, Redmon et al. 2015). It comes back to a classical architecture with a series of convolutional layers followed by a few fully connected layers. It is sometime described as “one shot” since a single information pathway suffices.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 6 / 15

slide-32
SLIDE 32

The most famous algorithm from this lineage is “You Only Look Once” (YOLO, Redmon et al. 2015). It comes back to a classical architecture with a series of convolutional layers followed by a few fully connected layers. It is sometime described as “one shot” since a single information pathway suffices. YOLO’s network is not a pre-existing one. It uses leaky ReLU, and its convolutional layers make use of the 1 × 1 bottleneck filters (Lin et al., 2013) to control the memory footprint and computational cost.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 6 / 15

slide-33
SLIDE 33

S × S grid on input Bounding boxes + confjdence Class probability map Final detections

(Redmon et al., 2015)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 7 / 15

slide-34
SLIDE 34

The output corresponds to splitting the image into a regular S × S grid, with S = 7

448 448 3 7 7

  • Conv. Layer

7x7x64-s-2 Maxpool Layer 2x2-s-2

3 3 112 112 192 3 3 56 56 256

  • Conn. Layer

4096

  • Conn. Layer
  • Conv. Layer

3x3x192 Maxpool Layer 2x2-s-2

  • Conv. Layers

1x1x128 3x3x256 1x1x256 3x3x512 Maxpool Layer 2x2-s-2

3 3 28 28 512

  • Conv. Layers

1x1x256 3x3x512 1x1x512 3x3x1024 Maxpool Layer 2x2-s-2

3 3 14 14 1024

  • Conv. Layers

1x1x512 3x3x1024 3x3x1024 3x3x1024-s-2

3 3 7 7 1024 7 7 1024 7 7 30

} ×4 } ×2

  • Conv. Layers

3x3x1024 3x3x1024

(Redmon et al., 2015)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 8 / 15

slide-35
SLIDE 35

The output corresponds to splitting the image into a regular S × S grid, with S = 7, and for each cell, to predict a 30d vector

448 448 3 7 7

  • Conv. Layer

7x7x64-s-2 Maxpool Layer 2x2-s-2

3 3 112 112 192 3 3 56 56 256

  • Conn. Layer

4096

  • Conn. Layer
  • Conv. Layer

3x3x192 Maxpool Layer 2x2-s-2

  • Conv. Layers

1x1x128 3x3x256 1x1x256 3x3x512 Maxpool Layer 2x2-s-2

3 3 28 28 512

  • Conv. Layers

1x1x256 3x3x512 1x1x512 3x3x1024 Maxpool Layer 2x2-s-2

3 3 14 14 1024

  • Conv. Layers

1x1x512 3x3x1024 3x3x1024 3x3x1024-s-2

3 3 7 7 1024 7 7 1024 7 7 30

} ×4 } ×2

  • Conv. Layers

3x3x1024 3x3x1024

(Redmon et al., 2015)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 8 / 15

slide-36
SLIDE 36

The output corresponds to splitting the image into a regular S × S grid, with S = 7, and for each cell, to predict a 30d vector:

  • B = 2 bounding boxes coordinates and confidence,
  • C = 20 class probabilities, corresponding to the classes of Pascal VOC.

448 448 3 7 7

  • Conv. Layer

7x7x64-s-2 Maxpool Layer 2x2-s-2

3 3 112 112 192 3 3 56 56 256

  • Conn. Layer

4096

  • Conn. Layer
  • Conv. Layer

3x3x192 Maxpool Layer 2x2-s-2

  • Conv. Layers

1x1x128 3x3x256 1x1x256 3x3x512 Maxpool Layer 2x2-s-2

3 3 28 28 512

  • Conv. Layers

1x1x256 3x3x512 1x1x512 3x3x1024 Maxpool Layer 2x2-s-2

3 3 14 14 1024

  • Conv. Layers

1x1x512 3x3x1024 3x3x1024 3x3x1024-s-2

3 3 7 7 1024 7 7 1024 7 7 30

} ×4 } ×2

  • Conv. Layers

3x3x1024 3x3x1024

(Redmon et al., 2015)

ˆ

xi,1

ˆ

yi,1

ˆ

wi,1

ˆ

hi,1

ˆ

ci,1 . . .

ˆ

xi,B

ˆ

yi,B

ˆ

wi,B

ˆ

hi,B

ˆ

ci,B

ˆ

pi,1 . . .

ˆ

pi,C

5 B values C values

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 8 / 15

slide-37
SLIDE 37

So the network predicts class scores and bounding-box regressions, and although the output comes from fully connected layers, it has a 2D structure.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 9 / 15

slide-38
SLIDE 38

So the network predicts class scores and bounding-box regressions, and although the output comes from fully connected layers, it has a 2D structure. It allows in particular YOLO to leverage the absolute location in the image to improve performance (e.g. vehicles tend to be at the bottom, umbrella at the top), which may or may not be desirable.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 9 / 15

slide-39
SLIDE 39

During training, YOLO makes the assumption that any of the S2 cells contains at most [the center of] a single object. We define for every image, cell index i = 1, . . . , S2, predicted box index j = 1, . . . , B and class index c = 1, . . . , C

  • 1obj

i

is 1 if there is an object in cell i and 0 otherwise,

  • 1obj

i,j is 1 if there is an object in cell i and predicted box j is the most fitting

  • ne, 0 otherwise.
  • pi,c is 1 if there is an object of class c in cell i, and 0 otherwise,
  • xi, yi, wi, hi the annotated object bounding box (defined only if 1obj

i

= 1, and relative in location and scale to the cell),

  • ci,j IOU between the predicted box and the ground truth target.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 10 / 15

slide-40
SLIDE 40

The training procedure first computes on each image the value of the 1obj

i,j s and

ci,j, and then does one step to minimize λcoord

S2

  • i=1

B

  • j=1

1obj

i,j

  • (xi −ˆ

xi,j)2 + (yi −ˆ yi,j)2 + √wi −

  • ˆ

wi,j 2 +

  • hi −
  • ˆ

hi,j 2 + λobj

S2

  • i=1

B

  • j=1

1obj

i,j (ci,j −ˆ

ci,j)2 + λnoobj

S2

  • i=1

B

  • j=1
  • 1−1obj

i,j

  • ˆ

c2

i,j

+ λclasses

S2

  • i=1

1obj

i C

  • c=1
  • pi,c − ˆ

pi,c 2 . where ˆ pi,c, ˆ xi,j, ˆ yi,j, ˆ wi,j, ˆ hi,j, ˆ ci,j are the network’s outputs. (slightly re-written from Redmon et al. 2015)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 11 / 15

slide-41
SLIDE 41

Training YOLO relies on many engineering choices that illustrate well how involved is deep-learning “in practice”:

  • Pre-train the 20 first convolutional layers on ImageNet classification,
  • use 448 × 448 input for detection, instead of 224 × 224,
  • use Leaky ReLU for all layers,
  • dropout after the first fully connected layer,

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 12 / 15

slide-42
SLIDE 42

Training YOLO relies on many engineering choices that illustrate well how involved is deep-learning “in practice”:

  • Pre-train the 20 first convolutional layers on ImageNet classification,
  • use 448 × 448 input for detection, instead of 224 × 224,
  • use Leaky ReLU for all layers,
  • dropout after the first fully connected layer,
  • normalize bounding boxes parameters in [0, 1],
  • use a quadratic loss not only for the bounding box coordinates, but also for

the confidence and the class scores,

  • reduce the weight of large bounding boxes by using the square roots of the

size in the loss,

  • reduce the importance of empty cells by weighting less the

confidence-related loss on them,

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 12 / 15

slide-43
SLIDE 43

Training YOLO relies on many engineering choices that illustrate well how involved is deep-learning “in practice”:

  • Pre-train the 20 first convolutional layers on ImageNet classification,
  • use 448 × 448 input for detection, instead of 224 × 224,
  • use Leaky ReLU for all layers,
  • dropout after the first fully connected layer,
  • normalize bounding boxes parameters in [0, 1],
  • use a quadratic loss not only for the bounding box coordinates, but also for

the confidence and the class scores,

  • reduce the weight of large bounding boxes by using the square roots of the

size in the loss,

  • reduce the importance of empty cells by weighting less the

confidence-related loss on them,

  • use momentum 0.9, decay 5e − 4,
  • data augmentation with scaling, translation, and HSV transformation.

A critical technical point is the design of the loss function that articulates both a classification and a regression objectives.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 12 / 15

slide-44
SLIDE 44

The Single Shot Multi-box Detector (SSD, Liu et al., 2015) improves upon YOLO with a fully-convolutional architectures and multi-scale maps.

300 300 3

VGG-16 through Conv5_3 layer

19 19 Conv7 (FC7) 1024 10 10 Conv8_2 512 5 5 Conv9_2 256 3 Conv10_2 256 256 38 38 Conv4_3 3 1 Image

Conv: 1x1x1024 Conv: 1x1x256 Conv: 3x3x512-s2 Conv: 1x1x128 Conv: 3x3x256-s2 Conv: 1x1x128 Conv: 3x3x256-s1

Detections:8732 per Class

Classifier : Conv: 3x3x(4x(Classes+4))

512 448 448 3 Image 7 7 1024 7 7 30

Fully Connected

YOLO Customized Architecture Non-Maximum Suppression

Fully Connected

Non-Maximum Suppression Detections: 98 per class

Conv11_2

74.3mAP 59FPS 63.4mAP 45FPS

Classifier : Conv: 3x3x(6x(Classes+4))

19 19 Conv6 (FC6) 1024

Conv: 3x3x1024

SSD YOLO Extra Feature Layers

Conv: 1x1x128 Conv: 3x3x256-s1 Conv: 3x3x(4x(Classes+4))

(Liu et al., 2015)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 13 / 15

slide-45
SLIDE 45

To summarize roughly how “one shot” deep detection can be achieved:

  • networks trained on image classification capture localization information,
  • regression layers can be attached to classification-trained networks,
  • object localization does not have to be class-specific,
  • multiple detection are estimated at each location to account for different

aspect ratios and scales.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 14 / 15

slide-46
SLIDE 46

Object detection networks

AlexNet (Krizhevsky et al., 2012) Overfeat (Sermanet et al., 2013) Box regression R-CNN (Girshick et al., 2013) Region proposal + crop in image Fast R-CNN (Girshick, 2015) Crop in feature maps Faster R-CNN (Ren et al., 2015) Convolutional region proposal YOLO (Redmon et al., 2015) No crop SSD (Liu et al., 2015) Fully convolutional + multi-scale maps Multi-scale convolutions + multi boxes Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 7.3. Networks for object detection 15 / 15

slide-47
SLIDE 47

The end

slide-48
SLIDE 48

References

  • R. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate
  • bject detection and semantic segmentation. CoRR, abs/1311.2524, 2013.
  • A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional

neural networks. In Neural Information Processing Systems (NIPS), 2012.

  • M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD:

single shot multibox detector. CoRR, abs/1512.02325, 2015.

  • J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified,

real-time object detection. CoRR, abs/1506.02640, 2015.

  • S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object

detection with region proposal networks. CoRR, abs/1506.01497, 2015.

  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat:

Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013.

  • J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object
  • recognition. International Journal of Computer Vision, 104(2):154–171, 2013.