YOLO9000: Better, Faster, Stronger Date: January 24, 2018 Prepared - - PowerPoint PPT Presentation

yolo9000
SMART_READER_LITE
LIVE PREVIEW

YOLO9000: Better, Faster, Stronger Date: January 24, 2018 Prepared - - PowerPoint PPT Presentation

YOLO9000: Better, Faster, Stronger Date: January 24, 2018 Prepared by Haris Khan (University of Toronto) CSC2548: Machine Learning in Computer Vision Haris Khan 1 Overview 1. Motivation for one-shot object detection and weakly-supervised


slide-1
SLIDE 1

CSC2548: Machine Learning in Computer Vision

YOLO9000: Better, Faster, Stronger

1 Haris Khan

Date: January 24, 2018 Prepared by Haris Khan (University of Toronto)

slide-2
SLIDE 2

CSC2548: Machine Learning in Computer Vision

Overview

  • 1. Motivation for one-shot object detection and

weakly-supervised learning

  • 2. YOLO
  • 3. YOLOv2 / YOLO9000
  • 4. Future Work

2 Haris Khan

slide-3
SLIDE 3

CSC2548: Machine Learning in Computer Vision

One-Shot Detection

  • Eliminates regional proposal steps used in R-CNN [3], Fast R-CNN [4]

and Faster R-CNN [5] Motivation:

  • Develop object detection methods that predict bounding boxes and

class probabilities at the same time

  • Want to achieve real-time detection speeds
  • Maintain / exceed accuracy benchmarks set by previous region proposal

methods

Haris Khan 3

slide-4
SLIDE 4

CSC2548: Machine Learning in Computer Vision

Improving Detection Datasets

VOC 2007 / 2012:

  • 20 classes
  • i.e. person, cat, dog, car, chair, bottle

ImageNet1000:

  • 1000 classes
  • i.e. German shepherd, golden

retriever, European fire salamander

Haris Khan 4

MS COCO:

  • 80 classes
  • i.e. book, apple, teddy bear, scissors

Motivation:

  • Increase the number and detail of classes that can be learned

during training using existing detection and classification datasets

slide-5
SLIDE 5

CSC2548: Machine Learning in Computer Vision

You Only Look Once (YOLO) [1]

Haris Khan 5

PASCAL VOC:

  • S = 7, 𝐶 = 2, 𝐷 = 20
  • Output tensor size = 7 × 7 × 30
  • 1. Assume each grid

cell has 𝐶 objects.

  • 2. Bounding box feature vector = [𝑦, 𝑧, 𝑥, ℎ, 𝑑]
  • 3. Merge predictions

into S × 𝑇 × (5𝐶 + 𝐷)

  • utput tensor

Image Credit: [1]

  • 2. 𝐷 object classes
slide-6
SLIDE 6

CSC2548: Machine Learning in Computer Vision

YOLO - Architecture

  • Inspired by GoogLeNet
  • 24 convolutional layers + 2 FC layers

Haris Khan 6

Image Credit: [1] Grid creation, bounding box & class predictions

slide-7
SLIDE 7

CSC2548: Machine Learning in Computer Vision

YOLO - Training Loss

Haris Khan 7

  • Only back-propagate loss if object is present

Image Credit: [1]

slide-8
SLIDE 8

CSC2548: Machine Learning in Computer Vision

YOLO - Test Results

  • Primary evaluation done on VOC 2007 & 2012 test sets

Haris Khan 8

VOC 2007 Test Results VOC 2012 Test Results Table Credits: [1] * *Speed measured on Titan X GPU

slide-9
SLIDE 9

CSC2548: Machine Learning in Computer Vision

YOLO - Limitations

  • Produces more localization errors than Fast R-CNN
  • Struggles to detect small, repeated objects (i.e. flocks of birds)
  • Bounding box priors not used during training

Haris Khan 9

Image Credit: [1]

slide-10
SLIDE 10

CSC2548: Machine Learning in Computer Vision

YOLO9000 - Paper Overview

YOLOv2 [2]:

  • Modified version of original YOLO that increases detection speed and

accuracy YOLO9000 [2]:

  • Training method that increases the number of classes a detection network

can learn by using weakly-supervised training on the union of detection (i.e. VOC, COCO) and classification (i.e. ImageNet) datasets

Haris Khan 10

slide-11
SLIDE 11

CSC2548: Machine Learning in Computer Vision

YOLOv2 - Modifications

Haris Khan 11

Modification Effect Bounding Boxes Anchor Boxes 7% recall increase Dimension clusters + new bounding box parameterization 4.8% mAP increase Architecture New Darknet-19 replaces GoogLeNet 33% computation decrease, 0.4% mAP increase Convolutional prediction layer 0.3% mAP increase Training Batch normalization 2% mAP increase High resolution fine-tuning of weights 4% mAP increase Multi-scale images 1.1% mAP increase Passthrough for fine-grained features 1% mAP increase

slide-12
SLIDE 12

CSC2548: Machine Learning in Computer Vision

YOLOv2 - Bounding Boxes

  • Anchor boxes allow multiple objects of

various aspect ratio to be detected in a single grid cell

  • Anchor boxes sizes determined by k-means

clustering of VOC 2007 training set

  • k = 5 provides best trade-off between average

IOU / model complexity

  • Average IOU = 61.0%
  • Feature vector parameterization directly

predicts bounding box centre point, width and height

Haris Khan 12

Image Credit: [2]

slide-13
SLIDE 13

CSC2548: Machine Learning in Computer Vision

YOLOv2 - DarkNet-19

  • 19 convolutional layers and 5 max-

pooling layers

  • Reduced number of FLOPs
  • VGG-16 -> 30.67 billion
  • YOLO -> 8.52 billion
  • YOLOv2 -> 5.58 billion

Haris Khan 13

Table Credit: [2] DarkNet-19 for Image Classification

slide-14
SLIDE 14

CSC2548: Machine Learning in Computer Vision

YOLOv2 - Example

Haris Khan 14

Video link: https://youtu.be/Cgxsv1riJhI?t=290

slide-15
SLIDE 15

CSC2548: Machine Learning in Computer Vision

YOLO9000 - Concept

Haris Khan 15

Image Credits: [2]

+

slide-16
SLIDE 16

CSC2548: Machine Learning in Computer Vision Haris Khan 16

Slide Credit: Joseph Redmon [3]

slide-17
SLIDE 17

CSC2548: Machine Learning in Computer Vision

YOLO9000 - WordTree

Haris Khan 17

Image Credit: [2]

slide-18
SLIDE 18

CSC2548: Machine Learning in Computer Vision Haris Khan 18

Slide Credit: Joseph Redmon [3]

slide-19
SLIDE 19

CSC2548: Machine Learning in Computer Vision Haris Khan 19

Image Credit: Joseph Redmon [3]

slide-20
SLIDE 20

CSC2548: Machine Learning in Computer Vision Haris Khan 20

Image Credit: Joseph Redmon [3]

slide-21
SLIDE 21

CSC2548: Machine Learning in Computer Vision

YOLOv2 - Detection Training

Datasets:

  • VOC 2007+2012, COCO trainval35k

Data Augmentation:

  • Random crops, colour shifting

Hyperparameters:

  • # of epochs = 160
  • Learning rate = 0.001
  • Weight decay = 0.0005
  • Momentum = 0.9

Haris Khan 21

Training Enhancements:

  • Batch normalization
  • High resolution fine-tuning
  • Multi-scale images
  • Three 3x3 & 1x1 convolutional

layers replace last convolutional layer of DarkNet-19 base model

  • Passthrough connection between

3x3x512 and second-to-last convolutional layers, adding fine- grained features to prediction layer

slide-22
SLIDE 22

CSC2548: Machine Learning in Computer Vision

YOLO9000 - Detection Training

Datasets:

  • 9418 classes
  • ImageNet (top 9000 classes)
  • COCO detection dataset
  • ImageNet detection challenge

Bounding Boxes:

  • Minimum IOU threshold = 0.3
  • # of dimension clusters =3

Haris Khan 22

Backpropagating Loss:

  • For detection images,

backpropagate as in YOLOv2

  • For unsupervised classification

images, only backpropagate classification loss, while finding best matching bounding box from WordTree

slide-23
SLIDE 23

CSC2548: Machine Learning in Computer Vision

YOLOv2 - Test Results

Haris Khan 23

Image Credit: Joseph Redmon [3] VOC 2007 Test Results

slide-24
SLIDE 24

CSC2548: Machine Learning in Computer Vision Haris Khan 24

Table Credits: [2] VOC 2012 Test Results COCO Test-Dev 2015 Results

slide-25
SLIDE 25

CSC2548: Machine Learning in Computer Vision

YOLO9000 - Test Results

  • Evaluated on ImageNet detection task
  • 200 classes total
  • 44 detection labelled classes shared between

ImageNet and COCO

  • 156 unsupervised classes
  • Overall detection accuracy = 19.7% mAP
  • 16.0% mAP achieved on unsupervised

classes

Haris Khan 25

Table Credit: [2] Best and Worst Classes on ImageNet

slide-26
SLIDE 26

CSC2548: Machine Learning in Computer Vision

YOLO9000 - Paper Evaluation

Strengths:

  • Speed performance of YOLOv2 far exceeds competitors (i.e. SSD)
  • Anchor box priors via clustering allow detector to learn ideal aspect ratios from

training data

  • WordTree method increases the number of learnable classes using existing

datasets

Weaknesses:

  • Detection performance of YOLOv2 on COCO is well below state-of-the-art
  • Description of how loss function uses unsupervised training examples is vague
  • Results from YOLO9000 tests are inconclusive
  • Does not compare method with alternative weakly-supervised techniques

Haris Khan 26

slide-27
SLIDE 27

CSC2548: Machine Learning in Computer Vision

Future Work

  • Improve the accuracy of one-shot detectors in dense object scenes
  • RetinaNet [7]
  • Investigate the transferability of weakly-supervised training to other

domains, such as image segmentation or dense captioning

Haris Khan 27

slide-28
SLIDE 28

CSC2548: Machine Learning in Computer Vision

Questions?

Haris Khan 28

slide-29
SLIDE 29

CSC2548: Machine Learning in Computer Vision

References

[1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788. [2] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” arXiv preprint. ArXiv161208242, 2016. [3] J. Redmon, “YOLO9000 Better, Faster, Stronger,” presented at the CVPR, 2017. [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587. [5] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448 [6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015,

  • pp. 91–99

[7] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” arXiv preprint. ArXiv170802002, 2017.

Haris Khan 29