Category-level localization g y Cordelia Schmid Cordelia Schmid - - PowerPoint PPT Presentation

category level localization g y
SMART_READER_LITE
LIVE PREVIEW

Category-level localization g y Cordelia Schmid Cordelia Schmid - - PowerPoint PPT Presentation

Category-level localization g y Cordelia Schmid Cordelia Schmid Recognition Recognition Classification Classification Object present/absent in an image Often presence of a significant amount of background clutter


slide-1
SLIDE 1

Category-level localization g y

Cordelia Schmid Cordelia Schmid

slide-2
SLIDE 2

Recognition Recognition

  • Classification
  • Classification

– Object present/absent in an image – Often presence of a significant amount of background clutter

  • Localization / Detection

– Localize object within the frame – Bounding box or pixel- level segmentation

slide-3
SLIDE 3

Pixel-level object classification Pixel level object classification

slide-4
SLIDE 4

Difficulties Difficulties

Intra class variations

  • Intra-class variations
  • Scale and viewpoint change
  • Multiple aspects of categories
slide-5
SLIDE 5

Approaches Approaches

Intra class variation

  • Intra-class variation

=> Modeling of the variations, mainly by learning from a large dataset for example by SVMs large dataset, for example by SVMs

  • Scale + limited viewpoints changes
  • Scale + limited viewpoints changes

=> multi-scale approach or invariant local features

  • Multiple aspects of categories

> separate detectors for each aspect front/profile face => separate detectors for each aspect, front/profile face, build an approximate 3D “category” model

slide-6
SLIDE 6

Outline

S

  • 1. Sliding window detectors
  • 2. Features and adding spatial information

g p

  • 3. Histogram of Oriented Gradients (HOG)
  • 4. State of the art algorithms and PASCAL VOC
slide-7
SLIDE 7

Sliding window detector

  • Basic component: binary classifier

Car/non-car Classifier Yes, No, t a car not a car

slide-8
SLIDE 8

Sliding window detector

  • Detect objects in clutter by search

Car/non-car Classifier

  • Sliding window: exhaustive search over position and scale
slide-9
SLIDE 9

Sliding window detector

  • Detect objects in clutter by search

Car/non-car Classifier

  • Sliding window: exhaustive search over position and scale
slide-10
SLIDE 10

Detection by Classification

  • Detect objects in clutter by search

Car/non-car Classifier

  • Sliding window: exhaustive search over position and scale

Sliding window: exhaustive search over position and scale (can use same size window over a spatial pyramid of images)

slide-11
SLIDE 11

Feature Extraction

Classification Detection

Does the image contain a car? Does the image contain a car?

  • Classification: Unknown location + clutter ) lots of invariance
  • Detection: Uncluttered, normalized image ) more “detail”
slide-12
SLIDE 12

Window (Image) Classification

Training Data Feature

  

Classifier Extraction

  

  • Features usually engineered

Car/Non-car

  • Classifier learnt from data
slide-13
SLIDE 13

Problems with sliding windows …

  • aspect ratio
  • granularity (finite grid)
  • granularity (finite grid)
  • partial occlusion
  • multiple responses
slide-14
SLIDE 14

Outline

S

  • 1. Sliding window detectors
  • 2. Features and adding spatial information

g p

  • 3. Histogram of Oriented Gradients (HOG)
  • 4. State of the art algorithms and PASCAL VOC
slide-15
SLIDE 15

BOW + Spatial pyramids

Start from BoW for region of interest (ROI)

  • no spatial information recorded

no spatial information recorded

  • sliding window detector

B f W d Bag of Words

         

Feature Vector

slide-16
SLIDE 16

Adding Spatial Information to Bag of Words

Bag of Words C t t

     

Concatenate

     

Feature Vector

Keeps fixed length feature vector for a window

slide-17
SLIDE 17

Spatial Pyramid – represent correspondence

  

1 BoW

   

4 BoW

       

16 BoW

   

16 BoW

 

slide-18
SLIDE 18

Dense Visual Words

  • Why extract only sparse image

fragments? fragments?

  • Good where lots of invariance

is needed, but not relevant to sliding window detection?

  • Extract dense visual words on an overlapping grid

  

Quantize Word

 

Patch / SIFT

  • More “detail” at the expense of invariance
slide-19
SLIDE 19

Outline

S

  • 1. Sliding window detectors
  • 2. Features and adding spatial information

g p

  • 3. Histogram of Oriented Gradients + linear SVM classifier
  • 4. State of the art algorithms and PASCAL VOC
slide-20
SLIDE 20

Feature: Histogram of Oriented Gradients (HOG) Gradients (HOG)

image dominant direction HOG ency

  • tile 64 x 128 pixel window into 8 x 8 pixel cells

freque

  • rientation

tile 64 x 128 pixel window into 8 x 8 pixel cells

  • each cell represented by histogram over 8
  • rientation bins (i.e. angles in range 0-180 degrees)
  • rientation
slide-21
SLIDE 21

Histogram of Oriented Gradients (HOG) continued

  • Adds a second level of overlapping spatial bins re
  • Adds a second level of overlapping spatial bins re-

normalizing orientation histograms over a larger spatial area

  • Feature vector dimension (approx) = 16 x 8 (for tiling) x 8

(orientations) x 4 (for blocks) = 4096 (orientations) x 4 (for blocks) 4096

slide-22
SLIDE 22

Window (Image) Classification

Training Data Feature

  

Classifier Extraction

  

  • HOG Features

pedestrian/Non-pedestrian

  • Linear SVM classifier
slide-23
SLIDE 23
slide-24
SLIDE 24

Averaged examples

slide-25
SLIDE 25

Dalal and Triggs, CVPR 2005

slide-26
SLIDE 26

Learned model

f(x)  wTx b

average over positive training data p g

slide-27
SLIDE 27

Training a sliding window detector

  • Unlike training an image classifier there are a (virtually)

g g

  • Unlike training an image classifier, there are a (virtually)

infinite number of possible negative windows Training (learning) generally proceeds in three distinct

  • Training (learning) generally proceeds in three distinct

stages: 1 B i l i i i l i d l ifi f

  • 1. Bootstrapping: learn an initial window classifier from

positives and random negatives

  • 2. Hard negatives: use the initial window classifier for

detection on the training images (inference) and identify false positives with a high score false positives with a high score

  • 3. Retraining: use the hard negatives as additional

t i i d t training data

slide-28
SLIDE 28

Training a sliding window detector

  • Object detection is inherently asymmetric: much more

“non-object” than “object” data non object than object data

  • Classifier needs to have very low false positive rate
  • Non-object category is very complex – need lots of data
  • Non-object category is very complex – need lots of data
slide-29
SLIDE 29

Bootstrapping

  • 1. Pick negative training

set at random set at random

  • 2. Train classifier

3 Run on training data

  • 3. Run on training data
  • 4. Add false positives to

training set training set

  • 5. Repeat from 2
  • Collect a finite but diverse set of non-object windows
  • Force classifier to concentrate on hard negative examples

For some classifiers can ensure equivalence to training on

  • For some classifiers can ensure equivalence to training on

entire data set

slide-30
SLIDE 30

Example: train an upper body detector

– Training data – used for training and validation sets 33 Hollywood2 training movies

  • 33 Hollywood2 training movies
  • 1122 frames with upper bodies marked

– First stage training (bootstrapping)

  • 1607 upper body annotations jittered to 32k positive samples
  • 55k negatives sampled from the same set of frames
  • 55k negatives sampled from the same set of frames

– Second stage training (retraining)

  • 150k hard negatives found in the training data
slide-31
SLIDE 31

Training data positive annotations Training data – positive annotations

slide-32
SLIDE 32

Positive windows

Note: common size and alignment

slide-33
SLIDE 33

Jittered positives

slide-34
SLIDE 34

Jittered positives

slide-35
SLIDE 35

Random negatives

slide-36
SLIDE 36

Random negatives

slide-37
SLIDE 37

Window (Image) first stage classification

HOG Feature

Linear SVM

Jittered positives

HOG Feature Extraction

   

Classifier

Jittered positives random negatives

f(x)  wTx b

x

  • find high scoring false positives detections

find high scoring false positives detections

  • these are the hard negatives for the next round of training
  • these are the hard negatives for the next round of training

cost = # training images x inference on each image

  • cost = # training images x inference on each image
slide-38
SLIDE 38

Hard negatives

slide-39
SLIDE 39

Hard negatives

slide-40
SLIDE 40

First stage performance on validation set

slide-41
SLIDE 41

Precision – Recall curve

returned windows correct windows windows windows

  • Precision: % of returned windows that

are correct are correct

  • Recall: % of correct windows that are

1

all windows

  • Recall: % of correct windows that are

returned

0.6 0.8

  • n

classifier score decreasing

0 2 0.4 precisio 0.2 0.4 0.6 0.8 1 0.2 recall

slide-42
SLIDE 42

Effects of retraining

slide-43
SLIDE 43

Side by side

before retraining after retraining

slide-44
SLIDE 44

Side by side

before retraining after retraining

slide-45
SLIDE 45

Accelerating Sliding Window Search

  • Sliding window search is slow because so many windows are

needed e g x × y × scale ≈ 100 000 for a 320×240 image needed e.g. x × y × scale 100,000 for a 320×240 image

  • Most windows are clearly not the object class of interest
  • Can we speed up the search?
slide-46
SLIDE 46

Cascaded Classification

  • Build a sequence of classifiers with increasing complexity

More complex, slower, lower false positive rate Classifier N Face Classifier 2 Classifier 1

Possibly a face Possibly a face

N 2 1 Window

face face

Non-face Non-face Non-face

  • Reject easy non-objects using simpler and faster classifiers
slide-47
SLIDE 47

Cascaded Classification

  • Slow expensive classifiers only applied to a few windows 

significant speed-up

  • Controlling classifier complexity/speed:

Controlling classifier complexity/speed:

– Number of support vectors [Romdhani et al, 2001] – Number of features [Viola & Jones, 2001] – Two-layer approach [Harzallah et al, 2009]

slide-48
SLIDE 48

Summary: Sliding Window Detection

  • Can convert any image classifier into an
  • bject detector by sliding window Efficient
  • bject detector by sliding window. Efficient

search methods available.

  • Requirements for invariance are reduced by

hi t l ti d l searching over e.g. translation and scale S ti l d b

  • Spatial correspondence can be

“engineered in” by spatial tiling

slide-49
SLIDE 49

Outline

S

  • 1. Sliding window detectors
  • 2. Features and adding spatial information

g p

  • 3. HOG + linear SVM classifier
  • 4. State of the art algorithms and PASCAL VOC
slide-50
SLIDE 50

PASCAL VOC dataset - Content

  • 20 classes: aeroplane, bicycle, boat, bottle, bus, car, cat,

chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, train, TV

  • Real images downloaded from flickr, not filtered for “quality”
  • Complex scenes, scale, pose, lighting, occlusion, ...
slide-51
SLIDE 51

Annotation

  • Complete annotation of all objects
  • Annotated in one session with written guidelines

O l d d Diffi lt Occluded Object is significantly

  • ccluded within BB

Difficult Not scored in evaluation Truncated Object extends beyond BB Pose Facing left

slide-52
SLIDE 52

Examples

Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow

slide-53
SLIDE 53

Examples

Dining Table Dog Horse Motorbike Person Potted Plant Sheep Sofa Train TV/Monitor p /

slide-54
SLIDE 54

Main Challenge Tasks

  • Classification

I th d i thi i ? – Is there a dog in this image? – Evaluation by precision/recall

  • Detection

– Localize all the people (if any) in this image / – Evaluation by precision/recall based on bounding box overlap

slide-55
SLIDE 55

Detection: Evaluation of Bounding Boxes

  • Area of Overlap (AO) Measure
  • Area of Overlap (AO) Measure

Ground truth Bgt Bgt  Bp Predicted Bp

> Threshold Detection if

50% 50%

slide-56
SLIDE 56

Classification/Detection Evaluation

  • Average Precision [TREC] averages precision over the entire range of

recall

– Curve interpolated to reduce influence of “outliers”

1 0.8

– A good score requires both high recall and high precision Interpolated

0.4 0.6 precision

– Application-independent – Penalizes methods giving high

0.2

g g g precision but low recall AP

0.2 0.4 0.6 0.8 1 recall

slide-57
SLIDE 57

Object Detection with Discriminatively Object Detection with Discriminatively Trained Part Based Models

Pedro F. Felzenszwalb, David Mcallester, Deva Ramanan, Ross Girshick PAMI 2010

Matlab code available online: http://www.cs.brown.edu/~pff/latent/

slide-58
SLIDE 58

Approach

  • Mixture of deformable part-based models

Mixture of deformable part-based models

– One component per “aspect” e.g. front/side view

  • Each component has global template + deformable parts

Each component has global template deformable parts

  • Discriminative training from bounding boxes alone
slide-59
SLIDE 59

Example Model

  • One component of person model

x1 x x x3 x4 x6 x5 x2

root filters coarse resolution part filters finer resolution deformation models coarse resolution finer resolution models

slide-60
SLIDE 60

Starting Point: HOG Filter

p

Filter F

Score of F at position p is F ⋅ φ(p H) F φ(p, H) φ(p, H) = concatenation of HOG features from HOG pyramid H HOG features from subwindow specified by p

  • Search: sliding window over position and scale
  • Feature extraction: HOG Descriptor

Feature extraction: HOG Descriptor

  • Classifier: Linear SVM

Dalal & Triggs [2005]

slide-61
SLIDE 61

Object Hypothesis

  • Position of root + each part
  • Each part: HOG filter (at higher resolution)
  • Each part: HOG filter (at higher resolution)

p0 : location of root z = (p0,..., pn) p1,..., pn : location of parts S i f filt Score is sum of filter scores minus deformation costs

slide-62
SLIDE 62

Score of a Hypothesis

Appearance term Spatial prior

filters deformation parameters displacements

concatenation of HOG features and concatenation of filters and deformation HOG features and part displacement features and deformation parameters

  • Linear classifier applied to feature subset defined by hypothesis
slide-63
SLIDE 63

Training

  • Training data = images + bounding boxes
  • Need to learn: model structure filters deformation costs
  • Need to learn: model structure, filters, deformation costs

Training

slide-64
SLIDE 64

Latent SVM (MI-SVM)

Classifiers that score an example x using p g β are model parameters z are latent values

  • Which component?
  • Where are the parts?

Training data

  • Where are the parts?

We would like to find β such that: Minimize

“Hinge loss” on one training example Regularizer SVM objective

slide-65
SLIDE 65

Latent SVM Training

  • Convex if we fix z for positive examples
  • Optimization:

– Initialize β and iterate: Alternation β

  • Pick best z for each positive example
  • Optimize β with z fixed

Alternation strategy p β

  • Local minimum: needs good initialization

g

– Parts initialized heuristically from root

slide-66
SLIDE 66

Person Model

root filters l ti part filters fi l ti deformation d l coarse resolution finer resolution models

Handles partial occlusion/truncation

slide-67
SLIDE 67

Car Model

root filters part filters deformation root filters coarse resolution part filters finer resolution deformation models

slide-68
SLIDE 68

Car Detections

high scoring false positives high scoring true positives

slide-69
SLIDE 69

Person Detections

hi h i t iti high scoring false positives high scoring true positives g g p (not enough overlap)

slide-70
SLIDE 70

Segmentation Driven Object Detection Segmentation Driven Object Detection with Fisher Vectors

Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid ICCV 2013 student presentation

slide-71
SLIDE 71

Approach

  • Pre-select class-

independent candidate image windows using i t ti image segmentation

[van de Sande et al., Segmentation as selective search for object CC 11 recognition, ICCV'11]

Guarantees ~95% Recall for any object class in Pascal VOC with only 1500 windows per image windows per image

slide-72
SLIDE 72

Approach

  • Local features +

feature re-weighting feature re weighting based on object segmentation masks R t i d

  • Represent windows

with Fisher Vector (FV) encoding

  • Compressed FV

descriptors for efficiency efficiency

  • Linear SVM

classifier with hard negati e mining negative mining