[PPT] - Deformable part models Ross Girshick UC Berkeley CS231B Stanford PowerPoint Presentation

SLIDE 1

Deformable part models

Ross Girshick UC Berkeley CS231B Stanford University Guest Lecture April 16, 2013

SLIDE 2

Image understanding

photo by “thomas pix” http://www.flickr.com/photos/thomaspix/2591427106

Snack time in the lab

SLIDE 3

What objects are where?

. . .

I see twinkies!

robot: “I see a table with twinkies, pretzels, fruit, and some mysterious chocolate things...”

SLIDE 4

DPM lecture overview

AP 12% 27% 36% 45% 49% 2005 2008 2009 2010 2011 Part 1: modeling Part 2: learning

SLIDE 5

Formalizing the object detection task

Many possible ways

SLIDE 6

Input

person motorbike

Desired output

Many possible ways, this one is popular: Formalizing the object detection task

cat, dog, chair, cow, person, motorbike, car, ...

SLIDE 7

Input

person motorbike

Desired output

Performance summary: Average Precision (AP) 0 is worst 1 is perfect

Many possible ways, this one is popular: Formalizing the object detection task

cat, dog, chair, cow, person, motorbike, car, ...

SLIDE 8

Benchmark datasets

PASCAL VOC 2005 – 2012

54k objects in 22k images
20 object classes
annual competition

SLIDE 9

Benchmark datasets

PASCAL VOC 2005 – 2012

54k objects in 22k images
20 object classes
annual competition

SLIDE 10

Reduction to binary classification

pos = { ... ... } neg = { ... background patches ... } SVM “Sliding window” detector Dalal & Triggs (CVPR’05) HOG

SLIDE 11

Sliding window detection

Compute HOG of the whole image at multiple resolutions
Score every subwindow of the feature pyramid
Apply non-maxima suppression

(f)

Image pyramid HOG feature pyramid

p

(, ) = w · φ(, )

SLIDE 12

Detection

p

number of locations p ~ 250,000 per image

SLIDE 13

Detection

p

number of locations p ~ 250,000 per image test set has ~ 5000 images >> 1.3x109 windows to classify

SLIDE 14

Detection

p

number of locations p ~ 250,000 per image test set has ~ 5000 images >> 1.3x109 windows to classify typically only ~ 1,000 true positive locations

SLIDE 15

Detection

p

number of locations p ~ 250,000 per image test set has ~ 5000 images >> 1.3x109 windows to classify typically only ~ 1,000 true positive locations Extremely unbalanced binary classification

SLIDE 16

Dalal & Triggs detector on INRIA

0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision Recall−Precision −− different descriptors on INRIA static person database

Ker. R−HOG
Lin. R−HOG
Lin. R2−Hog

Wavelet PCA−SIFT

Lin. E−ShapeC
AP = 75%
(79% in my implementation)
Very good
Declare victory and go home?

SLIDE 17

Dalal & Triggs on PASCAL VOC 2007

AP = 12% (using my implementation)

SLIDE 18

How can we do better?

Revisit an old idea: part-based models (“pictorial structures”)

Fischler & Elschlager ‘73, Felzenszwalb & Huttenlocher ’00

Combine with modern features and machine learning

SLIDE 19

Part-based models

Parts — local appearance templates
“Springs” — spatial connections between parts (geom. prior)

Image: [Felzenszwalb and Huttenlocher 05]

SLIDE 20

Part-based models

Local appearance is easier to model than the global appearance
Training data shared across deformations
“part” can be local or global depending on resolution
Generalizes to previously unseen configurations

SLIDE 21

General formulation

= (, ) = (, . . . , ) ⊆ × (, . . . , ) ∈

v1 v2

p

part locations in the image (or feature pyramid)

SLIDE 22

Part configuration score function

p score(, . . . , ) =

=

() −

(,)∈

(, )

Part match scores spring costs

v1 v2

Highest scoring configurations

SLIDE 23

Part configuration score function

Objective: maximize score over p1,...,pn
hn configurations! (h = |P|, about 250,000)
Dynamic programming
If G = (V,E) is a tree, O(nh2) general algorithm
O(nh) with some restrictions on dij

score(, . . . , ) =

=

() −

(,)∈

(, )

Part match scores spring costs

SLIDE 24

Star-structured deformable part models

test image “star” model detection root part

SLIDE 25

Recall the Dalal & Triggs detector

HOG feature pyramid
Linear filter / sliding-window detector
SVM training to learn parameters w

(f)

Image pyramid HOG feature pyramid

(, ) = w · φ(, )

p

SLIDE 26

D&T + parts

Add parts to the Dalal & Triggs detector
HOG features
Linear filters / sliding-window detector
Discriminative training

[FMR CVPR’08] [FGMR PAMI’10]

p0 z

Image pyramid HOG feature pyramid

root

SLIDE 27

Sliding window DPM score function

p0 z

Spring costs Filter scores

= (, . . . , ) score(, ) = max

,...,

=

(, ) −

=

(, )

Image pyramid HOG feature pyramid

root

SLIDE 28

Detection in a slide

+ x x x

... ... ...

model response of root filter transformed responses responses of part filters feature map feature map at 2x resolution detection scores for each root location low value high value color encoding of filter response values root filter 1-st part filter n-th part filter test image

max
[() − (, )]

SLIDE 29

What are the parts?

SLIDE 30

Aspect soup

General philosophy: enrich models to better represent the data

SLIDE 31

Mixture models

Data driven: aspect, occlusion modes, subclasses FMR CVPR ’08: AP = 0.27 (person) FGMR PAMI ’10: AP = 0.36 (person)

SLIDE 32

Pushmi–pullyu?

Good generalization properties on Doctor Dolittle’s farm This was supposed to detect horses

( + ) / 2 =

SLIDE 33

Latent orientation

Unsupervised left/right orientation discovery FGMR PAMI ’10: AP = 0.36 (person) voc-release5: AP = 0.45 (person) Publicly available code for the whole system: current voc-release5 0.42 0.47 0.57 horse AP

SLIDE 34

Summary of results

(f)

[DT’05] AP 0.12 [FMR’08] AP 0.27 [FGMR’10] AP 0.36 [GFM voc-release5] AP 0.45 [GFM’11] AP 0.49

SLIDE 35

Part 2: DPM parameter learning

? ? ? ? ? ? ? ? ? ? ? ? component 1 component 2 fixed model structure

SLIDE 36

Part 2: DPM parameter learning

? ? ? ? ? ? ? ? ? ? ? ? component 1 component 2 fixed model structure training images y +1

SLIDE 37

Part 2: DPM parameter learning

? ? ? ? ? ? ? ? ? ? ? ? component 1 component 2 fixed model structure training images y +1

1

SLIDE 38

Part 2: DPM parameter learning

? ? ? ? ? ? ? ? ? ? ? ? component 1 component 2 fixed model structure training images y +1

1

Parameters to learn: – biases (per component) – deformation costs (per part)

– filter weights

SLIDE 39

Linear parameterization

Spring costs Filter scores

= (, . . . , ) score(, ) = max

,...,

=

(, ) −

=

(, )

(, ) = w · φ(, ) (, ) = d · (, , , )

Filter scores Spring costs

(, ) = max

w · (, (, ))

SLIDE 40

Positive examples (y = +1)

We want w() = max

∈() w · (, )

to score >= +1 () includes all z with more than 70% overlap with ground truth x specifies an image and bounding box

person

SLIDE 41

Negative examples (y = -1)

x specifies an image and a HOG pyramid location p0 We want w() = max

∈() w · (, )

to score <= -1 () restricts the root to p0 and allows any placement of the other filters p0

SLIDE 42

Typical dataset

300 – 8,000 positive examples 500 million to 1 billion negative examples (not including latent configurations!) Large-scale*

*unless someone from google is here

SLIDE 43

How we learn parameters: latent SVM

(w) = w +

max{, w()}

SLIDE 44

(w) = w +

max{, w()}

(w) = w +

∈

max{, max

∈() w · (, )}

+

∈

max{, + max

∈() w · (, )}

How we learn parameters: latent SVM

SLIDE 45

(w) = w +

max{, w()}

(w) = w +

∈

max{, max

∈() w · (, )}

+

∈

max{, + max

∈() w · (, )}

w + score z1 z2 z3 z4 convex

How we learn parameters: latent SVM

SLIDE 46

(w) = w +

max{, w()}

w – score z1 z2 z3 z4 (w) = w +

∈

max{, max

∈() w · (, )}

+

∈

max{, + max

∈() w · (, )}

w + score z1 z2 z3 z4 convex concave :(

How we learn parameters: latent SVM

SLIDE 47

Observations

w – score z1 z2 z3 z4 w + score z1 z2 z3 z4 convex concave :( Latent SVM objective is convex in the negatives but not in the positives >> “semi-convex”

SLIDE 48

Convex upper bound on loss

w – score z1 z2 z3 z4 w (current) w – score z1 ZPi = z2 z3 z4 w (current) max{, − max

∈() w · (, )}

max{, −w · (, )} convex

SLIDE 49

Auxiliary objective

Let ZP = {ZP1, ZP2, ... } (w, ) = w +

∈

max{, w · (, )} +

∈

max{, + max

∈() w · (, )}

SLIDE 50

Auxiliary objective

Let ZP = {ZP1, ZP2, ... } (w, ) = w +

∈

max{, w · (, )} +

∈

max{, + max

∈() w · (, )}

Note that (w, ) ≥ min

(w, ) = (w)

SLIDE 51

Auxiliary objective

Let ZP = {ZP1, ZP2, ... } (w, ) = w +

∈

max{, w · (, )} +

∈

max{, + max

∈() w · (, )}

w∗ = min

w, (w, ) = min w (w)

and Note that (w, ) ≥ min

(w, ) = (w)

SLIDE 52

Auxiliary objective w∗ = min

w, (w, ) = min w (w)

This isn’t any easier to optimize

SLIDE 53

Auxiliary objective w∗ = min

w, (w, ) = min w (w)

This isn’t any easier to optimize Find stationary point by coordinate descent on (w, )

SLIDE 54

Auxiliary objective w∗ = min

w, (w, ) = min w (w)

This isn’t any easier to optimize Find stationary point by coordinate descent on (w, ) Initialization: either by picking a w(0) (or ZP)

SLIDE 55

Auxiliary objective w∗ = min

w, (w, ) = min w (w)

This isn’t any easier to optimize Find stationary point by coordinate descent on (w, ) Initialization: either by picking a w(0) (or ZP) Step 1: = argmax

∈()

w() · (, ) ∀ ∈

SLIDE 56

Auxiliary objective w∗ = min

w, (w, ) = min w (w)

This isn’t any easier to optimize Find stationary point by coordinate descent on (w, ) Initialization: either by picking a w(0) (or ZP) Step 1: Step 2: w(+) = argmin

w

(w, ) = argmax

∈()

w() · (, ) ∀ ∈

SLIDE 57

Step 1

This is just detection:

+ x x x

... ... ...

model response of root filter transformed responses responses of part filters feature map feature map at 2x resolution detection scores for each root location low value high value color encoding of filter response values root filter 1-st part filter n-th part filter test image

= argmax

∈()

w() · (, ) ∀ ∈

SLIDE 58

Step 2

min

w

w +
∈

max{, w · (, )} +

∈

max{, + max

∈() w · (, )}

Convex

SLIDE 59

Step 2

min

w

w +
∈

max{, w · (, )} +

∈

max{, + max

∈() w · (, )}

Convex Similar to a structural SVM

SLIDE 60

Step 2

min

w

w +
∈

max{, w · (, )} +

∈

max{, + max

∈() w · (, )}

Convex Similar to a structural SVM But, recall 500 million to 1 billion negative examples!

SLIDE 61

Step 2

min

w

w +
∈

max{, w · (, )} +

∈

max{, + max

∈() w · (, )}

Convex Similar to a structural SVM But, recall 500 million to 1 billion negative examples! Can be solved by a working set method – “bootstrapping” – “data mining” – “constraint generation” – requires a bit of engineering to make this fast

SLIDE 62

Comments

Latent SVM is mathematically equivalent to MI-SVM (Andrews et al. NIPS 2003) Latent SVM can be written as a latent structural SVM (Yu and Joachims ICML 2009) – natural optimization algorithm is concave-convex procedure – similar to, but not exactly the same as, coordinate descent xi1 bag of instances for xi xi2 xi3 z1 z2 z3 latent labels for xi

SLIDE 63

What about the model structure?

? ? ? ? ? ? ? ? ? ? ? ? component 1 component 2 fixed model structure training images y +1

1

Model structure – # components – # parts per component – root and part filter shapes

– part anchor locations

SLIDE 64

Learning model structure

Split positives by aspect ratio Warp to common size Train Dalal & Triggs model for each aspect ratio on its own

SLIDE 65

Learning model structure

Use D&T filters as initial w for LSVM training Merge components Root filter placement and component choice are latent

SLIDE 66

Learning model structure

Add parts to cover high-energy areas of root filters Continue training model with LSVM

SLIDE 67

Learning model structure

without orientation clustering with orientation clustering

SLIDE 68

Learning model structure

In summary – repeated application of LSVM training to models of increasing complexity – structure learning involves many heuristics (and vision insight!)