Deformable part models Ross Girshick UC Berkeley CS231B Stanford - - PowerPoint PPT Presentation

deformable part models
SMART_READER_LITE
LIVE PREVIEW

Deformable part models Ross Girshick UC Berkeley CS231B Stanford - - PowerPoint PPT Presentation

Deformable part models Ross Girshick UC Berkeley CS231B Stanford University Guest Lecture April 16, 2013 Image understanding Snack time in the lab photo by thomas pix http://www.flickr.com/photos/thomaspix/2591427106 What objects are


slide-1
SLIDE 1

Deformable part models

Ross Girshick UC Berkeley CS231B Stanford University Guest Lecture April 16, 2013

slide-2
SLIDE 2

Image understanding

photo by “thomas pix” http://www.flickr.com/photos/thomaspix/2591427106

Snack time in the lab

slide-3
SLIDE 3

What objects are where?

. . .

I see twinkies!

robot: “I see a table with twinkies, pretzels, fruit, and some mysterious chocolate things...”

slide-4
SLIDE 4

DPM lecture overview

AP 12% 27% 36% 45% 49% 2005 2008 2009 2010 2011 Part 1: modeling Part 2: learning

slide-5
SLIDE 5

Formalizing the object detection task

Many possible ways

slide-6
SLIDE 6

Input

person motorbike

Desired output

Many possible ways, this one is popular: Formalizing the object detection task

cat, dog, chair, cow, person, motorbike, car, ...

slide-7
SLIDE 7

Input

person motorbike

Desired output

Performance summary: Average Precision (AP) 0 is worst 1 is perfect

Many possible ways, this one is popular: Formalizing the object detection task

cat, dog, chair, cow, person, motorbike, car, ...

slide-8
SLIDE 8

Benchmark datasets

PASCAL VOC 2005 – 2012

  • 54k objects in 22k images
  • 20 object classes
  • annual competition
slide-9
SLIDE 9

Benchmark datasets

PASCAL VOC 2005 – 2012

  • 54k objects in 22k images
  • 20 object classes
  • annual competition
slide-10
SLIDE 10

Reduction to binary classification

pos = { ... ... } neg = { ... background patches ... } SVM “Sliding window” detector Dalal & Triggs (CVPR’05) HOG

slide-11
SLIDE 11

Sliding window detection

  • Compute HOG of the whole image at multiple resolutions
  • Score every subwindow of the feature pyramid
  • Apply non-maxima suppression

(f)

Image pyramid HOG feature pyramid

p

(, ) = w · φ(, )

slide-12
SLIDE 12

Detection

p

number of locations p ~ 250,000 per image

slide-13
SLIDE 13

Detection

p

number of locations p ~ 250,000 per image test set has ~ 5000 images >> 1.3x109 windows to classify

slide-14
SLIDE 14

Detection

p

number of locations p ~ 250,000 per image test set has ~ 5000 images >> 1.3x109 windows to classify typically only ~ 1,000 true positive locations

slide-15
SLIDE 15

Detection

p

number of locations p ~ 250,000 per image test set has ~ 5000 images >> 1.3x109 windows to classify typically only ~ 1,000 true positive locations Extremely unbalanced binary classification

slide-16
SLIDE 16

Dalal & Triggs detector on INRIA

0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision Recall−Precision −− different descriptors on INRIA static person database

  • Ker. R−HOG
  • Lin. R−HOG
  • Lin. R2−Hog

Wavelet PCA−SIFT

  • Lin. E−ShapeC
  • AP = 75%
  • (79% in my implementation)
  • Very good
  • Declare victory and go home?
slide-17
SLIDE 17

Dalal & Triggs on PASCAL VOC 2007

AP = 12% (using my implementation)

slide-18
SLIDE 18

How can we do better?

Revisit an old idea: part-based models (“pictorial structures”)

  • Fischler & Elschlager ‘73, Felzenszwalb & Huttenlocher ’00

Combine with modern features and machine learning

slide-19
SLIDE 19

Part-based models

  • Parts — local appearance templates
  • “Springs” — spatial connections between parts (geom. prior)

Image: [Felzenszwalb and Huttenlocher 05]

slide-20
SLIDE 20

Part-based models

  • Local appearance is easier to model than the global appearance
  • Training data shared across deformations
  • “part” can be local or global depending on resolution
  • Generalizes to previously unseen configurations
slide-21
SLIDE 21

General formulation

= (, ) = (, . . . , ) ⊆ × (, . . . , ) ∈

v1 v2

p

part locations in the image (or feature pyramid)

slide-22
SLIDE 22

Part configuration score function

p score(, . . . , ) =

  • =

() −

  • (,)∈

(, )

Part match scores spring costs

v1 v2

Highest scoring configurations

slide-23
SLIDE 23

Part configuration score function

  • Objective: maximize score over p1,...,pn
  • hn configurations! (h = |P|, about 250,000)
  • Dynamic programming
  • If G = (V,E) is a tree, O(nh2) general algorithm
  • O(nh) with some restrictions on dij

score(, . . . , ) =

  • =

() −

  • (,)∈

(, )

Part match scores spring costs

slide-24
SLIDE 24

Star-structured deformable part models

test image “star” model detection root part

slide-25
SLIDE 25

Recall the Dalal & Triggs detector

  • HOG feature pyramid
  • Linear filter / sliding-window detector
  • SVM training to learn parameters w

(f)

Image pyramid HOG feature pyramid

(, ) = w · φ(, )

p

slide-26
SLIDE 26

D&T + parts

  • Add parts to the Dalal & Triggs detector
  • HOG features
  • Linear filters / sliding-window detector
  • Discriminative training

[FMR CVPR’08] [FGMR PAMI’10]

p0 z

Image pyramid HOG feature pyramid

root

slide-27
SLIDE 27

Sliding window DPM score function

p0 z

Spring costs Filter scores

= (, . . . , ) score(, ) = max

,...,

  • =

(, ) −

  • =

(, )

Image pyramid HOG feature pyramid

root

slide-28
SLIDE 28

Detection in a slide

+ x x x

... ... ...

model response of root filter transformed responses responses of part filters feature map feature map at 2x resolution detection scores for each root location low value high value color encoding of filter response values root filter 1-st part filter n-th part filter test image

  • max
  • [() − (, )]
slide-29
SLIDE 29

What are the parts?

slide-30
SLIDE 30

Aspect soup

General philosophy: enrich models to better represent the data

slide-31
SLIDE 31

Mixture models

Data driven: aspect, occlusion modes, subclasses FMR CVPR ’08: AP = 0.27 (person) FGMR PAMI ’10: AP = 0.36 (person)

slide-32
SLIDE 32

Pushmi–pullyu?

Good generalization properties on Doctor Dolittle’s farm This was supposed to detect horses

( + ) / 2 =

slide-33
SLIDE 33

Latent orientation

Unsupervised left/right orientation discovery FGMR PAMI ’10: AP = 0.36 (person) voc-release5: AP = 0.45 (person) Publicly available code for the whole system: current voc-release5 0.42 0.47 0.57 horse AP

slide-34
SLIDE 34

Summary of results

(f)

[DT’05] AP 0.12 [FMR’08] AP 0.27 [FGMR’10] AP 0.36 [GFM voc-release5] AP 0.45 [GFM’11] AP 0.49

slide-35
SLIDE 35

Part 2: DPM parameter learning

? ? ? ? ? ? ? ? ? ? ? ? component 1 component 2 fixed model structure

slide-36
SLIDE 36

Part 2: DPM parameter learning

? ? ? ? ? ? ? ? ? ? ? ? component 1 component 2 fixed model structure training images y +1

slide-37
SLIDE 37

Part 2: DPM parameter learning

? ? ? ? ? ? ? ? ? ? ? ? component 1 component 2 fixed model structure training images y +1

  • 1
slide-38
SLIDE 38

Part 2: DPM parameter learning

? ? ? ? ? ? ? ? ? ? ? ? component 1 component 2 fixed model structure training images y +1

  • 1

Parameters to learn: – biases (per component) – deformation costs (per part)

– filter weights

slide-39
SLIDE 39

Linear parameterization

Spring costs Filter scores

= (, . . . , ) score(, ) = max

,...,

  • =

(, ) −

  • =

(, )

(, ) = w · φ(, ) (, ) = d · (, , , )

Filter scores Spring costs

(, ) = max

  • w · (, (, ))
slide-40
SLIDE 40

Positive examples (y = +1)

We want w() = max

∈() w · (, )

to score >= +1 () includes all z with more than 70% overlap with ground truth x specifies an image and bounding box

person

slide-41
SLIDE 41

Negative examples (y = -1)

x specifies an image and a HOG pyramid location p0 We want w() = max

∈() w · (, )

to score <= -1 () restricts the root to p0 and allows any placement of the other filters p0

slide-42
SLIDE 42

Typical dataset

300 – 8,000 positive examples 500 million to 1 billion negative examples (not including latent configurations!) Large-scale*

*unless someone from google is here

slide-43
SLIDE 43

How we learn parameters: latent SVM

(w) = w +

  • max{, w()}
slide-44
SLIDE 44

(w) = w +

  • max{, w()}

(w) = w +

max{, max

∈() w · (, )}

+

max{, + max

∈() w · (, )}

How we learn parameters: latent SVM

slide-45
SLIDE 45

(w) = w +

  • max{, w()}

(w) = w +

max{, max

∈() w · (, )}

+

max{, + max

∈() w · (, )}

w + score z1 z2 z3 z4 convex

How we learn parameters: latent SVM

slide-46
SLIDE 46

(w) = w +

  • max{, w()}

w – score z1 z2 z3 z4 (w) = w +

max{, max

∈() w · (, )}

+

max{, + max

∈() w · (, )}

w + score z1 z2 z3 z4 convex concave :(

How we learn parameters: latent SVM

slide-47
SLIDE 47

Observations

w – score z1 z2 z3 z4 w + score z1 z2 z3 z4 convex concave :( Latent SVM objective is convex in the negatives but not in the positives >> “semi-convex”

slide-48
SLIDE 48

Convex upper bound on loss

w – score z1 z2 z3 z4 w (current) w – score z1 ZPi = z2 z3 z4 w (current) max{, − max

∈() w · (, )}

max{, −w · (, )} convex

slide-49
SLIDE 49

Auxiliary objective

Let ZP = {ZP1, ZP2, ... } (w, ) = w +

max{, w · (, )} +

max{, + max

∈() w · (, )}

slide-50
SLIDE 50

Auxiliary objective

Let ZP = {ZP1, ZP2, ... } (w, ) = w +

max{, w · (, )} +

max{, + max

∈() w · (, )}

Note that (w, ) ≥ min

(w, ) = (w)

slide-51
SLIDE 51

Auxiliary objective

Let ZP = {ZP1, ZP2, ... } (w, ) = w +

max{, w · (, )} +

max{, + max

∈() w · (, )}

w∗ = min

w, (w, ) = min w (w)

and Note that (w, ) ≥ min

(w, ) = (w)

slide-52
SLIDE 52

Auxiliary objective w∗ = min

w, (w, ) = min w (w)

This isn’t any easier to optimize

slide-53
SLIDE 53

Auxiliary objective w∗ = min

w, (w, ) = min w (w)

This isn’t any easier to optimize Find stationary point by coordinate descent on (w, )

slide-54
SLIDE 54

Auxiliary objective w∗ = min

w, (w, ) = min w (w)

This isn’t any easier to optimize Find stationary point by coordinate descent on (w, ) Initialization: either by picking a w(0) (or ZP)

slide-55
SLIDE 55

Auxiliary objective w∗ = min

w, (w, ) = min w (w)

This isn’t any easier to optimize Find stationary point by coordinate descent on (w, ) Initialization: either by picking a w(0) (or ZP) Step 1: = argmax

∈()

w() · (, ) ∀ ∈

slide-56
SLIDE 56

Auxiliary objective w∗ = min

w, (w, ) = min w (w)

This isn’t any easier to optimize Find stationary point by coordinate descent on (w, ) Initialization: either by picking a w(0) (or ZP) Step 1: Step 2: w(+) = argmin

w

(w, ) = argmax

∈()

w() · (, ) ∀ ∈

slide-57
SLIDE 57

Step 1

This is just detection:

+ x x x

... ... ...

model response of root filter transformed responses responses of part filters feature map feature map at 2x resolution detection scores for each root location low value high value color encoding of filter response values root filter 1-st part filter n-th part filter test image

= argmax

∈()

w() · (, ) ∀ ∈

slide-58
SLIDE 58

Step 2

min

w

  • w +

max{, w · (, )} +

max{, + max

∈() w · (, )}

Convex

slide-59
SLIDE 59

Step 2

min

w

  • w +

max{, w · (, )} +

max{, + max

∈() w · (, )}

Convex Similar to a structural SVM

slide-60
SLIDE 60

Step 2

min

w

  • w +

max{, w · (, )} +

max{, + max

∈() w · (, )}

Convex Similar to a structural SVM But, recall 500 million to 1 billion negative examples!

slide-61
SLIDE 61

Step 2

min

w

  • w +

max{, w · (, )} +

max{, + max

∈() w · (, )}

Convex Similar to a structural SVM But, recall 500 million to 1 billion negative examples! Can be solved by a working set method – “bootstrapping” – “data mining” – “constraint generation” – requires a bit of engineering to make this fast

slide-62
SLIDE 62

Comments

Latent SVM is mathematically equivalent to MI-SVM (Andrews et al. NIPS 2003) Latent SVM can be written as a latent structural SVM (Yu and Joachims ICML 2009) – natural optimization algorithm is concave-convex procedure – similar to, but not exactly the same as, coordinate descent xi1 bag of instances for xi xi2 xi3 z1 z2 z3 latent labels for xi

slide-63
SLIDE 63

What about the model structure?

? ? ? ? ? ? ? ? ? ? ? ? component 1 component 2 fixed model structure training images y +1

  • 1

Model structure – # components – # parts per component – root and part filter shapes

– part anchor locations

slide-64
SLIDE 64

Learning model structure

Split positives by aspect ratio Warp to common size Train Dalal & Triggs model for each aspect ratio on its own

slide-65
SLIDE 65

Learning model structure

Use D&T filters as initial w for LSVM training Merge components Root filter placement and component choice are latent

slide-66
SLIDE 66

Learning model structure

Add parts to cover high-energy areas of root filters Continue training model with LSVM

slide-67
SLIDE 67

Learning model structure

without orientation clustering with orientation clustering

slide-68
SLIDE 68

Learning model structure

In summary – repeated application of LSVM training to models of increasing complexity – structure learning involves many heuristics (and vision insight!)