FeaturesforComputerVision AlexBerg ComputerScienceDepartment - - PowerPoint PPT Presentation

features for computer vision
SMART_READER_LITE
LIVE PREVIEW

FeaturesforComputerVision AlexBerg ComputerScienceDepartment - - PowerPoint PPT Presentation

FeaturesforComputerVision AlexBerg ComputerScienceDepartment ColumbiaUniversity WhyVision? Light! Itishowweseeotherpeople, navigateourenvironment,


slide-1
SLIDE 1

Features
for
Computer
Vision


Alex
Berg


Computer
Science
Department
 Columbia
University


slide-2
SLIDE 2

Why
Vision?





It
is
how
we
see
other
people,
 navigate
our
environment,
 communicate
ideas,
entertain,
 and
measure
the
world
around
us.


Light!


slide-3
SLIDE 3

Why
is
light
good
for
measurement?


  • PlenHful,
someHmes
free

  • Interacts
with
many
things,
but
not
too
many

  • Goes
generally
straight
over
distance

  • Very
small


high
spaHal
resoluHon

  • Easy
to
detect

cameras
work,
are
cheap

  • Fast,
but
not
too
fast

Hme
of
flight
sensors

  • Comes
in
flavors
(
wavelengths
)

  • But…


Microscopy
 Surveillance
 3D
Analysis
/
NavigaHon
 Remote

 Sensing


We
need
to
know
which
bits
to
measure!


slide-4
SLIDE 4

RecogniHon
in
Computer
Vision


Deciding
which
bits
to
measure…


slide-5
SLIDE 5

Range
of
Difficulty
in
RecogniHon


vs


  • r

slide-6
SLIDE 6

Range
of
Difficulty
in
RecogniHon


vs


  • r


Recognizing
the
same
image


slide-7
SLIDE 7

Range
of
Difficulty
in
RecogniHon


vs


  • r


Recognizing
the
same
object


Small

 change


slide-8
SLIDE 8

Range
of
Difficulty
in
RecogniHon


vs


  • r


Recognizing
the
same
object


Large

 change


slide-9
SLIDE 9

Range
of
Difficulty
in
RecogniHon


vs


  • r


Recognizing
the
same
object


slide-10
SLIDE 10

Range
of
Difficulty
in
RecogniHon


vs


Recognizing
the
same
object
category


Chair


slide-11
SLIDE 11

Range
of
Difficulty
in
RecogniHon


vs


Recognizing
the
same
object
category


slide-12
SLIDE 12

Where
do
Features
Fit?


Lens
 Sensor
 Post
 Processing
 Objects
in
the
world
 IlluminaHon
 Pixels
 Higher
level

 Features
 “SIFT”
 “HOG”
 etc.
 Vision

 Algorithms
 RecogniHon/
 Decision


slide-13
SLIDE 13

Where
do
Features
Fit?


Lens
 Sensor
 Post
 Processing
 Objects
in
the
world
 IlluminaHon
 Pixels
 Higher
level

 Features
 “SIFT”
 “HOG”
 etc.
 Vision

 Algorithms
 RecogniHon/
 Decision


slide-14
SLIDE 14

Where
do
Features
Fit?


Lens
 Sensor
 Post
 Processing
 Objects
in
the
world
 IlluminaHon
 Pixels
 Higher
level

 Features
 “SIFT”
 “HOG”
 etc.
 Vision

 Algorithms
 RecogniHon/
 Decision


slide-15
SLIDE 15

Control
objects
and
lighHng


(if
possible)


RetroreflecHve
Balls
 IlluminaHon
from

 near
the
cameras


slide-16
SLIDE 16

Here
raw
pixels
are
almost
enough…


vs


  • r


brighter
 darker
 MulHple
nearby
pixels
in
a
circle
 agreeing
probably
suffice.


slide-17
SLIDE 17

Cows
May
Be
Less
CooperaHve


Cows
come
in
many
brightnesses
as
does
the
background



slide-18
SLIDE 18

Looking
at
shape


Similar
shapes
have
very
different
pixel
values


slide-19
SLIDE 19

Looking
at
shape


Similar
shapes
have
very
different
pixel
values


slide-20
SLIDE 20

Vision
is
difficult


Different
images
of
the
same
thing
o^en
look
different.
 SomeHmes
images
of
different
things
look
the
same.
 Despite
all
its
useful
qualiHes
light
only
tells
us
about
objects
indirectly…


slide-21
SLIDE 21

The
usual
suspects
for
problems
are:


Pose
 IlluminaHon

 ArHculaHon
 Intra‐category
variaHon


slide-22
SLIDE 22

When
are
raw
pixels
enough?


Given
sufficient
training
data
and
a
powerful
classifier
 patches
or
windows
of
pixels
would
be
enough
–
we
wouldn’t
need

 any
high
level
features.
 For
a
toy
10x10
pixel
image
with
10
brightness
levels
there
are
 10^100
possibiliHes,
you
might
imagine
labeling
all
of
them
 as
face
or
not…
 There
is
almost
never
enough
training
data
for
this
approach,
 ExcepHons
are
when
we
can
enumerate
the
posiHve
examples.


slide-23
SLIDE 23

Simple
example
with
translaHon


0,0,0,0,1,0,0,0,0
 0,0,0,0,0,0,0,1,0
 If
not
careful,
this
isn’t
even
differenHable.

 Need
to
be
careful
about
smoothing
and
 representaHon
 translaHon
 Pixel
value
 0,0,0,0,0,0,0,0,0


slide-24
SLIDE 24

Simple
example
with
translaHon


0,0,0,0,1,0,0,0,0
 0,0,0,0,0,0,0,1,0
 If
not
careful,
this
isn’t
even
differenHable.

 Need
to
be
careful
about
smoothing
and
 representaHon
 translaHon
 Pixel
value
 0,0,0,0,0,0,0,0,0
 Of
course
we
can
get

 around
simple
translaHon

 by
evaluaHng
comparisons

 everywhere


slide-25
SLIDE 25

Can
look
at
features
as
trying
to
 straighten
out
appearance
manifolds


Important
to
keep
track
of
what
you
are
throwing
away


slide-26
SLIDE 26


One
agempt:
edges


slide-27
SLIDE 27


One
agempt:
edges


slide-28
SLIDE 28


One
agempt:
edges


slide-29
SLIDE 29


One
agempt:
edges


Image
 OrientaHon
 Edge
Energy


slide-30
SLIDE 30


One
agempt:
edges


works
because
it
indicates
something
 physical
about
the
object
that
is

 conserved
across
images
 Image
 OrientaHon
 Edge
Energy


slide-31
SLIDE 31

Edges
help
deal
with
variaHon
in
 illuminaHon


IlluminaHon
fields

are
o^en
so^,
so
sharp
changes
may
indicate

 something
about
the
object
not
the
illuminaHon.

OrientaHon

 and
phase
o^en
preserved
under
lighHng
variaHons.

 Possible
sources
of
edges:
 ‐ Albedo
variaHons
on
the
object
surface
 ‐ Surface
structure
on
the
object
(changing
surface
normal,
creases,
holes)
 ‐ Boundaries
of
the
object
 Work
from
Simoncelli
et
al
on
the
importance
of
orientaHon

/
phase
for
 human
percepHon.


slide-32
SLIDE 32

Versions
of
edges


Brightness
gradients,
Haar
wavelets
 MulHple
scales,

 elongated
filters
 Color
compass
edges
Ruzon
&
Tomasi
 Texture
(“compass”)
MarHn,
Fowlkes,
 Malik


χ2( , )

EMD( , )

ri = I ∗ ki

slide-33
SLIDE 33

Is
it
the
same?
What?


slide-34
SLIDE 34

Is
it
the
same?
Op2cal
Flow


Color
and
texture
do
not
match.
 Too
low
resoluHon
to
extract
edges.
 Coarse
opHcal
flow
(spaHo‐
 temporal
gradient
direcHon)

 features
match.
 Efros,
Berg,
Mori,
Malik

ICCV
2003


slide-35
SLIDE 35

Review


I(x, y) F(I(x, y))

Image
‐>
feature

 Oriented
edge
detecHon
may
be
helpful
 Transformed
signals
 I(T(x, y)) = I(x, y)

slide-36
SLIDE 36

Review


I(x, y) F(I(x, y))

Image
‐>
feature

 Oriented
edge
detecHon
may
be
helpful
 Transformed
signals
 I(T(x, y)) = I(x, y) Region
of
interest
operators
 “Histograms”


slide-37
SLIDE 37

Region
of
interest
operators


Slide
from
Lazebnik


slide-38
SLIDE 38

Regions
of
Interest
 Invariance


F(I(T(x, y))) = F(I(x, y)) =

slide-39
SLIDE 39

Regions
of
Interest
 Invariance


F(I(T(x, y))) = F(I(x, y)) =

Compute
features
in
light
blue
region


slide-40
SLIDE 40

Regions
of
Interest
 Invariance


F(I(T(x, y))) = F(I(x, y)) =

Compute
features
 Adapt
region
to
image
content
(boxes)


slide-41
SLIDE 41

Regions
of
Interest
 Invariance


F(I(T(x, y))) = F(I(x, y)) =

Transform
to
canonical
pose
 Compute
features
 Adapt
region
to
image
content
(boxes)


=

slide-42
SLIDE 42

Regions
of
Interest
 Invariance
&
Co‐Variance


F(I(T(x, y))) = F(I(x, y)) =

Transform
to
canonical
pose
 Compute
features
 Adapt
region
to
image
content
(boxes)


= F(I(T(x, y))) = T ′(F(I(x, y)))

slide-43
SLIDE 43

Regions
of
Interest
 Invariance
&
Co‐Variance


F(I(T(x, y))) = F(I(x, y)) =

Transform
to
canonical
pose
 Compute
features
 Adapt
region
to
image
content
(boxes)


= F(I(T(x, y))) = T ′(F(I(x, y))) =

Compute
features
 Adapt
region
to
image
content
(boxes)
 Schmid
&
Mohr,

Lowe


slide-44
SLIDE 44

Region
of
Interest
Operators


slide-45
SLIDE 45

Region
of
Interest
Operators


Look
for
local
maxima


slide-46
SLIDE 46

Region
of
Interest
Operators


Cross
secHon
looks
like
‐>
 Look
for
local
maxima,
blobs


slide-47
SLIDE 47

Region
of
Interest
Operators


Cross
secHon
looks
like
‐>


slide-48
SLIDE 48

Extract
affine
regions
 Normalize
regions
 Eliminate
rotaHonal

 ambiguity


Example
Feature
Pipeline


Edge
 OrientaHon

 Histograms


slide-49
SLIDE 49

SIFT
(Lowe
’04)


Extract
affine
regions
 Normalize
regions
 Eliminate
rotaHonal

 ambiguity
 Edge
 OrientaHon

 Histograms


Example
Feature
Pipeline


Harris‐Affine
Region
of
Interest
Operator








































Lowe’s
Descriptor


slide-50
SLIDE 50

SIFT
(Lowe
’04)


Extract
affine
regions
 Normalize
regions
 Eliminate
rotaHonal

 ambiguity
 Edge
 OrientaHon

 Histograms


Example
Feature
Pipeline


Harris‐Affine
Region
of
Interest
Operator








































Lowe’s
Descriptor


Features!


slide-51
SLIDE 51

Matching
for
Alignment


Use
descriptors
to
compare
features
and
enforce
geometric
constraints


slide-52
SLIDE 52

Match
a
few
points


slide-53
SLIDE 53

Dense
Alignment


slide-54
SLIDE 54

Si^
2


slide-55
SLIDE 55

Eliminate
rotaHonal

 ambiguity
 Edge
 OrientaHon

 Histograms


Example
Feature
Pipeline


Remaining
variaHon
 here
 Needs
to
be
handled
 here


slide-56
SLIDE 56

Matching
affine
covariant
regions


Note
that
they
sHll
don’t
look
exactly
the
same
even
on
easy
images!
 Lowe’s
orientaHon
histogram
helps,
but
Grauman
&
Darrell
and
Lazebnik
et
 al

have
a
neat
alternaHve


slide-57
SLIDE 57

Embedding


slide-58
SLIDE 58

Grauman’s
Pyramid
Match
Kernel


“Match”
score
for
sets
X,
Y,


  • f
features:


Idea
from
StaHsHcs:
Mallow’s
1972
 Included
the
method
of
quanHzing

 feature
space,
which
was
rediscovered
by
 Rubner
et
al
1998
as
the
 Earth
Mover’s
Distance.


slide-59
SLIDE 59

Grauman’s
Pyramid
Match
Kernel


“Match”
score
for
sets
X,
Y,


  • f
features:


Idea
from
StaHsHcs:
Mallow’s
1972
 Included
the
method
of
quanHzing

 feature
space,
which
was
rediscovered
by
 Rubner
et
al
1998
as
the
 Earth
Mover’s
Distance
(EMD)
 Indyk
and
Thaper
2003
 Showed
how
to
embed

 points
in
a
mulHscale
pyramid
 so
that
the
l2
norm
on
the

 embedding
approximated

 EMD


slide-60
SLIDE 60

Grauman’s
Pyramid
Match
Kernel


“Match”
score
for
sets
X,
Y,


  • f
features:


Indyk
and
Thaper
2003
 Showed
how
to
embed

 points
in
a
mulHscale
pyramid
 so
that
the
l2
norm
on
the

 embedding
approximated

 EMD
 Grauman
replaced
l2
with

 histogram
intersecHon.
 Histogram
IntersecHon
/
Min
Kernel
is
posiHve
definite,
so
we
can
use
it
for
a
Kernelized
SVM


slide-61
SLIDE 61

Grauman’s
Pyramid
Match
Kernel


“Match”
score
for
sets
X,
Y,


  • f
features:


Indyk
and
Thaper
2003
 Showed
how
to
embed

 points
in
a
mulHscale
pyramid
 so
that
the
l2
norm
on
the

 embedding
approximated

 EMD
 Grauman
replaced
l2
with

 histogram
intersecHon.
 Histogram
IntersecHon
/
Min
Kernel
is
posiHve
definite,
so
we
can
use
it
for
a
Kernelized
SVM


slide-62
SLIDE 62

SpaHal
Pyramid
Match
(Lazebnik)


Only
use
pyramid
for
the
spaHal
coordinates
of
features.


slide-63
SLIDE 63

SpaHal
Pyramid
Match
(Lazebnik)


Applied
to
large
region
or
whole
image,
 No
interest
point
operator.


slide-64
SLIDE 64

RotaHon
/
scale
invariance
not
always
 needed.


Airplanes
on
the
runway
are
level.


slide-65
SLIDE 65

SpaHal
Pyramid
Kernel
(Lazebnik)


DistribuHon
of

edge
features
x,
y,
orientaHon,
energy


E(x, y, o) =

Edge
energy
at
x,y
in
orientaHon
o
 Histograms
are
just
sums
of
different
slices
of
E
 (just
a
linear
projecHon
if
E
is
represented
discretely)


slide-66
SLIDE 66

SpaHal
Pyramid
Kernel
(Lazebnik)


DistribuHon
of

edge
features
x,
y,
orientaHon,
energy


E(x, y, o) =

Edge
energy
at
x,y
in
orientaHon
o
 Histograms
are
just
sums
of
different
slices
of
E
 (just
a
linear
projecHon
if
E
is
represented
discretely)
 Same
for
GIST,
Shape
Contexts,
Geometric
Blur,
HOG
etc.
 The
only
impediment
to
an
understanding
of
all
of
these
features
as

 simple
projecHons
of
something
like
E()
above
is
the
min
kernel…


slide-67
SLIDE 67

Unified
Feature
Pipeline


Image

 Edges/filter
responses
 Contrast
 NormalizaHon
 ProjecHon
 Comparison
 L2
 Inner
product
 Min
Kernel


slide-68
SLIDE 68

Max‐Margin
Addi2ve
Classifiers
for
Detec2on



Subhransu
Maji
(UC
Berkeley)


Alex
Berg
(Columbia
University)


Will
be
a
talk
at
ICCV
2009
in
Kyoto


slide-69
SLIDE 69

DetecHon


Find
pedestrians


slide-70
SLIDE 70

DetecHon


Find
pedestrians


slide-71
SLIDE 71

DetecHon


Find
pedestrians


slide-72
SLIDE 72

DetecHon


Find
pedestrians


slide-73
SLIDE 73

DetecHon


Find
pedestrians
 104
to
106
or
more

windows
per
image


slide-74
SLIDE 74

DetecHon


Find
pedestrians
 104
to
106
or
more

windows
per
image
 BoosHng
+
Decision
Trees
 Viola
&
Jones

(faces)
 Linear
Classifier

 Dalal
&
Triggs

(pedestrians)
 Neural
Networks
 Rowley
et
al

(faces)


slide-75
SLIDE 75

ClassificaHon


What
is
this?


slide-76
SLIDE 76

ClassificaHon


Choose
from
many
categories
 What
is
this?


slide-77
SLIDE 77

ClassificaHon


Choose
from
many
categories
 What
is
this?


~105
examples
images
(training)


slide-78
SLIDE 78

ClassificaHon


Choose
from
many
categories
 What
is
this?


~105
examples
images
(training)
 Nearest
Neighbor
 Berg

(Caltech
101)
 Kernelized
SVM
 Grauman
et
al

(Caltech
101)
 CombinaHon
of
SVMs
 Varma
et
al

(Caltech
101)
 (skipping
model
based
methods)


slide-79
SLIDE 79

ClassificaHon


Choose
from
many
categories
 What
is
this?


~105
examples
images
(training)
 Nearest
Neighbor
 Berg

(Caltech
101)
 Kernelized
SVM
 Grauman
et
al

(Caltech
101)
 CombinaHon
of
SVMs
 Varma
et
al

(Caltech
101)
 (skipping
model
based
methods)


3sec
/
comparison
 0.001
sec
/
comparison
 Slow?


Caltech
101
–
Fei‐Fei
Li,
Pietro
Perona
2004


slide-80
SLIDE 80

DetecHon
 ClassificaHon


Linear
Classifier
 Kernelized
SVM
Classifier


slide-81
SLIDE 81

DetecHon
 ClassificaHon
 h(x) =

#sv

  • j=1

αjK(x, xj) + b

Linear
Classifier
 Kernelized
SVM
Classifier


h(x) = #dimensions

  • i=1

wixi

  • + b

Decision
funcHon
is
sign(h)
 Decision
funcHon
is
sign(h)


slide-82
SLIDE 82

DetecHon
 ClassificaHon
 h(x) =

#sv

  • j=1

αjK(x, xj) + b

Linear
Classifier
 Kernelized
SVM
Classifier


h(x) = #dimensions

  • i=1

wixi

  • + b

Test
feature
vector
 Support
Vector
 (training
example)
 One
coordinate
of

 feature
vector
 Kernel
FuncHon
 (comparison)
 O(#dims)
 O(#dims
x
#sv
)


slide-83
SLIDE 83

DetecHon
 ClassificaHon
 h(x) =

#sv

  • j=1

αjK(x, xj) + b

Linear
Classifier
 Kernelized
SVM
Classifier


h(x) = #dimensions

  • i=1

wixi

  • + b

Feature
vector
 Support
Vector
 (training
example)
 One
coordinate
of

 feature
vector
 Kernel
FuncHon
 (comparison)
 O(#dims)
 O(#dims
x
#sv
)


slide-84
SLIDE 84

A
SVM
with
Addi8ve
kernel
can
be
 evaluated
efficiently


Maji,
Berg,
Malik
CVPR
2008
 K(a, b) =

#dimensions

  • i=1

Ki(ai, bi)

If
 Then


h(x) =

#sv

  • j=1

αjK(x, xj) + b =

#sv

  • j=1

αj #dimensions

  • i=1

Ki(xi, xj

i)

  • + b

=

#dimensions

  • i=1

hi(xi)

slide-85
SLIDE 85

A
SVM
with
AddiHve
Kernel
can
be
 Evaluated
Efficiently


Maji,
Berg,
Malik
CVPR
2008
 K(a, b) =

#dimensions

  • i=1

Ki(ai, bi)

If
 Then


h(x) =

#sv

  • j=1

αjK(x, xj) + b =

#sv

  • j=1

αj #dimensions

  • i=1

Ki(xi, xj

i)

  • + b

=

#dimensions

  • i=1

hi(xi) If
you
have
an
addiHve
 kernel…
 then
the
SVM
decision
 funcHon
is
addiHve.


slide-86
SLIDE 86

A
SVM
with
AddiHve
Kernel
can
be
 Evaluated
Efficiently


Maji,
Berg,
Malik
CVPR
2008
 K(a, b) =

#dimensions

  • i=1

Ki(ai, bi)

If
 Then


h(x) =

#sv

  • j=1

αjK(x, xj) + b =

#sv

  • j=1

αj #dimensions

  • i=1

Ki(xi, xj

i)

  • + b

=

#dimensions

  • i=1

hi(xi) If
you
have
an
addiHve
 kernel…
 then
the
SVM
decision
 funcHon
is
addiHve.


Evaluate
these
1D
funcHons
efficiently
using
 a
look
up
table,
spline
(exact
or
approximate)



slide-87
SLIDE 87

IntersecHon

or
Min
Kernel


Maji,
Berg,
Malik
CVPR
2008


Kmin(a, b) =

#dimensions

  • i=1

min(ai, bi)

The
IntersecHon
or
Min
Kernel


Grauman
et
al
use
this
on
MulHscale

 Histograms
to
approximate
the
linear

 assignment
problem

(and
do
recogniHon)
 Lazebnik
et
al
refine
this
approach
to
only
 use
mulHple
scales
for

posiHon,
and
not
 for
the
features
 Much
follow
on
work


slide-88
SLIDE 88

IntersecHon

or
Min
Kernel


Maji,
Berg,
Malik
CVPR
2008


h(x) =

#sv

  • j=1

αjKmin(x, xj) + b =

#sv

  • j=1

αj #dimensions

  • i=1

min(xi, xj

i)

  • + b

=

#dimensions

  • i=1

hi(xi) + b Kmin(a, b) =

#dimensions

  • i=1

min(ai, bi)

Where


The
IntersecHon
or
Min
Kernel
 hi(xi) =

#sv

  • j=1

αj min(xi, xj

i)

slide-89
SLIDE 89

IntersecHon

or
Min
Kernel


Maji,
Berg,
Malik
CVPR
2008


h(x) =

#sv

  • j=1

αjKmin(x, xj) + b =

#sv

  • j=1

αj #dimensions

  • i=1

min(xi, xj

i)

  • + b

=

#dimensions

  • i=1

hi(xi) + b Kmin(a, b) =

#dimensions

  • i=1

min(ai, bi)

Where


The
support
vectors
are
constants,
 min(
xi
,
constant
)
is
piecewise
linear,
 so
hi(xi)
is
piecewise
linear.


The
IntersecHon
or
Min
Kernel
 hi(xi) =

#sv

  • j=1

αj min(xi, xj

i)

slide-90
SLIDE 90

IntersecHon

or
Min
Kernel


Maji,
Berg,
Malik
CVPR
2008


h(x) =

#sv

  • j=1

αjKmin(x, xj) + b =

#sv

  • j=1

αj #dimensions

  • i=1

min(xi, xj

i)

  • + b

=

#dimensions

  • i=1

hi(xi) + b Kmin(a, b) =

#dimensions

  • i=1

min(ai, bi)

Where


The
IntersecHon
or
Min
Kernel


O(
#dims
x
#sv
)
Becomes

 O(
#dims
x
log(#sv)
)


  • r
O(
#dims

)


exact
 approx.


The
support
vectors
are
constants,
 min(
xi
,
constant
)
is
piecewise
linear,
 so
hi(xi)
is
piecewise
linear.


hi(xi) =

#sv

  • j=1

αj min(xi, xj

i)

slide-91
SLIDE 91

Time
to
Perform
ClassificaHon


Maji,
Berg,
Malik
CVPR
2008
 Times
in
seconds
to
classify
10,000
test
vectors


slide-92
SLIDE 92

MulHscale
HOG
features


(Very
Similar
to
SpaHal
Pyramids)


Based
on
histograms
of
response
to
eight
orientated
edge
detecHons.
Non‐

  • verlapping
windows
of
integraHon
and
fixed
size
windows
for
contrast



normalizaHon
allow
efficient
computaHon.


slide-93
SLIDE 93

Example
hi(xi)
and
ApproximaHons


slide-94
SLIDE 94

Min
Kernel
“Beger”
than
Linear


slide-95
SLIDE 95

Min
Kernel
“Beger”
than
Linear


Caltech
101
with
“simple
features”



 15
training
examples
per
category
 Accuracy
of
Min
Kernel
vs
Linear
on
Text
classificaHon
 Linear
SVM


















40%
correct
 Min
Kernel
(IK)
SVM


52%
correct


slide-96
SLIDE 96

Now
we
can
use
Min
Kernel
for
 DetecHon
in
Seconds
Instead
of
Hours


slide-97
SLIDE 97


1

‐1


 ‐1


2


‐1
 





‐1



.
 
















2

‐1
 
















‐1


1


It
is
possible
to
directly
train
classifiers
with
the
same
structure
as
the
approximaHon

 without
using
support
vectors
at
all.

The
formulaHon
is
very
similar
to
a
linear
classifier,

 with
different
regularizaHon.

Can
be
trained
efficiently
using
stochasHc
(sub)gradient
descent.
 Linear
 Piecewise
 Linear


[ ]

H
=



Direct
Training


minimize : ˆ w′H ˆ w + c ξj subject to : ˆ yi( ˆ w′ˆ xj + b) ≥ 1 − ξj ξj ≥ 0 minimize : w′w + c ξj subject to : yi(w′xj + b) ≥ 1 − ξj ξj ≥ 0

slide-98
SLIDE 98

Slightly
different
formulaHon


min

w

λ 2 w′w + 1 m

  • i

ℓ (w; (xi, yi))

Linear
 Piecewise
linear


min

w

λ 2 ˆ w′H ˆ w + 1 m

  • i

ℓ ( ˆ w; (ˆ xi, yi))

slide-99
SLIDE 99

for







accuracy


Shalev‐Schwartz,
Singer,
Srebro
ICML
2007


O d λǫ

  • ǫ
slide-100
SLIDE 100

Shalev‐Schwartz,
Singer,
Srebro
ICML
2007


w √ w′Hw

  • w′

1Hw1

  • w′

t+ 1

2Hwt+ 1 2

for







accuracy


O d λǫ

  • ǫ

(1 − ηtλH)

Maji,
Berg,
ICCV
2009


slide-101
SLIDE 101

Shalev‐Schwartz,
Singer,
Srebro
ICML
2007


w √ w′Hw

  • w′

1Hw1

  • w′

t+ 1

2Hwt+ 1 2

for







accuracy


O d λǫ

  • ǫ

(1 − ηtλH)

w
and
x
are
large
but
sparse,
so
we
can
get
computaHon
to
scale
with
#
non
zeros.



slide-102
SLIDE 102

Conclusions


  • SVM
with
an
addiHve
kernel
is
just
sum
of
funcHons
for
each


coordinate
separately


  • We
can
evaluate
these
classifiers
fast
enough
to
do
sliding


window
detecHon


  • Training
Hme
100x
faster
than
Viola
Jones
tesHng
Hme
~10x


slower
than
Violoa
Jones,
but
may
work
for
a
broader
range
of
 categories


  • Current
work
reducing
training
Hme
to
interacHve

  • May
have
applicaHons
beyond
computer
vision

‐‐







the
correct
addiHonal
projecHons
can
approach
arbitrary
 classifier.


slide-103
SLIDE 103

























Thank
you



and
my
co‐authors
so
far…


slide-104
SLIDE 104

AGribute
and
Simile
Classifiers
for
Face
Verifica2on


Neeraj
Kumar,
Alex
Berg,
Peter
Belhumeur,
Shree
Nayar
 Columbia
University


Will
be
a
talk
at
ICCV
2009
in
Kyoto


slide-105
SLIDE 105

Faces
in
the
wild



as
in
Names
and
Faces















T.
Berg
et
al
 












cvpr
2004


large
variaHon
in

 pose
illuminaHon
 expression
lighHng
 etc.


slide-106
SLIDE 106

Faces
in
the
wild



as
in
Names
and
Faces















T.
Berg
et
al
 












cvpr
2004


large
variaHon
in

 pose
illuminaHon
 expression
lighHng
 etc.
 Typical
measures
of
face
similarity
(eg
PCA+LDA)
would
say
these
faces
are
very
different.


slide-107
SLIDE 107

Faces
in
the
wild



as
in
Names
and
Faces















T.
Berg
et
al
 












cvpr
2004


large
variaHon
in

 pose
illuminaHon
 expression
lighHng
 etc.
 Typical
measures
of
face
similarity
(eg
PCA+LDA)
would
say
these
faces
are
very
different.
 Ferencz
et
al
learned
(discriminaHve)
generalized
linear
models
some
other
groups
 
(caltech,
UMass)
have
looked
at
similar
approaches.


slide-108
SLIDE 108

Faces
in
the
wild



as
in
Names
and
Faces















T.
Berg
et
al
 












cvpr
2004


large
variaHon
in

 pose
illuminaHon
 expression
lighHng
 etc.


Erik
Learned‐Miller
and
the

UMass
group
 labeled
a
broader
version
of
the
names
 and
faces
data
called:

Labeled
Faces
in
 the
Wild

(LFW)
 Results
from
many
groups
are
available
‐‐
 best
are
from
Wolf
et
al
and
Nowak
and
 Jurie.



slide-109
SLIDE 109

Faces
in
the
wild



as
in
Names
and
Faces















T.
Berg
et
al
 












cvpr
2004


large
variaHon
in

 pose
illuminaHon
 expression
lighHng
 etc.
 We
look
at
agributes
of
 the
faces
that
may
be

 robust
with
respect
to

 IdenHty,
but
are
also
 common
to
many
people
 So
that
we
can
collect
very
 large
training
sets
for
each

 agribute.



slide-110
SLIDE 110

We
look
at
agributes
of
 the
faces
that
may
be

 robust
with
respect
to

 IdenHty,
but
are
also
 common
to
many
people
 So
that
we
can
collect
very
 large
training
sets
for
each

 agribute.

 RGB,
HSV,
 Gradient,
 Gradient
DirecHon
 Moments
 Histograms
 No
normalizaHon
 Or
l1
or
“l2”
 Various
subparts
of
faces,
 (requires
alignment
to
a

 

single
generic
face)


slide-111
SLIDE 111

Some
Training
Data
Collected
Using
Amazon’s
Mechanical
Turk


slide-112
SLIDE 112
slide-113
SLIDE 113

~2000…
 An
individual
person
may
have
 disHncHve
features
(eg
eyes)
 So
learn
a
classifier
to
recognize
 their
eyes,
(eg
“she
had
Bege

 Davis
Eyes”)



slide-114
SLIDE 114
slide-115
SLIDE 115

Humans
are
really
good!


slide-116
SLIDE 116

Humans
are
really
good!


slide-117
SLIDE 117

Humans
are
really
good!
 They
don’t
even
need
to
see
the
face!!!!
 Other
algorithms
have
 access
to
all
this
background
 and
we
do
substanHally
beger
 looking
only
at
the
face…
 But
there
is
a
long
way
to
go
 to
human
performance
even



  • n
a
Hght
crop
of
the
face….

slide-118
SLIDE 118

New
Large
Dataset
of
CelebriHes


This
porHon
was
used
to
train
 the
similarity
classifiers…
 In
total,
over
200
people
with
100+

 images
of
each


slide-119
SLIDE 119

Related
To
“Straightening
Manifolds”


It
is
difficult
to
collect
many
example
images
of
parHcular
people,
 but
easy
to
collect
millions
of
examples
of

Asians,
or
smiling
people,
etc
 with
huge
variaHon
including

pose
and
se{ngs.
 The
similarity
of
a
face
to
any
of
these
groups
defines
a
simplified

 coordinate
for
its
posiHon
in
the
space
of
all
faces.



slide-120
SLIDE 120

Deformable Templates – Discrete Optimization

Model of Car Image

Appearance: cost of matching two local features ( geometric blur ) Binary indicator vector: Xij = 1 iff i matches to j Geometry: cost of matching two pairs of features Integer Quadratic Programming

A.C. Berg, T.L. Berg, J. Malik CVPR 05

slide-121
SLIDE 121

Keep quadratic framework, but use parallel edges as local features

  • X. Ren, A.C. Berg, J. Malik

ICCV 05

Edges Parallel Edges Configuration Only completely non-geometric blur related work in the talk, for now…

slide-122
SLIDE 122

Keep patch features (geometric blur), but modify geometry model

Align faces before representation

Cluster using face appearance and names from captions, get a very large data set of faces automatically 90% (or more) accurate

T.L Berg, A.C. Berg,

  • J. Edwards, D.A. Forsyth

NIPS 04

slide-123
SLIDE 123

Deformable Template Matching with Exemplars for Recognition

  • Use exemplars as deformable templates
  • Find a correspondence between the query image and

each template, best one wins

Query Image Database of Templates

slide-124
SLIDE 124

Deformable Template Matching with Exemplars for Recognition

Query Image Database of Templates Best matching template is a helicopter

  • Use exemplars as deformable templates
  • Find a correspondence between the query image and

each template, best one wins

slide-125
SLIDE 125

Many Results on Caltech-101

slide-126
SLIDE 126

Many Results on Caltech-101

Templates give correspondence

slide-127
SLIDE 127

Many Results on Caltech-101

Bags of features just classify images Templates give correspondence “Real Object Recognition”

slide-128
SLIDE 128

Many Results on Caltech-101

Bags of features just classify images Templates give correspondence “Real Object Recognition” Current best knn-svm

slide-129
SLIDE 129

Deconstruction...

geometric blur descriptors A.C. Berg J. Malik CVPR 2001

To classify images, forget about the objects, just match features in the whole image.

  • 1 NN classifier for each feature, most votes wins  54%

No Geometry Berg Thesis 2005

  • Bag of features using max-margin learning for weights 60%

No Geometry Frome Singer Malik NIPS 2006

  • Bag of features using svm  60%, knn-svm  62%

Rough position in image Zhang, Berg, Maire, Malik CVPR 2006

  • Doesn’t matter how you use geometry

Spatial pyramid matching kernel Lazebnik et al  59%

slide-130
SLIDE 130

Deformable Templates – Discrete Optimization

Model of Car Image

Appearance: cost of matching two local features ( geometric blur ) Binary indicator vector: Xij = 1 iff i matches to j Geometry: cost of matching two pairs of features Integer Quadratic Programming

A.C. Berg, T.L. Berg, J. Malik CVPR 05

slide-131
SLIDE 131

Drop the Templates –> EASY Optimization

Model of Car Image

Appearance: cost of matching two local features ( geometric blur ) Binary indicator vector: Xij = 1 iff i matches to j Geometry: cost of matching two pairs of features Easy Linear Programming

A.C. Berg, T.L. Berg, J. Malik CVPR 05

No Geometry 54-60%

slide-132
SLIDE 132

Add back in a little geoemtry & knn-svm

  • Classify whole image
  • Use geometric blur to

model appearance similarity

  • Match to approximately the

right part of the image

  • Use k-nearest

neighbors to train a svm / query

  • Get the current

best results on image classification

query k nearest neighbors local decision boundary

  • H. Zhang, A.C. Berg, M. Maire, J. Malik

CVPR 2006

Basically append position to the feature vector + an svm gives 60% +knn-svm gives another 2%

slide-133
SLIDE 133

Samples from Caltech-101 Split into Animals & Non Animals

Since we are recognizing the whole image anyway

slide-134
SLIDE 134

Samples from Caltech-101 Split into Animals & Non Animals

Classification rate ~90% correct Same range as humans! (Thorpe et al) but…

slide-135
SLIDE 135

Beyond Caltech 101 – TRECVID

  • - Hundreds of thousands of key frames from shots in the TRECVID dataset
  • - Broadcast video in English (US), Chinese, Arabic
  • - Training data labeled with >= 40 labels

weather, sports, person, car, business leader, etc.

  • - Goal is to rank new shots by whether they contains these labels....

Slav Petrov, Arlo Faria, Pascal Michaillat, Alexander Berg, Andreas Stolcke, Dan Klein, Jitendra Malik

tired of looking at the same 9144 images?

slide-136
SLIDE 136

TRECVID Results ’06

mAP = 0.11

Results ’05 Berkeley-Shape mAP = 0.38 Best ’05 (IBM) mAP = 0.34

Best Berkeley-Shape Median

slide-137
SLIDE 137

Evaluating Similar Appearance

Extract Sparse Channels (Edges) Apply Spatially Varying Blur Subsample

Sparse Non-Negative Channels

(eg. oriented edge enegery)

geometric blur

  • f each channel

geometric blur descriptor

geometric blur based descriptor

Berg, Malik CVPR 2001 Berg, Berg, Malik CVPR 2005

  • - Works well for recognition tasks
  • - Berg Thesis 05 has theoretical and ecological validation
  • - Can be comparable performance to SIFT on wide baseline matching
slide-138
SLIDE 138

Scenes different than objects.

  • Scenes are made up of stuff not things
  • Often not distinguishable locally: sky vs wall vs street
  • Template style geometric models may be too rigid for scenes
  • Scene specific previous work
  • Contextual Guidance of Attention in Natural scenes: The role of Global

features on object search Torralba, Oliva, Castelhano & Henderson Psychological Review 2006 Object dependent saliency maps

  • Integrated Models for Scenes Objects and Parts

Sudderth, Torralba, Freeman & Wilsky NIPS , ICCV 2005 and others Local Features, simple rigorous models for their arrangement

  • Geometric Context from a Single Image

Hoiem, Efros, Hebert IJCV 2006, and others

  • So how bad is it anyway?
slide-139
SLIDE 139

Quantifying “Badness”

"When you can measure what you are speaking about and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of the meager and unsatisfactory kind."

  • -Lord Kelvin

Mutual Information Relative Mutual Information Entropy

  • 1. Hand Label

Sky, Foliage, Building, Street

  • 2. Split into training and test
  • 3. Compute features
  • 4. Build classifiers /

perform regression

  • 5. Evaluate how much the

features and classifiers tell about the labels

slide-140
SLIDE 140

I stand at the window and see a house, trees, sky. Theoretically I might say there were 327 brightnesses and nuances of colour. Do I have “327”?

  • No. I have sky, house, and trees.

– Max Wertheimer 1923

An old problem

slide-141
SLIDE 141

An old problem

I stand at the window and see a house, trees, sky. Theoretically I might say there were 327 brightnesses and nuances of colour. Do I have “327”?

  • No. I have sky, house, and trees.

– Max Wertheimer 1923

slide-142
SLIDE 142

An old problem

Sky Tree Building Street

I stand at the window and see a house, trees, sky. Theoretically I might say there were 327 brightnesses and nuances of colour. Do I have “327”?

  • No. I have sky, house, and trees.

– Max Wertheimer 1923

slide-143
SLIDE 143

An old problem

Sky Tree Building Street Use this coarse parsing for more detailed parsing of buildings

I stand at the window and see a house, trees, sky. Theoretically I might say there were 327 brightnesses and nuances of colour. Do I have “327”?

  • No. I have sky, house, and trees.

– Max Wertheimer 1923

slide-144
SLIDE 144

An old problem

Sky Tree Building Street Use this coarse parsing for more detailed parsing of buildings Building Boundary Roofline Window Roof Building Color Roof Color

I stand at the window and see a house, trees, sky. Theoretically I might say there were 327 brightnesses and nuances of colour. Do I have “327”?

  • No. I have sky, house, and trees.

– Max Wertheimer 1923

slide-145
SLIDE 145

Why Architectural Scenes?

Sky Tree Building Street Building Boundary Roofline Window Roof Building Color Roof Color

  • Make up decent portion of our surroundings
  • Microsoft, Amazon, etc. are collecting and

trying to use a great deal of this type of data, anything automatic is helpful

  • Stress current computational approaches

to visual recognition...

slide-146
SLIDE 146

Our Method

Work with: Floraine Grabler ETH Zurich, Jitendra Malik U.C. Berkeley

slide-147
SLIDE 147

Our Method : Features in a Patch

  • Color of central pixel
  • Color histogram
  • Total edge energy
  • Oriented edge energy
  • Height in Image
  • Lengths/Orientations of contours

KNN Density estimates  prob. of label (sky, tree, etc.) given feature(s) except 2 SVMs for central pixel color and edge energy, one for sky one for trees

slide-148
SLIDE 148

Our Method : Features in a Patch

  • Color of central pixel
  • Color histogram
  • Total edge energy
  • Oriented edge energy
  • Height in Image
  • Lengths/Orientations of contours

KNN Density estimates  prob. of label (sky, tree, etc.) given feature Most likely category

Tree/Folliage Building Sky Mixed Sky Street

slide-149
SLIDE 149

Our Method : Features in a Patch

  • Color of central pixel
  • Color histogram
  • Total edge energy
  • Oriented edge energy
  • Height in Image
  • Lengths/Orientations of contours

KNN Density estimates  prob. of label (sky, tree, etc.) given feature Most likely category

Tree/Folliage Building Sky Mixed Sky Street

slide-150
SLIDE 150

Our Method : Features in a Patch

  • Color of central pixel
  • Color histogram
  • Total edge energy
  • Oriented edge energy
  • Height in Image
  • Lengths/Orientations of contours

KNN Density estimates  prob. of label (sky, tree, etc.) given feature Most likely category

Tree/Folliage Building Sky Mixed Sky Street

slide-151
SLIDE 151

Our Method : Features in a Patch

  • Color of central pixel
  • Color histogram
  • Total edge energy
  • Oriented edge energy
  • Height in Image
  • Lengths/Orientations of contours

Sky (red) or not (blue) SVM for central pixel color and edge energy, to predict sky

slide-152
SLIDE 152

Our Method : Features in a Patch

  • Color of central pixel
  • Color histogram
  • Total edge energy
  • Oriented edge energy
  • Height in Image
  • Lengths/Orientations of contours

Foliage (red) or not (blue) SVM for central pixel color and edge energy, to predict foliage

slide-153
SLIDE 153

Our Method : Features in a Patch

  • Color of central pixel
  • Color histogram
  • Total edge energy
  • Oriented edge energy
  • Height in Image
  • Lengths/Orientations of contours

Most likely category

Tree/Folliage Building Sky Mixed Sky Street

Combination

Still some confusion, after all the building is street colored, So use a first pass of a detailed parse to make a building and sky model for this image…

slide-154
SLIDE 154

Our Method : Features in a Patch++

  • Color of central pixel
  • Color histogram
  • Total edge energy
  • Oriented edge energy
  • Height in Image
  • Lengths/Orientations of contours

Most likely category

Tree/Folliage Building Sky Mixed Sky Street

Image specific model

slide-155
SLIDE 155

Our Method : Features in a Patch++

  • Color of central pixel
  • Color histogram
  • Total edge energy
  • Oriented edge energy
  • Height in Image
  • Lengths/Orientations of contours

Most likely category

Tree/Folliage Building Sky Mixed Sky Street

Image specific model

With some spatial smoothing (driven by training data)

slide-156
SLIDE 156

Our Method : Details

This rough parsing helps find

  • features defined by the parsing
  • rooflines, sides of buildings
  • power lines and dead trees
  • windows and doors
  • building color

Most likely category

Tree/Folliage Building Sky Mixed Sky Street

slide-157
SLIDE 157

Our Method : Detailed Parsing

using windows as an example

Evaluate a hypotheses about window location and size by:

  • The size and aspect ratio
  • The surrounding rough labels
  • The estimated building color
slide-158
SLIDE 158

Our Method : Detailed Parsing

using windows as an example

Evaluate a hypotheses about window location and size by:

  • The size and aspect ratio
  • The surrounding rough labels
  • The estimated building color
slide-159
SLIDE 159

Finding Windows

Various ways to form window hypotheses Combine with model of building color and spatial context

slide-160
SLIDE 160

Finding Windows

Various ways to form window hypotheses

  • - Hypotheses are reinforced by regularity
  • - Windows should be under rooflines

Without configuration cue With configuration cue

slide-161
SLIDE 161

Look at many results…

slide-162
SLIDE 162

Mutual Information Relative Mutual Information Entropy

How bad is it?

Simple patch features tell us about as much about the coarse scale parsing as the geometric context work.

slide-163
SLIDE 163

Mutual Information Relative Mutual Information Entropy

How bad is it?

Simple patch features without segmentation tell us about as much about the coarse scale parsing as the output of the geometric context work, more for buildings and folliage.

slide-164
SLIDE 164

Mutual Information Relative Mutual Information Entropy

How bad is it?

Simple patch features (color histogram here ) also give a fair amount of information about the geometric structure. Simple patch features tell us about as much about the coarse scale parsing as the geometric context work.

slide-165
SLIDE 165

Mutual Information Relative Mutual Information Entropy

How bad is it?

Simple patch features (color histogram here ) also give a fair amount of information about the geometric structure. Simple patch features tell us about as much about the coarse scale parsing as the geometric context work.

geometric blur 0.28 0.26 0.36

slide-166
SLIDE 166

Where to Next?

  • Forget being clever about local smoothing
  • Small number of global parameters
  • eg. Sky color, building color, color constancy
  • Soft Representation for spatial layout
  • Geometry
  • Surfaces?
  • Occluding boundaries ( rooflines, building sides ) are

“still” important we need to harness this

  • Segmentation should help, need to find out how
  • Thank you.
slide-167
SLIDE 167

























Thank
you



and
my
co‐authors
so
far…