The Three R s of Vision Recognition Reconstruction Reorganization - - PowerPoint PPT Presentation

the three r s of vision
SMART_READER_LITE
LIVE PREVIEW

The Three R s of Vision Recognition Reconstruction Reorganization - - PowerPoint PPT Presentation

The Three R s of Vision Recognition Reconstruction Reorganization Jitendra Malik UC Berkeley Recognition, Reconstruction & Reorganization Recognition Reconstruction Reorganization Fifty years of computer vision 1963-2013


slide-1
SLIDE 1

The Three R’s of Vision

Recognition Reorganization Reconstruction

Jitendra Malik UC Berkeley

slide-2
SLIDE 2

Recognition, Reconstruction & Reorganization

Recognition Reorganization Reconstruction

slide-3
SLIDE 3

Fifty years of computer vision 1963-2013

  • 1960s: Beginnings in artificial intelligence, image processing

and pattern recognition

  • 1970s: Foundational work on image formation: Horn,

Koenderink, Longuet-Higgins …

  • 1980s: Vision as applied mathematics: geometry, multi-scale

analysis, probabilistic modeling, control theory, optimization

  • 1990s: Geometric analysis largely completed, vision meets

graphics, statistical learning approaches resurface

  • 2000s: Significant advances in visual recognition, range of

practical applications

slide-4
SLIDE 4

Different aspects of vision

  • Perception: study the “laws of seeing” -predict what a human

would perceive in an image.

  • Neuroscience: understand the mechanisms in the retina and

the brain

  • Function: how laws of optics, and the statistics of the world

we live in, make certain interpretations of an image more likely to be valid

The match between human and computer vision is strongest at the level of function, but since typically the results of computer vision are meant to be conveyed to humans makes it useful to be consistent with human perception. Neuroscience is a source of ideas but being bio-mimetic is not a requirement.

slide-5
SLIDE 5

The Three R’s of Vision

Recognition Reconstruction Reorganization

slide-6
SLIDE 6

The Three R’s of Vision

Each of the 6 directed arcs in this diagram is a useful direction

  • f information flow

Recognition Reconstruction Reorganization

slide-7
SLIDE 7

Review

  • Reconstruction

– Feature matching + multiple view geometry has led to city scale point cloud reconstructions

  • Recognition

– 2D problems such as handwriting recognition, face detection successfully fielded in applications. – Partial progress on 3d object category recognition

  • Reorganization

– Progress on bottom-up segmentation hitting diminishing returns – Semantic segmentation is the key problem now

slide-8
SLIDE 8

Image-based Modeling

  • Façade (1996) Debevec, Taylor & Malik

– Acquire photographs – Recover geometry (explicit or implicit) – Texture map

slide-9
SLIDE 9

Campus Model of UC Berkeley

Campanile + 40 Buildings (Debevec et al, 1997)

slide-10
SLIDE 10

Arc de Triomphe

slide-11
SLIDE 11

The Taj Mahal

Taj Mahal modeled from

  • ne photograph

by G. Borshukov

slide-12
SLIDE 12

State of the Art in Reconstruction

  • Multiple photographs
  • Range Sensors

Agarwal et al (2010) Kinect (PrimeSense) Velodyne Lidar

Semantic Segmentation is needed to make this more useful…

Frahm et al, (2010)

slide-13
SLIDE 13

Shape, Albedo, and Illumination from Shading

Jonathan Barron Jitendra Malik UC Berkeley

slide-14
SLIDE 14

Far Near shape / depth

Forward Optics

slide-15
SLIDE 15

Far Near shape / depth

Forward Optics

illumination

slide-16
SLIDE 16

Far Near log-shading image of Z and L shape / depth

Forward Optics

illumination

slide-17
SLIDE 17

Far Near log-shading image of Z and L shape / depth log-albedo / log-reflectance

Forward Optics

illumination

slide-18
SLIDE 18

Far Near log-shading image of Z and L shape / depth log-albedo / log-reflectance illumination Lambertian reflectance in log-intensity

Forward Optics

slide-19
SLIDE 19

Far Near

?

?

?

Shape, Albedo, and Illumination from Shading

SAIFS (“safes”)

?

log-shading image of Z and L shape / depth log-albedo / log-reflectance illumination Lambertian reflectance in log-intensity

slide-20
SLIDE 20

“Find the most likely explanation (shape Z and log-albedo A) that together exactly reconstructs log-image I, given rendering engine S() and known illumination L.”

? ? ?

Known Lighting Problem Formulation:

slide-21
SLIDE 21

Demo!

slide-22
SLIDE 22

What do we know about reflectance?

1) Piecewise smooth (variation is small and sparse) 2) Palette is small (distribution is low-entropy) 3) Some colors are common (maximize likelihood under density model)

slide-23
SLIDE 23

Reflectance: Absolute Color

slide-24
SLIDE 24

1) Piecewise smooth (variation in mean curvature is small and sparse) 2) Face outward at the occluding contour 3) Tend to be fronto-parallel (slant tends to be small)

What do we know about shapes?

slide-25
SLIDE 25

Real World Images Evaluation:

slide-26
SLIDE 26

Real World Images Evaluation:

slide-27
SLIDE 27

Recognition helps reconstruction Blanz & Vetter (1999)

Geometric Context (Hoiem, Efros, Hebert) for outdoor scenes; recent work on rooms (CMU, UIUC) is another example

slide-28
SLIDE 28

The Three R’s of Vision

Recognition Reconstruction Reorganization

slide-29
SLIDE 29

Caltech-101 [Fei-Fei et al. 04]

  • 102 classes, 31-300 images/class
slide-30
SLIDE 30

Caltech 101 classification results (even better by combining cues..)

slide-31
SLIDE 31

ICCV '99, Corfu, Greece

Texton Histogram Model for Recognition (Leung & Malik, 1999) cf. Bag of Words Terrycloth Rough Plastic Pebbles Plaster-b

slide-32
SLIDE 32

Lazebnik, Schmid & Ponce (2006)

They proposed using vector-quantized SIFT descriptors as “words”

slide-33
SLIDE 33

PASCAL Visual Object Challenge

(Everingham et al)

slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36

A good building block is a linear SVM trained

  • n HOG features (Dalal & Triggs)
slide-37
SLIDE 37
slide-38
SLIDE 38

AP=0.23

slide-39
SLIDE 39
slide-40
SLIDE 40

Problems with current recognition approaches

  • Performance is quite poor compared to that

at 2d recognition tasks and the needs of many applications.

  • Pose Estimation / Localization of parts or

keypoints is even worse. We can’t isolate decent stick figures from radiance images, making use of depth data necessary.

  • Progress has slowed down. Variations of

HOG/Deformable part models dominate.

slide-41
SLIDE 41

PCA Results on APs of 20 VOC classes

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

NECUIUC(09) OXFORD(09) UoCTTI(09) BONN_FGT(10) BONN_SVR(10) NLPR(10) UCLA(10) NUS(10) UoCTTI(10) UVA(10) UMNECUIUC(10) FGBG(10) BROOKES(11) CORNELL(11) MISSOURI(11) NLPR(11) NUS(11) UCLA(11) OXFORD(11) UoCTTI(11) UVA(11)

Principal Component 1 Principal Component 2

slide-42
SLIDE 42

Next steps in recognition

  • Richer features than SIFT/HOG (deep learning ?)
  • Incorporate the “shape bias” known from child

development literature to improve generalization

– This requires monocular computation of shape, as once posited in the 2.5D sketch, and distinguishing albedo and illumination changes from geometric contours

  • Top down templates should predict keypoint

locations and image support, not just information about category

  • Recognition and figure-ground inference need to co-
  • evolve. Occlusion is signal, not noise.
slide-43
SLIDE 43
slide-44
SLIDE 44

High-Level Computer Vision

slide-45
SLIDE 45

Object Recognition

High-Level Computer Vision

person person van dog

slide-46
SLIDE 46

Object Recognition Semantic Segmentation

High-Level Computer Vision

person person van dog

slide-47
SLIDE 47

Object Recognition Semantic Segmentation Pose Estimation

High-Level Computer Vision

Facing the camera Facing back, head to the right In a back view

slide-48
SLIDE 48

Object Recognition Semantic Segmentation Pose Estimation Action Recognition

High-Level Computer Vision

talking Walking away

slide-49
SLIDE 49

Object Recognition Semantic Segmentation Pose Estimation Action Recognition Attribute Classification

High-Level Computer Vision

blue GMC van Entlebucher mountain dog Man with glasses and a coat elderly white man with a baseball hat

slide-50
SLIDE 50

Object Recognition Semantic Segmentation Pose Estimation Action Recognition Attribute Classification

High-Level Computer Vision

“A man with glasses and a coat, facing back, walking away” “An entlebucher mountain dog sitting in a bag” “An elderly man with a hat and glasses, facing the camera and talking” “A blue GMC van parked, in a back view”

slide-51
SLIDE 51

Trying to extract stick figures is hard (and unnecessary!)

Generalized cylinders (Marr & Nishihara, Binford) Pictorial Structures (Felszenswalb & Huttenlocher)

slide-52
SLIDE 52

All the wrong limbs…

slide-53
SLIDE 53

Motivation

slide-54
SLIDE 54

various images submitted to the CMU on-lin http://www.vasc.ri.cmu.edu/cgi-bin/demos/findface.cgi

Face Detection

Carnegie Mellon University

slide-55
SLIDE 55

Examples of poselets (Bourdev & Malik , 2009)

Patches are often far visually, but they are close semantically

slide-56
SLIDE 56

How do we train a poselet for a given pose configuration?

slide-57
SLIDE 57

Finding Correspondences

Given part of a human pose How do we find a similar pose configuration in the training set?

slide-58
SLIDE 58

Finding Correspondences

We use keypoints to annotate the joints, eyes, nose,

  • etc. of people

Left Hip Left Shoulder

slide-59
SLIDE 59

Finding Correspondences

Residual Error

slide-60
SLIDE 60

Training poselet classifiers

Residual Error: 0.15 0.20 0.10 0.35 0.15 0.85

  • 1. Given a seed patch
  • 2. Find the closest patch for every other person
  • 3. Sort them by residual error
  • 4. Threshold them
slide-61
SLIDE 61

Male or female?

slide-62
SLIDE 62

How do we train attribute classifiers “in the wild”?

 Effective prediction requires inferring the pose

and camera view

 Pose reconstruction is itself a hard problem, but

we don’t need perfect solution.

 We train attribute classifiers for each poselet  Poselets implicitly decompose the pose

slide-63
SLIDE 63

Gender classifier per poselet is much easier to train

slide-64
SLIDE 64

Is male

slide-65
SLIDE 65

Has long hair

slide-66
SLIDE 66

Wears a hat

slide-67
SLIDE 67

Wears glasses

slide-68
SLIDE 68

Wears long pants

slide-69
SLIDE 69

Wears long sleeves

slide-70
SLIDE 70

Some discriminative poselets (Maji et al)

slide-71
SLIDE 71

Armlets (Gkioxari et al, CVPR 2013)

slide-72
SLIDE 72

Right Arm Left Arm Yang & Ramanan Our method

Multiple Instances

slide-73
SLIDE 73

Results

  • Results of Augmented Armlets and Comparison with baseline[1]

PCP Yang & Ramanan [1] Our model R_UpperArm 38.9 50.2 R_Lower Arm 21.0 25.0 L_Upper Arm 36.9 49.2 L_Lower Arm 19.1 25.4 Average 29.0 37.5

[1] Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. CVPR, 2011

slide-74
SLIDE 74

The Three R’s of Vision

Recognition Reconstruction Reorganization

slide-75
SLIDE 75

75

  • D. Martin, C. Fowlkes, D. Tal, J. Malik. "A Database of Human Segmented Natural Images and its

Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics", ICCV, 2001

Berkeley Segmentation DataSet [BSDS]

slide-76
SLIDE 76

State of the Art in Reorganization

  • Interactive segmentation

using graph cuts

  • Berkeley gPb edges &

regions

Rother, Kolmogorov & Blake (2004), Boykov & Jolly (2001), Boykov, Veksler & Zabih(2001) Arbelaez et al (2009), Martin, Fowlkes, Malik (2004), Shi & Malik (2000)

We may be hitting the limits of bottom-up segmentation…

slide-77
SLIDE 77

What boundaries do you see?

slide-78
SLIDE 78

Motion Boundaries

Sundberg et al, CVPR 2011; Brox & Malik, ECCV 2010

slide-79
SLIDE 79

Recognition Helps Reorganization

slide-80
SLIDE 80

The Three R’s of Vision

Recognition Reconstruction Reorganization

Superpixel assemblies as candidates

slide-81
SLIDE 81

Semantic Segmentation using Regions and Parts

  • P. Arbeláez, B. Hariharan, S. Gupta,
  • C. Gu, L. Bourdev and J. Malik
slide-82
SLIDE 82

This Work

Top-down Part/Object Detectors 0.93 Cat Segmenter 0.57 0.32 Bottom-up Region Segmentation

slide-83
SLIDE 83

Results on PASCAL VOC

slide-84
SLIDE 84

Perceptual Robotics

Using RGBD images to semantically parse scenes

  • S. Gupta, P. Arbeláez & J. Malik (CVPR 2013)
slide-85
SLIDE 85

Using RGBD Images to Semantically Parse Scenes

SVM Classifier

Color Image Depth Image visualized in pseudo color blue is close, orange is far Normal Image visualized in pseudo color blue are surfaces facing up

Input Reorganization

Bottom Up Segmentation into superpixels Long Range Linking

Semantic Segmentation

From Kinect-like depth sensors Compute features on superpixels, classify using SVMs as classifiers

slide-86
SLIDE 86

Semantic Segmentation

Super Pixel Classification

Classifier IK SVM

Category Pr

wall 0.90 cabinet 0.05 window 0.05 chair 0.0 table 0.0

slide-87
SLIDE 87

Semantic Segmentation

Affordance Based Features

  • Geocentric Pose
  • Orientation Features
  • Height above ground
  • Size Features
  • Spatial extent
  • Surface Area
  • Is clipped/occluded
  • Shape Features
  • Planarity
  • Strength of local geometric gradients

Category Specific Features

  • Scores of one-versus-rest SVMs using

histogram of

  • Vector Quantized SIFT
  • Geocentric Textons

Use orientation with respect to gravity, heights above ground, actual sizes

slide-88
SLIDE 88

Semantic Segmentation

slide-89
SLIDE 89

Semantic Segmentation

Category wise performance

[NYU] Our wall 55.25 62.2 floor 73.08 75.9 cabinet 31.4 44.5 bed 38.87 49.4 chair 28.94 37.9 sofa 24.52 39.3 table 20.13 31.2 door 5.59 10.4 window 26.35 32.4 bookshelf 20.6 19 [NYU] Our picture 34.31 39.5 counter 32.03 47.4 blinds 39.01 42.1 desk 4.52 9.4 shelves 3.07 3.3 curtain 26.43 32 dresser 13.08 19.9 pillow 18.34 27.1 mirror 4.08 18.9 floor mat 7.11 20.8

NYU [Silberman et al ECCV12] Indoor segmentation and support inference from RGBD images.

[NYU] Our 35.26 42.04

Aggregate Performance

slide-90
SLIDE 90

Semantic Segmentation

Performance – some more categories

[NYU] Our clothes 6.27 8.5 ceiling 62.99 58.3 books 5.34 3.4 refrigerator 1.28 17.3 television 5.66 19.1 paper 12.6 12.5 towel 0.11 8 shower curtain 3.55 15 box 0.12 3.3 whiteboard 31.2 [NYU] Our person 6.35 16.7 night stand 5.95 29 toilet 26.49 39.4 sink 24.66 25.2 lamp 14.99 23.5 bathtub 20.5 bag 0.1

  • therstructure

5.75 2.6

  • therfurniture

3.66 19.8

  • therprop

20.29 25.5

[NYU] Silberman et al, ECCV12, Indoor segmentation and support inference from RGBD images.

slide-91
SLIDE 91

The Three R’s of Vision

Recognition Reconstruction Reorganization

slide-92
SLIDE 92

Thank You