Make3D: Learning 3D Scene Structure from a Single Still Image - - PowerPoint PPT Presentation

make3d learning 3d scene structure from a single still
SMART_READER_LITE
LIVE PREVIEW

Make3D: Learning 3D Scene Structure from a Single Still Image - - PowerPoint PPT Presentation

Make3D: Learning 3D Scene Structure from a Single Still Image Ashoutosh Saxena, Min Sun, and Andrew Ng Ian Endres CS598 February 5, 2009 Overview Goal:Infer 3D models from monocular cues Segment into planar patches Build model from


slide-1
SLIDE 1

Make3D: Learning 3D Scene Structure from a Single Still Image Ashoutosh Saxena, Min Sun, and Andrew Ng

Ian Endres CS598 February 5, 2009

slide-2
SLIDE 2

Overview

Goal:Infer 3D models from monocular cues

◮ Segment into planar patches ◮ Build model from depth maps ◮ Estimate orientation/location of patches ◮ Construct 3D model

slide-3
SLIDE 3

Properties to Model

◮ Single Image

Depth from features Connectedness Coplanarity

◮ Multiple Images

Depths from triangulation

◮ Objects

Object A is above B Object Orientation

slide-4
SLIDE 4

Superpixel Model

◮ Superpixel as a plane ◮ Model as a 3d mesh of polygons ◮ Use Felzenzwalb and Huttenlocher’s segmenter ◮ Goal: Determine location and orientation of each superpixel

slide-5
SLIDE 5

Superpixel Parameters

◮ α ∈ R3

◮ ˆ

α =

α |α| unit normal of plane

1 |α| distance from camera center to plane

◮ Thus, qTα = 1 for any point q ∈ R3 on plane

◮ Ri ∈ R3

Unit length ray pointing from camera center to pixel i on image plane (using “reasonable guess” of camera’s intrinsic parameters).

◮ di = 1 RT

i α is distance of point i (having ray Ri) from camera

center if it lies on plane described by α

slide-6
SLIDE 6

Features

◮ Monocular Features: xi ∈ R524

◮ Filter response + shape computed for each superpixel ◮ Additional contextual information from neighbors, at 3 scales

Uses features from largest superpixel neighbor in each bin (i.e. S1C)

◮ Boundary Features: ǫij ∈ {0, 1}14

◮ Segmentations based on 7 different properties at 2 scales

Properties include color, texture, and edges

◮ For each segmentation k, if superpixels i, j fall on same

segment, ǫij(k) = 1, otherwise 0

slide-7
SLIDE 7

Models

◮ P(yij|ǫij; ψ) - models the confidence of superpixels i, j

belonging to same planar surface (0 for boundary/fold - 1 for planar)

◮ P(α|X, v, y, R; θ) - models depth and orientation parameters

  • f superpixels, composed of:

◮ f1(αi|Xi, vi, Ri; θ) - plane parameters as a function of single

superpixel i features

◮ f2(αi, αj|yij, Ri, Rj) - plane parmeters as a function of edge

features between superpixels i, j

◮ P(vi|xi; φr) - models each pixel’s ability to predict parameters

  • f associated superpixel
slide-8
SLIDE 8

Occlusion Boundary and Fold Model

◮ Simple edge detector not sufficient for detecting 3d

discontinuities (consider a shadow)

◮ yi,j ∈ [0, 1] where 0 indicates boundary/fold, 1 indicates

planar surface

◮ yij hand labeled in 50 images ◮ P(yij|ǫij; ψ) = 1 1+exp (−ψT ǫij) learned using logistic regression

slide-9
SLIDE 9

Unary Depth Model (f1)

◮ Predict depth ˆ

d as a function of features x

◮ Penalize using relative error ˆ d d − 1 where ˆ

d = xTθr Note:

1 d = RT i,siαi

f1(αi|Xi, vi, Ri; θ) = exp

si vi,si|RT i,siαi(xT i,siθr) − 1|

  • ◮ The r in θr indicates one of 11 rows in the image

◮ Parameters learned from pseudo log-likelihood of P(α| . . .)

Since f2(·) does not depend on θr, this gives: θ∗

r = argminθr

  • i
  • si vi,si|

1 di,si (xT i,siθr) − 1|

slide-10
SLIDE 10

Depth Prediction Confidence (v)

◮ Given a model ˆ

d = xT

i θr for predicting depth, build a model

to predict expected error

◮ Thus learn |di−xT

i θr|

di

=

1 1+exp (−φT

r xi)

◮ This (ideally) can predict how well a feature predicts the

depth of a pixel

◮ Presumably, v = 1 − 1 1+exp (−φT

r xi), indicating confidence of

prediction ability

slide-11
SLIDE 11

Superpixel Interaction Models (f2)

◮ f2(αi, αj|yij, Ri, Rj) = {si,sj}∈N hsi,sj(αi, αj|yij, Ri, Rj) ◮ si, sj are pixels from superpixels i, j respectively, chosen

according to the figure depending on property to be modeled (i.e. connectivity, planarity, linearity)

◮ h(·) also depends on property

slide-12
SLIDE 12

Connectivity and Co-planarity

Neighboring superpixels tend to be connected if no occlusion

◮ Uses pairs of neighboring pixels (si, sj) chosen along

boundarise of superpixels i, j

◮ hsi,sj = exp

  • − yij

|di,si −dj,sj |

√di,si dj,sj

  • Neighboring superpixels tend to belong to the same plane if no fold

◮ A pair (s′′ i , s′′ j ) is chosen from the centers of each superpixel

i, j respectively

◮ hs′′

j = exp

  • − yij|(RT

j,s′′

j αi − RT

j,s′′

j αj)ˆ

d|

  • ◮ Penalizes distance between sj and sj projected onto plane i

◮ hs′′

j ,s′′ i = hs′′ j (·)hs′′ i (·)

slide-13
SLIDE 13

Co-linearity

Superpixels lying on a straight line are likely to lie on the same plane

◮ Same penalty as Co-planar term, except superpixels i, j aren’t

adjacent

◮ Also, yi,j is computed from lines in the image instead of the

  • cc/fold model
slide-14
SLIDE 14

Inference

◮ α∗ = argmaxα log P(α|X, v, y, R; θr)

= argmaxα log 1

Z

  • i f1(αi|Xi, vi, Ri; θ)

i,j f2(αi, αj|yij, Ri, Rj) ◮ Each term results in L1 norm of a linear function of α ◮ Solved via a Newton method with smooth approximation of

L1 norm

slide-15
SLIDE 15

Experiments

◮ Depth maps from laser scanner, plus corresponding image

(400 training, 134 test)

◮ Images from urban and natural scenes from daytime ◮ 588 Additional test images from internet (no depth map) ◮ Evaluation: Predict depths, then render 3d model

◮ % qualitatively correct ◮ % major planes correctly identified ◮ Average depth error log10 : | log d − log ˆ

d|

◮ Relative depth error |d−ˆ

d| d

slide-16
SLIDE 16

Performance

slide-17
SLIDE 17

Results 1

slide-18
SLIDE 18

Other Tasks

◮ 3D model from multiple images

Adds extra term (f3) which penalizes depth discrepencies when 3d correspondences exist between images

◮ Incorporating Object Information

Object A is on top of object B Object A is connected to Object B - such as person’s feet on ground Object A has known orientation - such as people standing upright