Make3D: Learning 3D Scene Structure from a Single Still Image - - PowerPoint PPT Presentation
Make3D: Learning 3D Scene Structure from a Single Still Image - - PowerPoint PPT Presentation
Make3D: Learning 3D Scene Structure from a Single Still Image Ashoutosh Saxena, Min Sun, and Andrew Ng Ian Endres CS598 February 5, 2009 Overview Goal:Infer 3D models from monocular cues Segment into planar patches Build model from
Overview
Goal:Infer 3D models from monocular cues
◮ Segment into planar patches ◮ Build model from depth maps ◮ Estimate orientation/location of patches ◮ Construct 3D model
Properties to Model
◮ Single Image
Depth from features Connectedness Coplanarity
◮ Multiple Images
Depths from triangulation
◮ Objects
Object A is above B Object Orientation
Superpixel Model
◮ Superpixel as a plane ◮ Model as a 3d mesh of polygons ◮ Use Felzenzwalb and Huttenlocher’s segmenter ◮ Goal: Determine location and orientation of each superpixel
Superpixel Parameters
◮ α ∈ R3
◮ ˆ
α =
α |α| unit normal of plane
◮
1 |α| distance from camera center to plane
◮ Thus, qTα = 1 for any point q ∈ R3 on plane
◮ Ri ∈ R3
Unit length ray pointing from camera center to pixel i on image plane (using “reasonable guess” of camera’s intrinsic parameters).
◮ di = 1 RT
i α is distance of point i (having ray Ri) from camera
center if it lies on plane described by α
Features
◮ Monocular Features: xi ∈ R524
◮ Filter response + shape computed for each superpixel ◮ Additional contextual information from neighbors, at 3 scales
Uses features from largest superpixel neighbor in each bin (i.e. S1C)
◮ Boundary Features: ǫij ∈ {0, 1}14
◮ Segmentations based on 7 different properties at 2 scales
Properties include color, texture, and edges
◮ For each segmentation k, if superpixels i, j fall on same
segment, ǫij(k) = 1, otherwise 0
Models
◮ P(yij|ǫij; ψ) - models the confidence of superpixels i, j
belonging to same planar surface (0 for boundary/fold - 1 for planar)
◮ P(α|X, v, y, R; θ) - models depth and orientation parameters
- f superpixels, composed of:
◮ f1(αi|Xi, vi, Ri; θ) - plane parameters as a function of single
superpixel i features
◮ f2(αi, αj|yij, Ri, Rj) - plane parmeters as a function of edge
features between superpixels i, j
◮ P(vi|xi; φr) - models each pixel’s ability to predict parameters
- f associated superpixel
Occlusion Boundary and Fold Model
◮ Simple edge detector not sufficient for detecting 3d
discontinuities (consider a shadow)
◮ yi,j ∈ [0, 1] where 0 indicates boundary/fold, 1 indicates
planar surface
◮ yij hand labeled in 50 images ◮ P(yij|ǫij; ψ) = 1 1+exp (−ψT ǫij) learned using logistic regression
Unary Depth Model (f1)
◮ Predict depth ˆ
d as a function of features x
◮ Penalize using relative error ˆ d d − 1 where ˆ
d = xTθr Note:
1 d = RT i,siαi
f1(αi|Xi, vi, Ri; θ) = exp
- −
si vi,si|RT i,siαi(xT i,siθr) − 1|
- ◮ The r in θr indicates one of 11 rows in the image
◮ Parameters learned from pseudo log-likelihood of P(α| . . .)
Since f2(·) does not depend on θr, this gives: θ∗
r = argminθr
- i
- si vi,si|
1 di,si (xT i,siθr) − 1|
Depth Prediction Confidence (v)
◮ Given a model ˆ
d = xT
i θr for predicting depth, build a model
to predict expected error
◮ Thus learn |di−xT
i θr|
di
=
1 1+exp (−φT
r xi)
◮ This (ideally) can predict how well a feature predicts the
depth of a pixel
◮ Presumably, v = 1 − 1 1+exp (−φT
r xi), indicating confidence of
prediction ability
Superpixel Interaction Models (f2)
◮ f2(αi, αj|yij, Ri, Rj) = {si,sj}∈N hsi,sj(αi, αj|yij, Ri, Rj) ◮ si, sj are pixels from superpixels i, j respectively, chosen
according to the figure depending on property to be modeled (i.e. connectivity, planarity, linearity)
◮ h(·) also depends on property
Connectivity and Co-planarity
Neighboring superpixels tend to be connected if no occlusion
◮ Uses pairs of neighboring pixels (si, sj) chosen along
boundarise of superpixels i, j
◮ hsi,sj = exp
- − yij
|di,si −dj,sj |
√di,si dj,sj
- Neighboring superpixels tend to belong to the same plane if no fold
◮ A pair (s′′ i , s′′ j ) is chosen from the centers of each superpixel
i, j respectively
◮ hs′′
j = exp
- − yij|(RT
j,s′′
j αi − RT
j,s′′
j αj)ˆ
d|
- ◮ Penalizes distance between sj and sj projected onto plane i
◮ hs′′
j ,s′′ i = hs′′ j (·)hs′′ i (·)
Co-linearity
Superpixels lying on a straight line are likely to lie on the same plane
◮ Same penalty as Co-planar term, except superpixels i, j aren’t
adjacent
◮ Also, yi,j is computed from lines in the image instead of the
- cc/fold model
Inference
◮ α∗ = argmaxα log P(α|X, v, y, R; θr)
= argmaxα log 1
Z
- i f1(αi|Xi, vi, Ri; θ)
i,j f2(αi, αj|yij, Ri, Rj) ◮ Each term results in L1 norm of a linear function of α ◮ Solved via a Newton method with smooth approximation of
L1 norm
Experiments
◮ Depth maps from laser scanner, plus corresponding image
(400 training, 134 test)
◮ Images from urban and natural scenes from daytime ◮ 588 Additional test images from internet (no depth map) ◮ Evaluation: Predict depths, then render 3d model
◮ % qualitatively correct ◮ % major planes correctly identified ◮ Average depth error log10 : | log d − log ˆ
d|
◮ Relative depth error |d−ˆ
d| d