Towards Deep Multi-View Stereo
Towards Deep Multi-View Stereo
Silvano Galliani October 2, 2017
1 / 40
Towards Deep Multi-View Stereo Silvano Galliani October 2, 2017 1 - - PowerPoint PPT Presentation
Towards Deep Multi-View Stereo Towards Deep Multi-View Stereo Silvano Galliani October 2, 2017 1 / 40 Towards Deep Multi-View Stereo Multi View Stereo 2 / 40 Towards Deep Multi-View Stereo Outline 1 Gipuma: massively parallel multi-view
Towards Deep Multi-View Stereo
Silvano Galliani October 2, 2017
1 / 40
Towards Deep Multi-View Stereo
2 / 40
Towards Deep Multi-View Stereo
1 Gipuma: massively parallel multi-view stereo 2 Unsupervised normal prediction for improved multi-view
reconstruction
3 Learned multi-patch similarity 4 Conclusions
3 / 40
Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo
4 / 40
Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo
1 Accurate multiview stereo reconstruction 2 Highly efficient open source GPU implementation:
Correspondence over ten 2MPix images in 3 sec 1.6 sec
5 / 40
Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo
Our approach:
1 Estimate depth and fit patch per view by consecutively
treating each view as reference camera
2 Fuse depth maps in space to obtain final reconstruction
6 / 40
Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo
Approximate randomized search for the best depth & normal minimizing a local matching error: Initialize all pixels with a random normal Then:
Diffuse locally plane and save it when cost decreases Local optimization of normal Repeat (8 times enough)
Similar to belief propagation
7 / 40
Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo
1 Red-Black diffusion of planes → maximum parallelization on
GPU
2 Candidates from a bigger neighborhood → faster convergence
8 / 40
Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo
Fusion of depth & normal maps from different views into one 3D point cloud Consistency check on depth (fε) + normal (fang) on at least fcon views Average of reliable points (depth + normal) Tunable Adjustment between more accurate or complete result by tuning fε, fang and fcon
9 / 40
Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo
10 / 40
Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo
Large scale Multi-View dataset 80 different objects, each covered by 49–64 images of resolution 1600 × 1200 pixels (≈ 2 million pixels) ≈3 1.6 seconds per depthmap with fast settings ≈50 13 seconds per depthmap for accurate settings
11 / 40
Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo
Acc. Comp. Mean Med. Mean Med. Points
0.273 0.196 0.687 0.260
0.379 0.234 0.400 0.188
0.291 0.208 0.825 0.279 tola [Tola-10] 0.307 0.198 1.097 0.456 furu [fur-10] 0.605 0.321 0.842 0.431 camp [Cam-08] 0.753 0.480 0.540 0.179 Surfaces
0.363 0.215 0.766 0.329
0.631 0.262 0.519 0.309
0.366 0.223 0.900 0.347 tola [Tol-10] 0.488 0.244 0.974 0.382 furu [Fur-10] 1.299 0.534 0.702 0.405 camp [Cam-08] 1.411 0.579 0.562 0.322
12 / 40
Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo
Figure: Ground truth, textured reconstruction, reconstructed triangulation
13 / 40
Towards Deep Multi-View Stereo Gipuma: massively parallel multi-view stereo
New (multi-view) stereo and video benchmark on unstructured scenes: SLR camera image Multi field of view stereo rig video and images Training dataset available Presented at CVPR2017: eth3d.net
14 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction
15 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction
Common failure modes for MVS Ambiguous matches: Occlusions Lack of texture on homogeneous regions
16 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction
Dichotomy: Stereo correspondences: more accurate in textured regions with many large image gradients
17 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction
Dichotomy: Stereo correspondences: more accurate in textured regions with many large image gradients Shape-from-shading: typically more robust in flat regions with no albedo variations.
18 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction
Idea
19 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction
Explicit modeling of surface, light and material properties is an under-constrained problem: (lights position, lights color, lights intensity, reflectance function)
20 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction
2 Observations:
1 Shading affects surface orientation not depth 2 Specific light interaction can be view-dependent: we rule
edges, etc.. We learn the relation between image and surface normal We train a single model per each view
21 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction
We start with a reliable MVS reconstruction with gipuma For every image we use it as training data to learn a CNN which predict surface normal from RGB patch around point
22 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction
We use of a Convolutional Neural Network that minimizes the error of training Vs predicted normal
Accurate results w.r.t. training data Joint training of model did not works
Mean Error 18◦ Predicted, 11◦ MVS Mean of Median Error 16◦ Predicted, 9◦ MVS
23 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction 24 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction
Normals are dense but without depth information
1 Integrate the new normals with a masked Poisson equation 2 Faaa all the new dense depth maps to obtain the final point
cloud
25 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction
The vector field g consists of the gradients of both functions, ∀x ∈ Ω : g(x) =
if x ∈ A ∇f , else (1) Find an interpolant f over Ω\A that minimizes the squared error min
f
∇f − g2 . (2) This leads to the Poisson equation ∆f = div g , (3) Solved with Gauss-Seidel + Successive Over Relaxation (SOR) (few seconds)
26 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction
27 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction
28 / 40
Towards Deep Multi-View Stereo Unsupervised normal prediction for improved multi-view reconstruction
29 / 40
Towards Deep Multi-View Stereo Learned multi-patch similarity
ICCV2017
30 / 40
Towards Deep Multi-View Stereo Learned multi-patch similarity
A crucial component of stereo reconstruction is the matching function. Similarity In 2-view stereo matching similarity is uniquely determined: left vs right But what about Multi View Stereo? No Direct Solution → It’s common and robust to average pairwise scores Idea
31 / 40
Towards Deep Multi-View Stereo Learned multi-patch similarity
We train a CNN network which directly learn a similarity score from multiple patches Multi-branch siamese network with shared weights and average aggregation Cast as a binary classification problem
mean conv1 conv1 conv1 conv1 conv1
TanH1 TanH1 TanH1 TanH1 TanH1
pool1 pool1 pool1 pool1 pool1 conv2 conv2 conv2 conv2 conv2
TanH2 TanH2 TanH2 TanH2 TanH2
pool2 pool2 pool2 pool2 pool2
Convolutional Layer 3 Convolutional Layer 4 Convolutional Layer 5 ReLU 3 ReLU 4 Softmax
0 .. 1
32 / 40
Towards Deep Multi-View Stereo Learned multi-patch similarity
We directly extract a set of patches obtained from 3D data points backprojected on images: Positive examples are obtained by cropping a rectangle from the backprojected corrected 3d depth on other views Negative examples are extracted from points far from the real depth but still on the epipolar lines Roughly 15 million positive and negative examples are used
33 / 40
Towards Deep Multi-View Stereo Learned multi-patch similarity
To compare directly our method we modified a standard plane sweeping algorithm and used our similarity score For each point to find the correct depth we test all planes at different depth values and pick the one with highest similarity
34 / 40
Towards Deep Multi-View Stereo Learned multi-patch similarity
Benefits of joint similarity computation Direct multiple similarity computation across all the patches Reference camera does not have a privileged role: → robust to occlusion w.r.t. the reference view:
35 / 40
Towards Deep Multi-View Stereo Learned multi-patch similarity
Benefits of branch averaging Matching different numbers of viewpoints with the similarity network can be done without retraining:
Figure: Input, 3, 5, 9 views
36 / 40
Towards Deep Multi-View Stereo Learned multi-patch similarity
The learned similarity generalizes to a different test environment:
Figure: Fountain from Strecha dataset
37 / 40
Towards Deep Multi-View Stereo Learned multi-patch similarity
Accuracy Completeness Mean Median Mean Median BIRD SAD 2.452 0.380 4.035 1.105 ZNCC 1.375 0.365 4.253 1.332 SIFT 1.594 0.415 5.269 1.845 LIFT 1.844 0.562 4.387 1.410 OUR concat 1.605 0.305 4.358 1.133 OUR 1.881 0.271 4.167 1.044 FLOWER SAD 2.537 1.143 2.768 1.407 ZNCC 2.018 1.106 2.920 1.467 SIFT 2.795 1.183 4.747 2.480 LIFT 3.049 1.420 4.224 2.358 OUR concat 2.033 0.843 2.609 1.267 OUR 1.973 0.771 2.609 1.208 CAN SAD 1.824 0.664 2.283 1.156 ZNCC 1.187 0.628 2.092 1.098 SIFT 1.769 0.874 3.067 1.726 LIFT 2.411 1.207 3.003 1.823 OUR concat 1.082 0.477 1.896 0.833 OUR 1.123 0.478 1.982 0.874 BUDDHA SAD 0.849 0.250 1.119 0.561 ZNCC 0.688 0.299 1.208 0.656 SIFT 0.696 0.263 1.347 0.618 LIFT 0.688 0.299 1.208 0.656 OUR concat 0.682 0.231 1.017 0.473 OUR 0.637 0.206 1.057 0.475
38 / 40
Towards Deep Multi-View Stereo Conclusions
Unsupervised normal estimation works to improve MVS Dataset specific models are better than generic ones if we are self-supervised Similarity score can be trained jointly and proves to be better than hand crafted features Pure end-to-end multi-view-stereo network is still not there Deep Learning applied to 3D reconstruction is and unsolved and open problem
39 / 40
Towards Deep Multi-View Stereo Conclusions
gipuma code at http://github.com/kysucix/gipuma eth3d benchmark at http://eth3d.net
40 / 40