Unsupervised Monocular Depth Estimation CNN Robust to Training Data - - PowerPoint PPT Presentation
Unsupervised Monocular Depth Estimation CNN Robust to Training Data - - PowerPoint PPT Presentation
Unsupervised Monocular Depth Estimation CNN Robust to Training Data Diversity Valery Anisimovskiy, Andrey Shcherbinin, 15 May, 2020 Sergey Turko and Ilya Kurilin Problem Statement: Depth Sensors limitations IR depth sensor Stereo camera
Kin Kinect 2 YDLi YDLiDAR X1 X1
Problem Statement: Depth Sensors limitations
RGB im image Kin Kinect dep depth map ap ZE ZED D RGB St Stereocamera
IR depth sensor
Dense depth map Instantaneous
- Noisy
- Short range
Lidar
Long range Good accuracy
- Sparse
- Non-instantaneous
- Expensive
Stereo camera
Dense Instantaneous
- Unreliable
- Long baseline
- Occlusions
Problem Statement: RGB Camera + CNN
Unsupervised Monocular CNN
Dense and instantaneous depth map Cheap
ap monocular camera
Adap
apta table ble to scenery by training on relevant dataset
Easy training data collection
- High computational overhead
Pho hoto by Monocular cam amera
Unsuperv rvis ised ed Monocular CNN CNN
Predicted Dep Depth
4
Existing Approaches Supervised monocular CNN
KITTI dataset CNN model trained on dataset containing monocular input images and ground truth depth maps
Best depth estimation accuracy
- Requires hard-to-get precise
ground truth depth maps
- Costly training datasets
- Trained model is scene dependent
+
- Cityscapes dataset
5
Existing Approaches Unsupervised CNN trained on video sequence
CNN model trained on video sequence using camera pose prediction along with depth estimation
Leverages readily available video sequence data for training
- Worse depth estimation accuracy
- Requires camera intrinsics for
training
+
6
Existing Approaches Unsupervised CNN trained on stereo pairs
Unsupervised CNN model trained on stereo pairs with loss based
- n opposite image
reconstruction and left-right consistency
Good depth estimation accuracy
- Lacks robustness to training on
hybrid datasets containing data from different stereo rigs
+
- Available stereo datasets (KITTI, Cityscapes,…)
Possible sources of easy available training data: 7
Suggested Approach Unsupervised CNN trained on stereo pairs + Camera Parameter Estimation
Unsupervised CNN model trained on stereo pairs using camera parameters estimation along with depth estimation
State-of-the-art depth estimat imation accuracy acy
(among unsupervised monocular methods)
Easy adapt ptivity ivity to any scene category by routine training data collection No expensive ground truth depth Robustnes ness to to training data ta divers rsity ity via Camera parameters estimation
+
High quality commercial stereo movies Stereo web-datasets Custom stereo video
8
Proposed Model
Input stereo pair images (IL and IR) are separately processed by the Siamese depth estimation CNN to produce inverse depth maps (ẑL and ẑR), high-level feature maps and correction maps ΔIL and ΔIR High-level feature maps for both left and right images are processed by camera parameters estimation CNN to produce stereo camera parameters: gain and bias (G and B), which are used to transform ẑL and ẑR to disparity maps dL and dR: dL = G( G(ẑL + B) B) dR = -G( G(ẑR + B) B) Disparity maps dL and dR along with corrected input images ÎL and ÎR are used for reconstructing counterpart images I‘R and I‘L as well as counterpart disparity maps d’L and d’R Generated disparity maps d’L and d’R are further used to reconstruct auxiliary images I''L and I''R
Data flow
9
Proposed Model
Primary reconstruction: Auxiliary reconstruction: Correction regularization: Left-right consistency: Total loss:
10
Depth Estimation CNN Architecture
conv9x9, ELU
W x H x 3
conv9x9, ELU
W x H x 32 W x H x 32 W/2 x H/2 x 64
pool2x2, conv7x7, ELU
W/2 x H/2 x 64
conv7x7, ELU
W/4 x H/4 x 128
pool2x2, conv5x5, ELU
W/4 x H/4 x 128
conv5x5, ELU
W/8 x H/8 x 256
pool2x2, conv3x3, ELU
W/8 x H/8 x 256
conv3x3, ELU
W/16 x H/16 x 512
pool2x2, conv3x3, ELU
W/16 x H/16 x 512
conv3x3, ELU
W/32 x H/32 x 512
conv3x3, ELU
W/32 x H/32 x 512
pool2x2, conv3x3, ELU
W/64 x H/64 x 512
conv3x3, ELU
W/64 x H/64 x 512
pool2x2, conv3x3, ELU upsample2x2, conv3x3, ELU skip conv3x3, ELU
W/32 x H/32 x 512 W/32 x H/32 x 512
skip
W/32 x H/32 x 512
upsample2x2, conv3x3, ELU skip
W/16 x H/16 x 512 W/16 x H/16 x 256 W/16 x H/16 x 256
conv3x3, ELU upsample2x2, conv3x3, ELU
W/8 x H/8 x 256 W/8 x H/8 x 128
skip conv3x3, ELU
W/8 x H/8 x 128 Inverse depth W/8 x H/8 x 1
upsample2x2, conv3x3, ELU skip conv3x3, ELU
W/4 x H/4 x 64 Inverse depth W/4 x H/4 x 1
upsample2x2, conv3x3, ELU skip conv3x3, ELU
W/2 x H/2 x 32 Inverse depth W/2 x H/2 x 1
conv3x3, tanh upsample2x2, conv3x3, ELU
W/4 x H/4 x 1 W/4 x H/4 x 128 W/4 x H/4 x 64 W/2 x H/2 x 1 W/2 x H/2 x 64 W/2 x H/2 x 32 W x H x 1
upsample2x2 conv3x3, ELU
W x H x 16 Inverse depth W x H x 1
conv3x3, tanh upsample2x2 upsample2x2 conv3x3, tanh conv3x3, tanh
W x H x 16 Correction map W x H x 1
conv3x3, tanh
W/2 x H/2 x 32
conv3x3, tanh
Correction map W/2 x H/2 x 1
conv3x3, ELU
W x H x 32 W x H x 16
conv3x3, ELU
Inverse depth pyramid Input image
11
Camera Parameters Estimation CNN Architecture
Left image Right image concatenate
64
conv3x3, ELU
256
avg pool
Depth estimation CNN Left inverse depth map g, b
𝑒𝑀 = 𝑒 𝑀 + 𝑐
Left disparity map Right inverse depth map Right disparity map g, b
𝑒𝑆 = − 𝑒 𝑆 + 𝑐
Depth estimation CNN
512 32 512 32 128 256
conv3x3, ELU
256
fully-connected, ELU
2
fully-connected, sigmoid/tanh
Left high-level feature map Right high-level feature map
12
Training Datasets
Stereo Movies dataset (SM) Hybrid city driving dataset Cityscapes + KITTI (CS+K)
13
Quantitative Results on KITTI Dataset (Trained on CS+K)
Unsupervised (Semi-) Supervised
14
Quantitative Results on DIW Dataset (Trained on SM)
WHDR = Weighted Human Disagreement Rate
15
Test of Robustness to Training Dataset Diversity
CSK: Training on Cityscapes and fine-tuning on KITTI CS+K : Training on mixture of Cityscapes and KITTI
16
Camera Parameters Predicted for KITTI/Cityscapes
17
Qualitative Results on KITTI/Cityscapes Datasets
Input image Correction map Depth map by our method Depth map by Godard et al. [14] Cityscapes KITTI
18
Qualitative Results for Uncontrolled Street Views (CS+K)
Input image Depth map by our method Depth map by Godard et al. [14]
*Our model trained on CS+K dataset
19
Qualitative Results for Uncontrolled People Images (SM)
Input images Depth maps by our method
20
Conclusion State-of
- f-the-art accuracy among unsupervised monocular depth
estimation methods Robustness to dataset diversity and variability allows efficient training
- n hybrid datasets combining data from different stereo rigs