Unsupervised Monocular Depth Estimation CNN Robust to Training Data - PowerPoint PPT Presentation

Unsupervised Monocular Depth Estimation CNN Robust to Training Data Diversity Valery Anisimovskiy, Andrey Shcherbinin, 15 May, 2020 Sergey Turko and Ilya Kurilin

Problem Statement: Depth Sensors limitations IR depth sensor Stereo camera Lidar  Dense depth map  Dense  Long range  Instantaneous  Instantaneous  Good accuracy - Noisy - Unreliable - Sparse - Short range - Long baseline - Non-instantaneous - Occlusions - Expensive YDLiDAR X1 YDLi X1 Kinect 2 Kin ZED ZE D RGB St Stereocamera Kinect dep Kin depth map ap RGB im image

Problem Statement: RGB Camera + CNN Unsuperv rvis ised ed Monocular CNN CNN Pho hoto by Monocular cam amera Predicted Dep Depth Unsupervised Monocular CNN  Dense and instantaneous depth map  Cheap ap monocular camera  Adap apta table ble to scenery by training on relevant dataset  Easy training data collection - High computational overhead

Existing Approaches Supervised monocular CNN CNN model trained on dataset containing monocular input images and ground truth depth maps KITTI dataset + Best depth estimation accuracy - - Requires hard-to-get precise ground truth depth maps - Costly training datasets - Trained model is scene dependent Cityscapes dataset 4

Existing Approaches Unsupervised CNN trained on video sequence CNN model trained on video sequence using camera pose prediction along with depth estimation + Leverages readily available video sequence data for training - - Worse depth estimation accuracy - Requires camera intrinsics for training 5

Existing Approaches Unsupervised CNN trained on stereo pairs Unsupervised CNN model trained on stereo pairs with loss based on opposite image reconstruction and left-right consistency + Good depth estimation accuracy - - Lacks robustness to training on hybrid datasets containing data from different stereo rigs Available stereo datasets (KITTI, Cityscapes,…) 6

Suggested Approach Possible sources of easy available training data: Unsupervised CNN trained on stereo pairs + Camera Parameter Estimation Unsupervised CNN model trained on stereo pairs using camera parameters estimation along with depth estimation High quality commercial stereo movies +  State-of-the-art depth estimat imation accuracy acy (among unsupervised monocular methods)  Easy adapt ptivity ivity to any scene category by routine training data collection Stereo web-datasets  No expensive ground truth depth  Robustnes ness to to training data ta divers rsity ity via Camera parameters estimation Custom stereo video 7

Proposed Model Generated disparity maps d’ L and d’ R are further used to reconstruct auxiliary images I'' L and I'' R Disparity maps d L and d R along with corrected input images Î L and Î R are used for reconstructing counterpart images I‘ R and I‘ L as well as counterpart disparity maps d’ L and d’ R Data flow High-level feature maps for both left and right images are processed by camera parameters estimation CNN to produce stereo camera parameters: gain and bias (G and B), which are used to transform ẑ L and ẑ R to disparity maps d L and d R : G( ẑ L + B) d L = G( B) G( ẑ R + B) d R = - G( B) Input stereo pair images (I L and I R ) are separately processed by the Siamese depth estimation CNN to produce inverse depth maps ( ẑ L and ẑ R ), high-level feature maps and correction maps Δ I L and Δ I R 8

Proposed Model Auxiliary reconstruction: Left-right consistency: Primary reconstruction: Correction regularization: Total loss: 9

Depth Estimation CNN Architecture W x H x 16 Correction map W x H x 1 conv3x3, tanh conv3x3, ELU W/2 x H/2 x 32 Correction map W/2 x H/2 x 1 conv3x3, tanh conv3x3, ELU W x H x 16 Inverse depth W x H x 1 conv3x3, conv3x3, ELU tanh W x H x 1 Inverse depth W/2 x H/2 x 1 upsample2x2 W x H x 32 W x H x 16 Inverse depth W/4 x H/4 x 1 upsample2x2, conv3x3, ELU W/2 x H/2 x 32 conv3x3, tanh Inverse depth W/8 x H/8 x 1 conv3x3, ELU W/2 x H/2 x 1 upsample2x2 W/2 x H/2 x 64 W/2 x H/2 x 32 upsample2x2, conv3x3, ELU W/4 x H/4 x 64 conv3x3, tanh conv3x3, ELU W/4 x H/4 x 1 upsample2x2 W/4 x H/4 x 128 W/4 x H/4 x 64 upsample2x2, conv3x3, ELU W/8 x H/8 x 128 conv3x3, tanh conv3x3, ELU W/8 x H/8 x 256 W/8 x H/8 x 128 upsample2x2, conv3x3, ELU W/16 x H/16 x 256 conv3x3, ELU Inverse depth pyramid W/16 x H/16 x 512 W/16 x H/16 x 256 upsample2x2, conv3x3, ELU W/32 x H/32 x 512 conv3x3, ELU W/32 x H/32 x 512 W/32 x H/32 x 512 upsample2x2, conv3x3, ELU skip skip skip skip skip skip W/64 x H/64 x 512 conv3x3, ELU W/64 x H/64 x 512 pool2x2, conv3x3, ELU W/32 x H/32 x 512 conv3x3, ELU W/32 x H/32 x 512 pool2x2, conv3x3, ELU W/16 x H/16 x 512 conv3x3, ELU W/16 x H/16 x 512 pool2x2, conv3x3, ELU W/8 x H/8 x 256 conv3x3, ELU W/8 x H/8 x 256 pool2x2, conv3x3, ELU W/4 x H/4 x 128 conv5x5, ELU W/4 x H/4 x 128 pool2x2, conv5x5, ELU W/2 x H/2 x 64 conv7x7, ELU W/2 x H/2 x 64 pool2x2, conv7x7, ELU W x H x 32 conv9x9, ELU Input image W x H x 32 conv9x9, ELU W x H x 3 10

Camera Parameters Estimation CNN Architecture Left disparity map Right disparity map 2 fully-connected, sigmoid/tanh 256 fully-connected, ELU 256 avg pool g, b g, b 𝑆 + 𝑐 𝑀 + 𝑐 𝑒 𝑆 = −𝑕 𝑒 𝑒 𝑀 = 𝑕 𝑒 256 conv3x3, ELU Left inverse depth map Right inverse depth map 128 conv3x3, ELU 64 concatenate 32 32 512 512 Depth Depth estimation CNN estimation CNN Left high-level Right high-level feature map feature map 11 Left image Right image

Training Datasets Hybrid city driving dataset Cityscapes + KITTI (CS+K) Stereo Movies dataset (SM) 12

Quantitative Results on KITTI Dataset (Trained on CS+K) Unsupervised (Semi-) Supervised 13

Quantitative Results on DIW Dataset (Trained on SM) WHDR = Weighted Human Disagreement Rate 14

Test of Robustness to Training Dataset Diversity CS  K: Training on Cityscapes and fine-tuning on KITTI CS+K : Training on mixture of Cityscapes and KITTI 15

Camera Parameters Predicted for KITTI/Cityscapes 16

Qualitative Results on KITTI/Cityscapes Datasets Input image Correction map Depth map by our method Depth map by Godard et al. [14] Cityscapes KITTI 17

Qualitative Results for Uncontrolled Street Views (CS+K) Input image Depth map by our method Depth map by Godard et al. [14] 18 *Our model trained on CS+K dataset

Qualitative Results for Uncontrolled People Images (SM) Input images Depth maps by our method 19

Conclusion State-of of-the-art accuracy among unsupervised monocular depth estimation methods Robustness to dataset diversity and variability allows efficient training on hybrid datasets combining data from different stereo rigs The smallest CNN model size among high-accuracy methods Relaxed requirements for training data allow for easy and routine gathering of large and representative datasets. 20

Thank you 21

Unsupervised Monocular Depth Estimation CNN Robust to Training Data - PowerPoint PPT Presentation

Unsupervised Monocular Depth Estimation CNN Robust to Training Data Diversity Valery Anisimovskiy, Andrey Shcherbinin, 15 May, 2020 Sergey Turko and Ilya Kurilin Problem Statement: Depth Sensors limitations IR depth sensor Stereo camera

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Tsinghua University Monocular Depth-Pose Prediction [R, t] Depth and Pose RGB PoseNet

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Monocular Depth Estimation Using Atrous Convolutions Group 5 - Faraz Saeedan Fabian Kessler,

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

Monocular Vision Based Obstacle Avoidance: A Literature Review Outline Introduction

DeepCap: Monocular Human Performance Capture Using Weak Supervision Marc Habermann, Weipeng Xu ,

Monocular Visual-Inertial SLAM for ISMAR SLAM Challenge Jie PAN Shaozu CAO, Jie PAN, Jieqi SHI,

Monocular 3D Reconstruction Using Depth Cues from Planar Range Data Ben Eckart Heather Justice

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Rent3D: Floor-Plan Priors for Monocular Layout Estimation Chenxi Liu 1 , Alexander Schwing 2 ,

RGBD Tutorial 14210240041 Gu Pan Image RGB YUV Lab Depth Image RGB image Depth image Each pixel in

From 2D to 3D: Monocular Vision With application to robotics/AR Motivation How many sensors do

Quality Assurance in Performance: Evaluating Mono Benchmark Results Tomas Kalibera, Lubomir Bulej

Single-View and Multi-View Planar Models for Dense Monocular Mapping Alejo Concha, Jos M.

COMPUTER VISION FOR ROBOT NAVIGATION Sanketh Shetty Computer Vision and Robotics Laboratory

Visual SLAM for Mobile Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps Example

Deep learning for dense per-pixel prediction Chunhua Shen The University of Adelaide, Australia

* * 2 :

3D Multi-Object Tracking for Autonomous Driving Xinshuo Weng, Kris Kitani June 15, 2020 1 3D

Unsupervised Monocular Depth Estimation CNN Robust to Training Data - PowerPoint PPT Presentation

Unsupervised Monocular Depth Estimation CNN Robust to Training Data Diversity Valery Anisimovskiy, Andrey Shcherbinin, 15 May, 2020 Sergey Turko and Ilya Kurilin Problem Statement: Depth Sensors limitations IR depth sensor Stereo camera

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Tsinghua University Monocular Depth-Pose Prediction [R, t] Depth and Pose RGB PoseNet

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Monocular Depth Estimation Using Atrous Convolutions Group 5 - Faraz Saeedan Fabian Kessler,

for each dst in my.out_edges if dst.depth &gt; my.depth+1 then dst.depth = my.depth+1

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

Monocular Vision Based Obstacle Avoidance: A Literature Review Outline Introduction

DeepCap: Monocular Human Performance Capture Using Weak Supervision Marc Habermann, Weipeng Xu ,

Monocular Visual-Inertial SLAM for ISMAR SLAM Challenge Jie PAN Shaozu CAO, Jie PAN, Jieqi SHI,

Monocular 3D Reconstruction Using Depth Cues from Planar Range Data Ben Eckart Heather Justice

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Rent3D: Floor-Plan Priors for Monocular Layout Estimation Chenxi Liu 1 , Alexander Schwing 2 ,

RGBD Tutorial 14210240041 Gu Pan Image RGB YUV Lab Depth Image RGB image Depth image Each pixel in

From 2D to 3D: Monocular Vision With application to robotics/AR Motivation How many sensors do

Quality Assurance in Performance: Evaluating Mono Benchmark Results Tomas Kalibera, Lubomir Bulej

Single-View and Multi-View Planar Models for Dense Monocular Mapping Alejo Concha, Jos M.

COMPUTER VISION FOR ROBOT NAVIGATION Sanketh Shetty Computer Vision and Robotics Laboratory

Visual SLAM for Mobile Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps Example

Deep learning for dense per-pixel prediction Chunhua Shen The University of Adelaide, Australia

* * 2 :

3D Multi-Object Tracking for Autonomous Driving Xinshuo Weng, Kris Kitani June 15, 2020 1 3D

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1