Unsupervised Monocular Depth Estimation CNN Robust to Training Data - - PowerPoint PPT Presentation

unsupervised monocular depth estimation
SMART_READER_LITE
LIVE PREVIEW

Unsupervised Monocular Depth Estimation CNN Robust to Training Data - - PowerPoint PPT Presentation

Unsupervised Monocular Depth Estimation CNN Robust to Training Data Diversity Valery Anisimovskiy, Andrey Shcherbinin, 15 May, 2020 Sergey Turko and Ilya Kurilin Problem Statement: Depth Sensors limitations IR depth sensor Stereo camera


slide-1
SLIDE 1

Unsupervised Monocular Depth Estimation CNN Robust to Training Data Diversity

15 May, 2020

Valery Anisimovskiy, Andrey Shcherbinin, Sergey Turko and Ilya Kurilin

slide-2
SLIDE 2

Kin Kinect 2 YDLi YDLiDAR X1 X1

Problem Statement: Depth Sensors limitations

RGB im image Kin Kinect dep depth map ap ZE ZED D RGB St Stereocamera

IR depth sensor

Dense depth map Instantaneous

  • Noisy
  • Short range

Lidar

Long range Good accuracy

  • Sparse
  • Non-instantaneous
  • Expensive

Stereo camera

Dense Instantaneous

  • Unreliable
  • Long baseline
  • Occlusions
slide-3
SLIDE 3

Problem Statement: RGB Camera + CNN

Unsupervised Monocular CNN

Dense and instantaneous depth map Cheap

ap monocular camera

Adap

apta table ble to scenery by training on relevant dataset

Easy training data collection

  • High computational overhead

Pho hoto by Monocular cam amera

Unsuperv rvis ised ed Monocular CNN CNN

Predicted Dep Depth

slide-4
SLIDE 4

4

Existing Approaches Supervised monocular CNN

KITTI dataset CNN model trained on dataset containing monocular input images and ground truth depth maps

Best depth estimation accuracy

  • Requires hard-to-get precise

ground truth depth maps

  • Costly training datasets
  • Trained model is scene dependent

+

  • Cityscapes dataset
slide-5
SLIDE 5

5

Existing Approaches Unsupervised CNN trained on video sequence

CNN model trained on video sequence using camera pose prediction along with depth estimation

Leverages readily available video sequence data for training

  • Worse depth estimation accuracy
  • Requires camera intrinsics for

training

+

slide-6
SLIDE 6

6

Existing Approaches Unsupervised CNN trained on stereo pairs

Unsupervised CNN model trained on stereo pairs with loss based

  • n opposite image

reconstruction and left-right consistency

Good depth estimation accuracy

  • Lacks robustness to training on

hybrid datasets containing data from different stereo rigs

+

  • Available stereo datasets (KITTI, Cityscapes,…)
slide-7
SLIDE 7

Possible sources of easy available training data: 7

Suggested Approach Unsupervised CNN trained on stereo pairs + Camera Parameter Estimation

Unsupervised CNN model trained on stereo pairs using camera parameters estimation along with depth estimation

 State-of-the-art depth estimat imation accuracy acy

(among unsupervised monocular methods)

 Easy adapt ptivity ivity to any scene category by routine training data collection  No expensive ground truth depth  Robustnes ness to to training data ta divers rsity ity via Camera parameters estimation

+

High quality commercial stereo movies Stereo web-datasets Custom stereo video

slide-8
SLIDE 8

8

Proposed Model

Input stereo pair images (IL and IR) are separately processed by the Siamese depth estimation CNN to produce inverse depth maps (ẑL and ẑR), high-level feature maps and correction maps ΔIL and ΔIR High-level feature maps for both left and right images are processed by camera parameters estimation CNN to produce stereo camera parameters: gain and bias (G and B), which are used to transform ẑL and ẑR to disparity maps dL and dR: dL = G( G(ẑL + B) B) dR = -G( G(ẑR + B) B) Disparity maps dL and dR along with corrected input images ÎL and ÎR are used for reconstructing counterpart images I‘R and I‘L as well as counterpart disparity maps d’L and d’R Generated disparity maps d’L and d’R are further used to reconstruct auxiliary images I''L and I''R

Data flow

slide-9
SLIDE 9

9

Proposed Model

Primary reconstruction: Auxiliary reconstruction: Correction regularization: Left-right consistency: Total loss:

slide-10
SLIDE 10

10

Depth Estimation CNN Architecture

conv9x9, ELU

W x H x 3

conv9x9, ELU

W x H x 32 W x H x 32 W/2 x H/2 x 64

pool2x2, conv7x7, ELU

W/2 x H/2 x 64

conv7x7, ELU

W/4 x H/4 x 128

pool2x2, conv5x5, ELU

W/4 x H/4 x 128

conv5x5, ELU

W/8 x H/8 x 256

pool2x2, conv3x3, ELU

W/8 x H/8 x 256

conv3x3, ELU

W/16 x H/16 x 512

pool2x2, conv3x3, ELU

W/16 x H/16 x 512

conv3x3, ELU

W/32 x H/32 x 512

conv3x3, ELU

W/32 x H/32 x 512

pool2x2, conv3x3, ELU

W/64 x H/64 x 512

conv3x3, ELU

W/64 x H/64 x 512

pool2x2, conv3x3, ELU upsample2x2, conv3x3, ELU skip conv3x3, ELU

W/32 x H/32 x 512 W/32 x H/32 x 512

skip

W/32 x H/32 x 512

upsample2x2, conv3x3, ELU skip

W/16 x H/16 x 512 W/16 x H/16 x 256 W/16 x H/16 x 256

conv3x3, ELU upsample2x2, conv3x3, ELU

W/8 x H/8 x 256 W/8 x H/8 x 128

skip conv3x3, ELU

W/8 x H/8 x 128 Inverse depth W/8 x H/8 x 1

upsample2x2, conv3x3, ELU skip conv3x3, ELU

W/4 x H/4 x 64 Inverse depth W/4 x H/4 x 1

upsample2x2, conv3x3, ELU skip conv3x3, ELU

W/2 x H/2 x 32 Inverse depth W/2 x H/2 x 1

conv3x3, tanh upsample2x2, conv3x3, ELU

W/4 x H/4 x 1 W/4 x H/4 x 128 W/4 x H/4 x 64 W/2 x H/2 x 1 W/2 x H/2 x 64 W/2 x H/2 x 32 W x H x 1

upsample2x2 conv3x3, ELU

W x H x 16 Inverse depth W x H x 1

conv3x3, tanh upsample2x2 upsample2x2 conv3x3, tanh conv3x3, tanh

W x H x 16 Correction map W x H x 1

conv3x3, tanh

W/2 x H/2 x 32

conv3x3, tanh

Correction map W/2 x H/2 x 1

conv3x3, ELU

W x H x 32 W x H x 16

conv3x3, ELU

Inverse depth pyramid Input image

slide-11
SLIDE 11

11

Camera Parameters Estimation CNN Architecture

Left image Right image concatenate

64

conv3x3, ELU

256

avg pool

Depth estimation CNN Left inverse depth map g, b

𝑒𝑀 = 𝑕 𝑒 𝑀 + 𝑐

Left disparity map Right inverse depth map Right disparity map g, b

𝑒𝑆 = −𝑕 𝑒 𝑆 + 𝑐

Depth estimation CNN

512 32 512 32 128 256

conv3x3, ELU

256

fully-connected, ELU

2

fully-connected, sigmoid/tanh

Left high-level feature map Right high-level feature map

slide-12
SLIDE 12

12

Training Datasets

Stereo Movies dataset (SM) Hybrid city driving dataset Cityscapes + KITTI (CS+K)

slide-13
SLIDE 13

13

Quantitative Results on KITTI Dataset (Trained on CS+K)

Unsupervised (Semi-) Supervised

slide-14
SLIDE 14

14

Quantitative Results on DIW Dataset (Trained on SM)

WHDR = Weighted Human Disagreement Rate

slide-15
SLIDE 15

15

Test of Robustness to Training Dataset Diversity

CSK: Training on Cityscapes and fine-tuning on KITTI CS+K : Training on mixture of Cityscapes and KITTI

slide-16
SLIDE 16

16

Camera Parameters Predicted for KITTI/Cityscapes

slide-17
SLIDE 17

17

Qualitative Results on KITTI/Cityscapes Datasets

Input image Correction map Depth map by our method Depth map by Godard et al. [14] Cityscapes KITTI

slide-18
SLIDE 18

18

Qualitative Results for Uncontrolled Street Views (CS+K)

Input image Depth map by our method Depth map by Godard et al. [14]

*Our model trained on CS+K dataset

slide-19
SLIDE 19

19

Qualitative Results for Uncontrolled People Images (SM)

Input images Depth maps by our method

slide-20
SLIDE 20

20

Conclusion State-of

  • f-the-art accuracy among unsupervised monocular depth

estimation methods Robustness to dataset diversity and variability allows efficient training

  • n hybrid datasets combining data from different stereo rigs

The smallest CNN model size among high-accuracy methods Relaxed requirements for training data allow for easy and routine gathering of large and representative datasets.

slide-21
SLIDE 21

Thank you

21