Tsinghua University Monocular Depth-Pose Prediction [R, t] Depth - - PowerPoint PPT Presentation

tsinghua university monocular depth pose prediction
SMART_READER_LITE
LIVE PREVIEW

Tsinghua University Monocular Depth-Pose Prediction [R, t] Depth - - PowerPoint PPT Presentation

Wang Zhao, Shaohui Liu, Yezhi Shu, Yong-Jin Liu Tsinghua University Monocular Depth-Pose Prediction [R, t] Depth and Pose RGB PoseNet Fails to Generalize! All Drift ! Depth estimation in Indoor environments with Visual Odometry with


slide-1
SLIDE 1

Wang Zhao, Shaohui Liu, Yezhi Shu, Yong-Jin Liu Tsinghua University

slide-2
SLIDE 2

Monocular Depth-Pose Prediction

⋮ [R, t]

RGB Depth and Pose

slide-3
SLIDE 3

PoseNet Fails to Generalize!

Depth estimation in Indoor environments with complex camera motions and low texture Visual Odometry with Unseen Camera Ego-motions

All Drift!

slide-4
SLIDE 4

Joint Learning without PoseNet

Built on top of two-frame structure-from-motion FlowNet DepthNet

Sampled Correspondences

[R, t]

Scale Alignment

Inlier Mask Sparse Triangulated Depth

Loss

Sample & Triangulation

Normalized 8‐Point 𝐺: 1 ⋯ ⋯ ⋮ 1 ⋮ ⋮ ⋯

slide-5
SLIDE 5

Joint Learning without PoseNet

FlowNet

Sampled Correspondences

[R, t]

Inlier Mask

Normalized 8‐Point 𝐺: 1 ⋯ ⋯ ⋮ 1 ⋮ ⋮ ⋯

  • Correspondences are sampled based on the occlusion mask and the forward-backward

consistency score produced by the optical flow network .

  • 8-Point algorithm is implemented in RANSAC loop to robustly recover the relative pose.
  • Epipolar distance (Inlier mask) is calculated and used to further filter out the incorrect

matchings and non-rigid objects.

slide-6
SLIDE 6

Joint Learning without PoseNet

[R, t]

  • We sample 6k matches from flow to triangulate, according to the occlusion mask,

forward-backward score, and the inlier mask.

  • We use mid-point triangulation for its convenience and it’s naturally differentiable.
  • A match is abandoned if the angle between two rays is too small.

+

Flow Correspondence Relative pose Sample Triangulation Sparse Triangulation

slide-7
SLIDE 7

Joint Learning without PoseNet

  • Predicted depth is aligned with triangulation depth map to have a consistent scale.
  • Triangulation loss, depth re-projection loss and the depth smoothness loss are used to

supervise the depth-net. DepthNet

Scale Alignment

Sparse Triangulated Depth

Loss

slide-8
SLIDE 8

Scale Disentanglement

  • 1. The translation value 𝒖 of estimated pose [𝑺, 𝒖] from monocular

video is up-to-scale!

  • 2. Monocular depth prediction 𝑬 from network has a learnt scale.
  • 3. Joint training losses require a consistent scale across learnt

depth and pose.

slide-9
SLIDE 9

Scale Disentanglement

PoseNet-based learning system DepthNet PoseNet RGB Input

[𝑺, 𝒖] 𝑬

Loss PoseNet needs to learn a translation scale consistent with DepthNet

Our system DepthNet RGB Input

[𝑺, 𝒖] 𝑬

Scale Alignment

No need for network to learn a translation scale consistent with DepthNet

FlowNet + Solver

𝑬′

Loss

slide-10
SLIDE 10

Quantitative Results on KITTI dataset

Our method achieves state-of-the-art performances on KITTI depth and optical flow estimation.

slide-11
SLIDE 11

Robustness Improved – KITTI

Visual Odometry with unseen camera ego-motion PoseNet-based Our system

slide-12
SLIDE 12

Visual Odometry with Indoor Environments PoseNet-based Our system

Robustness Improved – TUM

slide-13
SLIDE 13

Depth Estimation in Indoor Environments PoseNet-based Our system Input Image PoseNet Ours

Robustness Improved – NYUv2

slide-14
SLIDE 14

Depth Estimation in Indoor Environments PoseNet-based Our system Best performance on NYUv2 among unsupervised methods!

Robustness Improved – NYUv2

slide-15
SLIDE 15

Code and model are available here

Check our paper for more details! Link: https://github.com/B1ueber2y/TrianFlow