Single-View Depth Image Estimation Fangchang Ma PhD Candidate at - - PowerPoint PPT Presentation

single view depth image estimation
SMART_READER_LITE
LIVE PREVIEW

Single-View Depth Image Estimation Fangchang Ma PhD Candidate at - - PowerPoint PPT Presentation

Single-View Depth Image Estimation Fangchang Ma PhD Candidate at MIT (Sertac Karaman Group) homepage: www.mit.edu/~fcma/ code: github.com/fangchangma Depth sensing is key to robotics advancement 1979, Multi-view vision and the Stanford


slide-1
SLIDE 1

Single-View Depth Image Estimation

Fangchang Ma PhD Candidate at MIT (Sertac Karaman Group)

  • homepage: www.mit.edu/~fcma/
  • code: github.com/fangchangma
slide-2
SLIDE 2

Depth sensing is key to robotics advancement

1979, Multi-view vision and the Stanford Cart

2

slide-3
SLIDE 3

Depth sensing is key to robotics advancement

2007, Velodyne LiDAR and the DARPA Urban Challenge

3

slide-4
SLIDE 4

Depth sensing is key to robotics advancement

2010, Kinect and aggressive drone manuvers

4

slide-5
SLIDE 5

Impact of depth sensing beyond robotics

Face ID by Apple

5

slide-6
SLIDE 6

Existing depth sensors have limited effective spatial resolutions

6

  • Stereo Cameras
  • Structure-light sensors
  • Time-of-flight sensors (e.g., LiDARs)
slide-7
SLIDE 7

Existing depth sensors have limited effective spatial resolutions

7

Stereo: triangulation is accurate only at texture-rich regions

slide-8
SLIDE 8

Existing depth sensors have limited effective spatial resolutions

Structure-light Sensors: short range, high power consumption

8

slide-9
SLIDE 9

Existing depth sensors have limited effective spatial resolutions

LiDARs: extremely sparse measurements

9

slide-10
SLIDE 10

Single-View Depth Image Estimation

Depth completion Depth Prediction

10

slide-11
SLIDE 11

Application 1: Sensor Enhancement

Kinect Velodyne LiDAR

11

slide-12
SLIDE 12

Application 2: Sparse Map Densification State-of-the-art, real-time SLAM algorithms are mostly (semi) feature-based, resulting in a sparse map representation

LSD-SLAM PTAM

12

Depth completion as a downstream, post-processing step for sparse SLAM algorithms, creating a dense map representation

slide-13
SLIDE 13

Single-View Depth Image Estimation

  • Why is the problem challenging?
  • How to solve the problem?
  • How to train a model without ground truth?
  • How fast can we run on embedded systems?
  • How to obtain performance guarantees with DL?
  • What to do if you “hate” deep learning?

13

slide-14
SLIDE 14

Single-View Depth Image Estimation

  • Why is the problem challenging?
  • How to solve the problem?
  • How to train a model without ground truth?
  • How fast can we run on embedded systems?
  • How to obtain performance guarantees with DL?
  • What to do if you “hate” deep learning?

14

slide-15
SLIDE 15

Challenges in Depth Completion

15

  • An ill-posed inverse problem
  • High-dimensional, continuous prediction
slide-16
SLIDE 16

Challenges in Depth Completion

16

  • Biased / adversarial sampling
  • Varying number of measurements
slide-17
SLIDE 17

Challenges in Depth Completion

17

  • Cross-modality fusion (RGB + Depth)
slide-18
SLIDE 18

Challenges in Depth Completion

18

  • Lack of ground truth data (category vs. distance)
slide-19
SLIDE 19

Single-View Depth Image Estimation

  • Why is the problem challenging?
  • How to solve the problem?
  • How to train a model without ground truth?
  • How fast can we run on embedded systems?
  • How to obtain performance guarantees with DL?
  • What to do if you “hate” deep learning?

19

slide-20
SLIDE 20

Sparse-to-Dense: Deep Regression Neural Networks

  • Direct encoding: use 0s to represent no-measurement
  • Early-fusion strategy: concatenate RGB and sparse Depth at input level
  • Network Architecture: standard convolutional neural network
  • Train end-to-end using ground truth depth

20

slide-21
SLIDE 21

Results on NYU Dataset

21

  • RGB only: RMS=51cm
  • RGB + 20 measurements: RMS=35cm
  • RGB + 50 measurements: RMS=28cm
  • RGB + 200 measurements: RMS=23cm
slide-22
SLIDE 22

Scaling of Accuracy vs. Samples

22

100 101 102 103 104 number of depth samples 0.00 0.05 0.10 0.15 0.20 0.25

REL

RGBd sparse depth RGB

slide-23
SLIDE 23

23

Application to Sparse Point Clouds

slide-24
SLIDE 24

24

Application to Sparse Point Clouds

slide-25
SLIDE 25

Sparse-to-Dense: depth prediction from sparse depth samples and a single image

Fangchang Ma, Sertac Karaman ICRA’18 code: github.com/fangchangma/sparse-to-dense

25

slide-26
SLIDE 26

Single-View Depth Image Estimation

  • Why is the problem challenging?
  • How to solve the problem?
  • How to train a model without ground truth?
  • How fast can we run on embedded systems?
  • How to obtain performance guarantees with DL?
  • What to do if you “hate” deep learning?

26

slide-27
SLIDE 27

Experiment 1. Supervised Training (Baseline). RMSE=0.814m (ranked 1st on KITTI). Prediction (point cloud) Input (point cloud)

(depth image)

27

slide-28
SLIDE 28

Self-supervision: enforce temporal photometric consistency

slide-29
SLIDE 29

Self-supervision: enforce temporal photometric consistency

Real RGB1

slide-30
SLIDE 30

Self-supervision: enforce temporal photometric consistency

Real RGB1

slide-31
SLIDE 31

Self-supervision: enforce temporal photometric consistency

Real RGB1 Real RGB2

slide-32
SLIDE 32

Self-supervision: enforce temporal photometric consistency

Real RGB1 Estimate pose from LiDAR and RGB Real RGB2

slide-33
SLIDE 33

Self-supervision: enforce temporal photometric consistency

Real RGB1 Estimate pose from LiDAR and RGB Real RGB2 Inverse warping using both depth and pose

slide-34
SLIDE 34

Self-supervision: enforce temporal photometric consistency

Real RGB1 Estimate pose from LiDAR and RGB Real RGB2 Inverse warping using both depth and pose Warped RGB2

slide-35
SLIDE 35

Real RGB1 Estimate pose from LiDAR and RGB Inverse warping using both depth and pose

Self-supervision: enforce temporal photometric consistency

Penalize photometric differences Real RGB2 Warped RGB2

slide-36
SLIDE 36

Supervised training requires ground truth depth labels, which are hard to acquire in practice

Self-supervision: temporal photometric consistency

RGB1 Warped RGB1 Photometric error

36

slide-37
SLIDE 37

Experiment 2. Self-Supervised Training

Prediction (point cloud) Input (point cloud)

(depth image)

Experiment 2. Self-Supervised. RMSE=1.30m

37

slide-38
SLIDE 38

Self-supervised Sparse-to-Dense: Self-supervised Depth Completion from LiDAR and Monocular Camera

Fangchang Ma, Guilherme Venturelli Cavalheiro, Sertac Karaman ICRA 2019 code: github.com/fangchangma/self-supervised-depth-completion

38

slide-39
SLIDE 39

Single-View Depth Image Estimation

  • Why is the problem challenging?
  • How to solve the problem?
  • How to train a model without ground truth?
  • How fast can we run on embedded systems?
  • How to obtain performance guarantees with DL?
  • What to do if you “hate” deep learning?

39

slide-40
SLIDE 40
  • An Efficient and lightweight encoder-decoder network architecture with

a low-latency design incorporating depthwise separable layers and additive skip connections

  • Network pruning applied to whole encoder-decoder network
  • Platform-specific compilation targeting embedded systems

40

FastDepth

slide-41
SLIDE 41

FastDepth is the first demonstration of real-time depth estimation on embedded systems

41

slide-42
SLIDE 42

FastDepth is the first demonstration of real-time depth estimation on embedded systems

42

slide-43
SLIDE 43

Achieved fast runtime through network design, pruning, and hardware-specific compilation

43

slide-44
SLIDE 44

FastDepth performs similarly to more complex models, but 65x faster

44

This Work

(178 fps on TX2 GPU)

Baseline

ResNet-50 with UpProj (2.7 fps on TX2 GPU)

Ground Truth RGB Input

slide-45
SLIDE 45

FastDepth: Fast Monocular Depth Estimation on Embedded Systems

Diana Wofk*, Fangchang Ma*, Tienju-Yang, Sertac Karaman, Vivienne Sze ICRA 2019 fastdepth.mit.edu https://github.com/dwofk/fast-depth

45

slide-46
SLIDE 46

Single-View Depth Image Estimation

  • Why is the problem challenging?
  • How to solve the problem?
  • How to train a model without ground truth?
  • How fast can we run on embedded systems?
  • How to obtain performance guarantees with DL?
  • What to do if you “hate” deep learning?

46

slide-47
SLIDE 47

Assumption: image can be modeled by a convolutional generative neural network

47

slide-48
SLIDE 48

Sub-sampling Process

48

slide-49
SLIDE 49

Rephrasing the depth-completion/image-inpainting problems Question: can you find x (or equivalently, z), given only y?

49

slide-50
SLIDE 50

If z is recovered, then we can reconstruct x as G(z) using a single forward pass

50

Rephrasing the depth-completion/image-inpainting problems

slide-51
SLIDE 51

51

The latent code z can be computed efficiently using gradient descent

slide-52
SLIDE 52

Main Theorem

For a 2-layer network, the latent code z can be recovered from the undersampled measurements y using gradient descents (with high probability) by minimizing the empirical loss function.

  • 0.5

0.5

  • 1.5
  • 1

1

  • 0.5

1.5

52

slide-53
SLIDE 53

Reconstructed Images Ground Truth

Experimental Results

Undersampled Measurements

53

slide-54
SLIDE 54

Invertibility of Convolutional Generative Networks from Partial Measurements Fangchang Ma*, Ulas Ayaz*, Sertac Karaman NeurIPS 2018 (previously known as NIPS) code: github.com/fangchangma/invert-generative-networks

54

slide-55
SLIDE 55

Single-View Depth Image Estimation

  • Why is the problem challenging?
  • How to solve the problem?
  • How to train a model without ground truth?
  • How fast can we run on embedded systems?
  • How to obtain performance guarantees with DL?
  • What to do if you “hate” deep learning?

55

slide-56
SLIDE 56

Depth Completion: Linear model with planar assumption

Input: only sparse depth Output: dense depth

Fangchang Ma, Luca Carlone, Ulas Ayaz, Sertac Karaman. “Sparse sensing for resource- constrained depth reconstruction”. IROS’16 Fangchang Ma, Luca Carlone, Ulas Ayaz, Sertac Karaman. “Sparse Depth Sensing for Resource- Constrained Robots”. The International Journal of Robotics Research (IJRR)

56

slide-57
SLIDE 57

Depth Completion: Linear model with planar assumption

Planar Assumption: a relatively structured environment can be well approximated by a small number of planar surfaces Implication: 2nd derivative of a structured environment is approximately sparse (sparsity of 2nd derivative is a measure of scene complexity) Observation: 2nd derivative of planar surfaces is sparse

57

slide-58
SLIDE 58

Depth Completion: Linear model with planar assumption

Planar Assumption: sparse 2nd derivative in a typical depth image Goal: find the simplest depth image (with the sparsest 2nd derivative) that is aligned with our measurements Convex Relaxation (Linear Programming):

58

slide-59
SLIDE 59

Depth Completion: Linear model with planar assumption

59

slide-60
SLIDE 60

Sparse Depth Sensing for Resource-Constrained Robots

Fangchang Ma*, Ulas Ayaz*, Sertac Karaman The International Journal of Robotics Research (IJRR) code: github.com/sparse-depth-sensing/sparse-depth-sensing

60

slide-61
SLIDE 61

Single-View Depth Image Estimation

Fangchang Ma homepage: www.mit.edu/~fcma/ code: github.com/fangchangma

  • Why is the problem challenging?
  • How to solve the problem?
  • How to train a model without ground truth?
  • How fast can we run on embedded systems?
  • How to obtain performance guarantees with DL?
  • What to do if you “hate” deep learning?