Networks for 3D Single-shot Object Detection JunYoung Gwak, - - PowerPoint PPT Presentation

networks for 3d single shot object detection
SMART_READER_LITE
LIVE PREVIEW

Networks for 3D Single-shot Object Detection JunYoung Gwak, - - PowerPoint PPT Presentation

Generative Sparse Detection Networks for 3D Single-shot Object Detection JunYoung Gwak, Christopher Choy, Silvio Savarese Key Challenge of 3D Object Detection Disjoint input and output space: Input 3D scan: surface of the object Output


slide-1
SLIDE 1
slide-2
SLIDE 2

Generative Sparse Detection Networks for 3D Single-shot Object Detection

JunYoung Gwak, Christopher Choy, Silvio Savarese

slide-3
SLIDE 3

Key Challenge of 3D Object Detection

Disjoint input and output space:

  • Input 3D scan: surface of the object
  • Output anchor space:

center of the bounding box Sparse convolution / PointNet: Learn only on the surface of the object ⇒ Output space is unreachable!

3

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-4
SLIDE 4

Key Challenge of 3D Object Detection

Possible solutions? (previous works)

  • Ignore this problem and make predictions

at the surface of the object

Nontrivial to decide which part of the surface is responsible for the prediction

4

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-5
SLIDE 5

Key Challenge of 3D Object Detection

Possible solutions? (previous works)

  • Ignore this problem and make predictions

at the surface of the object

Nontrivial to decide which part of the surface is responsible for the prediction

  • Convert sparse tensor to dense tensor

Give up efficiency in sparsity

5

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-6
SLIDE 6

Key Challenge of 3D Object Detection

Possible solutions? (previous works)

  • Ignore this problem and make predictions

at the surface of the object

Nontrivial to decide which part of the surface is responsible for the prediction

  • Convert sparse tensor to dense tensor

Give up efficiency in sparsity

  • For every point, predict relative center of

the instance

Requires center aggregation (clustering), inefficient

6

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-7
SLIDE 7

Key Challenge of 3D Object Detection

Key observation: Object centers are close to the object surface Can we generate object centers efficiently?

7

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-8
SLIDE 8

8

Generative Sparse Detection Networks for 3D Single-shot Object Detection

Method Overview

slide-9
SLIDE 9

9

Hierarchical Sparse Tensor Encoder

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-10
SLIDE 10

10

Hierarchical Sparse Tensor Encoder

  • Generates hierarchical sparse tensor

features with sparse 3D ResNet

  • Analogous to ResNet encoders

commonly used in of 2D detectors

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-11
SLIDE 11

11

Hierarchical Sparse Tensor Encoder

  • Generates hierarchical sparse tensor

features with sparse 3D ResNet

  • Analogous to ResNet encoders

commonly used in of 2D detectors

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-12
SLIDE 12

12

Hierarchical Sparse Tensor Encoder

  • Generates hierarchical sparse tensor

features with sparse 3D ResNet

  • Analogous to ResNet encoders

commonly used in of 2D detectors

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-13
SLIDE 13

13

Hierarchical Sparse Tensor Encoder

  • Generates hierarchical sparse tensor

features with sparse 3D ResNet

  • Analogous to ResNet encoders

commonly used in of 2D detectors

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-14
SLIDE 14

14

Hierarchical Sparse Tensor Encoder

  • Generates hierarchical sparse tensor

features with sparse 3D ResNet

  • Analogous to ResNet encoders

commonly used in of 2D detectors

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-15
SLIDE 15

15

Generative Sparse Tensor Decoder

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-16
SLIDE 16

16

Transposed Convolution + Sparsity Pruning

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-17
SLIDE 17

Transposed Convolution + Sparsity Pruning

  • Sparse Transposed Convolution

Outer-product of the convolution kernel shape on the input coordinates

Generates surrounding coordinates of the input coordinates (expands support)

  • Sparsity Pruning

17

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-18
SLIDE 18

Transposed Convolution + Sparsity Pruning

  • Sparse Transposed Convolution
  • Sparsity Pruning

For each generated point, predict whether to prune the coordinate

Prune coordinates that are not bounding box centers

18

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-19
SLIDE 19

Bounding box prediction

19

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-20
SLIDE 20

Bounding box prediction

20

Generative Sparse Detection Networks for 3D Single-shot Object Detection

  • For every point that are not pruned,

predict

Anchor classification

Bounding box regression

Semantic classification

  • Hierarchical multi-scale prediction on

pyramid network

20

slide-21
SLIDE 21

Full 3D search space

  • Search for object center up to ±1.6m of any observable surface

Fully sparse: Minimal runtime and memory footprint

  • Sparse Convolution Encoder
  • Conv Transpose and Pruning to only generate anchor centers

Fully-convolutional

  • Simple architecture
  • No clustering, no crop and merge, just convolutions

21

Advantages of f Our Method

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-22
SLIDE 22
  • Sparsity Prediction: Balanced Cross Entropy
  • Anchor Prediction: Balanced Cross Entropy
  • Semantic Prediction: Cross Entropy
  • Bounding Box Regression: Huber Loss

22

Losses

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-23
SLIDE 23
  • Sparsity Prediction: Balanced Cross Entropy
  • Anchor Prediction: Balanced Cross Entropy
  • Semantic Prediction: Cross Entropy
  • Bounding Box Regression: Huber Loss

Balanced Cross Entropy Overcome heavy label bias by equally penalizing positive and negative samples

23

Losses

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-24
SLIDE 24
  • Sparsity Prediction: Balanced Cross Entropy
  • Anchor Prediction: Balanced Cross Entropy
  • Semantic Prediction: Cross Entropy
  • Bounding Box Regression: Huber Loss

Balanced Cross Entropy Overcome heavy label bias by equally penalizing positive and negative samples

24

Losses

Bounding box parameters

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-25
SLIDE 25
  • Outperforms previous state-of-the-art

by 4.2 mAP@0.25

While being a single-shot detection

25

Comparison with previous SOTA - ScanNet

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-26
SLIDE 26
  • Outperforms previous state-of-the-art

by 4.2 mAP@0.25

While being a single-shot detection

  • While being x3.7 faster

runtime linear to # of points

runtime sublinear to floor area

⇒ free from curse of dimensionality!!

26

Comparison with previous SOTA - ScanNet

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-27
SLIDE 27
  • Outperforms previous state-of-the-art

by 4.2 mAP@0.25

While being a single-shot detection

  • While being x3.7 faster

runtime linear to # of points

runtime sublinear to floor area

⇒ free from curse of dimensionality!!

  • Minimal memory footprint

x6 efficient to dense counterpart

27

Comparison with previous SOTA - ScanNet

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-28
SLIDE 28
  • Outperforms previous state-of-the-art

by 4.2 mAP@0.25

While being a single-shot detection

  • While being x3.7 faster

runtime linear to # of points

runtime sublinear to floor area

⇒ free from curse of dimensionality!!

  • Minimal memory footprint

x6 efficient to dense counterpart

  • Maintains constant input density

Consistent information for scalability

28

Comparison with previous SOTA - ScanNet

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-29
SLIDE 29

29

Comparison with previous SOTA - ScanNet

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-30
SLIDE 30

30

Comparison with previous SOTA - S3DIS

Generative Sparse Detection Networks for 3D Single-shot Object Detection

  • Achieves state-of-the-art result
  • Our method doesn’t require crop-and-stitch post-processing

unlike Yang et al.

slide-31
SLIDE 31

31

Comparison with previous SOTA - S3DIS

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-32
SLIDE 32

32

Ablation study

Generative Sparse Detection Networks for 3D Single-shot Object Detection

Train without sparsity pruning

➔ Fails to train due to out of memory error

Train without Generative Sparse Tensor Decoder

slide-33
SLIDE 33

Train on small rooms, test on the the entire building 5 of S3DIS

  • 78M points, 13984m3 volume, and 53 rooms
  • Single fully-convolutional network feed-forward
  • Takes 20 seconds including data pre-processing and post-processing
  • Use 5G GPU memory to detect 573 instances of 3D objects

33

Scalability and generalization - S3DIS

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-34
SLIDE 34

How does our method achieve high scalability and generalization capacity? Consistent information regardless of the size of input:

  • Fully-convolutional: translation invariant
  • Consistent density of input: voxels. no fixed-sized random subsampling

Minimal runtime and memory footprint

  • Fully sparse

Sparse encoder: sparse convolution

Sparse decoder: pruning to prevent cubic growth of generated coordinates

34

Scalability and generalization - S3DIS

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-35
SLIDE 35

We propose Generative Sparse Detection Networks

  • Efficiently processes large-scale 3D scene using Sparse Convolution
  • Generates and prunes new coordinates to support anchor box centers

Which achieves

  • Outperforms previous state-of-the-art by 4.2 mAP@0.25
  • While being x3.7 faster (and runtime grows sublinear to the volume)
  • With minimal memory footprint (x6 efficient than dense counterpart)
  • Processes unprecedently large scene in a single network feed-forward

35

Conclusion

Generative Sparse Detection Networks for 3D Single-shot Object Detection

slide-36
SLIDE 36

Thank you!

Collaborators JunYoung Gwak Stanford University Chris Choy NVIDIA Silvio Savarese Stanford University

Generative Sparse Detection Networks for 3D Single-shot Object Detection