Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , - - PowerPoint PPT Presentation

point voxel cnn for e ffi cient 3d deep learning
SMART_READER_LITE
LIVE PREVIEW

Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , - - PowerPoint PPT Presentation

H ardware, A I and N eural-nets Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , Haotian Tang* , Yujun Lin , and Song Han Project Page: http://pvcnn.mit.edu/ 3D Deep Learning 3D Part Segmentation 3D Semantic Segmentation 3D


slide-1
SLIDE 1

Project Page: http://pvcnn.mit.edu/

Hardware, AI and Neural-nets

Zhijian Liu*, Haotian Tang*, Yujun Lin, and Song Han

Point-Voxel CNN for Efficient 3D Deep Learning

slide-2
SLIDE 2

3D Semantic Segmentation (for VR/AR Headsets) 3D Object Detection (for Self-Driving Cars) 3D Part Segmentation (for Robotic Systems) 3D deep learning has been used in various applications

  • n resource-constrained edge devices.

3D Deep Learning

slide-3
SLIDE 3

Efficient 3D Deep Learning

Off-chip DRAM access is much more expensive than arithmetic operation! Random memory access is inefficient due to the potential bank conflicts!

32b Mult and Add 32b SRAM Read 32b DRAM Read 30 167 668 640 5 3

Energy (pJ) Bandwidth (GB/s)

Ad Addres ess Bu Bus Da Data Bu Bus

!" #" !$ #$ !%

Wa Wait f for D DRAM A Acces ccess Wa Wait … … Wa Wait f for D DRAM A Acces ccess

Ad Addres ess Bu Bus Da Data Bu Bus

!" #" !$ #$ !% !& #% #&

Wa Wait f for D DRAM A Acces ccess Wa Wait f for D DRAM A Acces ccess Wa Wait f for D DRAM A Acces ccess Wa Wait f for D DRAM A Acces ccess

w/o Bank Conflicts w/ Bank Conflicts

Efficient 3D deep learning models should have small memory footprints and avoid random memory access.

slide-4
SLIDE 4

Voxel-Based Models: Cubically-Growing Memory

Low resolutions lead to significant information loss. High resolutions lead to large memory consumption.

*)

VoxNet [IROS’15] 3D ShapeNets [CVPR’15] 3D U-Net [MICCAI’16]

10 20 30 40 50 60 70 80 90 100 8 16 32 48 64 96 128 192 256 1 2 5 10 50 100 200 500 Distinguishable Points (%) Voxel Resolution GPU Memory (GB)

64 x 64 x 64 resolution 11 GB (Titan XP x 1) 42% information loss 128 x 128 x 128 resolution 83 GB (Titan XP x 7) 7% information loss

slide-5
SLIDE 5

Point-Based Models: Sparsity Overheads

Up to 80% of the time is wasted on structuring the sparse data, not on the actual feature extraction.

Irregular Access Dynamic Kernel Actual Computation

95.1 0.0 4.9 15.6 27.0 57.4 12.2 51.5 36.3 45.3 2.9 51.8

DGCNN PointCNN SpiderCNN Ours

Runtime (%)

PointCNN [NeurIPS’18] PointNet [CVPR’17] DGCNN [SIGGRAPH’19]

' * +)

slide-6
SLIDE 6

Point-Voxel Convolution (PVConv)

Devoxelize Normalize Voxelize Convolve Fuse

Point-Based Feature Transformation (Fine-Grained) Voxel-Based Feature Aggregation (Coarse-Grained)

Multi-Layer Perceptron

PVCNN combines the advantages of point-based models (small memory footprint) and voxel-based models (regularity).

slide-7
SLIDE 7

Point-Voxel Convolution (PVConv)

Features from Voxel-Based Branch: Features from Point-Based Branch: Voxel-based branch captures large, contiguous parts. Point-based branch captures isolated, discontinuous details.

slide-8
SLIDE 8

Results: 3D Part Segmentation (ShapeNet)

PVCNN outperforms PointCNN with 2.7x measured speedup and 1.5x memory reduction (on a GTX 1080Ti GPU).

PVCNN

30 60 90 120 150 180 210 0.7 83.5 84.0 84.5 85.0 85.5 86.0 GPU Latency (ms) GPU Memory (GB)

PointCNN DGCNN RSNet 3D-UNet SpiderCNN PointNet++

1.0 1.3 1.6 1.9 2.2 2.5 2.8 3.1

PointNet

Mean IoU

slide-9
SLIDE 9

Jetson Nano Jetson TX2 Jetson AGX Xavier

20.2 7.7 3.3 9.5 2.5 1.4

PointCNN (86.1 mIoU) PVCNN (86.2 mIoU)

Jetson Nano Jetson TX2 Jetson AGX Xavier

139.9 42.6 19.9 76.0 20.3 8.2

PointNet (83.7 mIoU) 0.25 PVCNN (85.2 mIoU)

Objects per Second

0.25 PVCNN runs with real-time performance (20 FPS)

  • n the lightweight edge device (Jetson Nano).

Results: 3D Part Segmentation (ShapeNet)

slide-10
SLIDE 10

Results: 3D Semantic Segmentation (S3DIS)

PVCNN++ outperforms PointCNN with 6.9x measured speedup and 5.7x memory reduction (on a GTX 1080Ti GPU).

20 60 100 140 180 220 300 0.4 42.5 45.0 47.5 50.0 52.5 55.0 GPU Latency (ms) GPU Memory (GB) 1.0 1.6 2.2 2.8 3.4 4.0 4.6 Mean IoU 57.5 260

PVCNN PVCNN++ 3D-UNet PointCNN RSNet DGCNN PointNet

slide-11
SLIDE 11

Input Scene PointNet 0.25 PVCNN (Ours) Ground Truth

Results: 3D Semantic Segmentation (S3DIS)

0.25 PVCNN outperforms PointNet with 1.8x measured speedup and 1.4x memory reduction (on a GTX 1080Ti GPU).

slide-12
SLIDE 12

Results: 3D Object Detection (KITTI)

Efficiency Car Pedestrian Cyclist Latency (GPU) Memory (GPU) Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard F-PointNet++ 105.2 ms 2.0 GB 83.8 70.9 63.7 70.0 61.3 53.6 77.2 56.5 53.4 PVCNN (efficient) 58.9 ms (1.8x) 1.4 GB (1.4x) 84.2 (+0.4) 71.1 (+0.2) 63.6 (-0.1) 69.2 (-0.8) 60.3 (-1.0) 52.5 (-1.1) 78.7 (+1.5) 57.8 (+1.3) 54.2 (+1.2) PVCNN (complete) 69.6 ms (1.5x) 1.4 GB (1.4x) 84.0 (+0.2) 71.5 (+0.6) 63.8 (+0.1) 73.2 (+3.2) 64.7 (+3.4) 56.8 (+3.2) 81.4 (+4.2) 60.0 (+3.5) 56.3 (+2.9)

PVCNN outperforms F-PointNet++ by 2.4% mAP with 1.5x measured speedup and 1.4x memory reduction.

slide-13
SLIDE 13

F-PointNet++ PVCNN (Ours)

Results: 3D Object Detection (KITTI)

PVCNN outperforms F-PointNet++ by 2.4% mAP with 1.5x measured speedup and 1.4x memory reduction.

slide-14
SLIDE 14

Project Page: http://pvcnn.mit.edu/

Hardware-Efficient Primitive

*)

' * +)

Bottleneck Analysis

Hardware, AI and Neural-nets

Point-Voxel CNN for Efficient 3D Deep Learning