Project Page: http://pvcnn.mit.edu/
Hardware, AI and Neural-nets
Zhijian Liu*, Haotian Tang*, Yujun Lin, and Song Han
Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , - - PowerPoint PPT Presentation
H ardware, A I and N eural-nets Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , Haotian Tang* , Yujun Lin , and Song Han Project Page: http://pvcnn.mit.edu/ 3D Deep Learning 3D Part Segmentation 3D Semantic Segmentation 3D
Project Page: http://pvcnn.mit.edu/
Hardware, AI and Neural-nets
Zhijian Liu*, Haotian Tang*, Yujun Lin, and Song Han
3D Semantic Segmentation (for VR/AR Headsets) 3D Object Detection (for Self-Driving Cars) 3D Part Segmentation (for Robotic Systems) 3D deep learning has been used in various applications
Off-chip DRAM access is much more expensive than arithmetic operation! Random memory access is inefficient due to the potential bank conflicts!
32b Mult and Add 32b SRAM Read 32b DRAM Read 30 167 668 640 5 3
Energy (pJ) Bandwidth (GB/s)
Ad Addres ess Bu Bus Da Data Bu Bus
!" #" !$ #$ !%
Wa Wait f for D DRAM A Acces ccess Wa Wait … … Wa Wait f for D DRAM A Acces ccess
Ad Addres ess Bu Bus Da Data Bu Bus
!" #" !$ #$ !% !& #% #&
Wa Wait f for D DRAM A Acces ccess Wa Wait f for D DRAM A Acces ccess Wa Wait f for D DRAM A Acces ccess Wa Wait f for D DRAM A Acces ccess
w/o Bank Conflicts w/ Bank Conflicts
Efficient 3D deep learning models should have small memory footprints and avoid random memory access.
Low resolutions lead to significant information loss. High resolutions lead to large memory consumption.
VoxNet [IROS’15] 3D ShapeNets [CVPR’15] 3D U-Net [MICCAI’16]
10 20 30 40 50 60 70 80 90 100 8 16 32 48 64 96 128 192 256 1 2 5 10 50 100 200 500 Distinguishable Points (%) Voxel Resolution GPU Memory (GB)
64 x 64 x 64 resolution 11 GB (Titan XP x 1) 42% information loss 128 x 128 x 128 resolution 83 GB (Titan XP x 7) 7% information loss
Up to 80% of the time is wasted on structuring the sparse data, not on the actual feature extraction.
Irregular Access Dynamic Kernel Actual Computation
95.1 0.0 4.9 15.6 27.0 57.4 12.2 51.5 36.3 45.3 2.9 51.8
DGCNN PointCNN SpiderCNN Ours
Runtime (%)
PointCNN [NeurIPS’18] PointNet [CVPR’17] DGCNN [SIGGRAPH’19]
' * +)
Devoxelize Normalize Voxelize Convolve Fuse
Point-Based Feature Transformation (Fine-Grained) Voxel-Based Feature Aggregation (Coarse-Grained)
Multi-Layer Perceptron
PVCNN combines the advantages of point-based models (small memory footprint) and voxel-based models (regularity).
Features from Voxel-Based Branch: Features from Point-Based Branch: Voxel-based branch captures large, contiguous parts. Point-based branch captures isolated, discontinuous details.
PVCNN outperforms PointCNN with 2.7x measured speedup and 1.5x memory reduction (on a GTX 1080Ti GPU).
PVCNN
30 60 90 120 150 180 210 0.7 83.5 84.0 84.5 85.0 85.5 86.0 GPU Latency (ms) GPU Memory (GB)
PointCNN DGCNN RSNet 3D-UNet SpiderCNN PointNet++
1.0 1.3 1.6 1.9 2.2 2.5 2.8 3.1
PointNet
Mean IoU
Jetson Nano Jetson TX2 Jetson AGX Xavier
20.2 7.7 3.3 9.5 2.5 1.4
PointCNN (86.1 mIoU) PVCNN (86.2 mIoU)
Jetson Nano Jetson TX2 Jetson AGX Xavier
139.9 42.6 19.9 76.0 20.3 8.2
PointNet (83.7 mIoU) 0.25 PVCNN (85.2 mIoU)
Objects per Second
0.25 PVCNN runs with real-time performance (20 FPS)
PVCNN++ outperforms PointCNN with 6.9x measured speedup and 5.7x memory reduction (on a GTX 1080Ti GPU).
20 60 100 140 180 220 300 0.4 42.5 45.0 47.5 50.0 52.5 55.0 GPU Latency (ms) GPU Memory (GB) 1.0 1.6 2.2 2.8 3.4 4.0 4.6 Mean IoU 57.5 260
PVCNN PVCNN++ 3D-UNet PointCNN RSNet DGCNN PointNet
Input Scene PointNet 0.25 PVCNN (Ours) Ground Truth
0.25 PVCNN outperforms PointNet with 1.8x measured speedup and 1.4x memory reduction (on a GTX 1080Ti GPU).
Efficiency Car Pedestrian Cyclist Latency (GPU) Memory (GPU) Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard F-PointNet++ 105.2 ms 2.0 GB 83.8 70.9 63.7 70.0 61.3 53.6 77.2 56.5 53.4 PVCNN (efficient) 58.9 ms (1.8x) 1.4 GB (1.4x) 84.2 (+0.4) 71.1 (+0.2) 63.6 (-0.1) 69.2 (-0.8) 60.3 (-1.0) 52.5 (-1.1) 78.7 (+1.5) 57.8 (+1.3) 54.2 (+1.2) PVCNN (complete) 69.6 ms (1.5x) 1.4 GB (1.4x) 84.0 (+0.2) 71.5 (+0.6) 63.8 (+0.1) 73.2 (+3.2) 64.7 (+3.4) 56.8 (+3.2) 81.4 (+4.2) 60.0 (+3.5) 56.3 (+2.9)
PVCNN outperforms F-PointNet++ by 2.4% mAP with 1.5x measured speedup and 1.4x memory reduction.
F-PointNet++ PVCNN (Ours)
PVCNN outperforms F-PointNet++ by 2.4% mAP with 1.5x measured speedup and 1.4x memory reduction.
Project Page: http://pvcnn.mit.edu/
Hardware-Efficient Primitive
' * +)
Bottleneck Analysis
Hardware, AI and Neural-nets