Sven Middelberg, Developer Technology Engineer smiddelberg@nvidia.com
MIXED REALITY FUSION Sven Middelberg, Developer Technology Engineer - - PowerPoint PPT Presentation
MIXED REALITY FUSION Sven Middelberg, Developer Technology Engineer - - PowerPoint PPT Presentation
MIXED REALITY FUSION Sven Middelberg, Developer Technology Engineer smiddelberg@nvidia.com VIRTUAL REALITY DEPTH FUSION 2 THE SETUP Intel Realsense D435 + Vive Tracker NVIDIA GP100 Vive 90 FPS 848x480 depth stream 3584 CUDA Cores 90 Htz
2
VIRTUAL REALITY DEPTH FUSION
3
THE SETUP
Intel Realsense D435 + Vive Tracker 90 FPS 848x480 depth stream NVIDIA GP100 3584 CUDA Cores 16GB HBM2 Memory Vive 90 Htz Update Rate
4
MIXED REALITY FUSION
5
MIXED REALITY FUSION
6
MIXED REALITY FUSION
Symbiosis of VR and depth fusion
TAKEAWAYS
How can we take advantage of the VR system to make depth fusion more robust? Which optimizations are necessary to simultaneously reconstruct a 90 fps depth stream and visualize it in stereo VR?
7
AGENDA
DEPTH FUSION IN A NUTSHELL ROBUST MIXED REALITY FUSION CUDA IMPLEMENTATION & OPTIMIZATIONS
8
DEPTH FUSION IN A NUTSHELL
12
NEW FRAME Pose Estimation Raycasting Volumetric Fusion VOLUMETRIC RECONSTRUCTION POSE VERTEX & NORMAL MAP
13
RECONSTRUCTION DATA STRUCTURE
Voxel Grid Vi Truncated Signed Distance Field Di Truncation Size μ
- 3.2
- 1.9
- 1.2
- 0.6
- 0.4
- 0.2
- 2.7
- 1.8
- 0.9
0.5 0.8
- 1.9
- 1.6
- 1.0
0.0 0.9 1.7
- 1.0
- 0.6
- 0.5
- 0.1
0.8 1.7
- 0.3
0.4 0.5 0.8 1.2 2.0
- 0.8
- 0.4
0.1 0.5 1.1 1.9 0.0
14
RECONSTRUCTION DATA STRUCTURE
Voxel Grid Vi Truncated Signed Distance Field Di Truncation Size μ
- 3.2
- 1.9
- 1.2
- 0.6
- 0.4
- 0.2
- 2.7
- 1.8
- 0.9
0.5 0.8
- 1.9
- 1.6
- 1.0
0.0 0.9 1.7
- 1.0
- 0.6
- 0.5
- 0.1
0.8 1.7
- 0.3
0.4 0.5 0.8 1.2 2.0
- 0.8
- 0.4
0.1 0.5 1.1 1.9 0.0 μ = 1.0
15
RECONSTRUCTION DATA STRUCTURE
Voxel Grid Vi Truncated Signed Distance Field Di Truncation Size μ Number of samples Ci 6 x 6 x 4 m³, 4mm voxel size 1500 x 1500 x 1000 voxel grid 2.25 billion voxels ⇒ 16.76 Gbyte
1.0 1.0 1.0
- 0.6
- 0.4
- 0.2
1.0 1.0
- 0.9
0.5 0.8 1.0 1.0 1.0 0.0 0.9 1.0 1.0
- 0.6
- 0.5
- 0.1
0.8 1.0
- 0.3
0.4 0.5 0.8 1.0 1.0
- 0.8
- 0.4
0.1 0.5 1.0 1.0 0.0 μ = 1.0
16
SPARSE VOXEL HASHING
Voxel brick: N3 voxel cube Hash function h(bx, by, bz) that maps from brick space to hash bucket Hash entry references actual brick memory within preallocated brick atlas
Nießner et al., 2013
i-5 i-4 i i+2 i-1 i+3 i+1 i-2 i-3 i+4 3D Brick Position Overflow List Offset Brick Atlas Pointer
Hash Entry Brick Atlas
17
VOLUMETRIC FUSION
18
VOLUMETRIC FUSION
Di, Ci
19
VOLUMETRIC FUSION
Project voxel onto image plane
Di, Ci
20
VOLUMETRIC FUSION
Project voxel onto image plane Find nearest depth
Di, Ci
21
VOLUMETRIC FUSION
Project voxel onto image plane Find nearest depth Compute TSDF sample
d Di, Ci
22
VOLUMETRIC FUSION
Project voxel onto image plane Find nearest depth Compute TSDF sample Update Di, Ci: 𝐸𝑗 ← 𝐷𝑗 ∗ 𝐸𝑗 + 𝑒 𝐷𝑗 + 1 𝐷𝑗 ← min(𝐷𝑗 + 1, 𝐷𝑛𝑏𝑦)
d Di, Ci
33
VERTEX & NORMAL MAP CONSTRUCTION
Two-stage raycasting 1st stage: March ray in steps of truncation region size μ 2nd stage: March voxel by voxel
Raycasting
μ
μ
- μ
34
VERTEX & NORMAL MAP CONSTRUCTION
Two-stage raycasting 1st stage: March ray in steps of truncation region size μ 2nd stage: March voxel by voxel V*: Ray position at zero-crossing N*: Gradient of TSDF at V*
Raycasting
μ
μ
- μ
V* N*
39
POSE ESTIMATION
Given: Depth Image I, Raycast pose P*, V*, N* Find pose P = (R|t) of I Construct depth pyramid Ij, 0 ≤ j < L Extract camera space vertices & normals Vj, Nj Iterative coarse-to-fine minimization of distance between Vj, Nj and V*, N* Initialize P with P*
V*, N* V2, N2 V0, N0 V1, N1
44
POSE ESTIMATION
Find correspondences (𝑊
𝑙 𝑘, 𝑂𝑙 𝑘, 𝑊 𝑙 ∗, 𝑂𝑙 ∗)
Minimize sum of squared point plane distances for 𝑄∆: 𝐹 𝑄∆ =
𝑙
𝑒𝑗𝑡𝑢(𝑄∆𝑄𝑊
𝑙 𝑘, 𝑊 𝑙 ∗, 𝑂𝑙 ∗)2
Update 𝑄 ← 𝑄∆𝑄
Point-Plane ICP
𝑊
𝑙 ∗
𝑂𝑙
∗
𝑄∆𝑄𝑊
𝑙 𝑘
𝑒𝑗𝑡𝑢(𝑄∆𝑄𝑊
𝑙 𝑘, 𝑊 𝑙 ∗, 𝑂𝑙 ∗)
48
POSE ESTIMATION
𝑇𝐹(3): Lie-group of poses (12 parameters) 𝑡𝑓(3): Lie-algebra (6 parameters) Minimal parameterization! Mapping between SE(3) and se(3): 𝑓𝑦𝑞 ∶ 𝑡𝑓 3 → 𝑇𝐹 3 𝑚𝑝 ∶ 𝑇𝐹 3 → 𝑡𝑓(3)
Lie-Algebraic Parameterization
Substitute 𝑄∆ = 𝑓𝑦𝑞(𝜀): 𝐹 𝜀 =
𝑙
𝑒𝑗𝑡𝑢(𝑓𝑦𝑞(𝜀) 𝑄𝑊
𝑙 𝑘, 𝑊 𝑙 ∗, 𝑂𝑙 ∗)2
49
NEW FRAME Pose Estimation Raycasting Volumetric Fusion VOLUMETRIC RECONSTRUCTION POSE VERTEX & NORMAL MAP
50
NEW FRAME Pose Estimation Raycasting Volumetric Fusion VOLUMETRIC RECONSTRUCTION POSE VERTEX & NORMAL MAP
51
ROBUST MIXED REALITY FUSION
52
NEW FRAME Pose Estimation Raycasting Volumetric Fusion VOLUMETRIC RECONSTRUCTION VERTEX & NORMAL MAP TRACKED POSE DISPLAY POSE
53
NEW FRAME Pose Estimation Volumetric Fusion VOLUMETRIC RECONSTRUCTION TRACKED POSE DISPLAY POSE Raycasting VERTEX & NORMAL MAP
54
NEW FRAME Pose Estimation Raycasting Volumetric Fusion VOLUMETRIC RECONSTRUCTION TRACKED POSE DISPLAY POSE
55
NEW FRAME Pose Estimation Volumetric Fusion VOLUMETRIC RECONSTRUCTION Stereo Raycast TRACKED POSE DISPLAY POSE Raycasting
56
ROBUST MIXED REALITY FUSION
Cheap relocalization using VR-tracked pose 𝑄𝑊𝑆 Requires registration of reconstruction and VR coordinate systems Why not use 𝑄𝑊𝑆 directly?
57
ROBUST MIXED REALITY FUSION
Using 𝑄𝑊𝑆 directly Regularized optimization of 𝑄𝑊𝑆
59
REGISTRATION
Find transformation T, such that 𝑄𝑗 = 𝑈 ∗ 𝑄𝑗
𝑊𝑆
Lie-algebraic approach Lie-Algebraic Averaging for Globally Consistent Motion Estimation Govindu, CVPR‘2004
𝑄0 𝑄
1
𝑄3 𝑄3
𝑊𝑆
𝑄2
𝑊𝑆
𝑄0
𝑊𝑆
𝑄
1 𝑊𝑆
𝑄2 T
61
RELOCALIZATION & REGULARIZATION
Initialization of pose estimation with 𝑄𝑊𝑆 Penalize distance of 𝑄𝑊𝑆 and exp 𝜀 𝑄: 𝑒𝑗𝑡𝑢 𝜀 = 𝑚𝑝 𝑓𝑦𝑞 𝜀 𝑄𝑄𝑊𝑆−1 ≈ 𝜀 + 𝑚𝑝(𝑄𝑄𝑊𝑆−1) 𝐹′ 𝜀 = 𝐹 𝜀 + 𝑒𝑗𝑡𝑢 𝜀 𝑈𝑇−1𝑒𝑗𝑡𝑢 𝜀 Low computational overhead 𝑦, 𝑧 ≈ 0: 𝑚𝑝 𝑓𝑦𝑞 𝑦 𝑓𝑦𝑞 𝑧 ≈ 𝑦 + 𝑧
62
CUDA IMPLEMENTATION & OPTIMIZATIONS
63
NEW FRAME Pose Estimation Volumetric Fusion VOLUMETRIC RECONSTRUCTION Stereo Raycast TRACKED POSE DISPLAY POSE Raycasting
64
System Setup Regularization & Solving
𝑄𝑊𝑆 𝑄 𝑄
Vj Nj V* N*
POSE ESTIMATION
System Setup - GPU Find correspondences Setup linear system σ 𝐾𝑙
𝑈𝐾𝑙𝜀 = − σ 𝐾𝑙𝑠 𝑙
Regularization & Solving - CPU Add regularization term to system Solve for 𝜀 Update 𝑄 ← 𝑓𝑦𝑞 𝜀 𝑄
73
POSE ESTIMATION
Baseline
__global__ void setupSystem(float* gSys, ...) { int x = getPixX(); int y = getPixY(); if (findCorrespondence(x, y, ...)) { float lSys[27]; computeLocalSystem(lSys, ...); #pragma unroll for (int i=0; i<27; ++i) { atomicAdd(gSys+i, lSys[i]); } } } void solve(float* gSys, ...) { float hostSys[27]; cudaMemcpyAsync(hostSys, gSys, ...); cudaStreamSynchronize(stream); cudaMemsetAsync(gSys, 0, ...); addRegularization(hostSys, ...); float delta[6]; solve(hostSys, delta); pose = exp(delta)*pose; }
74
POSE ESTIMATION
Baseline
19.5 26.5 36.4 BASELINE
81
POSE ESTIMATION
Warp-Aggregated Atomics
__global__ void setupSystem(float* gSys, ...) { int x = getPixX(); int y = getPixY(); float lSys[27]; initZero(lSys); if (findCorrespondence(x, y, ...)) { computeLocalSystem(lSys, ...); } int lane = getLane(); warpReduceSystem(lSys, lane); if (lane < 27) { atomicAdd(gSys+lane, lSys[0]); } } __device__ __forceinline__ void warpReduceSystem(float* lSys, int lane) { #pragma unroll for (int i=0; i<27; ++i) { warpReduce(lSys[i], lane); if (lane == i) lSys[0] = lSys[i]; } }
82
POSE ESTIMATION
Warp-Aggregated Atomics
19.5 3.7 26.5 3 36.4 3.6 BASELINE WARP-AGGREGATED ATOMICS
SPEEDUP 8.0
83
POSE ESTIMATION
Minimizing CPU Overhead
126.5 μs
84
POSE ESTIMATION
Minimizing CPU Overhead
79.5 μs Launch of next system setup kernel
85
POSE ESTIMATION
Regularize & solve on GPU ✓ Removes CPU ↔ GPU synchronization ✓ Removes CPU ↔ GPU copies ✓ Keeps GPU busy
Minimizing CPU Overhead
__constant__ SE3 cPose; void trackingStep(float* gSys, SE3* gPose, ...) { cudaMemcpyToSymbolAsync(cPose, gPose, ...); setupSystem<<<...>>>(gSys, ...); solveAndUpdate<<<...>>>(gSys, gPose); }
86
POSE ESTIMATION
Minimizing CPU Overhead
43.3 μs
87
POSE ESTIMATION
Minimizing CPU Overhead
19.5 3.7 3 26.5 3 2.4 36.4 3.6 3 BASELINE WARP-AGGREGATED ATOMICS GPU SOLVER
SPEEDUP 8.0 SPEEDUP 9.7
88
NEW FRAME Pose Estimation Volumetric Fusion VOLUMETRIC RECONSTRUCTION Stereo Raycast TRACKED POSE DISPLAY POSE Raycasting
89
DATASTRUCTURE
Three raycasts per frame Vast number of trilinear interpolations → Brick atlas in texture memory Problem: Interpolation at brick boundaries
Reconstruction Space Texture Atlas
90
DATASTRUCTURE
Three raycasts per frame Vast number of trilinear interpolations → Brick atlas in texture memory Problem: Interpolation at brick boundaries Solution: Apron voxels
Reconstruction Space Texture Atlas with Apron Voxels
91
NEW FRAME Pose Estimation Volumetric Fusion VOLUMETRIC RECONSTRUCTION Stereo Raycast TRACKED POSE DISPLAY POSE Raycasting
92
VOLUMETRIC FUSION
Brick Allocation Reserve texture atlas memory Brick Selection Select bricks that are within the new frame‘s camera frustum Brick Update Fuse selected brick‘s TSDFs and weights with new frame Apron Update Propate boundary voxels of updated bricks to neighbor brick‘s aprons
Brick Allocation Brick Update Brick Selection Apron Update New Frame & Pose
93
VOLUMETRIC FUSION
Update horizontally
94
VOLUMETRIC FUSION
Update horizontally Update vertically
95
VOLUMETRIC FUSION
Update horizontally Update vertically
96
VOLUMETRIC FUSION
Baseline
3.4 7.2 8.1 BASELINE
97
VOLUMETRIC FUSION
8³ bricks ⇒ 600 updates for 488 apron voxels ⇒ 23% overhead
Update horizontally Update vertically
98
VOLUMETRIC FUSION
Works only for 8³ bricks Requires at least 488 threads / block Classify apron voxels into three types, each warp updates a single type
Low Overhead Apron Update
Face Aprons 64 threads / face Updated by warps 1-12 Edge Aprons 8 threads / edge Updated by warps 13-15 Corner Aprons 1 thread / corner Updated by warp 16
99
VOLUMETRIC FUSION
Low Overhead Apron Update
3.4 2.5 7.2 4.8 8.2 5.4 BASELINE LOW OVERHEAD APRON UPDATE
SPEEDUP 1.5
106
VOLUMETRIC FUSION
Hiding Latency
▪ Kernel latency-bound ▪ Selected blocks written once, loaded twice from global memory ▪ Boundary voxel‘s TSDF loaded twice from atlas
__global__ void updateAprons(Brick* sel, int nSel) { for (int i=0; i<nSel; ++i) { Brick b = loadSelected(sel, i); // Latency Brick nb = loadNeighbor(b, ...); float tsdf = loadSrcVoxel(b, ...); // Latency writeApron(nb, tsdf, ...); } }
113
VOLUMETRIC FUSION
Combined Selection / Update
✓ Much of the latency hidden ✓ Selected blocks not written to, nor read from global memory ✓ Boundary voxel‘s TSDs loaded
- nly once
__global__ void combinedSelectAndUpdate(...) { __shared__ Brick sel[MAX_SEL_PER_BLOCK]; __shared__ int nSel; __shared__ float brickTsdfs[512]; select(sel, nSel); for (int i=0; i<nSel; ++i) { Brick b = sel[i]; Brick nb = loadNeighbor(b, ...); brickTsdfs[tid()] = updateVoxel(b, ...); __syncthreads(); float apronTsdf = getApronTsdf(brickTsdfs, tid()); writeApron(nb, apronTsdf, ...); } }
114
VOLUMETRIC FUSION
Combined Selection / Update
3.4 2.5 1.4 7.2 4.8 2.4 8.2 5.4 2.7 BASELINE LOW OVERHEAD APRON UPDATE COMBINED SELECT & UPDATE
SPEEDUP 1.5 SPEEDUP 2.9
115
NEW FRAME Pose Estimation Volumetric Fusion VOLUMETRIC RECONSTRUCTION Stereo Raycast TRACKED POSE DISPLAY POSE Raycasting
116
RAYCASTING
One raycast for the next frame (848x480) Two raycasts for the stereo HMD (540x600) Two-stage approach Full use of GPU texture units
μ
μ
- μ
117
RAYCASTING
Baseline
2.3 4.4 4.1 BASELINE
119
ACCELERATION STRUCTURE
Fast leaping over empty space Define supergrid of octrees Efficient update and query?
Empty-Space Skipping
Oi,j Oi,j-1 Oi+1,j-1 Oi-1,j-1 Oi,j+1 Oi+1,j+1 Oi-1,j+1 Oi-1,j Oi-1,j
120
ACCELERATION STRUCTURE
Binary Encoded 3-Level Occupancy Octree
121
ACCELERATION STRUCTURE
Binary Encoded 3-Level Occupancy Octree
Level 2 Cell
122
ACCELERATION STRUCTURE
Binary Encoded 3-Level Occupancy Octree
Level 1 Cells
123
ACCELERATION STRUCTURE
Binary Encoded 3-Level Occupancy Octree
Level 0 Cell
124
1 2 5 6 3 4 7 8 9 10 13 14 12 15 16 11
ACCELERATION STRUCTURE
Binary Encoded 3-Level Occupancy Octree
64-bit descriptor d Bit i is set if level 0 cell i is occupied Bits ordered corresponding to the octree hierarchy Level 2: Check if d is zero Level 1: Check if corresponding byte of d is zero Level 0: Check if corresponding bit of d is zero Update: atomicOr during brick allocation
130
RAYCASTING
Binary Encoded Occupancy Octree
2.3 1.8 4.4 3.4 4.1 3.5 BASELINE BINARY ENCODED OCCUPANCY OCTREE
SPEEDUP 1.3
132
OVERALL TIMINGS
6.2 10.2 7.9 12.2 9.1 17 AVG TIME MAX TIME
100% 97.2% 78.6%
133
NEXT STEPS
RGB data for reconstruction & texturing
COLOR INTEGRATION
2nd GPU to perform on-the-fly optimization
- f reconstruction
MULTI-GPU OPTIMIZATION
VR controller for interactive reconstruction
VR INTERACTION
Filtering & Optimization
DEEP LEARNING
Sven Middelberg, Developer Technology Engineer smiddelberg@nvidia.com