RTX-RSim
Accelerated Vulkan Room Response Simulation for Time-of-Flight Imaging
Peter Thoman, Markus Wippler, Robert Hranitzky, and Thomas Fahringer peter.thoman@uibk.ac.at
RTX-RSim Accelerated Vulkan Room Response Simulation for - - PowerPoint PPT Presentation
RTX-RSim Accelerated Vulkan Room Response Simulation for Time-of-Flight Imaging Peter Thoman, Markus Wippler, Robert Hranitzky, and Thomas Fahringer peter.thoman@uibk.ac.at IWOCL 2020 Background and Motivation IWOCL 2020 RTX-RSim 2 The
Peter Thoman, Markus Wippler, Robert Hranitzky, and Thomas Fahringer peter.thoman@uibk.ac.at
IWOCL 2020 – RTX-RSim 2
In room response simulation for time of flight imaging, we are interested in
computing the propagation of light
from a light source (L) through a room
(defined by some geometry and surface properties G)
to a sensor array (S)
IWOCL 2020 – RTX-RSim 3
S L G
In the real world, L and S are part of a Time-of-flight (ToF) camera assembly.
Unlike in e.g. image rendering or lighting
computations, the goal of the simulation is to compute a radiosity time series for each geometric primitive
Based on this time series, which simulates the actual
photons received by a ToF camera sensor, scene depth can be reconstructed
With RSim, since the exact depth is known, different
scenes and reconstruction schemes can be easily evaluated
IWOCL 2020 – RTX-RSim 4
r t
Use during development of better ToF hardware implementations or software algorithms
1.
Read input data, including geometric primitives (𝐻), their surface material information (𝜍), and initial impulse
2.
Pre-computation of the per-triangle area (𝐵𝑗)
3.
Mutual signal delay computation, storing the signal delay for each triangle pair (𝑗,𝑘) in 𝜐𝑗𝑘
4.
Mutual visibility computation, evaluating the energy transfer between each triangle pair stochastically and storing in 𝐿𝑗𝑘
5.
For each timestep 𝑢 ∈ [0,𝑈):
Propagate radiosity, computing 𝑠𝑏𝑒𝑢,𝑗 for each triangle 𝑗 in all pairs (𝑗,𝑘)
based on 𝐿𝑗𝑘 and 𝑠𝑏𝑒𝑢−1,𝑗
6.
Compute the distance from the light/sensor position to each triangle 𝑗, based on 𝑠𝑏𝑒[0,𝑈),𝑗
IWOCL 2020 – RTX-RSim 5
𝐵𝑗 𝑗 𝑘 𝜐𝑗𝑘 𝑗 𝑘
IWOCL 2020 – RTX-RSim 6
propagation
𝑠𝑏𝑒[0,𝑈),𝑗
Analyse time complexity for each step of the algorithm.
IWOCL 2020 – RTX-RSim 7
propagation
𝑠𝑏𝑒[0,𝑈),𝑗
Steps 1 and 2 iterate over 𝑶 triangles, with simple I/O operations and area computation for each element. Readily identified as 𝑷 𝑶 complexity.
IWOCL 2020 – RTX-RSim 8
propagation
𝑠𝑏𝑒[0,𝑈),𝑗
Computing propagation delay for each pair of triangles 𝑷 𝑶𝟑 However, the fixed factor is low, and compared to the remaining phases, even 𝑶𝟑 complexity is largely negligible.
IWOCL 2020 – RTX-RSim 9
propagation
𝑠𝑏𝑒[0,𝑈),𝑗
Stochastically evaluate the visibility between every pair of triangles – in naïve implementation requires a ray-triangle intersection check against all other triangles in the scene. With 𝑻 stochastic samples:
𝑃(𝑂3 ∗ 𝑇).
In practice, use geometric acceleration structure. Current RSim on CPU uses octrees, resulting in a reduction of average-case query complexity from 𝑃 𝑂 to 𝑃 log(𝑂) . 𝑷(𝑶𝟑 ∗ 𝒎𝒑𝒉 𝑶 ∗ 𝑻)
IWOCL 2020 – RTX-RSim 10
propagation
𝑠𝑏𝑒[0,𝑈),𝑗
Uses signal delay 𝜐𝑗𝑘 and mutual visibility information 𝐿𝑗𝑘, as well as the previous radiosity up to the currently computed timestep 𝑠𝑏𝑒[0,t),𝑗. For each timestep 𝑢 and each pair (𝑗,𝑘): Propagate energy between triangles in the pair from time 𝑢 − 𝜐𝑗,𝑘 according to mutual visibility as well as their surface properties. 𝑷(𝑶𝟑 ∗ 𝑼)
IWOCL 2020 – RTX-RSim 11
propagation
𝑠𝑏𝑒[0,𝑈),𝑗
Distance computation usually based on cross- correlation of radiosity time series. 𝑷 𝑶 ∗ 𝑼𝟑 T is usually much smaller than N, and fixed factor is very small as well. Usually negligible overall, similar to step 3.
IWOCL 2020 – RTX-RSim 12
Scaling trend matches
algorithmic complexity
Clearly mutual visibility
computation and radiosity simulation are main priority
IWOCL 2020 – RTX-RSim 13
20 40 60 80 100 120 Small Medium Large Relative Performance (Small = 1) Mutual Visibility Radiosity Simulation Other
IWOCL 2020 – RTX-RSim 14
A Vulkan implementation needs to be massively data-parallel to be efficient And we are constrained in the amount of data we can store on a GPU
Data-centric view of the algorithm
IWOCL 2020 – RTX-RSim 15
Contents Format Size Triangles (G) Indexed vertex buffer 𝑂 Material information (ρ) 3 * FP32 𝑂 Raytracing Buffers Internal / opaque 𝑃(𝑂) Sample Coordinates 2 * FP32 𝑇 Mutual Visibility (𝐿𝑗𝑘) FP16 𝑂2 Radiosity (𝑠𝑏𝑒) 4 * FP32 𝑂 ∗ 𝑈 Distance FP32 𝑂
IWOCL 2020 – RTX-RSim 16
Generally, 𝑇 ≪ 𝑈 ≪ 𝑂, therefore 𝐿𝑗𝑘 dominates. FP16 sufficient! Signal delay 𝜐𝑗𝑘 recomputed instead of stored.
Schematic representation of HW raytracing process
IWOCL 2020 – RTX-RSim 17
Raytracing Build Acceleration Structures Input Geometry
Top-level AS
[ ] [ ] [ ]
Bottom-level AS
…
Descriptor Set Shader Binding Table buff buff
…
Raygen Hit Miss Ray Generation Acceleration Structure Traversal Closest Hit Miss
no
yes
… GPU data structures … RT shader … Dataset … Operation … RT shader invocation Hit? … Fixed function GPU operation
Geometry is static we can optimize AS build for traversal speed rather
than build/update performance
IWOCL 2020 – RTX-RSim 18
Raytracing Build Acceleration Structures Input Geometry
Top-level AS
[ ] [ ] [ ]
Bottom-level AS
…
Descriptor Set Shader Binding Table buff buff
…
Raygen Hit Miss Ray Generation Acceleration Structure Traversal Closest Hit Miss
no
yes
… GPU data structures … RT shader … Dataset … Operation … RT shader invocation Hit? … Fixed function GPU operation
Descriptor Set: our RT shaders require read-only access to 𝐻, 𝜍, and the
Sample Coordinates buffer, as well as write access to 𝐿𝑗𝑘
Shaders: only require ray generation and a single hit and miss shader
IWOCL 2020 – RTX-RSim 19
Raytracing Build Acceleration Structures Input Geometry
Top-level AS
[ ] [ ] [ ]
Bottom-level AS
…
Descriptor Set Shader Binding Table buff buff
…
Raygen Hit Miss Ray Generation Acceleration Structure Traversal Closest Hit Miss
no
yes
… GPU data structures … RT shader … Dataset … Operation … RT shader invocation Hit? … Fixed function GPU operation
Raytracing
Ray generation: generate 𝑇 rays for every pair of triangles
(order independent, thus 𝑂²/2 − 𝑂 required size, 1D grid)
Aggregate results and write to 𝐿𝑗𝑘
IWOCL 2020 – RTX-RSim 20
Build Acceleration Structures Input Geometry
Top-level AS
[ ] [ ] [ ]
Bottom-level AS
…
Descriptor Set Shader Binding Table buff buff
…
Raygen Hit Miss Ray Generation Acceleration Structure Traversal Closest Hit Miss
no
yes
… GPU data structures … RT shader … Dataset … Operation … RT shader invocation Hit? … Fixed function GPU operation
Raytracing
Miss shader: trivial, simply set visible=false for use in raygen shader Closest hit: check if expected triangle hit
IWOCL 2020 – RTX-RSim 21
Build Acceleration Structures Input Geometry
Top-level AS
[ ] [ ] [ ]
Bottom-level AS
…
Descriptor Set Shader Binding Table buff buff
…
Raygen Hit Miss Ray Generation Acceleration Structure Traversal Closest Hit Miss
no
yes
… GPU data structures … RT shader … Dataset … Operation … RT shader invocation Hit? … Fixed function GPU operation
Second compute-intensive phase, based on
mutual visibility result from HW raytracing
Implemented using Vulkan compute shaders One shader invocation per time step Important: parallelized in 1D over 𝑂, not 2D over 𝑂2
slightly lower potential at small sizes, but less synchronization
IWOCL 2020 – RTX-RSim 22
Excerpt of core loop over
destination triangles
Note data-dependent access
to previous radiosity buffer
IWOCL 2020 – RTX-RSim 23
IWOCL 2020 – RTX-RSim 24
Recall that mutual visibility buffer 𝐿𝑗𝑘 requires 𝑂2 entries Therefore GPU memory limited to low triangle counts Recomputation is not desirable slowdown by at least factor 10 Solution: asynchronous streaming
Minimize performance impact by suitable chunking and latency hiding
IWOCL 2020 – RTX-RSim 25
IWOCL 2020 – RTX-RSim 26
copy 𝐿𝐽𝐽𝑘 ⟶ 𝑐𝑣𝑔
𝐵
copy 𝐿𝐽𝐽𝐽𝑘 ⟶ 𝑐𝑣𝑔
𝐶
copy 𝐿𝐽𝑘 ⟶ 𝑐𝑣𝑔
𝐵
copy 𝐿𝐽𝐽𝑘 ⟶ 𝑐𝑣𝑔
𝐶
compute(𝑐𝑣𝑔
𝐵) ⟶ 𝑠𝑏𝑒𝑢𝑜+1𝐽
𝑢𝑜+1 𝑢 𝑢𝑜
compute(𝑐𝑣𝑔
𝐶) ⟶ 𝑠𝑏𝑒𝑢𝑜𝐽𝐽𝐽
𝑑ℎ𝑣𝑜𝑙 𝐽 𝑑ℎ𝑣𝑜𝑙 𝐽𝐽 𝑑ℎ𝑣𝑜𝑙 𝐽𝐽𝐽 𝑑ℎ𝑣𝑜𝑙 𝐽
compute(𝑐𝑣𝑔
𝐶) ⟶ 𝑠𝑏𝑒𝑢𝑜𝐽
compute(𝑐𝑣𝑔
𝐵) ⟶ 𝑠𝑏𝑒𝑢𝑜𝐽𝐽
⟶ 𝑐𝑣𝑔
𝐶
⟶ 𝑠𝑏𝑒𝑢𝑜−1𝐽𝐽𝐽
true dependence anti-dependence
Requires two extra chunk buffers for double buffering
Linear rather than quadratic in size!
Mutual Visibility step generates 𝐿𝑗𝑘 also requires streaming Implementation simpler, only need
to stream the finished data out once
Also less performance critical,
since mutual visibility computation has higher per-element cost
We actually see speedup with
streaming in some results!
IWOCL 2020 – RTX-RSim 27
Compute 𝐿𝐽𝑘 Stream out 𝐿𝐽𝑘 Compute 𝐿𝐽𝐽𝑘 Compute 𝐿𝐽𝐽𝐽𝑘 Stream out 𝐿𝐽𝐽𝑘 Stream out 𝐿𝐽𝐽𝐽𝑘
All results on an AMD Ryzen TR 2920X + NVIDIA GeForce RTX 2070 system Note that CPU results are fully parallelized
IWOCL 2020 – RTX-RSim 28
IWOCL 2020 – RTX-RSim 29
1 s 10 s 100 s 1000 s 10000 s Small Medium Large RTX-RSim (GPU) RSim (CPU)
CPU results roughly
linear on logarithmic scale
GPU result worse
at “Small” size (insufficient parallelism in radiosity comp.)
Factor ~20
improvement over CPU at “Medium” and larger
Very high speedup in
mutual visibility phase with hardware raytracing
Radiosity simulation
limited by:
lack of parallelism
at “Small” size
streaming
requirements at “Large” size
IWOCL 2020 – RTX-RSim 30
112.1 137.3 155.2 6.7 22.0 16.6 8.8 10.0 10.2 1.0 10.0 100.0 Small Medium Large Mutual Visibility Radiosity Simulation Other
Raytracing actually
benefits from streaming (hiding some transfer latency)
Roughly 40%
performance impact on radiosity simulation due to streaming
Not ideal, but order of
magnitude better than recomputation
IWOCL 2020 – RTX-RSim 31
0.0 0.2 0.4 0.6 0.8 1.0 1.2 Raytracing Simulation Total
Small Medium
IWOCL 2020 – RTX-RSim 32
Using new raytracing hardware for accelerating room response simulation is
both viable and effective
Over factor 100 improvement in raytracing-heavy phases compared to CPU
Vulkan compute shaders are a good cross-platform and cross-vendor
alternative to e.g. CUDA, OpenCL and SYCL if direct interaction with graphics features is required
Streaming with full latency hiding allows overcoming GPU memory limits for
this algorithm with moderate performance impact
But is still limited by PCIe bandwidth
IWOCL 2020 – RTX-RSim 33
peter.thoman@uibk.ac.at
Partially funded by the FFG INPACT project.