RTX-RSim Accelerated Vulkan Room Response Simulation for - - PowerPoint PPT Presentation

rtx rsim
SMART_READER_LITE
LIVE PREVIEW

RTX-RSim Accelerated Vulkan Room Response Simulation for - - PowerPoint PPT Presentation

RTX-RSim Accelerated Vulkan Room Response Simulation for Time-of-Flight Imaging Peter Thoman, Markus Wippler, Robert Hranitzky, and Thomas Fahringer peter.thoman@uibk.ac.at IWOCL 2020 Background and Motivation IWOCL 2020 RTX-RSim 2 The


slide-1
SLIDE 1

RTX-RSim

Accelerated Vulkan Room Response Simulation for Time-of-Flight Imaging

Peter Thoman, Markus Wippler, Robert Hranitzky, and Thomas Fahringer peter.thoman@uibk.ac.at

IWOCL 2020

slide-2
SLIDE 2

Background and Motivation

IWOCL 2020 – RTX-RSim 2

slide-3
SLIDE 3

The Basic Idea

 In room response simulation for time of flight imaging, we are interested in

computing the propagation of light

 from a light source (L)  through a room

(defined by some geometry and surface properties G)

 to a sensor array (S)

IWOCL 2020 – RTX-RSim 3

S L G

In the real world, L and S are part of a Time-of-flight (ToF) camera assembly.

slide-4
SLIDE 4

The Goal

 Unlike in e.g. image rendering or lighting

computations, the goal of the simulation is to compute a radiosity time series for each geometric primitive

 Based on this time series, which simulates the actual

photons received by a ToF camera sensor, scene depth can be reconstructed

 With RSim, since the exact depth is known, different

scenes and reconstruction schemes can be easily evaluated

IWOCL 2020 – RTX-RSim 4

r t

 Use during development of better ToF hardware implementations or software algorithms

slide-5
SLIDE 5

Algorithm Overview

1.

Read input data, including geometric primitives (𝐻), their surface material information (𝜍), and initial impulse

2.

Pre-computation of the per-triangle area (𝐵𝑗)

3.

Mutual signal delay computation, storing the signal delay for each triangle pair (𝑕𝑗,𝑕𝑘) in 𝜐𝑗𝑘

4.

Mutual visibility computation, evaluating the energy transfer between each triangle pair stochastically and storing in 𝐿𝑗𝑘

5.

For each timestep 𝑢 ∈ [0,𝑈):

 Propagate radiosity, computing 𝑠𝑏𝑒𝑢,𝑗 for each triangle 𝑕𝑗 in all pairs (𝑕𝑗,𝑕𝑘)

based on 𝐿𝑗𝑘 and 𝑠𝑏𝑒𝑢−1,𝑗

6.

Compute the distance from the light/sensor position to each triangle 𝑕𝑗, based on 𝑠𝑏𝑒[0,𝑈),𝑗

IWOCL 2020 – RTX-RSim 5

𝐵𝑗 𝑕𝑗 𝑕𝑘 𝜐𝑗𝑘 𝑕𝑗 𝑕𝑘

slide-6
SLIDE 6

Algorithm Performance and Data Requirement Analysis

IWOCL 2020 – RTX-RSim 6

slide-7
SLIDE 7
  • 1. Input data prep.
  • 2. Pre-compute 𝐵𝑗
  • 3. Pre-compute 𝜐𝑗𝑘
  • 4. Mutual visibility
  • comp.  𝐿𝑗𝑘
  • 5. Radiosity

propagation

 𝑠𝑏𝑒[0,𝑈),𝑗

  • 6. Compute distance

Analyse time complexity for each step of the algorithm.

IWOCL 2020 – RTX-RSim 7

Algorithm Steps

slide-8
SLIDE 8
  • 1. Input data prep.
  • 2. Pre-compute 𝐵𝑗
  • 3. Pre-compute 𝜐𝑗𝑘
  • 4. Mutual visibility
  • comp.  𝐿𝑗𝑘
  • 5. Radiosity

propagation

 𝑠𝑏𝑒[0,𝑈),𝑗

  • 6. Compute distance

Steps 1 and 2 iterate over 𝑶 triangles, with simple I/O operations and area computation for each element. Readily identified as 𝑷 𝑶 complexity.

IWOCL 2020 – RTX-RSim 8

Algorithm Steps

slide-9
SLIDE 9
  • 1. Input data prep.
  • 2. Pre-compute 𝐵𝑗
  • 3. Pre-compute 𝜐𝑗𝑘
  • 4. Mutual visibility
  • comp.  𝐿𝑗𝑘
  • 5. Radiosity

propagation

 𝑠𝑏𝑒[0,𝑈),𝑗

  • 6. Compute distance

Computing propagation delay for each pair of triangles  𝑷 𝑶𝟑 However, the fixed factor is low, and compared to the remaining phases, even 𝑶𝟑 complexity is largely negligible.

IWOCL 2020 – RTX-RSim 9

Algorithm Steps

slide-10
SLIDE 10
  • 1. Input data prep.
  • 2. Pre-compute 𝐵𝑗
  • 3. Pre-compute 𝜐𝑗𝑘
  • 4. Mutual visibility
  • comp.  𝐿𝑗𝑘
  • 5. Radiosity

propagation

 𝑠𝑏𝑒[0,𝑈),𝑗

  • 6. Compute distance

Stochastically evaluate the visibility between every pair of triangles – in naïve implementation requires a ray-triangle intersection check against all other triangles in the scene. With 𝑻 stochastic samples:

 𝑃(𝑂3 ∗ 𝑇).

In practice, use geometric acceleration structure. Current RSim on CPU uses octrees, resulting in a reduction of average-case query complexity from 𝑃 𝑂 to 𝑃 log(𝑂) .  𝑷(𝑶𝟑 ∗ 𝒎𝒑𝒉 𝑶 ∗ 𝑻)

IWOCL 2020 – RTX-RSim 10

Algorithm Steps

slide-11
SLIDE 11
  • 1. Input data prep.
  • 2. Pre-compute 𝐵𝑗
  • 3. Pre-compute 𝜐𝑗𝑘
  • 4. Mutual visibility
  • comp.  𝐿𝑗𝑘
  • 5. Radiosity

propagation

 𝑠𝑏𝑒[0,𝑈),𝑗

  • 6. Compute distance

Uses signal delay 𝜐𝑗𝑘 and mutual visibility information 𝐿𝑗𝑘, as well as the previous radiosity up to the currently computed timestep 𝑠𝑏𝑒[0,t),𝑗. For each timestep 𝑢 and each pair (𝑕𝑗,𝑕𝑘): Propagate energy between triangles in the pair from time 𝑢 − 𝜐𝑗,𝑘 according to mutual visibility as well as their surface properties.  𝑷(𝑶𝟑 ∗ 𝑼)

IWOCL 2020 – RTX-RSim 11

Algorithm Steps

slide-12
SLIDE 12
  • 1. Input data prep.
  • 2. Pre-compute 𝐵𝑗
  • 3. Pre-compute 𝜐𝑗𝑘
  • 4. Mutual visibility
  • comp.  𝐿𝑗𝑘
  • 5. Radiosity

propagation

 𝑠𝑏𝑒[0,𝑈),𝑗

  • 6. Compute distance

Distance computation usually based on cross- correlation of radiosity time series.  𝑷 𝑶 ∗ 𝑼𝟑 T is usually much smaller than N, and fixed factor is very small as well. Usually negligible overall, similar to step 3.

IWOCL 2020 – RTX-RSim 12

Algorithm Steps

slide-13
SLIDE 13

Measured Performance

 Scaling trend matches

  • bservations on

algorithmic complexity

 Clearly mutual visibility

computation and radiosity simulation are main priority

IWOCL 2020 – RTX-RSim 13

20 40 60 80 100 120 Small Medium Large Relative Performance (Small = 1) Mutual Visibility Radiosity Simulation Other

slide-14
SLIDE 14

Vulkan Raytracing and Compute for Room Response Simulation

IWOCL 2020 – RTX-RSim 14

slide-15
SLIDE 15

Data Management

 A Vulkan implementation needs to be massively data-parallel to be efficient  And we are constrained in the amount of data we can store on a GPU

 Data-centric view of the algorithm

IWOCL 2020 – RTX-RSim 15

slide-16
SLIDE 16

Data Management

Contents Format Size Triangles (G) Indexed vertex buffer 𝑂 Material information (ρ) 3 * FP32 𝑂 Raytracing Buffers Internal / opaque 𝑃(𝑂) Sample Coordinates 2 * FP32 𝑇 Mutual Visibility (𝐿𝑗𝑘) FP16 𝑂2 Radiosity (𝑠𝑏𝑒) 4 * FP32 𝑂 ∗ 𝑈 Distance FP32 𝑂

IWOCL 2020 – RTX-RSim 16

 Generally, 𝑇 ≪ 𝑈 ≪ 𝑂, therefore 𝐿𝑗𝑘 dominates.  FP16 sufficient!  Signal delay 𝜐𝑗𝑘 recomputed instead of stored.

slide-17
SLIDE 17

 Schematic representation of HW raytracing process

IWOCL 2020 – RTX-RSim 17

Raytracing Build Acceleration Structures Input Geometry

Top-level AS

[ ] [ ] [ ]

Bottom-level AS

Descriptor Set Shader Binding Table buff buff

Raygen Hit Miss Ray Generation Acceleration Structure Traversal Closest Hit Miss

no

𝐿𝑗𝑘

yes

… GPU data structures … RT shader … Dataset … Operation … RT shader invocation Hit? … Fixed function GPU operation

Hardware Raytracing for Mutual Visibility

slide-18
SLIDE 18

 Geometry is static  we can optimize AS build for traversal speed rather

than build/update performance

IWOCL 2020 – RTX-RSim 18

Raytracing Build Acceleration Structures Input Geometry

Top-level AS

[ ] [ ] [ ]

Bottom-level AS

Descriptor Set Shader Binding Table buff buff

Raygen Hit Miss Ray Generation Acceleration Structure Traversal Closest Hit Miss

no

𝐿𝑗𝑘

yes

… GPU data structures … RT shader … Dataset … Operation … RT shader invocation Hit? … Fixed function GPU operation

Hardware Raytracing for Mutual Visibility

slide-19
SLIDE 19

 Descriptor Set: our RT shaders require read-only access to 𝐻, 𝜍, and the

Sample Coordinates buffer, as well as write access to 𝐿𝑗𝑘

 Shaders: only require ray generation and a single hit and miss shader

IWOCL 2020 – RTX-RSim 19

Raytracing Build Acceleration Structures Input Geometry

Top-level AS

[ ] [ ] [ ]

Bottom-level AS

Descriptor Set Shader Binding Table buff buff

Raygen Hit Miss Ray Generation Acceleration Structure Traversal Closest Hit Miss

no

𝐿𝑗𝑘

yes

… GPU data structures … RT shader … Dataset … Operation … RT shader invocation Hit? … Fixed function GPU operation

Hardware Raytracing for Mutual Visibility

slide-20
SLIDE 20

Raytracing

 Ray generation: generate 𝑇 rays for every pair of triangles

(order independent, thus 𝑂²/2 − 𝑂 required size, 1D grid)

 Aggregate results and write to 𝐿𝑗𝑘

IWOCL 2020 – RTX-RSim 20

Build Acceleration Structures Input Geometry

Top-level AS

[ ] [ ] [ ]

Bottom-level AS

Descriptor Set Shader Binding Table buff buff

Raygen Hit Miss Ray Generation Acceleration Structure Traversal Closest Hit Miss

no

𝐿𝑗𝑘

yes

… GPU data structures … RT shader … Dataset … Operation … RT shader invocation Hit? … Fixed function GPU operation

Hardware Raytracing for Mutual Visibility

slide-21
SLIDE 21

Raytracing

 Miss shader: trivial, simply set visible=false for use in raygen shader  Closest hit: check if expected triangle hit

IWOCL 2020 – RTX-RSim 21

Build Acceleration Structures Input Geometry

Top-level AS

[ ] [ ] [ ]

Bottom-level AS

Descriptor Set Shader Binding Table buff buff

Raygen Hit Miss Ray Generation Acceleration Structure Traversal Closest Hit Miss

no

𝐿𝑗𝑘

yes

… GPU data structures … RT shader … Dataset … Operation … RT shader invocation Hit? … Fixed function GPU operation

Hardware Raytracing for Mutual Visibility

slide-22
SLIDE 22

Compute Shader Radiosity Simulation

 Second compute-intensive phase, based on

mutual visibility result from HW raytracing

 Implemented using Vulkan compute shaders  One shader invocation per time step  Important: parallelized in 1D over 𝑂, not 2D over 𝑂2

 slightly lower potential at small sizes, but less synchronization

IWOCL 2020 – RTX-RSim 22

slide-23
SLIDE 23

Simplified Radiosity Compute Shader

 Excerpt of core loop over

destination triangles

 Note data-dependent access

to previous radiosity buffer

IWOCL 2020 – RTX-RSim 23

slide-24
SLIDE 24

Data Streaming with Latency Hiding

IWOCL 2020 – RTX-RSim 24

slide-25
SLIDE 25

Streaming Motivation

 Recall that mutual visibility buffer 𝐿𝑗𝑘 requires 𝑂2 entries  Therefore GPU memory limited to low triangle counts  Recomputation is not desirable  slowdown by at least factor 10  Solution: asynchronous streaming

 Minimize performance impact by suitable chunking and latency hiding

IWOCL 2020 – RTX-RSim 25

slide-26
SLIDE 26

RTX-RSim Streaming Scheme

IWOCL 2020 – RTX-RSim 26

copy 𝐿𝐽𝐽𝑘 ⟶ 𝑐𝑣𝑔

𝐵

copy 𝐿𝐽𝐽𝐽𝑘 ⟶ 𝑐𝑣𝑔

𝐶

copy 𝐿𝐽𝑘 ⟶ 𝑐𝑣𝑔

𝐵

copy 𝐿𝐽𝐽𝑘 ⟶ 𝑐𝑣𝑔

𝐶

compute(𝑐𝑣𝑔

𝐵) ⟶ 𝑠𝑏𝑒𝑢𝑜+1𝐽

𝑢𝑜+1 𝑢 𝑢𝑜

compute(𝑐𝑣𝑔

𝐶) ⟶ 𝑠𝑏𝑒𝑢𝑜𝐽𝐽𝐽

𝑑ℎ𝑣𝑜𝑙 𝐽 𝑑ℎ𝑣𝑜𝑙 𝐽𝐽 𝑑ℎ𝑣𝑜𝑙 𝐽𝐽𝐽 𝑑ℎ𝑣𝑜𝑙 𝐽

compute(𝑐𝑣𝑔

𝐶) ⟶ 𝑠𝑏𝑒𝑢𝑜𝐽

compute(𝑐𝑣𝑔

𝐵) ⟶ 𝑠𝑏𝑒𝑢𝑜𝐽𝐽

⟶ 𝑐𝑣𝑔

𝐶

⟶ 𝑠𝑏𝑒𝑢𝑜−1𝐽𝐽𝐽

true dependence anti-dependence

 Requires two extra chunk buffers for double buffering

 Linear rather than quadratic in size!

slide-27
SLIDE 27

Streaming for Mutual Visibility

 Mutual Visibility step generates 𝐿𝑗𝑘  also requires streaming  Implementation simpler, only need

to stream the finished data out once

 Also less performance critical,

since mutual visibility computation has higher per-element cost

 We actually see speedup with

streaming in some results!

IWOCL 2020 – RTX-RSim 27

Compute 𝐿𝐽𝑘 Stream out 𝐿𝐽𝑘 Compute 𝐿𝐽𝐽𝑘 Compute 𝐿𝐽𝐽𝐽𝑘 Stream out 𝐿𝐽𝐽𝑘 Stream out 𝐿𝐽𝐽𝐽𝑘

slide-28
SLIDE 28

Performance Evaluation

All results on an AMD Ryzen TR 2920X + NVIDIA GeForce RTX 2070 system Note that CPU results are fully parallelized

IWOCL 2020 – RTX-RSim 28

slide-29
SLIDE 29

Overall CPU vs. RTX-RSim Comparison

IWOCL 2020 – RTX-RSim 29

1 s 10 s 100 s 1000 s 10000 s Small Medium Large RTX-RSim (GPU) RSim (CPU)

 CPU results roughly

linear on logarithmic scale

 GPU result worse

at “Small” size (insufficient parallelism in radiosity comp.)

 Factor ~20

improvement over CPU at “Medium” and larger

slide-30
SLIDE 30

Speedup of individual phases

 Very high speedup in

mutual visibility phase with hardware raytracing

 Radiosity simulation

limited by:

 lack of parallelism

at “Small” size

 streaming

requirements at “Large” size

IWOCL 2020 – RTX-RSim 30

112.1 137.3 155.2 6.7 22.0 16.6 8.8 10.0 10.2 1.0 10.0 100.0 Small Medium Large Mutual Visibility Radiosity Simulation Other

slide-31
SLIDE 31

Streaming Performance Impact

 Raytracing actually

benefits from streaming (hiding some transfer latency)

 Roughly 40%

performance impact on radiosity simulation due to streaming

 Not ideal, but order of

magnitude better than recomputation

IWOCL 2020 – RTX-RSim 31

0.0 0.2 0.4 0.6 0.8 1.0 1.2 Raytracing Simulation Total

  • Rel. Perf. (1.0 = no streaming)

Small Medium

slide-32
SLIDE 32

Summary & Conclusion

IWOCL 2020 – RTX-RSim 32

slide-33
SLIDE 33

Conclusion

 Using new raytracing hardware for accelerating room response simulation is

both viable and effective

 Over factor 100 improvement in raytracing-heavy phases compared to CPU

 Vulkan compute shaders are a good cross-platform and cross-vendor

alternative to e.g. CUDA, OpenCL and SYCL if direct interaction with graphics features is required

 Streaming with full latency hiding allows overcoming GPU memory limits for

this algorithm with moderate performance impact

 But is still limited by PCIe bandwidth

IWOCL 2020 – RTX-RSim 33

slide-34
SLIDE 34

Thank you for your attention!

peter.thoman@uibk.ac.at

Partially funded by the FFG INPACT project.