Blasting Sand with CUDA: MPM Sand Simulation for VFX Gergely Klr - - PowerPoint PPT Presentation
Blasting Sand with CUDA: MPM Sand Simulation for VFX Gergely Klr - - PowerPoint PPT Presentation
Blasting Sand with CUDA: MPM Sand Simulation for VFX Gergely Klr DreamWorks Animation t n t n+1 t n t n+1 t n t n+1 Grid influence Nave Particles-to-Grid Gather Particles-to-Grid Our Solution Each particle is read only once, We
tn tn+1
tn tn+1
tn tn+1
Grid influence
Naïve Particles-to-Grid
Gather Particles-to-Grid
Our Solution
- Each particle is read only once,
- We efficiently use shared memory for the grids,
- We significantly reduce the number of atomic
- perations,
- And our secret sauce: a special data structure
for particle queries.
1 CUDA Block 1 CUDA Block 1 CUDA Block 1 CUDA Block 1 CUDA Block 1 CUDA Block 1 CUDA Block 1 CUDA Block 1 CUDA Block
CellBins ParticleIDs Actual particle data
TileBins CellBins ParticleIDs Actual particle data
- In each block/tile:
– Get blockIdx – Cells in the tile are TileBins[blockIdx-1].. TileBins[blockIdx]-1 – Get a cellId for each warp from this list
- Each thread works on two affected grid nodes
- Particles of a cell are
CellBins[cellId-1]..CellBins[cellId]-1
- Compute the contribution from the particle
- Store in shared
– Write back to global
Tile & Cell Keys
- Particle coordinates: (px, py, pz)
- Cell coordinates:
(ci, cj, ck) = ⌊(px, py, pz)/Δx⌋
- Tile and in-tile coordinates:
(ci, cj, ck) = (ti, tj, tk)∙TILE_SIZE + (ri, rj, rk)
Δx
tj ti tk rj rk ri
7 bits 7 bits 7 bits 3 bits 3 bits 3 bits
32 bit unsigned integer
Tile Bins sort Initial Particle IDs Particle IDs RLE
- inc. sum
Cell Bins masked RLE
- inc. sum
Tile & Cell Keys
- When sorted as uint32s, keys of the
same tile will be consecutive
- RLE encoding counts the number of
particles per cell
- The running sum of the counts gives
the offsets to particles
- RLE encoding with a mask for the
tile bits counts the number of non- empty cells per tile
- The running sum of these counts
gives the offsets to cells
Results
Overall
200 400 600 800 1000 262K 884K 2,097K 7,000K # of particles GPU CPU
Milliseconds per time step. Smaller is better. nVidia Quadro K5200 Intel Xeon CPU E5-2697 v3 @ 2.60GHz w/ 28 cores
Particles to Grids
100 200 300 400 500 600 262K 884K 2,097K 7,000K
Grids to Particles
100 200 300 400 500 600 262K 884K 2,097K 7,000K
Milliseconds per time step. Smaller is better.
Summary
- Particle binning with sort-RLE-scan
- Breaking the domain to tiles fitting to shared
memory
- Processing particles of a cell by a single warp
Special thanks to:
- Ken Museth
- Stephen Jones
- Jeff Budsberg
- Lawrence Lee
- Rob Tesdahl
- David Tonnesen
- Ibrahim Sani