High-Performance Physics Solver Design for Next Generation Consoles
Vangelis Kokkevis Steven Osman Eric Larsen
Simulation Technology Group Sony Computer Entertainment America US R&D
High-Performance Physics Solver Design for Next Generation Consoles - - PowerPoint PPT Presentation
High-Performance Physics Solver Design for Next Generation Consoles Vangelis Kokkevis Steven Osman Eric Larsen Simulation Technology Group Sony Computer Entertainment America US R&D This Talk Optimizing physics simulation on a
Simulation Technology Group Sony Computer Entertainment America US R&D
Focus on CELL architecture
Cloth, Rigid Bodies, Fluids, Particles
Looking for opportunities to parallelize processing
High Level – Many independent solvers on multiple cores Low Level – One solver, one/multiple cores
Coding with small memory in mind
Streaming Batching up work Software Caching
Speeding up processing within each unit
SIMD processing, instruction scheduling Double-buffering
Parallelizing/optimizing existing code
Too much material for a 1-hour talk Will provide references to techniques
Concentrate on actual results Concepts should be applicable beyond CELL
SPE0
DMA
256K LS
SPU0
SPE0
DMA
256K LS
SPU1
SPE0
DMA
256K LS
SPU2
SPE0
DMA
256K LS
SPU3
DMA
256K LS
SPU4
DMA
256K LS
SPU5
DMA
256K LS
SPU6
DMA
256K LS
SPU7
There’s more of them! SPUs have greater bandwidth & performance PPU is busy doing other stuff
SPE0
DMA 256K LS
SPU0
DMA
SPU1
DMA
SPU2
DMA
SPU3
DMA
SPU4
DMA
SPU5
DMA
SPU6
DMA
SPU7
Main Memory
L1/L2
256K LS 256K LS 256K LS 256K LS 256K LS 256K LS 256K LS
Limited LS size, shared between code and data Random accesses of main memory are slow
m1 m1 m2 m2 m3 m3 d1 d1 d2 d2 d3 d3 Mesh Triangle Mesh Triangle
T.Jacobsen,Advanced Character Physics, GDC 2001 A.Meggs,Taking Real-Time Cloth Beyond Curtains,GDC 2005
A lot, surprisingly…
Look for opportunities to stream data Keep in LS only data required for each step
DMA_IN B0
DMA_IN B0
Process B0
Process B0 DMA_OUT B0
DMA_IN B0
Process B0 DMA_OUT B0
DMA_IN B1
DMA_IN B0
Process B0 DMA_OUT B0
DMA_IN B1
DMA_IN B0
Process B1
Process B0 DMA_OUT B0
DMA_IN B1
DMA_IN B0
Process B1 DMA_OUT B1
Process B0 DMA_OUT B0
DMA_IN B1
DMA_IN B0
Process B1 DMA_OUT B1
Process B0 DMA_OUT B0 DMA_IN B1 DMA_IN B0 Process B1 DMA_OUT B1
Process B0 DMA_IN B1 DMA_IN B0
DMA_OUT B0 Process B1 DMA_IN B2 Process B2 DMA_OUT B1 DMA_IN B3
Number of verts processed per frame depends on
Fixing edge lengths requires random memory
struct Edge { int v1; int v2; float restLen; }
p[v1] p[v2] p[v1] p[v2]
Vector3 d = p[v2] – p[v1]; float len = sqrt(dot(d,d)); diff = (len-restLen)/len; p[v1] -= d * 0.5 * diff; p[v2] += d * 0.5 * diff;
Keep all particle positions in LS Stream in edge data In 200K we can fit 200KB / 16B > 12K vertices
Assumes easy random access to memory Processes tasks sequentially (no parallelism)
Constraint Solve Constraint Solve Broadphase Collision Detection
Broadphase Collision Detection Narrowphase Collision Detection Narrowphase Collision Detection Constraint Prep Constraint Prep Integration Integration Constraint Prep Constraint Prep
Points of contact between bodies Potentially colliding body pairs New body positions Updated body velocities Current body positions Constraint Equations
Broadphase Collision Detection Broadphase Collision Detection
Points of contact between bodies Potentially colliding body pairs New body positions Updated body velocities Current body positions Constraint Equations
Cumulative Frame Time
10000 20000 30000 40000 50000 60000 70000 1 57 113 169 225 281 337 393 449 505 561 617 673 729 785 841 897 953 1009 1065 1121 1177 1233 1289 1345 1401 1457 1513 1569 1625 1681 1737 1793 1849 1905 1961 Other INTEGRATION SOLVER CONSTRAINT_PREP NARROWPHASE BROADPHASE
“Gather” operation (extract data from PhysX data
structures and pack it in MM)
DMA packed data from MM to LS Process data and store output in LS DMA output to MM
“Scatter” operation (unpack output data and put back in
PhysX data structures)
Required PhysX data is not conveniently packed Data is often not aligned We need to use PhysX data structures to avoid
Solutions:
Use list DMAs to bring in data Modify existing code to force alignment Change PhysX code to work with new data structures
Task Description Task Description
batch inputs/
batch inputs/
Work batch buffers in MM Work batch buffers in MM
PhysX data-structures PhysX data-structures
Task Description Task Description
batch inputs/
batch inputs/
PhysX data-structures PhysX data-structures
A list of object pairs that may be colliding Want to do contact processing on SPUs Pairs list has references to geometry
Same bodies may be in several pairs Geometry may be instanced for different bodies
Can only access main memory with DMA No hardware cache Data reuse must be explicit
Cache entry is one geometric object Entries have variable size
SPU checks cache for object If not in cache, object fetched with DMA Cache returns a local address for object
Two entry buffers New entries appended to “current” buffer
Hash-table used to record and find loaded entries
A B C
Next DMA
When space runs out in a buffer
Overwrite data in second buffer
Considerations
Does not fragment memory No searches for free space But does not prefer frequently used data
Double-buffering
Start DMA for un-cached entries Process previously DMA’d entries
Process/pre-fetch batches
Fetch and compute times vary
Batching may improve balance
DMA-lists useful
One DMA command Multiple chunks of data gathered
D F E
A B C
Simple cache is practical
Used for small convex objects in PhysX
Design considerations
Tradeoff of cache-logic cycles vs. bandwidth saved Pre-fetching important to include
PreP PostP
Convex Stack (500 boxes)
10000 20000 30000 40000 50000 60000 70000 80000 1 35 69 103 137 171 205 239 273 307 341 375 409 443 477 511 545 579 613 647 681 715 749 783 817 851 885 919 953 987 1021 1055 1089 1123 1157 1191 1225 1259 1293 frame microseconds PPU-only 1-SPU 2-SPUs 3-SPUs 4-SPUs
Rigid bodies (ducks & boats) Height-field water surface Cloth with ripping (sails) Particle based fluids (splashes + cups)
Dedicate one SPU for each subsystem
PU – main loop
SPU thread synchronization, draw calls
SPU0 – height field water (<50%) SPU1 – splashes iso-surface (<50%) SPU2 – cloth sails for boat 1 (<50%) SPU3 – cloth sails for boat 2 (<50%) SPU4 – rigid body collision/response (95%)
HF water Iso-Surface Cloth Cloth Rigid Bodies
1 frame
vs
vs
vs We can execute all three of these simultaneously
Overhead of launching smaller jobs Synchronization when a stage is done
e.g. Gather results from all collision detection before solving
Fluid Simulation Fluid Surface Rigid Bodies Cloth Sails
Dependency exists
Fluid Simulation Fluid Surface Rigid Bodies Cloth Sails Cloth Boat 1 Cloth Boat 2
Note that the parts didn’t take an equal amount of time to run. We could have done better given time! But cloth was for multiple boats
Broad Phase Collision Detection Narrow Phase Collision Detection Constraint Solving
600 bodies example
Objects A Objects B Objects C
Broad Phase Collision Detection Narrow Phase Collision Detection Constraint Solving
Objects A Objects B Objects C Objects B Objects A Objects C
Proc 3 Proc 2 Proc 1 Proc 3 Proc 2 Proc 1 Proc 3 Proc 2 Proc 1
X0 Y0 Z0 W0 X1 Y1 Z1 W1 X2 Y2 Z2 W2 X3 Y3 Z3 W3 X4 Y4 Z4 W4 X5 Y5 Z5 W5 X6 Y6 Z6 W6 X7 Y7 Z7 W7 Data[0] Data[1] Data[2] Data[3] Data[4] Data[5] Data[6] Data[7] X0 Y0 Z0 W0 X1 Y1 Z1 W1 X2 Y2 Z2 W2 X3 Y3 Z3 W3 X4 Y4 Z4 W4 X5 Y5 Z5 W5 X6 Y6 Z6 W6 X7 Y7 Z7 W7 Data[0] Data[1] Data[2] Data[3] Data[4] Data[5] Data[6] Data[7]
Array of Structures
Structure of Arrays
1 AoS Vector 1 SoA Vector Bonus! Since W is almost always 0 or 1, we can eliminate it with a clever math library and save 25% memory and bandwidth!
p1(ti)=p1(ti-1) + v1(ti-1) * dt + 0.5 * G * dt2 p2(ti)=p2(ti-1) + v2(ti-1) * dt + 0.5 * G * dt2
…
p{1-4}(ti)=p{1-4}(ti-1) + v{1-4}(ti-1) * dt + 0.5 * G * dt2
p=current positions of all objects solve(cn, p) takes p and constraint cn and computes a new p that satisfies cn p=solve(c0, p) p=solve(c1, p) …
Solution: Arrange constraints such that no four
But they need more memory (pnew and pcurrent) And may need more iterations to converge
Mass distributed around a point Density falls to 0 at a radius h
Put particles in grid cells Process on different SPUs (Not used in duck demo)
SIMD and dual-issue on SPU Large n per cell may be better
Less grid overhead Loops fast on SPU
E.g., taken from neighbor grid cells O(n2) problem
Increase SIMD throughput with structure-of-arrays Transpose and produce combinations
Conversion instructions can dual-issue with math
Load[i+1] Store [i-1] To SoA[i+1] From SoA[i-1]
Across subsystems or within subsystems? Across iterations or within iterations? Data level independence? Instruction level independence? How about “bandwidth level” independence?
Sometimes running serially wins over overhead of
Vangelis Kokkevis: vangelis_kokkevis@playstation.sony.com Eric Larsen: eric_larsen@playstation.sony.com Steven Osman: steven_osman@playstation.sony.com