High-Performance Physics Solver Design for Next Generation Consoles - - PowerPoint PPT Presentation

high performance physics solver design for next
SMART_READER_LITE
LIVE PREVIEW

High-Performance Physics Solver Design for Next Generation Consoles - - PowerPoint PPT Presentation

High-Performance Physics Solver Design for Next Generation Consoles Vangelis Kokkevis Steven Osman Eric Larsen Simulation Technology Group Sony Computer Entertainment America US R&D This Talk Optimizing physics simulation on a


slide-1
SLIDE 1

High-Performance Physics Solver Design for Next Generation Consoles

Vangelis Kokkevis Steven Osman Eric Larsen

Simulation Technology Group Sony Computer Entertainment America US R&D

slide-2
SLIDE 2

This Talk

Optimizing physics simulation on a multi-core

architecture.

Focus on CELL architecture

Variety of simulation domains

Cloth, Rigid Bodies, Fluids, Particles

Practical advice based on real case-studies Demos!

slide-3
SLIDE 3

Basic Issues

Looking for opportunities to parallelize processing

High Level – Many independent solvers on multiple cores Low Level – One solver, one/multiple cores

Coding with small memory in mind

Streaming Batching up work Software Caching

Speeding up processing within each unit

SIMD processing, instruction scheduling Double-buffering

Parallelizing/optimizing existing code

slide-4
SLIDE 4

What is not in this talk?

Details on specific physics algorithms

Too much material for a 1-hour talk Will provide references to techniques

Much insight on non-CELL platforms

Concentrate on actual results Concepts should be applicable beyond CELL

slide-5
SLIDE 5

The Cell Processor Model

PPU

SPE0

DMA

256K LS

SPU0

SPE0

DMA

256K LS

SPU1

SPE0

DMA

256K LS

SPU2

SPE0

DMA

256K LS

SPU3

DMA

256K LS

SPU4

DMA

256K LS

SPU5

DMA

256K LS

SPU6

DMA

256K LS

SPU7

Main Memory L1/L2

slide-6
SLIDE 6

Physics on CELL

Physics should happen mostly on SPUs

There’s more of them! SPUs have greater bandwidth & performance PPU is busy doing other stuff

PPU

SPE0

DMA 256K LS

SPU0

DMA

SPU1

DMA

SPU2

DMA

SPU3

DMA

SPU4

DMA

SPU5

DMA

SPU6

DMA

SPU7

Main Memory

L1/L2

256K LS 256K LS 256K LS 256K LS 256K LS 256K LS 256K LS

slide-7
SLIDE 7

SPU Performance Recipe

Large bandwidth to and from main memory Quick (1-cycle) LS memory access SIMD instruction set Concurrent DMA and processing Challenges:

Limited LS size, shared between code and data Random accesses of main memory are slow

slide-8
SLIDE 8

Cloth Simulation

slide-9
SLIDE 9

Cloth Simulation

Cloth mesh simulated as point masses

(vertices) connected via distance constraints (edges).

m1 m1 m2 m2 m3 m3 d1 d1 d2 d2 d3 d3 Mesh Triangle Mesh Triangle

References:

T.Jacobsen,Advanced Character Physics, GDC 2001 A.Meggs,Taking Real-Time Cloth Beyond Curtains,GDC 2005

slide-10
SLIDE 10

Simulation Step

1.

Compute external forces, fE,per vertex

2.

Compute new vertex positions [ Integration ]:

pt+1 = (2pt − pt−1) + 1

3.

Fix edge lengths

  • Adjust vertex positions

4.

Correct penetrations with collision geometry

  • Adjust vertex positions

2fE ∗ 1 m ∗ Δt2

slide-11
SLIDE 11

How many vertices?

How many vertices fit in 256K (less actually)?

A lot, surprisingly…

Tips:

Look for opportunities to stream data Keep in LS only data required for each step

slide-12
SLIDE 12

Integration Step

Less than 4000 verts in 200K of memory We don’t need to keep them all in LS Keep vertex data in main memory and bring it

in in blocks

16 + 16 + 16 + 4 = 52 bytes / vertex

pt+1 = (2pt − pt−1) + 1

2fE ∗ 1 m ∗ Δt2

slide-13
SLIDE 13

Streaming Integration

B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3

pt pt−1 f E

B0 B1 B2 B3

1 m

Main Memory Local Store

slide-14
SLIDE 14

Streaming Integration

B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3

pt pt−1 f E

B0 B1 B2 B3

1 m

Local Store Main Memory

DMA_IN B0

B0 B0 B0 B0

slide-15
SLIDE 15

Streaming Integration

B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3

pt pt−1 f E

B0 B1 B2 B3

1 m

Main Memory Local Store

DMA_IN B0

B0 B0 B0 B0

Process B0

slide-16
SLIDE 16

Streaming Integration

B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3

pt pt−1 f E

Process B0 DMA_OUT B0

B0 B1 B2 B3

1 m

Main Memory Local Store

DMA_IN B0

B0 B0 B0 B0

slide-17
SLIDE 17

Streaming Integration

B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3

pt pt−1 f E

Process B0 DMA_OUT B0

B0 B1 B2 B3

1 m

DMA_IN B1

Main Memory Local Store

DMA_IN B0

B1 B1 B1 B1

slide-18
SLIDE 18

Streaming Integration

B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3

pt pt−1 f E

Process B0 DMA_OUT B0

B0 B1 B2 B3

1 m

DMA_IN B1

Main Memory Local Store

DMA_IN B0

B1 B1 B1 B1

Process B1

slide-19
SLIDE 19

Streaming Integration

B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3

pt pt−1 f E

Process B0 DMA_OUT B0

B0 B1 B2 B3

1 m

DMA_IN B1

Main Memory Local Store

DMA_IN B0

B1 B1 B1 B1

Process B1 DMA_OUT B1

slide-20
SLIDE 20

Streaming Integration

B0 B1 B2 B3 B0 B1 B2 B3 B0 B1 B2 B3

pt pt−1 f E

Process B0 DMA_OUT B0

B0 B1 B2 B3

1 m

DMA_IN B1

Main Memory Local Store

DMA_IN B0

B1 B1 B1 B1

Process B1 DMA_OUT B1

slide-21
SLIDE 21

Double-buffering

Take advantage of concurrent DMA and

processing to hide transfer times Without double-buffering:

Process B0 DMA_OUT B0 DMA_IN B1 DMA_IN B0 Process B1 DMA_OUT B1

With double-buffering:

Process B0 DMA_IN B1 DMA_IN B0

DMA_OUT B0 Process B1 DMA_IN B2 Process B2 DMA_OUT B1 DMA_IN B3

slide-22
SLIDE 22

Streaming Data

Streaming is possible when the data access

pattern is simple and predictable (e.g. linear)

Number of verts processed per frame depends on

processing speed and bandwidth but not LS size

Unfortunately, not every step in the cloth

solver can be fully streamed

Fixing edge lengths requires random memory

access…

slide-23
SLIDE 23

Fixing Edge Lengths

Points coming out of the integration step don’t

necessarily satisfy edge distance constraints

struct Edge { int v1; int v2; float restLen; }

p[v1] p[v2] p[v1] p[v2]

Vector3 d = p[v2] – p[v1]; float len = sqrt(dot(d,d)); diff = (len-restLen)/len; p[v1] -= d * 0.5 * diff; p[v2] += d * 0.5 * diff;

slide-24
SLIDE 24

Fixing Edge Lengths

An iterative process: Fix one edge at a time

by adjusting 2 vertex positions

Requires random access to particle positions

array

Solution:

Keep all particle positions in LS Stream in edge data In 200K we can fit 200KB / 16B > 12K vertices

slide-25
SLIDE 25

Rigid Bodies

Our group is currently porting the AGEIATM

PhysXTM SDK to CELL

Large codebase written with a PC

architecture in mind

Assumes easy random access to memory Processes tasks sequentially (no parallelism)

Interesting example on how to port existing

code to a multi-core architecture

slide-26
SLIDE 26

Starting the Port

Determine all the stages of the rigid body

pipeline

Look for stages that are good candidates for

parallelizing/optimizing

Profile code to make sure we are focusing on

the right parts

slide-27
SLIDE 27

Constraint Solve Constraint Solve Broadphase Collision Detection

Rigid Body Pipeline

Broadphase Collision Detection Narrowphase Collision Detection Narrowphase Collision Detection Constraint Prep Constraint Prep Integration Integration Constraint Prep Constraint Prep

Points of contact between bodies Potentially colliding body pairs New body positions Updated body velocities Current body positions Constraint Equations

slide-28
SLIDE 28

Rigid Body Pipeline

Broadphase Collision Detection Broadphase Collision Detection

Points of contact between bodies Potentially colliding body pairs New body positions Updated body velocities Current body positions Constraint Equations

NP NP NP CP CP CP CS CS CS I I I

slide-29
SLIDE 29

Profiling Scenario

slide-30
SLIDE 30

Profiling Results

Cumulative Frame Time

10000 20000 30000 40000 50000 60000 70000 1 57 113 169 225 281 337 393 449 505 561 617 673 729 785 841 897 953 1009 1065 1121 1177 1233 1289 1345 1401 1457 1513 1569 1625 1681 1737 1793 1849 1905 1961 Other INTEGRATION SOLVER CONSTRAINT_PREP NARROWPHASE BROADPHASE

slide-31
SLIDE 31

Running on the SPUs

Three steps:

  • 1. (PPU) Pre-process

“Gather” operation (extract data from PhysX data

structures and pack it in MM)

  • 2. (SPU) Execute

DMA packed data from MM to LS Process data and store output in LS DMA output to MM

  • 3. (PPU) Post-process

“Scatter” operation (unpack output data and put back in

PhysX data structures)

slide-32
SLIDE 32

Why Involve the PPU?

Required PhysX data is not conveniently packed Data is often not aligned We need to use PhysX data structures to avoid

breaking features we haven’t ported

Solutions:

Use list DMAs to bring in data Modify existing code to force alignment Change PhysX code to work with new data structures

slide-33
SLIDE 33

Batching Up Work

Task Description Task Description

batch inputs/

  • utputs

batch inputs/

  • utputs

Work batch buffers in MM Work batch buffers in MM

PhysX data-structures PhysX data-structures

PPU Pre-Process PPU Pre-Process

Task Description Task Description

batch inputs/

  • utputs

batch inputs/

  • utputs

… … SPU Execute SPU Execute PPU Post-Process PPU Post-Process

PhysX data-structures PhysX data-structures

Create work batches for each task

slide-34
SLIDE 34

A

Narrow-phase Collision Detection

Problem:

A list of object pairs that may be colliding Want to do contact processing on SPUs Pairs list has references to geometry

C B (A,C) (A,B) (B,C) …

slide-35
SLIDE 35

Narrow-phase Collision Detection

Data locality

Same bodies may be in several pairs Geometry may be instanced for different bodies

SPU memory access

Can only access main memory with DMA No hardware cache Data reuse must be explicit

slide-36
SLIDE 36

Software Cache

Idea: make a (read-only) software cache

Cache entry is one geometric object Entries have variable size

Basic operation

SPU checks cache for object If not in cache, object fetched with DMA Cache returns a local address for object

slide-37
SLIDE 37

Software Cache

Data Structures

Two entry buffers New entries appended to “current” buffer

Hash-table used to record and find loaded entries

A B C

Buffer 0

Next DMA

Buffer 1

slide-38
SLIDE 38

Software Cache

Data Replacement

When space runs out in a buffer

Overwrite data in second buffer

Considerations

Does not fragment memory No searches for free space But does not prefer frequently used data

slide-39
SLIDE 39

Software Cache

Hiding the DMA latency

Double-buffering

Start DMA for un-cached entries Process previously DMA’d entries

Process/pre-fetch batches

Fetch and compute times vary

Batching may improve balance

DMA-lists useful

One DMA command Multiple chunks of data gathered

Current Buffer

D F E

Process DMA

A B C

slide-40
SLIDE 40

Software Caching

Conclusions

Simple cache is practical

Used for small convex objects in PhysX

Design considerations

Tradeoff of cache-logic cycles vs. bandwidth saved Pre-fetching important to include

slide-41
SLIDE 41

Single SPU Performance

PPU only: PPU only:

Exec Exec

PPU + SPU: PPU + SPU:

PPU PPU PPU PPU SPU SPU Exec Exec Free Free SPU Exec < PPU Exec: SIMD + fast mem access SPU Exec < PPU Exec: SIMD + fast mem access

PreP PostP

slide-42
SLIDE 42

Multiple SPU Performance

Pre- and Post- processing times determine

how many SPUs can be used effectively

slide-43
SLIDE 43

Multiple SPU Performance

1 1 2 2 3 3 4 4 PPU PPU 1 SPU 1 SPU 1 1 2 2 3 3 4 4 S S 3 SPUs 3 SPUs 1 1 2 2 4 4 3 3 1 1 2 2 3 3 4 4 2 SPUs 2 SPUs

slide-44
SLIDE 44

PPU vs SPU comparisons

Convex Stack (500 boxes)

10000 20000 30000 40000 50000 60000 70000 80000 1 35 69 103 137 171 205 239 273 307 341 375 409 443 477 511 545 579 613 647 681 715 749 783 817 851 885 919 953 987 1021 1055 1089 1123 1157 1191 1225 1259 1293 frame microseconds PPU-only 1-SPU 2-SPUs 3-SPUs 4-SPUs

slide-45
SLIDE 45

Duck Demo

One of our first CELL demos (spring 2005) Several interacting physics systems:

Rigid bodies (ducks & boats) Height-field water surface Cloth with ripping (sails) Particle based fluids (splashes + cups)

slide-46
SLIDE 46

Duck Demo (Lots of Ducks)

slide-47
SLIDE 47

Duck Demo

Ambitious project with short deadline Early PC prototypes of some pieces Most straightforward way to parallelize:

Dedicate one SPU for each subsystem

Each piece could be developed and tested

individually

slide-48
SLIDE 48

Duck Demo Resource Allocation

PU – main loop

SPU thread synchronization, draw calls

SPU0 – height field water (<50%) SPU1 – splashes iso-surface (<50%) SPU2 – cloth sails for boat 1 (<50%) SPU3 – cloth sails for boat 2 (<50%) SPU4 – rigid body collision/response (95%)

HF water Iso-Surface Cloth Cloth Rigid Bodies

1 frame

slide-49
SLIDE 49

Parallelization Recipe

One three-step approach to code parallelization:

1.

Find independent components

2.

Run them side-by-side

3.

Recursively apply recipe to components

slide-50
SLIDE 50

Challenges

Step 1: Find independent components

Where do you look? Maybe you need to break apart and overlap

your data?

e.g. Broad phase collision detection

Maybe you need to break apart your loop into

individual iterations?

e.g. Solving cloth constraints

slide-51
SLIDE 51

Broad Phase Collision Detection

600 Objects 200 Objects A 200 Objects B 200 Objects C 200 Objects A 200 Objects B

vs

200 Objects A 200 Objects C

vs

200 Objects C 200 Objects B

vs We can execute all three of these simultaneously

Need to test 600 rigid bodies against each other.

slide-52
SLIDE 52

Cloth Solving

for (i=1 to 5) { cloth=solve(cloth) } A B C for (i=1 to 5) { solve_on_proc1(a); solve_on_proc2(b); wait_for_all() solve_on_proc1(c); wait_for_all(); }

slide-53
SLIDE 53

…challenges

Step 2: Run them side-by-side

Bandwidth and cache issues

Need good data layout to avoid thrashing cache

  • r bus

Processor issues

Need efficient processor management scheme

What if the job sizes are very different?

e.g. a suit of cloth and a separate neck tie Need further refinement of large jobs, or you only save on the small neck tie time

slide-54
SLIDE 54

…challenges

Step 3: Recurse When do you stop?

Overhead of launching smaller jobs Synchronization when a stage is done

e.g. Gather results from all collision detection before solving

But this can go down to the instruction level

e.g. Using Structure-of-Arrays, transform four independent vectors at once

slide-55
SLIDE 55

High Level Parallelization:

Duck Demo

Fluid Simulation Fluid Surface Rigid Bodies Cloth Sails

Dependency exists

Fluid Simulation Fluid Surface Rigid Bodies Cloth Sails Cloth Boat 1 Cloth Boat 2

Note that the parts didn’t take an equal amount of time to run. We could have done better given time! But cloth was for multiple boats

slide-56
SLIDE 56

Broad Phase Collision Detection Narrow Phase Collision Detection Constraint Solving

Lower Level Parallelization

Rigid Body Simulation

600 bodies example

Objects A Objects B Objects C

Broad Phase Collision Detection Narrow Phase Collision Detection Constraint Solving

Objects A Objects B Objects C Objects B Objects A Objects C

Proc 3 Proc 2 Proc 1 Proc 3 Proc 2 Proc 1 Proc 3 Proc 2 Proc 1

slide-57
SLIDE 57

Structure of Arrays

X0 Y0 Z0 W0 X1 Y1 Z1 W1 X2 Y2 Z2 W2 X3 Y3 Z3 W3 X4 Y4 Z4 W4 X5 Y5 Z5 W5 X6 Y6 Z6 W6 X7 Y7 Z7 W7 Data[0] Data[1] Data[2] Data[3] Data[4] Data[5] Data[6] Data[7] X0 Y0 Z0 W0 X1 Y1 Z1 W1 X2 Y2 Z2 W2 X3 Y3 Z3 W3 X4 Y4 Z4 W4 X5 Y5 Z5 W5 X6 Y6 Z6 W6 X7 Y7 Z7 W7 Data[0] Data[1] Data[2] Data[3] Data[4] Data[5] Data[6] Data[7]

Array of Structures

  • r “AoS”

Structure of Arrays

  • r “SoA”

1 AoS Vector 1 SoA Vector Bonus! Since W is almost always 0 or 1, we can eliminate it with a clever math library and save 25% memory and bandwidth!

slide-58
SLIDE 58

Lowest Level Parallelization:

Structure-of-Array processing of Particles Given:

pn(t)=position of particle n at time t vn(t)=velocity of particle n at time t

p1(ti)=p1(ti-1) + v1(ti-1) * dt + 0.5 * G * dt2 p2(ti)=p2(ti-1) + v2(ti-1) * dt + 0.5 * G * dt2

Note they are independent of each other So we can run four together using SoA

p{1-4}(ti)=p{1-4}(ti-1) + v{1-4}(ti-1) * dt + 0.5 * G * dt2

slide-59
SLIDE 59

Failure Case

Gauss Seidel Solver Consider a simple position-based solver that uses distance constraints. Given:

p=current positions of all objects solve(cn, p) takes p and constraint cn and computes a new p that satisfies cn p=solve(c0, p) p=solve(c1, p) …

Note that to solve c1, we need the result of c0. Can’t solve c0 and c1 concurrently!

slide-60
SLIDE 60

Failure Case

Possible Solutions Generally, it’s you’re out of luck, but…

Some cases have very limited dependencies

e.g. particle-based cloth solving

Solution: Arrange constraints such that no four

adjacent constraints share cloth particles

Consider a different solver

e.g. Jacobi solvers don’t use updated values until all constraints have been processed once

But they need more memory (pnew and pcurrent) And may need more iterations to converge

slide-61
SLIDE 61

Duck Demo (EyeToy + SPH)

slide-62
SLIDE 62

Smoothed Particle Hydrodynamics

(SPH) Fluid Simulation

Smoothed-particles

Mass distributed around a point Density falls to 0 at a radius h

Forces between particles closer than 2h

h

slide-63
SLIDE 63

SPH Fluid Simulation

High-level parallelism

Put particles in grid cells Process on different SPUs (Not used in duck demo)

Low-level parallelism

SIMD and dual-issue on SPU Large n per cell may be better

Less grid overhead Loops fast on SPU

slide-64
SLIDE 64

SPH Loop

Consider two sets of particles P and Q

E.g., taken from neighbor grid cells O(n2) problem

Can unroll (e.g., by 4)

for (i = 0; i < numP; i++) for (j = 0; j < numQ; j+=4) Compute force (pi, qj) Compute force (pi, qj+1) Compute force (pi, qj+2) Compute force (pi, qj+3)

slide-65
SLIDE 65

SPH Loop, SoA

Idea:

Increase SIMD throughput with structure-of-arrays Transpose and produce combinations

pi qj qj+1 x y z x y z x y z x y z x y z x y z x y z SoA p SoA q x y z x y z x y z x y z x y z x y z qj+2 qj+3

slide-66
SLIDE 66

SPH Loop, Software Pipelined

Add software pipelining

Conversion instructions can dual-issue with math

Load[i] To SoA[i] Compute[i] From SoA[i] Store[i] Compute[i]

Load[i+1] Store [i-1] To SoA[i+1] From SoA[i-1]

Pipe 0 Pipe 1

slide-67
SLIDE 67

Recap

Finding independence is hard!

Across subsystems or within subsystems? Across iterations or within iterations? Data level independence? Instruction level independence? How about “bandwidth level” independence?

Parallelization overhead

Sometimes running serially wins over overhead of

parallelization

slide-68
SLIDE 68

Particle Simulation Demo

slide-69
SLIDE 69

Questions?

http://www.research.scea.com/

Contacts:

Vangelis Kokkevis: vangelis_kokkevis@playstation.sony.com Eric Larsen: eric_larsen@playstation.sony.com Steven Osman: steven_osman@playstation.sony.com