High-Performance Physics Solver Design for Next Generation Consoles - PowerPoint PPT Presentation

High-Performance Physics Solver Design for Next Generation Consoles Vangelis Kokkevis Steven Osman Eric Larsen Simulation Technology Group Sony Computer Entertainment America US R&D

This Talk � Optimizing physics simulation on a multi-core architecture. � Focus on CELL architecture � Variety of simulation domains � Cloth, Rigid Bodies, Fluids, Particles � Practical advice based on real case-studies � Demos!

Basic Issues � Looking for opportunities to parallelize processing � High Level – Many independent solvers on multiple cores � Low Level – One solver, one/multiple cores � Coding with small memory in mind � Streaming � Batching up work � Software Caching � Speeding up processing within each unit � SIMD processing, instruction scheduling � Double-buffering � Parallelizing/optimizing existing code

What is not in this talk? � Details on specific physics algorithms � Too much material for a 1-hour talk � Will provide references to techniques � Much insight on non-CELL platforms � Concentrate on actual results � Concepts should be applicable beyond CELL

The Cell Processor Model SPU0 SPU1 SPU2 SPU3 Main Memory SPE0 SPE0 SPE0 SPE0 256K LS 256K LS 256K LS 256K LS DMA DMA DMA DMA DMA DMA DMA DMA L1/L2 256K LS 256K LS 256K LS 256K LS PPU SPU4 SPU5 SPU6 SPU7

Physics on CELL SPU0 SPU1 SPU2 SPU3 Main Memory SPE0 256K LS 256K LS 256K LS 256K LS DMA DMA DMA DMA DMA DMA DMA DMA L1/L2 256K LS 256K LS 256K LS 256K LS PPU SPU4 SPU5 SPU6 SPU7 � Physics should happen mostly on SPUs � There’s more of them! � SPUs have greater bandwidth & performance � PPU is busy doing other stuff

SPU Performance Recipe � Large bandwidth to and from main memory � Quick (1-cycle) LS memory access � SIMD instruction set � Concurrent DMA and processing � Challenges: � Limited LS size, shared between code and data � Random accesses of main memory are slow

Cloth Simulation

Cloth Simulation � Cloth mesh simulated as point masses (vertices) connected via distance constraints (edges). m 1 m 1 d 1 d 1 Mesh Triangle d 3 Mesh Triangle d 3 m 2 m 3 m 2 m 3 d 2 d 2 � References: � T.Jacobsen, Advanced Character Physics , GDC 2001 � A.Meggs, Taking Real-Time Cloth Beyond Curtains ,GDC 2005

Simulation Step Compute external forces, f E ,per vertex 1. Compute new vertex positions [ Integration ]: 2. p t +1 = (2 p t − p t − 1 ) + 1 2 f E ∗ 1 m ∗ Δ t 2 Fix edge lengths 3. Adjust vertex positions � Correct penetrations with collision geometry 4. Adjust vertex positions �

How many vertices? � How many vertices fit in 256K (less actually)? � A lot, surprisingly… � Tips: � Look for opportunities to stream data � Keep in LS only data required for each step

Integration Step p t +1 = (2 p t − p t − 1 ) + 1 2 f E ∗ 1 m ∗ Δ t 2 16 + 16 + 16 + 4 = 52 bytes / vertex � Less than 4000 verts in 200K of memory � We don’t need to keep them all in LS � Keep vertex data in main memory and bring it in in blocks

Streaming Integration Main Memory p t B0 B1 B2 B3 Local Store p t − 1 B0 B1 B2 B3 f E B0 B1 B2 B3 1 B0 B1 B2 B3 m

Streaming Integration Main Memory p t B0 B1 B2 B3 Local Store p t − 1 B0 B1 B2 B3 f E B0 B0 B0 B0 B0 B1 B2 B3 1 B0 B1 B2 B3 m DMA_IN B0

Streaming Integration Main Memory p t B0 B1 B2 B3 Local Store p t − 1 B0 B1 B2 B3 f E B0 B0 B0 B0 B0 B1 B2 B3 1 B0 B1 B2 B3 m DMA_IN Process B0 B0

Streaming Integration Main Memory p t B0 B1 B2 B3 Local Store p t − 1 B0 B1 B2 B3 f E B0 B0 B0 B0 B0 B1 B2 B3 1 B0 B1 B2 B3 m DMA_IN DMA_OUT Process B0 B0 B0

Streaming Integration Main Memory p t B0 B1 B2 B3 Local Store p t − 1 B0 B1 B2 B3 f E B1 B1 B1 B1 B0 B1 B2 B3 1 B0 B1 B2 B3 m DMA_IN DMA_OUT DMA_IN Process B0 B0 B0 B1

Streaming Integration Main Memory p t B0 B1 B2 B3 Local Store p t − 1 B0 B1 B2 B3 f E B1 B1 B1 B1 B0 B1 B2 B3 1 B0 B1 B2 B3 m DMA_IN DMA_OUT DMA_IN Process B0 Process B1 B0 B0 B1

Streaming Integration Main Memory p t B0 B1 B2 B3 Local Store p t − 1 B0 B1 B2 B3 f E B1 B1 B1 B1 B0 B1 B2 B3 1 B0 B1 B2 B3 m DMA_IN DMA_OUT DMA_IN DMA_OUT Process B0 Process B1 B0 B0 B1 B1

Streaming Integration Main Memory p t B0 B1 B2 B3 Local Store p t − 1 B0 B1 B2 B3 f E B1 B1 B1 B1 B0 B1 B2 B3 1 B0 B1 B2 B3 m … DMA_IN DMA_OUT DMA_IN DMA_OUT Process B0 Process B1 B0 B0 B1 B1

Double-buffering � Take advantage of concurrent DMA and processing to hide transfer times Without double-buffering: … DMA_IN DMA_OUT DMA_IN DMA_OUT Process B0 Process B1 B0 B0 B1 B1 With double-buffering: Process B0 Process B1 Process B2 … DMA_IN DMA_IN DMA_OUT DMA_IN DMA_OUT DMA_IN B0 B1 B0 B2 B1 B3

Streaming Data � Streaming is possible when the data access pattern is simple and predictable (e.g. linear) � Number of verts processed per frame depends on processing speed and bandwidth but not LS size � Unfortunately, not every step in the cloth solver can be fully streamed � Fixing edge lengths requires random memory access…

Fixing Edge Lengths � Points coming out of the integration step don’t necessarily satisfy edge distance constraints struct Edge { int v1; int v2; p[v1] p[v2] float restLen; } Vector3 d = p[v2] – p[v1]; float len = sqrt(dot(d,d)); diff = (len-restLen)/len; p[v1] -= d * 0.5 * diff; p[v2] += d * 0.5 * diff; p[v1] p[v2]

Fixing Edge Lengths � An iterative process: Fix one edge at a time by adjusting 2 vertex positions � Requires random access to particle positions array � Solution: � Keep all particle positions in LS � Stream in edge data � In 200K we can fit 200KB / 16B > 12K vertices

Rigid Bodies � Our group is currently porting the AGEIA TM PhysX TM SDK to CELL � Large codebase written with a PC architecture in mind � Assumes easy random access to memory � Processes tasks sequentially (no parallelism) � Interesting example on how to port existing code to a multi-core architecture

Starting the Port � Determine all the stages of the rigid body pipeline � Look for stages that are good candidates for parallelizing/optimizing � Profile code to make sure we are focusing on the right parts

Rigid Body Pipeline Current body positions Broadphase Constraint Broadphase Constraint Collision Detection Solve Collision Detection Solve Potentially colliding Updated body body pairs velocities Narrowphase Narrowphase Integration Integration Collision Detection Collision Detection Points of contact between bodies New body positions Constraint Prep Constraint Prep Constraint Prep Constraint Prep Constraint Equations

Rigid Body Pipeline Current body positions Broadphase Broadphase CS CS CS Collision Detection Collision Detection Potentially colliding body pairs Updated body velocities NP NP NP I I I Points of contact between bodies CP CP CP New body positions Constraint Equations

Profiling Scenario

Profiling Results Cumulative Frame Time 70000 60000 50000 Other 40000 INTEGRATION SOLVER CONSTRAINT_PREP NARROWPHASE 30000 BROADPHASE 20000 10000 0 1 57 113 169 225 281 337 393 449 505 561 617 673 729 785 841 897 953 1009 1065 1121 1177 1233 1289 1345 1401 1457 1513 1569 1625 1681 1737 1793 1849 1905 1961

Running on the SPUs � Three steps: 1. (PPU) Pre-process � “Gather” operation (extract data from PhysX data structures and pack it in MM) 2. (SPU) Execute � DMA packed data from MM to LS � Process data and store output in LS � DMA output to MM 3. (PPU) Post-process � “Scatter” operation (unpack output data and put back in PhysX data structures)

Why Involve the PPU? � Required PhysX data is not conveniently packed � Data is often not aligned � We need to use PhysX data structures to avoid breaking features we haven’t ported � Solutions: � Use list DMAs to bring in data � Modify existing code to force alignment � Change PhysX code to work with new data structures

Batching Up Work � Create work batches for each task PPU SPU PPU PPU SPU PPU Pre-Process Execute Post-Process Pre-Process Execute Post-Process Work batch Work batch buffers in MM buffers in MM Task Task Task Task Description Description Description Description PhysX PhysX PhysX PhysX batch batch batch data-structures batch data-structures data-structures data-structures … … inputs/ inputs/ inputs/ inputs/ outputs outputs outputs outputs

Narrow-phase Collision Detection � Problem: � A list of object pairs that may be colliding � Want to do contact processing on SPUs � Pairs list has references to geometry (A,C) C A (A,B) B (B,C) …

Narrow-phase Collision Detection � Data locality � Same bodies may be in several pairs � Geometry may be instanced for different bodies � SPU memory access � Can only access main memory with DMA � No hardware cache � Data reuse must be explicit

High-Performance Physics Solver Design for Next Generation Consoles - PowerPoint PPT Presentation

High-Performance Physics Solver Design for Next Generation Consoles Vangelis Kokkevis Steven Osman Eric Larsen Simulation Technology Group Sony Computer Entertainment America US R&D This Talk Optimizing physics simulation on a

Systerel Smart Solver Forum Mthodes Formelles October 2014 S3 S3 for C Systerel Smart Solver

A CDCL(LA) Solver SPASS-SATT A CDCL(LA) Solver Translation: fun (=SPASS) sated (=SATT)

Adjoint Solver Workshop Why is an Adjoint Solver useful? Design and manufacture for better

Growing Solver-Aided Languages with ROSETTE Emina Torlak & Rastislav Bodik U.C. Berkeley

KY 1 Engineering 10 San Jose State University Solver The Solver is intended primarily for

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline

precise solver for chemical ODEs Fan Feng, Zifa Wang GTC 2016 San Jose, USA, 04-07 Apr. 2016

An Iterative Solver for the Diffusion The Methods Progress So Far... Equation Alan Davidson

Circuit Solver TM THERMOSTATIC RECIRCULATION VALVE FOR DOMESTIC HOT WATER SYSTEMS

Implementation of multiple deltaTs for the multi-region solver based on chtMultiRegionFoam Yuzhu

SAT Solver as coNP Solver Beyond Resolution Norbert Manthey International Center for

Why Should You Care? Writing a CSP Solver in Understanding how the solver works 3 (or 4) Easy

Automatic Solver Configuration and Solver Portfolios Meinolf Sellmann IBM Research Watson AI

GenFoo: a general Fokker-Planck solver with applications in fusion plasma physics L. J. Hk

ScaRC is ready for use in FDS Alternative solver for the FDS pressure equation Dr. Susanne Kilian

Contents Contents Fluid

A Bayesian framework for optimal motion planning with uncertainty Andrea Censi, Daniele Calisi,

A Fast Algorithm for Permutation Pattern Matching Based on Alternating Runs Marie-Louise Bruner

Suppliers contract their own payment chamnnels for prepayment meter customers History of PPM

Status on PbWO 4 Crystals R. W. Novotny II.Physics Institute, University Giessen, Germany and

Architecting a Kotlin JVM and JS multiplatform project FELIPE LIMA / OCT 4TH, 2018 / KOTLINCONF

Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3 Ou

PERCEPTION IS IS Market is headed to the bottom of the current real estate cycle with

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight

High-Performance Physics Solver Design for Next Generation Consoles - PowerPoint PPT Presentation

High-Performance Physics Solver Design for Next Generation Consoles Vangelis Kokkevis Steven Osman Eric Larsen Simulation Technology Group Sony Computer Entertainment America US R&D This Talk Optimizing physics simulation on a

Systerel Smart Solver Forum Mthodes Formelles October 2014 S3 S3 for C Systerel Smart Solver

A CDCL(LA) Solver SPASS-SATT A CDCL(LA) Solver Translation: fun (=SPASS) sated (=SATT)

Adjoint Solver Workshop Why is an Adjoint Solver useful? Design and manufacture for better

Growing Solver-Aided Languages with ROSETTE Emina Torlak &amp; Rastislav Bodik U.C. Berkeley

KY 1 Engineering 10 San Jose State University Solver The Solver is intended primarily for

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline

precise solver for chemical ODEs Fan Feng, Zifa Wang GTC 2016 San Jose, USA, 04-07 Apr. 2016

An Iterative Solver for the Diffusion The Methods Progress So Far... Equation Alan Davidson

Circuit Solver TM THERMOSTATIC RECIRCULATION VALVE FOR DOMESTIC HOT WATER SYSTEMS

Implementation of multiple deltaTs for the multi-region solver based on chtMultiRegionFoam Yuzhu

SAT Solver as coNP Solver Beyond Resolution Norbert Manthey International Center for

Why Should You Care? Writing a CSP Solver in Understanding how the solver works 3 (or 4) Easy

Automatic Solver Configuration and Solver Portfolios Meinolf Sellmann IBM Research Watson AI

GenFoo: a general Fokker-Planck solver with applications in fusion plasma physics L. J. Hk

ScaRC is ready for use in FDS Alternative solver for the FDS pressure equation Dr. Susanne Kilian

Contents Contents Fluid

A Bayesian framework for optimal motion planning with uncertainty Andrea Censi, Daniele Calisi,

A Fast Algorithm for Permutation Pattern Matching Based on Alternating Runs Marie-Louise Bruner

Suppliers contract their own payment chamnnels for prepayment meter customers History of PPM

Status on PbWO 4 Crystals R. W. Novotny II.Physics Institute, University Giessen, Germany and

Architecting a Kotlin JVM and JS multiplatform project FELIPE LIMA / OCT 4TH, 2018 / KOTLINCONF

Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3 Ou

PERCEPTION IS IS Market is headed to the bottom of the current real estate cycle with

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight

Growing Solver-Aided Languages with ROSETTE Emina Torlak & Rastislav Bodik U.C. Berkeley