complex systems
play

Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond - PowerPoint PPT Presentation

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre) GTC 2015 Overview Complex Systems A Framework


  1. From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre) GTC 2015

  2. Overview • Complex Systems • A Framework for Modelling Agents • Degrees of Parallelisation • Agent Communication • Putting it all together GTC 2015

  3. Complex Systems • Many individuals • Interact and behave according to simple rules • System level behaviour emerges GTC 2015

  4. Agent Based Modelling • A method for specification and simulation of a complex system • Model is a set of autonomous communicating agents • Simulation helps to understand complex systems • Interventions and prediction • Presents a computational challenge! • Especially for real time or faster GTC 2015

  5. Difficulties in Applying GPUs • Agents are heterogeneous i.e. They diverge • Agents are born and agents die Leads to sparse populations and non coalesced access • Agents communicate No global mechanism for GPU thread communication • Agents don't stay still Acceleration structures used for simulation need to be rebuilt GTC 2015

  6. • Complex Systems • A Framework for Modelling Agents • Degrees of Parallelisation • Agent Communication • Putting it all together GTC 2015

  7. A Formal Model of an Agent • Abstract the underlying architecture • Let modellers write models not parallel programs • Describe agents as a form of state machine (X- Machine) • Minimises divergence • Describe state transition functions (agent functions) using high level script • Describe communication as message dependencies between agent functions • Results in Directed Acyclic Graph • Identifies synchronisation points for scheduling GTC 2015

  8. FLAME GPU: A Code Generation Framework • XML Model File • Describe Agents and Communication (messages) as a model in XML • XSLT Templates • Code generate a simulation API from agent descriptions • Scripted Behaviour • Scripted behaviour links with dynamic simulation API • Simulation Program • Loads initial data and provides I/O or interactive visualisation GTC 2015

  9. Code Generation using XSLT • Powerful technique for code generation from Declarative XML model • Full functional programming language <xagents> <gpu:xagent> <name>Circle</name> <memory> <gpu:variable> <xsl:for-each select="xagents/gpu:xagent"> <type>int</type> struct __align__(16) xmachine_memory_<xsl:value-of select="name"/> <name>id</name> {<xsl:for-each select="memory/gpu:variable"> </gpu:variable> <xsl:value-of select="type"/><xsl:text> </xsl:text><xsl:if test="arrayLength">*</xsl:if><xsl:value-of select="name"/>; <gpu:variable> </xsl:for-each> <type>float</type> }; <name>x</name> </xsl:for-each> </gpu:variable> <gpu:variable> <type>float</type> <name>y</name> </gpu:variable> <gpu:variable> <type>float</type> struct __align__(16) xmachine_memory_Circle <name>z</name> { </gpu:variable> int id; <gpu:variable> float x; <type>float</type> float y; <name>fx</name> float z; </gpu:variable> float fx; <gpu:variable> float fy; <type>float</type> }; <name>fy</name> </gpu:variable> </memory> GTC 2015 ...

  10. Mapping an Agent to the GPU __FLAME_GPU_FUNC__ int read_locations( • Each agent function is corresponds to a single GPU xmachine_memory_bird* xmemory, xmachine_message_location_list* location_messages) kernel { /* Get the first message */ • Each CUDA thread represents a single agent instance xmachine_message_location* location_message = get_first_location_message(location_messages); • Agent functions use a dynamically generated API /* Repeat untill there are no more messages */ • Agent Data is transparently loaded from Structures while(location_message) { of arrays /* Process the message */ if distance_check(xmemory, location_message) { typedef struct agent{ typedef struct agent_list{ updateSteerVelocity(xmemory, location_message); float x; float x[N]; } float y; float y[N]; } xm_memory_agent_list [N]; } xm_memory_agent_list; /* Get the next message */ location_message = get_next_location_message(location_message, location_messages); } … … … /* Update any other xmemory variables */ xmemory->x += xmemory->vel_x*TIME_STEP; 0 1 2 3 N 0 1 2 3 N 0 1 2 3 N ... return 0; } … … GTC 2015

  11. • Complex Systems • A Framework for Modelling Agents • Degrees of Parallelisation • Agent Communication • Putting it all together GTC 2015

  12. Agent Divergence and Sparsity Agent List • Divergence : Must group agents (threads) • Good News: Agents are already grouped by Agent Function state • Bad News: Agents change states so we are left with sparse lists 1 0 1 1 0 1 0 1 • Avoid Sparse Lists by using parallel 3 4 0 0 1 2 2 4 Prefix Sum compaction. • Thrust C++ library Compact New Agent List GTC 2015

  13. Parallelism within the model • Behaviour consists of function layers Layer 1 • Each layer is a synchronisation barrier • Synchronisation between agents only required when a dependency exists (communication or agent Layer 2 memory) • This creates parallelism within the function layers of the model • CUDA Streams can be used to Layer 3 execute independent functions GTC 2015

  14. High Divergence Example • Single agent ‘cell’ type • 5 types of cell within • Single message type • Advantages • Large population counts (good utilisation) • Simple modelling (but complicated agent transition functions) • Disadvantages • Lots of code divergence • Unnecessary message reading GTC 2015

  15. Low Divergence Example • Multiple agent types • Different agent type for each cell type • Distinction between message • Advantages • Less divergent code • More parallelism within the model • Less message reading • Disadvantages • Complex dependencies • More complex (looking) model • Smaller population sizes GTC 2015

  16. Parallelism within the model - performance Simulation Speedup Average iteration time of cell behaviour 100 2000 1800 1600 Speedup of Cell Behaviour 1400 10 1200 Time (ms) 1000 800 1 600 500 2000 8000 32000 128000 512000 400 200 0 0.1 Population Size 500 2000 8000 32000 128000 512000 Population Size High Divergence Low Divergence GTC 2015

  17. • Complex Systems • A Framework for Modelling Agents • Degrees of Parallelisation • Agent Communication • Putting it all together GTC 2015

  18. Agent Communication • Brute Force Messaging (N-Body problem) • Tile Messages into shared memory • Spatially Distributed Agents • Build data structure to bin agents • CUDA Particles • Use counting sort to improve performance • Discrete Space Limited Range (Cellular Automaton) • Cache results via texture cache (good locality) GTC 2015

  19. Spatially Distributed Communication Radix Sorting Count Sort Hash Message Hash Message atomic add to bin Sort using Thrust (Sort by Key) Prefix Sum sort keys global index of bins Reorder Reorder scatter messages scatter messages build partition matrix GTC 2015

  20. Counting Sort Performance Study Sorting Performance (1M elements) Tesla K20 Tesla K40 GTX 980 2 1.6 1.6 1.8 1.4 1.4 1.6 1.2 1.2 Time (ms) 1.4 1 1 1.2 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 Element range Thrust Sort Counting Sort GTC 2015

  21. Performance Breakdown for 16k agents 1 0.8 Time (ms) 0.6 Performance Improvement using Count Sort (GTX980) 0.4 1.35 1.3 0.2 1.25 Speedup 0 1.2 Count Sort Thrust Sort 1.15 Performance Breakdown for 4M agents 1.1 1.05 1400 1 1200 1000 Time (ms) Population Szie 800 600 400 • Counting sort best suited to smaller 200 0 population sizes Count Sort Thrust Sort • Message reading is the bottleneck send_locations read_locations move GTC 2015

  22. Spatially Distributed Communication Benchmark 262144 65536 16384 GTX 980 K40 FLAME CPU 4096 1024 Time (ms) 256 64 16 4 1 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 Population Size 27k faster than FLAME on CPU with 50k agents ( apples != oranges ) 700x faster than FLAME II with 50k agents on 16 cores (using MPI, vector splitting) GTC 2015

  23. • Complex Systems • A Framework for Modelling Agents • Degrees of Parallelisation • Agent Communication • Putting it all together GTC 2015

  24. Pedestrian Dynamics • Pedestrian agents • Social Repulsion (Social Forces) • Reynolds steering forces • Reciprocal Velocity Obstacles • Navigation agents • Global Vector Field • Navigation Graph • Environment and Goals are calculated as a weighted influence • An extension: Navigation graphs GTC 2015

  25. Conclusions • Agent based modelling can be used to represent complex systems at differing biological scales • FLAME GPU is a framework for model description and CUDA code generation • Using state based representation avoids divergence and allows parallelism within a model to be exploited • Counting sort helpful for highly divergent population • Visualisation is extremely cheap GTC 2015

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend