From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)
Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre)
GTC 2015
Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond - - PowerPoint PPT Presentation
From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre) GTC 2015 Overview Complex Systems A Framework
Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre)
GTC 2015
GTC 2015
GTC 2015
GTC 2015
i.e. They diverge
Leads to sparse populations and non coalesced access
No global mechanism for GPU thread communication
Acceleration structures used for simulation need to be rebuilt
GTC 2015
GTC 2015
GTC 2015
model in XML
visualisation
GTC 2015
<xagents> <gpu:xagent> <name>Circle</name> <memory> <gpu:variable> <type>int</type> <name>id</name> </gpu:variable> <gpu:variable> <type>float</type> <name>x</name> </gpu:variable> <gpu:variable> <type>float</type> <name>y</name> </gpu:variable> <gpu:variable> <type>float</type> <name>z</name> </gpu:variable> <gpu:variable> <type>float</type> <name>fx</name> </gpu:variable> <gpu:variable> <type>float</type> <name>fy</name> </gpu:variable> </memory> ... <xsl:for-each select="xagents/gpu:xagent"> struct __align__(16) xmachine_memory_<xsl:value-of select="name"/> {<xsl:for-each select="memory/gpu:variable"> <xsl:value-of select="type"/><xsl:text> </xsl:text><xsl:if test="arrayLength">*</xsl:if><xsl:value-of select="name"/>; </xsl:for-each> }; </xsl:for-each> struct __align__(16) xmachine_memory_Circle { int id; float x; float y; float z; float fx; float fy; };
GTC 2015
kernel
typedef struct agent{ float x; float y; } xm_memory_agent_list [N]; typedef struct agent_list{ float x[N]; float y[N]; } xm_memory_agent_list; 1 2 3 N … 0 1 2 N 3 … 0 1 2 N 3 … … … __FLAME_GPU_FUNC__ int read_locations( xmachine_memory_bird* xmemory, xmachine_message_location_list* location_messages) { /* Get the first message */ xmachine_message_location* location_message = get_first_location_message(location_messages); /* Repeat untill there are no more messages */ while(location_message) { /* Process the message */ if distance_check(xmemory, location_message) { updateSteerVelocity(xmemory, location_message); } /* Get the next message */ location_message = get_next_location_message(location_message, location_messages); } /* Update any other xmemory variables */ xmemory->x += xmemory->vel_x*TIME_STEP; ... return 0; }
GTC 2015
GTC 2015
state
with sparse lists
Agent Function 1 1 1 1 1 Compact New Agent List Agent List 1 2 2 3 4 4 Prefix Sum
GTC 2015
layers
barrier
exists (communication or agent memory)
function layers of the model
execute independent functions
GTC 2015
Layer 1 Layer 2 Layer 3
utilisation)
transition functions)
GTC 2015
GTC 2015
200 400 600 800 1000 1200 1400 1600 1800 2000 500 2000 8000 32000 128000 512000 Time (ms) Population Size
Average iteration time of cell behaviour
GTC 2015
High Divergence Low Divergence
0.1 1 10 100 500 2000 8000 32000 128000 512000 Speedup of Cell Behaviour Population Size
Simulation Speedup
GTC 2015
GTC 2015
GTC 2015
Radix Sorting Hash Message Sort using Thrust (Sort by Key) sort keys Reorder scatter messages build partition matrix Count Sort Hash Message atomic add to bin Prefix Sum global index of bins Reorder scatter messages
GTC 2015
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Tesla K20
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Tesla K40
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
GTX 980
Sorting Performance (1M elements) Thrust Sort Counting Sort
Time (ms) Element range
population sizes
GTC 2015 0.2 0.4 0.6 0.8 1 Count Sort Thrust Sort Time (ms)
Performance Breakdown for 16k agents
200 400 600 800 1000 1200 1400 Count Sort Thrust Sort Time (ms)
Performance Breakdown for 4M agents
1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 Speedup Population Szie
Performance Improvement using Count Sort (GTX980)
send_locations read_locations move
GTC 2015 1 4 16 64 256 1024 4096 16384 65536 262144 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 Time (ms) Population Size GTX 980 K40 FLAME CPU
27k faster than FLAME on CPU with 50k agents (apples != oranges) 700x faster than FLAME II with 50k agents on 16 cores (using MPI, vector splitting)
GTC 2015
weighted influence
GTC 2015
GTC 2015
Get the code for free from: http://www.flamegpu.com www.github.com/FLAMEGPU Contact Me: p.richmond@sheffield.ac.uk http://www.paulrichmond.staff.shef.ac.uk Please complete the Presenter Evaluation sent to you by email or through the GTC Mobile App. Your feedback is important!
GTC 2015