Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond - - PowerPoint PPT Presentation

complex systems
SMART_READER_LITE
LIVE PREVIEW

Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond - - PowerPoint PPT Presentation

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre) GTC 2015 Overview Complex Systems A Framework


slide-1
SLIDE 1

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)

Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre)

GTC 2015

slide-2
SLIDE 2

Overview

  • Complex Systems
  • A Framework for Modelling Agents
  • Degrees of Parallelisation
  • Agent Communication
  • Putting it all together

GTC 2015

slide-3
SLIDE 3

Complex Systems

  • Many individuals
  • Interact and behave according to simple rules
  • System level behaviour emerges

GTC 2015

slide-4
SLIDE 4

Agent Based Modelling

  • A method for specification and simulation of a complex system
  • Model is a set of autonomous communicating agents
  • Simulation helps to understand complex systems
  • Interventions and prediction
  • Presents a computational challenge!
  • Especially for real time or faster

GTC 2015

slide-5
SLIDE 5

Difficulties in Applying GPUs

  • Agents are heterogeneous

i.e. They diverge

  • Agents are born and agents die

Leads to sparse populations and non coalesced access

  • Agents communicate

No global mechanism for GPU thread communication

  • Agents don't stay still

Acceleration structures used for simulation need to be rebuilt

GTC 2015

slide-6
SLIDE 6
  • Complex Systems
  • A Framework for Modelling Agents
  • Degrees of Parallelisation
  • Agent Communication
  • Putting it all together

GTC 2015

slide-7
SLIDE 7

A Formal Model of an Agent

  • Abstract the underlying architecture
  • Let modellers write models not parallel programs
  • Describe agents as a form of state machine (X-

Machine)

  • Minimises divergence
  • Describe state transition functions (agent functions)

using high level script

  • Describe communication as message dependencies

between agent functions

  • Results in Directed Acyclic Graph
  • Identifies synchronisation points for scheduling

GTC 2015

slide-8
SLIDE 8

FLAME GPU: A Code Generation Framework

  • XML Model File
  • Describe Agents and Communication (messages) as a

model in XML

  • XSLT Templates
  • Code generate a simulation API from agent descriptions
  • Scripted Behaviour
  • Scripted behaviour links with dynamic simulation API
  • Simulation Program
  • Loads initial data and provides I/O or interactive

visualisation

GTC 2015

slide-9
SLIDE 9

Code Generation using XSLT

  • Powerful technique for code generation from Declarative XML model
  • Full functional programming language

<xagents> <gpu:xagent> <name>Circle</name> <memory> <gpu:variable> <type>int</type> <name>id</name> </gpu:variable> <gpu:variable> <type>float</type> <name>x</name> </gpu:variable> <gpu:variable> <type>float</type> <name>y</name> </gpu:variable> <gpu:variable> <type>float</type> <name>z</name> </gpu:variable> <gpu:variable> <type>float</type> <name>fx</name> </gpu:variable> <gpu:variable> <type>float</type> <name>fy</name> </gpu:variable> </memory> ... <xsl:for-each select="xagents/gpu:xagent"> struct __align__(16) xmachine_memory_<xsl:value-of select="name"/> {<xsl:for-each select="memory/gpu:variable"> <xsl:value-of select="type"/><xsl:text> </xsl:text><xsl:if test="arrayLength">*</xsl:if><xsl:value-of select="name"/>; </xsl:for-each> }; </xsl:for-each> struct __align__(16) xmachine_memory_Circle { int id; float x; float y; float z; float fx; float fy; };

GTC 2015

slide-10
SLIDE 10

Mapping an Agent to the GPU

  • Each agent function is corresponds to a single GPU

kernel

  • Each CUDA thread represents a single agent instance
  • Agent functions use a dynamically generated API
  • Agent Data is transparently loaded from Structures
  • f arrays

typedef struct agent{ float x; float y; } xm_memory_agent_list [N]; typedef struct agent_list{ float x[N]; float y[N]; } xm_memory_agent_list; 1 2 3 N … 0 1 2 N 3 … 0 1 2 N 3 … … … __FLAME_GPU_FUNC__ int read_locations( xmachine_memory_bird* xmemory, xmachine_message_location_list* location_messages) { /* Get the first message */ xmachine_message_location* location_message = get_first_location_message(location_messages); /* Repeat untill there are no more messages */ while(location_message) { /* Process the message */ if distance_check(xmemory, location_message) { updateSteerVelocity(xmemory, location_message); } /* Get the next message */ location_message = get_next_location_message(location_message, location_messages); } /* Update any other xmemory variables */ xmemory->x += xmemory->vel_x*TIME_STEP; ... return 0; }

GTC 2015

slide-11
SLIDE 11
  • Complex Systems
  • A Framework for Modelling Agents
  • Degrees of Parallelisation
  • Agent Communication
  • Putting it all together

GTC 2015

slide-12
SLIDE 12

Agent Divergence and Sparsity

  • Divergence: Must group agents (threads)
  • Good News: Agents are already grouped by

state

  • Bad News: Agents change states so we are left

with sparse lists

  • Avoid Sparse Lists by using parallel

compaction.

  • Thrust C++ library

Agent Function 1 1 1 1 1 Compact New Agent List Agent List 1 2 2 3 4 4 Prefix Sum

GTC 2015

slide-13
SLIDE 13

Parallelism within the model

  • Behaviour consists of function

layers

  • Each layer is a synchronisation

barrier

  • Synchronisation between agents
  • nly required when a dependency

exists (communication or agent memory)

  • This creates parallelism within the

function layers of the model

  • CUDA Streams can be used to

execute independent functions

GTC 2015

Layer 1 Layer 2 Layer 3

slide-14
SLIDE 14

High Divergence Example

  • Single agent ‘cell’ type
  • 5 types of cell within
  • Single message type
  • Advantages
  • Large population counts (good

utilisation)

  • Simple modelling (but complicated agent

transition functions)

  • Disadvantages
  • Lots of code divergence
  • Unnecessary message reading

GTC 2015

slide-15
SLIDE 15

Low Divergence Example

  • Multiple agent types
  • Different agent type for each cell type
  • Distinction between message
  • Advantages
  • Less divergent code
  • More parallelism within the model
  • Less message reading
  • Disadvantages
  • Complex dependencies
  • More complex (looking) model
  • Smaller population sizes

GTC 2015

slide-16
SLIDE 16

200 400 600 800 1000 1200 1400 1600 1800 2000 500 2000 8000 32000 128000 512000 Time (ms) Population Size

Average iteration time of cell behaviour

Parallelism within the model - performance

GTC 2015

High Divergence Low Divergence

0.1 1 10 100 500 2000 8000 32000 128000 512000 Speedup of Cell Behaviour Population Size

Simulation Speedup

slide-17
SLIDE 17
  • Complex Systems
  • A Framework for Modelling Agents
  • Degrees of Parallelisation
  • Agent Communication
  • Putting it all together

GTC 2015

slide-18
SLIDE 18

Agent Communication

  • Brute Force Messaging (N-Body problem)
  • Tile Messages into shared memory
  • Spatially Distributed Agents
  • Build data structure to bin agents
  • CUDA Particles
  • Use counting sort to improve performance
  • Discrete Space Limited Range (Cellular Automaton)
  • Cache results via texture cache (good locality)

GTC 2015

slide-19
SLIDE 19

Spatially Distributed Communication

GTC 2015

Radix Sorting Hash Message Sort using Thrust (Sort by Key) sort keys Reorder scatter messages build partition matrix Count Sort Hash Message atomic add to bin Prefix Sum global index of bins Reorder scatter messages

slide-20
SLIDE 20

Counting Sort Performance Study

GTC 2015

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Tesla K20

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

Tesla K40

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

GTX 980

Sorting Performance (1M elements) Thrust Sort Counting Sort

Time (ms) Element range

slide-21
SLIDE 21
  • Counting sort best suited to smaller

population sizes

  • Message reading is the bottleneck

GTC 2015 0.2 0.4 0.6 0.8 1 Count Sort Thrust Sort Time (ms)

Performance Breakdown for 16k agents

200 400 600 800 1000 1200 1400 Count Sort Thrust Sort Time (ms)

Performance Breakdown for 4M agents

1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 Speedup Population Szie

Performance Improvement using Count Sort (GTX980)

send_locations read_locations move

slide-22
SLIDE 22

Spatially Distributed Communication Benchmark

GTC 2015 1 4 16 64 256 1024 4096 16384 65536 262144 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 Time (ms) Population Size GTX 980 K40 FLAME CPU

27k faster than FLAME on CPU with 50k agents (apples != oranges) 700x faster than FLAME II with 50k agents on 16 cores (using MPI, vector splitting)

slide-23
SLIDE 23
  • Complex Systems
  • A Framework for Modelling Agents
  • Degrees of Parallelisation
  • Agent Communication
  • Putting it all together

GTC 2015

slide-24
SLIDE 24

Pedestrian Dynamics

  • Pedestrian agents
  • Social Repulsion (Social Forces)
  • Reynolds steering forces
  • Reciprocal Velocity Obstacles
  • Navigation agents
  • Global Vector Field
  • Navigation Graph
  • Environment and Goals are calculated as a

weighted influence

  • An extension: Navigation graphs

GTC 2015

slide-25
SLIDE 25

Conclusions

  • Agent based modelling can be used to represent complex systems at

differing biological scales

  • FLAME GPU is a framework for model description and CUDA code

generation

  • Using state based representation avoids divergence and allows

parallelism within a model to be exploited

  • Counting sort helpful for highly divergent population
  • Visualisation is extremely cheap

GTC 2015

slide-26
SLIDE 26

Thank You

Get the code for free from: http://www.flamegpu.com www.github.com/FLAMEGPU Contact Me: p.richmond@sheffield.ac.uk http://www.paulrichmond.staff.shef.ac.uk Please complete the Presenter Evaluation sent to you by email or through the GTC Mobile App. Your feedback is important!

GTC 2015