Exploring GPGPU Acceleration of Process-Oriented Simulations - - PowerPoint PPT Presentation

exploring gpgpu acceleration of process oriented
SMART_READER_LITE
LIVE PREVIEW

Exploring GPGPU Acceleration of Process-Oriented Simulations - - PowerPoint PPT Presentation

Exploring GPGPU Acceleration of Process-Oriented Simulations Communicating Process Architectures 2013 Fred Barnes School of Computing, University of Kent, Canterbury F.R.M.Barnes@kent.ac.uk http://www.cs.kent.ac.uk/~frmb/ Contents


slide-1
SLIDE 1

Exploring GPGPU Acceleration of Process-Oriented Simulations

Communicating Process Architectures 2013

Fred Barnes School of Computing, University of Kent, Canterbury

F.R.M.Barnes@kent.ac.uk http://www.cs.kent.ac.uk/~frmb/

slide-2
SLIDE 2

Contents

Process-oriented programming. The boids simulation (shop manual). GPUs. Boids with GPU. Better boids, with and without the GPU. Going even faster. Exploring the results. Conclusions and future work.

slide-3
SLIDE 3

Process-orientation

Process-Oriented Programming

Building systems with concurrent processes as the bricks.

processes communicate and synchronise using channels and barriers (the mortar). communication is synchronised, unidirectional and unbuffered.

We use the occam-π language [1] for implementation.

based heavily on the semantics of CSP [2]. ideas of dynamics and mobility from the π-calculus [3].

slide-4
SLIDE 4

Process-orientation

Process-Oriented Programming

Building systems with concurrent processes as the bricks.

processes communicate and synchronise using channels and barriers (the mortar). communication is synchronised, unidirectional and unbuffered.

We use the occam-π language [1] for implementation.

based heavily on the semantics of CSP [2]. ideas of dynamics and mobility from the π-calculus [3]. B C A D

slide-5
SLIDE 5

Process-orientation

Process-Oriented Programming

Building systems with concurrent processes as the bricks.

processes communicate and synchronise using channels and barriers (the mortar). communication is synchronised, unidirectional and unbuffered.

We use the occam-π language [1] for implementation.

based heavily on the semantics of CSP [2]. ideas of dynamics and mobility from the π-calculus [3]. B C A D

slide-6
SLIDE 6

Process-orientation

Process-Oriented Programming

Building systems with concurrent processes as the bricks.

processes communicate and synchronise using channels and barriers (the mortar). communication is synchronised, unidirectional and unbuffered.

We use the occam-π language [1] for implementation.

based heavily on the semantics of CSP [2]. ideas of dynamics and mobility from the π-calculus [3]. B C A D my.process

in?

  • ut!

sync

slide-7
SLIDE 7

Process-orientation

Process-Oriented Programming

Building systems with concurrent processes as the bricks.

processes communicate and synchronise using channels and barriers (the mortar). communication is synchronised, unidirectional and unbuffered.

We use the occam-π language [1] for implementation.

based heavily on the semantics of CSP [2]. ideas of dynamics and mobility from the π-calculus [3]. B C A D my.process

in?

  • ut!

sync

slide-8
SLIDE 8

Process-orientation

Process-Oriented Programming

Channels are first class types, so can have channels carrying channels (or rather, channel ends).

enables networks of processes to reconfigure themselves dynamically. can have shared channel-ends, whose mutually exclusive access is protected by a fair-queueing semaphore.

Processes can alternate (select) between multiple channel inputs and timeouts, with optional priority.

external choice in CSP, more or less.

Can build large systems (104 – 106 processes) using layered networks

  • f communicating processes, that grow, shrink and evolve at

run-time.

need to be aware of dangers such as deadlock, livelock and starvation (good design).

slide-9
SLIDE 9

Process-orientation

Process-Oriented Programming

Channels are first class types, so can have channels carrying channels (or rather, channel ends).

enables networks of processes to reconfigure themselves dynamically. can have shared channel-ends, whose mutually exclusive access is protected by a fair-queueing semaphore.

Processes can alternate (select) between multiple channel inputs and timeouts, with optional priority.

external choice in CSP, more or less.

Can build large systems (104 – 106 processes) using layered networks

  • f communicating processes, that grow, shrink and evolve at

run-time.

need to be aware of dangers such as deadlock, livelock and starvation (good design).

slide-10
SLIDE 10

Process-orientation

Process-Oriented Programming

Channels are first class types, so can have channels carrying channels (or rather, channel ends).

enables networks of processes to reconfigure themselves dynamically. can have shared channel-ends, whose mutually exclusive access is protected by a fair-queueing semaphore.

Processes can alternate (select) between multiple channel inputs and timeouts, with optional priority.

external choice in CSP, more or less.

Can build large systems (104 – 106 processes) using layered networks

  • f communicating processes, that grow, shrink and evolve at

run-time.

need to be aware of dangers such as deadlock, livelock and starvation (good design).

slide-11
SLIDE 11

Process-orientation

Not a Talk About occam-π

For the purpose of this talk, pictures are sufficient.

the graphical representation we have for process networks maps cleanly to and from code.

Not entirely dissimilar languages Erlang (Sony Ericsson) and Go (Google) do similar things — some intersection of features.

no assumption about sequential execution in occam-π: equal syntax standing with concurrent execution (SEQ vs. PAR).

Perhaps more relevant is the tool-chain and the run-time system (CCSP [4]).

compiled to native code for fast execution (though not optimal). small overheads for channels (4 bytes) and processes (32 bytes minimum).

slide-12
SLIDE 12

Process-orientation

Not a Talk About occam-π

For the purpose of this talk, pictures are sufficient.

the graphical representation we have for process networks maps cleanly to and from code.

Not entirely dissimilar languages Erlang (Sony Ericsson) and Go (Google) do similar things — some intersection of features.

no assumption about sequential execution in occam-π: equal syntax standing with concurrent execution (SEQ vs. PAR).

Perhaps more relevant is the tool-chain and the run-time system (CCSP [4]).

compiled to native code for fast execution (though not optimal). small overheads for channels (4 bytes) and processes (32 bytes minimum).

slide-13
SLIDE 13

Process-orientation

Not a Talk About occam-π

For the purpose of this talk, pictures are sufficient.

the graphical representation we have for process networks maps cleanly to and from code.

Not entirely dissimilar languages Erlang (Sony Ericsson) and Go (Google) do similar things — some intersection of features.

no assumption about sequential execution in occam-π: equal syntax standing with concurrent execution (SEQ vs. PAR).

Perhaps more relevant is the tool-chain and the run-time system (CCSP [4]).

compiled to native code for fast execution (though not optimal). small overheads for channels (4 bytes) and processes (32 bytes minimum).

slide-14
SLIDE 14

Boids

The Boids Simulation

A good case study — it is not trivially parallelisable.

fractal generators and Conway’s game-of-life are trivially parallelisable and give the expected speedups when running with the GPU (×300 or more).

An n-body problem, but where n is kept manageable by partitioning the world into a regular grid. Produced originally as part of the CoSMoS project [5, 6].

based on Reynolds’ “boids” [7].

slide-15
SLIDE 15

Boids

The Boids Simulation

A good case study — it is not trivially parallelisable.

fractal generators and Conway’s game-of-life are trivially parallelisable and give the expected speedups when running with the GPU (×300 or more).

An n-body problem, but where n is kept manageable by partitioning the world into a regular grid. Produced originally as part of the CoSMoS project [5, 6].

based on Reynolds’ “boids” [7].

slide-16
SLIDE 16

Boids

The Boids Simulation

(grid of locations)

slide-17
SLIDE 17

Boids

The Boids Simulation

(grid of locations)

slide-18
SLIDE 18

Boids

The Boids Simulation

(grid of locations) (viewers)

slide-19
SLIDE 19

Boids

The Boids Simulation

(grid of locations) (viewers) (abstract agents)

slide-20
SLIDE 20

Boids

The Boids Simulation

(grid of locations) (viewers) (abstract agents) (boids)

slide-21
SLIDE 21

Boids

The Boids Simulation

(grid of locations) (viewers) (abstract agents) (boids) (barrier)

slide-22
SLIDE 22

Boids

The Boids Simulation

(grid of locations) (viewers) (abstract agents) (boids) (barrier) updater

slide-23
SLIDE 23

Boids

The Boids Simulation

(grid of locations) (viewers) (abstract agents) (boids) (barrier) updater

slide-24
SLIDE 24

Boids

The Boids Simulation

(grid of locations) (viewers) (abstract agents) (boids) (barrier) updater

display

(framebuffers)

slide-25
SLIDE 25

Boids

The Boids Simulation

(grid of locations) (viewers) (abstract agents) (boids) updater

display

(framebuffers) (barrier) (interactive agent)

slide-26
SLIDE 26

Boids

Simulation Operation

World is defined using a grid of location processes.

each location has a viewer, and each viewer has an updater.

Boid processes do not interact with locations and viewers directly.

instead interacting with an abstract agent, that in turn handles interaction with the world (and its particular geometry).

The barrier divides simulation execution into two phases.

Phase 1:

processes synchronise on the barrier. via the abstract agent and viewer, neighbour discovery. compute new acceleration and velocity. move if needed.

Phase 2:

processes synchronise on the barrier. viewers update from locations.

slide-27
SLIDE 27

Boids

Simulation Operation

World is defined using a grid of location processes.

each location has a viewer, and each viewer has an updater.

Boid processes do not interact with locations and viewers directly.

instead interacting with an abstract agent, that in turn handles interaction with the world (and its particular geometry).

The barrier divides simulation execution into two phases.

Phase 1:

processes synchronise on the barrier. via the abstract agent and viewer, neighbour discovery. compute new acceleration and velocity. move if needed.

Phase 2:

processes synchronise on the barrier. viewers update from locations.

slide-28
SLIDE 28

Boids

Simulation Operation

World is defined using a grid of location processes.

each location has a viewer, and each viewer has an updater.

Boid processes do not interact with locations and viewers directly.

instead interacting with an abstract agent, that in turn handles interaction with the world (and its particular geometry).

The barrier divides simulation execution into two phases.

Phase 1:

processes synchronise on the barrier. via the abstract agent and viewer, neighbour discovery. compute new acceleration and velocity. move if needed.

Phase 2:

processes synchronise on the barrier. viewers update from locations.

slide-29
SLIDE 29

Boids

Simulation Operation

World is defined using a grid of location processes.

each location has a viewer, and each viewer has an updater.

Boid processes do not interact with locations and viewers directly.

instead interacting with an abstract agent, that in turn handles interaction with the world (and its particular geometry).

The barrier divides simulation execution into two phases.

updater

Phase 1:

processes synchronise on the barrier. via the abstract agent and viewer, neighbour discovery. compute new acceleration and velocity. move if needed.

Phase 2:

processes synchronise on the barrier. viewers update from locations.

slide-30
SLIDE 30

Boids

Simulation Operation

World is defined using a grid of location processes.

each location has a viewer, and each viewer has an updater.

Boid processes do not interact with locations and viewers directly.

instead interacting with an abstract agent, that in turn handles interaction with the world (and its particular geometry).

The barrier divides simulation execution into two phases.

updater

Phase 1:

processes synchronise on the barrier. via the abstract agent and viewer, neighbour discovery. compute new acceleration and velocity. move if needed.

Phase 2:

processes synchronise on the barrier. viewers update from locations.

slide-31
SLIDE 31

Boids

Simulation Operation

World is defined using a grid of location processes.

each location has a viewer, and each viewer has an updater.

Boid processes do not interact with locations and viewers directly.

instead interacting with an abstract agent, that in turn handles interaction with the world (and its particular geometry).

The barrier divides simulation execution into two phases.

updater

Phase 1:

processes synchronise on the barrier. via the abstract agent and viewer, neighbour discovery. compute new acceleration and velocity. move if needed.

Phase 2:

processes synchronise on the barrier. viewers update from locations.

slide-32
SLIDE 32

Boids

Simulation Operation

World is defined using a grid of location processes.

each location has a viewer, and each viewer has an updater.

Boid processes do not interact with locations and viewers directly.

instead interacting with an abstract agent, that in turn handles interaction with the world (and its particular geometry).

The barrier divides simulation execution into two phases.

updater

Phase 1:

processes synchronise on the barrier. via the abstract agent and viewer, neighbour discovery. compute new acceleration and velocity. move if needed.

Phase 2:

processes synchronise on the barrier. viewers update from locations.

slide-33
SLIDE 33

Boids

Simulation Operation

World is defined using a grid of location processes.

each location has a viewer, and each viewer has an updater.

Boid processes do not interact with locations and viewers directly.

instead interacting with an abstract agent, that in turn handles interaction with the world (and its particular geometry).

The barrier divides simulation execution into two phases.

updater

Phase 1:

processes synchronise on the barrier. via the abstract agent and viewer, neighbour discovery. compute new acceleration and velocity. move if needed.

Phase 2:

processes synchronise on the barrier. viewers update from locations.

slide-34
SLIDE 34

Boids

Simulation Operation

World is defined using a grid of location processes.

each location has a viewer, and each viewer has an updater.

Boid processes do not interact with locations and viewers directly.

instead interacting with an abstract agent, that in turn handles interaction with the world (and its particular geometry).

The barrier divides simulation execution into two phases.

updater

Phase 1:

processes synchronise on the barrier. via the abstract agent and viewer, neighbour discovery. compute new acceleration and velocity. move if needed.

Phase 2:

processes synchronise on the barrier. viewers update from locations.

slide-35
SLIDE 35

Boids

Simulation Operation

World is defined using a grid of location processes.

each location has a viewer, and each viewer has an updater.

Boid processes do not interact with locations and viewers directly.

instead interacting with an abstract agent, that in turn handles interaction with the world (and its particular geometry).

The barrier divides simulation execution into two phases.

updater

Phase 1:

processes synchronise on the barrier. via the abstract agent and viewer, neighbour discovery. compute new acceleration and velocity. move if needed.

Phase 2:

processes synchronise on the barrier. viewers update from locations.

slide-36
SLIDE 36

Boids

Simulation Operation

World is defined using a grid of location processes.

each location has a viewer, and each viewer has an updater.

Boid processes do not interact with locations and viewers directly.

instead interacting with an abstract agent, that in turn handles interaction with the world (and its particular geometry).

The barrier divides simulation execution into two phases.

updater

Phase 1:

processes synchronise on the barrier. via the abstract agent and viewer, neighbour discovery. compute new acceleration and velocity. move if needed.

Phase 2:

processes synchronise on the barrier. viewers update from locations.

slide-37
SLIDE 37

Boids

Simulation Operation

World is defined using a grid of location processes.

each location has a viewer, and each viewer has an updater.

Boid processes do not interact with locations and viewers directly.

instead interacting with an abstract agent, that in turn handles interaction with the world (and its particular geometry).

The barrier divides simulation execution into two phases.

updater

Phase 1:

processes synchronise on the barrier. via the abstract agent and viewer, neighbour discovery. compute new acceleration and velocity. move if needed.

Phase 2:

processes synchronise on the barrier. viewers update from locations.

slide-38
SLIDE 38

Boids

Simulation Operation

World is defined using a grid of location processes.

each location has a viewer, and each viewer has an updater.

Boid processes do not interact with locations and viewers directly.

instead interacting with an abstract agent, that in turn handles interaction with the world (and its particular geometry).

The barrier divides simulation execution into two phases.

updater

Phase 1:

processes synchronise on the barrier. via the abstract agent and viewer, neighbour discovery. compute new acceleration and velocity. move if needed.

Phase 2:

processes synchronise on the barrier. viewers update from locations.

slide-39
SLIDE 39

Boids

From the Boids’ Perspective

1: procedure boid(space link, barrier t) 2:

state me = initial state ()

3:

while True do

4:

sync t ⊲ enter observation phase

5:

all = get viewable(link)

6:

vis,obs = prune visible(all, me)

7:

me = centre of mass(vis, me)

8:

me = repulsion(vis, me)

9:

me = mean velocity(vis, me)

10:

me = obstacles(obs, me)

11:

update(link, me)

12:

sync t ⊲ enter update phase

13:

end while

14: end procedure

slide-40
SLIDE 40

Boids

Performance

For 2048 boids and 9 obstacles in an 8×6 grid.

test machine is an Intel Quad Core i7 (2600K) running at 3.4 GHz (fixed); 4 real cores & 4 hyperthreads.

20 40 60 80 100 120 140 160 180 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration 1 core 2 cores 4 cores 8 cores 1 2 3 4 5 1 2 3 4 5 6 7 8 Speedup Cores 128 boids 256 boids 512 boids 1024 boids 2048 boids

Performance drops as flocks start to form (n-body effect).

levels out to around 50 cycles/sec.

slide-41
SLIDE 41

Boids

Visualisation

Some of the process plumbing is used for a display:

uses SDL to display 2D framebuffers on a host display (and, separately, allows capture to files). in interactive mode, can adjust simulation parameters and move an obstacle around.

This is about as good as the original version will manage.

could tweak it for more performance based on parameter values, but not expecting substantial improvements.

Solution: use the GPU to speed things up!

slide-42
SLIDE 42

Boids

Visualisation

Some of the process plumbing is used for a display:

uses SDL to display 2D framebuffers on a host display (and, separately, allows capture to files). in interactive mode, can adjust simulation parameters and move an obstacle around.

This is about as good as the original version will manage.

could tweak it for more performance based on parameter values, but not expecting substantial improvements.

Solution: use the GPU to speed things up!

slide-43
SLIDE 43

Boids

Visualisation

Some of the process plumbing is used for a display:

uses SDL to display 2D framebuffers on a host display (and, separately, allows capture to files). in interactive mode, can adjust simulation parameters and move an obstacle around.

This is about as good as the original version will manage.

could tweak it for more performance based on parameter values, but not expecting substantial improvements.

Solution: use the GPU to speed things up!

slide-44
SLIDE 44

History lesson

A Brief History of GPUs

Intel release the iSBX 275 multibus board, providing accelerated drawing of lines, arcs, rectangles and character bitmaps. 1983

slide-45
SLIDE 45

History lesson

A Brief History of GPUs

Intel release the iSBX 275 multibus board, providing accelerated drawing of lines, arcs, rectangles and character bitmaps. first personal computer graphics processor appears in the Commodore Amiga: line drawing, area fill and blitter. Included a graphics co-processor with a primitive instruction set. 1983 1985

slide-46
SLIDE 46

History lesson

A Brief History of GPUs

Intel release the iSBX 275 multibus board, providing accelerated drawing of lines, arcs, rectangles and character bitmaps. first personal computer graphics processor appears in the Commodore Amiga: line drawing, area fill and blitter. Included a graphics co-processor with a primitive instruction set. IBM release the 8514/A for the PS/2 (MCA bus): line drawing, area fill and blitter. 1983 1985 1987

slide-47
SLIDE 47

History lesson

A Brief History of GPUs

S3 Graphics introduce the S3 86C911, designed to accelerate specific software. Responsible for many similar (accelerator) cards. 1991

slide-48
SLIDE 48

History lesson

A Brief History of GPUs

S3 Graphics introduce the S3 86C911, designed to accelerate specific software. Responsible for many similar (accelerator) cards. SGI develop and publish OpenGL, an API for graphics processing. 1991 1992

slide-49
SLIDE 49

History lesson

A Brief History of GPUs

S3 Graphics introduce the S3 86C911, designed to accelerate specific software. Responsible for many similar (accelerator) cards. SGI develop and publish OpenGL, an API for graphics processing. DirectX introduced with the release of Windows ’95 and NT 4.0. 1991 1992 1995

slide-50
SLIDE 50

History lesson

A Brief History of GPUs

S3 Graphics introduce the S3 86C911, designed to accelerate specific software. Responsible for many similar (accelerator) cards. SGI develop and publish OpenGL, an API for graphics processing. DirectX introduced with the release of Windows ’95 and NT 4.0. NVIDIA popularise the term Graphics Processing Unit. 1991 1992 1995 1999

slide-51
SLIDE 51

History lesson

A Brief History of GPUs

S3 Graphics introduce the S3 86C911, designed to accelerate specific software. Responsible for many similar (accelerator) cards. SGI develop and publish OpenGL, an API for graphics processing. DirectX introduced with the release of Windows ’95 and NT 4.0. NVIDIA popularise the term Graphics Processing Unit. NVIDIA releases the GeForce 3, included a programmable shader. Start of the GPGPU era. 1991 1992 1995 1999 2001

slide-52
SLIDE 52

History lesson

A Brief History of GPUs

S3 Graphics introduce the S3 86C911, designed to accelerate specific software. Responsible for many similar (accelerator) cards. SGI develop and publish OpenGL, an API for graphics processing. DirectX introduced with the release of Windows ’95 and NT 4.0. NVIDIA popularise the term Graphics Processing Unit. NVIDIA releases the GeForce 3, included a programmable shader. Start of the GPGPU era. ATI try and introduce Visual Processing Unit (VPU) into the lexicon, unsuccessfully. 1991 1992 1995 1999 2001 2002

slide-53
SLIDE 53

History lesson

A Brief History of GPUs

S3 Graphics introduce the S3 86C911, designed to accelerate specific software. Responsible for many similar (accelerator) cards. SGI develop and publish OpenGL, an API for graphics processing. DirectX introduced with the release of Windows ’95 and NT 4.0. NVIDIA popularise the term Graphics Processing Unit. NVIDIA releases the GeForce 3, included a programmable shader. Start of the GPGPU era. ATI try and introduce Visual Processing Unit (VPU) into the lexicon, unsuccessfully. GPUs able to handle looping and floating-point intensive shader ‘mini-programs’. 1991 1992 1995 1999 2001 2002

slide-54
SLIDE 54

History lesson

A Brief History of GPUs

Specific graphics co-processors existed in the 1980s and 1990s, but not in the general consumer market.

fixed-feature hardware accelerators (DirectX) cheaper and faster.

Recent GPU cards offer significant computational ability, driven largely by the HPC and gaming industries.

fundamentally still graphics processors, not high-performance scientific calculators.

slide-55
SLIDE 55

History lesson

A Brief History of GPUs

Specific graphics co-processors existed in the 1980s and 1990s, but not in the general consumer market.

fixed-feature hardware accelerators (DirectX) cheaper and faster.

Recent GPU cards offer significant computational ability, driven largely by the HPC and gaming industries.

fundamentally still graphics processors, not high-performance scientific calculators.

slide-56
SLIDE 56

GPUs

General GPU Structure

Bunch of different hardware units:

memory (VRAM) and host interfaces. a large cache memory area. thread scheduling logic. a number of stream processors.

Logical interpretation is SIMD: data is fixed (in a large register-file) and instructions are pumped through a number of processing cores. NVIDIA Fermi [8] used in GF100 and GF110 GPUs.

available on cards such as the Tesla C2050 and GeForce GTX 580. around 3 billion transistors in 512 CUDA cores. more optimisations for double-precision arithmetic.

Resulting silicon on a 40nm process is about the size of a stamp.

hard to fabricate, but regular structure means that parts can be disabled where defective. e.g. GTX 570 has 1 of the 16 stream processors disabled.

slide-57
SLIDE 57

GPUs

General GPU Structure

Bunch of different hardware units:

memory (VRAM) and host interfaces. a large cache memory area. thread scheduling logic. a number of stream processors.

Logical interpretation is SIMD: data is fixed (in a large register-file) and instructions are pumped through a number of processing cores. NVIDIA Fermi [8] used in GF100 and GF110 GPUs.

available on cards such as the Tesla C2050 and GeForce GTX 580. around 3 billion transistors in 512 CUDA cores. more optimisations for double-precision arithmetic.

Resulting silicon on a 40nm process is about the size of a stamp.

hard to fabricate, but regular structure means that parts can be disabled where defective. e.g. GTX 570 has 1 of the 16 stream processors disabled.

slide-58
SLIDE 58

GPUs

General GPU Structure

Bunch of different hardware units:

memory (VRAM) and host interfaces. a large cache memory area. thread scheduling logic. a number of stream processors.

Logical interpretation is SIMD: data is fixed (in a large register-file) and instructions are pumped through a number of processing cores. NVIDIA Fermi [8] used in GF100 and GF110 GPUs.

available on cards such as the Tesla C2050 and GeForce GTX 580. around 3 billion transistors in 512 CUDA cores. more optimisations for double-precision arithmetic.

Resulting silicon on a 40nm process is about the size of a stamp.

hard to fabricate, but regular structure means that parts can be disabled where defective. e.g. GTX 570 has 1 of the 16 stream processors disabled.

slide-59
SLIDE 59

GPUs

NVIDIA Fermi Architecture

slide-60
SLIDE 60

GPUs

On GPU Programming

The GPU programming model, for CUDA [9] and OpenCL [10], is somewhat abstracted from the real hardware.

CUDA used for these experiments: more mature and well documented, but less portable.

Programmer writes a kernel — a piece of code that is executed in parallel across the CUDA cores.

single threads organised into thread blocks (max. 512/1024). blocks arranged into grids that can be huge (64k/2G × 2/3D). threads scheduled in groups of 32 called warps, execution is interleaved (based on available resources).

Arrangement of threads, blocks and grids can be tweaked for performance.

balanced with register and cache memory use. “better” GPUs can do shared memory and synchronisation within thread blocks.

slide-61
SLIDE 61

GPUs

On GPU Programming

The GPU programming model, for CUDA [9] and OpenCL [10], is somewhat abstracted from the real hardware.

CUDA used for these experiments: more mature and well documented, but less portable.

Programmer writes a kernel — a piece of code that is executed in parallel across the CUDA cores.

single threads organised into thread blocks (max. 512/1024). blocks arranged into grids that can be huge (64k/2G × 2/3D). threads scheduled in groups of 32 called warps, execution is interleaved (based on available resources).

Arrangement of threads, blocks and grids can be tweaked for performance.

balanced with register and cache memory use. “better” GPUs can do shared memory and synchronisation within thread blocks.

slide-62
SLIDE 62

GPUs

On GPU Programming

The GPU programming model, for CUDA [9] and OpenCL [10], is somewhat abstracted from the real hardware.

CUDA used for these experiments: more mature and well documented, but less portable.

Programmer writes a kernel — a piece of code that is executed in parallel across the CUDA cores.

single threads organised into thread blocks (max. 512/1024). blocks arranged into grids that can be huge (64k/2G × 2/3D). threads scheduled in groups of 32 called warps, execution is interleaved (based on available resources).

Arrangement of threads, blocks and grids can be tweaked for performance.

balanced with register and cache memory use. “better” GPUs can do shared memory and synchronisation within thread blocks.

slide-63
SLIDE 63

GPUs

GPU Programming

For doing typical scientific calculations (e.g. boid algorithms) over a set of things (e.g. boid state) simplest to treat as a 1D problem:

slide-64
SLIDE 64

GPUs

GPU Programming

For doing typical scientific calculations (e.g. boid algorithms) over a set of things (e.g. boid state) simplest to treat as a 1D problem:

inputs:

b b b

1 2 3 4 n-1 kernel

  • utputs:

b b b

slide-65
SLIDE 65

GPUs

GPU Programming

For doing typical scientific calculations (e.g. boid algorithms) over a set of things (e.g. boid state) simplest to treat as a 1D problem:

inputs:

b b b

1 2 3 4 n-1 kernel

  • utputs:

b b b

slide-66
SLIDE 66

GPUs

GPU Programming

For doing typical scientific calculations (e.g. boid algorithms) over a set of things (e.g. boid state) simplest to treat as a 1D problem:

inputs:

b b b

1 2 3 4 n-1 kernel

  • utputs:

b b b

typedef struct { ... stuff } gpu in; typedef struct { ... stuff } gpu out; global void my kernel (const gpu in *in, gpu out *out, const int count) { int idx = (blockDim.x * blockIdx.x) + threadIdx.x; if (idx < count) {

  • ut[idx] = sums (in, idx);

} }

slide-67
SLIDE 67

GPUs

GPU Programming

For doing typical scientific calculations (e.g. boid algorithms) over a set of things (e.g. boid state) simplest to treat as a 1D problem:

inputs:

b b b

1 2 3 4 n-1 kernel

  • utputs:

b b b

typedef struct { ... stuff } gpu in; typedef struct { ... stuff } gpu out; global void my kernel (const gpu in *in, gpu out *out, const int count) { int idx = (blockDim.x * blockIdx.x) + threadIdx.x; if (idx < count) {

  • ut[idx] = sums (in, idx);

} } ... allocate device memory while (busy) { ... copy data to GPU my kernel <<< 512, blks >>> (args) ... copy results from GPU } ... free device memory

slide-68
SLIDE 68

GPU server

GPU Server Approach

As a starting point, a GPU server process is introduced.

clean abstraction: other processes send computation requests and collect results. server collects requests and dispatches them in fixed-size batches to the GPU.

  • nly a few parts of the boid

algorithm to start with:

Despite the additional infrastructure, overheads are not too significant.

but performance is not too great either.

slide-69
SLIDE 69

GPU server

GPU Server Approach

As a starting point, a GPU server process is introduced.

clean abstraction: other processes send computation requests and collect results.

(location and viewer processes)

server collects requests and dispatches them in fixed-size batches to the GPU.

  • nly a few parts of the boid

algorithm to start with:

Despite the additional infrastructure, overheads are not too significant.

but performance is not too great either.

slide-70
SLIDE 70

GPU server

GPU Server Approach

As a starting point, a GPU server process is introduced.

clean abstraction: other processes send computation requests and collect results.

(location and viewer processes) gpu.server (CUDA library)

server collects requests and dispatches them in fixed-size batches to the GPU.

  • nly a few parts of the boid

algorithm to start with:

Despite the additional infrastructure, overheads are not too significant.

but performance is not too great either.

slide-71
SLIDE 71

GPU server

GPU Server Approach

As a starting point, a GPU server process is introduced.

clean abstraction: other processes send computation requests and collect results.

(location and viewer processes) gpu.server (CUDA library)

server collects requests and dispatches them in fixed-size batches to the GPU.

  • nly a few parts of the boid

algorithm to start with:

Despite the additional infrastructure, overheads are not too significant.

but performance is not too great either.

slide-72
SLIDE 72

GPU server

GPU Server Approach

As a starting point, a GPU server process is introduced.

clean abstraction: other processes send computation requests and collect results.

(location and viewer processes) gpu.server (CUDA library)

server collects requests and dispatches them in fixed-size batches to the GPU.

  • nly a few parts of the boid

algorithm to start with:

Despite the additional infrastructure, overheads are not too significant.

but performance is not too great either.

slide-73
SLIDE 73

GPU server

GPU Server Approach

As a starting point, a GPU server process is introduced.

clean abstraction: other processes send computation requests and collect results.

(location and viewer processes) gpu.server (CUDA library)

server collects requests and dispatches them in fixed-size batches to the GPU.

  • nly a few parts of the boid

algorithm to start with:

Despite the additional infrastructure, overheads are not too significant.

but performance is not too great either.

slide-74
SLIDE 74

GPU server

GPU Server Approach

As a starting point, a GPU server process is introduced.

clean abstraction: other processes send computation requests and collect results.

(location and viewer processes) gpu.server (CUDA library)

server collects requests and dispatches them in fixed-size batches to the GPU.

  • nly a few parts of the boid

algorithm to start with:

Despite the additional infrastructure, overheads are not too significant.

but performance is not too great either.

slide-75
SLIDE 75

GPU server

GPU Server Approach

20 40 60 80 100 120 140 160 180 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration no server 256 requests 1024 requests 2048 requests 20 40 60 80 100 120 140 160 180 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration no server 256 requests 1024 requests 2048 requests

serialised CPU performance GPU performance

Original choice of which parts of the algorithm to implement on the GPU not brilliant:

most computationally expensive part is the splitting of viewable agents into visible boids and obstacles.

slide-76
SLIDE 76

GPU server

GPU Server Approach

20 40 60 80 100 120 140 160 180 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration no server 256 requests 1024 requests 2048 requests 20 40 60 80 100 120 140 160 180 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration no server 256 requests 1024 requests 2048 requests

serialised CPU performance GPU performance

Original choice of which parts of the algorithm to implement on the GPU not brilliant:

most computationally expensive part is the splitting of viewable agents into visible boids and obstacles.

slide-77
SLIDE 77

GPU server

GPU Server Approach: More GPU

Putting more of the boid algorithm onto the GPU, does not help: Significant increase in the amount of data (all) copied to the GPU. for typical parameter sets, the number of visible agents (vis) is around 3–5% of those viewable (all) — circa 13MB for 2048 boids.

slide-78
SLIDE 78

GPU server

GPU Server Approach: More GPU

Putting more of the boid algorithm onto the GPU, does not help:

20 40 60 80 100 120 140 160 180 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration no server CPU, 128 requests GPU, 128 requests GPU, 2048 requests

Significant increase in the amount of data (all) copied to the GPU. for typical parameter sets, the number of visible agents (vis) is around 3–5% of those viewable (all) — circa 13MB for 2048 boids.

slide-79
SLIDE 79

GPU server

GPU Server Approach: More GPU

Putting more of the boid algorithm onto the GPU, does not help:

20 40 60 80 100 120 140 160 180 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration no server CPU, 128 requests GPU, 128 requests GPU, 2048 requests

Significant increase in the amount of data (all) copied to the GPU. for typical parameter sets, the number of visible agents (vis) is around 3–5% of those viewable (all) — circa 13MB for 2048 boids.

slide-80
SLIDE 80

GPU server

GPU Server Approach: More GPU

Putting more of the boid algorithm onto the GPU, does not help:

20 40 60 80 100 120 140 160 180 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration no server CPU, 128 requests GPU, 128 requests GPU, 2048 requests

Significant increase in the amount of data (all) copied to the GPU. for typical parameter sets, the number of visible agents (vis) is around 3–5% of those viewable (all) — circa 13MB for 2048 boids.

slide-81
SLIDE 81

GPU server

GPU Server Approach: More Optimisations

Various attempts to further optimise the system (without changing anything too substantially) did not produce anything better than the CPU-only version.

limited by the memory bandwidth between host and GPU — might improve with host-stolen video-RAM. strategies included page locked memory on the host (directly sharable over the PCIe bus) and the use of streams on the device to

  • verlap memory copies with kernel execution.
slide-82
SLIDE 82

Shared data

Refactoring: Shared Data

As a moderate change, introduce some shared data to the system.

in principle, means the actual boid (and other agent) state only needs to be copied to the GPU once each cycle. barrier phases can be used to coordinate access to this shared state safely (CREW).

Requires some less subtle changes in the system:

mostly absolute positioning and agent IDs not state. Phase 1: boids read global state and compute new (local) velocity. Phase 2: boids update global state and move. Phase 3: updates to viewable states

  • ccur (as before).
slide-83
SLIDE 83

Shared data

Refactoring: Shared Data

As a moderate change, introduce some shared data to the system.

in principle, means the actual boid (and other agent) state only needs to be copied to the GPU once each cycle. barrier phases can be used to coordinate access to this shared state safely (CREW).

Requires some less subtle changes in the system:

mostly absolute positioning and agent IDs not state. Phase 1: boids read global state and compute new (local) velocity. Phase 2: boids update global state and move. Phase 3: updates to viewable states

  • ccur (as before).
slide-84
SLIDE 84

Shared data

Refactoring: Shared Data

As a moderate change, introduce some shared data to the system.

in principle, means the actual boid (and other agent) state only needs to be copied to the GPU once each cycle. barrier phases can be used to coordinate access to this shared state safely (CREW).

Requires some less subtle changes in the system:

mostly absolute positioning and agent IDs not state. Phase 1: boids read global state and compute new (local) velocity. Phase 2: boids update global state and move. Phase 3: updates to viewable states

  • ccur (as before).
slide-85
SLIDE 85

Shared data

Refactoring: Shared Data

As a moderate change, introduce some shared data to the system.

in principle, means the actual boid (and other agent) state only needs to be copied to the GPU once each cycle. barrier phases can be used to coordinate access to this shared state safely (CREW).

Requires some less subtle changes in the system:

mostly absolute positioning and agent IDs not state.

(location and viewer processes)

Phase 1: boids read global state and compute new (local) velocity. Phase 2: boids update global state and move. Phase 3: updates to viewable states

  • ccur (as before).
slide-86
SLIDE 86

Shared data

Refactoring: Shared Data

As a moderate change, introduce some shared data to the system.

in principle, means the actual boid (and other agent) state only needs to be copied to the GPU once each cycle. barrier phases can be used to coordinate access to this shared state safely (CREW).

Requires some less subtle changes in the system:

mostly absolute positioning and agent IDs not state.

(location and viewer processes)

b b b

Phase 1: boids read global state and compute new (local) velocity. Phase 2: boids update global state and move. Phase 3: updates to viewable states

  • ccur (as before).
slide-87
SLIDE 87

Shared data

Refactoring: Shared Data

As a moderate change, introduce some shared data to the system.

in principle, means the actual boid (and other agent) state only needs to be copied to the GPU once each cycle. barrier phases can be used to coordinate access to this shared state safely (CREW).

Requires some less subtle changes in the system:

mostly absolute positioning and agent IDs not state.

(location and viewer processes)

b b b

Phase 1: boids read global state and compute new (local) velocity. Phase 2: boids update global state and move. Phase 3: updates to viewable states

  • ccur (as before).
slide-88
SLIDE 88

Shared data

Refactoring: Shared Data

As a moderate change, introduce some shared data to the system.

in principle, means the actual boid (and other agent) state only needs to be copied to the GPU once each cycle. barrier phases can be used to coordinate access to this shared state safely (CREW).

Requires some less subtle changes in the system:

mostly absolute positioning and agent IDs not state.

(location and viewer processes)

b b b

Phase 1: boids read global state and compute new (local) velocity. Phase 2: boids update global state and move. Phase 3: updates to viewable states

  • ccur (as before).
slide-89
SLIDE 89

Shared data

Refactoring: Shared Data

As a moderate change, introduce some shared data to the system.

in principle, means the actual boid (and other agent) state only needs to be copied to the GPU once each cycle. barrier phases can be used to coordinate access to this shared state safely (CREW).

Requires some less subtle changes in the system:

mostly absolute positioning and agent IDs not state.

(location and viewer processes)

b b b

Phase 1: boids read global state and compute new (local) velocity. Phase 2: boids update global state and move. Phase 3: updates to viewable states

  • ccur (as before).
slide-90
SLIDE 90

Shared data

Shared Data: Performance

Considering a CPU-only version to start with (based on the original), performance is significantly improved.

downside is our existing GPU results now look even worse...

50 100 150 200 250 300 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration

  • riginal, 1 core, 2048 boids
  • riginal, 8 cores, 2048 boids

shared-data, 1 core, 2048 boids shared-data, 8 cores, 2048 boids 0.5 1 1.5 2 2.5 3 3.5 4 4.5 1 2 3 4 5 6 7 8 Speedup CPU cores

  • riginal, 1024 boids

shared-data, 1024 boids

  • riginal, 4096 boids

shared-data, 4096 boids

slide-91
SLIDE 91

Shared data

Shared Data: Reintroducing the GPU

Next, add a GPU-server process, operating on shared agent data.

still copying around arrays of viewable agents, but only integers now.

(location and viewer processes)

b b b

at the start of the GPU cycle (for a given batch size), all agent state copied over. results collected locally and global state (on the host) updated before the second phase.

slide-92
SLIDE 92

Shared data

Shared Data: Reintroducing the GPU

Next, add a GPU-server process, operating on shared agent data.

still copying around arrays of viewable agents, but only integers now.

(location and viewer processes)

b b b

gpu.server (CUDA library)

at the start of the GPU cycle (for a given batch size), all agent state copied over. results collected locally and global state (on the host) updated before the second phase.

slide-93
SLIDE 93

Shared data

Shared Data: Reintroducing the GPU

Next, add a GPU-server process, operating on shared agent data.

still copying around arrays of viewable agents, but only integers now.

(location and viewer processes)

b b b

gpu.server (CUDA library)

at the start of the GPU cycle (for a given batch size), all agent state copied over. results collected locally and global state (on the host) updated before the second phase.

slide-94
SLIDE 94

Shared data

Shared Data: Reintroducing the GPU

Next, add a GPU-server process, operating on shared agent data.

still copying around arrays of viewable agents, but only integers now.

(location and viewer processes)

b b b

gpu.server (CUDA library)

at the start of the GPU cycle (for a given batch size), all agent state copied over. results collected locally and global state (on the host) updated before the second phase.

slide-95
SLIDE 95

Shared data

Reintroducing the GPU: Performance

20 40 60 80 100 120 140 160 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration cpu-1024 cpu-4096 gpu-1024 gpu-4096 20 40 60 80 100 120 140 160 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration cpu-2048 cpu-4096 gpu-2048 gpu-4096

Batches of 512 jobs Batches of 2048 jobs

Performance is unimpressive.

worse than the shared-data CPU-only version in all cases.

Still a lot of viewable state manipulation.

slide-96
SLIDE 96

Shared data

Reintroducing the GPU: Performance

20 40 60 80 100 120 140 160 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration cpu-1024 cpu-4096 gpu-1024 gpu-4096 20 40 60 80 100 120 140 160 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration cpu-2048 cpu-4096 gpu-2048 gpu-4096

Batches of 512 jobs Batches of 2048 jobs

Performance is unimpressive.

worse than the shared-data CPU-only version in all cases.

Still a lot of viewable state manipulation.

slide-97
SLIDE 97

More shared data

Sharing the Viewable State

Sharing the viewable state (in each viewer) requires some changes in the boid algorithm.

a single pass over the viewable agents, instead of sorting into visible boids and obstacles. just as functional, but the boid algorithm is a little harder to follow. mostly just reducing the amount

  • f data copied around.

contents updated during the update phase.

slide-98
SLIDE 98

More shared data

Sharing the Viewable State

Sharing the viewable state (in each viewer) requires some changes in the boid algorithm.

a single pass over the viewable agents, instead of sorting into visible boids and obstacles.

b b b

gpu.server (CUDA library) updater

just as functional, but the boid algorithm is a little harder to follow. mostly just reducing the amount

  • f data copied around.

contents updated during the update phase.

slide-99
SLIDE 99

More shared data

Sharing the Viewable State

Sharing the viewable state (in each viewer) requires some changes in the boid algorithm.

a single pass over the viewable agents, instead of sorting into visible boids and obstacles.

b b b

gpu.server (CUDA library) updater

just as functional, but the boid algorithm is a little harder to follow. mostly just reducing the amount

  • f data copied around.

contents updated during the update phase.

slide-100
SLIDE 100

More shared data

Sharing the Viewable State

Sharing the viewable state (in each viewer) requires some changes in the boid algorithm.

a single pass over the viewable agents, instead of sorting into visible boids and obstacles.

b b b

gpu.server (CUDA library) updater

just as functional, but the boid algorithm is a little harder to follow. mostly just reducing the amount

  • f data copied around.

contents updated during the update phase.

slide-101
SLIDE 101

More shared data

Sharing the Viewable State

Sharing the viewable state (in each viewer) requires some changes in the boid algorithm.

a single pass over the viewable agents, instead of sorting into visible boids and obstacles.

b b b

gpu.server (CUDA library) updater

just as functional, but the boid algorithm is a little harder to follow. mostly just reducing the amount

  • f data copied around.

contents updated during the update phase.

slide-102
SLIDE 102

More shared data

Sharing the Viewable State: Performance

50 100 150 200 250 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration cpu-1024 cpu-4096 gpu-1024 gpu-4096 50 100 150 200 250 300 350 400 450 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration cpu-2048 cpu-4096 gpu-2048 gpu-4096

Batches of 512 jobs Batches of 2048 jobs

For batches of 2048 jobs, start seeing some performance gain for the first time!

slow-down for 4096 boids is partially due to increased density (still in an 8×6 grid).

slide-103
SLIDE 103

More shared data

Parallel GPU Servers

An obvious (and fairly straightforward) next step is to parallelise the GPU server.

to take advantage of multiple GPUs.

  • r allow a mix of GPU and CPU execution.

200 400 600 800 1000 1 2 4 8 Throughput (k-boids / sec) GPU server processes no farmer, 2048 boids gpu, 2048/1024 cpu, 2048/1024 no farmer, 4096 boids gpu, 4096/1024 200 400 600 800 1000 0 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration no farmer, 2048 boids no farmer, 4096 boids 2 servers, 2048/1024 2 servers, 4096/1024

Average throughput at t = 500 Throughput for multiple GPUs

Improvement in throughput for 4096 boids is significant.

so worth doing when multiple GPUs are present.

slide-104
SLIDE 104

More shared data

Parallel GPU Servers

An obvious (and fairly straightforward) next step is to parallelise the GPU server.

to take advantage of multiple GPUs.

  • r allow a mix of GPU and CPU execution.

200 400 600 800 1000 1 2 4 8 Throughput (k-boids / sec) GPU server processes no farmer, 2048 boids gpu, 2048/1024 cpu, 2048/1024 no farmer, 4096 boids gpu, 4096/1024 200 400 600 800 1000 0 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration no farmer, 2048 boids no farmer, 4096 boids 2 servers, 2048/1024 2 servers, 4096/1024

Average throughput at t = 500 Throughput for multiple GPUs

Improvement in throughput for 4096 boids is significant.

so worth doing when multiple GPUs are present.

slide-105
SLIDE 105

Further optimisation

Further Optimisation: Less Channel I/O

Each cycle, the viewer processes update their viewable arrays from the contents of the 9 connected locations.

means agent IDs are duplicated 8 times (although that’s not a huge

  • verhead).

Each boid goes through a sequence of communications with the GPU server process.

when dealing with large numbers of boids, this creates significant

  • verheads (for something that is largely straightforward).

Solutions to these damage the clarity of the system.

largely by breaking the abstractions of delegated computation (the GPU server process) and viewable state (in the viewer processes).

slide-106
SLIDE 106

Further optimisation

Further Optimisation: Less Channel I/O

Each cycle, the viewer processes update their viewable arrays from the contents of the 9 connected locations.

means agent IDs are duplicated 8 times (although that’s not a huge

  • verhead).

Each boid goes through a sequence of communications with the GPU server process.

when dealing with large numbers of boids, this creates significant

  • verheads (for something that is largely straightforward).

Solutions to these damage the clarity of the system.

largely by breaking the abstractions of delegated computation (the GPU server process) and viewable state (in the viewer processes).

slide-107
SLIDE 107

Further optimisation

Further Optimisation: Less Channel I/O

Each cycle, the viewer processes update their viewable arrays from the contents of the 9 connected locations.

means agent IDs are duplicated 8 times (although that’s not a huge

  • verhead).

Each boid goes through a sequence of communications with the GPU server process.

when dealing with large numbers of boids, this creates significant

  • verheads (for something that is largely straightforward).

Solutions to these damage the clarity of the system.

largely by breaking the abstractions of delegated computation (the GPU server process) and viewable state (in the viewer processes).

slide-108
SLIDE 108

Further optimisation

Further Optimisation: Less Channel I/O

b b b

(agent state)

gpu.engine (CUDA library) updater

(viewable IDs) (neighbourhood map)

Three phases of execution:

state copied to GPU, computations performed, results collected and written back. boids initiate movement, moving if needed. global viewable state updated.

slide-109
SLIDE 109

Further optimisation

Further Optimisation: Less Channel I/O

b b b

(agent state)

gpu.engine (CUDA library) updater

(viewable IDs) (neighbourhood map)

Three phases of execution:

state copied to GPU, computations performed, results collected and written back. boids initiate movement, moving if needed. global viewable state updated.

slide-110
SLIDE 110

Further optimisation

Further Optimisation: Less Channel I/O

b b b

(agent state)

gpu.engine (CUDA library) updater

(viewable IDs) (neighbourhood map)

Three phases of execution:

state copied to GPU, computations performed, results collected and written back. boids initiate movement, moving if needed. global viewable state updated.

slide-111
SLIDE 111

Further optimisation

Further Optimisation: Less Channel I/O

b b b

(agent state)

gpu.engine (CUDA library) updater

(viewable IDs) (neighbourhood map)

Three phases of execution:

state copied to GPU, computations performed, results collected and written back. boids initiate movement, moving if needed. global viewable state updated.

slide-112
SLIDE 112

Further optimisation

Further Optimisation: Less Channel I/O

b b b

(agent state)

gpu.engine (CUDA library) updater

(viewable IDs) (neighbourhood map)

Three phases of execution:

state copied to GPU, computations performed, results collected and written back. boids initiate movement, moving if needed. global viewable state updated.

slide-113
SLIDE 113

Further optimisation

Further Optimisation: Less Channel I/O

b b b

(agent state)

gpu.engine (CUDA library) updater

(viewable IDs) (neighbourhood map)

Three phases of execution:

state copied to GPU, computations performed, results collected and written back. boids initiate movement, moving if needed. global viewable state updated.

slide-114
SLIDE 114

Further optimisation

Less Channel I/O: Performance

Improvement in performance is substantial.

for 16384 boids, vary the density and execution mode.

200 400 600 800 1000 1200 1400 1600 1800 0 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration in-boid 8x cpu 1x gpu 2x gpu 200 400 600 800 1000 1200 1400 1600 1800 20 40 60 80 100 120

  • Avg. throughput (k-boids / sec)

Density (boids / location) in-boid 8x cpu 1x gpu 2x gpu

Average throughput at 38 boids/location Throughput for varying densities

16384 50×38 = 9 16384 26×20 = 32 16384 20×14 = 59 16384 16×12 = 85 16384 12×9 = 152

slide-115
SLIDE 115

Further optimisation

Centralising Movement

Boid processes still (and only) initiate the move between locations.

centralising this makes the boid processes, locations and updaters redundant.

We keep the locations for interaction, however.

and double-buffer the agent state for performance.

slide-116
SLIDE 116

Further optimisation

Centralising Movement

Boid processes still (and only) initiate the move between locations.

centralising this makes the boid processes, locations and updaters redundant.

We keep the locations for interaction, however.

and double-buffer the agent state for performance.

slide-117
SLIDE 117

Further optimisation

Centralising Movement

Boid processes still (and only) initiate the move between locations.

centralising this makes the boid processes, locations and updaters redundant.

We keep the locations for interaction, however.

and double-buffer the agent state for performance.

(viewable IDs) render (framebuffers) gpu.engine move.engine interact.agent (GUI events) (barrier) (agent state arrays)

slide-118
SLIDE 118

Further optimisation

Centralising Movement

Boid processes still (and only) initiate the move between locations.

centralising this makes the boid processes, locations and updaters redundant.

We keep the locations for interaction, however.

and double-buffer the agent state for performance.

(viewable IDs) render (framebuffers) gpu.engine move.engine interact.agent (GUI events) (barrier) (agent state arrays) phase 1 read/write

slide-119
SLIDE 119

Further optimisation

Centralising Movement

Boid processes still (and only) initiate the move between locations.

centralising this makes the boid processes, locations and updaters redundant.

We keep the locations for interaction, however.

and double-buffer the agent state for performance.

(viewable IDs) render (framebuffers) gpu.engine move.engine interact.agent (GUI events) (barrier) (agent state arrays) phase 2 read/write

slide-120
SLIDE 120

Further optimisation

Centralising Movement: Performance

Squeeze a little more performance out of the GPU(s).

500 1000 1500 2000 2500 5000 10000 15000 20000 25000 30000 35000

  • Avg. throughput (k-boids / sec)

Boids 1x gpu 2x gpu 500 1000 1500 2000 2500 0 100 200 300 400 500 600 700 800 900 1000 Throughput (k-boids / sec) Iteration 1x gpu 2x gpu

Fixed density (40), varying number of boids Throughput for 30720 boids, 32×24 grid

Could manage more in theory, but visualisation creates overheads.

from about 60 cycles/sec without visualisation, to 25 cycles/sec with (synchronised display).

slide-121
SLIDE 121

Experimenting

Experimenting with Different Parameters

Boid algorithm uses a number of different parameters internally.

repulsion radius and fraction, viewing angle and distance, centre-of-mass fraction, mean-velocity fraction. and a few other things.

Playing around produces substantially different behaviours.

previously difficult to explore with large numbers of agents.

slide-122
SLIDE 122

Experimenting

Experimenting with Different Parameters

Boid algorithm uses a number of different parameters internally.

repulsion radius and fraction, viewing angle and distance, centre-of-mass fraction, mean-velocity fraction. and a few other things.

Playing around produces substantially different behaviours.

previously difficult to explore with large numbers of agents.

slide-123
SLIDE 123

Conclusions

Conclusions

Have gone from a basic occam-π only implementation (managing around 110,000 boid-cycles per second) to a hybrid CPU/GPU implementation with good performance (2,000,000 boid-cycles per second).

could still improve though (future work).

A process of step-by-step change, not a new implementation.

unlikely to have come up with this design from a fresh start.

Despite the differences from the original, still retains nice high-level features:

can have other agents (e.g. the interactive one) in the system too — executing on the CPU, GPU or something else. distribution still possible: use of locations (even if just data).

slide-124
SLIDE 124

Conclusions

Conclusions

Have gone from a basic occam-π only implementation (managing around 110,000 boid-cycles per second) to a hybrid CPU/GPU implementation with good performance (2,000,000 boid-cycles per second).

could still improve though (future work).

A process of step-by-step change, not a new implementation.

unlikely to have come up with this design from a fresh start.

Despite the differences from the original, still retains nice high-level features:

can have other agents (e.g. the interactive one) in the system too — executing on the CPU, GPU or something else. distribution still possible: use of locations (even if just data).

slide-125
SLIDE 125

Conclusions

Conclusions

Have gone from a basic occam-π only implementation (managing around 110,000 boid-cycles per second) to a hybrid CPU/GPU implementation with good performance (2,000,000 boid-cycles per second).

could still improve though (future work).

A process of step-by-step change, not a new implementation.

unlikely to have come up with this design from a fresh start.

Despite the differences from the original, still retains nice high-level features:

can have other agents (e.g. the interactive one) in the system too — executing on the CPU, GPU or something else. distribution still possible: use of locations (even if just data).

slide-126
SLIDE 126

Conclusions

Future Work

Now that we can have large numbers of boids, a 3D version.

and perhaps an opportunity to do something interesting with the haptics interface.

Absolutely no attempt (because of lack of time) has been made to

  • ptimise the code that runs on the GPU, other than getting it to

work.

expect to squeeze a bit of performance out. have not even experimented with different threads-per-block and similar.

A total GPU implementation, to give a “best case” benchmark.

if not already; handling the moves on the GPU is non-trivial.

slide-127
SLIDE 127

Conclusions

Acknowledgements

Hardware:

NVIDIA GTX-570, GTX-590 and ATI Radeon 7970 funded by the Faculty of Sciences (REF fund 2012/2013, Tranche 1). fast desktop (quad-core 2600K) funded by the School of Computing.

Early experiments with occam-π and CUDA/OpenCL done by Tom Pressnell and Brendan Le Foll (graduated). Images, in no particular order:

Intel Corporation, Kaiiv (de.wikipedia), Editing by Pixel8, IBM Corporation, pcmag.com, IXBT Labs, anandtech.com, NVIDIA Corporation.

Additional history/etc.: Wikipedia.

slide-128
SLIDE 128

Conclusions

Questions?

slide-129
SLIDE 129

References

References

[1] P.H. Welch and F.R.M. Barnes. Communicating mobile processes: introducing occam-pi. In 25 Years of CSP, volume 3525 of LNCS. Springer, 2005. [2] C.A.R. Hoare. Communicating Sequential Processes. Prentice-Hall, London, 1985. ISBN: 0-13-153271-5. [3]

  • R. Milner.

Communicating and Mobile Systems: the Pi-Calculus. Cambridge University Press, 1999. ISBN: 0-52165-869-1. [4] C.G. Ritson, A.T. Sampson, and F.R.M. Barnes. Multicore scheduling for lightweight communicating processes. Science of Computer Programming, 77(6):727–740, June 2012. [5] Fiona A.C. Polack, Tim Hoverd, Adam T. Sampson, Susan Stepney, and Jon Timmis. Complex systems models: engineering simulations. In S. Bullock, J. Noble, R. Watson, and M. A. Bedau, editors, Artificial Life XI: Proceedings of the Eleventh International Conference on the Simulation and Synthesis of Living Systems, pages 482–489. MIT Press, Cambridge, MA, 2008. [6] Adam T. Sampson, John Markus Bjørndalen, and Paul S. Andrews. Birds on the wall: Distributing a process-oriented simulation. In 2009 IEEE Congress on Evolutionary Computation (CEC 2009), pages 225–231. IEEE Press, 2009. [7] Craig W. Reynolds. Flocks, herds and schools: A distributed behavioral model. SIGGRAPH Comput. Graph., 21(4):25–34, August 1987. [8] NVIDIA Corporation. Whitepaper: NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, 2009. [9] NVIDIA. CUDA C programming guide 4.2, April 2012. http://www.nvidia.com/content/cuda/cuda-developer-resources.html. [10] Khronos OpenCL Working Group. The OpenCL specification 1.2, November 2011. http://www.khronos.org/registry/cl/.