GPU-Based Simulation of Spiking Neural Networks with Real-Time - - PowerPoint PPT Presentation

gpu based simulation of
SMART_READER_LITE
LIVE PREVIEW

GPU-Based Simulation of Spiking Neural Networks with Real-Time - - PowerPoint PPT Presentation

GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy Dmitri Yudanov, Muhammad Shaaban, Roy Melton, Leon Reznik Department of Computer Engineering Rochester Institute of Technology United States WCCI


slide-1
SLIDE 1

GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy

Dmitri Yudanov, Muhammad Shaaban, Roy Melton, Leon Reznik

Department of Computer Engineering Rochester Institute of Technology United States WCCI 2010, IJCNN, July 23

slide-2
SLIDE 2

 Motivation  Neural network models  Simulation systems of neural networks  Parker-Sochacki numerical integration method  CUDA GPU architecture  Implementation: software architecture, computation phases  Verification  Results  Conclusion and future work  Q&A

Agenda

slide-3
SLIDE 3

 Other works: accuracy and verification problem  To provide scalable accuracy  To perform direct verification  Based on:

Motivation

J. Nageswaran, N. Dutt, J. Krichmar, A. Nicolau, and A. Veidenbaum, "A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors," Neural Networks, Jul. 2009. A. K. Fidjeland, E. B. Roesch, M. P. Shanahan, and W. Luk, "NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs," Application-Specific Systems, Architectures and Processors, IEEE International Conference on, vol. 0, pp. 137-144, 2009. J.-P. Tiesel and A. S. Maida, "Using parallel GPU architecture for simulation of planar I/F networks," in , 2009, pp. 754--759. R. Stewart and W. Bair, "Spiking neural network simulation: numerical integration with the Parker-Sochacki method," Journal of Computational Neuroscience, vol. 27, no. 1, pp. 115-33,

  • Aug. 2009.
slide-4
SLIDE 4

IF: simple, but has poor spiking response

HH: has reach response, but complex

IZ: simple, has reach response, but phenomenological

Neuron Models: IF, HH, IZ

IZ HH IF

slide-5
SLIDE 5

System Modeling: Synchronous Systems

N – network size F - average firing rate of a neuron p – average target neurons per spike source R. Brette, et al. Order of computation per second of simulated time

Time quantization error introduced by dt

Smaller dt  more precise, but computation hangry

May result in missing events  STDP unfriendly

Aligned events  good for parallel computing

slide-6
SLIDE 6

System Modeling: Asynchronous Systems

N – network size F - average firing rate of a neuron p – average target neurons per spike source R. Brette, et al. Order of computation per second of simulated time

Events are processed sequentially

More computation per unit-time

Spike predictor-corrector  excessive re-computation

Assumes analytical solution

Small computation order

Events are unique in time  no quantization error more accurate, STDP friendly

slide-7
SLIDE 7

System Modeling: Hybrid Systems

Events are processed sequentially

Largest possible dt is limited by minimum delay and highest possible transient

Refreshes every dt  more structured than event-driven  good for parallel computing

Events are unique in time  no quantization error more accurate, STDP friendly

Doesn’t require analytical solution

N – network size, F - average firing rate of a neuron, p – average target neurons per spike source

  • R. Brette, et al.

Order of computation per second of simulated time

slide-8
SLIDE 8

Choice of Numerical Integration Method

Euler: compute next y based on tangent to current y

Modified Euler: predict with Euler, correct with average slope

Runge-Kutta 4th Order: evaluate and average

Bulirsch–Stoer: modified midpoint method with evaluation and error tolerance check using extrapolation with rational

  • functions. Adaptive order. Generally more suited for smooth

functions.

Parker-Sochacki: express IVP as power series. Adaptive order

Motivation: need to solve an IVP

slide-9
SLIDE 9

A typical IVP:

Assume that solution function can be represented with power series. Therefore, its derivative based on Maclaurin series properties is

Parcker-Sochacki Method

slide-10
SLIDE 10

If is linear: Shift it to eliminate constant term: As a result, the equation becomes: With finite order N:

 LLP  Parallel reduction

Parcker-Sochacki Method

slide-11
SLIDE 11

If is quadratic: Shift it to eliminate constant term: As a result, the equation becomes: Quadratic term can be converted with series multiplication:

Parcker-Sochacki Method

slide-12
SLIDE 12

and the equation becomes: With finite order N:

 Loop-carried circular dependence on d  Only partial parallelism possible

Parcker-Sochacki Method

slide-13
SLIDE 13

 Power series representation  adaptive order  error

tolerance control Limitations:

 Cauchy product reduces parallelism

Parcker-Sochacki Method

 Local Lipschitz constant determines the number of

iterations for achieving certain error tolerance:

slide-14
SLIDE 14

 Kernel: code separate, task division  Thread  Block (1D, 2D, 3D)  Grid (1D, 2D)  Divide computation based on IDs  Granularity: bit level (after warp

bcast access)

CUDA: SW

slide-15
SLIDE 15

Scheduling

Scheduling: parallel and sequential

Scalability  requirement for blocks to be independent Warp

Warp = 32 threads

Warp divergence

Warp level synchronization Active blocks and threads:

Active threads / SM: maximum1024

Goal: full occupancy = 1024 threads

CUDA: HW

slide-16
SLIDE 16

Software Architecture

slide-17
SLIDE 17

Update Phase

Adaptive order p according to required error tolerance

Can be processed in parallel for each neuron

Stewart and Bair

slide-18
SLIDE 18

Propagation Phase

Translate spikes to synaptic events: global communication is required

Encoded spikes are written to the global memory: bit mask + time values

A propagation block reads and filters all spikes, decodes, fetches synaptic data and distributes into time slots

slide-19
SLIDE 19

Sorting Phase

Satish et al.

slide-20
SLIDE 20

Software Architecture

slide-21
SLIDE 21

Results: Verification

GPU Device: GTX 260

24 symmetric multiprocessors

Shared memory size, 16 KB / SM

Global memory size, 938 MB

Clock rate, 1.3 GHz

Input Conditions

Random parameter allocation

Random connectivity

Zero PS error tolerance

CPU Device: AMD Opteron 285

Dual core

L2 cache size, 1 KB / core

RAM size, 4 GB

Clock rate, 2.6 GHz

Output

Membrane potential traces

Passed test for equality

slide-22
SLIDE 22

Results: Simulation Time vs. Network Size

Conditions: 80% excitatory / 20% inhibitory synapses, zero tolerance, 10 sec of simulation, initially excited by 0 – 200 pA current.

Results: GPU simulation 8-9 times faster, RT performance for 2-4% - connected networks with size 2048 – 4096 neurons.

Major limiting factors: shared memory, number of SM 50 100 150 200 250 2 3 4 5 6 7 8 9

Simulation Time, sec. Network size, 1000 x neurons

2% 4% 8% 16% 2% 4% 8% 16%

slide-23
SLIDE 23

Results: Simulation Time vs. Event Throughput

10 60 110 160 210 260 310 360 410 2 4 6 8 10 12

Simulation Time, sec. Mean Event Throughput, 1000 x events/(sec. x neuron)

2% 4% 8% 16% 2% 4% 8% 16%

Conditions: increasing excitatory / inhibitory ratio from 0.8/0.2 to0.98/0.02, network of 4096 neurons, zero tolerance, 10 sec of simulation, initially excited by 0 – 200 pA current.

Results: GPU simulation 6-9 times faster, up to 10,000 events per sec per neuron. RT performance for 0-2% - connected networks with size of 2048 – 4096.

Major limiting factors: shared memory, number of SM

slide-24
SLIDE 24

Results: Comparison with Other Works

Metric This Work Other works Reason Increase in speed 6 – 9, RT 10 – 35, RT GPU device, complexity of computation, numerical integration methods, simulation type, time scale Network Size 2K - 8K 16K - 200K Connectivity per neuron 100 - 1.3K 100 – 1K Accuracy Full single precision FP Undefined Numerical integration method Verification Direct Indirect

slide-25
SLIDE 25

Conclusion

 Add accurate STDP implementation  Characterize accuracy in relation to signal processing, network

size, network speed, learning

 Provide an example of application  Port to Open CL  Further optimization  Implemented high-accurate PS-based hybrid system of spiking

neural network with IZ neurons on GPU

 Directly verified implementation

Future Work

slide-26
SLIDE 26

Essential Bibliography

R. Brette, et al., "Simulation of networks of spiking neurons: A review of tools and strategies," Journal of Computational Neurscience, vol. 23, no. 3, pp. 349-398, 2007. R. Stewart and W. Bair, "Spiking neural network simulation: numerical integration with the Parker-Sochacki method," Journal of Computational Neuroscience, vol. 27, no. 1, pp. 115-33, Aug. 2009. G. E. Parker and J. S. Sochacki, "Implementing the Picard iteration," Neural, Parallel Sci. Comput., vol. 4,

  • pp. 97--112, 1996.

E. M. Izhikevich, "Simple model of spiking neurons," Neural Networks, IEEE Transactions on, vol. 14, pp. 1569--1572, 2003. N. Satish, M. Harris, and M. Garland, "Designing efficient sorting algorithms for manycore GPUs," in , 2009,

  • pp. 1--10.

(2010, Apr.) CUDA Data Parallel Primitives Library. [Accessed online 04/30/2010]. http://code.google.com/p/cudpp/ (2008) NVIDIA CUDA Programming Guide 2.3. [Accessed online 04/30/2010]. http://developer.nvidia.com

Q&A

J. Nageswaran, N. Dutt, J. Krichmar, A. Nicolau, and A. Veidenbaum, "A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors," Neural Networks, Jul. 2009. A. K. Fidjeland, E. B. Roesch, M. P. Shanahan, and W. Luk, "NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs," Application-Specific Systems, Architectures and Processors, IEEE International Conference on, vol. 0, pp. 137-144, 2009. J.-P. Tiesel and A. S. Maida, "Using parallel GPU architecture for simulation of planar I/F networks," in , 2009, pp. 754--759.

Other works