Simulating the Behavior of the Human Brain on NVIDIA GPUs (Human - - PowerPoint PPT Presentation

simulating the behavior of the human brain on nvidia gpus
SMART_READER_LITE
LIVE PREVIEW

Simulating the Behavior of the Human Brain on NVIDIA GPUs (Human - - PowerPoint PPT Presentation

www.bsc.es Simulating the Behavior of the Human Brain on NVIDIA GPUs (Human Brain Project) Pedro Valero-Lara, Ivan Mart nez-Prez, Antonio J. Pea, Xavier Martorell, Ral Sirvent, and Jess Labarta Munich, 09-11-2018 Human Brain


slide-1
SLIDE 1

Munich, 09-11-2018

Pedro Valero-Lara, Ivan Mart nez-Pérez, Antonio J. Peña, ı ı Xavier Martorell, Raül Sirvent, and Jesús Labarta www.bsc.es

Simulating the Behavior of the Human Brain on NVIDIA GPUs

(Human Brain Project)

slide-2
SLIDE 2

H2020 FET Flagship Project

  • Accelerate the fields of neuroscience, computing and brain-related medicine
  • 8 Different Sub-Projects

➔ Sub-Project 7: High Performance Analytics and Computing

  • WP 7.5: Providing support for the migration of simulation codes to hybrid and/or

accelerator-enabled architectures

  • 86x10 (86 Billions) neurons

➔ ~80,000 Volta GPUs

  • Steps:

➔ Neurons Generator → once at the very begining ➔ Solving voltage capacitance ➔ Synapses (Spiking) → communication

Human Brain Project (HBP)

2

slide-3
SLIDE 3

Hines Method

  • Ax=b,

➔ where A is a Hines (3 vectors) Matrix ➔ Similar to Tridiagonal System (Thomas Method) ➔ 8xN operations ➔ Vector p → branches

Solving Voltage Capacitance – Hines Method

3

void hines solver(double *a, double *b, double *d, double *rhs, int *p, int cell size) { int i; double factor; // backward sweep for(int i=cell size-1; i>0; −−i) { factor = a[i] / d[i]; d[p[i]] -= factor * b[i]; rhs[p[i]] -= factor * rhs[i]; } rhs[0] /= d[0] // forward sweep for(i=1; i<cell size; ++i) { rhs[i] -= b[i] * rhs[p[i]]; rhs[i] /= d[i]; } }

slide-4
SLIDE 4

cuHinesBatch

  • Saturate the GPU with a high number of neurons

➔ 1 thread per neuron ➔ No synchronizations ➔ No atomic operations

  • Data Layouts

➔ Flat

No coalesncing

➔ Full-Interleaved

Coalescing

Big jumps in memory

➔ Block-Interleaved

  • Coalesing
  • Small jumps in memory

Implementation of cuHinesBatch

4

slide-5
SLIDE 5

Test Case: Flat

  • K80 NVIDIA GPU (Kepler)

➔ 4992 CUDA cores ➔ 24 GB GDDR5

  • Input (Hines Matrices)

➔ 300 elements and 2 branches ➔ double precision ➔ BatchSize

512; 5,120; 51,120; 512,000

  • Setting

➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1) ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 35

Performance of cuHinesBatch: Flat

5

slide-6
SLIDE 6

Test Case: Full-Interleaved

  • K80 NVIDIA GPU (Kepler)

➔ 4992 CUDA cores ➔ 24 GB GDDR5

  • Input (Hines Matrices)

➔ 300 elements and 2 branches ➔ double precision ➔ BatchSize

512; 5,120; 51,120; 512,000

  • Setting

➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1) ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 35

Performance of cuHinesBatch: Full-Interleaved

6

slide-7
SLIDE 7

Test Case: Block-Interleaved

  • K80 NVIDIA GPU (Kepler)

➔ 4992 CUDA cores ➔ 24 GB GDDR5

  • Input (Hines Matrices)

➔ 300 elements and 2 branches ➔ double precision ➔ BatchSize = 512,000

  • Setting

➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1) ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 35

Performance of cuHinesBatch: Block-Interleaved

7

slide-8
SLIDE 8

Test Case: Real Neurons

  • K80 NVIDIA GPU (Kepler)

➔ 4992 CUDA cores ➔ 24 GB GDDR5

  • Input (Hines Matrices)

➔ 6 different morphologies

Small, medium, and big

Low (10%) and high (50%) #branches

  • http://www.neuromorpho.org/

➔ BatchSize

256; 2,560; 25,600; 256,000

  • Setting

➔ Block-Interleaved ➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1) ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 35

Performance of cuHinesBatch on Real Neurons

8

slide-9
SLIDE 9

Test Case: Pascal

  • P100 NVIDIA GPU (Pascal)

➔ 3584 CUDA cores ➔ 16 GB HMB2

  • Input (Hines Matrices)

➔ Medium (size) ➔ Low (% #branches) ➔ BatchSize = 25,6000

  • Setting

➔ Full-Interleaved ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 62

  • NVPROF

➔ High occupancy (99,5%) ➔ High bandwidth (500 GB/s) ➔ No memory issues

Performance of cuHinesBatch on Pascal

9

slide-10
SLIDE 10

Test Case: cuThomasBatch

  • 1 logic (K40) GPU of K80 NVIDIA GPU (Kepler)

➔ 2496 CUDA cores ➔ 16 GB GDDR5

  • Input (Hines Matrices)

➔ System Size

64; 128; 256; 512

1,024; 2,048; 4,096; 8,192

➔ BatchSize

256; 2,560; 25,600; 256,000

20; 200; 2,000; 20,000

  • Setting

➔ cusparseDgtsvStridedBatch ➔ cuThomasBatch

  • Results

1,2-2,8x faster

4x more precise

2x less memory occupancy

Performance of cuHinesBatch: cuThomasBatch

10

slide-11
SLIDE 11

Test Case: Multi-Morphology

  • 1 logic (K40) GPU of K80 NVIDIA GPU (Kepler)

➔ 2496 CUDA cores ➔ 16 GB GDDR5

  • Input (Hines Matrices)

➔ Different morphologies

Mono-Morpholgy

Multi_morphology

  • Same size
  • Different size

1,024; 2,048; 4,096; 8,192

BatchSize = 25,600

10% and 50% of #Branches

  • Setting

Full-Interleaved

Padding

Performance of cuHinesBatch: Multi-Morphology

11

slide-12
SLIDE 12

cuHinesBatch

  • High performance (50x faster than seq. CPU)
  • Big scaling even when using a very high number of neurons

➔ 1 thread per neuron (Hines System) ➔ Full-Interleaved Data Layout ➔ Faster than using one CUDA Block per system 

cuThomasBatch

  • Data Layout transformation (from flat to full-interleaved)

➔ Once at the very begining of the simulation

  • Fall in performance for multi-morphology

➔ 2 approaches: 

cuThomasBatch per segment

CusparseDgtsvStridedBatch per segment

Conclusions & Future Work

12

slide-13
SLIDE 13

Pedro Valero-Lara, Ivan Martínez-Perez, Antonio J. Peña, Xavier Martorell, Raúl Sirvent, Jesús Labarta: cuHinesBatch: Solving Multiple Hines systems on GPUs Human Brain Project*. ICCS 2017: 566-575 Pedro Valero-Lara, Ivan Mart nez-Pérez, Raül Sirvent, Xavier Martorell, and Antonio J. Peña: ı ı NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch. PPAM 2017 cuHinesBatch repository: https://pm.bsc.es/gitlab/imartin1/cuHinesBatch Acknowledgements:

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 720270 (HBP SGA1), from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015- 65316-P) and the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Paral·lels (2014-SGR-1051). We thank the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of

  • Excellence. Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number

IJCI-2015-23266.

References & Acknowledgments

13

slide-14
SLIDE 14

Thank you!

For further information please contact pedro.valero@bsc.es

www.bsc.es