Simulating the Behavior of the Human Brain on NVIDIA GPUs (Human - PowerPoint PPT Presentation

www.bsc.es Simulating the Behavior of the Human Brain on NVIDIA GPUs (Human Brain Project) Pedro Valero-Lara, Ivan Mart nez-Pérez, Antonio J. Peña, ı ı Xavier Martorell, Raül Sirvent, and Jesús Labarta Munich, 09-11-2018

Human Brain Project (HBP) H2020 FET Flagship Project Accelerate the fields of neuroscience, computing and brain-related medicine ● 8 Different Sub-Projects ● ➔ Sub-Project 7: High Performance Analytics and Computing WP 7.5: Providing support for the migration of simulation codes to hybrid and/or ● accelerator-enabled architectures 86x10 (86 Billions) neurons ⁹ ● ➔ ~ 80,000 Volta GPUs Steps: ● ➔ Neurons Generator → once at the very begining ➔ Solving voltage capacitance ➔ Synapses (Spiking) → communication 2

Solving Voltage Capacitance – Hines Method Hines Method void hines solver (double *a, double *b, double *d, double *rhs, int *p, int cell size) { int i; double factor; // backward sweep for(int i=cell size-1; i>0; −−i) { factor = a[i] / d[i]; d[ p[i] ] -= factor * b[i]; rhs[ p[i] ] -= factor * rhs[i]; Ax=b, ● } ➔ where A is a Hines (3 vectors) Matrix rhs[0] /= d[0] ➔ Similar to Tridiagonal System (Thomas Method) // forward sweep ➔ 8xN operations for(i=1; i<cell size; ++i) { ➔ Vector p → branches rhs[i] -= b[i] * rhs[ p[i] ]; rhs[i] /= d[i]; } } 3

Implementation of cuHinesBatch cuHinesBatch Saturate the GPU with a high number of neurons ● ➔ 1 thread per neuron ➔ No synchronizations ➔ No atomic operations Data Layouts ● ➔ Flat No coalesncing  ➔ Full-Interleaved Coalescing  Big jumps in memory  ➔ Block-Interleaved Coalesing ● Small jumps in memory ● 4

Performance of cuHinesBatch: Flat Test Case: Flat K80 NVIDIA GPU (Kepler) ● ➔ 4992 CUDA cores ➔ 24 GB GDDR5 Input (Hines Matrices) ● ➔ 300 elements and 2 branches ➔ double precision ➔ BatchSize 512; 5,120; 51,120; 512,000  Setting ● ➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1) ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 35 5

Performance of cuHinesBatch: Full-Interleaved Test Case: Full-Interleaved K80 NVIDIA GPU (Kepler) ● ➔ 4992 CUDA cores ➔ 24 GB GDDR5 Input (Hines Matrices) ● ➔ 300 elements and 2 branches ➔ double precision ➔ BatchSize 512; 5,120; 51,120; 512,000  Setting ● ➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1) ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 35 6

Performance of cuHinesBatch: Block-Interleaved Test Case: Block-Interleaved K80 NVIDIA GPU (Kepler) ● ➔ 4992 CUDA cores ➔ 24 GB GDDR5 Input (Hines Matrices) ● ➔ 300 elements and 2 branches ➔ double precision ➔ BatchSize = 512,000 Setting ● ➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1) ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 35 7

Performance of cuHinesBatch on Real Neurons Test Case: Real Neurons K80 NVIDIA GPU (Kepler) ● ➔ 4992 CUDA cores ➔ 24 GB GDDR5 Input (Hines Matrices) ● ➔ 6 different morphologies Small, medium, and big  Low (10%) and high (50%) #branches  http://www.neuromorpho.org/ ● ➔ BatchSize 256; 2,560; 25,600; 256,000  Setting ● ➔ Block-Interleaved ➔ cudaFuncSetCacheConfig(LBM,cudaFuncCachePreferL1) ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 35 8

Performance of cuHinesBatch on Pascal Test Case: Pascal P100 NVIDIA GPU (Pascal) ● ➔ 3584 CUDA cores ➔ 16 GB HMB2 Input (Hines Matrices) ● ➔ Medium (size) ➔ Low (% #branches) ➔ BatchSize = 25,6000 Setting ● ➔ Full-Interleaved ➔ numactl −−interleave=all ➔ -O3 -openmp -arch=compute 62 NVPROF ● ➔ High occupancy (99,5%) ➔ High bandwidth (500 GB/s) ➔ No memory issues 9

Performance of cuHinesBatch: cuThomasBatch Test Case: cuThomasBatch 1 logic (K40) GPU of K80 NVIDIA GPU (Kepler) ● ➔ 2496 CUDA cores ➔ 16 GB GDDR5 Input (Hines Matrices) ● ➔ System Size 64; 128; 256; 512  1,024; 2,048; 4,096; 8,192  ➔ BatchSize 256; 2,560; 25,600; 256,000  20; 200; 2,000; 20,000  Setting ● ➔ cusparseDgtsvStridedBatch ➔ cuThomasBatch Results ● 1,2-2,8x faster ➔ 4x more precise ➔ 2x less memory occupancy ➔ 10

Performance of cuHinesBatch: Multi-Morphology Test Case: Multi-Morphology 1 logic (K40) GPU of K80 NVIDIA GPU (Kepler) ● ➔ 2496 CUDA cores ➔ 16 GB GDDR5 Input (Hines Matrices) ● ➔ Different morphologies Mono-Morpholgy  Multi_morphology  Same size • Different size • 1,024; 2,048; 4,096; 8,192  BatchSize = 25,600 ➔ 10% and 50% of #Branches ➔ Setting ● Full-Interleaved ➔ Padding ➔ 11

Conclusions & Future Work cuHinesBatch High performance (50x faster than seq. CPU) ● Big scaling even when using a very high number of neurons ● ➔ 1 thread per neuron (Hines System) ➔ Full-Interleaved Data Layout ➔ Faster than using one CUDA Block per system cuThomasBatch  Data Layout transformation (from flat to full-interleaved) ● ➔ Once at the very begining of the simulation Fall in performance for multi-morphology ● ➔ 2 approaches: cuThomasBatch per segment  CusparseDgtsvStridedBatch per segment  12

References & Acknowledgments Pedro Valero-Lara, Ivan Martínez-Perez, Antonio J. Peña, Xavier Martorell, Raúl Sirvent, Jesús Labarta: cuHinesBatch: Solving Multiple Hines systems on GPUs Human Brain Project* . ICCS 2017: 566-575 Pedro Valero-Lara, Ivan Mart nez-Pérez, Raül Sirvent, Xavier Martorell, and Antonio J. Peña: NVIDIA GPUs Scalability to ı ı Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch . PPAM 2017 cuHinesBatch repository: https://pm.bsc.es/gitlab/imartin1/cuHinesBatch Acknowledgements: This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 720270 (HBP SGA1), from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015- 65316-P) and the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Paral·lels (2014-SGR-1051). We thank the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence. Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266. 13

www.bsc.es Thank you! For further information please contact pedro.valero@bsc.es

Simulating the Behavior of the Human Brain on NVIDIA GPUs (Human - PowerPoint PPT Presentation

www.bsc.es Simulating the Behavior of the Human Brain on NVIDIA GPUs (Human Brain Project) Pedro Valero-Lara, Ivan Mart nez-Prez, Antonio J. Pea, Xavier Martorell, Ral Sirvent, and Jess Labarta Munich, 09-11-2018 Human Brain

NVIDIA GRID Linux Virtual Desktops with NVIDIA Virtual GPUs for Chip-Design Applications Shailesh

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

NVIDIA GPUS Mark Kilgard Principal S ystem S oftware Engineer, NVIDIA Piers Daniell S enior

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

BRAIN VENTRICULAR SYSTEM CSF THE BRAIN BRAIN The brain (encephalon) lies within the cranium. It

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Simulating Syst Simulating Systems in Gr ems in Ground V ound Vehicle hicle Design Design

Language and the human brain Brain and Language What will be covered? A brief survey of

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

NVIDIA OPTICAL FLOW Abhijit Patait, 3/18/2019 Optical Flow in Turing GPUs NVIDIA Optical Flow

Cluster Monitoring and Management Tools RAJAT PHULL, NVIDIA SOFTWARE ENGINEER ROB TODD, NVIDIA

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

NVIDIA Quadro and NVS Video Walls NVIDIA Quadro and NVS Video Walls Using NVIDIA technology to

GENERATION OF GAMING TECHNOLOGY Samuel Lo, NVIDIA AI Technology Centre samuell@nvidia.com NVIDIA

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil,

El Dorado Trail Extension & Bike/Pedestrian Overcrossing Projects Public Informational

Q4 YEAR-END REPORT JANUARY DECEMBER 2017 JOHAN DENNELIND PRESIDENT & CEO STRONG Q4

Interlocking Forme No. 1 Flat - 520x418mm Finished - 220x307 Interlocking Forme No. 2 Flat -

2020 2021 Budget Overview Presented to Board of Directors on July 21, 2020 Melissa deVita, Deputy

Dielectric/Magnetic Probe A Unique NDT Sensor for Magnetic Material Quality Assurance Mark D. A.

Frontispiece of The Houses of Greenwich Village by Kevin D. Murphy, Abrams, New York,

46 MacDougal Street LPC Presentation February 09, 2017 1 46 MACDOUGAL ST. February 09, 2017

Queniborough Conservation Area Queniborough Conservation Area The Presentation 1. Implications

Simulating the Behavior of the Human Brain on NVIDIA GPUs (Human - PowerPoint PPT Presentation

www.bsc.es Simulating the Behavior of the Human Brain on NVIDIA GPUs (Human Brain Project) Pedro Valero-Lara, Ivan Mart nez-Prez, Antonio J. Pea, Xavier Martorell, Ral Sirvent, and Jess Labarta Munich, 09-11-2018 Human Brain

NVIDIA GRID Linux Virtual Desktops with NVIDIA Virtual GPUs for Chip-Design Applications Shailesh

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

NVIDIA GPUS Mark Kilgard Principal S ystem S oftware Engineer, NVIDIA Piers Daniell S enior

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

BRAIN VENTRICULAR SYSTEM CSF THE BRAIN BRAIN The brain (encephalon) lies within the cranium. It

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Simulating Syst Simulating Systems in Gr ems in Ground V ound Vehicle hicle Design Design

Language and the human brain Brain and Language What will be covered? A brief survey of

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

NVIDIA OPTICAL FLOW Abhijit Patait, 3/18/2019 Optical Flow in Turing GPUs NVIDIA Optical Flow

Cluster Monitoring and Management Tools RAJAT PHULL, NVIDIA SOFTWARE ENGINEER ROB TODD, NVIDIA

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

NVIDIA Quadro and NVS Video Walls NVIDIA Quadro and NVS Video Walls Using NVIDIA technology to

GENERATION OF GAMING TECHNOLOGY Samuel Lo, NVIDIA AI Technology Centre samuell@nvidia.com NVIDIA

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil,

El Dorado Trail Extension &amp; Bike/Pedestrian Overcrossing Projects Public Informational

Q4 YEAR-END REPORT JANUARY DECEMBER 2017 JOHAN DENNELIND PRESIDENT &amp; CEO STRONG Q4

Interlocking Forme No. 1 Flat - 520x418mm Finished - 220x307 Interlocking Forme No. 2 Flat -

2020 2021 Budget Overview Presented to Board of Directors on July 21, 2020 Melissa deVita, Deputy

Dielectric/Magnetic Probe A Unique NDT Sensor for Magnetic Material Quality Assurance Mark D. A.

Frontispiece of The Houses of Greenwich Village by Kevin D. Murphy, Abrams, New York,

46 MacDougal Street LPC Presentation February 09, 2017 1 46 MACDOUGAL ST. February 09, 2017

Queniborough Conservation Area Queniborough Conservation Area The Presentation 1. Implications

El Dorado Trail Extension & Bike/Pedestrian Overcrossing Projects Public Informational

Q4 YEAR-END REPORT JANUARY DECEMBER 2017 JOHAN DENNELIND PRESIDENT & CEO STRONG Q4