toward real time simulation of cardiac dynamics
play

Toward Real-Time Simulation of Cardiac Dynamics Ezio Bartocci SUNY - PowerPoint PPT Presentation

Toward Real-Time Simulation of Cardiac Dynamics Ezio Bartocci SUNY Stony Brook Joint work with E. Cherry, J. Glimm, R. Grosu, S. A. Smolka, F. Fenton Outline Motivation Cardiac Models as Reaction Diffusion Systems CUDA Programming


  1. Toward Real-Time Simulation of Cardiac Dynamics Ezio Bartocci SUNY Stony Brook Joint work with E. Cherry, J. Glimm, R. Grosu, S. A. Smolka, F. Fenton

  2. Outline • Motivation • Cardiac Models as Reaction Diffusion Systems • CUDA Programming Model • Reaction Diffusion in CUDA • Case Studies • Work in Progress

  3. Motivation for our Work Simulation-Based Analysis • Spiral Formation • Spiral Formation • Spiral Formation • Spiral Formation • Spiral Formation • Spiral Breakup • Spiral Breakup • Spiral Breakup • Spiral Breakup • Spiral Breakup • Tip Tracking • Tip Tracking • Tip Tracking • Tip Tracking • Tip Tracking • Front Wave Tracking • Front Wave Tracking • Front Wave Tracking • Front Wave Tracking • Front Wave Tracking • Curvature Analysis • Curvature Analysis • Curvature Analysis • Curvature Analysis • Curvature Analysis • Conduction Velocity • Conduction Velocity • Conduction Velocity • Conduction Velocity • Conduction Velocity • Restitution Analysis • Restitution Analysis • Restitution Analysis • Restitution Analysis • Restitution Analysis

  4. Cardiac Models as Reaction Diffusion Systems Schematic Action Potential Membrane’s AP depends on: • Stimulus (voltage or current): – External / Neighboring cells • Cell’s state voltage AP has nonlinear behavior! Threshold • Reaction diffusion system: s u l failed initiation u m ∂ u i t S ∂ t = R ( u ) + ∇ ( D ∇ u ) Resting potential time Behavior Reaction Diffusion In time

  5. Cardiac Models • Minimal Model (Flavio-Cherry) (4 v) Human • Beeler-Reuter (8 v) Canine • Ten Tussher Panfilov (19 v) Human • Iyer (67 v) Human

  6. Available Technologies CPU based GPU based The GPU devotes more transistors to data processing This image is from CUDA programming guide

  7. GPU vs CPU

  8. GPU Architecture MULTIPROCESSORS Each GPU consists of a Registers Registers Registers Registers Set of multiprocessors. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Constant Cache Texture Cache Global Memory

  9. GPU Architecture MULTIPROCESSORS Each Multiprocesssor Registers Registers Registers Registers can have 8/32 Stream Processors (SP) (called SP SP SP SP SP SP SP SP by NVIDIA also cores) SP SP SP SP SP SP SP SP which share access to local memory. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Constant Cache Texture Cache Global Memory

  10. GPU Architecture MULTIPROCESSORS Each Stream Processor Registers Registers Registers Registers (core) contains a fused multiply-adder capable SP SP SP SP SP SP SP SP of single precision arithmetic. It is capable SP SP SP SP SP SP SP SP of completing 3 floating SP SP SP SP SP SP point operations per SP SP cycle - a fused MADD SP SP SP SP SP SP SP SP and a MUL. Shared Memory Shared Memory Shared Memory Shared Memory Constant Cache Texture Cache Global Memory

  11. GPU Architecture MULTIPROCESSORS Registers Registers Registers Registers SP SP SP SP SP SP SP SP Each multiprocessor can SP SP SP SP SP SP SP SP contain one or more 64- bit fused multiple adder SP SP SP SP SP SP SP SP for double precision operations. SP SP SP SP SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Constant Cache Texture Cache Global Memory

  12. Memory Hierarchy MULTIPROCESSORS The fastest available Registers Registers Registers Registers Memory for GPU computation is device SP SP SP SP SP SP SP SP registers. Each multiprocessor contains SP SP SP SP SP SP SP SP 16KB of registers. … SP SP SP SP SP SP SP SP The registers are partitioned among the SP SP SP SP SP SP SP SP MP-resident threads Shared Memory Shared Memory Shared Memory Shared Memory Constant Cache Texture Cache Global Memory

  13. Memory Hierarchy MULTIPROCESSORS Registers Registers Registers Registers SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP … SP SP SP SP SP SP SP SP Shared memory (16KB) is primarily intended SP SP SP SP SP SP SP SP as a means to provide fast communication Shared Memory Shared Memory Shared Memory Shared Memory between threads of the executed by the same multiprocessor, although, Constant Cache due to its speed, it can also be used as a Texture Cache programmer controlled memory cache. Global Memory

  14. Memory Hierarchy MULTIPROCESSORS Registers Registers Registers Registers SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP … SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Constant Cache GPUs have also DRAM The latency is 150x is slower Texture Cache then registers and shared memory Global Memory

  15. Memory Hierarchy MULTIPROCESSORS Registers Registers Registers Registers SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP … SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Constant memory, as the Constant Cache name implies, is a read-only region which also has a small cache. Texture Cache Global Memory

  16. Memory Hierarchy MULTIPROCESSORS Registers Registers Registers Registers SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP … SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Texture memory is read-only with a small cache optimized Constant Cache for manipulation of textures. It also provides built-in linear Texture Cache interpolation of the data. also provides built-in linear interpolation of the data. Global Memory

  17. Memory Hierarchy MULTIPROCESSORS Registers Registers Registers Registers SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP … SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Constant Cache Texture Cache Global memory is available to all threads and is persistent between GPU calls. Global Memory

  18. CUDA Programming Model Single Instruction Multiple Threads (SIMT) similar to CPU Host GPU Device Single Instruction Multiple Data (SIMD) Grid k THREAD ID 0 THREAD ID 1 THREAD ID 2 THREAD ID 3 Serial Code i A[ID]=ID A[ID]=ID A[ID]=ID A[ID]=ID Block (0,1) Block (0,0) if (ID%2) if (ID%2) if (ID%2) if (ID%2) Block (1,0) Block (1,1) A[ID]+=2; A[ID]+=2; A[ID]+=2; A[ID]+=2; A[ID]*=2; A[ID]*=2; A[ID]*=2; A[ID]*=2; Kernel else else else else Block (2,1) Block (2,0) A[ID]=0; A[ID]=0; A[ID]=0; A[ID]=0; Invocation endif endif endif endif A[ID]+=2; A[ID]+=2; A[ID]+=2; A[ID]+=2; Block (3,0) Block (3,1) Block (4,0) Block (4,1) 6 2 10 2 2 0 1 0 4 2 3 0 4 0 0 1 8 2 0 3 4 0 8 0 Serial Code j Block (5,1) Block (5,0) A vector When branches occur in the code (e.g. due to if statements) the divergent When execution merges, the threads can threads will become inactive until the conforming threads complete their continue to operate in parallel. separate execution.

  19. CUDA Programming Model Different threads are multiplexed and executed by the same core GRID BLOCK in order to reduce the latency of memory access. Each Thread block is executed by a multiprocessors The max number of threads for a thread block is 512 and it depends on the amount of registers that each thread may need. GPU DEVICE THREAD BLOCK REGISTERS Global Memory SHARED MEMORY

  20. Tesla C1060 Fermi C2070 14 Multiprocessors 30 Multiprocessors 448 Cores 240 Cores Processor core clock: 1.15 GHz Processor core clock: 1.296 GHz 1030 Gigaflops (Single precision) 933 Gigaflops (Single precision) 515 Gigaflops (Double precision) 78 Gigaflops (Double Precision) Max Bandwidth (144 GBytes/sec) Max Bandwidth(102 Gigabytes/sec) 6 GB of DRAM 4 GB of DRAM Cost: 1000 $ Cost: 3200 $

  21. Reaction-Diffusion in CUDA ∂ u ∂ t = R ( u ) + ∇ ( D ∇ u ) For each time step, a set of (ODEs) and Partial Differential Equations (PDEs) must be solved. for (timestep=1; timestep < nend; i++){ solveODEs <<grid, block>> (….); calcLaplacian <<grid, block>> (….); } Solving ODEs using different method Solving PDEs (Calc the Laplacian) depending on the model: -Runge-Kutta 4 th order -Forward Euclid -Backward Euclid -Semi-Implicit ( ) ∇ ( D ∇ u ) i , j = Ddt dx 2 u i − 1, j + u i , j − 1 + u i + 1, j + u i , j + 1 − 4 u i , j ……

  22. Optimize the Reaction Term in CUDA Heaviside Simplification • In order to avoid divergent branches among threads we substitute if then else condition of Heaviside functions in this way: If (x > a){ y = b; y =b + (x > a)*(b_c) }else{ y = c; a, b, b_c (b-c) are constant } Precomputing Lookup Tables using Texture • We build tables where we precompute nonlinear part of the ODEs, we bind these table to textures and we exploit the built-in linear interpolation feature of the texture.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend