Toward Real-Time Simulation of Cardiac Dynamics Ezio Bartocci SUNY - - PowerPoint PPT Presentation

toward real time simulation of cardiac dynamics
SMART_READER_LITE
LIVE PREVIEW

Toward Real-Time Simulation of Cardiac Dynamics Ezio Bartocci SUNY - - PowerPoint PPT Presentation

Toward Real-Time Simulation of Cardiac Dynamics Ezio Bartocci SUNY Stony Brook Joint work with E. Cherry, J. Glimm, R. Grosu, S. A. Smolka, F. Fenton Outline Motivation Cardiac Models as Reaction Diffusion Systems CUDA Programming


slide-1
SLIDE 1

Ezio Bartocci

Toward Real-Time Simulation

  • f Cardiac Dynamics

Joint work with

  • E. Cherry, J. Glimm, R. Grosu, S. A. Smolka, F. Fenton

SUNY Stony Brook

slide-2
SLIDE 2

Outline

  • Motivation
  • Cardiac Models as Reaction Diffusion Systems
  • CUDA Programming Model
  • Reaction Diffusion in CUDA
  • Case Studies
  • Work in Progress
slide-3
SLIDE 3

Motivation for our Work

  • Spiral Formation
  • Spiral Breakup
  • Tip Tracking
  • Front Wave Tracking
  • Curvature Analysis
  • Conduction Velocity
  • Restitution Analysis

Simulation-Based Analysis

  • Spiral Formation
  • Spiral Breakup
  • Tip Tracking
  • Front Wave Tracking
  • Curvature Analysis
  • Conduction Velocity
  • Restitution Analysis
  • Spiral Formation
  • Spiral Breakup
  • Tip Tracking
  • Front Wave Tracking
  • Curvature Analysis
  • Conduction Velocity
  • Restitution Analysis
  • Spiral Formation
  • Spiral Breakup
  • Tip Tracking
  • Front Wave Tracking
  • Curvature Analysis
  • Conduction Velocity
  • Restitution Analysis
  • Spiral Formation
  • Spiral Breakup
  • Tip Tracking
  • Front Wave Tracking
  • Curvature Analysis
  • Conduction Velocity
  • Restitution Analysis
slide-4
SLIDE 4

Cardiac Models as Reaction Diffusion Systems

Membrane’s AP depends on:

  • Stimulus (voltage or current):

– External / Neighboring cells

  • Cell’s state

time voltage

S t i m u l u s failed initiation Threshold Resting potential

Schematic Action Potential

AP has nonlinear behavior!

  • Reaction diffusion system:

∂u ∂t = R(u) + ∇(D∇u)

Behavior In time Reaction Diffusion

slide-5
SLIDE 5

Cardiac Models

  • Minimal Model (Flavio-Cherry) (4 v) Human
  • Beeler-Reuter (8 v) Canine
  • Ten Tussher Panfilov (19 v) Human
  • Iyer (67 v) Human
slide-6
SLIDE 6

Available Technologies CPU based GPU based

The GPU devotes more transistors to data processing

This image is from CUDA programming guide

slide-7
SLIDE 7

GPU vs CPU

slide-8
SLIDE 8

GPU Architecture

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

Global Memory Texture Cache Constant Cache Each GPU consists of a Set of multiprocessors. MULTIPROCESSORS

slide-9
SLIDE 9

GPU Architecture

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

Global Memory Texture Cache Constant Cache Each Multiprocesssor can have 8/32 Stream Processors (SP) (called by NVIDIA also cores) which share access to local memory. MULTIPROCESSORS

slide-10
SLIDE 10

GPU Architecture

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

Global Memory Texture Cache Constant Cache MULTIPROCESSORS Each Stream Processor (core) contains a fused multiply-adder capable

  • f single precision
  • arithmetic. It is capable
  • f completing 3 floating

point operations per cycle - a fused MADD and a MUL.

slide-11
SLIDE 11

GPU Architecture

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

Global Memory Texture Cache Constant Cache MULTIPROCESSORS Each multiprocessor can contain one or more 64- bit fused multiple adder for double precision

  • perations.
slide-12
SLIDE 12

Memory Hierarchy

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

Global Memory Texture Cache Constant Cache The fastest available Memory for GPU computation is device

  • registers. Each

multiprocessor contains 16KB of registers. The registers are partitioned among the MP-resident threads MULTIPROCESSORS …

slide-13
SLIDE 13

Memory Hierarchy

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

Global Memory Texture Cache Constant Cache MULTIPROCESSORS … Shared memory (16KB) is primarily intended as a means to provide fast communication between threads of the executed by the same multiprocessor, although, due to its speed, it can also be used as a programmer controlled memory cache.

slide-14
SLIDE 14

Memory Hierarchy

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

Global Memory Texture Cache Constant Cache MULTIPROCESSORS … GPUs have also DRAM The latency is 150x is slower then registers and shared memory

slide-15
SLIDE 15

Memory Hierarchy

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

Global Memory Texture Cache Constant Cache MULTIPROCESSORS … Constant memory, as the name implies, is a read-only region which also has a small cache.

slide-16
SLIDE 16

Memory Hierarchy

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

Global Memory Texture Cache Constant Cache MULTIPROCESSORS … Texture memory is read-only with a small cache optimized for manipulation of textures. It also provides built-in linear interpolation of the data. also provides built-in linear interpolation of the data.

slide-17
SLIDE 17

Memory Hierarchy

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

SP SP SP SP SP SP SP SP

Registers

Shared Memory

Global Memory Texture Cache Constant Cache MULTIPROCESSORS … Global memory is available to all threads and is persistent between GPU calls.

slide-18
SLIDE 18

CUDA Programming Model

CPU Host Serial Code i Serial Code j Kernel Invocation GPU Device Block (0,0) Block (0,1) Block (1,0) Block (1,1) Block (2,0) Block (2,1) Block (3,0) Block (3,1) Block (4,0) Block (4,1) Block (5,0) Block (5,1) Grid k

A[ID]=ID

Single Instruction Multiple Threads (SIMT) similar to Single Instruction Multiple Data (SIMD)

THREAD ID 0 THREAD ID 1 THREAD ID 2 THREAD ID 3

A[ID]=ID A[ID]=ID A[ID]=ID if (ID%2) if (ID%2) if (ID%2) if (ID%2)

A vector

A[ID]+=2; A[ID]+=2; A[ID]+=2; A[ID]+=2; A[ID]*=2; A[ID]*=2; A[ID]*=2; A[ID]*=2; else else else else A[ID]=0; A[ID]=0; A[ID]=0; A[ID]=0; endif endif endif endif

A[ID]+=2; A[ID]+=2; A[ID]+=2; A[ID]+=2;

1 2 3 2 4 4 8 When branches occur in the code (e.g. due to if statements) the divergent threads will become inactive until the conforming threads complete their separate execution. 6 2 10 2 When execution merges, the threads can continue to operate in parallel. 1 2 3 4 8

slide-19
SLIDE 19

CUDA Programming Model

Global Memory GPU DEVICE GRID BLOCK THREAD BLOCK

SHARED MEMORY REGISTERS

The max number of threads for a thread block is 512 and it depends on the amount of registers that each thread may need. Each Thread block is executed by a multiprocessors Different threads are multiplexed and executed by the same core in order to reduce the latency of memory access.

slide-20
SLIDE 20

Tesla C1060 Fermi C2070

30 Multiprocessors 240 Cores Processor core clock: 1.296 GHz 933 Gigaflops (Single precision) 78 Gigaflops (Double Precision) Max Bandwidth(102 Gigabytes/sec) 4 GB of DRAM 14 Multiprocessors 448 Cores Processor core clock: 1.15 GHz 1030 Gigaflops (Single precision) 515 Gigaflops (Double precision) Max Bandwidth (144 GBytes/sec) 6 GB of DRAM Cost: 1000 $ Cost: 3200 $

slide-21
SLIDE 21

Reaction-Diffusion in CUDA

∂u ∂t = R(u) + ∇(D∇u)

For each time step, a set of (ODEs) and Partial Differential Equations (PDEs) must be solved. for (timestep=1; timestep < nend; i++){ solveODEs <<grid, block>> (….); calcLaplacian <<grid, block>> (….); } Solving ODEs using different method depending on the model:

  • Runge-Kutta 4th order
  • Forward Euclid
  • Backward Euclid
  • Semi-Implicit

…… Solving PDEs (Calc the Laplacian)

∇(D∇u)i, j = Ddt dx2 ui−1, j + ui, j−1 + ui+1, j + ui, j+1 − 4ui, j

( )

slide-22
SLIDE 22

Optimize the Reaction Term in CUDA

Heaviside Simplification

  • In order to avoid divergent branches among threads we

substitute if then else condition of Heaviside functions in this way: Precomputing Lookup Tables using Texture

  • We build tables where we precompute nonlinear part of

the ODEs, we bind these table to textures and we exploit the built-in linear interpolation feature of the texture.

If (x > a){ y = b; }else{ y = c; } y =b + (x > a)*(b_c) a, b, b_c (b-c) are constant

slide-23
SLIDE 23

Optimize the Reaction Term in CUDA

Kernel splitting:

  • For complex models, we need to split the ODEs solving

in many Kernels in order to have enough registers for thread to perform our calculation. Use –use_fast_math compiler option to substitute

  • In the models that use log, exp, sqrt, functions we

substitute them with the GPU built-in functions. Using Costant memory for common parameters (dt, dx, ..)

slide-24
SLIDE 24

Optimize the Diffusion Term in CUDA

Solving PDEs (Calc the Laplacian)

∇(D∇u)i, j = Ddt dx2 ui−1, j + ui, j−1 + ui+1, j + ui, j+1 − 4ui, j

( )

Each location is a float (4 bytes) The global memory latency is very slow. The memory is accessed in multiples of 64 bytes Using texture we can reduce the latency In texture the data is cached (optimize for 2D Locality) Drawback: It supports only single precision

slide-25
SLIDE 25

Optimize the Diffusion Term in CUDA

∇(D∇u)i, j = Ddt dx2 ui−1, j + ui, j−1 + ui+1, j + ui, j+1 − 4ui, j

( )

Drawback: The number of threads is greater than the number of elements Another Technique is using SHARED MEMORY THREAD BLOCK

SHARED MEMORY REGISTERS

Step 1 Step 2 Step 3 The yellow and red threads read the location from the global memory into the shared memory. The red threads calculates the laplacian using the values in the shared memory. This technique supports single and double precision SYNCH

slide-26
SLIDE 26

Case Study 1: Minimal Model (4V)

slide-27
SLIDE 27

Perfomances

CPU Computation GPU Computation

slide-28
SLIDE 28

Naïve Implementation

5.4 x 512x512 2048x2048 82.92 x

slide-29
SLIDE 29

Reaction optimized (Diffusion with Shared Memory)

520x520 2074x2074 1.95 x 27.71 x

slide-30
SLIDE 30

Reaction optimized (Diffusion with Texture)

512x512 1.7x 2048x2048 22.46x

slide-31
SLIDE 31

Beeler-Reuter Model (8V)

slide-32
SLIDE 32

2074x2074 520x520

Reaction optimized (Diffusion with Shared Memory)

12.87x 165.46x

slide-33
SLIDE 33

Reaction optimized (Diffusion with Texture)

11.34x 125.62x

slide-34
SLIDE 34

Ten Tusscher Panfilov Model (19V)

slide-35
SLIDE 35

Reaction optimized (Diffusion with Shared Memory)

520x520 2074x2074 47x 815x

slide-36
SLIDE 36

Reaction optimized (Diffusion with Texture)

2048x2048 36x 515x 512x512

slide-37
SLIDE 37

Double vs Single

After 10 minutes

  • f simulation:

Nai

slide-38
SLIDE 38

Double vs Single

After 10 minutes

  • f simulation:

Ki

slide-39
SLIDE 39

Double vs Single

After 10 minutes of simulation:

slide-40
SLIDE 40

Work in Progress Iyer Model (67V)

slide-41
SLIDE 41

Work in Progress 3D Models

slide-42
SLIDE 42

Conclusions

  • Many other challenge problems of CMACS expedition

can take advantage of GPU technologies.

  • The curve of developing of these technologies seems

very promising for the future years.

  • We are definitely interesting to collaborate with the
  • ther teams of the CMACS expedition in order to

develop new revolutionary highly scalable GPU-based analysis tools for complex systems.

slide-43
SLIDE 43

Thank you