Red Blood Cell Simulations with Chemical Transport Properties Ansel - - PowerPoint PPT Presentation

▶

Dec 06, 2023 39 likes •243 views

Red Blood Cell Simulations with Chemical Transport Properties Ansel L. Blumers Karniadakis Group, Brown University, Rhode Island, USA San Jos || GTC 2017 || May, 2017 Scientific Inquiries Aim to investigate Chemical-driven plaque

SLIDE 1

Red Blood Cell Simulations with Chemical Transport Properties

Ansel L. Blumers

Karniadakis Group, Brown University, Rhode Island, USA San José || GTC 2017 || May, 2017

SLIDE 2

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Scientific Inquiries

Aim to investigate Chemical-driven plaque and thrombus formation. Model chemical transport Released from red and white blood cells to plasma, Sieved through the vessel wall to surrounding tissue. Model red blood cells Red blood cell dynamics in the blood.

SLIDE 3

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Dissipative Particle Dynamics

Fi = X

i6=j

(FC

ij + FD ij + FR ij)

ij = αijωC(rij)eij

ij = −γijωD(rij)(eij · vij)eij

ij = σijωR(rij)ξijδt−1/2eij

random dissipative conservative

A coarse-grained particle method for mesoscopic simulations.

Pairwise Force Interaction

Groot, R.D., Warren, P.B., The Journal of Chemical Physics, 1997

SLIDE 4

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Transport Properties

source / external random flux Fickian flux

Pairwise Chemical Transport

Li, Z., Yazdani, A., Tartakovsky, A., Karniadakis, G.E., The Journal of Chemical Physics, 2015

Solves the Advection-Diffusion-Reaction equation.

SLIDE 5

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Modeling Red Blood Cell (RBC)

Discretized with 500 vertices and held together by 3 types of bonded potentials.

Bending rigidity Visco-elastic + hydrostatic-elastic Global area and volume constraints + local area constraint

Fedosov, D.A., Caswell, B., Karniadakis, G.E., Biophysical Journal., 2010

SLIDE 6

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Need a Fast & Robust Program

Fusion

SLIDE 7

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

USERMESO 2.0

RBC Concen. & ADR

USERMESO

USERMESO 2.0

* S4518, GTC 2014, Y.-H. Tang, Accelerating Dissipative Particle Dynamics Simulation on Kepler: Algorithm, Numerics and Application USERMESO 2.0 : Blumers, A., Tang, Y.-H., Li, Z., Li, X., Karniadakis, G. E., Computer Physics Communications, 2017 USERMESO : Tang, Y.-H., Karniadakis, G.E., Computer Physics Communications, 2014 https://github.com/AnselGitAccount

Injects new capabilities into USERMESO * Combining Open MPI, OpenMP and CUDA Simultaneously simulate ... chemical concentration field advection-diffusion-reaction processes red blood cell dynamics

SLIDE 8

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Finding Neighbors

Particles in … Center Cell Neighboring Cells Particles are binned into cells. Finding neighbors by calculating relative distance to other particles in adjacent cells. N M Strategy #1: Each warp takes on particles in Center Cell. Strategy #2: Each warp takes on particles in Neighboring Cells. N x M predicates

3 x 3 cells

SLIDE 9

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Neighbor Kernel Optimization

S4518, GTC 2014, Y.-H. Tang, Accelerating Dissipative Particle Dynamics Simulation on Kepler: Algorithm, Numerics and Application

Atomics-free and parallel committing at warp-level dependency. Optimize : Save particle IDs to a neighbor list orderly . Solution:

1. Use balloc(int) which returns a bit mask called ballot. 2. Use popc(int) which returns number of bits set. 3. Return value of __popc(int) is broadcasted to each thread and further masked by a lane- specific mask 4. Place ID accordingly.

SLIDE 10

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Data Layout – Chemical Concentration

Hardware: K20X

Each particle carries an arbitrary number of chemical species. Flux is calculated using particle location and chemical concentration.

Global Load 2D Texture Stall – Data Request ~ 23% ~ 44% Texture Cache Hit Rate ~ 52% ~ 36% Kernel Duration 33.7ms 43.7ms

Culprit – Texture cache depletion Data locality in coordinates is optimized for texture cache hit rate. The concentration texture depletes the cache and thus disrupt the data locality optimization. Overall texture cache hit rate is therefore reduced significantly.

2D Texture Implementation: Each layer holds the concentration of all particles for one species.

30%

SLIDE 11

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Parallel GPU/MPI Synchronization

Steps for RBC computation: 1) Compute total area and volume of local RBCs. 2) All-reduce across all nodes. 3) Enforce area and volume constraints for each RBC. Complication - Domain decomposition causes large synchronization overhead.

K-Gather Compute total area and volume of each RBC. K-Apply Enforce area and volume constraints. Prior Processes prior to RBC computation. Subsequent Processes subsequent to RBC computation.

Naive approach Domain decomposition: Broken up into sub-domains. Each sub-domain is computed by one CPU-GPU pair.

SLIDE 12

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Attempted Optimization #1: Multi-stream

Strategy: Overlapping K-Gather and K-Apply. Uses the total area and volume from previous time step. Complication: Streaming multiprocessor saturation – kernels competing for computing resources. The resulting higher cache refresh rate is detrimental to efficiency. Consequence: Worse performance

SLIDE 13

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Attempted Optimization #2: Multi-stream + Non-blocking

13 1) Wait for the completion of MPI-Iallreduce from last time step. 2) Upload data to device with asynchronous Memcpy-HtD. 3) Compute the total area and volume of each RBC in K-Gather. 4) Place asynchronous Memcpy-DtH in execution queue. 5) Download data to host with asynchronous Memcpy-HtD. 6) Enforce the area and volume constraints in K-Apply. 7) Wait for the completion of Memcpy-DtH. 8) Sum total area and volume with MPI-Iallreduce.

Algorithm:

Strategy: Overlapping data transfer and computation + Utilizing Non-blocking communication

SLIDE 14

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Benchmark – Weak & Strong Scalings

Strong Weak Metrics: (million particles) • (steps per second) à MPS/second

For example: Hct 7% for a system volume of 32,768 translates to 24 RBCs and 131,072 pure fluid particles. Hct 35% for a system volume of 32,768 translates to 123 RBCs and 49,768 pure fluid particles.

Global volume - 2,097,152 Local volume (per node) - 32,768

SLIDE 15

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Benchmark - Speedup

Time (Non-blocking impl.) Time (Blocking impl.) Speedup: Hardware: Titan, 8 Opteron 6274 clusters + 1 K20X / Cray XK7 node. Runtime: 1 rank with 8 OpenMP threads per node. Strong Weak

SLIDE 16

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Benchmark – Single Node

16 For example: Hct 7% for a system volume of 8,192 translates to 6 RBCs and 32,768 pure fluid particles.

Hct System Volume Solvent Particles RBC Count Total Particle Count Speedup 7% 8,192 32,768 6 37,768 3.8 16,382 65,536 12 71,536 5.1 32,768 131,072 24 143,072 5.4 65,536 262,144 49 286,644 5.7 35% 8,192 32,768 30 49,768 4.5 16,384 65,536 61 96,036 5.3 32,768 131,072 123 192,572 5.9 65,536 262,144 246 385,144 6.7 Time (CPU-GPU hybrid) Time (CPU only) Speedup:

SLIDE 17

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Bonus Benchmark

System volume Total particle count TITAN X Maxwell speedup TITAN X Pascal speedup 8,192 47,768 4.8 6.5 16,382 96,036 5.8 9.2 32,768 192,572 7.2 9.9 65,536 385,144 7.2 10.1 Hardware: Two Intel Xeon E5-2630L CPUs at 2.0 GHz, GeForce TITAN X Maxwell or TITAN X Pascal Runtime: 1 rank with 8 OpenMP threads.

SLIDE 18

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

~12x speedup than CPU-only version

720,778 particles 5% represents RBCs @ 500 particles per RBC

SLIDE 19

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Summary

19 USERMESO 2.0 offers GPU-accelerated ability to model advection-diffusion-reaction of

chemical and RBC dynamics. Multi-stream scheduling and non-blocking communication are employed to maximize GPU- CPU concurrency. Comparing with CPU counterpart, USERMESO 2.0 produces up to 10.1 times speedup on

ne GPU over 16 cores in a single node.

It is able to achieve a weak scaling efficiency of 91% across 256 nodes and almost linear strong scaling.

SLIDE 20

Ansel L. Blumers || ansel_blumers@brown.edu || GTC 2017 || May, 2017

Thank you for listening!

Acknowledgement: This work is supported by NIH and US Army Research Laboratory. The benchmarks were performed on TITAN at Oak Ridge National Laboratory through the Innovative and Novel Computational Impact on Theory and Experiment program under project BIP118. A.L.B. would like to acknowledge Wayne Joubert (ORNL) for his effort to coordinate machine reservations. We would like also to acknowledge the support of NVIDIA Corporation with the donation of TITAN X Pascal GPU.

Red Blood Cell Simulations with Chemical Transport Properties

Ansel L. Blumers

Scientific Inquiries

Aim to investigate Chemical-driven plaque and thrombus formation. Model chemical transport Released from red and white blood cells to plasma, Sieved through the vessel wall to surrounding tissue. Model red blood cells Red blood cell dynamics in the blood.

Dissipative Particle Dynamics

Fi = X

(FC

A coarse-grained particle method for mesoscopic simulations.

Pairwise Force Interaction

Transport Properties

Pairwise Chemical Transport

Solves the Advection-Diffusion-Reaction equation.

Modeling Red Blood Cell (RBC)

Discretized with 500 vertices and held together by 3 types of bonded potentials.

Need a Fast & Robust Program

Fusion

USERMESO 2.0

USERMESO 2.0

Injects new capabilities into USERMESO * Combining Open MPI, OpenMP and CUDA Simultaneously simulate ... chemical concentration field advection-diffusion-reaction processes red blood cell dynamics

Finding Neighbors

3 x 3 cells

Neighbor Kernel Optimization

Atomics-free and parallel committing at warp-level dependency. Optimize : Save particle IDs to a neighbor list orderly . Solution:

1. Use __balloc(int) which returns a bit mask called ballot. 2. Use __popc(int) which returns number of bits set. 3. Return value of __popc(int) is broadcasted to each thread and further masked by a lane- specific mask 4. Place ID accordingly.

Data Layout – Chemical Concentration

Each particle carries an arbitrary number of chemical species. Flux is calculated using particle location and chemical concentration.

Global Load 2D Texture Stall – Data Request ~ 23% ~ 44% Texture Cache Hit Rate ~ 52% ~ 36% Kernel Duration 33.7ms 43.7ms

Culprit – Texture cache depletion Data locality in coordinates is optimized for texture cache hit rate. The concentration texture depletes the cache and thus disrupt the data locality optimization. Overall texture cache hit rate is therefore reduced significantly.

Parallel GPU/MPI Synchronization

Steps for RBC computation: 1) Compute total area and volume of local RBCs. 2) All-reduce across all nodes. 3) Enforce area and volume constraints for each RBC. Complication - Domain decomposition causes large synchronization overhead.

Naive approach Domain decomposition: Broken up into sub-domains. Each sub-domain is computed by one CPU-GPU pair.

Attempted Optimization #1: Multi-stream

Strategy: Overlapping K-Gather and K-Apply. Uses the total area and volume from previous time step. Complication: Streaming multiprocessor saturation – kernels competing for computing resources. The resulting higher cache refresh rate is detrimental to efficiency. Consequence: Worse performance

Attempted Optimization #2: Multi-stream + Non-blocking

Algorithm:

Strategy: Overlapping data transfer and computation + Utilizing Non-blocking communication

Benchmark – Weak & Strong Scalings

Strong Weak Metrics: (million particles) • (steps per second) à MPS/second

Global volume - 2,097,152 Local volume (per node) - 32,768

Benchmark - Speedup

Time (Non-blocking impl.) Time (Blocking impl.) Speedup: Hardware: Titan, 8 Opteron 6274 clusters + 1 K20X / Cray XK7 node. Runtime: 1 rank with 8 OpenMP threads per node. Strong Weak

Benchmark – Single Node

Bonus Benchmark

~12x speedup than CPU-only version

720,778 particles 5% represents RBCs @ 500 particles per RBC

Summary

chemical and RBC dynamics. Multi-stream scheduling and non-blocking communication are employed to maximize GPU- CPU concurrency. Comparing with CPU counterpart, USERMESO 2.0 produces up to 10.1 times speedup on

It is able to achieve a weak scaling efficiency of 91% across 256 nodes and almost linear strong scaling.

Thank you for listening!

1. Use balloc(int) which returns a bit mask called ballot. 2. Use popc(int) which returns number of bits set. 3. Return value of __popc(int) is broadcasted to each thread and further masked by a lane- specific mask 4. Place ID accordingly.