GPU Accelerated Virtual Cell Biology and SIMD Enhanced High - - PowerPoint PPT Presentation

gpu accelerated virtual cell biology
SMART_READER_LITE
LIVE PREVIEW

GPU Accelerated Virtual Cell Biology and SIMD Enhanced High - - PowerPoint PPT Presentation

GPU Accelerated Virtual Cell Biology and SIMD Enhanced High Throughput Computational Biology Narayan Ganesan Assistant professor at the Department of Electrical and Computer Engineering Hanyu Jiang Ph.D. student, research assistant at the


slide-1
SLIDE 1

GPU Accelerated Virtual Cell Biology and SIMD Enhanced High Throughput Computational Biology

Narayan Ganesan Assistant professor at the Department of Electrical and Computer Engineering Hanyu Jiang Ph.D. student, research assistant at the Department of Electrical and Computer Engineering

slide-2
SLIDE 2

PART I: Agent Based Virtual Cell Biology

slide-3
SLIDE 3
  • Serves as a computational microscope into the behavior of the cell.
  • Helps observe behavioral patterns, such as expected time to DNA

transcription, induced variance in protein translation and decay.

  • Helps model and study noise in biological processes.
  • Helps study cross talk between several large and complex pathways.

3

Advantages of Process Simulation in 3D Space

slide-4
SLIDE 4

Computational Tasks Involved

  • Each particle maintains its own identity and attributes.
  • The list of interactions between the different chemical species are specified at the

start of the simulation.

  • The particles are allowed to diffuse independently, at the given rate of diffusion and

given variance in their velocities. (random walk in 3D space)

  • When two particles that can react with each other come within the vicinity of each
  • ther (radius of reaction), then the scheduler schedules a reaction between them

and marks the particles as inactive.

  • The reaction is executed wherein new particles (products of the reaction) are added

to the system, along with their new identity and attributes.

slide-5
SLIDE 5
  • Each thread is assigned to a particle along with its identity and attributes.
  • Thus each particle is independent and autonomous agent within the 3D space.
  • Feasible list returns the set of feasible neighboring particles that it can react with

based on all the reactions within the system.

Reactions + = + = + = Feasible list generated will include particles feasible for all reactions.

Computational Challenge: Parallel Selection

slide-6
SLIDE 6

Computational Workgroup of threads Feasible List Generation Inconsistent reaction selection Consistent reaction selection

Computational Challenge: Parallel Selection

slide-7
SLIDE 7

1) Build the feasibleList for each particle, which is a subset of neighborList and contains the set of particles capable of reacting with the current particle. 2) Sort the feasibleList according to the Euclidean metric in order to set the reaction priority. 3) Each particle selects the first available particle in its sorted feasibleList for reaction. 4) If the selection is mutual then schedule the corresponding reaction in the reaction pipeline and

– mark the particle as not available for any more selections. else – mark the particle as still available for selection by other particles.

Perform steps 3) and 4) until converged or no more available particles in the feasibleList. Algorithm Converges within 6 iterations.

Algorithm For Consistent Parallel Selection

slide-8
SLIDE 8

1. JAK binds to IFN-γ receptor and forms IFNR-JAK complex (RJ). 2. IFN-γ binds to extra cellular domain of RJ complex and forms IFNRJ complex. 3. Dimerization of IFNRJ leads to formation of IFNRJ2. 4. IFNRJ2 is phosphorylated and IFNRJ2* is formed. 5. STAT1c binds to IFNRJ2* and is phosphorylated (STAT1c*). 6. Phosphorylated STAT1c (STAT1c*) forms a homo-dimer (STAT1c*- STAT1c*). 7. Homo-dimer (STAT1c*-STAT1c*) are trans-located to nucleus (STAT1n*-STAT1n*). 8. STAT1n*-STAT1n* works as a transcription factor. 9. SOCS1 is induced by JAK/STAT pathway.

  • 10. SOCS1 binds to the activated receptor (IFNRJ2*) and inhibits its

activity.

Example: JAK-STAT Signaling Mechanism

slide-9
SLIDE 9

ODEs for JAK-STAT Signaling Pathway

slide-10
SLIDE 10

Computing Framework – Input Config. File

#---------------------------------------------------------------- Regions # Regionid, x_orig, y_orig, z_orig, x_length, y_length, z_length #---------------------------------------------------------------- 4 0.0 0.0 57.0 60.0 60.0 3.0 #extracellular medium 3 0.0 0.0 54.0 60.0 60.0 3.0 #cellplasma membrane 2 0.0 0.0 8.0 60.0 60.0 46.0 #cytoplasm 1 0.0 0.0 5.0 60.0 60.0 3.0 #nuclear membrane 0 0.0 0.0 0.0 60.0 60.0 5.0 #nucleus # all concentrations are in nM/L. 1nM/L = 602.3*VOL*conc parts in Cell, VOL = 3.3 ncc. #---------------------------------------- Reagents # Reagent inertia, init_cond, region_id #---------------------------------------- R, 0.5, 12.0 JAK, 0.5, 12.0 RJ, 0.5, 0.0 IFN, 0.5, 15.0 IFNRJ, 1.0, 0.0 IFNRJ2, 1.0, 0.0 IFNRJ2x, 1.0, 0.0 STAT1c, 1.0, 300.0 #----------------------------------------------- Reactions # reaction, forward_rate, reverse_rate #----------------------------------------------- IFNRJ2 = IFNRJ2x, 0.005, 0.0, IFNRJ2x + STAT1c = IFNRJ2x-STAT1c, 1.0, 0.1, IFNRJ2x-STAT1c = IFNRJ2x + STAT1cx, 0.4, 0.0, IFNRJ2x + STAT1cx = IFNRJ2x-STAT1cx, 1.0, 0.1, STAT1cx + STAT1cx = STAT1cx2, 1.0, 0.005, ….

slide-11
SLIDE 11

Process Simulation Framework Input: Configuration Time to simulate 3D trajectory and snapshots

  • f particles within the

biological cell. Particle Concentrations

Step R JAK RJ IFN IFNRJ IFNRJ2 IFNRJ2x STAT1c 1 2 3 4 5 6 7 8 9 0 24140 24140 0 60350 0 0 0 603504 . . . 10015913 15913 153 52276 584 3649 15 603372 . . . 2008855 8855 69 45134 666 6928 28 602761 . . . 3005963 5963 25 42198 768 7967 69 601335 . . . 4004485 4485 25 40720 700 8351 72 599184 . . . . . .

Sample Output:

Process Simulation Framework: Workflow

slide-12
SLIDE 12

The particle concentration is output as a function of time.

GPU Enabled Virtual Cell Biology Simulation

slide-13
SLIDE 13

13

Strong Linear Scaling w.r.t. number of agents Weak Scaling w.r.t. number of Processors

Performance and Scalability

slide-14
SLIDE 14

Part II: SIMD Enhanced Protein Motif Detection

slide-15
SLIDE 15

15

Hidden Markov Model and hmmsearch of HMMER

Each Sample Path follows a set of predefined transition probabilities between the states

slide-16
SLIDE 16

16

HMM Model & Sequence Database

HMM model Protein sequence database

slide-17
SLIDE 17

17

Dependencies and Computational Hotspot

𝑊

𝑁 𝑗, 𝑘 = 𝜁 𝑆𝑗, 𝑁 𝑘 + max

{ 𝑊

𝑁(𝑗 − 1, 𝑘 − 1) + 𝑈𝑁𝑁(𝑘 − 1, 𝑘), 𝑊 𝐽(𝑗 − 1, 𝑘 − 1) + 𝑈𝐽𝑁(𝑘 − 1, 𝑘),

𝑊

𝐸(𝑗 − 1, 𝑘 − 1) + 𝑈𝐸𝑁(𝑘 − 1, 𝑘), 𝐶 + 𝑈𝐶𝑁(𝑁 𝑘)}

𝑊

𝐽 𝑗, 𝑘 = max

{ 𝑊

𝑁(𝑗 − 1, 𝑘) + 𝑈𝑁𝐽(𝑘, 𝑘),

𝑊

𝐽(𝑗 − 1, 𝑘) + 𝑈𝐽𝐽(𝑘, 𝑘)}

𝑊

𝐸 𝑗, 𝑘 = max

{ 𝑊

𝑁(𝑗, 𝑘 − 1) + 𝑈𝑁𝐸(𝑘 − 1, 𝑘),

𝑊

𝐸(𝑗, 𝑘 − 1) + 𝑈𝐸𝐸(𝑘 − 1, 𝑘)}

  • Match states:
  • Insert states:
  • Delete states:
slide-18
SLIDE 18

18

How the computational kernel looks like…

HMM states

M

Maximum probability that the sequence was generated by the model: O(MxN)

N

XE

1 1

One sequence

  • MSV needs Match score and XE of

previous row

  • Viterbi needs adjacent Delete

score in current row

  • Dependence on XE impose a row

major order computation

  • Match
  • Insert
  • Delete
slide-19
SLIDE 19

19

Multi-tiered Parallel Framework for Acceleration

slide-20
SLIDE 20

20

Detail #1: Synchronize-free Execution

Warp #1 Warp #2 Warp #3 Done

Sequence Database

  • One warp pick up one

sequence

  • Once done, move to next

schedule automatically

  • Eliminate block-scoped

__syncthreads() caused by:

  • Intra-states dependency of

HMM model

  • Unbalance sequence data
  • Keep threads active
  • High throughput
slide-21
SLIDE 21

21

Detail #2: Striped Layout vs. Sequential Layout

  • Sequential Layout
  • Straightforward
  • Private data dependence across adjacent threads
  • More sequential overhead and thread idling
  • Striped Layout
  • Only one reordering request per DP row
  • All parallel execution
slide-22
SLIDE 22

22

Detail #3: PTX assembly for Reordering

  • Reorder 128 scores within one warp
  • Shifting
  • Exchange (Intra-warp shuffle)
  • Merge
  • Ready to go next!
slide-23
SLIDE 23

23

Detail #4: PTX assembly for Max-Reduction

  • Max-reduction
  • SIMD max
  • Intra-warp shuffle
  • broadcast
slide-24
SLIDE 24

24

Benchmark Performance

  • Overused shared memory hurts occupancy

and overall performance

  • Larger capacity of local memory for each

thread available is a good news

  • Considering cases like pipeline usage and

available registers, more threads/warps may not results in further speedup. Reversely, it may bring stalling and register spills.

slide-25
SLIDE 25

25

Benchmark Performance – cont.

  • GCUPS = GigaCell Update Per Second
  • Larger model, better performance.
  • About 5x faster than highly-optimized CPU

implementations.

  • Complex algorithms bring in intensive

register pressure and off-chip data transfer

  • Lower hit ratio on L1, L2 and Read-Only

caches is the performance killer

slide-26
SLIDE 26

26

  • NVIDIA-Professor Partnership
  • Xilinx University Program (XUP)
  • Stevens Institute of Technology Start-up Foundation

Acknowledgements Contact Information

  • Narayan Ganesan

Email: nganesan@stevens.edu

  • Hanyu Jiang

Email: hjiang5@stevens.edu