Scaling Challenges in NAMD: Past and Future Outline NAMD: An - - PowerPoint PPT Presentation

▶

Jan 08, 2023 163 likes •478 views

NAMD Team: Chee Wai Lee Abhinav Bhatele Kumaresh P. Eric Bohm James Phillips Sameer Kumar Gengbin Zheng David Kunzman Laxmikant Kale Chao Mei Klaus Schulten Scaling Challenges in NAMD: Past and Future Outline NAMD: An Introduction

SLIDE 1

Scaling Challenges in NAMD: Past and Future

NAMD Team: Abhinav Bhatele Eric Bohm Sameer Kumar David Kunzman Chao Mei Chee Wai Lee Kumaresh P. James Phillips Gengbin Zheng Laxmikant Kale Klaus Schulten

SLIDE 2

Outline

NAMD: An Introduction
Past Scaling Challenges

– Conflicting Adaptive Runtime Techniques – PME Computation – Memory Requirements

Performance Results
Comparison with other MD codes
Future Challenges:

– Load Balancing – Parallel I/O – Fine-grained Parallelization

SLIDE 3

What is NAMD ?

A parallel molecular dynamics application
Simulate the life of a bio-molecule
How is the simulation performed ?

– Simulation window broken down into a large number

f time steps (typically 1 fs each)

– Forces on every atom calculated every time step – Velocities and positions updated and atoms migrated to their new positions

SLIDE 4

How is NAMD parallelized ?

HYBRID DECOMPOSITION

SLIDE 5

SLIDE 6

What makes NAMD efficient ?

Charm++ runtime support

– Asynchronous message-driven model – Adaptive overlap of communication and computation

SLIDE 7

Bonded Computes Non-bonded Computes Patch Integration Patch Integration Reductions Multicast Point to Point Point to Point PME

Non-bonded Work Bonded Work Integration PME Communication

SLIDE 8

What makes NAMD efficient ?

Charm++ runtime support

– Asynchronous message-driven model – Adaptive overlap of communication and computation

Load balancing support

– Difficult problem: balancing heterogeneous computation – Measurement-based load balancing

SLIDE 9

What makes NAMD highly scalable ?

Hybrid decomposition scheme
Variants of this hybrid scheme used by Blue Matter and

Desmond

SLIDE 10

Scaling Challenges

Scaling a few thousand atom simulations to tens
f thousands of processors

– Interaction of adaptive runtime techniques – Optimizing the PME implementation

Running multi-million atom simulations on

machines with limited memory

– Memory Optimizations

SLIDE 11

Conflicting Adaptive Runtime Techniques

Patches multicast data to computes
At load balancing step, computes re-assigned to

processors

Tree re-built after computes have

migrated

SLIDE 12

SLIDE 13

SLIDE 14

Solution

– Persistent spanning trees – Centralized spanning tree creation

Unifying the two techniques

SLIDE 15

PME Calculation

Particle Mesh Ewald (PME) method used for

long range interactions

– 1D decomposition of the FFT grid

PME is a small portion of the total computation

– Better than the 2D decomposition for small number of processors

On larger partitions

– Use a 2D decomposition – More parallelism and better overlap

SLIDE 16

Automatic Runtime Decisions

Use of 1D or 2D algorithm for PME
Use of spanning trees for multicast
Splitting of patches for fine-grained parallelism
Depend on:

– Characteristics of the machine – No. of processors – No. of atoms in the simulation

SLIDE 17

Reducing the memory footprint

Exploit the fact that building blocks for a bio-

molecule have common structures

Store information about a particular kind of atom
nly once

SLIDE 18

O H H O H H 14333 14332 14334 14496 14495 14497 O H H O H H

+1

SLIDE 19

Reducing the memory footprint

Exploit the fact that building blocks for a bio-

molecule have common structures

Store information about a particular kind of atom
nly once
Static atom information increases only with the

addition of unique proteins in the simulation

Allows simulation of 2.8 M Ribosome on Blue

Gene/L

SLIDE 20

Memory Reduction

0.01 0.1 1 10 100 1000

IAPP DHFR Lysozyme ApoA1 F1- ATPase STMV Bar Domain Ribosome

Benchmark Memory Usage (MB) Original New

< 0.5 MB < 0.5 MB

SLIDE 21

NAMD on Blue Gene/L

1 million atom simulation on 64K processors (LLNL BG/L)

SLIDE 22

NAMD on Cray XT3/XT4

5570 atom simulation on 512 processors at 1.2 ms/step

SLIDE 23

Comparison with Blue Matter

Blue Matter developed specifically for Blue Gene/L

Time for ApoA1 (ms/step)

NAMD running on

4K cores of XT3 is comparable to BM running on 32K cores of BG/L

SLIDE 24

3.46 4.2 5.52 7.48 11.42 19.59 NAMD CO mode (No MTS) 2.04 2.71 3.78 5.8 9.73 16.83 NAMD CO mode (1 pe/node) 2.09 3.14 5.39 9.97 18.95 38.42 Blue Matter (2 pes/node)

5.3 5.62 9.99 11.99 NAMD VN mode (No MTS) 2.11 2.29 3.06 4.06 6.26 9.82 NAMD VN mode (2 pes/node) 16384 8192 4096 2048 1024 512 Number of Nodes

SLIDE 25

Comparison with Desmond

Desmond is a proprietary MD program

Time (ms/step) for Desmond on 2.4 GHz Opterons and NAMD on 2.6 GHz Xeons

Uses single precision

and exploits SSE instructions

Low-level infiniband

primitives tuned for MD

SLIDE 26

1.5 2.0 7.1 9.4 256 1.1 1.4 4.2 5.2 512 1.0

3.0 1024

6.3 11.5 21.0 41.4 Desmond DHFR 2.0 18.2 33.5 64.3 126.8 256.8 Desmond ApoA1 2.4 4.3 8.09 14.9 27.3 NAMD DHFR 1.9 13.4 26.5 50.7 104.9 199.3 NAMD ApoA1 2048 128 64 32 16 8 Number of Cores

SLIDE 27

NAMD on Blue Gene/P

SLIDE 28

Future Work

Optimizing PME computation

– Use of one-sided puts between FFTs

Reducing communication and other overheads

with increasing fine-grained parallelism

Running NAMD on Blue Waters

– Improved distributed load balancers – Parallel Input/Output

SLIDE 29

Summary

NAMD is a highly scalable and portable MD

program

– Runs on a variety of architectures – Available free of cost on machines at most supercomputing centers – Supports a range of sizes of molecular systems

Uses adaptive runtime techniques for high

scalability

Automatic selection of algorithms at runtime best

suited for the scenario

With new optimizations, NAMD is ready for the

next generation of parallel machines

SLIDE 30

Scaling Challenges in NAMD: Past and Future

Outline

What is NAMD ?

– Simulation window broken down into a large number

– Forces on every atom calculated every time step – Velocities and positions updated and atoms migrated to their new positions

How is NAMD parallelized ?

HYBRID DECOMPOSITION

What makes NAMD efficient ?

– Asynchronous message-driven model – Adaptive overlap of communication and computation

What makes NAMD efficient ?

– Asynchronous message-driven model – Adaptive overlap of communication and computation

– Difficult problem: balancing heterogeneous computation – Measurement-based load balancing

What makes NAMD highly scalable ?

Desmond

Scaling Challenges

– Interaction of adaptive runtime techniques – Optimizing the PME implementation

machines with limited memory

– Memory Optimizations

Conflicting Adaptive Runtime Techniques

processors

migrated

– Persistent spanning trees – Centralized spanning tree creation

PME Calculation

long range interactions

– 1D decomposition of the FFT grid

– Better than the 2D decomposition for small number of processors

– Use a 2D decomposition – More parallelism and better overlap

Automatic Runtime Decisions

– Characteristics of the machine – No. of processors – No. of atoms in the simulation

Reducing the memory footprint

molecule have common structures

O H H O H H 14333 14332 14334 14496 14495 14497 O H H O H H

+1

+1

Reducing the memory footprint

molecule have common structures

addition of unique proteins in the simulation

Gene/L

NAMD on Blue Gene/L

NAMD on Cray XT3/XT4

Comparison with Blue Matter

4K cores of XT3 is comparable to BM running on 32K cores of BG/L

Comparison with Desmond

and exploits SSE instructions

primitives tuned for MD

NAMD on Blue Gene/P

Future Work

– Use of one-sided puts between FFTs

with increasing fine-grained parallelism

– Improved distributed load balancers – Parallel Input/Output

Summary

program

– Runs on a variety of architectures – Available free of cost on machines at most supercomputing centers – Supports a range of sizes of molecular systems

scalability

suited for the scenario

next generation of parallel machines

Questions ?