NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao - - PowerPoint PPT Presentation

namd on bluewaters
SMART_READER_LITE
LIVE PREVIEW

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao - - PowerPoint PPT Presentation

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale NSF/NCSA Blue Waters Project Sustained Petaflops system funded by NSF to be ready in 2011.


slide-1
SLIDE 1

NAMD on BlueWaters

Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

slide-2
SLIDE 2

NSF/NCSA Blue Waters Project

 Sustained Petaflops system funded by NSF to

be ready in 2011.

− System expected to exceed 300,000 processor

cores.

 NSF Acceptance test: 100 million atom Bar

Domain simulation using NAMD.

 NAMD PRAC The Computational Microscope

− Systems from 10 to 100 million atoms

 A recently submitted PRAC from an

independent group wishes to use NAMD

− 1 Billion atoms!

slide-3
SLIDE 3

Hybrid of spatial and force decomposition:

  • Spatial decomposition of atoms into cubes

(called patches)

  • For every pair of interacting patches, create one
  • bject for calculating electrostatic interactions
  • Recent: Blue Matter, Desmond, etc. use

this idea in some form

NAMD

 Molecular Dynamics simulation of biological systems  Uses the Charm++ idea:

− Decompose the computation into a large number of objects − Have an Intelligent Run-time system (of Charm++) assign objects to

processors for dynamic load balancing

slide-4
SLIDE 4

BW Challenges and Opportunities

 Support systems >= 100 Million atoms  Performance requirements for 100 Million atom  Scale to over 300,000 cores  Power 7 Hardware

− PPC architecture − Wide node at least 32 cores with 128 HT threads

 BlueWaters Torrent interconnect  Doing research under NDA

slide-5
SLIDE 5

BlueWaters Architecture

 IBM Power7  Peak Perf ~10 PF  Sustained ~1 PF  300,000+ cores  1.2+ PB Memory  18+ PB Disc  8 cores/chip  4 chips/MCM  8 MCMs/Drawer  4 Drawers/SuperNode  1024 cores/SuperNode  Linux OS

slide-6
SLIDE 6

Power 7

 64-bit PowerPC  3.7-4Ghz  Up to 8 FLOPs/cycle  4-way SMT  128 byte cache lines  32 KB L1  256 KB l2  4 MB local in shared

32 MB L3 cache

 2 fixed point, 2 load

store

 1 VMX  1 decimal FP  2 VSX

− 4 FLOPs/cycle

 6-wide in-order  8-wide out-of-order  12 data streams

prefetch

slide-7
SLIDE 7

Hub Chip Module

 Connects 8 QCMs via L-local (copper)

− 24 GB/s

 Connects 4 P7-IH drawers L-remote (optical)

− 6 GB/s

 Connects up to 512 SuperNodes D (optical)

− 10 GB/s

slide-8
SLIDE 8

Availability

 NCSA has BlueDrop machine

− Linux − IBM 780 (MR) POWER7 3.8 Ghz − Login node 2x8 core processors − Compute note 4x8 core in 2 enclosures

 BlueBioU

− Linux − 18 IBM 750 (HV32) nodes 3.55 Ghz − Infiniband 4x DDR (Galaxy)

slide-9
SLIDE 9

NAMD on BW

 Use SMT=4 effectively  Use Power7 effectively

− Shared memory topology − Prefetch − Loop unrolling − SIMD VSX

 Use Torrent effectively

− LAPI/XMI

slide-10
SLIDE 10

Petascale Scalability Concerns

 Centralized load balancer - solved  IO

− Unscalable file formats - solved − input read at startup - solved − Sequential output – in progress

 Fine grain overhead – in progress  Non-bonded multicasts – being studied  Particle Mesh Ewald

− Largest grid target <= 1024 − Communication overhead primary issue − Considering Multilevel Summation alternative

slide-11
SLIDE 11

NAMD and SMT=4

 P7 hardware threads are prioritized

− 0,1 highest − 2,3 lowest

 Charm runtime measure processor

performance

− Load balancer operates accordingly

 NAMD on SMT=4 35% faster than SMT=1

− No new code required!

 At the limit it requires 4x more decomposition

slide-12
SLIDE 12

NAMD on Power7 HV 32 AIX

HV32 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Relative Parallel Efficiency

NAMD ApoA1 on Power 7 HV32 (AIX)

Core 1 Core 1, SMT= 2 Core 1, SMT=4 Core 2 Core 4 Core 8 Core 8, SMT=2 Core 8, SMT=4 Core 16 Core 32

Efficiency

slide-13
SLIDE 13

SIMD -> VSX

 VSX adds double

precision support to VMX

 SSE2 already in use

in 2 NAMD functions

 MD-SIMD

implementation of nonbonded MD benchmark available from Kunzman

 Translate SSE to

VSX

 Add VSX support to

MD-SIMD

slide-14
SLIDE 14

MD-SIMD performance

slide-15
SLIDE 15

Support for Large Molecular Systems

 New Compressed PSF file format

− Supports >100 million atoms − Supports parallel startup − Support MEM_OPT molecule representation

 MEM_OPT molecule format reduces data

replication through signatures

 Parallelize reading of input at startup

− Cannot support legacy PDB format − Use binary coordinates format

 Changes in VMD courtesy John Stone

slide-16
SLIDE 16

Parallel Startup

T a b le 1 : P a r a lle l S t a rt u p f

  • r

1 M illio n w a t e r

  • n

B lu e G e n e /P N

  • d

e s S tart (se c) M em

  • ry

(M B ) 1 N A 4 4 8 4 .5 5 * 8 4 4 6 .4 9 9 8 6 5 .1 1 7 1 6 4 2 4 .7 6 5 4 5 6 .4 8 7 3 2 4 2 .4 9 2 2 5 8 .0 2 3 6 4 4 3 5 .3 6 6 2 3 5 .9 4 9 1 2 8 2 2 7 .0 1 8 2 2 2 .2 1 9 2 5 6 1 2 2 .2 9 6 2 1 8 .2 8 5 5 1 2 7 3 .2 5 7 1 2 1 8 .4 4 9 1 2 4 7 6 .1 5 2 1 4 .7 5 8 T a b le : P a r a lle l S t a rt u p 1 1 6 M illio n B A R d

  • m

a in

  • n

A b e N

  • d

e s S ta rt (se c) M e m

  • ry

(M B ) 1 3 7 5 .6 * 7 5 4 5 7 .7 * 5 3 4 .3 6 1 1 8 8 3 2 2 .1 6 5 9 8 1 2 3 2 3 .5 6 1 7 1

slide-17
SLIDE 17

Fine grain overhead

 End user targets are all fixed size problems  Strong scaling performance dominates

− Maximize number of nanoseconds/day of simulation

 Non-bonded cutoff distance determines patch

size

− Patch can be subdivided along x, y, z dimensions

 2 away X, 2-away XY, 2 away XYZ

− Theoretically K-away...

slide-18
SLIDE 18

1-away vs 2-away X

slide-19
SLIDE 19

Fine-grain overhead reduction

 Distant computes have little or no interaction

− Long diagonal opposites of 2-awayXYZ mostly

  • utside of cutoff

 Optimizations

− Don't migrate tiny computes − Sort pairlists to truncate computation − Increase margin and do not create redundant

compute objects

 Slight (<5%) reduction in step time

slide-20
SLIDE 20

Future work

 Integrate parallel output into CVS NAMD  Consolidate small compute objects  Leverage native communication API  Particle Mesh Ewald improve/replace  Parallel I/O optimization study on multiple

platforms

 High (>16k) scaling study on multiple platforms