namd on bluewaters
play

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao - PowerPoint PPT Presentation

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale NSF/NCSA Blue Waters Project Sustained Petaflops system funded by NSF to be ready in 2011.


  1. NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

  2. NSF/NCSA Blue Waters Project  Sustained Petaflops system funded by NSF to be ready in 2011. − System expected to exceed 300,000 processor cores.  NSF Acceptance test: 100 million atom Bar Domain simulation using NAMD.  NAMD PRAC The Computational Microscope − Systems from 10 to 100 million atoms  A recently submitted PRAC from an independent group wishes to use NAMD − 1 Billion atoms!

  3. NAMD  Molecular Dynamics simulation of biological systems  Uses the Charm++ idea: − Decompose the computation into a large number of objects − Have an Intelligent Run-time system (of Charm++) assign objects to processors for dynamic load balancing Hybrid of spatial and force decomposition: • Spatial decomposition of atoms into cubes (called patches) • For every pair of interacting patches, create one object for calculating electrostatic interactions • Recent: Blue Matter, Desmond, etc. use this idea in some form

  4. BW Challenges and Opportunities  Support systems >= 100 Million atoms  Performance requirements for 100 Million atom  Scale to over 300,000 cores  Power 7 Hardware − PPC architecture − Wide node at least 32 cores with 128 HT threads  BlueWaters Torrent interconnect  Doing research under NDA

  5. BlueWaters Architecture  IBM Power7  8 cores/chip  Peak Perf ~10 PF  4 chips/MCM  Sustained ~1 PF  8 MCMs/Drawer  300,000+ cores  4 Drawers/SuperNode  1.2+ PB Memory  1024 cores/SuperNode  18+ PB Disc  Linux OS

  6. Power 7  64-bit PowerPC  2 fixed point, 2 load store  3.7-4Ghz  1 VMX  Up to 8 FLOPs/cycle  1 decimal FP  4-way SMT  2 VSX  128 byte cache lines − 4 FLOPs/cycle  32 KB L1  6-wide in-order  256 KB l2  8-wide out-of-order  4 MB local in shared  12 data streams 32 MB L3 cache prefetch

  7. Hub Chip Module  Connects 8 QCMs via L-local (copper) − 24 GB/s  Connects 4 P7-IH drawers L-remote (optical) − 6 GB/s  Connects up to 512 SuperNodes D (optical) − 10 GB/s

  8. Availability  NCSA has BlueDrop machine − Linux − IBM 780 (MR) POWER7 3.8 Ghz − Login node 2x8 core processors − Compute note 4x8 core in 2 enclosures  BlueBioU − Linux − 18 IBM 750 (HV32) nodes 3.55 Ghz − Infiniband 4x DDR (Galaxy)

  9. NAMD on BW  Use SMT=4 effectively  Use Power7 effectively − Shared memory topology − Prefetch − Loop unrolling − SIMD VSX  Use Torrent effectively − LAPI/XMI

  10. Petascale Scalability Concerns  Centralized load balancer - solved  IO − Unscalable file formats - solved − input read at startup - solved − Sequential output – in progress  Fine grain overhead – in progress  Non-bonded multicasts – being studied  Particle Mesh Ewald − Largest grid target <= 1024 − Communication overhead primary issue − Considering Multilevel Summation alternative

  11. NAMD and SMT=4  P7 hardware threads are prioritized − 0,1 highest − 2,3 lowest  Charm runtime measure processor performance − Load balancer operates accordingly  NAMD on SMT=4 35% faster than SMT=1 − No new code required!  At the limit it requires 4x more decomposition

  12. NAMD on Power7 HV 32 AIX Relative Parallel Efficiency NAMD ApoA1 on Power 7 HV32 (AIX) 1.8 1.6 1.4 Core 1 1.2 Core 1, SMT= 2 Core 1, SMT=4 Core 2 Core 4 1 Core 8 Efficiency Core 8, SMT=2 Core 8, SMT=4 Core 16 0.8 Core 32 0.6 0.4 0.2 0 HV32

  13. SIMD -> VSX  VSX adds double  Translate SSE to precision support to VSX VMX  Add VSX support to  SSE2 already in use MD-SIMD in 2 NAMD functions  MD-SIMD implementation of nonbonded MD benchmark available from Kunzman

  14. MD-SIMD performance

  15. Support for Large Molecular Systems  New Compressed PSF file format − Supports >100 million atoms − Supports parallel startup − Support MEM_OPT molecule representation  MEM_OPT molecule format reduces data replication through signatures  Parallelize reading of input at startup − Cannot support legacy PDB format − Use binary coordinates format  Changes in VMD courtesy John Stone

  16. Parallel Startup T a b le 1 : P a r a lle l S t a rt u p f o r 1 0 M illio n w a t e r o n B lu e G e n e /P N o d e s S tart (se c) M em o ry (M B ) 1 N A 4 4 8 4 .5 5 * 8 4 4 6 .4 9 9 8 6 5 .1 1 7 1 6 4 2 4 .7 6 5 4 5 6 .4 8 7 3 2 4 2 0 .4 9 2 2 5 8 .0 2 3 6 4 4 3 5 .3 6 6 2 3 5 .9 4 9 1 2 8 2 2 7 .0 1 8 2 2 2 .2 1 9 2 5 6 1 2 2 .2 9 6 2 1 8 .2 8 5 5 1 2 7 3 .2 5 7 1 2 1 8 .4 4 9 1 0 2 4 7 6 .1 0 0 5 2 1 4 .7 5 8 T a b le : P a r a lle l S t a rt u p 1 1 6 M illio n B A R d o m a in o n A b e N o d e s S ta rt (se c) M e m o ry (M B ) 1 3 0 7 5 .6 * 7 5 4 5 7 .7 * 5 0 3 4 0 .3 6 1 1 0 0 8 8 0 3 2 2 .1 6 5 9 0 8 1 2 0 3 2 3 .5 6 1 7 1 0

  17. Fine grain overhead  End user targets are all fixed size problems  Strong scaling performance dominates − Maximize number of nanoseconds/day of simulation  Non-bonded cutoff distance determines patch size − Patch can be subdivided along x, y, z dimensions  2 away X, 2-away XY, 2 away XYZ − Theoretically K-away...

  18. 1-away vs 2-away X

  19. Fine-grain overhead reduction  Distant computes have little or no interaction − Long diagonal opposites of 2-awayXYZ mostly outside of cutoff  Optimizations − Don't migrate tiny computes − Sort pairlists to truncate computation − Increase margin and do not create redundant compute objects  Slight (<5%) reduction in step time

  20. Future work  Integrate parallel output into CVS NAMD  Consolidate small compute objects  Leverage native communication API  Particle Mesh Ewald improve/replace  Parallel I/O optimization study on multiple platforms  High (>16k) scaling study on multiple platforms

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend