Towards Process-Level Charm++ Programming in NAMD James Phillips - - PowerPoint PPT Presentation

towards process level charm programming in namd
SMART_READER_LITE
LIVE PREVIEW

Towards Process-Level Charm++ Programming in NAMD James Phillips - - PowerPoint PPT Presentation

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/ Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Charm++ 2015


slide-1
SLIDE 1

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Towards Process-Level Charm++ Programming in NAMD

James Phillips

Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/

slide-2
SLIDE 2

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Developers of the widely used computational biology software VMD and NAMD

250,000 registered VMD users 72,000 registered NAMD users 600 publications (since 1972)

  • ver 54,000 citations

5 faculty members 8 developers 1 systems administrator 17 postdocs 46 graduate students 3 administrative staff

research projects include: virus capsids, ribosome, photosynthesis, protein folding, membrane reshaping, animal magnetoreception

Tajkorshid, Luthey-Schulten, Stone, Schulten, Phillips, Kale, Mallon

NIH Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics

Achievements Built on People

Renewed 2012-2017 with 10.0 score (NIH)

slide-3
SLIDE 3

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

NAMD Serves NIH Users and Goals

Practical Supercomputing for Biomedical Research

  • 72,000 users can’t all be computer experts.

– 18% are NIH-funded; many in other countries. – 21,000 have downloaded more than one version. – 5000 citations of NAMD reference papers.

  • One program available on all platforms.

– Desktops and laptops – setup and testing – Linux clusters – affordable local workhorses – Supercomputers – free allocations on XSEDE – Blue Waters – sustained petaflop/s performance – GPUs - next-generation supercomputing

  • User knowledge is preserved across platforms.

– No change in input or output files. – Run any simulation on any number of cores.

  • Available free of charge to all.

Oak Ridge TITAN Hands-On Workshops

slide-4
SLIDE 4

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

NAMD Benefits from Charm++ Collaboration

  • Illinois Parallel Programming Lab

– Prof. Laxmikant Kale – charm.cs.illinois.edu

  • Long standing collaboration

– Since start of Center in 1992 – Gordon Bell award at SC2002 – Joint Fernbach award at SC12

  • Synergistic research

– NAMD requirements drive and validate CS work – Charm++ software provides unique capabilities – Enhances NAMD performance in many ways

slide-5
SLIDE 5

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015 1990 1994 1998 2002 2006 2010 104 105 106 107 108 2014 Lysozyme ApoA1 ATP Synthase STMV Ribosome HIV capsid Number of atoms 1986

Structural data drives simulations

slide-6
SLIDE 6

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Charm++ Used by NAMD

  • Parallel C++ with data driven objects.
  • Asynchronous method invocation.
  • Prioritized scheduling of messages/execution.
  • Measurement-based load balancing.
  • Portable messaging layer.
slide-7
SLIDE 7

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

  • Spatially decompose data and

communication.

  • Separate but related work

decomposition.

  • “Compute objects” facilitate

iterative, measurement-based load balancing system.

NAMD Hybrid Decomposition

Kale et al., J. Comp. Phys. 151:283-312, 1999.

slide-8
SLIDE 8

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Phillips et al., SC2002.

Offload to GPU

Objects are assigned to processors and queued as data arrives.

NAMD Overlapping Execution

slide-9
SLIDE 9

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Overlapping GPU and CPU with Communication

Remote Force Local Force GPU CPU Other Nodes/Processes Local Remote x f f f f Local x x Update One Timestep x

Phillips et al., SC2008

slide-10
SLIDE 10

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

NAMD on Petascale Platforms Today

0.25 0.5 1 2 4 8 16 32 64 256 512 1024 2048 4096 8192 16384 Number of Nodes 21M atoms 224M atoms Performance (ns per day) Blue Waters XK7 (GTC15) Titan XK7 (GTC15) Edison XC30 (SC14) Blue Waters XE6 (SC14)

(2fs timestep)

14 ns/day, 79% parallel efficiency on 224M atoms 25 ns/day, 7 ms/step

slide-11
SLIDE 11

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Future NAMD Platforms

  • NERSC Cori / Argonne Theta (2016)

– Knight’s Landing (KNL) Xeon Phi – Single-socket nodes, Cray Aries network

  • Oak Ridge Summit (2018)

– IBM Power 9 CPUs + NVIDIA Volta GPUs – 3,400 fat nodes, dual-rail InfiniBand network

  • Argonne Aurora (2018)

– Knight’s Hill (KNH) Xeon Phi

slide-12
SLIDE 12

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Charm++ Programming Model

  • Programmer:

– Reasons about (arrays of) chares – Writes entry methods for chares – Entry methods send messages

  • Runtime:

– Manages (re)mapping of chares to PEs

slide-13
SLIDE 13

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

What if PEs share a host?

  • Communication can bypass network
  • Opportunity for optimization!

– Multicast and reduction trees (easy) – Communication-aware load balancer (hard)

  • May share a GPU (inefficiently)

– Likely need CUDA Multi-Process Service

slide-14
SLIDE 14

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

What if PEs share a host?

  • Charm++ detects “physical nodes”.
  • NAMD optimizations:

– Place patches based on physical nodes. – Place computes on same physical nodes. – Optimize trees for patch positions, forces. – Optimize global broadcast and reductions.

slide-15
SLIDE 15

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

What if PEs share a host?

  • Non-SMP NAMD runs are common.

– Avoid bottlenecks in Linux malloc(), free(), etc. – Don’t waste cores on communication threads. – Best performance for small simulations.

  • This will likely be changing:

– SMP builds are now faster on Cray for all sizes. – Fixing communication thread lock contention.

slide-16
SLIDE 16

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

What if PEs share a process?

  • Also share a host (see preceding slides).
  • Share one copy of static data.
  • Communicate by passing pointers.
  • Share one CUDA context.

– Use CUDA streams to overlap on GPU. – Avoid using shared default stream.

slide-17
SLIDE 17

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

What if PEs share a process?

  • Each process is Charm++ “node”.
  • No-pack messages to same-node PEs.
  • Node-level locks and PE-private variables.
  • Messages to “nodegroup” run on any PE.
  • Communication thread handles network.
  • CkLoop for OpenMP-style parallelism.
slide-18
SLIDE 18

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

What if PEs share a socket?

  • Shared memory controller and L3 cache.

– Duplicate data reduces cache efficiency. – Work with same data at same time if possible.

  • OpenMP and CkLoop do this naturally.
  • Possible to run one PE/socket and use OpenMP or

CkLoop to parallelize across cores on socket.

slide-19
SLIDE 19

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

What is most relevant for NAMD?

  • One process per node

– Single-node (multicore) – Largest simulations, memory limited – At most one process per GPU/MIC (offload)

  • One or two processes per socket

– Cray XE/XC or 64-core Opteron cluster

  • Manually set CPU affinity:

– E.g., +pemap 0-6,8-14 +commap 7,15

slide-20
SLIDE 20

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Process-level NAMD Today

  • Patch position/force trees

– Use nodegroup to avoid delaying messages

  • GPU/MIC offload

– Aggregate work, serialize control

  • PME electrostatics on petascale

– Single pencil per node to reduce message count

  • PME offload to GPU

– Aggregate work, serialize control

slide-21
SLIDE 21

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Process-level NAMD Idioms

  • Paired group and nodegroup for trees
  • CkLoop or nodegroup for bottlenecks

– PME FFT and transpose messages

  • Access chare on other PE via pointer

– GPU data and result processing

  • GPU control-serialization queue

– Want one PME charge grid CUDA stream per PE – But there are locks hidden in CUDA calls

slide-22
SLIDE 22

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Serialization Queue Idiom

  • When PE is ready to submit work to GPU:

– Lock queue. – Queue marked as busy? Add work and unlock. – Not marked as busy?

  • Mark as busy, unlock queue, submit work, lock queue.
  • While work in queue: unlock, submit work, lock.

– Mark queue as not busy and unlock queue.

slide-23
SLIDE 23

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Serialization Queue Trace

Single PE’s work only Second PE added work Many PEs added work

slide-24
SLIDE 24

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Suggested Charm++ Features

  • nodearray chare entries run on any PE in node

– Serialized per element unless [reentrant]

  • [mobile] entries for groups and arrays

– Execute on any PE in node similar to nodegroup

  • Set-exclusive entry points

– Serialize calls on same chare to entries in set

  • Set-reentrant entry points

– Serialize calls outside set on same chare

slide-25
SLIDE 25

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Charm++ Programming Model

  • Programmer:

– Reasons about (arrays of) chares – Writes entry methods for chares – Entry methods send messages

  • Runtime:

– Manages (re)mapping of chares to PEs

slide-26
SLIDE 26

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Charm++ Programming Model

  • Programmer:

– Reasons about (arrays of) chares – Writes entry methods for chares – Labels entry methods as [reentrant], etc. – Entry methods send messages

  • Runtime:

– Manages (re)mapping of chares to nodes/PEs

slide-27
SLIDE 27

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

What if PEs share a core?

  • Hardware threads up to SMT8 on Power8.
  • Shared L1/L2 cache – same as previous.
  • Shared execution units.

– Busy-waiting can slow PEs on same core. – Load balancer measurements more variable.

slide-28
SLIDE 28

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Hybrid Charm++/OpenMP?

  • Leverage vendor-optimized OpenMP 4.X
  • One thread team per PE

– Team master thread runs Charm++ scheduler – Use pthreads, atomics, etc. as now – Likely one team (PE) per core

  • OpenMP directives distribute loops to threads

– Also SIMD directives for vector instructions

slide-29
SLIDE 29

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Charm++ Programming Model

  • Programmer:

– Reasons about (arrays of) chares – Writes entry methods for chares – Labels entry methods as [reentrant], etc. – Exposes loop-level parallelism via OpenMP – Entry methods send messages

  • Runtime:

– Manages (re)mapping of chares to nodes/PEs

slide-30
SLIDE 30

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015

Conclusions

  • Process-level programming is needed.
  • Current Charm++ support is inelegant.
  • Small Charm++ changes expose:

– Additional scheduling flexibility – Entry point concurrency internal to chares – Loop-level parallelism internal to chares

slide-31
SLIDE 31

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

Charm++ 2015 Thanks to: NIH, NSF, DOE, NCSA, NVIDIA (Sarah Tariq, Patric Zhao, Sky Wu, Justin Luitjens, Nikolai Sakharnykh), Cray (Sarah Anderson, Ryan Olson), NCSA (Robert Brunner), PPL (Eric Bohm, Yanhua Sun, Gengbin Zheng, Nikhil Jain) and 20 years of NAMD and Charm++ developers and users.

James Phillips

Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/