James Phillips Beckman Institute, University of Illinois - - PowerPoint PPT Presentation

james phillips
SMART_READER_LITE
LIVE PREVIEW

James Phillips Beckman Institute, University of Illinois - - PowerPoint PPT Presentation

Attacking HIV with Petascale Molecular Dynamics Simulations on Titan and Blue Waters James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/ Biomedical Technology Research Center for Macromolecular


slide-1
SLIDE 1

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Attacking HIV with Petascale Molecular Dynamics Simulations on Titan and Blue Waters

James Phillips

Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/

slide-2
SLIDE 2

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

HIV Infective Cycle

Host Cell

Capsid uncoating Integration into the host’s chromatin

Virion

Nuclear Import

Binding Fusion Budding

slide-3
SLIDE 3

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015 Fusion/Entry inhibitors Protease inhibitors Reverse Transcription (RNA to DNA) inhibitors Integrase inhibitors Currently no drug targets capsid uncoating or nuclear import !

HIV Treatment

Host Cell

slide-4
SLIDE 4

HIV ¡Capsid ¡is ¡Much ¡Larger ¡than ¡ Previously ¡Simulated ¡Systems

10 nm Collaborators: Peijun Zhang, Angela Gronenborn - U. Pittsburgh Christopher Aiken - Vanderbilt U.

  • G. Zhao, et al. Nature 497 (2013); exp + comp

lysozyme ribosome

HIV virion

All five referees demanded: Only coarse-grained, not all-atom!

Coarse-­‑grained ¡

  • nly!
slide-5
SLIDE 5

HIV-­‑1 ¡capsid HIV-­‑1 ¡virion

5

186 ¡hexamers ¡ 12 ¡pentamers

slide-6
SLIDE 6
  • G. Zhao, et al. Nature 497 (2013); exp + comp

Key person: Juan Perilla (UIUC)

Modeling of the Hexameric Lattice using Molecular Dynamics Flexible Fitting

slide-7
SLIDE 7

Modeling of the Hexameric Lattice using Molecular Dynamics Flexible Fitting

  • G. Zhao, et al. Nature 497 (2013); exp + comp

Key person: Juan Perilla (UIUC)

slide-8
SLIDE 8

1.5 µs (1.3 M atoms) simulation of pentameric center MD Simulation Furnishes Atom-Level Structure

  • f Pentamer-of-Hexamers

Closed ¡capsid ¡is ¡made ¡of ¡ ¡ hexamers-­‑of-­‑hexamers ¡ ¡ pentamers-­‑of-­‑hexamers

slide-9
SLIDE 9

HIV capsid contains 186 1300+ proteins,

slide-10
SLIDE 10

One-Microsecond Simulation Includes 64 Million Atoms

Stable!

Key person: Juan Perilla (UIUC)

slide-11
SLIDE 11

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

2013 HPCwire
 Editors’ Choice Award for Best Use of HPC in Life Sciences

slide-12
SLIDE 12

Malleability of HIV-1 CA

Hexamer of hexamers bite angles along chiral axis Native capsid bite angle distribution pentamers

hexamers

1300 proteins in different conformations

  • G. Zhao, et al. Nature 497 (2013)
slide-13
SLIDE 13

A204 E213 K203 I201

Nature ¡497, ¡643-­‑646 ¡

A204C mutant in vitro

Peijun Zhang - U. Pittsburgh

Curvature is regulated by the trimer interface

  • G. Zhao, et al. Nature 497 (2013)

HIV-CA wild-type in vitro

slide-14
SLIDE 14

Capsid acts as an osmotic regulator

Results from 64 M atom, 1 µs molecular dynamics simulation! Chloride ions permeate through the hexameric center

slide-15
SLIDE 15

CypA TNPO3 CPSF6 NUP153 NUP358 rhTRIM5α TRIMCyp Inhibitor MX2

Cytoplasm Nucleus RNA Premature uncoating

HIV-1 infection

  • Z. ¡Ambrose, ¡C. ¡Aiken ¡ ¡Virology ¡454-­‑455 ¡(2014) ¡371–379

HIV-1 uncoating: regulation by host factors

Host cell prevents infection by inducing premature uncoating

nuclear pore Cell factors interacting with HIV capsid!

slide-16
SLIDE 16

CypA Bridge Model MD Simulations Identify a Novel Catalytic Site interaction ¡confirmed ¡by ¡NMR

  • nly polarizable force field yields

stable bridge interaction

slide-17
SLIDE 17

Competitive binding between CypA and TRIM

  • F. ¡Diaz-­‑Griffero, ¡Viruses ¡(2011) ¡

Binding

  • f cypA

Infection Binding

  • f E2

Premature uncoating No infection

cypA binding pattern prevents TRIM binding, but leaves Nup interactions intact

Key person: Juan Perilla (UIUC)

TRIM lattice

slide-18
SLIDE 18

A204 E213 K203 I201

Curvature regulated by trimeric interface CypA bridges adjacent capsid subunits and thereby binds in particular pattern on capsid surface. Ions permeate through the capsid

Chemical Detail (Every Atom) is Essential for Capsid Role

Not always listen to referees!

Don’t simplify before you understand!

slide-19
SLIDE 19

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Peijun Zhang Angela M. Gronenborn Department of Structural Biology Center for HIV Protein Interactions University of Pittsburgh School of Medicine Christopher Aiken Department of Pathology and Immunology Vanderbilt University School of Medicine Juan R. Perilla Klaus Schulten Theoretical and Computational Biophysics Group

HIV Acknowledgments

Laxmikant Kale Parallel Programming Lab

  • Dept. of Computer Science

University of Illinois at Urbana-Champaign

slide-20
SLIDE 20

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Developers of the widely used computational biology software VMD and NAMD

250,000 registered VMD users 72,000 registered NAMD users 600 publications (since 1972)

  • ver 54,000 citations

5 faculty members 8 developers 1 systems administrator 17 postdocs 46 graduate students 3 administrative staff

research projects include: virus capsids, ribosome, photosynthesis, protein folding, membrane reshaping, animal magnetoreception

Tajkorshid, Luthey-Schulten, Stone, Schulten, Phillips, Kale, Mallon

NIH Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics

Achievements Built on People

Renewed 2012-2017 with 10.0 score (NIH)

slide-21
SLIDE 21

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

NAMD Serves NIH Users and Goals


Practical Supercomputing for Biomedical Research

  • 72,000 users can’t all be computer experts.

– 18% are NIH-funded; many in other countries. – 21,000 have downloaded more than one version. – 5000 citations of NAMD reference papers.

  • One program available on all platforms.

– Desktops and laptops – setup and testing – Linux clusters – affordable local workhorses – Supercomputers – free allocations on XSEDE – Blue Waters – sustained petaflop/s performance – GPUs - next-generation supercomputing

  • User knowledge is preserved across platforms.

– No change in input or output files. – Run any simulation on any number of cores.

  • Available free of charge to all.

Oak Ridge TITAN Hands-On Workshops

slide-22
SLIDE 22

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

100 Million Atom Simulations Are Not Routine

  • Simulation setup is a black art

– Tools for adding solvent and ions don’t scale – Need to move tools and users towards new “js” file format

  • Still some rough edges

– Not all NAMD features usable at scale

  • Trajectory and restart output performance

– New Charm++ I/O library will help address this

  • Simulations require leadership machines

– Available resources are limited, allocation process is slow

  • Lack of setup/visualization/analysis facilities
slide-23
SLIDE 23

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

NIH Center Facilities Enable Petascale Biology

Over the past six years the Center has assembled all necessary hardware and infrastructure to prepare and analyze petascale molecular dynamics simulations, and makes these facilities available to visiting researchers. External Resources, 90% of our Computer Power High-End Workstations Accessible to Visitors

10 Gigabit Network Simulation Output

Petascale Gateway Facility

slide-24
SLIDE 24

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Virtual Facilities Enable Petascale Anywhere

High-end visualization and analysis workstations currently available only in person at the Beckman Institute must be virtualized and embedded at supercomputer centers.

1 Gigabit Network Compressed Video

slide-25
SLIDE 25

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Remote Visualization Now

  • TACC Stampede supports this today

– Includes nodes with 1TB memory – Not virtualized, allocate full dedicated node – New Maverick cluster added

  • Blue Waters – no visualization resource
  • Titan – new Rhea “viz” cluster drops GPUs
  • NIH Center - using NICE DCV for remote access
slide-26
SLIDE 26

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Jim Phillips monitors NAMD performance of thousands of cores on group’s 4K graphics system

slide-27
SLIDE 27

NAMD 2.10 Release (December 2014)

  • Focus on enabling petascale simulations
  • Type 1: Large systems of ~100 million atoms

– Scalable to all of Blue Waters or Titan (Phillips et al., SC14) – In regular production use for multiple biomedical driving projects – Amaro (UCSD) allocation on Blue Waters for 210M-atom influenza virus

  • Type 2: Replica exchange simulations of smaller systems

– Improved performance over NAMD 2.9, especially with GPUs – Scalable multiple copy algorithms, Comp. Phys. Comm. 185:908-16 – Multiple-walker adaptive biasing force, J. Chem. Theo. Comp. 10:5276-85 – Adaptive multilevel splitting, ESAIM Proc. (in press)

  • Various other improvements

– Xeon Phi port, GPU improvements including PME offload – Semi- and non-periodic long-range electrostatics (multilevel summation)

slide-28
SLIDE 28

NAMD Replica Exchange Example Application: Complete Description of Transport Cycle

IF apo IF bound in

  • ut

in

  • ut

OF apo

  • ut

in

  • ut

in OF bound

apo +substrate

Law,%et#al.,%Biochemistry#46,%12190%(2007).%

slide-29
SLIDE 29

IF

apo

IF

bound

OF

apo

OF

bound

12 ¡replicas ¡x ¡40 ¡ns ¡(H1/H7) ¡ 50 ¡replicas ¡x ¡20 ¡ns ¡(10 ¡Hs) 12 ¡replicas ¡x ¡40 ¡ns ¡(H1/H7) ¡ 24 ¡replicas ¡x ¡20 ¡ns ¡(H1/H7) ¡ 200 ¡replicas ¡(2D) ¡x ¡5 ¡ns ¡ 50 ¡replicas ¡x ¡20 ¡ns 30 ¡r ¡x ¡20 ¡ns ¡ 30 ¡r ¡x ¡20 ¡ns ¡ 30 ¡r ¡x ¡20 ¡ns 30 ¡r ¡x ¡20 ¡ns ¡ 30 ¡r ¡x ¡20 ¡ns

Mahmoud Moradi

150 ¡ replicas ¡

Advanced Replica Exchange Simulation Protocol Requiring a Combination of Multiple Collective Variables

slide-30
SLIDE 30

OF

b

IF

b

TS

b

−15 −10 −5 5 10 15

z (Å)

−15 −10 −5 5 10 15 1 2 3 4

z (Å)

1 2 3 4 1 2 3 4

Pore Radius (Å)

OF

a

IF

a

TS

a 1 2 3 4 1 2 3 4

Pore Radius at Periplasmic Gate (Å) Pore Radius at Cytoplasmic Gate (Å)

OFa IFa TSa OFb IFb TSb

Occluded Region Outward-Facing Inward-Facing

1 20 40 60 80 100 150 140 130 120 110 100

Periplasmic Gate

1 20 40 60 80 100 150 140 130 120 110 100 1 20 40 60 80 100 150 140 130 120 110 100

slide-31
SLIDE 31

Computational Structural Biology and Molecular Biophysics Group (CSBMB)

csbmb.beckman.illinois.edu

Anton Collaborating Labs

  • H. Mchaourab (Vanderbilt)
  • R. Nakamoto (U. Virginia)

D.-N. Wang (NYU) R01-GM086749 U54-GM087519 R01-GM101048 P41-GM104601 Mahmud Moradi Giray Enkavi Jing Li Po-Chao Wen Sundar Thangapandian Noah Trebesch

BLUE WATERS

slide-32
SLIDE 32

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

NAMD is based on Charm++

Complete info at charmplusplus.org

slide-33
SLIDE 33

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Charm++ Used by NAMD

  • Parallel C++ with data driven objects.
  • Asynchronous method invocation.
  • Prioritized scheduling of messages/execution.
  • Measurement-based load balancing.
  • Portable messaging layer.
slide-34
SLIDE 34

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

  • Spatially decompose data and

communication.

  • Separate but related work

decomposition.

  • “Compute objects” facilitate

iterative, measurement-based load balancing system.

NAMD Hybrid Decomposition

Kale et al., J. Comp. Phys. 151:283-312, 1999.

slide-35
SLIDE 35

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Phillips et al., SC2002.

Offload to GPU

Objects are assigned to processors and queued as data arrives.

NAMD Overlapping Execution

slide-36
SLIDE 36

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Overlapping GPU and CPU with Communication

Remote Force Local Force GPU CPU Other Nodes/Processes Local Remote x f f f f Local x x Update One Timestep x

Phillips et al., SC2008

slide-37
SLIDE 37

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Actual Timelines from NAMD


Generated using Charm++ tool “Projections” http://charm.cs.uiuc.edu/

Remote Force Local Force x f f x GPU CPU f f x x

slide-38
SLIDE 38

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015 Non-bonded local kernel Non-bonded remote kernel Results from GPU Incoming positions Integration Bonded

  • n CPU

Inter-node communication

slide-39
SLIDE 39

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Enabling Remote/Local Overlap

  • Asking for priorities since 2008

– Critical for Charm++ performance on CPU – With Kepler we get 1 bit on Tesla/Quadro

  • Doesn’t order grid launches:
  • Workaround is small memset in low-priority stream

– Doesn’t need priorities, so works on GeForce cards too!

memcpy high prio low prio Good! Less good

slide-40
SLIDE 40

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Kepler Shuffle Instructions

  • Reductions for energy and pressure tensor
  • Old implementation limits synchronization:

– Reduce multiple fields at same time – Warp-synchronous for final stages

  • Shuffle implementation is simpler and faster!

– Except now preprocessor code for older devices – “diff –D KEPLER_SHUFFLE” is very helpful

slide-41
SLIDE 41

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Maxwell Performance

ms/step 4.5 9 13.5 18 K80 (1/2) K80 Titan Black GTX 980 M6000 Titan X

NAMD ApoA1 benchmark on 14 cores 2.6 GHz E5-2650 v2 or E5-2660 v3

Titan X is 60% faster than Titan Black, 30% faster than GTX 980.

slide-42
SLIDE 42

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

CUDA 7

  • We’ve heard of it.
  • Looking forward to C++11.
  • Runtime compilation might be awesome.
  • We’ve also heard of CUDA 6.5.
  • It will be available on Cray XK7 “soon”.
  • Until then we’re stuck with CUDA 5.5.

42

slide-43
SLIDE 43

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Trends Affecting Performance

  • GPU performance increasing

– Performance limit will be code on CPU – Most highly tuned CPU code moved to GPU – Remaining CPU code is also less efficient – Therefore CPU must run serial code well

  • CPU serial performance static
  • CPU core counts increasing
slide-44
SLIDE 44

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Suggested Strategy

  • Focus on CPU-side code

– Port to GPU or optimize/paralellize on CPU – Stream results off GPU to increase overlap – Use CPUs with best single-thread performance

  • Focus on communication

– Reduce communication overhead on CPU – General parallel scalablity improvements – Map decomposition to machine torus topology

  • Also applies to replica exchange partitions
slide-45
SLIDE 45

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Phillips et al., SC14

Torus Adaptation

  • Job partitioning for multiple

copy sampling algorithms

  • Mapping NAMD spatial

decomposition domains onto machine torus

  • Mapping particle-mesh

Ewald (PME) electrostatics

  • nto spatial decomposition

Additional Techniques

  • Coarsening of PME grid to

reduce long-range communication

  • Offloading of PME

interpolation onto GPUs

  • Removal of implicit

synchronization in pressure control algorithm

slide-46
SLIDE 46

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Irregular Torus Topologies

  • IBM Blue Gene L/P/Q provide jobs with

complete, regular, power-of-two torus.

  • Cray XE/XK job topology is unpredictable.

– Scheduler works around already running jobs. – May not be compact or contiguous.

  • New Blue Waters scheduler addresses this.

– Even full-machine jobs skip over I/O nodes.

slide-47
SLIDE 47

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Convert Torus to Optimized Mesh

  • Start with Charm++ TopoManager API

– Provides node coordinates and torus dimensions.

  • Extend with TopoManagerWrapper class

– Ensure same torus coordinates for entire physical node. – Shift torus coordinates to eliminate largest gap in node list. – Re-order dimensions from longest to shortest occupied span. – Provide functions for sorting list of ranks along ordered list of dimensions by “snake scanning” curve (seen on next slide).

  • Recursive bisection on these “snake scanning” curves is the

basis of all torus-mapping algorithms to follow.

slide-48
SLIDE 48

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC2015

Mapping Charm++ Partitions

Bisection 1 Bisection 2 Gaps in torus

slide-49
SLIDE 49

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Mapping NAMD Spatial Domains

  • Priorities are:

1. Evenly distributed patch load across available PEs 2. Compact patch set within physical node to minimize communication 3. Torus topology adaptation – only impacts largest runs

  • Simultaneous recursive bisection of patch mesh and PE mesh:

– Re-order patch and PE mesh dimensions longest to shortest. – When dividing PEs, divide patches along corresponding dimension, if possible, before falling back to next-longest dimension. – Divide PEs on physical node boundaries. – Divide patches to balance load with at least one patch per PE.

  • Within physical node, sort patches along PME slabs/pencils.
slide-50
SLIDE 50

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC2015

Mapping NAMD Spatial Domains

Processors Patches Gaps in torus

slide-51
SLIDE 51

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC2015

Mapping NAMD Spatial Domains

Processors Patches Gaps in torus

slide-52
SLIDE 52

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Mapping PME Electrostatics

  • Want to align X-Y grid of Z pencils to patches.

– Needs to work even on non-torus machines.

  • Assign X-Y coordinates to PEs.

– Average coordinate of patches on PE (or node, etc.)

  • Recursively bisect…

– Z pencils on longer dimension boundary (5x5=2x5+3x5). – PEs proportionately (25=10+15) on same coordinate.

  • Optimize Y-X-Y FFT transposes by placing X and Y

pencils with same Z coordinate on contiguous ranks.

slide-53
SLIDE 53

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

NAMD PME CUDA Kernel

  • Bottleneck for 100M atoms is PME FFT communication

– Switch from 4th-order to 8th-order interpolation on coarser grid

  • Doing 8th-order PME on GPU improves critical path
  • CPU may be bottleneck for 8th-order PME

– Especially as GPU non-bonded gets faster…

  • Simplest design that might possibly work:

– One stream per host PE (preserve control flow) – One atom per warp with warp-synchronous programming – Atomics to accumulate charge grid in global memory

  • One per thread so accesses coalesce
  • Also build “used” flags arrays for x-y pencils and z plane
slide-54
SLIDE 54

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

PME Kernel Aggregation

  • Initial version slower than PME on CPU
  • First, one launch per PE, not per patch
  • Second, one charge array per node

– First version to beat PME on CPU – Node-level coordination a challenge in Charm++ – Reduces number of messages sent per node!

  • Need to backport to PME on CPU version
  • May help CPU-only version, but not as much
slide-55
SLIDE 55

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015 Non-bonded local kernel Non-bonded remote kernel Results from GPU Incoming positions Integration Bonded

  • n CPU
slide-56
SLIDE 56

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015 Non-bonded kernels PME kernel submissions PME kernel polling

slide-57
SLIDE 57

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015 PME kernel priority streams steal slots from non-bonded Non-bonded kernel runs longer to calculate pair-lists

slide-58
SLIDE 58

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Performance Results

  • Petascale simulation preparation is not easy.

– Benchmarks based on 1.06M-atom STMV – 5x2x2 grid = 21M atoms ~ “small petascale” – 7x6x5 grid = 224M atoms ~ “Influenza virus”

  • Experiment by disabling optimizations

– Only disable one at a time, not cumulatively.

Huge system in 2006

slide-59
SLIDE 59

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Benchmarking Caution

  • Cray XE/XK performance varies due to:

– Compactness of nodes assigned to job – Other jobs running on machine (cross-traffic) – I/O activity (more Blue Waters than Titan)

  • To test performance impact of changes,

run old and new back-to-back in same job.

slide-60
SLIDE 60

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

SC14

NAMD Topology Mapping on Titan Cray XK7

(2fs timestep)

50% increase

Phillips et al., SC14

slide-61
SLIDE 61

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

SC14

Other NAMD Optimizations on Titan Cray XK7

(2fs timestep) +10% +20% +100% Phillips et al., SC14

slide-62
SLIDE 62

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

SC14

NAMD on Torus and Non-torus Networks

(2fs timestep) Phillips et al., SC14

slide-63
SLIDE 63

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Streaming CPU Results to CPU

  • Allows incremental results from a single grid to be

processed on CPU before grid finishes on GPU

  • Allows merging and prioritizing of remote and local work
  • GPU side:

– Write results to host-mapped memory (also without streaming) – __threadfence_system() and __syncthreads() – Atomic increment for next output queue location – Write result index to output queue

  • CPU side:

– Poll end of output queue (int array) in host memory

slide-64
SLIDE 64

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

if ( force_ready_queue ) { __threadfence_system(); __syncthreads(); if (threadIdx.x == 0) { int old = atomicInc(force_list_counters,force_lists_size-1); force_ready_queue[old] = myPatchPair.patch1_force_list_index; __threadfence_system(); } } while ( -1 != (flindex = force_ready_queue[force_ready_queue_next]) ) { force_ready_queue[force_ready_queue_next] = -1; ++force_ready_queue_next; …process output flindex… }

Polling on host: Streaming on GPU:

slide-65
SLIDE 65

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Controlling Output Order

  • Blocks have widely varying runtimes
  • Input order is not output order
  • Non-streaming was simple, just sort large to small

1 2 7 3 4 5 6 8 Output: 2,3,4,5,1,7,8,6,9 9 6 9 1 7 2 3 4 5 8 Savings

slide-66
SLIDE 66

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Controlling Output Order

  • First use reversed priorities as input order
  • Then reverse output order to use as input
  • Provides good ordering and near-ideal compactness

9 8 3 7 6 5 4 2 Output: 8,7,6,5,9,3,2,4,1 1 4 1 9 3 8 7 6 5 2

slide-67
SLIDE 67

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Controlling Output Order

  • Requires very little code to save order
  • Does not require measuring block runtimes
  • Better than old heuristic ordering
  • Streaming wins even on single node!

if (threadIdx.x == 0 && block_order) { int old = atomicInc(force_list_counters+1,total_block_count-1); block_order[old] = block_begin + blockIdx.x; }

slide-68
SLIDE 68

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Controlling Output Order

  • But what is optimal output order?
  • Remote before local (same as before)
  • Distribute local across threads
  • Slight preference for GPU host thread
  • Local without remote proxies last
  • Not yet implemented
slide-69
SLIDE 69

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Non-Streaming Kernel

slide-70
SLIDE 70

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Streaming Kernel

slide-71
SLIDE 71

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015 Non-bonded local kernel Non-bonded remote kernel Results from GPU Incoming positions Integration Bonded

  • n CPU
slide-72
SLIDE 72

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

slide-73
SLIDE 73

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

slide-74
SLIDE 74

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015 Non-bonded kernels PME kernel submissions PME kernel polling

slide-75
SLIDE 75

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

slide-76
SLIDE 76

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

slide-77
SLIDE 77

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Non-Streaming Kernel Pairlist Step

slide-78
SLIDE 78

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Streaming Kernel Pairlist Step

slide-79
SLIDE 79

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

New Streaming Kernel Performance

(2fs timestep)

4 8 16 32 64 256 512 1024 2048 4096 Number of Nodes 21M atoms Performance (ns per day) Blue Waters XK7 (new streaming) Blue Waters XK7 (no streaming)

+10% +30% +10-30%

slide-80
SLIDE 80

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Parallelize PME Within Node

(2fs timestep)

4 8 16 32 64 256 512 1024 2048 4096 Number of Nodes 21M atoms Performance (ns per day) Blue Waters XK7 (new streaming, tuned PME) Blue Waters XK7 (new streaming) Blue Waters XK7 (no streaming) 2.9 ms/step 60 ns/day

+10-30% +5%

slide-81
SLIDE 81

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Blue Waters vs Titan

(2fs timestep)

4 8 16 32 64 256 512 1024 2048 4096 Number of Nodes 21M atoms Performance (ns per day) Blue Waters XK7 (new streaming, tuned PME) Titan XK7 (new streaming, tuned PME) Titan XK7 (SC14, old streaming) Blue Waters XK7 (no streaming)

  • 5%
  • 25%
  • 5-25%

2.9 ms/step 60 ns/day

slide-82
SLIDE 82

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Topology-Aware Scheduling on Blue Waters

  • Map jobs to convex sets to

avoid network interference

  • NCSA, Cray, Adaptive
  • Just enabled January 13
  • Most likely explanation for

Blue Waters performance advantage over Titan

  • See Enos et al., CUG 2014
slide-83
SLIDE 83

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Blue Waters vs Titan

(2fs timestep)

0.25 0.5 1 2 4 8 16 32 64 256 512 1024 2048 4096 8192 16384 Number of Nodes 21M atoms 224M atoms Performance (ns per day) Blue Waters XK7 (GTC15) Titan XK7 (GTC15) Titan XK7 (SC14) 7 ms/step 25 ns/day

slide-84
SLIDE 84

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Comparison with CPU-only Machines

(2fs timestep)

0.25 0.5 1 2 4 8 16 32 64 256 512 1024 2048 4096 8192 16384 Number of Nodes 21M atoms 224M atoms Performance (ns per day) Blue Waters XK7 (GTC15) Titan XK7 (GTC15) Edison XC30 (SC14) Blue Waters XE6 (SC14)

+50-200% +100-200% +70%

slide-85
SLIDE 85

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Conclusions

  • In biology chemical detail is critical.
  • Remote visualization will be necessary.
  • Replica exchange enables long timescales.
  • Map decomposition to network topology.
  • Stream results from GPU in priority order.
  • Bad scheduling harms performance.
slide-86
SLIDE 86

Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu

GTC 2015

Thanks to: NIH, NSF, DOE, NCSA,
 NVIDIA (Sarah Tariq, Patric Zhao, Sky Wu, Justin Luitjens, Nikolai Sakharnykh),
 Cray (Sarah Anderson, Ryan Olson), NCSA (Robert Brunner),
 PPL (Eric Bohm, Yanhua Sun, Gengbin Zheng, Nikhil Jain)
 and 19 years of NAMD and Charm++ developers and users.

James Phillips

Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/