Refactoring NAMD for Petascale Machines and Graphics Processors - PowerPoint PPT Presentation

Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips http://www.ks.uiuc.edu/Research/namd/ NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

NAMD Design • Designed from the beginning as a parallel program • Uses the Charm++ idea: – Decompose the computation into a large number of objects – Have an Intelligent Run-time system (of Charm++) assign objects to processors for dynamic load balancing Hybrid of spatial and force decomposition: •Spatial decomposition of atoms into cubes (called patches) •For every pair of interacting patches, create one object for calculating electrostatic interactions •Recent: Blue Matter, Desmond, etc. use this idea in some form NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

NAMD Parallelization using Charm++ Example Configuration 108 VPs 847 VPs 100,000 VPs These 100,000 Objects (virtual processors, or VPs) are assigned to real processors by the Charm++ runtime system NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

Load Balancing Steps Regular Detailed, aggressive Load Timesteps Balancing Time Instrumented Refinement Load Timesteps Balancing NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

Parallelization on BlueGene/L • Sequential Optimizations • Messaging Layer Optimizations • NAMD parallel tuning • Illustrates porting effort Optimization Performance NAMD v2.5 40 ms NAMD v2.6 Blocking 25.2 24.3 Fine Grained 20.5 Congestion Control “Inside” help by: Sameer Kumar , former CS/TCB 14 Topology Loadbalancer student, now at IBM BlueGene group, 13.5 Chessboard Dynamic FIFO Mapping tasked by IBM to support NAMD 13.3 Fast Memcpy Chao Huang , spent summer at IBM on 11.9 Non Blocking messaging layer 8.6 (10 ns/day) 2AwayXY + Spanning tree NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

Fine Grained Decomposition on BlueGene Force Evaluation Integration Decomposing atoms into smaller bricks gives finer grained parallelism NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

Recent Large-Scale Parallelization Improvement with pencil: • Since the proposal was submitted 0.65 ns per day to 1.2 ns/day. • PME parallelization: needs to be fine grained – We recently did a 2-D (Pencil-based) parallelization: Fibrinogen system: 1 million • will be tuned further atoms running on 1024 – Efficient data-exchange between atoms and grid processors at PSC XT3 • Memory issues: – New machines will stress memory/node • 256MB per processor on BlueGene/L • NSF’s selection of NAMD, and BAR domain benchmark – Plan: partition all static data, • Preliminary work done: • We can now simulate ribosome on BlueGene/L • Much larger systems on Cray XT3: • Interconnection topology: – Is becoming a strong factor: bandwidth Y – Topology-aware load balancers in Charm++, some specialized to NAMD Z X Processor Grid NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

94% efficiency Shallow valleys, high peaks, nicely overlapped PME Apo-A1, on BlueGene/L, 1024 procs Charm++’s “Projections” Analysis too green: communication Time intervals on x axis, activity added across processors on Y axisl Blue/Purple: electrostatics Red: integration Orange: PME turquoise: angle/dihedral NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

76% efficiency Cray XT3, 512 processors: Initial runs Clearly, needed further tuning, especially PME. But, had more potential (much faster processors) NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

On Cray XT3, 512 processors: after optimizations 96% efficiency NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

Performance on BlueGene/L 100 IAPP simulation (Rivera, Straub, BU) Simulation Rate in Nanoseconds Per Day at 20 ns per day on 256 processors 10 1 us in 50 days 1 IAPP (5.5K atoms) LYSOZYME (40K atoms) 0.1 APOA1 (92K atoms) STMV simulation ATPase (327K atoms) at 6.65 ns per day STMV (1M atoms) on 20,000 processors BAR d. (1.3M atoms) 0.01 1 10 100 1000 10000 100000 Processors NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

Comparison with Blue Matter ApoLipoprotein-A1 (92K atoms) 512 1024 2048 4096 8192 16384 Nodes 38.42 18.95 9.97 5.39 3.14 2.09 ms/step Blue Matter (SC’06) 18.6 10.5 6.85 4.67 3.2 2.33 ms/step NAMD 2.33 11.3 7.6 5.1 3.7 3.0 ms/step NAMD (CP) (Virtual Node) NAMD is about 1.8 times faster than Blue Matter on 1024 processors (and 3.4 times faster with VN mode, where NAMD can use both processors on a node effectively). However: Note that NAMD does PME every 4 steps. NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

Performance on Cray XT3 100 Simulation Rate in Nanoseconds Per Day 10 1 IAPP (5.5K atoms) LYSOZYME(40K atoms) 0.1 APOA1 (92K atoms) ATPASE (327K atoms) STMV (1M atoms) BAR d. (1.3M atoms) RIBOSOME (2.8M atoms) 0.01 1 10 100 1000 10000 Processors NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

NAMD: Practical Supercomputing • 20,000 users can’t all be computer experts. – 18% are NIH-funded; many in other countries. – 4200 have downloaded more than one version. • User experience is the same on all platforms. – No change in input, output, or configuration files. – Run any simulation on any number of processors . – Automatically split patches and enable pencil PME. – Precompiled binaries available when possible. • Desktops and laptops – setup and testing – x86 and x86-64 Windows, PowerPC and x86 Macintosh – Allow both shared-memory and network-based parallelism. • Linux clusters – affordable workhorses – x86, x86-64, and Itanium processors – Gigabit ethernet, Myrinet, InfiniBand, Quadrics, Altix, etc NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

NAMD Shines on InfiniBand TACC Lonestar is based on Dell servers and InfiniBand. Commodity cluster with 5200 cores! (Everything’s bigger in Texas.) 100 32 ns/day 2.7 ms/step ) s m o t a k 4 2 10 ( R 15 ns/day ns per day F H D ApoA1 (92k atoms) 5.6 ms/step / C A J STMV (1M atoms) 1 Auto-switch to pencil PME 0.1 4 8 16 32 64 128 256 512 1024 cores NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

Hardware Acceleration for NAMD Can NAMD offload work to a special-purpose processor? • Resource studied all the options in 2005-2006: – FPGA reconfigurable computing (with NCSA) • Difficult to program, slow floating point, expensive – Cell processor (NCSA hardware) • Relatively easy to program, expensive – ClearSpeed (direct contact with company) • Limited memory and memory bandwidth, expensive – MDGRAPE • Inflexible and expensive – Graphics processor (GPU) • Program must be expressed as graphics operations NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

GPU Performance Far Exceeds CPU • A quiet revolution – in games world so far – Calculation: 450 GFLOPS vs. 32 GFLOPS – Memory Bandwidth: 80 GB/s vs. 8.4 GB/s GFLOPS G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

CUDA: Practical Performance November 2006: NVIDIA announces CUDA for G80 GPU. • CUDA makes GPU acceleration usable: – Developed and supported by NVIDIA. – No masquerading as graphics rendering. – New shared memory and synchronization. Fun to program (and drive) – No OpenGL or display device hassles. – Multiple processes per card (or vice versa). • Resource and collaborators make it useful: – Experience from VMD development – David Kirk (Chief Scientist, NVIDIA) – Wen-mei Hwu (ECE Professor, UIUC) NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

GeForce 8800 Graphics Mode • New GPUs are built around threaded cores Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Geom Thread Issue Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Thread Processor TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/

Refactoring NAMD for Petascale Machines and Graphics Processors - PowerPoint PPT Presentation

Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips http://www.ks.uiuc.edu/Research/namd/ NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/ NAMD Design

NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01 1 Molecular dynamics and NAMD MD

Scaling Challenges in NAMD: Past and Future Outline NAMD: An Introduction Past Scaling

Refactoring Your Code A Key Step to Agility Venkat Subramaniam (svenkat@cs.uh.edu)

Graphics Murray Cole Graphics 1 Graphics 2 Graphics 3 Graphics 4 Graphics 5 Graphics 6

Design Patterns & Refactoring Introduction to Refactoring Oliver Haase HTWG Konstanz Oliver

Experiences with Charm++ and NAMD on Knights Landing Supercomputers 15 th Annual Workshop on

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Constraint-Based Refactoring Rename Field Problem Proven Correct Solution Constraint- Based

CS378 - Mobile Computing 3D Graphics 2D Graphics android.graphics library for 2D graphics

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Scriptable Asynchronous Multi-Copy Algorithms in NAMD via Charm++ Partitions James Phillips

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing

VMD & NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD

3D GRAPHICS design animate render Computer Graphics 3D animation movies Computer Graphics

R graphics and data manipulation Mark Dunning, Mike Smith, Sarah Vowler 12 December 2014 About

Objectives for Training Purpose of todays training is to provide an overview of how

Causal inference with missing values Effect of tranexamic acid on mortality for head trauma

1 OBJECTIVES 1. Describe the impact of Western Acculturation on the dietary patterns of South

OUTLINE FIBROADENOMA PHYLLODES TUMOR FIBROEPITHELIAL LESIONS OF THE BREAST

3/26/2015 Breast For Drs Crash Course Karen Barbosa DO 1 Breast Dz Not all masses are Cancer

BREAST CANCER:DIAGNOSIS WITHOUT SURGERY BREAST CANCER:DIAGNOSIS WITHOUT SURGERY M.G.Pacquola,

Sage Screening Program Elizabeth Lando-King, PhD, RN Nurse Specialist & Regional Coordinator

Refactoring NAMD for Petascale Machines and Graphics Processors - PowerPoint PPT Presentation

Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips http://www.ks.uiuc.edu/Research/namd/ NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC http://www.ks.uiuc.edu/ NAMD Design

NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01 1 Molecular dynamics and NAMD MD

Scaling Challenges in NAMD: Past and Future Outline NAMD: An Introduction Past Scaling

Refactoring Your Code A Key Step to Agility Venkat Subramaniam (svenkat@cs.uh.edu)

Graphics Murray Cole Graphics 1 Graphics 2 Graphics 3 Graphics 4 Graphics 5 Graphics 6

Design Patterns &amp; Refactoring Introduction to Refactoring Oliver Haase HTWG Konstanz Oliver

Experiences with Charm++ and NAMD on Knights Landing Supercomputers 15 th Annual Workshop on

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Constraint-Based Refactoring Rename Field Problem Proven Correct Solution Constraint- Based

CS378 - Mobile Computing 3D Graphics 2D Graphics android.graphics library for 2D graphics

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Scriptable Asynchronous Multi-Copy Algorithms in NAMD via Charm++ Partitions James Phillips

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing

VMD &amp; NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD

3D GRAPHICS design animate render Computer Graphics 3D animation movies Computer Graphics

R graphics and data manipulation Mark Dunning, Mike Smith, Sarah Vowler 12 December 2014 About

Objectives for Training Purpose of todays training is to provide an overview of how

Causal inference with missing values Effect of tranexamic acid on mortality for head trauma

1 OBJECTIVES 1. Describe the impact of Western Acculturation on the dietary patterns of South

OUTLINE FIBROADENOMA PHYLLODES TUMOR FIBROEPITHELIAL LESIONS OF THE BREAST

3/26/2015 Breast For Drs Crash Course Karen Barbosa DO 1 Breast Dz Not all masses are Cancer

BREAST CANCER:DIAGNOSIS WITHOUT SURGERY BREAST CANCER:DIAGNOSIS WITHOUT SURGERY M.G.Pacquola,

Sage Screening Program Elizabeth Lando-King, PhD, RN Nurse Specialist &amp; Regional Coordinator

Design Patterns & Refactoring Introduction to Refactoring Oliver Haase HTWG Konstanz Oliver

VMD & NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD

Sage Screening Program Elizabeth Lando-King, PhD, RN Nurse Specialist & Regional Coordinator