State of Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel - - PowerPoint PPT Presentation

state of charm
SMART_READER_LITE
LIVE PREVIEW

State of Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel - - PowerPoint PPT Presentation

5 th Annual Workshop on Charm++ and Applications Welcome and Introduction State of Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana


slide-1
SLIDE 1

4/23/2007 CharmWorkshop2007 1

5th Annual Workshop on Charm++ and Applications Welcome and Introduction

“State of Charm++”

Laxmikant Kale

http://charm.cs.uiuc.edu

Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign

slide-2
SLIDE 2

4/23/2007 CharmWorkshop2007 2

A Glance at History

  • 1987: Chare Kernel arose from parallel Prolog work

– Dynamic load balancing for state-space search, Prolog, ..

  • 1992: Charm++
  • 1994: Position Paper:

– Application Oriented yet CS Centered Research – NAMD : 1994, 1996

  • Charm++ in almost current form: 1996-1998

– Chare arrays, – Measurement Based Dynamic Load balancing

  • 1997 : Rocket Center: a trigger for AMPI
  • 2001: Era of ITRs:

– Quantum Chemistry collaboration – Computational Astronomy collaboration: ChaNGa

slide-3
SLIDE 3

4/23/2007 CharmWorkshop2007 3

Outline

  • What is Charm++

– and why is it good

  • Overview of recent results

– Language work: raising the level of abstraction – Domain Specific Frameworks: ParFUM

  • Guebelle: crack propoagation
  • Haber: spae-time meshing

– Applications

  • NAMD (picked by NSF, new

scaling results to 32k procs.)

  • ChaNGa: released, gravity

performance

  • LeanCP:

– Use at National centers – BigSim – Scalable Performance tools – Scalable Load Balancers – Fault tolerance – Cell, GPGPUs, .. – Upcoming Challenges and

  • pportunities:
  • Multicore
  • Funding
slide-4
SLIDE 4

4/23/2007 CharmWorkshop2007 4

PPL Mission and Approach

  • To enhance Performance and Productivity in

programming complex parallel applications

– Performance: scalable to thousands of processors – Productivity: of human programmers – Complex: irregular structure, dynamic variations

  • Approach: Application Oriented yet CS centered research

– Develop enabling technology, for a wide collection of apps. – Develop, use and test it in the context of real applications

  • How?

– Develop novel Parallel programming techniques – Embody them into easy to use abstractions – So, application scientist can use advanced techniques with ease – Enabling technology: reused across many apps

slide-5
SLIDE 5

4/23/2007 CharmWorkshop2007 5

Migratable Objects (aka Processor Virtualization)

User View System implementation Programmer: [Over] decomposition

into virtual processors

Runtime: Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI

  • Software engineering

– Number of virtual processors can be independently controlled – Separate VPs for different modules

  • Message driven execution

– Adaptive overlap of communication – Predictability :

  • Automatic out-of-core

– Asynchronous reductions

  • Dynamic mapping

– Heterogeneous clusters

  • Vacate, adjust to speed, share

– Automatic checkpointing – Change set of processors used – Automatic dynamic load balancing – Communication optimization

Benefits

slide-6
SLIDE 6

4/23/2007 CharmWorkshop2007 6

Adaptive overlap and modules

SPMD and Message-Driven Modules

(From A. Gursoy, Simplified expression of message-driven programs and

quantification of their impact on performance, Ph.D Thesis, Apr 1994.)

Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, San Fransisco, 1995

slide-7
SLIDE 7

4/23/2007 CharmWorkshop2007 7

Realization: Charm++’s Object Arrays

  • A collection of data-driven objects

– With a single global name for the collection – Each member addressed by an index

  • [sparse] 1D, 2D, 3D, tree, string, ...

– Mapping of element objects to procS handled by the system

A[0] A[1] A[2] A[3] A[..]

User’s view

slide-8
SLIDE 8

4/23/2007 CharmWorkshop2007 8

Realization: Charm++’s Object Arrays

  • A collection of data-driven objects

– With a single global name for the collection – Each member addressed by an index

  • [sparse] 1D, 2D, 3D, tree, string, ...

– Mapping of element objects to procS handled by the system

A[0] A[1] A[2] A[3] A[..] A[3] A[0]

User’s view System view

slide-9
SLIDE 9

4/23/2007 CharmWorkshop2007 9

Charm++: Object Arrays

  • A collection of data-driven objects

– With a single global name for the collection – Each member addressed by an index

  • [sparse] 1D, 2D, 3D, tree, string, ...

– Mapping of element objects to procS handled by the system

A[0] A[1] A[2] A[3] A[..] A[3] A[0]

User’s view System view

slide-10
SLIDE 10

4/23/2007 CharmWorkshop2007 10

AMPI: Adaptive MPI

7 MPI processes

slide-11
SLIDE 11

4/23/2007 CharmWorkshop2007 11

AMPI: Adaptive MPI

Real Processors 7 MPI “processes”

Implemented as virtual processors (user-level migratable threads)

slide-12
SLIDE 12

4/23/2007 CharmWorkshop2007 12

Processor Utilization against Time on 128 and 1024 processors On 128 processor, a single load balancing step suffices, but On 1024 processors, we need a “refinement” step. Load Balancing

Aggressive Load Balancing Refinement Load Balancing

slide-13
SLIDE 13

4/23/2007 CharmWorkshop2007 13

Shrink/Expand

  • Problem: Availability of computing platform may change
  • Fitting applications on the platform by object migration
slide-14
SLIDE 14

4/23/2007 CharmWorkshop2007 14

So, Whats new?

slide-15
SLIDE 15

4/23/2007 CharmWorkshop2007 15

New Higher Level Abstractions

  • Previously: Multiphase Shared Arrays

– Provides a disciplined use of global address space – Each array can be accessed only in one of the following modes:

  • ReadOnly, Write-by-One-Thread, Accumulate-only

– Access mode can change from phase to phase – Phases delineated by per-array “sync”

  • Charisma++: Global view of control

– Allows expressing global control flow in a charm program – Separate expression of parallel and sequential – Functional Implementation (Chao Huang PhD thesis) – LCR’04, HPDC’07

slide-16
SLIDE 16

4/23/2007 CharmWorkshop2007 16

Multiparadigm Interoperability

  • Charm++ supports concurrent composition
  • Allows multiple module written in multiple

paradigms to cooperate in a single application

  • Some recent paradigms implemented:

– ARMCI (for Global Arrays)

  • Use of Multiparadigm programming

– You heard yesterday how ParFUM made use of multiple paradigms effetively

slide-17
SLIDE 17

4/23/2007 CharmWorkshop2007 17

Blue Gene Provided a Showcase.

  • Co-operation with Blue Gene team

– Sameer Kumar joins BlueGene team

  • BGW days competetion

– 2006: Computer Science day – 2007: Computational cosmology: ChaNGa

  • LeanCP collaboration

– with Glenn Martyna, IBM

slide-18
SLIDE 18

4/23/2007 CharmWorkshop2007 18

Cray and PSC Warms up

  • 4000 fast processors at PSC
  • 12,500 processors at ORNL
  • Cray support via a gift grant
slide-19
SLIDE 19

4/23/2007 CharmWorkshop2007 19

IBM Power7 Team

  • Collaborations begun with NSF Track 1 proposal
slide-20
SLIDE 20

4/23/2007 CharmWorkshop2007 20

Our Applications Achieved Unprecedented Speedups

slide-21
SLIDE 21

4/23/2007 CharmWorkshop2007 21

Applications and Charm++

Application Charm++ Other Applications Issues Techniques & libraries Synergy between Computer Science Research and Biophysics has been beneficial to both

slide-22
SLIDE 22

4/23/2007 CharmWorkshop2007 22

Charm++ and Applications

NAMD Charm++ Other Applications Issues Techniques & libraries Synergy between Computer Science Research and Biophysics has been beneficial to both ChaNGa LeanCP Space-time meshing Rocket Simulation

slide-23
SLIDE 23

4/23/2007 CharmWorkshop2007 23

Parallel Objects, Adaptive Runtime System Libraries and Tools

The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE

Crack Propagation Space-time meshes Computational Cosmology Rocket Simulation Protein Folding Dendritic Growth

Quantum Chemistry LeanCP

Develop abstractions in context of full-scale applications

NAMD: Molecular Dynamics STM virus simulation

slide-24
SLIDE 24

4/23/2007 CharmWorkshop2007 24

Molecular Dynamics in NAMD

  • Collection of [charged] atoms, with bonds

– Newtonian mechanics – Thousands of atoms (10,000 - 5000,000) – 1 femtosecond time-step, millions needed!

  • At each time-step

– Calculate forces on each atom

  • Bonds:
  • Non-bonded: electrostatic and van der Waal’s

– Short-distance: every timestep – Long-distance: every 4 timesteps using PME (3D FFT) – Multiple Time Stepping

– Calculate velocities and advance positions Collaboration with K. Schulten, R. Skeel, and coworkers

slide-25
SLIDE 25

4/23/2007 CharmWorkshop2007 25

NAMD: A Production MD program

NAMD

  • Fully featured program
  • NIH-funded development
  • Distributed free of charge

(~20,000 registered users)

  • Binaries and source code
  • Installed at NSF centers
  • User training and support
  • Large published simulations
slide-26
SLIDE 26

4/23/2007 CharmWorkshop2007 26

NAMD: A Production MD program

NAMD

  • Fully featured program
  • NIH-funded development
  • Distributed free of charge

(~20,000 registered users)

  • Binaries and source code
  • Installed at NSF centers
  • User training and support
  • Large published simulations
slide-27
SLIDE 27

4/23/2007 CharmWorkshop2007 27

Hybrid of spatial and force decomposition:

  • Spatial decomposition of atoms into cubes

(called patches)

  • For every pair of interacting patches, create one
  • bject for calculating electrostatic interactions
  • Recent: Blue Matter, Desmond, etc. use

this idea in some form

NAMD Design

  • Designed from the beginning as a parallel program
  • Uses the Charm++ idea:

– Decompose the computation into a large number of objects – Have an Intelligent Run-time system (of Charm++) assign objects to processors for dynamic load balancing

slide-28
SLIDE 28

4/23/2007 CharmWorkshop2007 28

847 VPs 100,000 VPs

NAMD Parallelization using Charm++

Example Configuration These 100,000 Objects (virtual processors, or VPs) are assigned to real processors by the Charm++ runtime system 108 VPs

slide-29
SLIDE 29

4/23/2007 CharmWorkshop2007 29

0.01 0.1 1 10 100 1 10 100 1000 10000 100000 Processors Simulation Rate in Nanoseconds Per Day IAPP (5.5K atoms) LYSOZYME (40K atoms) APOA1 (92K atoms) ATPase (327K atoms) STMV (1M atoms) BAR d. (1.3M atoms)

Performance on BlueGene/L

STMV simulation

at 6.65 ns per day

  • n 20,000 processors

IAPP simulation (Rivera, Straub, BU) at 20 ns per day

  • n 256 processors

1 us in 50 days

slide-30
SLIDE 30

4/23/2007 CharmWorkshop2007 30

Comparison with Blue Matter

ms/step 3.0 3.7 5.1 7.6 11.3

NAMD

(Virtual Node)

ms/step 2.33 3.2 4.67 6.85 10.5 18.6

NAMD

ms/step 2.09 3.14 5.39 9.97 18.95 38.42 Blue Matter

(SC’06)

16384 8192 4096 2048 1024 512

Nodes

NAMD is about 1.8 times faster than Blue Matter on 1024 nodes (and 2.4 times faster with VN mode, where

NAMD can use both processors on a node effectively). However: Note that NAMD does PME every 4 steps.

ApoLipoprotein-A1 (92K atoms)

slide-31
SLIDE 31

4/23/2007 CharmWorkshop2007 31

0.01 0.1 1 10 100 1 10 100 1000 10000 Processors Simulation Rate in Nanoseconds Per Day IAPP (5.5K atoms) LYSOZYME(40K atoms) APOA1 (92K atoms) ATPASE (327K atoms) STMV (1M atoms) BAR d. (1.3M atoms) RIBOSOME (2.8M atoms)

Performance on Cray XT3

slide-32
SLIDE 32

4/23/2007 CharmWorkshop2007 32

Computational Cosmology

  • N body Simulation (NSF)

– N particles (1 million to 1 billion), in a periodic box – Move under gravitation – Organized in a tree (oct, binary (k-d), ..)

  • Output data Analysis: in parallel (NASA)

– Particles are read in parallel – Interactive Analysis

  • Issues:

– Load balancing, fine-grained communication, tolerating communication latencies. – Multiple-time stepping

  • New Code Released: ChaNGa

Collaboration with T. Quinn (Univ. of Washington)

UofI Team: Filippo Giaochin, Pritish Jetley, Celso Mendes

slide-33
SLIDE 33

4/23/2007 CharmWorkshop2007 33

ChanGA Load Balancing Challenge:

Trade-off between communication and balance

slide-34
SLIDE 34

4/23/2007 CharmWorkshop2007 34

Recent Sucesses in Scaling ChaNGa

4096 8192 16384 32768 65536 131072 262144 524288 1.04858e+06 2.09715e+06 16 64 256 1024 4096 16384 Number of Processors x Execution Time Number of Processors Execution Time Scaling drgas lambb dwf1.6144 hrwh_LCDMs dwarf

slide-35
SLIDE 35

4/23/2007 CharmWorkshop2007 35

Quantum Chemistry: LeanCP

  • Car-Parinello MD
  • Illustrates utility of separating decomposition and

mapping

  • Very complex set of objects and interactions
  • Excellent scaling achieved

Collaboration with Glenn Martyna (IBM), Mark Tuckerman (NYU) UofI team: Eric Bohm, Abhinav Bhatele

slide-36
SLIDE 36

4/23/2007 CharmWorkshop2007 36

LeanCP Decomposition

slide-37
SLIDE 37

4/23/2007 CharmWorkshop2007 37

LeanCP Scaling

slide-38
SLIDE 38

4/23/2007 CharmWorkshop2007 38

Space-time meshing

  • Discontinuous Galerkin method
  • Tent-pitcher algorithm

Collaboration with Bob Haber, Jeff Ericsson, Michael Garland PPL team: Aaron Baker, Sayantan Chakravorty, Terry Wilmarth

slide-39
SLIDE 39

4/23/2007 CharmWorkshop2007 39

slide-40
SLIDE 40

4/23/2007 CharmWorkshop2007 40

Rocket Simulation

  • Dynamic, coupled physics

simulation in 3D

  • Finite-element solids on

unstructured tet mesh

  • Finite-volume fluids on

structured hex mesh

  • Coupling every timestep via

a least-squares data transfer

  • Challenges:

– Multiple modules – Dynamic behavior: burning surface, mesh adaptation

Robert Fielder, Center for Simulation of Advanced Rockets

Collaboration with M. Heath,

  • P. Geubelle, others
slide-41
SLIDE 41

4/23/2007 CharmWorkshop2007 41

Dynamic load balancing in Crack Propagation

slide-42
SLIDE 42

4/23/2007 CharmWorkshop2007 42

Colony: FAST-OS Project

  • DOE funded collaboration
  • Terry Jones: LLNL
  • Jose Moreira, et al IBM
  • At Illinois: supports

– Scalable Dynamic Load Balancing – Fault tolerance

slide-43
SLIDE 43

4/23/2007 CharmWorkshop2007 43

Lawrence Livermore National Laboratory

Terry Jones

University of Illinois at Urbana-Champaign

Laxmikant Kale Celso Mendes Sayantan Chakravorty

International Business Machines

Jose Moreira Andrew Tauferner Todd Inglett

  • Parallel Resource Instrumentation

Framework

  • Scalable Load Balancing
  • OS mechanisms for Migration
  • Processor Virtualization for Fault

Tolerance

  • Single system management space
  • Parallel Awareness and Coordinated

Scheduling of Services

  • Linux OS for cellular architecture

Services and Interfaces to Support Systems with Very Large Numbers of Processors

Collaborators Topics Title

Colony Project Overview

slide-44
SLIDE 44

4/23/2007 CharmWorkshop2007 44

Load Balancing on Very Large Machines

  • Existing load balancing strategies don’t scale on extremely large

machines

– Consider an application with 1M objects on 64K processors

Centralized

Object load data are sent to processor 0 Integrate to a complete object graph Migration decision is broadcast from processor 0 Global barrier

Distributed

Load balancing among neighboring processors Build partial object graph Migration decision is sent to its neighbors No global barrier

slide-45
SLIDE 45

4/23/2007 CharmWorkshop2007 45

A Hybrid Load Balancing Strategy

  • Dividing processors into independent sets of groups, and

groups are organized in hierarchies (decentralized)

  • Each group has a leader (the central node) which

performs centralized load balancing

  • A particular hybrid strategy that works well
slide-46
SLIDE 46

4/23/2007 CharmWorkshop2007 46

Fault Tolerance

  • Automatic Checkpointing

– Migrate objects to disk – In-memory checkpointing as an

  • ption

– Automatic fault detection and restart

  • Proactive Fault Tolerance

– “Impending Fault” Response – Migrate objects to other processors – Adjust processor-level parallel data structures

  • Scalable fault tolerance

– When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints! – Sender-side message logging – Latency tolerance helps mitigate costs – Restart can be speeded up by spreading out objects from failed processor

slide-47
SLIDE 47

4/23/2007 CharmWorkshop2007 47

BigSim

  • Simulating very large parallel machines

– Using smaller parallel machines

  • Reasons

– Predict performance on future machines – Predict performance obstacles for future machines – Do performance tuning on existing machines that are difficult to get allocations on

  • Idea:

– Emulation run using virtual processor processors (AMPI)

  • Get traces

– Detailed machine simulation using traces

slide-48
SLIDE 48

4/23/2007 CharmWorkshop2007 48

Objectives and Simualtion Model

  • Objectives:

– Develop techniques to facilitate the development of efficient peta-scale applications – Based on performance prediction of applications on large simulated parallel machines

  • Simulation-based Performance Prediction:

– Focus on Charm++ and AMPI programming models Performance prediction based on PDES – Supports varying levels of fidelity

  • processor prediction, network prediction.

– Modes of execution :

  • online and post-mortem mode
slide-49
SLIDE 49

4/23/2007 CharmWorkshop2007 49

Big Network Simulation

  • Simulate network behavior: packetization, routing,

contention, etc.

  • Incorporate with post-mortem simulation
  • Switches are connected in torus network

BGSIM Emulator

POSE Timestamp Correction BG Log Files (tasks & dependencies) Timestamp-corrected Tasks BigNetSim

slide-50
SLIDE 50

4/23/2007 CharmWorkshop2007 50

Projections: Performance visualization

slide-51
SLIDE 51

4/23/2007 CharmWorkshop2007 51

Architecture of BigNetSim

slide-52
SLIDE 52

4/23/2007 CharmWorkshop2007 52

Performance Prediction (contd.)

  • Predicting time of sequential code:

– User supplied time for every code block – Wall-clock measurements on simulating machine can be used via a suitable multiplier – Hardware performance counters to count floating point, integer, branch instructions, etc

  • Cache performance and memory footprint are approximated by

percentage of memory accesses and cache hit/miss ratio – Instruction level simulation (not implemented)

  • Predicting Network performance:

– No contention, time based on topology & other network parameters – Back-patching, modifies comm time using amount of comm activity – Network-simulation, modelling the netowrk entirely

slide-53
SLIDE 53

4/23/2007 CharmWorkshop2007 53

Multi-Cluster Co-Scheduling

  • Job co-scheduled to run

across two clusters to provide access to large numbers of processors

  • But cross cluster

latencies are large!

  • Virtualization within

Charm++ masks high inter-cluster latency by allowing overlap of communication with computation

Cluster A Cluster B

Intra-cluster latency (microseconds) Inter-cluster latency (milliseconds)

slide-54
SLIDE 54

4/23/2007 CharmWorkshop2007 54

Multi-Cluster Co-Scheduling

slide-55
SLIDE 55

4/23/2007 CharmWorkshop2007 55

Job Monitor Job Submission

F i l e U p l

  • a

d Job Specs B i d s Job Specs File Upload Job Id J

  • b

I d

Cluster Cluster Cluster

Faucets: Optimizing Utilization Within/across Clusters

slide-56
SLIDE 56

4/23/2007 CharmWorkshop2007 56

Other Ongoing Projects

  • Parallel Debugger
  • Automatic out-of-core execution
  • Parallel algorithms

– Current: Prim’s spanning tree algorithm, sorting, ..

  • New collaborations being explored

– Prof. Paulino, Prof. Pantano, ..

slide-57
SLIDE 57

4/23/2007 CharmWorkshop2007 57

Domain Specific Frameworks

Motivation

  • Reduce tedium of parallel

programming for commonly used paradigms and parallel data structures

  • Encapsulate parallel data

structures and algorithms

  • Provide easy to use

interface

  • Used to build concurrenltly

composible parallel modules Frameworks

  • Unstructured Meshes:ParFUM

– Generalized ghost regions – Used in Rocfrac, Rocflu at rocket center, and Outside CSAR – Fast collision detection

  • Multiblock framework

– Structured Meshes – Automates communication

  • AMR

– Common for both above

  • Particles

– Multiphase flows – MD, tree codes

slide-58
SLIDE 58

4/23/2007 CharmWorkshop2007 58

Summary and Messages

  • We at PPL have advanced migratable objects

technology

– We are committed to supporting applications – We grow our base of reusable techniques via such collaborations

  • Try using our technology:

– AMPI, Charm++, Faucets, ParFUM, .. – Available via the web http:// charm.cs.uiuc.edu

slide-59
SLIDE 59

4/23/2007 CharmWorkshop2007 59

Parallel Programming Laboratory

GRANTS

NSF: ITR Chemistry Car- Parinello MD, QM/MM IBM PERCS High Productivity

Sr.STAFF ENABLING PROJECTS

Fault-Tolerance: Checkpointing, Fault-Recovery, Proc.Evacuation Load-Balance: Centralized, Distributed, Hybrid Faucets: Dynamic Resource Management for Grids ParFUM: Supporting Unstructured Meshes (Comp.Geometry) Charm++ and Converse AMPI Adaptive MPI Projections: Performance Analysis Orchestration and Parallel Languages BigSim: Simulating Big Machines and Networks NSF: ITR , NASA Computational Cosmology and Visualization DOE HPC-Colony Services and Interfaces for Large Computers DOE CSAR Rocket Simulation NCSA Faculty Fellows Program NSF: ITR CPSD Space / Time Meshing NIH Biophysics NAMD NSF Next Generation Software BlueGene

slide-60
SLIDE 60

4/23/2007 CharmWorkshop2007 60

Over the next two days

System progress talks

  • Adaptive MPI
  • BigSim: Performance prediction
  • Scalable Performance Analysis
  • Fault Tolerance
  • Cell Processor
  • Grid Multi-cluster applications

Applications

  • Molecular Dynamics
  • Quantum Chemistry (LeanCP)
  • Computational Cosmology
  • Rocket Simulation

Tutorials

  • Charm++
  • AMPI
  • Projections
  • BigSim

Keynote: Kathy Yelick PGAS Languages and Beyond