Charm++ Migratable Objects + Asynchronous Methods + Adaptive Runtime - - PowerPoint PPT Presentation

charm
SMART_READER_LITE
LIVE PREVIEW

Charm++ Migratable Objects + Asynchronous Methods + Adaptive Runtime - - PowerPoint PPT Presentation

Charm++ Migratable Objects + Asynchronous Methods + Adaptive Runtime = Performance + Productivity Laxmikant V. Kale Anshu Arya Nikhil Jain Akhil Langer Jonathan Lifflander Harshitha Menon Xiang Ni Yanhua Sun Ehsan Totoni Ramprasad


slide-1
SLIDE 1

Charm++

Migratable Objects + Asynchronous Methods + Adaptive Runtime = Performance + Productivity Laxmikant V. Kale∗ Anshu Arya Nikhil Jain Akhil Langer Jonathan Lifflander Harshitha Menon Xiang Ni Yanhua Sun Ehsan Totoni Ramprasad Venkataraman∗ Lukasz Wesolowski

Parallel Programming Laboratory

Department of Computer Science University of Illinois at Urbana-Champaign

∗{kale, ramv}@illinois.edu

SC12: November 13, 2012

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 1 / 37

slide-2
SLIDE 2

Benchmarks

Required

1D FFT Random Access Dense LU Factorization

Optional

Molecular Dynamics Adaptive Mesh Refinement Sparse Triangular Solver

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 2 / 37

slide-3
SLIDE 3

Metrics: Performance and Productivity

Our Implementations in Charm++

Productivity Performance Code C++ CI Benchmark Driver Total Machine Max Performance Highlight Subtotal Cores 1D FFT 54 29 83 102 185 IBM BG/P 64K 2.71 TFlop/s IBM BG/Q 16K 2.31 TFlop/s Random Access 76 15 91 47 138 IBM BG/P 128K 43.10 GUPS IBM BG/Q 16K 15.00 GUPS Dense LU 1001 316 1317 453 1770 Cray XT5 8K 55.1 TFlop/s (65.7% peak) Molecular Dynamics 571 122 693 n/a 693 IBM BG/P 128K 24 ms/step (2.8M atoms) IBM BG/Q 16K 44 ms/step (2.8M atoms) Triangular Solver 642 50 692 56 748 IBM BG/P 512 48x speedup on 64 cores with helm2d03 matrix AMR 1126 118 1244 n/a 1244 IBM BG/Q 32k 22 timesteps/sec, 2d mesh, max 15 levels refinement C++ Regular C++ code CI Parallel interface descriptions and control flow DAG

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 3 / 37

slide-4
SLIDE 4

Capabilities

Demonstrated Productivity Benefits

Automatic load balancing Automatic checkpoints Tolerating process failures Asynchronous, non-blocking collective communication Interoperating with MPI

For more info

http://charm.cs.illinois.edu/

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 4 / 37

slide-5
SLIDE 5

Capabilities: Automated Dynamic Load Balancing

Measurement based fine-grained load balancing

◮ Principle of persistence - recent past indicates near future. ◮ Charm++ provides a suite of load-balancers. Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 5 / 37

slide-6
SLIDE 6

Capabilities: Automated Dynamic Load Balancing

Measurement based fine-grained load balancing

◮ Principle of persistence - recent past indicates near future. ◮ Charm++ provides a suite of load-balancers.

How to use?

◮ Periodic calls in application - AtSync(). ◮ Command line argument - +balancer Strategy. Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 5 / 37

slide-7
SLIDE 7

Capabilities: Automated Dynamic Load Balancing

Measurement based fine-grained load balancing

◮ Principle of persistence - recent past indicates near future. ◮ Charm++ provides a suite of load-balancers.

How to use?

◮ Periodic calls in application - AtSync(). ◮ Command line argument - +balancer Strategy.

MetaBalancer - When and how to load balance?

◮ Monitors the application continuously and predicts behavior. ◮ Decides when to invoke which load balancer. ◮ Command line argument - +MetaLB Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 5 / 37

slide-8
SLIDE 8

Capabilities: Checkpointing Application State

Checkpointing to disk for split execution CkStartCheckpoint(callback)

◮ Designed for applications need to run for a long period, but cannot get

all the allocation needed at one time.

Restart applications from checkpoint on any number of processors

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 6 / 37

slide-9
SLIDE 9

Capabilities: Tolerating Process Failures

Double in-memory checkpointing for online recovery CkStartMemCheckpoint(callback)

◮ To tolerate the more and more frequent failures in HPC system.

Injecting failure and automatically detection of failures CkDieNow()

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 7 / 37

slide-10
SLIDE 10

Capabilities: Interoperability

Invoke Charm++ from MPI

Callable like other external MPI libraries Use MPI communicators to enable the following modes

MPI Charm++

Time

...

P(1) P(2) P(N-1) P(N)

...

P(1) P(2) P(N-1) P(N)

...

P(1) P(2) P(N-1) P(N)

(a) Time Sharing (b) Space Sharing (c) Combined

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 8 / 37

slide-11
SLIDE 11

Capabilities: Interoperability

Trivial Changes to Existing Codes

Initialize and destroy Charm++ instances Use interface functions to transfer control

//MPI Init and other basic initialization { optional pure MPI code blocks } //create a communicator for initializing Charm++ MPI Comm split(MPI COMM WORLD, peid%2, peid, &newComm); CharmLibInit(newComm, argc, argv); { optional pure MPI code blocks } //Charm++ library invocation if(myrank%2) fft1d(inputData,outputData,data size); //more pure MPI code blocks //more Charm++ library calls CharmLibExit(); //MPI cleanup and MPI Finalize Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 9 / 37

slide-12
SLIDE 12

Capabilities: Asynchronous, Non-blocking Collective Communication

Overlap collective communication with other work Topological Routing and Aggregation Module (TRAM)

◮ Transforms point-to-point communication into collectives ◮ Minimal topology-aware software routing ◮ Aggregation of fine-grained communication ◮ Recombining at intermediate destinations

Intuitive expression of collectives through overloading constructs for point-to-point sends (e.g. broadcast)

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 10 / 37

slide-13
SLIDE 13

FFT: Parallel Coordination Code

doFFT()

for(phase = 0; phase < 3; ++phase) { atomic { sendTranspose(); } for(count = 0; count < P; ++count) when recvTranspose[phase] (fftMsg *msg) atomic { applyTranspose(msg); } if (phase < 2) atomic { fftw execute(plan); if(phase == 0) twiddle(); } }

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 11 / 37

slide-14
SLIDE 14

FFT: Performance

IBM Blue Gene/P (Intrepid), 25% memory, ESSL /w fftw wrappers

256 512 1024 2048 4096 8192 16384 32768 65536 10

1

10

2

10

3

10

4

Cores GFlop/s P2P All−to−all Mesh All−to−all Serial FFT limit Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 12 / 37

slide-15
SLIDE 15

FFT: Performance

IBM Blue Gene/P (Intrepid), 25% memory, ESSL /w fftw wrappers

256 512 1024 2048 4096 8192 16384 32768 65536 10

1

10

2

10

3

10

4

Cores GFlop/s P2P All−to−all Mesh All−to−all Serial FFT limit

Charm++ all-to-all using TRAM

Asynchronous, Non-blocking, Topology-aware, Combining, Streaming

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 12 / 37

slide-16
SLIDE 16

Random Access

Productivity

Use point to point sends and let Charm++ optimize communication Automatically detect and adapt to network topology of partition

Performance

Automatic communication optimization using TRAM

◮ Aggregation of fine-grained communication ◮ Minimal topology-aware software routing ◮ Recombining at intermediate destinations Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 13 / 37

slide-17
SLIDE 17

Random Access: Performance

IBM Blue Gene/P (Intrepid), BlueGene/Q (Vesta)

0.0625 0.25 1 4 16 64 128 512 2K 8K 32K 128K GUPS Number of cores 43.10 Perfect Scaling BG/P BG/Q

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 14 / 37

slide-18
SLIDE 18

LU: Capabilities

Composable library

◮ Modular program structure ◮ Seamless execution structure (interleaved modules) Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 15 / 37

slide-19
SLIDE 19

LU: Capabilities

Composable library

◮ Modular program structure ◮ Seamless execution structure (interleaved modules)

Block-centric

◮ Algorithm from a block’s perspective ◮ Agnostic of processor-level considerations Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 15 / 37

slide-20
SLIDE 20

LU: Capabilities

Composable library

◮ Modular program structure ◮ Seamless execution structure (interleaved modules)

Block-centric

◮ Algorithm from a block’s perspective ◮ Agnostic of processor-level considerations

Separation of concerns

◮ Domain specialist codes algorithm ◮ Systems specialist codes tuning, resource mgmt etc

Lines of Code Module-specific CI C++ Total Commits Factorization 517 419 936 472/572 83%

  • Mem. Aware Sched.

9 492 501 86/125 69% Mapping 10 72 82 29/42 69%

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 15 / 37

slide-21
SLIDE 21

LU: Capabilities

Flexible data placement

◮ Experiment with data layout

Memory-constrained adaptive lookahead

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 16 / 37

slide-22
SLIDE 22

LU: Performance

Weak Scaling: (N such that matrix fills 75% memory)

0.1 1 10 100 128 1024 8192 Total TFlop/s Number of Cores Theoretical peak on XT5 Weak scaling on XT5

67% 67.4% 67.4% 67.1% 66.2% 65.7%

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 17 / 37

slide-23
SLIDE 23

LU: Performance

... and strong scaling too! (N=96,000)

0.1 1 10 100 128 1024 8192 Total TFlop/s Number of Cores Theoretical peak on XT5 Weak scaling on XT5 Theoretical peak on BG/P Strong scaling on BG/P

60.3% 45% 40.8% 31.6%

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 18 / 37

slide-24
SLIDE 24

Optional Benchmarks

Why MD, AMR and Sparse Triangular Solver

Relevant scientific computing kernels Challenge the parallelization paradigm

◮ Load imbalances ◮ Dynamic communication structure

Express non-trivial parallel control flow

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 19 / 37

slide-25
SLIDE 25

LeanMD

SLOC: 693

1 Mimics short-range force calculation in NAMD 2 Resembles miniMD of Mantevo project (SLOC≈3000) 3 Advanced features:

Metabalancer: automated dynamic load balancing Fault tolerance: in-memory checkpointing based restart Split Execution: checkpoint on x cores, restart on y cores

1 Away Decomposition 2 AwayX Decomposition

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 20 / 37

slide-26
SLIDE 26

Code for FT and LB

if (stepCount % ldbPeriod == 0) { serial { AtSync(); } when ResumeFromSync() { } } Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 21 / 37

slide-27
SLIDE 27

Code for FT and LB

if (stepCount % ldbPeriod == 0) { serial { AtSync(); } when ResumeFromSync() { } } if (stepCount % checkptFreq == 0) { serial { //coordinate to start checkpointing contribute(CkCallback(CkReductionTarget(Cell,startCheckpoint),thisProxy(0,0,0))); } if (thisIndex.x == 0 && thisIndex.y == 0 && thisIndex.z == 0) { when startCheckpoint() serial { CkCallback cb(CkReductionTarget(Cell,recvCheckPointDone), thisProxy); if (checkptStrategy == 0) CkStartCheckpoint(logs.c str(), cb); else CkStartMemCheckpoint(cb); } } when recvCheckPointDone() { } } Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 21 / 37

slide-28
SLIDE 28

Code for FT and LB

if (stepCount % ldbPeriod == 0) { serial { AtSync(); } when ResumeFromSync() { } } if (stepCount % checkptFreq == 0) { serial { //coordinate to start checkpointing contribute(CkCallback(CkReductionTarget(Cell,startCheckpoint),thisProxy(0,0,0))); } if (thisIndex.x == 0 && thisIndex.y == 0 && thisIndex.z == 0) { when startCheckpoint() serial { CkCallback cb(CkReductionTarget(Cell,recvCheckPointDone), thisProxy); if (checkptStrategy == 0) CkStartCheckpoint(logs.c str(), cb); else CkStartMemCheckpoint(cb); } } when recvCheckPointDone() { } } //kill one of processes to demonstrate fault tolerance if (stepCount == 30 && thisIndex.x == 1 && thisIndex.y == 1 && thisIndex.z == 0) serial { if (CkHasCheckpoints()) { CkDieNow(); } } Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 21 / 37

slide-29
SLIDE 29

MD: Performance

1 million atoms. IBM Blue Gene/P (Intrepid)

10 100 1000 10000 2k 4k 8k 16k 32k 64k 128k Time per step (ms) Number of cores Performance on Intrepid (2.8 million atoms) No LB Hybrid LB

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 22 / 37

slide-30
SLIDE 30

Fault Tolerance Support for MD

MD: Checkpoint Time

10 20 30 40 50 60 2048 4096 8192 16384 32768 Time(ms) Number of processes LeanMD Checkpoint Time on BlueGene/Q 2.8 million 1.6 million

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 23 / 37

slide-31
SLIDE 31

Fault Tolerance Support for MD

MD: Restart Time

20 40 60 80 100 120 140 160 2048 4096 8192 16384 32768 Time(ms) Number of processes LeanMD Restart Time on BlueGene/Q 2.8 million 1.6 million

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 24 / 37

slide-32
SLIDE 32

Meta-Balancer vs Periodic Load Balancing

32 64 128 256 512 1024 2048 64 128 256 512 1024 Elapsed time (s) LB Period Elapsed time vs LB Period (BlueGene/P) 8k cores 16k cores 32k cores 64k cores 128k cores

Cores No LB (s) Periodic LB (s) Meta-Balancer (s) 8k 666 504 413 16k 336 260 277 32k 171 131 131 64k 122 104 100 128k 73 54 52

Frequent load balancing increases total execution time Infrequent load balancing leads to load imbalance and results in no gains Meta-Balancer adaptively performs load balancing to obtain best total execution time

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 25 / 37

slide-33
SLIDE 33

Sparse Triangular Solver- Matrix Decomposition

Column Decomposition Dense Parts Decomposition

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 26 / 37

slide-34
SLIDE 34

Sparse Triangular Solver- Parallel Algorithm

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 27 / 37

slide-35
SLIDE 35

Sparse Triangular Solver- Performance vs. SuperLU DIST

0.001 0.01 0.1 1 10 1 2 4 8 16 32 64 128 256 512

Solution Time (s) Number of Cores

SuperLU_webbase-1M SuperLU_helm2d03 SuperLU_largebasis SuperLU_hood slu_largebasis slu_webbase-1M slu_hood slu_helm2d03 Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 28 / 37

slide-36
SLIDE 36

Sparse Triangular Solver- Productivity in Charm++

More complicated (higher performance) algorithm in 692 total SLOCs

◮ vs. 897 SLOCs of SuperLU DIST triangular solver

Overdecomposition (with Round-Robin mapping) is essential

◮ Communication computation overlap, load balance

Creation of parallel units dynamically

◮ Distributing dense regions

Message-driven nature and priorities

◮ No need for something like MPI Iprobe Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 29 / 37

slide-37
SLIDE 37

Adaptive Mesh Refinement

Sample simulation Propagation of refinement decision messages Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 30 / 37

slide-38
SLIDE 38

Adaptive Mesh Refinement

Sample simulation Propagation of refinement decision messages

Stay Refine d d + 1 Coarsen d - 1 Coarsen d - 1 d Stay d + 1 Sibling d d + 2 d - 1 d d + 1 Refine d + 2 Initial state Received message d d Required depth Decision Local error condition Termination detection

*

Refine Coarsen, Stay

*

Finite state machine for each block’s decision update Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 30 / 37

slide-39
SLIDE 39

Adaptive Mesh Refinement

Charm++ Implementation

Blocks as virtual processors instead of each process containing many blocks

◮ simplifies implementation

Blocks addressed with bit vector indices

◮ CharmRTS handles physical locations

Dynamic distributed load balancing

Algorithmic Improvements1

O(#blocks/P) vs O(#blocks) memory per process 2 system quiesence states vs #level reductions for mesh restructuring O(1) vs O(log P) time neighbor lookup

1Langer, et.al. Scalable Algorithms for Distributed Memory Adaptive Mesh Refinement. In 24th International Conference on Computer Architecture and High Performance Computing (SBAC-PAD) 2012, NY, USA. Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 31 / 37

slide-40
SLIDE 40

Adaptive Mesh Refinement

0.1 1 10 100 256 512 1k 2k 4k 8k 16k 32k Timesteps / sec Number of Ranks IBM BG/Q Min−depth 4 IBM BG/Q Min−depth 5

Timesteps per second strong scaling on IBM BG/Q with a max depth of 15. 5 10 15 20 25 30 35 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k Remeshing Latency Time (ms) Number of Ranks Depth Range 4−9 Depth Range 4−10 Depth Range 4−11 The non-overlapped delay of remeshing in milliseconds on IBM BG/Q. The candlestick graphs the minimum and maxi- mum values, the 5th and 95th percentile, and the median. Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 32 / 37

slide-41
SLIDE 41

Charm++ at SC12

Temperature-aware load balancing Tue @ 2:00 pm NAMD at 200K+ cores Thu @ 11:00 am

For more info

http://charm.cs.illinois.edu/

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 33 / 37

slide-42
SLIDE 42

Charm++

Programming Model

Object-based Express logic via indexed collections of interacting objects (both data and tasks) Over-decomposed Expose more parallelism than available processors

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 34 / 37

slide-43
SLIDE 43

Charm++

Programming Model

Runtime-Assisted scheduling, observation-based adaptivity, load balancing, composition, etc. Message-Driven Trigger computation by invoking remote entry methods Non-blocking, Asynchronous Implicitly overlapped data transfer

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 35 / 37

slide-44
SLIDE 44

Charm++

Program Structure

Regular C++ code

◮ No special compilers Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 36 / 37

slide-45
SLIDE 45

Charm++

Program Structure

Regular C++ code

◮ No special compilers

Small parallel interface description file

◮ Can contain control flow DAG ◮ Parsed to generate more C++ code Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 36 / 37

slide-46
SLIDE 46

Charm++

Program Structure

Regular C++ code

◮ No special compilers

Small parallel interface description file

◮ Can contain control flow DAG ◮ Parsed to generate more C++ code

Inherit from framework classes to

◮ Communicate with remote objects ◮ Serialize objects for transmission

Exploit modern C++ program design techniques (OO, generics etc)

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 36 / 37

slide-47
SLIDE 47

TRAM: Message Routing and Aggregation

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 37 / 37