Charm++ for Productivity and Performance A Submission to the 2011 - - PowerPoint PPT Presentation

charm for productivity and performance
SMART_READER_LITE
LIVE PREVIEW

Charm++ for Productivity and Performance A Submission to the 2011 - - PowerPoint PPT Presentation

Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil Miller Yanhua Sun Ramprasad


slide-1
SLIDE 1

Charm++ for Productivity and Performance

A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale∗ Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil Miller Yanhua Sun Ramprasad Venkataraman∗ Lukasz Wesolowski Gengbin Zheng

Parallel Programming Laboratory

Department of Computer Science University of Illinois at Urbana-Champaign

∗{kale, ramv}@illinois.edu LLNL-PRES-513271

SC11: November 15, 2011

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 1 / 25

slide-2
SLIDE 2

Benchmarks

Required

Dense LU Factorization 1D FFT Random Access

Optional

Molecular Dynamics Barnes-Hut

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 2 / 25

slide-3
SLIDE 3

Charm++

Programming Model

Object-based Express logic via indexed collections of interacting objects (both data and tasks) Over-decomposed Expose more parallelism than available processors

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 3 / 25

slide-4
SLIDE 4

Charm++

Programming Model

Runtime-Assisted scheduling, observation-based adaptivity, load balancing, composition, etc. Message-Driven Trigger computation by invoking remote entry methods Non-blocking, Asynchronous Implicitly overlapped data transfer

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 4 / 25

slide-5
SLIDE 5

Charm++

Program Structure

Regular C++ code

◮ No special compilers Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 5 / 25

slide-6
SLIDE 6

Charm++

Program Structure

Regular C++ code

◮ No special compilers

Small parallel interface description file

◮ Can contain control flow DAG ◮ Parsed to generate more C++ code Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 5 / 25

slide-7
SLIDE 7

Charm++

Program Structure

Regular C++ code

◮ No special compilers

Small parallel interface description file

◮ Can contain control flow DAG ◮ Parsed to generate more C++ code

Inherit from framework classes to

◮ Communicate with remote objects ◮ Serialize objects for transmission

Exploit modern C++ program design techniques (OO, generics etc)

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 5 / 25

slide-8
SLIDE 8

Charm++

Capabilities

Promotes natural expression of parallelism Supports modularity

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 6 / 25

slide-9
SLIDE 9

Charm++

Capabilities

Promotes natural expression of parallelism Supports modularity Overlaps communication and computation Automatically balances load

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 6 / 25

slide-10
SLIDE 10

Charm++

Capabilities

Promotes natural expression of parallelism Supports modularity Overlaps communication and computation Automatically balances load Automatically handles heterogenous systems Adapts to reduce energy consumption Tolerates component failures

For more info

http://charm.cs.illinois.edu/why/

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 6 / 25

slide-11
SLIDE 11

Metrics: Performance

Our Implementations in Charm++

Code Machine Max Cores Best Performance LU Cray XT5 8K 67.4% of peak FFT IBM BG/P 64K 2.512 TFlop/s RandomAccess IBM BG/P 64K 22.19 GUPS MD Cray XE6 16K 1.9 ms/step (125K atoms) IBM BG/P 64K 11.6 ms/step (1M atoms) Barnes-Hut IBM BG/P 16K 27 × 109 interactions/s

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 7 / 25

slide-12
SLIDE 12

Metrics: Code Size

Our Implementations in Charm++

Code C++ CI Total1 Libraries LU 1231 418 1649 BLAS FFT 112 47 159 FFTW, Mesh RandomAccess 155 23 178 Mesh MD 645 128 773 Barnes-Hut 2871 56 2927 TIPSY C++ Regular C++ code CI Parallel interface descriptions and control flow DAG

1Required logic, excluding test harness, input generation, verification, etc. Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 8 / 25

slide-13
SLIDE 13

Metrics: Code Size

Our Implementations in Charm++

Code C++ CI Total1 Libraries LU 1231 418 1649 BLAS FFT 112 47 159 FFTW, Mesh RandomAccess 155 23 178 Mesh MD 645 128 773 Barnes-Hut 2871 56 2927 TIPSY C++ Regular C++ code CI Parallel interface descriptions and control flow DAG

Remember: Lots of freebies!

automatic load balancing, fault tolerance, overlap, composition, portability

1Required logic, excluding test harness, input generation, verification, etc. Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 8 / 25

slide-14
SLIDE 14

LU: Capabilities

Composable library

◮ Modular program structure ◮ Seamless execution structure (interleaved modules) Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 9 / 25

slide-15
SLIDE 15

LU: Capabilities

Composable library

◮ Modular program structure ◮ Seamless execution structure (interleaved modules)

Block-centric

◮ Algorithm from a block’s perspective ◮ Agnostic of processor-level considerations Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 9 / 25

slide-16
SLIDE 16

LU: Capabilities

Composable library

◮ Modular program structure ◮ Seamless execution structure (interleaved modules)

Block-centric

◮ Algorithm from a block’s perspective ◮ Agnostic of processor-level considerations

Separation of concerns

◮ Domain specialist codes algorithm ◮ Systems specialist codes tuning, resource mgmt etc

Lines of Code Module-specific CI C++ Total Commits Factorization 517 419 936 472/572 83%

  • Mem. Aware Sched.

9 492 501 86/125 69% Mapping 10 72 82 29/42 69%

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 9 / 25

slide-17
SLIDE 17

LU: Capabilities

Flexible data placement

◮ Experiment with data layout

Memory-constrained adaptive lookahead

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 10 / 25

slide-18
SLIDE 18

LU: Performance

Weak Scaling: (N such that matrix fills 75% memory)

0.1 1 10 100 128 1024 8192 Total TFlop/s Number of Cores Theoretical peak on XT5 Weak scaling on XT5

67% 67.4% 67.4% 67.1% 66.2% 65.7%

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 11 / 25

slide-19
SLIDE 19

LU: Performance

... and strong scaling too! (N=96,000)

0.1 1 10 100 128 1024 8192 Total TFlop/s Number of Cores Theoretical peak on XT5 Weak scaling on XT5 Theoretical peak on BG/P Strong scaling on BG/P

60.3% 45% 40.8% 31.6%

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 12 / 25

slide-20
SLIDE 20

FFT: Parallel Coordination Code

doFFT()

for(phase = 0; phase < 3; ++phase) { atomic { sendTranspose(); } for(count = 0; count < P; ++count) when recvTranspose[phase] (fftMsg *msg) atomic { applyTranspose(msg); } if (phase < 2) atomic { fftw execute(plan); if(phase == 0) twiddle(); } }

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 13 / 25

slide-21
SLIDE 21

FFT: Performance

IBM Blue Gene/P (Intrepid), 25% memory, ESSL /w fftw wrappers

256 512 1024 2048 4096 8192 16384 32768 65536 10

1

10

2

10

3

10

4

Cores GFlop/s P2P All−to−all Mesh All−to−all Serial FFT limit Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 14 / 25

slide-22
SLIDE 22

FFT: Performance

IBM Blue Gene/P (Intrepid), 25% memory, ESSL /w fftw wrappers

256 512 1024 2048 4096 8192 16384 32768 65536 10

1

10

2

10

3

10

4

Cores GFlop/s P2P All−to−all Mesh All−to−all Serial FFT limit

Charm++ all-to-all

Asynchronous, Non-blocking, Topology-aware, Combining, Streaming

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 14 / 25

slide-23
SLIDE 23

Random Access

What Charm++ brings to the table

Productivity

Automatically detect completion by sensing quiescence Automatically detect network topology of partition

Performance

Uses same Charm++ all-to-all

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 15 / 25

slide-24
SLIDE 24

Random Access: Performance

IBM Blue Gene/P (Intrepid), 2 GB of memory per node

0.125 0.25 0.5 1 2 4 8 16 32 128 256 512 1K 2K 4K 8K 16K 32K 64K GUPS Number of cores 22.19 Perfect Scaling Charm++

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 16 / 25

slide-25
SLIDE 25

Optional Benchmarks

Why MD and Barnes-Hut?

Relevant scientific computing kernels Challenge the parallelization paradigm

◮ Load imbalances ◮ Dynamic communication structure

Express non-trivial parallel control flow

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 17 / 25

slide-26
SLIDE 26

Molecular Dynamics

Overview

1 Mimics force calculation in NAMD 2 Resembles the miniMD application in the Mantevo benchmark suite 3 SLOC is 773 in comparison to just under 3000 lines for miniMD

(a) 1 Away Decomposition (b) 2 AwayX Decomposition

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 18 / 25

slide-27
SLIDE 27

MD: Performance

125,000 atoms. Cray XE6 (Hopper) 1 10 100 264 528 1032 2064 4104 8208 16392 Time per step (ms) Number of cores Performance on Hopper (125,000 atoms) 1.91 ms/step No LB Refine LB

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 19 / 25

slide-28
SLIDE 28

MD: Performance

1 million atoms. IBM Blue Gene/P (Intrepid) 256 512 1024 2048 4096 8192 16384 32768 65536 256 512 1024 2048 4096 8192 16384 32768 65536 Speedup Number of cores Speedup on Intrepid (1 million atoms) 11.6 ms/step Ideal Charm++

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 20 / 25

slide-29
SLIDE 29

MD: Performance

Number of cores does not have to be a power-of-2 300 350 400 450 500 550 600 650 64 72 80 88 96 104 112 120 128 Time per step (ms) Number of cores MD on non power-of-2 cores Intrepid

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 21 / 25

slide-30
SLIDE 30

Barnes-Hut: Productivity

1 Adaptive overlap of computation and communication allows

latency of requests for remote data to be hidden by useful local computation on PEs.

2 Automatic measurement-based load balancing allows dissociation

  • f data decomposition from task assignment: balance communication

through Oct-decomposition and computation through separate load balancing strategy.

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 22 / 25

slide-31
SLIDE 31

Barnes-Hut: Performance

Non-uniform (Plummer) distribution. IBM Blue Gene/P (Intrepid)

0.50 1.00 2.00 4.00 8.00 16.00 2k 4k 8k 16k Time/step (seconds) Cores Barnes-Hut scaling on BG/P 50m 10m

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 23 / 25

slide-32
SLIDE 32

Barnes-Hut: Performance

Non-uniform (Plummer) distribution. IBM Blue Gene/P (Intrepid)

0.50 1.00 2.00 4.00 8.00 16.00 2k 4k 8k 16k Time/step (seconds) Cores Barnes-Hut scaling on BG/P 50m 10m

0.02 0.04 0.09 0.18 0.36 0.71 1.4 2.8 5.7 11.4 22.8 1 10 100 1000 10000 100000

Plummer 100k Distribution

Distance from COM Frequency

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 23 / 25

slide-33
SLIDE 33

Charm++ at SC11

Temperature-aware load balancing Tue @ 2:00 pm Fault tolerance protocol PhD Forum; Tue @ 3:45 pm NAMD at 200K+ cores Thu @ 11:00 am Topology aware mapping for PERCS Thu @ 4:00 pm Parallel stochastic optimization Poster All-to-all simulations on PERCS Poster

For more info

http://charm.cs.illinois.edu/why/

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 24 / 25

slide-34
SLIDE 34

MeshStreamer: Message Routing and Aggregation

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 25 / 25