Charm++ for Productivity and Performance A Submission to the 2011 - - PowerPoint PPT Presentation

▶

May 02, 2023 156 likes •443 views

Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil Miller Yanhua Sun Ramprasad

SLIDE 1

Charm++ for Productivity and Performance

A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil Miller Yanhua Sun Ramprasad Venkataraman Lukasz Wesolowski Gengbin Zheng

Parallel Programming Laboratory

Department of Computer Science University of Illinois at Urbana-Champaign

May 7, 2012

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 1 / 24

SLIDE 2

Benchmarks

Required

Dense LU Factorization 1D FFT Random Access

Optional

Molecular Dynamics Barnes-Hut

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 2 / 24

SLIDE 3

Metrics: Performance

Our Implementations in Charm++

Code Machine Max Cores Best Performance LU Cray XT5 8K 67.4% of peak FFT IBM BG/P 64K 2.512 TFlop/s RandomAccess IBM BG/P 64K 22.19 GUPS MD Cray XE6 16K 1.9 ms/step (125K atoms) IBM BG/P 64K 11.6 ms/step (1M atoms) Barnes-Hut IBM BG/P 16K 27 × 109 interactions/s

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 3 / 24

SLIDE 4

Metrics: Code Size

Our Implementations in Charm++

Code C++ CI Total1 Libraries LU 1231 418 1649 BLAS FFT 112 47 159 FFTW, Mesh RandomAccess 155 23 178 Mesh MD 645 128 773 Barnes-Hut 2871 56 2927 TIPSY C++ Regular C++ code CI Parallel interface descriptions and control flow DAG

1Required logic, excluding test harness, input generation, verification, etc. Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 4 / 24

SLIDE 5

Metrics: Code Size

Our Implementations in Charm++

Code C++ CI Total1 Libraries LU 1231 418 1649 BLAS FFT 112 47 159 FFTW, Mesh RandomAccess 155 23 178 Mesh MD 645 128 773 Barnes-Hut 2871 56 2927 TIPSY C++ Regular C++ code CI Parallel interface descriptions and control flow DAG

Remember: Lots of freebies!

automatic load balancing, fault tolerance, overlap, composition, portability

1Required logic, excluding test harness, input generation, verification, etc. Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 4 / 24

SLIDE 6

LU: Capabilities

Composable library

◮ Modular program structure ◮ Seamless execution structure (interleaved modules) Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 5 / 24

SLIDE 7

LU: Capabilities

Composable library

◮ Modular program structure ◮ Seamless execution structure (interleaved modules)

Block-centric

◮ Algorithm from a block’s perspective ◮ Agnostic of processor-level considerations Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 5 / 24

SLIDE 8

LU: Capabilities

Composable library

◮ Modular program structure ◮ Seamless execution structure (interleaved modules)

Block-centric

◮ Algorithm from a block’s perspective ◮ Agnostic of processor-level considerations

Separation of concerns

◮ Domain specialist codes algorithm ◮ Systems specialist codes tuning, resource mgmt etc

Lines of Code Module-specific CI C++ Total Commits Factorization 517 419 936 472/572 83%

Mem. Aware Sched.

9 492 501 86/125 69% Mapping 10 72 82 29/42 69%

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 5 / 24

SLIDE 9

LU: Decomposition

A A A A A U T T T T U T T T T U T T T T U T T T T

Column being factored A Active panel block U U block T Trailing submatrix block Previously factored block

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 6 / 24

SLIDE 10

LU: Pseudo-Synchronous Scheduling

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7 Proc n Time Trailing Update Active Panel

...

Contribute to reduction Reduction up tree Rank 1 update Reduction root Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 7 / 24

SLIDE 11

LU: Capabilities

Flexible data placement

◮ Cf. Jonathan’s talk

Memory-constrained adaptive lookahead

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 8 / 24

SLIDE 12

LU: Capabilities

Memory-constrained adaptive lookahead

A A A A A U T T T T U T T T T U T T T T U T T T T

Column being factored A Active panel block U U block T Trailing submatrix block Previously factored block

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 9 / 24

SLIDE 13

LU: Performance

Weak Scaling: (N such that matrix fills 75% memory)

0.1 1 10 100 128 1024 8192 Total TFlop/s Number of Cores Theoretical peak on XT5 Weak scaling on XT5

67% 67.4% 67.4% 67.1% 66.2% 65.7%

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 10 / 24

SLIDE 14

LU: Performance

... and strong scaling too! (N=96,000)

0.1 1 10 100 128 1024 8192 Total TFlop/s Number of Cores Theoretical peak on XT5 Weak scaling on XT5 Theoretical peak on BG/P Strong scaling on BG/P

60.3% 45% 40.8% 31.6%

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 11 / 24

SLIDE 15

FFT: Parallel Coordination Code

doFFT()

for(phase = 0; phase < 3; ++phase) { atomic { sendTranspose(); } for(count = 0; count < P; ++count) when recvTranspose[phase] (fftMsg *msg) atomic { applyTranspose(msg); } if (phase < 2) atomic { fftw execute(plan); if(phase == 0) twiddle(); } }

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 12 / 24

SLIDE 16

MeshStreamer: Message Routing and Aggregation

Charm++ all-to-all

Asynchronous, Non-blocking, Topology-aware, Combining, Streaming

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 13 / 24

SLIDE 17

FFT: Performance

IBM Blue Gene/P (Intrepid), 25% memory, ESSL /w fftw wrappers

256 512 1024 2048 4096 8192 16384 32768 65536 10

Cores GFlop/s P2P All−to−all Mesh All−to−all Serial FFT limit Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 14 / 24

SLIDE 18

Random Access

What Charm++ brings to the table

Productivity

Automatically detect completion by sensing quiescence Automatically detect network topology of partition

Performance

Uses same Charm++ all-to-all

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 15 / 24

SLIDE 19

Random Access: Performance

IBM Blue Gene/P (Intrepid), 2 GB of memory per node

0.125 0.25 0.5 1 2 4 8 16 32 128 256 512 1K 2K 4K 8K 16K 32K 64K GUPS Number of cores 22.19 Perfect Scaling Charm++

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 16 / 24

SLIDE 20

Optional Benchmarks

Why MD and Barnes-Hut?

Relevant scientific computing kernels Challenge the parallelization paradigm

◮ Load imbalances ◮ Dynamic communication structure

Express non-trivial parallel control flow

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 17 / 24

SLIDE 21

Molecular Dynamics

Overview

1 Mimics force calculation in NAMD 2 Resembles the miniMD application in the Mantevo benchmark suite 3 SLOC is 773 in comparison to just under 3000 lines for miniMD

(a) 1 Away Decomposition (b) 2 AwayX Decomposition

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 18 / 24

SLIDE 22

MD: Performance

125,000 atoms. Cray XE6 (Hopper) 1 10 100 264 528 1032 2064 4104 8208 16392 Time per step (ms) Number of cores Performance on Hopper (125,000 atoms) No LB Refine LB

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 19 / 24

SLIDE 23

MD: Performance

1 million atoms. IBM Blue Gene/P (Intrepid) 256 512 1024 2048 4096 8192 16384 32768 65536 256 512 1024 2048 4096 8192 16384 32768 65536 Speedup Number of cores Speedup on Intrepid (1 million atoms) 11.6 ms/step Ideal Charm++

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 20 / 24

SLIDE 24

MD: Performance

Number of cores does not have to be a power-of-2 300 350 400 450 500 550 600 650 64 72 80 88 96 104 112 120 128 Time per step (ms) Number of cores MD on non power-of-2 cores Intrepid

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 21 / 24

SLIDE 25

Barnes-Hut: Productivity

1 Adaptive overlap of computation and communication allows

latency of requests for remote data to be hidden by useful local computation on PEs.

2 Automatic measurement-based load balancing allows dissociation

f data decomposition from task assignment: balance communication

through Oct-decomposition and computation through separate load balancing strategy.

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 22 / 24

SLIDE 26

Barnes-Hut: Performance

Non-uniform (Plummer) distribution. IBM Blue Gene/P (Intrepid)

0.50 1.00 2.00 4.00 8.00 16.00 2k 4k 8k 16k Time/step (seconds) Cores Barnes-Hut scaling on BG/P 50m 10m

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 23 / 24

SLIDE 27

Barnes-Hut: Performance

Non-uniform (Plummer) distribution. IBM Blue Gene/P (Intrepid)

0.50 1.00 2.00 4.00 8.00 16.00 2k 4k 8k 16k Time/step (seconds) Cores Barnes-Hut scaling on BG/P 50m 10m

0.02 0.04 0.09 0.18 0.36 0.71 1.4 2.8 5.7 11.4 22.8 1 10 100 1000 10000 100000

Plummer 100k Distribution

Distance from COM Frequency

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 23 / 24

SLIDE 28

Thank You!

Benchmarks

Dense LU Factorization 1D FFT Random Access Molecular Dynamics Barnes-Hut

Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance May 7, 2012 24 / 24