Charm++ for Productivity and Performance A Submission to the 2011 - PowerPoint PPT Presentation

Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale ∗ Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil Miller Yanhua Sun Ramprasad Venkataraman ∗ Lukasz Wesolowski Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign ∗ { kale, ramv } @illinois.edu LLNL-PRES-513271 SC11: November 15, 2011 Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 1 / 25

Benchmarks Required Dense LU Factorization 1D FFT Random Access Optional Molecular Dynamics Barnes-Hut Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 2 / 25

Charm++ Programming Model Object-based Express logic via indexed collections of interacting objects (both data and tasks) Over-decomposed Expose more parallelism than available processors Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 3 / 25

Charm++ Programming Model Runtime-Assisted scheduling, observation-based adaptivity, load balancing, composition, etc. Message-Driven Trigger computation by invoking remote entry methods Non-blocking, Asynchronous Implicitly overlapped data transfer Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 4 / 25

Charm++ Program Structure Regular C++ code ◮ No special compilers Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 5 / 25

Charm++ Program Structure Regular C++ code ◮ No special compilers Small parallel interface description file ◮ Can contain control flow DAG ◮ Parsed to generate more C++ code Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 5 / 25

Charm++ Program Structure Regular C++ code ◮ No special compilers Small parallel interface description file ◮ Can contain control flow DAG ◮ Parsed to generate more C++ code Inherit from framework classes to ◮ Communicate with remote objects ◮ Serialize objects for transmission Exploit modern C++ program design techniques (OO, generics etc) Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 5 / 25

Charm++ Capabilities Promotes natural expression of parallelism Supports modularity Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 6 / 25

Charm++ Capabilities Promotes natural expression of parallelism Supports modularity Overlaps communication and computation Automatically balances load Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 6 / 25

Charm++ Capabilities Promotes natural expression of parallelism Supports modularity Overlaps communication and computation Automatically balances load Automatically handles heterogenous systems Adapts to reduce energy consumption Tolerates component failures For more info http://charm.cs.illinois.edu/why/ Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 6 / 25

Metrics: Performance Our Implementations in Charm++ Code Machine Max Cores Best Performance LU Cray XT5 8K 67.4% of peak FFT IBM BG/P 64K 2.512 TFlop/s RandomAccess IBM BG/P 64K 22.19 GUPS Cray XE6 16K 1.9 ms/step (125K atoms) MD IBM BG/P 64K 11.6 ms/step (1M atoms) 27 × 10 9 interactions/s Barnes-Hut IBM BG/P 16K Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 7 / 25

Metrics: Code Size Our Implementations in Charm++ Total 1 Code C++ CI Libraries LU 1231 418 1649 BLAS FFT 112 47 159 FFTW, Mesh RandomAccess 155 23 178 Mesh MD 645 128 773 Barnes-Hut 2871 56 2927 TIPSY C++ Regular C++ code CI Parallel interface descriptions and control flow DAG 1 Required logic, excluding test harness, input generation, verification, etc. Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 8 / 25

Metrics: Code Size Our Implementations in Charm++ Total 1 Code C++ CI Libraries LU 1231 418 1649 BLAS FFT 112 47 159 FFTW, Mesh RandomAccess 155 23 178 Mesh MD 645 128 773 Barnes-Hut 2871 56 2927 TIPSY C++ Regular C++ code CI Parallel interface descriptions and control flow DAG Remember: Lots of freebies! automatic load balancing, fault tolerance, overlap, composition, portability 1 Required logic, excluding test harness, input generation, verification, etc. Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 8 / 25

LU: Capabilities Composable library ◮ Modular program structure ◮ Seamless execution structure (interleaved modules) Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 9 / 25

LU: Capabilities Composable library ◮ Modular program structure ◮ Seamless execution structure (interleaved modules) Block-centric ◮ Algorithm from a block’s perspective ◮ Agnostic of processor-level considerations Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 9 / 25

LU: Capabilities Composable library ◮ Modular program structure ◮ Seamless execution structure (interleaved modules) Block-centric ◮ Algorithm from a block’s perspective ◮ Agnostic of processor-level considerations Separation of concerns ◮ Domain specialist codes algorithm ◮ Systems specialist codes tuning, resource mgmt etc Lines of Code Module-specific CI C++ Total Commits Factorization 517 419 472/572 83% 936 Mem. Aware Sched. 9 492 501 86/125 69% Mapping 10 72 82 29/42 69% Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 9 / 25

LU: Capabilities Flexible data placement ◮ Experiment with data layout Memory-constrained adaptive lookahead Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 10 / 25

LU: Performance Weak Scaling: (N such that matrix fills 75% memory) 100 Theoretical peak on XT5 Weak scaling on XT5 65.7% 10 Total TFlop/s 67.4% 66.2% 67.4% 1 67.1% 67% 0.1 128 1024 8192 Number of Cores Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 11 / 25

LU: Performance ... and strong scaling too! (N=96,000) 100 Theoretical peak on XT5 Weak scaling on XT5 Theoretical peak on BG/P Strong scaling on BG/P 10 Total TFlop/s 31.6% 40.8% 1 45% 60.3% 0.1 128 1024 8192 Number of Cores Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 12 / 25

FFT: Parallel Coordination Code doFFT() for(phase = 0; phase < 3; ++phase) { atomic { sendTranspose(); } for(count = 0; count < P; ++count) when recvTranspose[phase] (fftMsg *msg) atomic { applyTranspose(msg); } if (phase < 2) atomic { fftw execute(plan); if(phase == 0) twiddle(); } } Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 13 / 25

FFT: Performance IBM Blue Gene/P (Intrepid), 25% memory, ESSL /w fftw wrappers 4 10 3 10 GFlop/s 2 10 P2P All−to−all Mesh All−to−all Serial FFT limit 1 10 256 512 1024 2048 4096 8192 16384 32768 65536 Cores Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 14 / 25

FFT: Performance IBM Blue Gene/P (Intrepid), 25% memory, ESSL /w fftw wrappers 4 10 Charm++ all-to-all 3 10 Asynchronous, Non-blocking, Topology-aware, Combining, Streaming GFlop/s 2 10 P2P All−to−all Mesh All−to−all Serial FFT limit 1 10 256 512 1024 2048 4096 8192 16384 32768 65536 Cores Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 14 / 25

Random Access What Charm++ brings to the table Productivity Automatically detect completion by sensing quiescence Automatically detect network topology of partition Performance Uses same Charm++ all-to-all Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 15 / 25

Random Access: Performance IBM Blue Gene/P (Intrepid), 2 GB of memory per node Perfect Scaling 32 Charm++ 16 22.19 8 4 GUPS 2 1 0.5 0.25 0.125 128 256 512 1K 2K 4K 8K 16K 32K 64K Number of cores Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 16 / 25

Optional Benchmarks Why MD and Barnes-Hut? Relevant scientific computing kernels Challenge the parallelization paradigm ◮ Load imbalances ◮ Dynamic communication structure Express non-trivial parallel control flow Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 17 / 25

Molecular Dynamics Overview 1 Mimics force calculation in NAMD 2 Resembles the miniMD application in the Mantevo benchmark suite 3 SLOC is 773 in comparison to just under 3000 lines for miniMD (a) 1 Away Decomposition (b) 2 AwayX Decomposition Kale et al. (PPL, Illinois) Charm++ for Productivity and Performance SC11: November 15, 2011 18 / 25

Charm++ for Productivity and Performance A Submission to the 2011 - PowerPoint PPT Presentation

Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil Miller Yanhua Sun Ramprasad

Recent Results in Charm Physics Recent Results in Charm Physics Topics Topics Rare Charm

State of Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory

Charm physics and XYZ states at BESIII Evgeny BOGER JINR Dubna On behalf of BESIII

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu

Charm++ Interoperability Nikhil Jain Charm Workshop - 2013 1 Monday, April 15, 13 1

Charm and and bottom bottom Heavy baryon Heavy baryon Charm mass spectrum from from mass

Combination and QCD Analysis of Charm Production Cross Section Measurements in DIS at HERA Kenan

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski

Charm4py: Parallel Programming with Python and Charm++ Juan Galvez May 1, 2019 17 th Annual

CHARM Community Health And Resources Management A Scenario Planning Mapping Tool Yu Wen Chou

Charm++ as an Energy Efficient Runtime 1 4/18/17 BILGE ACUN - CHARM++ WORKSHOP 2017 Interaction

CHARM 2016 @ Bologna Italy Angelo Carbone on behalf of Department of Physics CHARM 2015 and

Baryon bound states of three hadrons with charm and hidden charm Chu-Wen Xiao (

Review of recent developments on leptonic and semileptonic charm decays from lattice QCD

relaxation time on the quenched lattice Atsuro Ikeda, Masayuki Asakawa, Masakiyo Kitazawa Osaka

Taylor Burdick Thesis Presentation Advised by Dr. Patrick Donnay 2013-2014 Public opinion is

JWST in the Age of Time Domain Astronomy: Understanding FU Orionis Phenomena Joel D. Green

Site Restoration Strategy Development Nuclear Legacy Advisory Forum Steering Group 21 January

Inflowing matter - magnetosphere interaction in compact stars Luigi Stella INAF

A Pin and Power Efficient Low Latency 8-12Gb/s/wire 8b8w- Coded SerDes Link for High Loss

Multi-Hop Beeping Networks Klaus-Tycho Frster, Jochen Seidel, Roger Wattenhofer ETH Zurich

Pony for Fintech or How I Stopped Worrying and Learned to Love an Exotic Product Sylvan Clebsch

SWIMS-18, PFS 2 , and ULTIMATE-K Tadayuki Kodama (Tohoku Univ.) HSC-HSC/PFS-PFS, Mahalo-Subaru,