Progress on Path to Exascale Computational Challenges in Fusion - PowerPoint PPT Presentation

Progress on Path to Exascale Computational Challenges in Fusion Energy Sciences William Tang Princeton University and Princeton Plasma Physics Laboratory Princeton, NJ, USA GPU Technology Theater 2012 International Supercomputing Conference Salt Lake City, Utah November 14, 2012

Fusion: an Attractive Energy Source • Abundant fuel, available to all nations Fusion Energy: Burning plasmas are self-heated and self-organized systems – Deuterium and lithium easily available for millions of years • Environmental advantages – No carbon emissions, short-lived radioactivity • Cannot “blow up or melt down,” resistant to terrorist attack – Less than minute’s worth of fuel in chamber • Low risk of nuclear materials proliferation – No fissile materials required • Compact relative to solar, wind and biomass – Modest land usage • Not subject to daily, seasonal or regional weather variation & no requirement for local CO 2 sequestration  Not limited by need for large-scale energy storage nor for long-distance energy transmission • Fusion is complementary to other attractive energy sources

Progress in Magnetic Fusion Energy (MFE) Research Fusion Power 1,000 Megawatts 100 Data from Tokamak JET ITER Experiments Worldwide (EUROPE) 10 500MW 16MW TFTR 1,000 (U.S.) 10MW Kilowatts 100 10 1,000 Watts 100 10 1,000 Milliwatts 100 10 1975 1985 1995 2005 2020 Years

ITER Goal: Demonstration of the Scientific and Technological Feasibility of Fusion Power • ITER is an ~$20B facility located in France & involving 7 governments representing over half of world’s population  dramatic next-step for Magnetic Fusion Energy (MFE) ITER producing a sustained burning plasma -- Today: 10 MW(th) for 1 second with gain ~1 -- ITER: 500 MW(th) for >400 seconds with gain >10 • “DEMO” will be demonstration fusion reactor after ITER -- 2500 MW(th) continuous with gain >25, in a device of similar size and field as ITER • Ongoing R&D programs worldwide [experiments, theory, computation , and technology] essential to provide growing knowledge base for ITER operation targeted for ~ 2020  Realistic HPC-enabled simulations required to cost-effectively plan, “steer,” & harvest key information from expensive (~$1M/ long-pulse) ITER shots

FES Needs to be Prepared to Exploit Local Concurrency to Take Advantage of Most Powerful Supercomputing Systems in 21 st Century (e.g., U.S.’s Titan & Blue-Gene-Q, Japan’s Fujitsu-K, China’s Tianhe-1A, ….) Multi-core Era: A new paradigm in computing Massively Parallel Era • USA, Japan, Europe Vector Era • USA, Japan

Extreme Scale Programming Models for Applications --- continue to follow interdisciplinary paradigm established by U.S. SciDAC Program Problem with l a Theory n � Mathematical Model? o i t (Mathematical Model) � a t u p ? m d o o C h Applied t Computational Computer h e t M i Mathematics w Physics � Science � m (Basic Algorithms) � e (Scientific Codes) � (System Software) � l b “Performance” o r P Loop* Computational Inadequate � Predictions � “V&V + UQ” Loop* Agree* w/ Yes � No � Speed/Efficiency? � Experiments? � Use the New Tool for *Comparisons: empirical trends; Adequate sensitivity studies; detailed Scientific Discovery structure (spectra, correlation (Repeat cycle as new functions, …) * Co-design Challenges: phenomena encountered ) low memory/core; locality; latency; …..

Integrated Plasma Edge-Core Petascale Studies on Jaguar • C.S. Chang, et al., SciDAC-2 “CPES” Project: petascale-level production runs with XGC-1 require 24M CPU hours (100,000 cores × 240 hours) 223,488 cores • XGC1 scales efficiently all the way to full Jaguar petaflop capability (with MPI+ OpenMP) & routinely uses >70% capability • New SciDAC-3 “EPSi” Project: to address XGC1 conversion to GPU architecture of Titan

Microturbulence in Fusion Plasmas – Mission Importance: Fusion reactor size & cost determined by balance between loss processes & self-heating rates • “Scientific Discovery” - Transition to favorable scaling of confinement produced in simulations for ITER-size plasmas - a/ ρ i = 400 (JET, largest present lab Good news for experiment) through ITER! - a/ ρ i = 1000 (ITER, ignition experiment) Ion transport • Multi-TF simulations using GTC global PIC code [e.g., Z. Lin, et al, Science, 281, 1835 (1998), PRL (2002) ] deployed a billion particles, 125M spatial grid points; 7000 time steps at NERSC  1 st ITER-scale simulation with ion gyroradius resolution • Understanding physics of favorable plasma size scaling trend demands much greater  Excellent Scalability of Global PIC Codes on LCF’s computational resources + improved enables advanced physics simulations to improve algorithms [radial domain decomposition, understanding hybrid (MPI+Open MP) language, ..] & • major advances in Global PIC code development for both advanced CPU & GPU low memory per core modern diagnositics systems [e.g. – GTC-P on Blue-Gene Q at ALCF & GTC-GPU onTitan/Titan-Dev at OLCF ]

Recent GTC-P Weak Scaling Results on “Mira” @ALCF [Bei Wang, G8 Post-Doc] Excellent scalability demonstrated [ both grid size and # of particles increased proportionally with # of cores ] & dramatic (x10) gain in “time to solution” Similar weak scaling studies being planned for Sequoia (world’s largest BG-Q system) at LLNL

GTC-GPU  GPU/CPU Hybrid Code 1 st Version -- Introduced at SC2011: K. Madduri, K. Ibrahim , S. Williams, E.J.Im, S. Ethier, J. Shalf, L. Oliker, “Gyrokinetic Toroidal Simulations on Leading Multi- and Many-core HPC Systems” • Physics content in GPU/CPU Hybrid version of GTC code is the same as in the “GTC-P” code • Challenge: massive fine-grained parallelism and explicit memory transfers between multiple memory spaces within a compute node • Approach : consider 3 main computational phases: charge deposition, particle push and particle shift -- integrate three programming models [nVidia, Cuda, & OpenMP] within a node, and MPI between nodes -- explored speedup by parallelizing the charge deposition phase  Memory locality improves performance of most routines which can however degrade because of access conflicts • New Results : [Bei Wang, et al., Princeton U./PPPL; K. Ibrahim, et al., LBNL, ….]  GTC-GPU code demonstrated excellent scaling behavior on NERSC’s 48-node Dirac test-bed and recently on OLCF’s 960-node Titan-Dev system (  Readiness for Titan and Tianhe 1A)

GTC-GPU Optimization Considerations • Gather and scatter operations are key computational components in a PIC method  account for 80% of the total computational time in GTC • Challenge: Achieving highly efficient parallelism while dealing with (i) random access that makes poor use of caches; and (ii) potential fine-grain data hazards that serialize the computation • Approach: Improve locality by sorting the particles by deposition grid points -- for gyrokinetic PIC method (where each finite-sized particle is represented by four points on a ring) requires sorting these points instead of particles  Sorting is an important pre-processing step in PIC method when dealing with GPU architecture

GTC-GPU Charge & Push Subroutine • “GPU multi”: the original GPU version • “GPU cooperative”: an improved GPU version that uses shared memory to achieve coalesced global memory access (for improved memory bandwidth) • “GPU points-sorting”: the up-to-date GPU version that uses shared memory to: (i) achieve coalesced global memory access; and (ii) reduce global memory access through points sorting • “CPU 16 threads”: the best optimized CPU version with 16 threads OpenMP Note: All tests illustrated here were carried out on Dirac (NERSC) with 10 ppc for 192 grid- points in radial dimension with 100 steps

Problem Size Studies Grid Size B C D mpsi 192 384 768 mthetamax 1408 2816 5632 mgrid 151161 602695 2406883 Problem Settings: • mpsi: number of grid points in radial dimension • mthetamax: number of grid points in poloidal dimension at the largest ring • mgrid: total number of grid points in each plane Problem size C corresponds to JET size tokamak Problem size D corresponds to ITER size tokamak The largest problem we can run on Dirac (a single 3GB Fermi GPU on each node) is C20 (JET size tokamak with 20 ppc). • We expect to be able to run D20 (ITER size tokamak with 20 ppc) on Titan (with a single 6GB Kepler GPU on each node).

Progress on Path to Exascale Computational Challenges in Fusion - PowerPoint PPT Presentation

Progress on Path to Exascale Computational Challenges in Fusion Energy Sciences William Tang Princeton University and Princeton Plasma Physics Laboratory Princeton, NJ, USA GPU Technology Theater 2012 International Supercomputing Conference

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

A * A path finding algorithm. A path finding algorithm. Given a state space, such as a

On Path Generation, Path Following On Path Generation, Path Following and Time Coordination for

Using Off-Path and On-Path Signaling for Internet Security Saikat Guha, Paul Francis Cornell

Progress with PETSc on Manycore and GPU-based Systems on the Path to Exascale Richard Tran Mills

The U.S. D.O.E. Exascale Computing Project Goals and Challenges Paul Messina, ECP Director

THE ROAD TO EXASCALE: HARDWARE AND SOFTWARE CHALLENGES JACK DONGARRA UNIVERSITY OF TENNESSEE

Exascale: Parallelism gone wild! Craig Stunkel, IBM Research IBM Research Outline Why are

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC What are we talking about? 100M cores

Introduction to Path Analysis Ways to think about path analysis Path coefficients

Martha Brumfield, President and CEO C-Path Mission C-Path The Critical Path Institute is a

More On Paths Supplement to Chapter 4, Graph Theory Path definition What is a path? We

CSE 421 Longest Path in a DAG, LIS, Shortest Path with Negative Weights Shayan Oveis Gharan 1

C-Mod Core Transport Program Presented by Martin Greenwald C-Mod PAC Feb. 6-8, 2008 MIT

Study on the alternatives to the secondary surface preparation in protective coatings Naoki

Hunting Manufactures and distributes products that enable the extraction of oil and gas Hunting

Challenges in Aircraft Engine Control and Gas Path Health Management Dr. Sanjay Garg Donald L.

EARTHQUAKE DETECTION, PROTECTION, RECOVERY. JAPAN DEVASTATION DEVISTATION EARTHQUAKE

Mineral Deposit Prevention and Removal by Electro- Mineral Deposit Prevention and Removal by

A D A M O A K S E N E R G Y , L L C Adam Oaks Energy, L.L.C. ( AOE ) is a company that

EDRIVE - MEC EPSRC Supergen Marine Grand Challenge 1 st April 2016 31 st March 2019

Progress on Path to Exascale Computational Challenges in Fusion - PowerPoint PPT Presentation

Progress on Path to Exascale Computational Challenges in Fusion Energy Sciences William Tang Princeton University and Princeton Plasma Physics Laboratory Princeton, NJ, USA GPU Technology Theater 2012 International Supercomputing Conference

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

A * A path finding algorithm. A path finding algorithm. Given a state space, such as a

On Path Generation, Path Following On Path Generation, Path Following and Time Coordination for

Using Off-Path and On-Path Signaling for Internet Security Saikat Guha, Paul Francis Cornell

Progress with PETSc on Manycore and GPU-based Systems on the Path to Exascale Richard Tran Mills

The U.S. D.O.E. Exascale Computing Project Goals and Challenges Paul Messina, ECP Director

THE ROAD TO EXASCALE: HARDWARE AND SOFTWARE CHALLENGES JACK DONGARRA UNIVERSITY OF TENNESSEE

Exascale: Parallelism gone wild! Craig Stunkel, IBM Research IBM Research Outline Why are

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&amp;UIUC What are we talking about? 100M cores

Introduction to Path Analysis Ways to think about path analysis Path coefficients

Martha Brumfield, President and CEO C-Path Mission C-Path The Critical Path Institute is a

More On Paths Supplement to Chapter 4, Graph Theory Path definition What is a path? We

CSE 421 Longest Path in a DAG, LIS, Shortest Path with Negative Weights Shayan Oveis Gharan 1

C-Mod Core Transport Program Presented by Martin Greenwald C-Mod PAC Feb. 6-8, 2008 MIT

Study on the alternatives to the secondary surface preparation in protective coatings Naoki

Hunting Manufactures and distributes products that enable the extraction of oil and gas Hunting

Challenges in Aircraft Engine Control and Gas Path Health Management Dr. Sanjay Garg Donald L.

EARTHQUAKE DETECTION, PROTECTION, RECOVERY. JAPAN DEVASTATION DEVISTATION EARTHQUAKE

Mineral Deposit Prevention and Removal by Electro- Mineral Deposit Prevention and Removal by

A D A M O A K S E N E R G Y , L L C Adam Oaks Energy, L.L.C. ( AOE ) is a company that

EDRIVE - MEC EPSRC Supergen Marine Grand Challenge 1 st April 2016 31 st March 2019

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC What are we talking about? 100M cores