Preparing to Program Aurora at Exascale Argonne Leadership - PowerPoint PPT Presentation

Preparing to Program Aurora at Exascale Argonne Leadership Computing Facility IWOCL, Apr. 28, 2020 Hal Finkel, et al. www.anl.gov

Scientifc Supercomputing

What is (traditional) supercomputing? Computing for large, tightly-coupled problems. Lots of computational capability paired with lots of high-performance memory. High computational density paired with a high-throughput low-latency network.

Many Scientifc Domains https://www.alcf.anl.gov/files/alcfscibro2015.pdf

Common Algorithm Classes in HPC http://crd.lbl.gov/assets/pubs_presos/CDS/ATG/WassermanSOTON.pdf

Upcoming Hardware

Toward The Future of Supercomputing GPUs “Many Core” CPUs https://forum.beyond3d.com/threads/nvidia-pascal-speculation-thread.55552/page-4 http://www.nextplatform.com/2015/11/30/inside-future-knights-landing-xeon-phi-systems/ All of our upcoming systems use GPUs!

Upcoming Systems (https://science.osti.gov/-/media/ascr/ascac/pdf/meetings/201909/20190923_ASCAC-Helland-Barbara-Helland.pdf) 9

Aurora: A High-level View  Intel-Cray machine arriving at Argonne in 2021  Sustained Performance > 1Exafoos  Intel Xeon orocessors and Intel X e GPUs  2 Xeons (Saoohire Raoids)  6 GPUs (Ponte Vecchio [PVC])  Greater than 10 PB of total memory  Cray Slingshot fabric and Shasta olatform  Filesystem  Distributed Asynchronous Object Store (DAOS)  ≥ 230 PB of storage caoacity  Bandwidth of > 25 TB/s  Lustre  150 PB of storage caoacity  Bandwidth of ~1TB/s 10

Aurora Compute Node  2 Intel Xeon (Saoohire Raoids) orocessors Unified Memory and GPU ↔ GPU connectivity…  6 X e Architecture based GPUs Important implications for the (Ponte Vecchio) programming model!  All to all connection  Low latency and high bandwidth  8 Slingshot Fabric endooints  Unifed Memory Architecture across CPUs and GPUs 11

Programming Models (for Aurora)

Three Pillars Simulatin Data Learning Simulatin Data Learning HPC Languages Priductvity Languages Priductvity Languages HPC Languages Priductvity Languages Priductvity Languages Directves Big Data Stack DL Framewirks Directves Big Data Stack DL Framewirks Parallel Runtmes Statstcal Libraries Statstcal Libraries Parallel Runtmes Statstcal Libraries Statstcal Libraries Silver Libraries Databases Linear Algebra Libraries Silver Libraries Databases Linear Algebra Libraries Cimpilers, Perfirmance Tiils, Debuggers Cimpilers, Perfirmance Tiils, Debuggers Math Libraries, C++ Standard Library, libc Math Libraries, C++ Standard Library, libc I/O, Messaging I/O, Messaging Cintainers, Visualizatin Cintainers, Visualizatin Scheduler Scheduler Linux Kernel, POSIX Linux Kernel, POSIX 13

MPI on Aurora • Intel MPI & Cray MPI • MPI 3.0 standard comoliant MPICꔰ MPICꔰ • The MPI library will be thread safe • Allow aoolications to use MPI from individual threads Cꔰ4 Cꔰ4 • Efcient MPIHTꔰREADHMUTIPLE (locking ootimitations) OFI OFI • Asynchronous orogress in all tyoes of nonblocking communication libfabric libfabric • Nonblocking send-receive and collectives • One-sided ooerations Slingshot Slingshot orovider orovider • ꔰardware and tooology ootimited collective imolementations ꔰardware ꔰardware • Suooorts MPI tools interface • Control variables 14

Intel Fortran for Aurora  Fortran 2008  OoenMP 5  A signifcant amount of the code run on oresent day machines is written in Fortran.  Most new code develooment seems to have shifted to other languages (mainly C++). 15

oneAPI  Industry soecifcation from Intel ( httos://www.oneaoi.com/soec/)  Language and libraries to target orogramming across diverse architectures (DPC++, APIs, low level interface)  Intel oneAPI oroducts and toolkits ( httos://software.intel.com/ONEAPI)  Imolementations of the oneAPI soecifcation and analysis and debug tools to helo orogramming 16

Intel MKL – Math Kernel Library  ꔰighly tuned algorithms  FFT  Linear algebra (BLAS, LAPACK)  Soarse solvers  Statistical functions  Vector math  Random number generators  Ootimited for every Intel olatform oneAPI beta includes  oneAPI MKL (oneMKL) DPC++ support  httos://software.intel.com/en-us/oneaoi/mkl 17

AI and Analytics  Libraries to suooort AI and Analytics  OneAPI Deeo Neural Network Library (oneDNN)  ꔰigh Performance Primitives to accelerate deeo learning frameworks  Powers T ensorfow, PyT orch, MXNet, Intel Cafe, and more  Running on Gen9 today (via OoenCL)  oneAPI Data Analytics Library (oneDAL)  Classical Machine Learning Algorithms  Easy to use one line daal4oy Python interfaces  Powers Scikit-Learn  Aoache Soark MLlib 18

Heterogenous System Programming Models  Aoolications will be using a variety of orogramming models for Exascale:  CUDA  OoenCL  ꔰIP  OoenACC  OoenMP  DPC++/SYCL  Kokkos  Raja  Not all systems will suooort all models  Libraries may helo you abstract some orogramming models. 19

OpenMP 5  OoenMP 5 constructs will orovide directives based orogramming model for Intel GPUs  Available for C, C++, and Fortran  A oortable model exoected to be suooorted on a variety of olatforms (Aurora, Frontier, Perlmutter, …)  Ootimited for Aurora  For Aurora, OoenACC codes could be converted into OoenMP  ALCF staf will assist with conversion, training, and best oractices  Automated translation oossible through the clacc conversion tool (for C/C++) htups://wwwsipenmpsirg/ 20

OpenMP 4.5/5: for Aurora  OoenMP 4.5/5 soecifcation has signifcant uodates to allow for imoroved suooort of accelerator devices Ofoading code to run on accelerator Distributng iteratons of the loop to Controlling data transfer between threads devices #pragma omp target [clause[[,] clause],…] #pragma omp teams [clause[[,] clause],…] map ([map-type:] list ) structured-block structured-block map-type :=allic | tifrim | frim | ti | #pragma omp declare target #pragma omp distribute [clause[[,] clause], … …] #pragma omp target data [clause[[,] clause],…] declaratons-defniton-seq for-loops #pragma omp declare variant *( variant- structured-block func-id) clause new-line #pragma omp loop* [clause[[,] clause],…] #pragma omp target update [clause[[,] clause], for-loops …] functon defniton or declaraton Runtme suppirt riutnes: Envirinment variables * denites OMP 5 • viid omp_set_default_device (int dev_num) • Cintril default device thriugh • int omp_get_default_device (viid) OMP_DEFAULT_DEVICE • int omp_get_num_devices (viid) • Cintril ifiad with • int omp_get_num_teams (viid) OMP_TARGET_OFFLOAD 21

DPC++ (Data Parallel C++) and SYCL  SYCL  Khronos standard soecifcation SYCL 1.2.1 or later  SYCL is a C++ based abstraction layer (standard C++11)  Builds on OoenCL concepts (but single-source) C++11 or  SYCL is designed to be as close to standard C++ as later possible  Current Imolementations of SYCL:  ComouteCPP™ (www.codeolay.com)  Intel SYCL (github.com/intel/llvm)  triSYCL (github.com/triSYCL/triSYCL)  hioSYCL (github.com/illuhad/hioSYCL)  Runs on today’s CPUs and nVidia, AMD, Intel GPUs 22

DPC++ (Data Parallel C++) and SYCL Intel DPC++  SYCL  Khronos standard soecifcation SYCL 1.2.1 or later  SYCL is a C++ based abstraction layer (standard C++11)  Builds on OoenCL concepts (but single-source) C++11 or  SYCL is designed to be as close to standard C++ as later possible  Current Imolementations of SYCL:  ComouteCPP™ (www.codeolay.com) Extensions Descripton  Intel SYCL (github.com/intel/llvm) Unifed Shared defnes piinter-based memiry accesses and  triSYCL (github.com/triSYCL/triSYCL) Memiry (USM) management interfacess  hioSYCL (github.com/illuhad/hioSYCL) defnes simple in-irder semantcs fir queues, In-irder queues ti simplify cimmin ciding patuernss  Runs on today’s CPUs and nVidia, AMD, Intel GPUs privides reductin abstractin ti the ND-  DPC++ Reductin range firm if parallel_firs  Part of Intel oneAPI soecifcation Optinal lambda remives requirement ti manually name name lambdas that defne kernelss  Intel extension of SYCL to suooort new innovative features defnes a griuping if wirk-items within a  Incoroorates SYCL 1.2.1 soecifcation and Unifed Shared Subgriups wirk-griups Memory enables efcient First-In, First-Out (FIFO)  Add language or runtime extensions as needed to meet Data fiw pipes cimmunicatin (FPGA-inly) user needs httos://soec.oneaoi.com/oneAPI/Elements/docoo/docooHroot.html#extensions-table 23

OpenMP 5 Host Device Transfer data and execution control Distributes iteratins ti the extern void init(float*, float*, int); threads, where each thread extern void output(float*, int); uses SIMD parallelism void vec_mult(float*p, float*v1, float*v2, int N) Creates { teams if int i; threads init(v1, v2, N); #pragma omp target teams distribute parallel for simd \ in the map(to: v1[0:N], v2[0:N]) map(from: p[0:N]) target for (i=0; i<N; i++) device { p[i] = v1[i]*v2[i]; Cintrilling data } transfer output(p, N); } 24

Preparing to Program Aurora at Exascale Argonne Leadership - PowerPoint PPT Presentation

Preparing to Program Aurora at Exascale Argonne Leadership Computing Facility IWOCL, Apr. 28, 2020 Hal Finkel, et al. www.anl.gov Scientifc Supercomputing What is (traditional) supercomputing? Computing for large, tightly-coupled problems.

ARGONNES AURORA EXASCALE COMPUTER SUSAN COGHLAN Aurora Technical Lead and ALCF-3 Project

Aurora Control System y by WaterFurnace Aurora Base Control Board Aurora Base Control Board

Aurora Colour Swatches/Palette Proposal Current vs. New colour swatches Aurora Borealis Colour

Aurora Presentation 3d 2012 Keygen Crack ->->->-> DOWNLOAD 1 / 5 2 / 5 Aurora 3D

Crack Aurora 3d Presentation Crack Crack Aurora 3d Presentation Crack 1 / 3 2 / 3 Aurora 3D

aurora 1.0 HOW IT WORKS aurora 1.1 THE REASON WHY Why aurora app #1 Because it serves as a

Aurora Water Youth Education Classroom Presentation Request Form Please note: Aurora

The northern lights (aurora borealis) The northern lights (aurora borealis) The northern lights

Crack Para Aurora 3d Presentation 18l Crack Para Aurora 3d Presentation 18l 1 / 3 2 / 3 Crack

THE AURORA PARTITION-WALL SYSTEM, the 31 SERIES Komandor has introduced a new AURORA system. The

Health Science Career Academy West Aurora High School West Aurora

Partnership between the Aurora Museum & Archives and the Aurora Sports Hall of Fame MISSION

Learning and Teaching Academy All about Aurora! The beginnings of the Aurora Programme 7,204

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

Homework 1 1. [40pt] We sequenced a small region of

We were talking about similarity, sequence comparison and alignment. HOW DOES IT WORK ?

Roberto Bruttomesso Intrepid: an SMT-based Model Checker for Control Engineering and Industrial

On the representation of de Bruijn Graphs Rayan Chikhi joint work with P . Medvedev, A.

Computing the Cohomology Algebra of a Polyhedral Complex Joint work with R. Gonzalez-Diaz &

iDedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson

Semiregular Subgroups of Transitive Permutation Groups Dragan Maru si c University of

using Intel SGX Sergey Gorbunov University of Waterloo Joint work with Ben Fisch, Dhinakaran

Preparing to Program Aurora at Exascale Argonne Leadership - PowerPoint PPT Presentation

Preparing to Program Aurora at Exascale Argonne Leadership Computing Facility IWOCL, Apr. 28, 2020 Hal Finkel, et al. www.anl.gov Scientifc Supercomputing What is (traditional) supercomputing? Computing for large, tightly-coupled problems.

ARGONNES AURORA EXASCALE COMPUTER SUSAN COGHLAN Aurora Technical Lead and ALCF-3 Project

Aurora Control System y by WaterFurnace Aurora Base Control Board Aurora Base Control Board

Aurora Colour Swatches/Palette Proposal Current vs. New colour swatches Aurora Borealis Colour

Aurora Presentation 3d 2012 Keygen Crack -&gt;-&gt;-&gt;-&gt; DOWNLOAD 1 / 5 2 / 5 Aurora 3D

Crack Aurora 3d Presentation Crack Crack Aurora 3d Presentation Crack 1 / 3 2 / 3 Aurora 3D

aurora 1.0 HOW IT WORKS aurora 1.1 THE REASON WHY Why aurora app #1 Because it serves as a

Aurora Water Youth Education Classroom Presentation Request Form Please note: Aurora

The northern lights (aurora borealis) The northern lights (aurora borealis) The northern lights

Crack Para Aurora 3d Presentation 18l Crack Para Aurora 3d Presentation 18l 1 / 3 2 / 3 Crack

THE AURORA PARTITION-WALL SYSTEM, the 31 SERIES Komandor has introduced a new AURORA system. The

Health Science Career Academy West Aurora High School West Aurora

Partnership between the Aurora Museum &amp; Archives and the Aurora Sports Hall of Fame MISSION

Learning and Teaching Academy All about Aurora! The beginnings of the Aurora Programme 7,204

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

Homework 1 1. [40pt] We sequenced a small region of

We were talking about similarity, sequence comparison and alignment. HOW DOES IT WORK ?

Roberto Bruttomesso Intrepid: an SMT-based Model Checker for Control Engineering and Industrial

On the representation of de Bruijn Graphs Rayan Chikhi joint work with P . Medvedev, A.

Computing the Cohomology Algebra of a Polyhedral Complex Joint work with R. Gonzalez-Diaz &amp;

iDedup Latency-aware inline deduplication for primary workloads Kiran Srinivasan, Tim Bisson

Semiregular Subgroups of Transitive Permutation Groups Dragan Maru si c University of

using Intel SGX Sergey Gorbunov University of Waterloo Joint work with Ben Fisch, Dhinakaran

Aurora Presentation 3d 2012 Keygen Crack ->->->-> DOWNLOAD 1 / 5 2 / 5 Aurora 3D

Partnership between the Aurora Museum & Archives and the Aurora Sports Hall of Fame MISSION

Computing the Cohomology Algebra of a Polyhedral Complex Joint work with R. Gonzalez-Diaz &