system software for armv8 a with sve
play

System Software for Armv8-A with SVE Yutaka Ishikawa, Leader of - PowerPoint PPT Presentation

System Software for Armv8-A with SVE Yutaka Ishikawa, Leader of FLAGSHIP2020 Project RIKEN Center for Computational Science 9:00 9:25 14 th of January, 2019 Open Source HPC Collaboration on Arm Architecture Linaro workshop, Guangzhou , China


  1. System Software for Armv8-A with SVE Yutaka Ishikawa, Leader of FLAGSHIP2020 Project RIKEN Center for Computational Science 9:00– 9:25 14 th of January, 2019 Open Source HPC Collaboration on Arm Architecture Linaro workshop, Guangzhou , China

  2. Background: Flagship2020 • Missions • Building the Japanese national flagship supercomputer, post K, and • Developing wide range of HPC applications, running on post K, in order to solve social and science issues in Japan • Project organization • Post K Computer development • RIKEN AICS is in charge of development • Fujitsu is vendor partner. • International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC, BSC, INRIA, RIKEN) • Applications • The government selected • 9 social & scientific priority issues • 4 exploratory issues and their R&D organizations. NOW 2 20019/1/14 RIKEN Center for Computational Science

  3. Background: Flagship2020 • Missions • Building the Japanese national flagship supercomputer, post Target Applications K, and • Developing wide range of HPC applications, running on post K, Program Brief description in order to solve social and science issues in Japan ① GENESIS MD for proteins ② Genomon Genome processing (Genome alignment) • Project organization Earthquake simulator (FEM in unstructured & structured • Post K Computer development ③ GAMERA grid) • RIKEN AICS is in charge of development Weather prediction system using Big data (structured grid ④ NICAM+LETK stencil & ensemble Kalman filter) • Fujitsu is vendor partner. ⑤ NTChem molecular electronic (structure calculation) • International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC, ⑥ FFB Large Eddy Simulation (unstructured grid) BSC, INRIA, RIKEN) • Applications ⑦ RSDFT an ab-initio program (density functional theory) • The government selected Computational Mechanics System for Large Scale Analysis ⑧ Adventure and Design (unstructured grid) • 9 social & scientific priority issues ⑨ CCS-QCD Lattice QCD simulation (structured grid Monte Carlo) • 4 exploratory issues and their R&D organizations. NOW 3 20019/1/14 RIKEN Center for Computational Science

  4. Background: Post-K CPU A64FX Courtesy of FUJITSU LIMITED Architecture Armv8.2-A SVE (512 bit SIMD) Core 48 cores for compute and 2/4 for OS activities DP: 2.7+ TF, SP: 5.4+ TF, HP: 10.8 TF L1D: 64 KiB, 4 way, 230 GB/s(load), 115 GB/s (store) Cache L2: 8 MiB, 16way, 115 GB/s (load), 57 GB/s (store) Memory HBM2 32 GiB, 1024 GB/s CMG: CPU Memory Group Interconnect TofuD (28 Gbps x 2 lane x 10 port) NOC: Network On Chip I/O PCIe Gen3 x 16 lane Technology 7nm FinFET Performance Stream triad: 830+ GB/s Dgemm: 2.5+ TF (90+% efficiency) ref. Toshio Yoshida, “Fujitsu High Performance CPU for the Post-K Computer,” IEEE Hot Chips: A Symposium on High Performance Chips, San Jose, August 21, 2018. 20019/1/14 RIKEN Center for Computational Science 4

  5. Background: An Overview of Post-K Hardware ● Compute Node, Compute + I/O Node connected by 6D mesh/torus Interconnect ● 3-level hierarchical storage system ● 1 st Layer Cache for global file system ● Temporary file systems ● - Local file system for compute node - Shared file system for a job ● 2 nd Layer Lustre-based global file system ● ● 3 rd Layer Storage for archive ● 20019/1/14 RIKEN Center for Computational Science 5

  6. An Overview of System Software Stack Easy of use is one of our KPIs (Key Performance Indicators) Linux Distribution Providing wide range of applications/tools/libraries/compilers Eco-System Fortran, C/C++, OpenMP, Java, … Batch Job System Math libraries Hierarchical File System Tuning and Debugging Tools Parallel File System Parallel Programming Environments Communicati Application-oriente XMP, FDPS, … on d MPI File I/O Process/Thre File I/O for ad Low Level Communication Hierarchical Storage LLIO PIP Multi-Kernel System: Linux and light-weight kernel (McKernel) Armv8 + SVE 20019/1/14 RIKEN Center for Computational Science 6

  7. Post-K Programming Environment Programing Languages and Compilers Script Languages provided by Linux ● ● provided by Fujitsu distributor Fortran2008 & Fortran2018 subset E.g., Python+NumPy, SciPy ● ● Communication Libraries C11 & GNU and Clang extensions ● ● MPI 3.1 & MPI4.0 subset C++14 & C++17 subset and GNU and ● ● Clang extensions Open MPI base (Fujitsu), MPICH (RIKEN ) ● OpenMP 4.5 & OpenMP 5.0 subset Low-level Communication Libraries ● ● uTofu (Fujitsu), LLC(RIKEN ) Java ● ● File I/O Libraries provided by RIKEN ● GCC, LLVM, and Arm compiler will be also available pnetCDF, DTF, FTAR ● Scalable は筑波大・東大が運用する Parallel Programming Language & Domain ● Math Libraries Oakforest-PACS 上でも稼働している。 ● Specific Library provided by RIKEN BLAS, LAPACK, ScaLAPACK, SSL II ( Fujitsu ) ● XcalableMP ● EigenEXA, Batched BLAS ( RIKEN ) ● FDPS (Framework for Developing Particle ● Programming Tools provided by Fujitsu ● Simulator) Profiler, Debugger, GUI ● Process/Thread Library provided by RIKEN ● PiP (Process in Process) ● 7 20019/1/14 RIKEN Center for Computational Science

  8. Open Source Management Tools ● EasyBuild ● Used at CEA ● RIKEN is evaluating it. As an example, CAFFE, a deep learning tool, is ported to an Arm machine using EasyBuild CAFFE consists of several opensource packages: ● - boost, blas, cmake, gflags, google (glog, googletest, snapy, leveldb, protobuf), lmdb, opencv ● Spack ● Used at ECP project ● RIKEN is evaluating Spack also. 20019/1/14 RIKEN Center for Computational Science 8

  9. IHK/McKernel developed at RIKEN IHK: Linux kernel module ● Partition resources (CPU cores, ● Interface for Heterogeneous memory) Allows dynamically partitioning of node resources: Kernels ● CPU cores, physical memory, … Full Linux kernel on some cores ● Enables management of LWKs (assign resources, System daemons and in-situ non ● ● load, boot, destroy, etc..) HPC applications Provides inter-kernel communication, messaging ● Device drivers ● and notification Light-weight kernel(LWK), McKernel ● McKernel: Light-weight kernel ● on other cores Is designed for HPC, noiseless, simple ● HPC applications ● Implements only performance sensitive system ● calls, e.g., process and memory management, and the rest are offloaded to Linux Executes the same binary of ● In-situ non HPC application System Linux without any daemons HPC Applications recompilation Linu x Complex Linux API (glibc, /sys/, /proc/) TCP stack VFS Mem. Mngt. Thin LWK • IHK/McKernel runs on ? Very simple File Sys General Process/Thread memory • Intel Xeon and Xeon phi Dev. Drivers management Driers scheduler management … … • Fujitsu FX10 and FX100 Core Core Core Core Core Core (Experiments) Memory Parti Parti Interrupt tion tion 20019/1/14 RIKEN Center for Computational Science 9

  10. How to deploy IHK/McKernel • Linux Kernel with IHK kernel module is resident – daemons for job scheduler and etc. run on Linux • McKernel is dynamically reloaded (rebooted) by IHK for each application • No hardware reboot App B, requiring App A, requiring LWK-with-scheduler, LWK-without-schedu Is invoked ler, Is invoked Finish App C, using full Linux Finish capability, Is invoked Finish 20019/1/14 RIKEN Center for Computational Science 10

  11. miniFE (CORAL benchmark suite) Oakforest-PACS supercomputer, 25 PF in ● Conjugate gradient - strong scaling peak, at JCAHPC organized by U. of Tsukuba and U. of Tokyo ● Up to 3.5X improvement (Linux falls over.. ) 3.5X Results using the same binary Balazs Gerofi, Rolf Riesen, Robert W. Wisniewski and Yutaka Ishikawa: “Toward Full Specialization of the HPC System Software Stack: Reconciling Application Containers and Lightweight Multi-kernels”, International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2017 20019/1/14 RIKEN Center for Computational Science 11

  12. Support of Software Development/Porting for Post-K Contribution to Arm HPC (Armv8-A SVE) Ecosystem NOW CY2017 CY2018 CY2019 CY2020 CY2021 Installation, Operation Design and Implementation Manufacturing and Tuning Specification Armv8-A + SVE Overview Detailed hardware info. Optimization Publishing Incrementally Guidebook RIKEN Performance estimation tool using FX100 Performance Evaluation Environment RIKEN Simulator Early Access Program • CY2018. Q2, Optimization guidebook is incrementally published • CY2020. Q2, Early access program start • CY2021. Q1/Q2, General operation starts 20019/1/14 RIKEN Center for Computational Science 12

  13. Concluding Remarks https://postk-web.r-ccs.riken.jp/faq.html 20019/1/14 RIKEN Center for Computational Science 13

  14. BACKUP 14

  15. MPI Communication implemented using Tofu2 and TofuD Tofu2 and TofuD offloading mechanism ● Posting send commands (PUT, GET, NOP) to ● a command queue, the Tofu network interface processes posted commands. Tofu2 has two packet processing modes: ● Normal Mode and Session Mode. In the Session Mode, a special register called Scheduling Pointer plays important role. Scheduling Pointer: Commands enqueued in ● the command queue are processed until reaching an entry pointed by the Scheduling Pointer. Scheduling Pointer is updated by a packet sent by remote node 20019/1/14 RIKEN Center for Computational Science 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend