System Software for Armv8-A with SVE Yutaka Ishikawa, Leader of - - PowerPoint PPT Presentation

system software for armv8 a with sve
SMART_READER_LITE
LIVE PREVIEW

System Software for Armv8-A with SVE Yutaka Ishikawa, Leader of - - PowerPoint PPT Presentation

System Software for Armv8-A with SVE Yutaka Ishikawa, Leader of FLAGSHIP2020 Project RIKEN Center for Computational Science 9:00 9:25 14 th of January, 2019 Open Source HPC Collaboration on Arm Architecture Linaro workshop, Guangzhou , China


slide-1
SLIDE 1

System Software for Armv8-A with SVE

Yutaka Ishikawa, Leader of FLAGSHIP2020 Project RIKEN Center for Computational Science

9:00– 9:25 14th of January, 2019 Open Source HPC Collaboration on Arm Architecture Linaro workshop, Guangzhou , China

slide-2
SLIDE 2

Background: Flagship2020

20019/1/14

  • Missions
  • Building the Japanese national flagship supercomputer, post

K, and

  • Developing wide range of HPC applications, running on post K,

in order to solve social and science issues in Japan

  • Project organization
  • Post K Computer development
  • RIKEN AICS is in charge of development
  • Fujitsu is vendor partner.
  • International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC,

BSC, INRIA, RIKEN)

  • Applications
  • The government selected
  • 9 social & scientific priority issues
  • 4 exploratory issues

and their R&D organizations.

2

NOW

RIKEN Center for Computational Science

slide-3
SLIDE 3

Background: Flagship2020

20019/1/14

  • Missions
  • Building the Japanese national flagship supercomputer, post

K, and

  • Developing wide range of HPC applications, running on post K,

in order to solve social and science issues in Japan

  • Project organization
  • Post K Computer development
  • RIKEN AICS is in charge of development
  • Fujitsu is vendor partner.
  • International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC,

BSC, INRIA, RIKEN)

  • Applications
  • The government selected
  • 9 social & scientific priority issues
  • 4 exploratory issues

and their R&D organizations.

3

NOW

Target Applications Program Brief description

① GENESIS MD for proteins ② Genomon Genome processing (Genome alignment) ③ GAMERA Earthquake simulator (FEM in unstructured & structured grid) ④ NICAM+LETK Weather prediction system using Big data (structured grid stencil & ensemble Kalman filter) ⑤ NTChem molecular electronic (structure calculation) ⑥ FFB Large Eddy Simulation (unstructured grid) ⑦ RSDFT an ab-initio program (density functional theory) ⑧ Adventure Computational Mechanics System for Large Scale Analysis and Design (unstructured grid) ⑨ CCS-QCD Lattice QCD simulation (structured grid Monte Carlo)

RIKEN Center for Computational Science

slide-4
SLIDE 4

Courtesy of FUJITSU LIMITED

Background: Post-K CPU A64FX

20019/1/14

4

Architecture Armv8.2-A SVE (512 bit SIMD) Core 48 cores for compute and 2/4 for OS activities DP: 2.7+ TF, SP: 5.4+ TF, HP: 10.8 TF Cache L1D: 64 KiB, 4 way, 230 GB/s(load), 115 GB/s (store) L2: 8 MiB, 16way, 115 GB/s (load), 57 GB/s (store) Memory HBM2 32 GiB, 1024 GB/s Interconnect TofuD (28 Gbps x 2 lane x 10 port) I/O PCIe Gen3 x 16 lane Technology 7nm FinFET

Performance Stream triad: 830+ GB/s Dgemm: 2.5+ TF (90+% efficiency)

  • ref. Toshio Yoshida, “Fujitsu High Performance CPU for the Post-K Computer,” IEEE Hot Chips: A

Symposium on High Performance Chips, San Jose, August 21, 2018.

CMG: CPU Memory Group NOC: Network On Chip

RIKEN Center for Computational Science

slide-5
SLIDE 5

Background: An Overview of Post-K Hardware

  • Compute Node, Compute + I/O Node connected

by 6D mesh/torus Interconnect

  • 3-level hierarchical storage system
  • 1st Layer
  • Cache for global file system
  • Temporary file systems
  • Local file system for compute node
  • Shared file system for a job
  • 2nd Layer
  • Lustre-based global file system
  • 3rd Layer
  • Storage for archive

5

20019/1/14

RIKEN Center for Computational Science

slide-6
SLIDE 6

An Overview of System Software Stack

20019/1/14

Easy of use is one of our KPIs (Key Performance Indicators)

Providing wide range of applications/tools/libraries/compilers

Linux Distribution Eco-System

Parallel Programming Environments XMP, FDPS, … Armv8 + SVE

Multi-Kernel System: Linux and light-weight kernel (McKernel)

Batch Job System Application-oriente d File I/O Communicati

  • n

MPI Parallel File System Tuning and Debugging Tools Hierarchical File System Low Level Communication

File I/O for Hierarchical Storage LLIO

Fortran, C/C++, OpenMP, Java, … Math libraries Process/Thre ad PIP

6

RIKEN Center for Computational Science

slide-7
SLIDE 7
  • Programing Languages and Compilers

provided by Fujitsu

  • Fortran2008 & Fortran2018 subset
  • C11 & GNU and Clang extensions
  • C++14 & C++17 subset and GNU and

Clang extensions

  • OpenMP 4.5 & OpenMP 5.0 subset
  • Java

GCC, LLVM, and Arm compiler will be also available

  • Parallel Programming Language & Domain

Specific Library provided by RIKEN

  • XcalableMP
  • FDPS (Framework for Developing Particle

Simulator)

  • Process/Thread Library provided by RIKEN
  • PiP (Process in Process)
  • Script Languages provided by Linux

distributor

  • E.g., Python+NumPy, SciPy
  • Communication Libraries
  • MPI 3.1 & MPI4.0 subset
  • Open MPI base (Fujitsu), MPICH (RIKEN)
  • Low-level Communication Libraries
  • uTofu (Fujitsu), LLC(RIKEN)
  • File I/O Libraries provided by RIKEN
  • pnetCDF, DTF, FTAR
  • Math Libraries
  • BLAS, LAPACK, ScaLAPACK, SSL II (Fujitsu)
  • EigenEXA, Batched BLAS (RIKEN)
  • Programming Tools provided by Fujitsu
  • Profiler, Debugger, GUI

Post-K Programming Environment

Scalableは筑波大・東大が運用する Oakforest-PACS上でも稼働している。 20019/1/14

7 RIKEN Center for Computational Science

slide-8
SLIDE 8

Open Source Management Tools

  • EasyBuild
  • Used at CEA
  • RIKEN is evaluating it. As an example, CAFFE, a deep learning tool, is

ported to an Arm machine using EasyBuild

  • CAFFE consists of several opensource packages:
  • boost, blas, cmake, gflags, google (glog, googletest, snapy, leveldb, protobuf), lmdb,
  • pencv
  • Spack
  • Used at ECP project
  • RIKEN is evaluating Spack also.

8

20019/1/14

RIKEN Center for Computational Science

slide-9
SLIDE 9
  • Partition resources (CPU cores,

memory)

  • Full Linux kernel on some cores
  • System daemons and in-situ non

HPC applications

  • Device drivers
  • Light-weight kernel(LWK), McKernel
  • n other cores
  • HPC applications

IHK/McKernel developed at RIKEN

  • IHK: Linux kernel module
  • Allows dynamically partitioning of node resources:

CPU cores, physical memory, …

  • Enables management of LWKs (assign resources,

load, boot, destroy, etc..)

  • Provides inter-kernel communication, messaging

and notification

  • McKernel: Light-weight kernel
  • Is designed for HPC, noiseless, simple
  • Implements only performance sensitive system

calls, e.g., process and memory management, and the rest are offloaded to Linux

Very simple memory management

Thin LWK

Process/Thread management

General scheduler Complex

  • Mem. Mngt.

Linu x

TCP stack

  • Dev. Drivers

VFS

File Sys Driers

Memory

… …

Interrupt

System daemons

?

HPC Applications

Parti tion Parti tion

In-situ non HPC application

Linux API (glibc, /sys/, /proc/)

Core Core Core Core Core Core 20019/1/14 9

  • IHK/McKernel runs on
  • Intel Xeon and Xeon phi
  • Fujitsu FX10 and FX100

(Experiments)

Interface for Heterogeneous Kernels

  • Executes the same binary of

Linux without any recompilation

RIKEN Center for Computational Science

slide-10
SLIDE 10

How to deploy IHK/McKernel

  • Linux Kernel with IHK kernel module is resident

– daemons for job scheduler and etc. run on Linux

  • McKernel is dynamically reloaded (rebooted) by IHK for

each application

  • No hardware reboot

Finish

App A, requiring LWK-without-schedu ler, Is invoked App B, requiring LWK-with-scheduler, Is invoked

Finish

App C, using full Linux capability, Is invoked

Finish 20019/1/14

10

RIKEN Center for Computational Science

slide-11
SLIDE 11

miniFE (CORAL benchmark suite)

11

  • Conjugate gradient - strong scaling
  • Up to 3.5X improvement (Linux falls over.. )

3.5X

Oakforest-PACS supercomputer, 25 PF in peak, at JCAHPC organized by U. of Tsukuba and U. of Tokyo

Results using the same binary

20019/1/14

Balazs Gerofi, Rolf Riesen, Robert W. Wisniewski and Yutaka Ishikawa: “Toward Full Specialization of the HPC System Software Stack: Reconciling Application Containers and Lightweight Multi-kernels”, International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2017

RIKEN Center for Computational Science

slide-12
SLIDE 12

Support of Software Development/Porting for Post-K

20019/1/14

RIKEN Center for Computational Science

12

CY2017 CY2018 CY2019 CY2020 CY2021 Specification Optimization Guidebook RIKEN Performance Evaluation Environment Early Access Program Publishing Incrementally Performance estimation tool using FX100 RIKEN Simulator Installation, and Tuning

Manufacturing Design and Implementation

Operation Armv8-A + SVE Overview Detailed hardware info.

  • CY2018. Q2, Optimization guidebook is incrementally published
  • CY2021. Q1/Q2, General operation starts

NOW

  • CY2020. Q2, Early access program start

Contribution to Arm HPC (Armv8-A SVE) Ecosystem

slide-13
SLIDE 13

Concluding Remarks

20019/1/14

RIKEN Center for Computational Science

13

https://postk-web.r-ccs.riken.jp/faq.html

slide-14
SLIDE 14

BACKUP

14

slide-15
SLIDE 15

MPI Communication implemented using Tofu2 and TofuD

  • Tofu2 and TofuD offloading mechanism
  • Posting send commands (PUT, GET, NOP) to

a command queue, the Tofu network interface processes posted commands.

  • Tofu2 has two packet processing modes:

Normal Mode and Session Mode. In the Session Mode, a special register called Scheduling Pointer plays important role.

  • Scheduling Pointer: Commands enqueued in

the command queue are processed until reaching an entry pointed by the Scheduling

  • Pointer. Scheduling Pointer is updated by a

packet sent by remote node

20019/1/14

15

RIKEN Center for Computational Science

slide-16
SLIDE 16

Evaluation: Latency

16 MPI_Neighbor_alltoall_init(sbuf, count, MPI_DOUBLE, rbuf, MPI_DOUBLE, comm, &req[1]); for (I = 0; …….) { /` Computation `/ MPI_Start(req); /* Computation */ MPI_Wait( req, stat); }

Tofu2 Offload

Direct Transfers between User Buffers Completely Asynchronous Progression

Persistent pt2pt. (≒Non-blocking pt2pt.)

Latency [us] Message Size [Bytes] Latency [us]

  • The offload version is faster.
  • Unlike the point-to-point version, the
  • ffload version doe not need CPU cycle

for communication progress. Thus computation and communication

  • verlap is realized by the offload

version.

20019/1/14

  • Masayuki Hatanaka, Masamichi Takagi, Atsushi Hori, Yutaka Ishikawa, “Offloaded MPI persistent collectives using persistent generalized request

interface,” Proceedings of the 24th European MPI Users' Group Meeting (EuroMPI2017), ACM, 2017.

  • Yoshiyuki Morie, Masayuki Hatanaka, Masamichi Takagi, Atsushi Hori, Yutaka Ishikawa, “Prototyping of Offloaded Persistent Broadcast on Tofu2

Interconnect,” SC17, 2017 (poster)

  • Yoshiyuki Morie, Masayuki Hatanaka, Masamichi Tagaki, Atsushi Hori, Yutaka Ishikawa, "Evaluation of Intra Node of Persistent Collective

Communication using NIC Offloading," SWOPP'18, HPC165, 2018. (In Japanese)

RIKEN Center for Computational Science

slide-17
SLIDE 17

17

  • Application
  • MODYLAYS, USQCD, OpenFOAM
  • Library
  • Numpy, Scipy, pysam, FFTW, LAPACK95, lapack, blas, Metis, ParMetis, HDF5,

NetCDF, NetCDF-fortran, PnetCDF, scalasca, SCOTCH, Zoltan, openmpi1.8,

  • penmpi1.10, mpich2-1.4.1, boost, FFTE, PETSc/SLEPc Elemental, BWA, Star,

Blat, TopHat, TopHat2, MapSplice2, MPDyn2, ELPA, Trillinos, Eigen3, mesa, MesaGLUT, libxml2, C-LIME, EigenExa

  • Tool/Visuallization Tool
  • git, git-flow, gnuplot, Paraview, VisIT, ImageMagick, svn, Samtools, bedtools,

Biobambam, Picard, GMT, GrADS, HDF-EOS, wgrib, GRIB API, Climate data Operators

  • Build tool
  • cmake, gnu Autotools, automake, autoconf, gcc, gfortran, C++, libtools
  • Shell script / Programming language / Script language
  • python2, python3, perl5, R, Ruby2, zsh, ksh, NCADS Command Language

OSS Survey (9 priority issues developers)

20019/1/14

RIKEN Center for Computational Science

slide-18
SLIDE 18

18

  • Application
  • ABINIT-MP, AkaiKKR, bedtools, Biobambam, BWA, CUBE, ERmod, fdps, FFV-C,

FrontFlow/Red, FrontISTR, GAMES, GENESIS, gromacs, GROMACS, HIVE, LAMMPS, MapSplice2, MODYLAS, NEURON, octa, OpenFOAM, PBVR, Picard, PIMD, quantum ESPRESSO, rDock, Samtools, SCALE, Star, TopHat, TopHat 2, WHEEL, xTAPP,

  • Library
  • FFTW, matplotlib(python), beautiful soup(python), metis, ParMETIS, NetCDF4, HDF5,

NuSDAS1.3, octa, fdps, Zoltan, cgns, Polylib, libsim

  • Visualization tool
  • gnuplot, PBVR, VTK, OSMesa
  • Tool
  • GNU utils, zlib, anaconda(python), itk, PAPI, PMlib, Szip, zip, TextParser, fpzip,
  • Build tool
  • make, autoconf, cmake
  • Shell script / Programming language / Script language
  • bash, curl, python, ruby
  • ISV
  • ABAQUS, Advance, AMBER, Ansys fluent, Gaussian, FLUENT, Scryu/Tetra, LS-DYNA,

VPS solver ( PAM-CRASH ), Helyx, HEETAH, iconCFD, LaBS, JMAG, MIZUHO, NuFD, VASP, VSOP

OSS Survey (K computer users)

20019/1/14

RIKEN Center for Computational Science