HPC Operations at the Cyprus Institute George Tsouloupas, PhD Head - - PowerPoint PPT Presentation

hpc operations at the cyprus institute
SMART_READER_LITE
LIVE PREVIEW

HPC Operations at the Cyprus Institute George Tsouloupas, PhD Head - - PowerPoint PPT Presentation

HPC Operations at the Cyprus Institute George Tsouloupas, PhD Head of HPC Facility George Tsouloupas (Nicosia 2015) Overview Organization Hardware resources (clusters, storage, networking) Software (OS deployment, services and


slide-1
SLIDE 1

George Tsouloupas (Nicosia 2015)

HPC Operations at the Cyprus Institute

George Tsouloupas, PhD Head of HPC Facility

slide-2
SLIDE 2

George Tsouloupas (@JSC 2014)

Overview

  • Organization
  • Hardware resources (clusters, storage, networking)
  • Software (OS deployment, services and cloud

infrastructure)

  • Libraries and scientific software deployment using

EasyBuild

  • Tools
slide-3
SLIDE 3

George Tsouloupas (@JSC 2014)

Short History

  • The Cyprus Institute est. 2007
  • CaSToRC (Director: Dina Alexandrou)

○ Central goal: To develop world-class research and education in computational science serving the Eastern Mediterranean in collaboration with other regional institutions ○ Development of a national High Performance Computing centre

  • CyTera commissioned in Dec 2011
slide-4
SLIDE 4

George Tsouloupas (@JSC 2014)

CyTera

  • Cy-Tera is the first large cluster as part of a

Cypriot National HPC Facility

  • Cy-Tera Strategic Infrastructure Project

○ A new research unit to host a HPC infrastructure ○ RPF funded project (i.e. Nationally funded)

  • LinkSCEEM leverages Cy-Tera
  • Contributes resources to PRACE
slide-5
SLIDE 5

George Tsouloupas (@JSC 2014)

LinkSCEEM

slide-6
SLIDE 6

George Tsouloupas (@JSC 2014)

Projects and Resource Allocation

  • Cyprus Meteorology Service
slide-7
SLIDE 7

George Tsouloupas (@JSC 2014)

Projects and Resource Allocation

  • Semi-annual Allocation process

○ Internal technical reviews ○ External scientific reviews

  • 43 Production projects to date
  • 75 Preparatory projects to date
slide-8
SLIDE 8

George Tsouloupas (@JSC 2014)

HPC Ops

slide-9
SLIDE 9

George Tsouloupas (@JSC 2014)

Organization: Responsibilities

  • Stelios Erotokritou

○ Project Liaison, Networking, System Administration

  • Thekla Loizou

○ User Support, Scientific Software, System Administration.

  • Andreas Panteli

○ PRACE services, System Administration, Scientific Software, Networking.

  • George Tsouloupas

○ System Administration, User Support, Scientific Software, PRACE services, Networking, NCSA Liaising, HPC Ops head.

slide-10
SLIDE 10

George Tsouloupas (@JSC 2014)

Maintenance and downtimes

  • Scheduled downtime - Monthly Maintenance

○ 0.7% downtime

  • Unscheduled downtime

○ <0.1% due to operator blunders ○ UPS Issues: an additional 15-20 hours of downtime

  • Downtime for Rebuilding CyTera:

○ Estimated <5%

  • Still well within the promised 80% uptime
slide-11
SLIDE 11

George Tsouloupas (@JSC 2014)

Hardware Resources

slide-12
SLIDE 12

Resources -- Cytera

  • Hybrid CPU/GPU Linux Cluster
  • Computational Power
  • 98 x 2 x 6-core compute nodes
  • Each compute node = 128GFlops
  • 18 x 2 x 6-core + 2 x NVIDIA M2070 GPU nodes
  • Each GPU node = 1 Tflop
  • Theoretical Peak Performance (TPP) = 30.5Tflops
  • 48 GB memory per node
  • MPI Messaging & Storage Access
  • 40Gbps QDR Infiniband
  • Storage: 360TB raw disk
slide-13
SLIDE 13

Resources -- Prometheus

  • ex PRACE Prototype
  • Hybrid CPU/GPU Linux Cluster
  • Computational Power
  • 8 x 2 x 6-core + 2 x NVIDIA M2070 GPU nodes
  • 24 GB memory per node
  • MPI Messaging & Storage Access
  • 40Gbps QDR Infiniband
  • Storage: 40TB raw disk
slide-14
SLIDE 14

Euclid -- Training Cluster

  • Hybrid CPU/GPU Linux Cluster
  • Training Cluster of the LinkSCEEM project
  • Computational Power

○ 6 eight-core compute nodes + 2 NVIDIA Tesla T10 processors

  • 16 GB memory per node
  • MPI Messaging & Storage Access
  • Infiniband Network
  • Storage: 40TB raw disk
  • In-house + Universities in Cyprus, Jordan;

workshops...

slide-15
SLIDE 15

Prototype Clusters

  • Dell C8000 chassis

○ 2 nodes * 2 Xeon Phi + 2 nodes * 2x NVIDIA K20m

  • MIC MEGWARE

○ 12 Xeon Phi Accelerators in 4 nodes.

slide-16
SLIDE 16

Post-Processing

  • post01 , post02

○ 128GB Ram ○ Access to all filesystems

  • Same software and modules as the clusters

○ compiled specifically for each node

slide-17
SLIDE 17

Storage

slide-18
SLIDE 18

Storage

  • Cytera storage (IBM)

○ 360TB (raw) ■ 100TB scratch ○ 4.7GBytes/s ○ GPFS ○ project storage

  • DDN7700 (LTS)

○ GPFS ○ 1GB/s ○ 180TB ○ room for another 400TB

  • DDN9900 (LTS)

○ GPFS ○ 200TB ○ being phased out

  • “ONYX”

○ Commodity Hardware ○ FhGFS/ BeeGFS ○ 360TB

  • BACKUP

○ 80TB

  • DDN9550 (Auxiliary)

○ NFS, Lustre ○ 40TB

slide-19
SLIDE 19

90x 4TB disks + 4 SSD disks IB RDMA / IB TCP / Ethernet TCP

Metadata

SAS SAS

Multipath

“ONYX” storage integrated from scratch

  • - BeeGFS over ZFS over JBODs = Very good

value for money! (<100euro/TB including the servers!)

slide-20
SLIDE 20
  • Up to 3GB/s Writes (iozone)
  • Around 14000 directory creates/ second
slide-21
SLIDE 21

George Tsouloupas (@JSC 2014)

Software (System) -- Filesystems

  • Four GPFS filesystems

○ On three storage systems ○ GPFS multiclustering ○ Project Storage + LTS

  • FhGFS/BeeGFS

○ Home directories on Euclid ○ Home directories on Prometheus ○ New 360TB system

slide-22
SLIDE 22

George Tsouloupas (@JSC 2014)

Software (System) -- Deployment

  • XCAT

○ Two deployment servers (Separate VLAN’s) ■ Cytera ■ Everything else ○ “Thin” deployment

  • Ansible

○ Infrastructure as code, git maintained ○ Manual configuration prohibited

slide-23
SLIDE 23

George Tsouloupas (@JSC 2014)

Software (System) -- Services

  • Cy-Tera

○ RHEL 6 x86_64 ○ Torque/Moab SLURM

  • Prometheus

○ CentOS 6.5 ○ SLURM

  • Euclid

○ CentOS 6.5 ○ Torque/Maui SLURM

  • Planck (Testing Cluster)

○ SLURM

slide-24
SLIDE 24

George Tsouloupas (@JSC 2014)

Software -- Workload Management

  • 1st SLURM test on prototype cluster in 2012

○ Basic configuration (single queue, etc.)

  • Decision to move to SLURM

○ Save up on MOAB Licensing 50K over three years ■ Thats 1/2 of an engineer in terms of cost ○ Uniform scheduler across systems ○ It’s much easier to set up a test environment if you don’t have to worry about licensing...

  • Transition from Moab to slurm

○ Gave users a 4-month head-start with access to SLURM ○ 80% of users only made the transition after they could no longer run on MOAB...

slide-25
SLIDE 25

George Tsouloupas (@JSC 2014)

SLURM Migration

  • GOAL: Implement the exact functionality that

we had in MOAB

○ Routing queues for gpu - cpu , job-size ○ Low-priority queues ○ Standing reservations + triggers

slide-26
SLIDE 26

George Tsouloupas (@JSC 2014)

SLURM Migration

  • Requested memory on gpu nodes (gres). When a user

was asking for mem-per-cpu more than 4 megabytes the nodes were allocated but they were remaining idle. To solve this we always make the requested memory per cpu equal to "0" for gpu jobs, in the job submission plugin.

  • No triggers to start job in a reservation. We used cron.
  • No routing queues as in Torque. We implemented the

functionality in the job_submit plugin.

  • Bug in IntelMPI with slurm, concerning hostlist parsing.

Solved after IntelMPI Version 4.1 Update 3.

  • Standing reservations locked into specific nodes.
slide-27
SLIDE 27

George Tsouloupas (@JSC 2014)

Software

Available on all systems...

  • Intel Compiler Suite (optimised on Intel

architecture)

  • PGI Compiler Suite (including OpenACC for

GPU’s) (WIP for all systems)

  • CUDA
  • Optimised math libraries
slide-28
SLIDE 28

George Tsouloupas (@JSC 2014)

Scientific software and Libraries

How I Learned to Stop Worrying and Love EasyBuild

Facts:

  • Modules provided to users: 641

a2ps Bonnie++ CUDA GDB guile LAPACK MCL numpy Qt TiCCutils ABINIT Boost cURL Geant4 gzip libctl MEME NWChem QuantumESPRESSO TiMBL ABySS Bowtie DL_POLY_Classic GEOS Harminv libffi MetaVelvet Oases R TinySVM AMOS Bowtie2 Doxygen gettext HDF libgtextutils METIS OpenBLAS RAxML Tk ant BWA EasyBuild GHC HDF5 libharu Mothur OpenFOAM RNAz Trinity aria2 byacc Eigen git HH-suite Libint MPFR OpenMPI SAMtools UDUNITS arpack-ng bzip2 ELinks GLib HMMER libmatheval mpiBLAST OpenPGM ScaLAPACK util-linux ATLAS cairo EMBOSS GLIMMER HPL libpng MrBayes OpenSSL ScientificPython Valgrind Autoconf ccache ESMF glproto hwloc libpthread- stubs MUMmer PAML SCons Velvet bam2fastq CD-HIT ETSF_IO GMP Hypre libreadline MUSCLE PAPI SCOTCH ViennaRNA BamTools CDO expat gmvapich2 icc libsmm MVAPICH2 parallel SHRiMP VTK Bash cflow FASTA gmvolf iccifort libtool NAMD ParFlow Silo WPS bbFTP cgdb FASTX-Toolkit gnuplot ictce libunistring nano ParMETIS SOAPdenovo WRF bbftpPRO Chapel FFTW goalf ifort libxc NASM PCRE Stacks xorg-macros beagle-lib Clang FIAT gompi imkl libxml2 NCL Perl Stow xproto BFAST ClangGCC flex google-sparsehash impi libxslt nco PETSc SuiteSparse YamCha binutils CLHEP fontconfig goolf Infernal libyaml ncurses pixman Szip Yasm biodeps ClustalW2 freeglut goolfc iomkl likwid netCDF pkg-config Tar ZeroMQ Biopython CMake freetype gperf Iperf LZO netCDF-Fortran PLINK tbb zlib Bison Corkscrew g2clib grib_api JasPer M4 nettle Primer3 Tcl zsync BLACS CP2K g2lib GROMACS Java makedepend NEURON problog tcsh BLAT CRF++ GCC GSL JUnit mc numactl Python Theano

  • Modules that can be provided within hours: 2238
slide-29
SLIDE 29

Software

  • Automated reproducible build processes
  • Maintain multiple compilers/versions
  • 1000’s of software

packages

slide-30
SLIDE 30

Targetting communities

e.g. bioinformatics

  • Local team has contributed tens of bioinformatics-

related packages to EasyBuild. (posters at BBC13 and CSC2013)

  • Galaxy server

○ tested last summer ○ to be deployed.

slide-31
SLIDE 31

George Tsouloupas (@JSC 2014)

Scientific software and Libraries

How I Learned to Stop Worrying and Love EasyBuild

Change management and software provision:

  • Rollback of individual users

eb/ eb120/ eb130/ eb130909p2/ eb140/ eb150/ eb20130520/ eb20130603/ eb20130619 / eb130607/ eb131021/ eb131127/ eb140108/ sw6 -> eb131021/ sw6p2 -> eb130909p2/ As per buildsets concept of HPCBIOS (by Fotis Georgatos) https://fosdem.org/2014/schedule/event/hpc_devroom_hpcbios

slide-32
SLIDE 32

Tools of the trade

Ticketing system -- Jira

slide-33
SLIDE 33

George Tsouloupas (@JSC 2014)

User Support

Created Vs resolved tickets

slide-34
SLIDE 34

Tools of the trade -- Confluence wiki

slide-35
SLIDE 35

Icinga(/Nagios)

slide-36
SLIDE 36

Tools of the trade -- Monitoring - Cluvis (in-

house developed tool)

slide-37
SLIDE 37

Infrastructure as code and why you should care.

Treat the configuration of systems the same way that software source code is treated.

  • Use a configuration management tool!

○ Puppet, CFengine, Chef, Ansible ...

  • Documented installation/configuration

○ Testing!

  • Versioned, source-controlled
  • Idempotence

○ Ability to “re-run” configuration ○ Operations can be applied multiple times without changing the result beyond the initial application

slide-38
SLIDE 38

Infrastructure as code and why you should care.

We picked Ansible because of several reason:

  • Python based

(even though we have not written a line of code in python yet)

  • SSH-based (agentless)
  • Looked simple enough
  • Really easy to get started
slide-39
SLIDE 39

Short example: euclid Inventory file

[headnodes] euclid mgmt_addr=172.30.150.99 \ vlan110_addr=10.20.110.11 \ ib_addr=172.31.150.99 [nodes] e01 mgmt_addr=172.30.150.1 ib_addr=172.31.150.1 e02 mgmt_addr=172.30.150.2 ib_addr=172.31.150.2 e03 mgmt_addr=172.30.150.3 ib_addr=172.31.150.3 e04 mgmt_addr=172.30.150.4 ib_addr=172.31.150.4 e05 mgmt_addr=172.30.150.5 ib_addr=172.31.150.5 e06 mgmt_addr=172.30.150.6 ib_addr=172.31.150.6 [accounting] euclid [monitoring_servers] euclid [all:vars] ldap_resource_name=euclid home_dir_base=/fhgfs/euclid/home shared_fs=/fhgfs/euclid module_buildset=eb141014 ldap_server=euclid ldap_server_secondary=ldap.cyi.ac.cy eb_sources_path=/fhgfs/sources cluster=euclid mgmt_if=eth0 external_if=eth1 gateway=172.30.205.1

  • fed_version=3.12
slide-40
SLIDE 40

Short example: euclid Playbook euclid.yml

  • include: network.yml
  • include: common.yml
  • include: gpu.yml
  • include: hosts.yml
  • include: ib_network.yml
  • include: ldap_server.yml
  • include: ldap_client.yml
  • include: slurm.yml
  • include: fhgfs_client_blue.yml
  • include: icinga_server.yml
  • include: icinga_client.yml
  • include: lmod.yml
  • Set up networking
  • Set up basic system aspects(e.
  • g. EPEL)
  • Install NVIDIA driver
  • Set up OFED and Infiniband

network

  • Set up LDAP server (mirror) and

client authentication

  • Set up SLURM
  • Install BeeGFS clients for

specific filesystem “blue”

  • Set up Monitoring
  • Set up LMod
slide-41
SLIDE 41

Example: gpu.yml

slide-42
SLIDE 42

Example: slurm.yml

slide-43
SLIDE 43

Example: slurm.yml

slide-44
SLIDE 44

Example: slurm.yml

slide-45
SLIDE 45

Example: slurm.yml

slide-46
SLIDE 46

slurm.conf_euclid.j2

... AccountingStorageHost={{ groups['accounting'][0] }} ... ClusterName="{{ cluster }}" ... ControlMachine={{ groups['headnodes'][0] }} … StateSaveLocation={{home_dir_base}}/slurm/state ... UsePAM=1 {% for host in groups['gpunodes'] %} NodeName={{host}} CPUs={{ hostvars[host]['ansible_processor_vcpus'] }} Gres=gpu:2 Sockets={{ hostvars[host] ['ansible_processor_count'] }} CoresPerSocket={{ hostvars[host]['ansible_processor_cores'] }} ThreadsPerCore={{ hostvars[host]['ansible_processor_threads_per_core'] }} State=UNKNOWN RealMemory={{ hostvars[host] ['ansible_memtotal_mb'] }} {% endfor %} PartitionName=batch Nodes={% for host in groups['nodes'] %}{{host}},{%endfor%} MaxTime=INFINITE State=UP Default=YES RootOnly=YES PartitionName=cpu Nodes={% for host in groups['gpunodes'] %}{{host}},{%endfor%} MaxTime=24:00:00 State=UP PartitionName=gpu Nodes={% for host in groups['gpunodes'] %}{{host}},{%endfor%} MaxTime=24:00:00 State=UP

slide-47
SLIDE 47

Ansible roles developed by local Ops:

common easybuild fhgfs_client fhgfs_server ganglia317 ganglia_client ganglia_nvidia_gpu ganglia_server gluster_client gluster_server gpfs gpu ldap_client ldap_server lmod mpss

  • fed slurm slurm_acct zfs

From the community: apache icinga2-ansible-classic-ui icinga2- ansible-no-ui icinga2-ansible-web-ui icinga2- nrpe-agent mysql network php samba

slide-48
SLIDE 48

George Tsouloupas (Nicosia 2015)

THANKS!

Your valued feedback is appreciated!