George Tsouloupas (Nicosia 2015)
HPC Operations at the Cyprus Institute George Tsouloupas, PhD Head - - PowerPoint PPT Presentation
HPC Operations at the Cyprus Institute George Tsouloupas, PhD Head - - PowerPoint PPT Presentation
HPC Operations at the Cyprus Institute George Tsouloupas, PhD Head of HPC Facility George Tsouloupas (Nicosia 2015) Overview Organization Hardware resources (clusters, storage, networking) Software (OS deployment, services and
George Tsouloupas (@JSC 2014)
Overview
- Organization
- Hardware resources (clusters, storage, networking)
- Software (OS deployment, services and cloud
infrastructure)
- Libraries and scientific software deployment using
EasyBuild
- Tools
George Tsouloupas (@JSC 2014)
Short History
- The Cyprus Institute est. 2007
- CaSToRC (Director: Dina Alexandrou)
○ Central goal: To develop world-class research and education in computational science serving the Eastern Mediterranean in collaboration with other regional institutions ○ Development of a national High Performance Computing centre
- CyTera commissioned in Dec 2011
George Tsouloupas (@JSC 2014)
CyTera
- Cy-Tera is the first large cluster as part of a
Cypriot National HPC Facility
- Cy-Tera Strategic Infrastructure Project
○ A new research unit to host a HPC infrastructure ○ RPF funded project (i.e. Nationally funded)
- LinkSCEEM leverages Cy-Tera
- Contributes resources to PRACE
George Tsouloupas (@JSC 2014)
LinkSCEEM
George Tsouloupas (@JSC 2014)
Projects and Resource Allocation
- Cyprus Meteorology Service
George Tsouloupas (@JSC 2014)
Projects and Resource Allocation
- Semi-annual Allocation process
○ Internal technical reviews ○ External scientific reviews
- 43 Production projects to date
- 75 Preparatory projects to date
George Tsouloupas (@JSC 2014)
HPC Ops
George Tsouloupas (@JSC 2014)
Organization: Responsibilities
- Stelios Erotokritou
○ Project Liaison, Networking, System Administration
- Thekla Loizou
○ User Support, Scientific Software, System Administration.
- Andreas Panteli
○ PRACE services, System Administration, Scientific Software, Networking.
- George Tsouloupas
○ System Administration, User Support, Scientific Software, PRACE services, Networking, NCSA Liaising, HPC Ops head.
George Tsouloupas (@JSC 2014)
Maintenance and downtimes
- Scheduled downtime - Monthly Maintenance
○ 0.7% downtime
- Unscheduled downtime
○ <0.1% due to operator blunders ○ UPS Issues: an additional 15-20 hours of downtime
- Downtime for Rebuilding CyTera:
○ Estimated <5%
- Still well within the promised 80% uptime
George Tsouloupas (@JSC 2014)
Hardware Resources
Resources -- Cytera
- Hybrid CPU/GPU Linux Cluster
- Computational Power
- 98 x 2 x 6-core compute nodes
- Each compute node = 128GFlops
- 18 x 2 x 6-core + 2 x NVIDIA M2070 GPU nodes
- Each GPU node = 1 Tflop
- Theoretical Peak Performance (TPP) = 30.5Tflops
- 48 GB memory per node
- MPI Messaging & Storage Access
- 40Gbps QDR Infiniband
- Storage: 360TB raw disk
Resources -- Prometheus
- ex PRACE Prototype
- Hybrid CPU/GPU Linux Cluster
- Computational Power
- 8 x 2 x 6-core + 2 x NVIDIA M2070 GPU nodes
- 24 GB memory per node
- MPI Messaging & Storage Access
- 40Gbps QDR Infiniband
- Storage: 40TB raw disk
Euclid -- Training Cluster
- Hybrid CPU/GPU Linux Cluster
- Training Cluster of the LinkSCEEM project
- Computational Power
○ 6 eight-core compute nodes + 2 NVIDIA Tesla T10 processors
- 16 GB memory per node
- MPI Messaging & Storage Access
- Infiniband Network
- Storage: 40TB raw disk
- In-house + Universities in Cyprus, Jordan;
workshops...
Prototype Clusters
- Dell C8000 chassis
○ 2 nodes * 2 Xeon Phi + 2 nodes * 2x NVIDIA K20m
- MIC MEGWARE
○ 12 Xeon Phi Accelerators in 4 nodes.
Post-Processing
- post01 , post02
○ 128GB Ram ○ Access to all filesystems
- Same software and modules as the clusters
○ compiled specifically for each node
Storage
Storage
- Cytera storage (IBM)
○ 360TB (raw) ■ 100TB scratch ○ 4.7GBytes/s ○ GPFS ○ project storage
- DDN7700 (LTS)
○ GPFS ○ 1GB/s ○ 180TB ○ room for another 400TB
- DDN9900 (LTS)
○ GPFS ○ 200TB ○ being phased out
- “ONYX”
○ Commodity Hardware ○ FhGFS/ BeeGFS ○ 360TB
- BACKUP
○ 80TB
- DDN9550 (Auxiliary)
○ NFS, Lustre ○ 40TB
90x 4TB disks + 4 SSD disks IB RDMA / IB TCP / Ethernet TCP
Metadata
SAS SAS
Multipath
“ONYX” storage integrated from scratch
- - BeeGFS over ZFS over JBODs = Very good
value for money! (<100euro/TB including the servers!)
- Up to 3GB/s Writes (iozone)
- Around 14000 directory creates/ second
George Tsouloupas (@JSC 2014)
Software (System) -- Filesystems
- Four GPFS filesystems
○ On three storage systems ○ GPFS multiclustering ○ Project Storage + LTS
- FhGFS/BeeGFS
○ Home directories on Euclid ○ Home directories on Prometheus ○ New 360TB system
George Tsouloupas (@JSC 2014)
Software (System) -- Deployment
- XCAT
○ Two deployment servers (Separate VLAN’s) ■ Cytera ■ Everything else ○ “Thin” deployment
- Ansible
○ Infrastructure as code, git maintained ○ Manual configuration prohibited
George Tsouloupas (@JSC 2014)
Software (System) -- Services
- Cy-Tera
○ RHEL 6 x86_64 ○ Torque/Moab SLURM
- Prometheus
○ CentOS 6.5 ○ SLURM
- Euclid
○ CentOS 6.5 ○ Torque/Maui SLURM
- Planck (Testing Cluster)
○ SLURM
George Tsouloupas (@JSC 2014)
Software -- Workload Management
- 1st SLURM test on prototype cluster in 2012
○ Basic configuration (single queue, etc.)
- Decision to move to SLURM
○ Save up on MOAB Licensing 50K over three years ■ Thats 1/2 of an engineer in terms of cost ○ Uniform scheduler across systems ○ It’s much easier to set up a test environment if you don’t have to worry about licensing...
- Transition from Moab to slurm
○ Gave users a 4-month head-start with access to SLURM ○ 80% of users only made the transition after they could no longer run on MOAB...
George Tsouloupas (@JSC 2014)
SLURM Migration
- GOAL: Implement the exact functionality that
we had in MOAB
○ Routing queues for gpu - cpu , job-size ○ Low-priority queues ○ Standing reservations + triggers
George Tsouloupas (@JSC 2014)
SLURM Migration
- Requested memory on gpu nodes (gres). When a user
was asking for mem-per-cpu more than 4 megabytes the nodes were allocated but they were remaining idle. To solve this we always make the requested memory per cpu equal to "0" for gpu jobs, in the job submission plugin.
- No triggers to start job in a reservation. We used cron.
- No routing queues as in Torque. We implemented the
functionality in the job_submit plugin.
- Bug in IntelMPI with slurm, concerning hostlist parsing.
Solved after IntelMPI Version 4.1 Update 3.
- Standing reservations locked into specific nodes.
George Tsouloupas (@JSC 2014)
Software
Available on all systems...
- Intel Compiler Suite (optimised on Intel
architecture)
- PGI Compiler Suite (including OpenACC for
GPU’s) (WIP for all systems)
- CUDA
- Optimised math libraries
George Tsouloupas (@JSC 2014)
Scientific software and Libraries
How I Learned to Stop Worrying and Love EasyBuild
Facts:
- Modules provided to users: 641
a2ps Bonnie++ CUDA GDB guile LAPACK MCL numpy Qt TiCCutils ABINIT Boost cURL Geant4 gzip libctl MEME NWChem QuantumESPRESSO TiMBL ABySS Bowtie DL_POLY_Classic GEOS Harminv libffi MetaVelvet Oases R TinySVM AMOS Bowtie2 Doxygen gettext HDF libgtextutils METIS OpenBLAS RAxML Tk ant BWA EasyBuild GHC HDF5 libharu Mothur OpenFOAM RNAz Trinity aria2 byacc Eigen git HH-suite Libint MPFR OpenMPI SAMtools UDUNITS arpack-ng bzip2 ELinks GLib HMMER libmatheval mpiBLAST OpenPGM ScaLAPACK util-linux ATLAS cairo EMBOSS GLIMMER HPL libpng MrBayes OpenSSL ScientificPython Valgrind Autoconf ccache ESMF glproto hwloc libpthread- stubs MUMmer PAML SCons Velvet bam2fastq CD-HIT ETSF_IO GMP Hypre libreadline MUSCLE PAPI SCOTCH ViennaRNA BamTools CDO expat gmvapich2 icc libsmm MVAPICH2 parallel SHRiMP VTK Bash cflow FASTA gmvolf iccifort libtool NAMD ParFlow Silo WPS bbFTP cgdb FASTX-Toolkit gnuplot ictce libunistring nano ParMETIS SOAPdenovo WRF bbftpPRO Chapel FFTW goalf ifort libxc NASM PCRE Stacks xorg-macros beagle-lib Clang FIAT gompi imkl libxml2 NCL Perl Stow xproto BFAST ClangGCC flex google-sparsehash impi libxslt nco PETSc SuiteSparse YamCha binutils CLHEP fontconfig goolf Infernal libyaml ncurses pixman Szip Yasm biodeps ClustalW2 freeglut goolfc iomkl likwid netCDF pkg-config Tar ZeroMQ Biopython CMake freetype gperf Iperf LZO netCDF-Fortran PLINK tbb zlib Bison Corkscrew g2clib grib_api JasPer M4 nettle Primer3 Tcl zsync BLACS CP2K g2lib GROMACS Java makedepend NEURON problog tcsh BLAT CRF++ GCC GSL JUnit mc numactl Python Theano
- Modules that can be provided within hours: 2238
Software
- Automated reproducible build processes
- Maintain multiple compilers/versions
- 1000’s of software
packages
Targetting communities
e.g. bioinformatics
- Local team has contributed tens of bioinformatics-
related packages to EasyBuild. (posters at BBC13 and CSC2013)
- Galaxy server
○ tested last summer ○ to be deployed.
George Tsouloupas (@JSC 2014)
Scientific software and Libraries
How I Learned to Stop Worrying and Love EasyBuild
Change management and software provision:
- Rollback of individual users
eb/ eb120/ eb130/ eb130909p2/ eb140/ eb150/ eb20130520/ eb20130603/ eb20130619 / eb130607/ eb131021/ eb131127/ eb140108/ sw6 -> eb131021/ sw6p2 -> eb130909p2/ As per buildsets concept of HPCBIOS (by Fotis Georgatos) https://fosdem.org/2014/schedule/event/hpc_devroom_hpcbios
Tools of the trade
Ticketing system -- Jira
George Tsouloupas (@JSC 2014)
User Support
Created Vs resolved tickets
Tools of the trade -- Confluence wiki
Icinga(/Nagios)
Tools of the trade -- Monitoring - Cluvis (in-
house developed tool)
Infrastructure as code and why you should care.
Treat the configuration of systems the same way that software source code is treated.
- Use a configuration management tool!
○ Puppet, CFengine, Chef, Ansible ...
- Documented installation/configuration
○ Testing!
- Versioned, source-controlled
- Idempotence
○ Ability to “re-run” configuration ○ Operations can be applied multiple times without changing the result beyond the initial application
Infrastructure as code and why you should care.
We picked Ansible because of several reason:
- Python based
(even though we have not written a line of code in python yet)
- SSH-based (agentless)
- Looked simple enough
- Really easy to get started
Short example: euclid Inventory file
[headnodes] euclid mgmt_addr=172.30.150.99 \ vlan110_addr=10.20.110.11 \ ib_addr=172.31.150.99 [nodes] e01 mgmt_addr=172.30.150.1 ib_addr=172.31.150.1 e02 mgmt_addr=172.30.150.2 ib_addr=172.31.150.2 e03 mgmt_addr=172.30.150.3 ib_addr=172.31.150.3 e04 mgmt_addr=172.30.150.4 ib_addr=172.31.150.4 e05 mgmt_addr=172.30.150.5 ib_addr=172.31.150.5 e06 mgmt_addr=172.30.150.6 ib_addr=172.31.150.6 [accounting] euclid [monitoring_servers] euclid [all:vars] ldap_resource_name=euclid home_dir_base=/fhgfs/euclid/home shared_fs=/fhgfs/euclid module_buildset=eb141014 ldap_server=euclid ldap_server_secondary=ldap.cyi.ac.cy eb_sources_path=/fhgfs/sources cluster=euclid mgmt_if=eth0 external_if=eth1 gateway=172.30.205.1
- fed_version=3.12
Short example: euclid Playbook euclid.yml
- include: network.yml
- include: common.yml
- include: gpu.yml
- include: hosts.yml
- include: ib_network.yml
- include: ldap_server.yml
- include: ldap_client.yml
- include: slurm.yml
- include: fhgfs_client_blue.yml
- include: icinga_server.yml
- include: icinga_client.yml
- include: lmod.yml
- Set up networking
- Set up basic system aspects(e.
- g. EPEL)
- Install NVIDIA driver
- Set up OFED and Infiniband
network
- Set up LDAP server (mirror) and
client authentication
- Set up SLURM
- Install BeeGFS clients for
specific filesystem “blue”
- Set up Monitoring
- Set up LMod
Example: gpu.yml
Example: slurm.yml
Example: slurm.yml
Example: slurm.yml
Example: slurm.yml
slurm.conf_euclid.j2
... AccountingStorageHost={{ groups['accounting'][0] }} ... ClusterName="{{ cluster }}" ... ControlMachine={{ groups['headnodes'][0] }} … StateSaveLocation={{home_dir_base}}/slurm/state ... UsePAM=1 {% for host in groups['gpunodes'] %} NodeName={{host}} CPUs={{ hostvars[host]['ansible_processor_vcpus'] }} Gres=gpu:2 Sockets={{ hostvars[host] ['ansible_processor_count'] }} CoresPerSocket={{ hostvars[host]['ansible_processor_cores'] }} ThreadsPerCore={{ hostvars[host]['ansible_processor_threads_per_core'] }} State=UNKNOWN RealMemory={{ hostvars[host] ['ansible_memtotal_mb'] }} {% endfor %} PartitionName=batch Nodes={% for host in groups['nodes'] %}{{host}},{%endfor%} MaxTime=INFINITE State=UP Default=YES RootOnly=YES PartitionName=cpu Nodes={% for host in groups['gpunodes'] %}{{host}},{%endfor%} MaxTime=24:00:00 State=UP PartitionName=gpu Nodes={% for host in groups['gpunodes'] %}{{host}},{%endfor%} MaxTime=24:00:00 State=UP
Ansible roles developed by local Ops:
common easybuild fhgfs_client fhgfs_server ganglia317 ganglia_client ganglia_nvidia_gpu ganglia_server gluster_client gluster_server gpfs gpu ldap_client ldap_server lmod mpss
- fed slurm slurm_acct zfs
From the community: apache icinga2-ansible-classic-ui icinga2- ansible-no-ui icinga2-ansible-web-ui icinga2- nrpe-agent mysql network php samba
George Tsouloupas (Nicosia 2015)