Bright Cluster Manager Advanced HPC cluster management made easy - PowerPoint PPT Presentation

Bright Cluster Manager Advanced HPC cluster management made easy Martijn de Vries CTO Bright Computing

About Bright Computing Bright Computing 1. Develops and supports Bright Cluster Manager for HPC systems and server farms 2. Incorporated in USA (HQ in San Jose, California) 3. Development office in Amsterdam, NL 4. Backed by ING Bank as shareholder and investor 5. Sells through a rapidly growing network of resellers and OEMs world-wide 6. Customers and resellers in US, Canada, Brazil, Europe, Middle-East, India, Singapore, Japan 7. Installations in Academia, Government, Industry, ranging from 4 node to TOP500 systems 2

Customers Academ ia Governm ent I ndustry 3

The Commonly Used “Toolkit” Approach  Most HPC cluster management solutions use the “toolkit” approach (Linux distro + tools) • Examples: Rocks, PCM, OSCAR, UniCluster, CMU, bullx, etc. • Tools typically used: Ganglia, Cacti, Nagios, Cfengine, System Imager, Puppet, Cobbler, Hobbit, Big Brother, Zabbix, Groundwork, etc.  Issues with the “toolkit” approach: • Tools rarely designed to work together • Tools rarely designed for HPC • Tools rarely designed to scale • Each tool has its own command line interface and GUI • Each tool has its own daemon and database • Roadmap dependent on developers of the tools  Making a collection of unrelated tools work together • Requires a lot of expertise and scripting • Rarely leads to a really easy-to-use and scalable solution 4

About Bright Cluster Manager  Bright Cluster Manager takes a much more fundamental & integrated approach • Designed and written from the ground up • Single cluster management daemon provides all functionality • Single, central database for configuration and monitoring data • Single CLI and GUI for ALL cluster management functionality  Which makes Bright Cluster Manager … • Extremely easy to use • Extremely scalable • Secure & reliable • Complete • Flexible 5

Architecture CMDaemon 6

Bright Cluster Manager — Elements Cluster Management GUI Cluster Management Shell SSL / SOAP / X509 / IPtables Cluster Management Daemon PBS Pro Torque Monitoring Compilers Maui/MOAB Automation Libraries Provisioning Grid Engine Health Checks Debuggers SLURM Management Profilers LSF* SLES / RHEL / CentOS / SL / Oracle EL SLES / RHEL / CentOS / SL / Oracle EL ScaleMP vSMP Interconnect IPMI / iLO Ethernet Memory GPU Disk PDU CPU 7

HPC User Environment  Let users focus on performing computations  Rich collection of HPC software • Compilers (GNU, Intel*, Portland*, Open64, etc.) • Parallel middleware (MPI libraries, threading libraries, OpenMP, Global Arrays, etc.) • Mathematical libraries (ACML, MKL*, LAPACK, BLAS, etc.) • Development tools (debuggers, profilers, etc.) • Environment modules  Intel Cluster Ready Compliant Compliant applications run out of the box 8

Management Interface Graphical User Interface (GUI)  Offers administrator full cluster control  Standalone desktop application  Manages multiple clusters simultaneously  Runs on Linux, Windows, MacOS X*  Built on top of Mozilla XUL engine Cluster Management Shell (CMSH)  All GUI functionality also available through Cluster Management Shell  Interactive and scriptable in batch mode 9

Cluster Management Shell (CMSH) Features:  Modular interface  Command completion using tab key  Command line history  Output redirection to file or shell command  Scriptable in batch mode  Support for looping over objects Example [demo]% device [demo->device]% status demo ................ [ UP ] node001 ............. [ UP ] node002 ............. [ UP ] 12

Node Provisioning Image based  Slave node image is a directory on the head node  Unlimited number of images can be created  Software changes for the slave nodes are made inside the image(s) on the head node  Provisioning system ensures that changes are propagated to the slave nodes Nodes always boot over the network  Slave nodes PXE boot into Node Installer, which  Identifies node (switch port or MAC based)  Configures BMC  Partition disks (if any) and creates file systems  Installs or updates software image  Pivot the root from NFS to the local file system 17

Architecture — Monitoring CMDaemon BMC BMC BMC 20

Bright Cluster Manager for GPGPU 25

GPU Development Environment  CUDA & OpenCL redistribution rights  Current and previous versions of CUDA & OpenCL  Easy switching between CUDA & OpenCL versions  CUDA driver automatically compiled at boot time  Support for new Fermi architecture • Native 64-bit GPU support • Multiple copy engine support • ECC reporting • Concurrent kernel execution • Fermi HW debugging support in cuda-gdb 26

GPU Monitoring 27

Cluster Health Checking  Goal: provide problem free environment for running jobs  Hardware & software health  Three types of health check • Health checks before jobs are run – Halt workload manager few (milli)seconds before job is executed – Check health of each reserved node – If unhealthy, take off line, inform system administrator – Hand job back to workload manager • Frequently scheduled health checks – Run health check when node is not used – Run health check through queuing system • Hardware burn-in environment – Most thorough health check – Requires reboot  All types are extensible 31

Scalability Cluster Management software should not be limiting factor for cluster size. Philosophy used for Bright Cluster Manager:  All tasks performed by master node should be off- loadable to dedicated nodes.  If master node can not handle a task as a result of cluster size, task can be placed on 1 or more dedicated nodes.  For example: multiple dedicated load-balanced provisioning nodes may be assigned in a cluster. 36

Image Based Provisioning  Software image (or “image”) is directory on head node  Image contains full Linux file-tree (/bin, /usr, …)  Software is not installed on nodes directly, but rather to image  After image has been changed, changes can be propagated to the compute nodes  Propagating image changes to nodes can be done in two ways: 1. Rebooting nodes 2. Using device imageupdate in CMSH, or “Update Node” in GUI  Latter allows nodes to be updated without reboot  Some changes do require reboot (e.g. kernel update) 37

Provisioning Process  Node Installer submits provisioning request to head node  Head node will queue request until a provisioning slot becomes available on one of the provisioning nodes (possibly just head node itself)  Provisioning node will connect to compute node to provision software image to local file system  Two install modes: • FULL : Re-partition hard drives, transfer image from scratch • SYNC : Only transfer differences between image and local disk  Default install mode is SYNC  Disk setup mismatch triggers FULL install mode 38

Changing Software Images  Installing/updating RPMs rpm --root=/cm/images/default-image –i myapp.rpm yum --installroot=/cm/images/default-image install myapp yum --installroot=/cm/images/default-image update  Installing software from source make DESTDIR=/cm/images/default-image install Note that not all Makefiles support $DESTDIR Usage example from Makefile: install -m644 file-example $(DESTDIR)/etc/file  Making changes manually • chroot /cm/images/default-image cd /usr/src/myapp; make install • emacs /cm/images/default-image/etc/file 40

Cloud Bursting (in development)  Allow clusters to be extended with cloud resources  Cluster can grow or shrink based on workload and policies  Integrated interface to public cloud providers  Unsolved problem: how to deal with local storage? 41

Looking for challenging and exciting jobs in HPC? www.clustervision.com www.brightcomputing.com 42

Bright Cluster Manager Advanced HPC cluster management made easy - PowerPoint PPT Presentation

Bright Cluster Manager Advanced HPC cluster management made easy Martijn de Vries CTO Bright Computing About Bright Computing Bright Computing 1. Develops and supports Bright Cluster Manager for HPC systems and server farms 2. Incorporated

PRE-KINDERGARTEN PRE-KINDERGARTEN Bright Beginnings Bright Beginnings Bright Beginnings Bright

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

Bright Spots Student & Staff Recognition National Honor Society Induction Bright Spots

Intro to Scala.js Singapore-Scala, 28 Mar 2017 Li Haoyi haoyi.sg@gmail.com Bright Technology

and GPUs - A Pragmatic Guide Martijn de Vries Chief Technology Officer About Bright Computing

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

2020: A year to choose the future CTBI webinar, 6 June 2020 James Buchanan Bright Now Campaign

Bright Futures Begin At Birth Sparking A Collective, Statewide Effort To Ensure Every Child In

Building Sustainability: Building Sustainability: Building Sustainability: Building

Bright Lake Sanitary Survey Date: May 23, 2012 Presenter: Chris Spooney Outline Bright Lake

BRIGHT NIGHTS IN STANLEY PARK BURN FUND PARK BOARD COMMITTEE MEETING April 18, 2016 Bright

Bright F Futures R Report W hat at is t s the F Florida B a Bright F Futures S s Scholar

Bright Spots Student & Staff Recognition PTA Reflections Winners! Bright Spots Student

Welcome! Childminding forum 9 November 2019 Bright Start Bright Ideas Newsletter

SCUs Bright (but not too bright) Future Alex Duncan Clint May Adam Murai Kenna Hall 12:21

An Introduction to Mesh Generation Methods and Softwares for Scientific Computing Part I TU

Administrative notes October 19, 2017 Reminder: Reading quizzes due Monday Reminder:

Software with the Quality that Has No Name Federico Mena Quintero federico@gnome.org Desktop

LDA 1 [Credits: Mike Smith, Las Vegas Sun 2013] LDA 2 [Credits: IITD Library] 4 5 6 In

Natural Evil and the Mythology of J.R.R. Tolkien Keith B. Miller Department of Geology Kansas

There and Back Again A coachs tale by Aino Corry YOW 2016 Melbourne Takeaway message Agile

L A T EX in the Classroom Dan Raies The University of Oregon Wednesday, July 20, 2014 Dan

Progress Report - SR Challenge Patrik Schneider 1,2 Danh Le Phuoc 3 (1) Institute of

Bright Cluster Manager Advanced HPC cluster management made easy - PowerPoint PPT Presentation

Bright Cluster Manager Advanced HPC cluster management made easy Martijn de Vries CTO Bright Computing About Bright Computing Bright Computing 1. Develops and supports Bright Cluster Manager for HPC systems and server farms 2. Incorporated

PRE-KINDERGARTEN PRE-KINDERGARTEN Bright Beginnings Bright Beginnings Bright Beginnings Bright

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

Bright Spots Student &amp; Staff Recognition National Honor Society Induction Bright Spots

Intro to Scala.js Singapore-Scala, 28 Mar 2017 Li Haoyi haoyi.sg@gmail.com Bright Technology

and GPUs - A Pragmatic Guide Martijn de Vries Chief Technology Officer About Bright Computing

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

2020: A year to choose the future CTBI webinar, 6 June 2020 James Buchanan Bright Now Campaign

Bright Futures Begin At Birth Sparking A Collective, Statewide Effort To Ensure Every Child In

Building Sustainability: Building Sustainability: Building Sustainability: Building

Bright Lake Sanitary Survey Date: May 23, 2012 Presenter: Chris Spooney Outline Bright Lake

BRIGHT NIGHTS IN STANLEY PARK BURN FUND PARK BOARD COMMITTEE MEETING April 18, 2016 Bright

Bright F Futures R Report W hat at is t s the F Florida B a Bright F Futures S s Scholar

Bright Spots Student &amp; Staff Recognition PTA Reflections Winners! Bright Spots Student

Welcome! Childminding forum 9 November 2019 Bright Start Bright Ideas Newsletter

SCUs Bright (but not too bright) Future Alex Duncan Clint May Adam Murai Kenna Hall 12:21

An Introduction to Mesh Generation Methods and Softwares for Scientific Computing Part I TU

Administrative notes October 19, 2017 Reminder: Reading quizzes due Monday Reminder:

Software with the Quality that Has No Name Federico Mena Quintero federico@gnome.org Desktop

LDA 1 [Credits: Mike Smith, Las Vegas Sun 2013] LDA 2 [Credits: IITD Library] 4 5 6 In

Natural Evil and the Mythology of J.R.R. Tolkien Keith B. Miller Department of Geology Kansas

There and Back Again A coachs tale by Aino Corry YOW 2016 Melbourne Takeaway message Agile

L A T EX in the Classroom Dan Raies The University of Oregon Wednesday, July 20, 2014 Dan

Progress Report - SR Challenge Patrik Schneider 1,2 Danh Le Phuoc 3 (1) Institute of

Bright Spots Student & Staff Recognition National Honor Society Induction Bright Spots

Bright Spots Student & Staff Recognition PTA Reflections Winners! Bright Spots Student