Comet Virtual Clusters Whats underneath? Philip Papadopoulos San - - PowerPoint PPT Presentation

comet virtual clusters what s underneath
SMART_READER_LITE
LIVE PREVIEW

Comet Virtual Clusters Whats underneath? Philip Papadopoulos San - - PowerPoint PPT Presentation

Comet Virtual Clusters Whats underneath? Philip Papadopoulos San Diego Supercomputer Center ppapadopoulos@ucsd.edu Overview NSF Award# 1341698, Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science PI: Michael Norman


slide-1
SLIDE 1

Comet Virtual Clusters – What’s underneath?

Philip Papadopoulos San Diego Supercomputer Center ppapadopoulos@ucsd.edu

slide-2
SLIDE 2

Overview

NSF Award# 1341698, Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science PI: Michael Norman Co-PIs: Shawn Strande, Philip Papadopoulos, Robert Sinkovits, Nancy Wilkins-Diehr SDSC Project in Collaboration with Indiana University (led by Geoffrey Fox)

slide-3
SLIDE 3

Comet: System Characteristics

  • Total peak flops ~2.1 PF
  • Dell primary integrator
  • Intel Haswell processors w/ AVX2
  • Mellanox FDR InfiniBand
  • 1,944 standard compute nodes (46,656

cores)

  • Dual CPUs, each 12-core, 2.5 GHz
  • 128 GB DDR4 2133 MHz DRAM
  • 2*160GB GB SSDs (local disk)
  • 36 108 GPU nodes
  • Same as standard nodes plus
  • Two NVIDIA K80 cards, each with dual

Kepler3 GPUs (36)

  • Two NVIDIA P100 GPUs (72)
  • 4 large-memory nodes
  • 1.5 TB DDR4 1866 MHz DRAM
  • Four Haswell processors/node
  • 64 cores/node
  • Hybrid fat-tree topology
  • FDR (56 Gbps) InfiniBand
  • Rack-level (72 nodes, 1,728 cores) full bisection

bandwidth

  • 4:1 oversubscription cross-rack
  • Performance Storage (Aeon)
  • 7.6 PB, 200 GB/s; Lustre
  • Scratch & Persistent Storage segments
  • Durable Storage (Aeon)
  • 6 PB, 100 GB/s; Lustre
  • Automatic backups of critical data
  • Home directory storage
  • Gateway hosting nodes
  • Virtual image repository
  • 100 Gbps external

connectivity to Internet2 &

slide-4
SLIDE 4

Comet Network Architecture

InfiniBand compute, Ethernet Storage

Juniper 100 Gbps Arista 40GbE (2x) Data Mover Nodes

Research and Education Network Access Data Movers Internet 2 7x 36-port FDR in each rack wired as full fat-tree. 4:1 over subscription between racks.

72 HSWL 320 GB Core InfiniBand (2 x 108- port) 36 GPU 4 Large- Memory

IB-Ethernet Bridges (4 x 18-port each) Performance Storage 7.7 PB, 200 GB/s 32 storage servers Durable Storage 6 PB, 100 GB/s 64 storage servers Arista 40GbE (2x) 27 racks

FDR 36p FDR 36p

64 128 18

72 HSWL 320 GB 72 HSWL

2*36 4*18 Mid-tier InfiniBand Additional Support Components (not shown for clarity) Ethernet Mgt Network (10 GbE) Node-Local Storage 18 72

FDR FDR 40GbE 40GbE 10GbE 18 switches

4 4

FDR

72 Home File Systems VM Image Repository

Login Data Mover

Management Gateway Hosts

slide-5
SLIDE 5

Fun with IB Ethernet Bridging

  • Comet has four (4) Ethernet IB bridge switches
  • 18 FDR links, 18 40GbE links (72 total of each)
  • 4 X 16 port + 4 x 2 Port LAGS on the Ethernet Side
  • Issue #1
  • Significant BW limitation cluster storage
  • Why? (IB Routing)

1. Each LAG group has a single IB local ID (LID) 2. IB switches are destination routed – Default is that all sources for the same destination LID take the same route (port)

  • Solution: change LID mask count (LMC) from 0 to 2. Every LID

becomes 2^LMC addresses. At each switch level, there are now 2^LMC routes to a destination LID (better route dispersion)

  • Drawbacks: IB can have about 48K endpoints . When you increase

LMC for better route balancing, you reduce the size of your

  • network. At LMC=2 12K at LMC=3 6K nodes.

IB Switch LID of LAG IB Nodes

slide-6
SLIDE 6

More IB to Ethernet Issues

PROBLEM: Losing Ethernet Paths from Nodes to storage

  • Mellanox bridges use PROXY ARP
  • When a IPOIB interface on a compute ARPs for IP address XX.YY

bridges “answers” with it’s MAC address. When it receives a packet destined for IP XX.YY it forwards (Layer 2) to the appropriate mac

  • Vendor Advertised that it could handle 3K Proxy Arp entries

per bridge. Our network config worked for 18+ months.

  • Then, a change in opensm (subnet manager). Whenever a

subnet change occurred, an ARP flood ensued (2K nodes each asking for O(64) Ether mac addresses).

  • Bridge CPUs were woefully underpowered taking minutes to

respond to all ARP requests. Lustre wasn’t happy

  • redesigned network from layer 2 to layer 3 (using routers

inside our Arista Fabric).

IB/Ether Bridge (mac: bb) Ethernet IP XX.YY (mac: aa) IPoIB node: Who has XX.YY? Bridge answers: “I do, at bb” (PROXY ARP) Lustre Storage Arista Switch/Router

slide-7
SLIDE 7

Vir irtualized Clu lusters on Comet

Goal:

Provide a near bare metal HPC performance and management experience

Target Use Projects that could manage their own cluster, and:

  • can’t fit OUR software environment, and
  • don’t want to buy hardware or
  • have bursty or intermittent need
slide-8
SLIDE 8

Nucleus Persistent virtual front end API

  • Request nodes
  • Console & power
  • Scheduling
  • Storage management
  • Coordinating network changes
  • VM launch & shutdown

Idle disk images Active virtual compute nodes Disk images

User Perspective

User is a system administrator –we give them their own HPC cluster

Attached and synchronized

slide-9
SLIDE 9

User-Customized HPC

1:1 physical-to-virtual compute node

Frontend Virtual Frontend Hosting Disk Image Vault Compute Compute Compute Compute Compute Compute Compute Compute Compute

public network private Virtual Compute Virtual Compute Virtual Compute private Virtual Compute Virtual Compute Virtual Compute private physical virtual virtual Virtual Frontend Virtual Frontend

slide-10
SLIDE 10

High Performance Virtual Cluster Characteristics

Virtual Frontend Virtual Compute Virtual Compute Virtual Compute private Ethernet Infiniband

All l no nodes s ha have

  • Priv

rivate Eth thernet

  • In

Infin iniband

  • Loc

Local l Di Disk St Storage Vi Virtual l Co Compute Nod

  • des

s can Netw twork boo boot (P (PXE XE) fr from

  • m its

ts vi virtual l fr frontend All l Di Disks retain state

  • keep user configuration between boots

In Infin iniband Vi Virtualization

  • 8%

8% latency overhead. .

  • Nom
  • min

inal ba bandwid idth overhead Comet: Providing Virtualized HPC for XSEDE

slide-11
SLIDE 11

Bare Metal “Experience”

  • Can install virtual frontend from a bootable ISO image
  • Subordinate nodes can PXE boot
  • Compute nodes retain disk state (turning off a compute node is equivalent

to turning off power on a physical node).

  • Don’t want cluster owners to learn an entirely “new way” of doing

things.

  • Side comment: you don’t always have to run the way “Google does it” to do good

science.

  • If you have tools to manage physical nodes today, you can use those

same tools to manage your virtual cluster.

slide-12
SLIDE 12

Benchmark Results

slide-13
SLIDE 13

Single Root I/O Virtualization in HPC

  • Problem: Virtualization generally has resulted in

significant I/O performance degradation (e.g., excessive DMA interrupts)

  • Solution: SR-IOV and Mellanox ConnectX-3

InfiniBand host channel adapters

  • One physical function multiple virtual

functions, each light weight but with its own DMA streams, memory space, interrupts

  • Allows DMA to bypass hypervisor to VMs
  • SRIOV enables virtual HPC cluster w/ near-native

InfiniBand latency/bandwidth and minimal overhead

slide-14
SLIDE 14

MPI bandwidth slowdown from SR-IOV is at most 1.21 for medium-sized messages & negligible for small & large ones

slide-15
SLIDE 15

MPI latency slowdown from SR-IOV is at most 1.32 for small messages & negligible for large ones

slide-16
SLIDE 16

WRF Weather Modeling

  • 96-core (4-node) calculation
  • Nearest-neighbor

communication

  • Test Case: 3hr Forecast, 2.5km

resolution of Continental US (CONUS).

  • Scalable algorithms
  • 2% slower w/ SR-IOV vs native IB.

WR F 3.4.1 – 3hr forecast

slide-17
SLIDE 17

MrBayes: Software for Bayesian inference of phylogeny.

  • Widely used, including by CIPRES

gateway.

  • 32-core (2 node) calculation
  • Hybrid MPI/OpenMP Code.
  • 8 MPI tasks, 4 OpenMP threads per

task.

  • Compilers: gcc + mvapich2 v2.2,

AVX options.

  • Test

Case: 218 taxa, 10,000 generations.

  • 3% slower with SR-IOV vs native IB.
slide-18
SLIDE 18

Quantum ESPRESSO

  • 48-core (3 node) calculation
  • CG matrix inversion - irregular

communication

  • 3D FFT matrix transposes (all-to-

all communication)

  • Test Case: DEISA AUSURF 112

benchmark.

  • 8% slower w/ SR-IOV vs native IB.
slide-19
SLIDE 19

RAxML: Code for Maximum Likelihood-based inference of large

phylogenetic trees.

  • Widely used, including by CIPRES

gateway.

  • 48-core (2 node) calculation
  • Hybrid MPI/Pthreads Code.
  • 12 MPI tasks, 4 threads per task.
  • Compilers: gcc + mvapich2 v2.2,

AVX options.

  • Test Case: Comprehensive analysis,

218 taxa, 2,294 characters, 1,846 patterns, 100 bootstraps specified.

  • 19% slower w/ SR-IOV vs native IB.
slide-20
SLIDE 20

NAMD: Molecular Dynamics, ApoA1 Benchmark

  • 48-core (2 node) calculation
  • Test Case: ApoA1 benchmark.
  • 92,224 atoms, periodic, PME.
  • Binary used: NAMD 2.11, ibverbs,

SMP.

  • Directly used prebuilt binary which

uses ibverbs for multi-node runs.

  • 23% slower w/ SR-IOV vs native IB.
slide-21
SLIDE 21

Accessing Virtual Cluster Capabilities – much smaller API than Openstack/EC2/GCE

  • REST API
  • Command line interface
  • Command shell for scripting
  • Console Access
  • (Portal)

User does NOT see: Rocks, Slurm, etc.

slide-22
SLIDE 22

Cloudmesh – Command line interface

Developed by IU collaborators

  • Cloudmesh

client enables access to multiple cloud environments from a command shell and command line.

  • We leverage this easy to use CLI, allowing the use of Comet

as infrastructure for virtual cluster management.

  • Cloudmesh has more functionality with ability to access

hybrid clouds OpenStack, (EC2, AWS, Azure); possible to extend to other systems like Jetstream, Bridges etc.

  • Plans

for customizable launchers available through command line or browser – can target specific application user communities.

Reference: https://github.com/cloudmesh/client

slide-23
SLIDE 23

Comet Cloudmesh Client (selected commands)

  • cm comet cluster ID
  • Show the cluster details
  • cm comet power on ID vm-ID -[0-3] --walltime=6h
  • Power 3 nodes on for 6 hours
  • cm comet image attach image.iso ID vm-ID-0
  • Attach an image
  • cm comet boot ID vm-ID-0
  • Boot node 0
  • cm comet console vc4
  • Console
slide-24
SLIDE 24

Getting Started

  • http://cloudmesh.github.io/client/tutorials/comet_cloudmesh.html
  • List of ISO images that a user can use to install a frontend

$ cm comet iso list 1: CentOS-7-x86_64-NetInstall-1511.iso 2: ubuntu-16.04.2-server-amd64.iso 3: ipxe.iso ...<snip>... 19: Fedora-Server-netinst-x86_64-25-1.3.iso 20: ubuntu-14.04.4-server-amd64.iso

  • Attach ISO (Ubuntu) , Boot Frontend, Connect to Console

$ cm comet iso attach 2 vctNN1 $ cm comet power on vctNN $ cm comet console vctNN

cm comet iso attach 2 vctNN

slide-25
SLIDE 25

Cluster owner has access to console at BIOS boot (any node in the cluster)

slide-26
SLIDE 26

SDSC Policy

  • Virtual frontends (VFE) can be up 7 x 24 x 365
  • Typical config is 8GB memory, 36GB disk, 4 cores
  • Multiple VFEs on a single physical host
  • Compute nodes are treated as (parallel) jobs in our batch system
  • Users request nodes to be turned on/off.
  • Cloudmesh client hides that a request to turn on a node is actually a

batch job submission to SLURM.

  • A compute node retains its disk state, MAC address of Ethernet and

GUID of virtualized IB. power off a virtual compute node is just like power off of physical hardware.

slide-27
SLIDE 27

“Fun” with KVM and SRIOV

  • Issue: virtual compute nodes are allocated 120/128GB memory.

Sometimes it would take a very long time (20 minutes) for a KVM virtual container to start.

  • Root cause: KVM wants to allocate a contiguous block of physical memory.

When a node has been running for a while, this isn’t likely.

  • Hammer: reboot physical node
  • More subtle: (works mostly), release all caches/buffers.
  • When a cluster node is allocated, we assign its virtual IB adapter a

fixed GUID.

  • Some handstands with virtual function assignment within the physical node
slide-28
SLIDE 28

VM Disk Management

  • Each VM gets a 36 GB disk (Small SCSI) – This is adjustable
  • Disk images are persistent through reboots
  • Two central NASes (ZFS-based) store all disk images
  • VM can be allocated on any physical compute node in Comet
  • Two solutions:
  • iSCSI (Network mounted disk)
  • Disk replication on nodes
slide-29
SLIDE 29

Virtual compute-x

Non-performant approach: VM Disk Management via iSCSI only

NAS Compute nodes

Targets

iqn.2001-04.com.nas-0-0-vm-compute-x

This is what OpenStack Supports Big Issue: Bandwidth Bottleneck at NAS

slide-30
SLIDE 30

A hybrid solution via replication

  • Initial boot of any cluster node uses an iSCSI disk (Call this a

node disk) on the NAS

  • During normal operation, Comet moves a node disk to the

physical host that is running the node VM. And then disconnects from the NAS

  • All Node disk operation is local to the physical host
  • Fundamentally enables scale out w/o a $1M NAS
  • At Shutdown, any changes made to the node disk (now on

the physical host) are migrated back to the NAS, ready for next boot

slide-31
SLIDE 31

VM Disk Management Replication

Replication states: 1. Unused unmapped 2. Init disk NAS -> VM

a.

Move disk image

b.

Merge temporary modification

3. Steady state mapped 4. Release disk VM -> NAS 5. Unused unmapped

slide-32
SLIDE 32

1.a Init Disk

NAS Compute nodes

Virtual compute-x

Targets

iqn.2001-04.com.nas-0-0-vm-compute-x

Replicate Disk iSCSI mount on NAS enables virtual compute node to boot immediately.

  • Read operations from NAS
  • Write operations to local disk
slide-33
SLIDE 33

1.b Init Disk

NAS Compute nodes

Virtual compute-x

Targets During boot, the disk image on the NAS is migrated to the physical host.

  • Read-only and read/write

are then merged into one local disk

  • iSCSI mount is

disconnected

slide-34
SLIDE 34
  • 2. Steady State

NAS Compute nodes

Virtual compute-x

Targets During normal operation

  • Node disk is snapshot
  • Incremental snapshots

sent to NAS (replicate back to NAS)

  • Timing/load/experiment

will tell us how often we can do this

slide-35
SLIDE 35
  • 3. Release Disk

NAS Compute nodes

Virtual compute-x

Targets

Power off

At shutdown, any unsynched changes are send back to NAS

  • When the last snapshot

is sent, the Virtual compute node can be rebooted on another system

slide-36
SLIDE 36

Current implementation

https://github.com/rocksclusters/img-storage-roll

slide-37
SLIDE 37

Some Technical Details

  • NAS and Physical Nodes use ZFS as the native file system
  • A Node Disk is defined inside of ZFS as a ZVOL (a raw disk volume)
  • ZVOLs
  • Can be snapshot using native ZFS utilities
  • Full and incremental snapshots can be sent over the network using ZFS

send/recv + ssh (or other protocol)

  • VMs simply see a raw disk
  • The ZVOL is the Disk Image
slide-38
SLIDE 38

Virtual Cluster projects

  • Open Science Grid: University of California, San

Diego, Frank Wuerthwein (in production)

  • Virtual cluster for PRAGMA/GLEON lake expedition
  • University of Florida, Renato Figueiredo
  • Deploying the Lifemapper species modeling

Platform with Virtual Clusters on Comet: University

  • f Kansas, James Beach
  • Adolescent Brain Cognitive Development Study:

NIH funded, 19 institutions.

  • Comet Goal was O(20) virtual clusters (not 1000s)