Comet Virtual Clusters Whats underneath? Philip Papadopoulos San - PowerPoint PPT Presentation

Comet Virtual Clusters – What’s underneath? Philip Papadopoulos San Diego Supercomputer Center ppapadopoulos@ucsd.edu

Overview NSF Award# 1341698, Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science PI: Michael Norman Co-PIs: Shawn Strande, Philip Papadopoulos, Robert Sinkovits, Nancy Wilkins-Diehr SDSC Project in Collaboration with Indiana University (led by Geoffrey Fox)

Comet: System Characteristics • Hybrid fat-tree topology • Total peak flops ~2.1 PF • Dell primary integrator • FDR (56 Gbps) InfiniBand • Intel Haswell processors w/ AVX2 • Rack-level (72 nodes, 1,728 cores) full bisection • Mellanox FDR InfiniBand bandwidth • 1,944 standard compute nodes (46,656 • 4:1 oversubscription cross-rack cores) • Dual CPUs, each 12-core, 2.5 GHz • Performance Storage (Aeon) • 128 GB DDR4 2133 MHz DRAM • 7.6 PB, 200 GB/s; Lustre • 2*160GB GB SSDs (local disk) • Scratch & Persistent Storage segments • 36 108 GPU nodes • Same as standard nodes plus • Durable Storage (Aeon) • Two NVIDIA K80 cards, each with dual Kepler3 GPUs (36) • 6 PB, 100 GB/s; Lustre • Two NVIDIA P100 GPUs (72) • Automatic backups of critical data • 4 large-memory nodes • Home directory storage • 1.5 TB DDR4 1866 MHz DRAM • Four Haswell processors/node • Gateway hosting nodes • 64 cores/node • Virtual image repository • 100 Gbps external connectivity to Internet2 &

Comet Network Architecture InfiniBand compute, Ethernet Storage Home File Systems Login Gateway VM Image Repository Management Hosts Data Mover Node-Local 72 HSWL 320 GB Storage 18 27 racks 4 Core Internet 2 FDR 36p InfiniBand FDR 72 FDR (2 x 108- 18 port) 72 HSWL switches 320 GB 2*36 FDR Juniper Research and Education 100 Gbps 72 FDR 36p IB-Ethernet Network Access Bridges (4 x 4 Mid-tier 36 GPU Data Movers 18-port each) InfiniBand Arista 4*18 40GbE 40GbE (2x) 4 Large- Memory Arista 40GbE Data Mover (2x) Nodes 64 40GbE 128 10GbE 18 72 HSWL Additional Support Components (not shown for clarity) Ethernet Mgt Network (10 GbE) 7x 36-port FDR in each rack wired as full fat-tree. 4:1 over Performance Storage Durable Storage subscription between racks. 7.7 PB, 200 GB/s 6 PB, 100 GB/s 32 storage servers 64 storage servers

Fun with IB �� Ethernet Bridging LID of LAG • Comet has four (4) Ethernet �� IB bridge switches • 18 FDR links, 18 40GbE links (72 total of each) • 4 X 16 port + 4 x 2 Port LAGS on the Ethernet Side • Issue #1 • Significant BW limitation cluster � storage IB Switch • Why? (IB Routing) 1. Each LAG group has a single IB local ID (LID) 2. IB switches are destination routed – Default is that all sources for the same destination LID take the same route (port) • Solution: change LID mask count (LMC) from 0 to 2. � Every LID IB Nodes becomes 2^LMC addresses. At each switch level, there are now 2^LMC routes to a destination LID (better route dispersion) • Drawbacks: IB can have about 48K endpoints . When you increase LMC for better route balancing, you reduce the size of your network. At LMC=2 � 12K at LMC=3 � 6K nodes.

Lustre Storage More IB to Ethernet Issues Ethernet IP XX.YY (mac: aa) PROBLEM: Losing Ethernet Paths from Nodes to storage • Mellanox bridges use PROXY ARP Arista Switch/Router • When a IPOIB interface on a compute ARPs for IP address XX.YY bridges “answers” with it’s MAC address. When it receives a packet destined for IP XX.YY it forwards (Layer 2) to the appropriate mac • Vendor Advertised that it could handle 3K Proxy Arp entries per bridge. Our network config worked for 18+ months. IB/Ether Bridge • Then, a change in opensm (subnet manager). Whenever a (mac: bb) subnet change occurred, an ARP flood ensued (2K nodes each asking for O(64) Ether mac addresses). • Bridge CPUs were woefully underpowered taking minutes to respond to all ARP requests. Lustre wasn’t happy • � redesigned network from layer 2 to layer 3 (using routers IPoIB node: Who has XX.YY? inside our Arista Fabric). Bridge answers: “I do, at bb” (PROXY ARP)

Vir irtualized Clu lusters on Comet Goal: Provide a near bare metal HPC performance and management experience Target Use Projects that could manage their own cluster, and: • can’t fit OUR software environment, and • don’t want to buy hardware or • have bursty or intermittent need

User Perspective User is a system administrator – we give them their own HPC cluster Active virtual compute nodes Attached and synchronized • Scheduling • Storage management • Coordinating network changes • VM launch & shutdown Nucleus Disk images API • Request nodes • Console & power Persistent virtual front end Idle disk images

User-Customized HPC 1:1 physical-to-virtual compute node public physical network virtual virtual Virtual Frontend Disk Image Vault Hosting Frontend Virtual Virtual Frontend Frontend private private private Compute Compute Compute Virtual Compute Compute Compute Compute Virtual Virtual Virtual Compute Compute Compute Compute Compute Compute Virtual Virtual Compute Compute

High Performance Virtual Cluster Characteristics Comet: Providing Virtualized HPC for XSEDE Infiniband Infin In iniband Vi Virtualization Virtual Frontend • 8% latency overhead. 8% . • Nom omin inal ba bandwid idth overhead private Ethernet All l no nodes s ha have • Priv rivate Eth thernet • Infin In iniband Virtual • Loc Local l Di Disk St Storage Compute Virtual Vi l Co Compute Nod odes s can Netw twork boo boot (P (PXE XE) Virtual from fr om its ts vi virtual l fr frontend Compute All l Di Disks retain state Virtual Compute • keep user configuration between boots

Bare Metal “Experience” • Can install virtual frontend from a bootable ISO image • Subordinate nodes can PXE boot • Compute nodes retain disk state (turning off a compute node is equivalent to turning off power on a physical node). • � Don’t want cluster owners to learn an entirely “new way” of doing things. • Side comment: you don’t always have to run the way “Google does it” to do good science. • � If you have tools to manage physical nodes today, you can use those same tools to manage your virtual cluster.

Benchmark Results

Single Root I/O Virtualization in HPC • Problem: Virtualization generally has resulted in significant I/O performance degradation (e.g., excessive DMA interrupts) • Solution: SR-IOV and Mellanox ConnectX-3 InfiniBand host channel adapters • One physical function � multiple virtual functions, each light weight but with its own DMA streams, memory space, interrupts • Allows DMA to bypass hypervisor to VMs • SRIOV enables virtual HPC cluster w/ near-native InfiniBand latency/bandwidth and minimal overhead

MPI bandwidth slowdown from SR-IOV is at most 1.21 for medium-sized messages & negligible for small & large ones

MPI latency slowdown from SR-IOV is at most 1.32 for small messages & negligible for large ones

WRF Weather Modeling • 96-core (4-node) calculation • Nearest-neighbor communication • Test Case: 3hr Forecast, 2.5km resolution of Continental US (CONUS). • Scalable algorithms • 2% slower w/ SR-IOV vs native IB. WR F 3.4.1 – 3hr forecast

MrBayes: Software for Bayesian inference of phylogeny. • Widely used, including by CIPRES gateway. • 32-core (2 node) calculation • Hybrid MPI/OpenMP Code. • 8 MPI tasks, 4 OpenMP threads per task. • Compilers: gcc + mvapich2 v2.2, AVX options. • Test Case: 218 taxa, 10,000 generations. • 3% slower with SR-IOV vs native IB.

Quantum ESPRESSO • 48-core (3 node) calculation • CG matrix inversion - irregular communication • 3D FFT matrix transposes (all-to- all communication) • Test Case: DEISA AUSURF 112 benchmark. • 8% slower w/ SR-IOV vs native IB.

RAxML: Code for Maximum Likelihood-based inference of large phylogenetic trees. • Widely used, including by CIPRES gateway. • 48-core (2 node) calculation • Hybrid MPI/Pthreads Code. • 12 MPI tasks, 4 threads per task. • Compilers: gcc + mvapich2 v2.2, AVX options. • Test Case: Comprehensive analysis, 218 taxa, 2,294 characters, 1,846 patterns, 100 bootstraps specified. • 19% slower w/ SR-IOV vs native IB.

NAMD: Molecular Dynamics, ApoA1 Benchmark • 48-core (2 node) calculation • Test Case: ApoA1 benchmark. • 92,224 atoms, periodic, PME. • Binary used: NAMD 2.11, ibverbs, SMP. • Directly used prebuilt binary which uses ibverbs for multi-node runs. • 23% slower w/ SR-IOV vs native IB.

Accessing Virtual Cluster Capabilities – much smaller API than Openstack/EC2/GCE • REST API • Command line interface • Command shell for scripting • Console Access • (Portal) User does NOT see: Rocks, Slurm, etc.

Comet Virtual Clusters Whats underneath? Philip Papadopoulos San - PowerPoint PPT Presentation

Comet Virtual Clusters Whats underneath? Philip Papadopoulos San Diego Supercomputer Center ppapadopoulos@ucsd.edu Overview NSF Award# 1341698, Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science PI: Michael Norman

WELCOME AGENDA COMET GROUP COMET TRAITEMENTS PHOENIX PROCESS R&D Other developments

The COMET Experiment Status and Prospects Matthias Dubouchet High Energy Physics Group Imperial

Spark Programming at Comet UCSB CS240A 2016. Tao Yang Comet Cluster Comet cluster has 1944

Core Outcome Measures in Effectiveness Trials http://www.liv.ac.uk/nwhtmr/comet/comet.htm

Strategy update: The growth story accelerates Dr. Ren Lenggenhager CEO, Comet Group Technology

Dynamic Virtual Clusters in a Grid Dynamic Virtual Clusters in a Grid Site Manager Site Manager

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

COMET COordination and iMplementation of a pan- European instrumenT for radioecology Hildegarde

Assessing Greenhouse Gas Reductions from Carbon Farming using COMET-Planner TM and COMET-Farm TM

A History of Comet Discovery from South Africa Tim Cooper Director Comet and Meteor Section

Search for Muon to electron conversion at J-PARC The Current Status of COMET Experiment Wu Chen,

Present Status and Future Prospects of COMET to Search for -e Conversion at J-PARC Y. Fujii

- e conversion @J - PARC --- COMET --- Koji Y oshimura, KEK on behalf of COMET

GROUPS Virtual Group Topics Overview of Virtual Groups Participating as a Virtual Group in

Recent ASSA Results on Comets and Meteors Tim Cooper Director, Comet and Meteor Section

Cache Hierarchies Po-An Tsai , Nathan Beckmann, and Daniel Sanchez Executive summary

100GE Upgrades at FNAL Phil DeMar ; Andrey Bobyshev CHEP 2015 April 14, 2015 FNAL High-Impact

Review of Symbols Review of Symbols CS 105 Basic Parameters Tour of the Black Holes of

Other threats Threat model (beyond TLS) TLS = confidentiality, integrity, authenticity

Enhancing MySQL Security Vinicius M. Grippa Support Engineer for MySQL/MongoDB

Classes in C++ A lot of this stuff is trivia, but it can be hard to discern up front. Classes in

Realizing a Virtual Private Network using Named Data Networking September 28, 2017 Craig

Overview Distributed Services for Distributed VPNs Objectives for Robust Time

Comet Virtual Clusters Whats underneath? Philip Papadopoulos San - PowerPoint PPT Presentation

Comet Virtual Clusters Whats underneath? Philip Papadopoulos San Diego Supercomputer Center ppapadopoulos@ucsd.edu Overview NSF Award# 1341698, Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science PI: Michael Norman

WELCOME AGENDA COMET GROUP COMET TRAITEMENTS PHOENIX PROCESS R&amp;D Other developments

The COMET Experiment Status and Prospects Matthias Dubouchet High Energy Physics Group Imperial

Spark Programming at Comet UCSB CS240A 2016. Tao Yang Comet Cluster Comet cluster has 1944

Core Outcome Measures in Effectiveness Trials http://www.liv.ac.uk/nwhtmr/comet/comet.htm

Strategy update: The growth story accelerates Dr. Ren Lenggenhager CEO, Comet Group Technology

Dynamic Virtual Clusters in a Grid Dynamic Virtual Clusters in a Grid Site Manager Site Manager

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

COMET COordination and iMplementation of a pan- European instrumenT for radioecology Hildegarde

Assessing Greenhouse Gas Reductions from Carbon Farming using COMET-Planner TM and COMET-Farm TM

A History of Comet Discovery from South Africa Tim Cooper Director Comet and Meteor Section

Search for Muon to electron conversion at J-PARC The Current Status of COMET Experiment Wu Chen,

Present Status and Future Prospects of COMET to Search for -e Conversion at J-PARC Y. Fujii

- e conversion @J - PARC --- COMET --- Koji Y oshimura, KEK on behalf of COMET

GROUPS Virtual Group Topics Overview of Virtual Groups Participating as a Virtual Group in

Recent ASSA Results on Comets and Meteors Tim Cooper Director, Comet and Meteor Section

Cache Hierarchies Po-An Tsai , Nathan Beckmann, and Daniel Sanchez Executive summary

100GE Upgrades at FNAL Phil DeMar ; Andrey Bobyshev CHEP 2015 April 14, 2015 FNAL High-Impact

Review of Symbols Review of Symbols CS 105 Basic Parameters Tour of the Black Holes of

Other threats Threat model (beyond TLS) TLS = confidentiality, integrity, authenticity

Enhancing MySQL Security Vinicius M. Grippa Support Engineer for MySQL/MongoDB

Classes in C++ A lot of this stuff is trivia, but it can be hard to discern up front. Classes in

Realizing a Virtual Private Network using Named Data Networking September 28, 2017 Craig

Overview Distributed Services for Distributed VPNs Objectives for Robust Time

WELCOME AGENDA COMET GROUP COMET TRAITEMENTS PHOENIX PROCESS R&D Other developments