Designing and Building Efficient HPC Cloud with Modern Networking Technologies
- n Heterogeneous HPC Clusters
Jie Zhang
- Dr. Dhabaleswar K. Panda (Advisor)
Department of Computer Science & Engineering The Ohio State University
Designing and Building Efficient HPC Cloud with Modern Networking - - PowerPoint PPT Presentation
Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Jie Zhang Dr. Dhabaleswar K. Panda (Advisor) Department of Computer Science & Engineering The Ohio State University Outline
Department of Computer Science & Engineering The Ohio State University
SC 2017 Doctoral Showcase 2 Network Based Computing Laboratory
SC 2017 Doctoral Showcase 3 Network Based Computing Laboratory
(Courtesy: http://www.idc.com/getdoc.jsp?containerId=prUS41669516)
SC 2017 Doctoral Showcase 4 Network Based Computing Laboratory
High Performance Interconnects – InfiniBand (with SR-IOV) <1usec latency, 200Gbps Bandwidth>
Multi-/Many-core Processors Accelerators (GPUs/Co-processors) Large memory nodes
(Upto 2 TB)
Cloud Cloud
SDSC Comet TACC Stampede
SC 2017 Doctoral Showcase 5 Network Based Computing Laboratory
Single Root I/O Virtualization (SR-IOV) is providing new opportunities to design HPC cloud with very little low overhead through bypassing hypervisor
Physical Function (PF), to present itself as multiple virtual devices, or Virtual Functions (VFs)
non-virtualized PFs, no need for driver change
through PCI pass-through
Guest 1 Guest OS VF Driver Guest 2 Guest OS VF Driver Guest 3 Guest OS VF Driver Hypervisor PF Driver I/O MMU SR-IOV Hardware Virtual Function Virtual Function Virtual Function Physical Function PCI Express
SC 2017 Doctoral Showcase 6 Network Based Computing Laboratory
SC 2017 Doctoral Showcase 7 Network Based Computing Laboratory
SC 2017 Doctoral Showcase 8 Network Based Computing Laboratory
SC 2017 Doctoral Showcase 9 Network Based Computing Laboratory
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,825 organizations in 85 countries – More than 432,000 (> 0.4 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Jul ‘17 ranking)
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu
SC 2017 Doctoral Showcase 10 Network Based Computing Laboratory
Application MPI Layer ADI3 Layer SMP Channel Network Channel Shared Memory InfiniBand API MPI Library Communication Device APIs Native Hardware Application MPI Layer ADI3 Layer IVShmem Channel SR-IOV Channel Shared Memory InfiniBand API Virtual Machine Aware Communication Device APIs Virtualized Hardware Communication Coordinator Locality Detector SMP Channel
Performance Computing (HiPC’14), Dec 2014
SC 2017 Doctoral Showcase 11 Network Based Computing Laboratory
SINE and SPEC
2 4 6 8 10 12 14 16 18 FT LU CG MG IS Execution Time (s) NAS B Class SR-IOV Proposed 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 SPEC INVERSE RAND SINE Execution Times (s) P3DFFT 512*512*512 SR-IOV Proposed
NAS-32 VMs (8 VMs per node) P3DFFT-32 VMs (8 VMs per node) 43% 20% 29% 33% 29%
SC 2017 Doctoral Showcase 12 Network Based Computing Laboratory
SC 2017 Doctoral Showcase 13 Network Based Computing Laboratory
MPI
Host Guest VM1
Hypervisor IB Adapter IVShmem Network Suspend Trigger Read-to- Migrate Detector Network Reactive Notifier
MPI
Guest VM2
MPI
Host Guest VM1
VF / SR-IOV Hypervisor IB Adapter IVShmem Ethernet Adapter MPI
Guest VM2
Ethernet Adapter Migration Done Detector VF / SR-IOV VF / SR-IOV VF / SR-IOV Migration Trigger
IPDPS, 2017
coordinate with VM during migration (detach/reattach SR-IOV/IVShmem, migrate VMs, migration status)
connection suspending and reactivating
Thread based (MT) design to optimize VM migration and MPI application performance
SC 2017 Doctoral Showcase 14 Network Based Computing Laboratory
Time
Comp
MPI Call
R
Time
R S
MPI Call Suspend Channel Reactivate Channel Control Msg
Migration
Migration Lock/Unlock Communication
Comp
Computation P 0 Controller Pre-migration
S
Ready-to-Migrate
Migration
Post-Migration MPI Call MPI Call Time
No-migration
P 0
Comp
MPI Call P 0 MPI Call Thread Pre- migration Ready-to- migrate
Migration
Post- migration MPI Call
Comp
Controller Time
Comp
Migration-thread based Worst Scenario
P 0 MPI Call Thread Pre- migration
S
Ready-to- migrate
Migration
Post- migration Controller
S R R
Migration-thread based Typical Scenario Progress Engine Based
Migration Signal Detection Down-Time for VM Live Migration MPI Call
SC 2017 Doctoral Showcase 15 Network Based Computing Laboratory
5 10 15 20 25 30 LU.B EP.B MG.B CG.B FT.B Execution Time (s) PE MT-worst MT-typical NM 0.0 0.1 0.2 0.3 0.4 20,10 20,16 20,20 22,10 Execution Time (s) PE MT-worst MT-typical NM
Graph500 NAS
SC 2017 Doctoral Showcase 16 Network Based Computing Laboratory
QPI M e m
y C
t r
l e r
core 4 core 5 core 6 core 7
NUMA 0 VM 0 Container 0 M e m
y C
t r
l e r
core 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11 core 12 core 13 core 14 core 15
VM 1 Container 2 Container 1 Container 3 1 2 3 4 Two-Layer NUMA Aware Communication Coordinator NUMA 1 Container Locality Detector VM Locality Detector Nested Locality Combiner Two-Layer Locality Detector CMA Channel SHared Memory (SHM) Channel Network (HCA) Channel
Two-Layer Locality Detector: Dynamically detecting MPI processes in the co-resident containers inside
resident VMs Two-Layer NUMA Aware Communication Coordinator: Leverage nested locality info, NUMA architecture info and message to select appropriate communication channel
InfiniBand, The 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’17), April 2017
SC 2017 Doctoral Showcase 17 Network Based Computing Laboratory
NUMA Loader Nested Locality Loader Message Parser Communication Coordinator Two-Layer NUMA Aware Communication Coordinator CMA Channel SHared Memory (SHM) Channel Network (HCA) Channel Two-Layer Locality Detector
Detector
destination process is pinning
SC 2017 Doctoral Showcase 18 Network Based Computing Laboratory
execution time for Graph 500 and NAS, respectively
performance benefit.
50 100 150 200 IS MG EP FT CG LU Execution Time (s)
Class D NAS
Default 1Layer 2Layer-Enhanced-Hybrid 2 4 6 8 10 22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16 BFS Execution Time (s)
Graph500
Default 1Layer 2Layer-Enhanced-Hybrid
SC 2017 Doctoral Showcase 19 Network Based Computing Laboratory Compute Nodes
MPI MPI
MPI MPI MPI MPI
MPI MPI MPI MPI
VM VM VM VM
VM VM
VM VM
Exclusive Allocations Sequential Jobs (EASJ) Exclusive Allocations Concurrent Jobs (EACJ) Shared-host Allocations Concurrent Jobs (SACJ)
SC 2017 Doctoral Showcase 20 Network Based Computing Laboratory
Submit Job SLURMctld
VM Configuration File
physical node
SLURMd SLURMd
VM Launching/Reclaiming
libvirtd
VM1
VF IVSHME M
VM2
VF IVSHME M physical node
SLURMd
physical node
SLURMd
sbatch File
MPI MPI
physical resource request physical node list
launch VMs
Lustre
Image Pool
function
check availability
etc.
….
SC 2017 Doctoral Showcase 21 Network Based Computing Laboratory
– Utilize SPANK plugin to read VM configuration, launch/reclaim VM – File based lock to detect occupied VF and exclusively allocate free VF – Assign a unique ID to each IVSHMEM device and dynamically attach to each VM – Inherit advantages from Slurm: coordination, scalability, security
– Offload VM launch/reclaim to underlying OpenStack framework – PCI Whitelist to passthrough free VF to VM – Extend Nova to enable IVSHMEM when launching VM – Inherit advantage from both OpenStack and Slurm: component optimization, performance
SC 2017 Doctoral Showcase 22 Network Based Computing Laboratory
50 100 150 200 250 22,10 22,16 22,20
Problem Size (Scale, Edgefactor)
VM Native
EASJ SACJ
500 1000 1500 2000 2500 3000 24,16 24,20 26,10
BFS Execution Time (ms) Problem Size (Scale, Edgefactor)
VM Native 50 100 150 200 250 22 10 22 16 22 20
Problem Size (Scale, Edgefactor)
VM Native
EACJ Graph500 with 64 Procs acorss 8 Nodes on Chameleon
6% 4% 9%
SC 2017 Doctoral Showcase 23 Network Based Computing Laboratory
state.edu/download/mvapich/virt/mvapich2-virt-2.2-1.el7.centos.x86_64.rpm
– MPI bare-metal cluster: https://www.chameleoncloud.org/appliances/29/ – MPI + SR-IOV KVM cluster: https://www.chameleoncloud.org/appliances/28/
SC 2017 Doctoral Showcase 24 Network Based Computing Laboratory
SC 2017 Doctoral Showcase 25 Network Based Computing Laboratory
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/