Designing and Building Efficient HPC Cloud with Modern Networking - - PowerPoint PPT Presentation

designing and building efficient hpc cloud with modern
SMART_READER_LITE
LIVE PREVIEW

Designing and Building Efficient HPC Cloud with Modern Networking - - PowerPoint PPT Presentation

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Jie Zhang Dr. Dhabaleswar K. Panda (Advisor) Department of Computer Science & Engineering The Ohio State University Outline


slide-1
SLIDE 1

Designing and Building Efficient HPC Cloud with Modern Networking Technologies

  • n Heterogeneous HPC Clusters

Jie Zhang

  • Dr. Dhabaleswar K. Panda (Advisor)

Department of Computer Science & Engineering The Ohio State University

slide-2
SLIDE 2

SC 2017 Doctoral Showcase 2 Network Based Computing Laboratory

Outline

  • Introduction
  • Problem Statement
  • Detailed Designs and Results
  • Impact on HPC Community
  • Conclusion
slide-3
SLIDE 3

SC 2017 Doctoral Showcase 3 Network Based Computing Laboratory

  • Cloud Computing focuses on maximizing the effectiveness of the shared resources
  • Virtualization is the key technology behind
  • Widely adopted in industry computing environment
  • IDC Forecasts Worldwide Public IT Cloud Services spending will reach $195 billion by 2020

(Courtesy: http://www.idc.com/getdoc.jsp?containerId=prUS41669516)

Cloud Computing and Virtualization

Virtualization Cloud Computing

slide-4
SLIDE 4

SC 2017 Doctoral Showcase 4 Network Based Computing Laboratory

Drivers of Modern HPC Cluster and Cloud Architecture

  • Multi-/Many-core technologies
  • Accelerators (GPUs/Co-processors)
  • Large memory nodes
  • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
  • Single Root I/O Virtualization (SR-IOV)

High Performance Interconnects – InfiniBand (with SR-IOV) <1usec latency, 200Gbps Bandwidth>

Multi-/Many-core Processors Accelerators (GPUs/Co-processors) Large memory nodes

(Upto 2 TB)

Cloud Cloud

SDSC Comet TACC Stampede

slide-5
SLIDE 5

SC 2017 Doctoral Showcase 5 Network Based Computing Laboratory

Single Root I/O Virtualization (SR-IOV) is providing new opportunities to design HPC cloud with very little low overhead through bypassing hypervisor

Single Root I/O Virtualization (SR-IOV)

  • Allows a single physical device, or a

Physical Function (PF), to present itself as multiple virtual devices, or Virtual Functions (VFs)

  • VFs are designed based on the existing

non-virtualized PFs, no need for driver change

  • Each VF can be dedicated to a single VM

through PCI pass-through

Guest 1 Guest OS VF Driver Guest 2 Guest OS VF Driver Guest 3 Guest OS VF Driver Hypervisor PF Driver I/O MMU SR-IOV Hardware Virtual Function Virtual Function Virtual Function Physical Function PCI Express

slide-6
SLIDE 6

SC 2017 Doctoral Showcase 6 Network Based Computing Laboratory

Does it suffice to build efficient HPC cloud with only SR-IOV? NO.

  • Not support locality-aware communication, co-located VMs still has to use

SR-IOV channel

  • Not support VM migration because of device passthrough
  • Not properly manage and isolate critical virtualized resource
slide-7
SLIDE 7

SC 2017 Doctoral Showcase 7 Network Based Computing Laboratory

  • Can MPI runtime be redesigned to provide virtualization support for

VMs/Containers when building HPC clouds?

  • How much benefits can be achieved on HPC clouds with redesigned

MPI runtime for scientific kernels and applications?

  • Can fault-tolerance/resilience (Live Migration) be supported on SR-IOV

enabled HPC clouds?

  • Can we co-design with resource management and scheduling systems

to enable HPC clouds on modern HPC systems?

Problem Statements

slide-8
SLIDE 8

SC 2017 Doctoral Showcase 8 Network Based Computing Laboratory

Research Framework

slide-9
SLIDE 9

SC 2017 Doctoral Showcase 9 Network Based Computing Laboratory

  • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,825 organizations in 85 countries – More than 432,000 (> 0.4 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Jul ‘17 ranking)

  • 1st ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China
  • 15th ranked 241,108-core cluster (Pleiades) at NASA
  • 20th ranked 522,080-core cluster (Stampede) at TACC
  • 44th ranked 74,520-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu

MVAPICH2 Project

slide-10
SLIDE 10

SC 2017 Doctoral Showcase 10 Network Based Computing Laboratory

Locality-aware MPI Communication with SR-IOV and IVShmem

Application MPI Layer ADI3 Layer SMP Channel Network Channel Shared Memory InfiniBand API MPI Library Communication Device APIs Native Hardware Application MPI Layer ADI3 Layer IVShmem Channel SR-IOV Channel Shared Memory InfiniBand API Virtual Machine Aware Communication Device APIs Virtualized Hardware Communication Coordinator Locality Detector SMP Channel

  • MPI library running in native and virtualization environments
  • In virtualized environment
  • Support shared-memory channels (SMP, IVShmem) and SR-IOV channel
  • Locality detection
  • Communication coordination
  • Communication optimizations on different channels (SMP, IVShmem, SR-IOV; RC, UD)
  • J. Zhang, X. Lu, J. Jose and D. K. Panda, High Performance MPI Library over SR-IOV Enabled InfiniBand Clusters, The International Conference on High

Performance Computing (HiPC’14), Dec 2014

slide-11
SLIDE 11

SC 2017 Doctoral Showcase 11 Network Based Computing Laboratory

Application Performance (NAS & P3DFFT)

  • Proposed design delivers up to 43% (IS) improvement for NAS
  • Proposed design brings 29%, 33%, 29% and 20% improvement for INVERSE, RAND,

SINE and SPEC

2 4 6 8 10 12 14 16 18 FT LU CG MG IS Execution Time (s) NAS B Class SR-IOV Proposed 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 SPEC INVERSE RAND SINE Execution Times (s) P3DFFT 512*512*512 SR-IOV Proposed

NAS-32 VMs (8 VMs per node) P3DFFT-32 VMs (8 VMs per node) 43% 20% 29% 33% 29%

slide-12
SLIDE 12

SC 2017 Doctoral Showcase 12 Network Based Computing Laboratory

SR-IOV-enabled VM Migration Support on HPC Clouds

slide-13
SLIDE 13

SC 2017 Doctoral Showcase 13 Network Based Computing Laboratory

High Performance SR-IOV enabled VM Migration Framework for MPI Applications

MPI

Host Guest VM1

Hypervisor IB Adapter IVShmem Network Suspend Trigger Read-to- Migrate Detector Network Reactive Notifier

  • Controller

MPI

Guest VM2

MPI

Host Guest VM1

VF / SR-IOV Hypervisor IB Adapter IVShmem Ethernet Adapter MPI

Guest VM2

Ethernet Adapter Migration Done Detector VF / SR-IOV VF / SR-IOV VF / SR-IOV Migration Trigger

  • J. Zhang, X. Lu, D. K. Panda. High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV enabled InfiniBand Clusters.

IPDPS, 2017

  • Two Challenges
  • 1. Detach/re-attach virtualized devices
  • 2. Maintain IB Connection
  • Challenge 1: Multiple parallel libraries to

coordinate with VM during migration (detach/reattach SR-IOV/IVShmem, migrate VMs, migration status)

  • Challenge 2: MPI runtime handles IB

connection suspending and reactivating

  • Propose Progress Engine (PE) and Migration

Thread based (MT) design to optimize VM migration and MPI application performance

slide-14
SLIDE 14

SC 2017 Doctoral Showcase 14 Network Based Computing Laboratory

Proposed Design of MPI Runtime

Time

Comp

MPI Call

R

Time

R S

MPI Call Suspend Channel Reactivate Channel Control Msg

Migration

Migration Lock/Unlock Communication

Comp

Computation P 0 Controller Pre-migration

S

Ready-to-Migrate

Migration

Post-Migration MPI Call MPI Call Time

No-migration

P 0

Comp

MPI Call P 0 MPI Call Thread Pre- migration Ready-to- migrate

Migration

Post- migration MPI Call

Comp

Controller Time

Comp

Migration-thread based Worst Scenario

P 0 MPI Call Thread Pre- migration

S

Ready-to- migrate

Migration

Post- migration Controller

S R R

Migration-thread based Typical Scenario Progress Engine Based

Migration Signal Detection Down-Time for VM Live Migration MPI Call

slide-15
SLIDE 15

SC 2017 Doctoral Showcase 15 Network Based Computing Laboratory

  • 8 VMs in total and 1 VM carries out migration during application running
  • Compared with NM, MT- worst and PE incur some overhead
  • MT-typical allows migration to be completely overlapped with computation

Application Performance

5 10 15 20 25 30 LU.B EP.B MG.B CG.B FT.B Execution Time (s) PE MT-worst MT-typical NM 0.0 0.1 0.2 0.3 0.4 20,10 20,16 20,20 22,10 Execution Time (s) PE MT-worst MT-typical NM

Graph500 NAS

slide-16
SLIDE 16

SC 2017 Doctoral Showcase 16 Network Based Computing Laboratory

High Performance MPI Communication for Nested Virtualization

QPI M e m

  • r

y C

  • n

t r

  • l

l e r

core 4 core 5 core 6 core 7

NUMA 0 VM 0 Container 0 M e m

  • r

y C

  • n

t r

  • l

l e r

core 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11 core 12 core 13 core 14 core 15

VM 1 Container 2 Container 1 Container 3 1 2 3 4 Two-Layer NUMA Aware Communication Coordinator NUMA 1 Container Locality Detector VM Locality Detector Nested Locality Combiner Two-Layer Locality Detector CMA Channel SHared Memory (SHM) Channel Network (HCA) Channel

Two-Layer Locality Detector: Dynamically detecting MPI processes in the co-resident containers inside

  • ne VM as well as the ones in the co-

resident VMs Two-Layer NUMA Aware Communication Coordinator: Leverage nested locality info, NUMA architecture info and message to select appropriate communication channel

  • J. Zhang, X. Lu and D. K. Panda, Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled

InfiniBand, The 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’17), April 2017

slide-17
SLIDE 17

SC 2017 Doctoral Showcase 17 Network Based Computing Laboratory

Two-Layer NUMA Aware Communication Coordinator

NUMA Loader Nested Locality Loader Message Parser Communication Coordinator Two-Layer NUMA Aware Communication Coordinator CMA Channel SHared Memory (SHM) Channel Network (HCA) Channel Two-Layer Locality Detector

  • Nested Locality Loader reads locality info of destination process from Two-Layer Locality

Detector

  • NUMA Loader reads info of VM/container placements to decide on which NUMA node the

destination process is pinning

  • Message Parser obtains message attributes, e.g., message type and message size
slide-18
SLIDE 18

SC 2017 Doctoral Showcase 18 Network Based Computing Laboratory

Applications Performance

  • 256 processes across 64 containers on 16 nodes
  • Compared with Default, enhanced-hybrid design reduces up to 16% (28,16) and 10% (LU) of

execution time for Graph 500 and NAS, respectively

  • Compared with the 1Layer case, enhanced-hybrid design also brings up to 12% (28,16) and 6% (LU)

performance benefit.

50 100 150 200 IS MG EP FT CG LU Execution Time (s)

Class D NAS

Default 1Layer 2Layer-Enhanced-Hybrid 2 4 6 8 10 22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16 BFS Execution Time (s)

Graph500

Default 1Layer 2Layer-Enhanced-Hybrid

slide-19
SLIDE 19

SC 2017 Doctoral Showcase 19 Network Based Computing Laboratory Compute Nodes

MPI MPI

MPI MPI MPI MPI

MPI MPI MPI MPI

VM VM VM VM

VM VM

VM VM

Exclusive Allocations Sequential Jobs (EASJ) Exclusive Allocations Concurrent Jobs (EACJ) Shared-host Allocations Concurrent Jobs (SACJ)

Typical Usage Scenarios

slide-20
SLIDE 20

SC 2017 Doctoral Showcase 20 Network Based Computing Laboratory

Submit Job SLURMctld

VM Configuration File

physical node

SLURMd SLURMd

VM Launching/Reclaiming

libvirtd

VM1

VF IVSHME M

VM2

VF IVSHME M physical node

SLURMd

physical node

SLURMd

sbatch File

MPI MPI

physical resource request physical node list

launch VMs

Lustre

Image Pool

  • 1. SR-IOV virtual

function

  • 2. IVSHMEM device
  • 3. Network setting
  • 4. Image management
  • 5. Launching VMs and

check availability

  • 6. Mount global storage,

etc.

….

Slurm-V Architecture Overview

slide-21
SLIDE 21

SC 2017 Doctoral Showcase 21 Network Based Computing Laboratory

Alternative Designs of Slurm-V

  • Slurm SPANK Plugin based design

– Utilize SPANK plugin to read VM configuration, launch/reclaim VM – File based lock to detect occupied VF and exclusively allocate free VF – Assign a unique ID to each IVSHMEM device and dynamically attach to each VM – Inherit advantages from Slurm: coordination, scalability, security

  • Slurm SPANK Plugin over OpenStack based design

– Offload VM launch/reclaim to underlying OpenStack framework – PCI Whitelist to passthrough free VF to VM – Extend Nova to enable IVSHMEM when launching VM – Inherit advantage from both OpenStack and Slurm: component optimization, performance

slide-22
SLIDE 22

SC 2017 Doctoral Showcase 22 Network Based Computing Laboratory

50 100 150 200 250 22,10 22,16 22,20

Problem Size (Scale, Edgefactor)

VM Native

  • 32 VMs across 8 nodes, 6 Cores/VM
  • EASJ - Compared to Native, less than 4% overhead
  • SACJ, EACJ – less than 9% overhead, when running NAS as concurrent job with 64 Procs

EASJ SACJ

Applications Performance

500 1000 1500 2000 2500 3000 24,16 24,20 26,10

BFS Execution Time (ms) Problem Size (Scale, Edgefactor)

VM Native 50 100 150 200 250 22 10 22 16 22 20

Problem Size (Scale, Edgefactor)

VM Native

EACJ Graph500 with 64 Procs acorss 8 Nodes on Chameleon

6% 4% 9%

slide-23
SLIDE 23

SC 2017 Doctoral Showcase 23 Network Based Computing Laboratory

Impact on HPC and Cloud Communities

  • Designs available through MVAPICH2-Virt library http://mvapich.cse.ohio-

state.edu/download/mvapich/virt/mvapich2-virt-2.2-1.el7.centos.x86_64.rpm

  • Complex Appliances available on Chameleon Cloud

– MPI bare-metal cluster: https://www.chameleoncloud.org/appliances/29/ – MPI + SR-IOV KVM cluster: https://www.chameleoncloud.org/appliances/28/

  • Enables users to easily and quickly deploy HPC clouds and perform jobs with

high performance

  • Enables administrators to efficiently manage and schedule cluster resource
slide-24
SLIDE 24

SC 2017 Doctoral Showcase 24 Network Based Computing Laboratory

Conclusion

  • Addresses key issues on building efficient HPC clouds
  • Optimizes MPI communication on various HPC clouds
  • Presents designs of live migration to provide fault-tolerance on HPC clouds
  • Presents co-designs with resource management and scheduling systems
  • Demonstrates the corresponding benefits on modern HPC clusters
  • Broader outreach through MVAPICH2-Virt public releases and complex

appliances on Chameleon Cloud testbed

slide-25
SLIDE 25

SC 2017 Doctoral Showcase 25 Network Based Computing Laboratory

Thank You! & Questions?

zhang.2794@osu.edu

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/