MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build - PowerPoint PPT Presentation

MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds Jie Zhang, Xiaoyi Lu, Mark Arnold and Dhabaleswar. K. Panda

Outline • Introduction • Problem Statement • Proposed Design • Performance Evaluation Network Based Computing Laboratory 2

Single Root I/O Virtualization (SR-IOV) • Single Root I/O Virtualization (SR-IOV) is providing new opportunities to design HPC cloud with very little low overhead – Allows a single physical device, Guest 1 Guest 2 Guest 3 or a Physical Function (PF), to Guest OS Guest OS Guest OS present itself as multiple VF Driver VF Driver VF Driver virtual devices, or Virtual Functions (VFs) Hypervisor PF Driver – Each VF can be dedicated to a single VM through PCI pass- I/O MMU PCI Express through – VFs are designed based on the Virtual Virtual Virtual Physical Function Function Function Function existing non-virtualized PFs, no SR-IOV Hardware need for driver change Network Based Computing Laboratory 3

Inter-VM Shared Memory (IVShmem) Guest 1 Guest 2 user space user space MPI MPI proc proc kernel space kernel space PCI VF PCI VF Device Driver Device Driver Hypervisor PF Driver Virtual Virtual Physical IV-SHM Function Function Function /dev/shm/ Infiniband Adapter IV-Shmem Channel Host Environment SR-IOV Channel • SR-IOV shows near to native performance for inter-node point to point communication • However, NOT VM locality aware • IVShmem offers zero-copy access to data on shared memory of co-resident VMs Network Based Computing Laboratory 4

Problem Statement • How to design a high performance MPI library to efficiently take advantage SR- IOV and IVShmem to deliver VM locality aware communication and optimal performance? • How to build an HPC Cloud with near native performance for MPI applications over SR-IOV enabled InfiniBand clusters? • How much performance improvement can be achieved by our proposed design on MPI point-to-point, collective operations and applications in HPC clouds? • How much benefit the proposed approach with InfiniBand can provide compared to Amazon EC2? Network Based Computing Laboratory 6

VM Locality Aware MVAPICH2 Design Overview Application Application MPI Layer MPI Layer ADI3 Layer Communication Coordinator MPI Library Virtual Machine ADI3 Layer Aware Locality Detector SMP Network SMP IVShmem SR-IOV Channel Channel Channel Channel Channel Communication Communication Shared InfiniBand Shared Memory InfiniBand API Device APIs Device APIs Memory API Native Hardware Virtualized Hardware MVAPICH2 library running in native and virtualization environments • In virtualized environment • - Support shared-memory channels (SMP, IVShmem) and SR-IOV channel - Locality detection - Communication coordination Network Based Computing Laboratory 8

Virtual Machine Locality Detection Create a VM List • structure on IVShmem user space user space user space user space MPI MPI MPI rank4 MPI rank0 rank1 rank5 region of each host proc proc proc proc kernel space kernel space kernel space kernel space Each MPI process writes • PCI VF VF PCI VF PCI VF PCI Device Driver Device Driver Device Driver Device Driver its own membership information into shared PF Driver Hypervisor VM List structure according to its global rank VM VM List st IVShmem 1 1 0 0 1 1 0 0 0 0 0 0 One byte each, lock-free, • /dev/shm/ O(N) Host Environment Network Based Computing Laboratory 9

Communication Coordination Retrieve VM locality • Guest1 Guest2 user space user space detection information MPI Process Rank 1 MPI Process Rank 4 Communication Communication Schedule Coordinator Coordinator • 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 communication IVShmem SR-IOV IVShmem SR-IOV Channel Channel Channel Channel channels based on kernel space kernel space VM locality PCI Device PCI Device VF Driver VF Driver information PF Driver Hypervisor Fast index, light- • weight Virtual Virtual Physical IVShmem Function Function Function /dev/shm InfiniBand Adapter Host Environment Network Based Computing Laboratory 10

MVAPICH2 with SR-IOV over OpenStack Heat • OpenStack is one of the most Orchestrates cloud Provides popular open-source solutions to UI Horizon build a cloud and manage huge Provisions Nova amounts of virtual machines Neutron Provides Provides VM • Deployment with OpenStack images Network – Supporting SR-IOV configuration Stores Glance Swift images in Provides – Extending Nova in OpenStack to support Volumes Backup Cinder IVShmem volumes in – Virtual Machine Aware design of Ceilometer Monitors MVAPICH2 with SR-IOV Provides Keystone • An efficient approach to build HPC Auth for Clouds Network Based Computing Laboratory 11

Experimental HPC Cloud Network Based Computing Laboratory 12

Cloud Testbeds Cluster Nowlab Cloud Amazon EC2 Instance 4 Core/VM 8 Core/VM 4 Core/VM 8 Core/VM Platform RHEL 6.5 Qemu+KVM Amazon Linux (EL6) Amazon Linux HVM Xen HVM C3.xlarge (EL6) Instance Xen HVM C3.2xlarge Instance CPU SandyBridge Intel(R) IvyBridge Intel(R) Xeon E5-2680v2 Xeon E5-2670 (2.6GHz) (2.8GHz) RAM 6 GB 12 GB 7.5 GB 15 GB Interconnect FDR (56Gbps) InfiniBand 10 GigE with Intel ixgbevf SR-IOV driver Mellanox ConnectX-3 with SR-IOV Network Based Computing Laboratory 14

Performance Evaluation • Performance of MPI Level Point-to-point Operations – Inter-node MPI Level Two-sided Operations – Intra-node MPI Level Two-sided Operations – Intra-node MPI Level One-sided Operations • Performance of MPI Level Collectives Operations – Broadcast, Allreduce, Allgather and Alltoall • Performance of Typical MPI Benchmarks and Applications – NAS and Graph500 *Amazon EC2 does not support users to explicitly allocate VMs in one physical node so far. We allocate multiple VMs in one logical group and compare the point-to-point performance for each pair of VMs. We see the VMs who have the lowest latency as located within one physical node (Intra-node), otherwise Inter-node. Network Based Computing Laboratory 15

Inter-node MPI Level Two-sided Point-to-Point Performance 0% • EC2 C3.xlarge instances • Similar performance with SR-IOV-Def • Compared to Native, similar overhead as basic IB level • Compared to EC2, up to 29X and 16X performance speedup on Lat & BW Network Based Computing Laboratory 16

Intra-node MPI Level Two-sided Point-to-Point Performance • EC2 C3.xlarge instances • Compared to SR-IOV-Def, up to 84% and 158% performance improvement on Lat & BW • Compared to Native, 3%-7% overhead for Lat, 3%-8% overhead for BW • Compared to EC2, up to 160X and 28X performance speedup on Lat & BW Network Based Computing Laboratory 17

Intra-node MPI Level One-sided Put Performance • EC2 C3.xlarge instances • Compared to SR-IOV-Def, up to 63% and 42% improvement on Lat & BW • Compared to EC2, up to 134X and 33X performance speedup on Lat & BW Network Based Computing Laboratory 18

Intra-node MPI Level One-sided Get Performance • EC2 C3.xlarge instances • Compared to SR-IOV-Def, up to 70% improvement on both Lat & BW • Compared to EC2, up to 121X and 24X performance speedup on Lat & BW Network Based Computing Laboratory 19

MPI Level Collectives Operations Performance (4 cores/VM * 4 VMs) • EC2 C3.xlarge instances • Compared to SR-IOV-Def, up to 74% and 60% performance improvement on Broadcast & Allreduce • Compared to EC2, up to 65X and 22X performance speedup on Bcast & Allreduce Network Based Computing Laboratory 20

MPI Level Collectives Operations Performance (4 cores/VM * 4 VMs) • EC2 C3.xlarge instances • Compared to SR-IOV-Def, up to 74% and 81% performance improvement on Allgather & Alltoall • Compared to EC2, up to 28X and 45X performance speedup on Allgather & Alltoall Network Based Computing Laboratory 21

MPI Level Collectives Operations Performance (4 cores/VM * 16 VMs) • Compared to SR-IOV-Def, up to 41% and 45% performance improvement on Bcast & Allreduce Network Based Computing Laboratory 22

MPI Level Collectives Operations Performance (4 cores/VM * 16 VMs) • Compared to SR-IOV-Def, up to 40% and 39% performance improvement on Allgather & Alltoall Network Based Computing Laboratory 23

Performance of Typical MPI Benchmarks and Applications (8 cores/VM * 4 VMs) • EC2 C3.2xlarge instances • Compared to Native, 2%-9% overhead for NAS, around 6% overhead for Graph500 • Compared to EC2, up to 4.4X (FT) speedup for NAS, up to 12X (20,10) speedup for Graph500 Network Based Computing Laboratory 24

Performance of Typical MPI Benchmarks and Applications (8 cores/VM * 8 VMs) • EC2 C3.2xlarge instances • Compared to Native, 6%-9% overhead for NAS, around 8% overhead for Graph500 Network Based Computing Laboratory 25

MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build - PowerPoint PPT Presentation

MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds Jie Zhang, Xiaoyi Lu, Mark Arnold and Dhabaleswar. K. Panda Outline Introduction Problem Statement Proposed Design Performance Evaluation

Building Efficient HPC Clouds with MVAPICH2 and OpenStack over SR-IOV-enabled Heterogeneous

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

Build your own Web Portal using OpenStack APIs and Services OpenStack Summit in Austin 2016

BUILD YOUR FIRST OPENSTACK APPLICATION WITH OPENSTACK PYTHONSDK VICTORIA MARTINEZ DE LA CRUZ

Running Kubernetes on OpenStack and Bare Metal OpenStack Summit Berlin, November 2018 Ramon

OpenStack Charms Project Update, OpenStack Summit Berlin Frode Nordahl (fnordahl) Ryan Beisner

Coordination and Leadership challenges in producing OpenStack Thierry Carrez (@tcarrez) Release

Bringing Private Cloud to Australia OpenStack on VMware OpenStack Summit 2013 Introduction

Future of OpenStack Looking Forward to 2019 Alan.Clark@suse.com What and Why OpenStack

Moving SNE to the Cloud RP1i3 Sudesh Jethoe http://www.openstack.org/assets/openstack-logo/

OpenStack Networking Project Update, OpenStack Summit Sydney Miguel Lavalle, IRC mlavalle

What is OpenStack ? Hello! I am Thierry Carrez I work for the OpenStack Foundation. You can

OpenStack Charms Project Update, OpenStack Summit Vancouver James Page (jamespage) What are the

Agenda Openstack CEPH Storage Dream team: CEPH and Openstack Summary GUUG FFG 2015

DNS in OpenStack What is the OpenStack DNS API? https://gra.ham.ie | @grahamhayes 1 Graham

Get a Python job, Work on OpenStack ! about:me Release Manager for OpenStack Chair of

dCache sensors & monitoring A proposal to share sensors Gerard.Bernabeu@pic.es Functional

Automatic deployment of the Network Weather Service using the Effective Network View Arnaud

The six pillars in astronomy Karl Mannheim Contribution to: Initiative for a Data and Analysis

The Impact of Web Service Integration on Grid Performance Franois Taani Matti Hiltunen &

Lecture 21: Grids and Clouds David Bindel 11 Nov 2011 Logistics Project 3 due Monday at

Audio networking Franois Dchelle (dechelle@ircam.fr) Patrice Tisserand (tisserand@ircam.fr)

Grid Computing Win Bausch Information and Communication Systems Research Group Institute of

EUCALYPTUS: An Elastic Utility Computing Architecture for Linking Your Programs to Useful

MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build - PowerPoint PPT Presentation

MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds Jie Zhang, Xiaoyi Lu, Mark Arnold and Dhabaleswar. K. Panda Outline Introduction Problem Statement Proposed Design Performance Evaluation

Building Efficient HPC Clouds with MVAPICH2 and OpenStack over SR-IOV-enabled Heterogeneous

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

Build your own Web Portal using OpenStack APIs and Services OpenStack Summit in Austin 2016

BUILD YOUR FIRST OPENSTACK APPLICATION WITH OPENSTACK PYTHONSDK VICTORIA MARTINEZ DE LA CRUZ

Running Kubernetes on OpenStack and Bare Metal OpenStack Summit Berlin, November 2018 Ramon

OpenStack Charms Project Update, OpenStack Summit Berlin Frode Nordahl (fnordahl) Ryan Beisner

Coordination and Leadership challenges in producing OpenStack Thierry Carrez (@tcarrez) Release

Bringing Private Cloud to Australia OpenStack on VMware OpenStack Summit 2013 Introduction

Future of OpenStack Looking Forward to 2019 Alan.Clark@suse.com What and Why OpenStack

Moving SNE to the Cloud RP1i3 Sudesh Jethoe http://www.openstack.org/assets/openstack-logo/

OpenStack Networking Project Update, OpenStack Summit Sydney Miguel Lavalle, IRC mlavalle

What is OpenStack ? Hello! I am Thierry Carrez I work for the OpenStack Foundation. You can

OpenStack Charms Project Update, OpenStack Summit Vancouver James Page (jamespage) What are the

Agenda Openstack CEPH Storage Dream team: CEPH and Openstack Summary GUUG FFG 2015

DNS in OpenStack What is the OpenStack DNS API? https://gra.ham.ie | @grahamhayes 1 Graham

Get a Python job, Work on OpenStack ! about:me Release Manager for OpenStack Chair of

dCache sensors &amp; monitoring A proposal to share sensors Gerard.Bernabeu@pic.es Functional

Automatic deployment of the Network Weather Service using the Effective Network View Arnaud

The six pillars in astronomy Karl Mannheim Contribution to: Initiative for a Data and Analysis

The Impact of Web Service Integration on Grid Performance Franois Taani Matti Hiltunen &amp;

Lecture 21: Grids and Clouds David Bindel 11 Nov 2011 Logistics Project 3 due Monday at

Audio networking Franois Dchelle (dechelle@ircam.fr) Patrice Tisserand (tisserand@ircam.fr)

Grid Computing Win Bausch Information and Communication Systems Research Group Institute of

EUCALYPTUS: An Elastic Utility Computing Architecture for Linking Your Programs to Useful

dCache sensors & monitoring A proposal to share sensors Gerard.Bernabeu@pic.es Functional

The Impact of Web Service Integration on Grid Performance Franois Taani Matti Hiltunen &