John Paul Walters Project Leader, USC Information Sciences Institute - - PowerPoint PPT Presentation

john paul walters
SMART_READER_LITE
LIVE PREVIEW

John Paul Walters Project Leader, USC Information Sciences Institute - - PowerPoint PPT Presentation

Achieving Near-Native GPU Performance in the Cloud John Paul Walters Project Leader, USC Information Sciences Institute jwalters@isi.edu Outline Motivation ISIs HPC Cloud Effort Background: PCI Passthrough, SR-IOV Results


slide-1
SLIDE 1

Achieving Near-Native GPU Performance in the Cloud

John Paul Walters

Project Leader, USC Information Sciences Institute jwalters@isi.edu

slide-2
SLIDE 2

2

Outline

  • Motivation
  • ISI’s HPC Cloud Effort
  • Background: PCI Passthrough, SR-IOV
  • Results
  • Conclusion
slide-3
SLIDE 3

3

Motivation

  • Scientific workloads demand increasing performance

with greater power efficiency

– Architectures have been driven towards specialization, heterogeneity

  • Infrastructure-as-a-Service (IaaS) clouds can democratize

access to the latest, most powerful accelerators

– If performance goals are met

  • Can we provide HPC-class performance in the cloud?
slide-4
SLIDE 4

4

ISI’s HPC Cloud Work

  • Cloud computing is traditionally seen as a resource for IT

– Web servers, databases

  • More recently researchers have begun to leverage the public

cloud as an HPC resource

– AWS virtual cluster is 101 on Top500 list

  • Major difference between HPC and IT in the cloud:

– Types of resources, heterogeneity

  • Our contribution: we’re developing the heterogeneous HPC

extensions for the OpenStack cloud computing platform

slide-5
SLIDE 5

5

OpenStack Background

  • OpenStack founded by Rackspace

and NASA

  • In use by Rackspace, HP, and
  • thers for their public clouds
  • Open source with hundreds of

participating companies

  • In use for both public and private

clouds

  • Current stable release: OpenStack

Juno

– OpenStack Kilo to be released in April

20 40 60 80 100 120

Google Trends Searches for Common Open Source IaaS Projects

  • penstack

cloudstack

  • pennebula

eucalyptus cloud

slide-6
SLIDE 6

6

Accessing GPUs from Virtual Hosts Using API Remoting

500 1000 1500 2000 2500 3000 3500 4000 MB/sec Bytes

Host to Device Bandwidth, Pageable

Host LXC gVirtus 50 100 150 200 GFlops/Sec Size (NxM), Single Precision Real

Matrix Multiply for Increasing NxM

Host gVirtus LXC

I/O performance low for gVirtus/KVM, LXC much closer to native performance. Larger matrix multiply amortizes I/O transfer cost, LXC and native performance indistinguishable.

slide-7
SLIDE 7

7

Accelerators and Virtualization

  • Combine non-

virtualized accelerators with virtual hosts

  • Results in > 99%

efficiency

0.96 0.97 0.98 0.99 1 1.01

fft_sp fft_sp_pcie ifft_sp ifft_sp_pcie fft_dp fft_dp_pcie ifft_dp ifft_dp_pcie sgemm_n sgemm_t sgemm_n_p… sgemm_t_p… dgemm_n dgemm_t dgemm_n_… dgemm_t_p…

Relative Performance

SHOC Performance for Common Signal Processing Kernels

KVM Xen LXC VMWare

slide-8
SLIDE 8

8

PCI Passthrough Background

  • 1:1 mapping of physical

device to virtual machine

  • Device remains non-

virtualized

slide-9
SLIDE 9

9

SR-IOV Background

Image from: http://docs.oracle.com/cd/E23824_01/html/819-3196/figures/sriov-intro.png

  • SR-IOV partitions a single

physical device into multiple virtual functions

  • Virtual functions almost

indistinguishable from physical functions.

  • Virtual functions passed to

virtual machines using PCI passthrough

slide-10
SLIDE 10

10

Multi-GPU with SR-IOV and GPUDirect

  • Many real applications extend beyond a single node’s

capabilities

  • Test multi-node performance with Infiniband SR-IOV and

GPUDirect

  • 4 Sandy Bridge nodes equipped with K20/K40 GPUs

– ConnectX-3 IB with SR-IOV enabled – Ported Mellanox OFED 2.1-1 to 3.13 kernel – KVM hypervisor

  • Test with LAMMPS, OSU Microbenchmarks, and HOOMD
slide-11
SLIDE 11

11

0.5 1 1.5 2 2.5 3 3.5 32k 64k 128k 256k 512k

Millions of atom-timesteps per second Problem Size

LAMMPS Rhodopsin Performance

VM 32c/4g VM 4c/4g Base 32c/4g Base 4c/4g

LAMMPS Rhodopsin with SR-IOV Performance

slide-12
SLIDE 12

12

LAMMPS Lennard-Jones with SR-IOV Performance

20 40 60 80 100 120 140 2k 4k 8k 16k 32k 64k 128k 256k 512k 1024k 2048k

Millions of atom-timesteps per second Problem Size

LAMMPS Lennard-Jones Performance

VM 32c/4g VM 4c/4g Base 32c/4g Base 4c/4g

slide-13
SLIDE 13

13

LAMMPS Virtualized Performance

  • Achieve 96% - 99% efficiency

– Performance gap decreases with increasing problem size

  • Future work needed to validate results across much

larger systems

– This work is in the early stages

slide-14
SLIDE 14

14

GPUDirect Advantage

Image source: http://old.mellanox.com/content/pages.php?pg=products_dyn &product_family=116

  • Validate GPUDirect
  • ver SR-IOV

– Uses nvidia_peer_memory- 1.0-0 kernel module

  • OSU GDR

Microbenchmarks

  • HOOMD MD
slide-15
SLIDE 15

15

OSU GDR Microbenchmarks: Latency

50 100 150 200 250 300 350 400 450 500 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 Avg Latency (us) Size (Bytes) Native Virtualized

10 20 30 40 1 4 16 64

slide-16
SLIDE 16

16

OSU GDR Microbenchmarks: Bandwidth

500 1000 1500 2000 2500 3000 3500 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 Bandwidth (MB/s) Size (Bytes) Native VIrtualized

slide-17
SLIDE 17

17

GPUDirect-enabled VM Performance

100 200 300 400 500 600 700 800 1 2 3 4 Average Timesteps per second N Nodes

HOOMD GPUDirect Performance, 256K Particles Lennard-Jones Simulation

VM GPUDirect VM No GPUDirect Base GPUDirect Base No GPUDirect

slide-18
SLIDE 18

18

Discussion

  • Take-away: GDR provides nearly 10% improvement
  • SR-IOV interconnect results in < 2% overhead
  • Further work needed to validate these results in larger

systems

– Small-scale results are promising

slide-19
SLIDE 19

19

Future Work

  • For full results see:

– J.P. Walters, et al. GPU Passthrough Performance: A Comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL Applications, IEEE Cloud 2014 – A.J. Younge, et al. Supporting High Performance Molecular Dynamics in Virtualized Clusters using IOMMU, SR-IOV, and GPUDirect, to appear in VEE 2015.

  • Next steps:

– Extend scalability results – OpenStack integration

  • Code: https://github.com/usc-isi/nova
slide-20
SLIDE 20

20

Questions and Comments

  • Contact me:

– jwalters@isi.edu – www.isi.edu/people/jwalters/