John Paul Walters Project Leader, USC Information Sciences Institute - - PowerPoint PPT Presentation
John Paul Walters Project Leader, USC Information Sciences Institute - - PowerPoint PPT Presentation
Achieving Near-Native GPU Performance in the Cloud John Paul Walters Project Leader, USC Information Sciences Institute jwalters@isi.edu Outline Motivation ISIs HPC Cloud Effort Background: PCI Passthrough, SR-IOV Results
2
Outline
- Motivation
- ISI’s HPC Cloud Effort
- Background: PCI Passthrough, SR-IOV
- Results
- Conclusion
3
Motivation
- Scientific workloads demand increasing performance
with greater power efficiency
– Architectures have been driven towards specialization, heterogeneity
- Infrastructure-as-a-Service (IaaS) clouds can democratize
access to the latest, most powerful accelerators
– If performance goals are met
- Can we provide HPC-class performance in the cloud?
4
ISI’s HPC Cloud Work
- Cloud computing is traditionally seen as a resource for IT
– Web servers, databases
- More recently researchers have begun to leverage the public
cloud as an HPC resource
– AWS virtual cluster is 101 on Top500 list
- Major difference between HPC and IT in the cloud:
– Types of resources, heterogeneity
- Our contribution: we’re developing the heterogeneous HPC
extensions for the OpenStack cloud computing platform
5
OpenStack Background
- OpenStack founded by Rackspace
and NASA
- In use by Rackspace, HP, and
- thers for their public clouds
- Open source with hundreds of
participating companies
- In use for both public and private
clouds
- Current stable release: OpenStack
Juno
– OpenStack Kilo to be released in April
20 40 60 80 100 120
Google Trends Searches for Common Open Source IaaS Projects
- penstack
cloudstack
- pennebula
eucalyptus cloud
6
Accessing GPUs from Virtual Hosts Using API Remoting
500 1000 1500 2000 2500 3000 3500 4000 MB/sec Bytes
Host to Device Bandwidth, Pageable
Host LXC gVirtus 50 100 150 200 GFlops/Sec Size (NxM), Single Precision Real
Matrix Multiply for Increasing NxM
Host gVirtus LXC
I/O performance low for gVirtus/KVM, LXC much closer to native performance. Larger matrix multiply amortizes I/O transfer cost, LXC and native performance indistinguishable.
7
Accelerators and Virtualization
- Combine non-
virtualized accelerators with virtual hosts
- Results in > 99%
efficiency
0.96 0.97 0.98 0.99 1 1.01
fft_sp fft_sp_pcie ifft_sp ifft_sp_pcie fft_dp fft_dp_pcie ifft_dp ifft_dp_pcie sgemm_n sgemm_t sgemm_n_p… sgemm_t_p… dgemm_n dgemm_t dgemm_n_… dgemm_t_p…
Relative Performance
SHOC Performance for Common Signal Processing Kernels
KVM Xen LXC VMWare
8
PCI Passthrough Background
- 1:1 mapping of physical
device to virtual machine
- Device remains non-
virtualized
9
SR-IOV Background
Image from: http://docs.oracle.com/cd/E23824_01/html/819-3196/figures/sriov-intro.png
- SR-IOV partitions a single
physical device into multiple virtual functions
- Virtual functions almost
indistinguishable from physical functions.
- Virtual functions passed to
virtual machines using PCI passthrough
10
Multi-GPU with SR-IOV and GPUDirect
- Many real applications extend beyond a single node’s
capabilities
- Test multi-node performance with Infiniband SR-IOV and
GPUDirect
- 4 Sandy Bridge nodes equipped with K20/K40 GPUs
– ConnectX-3 IB with SR-IOV enabled – Ported Mellanox OFED 2.1-1 to 3.13 kernel – KVM hypervisor
- Test with LAMMPS, OSU Microbenchmarks, and HOOMD
11
0.5 1 1.5 2 2.5 3 3.5 32k 64k 128k 256k 512k
Millions of atom-timesteps per second Problem Size
LAMMPS Rhodopsin Performance
VM 32c/4g VM 4c/4g Base 32c/4g Base 4c/4g
LAMMPS Rhodopsin with SR-IOV Performance
12
LAMMPS Lennard-Jones with SR-IOV Performance
20 40 60 80 100 120 140 2k 4k 8k 16k 32k 64k 128k 256k 512k 1024k 2048k
Millions of atom-timesteps per second Problem Size
LAMMPS Lennard-Jones Performance
VM 32c/4g VM 4c/4g Base 32c/4g Base 4c/4g
13
LAMMPS Virtualized Performance
- Achieve 96% - 99% efficiency
– Performance gap decreases with increasing problem size
- Future work needed to validate results across much
larger systems
– This work is in the early stages
14
GPUDirect Advantage
Image source: http://old.mellanox.com/content/pages.php?pg=products_dyn &product_family=116
- Validate GPUDirect
- ver SR-IOV
– Uses nvidia_peer_memory- 1.0-0 kernel module
- OSU GDR
Microbenchmarks
- HOOMD MD
15
OSU GDR Microbenchmarks: Latency
50 100 150 200 250 300 350 400 450 500 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 Avg Latency (us) Size (Bytes) Native Virtualized
10 20 30 40 1 4 16 64
16
OSU GDR Microbenchmarks: Bandwidth
500 1000 1500 2000 2500 3000 3500 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 Bandwidth (MB/s) Size (Bytes) Native VIrtualized
17
GPUDirect-enabled VM Performance
100 200 300 400 500 600 700 800 1 2 3 4 Average Timesteps per second N Nodes
HOOMD GPUDirect Performance, 256K Particles Lennard-Jones Simulation
VM GPUDirect VM No GPUDirect Base GPUDirect Base No GPUDirect
18
Discussion
- Take-away: GDR provides nearly 10% improvement
- SR-IOV interconnect results in < 2% overhead
- Further work needed to validate these results in larger
systems
– Small-scale results are promising
19
Future Work
- For full results see:
– J.P. Walters, et al. GPU Passthrough Performance: A Comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL Applications, IEEE Cloud 2014 – A.J. Younge, et al. Supporting High Performance Molecular Dynamics in Virtualized Clusters using IOMMU, SR-IOV, and GPUDirect, to appear in VEE 2015.
- Next steps:
– Extend scalability results – OpenStack integration
- Code: https://github.com/usc-isi/nova
20
Questions and Comments
- Contact me: