 
              GaaS Workload Characterization under NUMA Architecture for Virtualized GPU Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li Presented by Huixiang Chen ISPASS 2017 April 24, 2017, Santa Rosa, California IDEAL (Intelligent Design of Efficient Architectures Laboratory) Department of Electrical and Computer Engineering University of Florida
Talk Overview 1. Background and Motivation 2. Experiment Setup L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 core core uncore QPI uncore Core Interconnect Core Interconnect GPU A LL cache GPU B LL cache PCIe/F MC PCIe/F MC memory memory 3. Characterizations and Analysis 4. DVFS 2 / 27
Graphics-as-a-Service (GaaS) Virtual Desktop (VDI) Cloud Gaming Video Streaming 3 / 27
Graphics-as-a-Service (GaaS) GPU Virtualization! 4 / 27
GPU Virtualization • 1. API Intercept • 2. GPU pass-through • 3. Shared virtualized GPU 5 / 27
GPU Virtualization Intel GVT-g Intel GVT-s Intel GVT-d AMD Firepro NVIDIA GPU-passthrough vCUDA NVIDIA GRID 1. API intercept 3. Virtualized GPU 2. GPU pass-through 6 / 27
NVIDIA GRID GPU Virtualization Guest VM Guest VM Applications Applications XenServer Hypervisor Applications Paravirtualized Apps Apps Memory Apps NVIDIA GRID Interface Guest VM vGPU Manager Driver Guest VM Nvidia Kernel Driver Driver Direct GPU Management Access Inferface Requests from VMs NVIDIA GPU Channel CPU Memory Access Timeshared scheduling GPU MMU Framebuffer VM1 FB GPU Memory VM2 FB Streaming engine Access 3D Graphics Copy Video Video VM1 pagetables Engine Engine Encoder Decoder VM2 pagetables 7 / 27
GPU NUMA issue • Unified Architecture • Discrete Architecture Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Controller Memory Controller Controller Controller Socket 0 Socket 1 Socket 0 Socket 1 CPU CPU core core core core CPU CPU core core core core GPU0 GPU1 QPI core core QPI core core core core core core Cache Cache Cache Cache Last level cache Last level cache Last level cache Last level cache PCIE PCIE express express GPU0 GPU1 Discrete Architecture Unified Architecture 8 / 27
GPU NUMA Issue L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 core QPI core Interconnect uncore uncore Core Interconnect Core Interconnect GPU A LL cache GPU B LL cache PCIe/F MC PCIe/F MC memory memory Remote Memory Access Local Memory Access Real case App App Ideal I/O thread I/O thread 9 / 27
Talk Overview 1. Background and motivation 2. Experiment Setup L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 core core uncore QPI uncore Core Interconnect Core Interconnect GPU A LL cache GPU B LL cache PCIe/F MC PCIe/F MC memory memory 3. Characterizations and Analysis 4. DVFS 10 / 27
Experiment Setup • Platform Configuration – 4U Supermicro Server – XenServer 7.0 – Intel QPI, 6.4 GT/s – NVIDIA GRID K2, 8GB GDDR5, 225W, PCIE 3.0 x 16 Physical Frame Buffer VGPU type Maximum vGPUs per GPU GPUs (Mbytes) K280 4096 1 GRID K2 K260 2048 2 2 K240 1024 4 K220 512 8 11 / 27
Workload Selection • Workloads and Metrics – GaaS workloads: Unigine-Heaven, Unigine-Valley, 3DMark (Return to Proxycon, Firefly Forest, Canyon Fly, Deep Freeze) – Performance metrics: frame-per-seconds (FPS) – GPGPU workloads: Rodinia benchmark – Performance metrics: execution time • Local mapping: – the Guest VM’s vCPUs are statically pinned to the local socket close to the GPU. • Remote mapping: – the vCPUs are statically pinned to the remote socket. (XenServer controls the memory affinity automatically close to the CPU affinity). 12 / 27
Talk Overview 1. Background and motivation 2. Experiment Setup L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 core core uncore QPI uncore Core Interconnect Core Interconnect GPU A LL cache GPU B LL cache PCIe/F MC PCIe/F MC memory memory 3. Characterizations and Analysis 4. DVFS 13 / 27
Bandwidth (MB/s) Bandwidth (MB/s) NUMA Transfer Bandwidth 10 12 0 2 4 6 8 0 2 4 6 8 1KB 1 K CPU  GPU, pageable memory B 2KB 2 CPU  GPU, pinned memory K B 4KB 4 K B RemoteHtoD LocalHtoD RemoteHtoD LocalDtoD 8KB 8 K B 16KB 1 6 K B 32KB 3 2 K B 64KB 6 4 Transfer size K Transfer size 128KB B 1 2 8 K 256KB B 2 5 6 K 512KB B 5 1 2 K B 1MB 1 M B 2MB 2 M B 4MB 4 M B 8MB 8 M 16MB B 1 6 M B 32MB 3 2 M 64MB B 6 4 M B Bandwidth (MB/s) Bandwidth (MB/s) 10 12 0 2 4 6 8 0 2 4 6 8 1KB 1KB GPU  CPU, pageable memory GPU  CPU, pinned memory 2KB 2KB 4KB 4KB RemoteDtoH LocalDtoH RemoteDtoH LocalDtoH 8KB 8KB 16KB 16KB 32KB 32KB 64KB 64KB Transfer size Transfer size 128KB 128KB 256KB 256KB 512KB 512KB 1MB 1MB 2MB 2MB 4MB 4MB 8MB 8MB 16MB 16MB 14 / 27 32MB 32MB 64MB 64MB
NUMA Transfer Bandwidth • Pinned memory: – 10% NUMA overhead for writing data to GPU, 20% reading data back from GPU • Pageable memory: – close to 0 NUMA overhead for writing, 50% for reading data back from GPU 15 / 27
NUMA Performance Difference-GPGPU Workloads • Note: only 1VM can be configured • Remarks using K2 for CUDA programs. Normalized execution time 1.2 – For GPGPU workloads Local Remote 1.1 streamcluster, srad_v2, backprop stands out 1.0 0.9 – Further breakdown shows 0.8 that for GPGPU workloads, r e n r t e s p u 2 e a u l d p d v o i l s e s a n g l r n 2 c _ p f r s w b n i r t m d t u f e w k + t h a r m a c b a t d the more time spent on a a r a g e s e m b p r h t u s m CPU GPU communication, Normalized execution time Memory Kernel CPU+Other the higher NUMA 100% overhead there is. 80% 60% 40% 20% 0% streamcluster srad_v2 backprop bfs b+tree gaussian heartwall pathfinder mummergpu nn dwt2d 16 / 27
NUMA Performance Difference-GaaS Workloads 70 50 Local Remote Local Remote 60 40 50 30 40 FPS FPS 30 20 20 10 10 0 0 M M M M M M M M M M M M 1VM 2VM 4VM 1VM 2VM 1VM 1VM 2VM 4VM 1VM 2VM 1VM V V V V V V V V V V V V 1 2 4 1 2 1 1 2 4 1 2 1 K240 K260 K260 K280 K240 K280 K240 K260 K260 K280 K240 K280 3DMark Return to Proxycon Firefly Forest Canyon Flight Deep Freeze 40 Local Remote 30 FPS 20 10 0 1VM 2VM 4VM 1VM 2VM 1VM 1VM 2VM 4VM 1VM 2VM 1VM K240 K260 K280 K260 K280 K240 Unigine-Heaven Unigine-Valley • GaaS workloads – Little NUMA overhead exists 17 / 27
GaaS Overhead Analysis Cont. (1) streamcluster 3DMark 1. GPU compute 1. GPU compute GPU compute 3D graphics processing Memory copy between 2.Copy queue CPU and GPU Memory copy between 2.Copy queue CPU and GPU backprop Unigine-Heaven 1. GPU compute GPU compute 1. GPU compute Memory copy between 2.Copy queue CPU and GPU 3D graphics processing srad_v2 2.Copy queue 1. GPU compute Memory copy between GPU compute CPU and GPU Memory copy between CPU and GPU 2.Copy queue Unigine-Valley heartwall 1. GPU compute GPU compute 3D graphics processing 1. GPU compute 2.Copy queue Memory copy between 2.Copy queue Memory copy between CPU and GPU CPU and GPU • GPGPU workloads • GaaS workloads 18 / 27
GaaS Overhead Analysis Cont. (1) • 1. For GaaS workloads, most memory copy operations between CPU and GPU are overlapped with graphics processing operations. However, GPGPU workloads are different. Little overlap happens. • 2. The communication time is trivial compared to GPU computing in the graphics queue, which clearly shows the GPU-computation intensive feature. 19 / 27
GaaS Overhead Analysis Cont. (2) 3DMark GPU compute hearwall Copy queue Unigine-Heaven GPU compute GPU compute Copy queue cudaMemCpy(HtoD) cudaMemCpy(DtoH) Copy queue Unigine-Valley GPU compute Copy queue • GaaS workloads 20 / 27
GaaS Overhead Analysis Cont. (2) • GaaS workloads incurs more real-time processing, compared with GPGPU workloads. This kind of workload behavior makes it easier for memory transfers overlapping with GPU computing. 21 / 27
Influence of CPU uncore 30 4 VMs on the same socket normalized L3 miss rate 4 VMs on seperate socket 25 20 15 10 5 0 VM1VM2VM3VM4 VM1VM2VM3VM4 VM1VM2VM3VM4 3DMark Unigine-Heaven Unigine-Valley • CPU uncore has little performance influence on GPU NUMA for GaaS 22 / 27
Talk Overview 1. Background and motivation 2. Experiment Setup L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 core core uncore QPI uncore Core Interconnect Core Interconnect GPU A LL cache GPU B LL cache PCIe/F MC PCIe/F MC memory memory 3. Characterizations and Analysis 4. DVFS 23 / 27
Recommend
More recommend