GaaS Workload Characterization under NUMA Architecture for - - PowerPoint PPT Presentation
GaaS Workload Characterization under NUMA Architecture for - - PowerPoint PPT Presentation
GaaS Workload Characterization under NUMA Architecture for Virtualized GPU Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li Presented by Huixiang Chen ISPASS 2017 April 24, 2017, Santa Rosa, California IDEAL (Intelligent Design of
2 / 27
Talk Overview
- 1. Background and Motivation
Core Interconnect PCIe/F
LL cachecore MC uncore
memoryGPU A
L1/L2 L1/L2 L1/L2 L1/L2Core Interconnect PCIe/F
LL cachecore MC uncore
memoryGPU B QPI
- 3. Characterizations and Analysis
- 4. DVFS
- 2. Experiment Setup
3 / 27
Graphics-as-a-Service (GaaS)
Cloud Gaming Video Streaming Virtual Desktop (VDI)
4 / 27
Graphics-as-a-Service (GaaS)
GPU Virtualization!
5 / 27
GPU Virtualization
- 1. API Intercept
- 2. GPU pass-through
- 3. Shared virtualized GPU
6 / 27
GPU Virtualization
Intel GVT-s vCUDA
- 1. API intercept
AMD Firepro NVIDIA GRID
- 3. Virtualized GPU
Intel GVT-g
- 2. GPU pass-through
Intel GVT-d NVIDIA GPU-passthrough
7 / 27
NVIDIA GRID GPU Virtualization
Guest VM Guest VM Driver Applications Applications Applications Guest VM Guest VM Driver Apps Apps Apps NVIDIA GRID vGPU Manager 3D Graphics Engine Copy Engine Video Encoder Video Decoder Timeshared scheduling GPU MMU Framebuffer NVIDIA GPU Paravirtualized Interface
VM1 FB VM2 FB
VM1 pagetables VM2 pagetables
Direct GPU Access XenServer Hypervisor Streaming engine Nvidia Kernel Driver Management Inferface
Memory Channel
Requests from VMs
CPU Memory Access GPU Memory Access
8 / 27
GPU NUMA issue
- Unified Architecture
- Discrete Architecture
GPU0 core core core core Cache
CPU
Last level cache Cache GPU1 core core core core Cache
CPU
Last level cache Cache
Socket 0 Socket 1
QPI Memory Memory Memory Memory Memory Memory Memory Controller Memory Controller
Unified Architecture
core core core core Last level cache
CPU Socket 0
core core core core
CPU Socket 1
Memory Memory Memory Memory Memory Memory
GPU0 GPU1 Last level cache Memory Controller Memory Controller PCIE express PCIE express
QPI
Discrete Architecture
9 / 27
GPU NUMA Issue
L1/L2 L1/L2 L1/L2 L1/L2
Core Interconnect PCIe/F
LL cache
core MC uncore
memory
GPU A
L1/L2 L1/L2 L1/L2 L1/L2
Core Interconnect PCIe/F
LL cache
core MC uncore
memory
GPU B
QPI Interconnect
Local Memory Access Remote Memory Access
I/O thread App I/O thread App
Ideal Real case
10 / 27
Talk Overview
- 1. Background and motivation
Core Interconnect PCIe/F
LL cachecore MC uncore
memoryGPU A
L1/L2 L1/L2 L1/L2 L1/L2Core Interconnect PCIe/F
LL cachecore MC uncore
memoryGPU B QPI
- 3. Characterizations and Analysis
- 4. DVFS
- 2. Experiment Setup
11 / 27
Experiment Setup
- Platform Configuration
– 4U Supermicro Server – XenServer 7.0 – Intel QPI, 6.4 GT/s – NVIDIA GRID K2, 8GB GDDR5, 225W, PCIE 3.0 x 16
GRID K2 Physical GPUs VGPU type Frame Buffer (Mbytes) Maximum vGPUs per GPU 2 K280 4096 1 K260 2048 2 K240 1024 4 K220 512 8
12 / 27
Workload Selection
- Workloads and Metrics
– GaaS workloads: Unigine-Heaven, Unigine-Valley, 3DMark (Return to Proxycon, Firefly Forest, Canyon Fly, Deep Freeze) – Performance metrics: frame-per-seconds (FPS) – GPGPU workloads: Rodinia benchmark – Performance metrics: execution time
- Local mapping:
– the Guest VM’s vCPUs are statically pinned to the local socket close to the GPU.
- Remote mapping:
– the vCPUs are statically pinned to the remote socket. (XenServer controls the memory affinity automatically close to the CPU affinity).
13 / 27
Talk Overview
- 1. Background and motivation
Core Interconnect PCIe/F
LL cachecore MC uncore
memoryGPU A
L1/L2 L1/L2 L1/L2 L1/L2Core Interconnect PCIe/F
LL cachecore MC uncore
memoryGPU B QPI
- 3. Characterizations and Analysis
- 4. DVFS
- 2. Experiment Setup
14 / 27
NUMA Transfer Bandwidth
CPUGPU, pinned memory
1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MB 8MB 16MB 32MB 64MB
2 4 6 8 10 12
LocalHtoD RemoteHtoD
Transfer size Bandwidth (MB/s)
1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MB 8MB 16MB 32MB 64MB
2 4 6 8 10 12
LocalDtoH RemoteDtoH
Transfer size Bandwidth (MB/s)
1 K B 2 K B 4 K B 8 K B 1 6 K B 3 2 K B 6 4 K B 1 2 8 K B 2 5 6 K B 5 1 2 K B 1 M B 2 M B 4 M B 8 M B 1 6 M B 3 2 M B 6 4 M B
2 4 6 8 LocalDtoD RemoteHtoD Transfer size Bandwidth (MB/s)
1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MB 8MB 16MB 32MB 64MB
2 4 6 8 LocalDtoH RemoteDtoH Transfer size Bandwidth (MB/s)
GPUCPU, pinned memory CPUGPU, pageable memory GPUCPU, pageable memory
15 / 27
NUMA Transfer Bandwidth
- Pinned memory:
– 10% NUMA overhead for writing data to GPU, 20% reading data back from GPU
- Pageable memory:
– close to 0 NUMA overhead for writing, 50% for reading data back from GPU
16 / 27
NUMA Performance Difference-GPGPU Workloads
- Remarks
– For GPGPU workloads
streamcluster, srad_v2, backprop stands out – Further breakdown shows that for GPGPU workloads, the more time spent on CPU GPU communication, the higher NUMA
- verhead there is.
0.8 0.9 1.0 1.1 1.2 d w t 2 d m u m m e r g p u p a t h f i n d e r n n h e a r t w a l l g a u s s i a n b + t r e e b f s b a c k p r
- p
s r a d _ v 2 Normalized execution time Local Remote s t r e a m c l u s t e r
streamcluster srad_v2 backprop bfs b+tree gaussian heartwall nn pathfinder mummergpu dwt2d 0% 20% 40% 60% 80% 100% Normalized execution time Memory Kernel CPU+Other
- Note: only 1VM can be configured
using K2 for CUDA programs.
17 / 27
NUMA Performance Difference-GaaS Workloads
- GaaS workloads
– Little NUMA overhead exists
1 V M 2 V M 4 V M 1 V M 2 V M 1 V M 1 V M 2 V M 4 V M 1 V M 2 V M 1 V M
10 20 30 40 50
K280 K260 K240 K280 K260 K240
Firefly Forest FPS Local Remote Return to Proxycon
1VM 2VM 4VM 1VM 2VM 1VM 1VM 2VM 4VM 1VM 2VM 1VM
10 20 30 40 50 60 70
K280 K260 K240 K280 K260 K240
Deep Freeze FPS Local Remote Canyon Flight
1VM 2VM 4VM 1VM 2VM 1VM 1VM 2VM 4VM 1VM 2VM 1VM
10 20 30 40
K280 K260 K240 K280 K260 K240
Unigine-Valley FPS Local Remote Unigine-Heaven
3DMark
18 / 27
GaaS Overhead Analysis Cont. (1)
3D graphics processing Memory copy between CPU and GPU 2.Copy queue
- 1. GPU compute
3DMark
2.Copy queue
- 1. GPU compute
3D graphics processing Memory copy between CPU and GPU
Unigine-Heaven
3D graphics processing Memory copy between CPU and GPU 2.Copy queue
- 1. GPU compute
Unigine-Valley
GPU compute Memory copy between CPU and GPU
2.Copy queue
- 1. GPU compute
streamcluster
- GaaS workloads
GPU compute Memory copy between CPU and GPU 2.Copy queue
- 1. GPU compute
backprop
GPU compute Memory copy between CPU and GPU
2.Copy queue
- 1. GPU compute
srad_v2
GPU compute Memory copy between CPU and GPU 2.Copy queue
- 1. GPU compute
heartwall
- GPGPU workloads
19 / 27
GaaS Overhead Analysis Cont. (1)
- 1. For GaaS workloads, most memory copy operations
between CPU and GPU are overlapped with graphics processing operations. However, GPGPU workloads are
- different. Little overlap happens.
- 2. The communication time is trivial compared to GPU
computing in the graphics queue, which clearly shows the GPU-computation intensive feature.
20 / 27
GaaS Overhead Analysis Cont. (2)
3DMark
Copy queue GPU compute Copy queue GPU compute
Unigine-Heaven
Unigine-Valley
Copy queue GPU compute
Copy queue GPU compute cudaMemCpy(HtoD) cudaMemCpy(DtoH)
hearwall
- GaaS workloads
21 / 27
GaaS Overhead Analysis Cont. (2)
- GaaS workloads incurs more real-time processing,
compared with GPGPU workloads. This kind of workload behavior makes it easier for memory transfers
- verlapping with GPU computing.
22 / 27
Influence of CPU uncore
- CPU uncore has little performance influence on GPU
NUMA for GaaS
VM1VM2VM3VM4 VM1VM2VM3VM4 VM1VM2VM3VM4 5 10 15 20 25 30 Unigine-Valley Unigine-Heaven normalized L3 miss rate 4 VMs on the same socket 4 VMs on seperate socket 3DMark
23 / 27
Talk Overview
- 1. Background and motivation
Core Interconnect PCIe/F
LL cachecore MC uncore
memoryGPU A
L1/L2 L1/L2 L1/L2 L1/L2Core Interconnect PCIe/F
LL cachecore MC uncore
memoryGPU B QPI
- 3. Characterizations and Analysis
- 4. DVFS
- 2. Experiment Setup
24 / 27
DVFS-CPU
- Remarks:
Ondemand CPU frequency scaling achieves the best performance tradeoff between performance and energy for GaaS
RP FF CF DF UH UV 10 20 30 40 50 60 FPS
Performance Ondemand Powersave
50 100 150 200 250 300 350 540 560 580 600 620 640 660 Power (watt)
Performance Powersave Ondemand
time (s) 3DMark power 50 100 150 200 250 580 600 620 640 660 Power (watt)
Performance Powersave Ondemand
time(s) Unigine-Heaven power
50 100 150 200 580 600 620 640 660 Power (watt)
Performance Powersave Ondemand
time(s) Unigine-Valley power
25 / 27
DVFS-GPU
- Remarks:
The GPU memory frequency can be tuned lower within a certain range to get energy saving with little performance degradation for GaaS.
RP FF CF DF UH UV 10 20 30 40 50 60 70 FPS core_high core_low RP FF CF DF UH UV 10 20 30 40 50 60 70 FPS mem_high mem_low
Core 745575Mhz Mem 1250750Mhz
26 / 27
Conclusions
- In this work, we conduct a characterization on XenServer using
virtual GPU, we found no NUMA overhead for GaaS workloads, due to the fact that most memory copy operations are
- verlapped with GPU computation.
- GaaS workloads exhibits different workload behavior with
GPGPU workloads.
- Ondemand CPU frequency scaling achieves the best tradeoff
between performance and energy for GaaS.
- GPU memory clock can be tuned lower within a certain range to