GaaS Workload Characterization under NUMA Architecture for - - PowerPoint PPT Presentation

gaas workload characterization under numa architecture
SMART_READER_LITE
LIVE PREVIEW

GaaS Workload Characterization under NUMA Architecture for - - PowerPoint PPT Presentation

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li Presented by Huixiang Chen ISPASS 2017 April 24, 2017, Santa Rosa, California IDEAL (Intelligent Design of


slide-1
SLIDE 1

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU

IDEAL (Intelligent Design of Efficient Architectures Laboratory) Department of Electrical and Computer Engineering

University of Florida Presented by Huixiang Chen ISPASS 2017

April 24, 2017, Santa Rosa, California

Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li

slide-2
SLIDE 2

2 / 27

Talk Overview

  • 1. Background and Motivation
L1/L2 L1/L2 L1/L2 L1/L2

Core Interconnect PCIe/F

LL cache

core MC uncore

memory

GPU A

L1/L2 L1/L2 L1/L2 L1/L2

Core Interconnect PCIe/F

LL cache

core MC uncore

memory

GPU B QPI

  • 3. Characterizations and Analysis
  • 4. DVFS
  • 2. Experiment Setup
slide-3
SLIDE 3

3 / 27

Graphics-as-a-Service (GaaS)

Cloud Gaming Video Streaming Virtual Desktop (VDI)

slide-4
SLIDE 4

4 / 27

Graphics-as-a-Service (GaaS)

GPU Virtualization!

slide-5
SLIDE 5

5 / 27

GPU Virtualization

  • 1. API Intercept
  • 2. GPU pass-through
  • 3. Shared virtualized GPU
slide-6
SLIDE 6

6 / 27

GPU Virtualization

Intel GVT-s vCUDA

  • 1. API intercept

AMD Firepro NVIDIA GRID

  • 3. Virtualized GPU

Intel GVT-g

  • 2. GPU pass-through

Intel GVT-d NVIDIA GPU-passthrough

slide-7
SLIDE 7

7 / 27

NVIDIA GRID GPU Virtualization

Guest VM Guest VM Driver Applications Applications Applications Guest VM Guest VM Driver Apps Apps Apps NVIDIA GRID vGPU Manager 3D Graphics Engine Copy Engine Video Encoder Video Decoder Timeshared scheduling GPU MMU Framebuffer NVIDIA GPU Paravirtualized Interface

VM1 FB VM2 FB

VM1 pagetables VM2 pagetables

Direct GPU Access XenServer Hypervisor Streaming engine Nvidia Kernel Driver Management Inferface

Memory Channel

Requests from VMs

CPU Memory Access GPU Memory Access

slide-8
SLIDE 8

8 / 27

GPU NUMA issue

  • Unified Architecture
  • Discrete Architecture

GPU0 core core core core Cache

CPU

Last level cache Cache GPU1 core core core core Cache

CPU

Last level cache Cache

Socket 0 Socket 1

QPI Memory Memory Memory Memory Memory Memory Memory Controller Memory Controller

Unified Architecture

core core core core Last level cache

CPU Socket 0

core core core core

CPU Socket 1

Memory Memory Memory Memory Memory Memory

GPU0 GPU1 Last level cache Memory Controller Memory Controller PCIE express PCIE express

QPI

Discrete Architecture

slide-9
SLIDE 9

9 / 27

GPU NUMA Issue

L1/L2 L1/L2 L1/L2 L1/L2

Core Interconnect PCIe/F

LL cache

core MC uncore

memory

GPU A

L1/L2 L1/L2 L1/L2 L1/L2

Core Interconnect PCIe/F

LL cache

core MC uncore

memory

GPU B

QPI Interconnect

Local Memory Access Remote Memory Access

I/O thread App I/O thread App

Ideal Real case

slide-10
SLIDE 10

10 / 27

Talk Overview

  • 1. Background and motivation
L1/L2 L1/L2 L1/L2 L1/L2

Core Interconnect PCIe/F

LL cache

core MC uncore

memory

GPU A

L1/L2 L1/L2 L1/L2 L1/L2

Core Interconnect PCIe/F

LL cache

core MC uncore

memory

GPU B QPI

  • 3. Characterizations and Analysis
  • 4. DVFS
  • 2. Experiment Setup
slide-11
SLIDE 11

11 / 27

Experiment Setup

  • Platform Configuration

– 4U Supermicro Server – XenServer 7.0 – Intel QPI, 6.4 GT/s – NVIDIA GRID K2, 8GB GDDR5, 225W, PCIE 3.0 x 16

GRID K2 Physical GPUs VGPU type Frame Buffer (Mbytes) Maximum vGPUs per GPU 2 K280 4096 1 K260 2048 2 K240 1024 4 K220 512 8

slide-12
SLIDE 12

12 / 27

Workload Selection

  • Workloads and Metrics

– GaaS workloads: Unigine-Heaven, Unigine-Valley, 3DMark (Return to Proxycon, Firefly Forest, Canyon Fly, Deep Freeze) – Performance metrics: frame-per-seconds (FPS) – GPGPU workloads: Rodinia benchmark – Performance metrics: execution time

  • Local mapping:

– the Guest VM’s vCPUs are statically pinned to the local socket close to the GPU.

  • Remote mapping:

– the vCPUs are statically pinned to the remote socket. (XenServer controls the memory affinity automatically close to the CPU affinity).

slide-13
SLIDE 13

13 / 27

Talk Overview

  • 1. Background and motivation
L1/L2 L1/L2 L1/L2 L1/L2

Core Interconnect PCIe/F

LL cache

core MC uncore

memory

GPU A

L1/L2 L1/L2 L1/L2 L1/L2

Core Interconnect PCIe/F

LL cache

core MC uncore

memory

GPU B QPI

  • 3. Characterizations and Analysis
  • 4. DVFS
  • 2. Experiment Setup
slide-14
SLIDE 14

14 / 27

NUMA Transfer Bandwidth

CPUGPU, pinned memory

1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MB 8MB 16MB 32MB 64MB

2 4 6 8 10 12

LocalHtoD RemoteHtoD

Transfer size Bandwidth (MB/s)

1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MB 8MB 16MB 32MB 64MB

2 4 6 8 10 12

LocalDtoH RemoteDtoH

Transfer size Bandwidth (MB/s)

1 K B 2 K B 4 K B 8 K B 1 6 K B 3 2 K B 6 4 K B 1 2 8 K B 2 5 6 K B 5 1 2 K B 1 M B 2 M B 4 M B 8 M B 1 6 M B 3 2 M B 6 4 M B

2 4 6 8 LocalDtoD RemoteHtoD Transfer size Bandwidth (MB/s)

1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MB 8MB 16MB 32MB 64MB

2 4 6 8 LocalDtoH RemoteDtoH Transfer size Bandwidth (MB/s)

GPUCPU, pinned memory CPUGPU, pageable memory GPUCPU, pageable memory

slide-15
SLIDE 15

15 / 27

NUMA Transfer Bandwidth

  • Pinned memory:

– 10% NUMA overhead for writing data to GPU, 20% reading data back from GPU

  • Pageable memory:

– close to 0 NUMA overhead for writing, 50% for reading data back from GPU

slide-16
SLIDE 16

16 / 27

NUMA Performance Difference-GPGPU Workloads

  • Remarks

– For GPGPU workloads

streamcluster, srad_v2, backprop stands out – Further breakdown shows that for GPGPU workloads, the more time spent on CPU GPU communication, the higher NUMA

  • verhead there is.

0.8 0.9 1.0 1.1 1.2 d w t 2 d m u m m e r g p u p a t h f i n d e r n n h e a r t w a l l g a u s s i a n b + t r e e b f s b a c k p r

  • p

s r a d _ v 2 Normalized execution time Local Remote s t r e a m c l u s t e r

streamcluster srad_v2 backprop bfs b+tree gaussian heartwall nn pathfinder mummergpu dwt2d 0% 20% 40% 60% 80% 100% Normalized execution time Memory Kernel CPU+Other

  • Note: only 1VM can be configured

using K2 for CUDA programs.

slide-17
SLIDE 17

17 / 27

NUMA Performance Difference-GaaS Workloads

  • GaaS workloads

– Little NUMA overhead exists

1 V M 2 V M 4 V M 1 V M 2 V M 1 V M 1 V M 2 V M 4 V M 1 V M 2 V M 1 V M

10 20 30 40 50

K280 K260 K240 K280 K260 K240

Firefly Forest FPS Local Remote Return to Proxycon

1VM 2VM 4VM 1VM 2VM 1VM 1VM 2VM 4VM 1VM 2VM 1VM

10 20 30 40 50 60 70

K280 K260 K240 K280 K260 K240

Deep Freeze FPS Local Remote Canyon Flight

1VM 2VM 4VM 1VM 2VM 1VM 1VM 2VM 4VM 1VM 2VM 1VM

10 20 30 40

K280 K260 K240 K280 K260 K240

Unigine-Valley FPS Local Remote Unigine-Heaven

3DMark

slide-18
SLIDE 18

18 / 27

GaaS Overhead Analysis Cont. (1)

3D graphics processing Memory copy between CPU and GPU 2.Copy queue

  • 1. GPU compute

3DMark

2.Copy queue

  • 1. GPU compute

3D graphics processing Memory copy between CPU and GPU

Unigine-Heaven

3D graphics processing Memory copy between CPU and GPU 2.Copy queue

  • 1. GPU compute

Unigine-Valley

GPU compute Memory copy between CPU and GPU

2.Copy queue

  • 1. GPU compute

streamcluster

  • GaaS workloads

GPU compute Memory copy between CPU and GPU 2.Copy queue

  • 1. GPU compute

backprop

GPU compute Memory copy between CPU and GPU

2.Copy queue

  • 1. GPU compute

srad_v2

GPU compute Memory copy between CPU and GPU 2.Copy queue

  • 1. GPU compute

heartwall

  • GPGPU workloads
slide-19
SLIDE 19

19 / 27

GaaS Overhead Analysis Cont. (1)

  • 1. For GaaS workloads, most memory copy operations

between CPU and GPU are overlapped with graphics processing operations. However, GPGPU workloads are

  • different. Little overlap happens.
  • 2. The communication time is trivial compared to GPU

computing in the graphics queue, which clearly shows the GPU-computation intensive feature.

slide-20
SLIDE 20

20 / 27

GaaS Overhead Analysis Cont. (2)

3DMark

Copy queue GPU compute Copy queue GPU compute

Unigine-Heaven

Unigine-Valley

Copy queue GPU compute

Copy queue GPU compute cudaMemCpy(HtoD) cudaMemCpy(DtoH)

hearwall

  • GaaS workloads
slide-21
SLIDE 21

21 / 27

GaaS Overhead Analysis Cont. (2)

  • GaaS workloads incurs more real-time processing,

compared with GPGPU workloads. This kind of workload behavior makes it easier for memory transfers

  • verlapping with GPU computing.
slide-22
SLIDE 22

22 / 27

Influence of CPU uncore

  • CPU uncore has little performance influence on GPU

NUMA for GaaS

VM1VM2VM3VM4 VM1VM2VM3VM4 VM1VM2VM3VM4 5 10 15 20 25 30 Unigine-Valley Unigine-Heaven normalized L3 miss rate 4 VMs on the same socket 4 VMs on seperate socket 3DMark

slide-23
SLIDE 23

23 / 27

Talk Overview

  • 1. Background and motivation
L1/L2 L1/L2 L1/L2 L1/L2

Core Interconnect PCIe/F

LL cache

core MC uncore

memory

GPU A

L1/L2 L1/L2 L1/L2 L1/L2

Core Interconnect PCIe/F

LL cache

core MC uncore

memory

GPU B QPI

  • 3. Characterizations and Analysis
  • 4. DVFS
  • 2. Experiment Setup
slide-24
SLIDE 24

24 / 27

DVFS-CPU

  • Remarks:

Ondemand CPU frequency scaling achieves the best performance tradeoff between performance and energy for GaaS

RP FF CF DF UH UV 10 20 30 40 50 60 FPS

Performance Ondemand Powersave

50 100 150 200 250 300 350 540 560 580 600 620 640 660 Power (watt)

Performance Powersave Ondemand

time (s) 3DMark power 50 100 150 200 250 580 600 620 640 660 Power (watt)

Performance Powersave Ondemand

time(s) Unigine-Heaven power

50 100 150 200 580 600 620 640 660 Power (watt)

Performance Powersave Ondemand

time(s) Unigine-Valley power

slide-25
SLIDE 25

25 / 27

DVFS-GPU

  • Remarks:

The GPU memory frequency can be tuned lower within a certain range to get energy saving with little performance degradation for GaaS.

RP FF CF DF UH UV 10 20 30 40 50 60 70 FPS core_high core_low RP FF CF DF UH UV 10 20 30 40 50 60 70 FPS mem_high mem_low

Core 745575Mhz Mem 1250750Mhz

slide-26
SLIDE 26

26 / 27

Conclusions

  • In this work, we conduct a characterization on XenServer using

virtual GPU, we found no NUMA overhead for GaaS workloads, due to the fact that most memory copy operations are

  • verlapped with GPU computation.
  • GaaS workloads exhibits different workload behavior with

GPGPU workloads.

  • Ondemand CPU frequency scaling achieves the best tradeoff

between performance and energy for GaaS.

  • GPU memory clock can be tuned lower within a certain range to

save energy for GaaS.

slide-27
SLIDE 27

27 / 27

Thanks For Your Attention!