GaaS Workload Characterization under NUMA Architecture for - PowerPoint PPT Presentation

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li Presented by Huixiang Chen ISPASS 2017 April 24, 2017, Santa Rosa, California IDEAL (Intelligent Design of Efficient Architectures Laboratory) Department of Electrical and Computer Engineering University of Florida

Talk Overview 1. Background and Motivation 2. Experiment Setup L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 core core uncore QPI uncore Core Interconnect Core Interconnect GPU A LL cache GPU B LL cache PCIe/F MC PCIe/F MC memory memory 3. Characterizations and Analysis 4. DVFS 2 / 27

Graphics-as-a-Service (GaaS) Virtual Desktop (VDI) Cloud Gaming Video Streaming 3 / 27

Graphics-as-a-Service (GaaS) GPU Virtualization! 4 / 27

GPU Virtualization • 1. API Intercept • 2. GPU pass-through • 3. Shared virtualized GPU 5 / 27

GPU Virtualization Intel GVT-g Intel GVT-s Intel GVT-d AMD Firepro NVIDIA GPU-passthrough vCUDA NVIDIA GRID 1. API intercept 3. Virtualized GPU 2. GPU pass-through 6 / 27

NVIDIA GRID GPU Virtualization Guest VM Guest VM Applications Applications XenServer Hypervisor Applications Paravirtualized Apps Apps Memory Apps NVIDIA GRID Interface Guest VM vGPU Manager Driver Guest VM Nvidia Kernel Driver Driver Direct GPU Management Access Inferface Requests from VMs NVIDIA GPU Channel CPU Memory Access Timeshared scheduling GPU MMU Framebuffer VM1 FB GPU Memory VM2 FB Streaming engine Access 3D Graphics Copy Video Video VM1 pagetables Engine Engine Encoder Decoder VM2 pagetables 7 / 27

GPU NUMA issue • Unified Architecture • Discrete Architecture Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Controller Memory Controller Controller Controller Socket 0 Socket 1 Socket 0 Socket 1 CPU CPU core core core core CPU CPU core core core core GPU0 GPU1 QPI core core QPI core core core core core core Cache Cache Cache Cache Last level cache Last level cache Last level cache Last level cache PCIE PCIE express express GPU0 GPU1 Discrete Architecture Unified Architecture 8 / 27

GPU NUMA Issue L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 core QPI core Interconnect uncore uncore Core Interconnect Core Interconnect GPU A LL cache GPU B LL cache PCIe/F MC PCIe/F MC memory memory Remote Memory Access Local Memory Access Real case App App Ideal I/O thread I/O thread 9 / 27

Talk Overview 1. Background and motivation 2. Experiment Setup L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 L1/L2 core core uncore QPI uncore Core Interconnect Core Interconnect GPU A LL cache GPU B LL cache PCIe/F MC PCIe/F MC memory memory 3. Characterizations and Analysis 4. DVFS 10 / 27

Experiment Setup • Platform Configuration – 4U Supermicro Server – XenServer 7.0 – Intel QPI, 6.4 GT/s – NVIDIA GRID K2, 8GB GDDR5, 225W, PCIE 3.0 x 16 Physical Frame Buffer VGPU type Maximum vGPUs per GPU GPUs (Mbytes) K280 4096 1 GRID K2 K260 2048 2 2 K240 1024 4 K220 512 8 11 / 27

Workload Selection • Workloads and Metrics – GaaS workloads: Unigine-Heaven, Unigine-Valley, 3DMark (Return to Proxycon, Firefly Forest, Canyon Fly, Deep Freeze) – Performance metrics: frame-per-seconds (FPS) – GPGPU workloads: Rodinia benchmark – Performance metrics: execution time • Local mapping: – the Guest VM’s vCPUs are statically pinned to the local socket close to the GPU. • Remote mapping: – the vCPUs are statically pinned to the remote socket. (XenServer controls the memory affinity automatically close to the CPU affinity). 12 / 27

Bandwidth (MB/s) Bandwidth (MB/s) NUMA Transfer Bandwidth 10 12 0 2 4 6 8 0 2 4 6 8 1KB 1 K CPU  GPU, pageable memory B 2KB 2 CPU  GPU, pinned memory K B 4KB 4 K B RemoteHtoD LocalHtoD RemoteHtoD LocalDtoD 8KB 8 K B 16KB 1 6 K B 32KB 3 2 K B 64KB 6 4 Transfer size K Transfer size 128KB B 1 2 8 K 256KB B 2 5 6 K 512KB B 5 1 2 K B 1MB 1 M B 2MB 2 M B 4MB 4 M B 8MB 8 M 16MB B 1 6 M B 32MB 3 2 M 64MB B 6 4 M B Bandwidth (MB/s) Bandwidth (MB/s) 10 12 0 2 4 6 8 0 2 4 6 8 1KB 1KB GPU  CPU, pageable memory GPU  CPU, pinned memory 2KB 2KB 4KB 4KB RemoteDtoH LocalDtoH RemoteDtoH LocalDtoH 8KB 8KB 16KB 16KB 32KB 32KB 64KB 64KB Transfer size Transfer size 128KB 128KB 256KB 256KB 512KB 512KB 1MB 1MB 2MB 2MB 4MB 4MB 8MB 8MB 16MB 16MB 14 / 27 32MB 32MB 64MB 64MB

NUMA Transfer Bandwidth • Pinned memory: – 10% NUMA overhead for writing data to GPU, 20% reading data back from GPU • Pageable memory: – close to 0 NUMA overhead for writing, 50% for reading data back from GPU 15 / 27

NUMA Performance Difference-GPGPU Workloads • Note: only 1VM can be configured • Remarks using K2 for CUDA programs. Normalized execution time 1.2 – For GPGPU workloads Local Remote 1.1 streamcluster, srad_v2, backprop stands out 1.0 0.9 – Further breakdown shows 0.8 that for GPGPU workloads, r e n r t e s p u 2 e a u l d p d v o i l s e s a n g l r n 2 c _ p f r s w b n i r t m d t u f e w k + t h a r m a c b a t d the more time spent on a a r a g e s e m b p r h t u s m CPU GPU communication, Normalized execution time Memory Kernel CPU+Other the higher NUMA 100% overhead there is. 80% 60% 40% 20% 0% streamcluster srad_v2 backprop bfs b+tree gaussian heartwall pathfinder mummergpu nn dwt2d 16 / 27

NUMA Performance Difference-GaaS Workloads 70 50 Local Remote Local Remote 60 40 50 30 40 FPS FPS 30 20 20 10 10 0 0 M M M M M M M M M M M M 1VM 2VM 4VM 1VM 2VM 1VM 1VM 2VM 4VM 1VM 2VM 1VM V V V V V V V V V V V V 1 2 4 1 2 1 1 2 4 1 2 1 K240 K260 K260 K280 K240 K280 K240 K260 K260 K280 K240 K280 3DMark Return to Proxycon Firefly Forest Canyon Flight Deep Freeze 40 Local Remote 30 FPS 20 10 0 1VM 2VM 4VM 1VM 2VM 1VM 1VM 2VM 4VM 1VM 2VM 1VM K240 K260 K280 K260 K280 K240 Unigine-Heaven Unigine-Valley • GaaS workloads – Little NUMA overhead exists 17 / 27

GaaS Overhead Analysis Cont. (1) streamcluster 3DMark 1. GPU compute 1. GPU compute GPU compute 3D graphics processing Memory copy between 2.Copy queue CPU and GPU Memory copy between 2.Copy queue CPU and GPU backprop Unigine-Heaven 1. GPU compute GPU compute 1. GPU compute Memory copy between 2.Copy queue CPU and GPU 3D graphics processing srad_v2 2.Copy queue 1. GPU compute Memory copy between GPU compute CPU and GPU Memory copy between CPU and GPU 2.Copy queue Unigine-Valley heartwall 1. GPU compute GPU compute 3D graphics processing 1. GPU compute 2.Copy queue Memory copy between 2.Copy queue Memory copy between CPU and GPU CPU and GPU • GPGPU workloads • GaaS workloads 18 / 27

GaaS Overhead Analysis Cont. (1) • 1. For GaaS workloads, most memory copy operations between CPU and GPU are overlapped with graphics processing operations. However, GPGPU workloads are different. Little overlap happens. • 2. The communication time is trivial compared to GPU computing in the graphics queue, which clearly shows the GPU-computation intensive feature. 19 / 27

GaaS Overhead Analysis Cont. (2) 3DMark GPU compute hearwall Copy queue Unigine-Heaven GPU compute GPU compute Copy queue cudaMemCpy(HtoD) cudaMemCpy(DtoH) Copy queue Unigine-Valley GPU compute Copy queue • GaaS workloads 20 / 27

GaaS Overhead Analysis Cont. (2) • GaaS workloads incurs more real-time processing, compared with GPGPU workloads. This kind of workload behavior makes it easier for memory transfers overlapping with GPU computing. 21 / 27

Influence of CPU uncore 30 4 VMs on the same socket normalized L3 miss rate 4 VMs on seperate socket 25 20 15 10 5 0 VM1VM2VM3VM4 VM1VM2VM3VM4 VM1VM2VM3VM4 3DMark Unigine-Heaven Unigine-Valley • CPU uncore has little performance influence on GPU NUMA for GaaS 22 / 27

GaaS Workload Characterization under NUMA Architecture for - PowerPoint PPT Presentation

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li Presented by Huixiang Chen ISPASS 2017 April 24, 2017, Santa Rosa, California IDEAL (Intelligent Design of

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

Workload, Fatigue, and Sleep Disruption 1 Workload 1.What is workload? 2.What is the

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

DAY 2 Agenda for Today Introduce the workload characterization problem. Discuss a

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

WORKLOAD WORKLOAD WORKLOAD During exercise, nasal breathing causes a reduction in FEO 2

ASHA Workload Calculator What is Direct and Other indirect workload? activities Services

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

Day 3 Agenda for Today Formulate simple problem statement Revisit the workload

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

Heterostructures Heterojunction: e.g. between GaAs (W g ~1.4 eV) and AlGaAs (W g ~1.6-1.9 eV)

ICCL Summer School 2008 The logic of generalized truth values. A tour into Philosophical Logic

Reactivity effect breakdown calculations with perturbations analysis JEFF-3.1.1 to

On game logics Sujata Ghosh Visva-Bharati & Indian Statistical Institute Formal Methods

Cryptographic Sboxes Anne Canteaut Anne.Canteaut@inria.fr

I nGaAs Nanoelectronics: from THz to CMOS J. A. del Alamo Microsystems Technology Laboratories,

Lecture 16: Semiconductors (Kittel Ch. 8) Good Semiconductors Semimetals Metals

Characterization of CdTe, CdZnTe and GaAs detectors Igor A. Sokolov Igor A. Sokolov NDIP 2008

Anisotropy induces non-Fermi-liquid behavior and nemagnetic order in 3D Luttinger semimetals

GaaS Workload Characterization under NUMA Architecture for - PowerPoint PPT Presentation

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li Presented by Huixiang Chen ISPASS 2017 April 24, 2017, Santa Rosa, California IDEAL (Intelligent Design of

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

Workload, Fatigue, and Sleep Disruption 1 Workload 1.What is workload? 2.What is the

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

DAY 2 Agenda for Today Introduce the workload characterization problem. Discuss a

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

WORKLOAD WORKLOAD WORKLOAD During exercise, nasal breathing causes a reduction in FEO 2

ASHA Workload Calculator What is Direct and Other indirect workload? activities Services

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

Day 3 Agenda for Today Formulate simple problem statement Revisit the workload

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

Heterostructures Heterojunction: e.g. between GaAs (W g ~1.4 eV) and AlGaAs (W g ~1.6-1.9 eV)

ICCL Summer School 2008 The logic of generalized truth values. A tour into Philosophical Logic

Reactivity effect breakdown calculations with perturbations analysis JEFF-3.1.1 to

On game logics Sujata Ghosh Visva-Bharati &amp; Indian Statistical Institute Formal Methods

Cryptographic Sboxes Anne Canteaut Anne.Canteaut@inria.fr

I nGaAs Nanoelectronics: from THz to CMOS J. A. del Alamo Microsystems Technology Laboratories,

Lecture 16: Semiconductors (Kittel Ch. 8) Good Semiconductors Semimetals Metals

Characterization of CdTe, CdZnTe and GaAs detectors Igor A. Sokolov Igor A. Sokolov NDIP 2008

Anisotropy induces non-Fermi-liquid behavior and nemagnetic order in 3D Luttinger semimetals

On game logics Sujata Ghosh Visva-Bharati & Indian Statistical Institute Formal Methods