DATA CENTER GPU MANAGER Brent Stolle and David Beer March 2018

TOOLS FOR MANAGING GPUs Out-of-Band In-Band GPU Metrics and Monitoring via Tools use the NVIDIA driver to BMC (SMBPBI) provide GPU and NVSwitch metrics Provide metrics (thermals, power, etc.) without the NVIDIA driver DCGM, NVML (smi) are in-band tools Typically used at public CSPs (i.e. multi-tenant environments) Typically used at single tenant environments 2

NVIDIA IN-BAND TOOLS ECOSYSTEM Cluster managers, Job ▶ schedulers, TSDBs, 3rd Party Tools Visualization tools DCGM Customers integrating DCGM; ▶ CSPs for system validation NVML Customers building their own ▶ GPU metrics/monitoring stack using NVML 3

HOW SHOULD I MANAGE MY GPUS? 3 RD PARTY NVML DCGM TOOLS Stateless queries. Can only Can query a few hours of Provide database, graphs, query current data metrics and a nice UI Low overhead while Provides health checks Need management node(s) running, high overhead to and diagnostics develop Development already Can batch done. You just have to Low-level control of GPUs queries/operations to configure the tools. groups of GPUs Management app must run on same box as GPUs Can be remote or local 4

DATA CENTER GPU MANAGER (DCGM) GPU DIAGNOSTICS ACTIVE HEALTH MONITORING Runtime Health Checks Software Deployment Tests ▶ ▶ Stress Tests Prologue Checks ▶ ▶ Epilogue Checks Hardware Issues and Interface Tests ▶ ▶ (PCIe, NVLink) POLICY AND ALERTING CONFIGURATION MANAGEMENT Pre-configured Policies Dynamic Power Capping ▶ ▶ Job Level Statistics Synchronous Clock Boost ▶ ▶ Stateful Configuration Fixed Clocks ▶ ▶ 5

DCGM OVERVIEW GPU Management in the Accelerated Data Center Supported NVIDIA Hardware Fully supported on Tesla GPUs (Kepler+) ● Supported on Quadro, GeForce, and Titan GPUs (Maxwell+) ● ● Supports NvSwitch and DGX-2 ● Driver R384 or Later (Linux only) SDK Installer Packages ● .deb and .rpm Packages Includes Binaries – CLI ( dcgmi ) and daemon ( nv-hostengine ) ● Libraries and Headers (includes NVML) ● ● C and Python Bindings and Code samples ● Documentation - User Guides and API docs https://developer.nvidia.com/data-center-gpu-manager-dcgm Latest Release: v1.3.3 (Jan 2018) 6

AVAILABLE NVIDIA MANAGEMENT TOOLS Software Stack Data Center GPU Manager (DCGM) DCGM-Based DCGMI 3 rd Party Tools Additional diagnostics (aka NVVS) and Client Lib ▶ Client Lib active health monitoring GPU Policy management and more DCGM ▶ Diagnostics Daemon (NVVS) NVIDIA Management Library CUDA NVML (NVML) Low level control of GPUs ▶ Included as part of driver NVIDIA Driver ▶ Header is part of CUDA Toolkit / DCGM ▶ 7

ACTIVE HEALTH MONITORING & ANALYSIS Run Health Check : Healthy System NON INVASIVE dcgmi health --check -g 1 CHECKS Health Monitor Report +------------------+---------------------------------------------------------+ | Overall Health: Healthy | +==================+=========================================================+ Real-time monitoring & aggregated health Run Health Check : System with problems indicator dcgmi health -g 1 –c Health Monitor Report Checks health of all +----------------------------------------------------------------------------+ GPUs and NVSwitch | Group 1 | Overall Health: Warning | +==================+=========================================================+ subsystems | GPU ID: 0 | Warning | | | PCIe system: Warning - Detected more than 8 PCIe | • PCIe, ECC, Inforom, Power | | replays per minute for GPU 0: 13 | Thermal, NVLink +------------------+---------------------------------------------------------+ | GPU ID: 1 | Warning | | | InfoROM system: Warning - A corrupt InfoROM has been | | | detected in GPU 1. | +------------------+---------------------------------------------------------+ 8

Demo: Health Checks 9

GPU DIAGNOSTICS (NVVS) – COVERAGE AREAS HARDWARE ISSUES AND DIAGNOSTICS DEPLOYMENT AND SOFTWARE ISSUES NVML library access and versioning PCIe and NVLink interface checks ▶ ▶ Framebuffer and memory checks CUDA library access and versioning ▶ ▶ Software conflicts Compute engine checks ▶ ▶ STRESS CHECKS INTEGRATION ISSUES Power and thermal stress PCIe and NVLink replay counter checks ▶ ▶ Throughput stress Topological limitations ▶ ▶ Constant relative system performance Permissions, driver and cgroups checks ▶ ▶ Maximum relative system performance Basic power and thermal constraint ▶ ▶ checks 10

COMPREHENSIVE DIAGNOSTICS dcgmi diag -r 3 +---------------------------+-------------+ | Diagnostic | Result | ACTIVE HEALTH CHECKS +===========================+=============+ |----- Deployment --------+-------------| | Blacklist | Pass | | NVML Library | Pass | | CUDA Main Library | Pass | | CUDA Toolkit Library | Pass | Identification, recovery & isolation | Permissions and OS Blocks | Pass | of failed GPUs and NVSwitches. | Persistence Mode | Pass | | Environment Variables | Pass | | Page Retirement | Pass | Diagnostics to root cause failures, | Graphics Processes | Pass | Pre & post job GPU health checks | Inforom | Pass | +----- Hardware ----------+-------------+ | GPU Memory | Pass - All | System sanity to stress performance, | Diagnostic | Pass - All | +----- Integration -------+-------------+ bandwidth, power and thermal | PCIe | Pass - All | characteristics +----- Stress ------------+-------------+ | SM Stress | Pass - All | | Targeted Stress | Pass - All | Multi-level diagnostic options from | Targeted Power | Warn - All | | Memory Bandwidth | Pass - All | few seconds to minutes +---------------------------+-------------+ 11

FLEXIBLE GPU GOVERNANCE POLICIES With Existing Tools Using DCGM Continuous monitoring by the user Condition Notification Action Identify GPUs with double bit errors Condition : Watch for Auto-detects double bit DBE errors, performs page Manually perform Action : Page retirement retirement, and notifies GPU reset to Notification : Callback the user correct problems 12

Demo: Policy Alerting 13

MANAGING JOB LIFECYCLE Create GPU group and check health Which GPUs did my job run on? How much of the GPUs did my job Start Job Stats use? Any error or warning conditions Run Job during my job (ECC errors, clock throttling, etc) Are the GPUs healthy and ready Stop Job Stats for the next job? Display Job Stats 14

DATA CENTER GPU MANAGER Brent Stolle and David Beer March 2018 - PowerPoint PPT Presentation

DATA CENTER GPU MANAGER Brent Stolle and David Beer March 2018 TOOLS FOR MANAGING GPUs Out-of-Band In-Band GPU Metrics and Monitoring via Tools use the NVIDIA driver to BMC (SMBPBI) provide GPU and NVSwitch metrics Provide metrics

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

Trans ransform forming B ng Bus usines ness Proces rocesses s in H n Higher gher Educa

KERNELFAULT: Pwning Linux using Hardware Fault Injection Niek Timmers Cristofaro Mune

We are all energy savers ! By: Emma Green, Annabel Green, Ashley Gaccetta, Amaya Rose and our

Soda Folk is a range of traditional American soft drinks. Our products are all-natural, free from

ROPE UL T RASONIC DAC O N INSPEC T IO N SERVIC ES ACCE SS INSPE CT ION Who we ar e Co

Lifting Jonathan Templeman Chief Executive 53% of Lifting 22% of Melrose Products overview

NSW Minerals Industry OHS Conference 2008 Stream - Equipment Design A case study of two NSW DPI

Objectives To evaluate the functionality of Rope pump To determine the water quality of