dcgm
play

(DCGM) Brent Stolle and Rajat Phull, 4/5/2016 DATA CENTER - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley DATA CENTER GPU MANAGER (DCGM) Brent Stolle and Rajat Phull, 4/5/2016 DATA CENTER INFRASTRUCTURE CHALLENGES Resource Availability & Uptime Under-utilized Resources & Efficiency Administrative Overhead


  1. April 4-7, 2016 | Silicon Valley DATA CENTER GPU MANAGER (DCGM) Brent Stolle and Rajat Phull, 4/5/2016

  2. DATA CENTER INFRASTRUCTURE CHALLENGES Resource Availability & Uptime Under-utilized Resources & Efficiency Administrative Overhead 2

  3. DATA CENTER GPU MANAGER Tesla GPUs Only DCGM Existing Tools Policy & Active Diagnostics Configuration Device Management and Health Checks Management Per GPU Configuration & Increases Reliability Lower Admin overhead Monitoring • Device Identification • Configuration & Monitoring Enhanced Clock & Stateful • Clock Management Power management Group Operations Maintains historical info Increases Efficiency Easy of Use All GPUs Supported 3 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

  4. NVIDIA DATA CENTER GPU MANAGER (DCGM) Comprehensive GPU Management for Accelerated Data Center Health Monitoring Maximize GPU Reliability & Uptime Streamline GPU Administration & TCO Active Policy Diagnostics Governance Boost Performance & Resource Efficiency Power & Clock Mgmt. 4

  5. Maximize GPU Reliability & Availability Active Health Monitoring & Analysis Comprehensive Diagnostics 5

  6. Maximize GPU Reliability & Availability Active Health Monitoring & Analysis Create Group dcgmi group --create all_gpus_grp --default Successfully created group "all_gpus_grp “ group id: 1 NON INVASIVE Set Watches dcgmi health – g 1 --set pmi Performed during job Health monitor systems set successfully execution Get Watches Overall health for the GPU subsystems (PCIe, SM, dcgmi health -g 1 -f +----------------------------------------------------------------------------+ MCU, PMU, Inforom, | Group Health Watches | Power and thermal +=========+==================================================================+ | PCIe | On | system) | NVLINK | Off | | PMU | Off | | MCU | Off | | Memory | On | | SM | Off | | InfoROM | On | | Thermal | Off | | Power | Off | | Driver | Off | 6 +---------+------------------------------------------------------------------+

  7. Maximize GPU Reliability & Availability Active Health Monitoring & Analysis Run Health Check : Healthy System NON INVASIVE dcgmi health --check -g 1 Health Monitor Report +------------------+---------------------------------------------------------+ | Overall Health: Healthy | +==================+=========================================================+ Performed during job execution Run Health Check : System with problems dcgmi health --check -g 1 Overall health for the GPU Health Monitor Report subsystems (PCIe, SM, +----------------------------------------------------------------------------+ MCU, PMU, Inforom, | Group 1 | Overall Health: Warning | +==================+=========================================================+ Power and thermal | GPU ID: 0 | Warning | system) | | PCIe system: Warning - Detected more than 8 PCIe | | | replays per minute for GPU 0: 13 | +------------------+---------------------------------------------------------+ | GPU ID: 1 | Warning | | | InfoROM system: Warning - A corrupt InfoROM has been | | | detected in GPU 1. | +------------------+---------------------------------------------------------+ 7

  8. Maximize GPU Reliability & Availability Comprehensive Diagnostics Quick Diagnostics (~secs) INVASIVE dcgmi diag – g 1 -r 1 +---------------------------+-------------+ | Diagnostic | Result | +===========================+=============+ |----- Deployment --------+-------------| | Blacklist | Pass | Performed at job | NVML Library | Pass | epilogue/prologue or when job | CUDA Main Library | Pass | | CUDA Toolkit Libraries | Pass | fails | Permissions and OS Blocks | Pass | | Persistence Mode | Pass | | Environment Variables | Pass | Validates device sub-components, | Page Retirement | Pass | interlink bandwidth, memory/ecc | Graphics Processes | Pass | +---------------------------+-------------+ state and deployment software integrity Several levels of diagnostic are available. Level 1-3 with – r. 8

  9. Maximize GPU Reliability & Availability Comprehensive Diagnostics Extended Diagnostics (~mins) INVASIVE dcgmi diag – g 1 -r 2 +---------------------------+-------------+ | Diagnostic | Result | +===========================+=============+ |----- Deployment --------+-------------| | Blacklist | Pass | Performed at job | NVML Library | Pass | epilogue/prologue or when job | CUDA Main Library | Pass | | CUDA Toolkit Libraries | Pass | fails | Permissions and OS Blocks | Pass | | Persistence Mode | Pass | | Environment Variables | Pass | Validates device sub-components, | Page Retirement | Pass | interlink bandwidth, memory/ecc | Graphics Processes | Pass | state and deployment software +----- Performance -------+-------------+ | SM Performance | Pass - All | integrity | Targeted Performance | Pass - All | | Targeted Power | Warn - All | +---------------------------+-------------+ Several levels of diagnostic are available. Level 1-3 with – r. 9

  10. Maximize GPU Reliability & Availability Comprehensive Diagnostics Hardware Diagnostics INVASIVE dcgmi diag -r 3 +---------------------------+-------------+ | Diagnostic | Result | +===========================+=============+ |----- Deployment --------+-------------| | Blacklist | Pass | Performed at job | NVML Library | Pass | epilogue/prologue or when job | CUDA Main Library | Pass | | CUDA Toolkit Libraries | Pass | fails | Permissions and OS Blocks | Pass | | Persistence Mode | Pass | | Environment Variables | Pass | Validates device sub-components, | Page Retirement | Pass | interlink bandwidth, memory/ecc | Graphics Processes | Pass | +----- Hardware ----------+-------------+ state and deployment software | GPU Memory | Pass - All | integrity | Diagnostic | Pass - All | +----- Integration -------+-------------+ | PCIe | Pass - All | Several levels of diagnostic are +----- Performance -------+-------------+ available. Level 1-3 with – r. | SM Performance | Pass - All | | Targeted Performance | Pass - All | | Targeted Power | Warn - All | +---------------------------+-------------+ 10

  11. Streamline GPU Administration & TCO Flexible GPU Governance Policies Manage GPU group Configuration Job Statistics 11

  12. Streamline GPU Administration & TCO Flexible GPU Governance Policies With Existing Tools Using DCGM Continuous monitoring by the user Condition Notification Action Identify GPUs with double bit errors Auto-detects double Condition : Watch for DBE bit errors, performs Action : Page retirement Perform GPU reset page retirement, and Notification : Callback to correct notifies the user problems 12

  13. Streamline GPU Administration & TCO Manage GPU group Configuration MAINTAINS Get Config for the group of GPUs SUPPORTED SETTINGS CONFIGURATION dcgmi config -g 1 --get +--------------------------+------------------------+------------------------+ | all_gpu_group | | | Initialization: Configure all GPUs (global | Group of 2 GPUs | TARGET CONFIGURATION | CURRENT CONFIGURATION | group) +==========================+========================+========================+ | Sync Boost | Disabled | Disabled | | SM Application Clock | 705 | 705 | Per-job basis: Individual partitioned | Memory Application Clock | 2600 | 2600 | | ECC Mode | Enabled | Enabled | group settings | Power Limit | 225 | 225 | | Compute Mode | E. Process | E. Process | Maintains settings across driver restarts, +--------------------------+------------------------+------------------------+ GPU resets or at job start Supports SET , GET and ENFORCE DCGM maintains the target configuration across resets 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend