PerfMon redux: analyzing a CUDA application with the Windows PerfMon - - PowerPoint PPT Presentation

perfmon redux analyzing a cuda application with the
SMART_READER_LITE
LIVE PREVIEW

PerfMon redux: analyzing a CUDA application with the Windows PerfMon - - PowerPoint PPT Presentation

S6287 PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA application with the Windows Performance Monitor Richard Wilton Department of Physics and Astronomy Johns Hopkins University S6287: Analyzing


slide-1
SLIDE 1

PerfMon redux: analyzing a CUDA application with the Windows

S6287

Richard Wilton Department of Physics and Astronomy Johns Hopkins University

PerfMon redux: analyzing a CUDA application with the Windows Performance Monitor

slide-2
SLIDE 2

S6287: Analyzing a CUDA application with PerfMon

What to monitor and why

  • What is there to monitor?

Speed (duration) Resource utilization Interactions between resources Interactions between resources

  • Why bother?

Prove that things are operating as expected Make things run faster Find performance bottlenecks Identify resource contention

slide-3
SLIDE 3

S6287: Analyzing a CUDA application with PerfMon

Setup for performance monitoring

  • Tools you need

Microsoft Windows NVidia GPU and CUDA toolkit (NVML) Microsoft Visual Studio (PerfLib v2) Microsoft Visual Studio (PerfLib v2)

  • Monitoring setup

Target machine with target hardware Application “release” build Choose your performance counters

slide-4
SLIDE 4

Choosing performance counters

S6287: Analyzing a CUDA application with PerfMon

Counters in the GPU group:

  • Clock speed (MHz): memory
  • Clock speed (MHz): SM
  • Fan speed (% maximum)
  • Global memory allocated (bytes)
  • Global memory allocated (percent)
  • Global memory allocated (percent)
  • Global memory free (bytes)
  • Global memory read/write activity (%)
  • GPU compute activity (%)
  • GPU temperature (°C)
  • GPU total power draw (watts)
  • PCIe receive throughput (KB/s)
  • PCIe transmit throughput (KB/s)
slide-5
SLIDE 5

Choosing performance counters

S6287: Analyzing a CUDA application with PerfMon

Monitoring everything at once is probably not a good idea.

slide-6
SLIDE 6

Application pipeline (circa 2013)

S6287: Analyzing a CUDA application with PerfMon

CPU compute activity GPU (CUDA) compute activity

slide-7
SLIDE 7

GPU activity

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 0, 1, 2 GPU compute activity % Global memory read/write activity % Host-related counters CPU activity % CPU activity % Host memory allocation

slide-8
SLIDE 8

GPU activity

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 0 GPU compute activity % Global memory read/write activity % Host-related counters CPU activity % CPU activity % Host memory allocation

slide-9
SLIDE 9

Sampling Jaggedness

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 0 GPU compute activity % Global memory read/write activity % Sampled at 1-second intervals Sampled at 1-second intervals Samples are “snapshots” (not averaged)

slide-10
SLIDE 10

Concurrency among multiple GPUs

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 0, 1, 2 GPU compute activity % Global memory read/write activity % Host-related counters CPU activity % CPU activity % Host memory allocation

slide-11
SLIDE 11

Concurrency among multiple GPUs

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 0 GPU compute activity % Global memory read/write activity % Host-related counters CPU activity % CPU activity % Host memory allocation

slide-12
SLIDE 12

Concurrency among multiple GPUs

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 1 GPU compute activity % Global memory read/write activity % Host-related counters CPU activity % CPU activity % Host memory allocation

slide-13
SLIDE 13

Concurrency among multiple GPUs

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 2 GPU compute activity % Global memory read/write activity % Host-related counters CPU activity % CPU activity % Host memory allocation

slide-14
SLIDE 14

Starving for CPU cycles

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 0, 1, 2 GPU compute activity % Global memory read/write activity % Host-related counters CPU activity % CPU activity % Host memory allocation

slide-15
SLIDE 15

Starving for CPU cycles

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 0 GPU compute activity % Global memory read/write activity % Host-related counters CPU activity % CPU activity % Host memory allocation

slide-16
SLIDE 16

Starving for CPU cycles

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 0, 1, 2 GPU compute activity % Global memory read/write activity % Host-related counters CPU activity % CPU activity % Host memory allocation

slide-17
SLIDE 17

Starving for CPU cycles

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 0 GPU compute activity % Global memory read/write activity % Host-related counters CPU activity % CPU activity % Host memory allocation

slide-18
SLIDE 18

Consuming a resource

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 2 GPU compute activity % Global memory allocated (bytes) Host-related counters CPU activity %

(image TBD)

CPU activity %

slide-19
SLIDE 19

GPU mystery

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 0, 1 GPU compute activity % Global memory read/write activity % GPU temperature (°C) GPU total power draw (watts) GPU total power draw (watts) Host-related counters CPU activity % Host memory allocation

slide-20
SLIDE 20

GPU mystery

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 0, 1 GPU compute activity % Global memory read/write activity % GPU temperature (°C) GPU total power draw (watts) GPU total power draw (watts) Host-related counters CPU activity % Host memory allocation

slide-21
SLIDE 21

GPU mystery

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 0, 1 GPU compute activity % Global memory read/write activity % GPU temperature (°C) GPU total power draw (watts) GPU total power draw (watts) Host-related counters CPU activity % Host memory allocation

slide-22
SLIDE 22

GPU mystery

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 0, 1 GPU compute activity % Global memory read/write activity % GPU temperature (°C) GPU total power draw (watts) GPU total power draw (watts) Host-related counters CPU activity % Host memory allocation

slide-23
SLIDE 23

GPU mystery

S6287: Analyzing a CUDA application with PerfMon

Device-related counters – device 0, 1 GPU compute activity % Global memory read/write activity % GPU temperature (°C) GPU total power draw (watts) GPU total power draw (watts) Host-related counters CPU activity % Host memory allocation

slide-24
SLIDE 24

S6287: Analyzing a CUDA application with PerfMon

PerfMon and CUDA

  • What is there to monitor?

Speed (duration) Resource utilization Interactions between resources Interactions between resources

  • Why bother?

Prove that things are operating as expected Make things run faster Find performance bottlenecks Identify resource contention

slide-25
SLIDE 25

S6287: Analyzing a CUDA application with PerfMon

Questions / Comments