DEVELOPMENT BUG PREVENTION AND ISOLATION Erika Dignam and Ross - - PowerPoint PPT Presentation

development
SMART_READER_LITE
LIVE PREVIEW

DEVELOPMENT BUG PREVENTION AND ISOLATION Erika Dignam and Ross - - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley ROBUST SOFTWARE DEVELOPMENT BUG PREVENTION AND ISOLATION Erika Dignam and Ross Cunniff 04 April 2016 ABOUT US Ross Cunniff Senior Software Engineer and NVIDIA SPEC representative. 15-year NVIDIA employee.


slide-1
SLIDE 1

April 4-7, 2016 | Silicon Valley

Erika Dignam and Ross Cunniff 04 April 2016

ROBUST SOFTWARE DEVELOPMENT

BUG PREVENTION AND ISOLATION

slide-2
SLIDE 2

2

ABOUT US

Ross Cunniff Senior Software Engineer and NVIDIA SPEC representative. 15-year NVIDIA employee. Over 30 years of computer engineering experience. Erika Dignam Technical Program Manager and Bug Triager Studied computer arts. At NVIDIA for 9 years.

4/25/2016

slide-3
SLIDE 3

3

STRUCTURE

Bug types | Triage and Tools | Recap Process Details | Bookkeeping Prevention and Benchmarking

slide-4
SLIDE 4

4

BUG TYPES

Crash or TDR Corruption Performance SLI Scaling

4/25/2016

slide-5
SLIDE 5

5

TOOLS AND TRIAGE

Traces – All bug types

What is a trace? Intercepts calls between application and driver | Records to a file

  • Apitrace (DX and OpenGL) - http://apitrace.github.io/
  • Pass along .trace file – Replay, performance info, and dump API stream
  • Simple to use - copy <API>.dll to executable location
  • Caveats - Long reproductions means large files | Tracing tools don’t always capture

| Some apps are not tracing friendly out of the box

4/25/2016

APP NV Driver file.trace apitrace

slide-6
SLIDE 6

6

TOOLS AND TRIAGE

Traces

More Tracing tools

  • GLIntercept (OpenGL) - https://github.com/dtrebilco/glintercept
  • Useful for error states and other tracing, a little older than apitrace
  • Copy opengl.dll and gliConfig.ini to executable folder location
  • Swapping the DebugContext.ini config file can give very helpful information, for

example issues with SLI Scaling EXAMPLE:

  • OpenGL: Performance(Medium) 131234: SLI performance warning: SLI AFR copy and

synchronization for texture mipmaps (42)

4/25/2016

slide-7
SLIDE 7

7

TOOLS AND TRIAGE

Crashes/TDR

Dump files

  • Mini dump - Always helpful, you can simply right click the process from the task manager or

process explorer and select “Dump to File”

  • Full dump - Better, but larger
  • https://msdn.microsoft.com/en-us/library/windows/desktop/bb787181(v=vs.85).aspx

TDR – Timeout Detection and Recovery

  • Increase the TDR delay, what are the results then?
  • https://msdn.microsoft.com/en-us/library/windows/hardware/ff569918(v=vs.85).aspx

4/25/2016

slide-8
SLIDE 8

8

TOOLS AND TRIAGE

CPU Profilers - Performance

Intel VTune

  • In-depth perf analysis, finer tuned control, filters noise | Needs a license, not

free

  • https://software.intel.com/en-us/intel-vtune-amplifier-xe

AMD CodeAnalyst

  • Simple, free, runs on both CPUs | Less robust than Vtune, no longer supported
  • http://developer.amd.com/tools-and-sdks/archive/amd-codeanalyst-

performance-analyzer/ App bound? Driver bound? GPU bound? Performance paths taken

4/25/2016

slide-9
SLIDE 9

9

TOOLS AND TRIAGE

Performance/Resources

Process Explorer

  • Free quick overview tool - Check loaded .dlls, can see load on resources, memory

leaks, GPU or CPU bound

  • https://technet.microsoft.com/en-us/sysinternals/processexplorer.aspx

GPUview

  • Free Windows tool included with the Windows Performance Toolkit (WPT)
  • https://graphics.stanford.edu/~mdfisher/GPUView.html
  • https://developer.nvidia.com/content/are-you-running-out-video-memory-

detecting-video-memory-overcommitment-using-gpuview

4/25/2016

slide-10
SLIDE 10

10

PROCESS EXPLORER

slide-11
SLIDE 11

11

TOOLS AND TRIAGE

Tools

gDEBugger

  • http://www.gremedy.com/
  • Free OpenGL debugging tool
  • Useful for data gathering, good for tracking state changes, dynamically look at

stream

  • EXAMPLE:
  • Polygon count information from models
  • Performance bug was root caused to one mode of the model was sending a significant

amount more polys into the OpenGL pipeline.

4/25/2016

slide-12
SLIDE 12

12

NVIDIA TOOLS AND LOGS

NVIDIA OpenGL Driver Error codes External Swak = Swiss Army Knife

  • NVIDIA tool used to capture detailed system information
  • Only available under NDA, on the partners site

WSAppNotifier.exe – Profiles

  • For application profile problems, tells you which profiles are running/applied
  • You may have to launch the app twice
  • NDA only, on partner site

4/25/2016

slide-13
SLIDE 13

13

WSAPPNOTIFIER.EXE

slide-14
SLIDE 14

14

TRIAGE/DEBUGGING

Profiles – Things to Try

Changing Global Profiles

  • Workstation App - Dynamic Streaming | Turns off some optimized driver paths
  • 3D App – Game Development | Simulates a GeForce
  • SLI Aware Application | SLI performance testing
  • Threaded optimization = OFF | In Profile settings

Notebooks

  • Try setting NVIDIA GPU to default | In profiles or SBIOS if available

4/25/2016

slide-15
SLIDE 15

15

RECAP

What tools for what bugs

Crash or TDR

  • TDR Delay RegKeys | Collect dump files | Trace | GPUView

Corruption

  • Trace | Changing profiles

Performance

  • Changing profiles | apitrace | VTune/CodeAnalyst

SLI Scaling

  • Debug Context from GLIntercept

4/25/2016

slide-16
SLIDE 16

16

TRIAGE/DEBUGGING

Vulkan

https://www.khronos.org/vulkan/

  • New API that puts the application developer in control, appDev manages GPU memory and

resources

Built in Validation Layer – API violations SDK - https://vulkan.lunarg.com/signin | Need account Demos

  • https://github.com/SaschaWillems/Vulkan | https://github.com/McNopper/Vulkan

Renderdoc | Graphics Debugger https://github.com/baldurk/renderdoc

4/25/2016

slide-17
SLIDE 17

17

TRIAGE/DEBUGGING

Vulkan

Vulkan Talks

  • S6818 – Vulkan and NVIDIA: The Essentials
  • S6138 – GPU Driven Rendering in Vulkan and OpenGL
  • S6133 – VKCPP: A C++ Layer on Top of Vulkan
  • Three Hangouts, Monday and Tuesday afternoons

Resources

  • https://github.com/KhronosGroup/Khronosdotorg/blob/master/api/vulkan/resources.md

4/25/2016

slide-18
SLIDE 18

18

BUG PROCESS

Normal External Bug Flow

  • External Bug -> QA -> Triage -> Engineering

Accounts to file bugs

  • partners.nvidia.com – Needs NDA
  • developer.nvidia.com\join
  • Access to early release drivers and NVIDIA tools, report bugs!

4/25/2016

slide-19
SLIDE 19

19

BUG PROCESS

Overview

NVBUGS Start by filing as a software issue Important to have basic reproduction steps

  • OS, driver, card, application and version if applicable, system information,

frequency

  • Severity and impact for you
  • Type - Performance, Crash, Corruption, TDR

Regression information is very helpful if can be provided

4/25/2016

slide-20
SLIDE 20

20

slide-21
SLIDE 21

21

TOOLS AND TRIAGE

Overview

Simple app/license

  • A trace would be great, no license/app/model needed
  • Avoids delays, very useful when a third party has a repro others can’t get
  • If not possible, then models/scenes/app/license/demo will be needed – Time sink

What to attach to bugs

  • Logs, traces, performance snap shots, dump files, videos, event logs
  • System information via externSwak (NVTOOL)

4/25/2016

slide-22
SLIDE 22

22

WHAT HAPPENS TO YOUR BUG

ODE = Optimized Driver for Enterprise

  • Long lived branch
  • Multiple releases or dot version per

branch

  • For production use and

certification QNF = Quadro New Feature

  • Short lived branch
  • One release per branch
  • Release driver for testing new

features and fixes

Fixes -> Driver | Branches

4/25/2016

WHQL = Windows Hardware Quality Labs Testing and Signed

slide-23
SLIDE 23

23

PREVENTION

What NVIDIA does

ATP and QA

  • We have QA teams with application experts around the world testing applications, GPUs, OSs,

and drivers

  • ATP is our automated test harness for further testing to cover more configurations

DVS

  • Driver Validation System. Automated and run with every single code change. 10 million

images/tests per day

German Test Lab and Global Test Lab

  • 24/7 automated testing of professional applications and features

4/25/2016

slide-24
SLIDE 24

24

PREVENTION

Best Process

We want benchmarks and test suites!

  • Early detection of bugs and issues
  • Early detection of performance regressions
  • Get involved in industry standard benchmarks, example SPEC

Over to Ross to discuss Performance Benchmark creation!

4/25/2016

slide-25
SLIDE 25

25

PERFORMANCE BENCHMARKING

A key to high-quality user experience

slide-26
SLIDE 26

26

“WHEN YOU CANNOT MEASURE IT…

…your knowledge is of a meagre and unsatisfactory kind” – Lord Kelvin

Anything a computer can do, a human can do. Given enough time… Computers are accelerators. Without good performance, user experience is bad. Benchmarking is the technique to ensure repeatable performance

slide-27
SLIDE 27

27

WHAT MAKES A BENCHMARK?

Originally a surveying mark which provided a repeatable reference for placing a leveling rod. Key attributes: #1: repeatable #2: accurate #3: reportable

slide-28
SLIDE 28

28

UNITS ARE NOT BENCHMARKS

Many common units exist: MIPS, FLOPS, FPS, LPM, … Just because you can run a test and get units out, does not make your test a benchmark Quiz: if your test returns a result 60 FPS, what might you be measuring? What about 30, 20, 15, … FPS?

slide-29
SLIDE 29

29

REPEATABILITY

First principle: make sure the same operations are benchmarked on all configs Most benchmarks exhibit some randomness in performance The causes are many; some examples:

Non-deterministic operating system process / thread scheduler Disk I/O – variable times to reach a sector with rotational media; variable wear leveling for solid state media Build-to-build variation due to cache layout changes Virus scan cycles

Rule of thumb: a variation of up to 5% is generally acceptable (if higher, use multiple runs and rely on regression toward the mean)

slide-30
SLIDE 30

30

ACCURACY

“Do these numbers reflect reality?” Always verify assumptions. Do you expect your benchmark to be GPU limited? Then verify on GPUs with different performance levels. Faster is not always better – ensure work is actually being done that reflects end- user experience A good benchmark has a means to verify correct operation Make sure the key portion of your benchmark runs long enough that you can actually measure its performance, not virtual memory subsystem latency or other irrelevant metrics

slide-31
SLIDE 31

31

NOTES ON TUNING

If you are not measuring properly, you might not be able to make improvements 60Hz example – sync-to-vblank (default on NVIDIA) Bottleneck shift. A graphics benchmark may start CPU/API-limited, then after tuning move to being limited by GPU vertex processing. Or even change to being limited by pixel-processing as window sizes change or as workloads shift Constantly re-evaluate benchmark assumptions when tuning

slide-32
SLIDE 32

32

REPORTABILITY

Your benchmark should yield a metric – FPS, LPM, etc. – that is easily collected for further processing Output in standard formats – CSV, JSON, XML – many tools to format and compare If your benchmark is repeatable, accurate, and has good reports, you should be able to track performance over multiple builds / revisions of your application You will also be able to track performance over other changing variables: OS, CPU, GPU driver, memory size, … Important: select a reference score, and keep it constant if at all possible – avoid normalization of deviance If weighting multiple subtests, consider relative importance of subtest to your user

  • community. Use the geometric mean where appropriate.
slide-33
SLIDE 33

33

EXAMPLE BENCHMARKS – SPEC APC

Clockwise from right:

  • 3dsmax 2015
  • PTC Creo 3
  • Maya 2012
  • SNX 8.5
  • Solidworks 2015
slide-34
SLIDE 34

34

EXAMPLE BENCHMARKS - SPEC VIEWPERF 12

Clockwise from right: Catia-04 Creo-01 Maya-04 Medical-01 Showcase-01 SNX-02 SW-03

slide-35
SLIDE 35

35

SNX-02 RESULTS SNAPSHOT

Generated automatically from XML produced by viewset

slide-36
SLIDE 36

36

SNX-02 DETAILS

Test 1 Test 2 Test 5 Test 6 Test 8 Test 10

slide-37
SLIDE 37

37

MORE SNX-02 DETAILS

Note varying weights – sum

  • f all is 100%
slide-38
SLIDE 38

38

MORE INFORMATION

SPEC benchmarking group – http://www.spec.org SPEC Graphics and Workstation Performance Group (GWPG): http://www.spec.org/gwpg/publish/gpcfaqs.html Contribute to SPEC GWPG: http://www.spec.org/gwpg/publish/develop_bench.html

“SPEC's Graphics and Workstation Performance Group (SPEC/GWPG) is seeking ISVs, software user groups, publication editors and testing lab directors to help develop and maintain standardized benchmarks based on professional graphics and workstation

  • applications. Organizations or individuals can submit existing benchmarks for

consideration by a project group or help the group develop an entirely new benchmark.”

slide-39
SLIDE 39

39

BENCHMARK CALL TO ACTION

Benchmark what matters to your users! Create good benchmarks Help us help you! – share your benchmark with NVIDIA and we will put it in our driver regression automation suite to prevent performance bugs Consider sharing application benchmarks with SPEC

slide-40
SLIDE 40

40

CALL TO ACTION

Help us help you! Create good unit tests and benchmarks Make use of available software development and analysis tools Be systematic in development and testing Share your unit tests and benchmarks with us (especially if there is a problem) Be clear and concise in your bug reports

slide-41
SLIDE 41

41

CONTACT US

Ross Cunniff – rcunniff@nvidia.com Erika Dignam – edignam@nvidia.com

slide-42
SLIDE 42

April 4-7, 2016 | Silicon Valley

QUESTIONS?

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

slide-43
SLIDE 43

43

slide-44
SLIDE 44

44

TRIAGE/DEBUGGING

Things to Look At

Looking at dumps

  • Use winDebug (in Windows SDK) and load a dump file
  • Check the call stack, see who’s there

Performance

  • Check where time is spent in your perf logs

4/25/2016