a low latency gpu engine based
play

A low latency GPU engine based reset mechanism for a more robust UI - PowerPoint PPT Presentation

A low latency GPU engine based reset mechanism for a more robust UI experience Carlos Santa 1 Agenda: Problem Statement - Whats the limitation in the GPU driver - Proposed Solution: What is Timeout Detection and Recovery (TDR) - How


  1. A low latency GPU engine based reset mechanism for a more robust UI experience Carlos Santa 1

  2. Agenda: Problem Statement - What’s the limitation in the GPU driver - Proposed Solution: What is Timeout Detection and Recovery (TDR) - How low can the latency be? - A word about preemption - Status of TDR in upstream - Q/A - 2

  3. Problem statement: Stability and Robustness Looking at a specific stability problem affecting the UI experience under Intel - Architecture when running GFX/Video playback use cases (video streaming type of app) The behavior was a frozen UI , followed by a black screen followed by system - reboot (of course after some random time interval (hours to long long hours)). Spent some time understanding the GFX architecture in Chrome OS as well as a - possible solution that could help here. 3

  4. Current limitation 1. If a 3D client app “hangs” the GPU then the GPU process may get killed followed by a full GPU reset. 2. For a complex use case such as video decode many frames/objects are currently in flight so killing the GPU Process and resetting the GPU causes undesirables effects. We then realized… Renderer Process (Client) Shared Memory GPU Process (Server) GPU Driver Compositor GL / D3D 2 full gpu reset Video App Compositor GPU H/W Context Video App Video Codec 3D Render Media Context Engine Engine Engine ?? 1 crash/hang 4

  5. Proposed solution: Timeout Detection & Recovery New feature for Intel GPUs (upstreaming is wip) that can increase both stability and - robustness by allowing applications to enable hang detections on individual batch buffers. Timeout Detection and Recovery (TDR) allows for the different engines in the GPU to be reset - independently (as opposed to a full GPU reset). Generally speaking, the implementations introduces a new IRQ handler in the i915 driver as well - as two new gpu watchdog command instructions before and after the emitted batch buffer’s start instruction in the GPU’s ring buffer. 5

  6. TDR: Step by step Media driver sets WD ∆t for BB 1 Flushes BB 2 t t+n kernel 4 WD runs until a given time threshold ∆t or the WD_TIMER_START 3 Ring WD_TIMER_CANCEL is reached. BB START Buffer WD_TIMER_CANCEL If the timer reaches the ∆t then an interrupt 5 is fired and is handled by the IRQ. A GPU hang is detected! ∆t = threshold If the BB completes before the ∆t and execution 6 WD = GPU watchdog reaches WD_TIMER_CANCEL then WD is t = time interval cancel and nothing happens. 6

  7. Proposed solution: 1. UMD Media Driver starts the watchdog timer after sending batch buffers 2. At some time later the media engine is detected to be in hung state after the watchdog timer has expired 3. The GPU driver resets only the affected media engine 4. Because the UMD Media driver knows when the faulty batch got submitted it could take actions during the the time it take the media driver to come back from the reset. GPU Process UMD Media (Server) Driver GPU Driver GL / D3D 1 Compositor GPU H/W Context Video App Video Codec 3D Render Media Context Engine Engine Engine 3 media engine gpu reset 2 7

  8. How low can the latency be? The whole mechanism works by an arbitrary threshold value that can be set from the - application through an ioctl. However, the threshold can’t be too low or else it can generate too many false - positives. Right now, we are setting the threshold value with respect to the screen resolution - (1080p=50ms, 4K=100ms, 8K=500ms and 16K=2000ms), however, we are still evaluating all these values. 8

  9. A word about preemption Media driver sets WD ∆t 1 Flushes BB 2 What happens if the BB sequence gets preempted before the WD timer gets canceled? t During preemption, the driver must cancel the t+n WD_TIMER_CANCEL command as part of the preemption sequence. kernel WD_TIMER_START What happens to the timer that was already 3 Ring BB START ticking? Buffer WD_TIMER_CANCEL ∆t = threshold WD = GPU watchdog t = time interval 9

  10. How a compositor could benefit? Video client Client 1. A compositor is fundamentally tasked to produce frames libVA API 2. In the past, by the time we detected that the GPU was hung it EGL/OGL Client was too late for the compositor to Mesa 3D 1 2 Compositor VAAPI driver recover (screen freeze, green or black screen or a system reboot). libDRM 3. A video client app can now determine early on whether a “task” has caused the Media Engine to KMS DRM Kernel crash and if so flag to the compositor to show the current frame while the 3D Render Media Media Engine comes back from the Video Codec Engine Engine reset. 10

  11. Status of TDR in upstream: Accepted in upstream Comments TDR – Reset Engine  Yes TDR – with GuC WIP TDR - Watchdog WIP IGT – TDR Watchdog WIP Prototype Comments TDR - Watchdog Ubuntu OS w/ drm-tip iHD and i965 Media Stacks ffmpeg media decode Ubuntu OS w/ drm-tip validated Chromium OS – cros-4.14 Video APK ARC++ validated 11

  12. How to get involved? All of this work is happening in upstream - TDR kernel patches - Code review: https://lists.freedesktop.org/archives/intel-gfx/2019- - January/185543.html i965 Media Driver in user space - Code review at: https://github.com/intel/intel-vaapi-driver/pull/429 - I can be reached on IRC as csanta - work email: carlos.santa AT intel.com 12

  13. Questions or feedback? 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend