an optimized diffusion depth of
play

An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen - PowerPoint PPT Presentation

An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen AMD AMD s Favorite Effects 28th February 2011 2 Agenda Motivation Recap of a high-level explanation of DDOF Recap of earlier DDOF solvers A Vanilla Cyclic


  1. An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen – AMD AMD ‘ s Favorite Effects 28th February 2011 2

  2. Agenda • Motivation • Recap of a high-level explanation of DDOF • Recap of earlier DDOF solvers • A Vanilla Cyclic Reduction(CR) DDOF solver • A DX11 optimized CR solver for DDOF • Results AMD ‘ s Favorite Effects 28th February 2011 3

  3. Motivation • Solver presented at GDC 2010 [RS2010] has some weaknesses • Great implementation but memory reqs and runtime too high for many game developers • Looking for faster and memory efficient solver AMD ‘ s Favorite Effects 28th February 2011 4

  4. Diffusion DOF recap 1 • DDOF is an enhanced way of blurring a picture taking an arbitrary CoC at a pixel into account • Interprets input image as a heat distribution • Uses the CoC at a pixel to derive a per pixel heat conductivity CoC=Circle of Confusion AMD ‘ s Favorite Effects 28th February 2011 5

  5. Diffusion DOF recap 2 • Blurring is done by time stepping a differential equation that models the diffusion of heat • ADI method used to arrive at a separable solution for stepping • Need to solve tri-diagonal linear system for each row and then each colum of the input AMD ‘ s Favorite Effects 28th February 2011 6

  6. DDOF Tri-diagonal system • row/col of input      b c 0 y x 1 1 1 1      image a b c y x      2 2 2 2 2 • derived from CoC at       a b c y x 3 3 3 3 3 each pixel of an           input row/col           • resulting blurred 0 a b y x n n n n row/col AMD ‘ s Favorite Effects 28th February 2011 7

  7. Solver recap 1 • The GDC2010 solver [RS2010] is a ‚hybrid‘ solver – Performs three PCR steps upfront – Performs serial ‚Sweep‘ algorithm to solve small resulting systems – Check [ZCO2010] for details on other hybrid solvers AMD ‘ s Favorite Effects 28th February 2011 8

  8. Solver recap 2 • The GDC2010 solver [RS2010] has drawbacks – It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm • GPUs without RW cache will suffer – For high resolutions three PCR steps produce tri-diagonal system of substantial size • This means a serial (sweep) algorithm is run on a ‚big‘ system AMD ‘ s Favorite Effects 28th February 2011 9

  9. Solver recap 3 • Cyclic Reduction (CR) solver – Used by [Kass2006] in the original DDOF paper – Runs in two phases 1. reduction phase 2. backward substitution phase AMD ‘ s Favorite Effects 28th February 2011 10

  10. Solver recap 4 • According to [ZCO2010]: – CR solver has lowest computational complexity of all solvers  – It suffers from lack of parallelism though  • At the end of the reduction phase • At the start of the backwards substitution phase AMD ‘ s Favorite Effects 28th February 2011 11

  11. Passes of a Vanilla CR Solver      b c 0 y x 1 1 1 1 Input image      X a b c y x      2 2 2 2 2       Pass 1: a b c y x 3 3 3 3 3 construct      abc from CoC                0 a b y x n n n n AMD ‘ s Favorite Effects 28th February 2011 12

  12. Passes of a Vanilla CR Solver Input image … X reduce reduce Solve for the Stop at size 1 first y Pass 1: … construct abc reduce reduce from CoC Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 13

  13. Vanilla Solver Results • Higher performance than reported in [Bavoil2010]  (~6 ms vs. ~8ms at 1600x1200) • Memory footprint prohibitively high  – >200 MB at 1600x1200 • Need an answer to tackling the lack of parallelism problem – answer given in [ZCO2010] AMD ‘ s Favorite Effects 28th February 2011 14

  14. Vanilla CR Solver Input image … X reduce reduce Solve for the This is Stop at size 1 first y what kills Pass 1: parallelism … construct abc reduce reduce from CoC Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 15

  15. Keeping the parallelism high Input image … X reduce reduce Stop at a Solve for Y at reasonable that resolution to size Pass 1: have a big … construct enough parallel abc reduce reduce from CoC workload (e.g using PCR see [ZCO2010]) Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 16

  16. Memory Optimizations 1 Input image … X reduce reduce Stop at a Solve for Y at reasonable that resolution size Pass 1: … construct abc reduce reduce from CoC Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 17

  17. Memory Optimizations 1 rgab32f rgab32f … X reduce reduce Stop at a Solve for Y at reasonable that resolution size … rgab32f rgab32f abc reduce reduce … rgba32f rgab32f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 18

  18. Memory Optimizations 1 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at reasonable This saves some significant that resolution size amount of memory - We found … rgab32f no artifacts for going from rgab32f abc reduce reduce rgba32f to rgba16f … rgba16f rgab16f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 19

  19. Memory Optimizations 2 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at reasonable This does again save a that resolution size significant amount of … rgab32f memory as this is the rgab32f abc reduce reduce biggest surface used by the solver … rgba16f rgab16f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 20

  20. Memory Optimizations 2 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at reasonable that resolution Skip abc size construction pass … and compute abc rgab32f abc reduce on-the-fly during 1. reduction pass … rgba16f rgab16f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 21

  21. Intermediate Results 1600x1200 Solver Time in ms Memory in Megabytes HD5870 GTX480 GDC2010 hybrid solver on GTX480 ~8.5 8.00 ~117 (guesstimate) [Bavoil 2010] 3.66 3.33 ~132 Standard Solver (already skips high res abc construction) AMD ‘ s Favorite Effects 28th February 2011 22

  22. Memory Optimizations 3 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at Yet again this saves a reasonable that resolution significant amount of Skip abc size construction memory ! … pass compute rgab32f abc reduce abc during 1. reduction pass … rgba16f rgab16f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 23

  23. Memory Optimizations 3 rgab16f … X reduce4 Stop at a Solve for Y at reasonable that resolution Reduce 4-to-1 Skip abc size in a special first construction … reduction pass pass compute abc abc during 1. reduction pass Substitute 1-to-4 in a … special rgba16f Y substitute substitute substitution pass substitute4 AMD ‘ s Favorite Effects 28th February 2011 24

  24. Intermediate Results 1600x1200 Solver Time in ms Memory in Megabytes HD5870 GTX480 GDC2010 hybrid solver on GTX480 ~8.5 8.00 ~117 (guesstimate) [Bavoil 2010] 3.66 3.33 ~132 Standard Solver (already skips high res abc construction) 4 – to-1 Reduction 2.87 3.32 ~73 AMD ‘ s Favorite Effects 28th February 2011 25

  25. DX11 Memory Optimizations 1 rgab16f … X reduce4 Stop at a Solve for Y at reasonable that resolution Reduce 4-to-1 Skip abc size in a special first construction … reduction pass pass compute abc abc during 1. reduction pass Substitute 1-to-4 in a … special rgba16f Y substitute substitute substitution pass substitute4 AMD ‘ s Favorite Effects 28th February 2011 26

  26. DX11 Memory Optimizations 1 Pack abc and X into one rgba_uint surface rgab16f … X reduce4 Stop at a Solve for Y at reasonable that resolution Reduce 4-to-1 Skip abc size in a special first construction … reduction pass pass compute abc abc during 1. reduction pass Substitute 1-to-4 in a … special rgba16f Y substitute substitute substitution pass substitute4 AMD ‘ s Favorite Effects 28th February 2011 27

  27. Using SM5 for data packing uint pack x,y channel rgab16f X uint (f32tof16(X.x) + (f32tof16(X.y) << 16)) uint rgab32f abc uint AMD ‘ s Favorite Effects 28th February 2011 28

  28. Using SM5 for data packing uint rgab16f X uint uint higher 27 bits of x channel rgab32f abc (asuint(abc.x) &0xFFFFFFC0) | uint (f32tof16(X.z) & 0x3F)) Steal 6 lowest mantissa bits of abc.x to store some bits of X.z AMD ‘ s Favorite Effects 28th February 2011 29

  29. Using SM5 for data packing uint rgab16f X uint uint higher 27 bits of y channel rgab32f abc (asuint(abc.y) &0xFFFFFFC0) | uint ((f32tof16(X.z) >>6 )& 0x3F)) Steal 6 lowest mantissa bits of abc.y to store some bits of X.z AMD ‘ s Favorite Effects 28th February 2011 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend