10 faster transparency from low level shader optimisation
play

10 Faster Transparency from Low Level Shader Optimisation - PowerPoint PPT Presentation

10 Faster Transparency from Low Level Shader Optimisation Pyarelal Knowles Geoff Leach Fabio Zambetta RMIT University, Australia Opaque Transparent 460 (2010): 6 FPS Titan X (Now): 142 FPS 12 Million Polygons 460 (base): 1 FPS


  1. 10 × Faster Transparency from Low Level Shader Optimisation Pyarelal Knowles Geoff Leach Fabio Zambetta RMIT University, Australia

  2. Opaque Transparent 460 (2010): 6 FPS Titan X (Now): 142 FPS

  3. 12 Million Polygons 460 (base): 1 FPS Titan X (base): 3 FPS Titan X (ours): 30 FPS

  4. 10× Faster Transparency (Software) ● Gained with two techniques ○ Backwards Memory Allocation - better occupancy ○ Register Based Block Sort - faster sorting ● Involves low level optimizations (OpenGL+GLSL) ● Interesting technical details ● Insight from CUDA, similarities to OpenGL ● Important to know hardware and language ● Now within 10× opaque rendering

  5. Transparency Objects, glass, visualization Antialiasing Particles Shadow Maps Images from: Creo Parametric 2, Portal 2, Shadow Of Mordor, marmoset.co, GRID 2, Unity3D, Chen et. al.’s Real-Time Volumetric Shadows using 1D Min-Max Mipmaps, Lokovic and Veach’s Deep Shadow Maps.

  6. Transparency ● Transparency uses alpha blending ● Weighted average ● Based on surface order

  7. Not Sorted Sorted

  8. Sorting for Transparency Sort triangles ● Geometry dependent Sort fragments (potential pixel colors) ● Rasterize and store ● Geometry independent ● Order Independent Transparency (OIT) ...

  9. Order Independent Transparency (OIT) ● Two passes ○ Build a deep image ○ Sort and blend fragments ● Exact OIT: sort all fragments ● Code snippets ○ On my poster ○ https://github.com/pknowles/oit

  10. 1. Deep Image ● Many fragments per pixel ● Construct in fragment shader ○ Race conditions ○ Different data structures Knowles, P.: Real-Time deep image rendering and order independent transparency . PhD Thesis, RMIT University, 2015.

  11. 2. Sort and composite Full screen pass: vec2 frags[MAX_FRAGS]; 1. Read all fragments void main() { 2. Sort int count = loadFragments(gl_FragCoord.xy); 3. Blend sortFragments(count); //insertion sort Bottleneck for large scenes colour = vec4(1.0); for (int i = 0; i < count; ++i) { vec4 c = unpackColour(frags[i].y) colour = mix(colour, c, c.a); } }

  12. OpenGL+GLSL vs CUDA OpenGL CUDA Same hardware Graphics Compute GLSL Shaders Kernels Specification, not implementation Implementation well documented Improving Nsight support Good Nsight support ● CUDA gives insight into GLSL execution ● Some significant architectural differences...

  13. GPU Architecture GPU Slow Global Memory L2 Cache SM/SMX SM/SMX SM/SMX Faster L1 Cache / Shared / “Local” Fastest Registers SP SP SP SP …

  14. OpenGL vs CUDA - An Interesting Example #define SIZE set_by_application vec4 myArray[SIZE]; uniform int zero; out vec4 fragColour; void main() { fragColour = myArray[zero]; } ● Why would allocating more memory make a shader slower?

  15. OpenGL vs CUDA - An Interesting Example ● In GLSL local memory is reserved GPU Global Memory ● The more required ● The less active threads L2 Cache ● Low occupancy SM/SMX SM/SMX SM/SMX L1 Cache / Shared / “Local” Thread Thread Thread Registers d a e SP SP SP SP … r h T

  16. Sorting in OIT ● Local memory is fixed ● Use conservative maximum ● Want dynamic size #define MAX_FRAGS set_by_application vec2 frags[ MAX_FRAGS ]; //conservative max void main() { int count = loadFragments(gl_FragCoord.xy); sortFragments(count); //insertion sort ...

  17. Backwards Memory Allocation Knowles, P., Leach, G., Zambetta, F.: Backwards Memory Allocation and Improved OIT . In Proceedings of Pacific Graphics 2013, pages 59–64, October 2013.

  18. Register-Based Block Sort ● Local memory still slow ● External sort in registers ○ From local memory ○ Copy blocks to registers ○ Sort ○ Copy back ○ k-way merge Knowles, P., Leach, G., Zambetta, F.: Fast Sorting for Exact OIT of Complex Scenes . The Visual Computer (TVCJ), vol. 30, no. 6-8, pages 603–613 June 2014.

  19. Intermediate Compiler Output ########################################## OPTION NV_bindless_texture; PARAM c[9] = { program.local[0..8] }; TEMP R0, R1; ● glGetProgramBinary TEMP RC, HC; TEMP lmem[8]; MOV.F lmem[0].x, c[0]; ● Provided by Nvidia driver MOV.F lmem[1].x, c[1]; MOV.F lmem[2].x, c[2]; ● Poor man’s --keep (CUDA) MOV.F lmem[3].x, c[3]; MOV.S R0.y, {1, 0, 0, 0}.x; REP.S ; SGE.S.CC HC.x, R0.y, c[8]; BRK (NE.x); MOV.S R0.z, R0.y; TEMP R0, R1; REP.S ; SLE.S.CC HC.x, R0.y, {0, 0, 0, 0}; BRK (NE.x); ... ADD.S R0.w, R0.z, -{1, 0, 0, 0}.x; MOV.U R0.x, R0.z; MOV.U R0.w, R0; TEMP lmem[8]; MOV.F R0.x, lmem[R0.x].x; MOV.F R1.x, lmem[R0.w].x; SGT.F R0.x, R1, R0; TRUNC.U.CC HC.x, R0; ADD.S R0.z, R0, -{1, 0, 0, 0}.x; ENDREP; ADD.S R0.y, R0, {1, 0, 0, 0}.x; ENDREP; ADD.F result.position, R0.y, R0.x; END ##########################################

  20. Results - Milliseconds per frame Scene Atrium Hairball (front / back) Power Plant Baseline 7 170 / 652 374 BMA+RBS 6 195 / 212 30 Opaque (no OIT) 1 5 / 3 9 ● Up to 10x improvement, at worst minor overhead Titan X, 1920x1080

  21. GPU Progression Power plant scene (milliseconds per frame) ● Speedup improves with each new GPU GPU (year) 460 (2010) 670 (2012) Titan (2013) Titan X (2015) Baseline 1004 670 476 374 BMA+RBS 258 94 56 30 Speedup 3.9 7.1 8.5 12.3

  22. Conclusion ● Low level optimizations necessary despite trend for higher level languages ● Need to be exposed to hardware architecture via language and tools ● Perhaps increasingly necessary with newer GPUs ● 10× faster OIT with BMA+RBS ● Much bigger scenes possible (also displays, i.e. 4K/8K) ● Better sorting and deep image rendering ● Much closer to opaque rendering speeds ● Sorting is no longer the bottleneck in many scenes

  23. Questions? pyarelal.knowles@gmail.com

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend