10 Faster Transparency from Low Level Shader Optimisation - - PowerPoint PPT Presentation
10 Faster Transparency from Low Level Shader Optimisation - - PowerPoint PPT Presentation
10 Faster Transparency from Low Level Shader Optimisation Pyarelal Knowles Geoff Leach Fabio Zambetta RMIT University, Australia Opaque Transparent 460 (2010): 6 FPS Titan X (Now): 142 FPS 12 Million Polygons 460 (base): 1 FPS
Opaque Transparent
460 (2010): 6 FPS Titan X (Now): 142 FPS
12 Million Polygons
460 (base): 1 FPS Titan X (base): 3 FPS Titan X (ours): 30 FPS
10× Faster Transparency (Software)
- Gained with two techniques
○ Backwards Memory Allocation - better occupancy ○ Register Based Block Sort - faster sorting
- Involves low level optimizations (OpenGL+GLSL)
- Interesting technical details
- Insight from CUDA, similarities to OpenGL
- Important to know hardware and language
- Now within 10× opaque rendering
Transparency
Antialiasing Particles Shadow Maps Objects, glass, visualization
Images from: Creo Parametric 2, Portal 2, Shadow Of Mordor, marmoset.co, GRID 2, Unity3D, Chen et. al.’s Real-Time Volumetric Shadows using 1D Min-Max Mipmaps, Lokovic and Veach’s Deep Shadow Maps.
Transparency
- Transparency uses alpha blending
- Weighted average
- Based on surface order
Not Sorted Sorted
Sorting for Transparency
Sort triangles
- Geometry dependent
Sort fragments (potential pixel colors)
- Rasterize and store
- Geometry independent
- Order Independent Transparency (OIT) ...
Order Independent Transparency (OIT)
- Two passes
○ Build a deep image ○ Sort and blend fragments
- Exact OIT: sort all fragments
- Code snippets
○ On my poster ○ https://github.com/pknowles/oit
- 1. Deep Image
- Many fragments per pixel
- Construct in fragment shader
○ Race conditions ○ Different data structures
Knowles, P.: Real-Time deep image rendering and order independent transparency. PhD Thesis, RMIT University, 2015.
- 2. Sort and composite
vec2 frags[MAX_FRAGS]; void main() { int count = loadFragments(gl_FragCoord.xy); sortFragments(count); //insertion sort colour = vec4(1.0); for (int i = 0; i < count; ++i) { vec4 c = unpackColour(frags[i].y) colour = mix(colour, c, c.a); } }
Full screen pass: 1. Read all fragments 2. Sort 3. Blend Bottleneck for large scenes
OpenGL+GLSL vs CUDA
OpenGL CUDA Same hardware Graphics Compute GLSL Shaders Kernels Specification, not implementation Implementation well documented Improving Nsight support Good Nsight support
- CUDA gives insight into GLSL execution
- Some significant architectural differences...
GPU Architecture
GPU SM/SMX SM/SMX SM/SMX Global Memory L2 Cache L1 Cache / Shared / “Local” Registers SP SP SP SP … Slow Faster Fastest
OpenGL vs CUDA - An Interesting Example
- Why would allocating more memory make a shader slower?
#define SIZE set_by_application vec4 myArray[SIZE]; uniform int zero;
- ut vec4 fragColour;
void main() { fragColour = myArray[zero]; }
OpenGL vs CUDA - An Interesting Example
GPU SM/SMX SM/SMX SM/SMX Global Memory L2 Cache Registers SP SP SP SP …
- In GLSL local memory is reserved
- The more required
- The less active threads
- Low occupancy
L1 Cache / Shared / “Local” Thread Thread Thread T h r e a d
Sorting in OIT
- Local memory is fixed
- Use conservative maximum
- Want dynamic size
#define MAX_FRAGS set_by_application vec2 frags[MAX_FRAGS]; //conservative max void main() { int count = loadFragments(gl_FragCoord.xy); sortFragments(count); //insertion sort ...
Backwards Memory Allocation
Knowles, P., Leach, G., Zambetta, F.: Backwards Memory Allocation and Improved OIT. In Proceedings of Pacific Graphics 2013, pages 59–64, October 2013.
Register-Based Block Sort
- Local memory still slow
- External sort in registers
○
From local memory ○ Copy blocks to registers
○
Sort ○ Copy back
○
k-way merge
Knowles, P., Leach, G., Zambetta, F.: Fast Sorting for Exact OIT of Complex Scenes. The Visual Computer (TVCJ), vol. 30, no. 6-8, pages 603–613 June 2014.
Intermediate Compiler Output
########################################## OPTION NV_bindless_texture; PARAM c[9] = { program.local[0..8] }; TEMP R0, R1; TEMP RC, HC; TEMP lmem[8]; MOV.F lmem[0].x, c[0]; MOV.F lmem[1].x, c[1]; MOV.F lmem[2].x, c[2]; MOV.F lmem[3].x, c[3]; MOV.S R0.y, {1, 0, 0, 0}.x; REP.S ; SGE.S.CC HC.x, R0.y, c[8]; BRK (NE.x); MOV.S R0.z, R0.y; REP.S ; SLE.S.CC HC.x, R0.y, {0, 0, 0, 0}; BRK (NE.x); ADD.S R0.w, R0.z, -{1, 0, 0, 0}.x; MOV.U R0.x, R0.z; MOV.U R0.w, R0; MOV.F R0.x, lmem[R0.x].x; MOV.F R1.x, lmem[R0.w].x; SGT.F R0.x, R1, R0; TRUNC.U.CC HC.x, R0; ADD.S R0.z, R0, -{1, 0, 0, 0}.x; ENDREP; ADD.S R0.y, R0, {1, 0, 0, 0}.x; ENDREP; ADD.F result.position, R0.y, R0.x; END ##########################################
- glGetProgramBinary
- Provided by Nvidia driver
- Poor man’s --keep (CUDA)
TEMP R0, R1; ... TEMP lmem[8];
Results - Milliseconds per frame
Titan X, 1920x1080 Scene Atrium Hairball (front / back) Power Plant Baseline 7 170 / 652 374 BMA+RBS 6 195 / 212 30 Opaque (no OIT) 1 5 / 3 9
- Up to 10x improvement, at worst minor overhead
GPU Progression
GPU (year) 460 (2010) 670 (2012) Titan (2013) Titan X (2015) Baseline 1004 670 476 374 BMA+RBS 258 94 56 30 Speedup 3.9 7.1 8.5 12.3
Power plant scene (milliseconds per frame)
- Speedup improves with each new GPU
Conclusion
- Low level optimizations necessary despite trend for higher level languages
- Need to be exposed to hardware architecture via language and tools
- Perhaps increasingly necessary with newer GPUs
- 10× faster OIT with BMA+RBS
- Much bigger scenes possible (also displays, i.e. 4K/8K)
- Better sorting and deep image rendering
- Much closer to opaque rendering speeds
- Sorting is no longer the bottleneck in many scenes