10 Faster Transparency from Low Level Shader Optimisation - - PowerPoint PPT Presentation

▶

Jul 07, 2023 40 likes •272 views

10 Faster Transparency from Low Level Shader Optimisation Pyarelal Knowles Geoff Leach Fabio Zambetta RMIT University, Australia Opaque Transparent 460 (2010): 6 FPS Titan X (Now): 142 FPS 12 Million Polygons 460 (base): 1 FPS

SLIDE 1

10× Faster Transparency from Low Level Shader Optimisation

Pyarelal Knowles Geoff Leach Fabio Zambetta

RMIT University, Australia

SLIDE 2

Opaque Transparent

460 (2010): 6 FPS Titan X (Now): 142 FPS

SLIDE 3

12 Million Polygons

460 (base): 1 FPS Titan X (base): 3 FPS Titan X (ours): 30 FPS

SLIDE 4

10× Faster Transparency (Software)

Gained with two techniques

○ Backwards Memory Allocation - better occupancy ○ Register Based Block Sort - faster sorting

Involves low level optimizations (OpenGL+GLSL)
Interesting technical details
Insight from CUDA, similarities to OpenGL
Important to know hardware and language
Now within 10× opaque rendering

SLIDE 5

Transparency

Antialiasing Particles Shadow Maps Objects, glass, visualization

Images from: Creo Parametric 2, Portal 2, Shadow Of Mordor, marmoset.co, GRID 2, Unity3D, Chen et. al.’s Real-Time Volumetric Shadows using 1D Min-Max Mipmaps, Lokovic and Veach’s Deep Shadow Maps.

SLIDE 6

Transparency

Transparency uses alpha blending
Weighted average
Based on surface order

SLIDE 7

Not Sorted Sorted

SLIDE 8

Sorting for Transparency

Sort triangles

Geometry dependent

Sort fragments (potential pixel colors)

Rasterize and store
Geometry independent
Order Independent Transparency (OIT) ...

SLIDE 9

Order Independent Transparency (OIT)

Two passes

○ Build a deep image ○ Sort and blend fragments

Exact OIT: sort all fragments
Code snippets

○ On my poster ○ https://github.com/pknowles/oit

SLIDE 10

1. Deep Image
Many fragments per pixel
Construct in fragment shader

○ Race conditions ○ Different data structures

Knowles, P.: Real-Time deep image rendering and order independent transparency. PhD Thesis, RMIT University, 2015.

SLIDE 11

2. Sort and composite

vec2 frags[MAX_FRAGS]; void main() { int count = loadFragments(gl_FragCoord.xy); sortFragments(count); //insertion sort colour = vec4(1.0); for (int i = 0; i < count; ++i) { vec4 c = unpackColour(frags[i].y) colour = mix(colour, c, c.a); } }

Full screen pass: 1. Read all fragments 2. Sort 3. Blend Bottleneck for large scenes

SLIDE 12

OpenGL+GLSL vs CUDA

OpenGL CUDA Same hardware Graphics Compute GLSL Shaders Kernels Specification, not implementation Implementation well documented Improving Nsight support Good Nsight support

CUDA gives insight into GLSL execution
Some significant architectural differences...

SLIDE 13

GPU Architecture

GPU SM/SMX SM/SMX SM/SMX Global Memory L2 Cache L1 Cache / Shared / “Local” Registers SP SP SP SP … Slow Faster Fastest

SLIDE 14

OpenGL vs CUDA - An Interesting Example

Why would allocating more memory make a shader slower?

#define SIZE set_by_application vec4 myArray[SIZE]; uniform int zero;

ut vec4 fragColour;

void main() { fragColour = myArray[zero]; }

SLIDE 15

OpenGL vs CUDA - An Interesting Example

GPU SM/SMX SM/SMX SM/SMX Global Memory L2 Cache Registers SP SP SP SP …

In GLSL local memory is reserved
The more required
The less active threads
Low occupancy

L1 Cache / Shared / “Local” Thread Thread Thread T h r e a d

SLIDE 16

Sorting in OIT

Local memory is fixed
Use conservative maximum
Want dynamic size

#define MAX_FRAGS set_by_application vec2 frags[MAX_FRAGS]; //conservative max void main() { int count = loadFragments(gl_FragCoord.xy); sortFragments(count); //insertion sort ...

SLIDE 17

Backwards Memory Allocation

Knowles, P., Leach, G., Zambetta, F.: Backwards Memory Allocation and Improved OIT. In Proceedings of Pacific Graphics 2013, pages 59–64, October 2013.

SLIDE 18

Register-Based Block Sort

Local memory still slow
External sort in registers

○

From local memory ○ Copy blocks to registers

○

Sort ○ Copy back

○

k-way merge

Knowles, P., Leach, G., Zambetta, F.: Fast Sorting for Exact OIT of Complex Scenes. The Visual Computer (TVCJ), vol. 30, no. 6-8, pages 603–613 June 2014.

SLIDE 19

Intermediate Compiler Output

########################################## OPTION NV_bindless_texture; PARAM c[9] = { program.local[0..8] }; TEMP R0, R1; TEMP RC, HC; TEMP lmem[8]; MOV.F lmem[0].x, c[0]; MOV.F lmem[1].x, c[1]; MOV.F lmem[2].x, c[2]; MOV.F lmem[3].x, c[3]; MOV.S R0.y, {1, 0, 0, 0}.x; REP.S ; SGE.S.CC HC.x, R0.y, c[8]; BRK (NE.x); MOV.S R0.z, R0.y; REP.S ; SLE.S.CC HC.x, R0.y, {0, 0, 0, 0}; BRK (NE.x); ADD.S R0.w, R0.z, -{1, 0, 0, 0}.x; MOV.U R0.x, R0.z; MOV.U R0.w, R0; MOV.F R0.x, lmem[R0.x].x; MOV.F R1.x, lmem[R0.w].x; SGT.F R0.x, R1, R0; TRUNC.U.CC HC.x, R0; ADD.S R0.z, R0, -{1, 0, 0, 0}.x; ENDREP; ADD.S R0.y, R0, {1, 0, 0, 0}.x; ENDREP; ADD.F result.position, R0.y, R0.x; END ##########################################

glGetProgramBinary
Provided by Nvidia driver
Poor man’s --keep (CUDA)

TEMP R0, R1; ... TEMP lmem[8];

SLIDE 20

Results - Milliseconds per frame

Titan X, 1920x1080 Scene Atrium Hairball (front / back) Power Plant Baseline 7 170 / 652 374 BMA+RBS 6 195 / 212 30 Opaque (no OIT) 1 5 / 3 9

Up to 10x improvement, at worst minor overhead

SLIDE 21

GPU Progression

GPU (year) 460 (2010) 670 (2012) Titan (2013) Titan X (2015) Baseline 1004 670 476 374 BMA+RBS 258 94 56 30 Speedup 3.9 7.1 8.5 12.3

Power plant scene (milliseconds per frame)

Speedup improves with each new GPU

SLIDE 22

Conclusion

Low level optimizations necessary despite trend for higher level languages
Need to be exposed to hardware architecture via language and tools
Perhaps increasingly necessary with newer GPUs
10× faster OIT with BMA+RBS
Much bigger scenes possible (also displays, i.e. 4K/8K)
Better sorting and deep image rendering
Much closer to opaque rendering speeds
Sorting is no longer the bottleneck in many scenes

SLIDE 23

10 Faster Transparency from Low Level Shader Optimisation - - PowerPoint PPT Presentation

10× Faster Transparency from Low Level Shader Optimisation

Pyarelal Knowles Geoff Leach Fabio Zambetta

RMIT University, Australia

Opaque Transparent

460 (2010): 6 FPS Titan X (Now): 142 FPS

12 Million Polygons

460 (base): 1 FPS Titan X (base): 3 FPS Titan X (ours): 30 FPS

10× Faster Transparency (Software)

Transparency

Transparency

Not Sorted Sorted

Sorting for Transparency

Sort triangles

Sort fragments (potential pixel colors)

Order Independent Transparency (OIT)

Full screen pass: 1. Read all fragments 2. Sort 3. Blend Bottleneck for large scenes

OpenGL+GLSL vs CUDA

GPU Architecture

OpenGL vs CUDA - An Interesting Example

OpenGL vs CUDA - An Interesting Example

Sorting in OIT

Backwards Memory Allocation

Register-Based Block Sort

○

○

○

Intermediate Compiler Output

TEMP R0, R1; ... TEMP lmem[8];

Results - Milliseconds per frame

GPU Progression

Power plant scene (milliseconds per frame)

Conclusion

Questions?

pyarelal.knowles@gmail.com