10 Faster Transparency from Low Level Shader Optimisation - - PowerPoint PPT Presentation

10 faster transparency from low level shader optimisation
SMART_READER_LITE
LIVE PREVIEW

10 Faster Transparency from Low Level Shader Optimisation - - PowerPoint PPT Presentation

10 Faster Transparency from Low Level Shader Optimisation Pyarelal Knowles Geoff Leach Fabio Zambetta RMIT University, Australia Opaque Transparent 460 (2010): 6 FPS Titan X (Now): 142 FPS 12 Million Polygons 460 (base): 1 FPS


slide-1
SLIDE 1

10× Faster Transparency from Low Level Shader Optimisation

Pyarelal Knowles Geoff Leach Fabio Zambetta

RMIT University, Australia

slide-2
SLIDE 2

Opaque Transparent

460 (2010): 6 FPS Titan X (Now): 142 FPS

slide-3
SLIDE 3

12 Million Polygons

460 (base): 1 FPS Titan X (base): 3 FPS Titan X (ours): 30 FPS

slide-4
SLIDE 4

10× Faster Transparency (Software)

  • Gained with two techniques

○ Backwards Memory Allocation - better occupancy ○ Register Based Block Sort - faster sorting

  • Involves low level optimizations (OpenGL+GLSL)
  • Interesting technical details
  • Insight from CUDA, similarities to OpenGL
  • Important to know hardware and language
  • Now within 10× opaque rendering
slide-5
SLIDE 5

Transparency

Antialiasing Particles Shadow Maps Objects, glass, visualization

Images from: Creo Parametric 2, Portal 2, Shadow Of Mordor, marmoset.co, GRID 2, Unity3D, Chen et. al.’s Real-Time Volumetric Shadows using 1D Min-Max Mipmaps, Lokovic and Veach’s Deep Shadow Maps.

slide-6
SLIDE 6

Transparency

  • Transparency uses alpha blending
  • Weighted average
  • Based on surface order
slide-7
SLIDE 7

Not Sorted Sorted

slide-8
SLIDE 8

Sorting for Transparency

Sort triangles

  • Geometry dependent

Sort fragments (potential pixel colors)

  • Rasterize and store
  • Geometry independent
  • Order Independent Transparency (OIT) ...
slide-9
SLIDE 9

Order Independent Transparency (OIT)

  • Two passes

○ Build a deep image ○ Sort and blend fragments

  • Exact OIT: sort all fragments
  • Code snippets

○ On my poster ○ https://github.com/pknowles/oit

slide-10
SLIDE 10
  • 1. Deep Image
  • Many fragments per pixel
  • Construct in fragment shader

○ Race conditions ○ Different data structures

Knowles, P.: Real-Time deep image rendering and order independent transparency. PhD Thesis, RMIT University, 2015.

slide-11
SLIDE 11
  • 2. Sort and composite

vec2 frags[MAX_FRAGS]; void main() { int count = loadFragments(gl_FragCoord.xy); sortFragments(count); //insertion sort colour = vec4(1.0); for (int i = 0; i < count; ++i) { vec4 c = unpackColour(frags[i].y) colour = mix(colour, c, c.a); } }

Full screen pass: 1. Read all fragments 2. Sort 3. Blend Bottleneck for large scenes

slide-12
SLIDE 12

OpenGL+GLSL vs CUDA

OpenGL CUDA Same hardware Graphics Compute GLSL Shaders Kernels Specification, not implementation Implementation well documented Improving Nsight support Good Nsight support

  • CUDA gives insight into GLSL execution
  • Some significant architectural differences...
slide-13
SLIDE 13

GPU Architecture

GPU SM/SMX SM/SMX SM/SMX Global Memory L2 Cache L1 Cache / Shared / “Local” Registers SP SP SP SP … Slow Faster Fastest

slide-14
SLIDE 14

OpenGL vs CUDA - An Interesting Example

  • Why would allocating more memory make a shader slower?

#define SIZE set_by_application vec4 myArray[SIZE]; uniform int zero;

  • ut vec4 fragColour;

void main() { fragColour = myArray[zero]; }

slide-15
SLIDE 15

OpenGL vs CUDA - An Interesting Example

GPU SM/SMX SM/SMX SM/SMX Global Memory L2 Cache Registers SP SP SP SP …

  • In GLSL local memory is reserved
  • The more required
  • The less active threads
  • Low occupancy

L1 Cache / Shared / “Local” Thread Thread Thread T h r e a d

slide-16
SLIDE 16

Sorting in OIT

  • Local memory is fixed
  • Use conservative maximum
  • Want dynamic size

#define MAX_FRAGS set_by_application vec2 frags[MAX_FRAGS]; //conservative max void main() { int count = loadFragments(gl_FragCoord.xy); sortFragments(count); //insertion sort ...

slide-17
SLIDE 17

Backwards Memory Allocation

Knowles, P., Leach, G., Zambetta, F.: Backwards Memory Allocation and Improved OIT. In Proceedings of Pacific Graphics 2013, pages 59–64, October 2013.

slide-18
SLIDE 18

Register-Based Block Sort

  • Local memory still slow
  • External sort in registers

From local memory ○ Copy blocks to registers

Sort ○ Copy back

k-way merge

Knowles, P., Leach, G., Zambetta, F.: Fast Sorting for Exact OIT of Complex Scenes. The Visual Computer (TVCJ), vol. 30, no. 6-8, pages 603–613 June 2014.

slide-19
SLIDE 19

Intermediate Compiler Output

########################################## OPTION NV_bindless_texture; PARAM c[9] = { program.local[0..8] }; TEMP R0, R1; TEMP RC, HC; TEMP lmem[8]; MOV.F lmem[0].x, c[0]; MOV.F lmem[1].x, c[1]; MOV.F lmem[2].x, c[2]; MOV.F lmem[3].x, c[3]; MOV.S R0.y, {1, 0, 0, 0}.x; REP.S ; SGE.S.CC HC.x, R0.y, c[8]; BRK (NE.x); MOV.S R0.z, R0.y; REP.S ; SLE.S.CC HC.x, R0.y, {0, 0, 0, 0}; BRK (NE.x); ADD.S R0.w, R0.z, -{1, 0, 0, 0}.x; MOV.U R0.x, R0.z; MOV.U R0.w, R0; MOV.F R0.x, lmem[R0.x].x; MOV.F R1.x, lmem[R0.w].x; SGT.F R0.x, R1, R0; TRUNC.U.CC HC.x, R0; ADD.S R0.z, R0, -{1, 0, 0, 0}.x; ENDREP; ADD.S R0.y, R0, {1, 0, 0, 0}.x; ENDREP; ADD.F result.position, R0.y, R0.x; END ##########################################

  • glGetProgramBinary
  • Provided by Nvidia driver
  • Poor man’s --keep (CUDA)

TEMP R0, R1; ... TEMP lmem[8];

slide-20
SLIDE 20

Results - Milliseconds per frame

Titan X, 1920x1080 Scene Atrium Hairball (front / back) Power Plant Baseline 7 170 / 652 374 BMA+RBS 6 195 / 212 30 Opaque (no OIT) 1 5 / 3 9

  • Up to 10x improvement, at worst minor overhead
slide-21
SLIDE 21

GPU Progression

GPU (year) 460 (2010) 670 (2012) Titan (2013) Titan X (2015) Baseline 1004 670 476 374 BMA+RBS 258 94 56 30 Speedup 3.9 7.1 8.5 12.3

Power plant scene (milliseconds per frame)

  • Speedup improves with each new GPU
slide-22
SLIDE 22

Conclusion

  • Low level optimizations necessary despite trend for higher level languages
  • Need to be exposed to hardware architecture via language and tools
  • Perhaps increasingly necessary with newer GPUs
  • 10× faster OIT with BMA+RBS
  • Much bigger scenes possible (also displays, i.e. 4K/8K)
  • Better sorting and deep image rendering
  • Much closer to opaque rendering speeds
  • Sorting is no longer the bottleneck in many scenes
slide-23
SLIDE 23

Questions?

pyarelal.knowles@gmail.com