tools ‡ air
Practical Real-Time Video Rendering with Modern OpenGL and GStreamer
Heinrich Fink (R&D Engineer)
tools air tools air http://www.toolsonair.com ToolsOnAir is - - PowerPoint PPT Presentation
Practical Real-Time Video Rendering with Modern OpenGL and GStreamer Heinrich Fink (R&D Engineer) tools air tools air http://www.toolsonair.com ToolsOnAir is known for its Mac-based software solutions for broadcasting.
Heinrich Fink (R&D Engineer)
http://www.toolsonair.com
But OpenGL has become a very large set of APIs, even for our simple use case, there many different code paths for
Gamma decoding Chroma scaling filters 10-bit 4:2:2 Y’CbCr to RGBA conversion GLSL Compute shaders GLSL Image load/store operations Floating point render formats GLSL integer ops
But OpenGL has become a very large set of APIs, even for our simple use case, there many different code paths for
Multi vs. single contexts/threads Persistent buffers Asynchronous texture transfers Avoid implicit synchronisations Gamma decoding Chroma scaling filters 10-bit 4:2:2 Y’CbCr to RGBA conversion GLSL Compute shaders GLSL Image load/store operations Floating point render formats GLSL integer ops Incoherent buffer updates
But OpenGL has become a very large set of APIs, even for our simple use case, there many different code paths for
Multi vs. single contexts/threads Persistent buffers Asynchronous texture transfers Avoid implicit synchronisations GL timer queries Debug output GL sync Application-side concurrent pipelines Gamma decoding Chroma scaling filters 10-bit 4:2:2 Y’CbCr to RGBA conversion GLSL Compute shaders GLSL Image load/store operations Floating point render formats GLSL integer ops Incoherent buffer updates
But OpenGL has become a very large set of APIs, even for our simple use case, there many different code paths for
Well, how do we know which works best?
gl-frame-bender Output data
frame-bender-lib Performance metrics Rendered output sequence
Input data
User settings Input sequence C++11 pipeline infrastructure OpenGL 4.x video pipeline
gl-frame-bender Output data
frame-bender-lib Performance metrics Rendered output sequence
Input data
User settings Input sequence C++11 pipeline infrastructure OpenGL 4.x video pipeline
CopyHostToPBO UnmapPBO UnpackPBO ConvertFormat Render ConvertFormat PackPBO MapPBO CopyPBOToHost
CopyHostToPBO UnmapPBO UnpackPBO ConvertFormat Render ConvertFormat PackPBO MapPBO CopyPBOToHost
CopyHostToPBO UnmapPBO UnpackPBO ConvertFormat Render ConvertFormat PackPBO MapPBO CopyPBOToHost
CopyHostToPBO UnmapPBO UnpackPBO ConvertFormat Render ConvertFormat PackPBO MapPBO CopyPBOToHost
CopyHostToPBO UnmapPBO UnpackPBO ConvertFormat Render ConvertFormat PackPBO MapPBO CopyPBOToHost
CopyHostToPBO UnmapPBO UnpackPBO ConvertFormat Render ConvertFormat PackPBO MapPBO CopyPBOToHost
CopyHostToPBO UnmapPBO UnpackPBO ConvertFormat Render ConvertFormat PackPBO MapPBO CopyPBOToHost
851 μs
CopyHostToPBO
5 μs
UnmapPBO
17 μs
UnpackPBO
35 μs
ConvertFormat
1359 μs
Render
31 μs
ConvertFormat
14 μs
PackPBO
958 μs
MapPBO
761 μs
CopyPBOToHost
host time
different color.
851 μs
CopyHostToPBO
5 μs
UnmapPBO
17 μs
UnpackPBO
35 μs
ConvertFormat
1359 μs
Render
31 μs
ConvertFormat
14 μs
PackPBO
958 μs
MapPBO
761 μs
CopyPBOToHost
host time
different color.
851 μs
CopyHostToPBO
5 μs
UnmapPBO
17 μs
UnpackPBO
35 μs
ConvertFormat
1359 μs
Render
31 μs
ConvertFormat
14 μs
PackPBO
958 μs
MapPBO
761 μs
CopyPBOToHost
host time
different color.
achieve.
1327 MB/s 332 MB/s
HD 1080p60 UHD-1 2160p60
achieve.
1550 MB/s 1327 MB/s 332 MB/s
HD 1080p60 UHD-1 2160p60 K5200 serial execution
achieve.
Single GL context & async host copies
979 μs
CopyHostToPBO
2 μs
UnmapPBO
15 μs
UnpackPBO
37 μs
ConvertFormat
1064 μs
Render
33 μs
ConvertFormat
14 μs
PackPBO
720 μs
MapPBO
894 μs
CopyPBOToHost
Renderer: Quadro K5200/PCIe/SSE2
host time
the performance as some CPU-side execution is now done concurrently
actually spend by the GPU
Single GL context & async host copies
979 μs
CopyHostToPBO
2 μs
UnmapPBO
15 μs
UnpackPBO
37 μs
ConvertFormat
1064 μs
Render
33 μs
ConvertFormat
14 μs
PackPBO
720 μs
MapPBO
894 μs
CopyPBOToHost
Renderer: Quadro K5200/PCIe/SSE2
host time
the performance as some CPU-side execution is now done concurrently
actually spend by the GPU
Single GL context & async host copies
979 μs
CopyHostToPBO
2 μs
UnmapPBO
15 μs
UnpackPBO
37 μs
ConvertFormat
1064 μs
Render
33 μs
ConvertFormat
14 μs
PackPBO
720 μs
MapPBO
894 μs
CopyPBOToHost
Renderer: Quadro K5200/PCIe/SSE2
host time
the performance as some CPU-side execution is now done concurrently
actually spend by the GPU
1550 MB/s 1327 MB/s 332 MB/s
HD 1080p60 UHD-1 2160p60 K5200 serial execution
achieve.
2890 MB/s 1550 MB/s 1327 MB/s 332 MB/s
HD 1080p60 UHD-1 2160p60 K5200 serial execution K5200 host- parallel
achieve.
Single GL context & async host copies
1 μs
UnmapPBO
476 μs
UnpackPBO
373 μs
ConvertFormat
343 μs
Render
217 μs
ConvertFormat
475 μs
PackPBO
1 μs
MapPBO
GPU time
where the GPU is actually spending time.
already using async PBO transfers
Multiple GL contexts & async host copies
1 μs
UnmapPBO
1163 μs
UnpackPBO
382 μs
ConvertFormat
537 μs
Render
228 μs
ConvertFormat
1153 μs
PackPBO
2 μs
MapPBO
GPU time
render and download
engines (aka dual-copy engines)
Multiple GL contexts & async host copies
1 μs
UnmapPBO
1163 μs
UnpackPBO
382 μs
ConvertFormat
537 μs
Render
228 μs
ConvertFormat
1153 μs
PackPBO
2 μs
MapPBO
GPU time
render and download
engines (aka dual-copy engines)
Multiple GL contexts & async host copies
1 μs
UnmapPBO
1163 μs
UnpackPBO
382 μs
ConvertFormat
537 μs
Render
228 μs
ConvertFormat
1153 μs
PackPBO
2 μs
MapPBO
GPU time
render and download
engines (aka dual-copy engines)
2890 MB/s 1550 MB/s 1327 MB/s 332 MB/s
HD 1080p60 UHD-1 2160p60 K5200 serial execution K5200 host- parallel
in real time
5060 MB/s 2890 MB/s 1550 MB/s 1327 MB/s 332 MB/s
HD 1080p60 UHD-1 2160p60 K5200 serial execution K5200 host- parallel K5200 host & GPU-parallel
in real time
Cr0 Y'0 Cb0 Y'2 Cb2 Y'1 Cb4 Y'3 Cr2 Y'5 Cr4 Y'4 Cr6 Y'6 Cb6 R0 G0 B0 R1 G1 B1 R2 G2 B2 R3 G3 B3 R4 G4 B4 R5 G5 B5
word 0 word 1 word 2 word 3 word 0 V210 RGB pixel 0 pixel 1 pixel 2 pixel 3 pixel 4 pixel 5
Cr0 Y'0 Cb0 Y'2 Cb2 Y'1 Cb4 Y'3 Cr2 Y'5 Cr4 Y'4 Cr6 Y'60 Cb6 R0 G0 B0 R1 G1 B1 R2 G2 B2 R3 G3 B3 R4 G4 B4 R5 G5 B5
word 0 word 1 word 2 word 3 word 0 V210 input texture RGB
texture pixel 0 pixel 1 pixel 2 pixel 3 pixel 4 pixel 5
Cr0 Y'0 Cb0 Y'2 Cb2 Y'1 Cb4 Y'3 Cr2 Y'5 Cr4 Y'4 Cr6 Y'60 Cb6 R0 G0 B0 R1 G1 B1 R2 G2 B2 R3 G3 B3 R4 G4 B4 R5 G5 B5
word 0 word 1 word 2 word 3 word 0 V210 input texture RGB
texture pixel 0 pixel 1 pixel 2 pixel 3 pixel 4 pixel 5
invocation 1
Cr0 Y'0 Cb0 Y'2 Cb2 Y'1 Cb4 Y'3 Cr2 Y'5 Cr4 Y'4 Cr6 Y'60 Cb6 R0 G0 B0 R1 G1 B1 R2 G2 B2 R3 G3 B3 R4 G4 B4 R5 G5 B5
word 0 word 1 word 2 word 3 word 0 V210 input texture RGB
texture pixel 0 pixel 1 pixel 2 pixel 3 pixel 4 pixel 5
invocation 1 invocation 2
R0 G0 B0 R1 G1 B1 R2 G2 B2 R3 G3 B3 R4 G4 B4 R5 G5 B5 Cr0 Y'0 Cb0 Y'2 Cb2 Y'1 Cb4 Y'3 Cr2 Y'5 Cr4 Y'4 word 0 word 1 word 2 word 3 pixel 0 pixel 1 pixel 2 pixel 3 pixel 4 pixel 5 Cr6 Y'6 Cb6 Y'8 Cb8 Y'7 Cb10 Y'9 Cr8 Y'11 Cr10 Y'10 word 0 word 1 word 2 word 3 Cr12 Y'12 Cb12 word 0 R6 G6 B6 R7 G7 B7 R8 G8 B8 R9 G9 B9 R10 G10 B10 R11 G11 B11 pixel 0 pixel 1 pixel 2 pixel 3 pixel 4 pixel 5
V210 input texture RGB
texture
ARB_shader_image_load_store
texture images.
R0 G0 B0 R1 G1 B1 R2 G2 B2 R3 G3 B3 R4 G4 B4 R5 G5 B5 Cr0 Y'0 Cb0 Y'2 Cb2 Y'1 Cb4 Y'3 Cr2 Y'5 Cr4 Y'4 word 0 word 1 word 2 word 3 pixel 0 pixel 1 pixel 2 pixel 3 pixel 4 pixel 5 Cr6 Y'6 Cb6 Y'8 Cb8 Y'7 Cb10 Y'9 Cr8 Y'11 Cr10 Y'10 word 0 word 1 word 2 word 3 Cr12 Y'12 Cb12 word 0 R6 G6 B6 R7 G7 B7 R8 G8 B8 R9 G9 B9 R10 G10 B10 R11 G11 B11 pixel 0 pixel 1 pixel 2 pixel 3 pixel 4 pixel 5
V210 input texture RGB
texture
invocation 1
ARB_shader_image_load_store
texture images.
R0 G0 B0 R1 G1 B1 R2 G2 B2 R3 G3 B3 R4 G4 B4 R5 G5 B5 Cr0 Y'0 Cb0 Y'2 Cb2 Y'1 Cb4 Y'3 Cr2 Y'5 Cr4 Y'4 word 0 word 1 word 2 word 3 pixel 0 pixel 1 pixel 2 pixel 3 pixel 4 pixel 5 Cr6 Y'6 Cb6 Y'8 Cb8 Y'7 Cb10 Y'9 Cr8 Y'11 Cr10 Y'10 word 0 word 1 word 2 word 3 Cr12 Y'12 Cb12 word 0 R6 G6 B6 R7 G7 B7 R8 G8 B8 R9 G9 B9 R10 G10 B10 R11 G11 B11 pixel 0 pixel 1 pixel 2 pixel 3 pixel 4 pixel 5
V210 input texture RGB
texture
invocation 1 invocation 2
ARB_shader_image_load_store
texture images.
GLSL 4.0+ bitfield ops with GL_R32UI instead of GL_RGB10_A2UI GL_RGBA16F to accommodate Y’CbCr out-of-bounds values Intel IPP ippiCopyManaged ¡with ¡IPP_NONTEMPORAL_STORE ARB_buffer_storage w. persistent memory
using the native pixel format RGB10_A2UI. The difference here is about 1.6 GB/sec.
texture formats. Using floating-point render formats allows us to keep those negative values and reduces the precision loss converting back and forth between luma/chroma and RGB.
advantage of persistently mapped PBO memory (e.g. keeping PBO-mapped pointers in memory pools for decoder, etc).
for visualisation and some other handy tools, is now online on Github as of this afternoon.
https://github.com/ToolsOnAir/gl-frame-bender
for visualisation and some other handy tools, is now online on Github as of this afternoon.
processing
GStreamer
source filter sink src sink src sink
libav codecs Blackmagic Decklink SDI OpenGL processing Streaming protocols … much more
This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no610370.
Immersive Coverage of Spatially Outspread Live Events
large-scale outdoor live production
glvideomixer input 0 input N mix out 0 mix out M monitor out 0 monitor out N File / SDI device / Network stream
glvideomixer input 0 input N mix out 0 mix out M monitor out 0 monitor out N
Dröge of Centricular for these matters
glvideomixer input 0 input N mix out 0 mix out M monitor out 0 monitor out N
VDPAU
Dröge of Centricular for these matters
glvideomixer input 0 input N mix out 0 mix out M monitor out 0 monitor out N
VDPAU GPUDirect for Video
Dröge of Centricular for these matters
glvideomixer input 0 input N mix out 0 mix out M monitor out 0 monitor out N
VDPAU GPUDirect for Video Async texture transfers
Dröge of Centricular for these matters
glvideomixer input 0 input N mix out 0 mix out M monitor out 0 monitor out N
VDPAU GPUDirect for Video OpenGL 4.x support Async texture transfers
Dröge of Centricular for these matters
glvideomixer input 0 input N mix out 0 mix out M monitor out 0 monitor out N
VDPAU NVENC GPUDirect for Video OpenGL 4.x support Async texture transfers
Dröge of Centricular for these matters
glvideomixer input 0 input N mix out 0 mix out M monitor out 0 monitor out N
VDPAU NVENC GPUDirect for Video OpenGL 4.x support Async texture transfers
Sebastian Dröge sebastian@centricular.com
Centricular
Dröge of Centricular for these matters
#1 gl-frame-bender
OpenGL 4.x recipes for high-performance real-time processing of professional video
#2 GStreamer live mixing
Build powerful real-world live mixing using state-of-the-art open-source tools
hfjnk@toolsonair.com Twitter: @heinrichfjnk
This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no610370.
http://www.icosole.eu