where information lives
- Page. 1
Seamless Audio Splicing Seamless Audio Splicing for for ISO/IEC - - PowerPoint PPT Presentation
where information lives Seamless Audio Splicing Seamless Audio Splicing for for ISO/IEC 13818 Transport Streams ISO/IEC 13818 Transport Streams A New Framework for Audio Elementary Stream Tailoring and Modeling Seyfullah Halit Oguz, Ph.D.
where information lives
where information lives
—EMC Engineering will spend well over $1 Billion
—Over 300 engineers concentrating on Celerra Server
—The Media Solutions Group Lab has in excess of
where information lives
—Deploying EMC Products into the Rich Media
—Develop products and solutions which meet the
—Developing partnerships with key companies to
—Provide Professional Services in the Rich Media
where information lives
where information lives
where information lives
* “generic”: No constraining assumptions are made about signal formats (e.g. the video frame rate (PAL, NTSC), the audio sampling frequency), or various encoding parameters (e.g. the audio bit rate, the layer of audio encoding algorithm employed). * “audio elementary streams”. Current focus will be on the audio elementary
* “transport streams”. Achieve audio elementary stream splicing directly on transport streams with lowest possible complexity.
where information lives
*(ISO/IEC 11172-3 Layer-II Audio coding with sampling frequency 48kHz and audio bitrate
192kbits/s assumed only for illustrative purposes)
**(NTSC frame rate assumed only for illustrative purposes)
where information lives
E
APU 1 APU 2
0.024 seconds
End S E
Start End S E (1 / 29.97) seconds
VPU 0 VPU 1 VPU 2 APU 0 Start E
S
S S
S E S E S E
VPU (k-2) VPU (k-1) VPU k VPU (k+1) VPU (k+2) APU (j-2) APU (j-1) APU j APU (j+1) APU (j+2) APU (j+3) S S S E E E E S
S S E S E S E
S E S E S E
The start of a VPU will be aligned with the start of an APU possibly at the beginning of a stream and then only at multiples of 5 minutes increments in time. This implies that later they will not be aligned again for all practical purposes. AT THE BEGINNING LATER
~= 0.03337 seconds
where information lives
S E S E S E
VPU (k-2) VPU (k-1) VPU k VPU (k+1) VPU (k+2)
S E E E S S
S S E E APU (m-2) APU (m-1) APU m APU (m+1) APU (m+2) APU (m+3) S S S S S E E E E E APU (j-2) APU (j-1) APU j APU (j+1) APU (j+2) APU (j+3) S S S S S S S E E E E E E E
S E S E S E
VPU (n-2) VPU (n-1) VPU n VPU (n+1) VPU (n+2)
S E E E S S
Ending stream Beginning stream
Splicing point is naturally defined with respect to VPUs.
time base #1 time base #2
where information lives
APUs are available only through the decoding of their corresponding AAUs. Fractional (i.e. truncated) AAUs in the encoded data domain are useless.
S E S E S E
VPU (k-2) VPU (k-1) VPU k VPU (k+1) VPU (k+2)
S E E E S S
APU (j-2) APU (j-1) APU j APU (j+1) APU (j+2) APU (j+3) S S S S S S S E E E E E E E S APU (m-2) APU (m-1) APU m APU (m+1) APU (m+2) S S S S E E E E E E APU (m-3) S E S
Time base of the beginning stream is shifted to achieve video presentation continuity.
where information lives
where information lives
where information lives
12 msec. 12 msec. 12 msec. 12 msec.
VPU (k-1) VPU k VPU (k+1) VPU (k+2) APU m APU (j+1)
Best aligned final APU Best aligned initial APU
The APU of the ending stream whose presentation interval ends within the identified 24 ms interval is called the “best aligned final APU”. The APU of the beginning stream whose presentation interval starts within the identified 24 ms interval is called the “best aligned initial APU”.
“short” “long” “short” “long”
There is a comprehensive list of 8 possible cases that can be identified regarding the alignment of ending and beginning audio streams based on the above definitions.
where information lives
Truncate the ending audio
Start the beginning audio
Re-stamp the audio PTSs
aligned final APU
the best aligned final APU,
accordingly.
aligned initial APU
associated with the best aligned initial APU,
accordingly.
the first and all consequent audio PES packets accordingly.
REQUIRED PROCESSING AT ELEMENTARY STREAM LEVEL
where information lives
12 msec. 12 msec. 12 msec. 12 msec.
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
Best aligned initial APU
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
Best aligned final APU
Case 6) b. a. final APU long, b. a. initial APU short and 0 msec. < audio overlap < 12 msec.
A/V skew of at most 12 msec.
SOLUTION: APU (j+2) APU m APU (j+1) APU (j+2) APU (j+3) (m) (m+1) APU (j+1) APU (m-1)
where information lives
where information lives
12 msec. 12 msec. 12 msec. 12 msec.
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
Best aligned initial APU Best aligned final APU
Case 1) Both best aligned APUs are short and 12 msec. < audio gap < 24 msec. APU j+1 APU (j+2) APU (m-1) APU m VPU (k-1) VPU k VPU (k+1) VPU (k+2) APU (j+1) APU (j+2) APU (j+3) APU (j+1) APU (j+2) APU (j+3) (m) (m) (m-1)
A/V skew of at most 12 msec. A/V skew of at most 12 msec.
SOLUTION (a): SOLUTION (b):
where information lives
where information lives
where information lives
where information lives
where information lives
where information lives
a a a
(j+1) (j+2) (j+3)
service time 0.024 sec.
aj pj p(j+1) p(j+2) p(j+3) aj : arrival time of jth AAU pj : presentation time of jth AAU
where information lives
where information lives
850 900 950 1000 1050 0.45 0.5 0.55 0.6 0.65 TS PACKET TYPE VISUALIZATION PLOT TS packet indices TS packet types (color coded)
all TS packets whose payloads have some of the j AAU data
(TS packet arrival time) * (fraction of the j AAU in the payload data)
packet’s 94th byte through a PCR extrapolation process.
th th
a = j
where information lives
where information lives
E[.]: expectation operation. : standard deviation.
where information lives
where information lives
6500 6600 6700 6800 6900 7000 7100 7200 7300 7400 7500 0.2 0.4 0.6 0.8 1 TS PACKET TYPE VISUALIZATION PLOT TS packet indices TS packet types (color coded) 5000 10000 15000 7.944 7.9441 7.9441 7.9442 7.9443 x 10
4
INTERLEAVED AUDIO-VIDEO ALIGNMENT IN THE BITSTREAM WRT PTS VALUES TS packet indices Interpolated play-out times for individual TS packets, [seconds]
Example 1
where information lives
5 10 15 20 25 30
10 20 30 40 50 60 70 90 100 110 AUDIO BUFFER ANALYSIS Time [seconds] Audio buffer level [% of max]
Predicted mean audio buffer fullness: 15585.8 bits = 54.36 %. 2 STD interval around the mean: 4504.2 bits = 15.71 %.
Example 1
where information lives
where information lives
where information lives
where information lives
where information lives
where information lives
12 msec. 12 msec. 12 msec. 12 msec.
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
Best aligned initial APU Best aligned final APU
Case 1) Both best aligned APUs are short and 12 msec. < audio gap < 24 msec. APU (j+1) APU (j+2) APU (m-1) APU m
where information lives
VPU (k-1) VPU k VPU (k+1) VPU (k+2) APU (j+1) APU (j+2) APU (j+3) (m)
A/V skew of at most 12 msec.
SOLUTION I: VPU (k-1) VPU k VPU (k+1) VPU (k+2) APU (j+1) APU (j+2) APU (j+3) (m)
A/V skew of at most 12 msec.
SOLUTION II: VPU (k-1) VPU k VPU (k+1) VPU (k+2) APU (j+1) APU (j+2) APU (j+3) (m+1) SOLUTION III:
A/V skew of at most 24 msec. Minimal achievable skew implementation for clips with originally moderate audio buffer levels. Note: An alternate implementation by omitting APU (j+2) and including APU (m-1) is also possible and this approach achieves exactly the same resultant A/V skew. This latter solution is preferable in terms of ease of implementation. Minimal achievable safe (forward) skew implementation for clips with originally low audio buffer levels. Note: An alternate implementation by omitting APU (j+2) and including APU (m-1) is also possible and this approach achieves exactly the same resultant A/V skew. This latter solution is preferable in terms of ease of implementation.
(m)
Minimal achievable safe (backward) skew implementation for clips with originally high audio buffer levels.
where information lives
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
Best aligned initial APU
Case 6) b. a. final APU long, b. a. initial APU short and 0 msec. < audio overlap < 12 msec. APU (j+2) APU m APU (j+1) APU (m-1)
12 msec. 12 msec. 12 msec. 12 msec. Best aligned final APU
VPU k VPU (k+1)
where information lives
VPU (k-1) VPU k VPU (k+1) VPU (k+2)
A/V skew of at most 12 msec.
SOLUTION I: APU (j+1) APU (j+2) APU (j+3) (m) (m+1) VPU (k-1) VPU k VPU (k+1) VPU (k+2)
A/V skew of at most 12 msec.
SOLUTION II: APU (j+1) APU (j+2) APU (j+3) (m) (m+1) VPU (k-1) VPU k VPU (k+1) VPU (k+2)
A/V skew of at most 24 msec.
SOLUTION III: APU (j+1) APU (j+2) APU (j+3) (m+1) (m+2)
Minimal achievable skew implementation for clips with originally moderate audio buffer levels. Minimal achievable safe (forward) skew implementation for clips with originally low audio buffer levels. Minimal achievable safe (backward) skew implementation for clips with originally high audio buffer levels. Note: An alternate implementation by omitting APU (j+1) and including APU m is also possible and this approach achieves exactly the same resultant A/V skew. This latter solution is preferable in terms of ease of implementation.
where information lives
where information lives
where information lives
the audio bit rate is a constant the VAUs are of varying sizes
where information lives
where information lives
9800 9805 9810 9815 9820 9825 0.005 0.01 0.015 0.02 0.025 0.03 TS PACKET TYPE VISUALIZATION PLOT TS packet indices TS packet types (color coded) 0.95 1 1.05 1.1 1.15 x 10
4
7.9442 7.9442 7.9442 7.9442 7.9442 7.9442 7.9442 7.9442 x 10
4
INTERLEAVED AUDIO-VIDEO ALIGNMENT IN THE BITSTREAM WRT PTS VALUES TS packet indices Interpolated play-out times for individual TS packets, [seconds] TS packet index 9812 TS packet index 11005
Example 1. Encoder B.
where information lives
where information lives
where information lives
where information lives