Seamless Audio Splicing Seamless Audio Splicing for for ISO/IEC - - PowerPoint PPT Presentation

seamless audio splicing seamless audio splicing for for
SMART_READER_LITE
LIVE PREVIEW

Seamless Audio Splicing Seamless Audio Splicing for for ISO/IEC - - PowerPoint PPT Presentation

where information lives Seamless Audio Splicing Seamless Audio Splicing for for ISO/IEC 13818 Transport Streams ISO/IEC 13818 Transport Streams A New Framework for Audio Elementary Stream Tailoring and Modeling Seyfullah Halit Oguz, Ph.D.


slide-1
SLIDE 1

where information lives

  • Page. 1

Seamless Audio Splicing for ISO/IEC 13818 Transport Streams Seamless Audio Splicing for ISO/IEC 13818 Transport Streams

A New Framework for Audio Elementary Stream Tailoring and Modeling

Seyfullah Halit Oguz, Ph.D. and Sorin Faibish EMC Corporation Media Solutions Group Engineering

slide-2
SLIDE 2

where information lives

  • Page. 2

EMC Media Solutions Group Profile

The EMC Media Solutions Group is part of EMC Engineering

—EMC Engineering will spend well over $1 Billion

Dollars on Research and Development in FY 2001.

—Over 300 engineers concentrating on Celerra Server

development and Rich Media products.

—The Media Solutions Group Lab has in excess of

$300 Million dollars of hardware for development of Rich Media solutions.

slide-3
SLIDE 3

where information lives

  • Page. 3

The EMC Media Solutions Group is tasked with:

—Deploying EMC Products into the Rich Media

market.

—Develop products and solutions which meet the

requirements of Rich Media customers.

—Developing partnerships with key companies to

develop and deploy customer solutions.

—Provide Professional Services in the Rich Media

environment.

Mission Statement

slide-4
SLIDE 4

where information lives

  • Page. 4

Outline Outline

Splicing in brief Objective Problem description Basic algorithm A model for the audio elementary streams Enhanced algorithm Additional implementation details Conclusions

slide-5
SLIDE 5

where information lives

  • Page. 5

Splicing Splicing

Splicing is the act of switching from one MPEG-2

program (embedded in a transport stream) to another MPEG-2 program (again embedded in a transport stream).

Commercial insertion, camera or content switching

and content editing all require splicing to be performed on compressed bit-streams.

The structure of the compressed data makes a

seamless splicing algorithm be far from trivial.

slide-6
SLIDE 6

where information lives

  • Page. 6

Objective Objective

A generic method to process the audio elementary

streams during the splicing of ITU-T Rec. H.222.0 | ISO/IEC 13818-1 transport streams (TS) to achieve a seamless audio splice.

* “generic”: No constraining assumptions are made about signal formats (e.g. the video frame rate (PAL, NTSC), the audio sampling frequency), or various encoding parameters (e.g. the audio bit rate, the layer of audio encoding algorithm employed). * “audio elementary streams”. Current focus will be on the audio elementary

  • streams. Ultimately audio and video splicing should be considered jointly.

* “transport streams”. Achieve audio elementary stream splicing directly on transport streams with lowest possible complexity.

slide-7
SLIDE 7

where information lives

  • Page. 7

Definitions and Notation Definitions and Notation

Audio Access Unit AAU (Audio Frame) (576 bytes)* Audio Presentation Unit, APU, (a block of contig- uous audio samples) (24 ms)* Audio Decoder Video Access Unit VAU (variable size) Video Presentation Unit, VPU, (a video frame) (1/29.97 s)** Video Decoder

*(ISO/IEC 11172-3 Layer-II Audio coding with sampling frequency 48kHz and audio bitrate

192kbits/s assumed only for illustrative purposes)

Encoded Data Domain Decoded Data Domain

**(NTSC frame rate assumed only for illustrative purposes)

slide-8
SLIDE 8

where information lives

  • Page. 8

VPU and APU alignment VPU and APU alignment

E

APU 1 APU 2

0.024 seconds

End S E

Start End S E (1 / 29.97) seconds

VPU 0 VPU 1 VPU 2 APU 0 Start E

S

S S

S E S E S E

VPU (k-2) VPU (k-1) VPU k VPU (k+1) VPU (k+2) APU (j-2) APU (j-1) APU j APU (j+1) APU (j+2) APU (j+3) S S S E E E E S

S S E S E S E

S E S E S E

The start of a VPU will be aligned with the start of an APU possibly at the beginning of a stream and then only at multiples of 5 minutes increments in time. This implies that later they will not be aligned again for all practical purposes. AT THE BEGINNING LATER

~= 0.03337 seconds

slide-9
SLIDE 9

where information lives

  • Page. 9

The setting for splicing The setting for splicing

S E S E S E

VPU (k-2) VPU (k-1) VPU k VPU (k+1) VPU (k+2)

S E E E S S

S S E E APU (m-2) APU (m-1) APU m APU (m+1) APU (m+2) APU (m+3) S S S S S E E E E E APU (j-2) APU (j-1) APU j APU (j+1) APU (j+2) APU (j+3) S S S S S S S E E E E E E E

S E S E S E

VPU (n-2) VPU (n-1) VPU n VPU (n+1) VPU (n+2)

S E E E S S

Ending stream Beginning stream

Splicing point is naturally defined with respect to VPUs.

time base #1 time base #2

slide-10
SLIDE 10

where information lives

  • Page. 10

Audio processing at splicing Audio processing at splicing

APUs are available only through the decoding of their corresponding AAUs. Fractional (i.e. truncated) AAUs in the encoded data domain are useless.

S E S E S E

VPU (k-2) VPU (k-1) VPU k VPU (k+1) VPU (k+2)

S E E E S S

APU (j-2) APU (j-1) APU j APU (j+1) APU (j+2) APU (j+3) S S S S S S S E E E E E E E S APU (m-2) APU (m-1) APU m APU (m+1) APU (m+2) S S S S E E E E E E APU (m-3) S E S

Time base of the beginning stream is shifted to achieve video presentation continuity.

slide-11
SLIDE 11

where information lives

  • Page. 11

So far… So far…

Decoding, time domain editing and re-encoding.

High computational complexity.

Gaps in the audio stream.

Audio mutes, uncontrolled audio-visual skew.

Overlaps in the scopes of APUs.

Uncontrolled audio-visual skew, inconsistent ES structure.

slide-12
SLIDE 12

where information lives

  • Page. 12

Observations Observations

Audio truncation should always be done at

AAU boundaries i.e. no fractional AAUs!

Audio truncation for the ending stream should

be done with respect to the end of its last VPU’s presentation interval.

Audio truncation for the beginning stream

should be done relative to the beginning of its first VPU’s presentation interval.

“BEST ALIGNED APUs”

slide-13
SLIDE 13

where information lives

  • Page. 13

Best aligned APUs Best aligned APUs

12 msec. 12 msec. 12 msec. 12 msec.

VPU (k-1) VPU k VPU (k+1) VPU (k+2) APU m APU (j+1)

Best aligned final APU Best aligned initial APU

The APU of the ending stream whose presentation interval ends within the identified 24 ms interval is called the “best aligned final APU”. The APU of the beginning stream whose presentation interval starts within the identified 24 ms interval is called the “best aligned initial APU”.

“short” “long” “short” “long”

There is a comprehensive list of 8 possible cases that can be identified regarding the alignment of ending and beginning audio streams based on the above definitions.

slide-14
SLIDE 14

where information lives

  • Page. 14

How to make use of best aligned APUs How to make use of best aligned APUs

Truncate the ending audio

stream at the end of the best aligned final APU.

Start the beginning audio

stream at the beginning of the best aligned initial APU.

Re-stamp the audio PTSs

  • f the beginning stream to

generate an immediate continuation of the ending audio stream.

  • In the audio PES packet carrying the best

aligned final APU

  • truncate after the AAU associated with

the best aligned final APU,

  • modify the PES packet size information

accordingly.

  • In the audio PES packet carrying the best

aligned initial APU

  • delete the AAU data preceding the AAU

associated with the best aligned initial APU,

  • modify the PES packet size information

accordingly.

  • Modify the PTS values associated with

the first and all consequent audio PES packets accordingly.

ACTION

REQUIRED PROCESSING AT ELEMENTARY STREAM LEVEL

slide-15
SLIDE 15

where information lives

  • Page. 15

12 msec. 12 msec. 12 msec. 12 msec.

VPU (k-1) VPU k VPU (k+1) VPU (k+2)

Best aligned initial APU

VPU (k-1) VPU k VPU (k+1) VPU (k+2)

Best aligned final APU

Case 6) b. a. final APU long, b. a. initial APU short and 0 msec. < audio overlap < 12 msec.

A/V skew of at most 12 msec.

SOLUTION: APU (j+2) APU m APU (j+1) APU (j+2) APU (j+3) (m) (m+1) APU (j+1) APU (m-1)

slide-16
SLIDE 16

where information lives

  • Page. 16

Minimal Achievable Skew Algorithm Minimal Achievable Skew Algorithm

Immediately applicable to 6 out of the 8 possible

best aligned APU relative position classes.

In the remaining 2 classes of relative position, a

slight modification to the proposed algorithm is needed to achieve an A/V skew bounded by half APU duration.

slide-17
SLIDE 17

where information lives

  • Page. 17

12 msec. 12 msec. 12 msec. 12 msec.

VPU (k-1) VPU k VPU (k+1) VPU (k+2)

Best aligned initial APU Best aligned final APU

Case 1) Both best aligned APUs are short and 12 msec. < audio gap < 24 msec. APU j+1 APU (j+2) APU (m-1) APU m VPU (k-1) VPU k VPU (k+1) VPU (k+2) APU (j+1) APU (j+2) APU (j+3) APU (j+1) APU (j+2) APU (j+3) (m) (m) (m-1)

A/V skew of at most 12 msec. A/V skew of at most 12 msec.

SOLUTION (a): SOLUTION (b):

slide-18
SLIDE 18

where information lives

  • Page. 18

Facts - I Facts - I

An audio elementary stream construction with

no holes and no audio PTS discontinuity is possible.

As a consequence, an A/V skew of magnitude

at most half APU duration will be induced in the beginning stream. This is below the sensitivity limits of human perception.

The proposed algorithm can be repeatedly

applied an arbitrary number of times with neither a failure to meet its structural assumptions nor a degradation in its promised A/V skew performance.

slide-19
SLIDE 19

where information lives

  • Page. 19

Facts - II Facts - II

The A/V skews induced as a result of the

proposed processing do not accumulate, i.e. irrespective of the number of consecutive splices the worst A/V skew at any point in time will be half of the APU duration.

At each splice point, at the termination of the

PUs of the ending stream, we always have total audio and video presentation durations up to that point almost matching each other, i.e.

|video_duration - audio_duration| (1/2) APU_duration

i.e. always correct amount of audio data provided.

= <

slide-20
SLIDE 20

where information lives

  • Page. 20

Facts - III Facts - III

The resulting audio stream is error-free and

totally ISO/IEC 11172-3 and ISO/IEC 13818-3 compliant.

slide-21
SLIDE 21

where information lives

  • Page. 21

Implementation at TS level Implementation at TS level

TS level implementation necessitates further

considerations: 1) Editing of transport packets 2) Transport packet buffering and re-multiplexing 3) Audio buffer management 4) Meta-data generation

slide-22
SLIDE 22

where information lives

  • Page. 22

Audio buffer management Audio buffer management

We have to control the dynamic behavior of the

audio buffer during the transient induced by the splicing as well as consequent to splicing.

We need a simple model to characterize the

audio elementary stream within a TS, such that

1) model parameters should be easy to estimate, 2) model should accurately provide the following desired information a) the mean audio buffer fullness, b) the extent of buffer fullness variation around its mean value.

slide-23
SLIDE 23

where information lives

  • Page. 23

Audio elementary stream model Audio elementary stream model

Audio elementary stream will be modeled on

the basis of AAUs which determines its granularity with respect to audio decoder actions.

The proposed framework for the model is an

arrival process to a FIFO queue with a deterministic service time.

a a a

(j+1) (j+2) (j+3)

service time 0.024 sec.

aj pj p(j+1) p(j+2) p(j+3) aj : arrival time of jth AAU pj : presentation time of jth AAU

slide-24
SLIDE 24

where information lives

  • Page. 24

Audio elementary stream model Audio elementary stream model

The presentation times p can be easily computed

for each AAU.

The arrival times a however are not uniquely

defined since each AAU arrives in a distributed fashion owing to transport packet encapsulation. Solution: Use “weighted-average arrival times”.

j j

slide-25
SLIDE 25

where information lives

  • Page. 25

850 900 950 1000 1050 0.45 0.5 0.55 0.6 0.65 TS PACKET TYPE VISUALIZATION PLOT TS packet indices TS packet types (color coded)

Audio elementary stream model Audio elementary stream model

  • Weighted-average arrival time is defined for each AAU as follows:

all TS packets whose payloads have some of the j AAU data

(TS packet arrival time) * (fraction of the j AAU in the payload data)

  • Note: TS packet arrival time is defined with respect to

packet’s 94th byte through a PCR extrapolation process.

th th

a = j

slide-26
SLIDE 26

where information lives

  • Page. 26

Audio elementary stream model Audio elementary stream model

Based on these definitions we let

w = p - a where w is the waiting time for j AAU in the audio buffer.

The mean and variance of w as a random

variable w, provide very important information about the audio elementary stream within the TS multiplex and hence the audio buffer.

j j j j th j

slide-27
SLIDE 27

where information lives

  • Page. 27

Audio elementary stream model Audio elementary stream model

Specifically, E[w].(audio bitrate) = mean audio buffer fullness .(audio bitrate) = a measure of the variation in the audio buffer fullness around its mean value

σw

E[.]: expectation operation. : standard deviation.

σ

slide-28
SLIDE 28

where information lives

  • Page. 28

Audio elementary stream model Audio elementary stream model

Example 1.

Encoder B characteristics:

  • Long mean waiting time
  • Highly irregular, bursty arrival regime
slide-29
SLIDE 29

where information lives

  • Page. 29

6500 6600 6700 6800 6900 7000 7100 7200 7300 7400 7500 0.2 0.4 0.6 0.8 1 TS PACKET TYPE VISUALIZATION PLOT TS packet indices TS packet types (color coded) 5000 10000 15000 7.944 7.9441 7.9441 7.9442 7.9443 x 10

4

INTERLEAVED AUDIO-VIDEO ALIGNMENT IN THE BITSTREAM WRT PTS VALUES TS packet indices Interpolated play-out times for individual TS packets, [seconds]

Example 1

slide-30
SLIDE 30

where information lives

  • Page. 30

5 10 15 20 25 30

  • 10

10 20 30 40 50 60 70 90 100 110 AUDIO BUFFER ANALYSIS Time [seconds] Audio buffer level [% of max]

Predicted mean audio buffer fullness: 15585.8 bits = 54.36 %. 2 STD interval around the mean: 4504.2 bits = 15.71 %.

Example 1

slide-31
SLIDE 31

where information lives

  • Page. 31

An important observation An important observation

PTS re-stamping in the beginning audio stream

The waiting times of AAUs will be modified in the beginning stream.

The mean waiting time of AAUs in the beginning

stream will change (decrease or increase) by at most half APU duration.

A corresponding change in the mean audio

buffer fullness level for the beginning stream will be induced.

slide-32
SLIDE 32

where information lives

  • Page. 32

Improvement to the proposed processing Improvement to the proposed processing

For audio elementary streams structured with mean

audio buffer fullness level bounded away from both underflows and overflows, the already proposed methodology with the minimal achievable A/V skew is the best to choose.

For audio elementary streams with mean audio

buffer fullness level close to either underflow or

  • verflow states, the requirement of minimal resultant

A/V skew should be relaxed and instead of “best aligned APUs” those APUs introducing the minimal safe A/V skew should be employed.

slide-33
SLIDE 33

where information lives

  • Page. 33

Outline of the improvement - I Outline of the improvement - I

Let the type of A/V skew in which the audio signal

component is delayed with respect to its associated video signal, be referred to as the forward (in time) skew.

Forward skew is a result of a uniform increment in

the beginning stream’s AAU presentation time stamps and is therefore associated with an increase in the mean audio buffer fullness level.

Hence, for beginning audio elementary streams with

mean audio buffer fullness levels close to underflow, the minimal achievable forward skew is the minimal safe skew and should be the one to be employed.

slide-34
SLIDE 34

where information lives

  • Page. 34

Outline of the improvement - II Outline of the improvement - II

Let the type of A/V skew in which the audio signal

component moves to earlier in time with respect to its associated video signal, be referred to as the backward (in time) skew.

Backward skew is a result of a uniform decrement in

the beginning stream’s AAU presentation time stamps and is therefore associated with a decrease in the mean audio buffer fullness level.

Hence, for beginning audio elementary streams with

mean audio buffer fullness levels close to overflow, the minimal achievable backward skew is the minimal safe skew and should be the one to be employed.

slide-35
SLIDE 35

where information lives

  • Page. 35

Minimal Safe Skew Algorithm Summary Minimal Safe Skew Algorithm Summary

  • For all 8 possible relative position classes of

best aligned APUs, three solutions are defined

1.

the minimal achievable skew solution

2.

the minimal safe forward skew solution

3.

the minimal safe backward skew solution

  • Minimal achievable skew is upper-bounded by

half APU duration.

  • Minimal safe skew (forward or backward) is

upper bounded by one APU duration. (Still below the sensitivity limits of human perception.)

slide-36
SLIDE 36

where information lives

  • Page. 36

12 msec. 12 msec. 12 msec. 12 msec.

VPU (k-1) VPU k VPU (k+1) VPU (k+2)

Best aligned initial APU Best aligned final APU

Case 1) Both best aligned APUs are short and 12 msec. < audio gap < 24 msec. APU (j+1) APU (j+2) APU (m-1) APU m

slide-37
SLIDE 37

where information lives

  • Page. 37

VPU (k-1) VPU k VPU (k+1) VPU (k+2) APU (j+1) APU (j+2) APU (j+3) (m)

A/V skew of at most 12 msec.

SOLUTION I: VPU (k-1) VPU k VPU (k+1) VPU (k+2) APU (j+1) APU (j+2) APU (j+3) (m)

A/V skew of at most 12 msec.

SOLUTION II: VPU (k-1) VPU k VPU (k+1) VPU (k+2) APU (j+1) APU (j+2) APU (j+3) (m+1) SOLUTION III:

A/V skew of at most 24 msec. Minimal achievable skew implementation for clips with originally moderate audio buffer levels. Note: An alternate implementation by omitting APU (j+2) and including APU (m-1) is also possible and this approach achieves exactly the same resultant A/V skew. This latter solution is preferable in terms of ease of implementation. Minimal achievable safe (forward) skew implementation for clips with originally low audio buffer levels. Note: An alternate implementation by omitting APU (j+2) and including APU (m-1) is also possible and this approach achieves exactly the same resultant A/V skew. This latter solution is preferable in terms of ease of implementation.

(m)

Minimal achievable safe (backward) skew implementation for clips with originally high audio buffer levels.

slide-38
SLIDE 38

where information lives

  • Page. 38

VPU (k-1) VPU k VPU (k+1) VPU (k+2)

Best aligned initial APU

Case 6) b. a. final APU long, b. a. initial APU short and 0 msec. < audio overlap < 12 msec. APU (j+2) APU m APU (j+1) APU (m-1)

12 msec. 12 msec. 12 msec. 12 msec. Best aligned final APU

VPU k VPU (k+1)

slide-39
SLIDE 39

where information lives

  • Page. 39

VPU (k-1) VPU k VPU (k+1) VPU (k+2)

A/V skew of at most 12 msec.

SOLUTION I: APU (j+1) APU (j+2) APU (j+3) (m) (m+1) VPU (k-1) VPU k VPU (k+1) VPU (k+2)

A/V skew of at most 12 msec.

SOLUTION II: APU (j+1) APU (j+2) APU (j+3) (m) (m+1) VPU (k-1) VPU k VPU (k+1) VPU (k+2)

A/V skew of at most 24 msec.

SOLUTION III: APU (j+1) APU (j+2) APU (j+3) (m+1) (m+2)

Minimal achievable skew implementation for clips with originally moderate audio buffer levels. Minimal achievable safe (forward) skew implementation for clips with originally low audio buffer levels. Minimal achievable safe (backward) skew implementation for clips with originally high audio buffer levels. Note: An alternate implementation by omitting APU (j+1) and including APU m is also possible and this approach achieves exactly the same resultant A/V skew. This latter solution is preferable in terms of ease of implementation.

slide-40
SLIDE 40

where information lives

  • Page. 40

Implementation at TS level Implementation at TS level

TS level implementation necessitates further

considerations: 1) Editing of transport packets 2) Transport packet buffering and re-multiplexing 3) Audio buffer management 4) Metadata generation

slide-41
SLIDE 41

where information lives

  • Page. 41

Editing of transport packets Editing of transport packets

The truncation of the final PES packet of the

ending audio stream will typically necessitate the insertion of some adaptation field padding into its last transport packet.

The deletion of some AAU data from the front end

  • f the beginning audio stream’s first PES packet

will typically necessitate the editing of at most two audio transport packets.

Possible use of a causal bit-reservoir in Layer III

encoding, typically dictates a structural constraint

  • n the beginning audio stream.
slide-42
SLIDE 42

where information lives

  • Page. 42

Transport packet buffering and re-multiplexing Transport packet buffering and re-multiplexing In the transport stream

the audio bit rate is a constant the VAUs are of varying sizes

therefore the relative positions of VAUs and AAUs associated with VPUs and APUs almost aligned in time cannot be maintained constant. Almost always it is the case that the AAUs are significantly delayed with respect to the VAUs for which the decoded representations are almost synchronous.

slide-43
SLIDE 43

where information lives

  • Page. 43

Legend Legend

  • Black: TS video packets initiating I type pictures.
  • Blue: TS video packets initiating P or B type pictures.
  • Cyan: TS video packets.
  • Red: TS audio packets initiating PES packets.
  • Magenta: TS audio packets,
  • Yellow: Null TS packets.
  • Green: TS packets carrying various system information.
slide-44
SLIDE 44

where information lives

  • Page. 44

9800 9805 9810 9815 9820 9825 0.005 0.01 0.015 0.02 0.025 0.03 TS PACKET TYPE VISUALIZATION PLOT TS packet indices TS packet types (color coded) 0.95 1 1.05 1.1 1.15 x 10

4

7.9442 7.9442 7.9442 7.9442 7.9442 7.9442 7.9442 7.9442 x 10

4

INTERLEAVED AUDIO-VIDEO ALIGNMENT IN THE BITSTREAM WRT PTS VALUES TS packet indices Interpolated play-out times for individual TS packets, [seconds] TS packet index 9812 TS packet index 11005

Example 1. Encoder B.

slide-45
SLIDE 45

where information lives

  • Page. 45

Transport packet buffering and re-multiplexing Transport packet buffering and re-multiplexing

This TS multiplex structure necessitates

1) locating and temporarily storing (buffering) the delayed audio packets when the ending stream is truncated based on the last VAU, 2) TS re-multiplexing in the form of a) deletion of some obsolete green audio packets in the beginning stream, b) insertion of the above blue audio packets into the beginning stream.

slide-46
SLIDE 46

where information lives

  • Page. 46

Metadata Generation Metadata Generation

  • During the ingest of MPEG-2 transport streams

to storage systems,

1.

estimate the parameters of the proposed audio elementary stream model,

2.

record this descriptive information within the metadata associated with the asset.

  • In splicing scenarios with live streams, known

characteristics and/or settings of the encoder employed can be used.

slide-47
SLIDE 47

where information lives

  • Page. 47

Conclusions Conclusions

Audio elementary stream tailoring based on minimal

achievable or minimal safe A/V skew concepts.

Audio splicing without any artifacts made possible. A simple and efficient model to characterize the

audio elementary stream structure embedded in the transport stream. (Prediction and possible control of audio buffer behavior within reach.)

Audio splicing cannot be considered independently

from the video splicing.

slide-48
SLIDE 48

where information lives

  • Page. 48