Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung - - PowerPoint PPT Presentation

audio indexing and retrieval
SMART_READER_LITE
LIVE PREVIEW

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung - - PowerPoint PPT Presentation

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval Motivation Main Audio Features Audio Classification Speech Recognition Music Retrieval Using Audio Features for Video


slide-1
SLIDE 1

Audio Indexing and Retrieval

IT6902; Semester B, 2004/2005; Leung

slide-2
SLIDE 2

Audio Indexing and Retrieval 2 IT6902; Semester B, 2004/2005; Leung

Audio Indexing and Retrieval

  • Motivation
  • Main Audio Features
  • Audio Classification
  • Speech Recognition
  • Music Retrieval
  • Using Audio Features for Video Indexing and

Retrieval

slide-3
SLIDE 3

Audio Indexing and Retrieval 3 IT6902; Semester B, 2004/2005; Leung

Scenarios

  • If we have an audio file of a pop singer’s concert, how can we find
  • ut when the singer is singing and when he/she is talking to the

audience?

  • If we have recorded the phone conversations during many sessions
  • f the conference meetings, how can we find out when and what

have been discussed for a particular project XYZ?

  • If we have many songs in digital format, how can we search for a

particular song for which we forget the title but we only know how to sing a few words or hum a few notes?

  • If we want to skim a horror movie file, how can we find out where are

the horror scenes?

slide-4
SLIDE 4

Audio Indexing and Retrieval 4 IT6902; Semester B, 2004/2005; Leung

Main Audio Features

  • Time-Domain Features

– Average Energy – Zero Crossing Rate – Silence Ratio

  • Frequency-Domain Features

– Sound Spectrum – Bandwidth – Energy Distribution – Harmonicity – Pitch

  • Spectrogram
slide-5
SLIDE 5

Audio Indexing and Retrieval 5 IT6902; Semester B, 2004/2005; Leung

Time-Domain Features

  • Amplitude-time representation of an audio signal
slide-6
SLIDE 6

Audio Indexing and Retrieval 6 IT6902; Semester B, 2004/2005; Leung

Time-Domain Features (2)

  • Average Energy

– Indicates the loudness of the audio signal

  • Zero Crossing Rate

– Indicates the frequency of signal amplitude sign change

N n x E

N n

− =

=

1 1 2

) (

n x(n)

[ ] [ ]

N n x n x ZC

N n

2 ) 1 ( sgn ) ( sgn

1 1

− =

− − =      < = > − = 1 1 ) sgn( a a a a

slide-7
SLIDE 7

Audio Indexing and Retrieval 7 IT6902; Semester B, 2004/2005; Leung

Time-Domain Features (3)

  • Silence Ratio

– Indicates the proportion of the sound piece that is silent – Silence is a period within which the absolute amplitude values of a certain number of samples are below a certain threshold – Silence ratio is calculated as the ratio between the sum of silent periods and the total length of the audio piece Approaches:

  • 1. Fixed Threshold
  • 2. Select Reference Silence Value
  • 3. Adaptive Silence Thresholds

silence silence silence silence

slide-8
SLIDE 8

Audio Indexing and Retrieval 8 IT6902; Semester B, 2004/2005; Leung

Frequency-Domain Features

  • Sound Spectrum

– For large value of N, the signal is often broken into blocks called frames and DFT is applied to each of the frames. This is known as Short Time Fourier Transform (STFT)

− = −

=

1 2

) ( ) (

N n N nk j

e n x k X

π

− =

=

1 2

) ( 1 ) (

N n N nk j

e k X N n x

π

Discrete Fourier Transform (DFT) Inverse Discrete Fourier Transform (IDFT)

slide-9
SLIDE 9

Audio Indexing and Retrieval 9 IT6902; Semester B, 2004/2005; Leung

Frequency-Domain Features (2)

  • Bandwidth

– indicated the frequency range of a sound – can be taken as the difference between the highest frequency and lowest frequency of non-zero spectrum components – “non-zero” may be defined as at least 3dB above the silence level

  • Energy distribution

– Signal distribution across frequency components – One important feature derived from the energy distribution is the centroid, which is the mid-point of the spectral energy distribution of a sound. Centroid is also called brightness

slide-10
SLIDE 10

Audio Indexing and Retrieval 10 IT6902; Semester B, 2004/2005; Leung

Frequency-Domain Features (3)

  • Harmonicity

– In harmonic sound, the spectral components are mostly whole number multiples of the lowest and most often loudest frequency – Lowest frequency is called fundamental frequency – Music is normally more harmonic than other sounds

  • Pitch

– the distinctive quality of a sound, dependent primarily on the frequency of the sound waves produced by its source –

  • nly period sounds, such as those produced by musical

instruments and the voice, give rise to a sensation of a pitch – In practice, we use the fundamental frequency as the approximation of the pitch

slide-11
SLIDE 11

Audio Indexing and Retrieval 11 IT6902; Semester B, 2004/2005; Leung

Spectrogram

  • Time and frequency components are shown in the same

representation

time frequency Intensity: Power of a frequency component at a particular time interval

Source: http://www.visualizationsoftware.com/gram.html

slide-12
SLIDE 12

Audio Indexing and Retrieval 12 IT6902; Semester B, 2004/2005; Leung

Audio Classification

  • Goal

– To classify the audio into speech, music and possibly into other categories/subcategories

  • Motivation

1. Different audio types require different processing and indexing retrieval techniques 2. Different audio types have different significance to different applications 3. The audio type or class information is itself very useful to some applications 4. The search space after classification is reduced to a particular audio class during the retrieval process

slide-13
SLIDE 13

Audio Indexing and Retrieval 13 IT6902; Semester B, 2004/2005; Leung

Speech vs. Music

slide-14
SLIDE 14

Audio Indexing and Retrieval 14 IT6902; Semester B, 2004/2005; Leung

Audio Classification Framework

  • Step by Step Classification

– each feature is used individually in different classification steps – the order in which different features are used for classification is important, normally decided based on computational complexity and the differentiating power of the different features

  • Feature Vector Based Audio Classification

– a set of features is used together as a vector to calculate the closeness of the input to the training sets – theoretically more effective because multiple features are considered in the classification decision making but more computationally demanding because of the multiple dimension feature vectors

slide-15
SLIDE 15

Audio Indexing and Retrieval 15 IT6902; Semester B, 2004/2005; Leung

Step by Step Classification

  • Lu and Hankinson 1998
slide-16
SLIDE 16

Audio Indexing and Retrieval 16 IT6902; Semester B, 2004/2005; Leung

Feature Vector Based Audio Classification

  • Scheirer and Stanley 1997

speech music

slide-17
SLIDE 17

Audio Indexing and Retrieval 17 IT6902; Semester B, 2004/2005; Leung

Example Audio Classes

  • Liu and Wan 2001
slide-18
SLIDE 18

Audio Indexing and Retrieval 18 IT6902; Semester B, 2004/2005; Leung

Audio Segmentation

  • a long sound track normally consists of a mixture of speech, music

and other sound types

  • can segment the audio piece into speech and music intervals based
  • n the classification scheme discussed earlier
  • Approach:

– divide the audio piece into a number of small windows and then apply audio the classification method to determine if the window is speech or music. – Consecutive windows are then grouped into speech or music interval if they are of the same type

M … M M M M S S S M M … M S M

slide-19
SLIDE 19

Audio Indexing and Retrieval 19 IT6902; Semester B, 2004/2005; Leung

Speech Recognition and Retrieval

  • Apply speech recognition techniques to convert

speech signals into text and then apply IR techniques for indexing and retrieval

– Speech Recognition

  • Basic concepts of Automatic Speech Recognition

(ASR)

  • Variations
  • Techniques based on Hidden Markov Model (HMM)

– Speaker Identification

slide-20
SLIDE 20

Audio Indexing and Retrieval 20 IT6902; Semester B, 2004/2005; Leung

Basic Concepts of ASR

  • General ASR System:

There are two stages of ASR:

  • 1. Training
  • Features of each speech unit is

extracted and stored in the system

  • 2. Recognition
  • Features of an input speech unit

are extracted and compared with each of the stored features and the speech unit with the best matching features is taken as the recognized unit

slide-21
SLIDE 21

Audio Indexing and Retrieval 21 IT6902; Semester B, 2004/2005; Leung

Challenges of ASR

  • Variations in different dimensions

1. Subject 2. Time 3. Background or environmental noise 4. Isolated words vs. continuous speech 5. Read vs. spontaneous speech 6. Size of the vocabulary

slide-22
SLIDE 22

Audio Indexing and Retrieval 22 IT6902; Semester B, 2004/2005; Leung

Speaker Identification

  • Goal

– find the identity of the speaker

  • can be used to determine the number of

speaker in a particular setting, whether the speaker is male/female, adult or child, a person’s mood, emotional state and attitude, etc…

slide-23
SLIDE 23

Audio Indexing and Retrieval 23 IT6902; Semester B, 2004/2005; Leung

Music Indexing and Retrieval

  • Two types of music
  • 1. Structured music and sound effects
  • 2. Sample-based music
  • Common query input form is humming, thus the

term query-by-humming i. Retrieval based on a set of features ii. Retrieval based on pitch

slide-24
SLIDE 24

Audio Indexing and Retrieval 24 IT6902; Semester B, 2004/2005; Leung

Structured Music

  • Represented by a set of commands or

algorithms.

  • Most common structured music is MIDI

– MIDI is a scripting language. It codes “events” that stand for the production of sounds. E.g., a MIDI event might include values for the pitch of a single note, its duration, and its volume.

  • MPEG-4 Structured Audio is a new standard

for structured audio (music and sound effects)

slide-25
SLIDE 25

Audio Indexing and Retrieval 25 IT6902; Semester B, 2004/2005; Leung

Structured Music (2)

  • Developed for sound transmission, synthesis

and production but not designed for indexing and retrieval purpose. However, the explicit structure and notes description existing in these formats make the retrieval process easy, since there is no need to do feature extraction from audio signal.

  • Suitable for queries requiring an exact match

between the queries and database sound files

slide-26
SLIDE 26

Audio Indexing and Retrieval 26 IT6902; Semester B, 2004/2005; Leung

Sample-Based Music

1. Retrieval based on a set of features

– Build model for each class based on a set of features and then compute the similarity between the features of the query and the models.

Class Model for Laughter (Wold et al., 1996)

slide-27
SLIDE 27

Audio Indexing and Retrieval 27 IT6902; Semester B, 2004/2005; Leung

Sample-Based Music (2)

2. Retrieval based on pitch

– Extract the pitch from the audio → pitch tracking – Represent an audio as a 1-D sequence (a string) of either pitch directions (U-up, D-down, S-similar) or pitch values (more symbols) – Use approximate string matching to determine similarity: finding all instances of a query string Q = q1q2q3…qm in a reference string R = r1r2r3…rn such that there are at most k mismatches.

slide-28
SLIDE 28

Audio Indexing and Retrieval 28 IT6902; Semester B, 2004/2005; Leung

Using Audio Features for Video Indexing and Retrieval

  • Can use audio classification and speech

understanding to help with the indexing and retrieval of audio

  • This is important because in general it is difficult to

extract video content even with complicated image processing techniques

  • Example: Informedia Project at CMU:

http://www.informedia.cs.cmu.edu/

slide-29
SLIDE 29

Audio Indexing and Retrieval 29 IT6902; Semester B, 2004/2005; Leung

Summary

  • Audio Indexing and Retrieval

– Main Audio Features – Audio Classification – Speech Recognition – Music Retrieval – Using Audio Features for Video Indexing and Retrieval

slide-30
SLIDE 30

Audio Indexing and Retrieval 30 IT6902; Semester B, 2004/2005; Leung

References

Survey papers on Audio Indexing and Retrieval:

  • Guojun Lu. “Indexing and Retrieval of Audio: A Survey”. Multimedia

Tools and Applications, 15, pp. 269–290, 2001.

  • Jonathan Foote. “An overview of audio information retrieval”.

Multimedia Systems , Volume 7, Issue 1, January 1999.

  • Hualu Wang, Ajay Divakaran, Anthony Vetro, Shih-Fu Chang and

Huifang Sun, “Survey of compressed-domain features used in audio- visual indexing and analysis”, Journal of Visual Communication and Image Representation, 14, pp.150-183, 2003.