Audio Indexing and Retrieval
IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung - - PowerPoint PPT Presentation
Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval Motivation Main Audio Features Audio Classification Speech Recognition Music Retrieval Using Audio Features for Video
IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 2 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 3 IT6902; Semester B, 2004/2005; Leung
audience?
have been discussed for a particular project XYZ?
particular song for which we forget the title but we only know how to sing a few words or hum a few notes?
the horror scenes?
Audio Indexing and Retrieval 4 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 5 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 6 IT6902; Semester B, 2004/2005; Leung
N n x E
N n
− =
=
1 1 2
) (
n x(n)
[ ] [ ]
N n x n x ZC
N n
2 ) 1 ( sgn ) ( sgn
1 1
− =
− − = < = > − = 1 1 ) sgn( a a a a
Audio Indexing and Retrieval 7 IT6902; Semester B, 2004/2005; Leung
silence silence silence silence
Audio Indexing and Retrieval 8 IT6902; Semester B, 2004/2005; Leung
− = −
=
1 2
) ( ) (
N n N nk j
e n x k X
π
− =
=
1 2
) ( 1 ) (
N n N nk j
e k X N n x
π
Discrete Fourier Transform (DFT) Inverse Discrete Fourier Transform (IDFT)
Audio Indexing and Retrieval 9 IT6902; Semester B, 2004/2005; Leung
– indicated the frequency range of a sound – can be taken as the difference between the highest frequency and lowest frequency of non-zero spectrum components – “non-zero” may be defined as at least 3dB above the silence level
– Signal distribution across frequency components – One important feature derived from the energy distribution is the centroid, which is the mid-point of the spectral energy distribution of a sound. Centroid is also called brightness
Audio Indexing and Retrieval 10 IT6902; Semester B, 2004/2005; Leung
– In harmonic sound, the spectral components are mostly whole number multiples of the lowest and most often loudest frequency – Lowest frequency is called fundamental frequency – Music is normally more harmonic than other sounds
– the distinctive quality of a sound, dependent primarily on the frequency of the sound waves produced by its source –
instruments and the voice, give rise to a sensation of a pitch – In practice, we use the fundamental frequency as the approximation of the pitch
Audio Indexing and Retrieval 11 IT6902; Semester B, 2004/2005; Leung
time frequency Intensity: Power of a frequency component at a particular time interval
Source: http://www.visualizationsoftware.com/gram.html
Audio Indexing and Retrieval 12 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 13 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 14 IT6902; Semester B, 2004/2005; Leung
– each feature is used individually in different classification steps – the order in which different features are used for classification is important, normally decided based on computational complexity and the differentiating power of the different features
– a set of features is used together as a vector to calculate the closeness of the input to the training sets – theoretically more effective because multiple features are considered in the classification decision making but more computationally demanding because of the multiple dimension feature vectors
Audio Indexing and Retrieval 15 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 16 IT6902; Semester B, 2004/2005; Leung
speech music
Audio Indexing and Retrieval 17 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 18 IT6902; Semester B, 2004/2005; Leung
and other sound types
– divide the audio piece into a number of small windows and then apply audio the classification method to determine if the window is speech or music. – Consecutive windows are then grouped into speech or music interval if they are of the same type
Audio Indexing and Retrieval 19 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 20 IT6902; Semester B, 2004/2005; Leung
There are two stages of ASR:
extracted and stored in the system
are extracted and compared with each of the stored features and the speech unit with the best matching features is taken as the recognized unit
Audio Indexing and Retrieval 21 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 22 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 23 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 24 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 25 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 26 IT6902; Semester B, 2004/2005; Leung
– Build model for each class based on a set of features and then compute the similarity between the features of the query and the models.
Class Model for Laughter (Wold et al., 1996)
Audio Indexing and Retrieval 27 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 28 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 29 IT6902; Semester B, 2004/2005; Leung
Audio Indexing and Retrieval 30 IT6902; Semester B, 2004/2005; Leung
Tools and Applications, 15, pp. 269–290, 2001.
Multimedia Systems , Volume 7, Issue 1, January 1999.
Huifang Sun, “Survey of compressed-domain features used in audio- visual indexing and analysis”, Journal of Visual Communication and Image Representation, 14, pp.150-183, 2003.