Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung

Audio Indexing and Retrieval • Motivation • Main Audio Features • Audio Classification • Speech Recognition • Music Retrieval • Using Audio Features for Video Indexing and Retrieval Audio Indexing and Retrieval 2 IT6902; Semester B, 2004/2005; Leung

Scenarios • If we have an audio file of a pop singer’s concert, how can we find out when the singer is singing and when he/she is talking to the audience? • If we have recorded the phone conversations during many sessions of the conference meetings, how can we find out when and what have been discussed for a particular project XYZ? • If we have many songs in digital format, how can we search for a particular song for which we forget the title but we only know how to sing a few words or hum a few notes? • If we want to skim a horror movie file, how can we find out where are the horror scenes? Audio Indexing and Retrieval 3 IT6902; Semester B, 2004/2005; Leung

Main Audio Features • Time-Domain Features – Average Energy – Zero Crossing Rate – Silence Ratio • Frequency-Domain Features – Sound Spectrum – Bandwidth – Energy Distribution – Harmonicity – Pitch • Spectrogram Audio Indexing and Retrieval 4 IT6902; Semester B, 2004/2005; Leung

Time-Domain Features • Amplitude-time representation of an audio signal Audio Indexing and Retrieval 5 IT6902; Semester B, 2004/2005; Leung

Time-Domain Features (2) • Average Energy – Indicates the loudness of the audio signal − ∑ N 1 2 x ( n ) = x ( n ) n 1 = E N • Zero Crossing Rate – Indicates the frequency of signal amplitude sign change − ∑ N 1 [ ] [ ] − − sgn x ( n ) sgn x ( n 1 ) = n 1 = ZC 2 N n >  1 a 0  = = sgn( a )  0 a 0  − < 1 a 0  Audio Indexing and Retrieval 6 IT6902; Semester B, 2004/2005; Leung

Time-Domain Features (3) • Silence Ratio – Indicates the proportion of the sound piece that is silent – Silence is a period within which the absolute amplitude values of a certain number of samples are below a certain threshold – Silence ratio is calculated as the ratio between the sum of silent periods and the total length of the audio piece silence silence silence silence Approaches: 1. Fixed Threshold 2. Select Reference Silence Value 3. Adaptive Silence Thresholds Audio Indexing and Retrieval 7 IT6902; Semester B, 2004/2005; Leung

Frequency-Domain Features • Sound Spectrum Discrete Fourier Transform (DFT) π j 2 nk − N 1 − ∑ = N X ( k ) x ( n ) e = n 0 Inverse Discrete Fourier Transform (IDFT) π j 2 nk − N 1 ∑ 1 = N x ( n ) X ( k ) e N = n 0 – For large value of N , the signal is often broken into blocks called frames and DFT is applied to each of the frames. This is known as Short Time Fourier Transform (STFT) Audio Indexing and Retrieval 8 IT6902; Semester B, 2004/2005; Leung

Frequency-Domain Features (2) • Bandwidth – indicated the frequency range of a sound – can be taken as the difference between the highest frequency and lowest frequency of non-zero spectrum components – “non-zero” may be defined as at least 3dB above the silence level • Energy distribution – Signal distribution across frequency components – One important feature derived from the energy distribution is the centroid , which is the mid-point of the spectral energy distribution of a sound. Centroid is also called brightness Audio Indexing and Retrieval 9 IT6902; Semester B, 2004/2005; Leung

Frequency-Domain Features (3) • Harmonicity – In harmonic sound, the spectral components are mostly whole number multiples of the lowest and most often loudest frequency – Lowest frequency is called fundamental frequency – Music is normally more harmonic than other sounds • Pitch – the distinctive quality of a sound, dependent primarily on the frequency of the sound waves produced by its source – only period sounds, such as those produced by musical instruments and the voice, give rise to a sensation of a pitch – In practice, we use the fundamental frequency as the approximation of the pitch Audio Indexing and Retrieval 10 IT6902; Semester B, 2004/2005; Leung

Spectrogram • Time and frequency components are shown in the same representation Source: http://www.visualizationsoftware.com/gram.html frequency time Intensity: Power of a frequency component at a particular time interval Audio Indexing and Retrieval 11 IT6902; Semester B, 2004/2005; Leung

Audio Classification • Goal – To classify the audio into speech, music and possibly into other categories/subcategories • Motivation 1. Different audio types require different processing and indexing retrieval techniques 2. Different audio types have different significance to different applications 3. The audio type or class information is itself very useful to some applications 4. The search space after classification is reduced to a particular audio class during the retrieval process Audio Indexing and Retrieval 12 IT6902; Semester B, 2004/2005; Leung

Speech vs. Music Audio Indexing and Retrieval 13 IT6902; Semester B, 2004/2005; Leung

Audio Classification Framework • Step by Step Classification – each feature is used individually in different classification steps – the order in which different features are used for classification is important, normally decided based on computational complexity and the differentiating power of the different features • Feature Vector Based Audio Classification – a set of features is used together as a vector to calculate the closeness of the input to the training sets – theoretically more effective because multiple features are considered in the classification decision making but more computationally demanding because of the multiple dimension feature vectors Audio Indexing and Retrieval 14 IT6902; Semester B, 2004/2005; Leung

Step by Step Classification • Lu and Hankinson 1998 Audio Indexing and Retrieval 15 IT6902; Semester B, 2004/2005; Leung

Feature Vector Based Audio Classification • Scheirer and Stanley 1997 music speech Audio Indexing and Retrieval 16 IT6902; Semester B, 2004/2005; Leung

Example Audio Classes • Liu and Wan 2001 Audio Indexing and Retrieval 17 IT6902; Semester B, 2004/2005; Leung

Audio Segmentation • a long sound track normally consists of a mixture of speech, music and other sound types • can segment the audio piece into speech and music intervals based on the classification scheme discussed earlier • Approach: – divide the audio piece into a number of small windows and then apply audio the classification method to determine if the window is speech or music. – Consecutive windows are then grouped into speech or music interval if they are of the same type … M M S S S M M M M M … M S M Audio Indexing and Retrieval 18 IT6902; Semester B, 2004/2005; Leung

Speech Recognition and Retrieval • Apply speech recognition techniques to convert speech signals into text and then apply IR techniques for indexing and retrieval – Speech Recognition • Basic concepts of Automatic Speech Recognition (ASR) • Variations • Techniques based on Hidden Markov Model (HMM) – Speaker Identification Audio Indexing and Retrieval 19 IT6902; Semester B, 2004/2005; Leung

Basic Concepts of ASR • General ASR System: There are two stages of ASR: 1. Training • Features of each speech unit is extracted and stored in the system 2. Recognition • Features of an input speech unit are extracted and compared with each of the stored features and the speech unit with the best matching features is taken as the recognized unit Audio Indexing and Retrieval 20 IT6902; Semester B, 2004/2005; Leung

Challenges of ASR • Variations in different dimensions 1. Subject 2. Time 3. Background or environmental noise 4. Isolated words vs. continuous speech 5. Read vs. spontaneous speech 6. Size of the vocabulary Audio Indexing and Retrieval 21 IT6902; Semester B, 2004/2005; Leung

Speaker Identification • Goal – find the identity of the speaker • can be used to determine the number of speaker in a particular setting, whether the speaker is male/female, adult or child, a person’s mood, emotional state and attitude, etc… Audio Indexing and Retrieval 22 IT6902; Semester B, 2004/2005; Leung

Music Indexing and Retrieval • Two types of music 1. Structured music and sound effects 2. Sample-based music • Common query input form is humming, thus the term query-by-humming i. Retrieval based on a set of features ii. Retrieval based on pitch Audio Indexing and Retrieval 23 IT6902; Semester B, 2004/2005; Leung

Structured Music • Represented by a set of commands or algorithms. • Most common structured music is MIDI – MIDI is a scripting language. It codes “events” that stand for the production of sounds. E.g., a MIDI event might include values for the pitch of a single note, its duration, and its volume. • MPEG-4 Structured Audio is a new standard for structured audio (music and sound effects) Audio Indexing and Retrieval 24 IT6902; Semester B, 2004/2005; Leung

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung - PowerPoint PPT Presentation

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval Motivation Main Audio Features Audio Classification Speech Recognition Music Retrieval Using Audio Features for Video

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Media Indexing & Retrieval Media Indexing & Retrieval Prepared by Ling Guan Jose Lay

Audio Device Client Better and Faster Audio I/O on Web Hongchan Choi Google Chrome Web Audio

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Retrieval by Content Part 3: Text Retrieval Latent Semantic Indexing Srihari: CSE 626 1 Latent

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Multimedia Indexing and Retrieval Georges Qunot Multimedia Information Modeling and Retrieval

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3

Cirrus Audio Solutions Cirrus Audio Solutions Home Audio Portable Audio Personal CD Player

Indexing and Searching Indexing and Searching Berlin Chen 2005 References: 1. Modern

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

Chapter V: Indexing & Searching Information Retrieval & Data Mining Universitt des

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Sponsored by : Buck Waters Broadcasters General Store

Deposing Named Plaintiffs in Class Litigation: Leveraging Issues of Adequacy and Commonality

Results from 1069 IMRT irradiations of an anthropomorphic head and anthropomorphic head and

Iceni 70cm transverter By G4DDK EME2018 Contents Introduction to the Iceni Design of

Music Information Retrieval State-of-the-art techniques Ladislav Mark Charles University,

Incl Inclusi usive Des ve Design ign Dee Deep Lear p Learning ning on on Aud Audio in Azu

Sound Effect Devices for Musicians Advisors: Dr. Randal Geiger, Dr. Degang Chen By: Ben Reichert,

Measuring Headphone Frequency Response Werner Dahm The Basics