Speech segment classification on music radio shows using machine - - PowerPoint PPT Presentation
Speech segment classification on music radio shows using machine - - PowerPoint PPT Presentation
Speech segment classification on music radio shows using machine learning algorithms Tim Scarfe, Yuri Kalnishkan Introduction This was my bachelor thesis but recently we have been reproducing the results and submitted a paper on it to
Introduction
- This was my bachelor thesis but recently we have been
reproducing the results and submitted a paper on it to
- DS2010. The paper, data, samples etc are on my web
site @ http://www.developer-x.com/papers/asot/DS2010_svm_voice_segment_r10.pdf
- We were interested in predicting intervals of speech in
electronic dance music radio shows
- We ended up working on one show “A State of Trance”
with current world #1 DJ Armin van Buuren
- We were specifically interested in Armin’s voice, no
- ther people’s voices, singing etc.
Why?
- It’s very useful to have temporal metadata for
audio streams i.e. When are the adverts? When is the traffic information?
- Audio is slower to index/label than video. On
videos you can scrub through and ascertain the structure quickly.
- Most ad-hoc audio streams don’t have
associated temporal metadata.
Methodology
- These radio shows are 2 hours long
- We took an approach typical of machine learning
i.e. discretising the show into feature vectors (representing 1 second each) and training a learning machine model on historical examples
- For simplicity we worked with 5 minute segments
(299 seconds from each) from 9 different shows.
- We labelled them and used one for training data,
the remaining 8 were concatenated together and used for testing.
Data (post-feature extraction)
- Training Set
– 299 examples
- 28 speech
- 271 non-speech
- Test Set
– 2392 examples
- 291 speech
- 2101 non-speech
- 1 second : 1 example
Audio Analysis (1)
- Audio is in the time domain...
- You can’t do much useful analysis in the time
domain unless it’s just a sine wave!
Audio Analysis (2)
- The problem is, in the time domain everything
gets mixed together. Here are 2 simple sine waves mixed up:
Audio Analysis (3)
- Now lets look at some typical audio from one
- f these radio shows.
- Here is about 230 samples of stereo audio.
What a mess!
Audio Analysis (4)
- What if there was a way to transform the
signal into the frequency domain, and discard all time information?
- Enter Fourier analysis
Fourier Analysis
- Fourier Analysis represents any function as a
set of multiple integer oscillations of trigonometric functions.
Temporal Feature Extraction Strategy
- We could just window the audio at 44100 sample
intervals and run a DFT on each.
- However this is too coarse; we want to capture
the temporal “fabric” or “texture” of the audio.
- What we want to do is capture lots of small
features (DFTs and others) (say 100 in a second) and then merge them together using means and variances back into one feature vector representing one second.
- Enter the STFT or Short Time Fourier Transform
Short Time Fourier Transform (STFT)
- Because we want a high number of DFT
windows per second, the number of samples for each might get low. Say 44100/100 == 441 samples per window. With so few samples we actually want to use overlapping windows and apply a windowing function to reduce spectral “leakage”.
Short Time Fourier Transform (STFT) (2)
Hann window function Rectangle window function Main STFT function (DFT with windowing added)
Visualising the STFT – the Spectrogram!
Speech Music
Spectrogram of a violin playing
- Human Hearing
- Critical Bands...
- Timbre and Musical Instruments
Information Overload!
- Due to the Nyquist theorem, we still have samplerate/2
== 22050 attributes on our feature vectors. This is way too much.
- The first thing we do is “discretise” these STFT vectors
into x “bins”. This will be 64, 32 or 8 in our
- experiments. In each bin we simply take the mean
average value.
- So now we have the rich frequency information broken
up into a manageable amount of bins.
- Another thing we do on some models (discussed later)
is down sample the audio i.e. to 22050Hz before processing it
Richer Feature Extraction
- The binned STFT in itself would work as a
feature (and we do use it), but we can extract even more from it!
- We can write feature detectors that operate in
the frequency domain.
Frequency Domain Features (1)
Frequency Domain Features (2)
- Spectral Centroid
Frequency Domain Features (3)
Bandwidth Energy Flatness/tonality Entropy Rolloff
Means and Variances
- We take the means and variances to combine the features
back into “textural” feature vectors representing 1 second of underlying audio. Here is an image plot of our “ModelA” which produced 221 features.
Final Feature Vectors
Class distribution histograms
- None of the features on their own provide good
class separation – this is why we need powerful learning machines
Models Descriptions
- For comparative purposes we have created 3
models A,B and C with different parameters
- n the features.
Learning Machines
- We are going to try out 3 different classifiers
- n the 3 models to see which one does best.
Support Vector Machines w/RBF
Bayesian Logistical Regression
C4.5
- Basically just the ID3 algorithm with pruning added
- Decision trees were popular in the 80’s
- At each node of the tree, C4.5 chooses one attribute of the
data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the
- data. The attribute with the highest normalized information
gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists.
- The algorithm is based on Occam's razor i.e. it prefers
smaller decision trees (simple rules) over larger ones.
Learning machine parameters
Verbose Results
Results Interpretation
- F-measure on speech class shown above
- SVM clearly out-performed the other two learning machines
- BLR performed strongly on the verbose feature set, but SVM performed
well regardless
- Interestingly, C4.5 + BLR did better on C, than B! Possibly a smaller feature
set translated to better accuracy
“Precision can be seen as a measure of exactness or fidelity, whereas Recall is a measure of completeness.”
a Precision score of 1.0 for a class C means that every item labelled as belonging to class C does indeed belong to class C (but says nothing about the number of items from class C that were not labelled correctly) whereas a Recall of 1.0 means that every item from class C was labelled as belonging to class C (but says nothing about how many other items were incorrectly also labelled as belonging to class C).
Where did it go wrong, improvements
- Many classification errors were border cases
- Heuristics could be used to improve
performance i.e. Assumptions about no gaps in speech
- 2-second feature vectors...
Any questions?!
- tim@cs.rhul.ac.uk
- www.developer-x.com/papers/asot