Speech segment classification on music radio shows using machine - - PowerPoint PPT Presentation

speech segment classification on music radio shows using
SMART_READER_LITE
LIVE PREVIEW

Speech segment classification on music radio shows using machine - - PowerPoint PPT Presentation

Speech segment classification on music radio shows using machine learning algorithms Tim Scarfe, Yuri Kalnishkan Introduction This was my bachelor thesis but recently we have been reproducing the results and submitted a paper on it to


slide-1
SLIDE 1

Speech segment classification on music radio shows using machine learning algorithms

Tim Scarfe, Yuri Kalnishkan

slide-2
SLIDE 2

Introduction

  • This was my bachelor thesis but recently we have been

reproducing the results and submitted a paper on it to

  • DS2010. The paper, data, samples etc are on my web

site @ http://www.developer-x.com/papers/asot/DS2010_svm_voice_segment_r10.pdf

  • We were interested in predicting intervals of speech in

electronic dance music radio shows

  • We ended up working on one show “A State of Trance”

with current world #1 DJ Armin van Buuren 

  • We were specifically interested in Armin’s voice, no
  • ther people’s voices, singing etc.
slide-3
SLIDE 3

Why?

  • It’s very useful to have temporal metadata for

audio streams i.e. When are the adverts? When is the traffic information?

  • Audio is slower to index/label than video. On

videos you can scrub through and ascertain the structure quickly.

  • Most ad-hoc audio streams don’t have

associated temporal metadata.

slide-4
SLIDE 4

Methodology

  • These radio shows are 2 hours long
  • We took an approach typical of machine learning

i.e. discretising the show into feature vectors (representing 1 second each) and training a learning machine model on historical examples

  • For simplicity we worked with 5 minute segments

(299 seconds from each) from 9 different shows.

  • We labelled them and used one for training data,

the remaining 8 were concatenated together and used for testing.

slide-5
SLIDE 5

Data (post-feature extraction)

  • Training Set

– 299 examples

  • 28 speech
  • 271 non-speech
  • Test Set

– 2392 examples

  • 291 speech
  • 2101 non-speech
  • 1 second : 1 example
slide-6
SLIDE 6

Audio Analysis (1)

  • Audio is in the time domain...
  • You can’t do much useful analysis in the time

domain unless it’s just a sine wave!

slide-7
SLIDE 7

Audio Analysis (2)

  • The problem is, in the time domain everything

gets mixed together. Here are 2 simple sine waves mixed up:

slide-8
SLIDE 8

Audio Analysis (3)

  • Now lets look at some typical audio from one
  • f these radio shows.
  • Here is about 230 samples of stereo audio.

What a mess!

slide-9
SLIDE 9

Audio Analysis (4)

  • What if there was a way to transform the

signal into the frequency domain, and discard all time information?

  • Enter Fourier analysis
slide-10
SLIDE 10

Fourier Analysis

  • Fourier Analysis represents any function as a

set of multiple integer oscillations of trigonometric functions.

slide-11
SLIDE 11

Temporal Feature Extraction Strategy

  • We could just window the audio at 44100 sample

intervals and run a DFT on each.

  • However this is too coarse; we want to capture

the temporal “fabric” or “texture” of the audio.

  • What we want to do is capture lots of small

features (DFTs and others) (say 100 in a second) and then merge them together using means and variances back into one feature vector representing one second.

  • Enter the STFT or Short Time Fourier Transform
slide-12
SLIDE 12

Short Time Fourier Transform (STFT)

  • Because we want a high number of DFT

windows per second, the number of samples for each might get low. Say 44100/100 == 441 samples per window. With so few samples we actually want to use overlapping windows and apply a windowing function to reduce spectral “leakage”.

slide-13
SLIDE 13

Short Time Fourier Transform (STFT) (2)

Hann window function Rectangle window function Main STFT function (DFT with windowing added)

slide-14
SLIDE 14

Visualising the STFT – the Spectrogram!

Speech Music

slide-15
SLIDE 15

Spectrogram of a violin playing

  • Human Hearing
  • Critical Bands...
  • Timbre and Musical Instruments
slide-16
SLIDE 16

Information Overload!

  • Due to the Nyquist theorem, we still have samplerate/2

== 22050 attributes on our feature vectors. This is way too much.

  • The first thing we do is “discretise” these STFT vectors

into x “bins”. This will be 64, 32 or 8 in our

  • experiments. In each bin we simply take the mean

average value.

  • So now we have the rich frequency information broken

up into a manageable amount of bins.

  • Another thing we do on some models (discussed later)

is down sample the audio i.e. to 22050Hz before processing it

slide-17
SLIDE 17

Richer Feature Extraction

  • The binned STFT in itself would work as a

feature (and we do use it), but we can extract even more from it!

  • We can write feature detectors that operate in

the frequency domain.

slide-18
SLIDE 18

Frequency Domain Features (1)

slide-19
SLIDE 19

Frequency Domain Features (2)

  • Spectral Centroid
slide-20
SLIDE 20

Frequency Domain Features (3)

Bandwidth Energy Flatness/tonality Entropy Rolloff

slide-21
SLIDE 21

Means and Variances

  • We take the means and variances to combine the features

back into “textural” feature vectors representing 1 second of underlying audio. Here is an image plot of our “ModelA” which produced 221 features.

slide-22
SLIDE 22
slide-23
SLIDE 23

Final Feature Vectors

slide-24
SLIDE 24

Class distribution histograms

  • None of the features on their own provide good

class separation – this is why we need powerful learning machines

slide-25
SLIDE 25

Models Descriptions

  • For comparative purposes we have created 3

models A,B and C with different parameters

  • n the features.
slide-26
SLIDE 26

Learning Machines

  • We are going to try out 3 different classifiers
  • n the 3 models to see which one does best.
slide-27
SLIDE 27

Support Vector Machines w/RBF

slide-28
SLIDE 28

Bayesian Logistical Regression

slide-29
SLIDE 29

C4.5

  • Basically just the ID3 algorithm with pruning added
  • Decision trees were popular in the 80’s
  • At each node of the tree, C4.5 chooses one attribute of the

data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the

  • data. The attribute with the highest normalized information

gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists.

  • The algorithm is based on Occam's razor i.e. it prefers

smaller decision trees (simple rules) over larger ones.

slide-30
SLIDE 30

Learning machine parameters

slide-31
SLIDE 31

Verbose Results

slide-32
SLIDE 32

Results Interpretation

  • F-measure on speech class shown above
  • SVM clearly out-performed the other two learning machines
  • BLR performed strongly on the verbose feature set, but SVM performed

well regardless

  • Interestingly, C4.5 + BLR did better on C, than B! Possibly a smaller feature

set translated to better accuracy

“Precision can be seen as a measure of exactness or fidelity, whereas Recall is a measure of completeness.”

a Precision score of 1.0 for a class C means that every item labelled as belonging to class C does indeed belong to class C (but says nothing about the number of items from class C that were not labelled correctly) whereas a Recall of 1.0 means that every item from class C was labelled as belonging to class C (but says nothing about how many other items were incorrectly also labelled as belonging to class C).

slide-33
SLIDE 33

Where did it go wrong, improvements

  • Many classification errors were border cases
  • Heuristics could be used to improve

performance i.e. Assumptions about no gaps in speech

  • 2-second feature vectors...
slide-34
SLIDE 34

Any questions?!

  • tim@cs.rhul.ac.uk
  • www.developer-x.com/papers/asot

All data, audio samples, these slides, associated paper etc can be downloaded there!