Speech segment classification on music radio shows using machine - PowerPoint PPT Presentation

Speech segment classification on music radio shows using machine learning algorithms Tim Scarfe, Yuri Kalnishkan

Introduction • This was my bachelor thesis but recently we have been reproducing the results and submitted a paper on it to DS2010. The paper, data, samples etc are on my web site @ http://www.developer-x.com/papers/asot/DS2010_svm_voice_segment_r10.pdf • We were interested in predicting intervals of speech in electronic dance music radio shows • We ended up working on one show “A State of Trance” with current world #1 DJ Armin van Buuren  • We were specifically interested in Armin’s voice, no other people’s voices, singing etc.

Why? • It’s very useful to have temporal metadata for audio streams i.e. When are the adverts? When is the traffic information? • Audio is slower to index/label than video. On videos you can scrub through and ascertain the structure quickly. • Most ad- hoc audio streams don’t have associated temporal metadata.

Methodology • These radio shows are 2 hours long • We took an approach typical of machine learning i.e. discretising the show into feature vectors (representing 1 second each) and training a learning machine model on historical examples • For simplicity we worked with 5 minute segments (299 seconds from each) from 9 different shows. • We labelled them and used one for training data, the remaining 8 were concatenated together and used for testing.

Data (post-feature extraction) • Training Set – 299 examples • 28 speech • 271 non-speech • Test Set – 2392 examples • 291 speech • 2101 non-speech • 1 second : 1 example

Audio Analysis (1) • Audio is in the time domain... • You can’t do much useful analysis in the time domain unless it’s just a sine wave!

Audio Analysis (2) • The problem is, in the time domain everything gets mixed together. Here are 2 simple sine waves mixed up:

Audio Analysis (3) • Now lets look at some typical audio from one of these radio shows. • Here is about 230 samples of stereo audio. What a mess!

Audio Analysis (4) • What if there was a way to transform the signal into the frequency domain, and discard all time information? • Enter Fourier analysis

Fourier Analysis • Fourier Analysis represents any function as a set of multiple integer oscillations of trigonometric functions.

Temporal Feature Extraction Strategy • We could just window the audio at 44100 sample intervals and run a DFT on each. • However this is too coarse; we want to capture the temporal “fabric” or “texture” of the audio. • What we want to do is capture lots of small features (DFTs and others) (say 100 in a second) and then merge them together using means and variances back into one feature vector representing one second. • Enter the STFT or Short Time Fourier Transform

Short Time Fourier Transform (STFT) • Because we want a high number of DFT windows per second, the number of samples for each might get low. Say 44100/100 == 441 samples per window. With so few samples we actually want to use overlapping windows and apply a windowing function to reduce spectral “leakage”.

Short Time Fourier Transform (STFT) (2) Hann window function Rectangle window function Main STFT function (DFT with windowing added)

Visualising the STFT – the Spectrogram! Speech Music

Spectrogram of a violin playing • Human Hearing • Critical Bands... • Timbre and Musical Instruments

Information Overload! • Due to the Nyquist theorem, we still have samplerate/2 == 22050 attributes on our feature vectors. This is way too much. • The first thing we do is “ discretise ” these STFT vectors into x “bins”. This will be 64, 32 or 8 in our experiments. In each bin we simply take the mean average value. • So now we have the rich frequency information broken up into a manageable amount of bins. • Another thing we do on some models (discussed later) is down sample the audio i.e. to 22050Hz before processing it

Richer Feature Extraction • The binned STFT in itself would work as a feature (and we do use it), but we can extract even more from it! • We can write feature detectors that operate in the frequency domain.

Frequency Domain Features (1)

Frequency Domain Features (2) • Spectral Centroid

Frequency Domain Features (3) Entropy Bandwidth Energy Flatness/tonality Rolloff

Means and Variances • We take the means and variances to combine the features back into “textural” feature vectors representing 1 second of underlying audio. Here is an image plot of our “ ModelA ” which produced 221 features.

Final Feature Vectors

Class distribution histograms • None of the features on their own provide good class separation – this is why we need powerful learning machines

Models Descriptions • For comparative purposes we have created 3 models A,B and C with different parameters on the features.

Learning Machines • We are going to try out 3 different classifiers on the 3 models to see which one does best.

Support Vector Machines w/RBF

Bayesian Logistical Regression

C4.5 • Basically just the ID3 algorithm with pruning added • Decision trees were popular in the 80’s • At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists. • The algorithm is based on Occam's razor i.e. it prefers smaller decision trees (simple rules) over larger ones.

Learning machine parameters

Verbose Results

Results Interpretation “ Precision can be seen as a measure of exactness or fidelity, whereas Recall is a measure of • F-measure on speech class shown above completeness .” • SVM clearly out-performed the other two learning machines • BLR performed strongly on the verbose feature set, but SVM performed well regardless • Interestingly, C4.5 + BLR did better on C, than B! Possibly a smaller feature set translated to better accuracy a Precision score of 1.0 for a class C means that every item labelled as belonging to class C does indeed belong to class C (but says nothing about the number of items from class C that were not labelled correctly) whereas a Recall of 1.0 means that every item from class C was labelled as belonging to class C (but says nothing about how many other items were incorrectly also labelled as belonging to class C).

Where did it go wrong, improvements • Many classification errors were border cases • Heuristics could be used to improve performance i.e. Assumptions about no gaps in speech • 2-second feature vectors...

Any questions?! • tim@cs.rhul.ac.uk • www.developer-x.com/papers/asot All data, audio samples, these slides, associated paper etc can be downloaded there!

Speech segment classification on music radio shows using machine - PowerPoint PPT Presentation

Speech segment classification on music radio shows using machine learning algorithms Tim Scarfe, Yuri Kalnishkan Introduction This was my bachelor thesis but recently we have been reproducing the results and submitted a paper on it to

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

MUSIC THERAPY MUSIC THERAPY What is music therapy? Music therapy is simply the process of using

RADIO RADIO Ca Cate tegor gory y Br Breakdo eakdown wn Radio Radio Cr Crea eativity

Music Genome Project David Seltzer Pandora Radio Internet radio service Personalized

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

JEWISH MUSIC 101: WHAT IS JEWISH MUSIC? A PROGRAM OF THE LOWELL MILKEN FUND FOR AMERICAN JEWISH

The intriguing case of sad music Dr. Jonna Vuoskoski jonna.vuoskoski@music.ox.ac.uk Music &

Packet Radio Lee Maddox, N4HOK What is Packet Radio? Packet radio is the connection of a computer

GNU Radio An introduction By Maryam Taghizadeh Dehkordi 9/9/2007 GNU Radio Outline

Dark Matter Radio (DM Radio) Kent Irwin for the DM Radio Collaboration DM Radio Pathfinder

Radio Radio It involve antennas It involve antennas It apparently involves electricity

RADIO SATELLITE FINLAND RADIO STATIONS FOR PEOPLE WHO LISTEN 0 2 Executive summary Radio

Music and Pain: A Music Therapy Perspective Deborah Salmon, MA, MTA, CMT BRAMS, Universit de

FOLK MUSIC AT KMH A presentation of the Folk Music Department at the Royal College of Music,

Music, Language and Computation Aline Honingh LoLaCo Guestlecture 2012 Outline Music at the

EBLL Response in HCV Units Segment 1: The Basics EBLL Response in in HCV Units Segment 1:

Identifying Use-After-Free Variables in Fire-and-Forget Tasks Jyothi Krishna V S & Vassily

BGP Communities for service providers Marco dItri <md@seeweb.it> @rfc1036 Seeweb

GlycoVault: A Bioinformatics Infrastructure for Glycan Pathway Visualization, Analysis and

Security Measures in OpenSSH Damien Miller djm@openbsd.org Introduction Describe the

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Quantum Kibble-Zurek mechanism: scaling hypothesis in the Ising and Bose-Hubbard models

Answer Set Programming, the Solving Paradigm for Knowledge Representation and Reasoning Martin

PROJECTOR: an automatic logic program rewriting tool for better performance Nick Hippen &

Speech segment classification on music radio shows using machine - PowerPoint PPT Presentation

Speech segment classification on music radio shows using machine learning algorithms Tim Scarfe, Yuri Kalnishkan Introduction This was my bachelor thesis but recently we have been reproducing the results and submitted a paper on it to

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

MUSIC THERAPY MUSIC THERAPY What is music therapy? Music therapy is simply the process of using

RADIO RADIO Ca Cate tegor gory y Br Breakdo eakdown wn Radio Radio Cr Crea eativity

Music Genome Project David Seltzer Pandora Radio Internet radio service Personalized

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

JEWISH MUSIC 101: WHAT IS JEWISH MUSIC? A PROGRAM OF THE LOWELL MILKEN FUND FOR AMERICAN JEWISH

The intriguing case of sad music Dr. Jonna Vuoskoski jonna.vuoskoski@music.ox.ac.uk Music &amp;

Packet Radio Lee Maddox, N4HOK What is Packet Radio? Packet radio is the connection of a computer

GNU Radio An introduction By Maryam Taghizadeh Dehkordi 9/9/2007 GNU Radio Outline

Dark Matter Radio (DM Radio) Kent Irwin for the DM Radio Collaboration DM Radio Pathfinder

Radio Radio It involve antennas It involve antennas It apparently involves electricity

RADIO SATELLITE FINLAND RADIO STATIONS FOR PEOPLE WHO LISTEN 0 2 Executive summary Radio

Music and Pain: A Music Therapy Perspective Deborah Salmon, MA, MTA, CMT BRAMS, Universit de

FOLK MUSIC AT KMH A presentation of the Folk Music Department at the Royal College of Music,

Music, Language and Computation Aline Honingh LoLaCo Guestlecture 2012 Outline Music at the

EBLL Response in HCV Units Segment 1: The Basics EBLL Response in in HCV Units Segment 1:

Identifying Use-After-Free Variables in Fire-and-Forget Tasks Jyothi Krishna V S &amp; Vassily

BGP Communities for service providers Marco dItri &lt;md@seeweb.it&gt; @rfc1036 Seeweb

GlycoVault: A Bioinformatics Infrastructure for Glycan Pathway Visualization, Analysis and

Security Measures in OpenSSH Damien Miller djm@openbsd.org Introduction Describe the

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Quantum Kibble-Zurek mechanism: scaling hypothesis in the Ising and Bose-Hubbard models

Answer Set Programming, the Solving Paradigm for Knowledge Representation and Reasoning Martin

PROJECTOR: an automatic logic program rewriting tool for better performance Nick Hippen &amp;

The intriguing case of sad music Dr. Jonna Vuoskoski jonna.vuoskoski@music.ox.ac.uk Music &

Identifying Use-After-Free Variables in Fire-and-Forget Tasks Jyothi Krishna V S & Vassily

BGP Communities for service providers Marco dItri <md@seeweb.it> @rfc1036 Seeweb

PROJECTOR: an automatic logic program rewriting tool for better performance Nick Hippen &