CompMusic Workshop II: Machine Learning Techniques Hema A Murthy, - - PowerPoint PPT Presentation

compmusic workshop ii machine learning techniques
SMART_READER_LITE
LIVE PREVIEW

CompMusic Workshop II: Machine Learning Techniques Hema A Murthy, - - PowerPoint PPT Presentation

CompMusic Workshop II: Machine Learning Techniques Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhars course material) Department of Computer Science and Engineering, IIT Madras, Chennai - 600036, India


slide-1
SLIDE 1

CompMusic Workshop II: Machine Learning Techniques

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s course material) Department of Computer Science and Engineering, IIT Madras, Chennai - 600036, India e-mail: hema@cse.iitm.ac.in

9-13 July 2012, Istanbul, Turkey

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 1/133

slide-2
SLIDE 2

Outline

  • Pattern Analysis
  • Machine Learning
  • Statistical Models (Bayes Classifier), GMMs, UBM-GMM
  • String Matching – Dynamic Programming
  • Hidden Markov Models
  • Kernel Methods (Support Vector Machine)
  • Non-negative Matrix Factorisation for Source Separation
  • Bayesian Information Criterion
  • Classification and Regression Trees
  • Score Normalisation

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 1/133

slide-3
SLIDE 3
  • 1. Introduction

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 2/133

slide-4
SLIDE 4

Patterns

  • 1, 2, 3, 4, 5, ? · · · , 24, 25, 26, 27, ?
  • 1, 3, 5, 7, 9, ?, · · · , 25, 27, 29, 31, ?
  • 2, 3, 5, 7, 11, ?, · · · , 29, 31, 37, 41, ?
  • 1, 1, 2, 3, 5, 8, ?, · · · , 55, 89, 144, 233, ?
  • 3, 5, 11, 21, 35, ?, · · · , 131, 165, 203, 245, ?
  • 1, 6, 19, 42, 59, ?, · · · , 95, 117, 156, 191, ?

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 3/133

slide-5
SLIDE 5

Discovering Regularity or Structure in Data

  • 1, 2, 3, 4, 5, 6 · · · , 24, 25, 26, 27, 28
  • 1, 3, 5, 7, 9, 11, · · · , 25, 27, 29, 31, 33
  • 2, 3, 5, 7, 11, 13, · · · , 29, 31, 37, 41, 43
  • 1, 1, 2, 3, 5, 8, 13, · · · , 55, 89, 144, 233, 377
  • 3, 5, 11, 21, 35, 53, · · · , 131, 165, 203, 245, 291

2, 6, 10, 14, 18, · · · , 34, 38, 42, 46

  • 1, 6, 19, 42, 59, ?, · · · , 95, 117, 156, 191, ?

Pattern Analysis: Automatic discovery of patterns in data.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 4/133

slide-6
SLIDE 6

Machine Learning I

  • Learning: Acquiring new knowledge or processing existing knowledge
  • Knowledge: Familiarity with information present in data
  • Learning by machines for pattern analysis:
  • Acquisition of knowledge from data to discover patterns
  • Data-driven techniques to learning by machines:
  • Learning from examples (Training of models)
  • Generalisation ability of learning machines:
  • Performance of trained models on new (test) data

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 5/133

slide-7
SLIDE 7

Focus of Statistical Machine Learning

  • Goal of learning techniques: Good generalisation ability
  • Learning techniques: Estimation of parameters of models
  • Learning machines and Learning techniques for pattern analysis:
  • Statistical Models (Maximum likelihood)
  • Artificial Neural Networks (Error correction learning)
  • Kernel Methods (Learning optimal linear relationships)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 6/133

slide-8
SLIDE 8
  • 2. Music Processing Tasks

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 7/133

slide-9
SLIDE 9

Music Processing Tasks I

  • Annotation of waveforms.
  • Identification of the type of music:
  • Western Classical
  • Hindustani Classical
  • Carnatic Classical
  • Turkish Music
  • Solution: Unimodal Bayesian inferencing
  • Identifying the genre – jazz, classical, country, pop, ...
  • Identification of the melody
  • rAga identification
  • based on notes: sampUrna rAgas – same seven notes in ascent and descent (Vector

Quantisation, GMMs, Neural Networks)

  • based on transcription: Longest Common Subsequence
  • based on phraseology: janya rAgas (DTW, HMMs)
  • Identifying the composer – syntactic pattern recognition – based on lyrics
  • Dikshitar – large number of consonants
  • Tyagaraja, Syama Shastri – fewer consonants more vowels
  • Entropic measures

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 8/133

slide-10
SLIDE 10

Music Processing Tasks II

  • Kullback-Leibler divergence
  • Identification of Musician (phraseology and voice ID)
  • transcription based analysis (LCS)
  • UBM-GMM frame work.
  • Segmentation of a piece into different parts.
  • segmenting a concert into different pieces (BIC)
  • Using Applauses
  • Using characteristics of pieces.
  • segmenting a piece into its constituent parts: AlAp, kriti, neraval, swaraprasthara,

thani, ...

  • Music separation (NMF)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 9/133

slide-11
SLIDE 11

Pattern Classification Problems

Data

Classification Class Label Database of Models Features Feature Extraction

x1 x2 x1 x2 x2 x1

Pattern Classification Problems

Overlapping Classes Nonlinear Separable Classes Classes Linearly Separable

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 10/133

slide-12
SLIDE 12
  • 3. Statistical Models for Pattern Classification

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 11/133

slide-13
SLIDE 13

Probability Distribtion

  • Data of a class is represented by a probability distribution
  • Single cluster classes
  • Represent using unimodal distribution

Univariate Gaussian distribution: p(x) = N(µ, σ2) = 1 √ 2πσ e

−(µ−x)2 2σ2

µ is the mean. σ2 is the variance.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • 4
  • 3
  • 2
  • 1

1 2 3 4

p(x) x µ 2σ

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 12/133

slide-14
SLIDE 14

Multivariate Gaussian Distribution I

  • Data in d-dimensional space
  • p(x) = N(x|µ, Σ) =

1 (2π)

d 2 |C| 1 2 e−(µ−x)tΣ−1(µ−x)

where µ is the mean vector Σ is the covariance matrix d = 2 x = [x1, x2]t µ = [µ1, µ2]t = [E(x1), E(x2)]t Σ =

  • E((x1 − µ1)2)

E((x1 − µ1)(x2 − µ2)) E((x2 − µ2)(x1 − µ1)) E((x2 − µ2)2)

  • =
  • σ2

1

σ1σ2 σ1σ2 σ2

2

  • Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 13/133

slide-15
SLIDE 15

Multivariate Gaussian Distribution II A multivariate Gaussian distribution was used to represent 8 rAgas in Carnatic music (includes both sampUrna and janya rAgas) Objective: To perform rAga Id on 30 second pieces.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 14/133

slide-16
SLIDE 16

Multivariate Gaussian Distribution III

RAGA Ham Sudh Susa Kal Kara Sri Hari Sank Ham 0.65 0.1125 0.0375 0.075 0.0125 0.1125 SuDha 0.00769 0.8307 0.0307 0.00769 0.02307 0.02307 0.0538 0.02307 SuSav 0.03 0.09 0.61 0.02 0.05 0.05 0.09 0.06 Kal 0.0428 0.1667 0.038 0.4142 0.0476 0.0142 0.1238 0.1523 Khara 0.0714 0.0142 0.0380 0.0238 0.5428 0.1238 0.1238 0.0619 SriRan 0.1083 0.1083 0.0667 0.0083 0.15 0.45 0.0917 0.0167 Hari 0.0142 0.0285 0.0214 0.0214 0.0714 0.0142 0.6428 0.1857 Sank 0.125 0.085 0.005 0.04 0.025 0.01 0.14 0.57

Table: Confusion Matrix - MVN: ham-hamsadhwani, kal-kalyani, khara-kharaharapriya, SriRan-sriranjani, Hari-harikAmbOji, Sudh-suddhAdhanyAsi,Sank-sankarAbharana, Susa - suddhAsAvEri

Note: Sriranjani notes are a subset of kharaharapriya – the reason for the confusion – = ⇒ need phonotactics of swaras

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 15/133

slide-17
SLIDE 17
  • 4. Methods of Parameter Estimation

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 16/133

slide-18
SLIDE 18

Maximum Likelihood (ML) Parameter Estimation Given Train data:D = {x1, x2, · · · , xn} Estimate Parameter Vector: θ = {θ1, θ2, · · · , θK}T

  • Likelihood of training data for a given θ:

P(D|θ) = Πn

1p(xn|θ)

L(θ) = log P(D|θ)

  • Choose θ such that the log likelihood is maximum::

L(θML) = argmaxθ log P(D|θ)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 17/133

slide-19
SLIDE 19

Illustration of parameter estimation using MLE

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 18/133

slide-20
SLIDE 20

Multimodal distributions Data is made of multiple clusters Bivariate Multimodal Distribution

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 19/133

slide-21
SLIDE 21

Clusters in Pattern Space (Vector Quantisation)

  • X

X X X X X X PARTITIONED VECTOR SPACE X = CENTROID OF REGION Waist Length 34 38 42 46 50 28 32 36 30 34 Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 20/133

slide-22
SLIDE 22

An Algorithm for Vector Quantisation I

Split each classify Find Compute Yes No D’=D m = 2.m Yes No m = 1 centroid vectors centroids D(distort.) m < M Stop Find centroid D-D’< δ D - Distortion -current iter. D’ - Distortion -previous iter.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 21/133

slide-23
SLIDE 23

An Algorithm for Vector Quantisation II The average distortion Di in cell Ci is given by Di = 1 N

  • x∈Ci

d(x, zi) where

  • zi is the centroid of cell Ci and
  • d(x, zi) = (x − zi)T(x − zi)
  • N is the number of vectors

The centroids that are obtained finally are then stored in a codebook called the VQ codebook.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 22/133

slide-24
SLIDE 24

An Algorithm for Vector Quantisation III

1 2 L Training set

  • f Vectors

Clustering Codebook M = 2 Vectors B Quantizer Codebook Indices Speech Input Vectors d(.,.) d(.,.) Algorithm v v ... v

Figure: System based on VQ Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 23/133

slide-25
SLIDE 25

An Algorithm for Vector Quantisation IV

−10 −5 5 10 −10 −5 5 10

Iter = 1

−10 −5 5 10 −10 −5 5 10

Iter = 3

−10 −5 5 10 −10 −5 5 10

Iter = 5

−10 −5 5 10 −10 −5 5 10

Iter = 8

−10 −5 5 10 −10 −5 5 10

Initialization

Figure: A two-cluster VQ Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 24/133

slide-26
SLIDE 26

Gaussian Mixture Model

  • Representation of Multimodal distributions: Gaussian mixture model (GMM)
  • GMM is a linear superposition of multiple Gaussians:

p(x) = ΣK

k=1πkN(x|µk, Σk)

  • For a d-dimensional feature vector representation of data, the parameters of a

component in a GMM are:

  • Mixture coefficient π
  • d-dimension mean vector µ
  • d × d size covariance matrix Σ

Maximum likelihood method for training a GMM: Expectation-Maximization (EM) method

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 25/133

slide-27
SLIDE 27

Gaussian Mixture Densities and its relation to VQ

  • The VQ codebook is modelled as a family of Gaussian probability density

functions (PDF).

  • Each cell is represented as a multi-dimensional PDF.
  • These probability density functions can overlap rather than partition the acoustic

space into disjoint subspaces.

  • correlations between different elements of the feature vector can also be

accounted in this representation.

  • Expectation Maximization (EM) algorithm is used to estimate the density

functions.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 26/133

slide-28
SLIDE 28

Gaussian Mixture Models – Estimation of Parameters Interpretation of Gaussian mixtures

  • Use of discrete latent variables:
  • d-dimensional binary random variable 1-of-K form.
  • p(x, z) = p(x|z)p(z).
  • p(zk = 1) = πk, 0 ≤ k ≤ K.
  • K

k=1 πk = 1

  • Meaning: Every point x is described completely by zk.
  • The weight πk indicates the responsibility of the kth mixture in explaining the

point x.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 27/133

slide-29
SLIDE 29

Expectation Maximisation Algorithm Given a Gaussian mixture model, maximize the likelihood function with respect to the parameters

  • 1. Initialize the means µk, covariances Σk and mixing coefficients πk
  • 2. Evaluate the initial value of the log likelihood
  • 3. E step: Evaluate responsibilities p(zk = 1|xn) = γ(znk) using current parameter

values

  • 4. M step: Re-estimate the parametersµk, Σk and πk.
  • 5. Using current responsibilities, evaluate the log likelihood
  • 6. Check for convergence of either the parameters or the log likelihood.
  • 7. If the convergence criterion is not satisfied return to step 4.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 28/133

slide-30
SLIDE 30

Example (Adapted from Bishop) I

−10 −5 5 10 15 −10 −5 5 10

Iter = 1

−10 −5 5 10 15 −10 −5 5 10

Iter = 17

−10 −5 5 10 15 −10 −5 5 10

Iter = 30

−10 −5 5 10 15 −10 −5 5 10

Iter = 45

−10 −5 5 10 15 −10 −5 5 10

Initialization

Figure: A two-mixture GMM Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 29/133

slide-31
SLIDE 31

Example (Adapted from Bishop) II

−10 −5 5 10 −8 −6 −4 −2 2 4

Data

−10 −5 5 10 −8 −6 −4 −2 2 4

K−Means

−10 −5 5 10 −8 −6 −4 −2 2 4

GMM

Figure: A three-cluster k-means and three-mixture GMM Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 30/133

slide-32
SLIDE 32

GMM based Classifier

λ p(x| ) 1 x GMM 1 GMM 2 GMM M Decision Logic λ p(x| ) λ Μ 2 p(x| ) Class Label

x is a feature vector that is obtained from a test sample. p(x|λm) = ΣK

k=1πkN(x|µmk, Σmk)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 31/133

slide-33
SLIDE 33

UBM-GMM framework The Universal Background Model based GMM (UBM-GMM) is another popular framework in the literature:

  • Works well when there is imbalance in the data for different classes.
  • Normalisation of scores across different classes.
  • Useful in the context of verification paradigm for classification.

In the context of music:

  • Reduces the search space.
  • Verify whether a give song belongs to a specific melody.

Philosophy of UBM-GMM

  • Pool data from all classes.
  • Build a single UBM-GMM to represent the pooled data.
  • Adapt the UBM-GMM using the data from each specific class.
  • Test against the adapted model.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 32/133

slide-34
SLIDE 34

UBM-GMM framework – An example

−8 −6 −4 −2 2 4 6 8 −8 −6 −4 −2 2 4 6 8 UBM data UBM −8 −6 −4 −2 2 4 6 8 −8 −6 −4 −2 2 4 6 8 UBM data New data UBM Adapted −6 −4 −2 2 4 6 −5 −4 −3 −2 −1 1 2 3 4 5 data UBM Adapted

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 33/133

slide-35
SLIDE 35

GMM based system for rAga identification

  • During training, a Gaussian Mixture Model (GMM) is built for each of rAga

using frame level pitch features.

  • During testing, GMM models produce class conditional acoustic likelihood

scores for a test utterance.

  • The GMM that gives the maximum acoustic score is chosen and decision about

the rAga is made (since each GMM corresponds to a rAga).

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 34/133

slide-36
SLIDE 36

GMM based Melody (rAga) identification I

  • Objective: To identify a rAga from 30 second pieces
  • Pitch was extracted from these pieces.
  • Normalized on the cent scale and collapsed onto a single octave.
  • sampUrna and janya rAgas were experimented with
  • The number of mixtures were chosen based on the number of notes in a rAga.

Figure shows the GMM for the rAga hamsadhwani.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 35/133

slide-37
SLIDE 37

GMM based Melody (rAga) identification II

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 36/133

slide-38
SLIDE 38

GMM based Melody (rAga) identification III

RAGA Ham SuDha SuSav Kal Kara SriRan Hari Shank Ham 0.3333 0.1111 0.4444 0.1111 SuDha 0.3846 0.3846 0.0769 0.1538 SuSav 0.2000 0.4000 0.2000 0.1000 0.1000 Kal 0.2000 0.1333 0.3333 0.2667 0.0667 Kara 0.2500 0.1000 0.1000 0.4000 0.0500 0.1000 SriRan 0.0952 0.0952 0.4762 0.3333 Hari 0.1000 0.1000 0.0500 0.0500 0.6500 0.0500 Shank 0.2000 0.3000 0.2000 0.1000 0.2000

Table: Confusion Matrix - GMM Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 37/133

slide-39
SLIDE 39

GMM Tokenisation I GMM Tokenization:

  • Parallel set of GMM tokenizers.
  • A bank of tokenizer dependent interpolated motif models (unigram, bigram)
  • Tokeniser produces frame-by-frame indices of the highest scoring GMM

component.

  • Likelihood of stream of symbols from each tokenizer evaluated by the language

models for each rAga.

  • A backend classifier determines the rAga.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 38/133

slide-40
SLIDE 40

GMM Tokenisation II

Feature Extraction Tokenizer Average of Log Likelihood Hyp. rAga GMM GMM Tokenizer model model model model model model nilAmbari nilAmbari bhairavi bhairavi bhairavi sankarAbharana sankarAbharana sankarAbharana Classifier

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 39/133

slide-41
SLIDE 41

GMM Tokenisation III The rationale: Cognitively, do we try to identify rAgas using the models that we know:

  • A listener would state that s/he is not able to place a rAga but state that in parts it

sounds like nilAmbari (bhairavi)

  • It might be a sankarAbharanam or yadukulakAmbOji (mAnji)

Question is: What is a meaningful token? Our initial results using pitch as a feature were miserable!

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 40/133

slide-42
SLIDE 42
  • 5. String Matching

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 41/133

slide-43
SLIDE 43

String Matching and Music Analysis Music has a lot structure:

  • Notes correspond to specific frequencies.
  • Sequence of notes that make a phrase identify a melody.

Analysis of Music:

  • Transcription into notes.
  • Identification of phrases.

Solution: Longest Common Subsequence Match:

  • Notes occur for a longer duration.
  • A particular note is missed but still belongs to the same melody.
  • Different notes are of different duration – example – the same song sung in a

different metre. Training:

  • Transcribe training examples to a sequence of notes.
  • Make a set of templates in terms of symbols.

Testing:

  • Transcribe test fragment to a sequence of notes.
  • Compare with trained templates using Dynamic programming.
  • Longest match identifies the melody.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 42/133

slide-44
SLIDE 44

String Matching Example using Dynamic Programming

B W U N K X N Q K T U X Y B 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 Q 2 2 2 1 2 2 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 3 L C S = N K B U

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 43/133

slide-45
SLIDE 45

Dynamic Time Warping Time Normalisation constraints

  • endpoint alignment
  • monotonicity constraints
  • local continuity constraints
  • global path constraints
  • slope weighting

Reference Pattern Test Pattern 2 x T 2 y T

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 44/133

slide-46
SLIDE 46

Dynamic Programming based Matching for Music I Feature used: Tonic normalised pitch contour Distance measure: Euclidean distance between pitch values

20 40 60 80 100 120 140 50 100 150 200

Mapping

700 800 900 1000 1100 1200 1300 50 100 150 200

Reference

20 40 60 80 100 120 140 700 800 900 1000 1100 1200 1300

Query

50 100 150 200 700 800 900 1000 1100 1200 1300

Warped Query with Reference

Figure shows the optimal DTW path.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 45/133

slide-47
SLIDE 47

Dynamic Programming based Matching for Music II

20 40 60 80 100 120 140 160 180 700 800 900 1000 1100 1200 1300

Reference

20 40 60 80 100 120 140 160 180 700 800 900 1000 1100 1200 1300

Query

Figure shows the template and query used.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 46/133

slide-48
SLIDE 48
  • 6. Overview of Hidden Markov Models (HMMs)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 47/133

slide-49
SLIDE 49

Drawbacks of the DTW based approach

  • Endpoint constraints.
  • monotonicity, global path constraints.
  • When the templates vary significantly, requires a large number of different

templates.

  • An alternative: Hidden Markov Models.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 48/133

slide-50
SLIDE 50

The HMM Framework

  • Provides a statistical model for characterising the properties of sequences signal.
  • Is currently used in the industry.
  • Need large amounts of training data to get reliable models.
  • Choosing the best structure is a difficult task.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 49/133

slide-51
SLIDE 51

Observable Markov Model An Observable Markov Model is a finite automaton:

  • finite number of states
  • transitions

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 50/133

slide-52
SLIDE 52

Hidden Markov Model

  • Suppose now
  • Output associated with each state is probabilistic
  • State in which the system is is hidden—only the observation is revealed
  • Such a system is called a Hidden Markov Model

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 51/133

slide-53
SLIDE 53

HMM example Coin Toss experiment:

  • One or more coins are tossed behind an opaque curtain
  • How many coins are there? Which coin is chosen? - unknown.
  • The result of experiments are known.

A typical observation sequence could be: O = (o1, o2, o2, ..., oT) = (HHTTTHTTH...H)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 52/133

slide-54
SLIDE 54

Two Coin Model

22 1−a11 1−a22 2 1 Coin1 Coin2 a a11

Hidden Markov Model 2−Coins Model State Sequence = 2 1 1 2 2 2 1 2 2 1 2... O = H H T T H T H H T T H...

  • r 1 1 1 2 1 1 2 1 1 2 1...

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 53/133

slide-55
SLIDE 55

Three-coin Model

a11 22 a 13 31 a21 12 a 2 23 a 32 a a 33 3 1 3−Coins Model Hidden Markov Model a a Coin1 Coin2 Coin3

  • r 1 2 3 1 1 2 2 3 1 2 3 ...

O = H H T T H T H H T T H... State Sequence = 3 1 2 3 3 1 1 2 3 1 3 ... Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 54/133

slide-56
SLIDE 56

HMM (cont’d)

  • System can be in one of N distinct states
  • Change from one state to another occurs at discrete time instants.
  • This change is
  • probabilistic
  • depends only on R preceding states (usually R = 1)
  • aij represents prob. of transition from state i at time t to state j at time t + 1
  • Each state has M distinct observations associated with it. bj(k) is prob. of
  • bserving the k-th symbol in state j
  • Prob. that system is initially in the i-th state is πi

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 55/133

slide-57
SLIDE 57

Three-coin Model

a11 22 a 13 31 a21 12 a 2 23 a 32 a a 33 3 1 3−Coins Model Hidden Markov Model a a Coin1 Coin2 Coin3

  • r 1 2 3 1 1 2 2 3 1 2 3 ...

O = H H T T H T H H T T H... State Sequence = 3 1 2 3 3 1 1 2 3 1 3 ... Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 56/133

slide-58
SLIDE 58

HMM (cont’d)

  • Observation sequence: O = (o1o2 . . . oT)
  • State sequence: q = (q1q2 . . . qT)
  • Model: λ = (A, B, π)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 57/133

slide-59
SLIDE 59

The Three Basic Problems

  • 1. Testing: Given O = (o1o2 . . . oT) and λ how do we efficiently compute P(O|λ)?
  • Given a sequence of speech frames that are known to represent a digit, how do we

recognize what digit it is? Since we have models for all the digits, choose that digit for which P(O|λ) is the maximum

  • Efficiency is crucial because the straightforward approach is computationally

infeasible

  • Forward-Backward procedure

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 58/133

slide-60
SLIDE 60

Evaluation an example I Consider the two-coin model, and the observation sequence O = {o1, o2, o3} = {H, T, H} Corresponding to the three observations we have three states Q = {q1, q2, q3} The state sequences can be any of: q1 q2 q3 1 1 1 1 1 2 1 2 1 1 2 2 2 1 1 2 1 2 2 2 1 2 2 2

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 59/133

slide-61
SLIDE 61

Evaluation an example II Consider the state sequence Q = {1, 2, 2} P(O|Q, λ) = P(HTH|Q = {1, 2, 2}) = b1(H)b2(T)b2(H) P(Q|λ) = π1a12a22 P(O|λ) =

  • Q

P(O, Q|λ) =

  • Q

P(O, Q|λ)P(Q|λ) To evaluate P(O|λ), we have to marginalise over all the state sequences above.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 60/133

slide-62
SLIDE 62

HMM Training

  • 2. Training: How do we adjust the parameters of λ to maximize P(O|λ)?
  • Given a number of utterances for the word “two” adjust the parameters of λ until

P(O|two model) converges.

  • This procedure is called training
  • Expectation-Maximization (Baum-Welch) algorithm

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 61/133

slide-63
SLIDE 63

Best State Sequence I

  • 3. Best State Sequence: Given O = (o1o2 . . . oT) and λ how do we choose a

corresponding q that best “explains” the observation?

  • For the hypothesized digit model, which state sequence best “explains” the
  • bservations?
  • The answer is strongly influenced by the optimality criterion
  • Single best state sequence is the commonly chosen criterion—Viterbi algorithm
  • Example: “The eight frames of speech from t = 120 ms to 260 ms are best

explained by the ‘two’ model’s state sequence s2-s3-s3-s4-s4-s4-s5-s5” Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 62/133

slide-64
SLIDE 64

Solution to Problem 1: Testing

  • P(O|λ) =
  • all q

P(O|q, λ) P(q|λ)

  • Direct method is computationally prohibitive –

all q = ⇒ all possible state sequences.

  • Define variables for the probability of partial observation sequences:

αt(i) = P(o1o2 . . . ot|qt = i, λ) βt(i) = P(ot+1ot+2 . . . oT|qt = i, λ)

  • A very efficient inductive procedure exists for computing

P(O|λ)—Forward-Backward algorithm

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 63/133

slide-65
SLIDE 65

Computation of αt(i)

1 S 2 S N S a3j aNj a2j a1j j S . . . t+1 t t+1 t Time α (j) α (i)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 64/133

slide-66
SLIDE 66

Solution to Problem 2: Training I

  • For the given model λ we need to adjust A, B, and π to satisfy the chosen
  • ptimization criterion
  • Closed-form solution does not exist
  • Training data is used in an iterative manner and model parameters are estimated

until convergence

  • The probabilities are estimated using relative frequency of occurrence
  • ¯

πj = expected number of times in state i at time t=1

  • ¯

aij = expected number of transitions from state i to state j expected number of transitions from state i

  • ¯

bj(k) = expected number of times in state j and observing the k symbol expected number of times in state j

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 65/133

slide-67
SLIDE 67

Solution to Problem 3: Best State Sequence

  • Suppose δt(i) be the most probable single path at time t, which accounts for the

first t observations and ends in state i

  • By induction

δt+1(j) =

  • max

i

δt(i) aij

  • · bj(ot+1)
  • To actually retrieve the state sequence we need to keeptrack of the argument that

maximized the above for each t and j

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 66/133

slide-68
SLIDE 68

Dynamic Time Warping Time Normalisation constraints

  • endpoint alignment
  • monotonicity constraints
  • local continuity constraints
  • global path constraints
  • slope weighting

Reference Pattern Test Pattern 2 x T 2 y T

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 67/133

slide-69
SLIDE 69

Markov Model for Music

  • 1

2 3 4 5 6 b (o ) b (o ) b (o ) b (o ) b (o ) b (o ) 1 2 3 4 5 6 a a a a a a

12 11 13 23 34 24 44

1 2 3 4 a a33 a22 Observation Sequence 1 1 2 3 3 4 a = 0 ij , j < i π = 0, ι ι ι π = 1, ι = 1 = 1

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 68/133

slide-70
SLIDE 70

HMM Training

Symbol sequence

Initialization Model State Seq− uence Segm. Model Reestimation Model Convergence Model Par. Training Data

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 69/133

slide-71
SLIDE 71

Isolated Motif Recognition System

P(O| P(O| ) P(O| ) 1 ) 2 v Sequence O Feature Analysis, (Vector Quantization) Music Signal S

HMM for HMM for Motif 2 Select Index of Recog. Motif Observation

λ λ

HMM for

λ λ λ λ 1 2 v

Motif 1 Motif v Computation Probability Probability Computation Probability Computation Maximum

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 70/133

slide-72
SLIDE 72

HMM for Phrase Identification The figure shows a specific motif for the rAga kAmbOji as sung by three different

  • musicians. The alignment in terms of the states is given below.

Note: The HMM used here includes skip states to highlight the state transitions. Details can be found in Vignesh’s paper.

600 800 1000 1200 1400 600 800 1000 1200 1400

State Occupancy on Viterbi Alignment

600 800 1000 1200 1400 600 800 1000 1200 1400 5 11 11 12 12 8 2 3 5 2 4 8 11 12 9 8 5 3 2 12 8 5 2 3 3 11

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 71/133

slide-73
SLIDE 73

Key Phrase Spotting I An e-HMM framework can be used for spotting key phrases in a musical piece.

  • The garbage model: This is an HMM to represent everything in a piece that is

not a typical phrase.

  • The Motif HMMs correspond to different motifs that can be used to identify the

piece.

  • The transition across motifs can transit via the garbage or directly.
  • The probability of transition needs to be determined from training data.
  • Initially all transitions can be made equiprobable.
  • The parameters can be learned using an EM framework.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 72/133

slide-74
SLIDE 74

Key Phrase Spotting II

HMM for Motif 4 HMM for Motif 2 Garbage Model HMM for Motif 1 HMM for Motif 3

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 73/133

slide-75
SLIDE 75
  • 7. Support Vector Machines for Pattern Classification

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 74/133

slide-76
SLIDE 76

Key Aspects of Kernel Methods Kernel methods involve:

  • Nonlinear transformation of data to a higher dimensional feature space induced

by a Mercer kernel

  • Detection of optimal linear solutions in the kernel feature space
  • Transformation to a higher dimensional space: Conversion of nonlinear relations

into linear relations (Cover’s theorem)

  • Nonlinearly separable patterns to linearly separable patterns
  • Nonlinear regression to linear regression
  • Nonlinear separation of clusters to linear separation of clusters

Key Feature: “Pattern analysis methods are implemented such that the kernel feature space representation is not explicitly required. They involve computation of pair-wise inner-products only.” The pair-wise inner-products are computed efficiently directly from the original representation of data using a kernel function (Kernel trick)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 75/133

slide-77
SLIDE 77

Illustration of Transformation Transformation: F(x) = {x2, y2, √ 2xy} Z = {z1, z2, z3}

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 76/133

slide-78
SLIDE 78

Optimal Separating Hyperplane for linearly Separable Classes I

Support vectors margin

x1 x2 wtx + b = 0 wtx + b = +1 wtx + b = −1

1 ||w||

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 77/133

slide-79
SLIDE 79

Optimal Separating Hyperplane for linearly Separable Classes II

  • Training data set consists of L examples:

{(xi, yi)}L

i=1, xi ∈ Rd and yi ∈ {+1, −1},

where xi is ith training example and yi is the corresponding class label.

  • Hyperplane is given by: wtx + b = 0, where w is the parameter vector and b is

the bias.

  • A separating hyperplane satisfies the following constraints:

yi(wtxi + b) > 0 for i = 1, 2, ..., L (1)

  • Canonical form (reduces the search space):

yi(wtxi + b) ≥ 1 for i = 1, 2, ..., L (2)

  • Distance between nearest example and separating hyperplane (margin) is:

1 ||w||.

  • Solution maximises

1 ||w|| or minimises w.

Learning problem:

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 78/133

slide-80
SLIDE 80

Optimal Separating Hyperplane for linearly Separable Classes III

  • Constrained Optimisation

J(w) = 1 2||w||2 (3)

  • The Lagrangian objective function:

Lp(w, b, α) = 1 2||w||2 −

L

  • i=1

αi[yi(wtxi + b) − 1] (4) where αi are nonnegative and are called Lagrange multipliers.

  • Saddle point of Lagrangian objective function is a solution.
  • The Lagrangian objective function is minimised with respect to w and b, and

then maximised with respect to α.

  • Conditions of optimality due to minimisation are

∂Lp(w, b, α) ∂w = 0 (5) ∂Lp(w, b, α) ∂b = 0 (6)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 79/133

slide-81
SLIDE 81

Optimal Separating Hyperplane for linearly Separable Classes IV

  • Application of optimality conditions gives

w =

L

  • i=1

αiyixi (7)

L

  • i=1

αiyi = 0 (8)

  • Substitute w from (7) in (4), use condition in (8), the dual form:

Ld(α) =

L

  • i=1

αi − 1 2

L

  • i=1

L

  • j=1

αiαjyiyjxi

txj

(9)

  • Maximize the objective function Ld(α) subject to the following constraints:

L

  • i=1

αiyi = 0 (10) αi ≥ 0 for i = 1, 2, ..., L (11) gives the optimum values of Lagrange multipliers {α∗

j }Ls j=1.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 80/133

slide-82
SLIDE 82

Optimal Separating Hyperplane for linearly Separable Classes V

  • The optimum parameter vector w∗ is given by

w∗ =

Ls

  • j=1

α∗

j yjxj

(12) where Ls is the number of support vectors.

  • The discriminant function of the optimal hyperplane (yj are support vectors)

D(x) = w∗tx + b∗ =

Ls

  • j=1

α∗

j yjxtxj + b∗

(13) where b∗ is the optimum bias.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 81/133

slide-83
SLIDE 83

Maximum Margin Hyperplane I Linearly Non-separable classes, the constraints in (2) are modified to include non-negative slack variables ξi as follows: yi(wtxi + b) ≥ 1 − ξi for i = 1, 2, ..., L (14)

  • The slack variable ξi measures the deviation of a data point xi from the ideal

condition of separability.

  • For 0 ≤ ξi ≤ 1, the data point falls in the correct side of the separating

hyperplane.

  • For ξi > 1, the data point falls on the wrong side of the separating hyperplane.
  • Support vectors are data points that satisfy the constraint in (14) with equality

sign.

  • The discriminant function of the optimal hyperplane for an input vector x is

given by D(x) = w∗tx + b∗ =

Ls

  • j=1

α∗

j yjxtxj + b∗

(15) where b∗ is the optimum bias.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 82/133

slide-84
SLIDE 84

Maximum Margin Hyperplane II

Support vectors margin

x1 x2 wtx + b = 0 wtx + b = +1 wtx + b = −1

1 ||w||

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 83/133

slide-85
SLIDE 85

Support Vector Machines

Linear output node Input layer Output K(x, xLs) K(x, x2) K(x, x1) Bias, b∗ Input vector x Hidden layer of Ls nodes α∗

1y1

  • α∗

LsyLs

α∗

2y2

D(x)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 84/133

slide-86
SLIDE 86

Kernel Functions

  • Mercer kernels: Kernel functions that satisfy Mercer’s theorem
  • Kernel gram matrix: Contains the value of the kernel function on all pairs of data

points in the training data set

  • Kernel gram matrix properties: positive semi-definite, for convergence of the

iterative method

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 85/133

slide-87
SLIDE 87

Nonlinearly Separable Problems I

  • Optimal hyperplane in the high dimensional feature space Φ(x)

The Lagrangian objective function takes the following form: Ld(α) =

L

  • i=1

αi − 1 2

L

  • i=1

L

  • j=1

αiαjyiyjΦ(xi)tΦ(xj) (16) subject to the constraints:

L

  • i=1

αiyi = 0 (17) and 0 ≤ αi ≤ C for i = 1, 2, ..., L (18)

  • The discriminant function for an input vector x is given by:

D(x) = w∗tΦ(x) + b∗ =

Ls

  • j=1

α∗

j yjΦ(x)tΦ(xj) + b∗

(19)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 86/133

slide-88
SLIDE 88

Nonlinearly Separable Problems II Some commonly used innerproduct kernel functions are as follows: Polynomial kernel: K(xi, xj) = (axt

ixj + c)p

Sigmoidal kernel: K(xi, xj) = tanh(axt

ixj + c)

Gaussian kernel: K(xi, xj) = exp(−δxi − xj2) Table shows the performance of SVM vs MVN on rAga recognition: In the Table:

  • PCDET - 12 bins Equal Temperament,
  • PCDET2 = 24 bins Equal Temperament,
  • PCDJI - 22 bin Just Intonation

SVM MVN PCDet 56.8755% 44.54% PCDet2 63.9318% 58.58% PCDji 58.9097% 50.62%

Table: Accuracy in raga recognition Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 87/133

slide-89
SLIDE 89

Summary of rAga recognition techniques

  • Discriminative training helps.
  • Multivariate Unimodal distributions shows comparable performance.
  • GMM based on number of notes in the rAga is a poor model.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 88/133

slide-90
SLIDE 90

Support Vector Machines and Music Analysis Advantages:

  • Discriminative Learning.
  • Testing – simple inner product.
  • Gives identity for patterns (support vectors) that discriminate between a pair of

classes (example: two allied rAgas) Disadvantages:

  • Fixed length pattern.
  • Requires transformation of data to fixed length patterns – intermediate matching

kernel.

  • Finding an appropriate kernel is a hard task.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 89/133

slide-91
SLIDE 91
  • 8. Non-Negative Matrix Factorisation

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 90/133

slide-92
SLIDE 92

Non-negative Matrix Factorisation (NMF) I Templates of sounds as building blocks:

  • Notes form music.
  • Phoneme-like structures combine to form speech.

Sounds correspond to such building blocks

  • Building blocks are structures that appear repeatedly.
  • Basic building blocks do not distort each other – but add constructively.

Goal: To learn these building blocks from data.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 91/133

slide-93
SLIDE 93

An Example: A Vocal Carnatic Music Performance

  • We see and hear 4 or 5 distinct voices (or sources) – lead vocalist, drone, violin,

mridanga, ghatam.

  • We should discover 4 or 5 building blocks.
  • A Solution: Non-negative Matrix Factorisation (NMF).
  • The Spectrum V can be represented as a linear combination of a dictionary of

spectral vectors V ≈ WH Here W corresponds to the dictionary, H corresponds to the activation of the spectral vectors.

  • Optimisation Problem: Estimate W and H such that divergence between V and

WH are minimised: argminW,H D(V|WH) subject to W >= 0 H >= 0 .

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 92/133

slide-94
SLIDE 94

Popular Divergence Measures I

  • Euclidean:

DEUC(V|WH) =

  • f
  • t

(Vf,t − [WH]|f,t)2

  • Generalised Kullback Leibler (KL):

DKL(V|WH) =

  • f
  • t

(Vf,t log Vf,t [WH]f,t − Vf,t + [WH]f,t)

  • Itakura-Saito (IS):

DIS(V|WH) =

  • f
  • t

(Vf,t Vf,t [WH]f,t − log Vf,t [WH]f,t − 1)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 93/133

slide-95
SLIDE 95

Update Rules Solution to the optimisation problem:

  • Randomly initialise W and H.
  • Iterate between the following update rules (for generalised KL divergence):

Hk,t ← Hk,t

  • f Wf,kVf,t/(WH)f,t
  • f′ Wf′,k

Wf,k ← Wf,k

  • t Hk,tVf,t/(WH)f,t
  • t′ Hf,t′

Other divergence measures have similar update rules.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 94/133

slide-96
SLIDE 96

Example of Tanpura Extraction from a Vocal Concert

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 95/133

slide-97
SLIDE 97

Pitch Extraction from Synthesised Tanpura

2000 4000 6000 8000 10000 12000 14000 16000 18000 100 200 300 400 500

Pitch Contour

2000 4000 6000 8000 10000 12000 14000 16000 18000 100 200 300 400 500

Pitch Contour after NMF

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 96/133

slide-98
SLIDE 98

NMF and its issues

  • Learning the dictionary requires good training examples.
  • The piece has to be first segmented to get separation of voices – refer to Ashwin

Bellur’s presentation on the same.

  • The approach seems to be promising for source separation.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 97/133

slide-99
SLIDE 99
  • 9. Bayesian Information Criterion

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 98/133

slide-100
SLIDE 100

Bayesian Information Criterion for Change Point Detection I A maximum likelihood approach to change point detection

  • Let x = {xi ∈ Rd, i = 1, ..., N} be a set of feature vectors extracted from audio.
  • xi ∼ N(µi, Σi)

where µi is the mean vector Σi is the full covariance matrix.

  • Change point detection can be posed as hypothesis testing problem at time i:

H0 : x1, · · · , xN ∼ N(µ, Σ) versus H1 : x1, · · · , xi ∼ N(µ1, Σ1); xi+1, · · · , xN ∼ N(µ2, Σ2)

  • The maximum likelihood ratio statitic is given by:

R(i) = N log |Σ| − N1 log |Σ1| − N2 log |Σ2| where N, N1 and N2 are the number of samples for entire segment, the segment from 1 to i and from i + 1 to N, respectively.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 99/133

slide-101
SLIDE 101

Bayesian Information Criterion for Change Point Detection II

  • The maximum likelihood ratio statistic is:

ˆ i = argmaxiR(i)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 100/133

slide-102
SLIDE 102

An Example: Applause Analysis The features described P Sarala et al.’s paper are used:

100 200 300 400 500 600 700 800 0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 10

4

Time in Seconds BIC Value for Spectral Entropy Change Point Applause Segment Music Segment 100 200 300 400 500 600 700 800 0.5 1 1.5 2 2.5 3 3.5 4 x 10

6

Time in Seconds BIC Value for Spectral flux feature Change Point Music Segment Applause Segment

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 101/133

slide-103
SLIDE 103
  • 10. Classification and Regression Trees

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 102/133

slide-104
SLIDE 104

Classification and Regression Trees Decision Tree Learning [3]:

  • Applications in data mining and machine learning.
  • A decision tree is used as a predictive model.
  • Maps observations about an item to conclusions about the item’s target value.
  • Commonly referred to as classification trees or regression trees (CART).
  • Leaves represent classifications, branches represent conjunctions of features that

lead to those classifications.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 103/133

slide-105
SLIDE 105

Classification and Regression Trees: An Example I A simple decision tree for height class

  • The goal of this example is to assign a person to one of five classes tall (T),

medium-tall (t), medium (M), medium-short (s) and short (S).

  • Similar to a rule-based classification system.
  • But the rules are generated from the data.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 104/133

slide-106
SLIDE 106

Classification and Regression Trees: An Example II

Is age greater than 12? N Y S Professional Basket Ball player? Y N T Is milk consumption greater than 4 liltres a week? Gender N Y Y M s t male ?

Issues in building CART:

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 105/133

slide-107
SLIDE 107

Classification and Regression Trees: An Example III

  • Defining a question set: This is crucial for the design of the decision tree.
  • Recursively choose the best question at each level to split a node:
  • The split function generally uses a probabilistic measure.
  • Reduce the uncertainty in the event being decided upon.
  • Mutual information between question and classification decision for the given data

sample is used.

  • Choose question which gives greatest reduction in entropy.
  • A greedy algorithm is to generate the CART tree.
  • The splitting is done until an appropriate size tree is obtained.

Some rigour:

  • Let Y be the RV of classification decision for data sample X.
  • The weighted entropy for any node t is given by

¯ Ht(Y) = Ht(Y)P(t) Ht(Y) = −

i P(ωi|t) log P(ωi|t)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 106/133

slide-108
SLIDE 108

Classification and Regression Trees: An Example IV where P(ωi|t) is the percentage of data samples for i in node t; P(t) is the prior probability of visiting node t. (P(t) is essentially the ratio of number of samples in node t and the total number

  • f training samples)

The splitting criteria at node t (split to nodes l and r) is equivalent to finding q such that: ∆Ht(q) = ¯ Ht(Y) − (Hl(Y) + Hr(Y)) = ¯ Ht(Y) − ¯ Ht(Y|q) q∗ = arg maxq(∆¯ Ht(q))

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 107/133

slide-109
SLIDE 109

CART and Music Processing

  • Clustering of music based on type/genre.
  • Hierarchical Clustering of rAgas using GMMs/HMMs.
  • Each node in the tree could correspond to divergence measures (for KL between

a pair of GMMs/HMMs)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 108/133

slide-110
SLIDE 110
  • 11. Score Normalisation

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 109/133

slide-111
SLIDE 111

Definition of Misses and False Alarms (source: wikipedia) Detection tasks: A tradeoff between two error types: missed detections and false alarms

  • Missed detection: Failure of the system to detect a target.
  • False Alarms: Declare a detection when target not present.

= ⇒ a single curve cannot represent capabilities of system. = ⇒ system has many operating points – hence represented by a curve.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 110/133

slide-112
SLIDE 112

ROC Curve (source: wikipedia) ROC – Receiver Operating Characteristic or Relative Operating Characteristic

  • Confusion Matrix

True Positive True Negative False Negative False Positive p’ n’ p n P N P’ N’ prediction

  • utcome

Actual Total Total Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 111/133

slide-113
SLIDE 113

Some Definitions (source: wikipedia) Let true positive (TP) eqv. with hit true negative (TN) eqv. with correct rejection false positive (FP) eqv. with false alarm, Type I error false negative (FN) eqv. with miss, Type II error true positive rate (TPR) eqv. with hit rate, recall, sensitivity false positive rate (FPR) eqv. with false alarm rate, fall-out accuracy (ACC) specificity (SPC) or True Negative Rate positive predictive value (PPV) eqv. with precision negative predictive value (NPV) false discovery rate (FDR) Matthews correlation coefficient (MCC)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 112/133

slide-114
SLIDE 114

Some Relevant Measures TPR = TP P = TP TP + FN FPR = FP N = FP FP + TN ACC = TP + TN P + N SPC = TN N = TN FP + TN = 1 − FPR PPV = TP TP + FP NPV = TN TN + FN FDR = FP FP + TP MCC = TPTN − FPFN √ PNP′N′

1

1Source: Fawcett (2004)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 113/133

slide-115
SLIDE 115

Plot of False alarm rate vs. Correct detection Rate

Line of Discrimination (random guess) Perfect better FPR or (1 − specificity) TPR or sensitivity worse

The distance from the random guess line is a measure of the goodness of the system.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 114/133

slide-116
SLIDE 116

ROC Curve (Contd.) Another measure normally used in ML community: AUC measure Pick a positive and negative example randomly. AUC is a measure of the ability that system assigns higher probability to positive example than the negative example.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 115/133

slide-117
SLIDE 117

Detection Error Tradeoff (DET) Curve

  • Error rates are plotted on both axes
  • Gives uniform treatment to both types of errors
  • A scale is used for both axes which spreads out the plot
  • Enables better distinction for well performing systems
  • Produces plots that are close to linear

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 116/133

slide-118
SLIDE 118

A Pictorial view of misses and false alarms

TN FN 1 FP TP 1

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 117/133

slide-119
SLIDE 119

An example system with ROC and DET curves (from [4])

DET plots

false alarm probability (%) miss probability (%) 0.1 0.5 1 2 5 10 20 40 0.1 0.5 1 2 5 10 20 40

ROC plots

False positive rate True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure: DET and ROC Curves Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 118/133

slide-120
SLIDE 120

Advantages of the DET curve

  • In the ROC good systems get bunched together at the top corner.
  • In the DET the systems are well-separated.
  • The DET curve thus enables better evaluation of systems.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 119/133

slide-121
SLIDE 121

Normalise Target and Non-target Scores (from [4])

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • 10
  • 5

5 10

p(scores) scores µ µ 1 nontarget scores target scores

The variances of both distributions are normalised to 1. The question: How does the plot become linear? µ0 – mean for non-target µ1 – mean for target

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 120/133

slide-122
SLIDE 122

The DET Plot (from [4]) X-Axis: FA probabilities Y-Axis: Miss probalities Top: Normal deviate w.r.t µ0 for non-target Right: Normal deviate w.r.t µ1 for target d – corresponds to the distance between the means d =

  • (µ1 − µ0)

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 121/133

slide-123
SLIDE 123

Normalisation of Scores (from [5]) Warping of axes using normal deviate function φ−1. Where: φ(P) = P

−∞

1 √ 2π e

−x2 2 dx

Non-target Distribution: N(µ0, σ0) Target Distribution: N(µ1, σ1) PM(t) = t

−∞

1 √ 2πσ1 e

− (x−µ1)2

2σ2 1

dx = φ(t − µ1 σ1 ) PFA(t) = +∞

t

1 √ 2πσ0 e

(x−µ0)2 2σ2

dx = φ(t − µ0 σ0 ) where PM(t) and PFA(t) are the probalities of miss and false alarm for a particular threshold t.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 122/133

slide-124
SLIDE 124

Linearity of DET curves Taking the inverse t − µ1 σ1 = φ−1(PM) Similarly µ0 − t σ0 = φ−1(PFA) Equating using t yields: φ−1(PM) = −σ0 σ1 φ−1(PFA) + µ0 − µ1 σ1 Thus this leads to a linear relationship between φ−1(PM) and φ−1(PFA). It is further shown in [5] that the distributions need not necessarily be Gaussian to obtain Linear DETs

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 123/133

slide-125
SLIDE 125

Score Normalisation (from [6])

  • Different feature streams
  • How do we choose the best feature?
  • How do we combine the scores from different feature streams?
  • Scaling of likelihood scores
  • handset normalisation

Scaling of likelihood scores: Determining global target-independent threshold to make a decision. Normalisation depends on the approach used for verification.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 124/133

slide-126
SLIDE 126

Theory of Score Normalisation I Let the target model be m. For a given test utterance O, P(m|O) needs to be computed Assume that ALL targets m are equally likely i.e. prior(m) = prior(n), where m = n. Here m and n are assumed to be targets. In general, it can be any two classes target and nontarget. log(P(m|O)) = log(P(O|m)) − log(P(O|mw)) where mw is the world model

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 125/133

slide-127
SLIDE 127

Distribution Scaling

  • Utterance from either non-target or target
  • A threshold required to distinguish between non-targets and targets
  • Threshold depends on bimodal distribution of log-likelihood score log(P(m|O))
  • Bimodal distribution parameters depend both on target model and observation
  • A distribution scale is required to find a single scale for all targets– single

threshold across all targets.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 126/133

slide-128
SLIDE 128

A Perspective on the DET curve σTy + µT = −σIx + µI where x are false alarms and y are missed recognitions. PMiss = 1 √ 2π y

−∞

e

−t2 2 dt

PFA = 1 √ 2π x

−∞

e

−t2 2 dt

Four different parameters, namely, µT (target mean), σT (target variance), µI (non-target mean), σI (non-target variance) These can be different for different targets: = ⇒ impossible to find a single threshold!

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 127/133

slide-129
SLIDE 129

Distribution Scaling: Contd. Distribution scaling: Normalise (µ, σ) to a single distribution – unit variance, zero mean. Two types of scaling are possible:

  • Target Centric: unify target distributions?
  • Non-target Centric: unify non-target distributions.

Too little data for Target Centric normalisations: = ⇒ most systems perform non-target centric normalisations. In non-target centric: horizontal axis of DET curve – zero mean, unit variance = ⇒ a threshold can be set for a specific false alarm rate.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 128/133

slide-130
SLIDE 130

Non-target Centric Scaling Simplifying the relationship between x and y: y = − 1 σT x + µT σT

  • First term leads to tilt
  • Second term leads to shift
  • EER:

x = y = − µT 1 + σT , PEER = 1 √ 2π x

−∞

e− t2

2 dt

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 129/133

slide-131
SLIDE 131

Cohort Normalisation

  • Zero Normalisation: Uses mean and variance estimated during training
  • Test Normalisation: Uses mean and variance estimation during testing

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 130/133

slide-132
SLIDE 132

Z-Norm, T-Norm Zero Normalisation:

  • For each target model m, obtain log likelihood score for non-target utterances

during training

  • Determine non-target distribution µI, σI that is specific to target m
  • Normalise:

S = log(P(m|O)) − µI σI Test Normalisation:

  • Use a set of non-target models, calculate log likelihood of test utterance.
  • Compute mean and variance of these scores.
  • Normalise similar to Z-norm

Advantage of T-Norm over Z-Norm: Indirectly incorporates acoustic mismatch during test.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 131/133

slide-133
SLIDE 133

References I Rabiner, L. R. and B. H. Juang, Fundamentals of Speech Recognition, Prentice Hall, New Jersey, 1993. P A Torres-Carrasquillo, Elliot Singer, M A Kohler, R J Greene, D A Reynolds, J R Deller, Jr, “Approaches to Language Identification using Gaussian Mixture Models and Shifted Delta Cepstral Features,” Proc. ICSLP 2002, pp.89-92. Xuedding Huang, ALex Acero, Hsiao-Wuen Non, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Prentice Hall PTR, 2001. A Martin, G Doddington, T Kamm, M Ordowski and M Przybocki, “The DET Curve in Assessment of Detection Task Performance,” Proc. EUROSPEECH, Rhodes, Greece, Sept. 1997, pp.1895-1898. Jiri Navratil and David Klusacek, “On Linear DETs,”Proc. ICASSP 2007, pp.229-232, Honolulu, April, 2007.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 132/133

slide-134
SLIDE 134

References II Roland Auckenthaler, Michael Carey and Harvey Lloyd-Thomas, “Score Normalisation for Text-Independent Speaker Verification Systems,” Digital Signal Processing, Vol. 10, pp.42-54, 2000.

Hema A Murthy, Ashwin Bellur and P Sarala (Slides adapted from Prof C Chandra Sekhar’s

Slide 133/133