Phoneme state posteriorgram features for speech based automatic - - PowerPoint PPT Presentation

phoneme state posteriorgram features for speech based
SMART_READER_LITE
LIVE PREVIEW

Phoneme state posteriorgram features for speech based automatic - - PowerPoint PPT Presentation

Phoneme state posteriorgram features for speech based automatic classification of speakers in cold and healthy condition Akshay Kalkunte Suresh, Srinivasa Raghavan K M, Dr. Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian


slide-1
SLIDE 1

Phoneme state posteriorgram features for speech based automatic classification of speakers in cold and healthy condition

Akshay Kalkunte Suresh, Srinivasa Raghavan K M,

  • Dr. Prasanta Kumar Ghosh

SPIRE LAB Electrical Engineering, Indian Institute of Science (IISc), Bangalore, India

1 January 2017

SPIRE LAB, IISc, Bangalore 1

slide-2
SLIDE 2

Overview

1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments

Effect of feature selection e2e model Effect of corpora Decision fusion

6 Conclusion

SPIRE LAB, IISc, Bangalore 2

slide-3
SLIDE 3

Introduction

Topics 1

1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion

SPIRE LAB, IISc, Bangalore 3

slide-4
SLIDE 4

Introduction

Introduction

Problem statement Given a database of speech samples recorded from speakers in healthy condition and suffering from common cold, we have to automatically classify the speech samples into cold affected and healthy speech. Why do we need to do this? Detection of presence of common cold in speech could find applications in

  • healthcare. It could also help in improving the accuracy in automatic speech

and speaker recognition systems.

SPIRE LAB, IISc, Bangalore 4

slide-5
SLIDE 5

Introduction

Illustration

Identifying whether speaker has cold from speech signal

SPIRE LAB, IISc, Bangalore 5

example 4 - Cold example 3 - Non-cold They wandered away Design your own wardrobe example 2 - Cold example 1 - Non-cold

slide-6
SLIDE 6

Introduction

Frequency domain perspective

(a) example 1 - Non cold (b) example 2 - Cold (a) example 3 - Non cold (b) example 4 - Cold

SPIRE LAB, IISc, Bangalore 6

They wandered away Design your own wardrobe

slide-7
SLIDE 7

Introduction

Speech signal production perspective

Congestion in Nasal and Vocal cavity in cold condition could possibly affect speech

SPIRE LAB, IISc, Bangalore 7

slide-8
SLIDE 8

Introduction

Previous works

Studies by Tull et al. 1 reveal differences in formant patterns, nasality parameters and melcepstral coefficients between normal and cold speech. Shan et al. 2 observed variations in the energy levels at lower and higher frequency bands and using mel-frequency cepstral coefficient (MFCC) found improvement in speaker recognition systems. P.Rose 3 pointed out that the cold is often accompanied by nasal cavity‘s inflammation and swelling, which changes the volume and shape of nasal cavity and furthermore affects the nasal modulation of sound source excitation signal and causes the speakers voice to change.

1Tull, “Investigating The Common Cold To Improve Speech Technology” 2Shan and Zhu, “Speaker Identification Under The Changed Sound Environment” 3Rose, Forensic Speaker Identification SPIRE LAB, IISc, Bangalore 8

slide-9
SLIDE 9

Our hypothesis

Topics 2

1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion

SPIRE LAB, IISc, Bangalore 9

slide-10
SLIDE 10

Our hypothesis

Our hypothesis

We hypothesize that the change in voice quality in speech affected by common cold could result in lower likelihoods from a model built using normal, healthy speech. We also hypothesize that some phonemes are affected to greater extent.

SPIRE LAB, IISc, Bangalore 10

slide-11
SLIDE 11

Phoneme state posteriorgram features

Topics 3

1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion

SPIRE LAB, IISc, Bangalore 11

slide-12
SLIDE 12

Phoneme state posteriorgram features

Steps to compute phoneme state posteriorgram features

Computing the PSP features involves the following stages - Acoustic feature (MFCC) extraction. Gaussian Mixture Models from non-cold speech. Likelihoods of features from Gaussian Mixture Models. Computing functionals.

SPIRE LAB, IISc, Bangalore 12

slide-13
SLIDE 13

Phoneme state posteriorgram features

Feature Extraction

SPIRE LAB, IISc, Bangalore 13

slide-14
SLIDE 14

Phoneme state posteriorgram features

Feature Extraction

The speech utterances are divided into ’Nl’ frames with a window size

  • f 25ms shifted by 10ms.

13-dim MFCC vector is obtained for each frame. Velocity and Acceleration features are appended to obtain a 39-dim feature vector.

SPIRE LAB, IISc, Bangalore 14

slide-15
SLIDE 15

Phoneme state posteriorgram features

Gaussian Mixture Models from non-cold speech

SPIRE LAB, IISc, Bangalore 15

slide-16
SLIDE 16

Phoneme state posteriorgram features

Gaussian Mixture Models from non-cold speech

We train a phonetic three state hidden Markov model (HMM) from the non-cold speech data. The GMMs for the HMM states are denoted by G1, G2, ... Gn.

SPIRE LAB, IISc, Bangalore 16

slide-17
SLIDE 17

Phoneme state posteriorgram features

Likelihoods of features from Gaussian Mixture Models

SPIRE LAB, IISc, Bangalore 17

slide-18
SLIDE 18

Phoneme state posteriorgram features

Likelihoods of features from Gaussian Mixture Models

The parameters for the i-th GMM (Gi) is given by λi = {wi

j, µi j, Σi j, j = 1 : 256}, where wi j is the weight for the j-th

component; µi

j and Σi j are the mean vector and diagonal covariance matrix

for the j-th component. Given 39-dim acoustic feature vector xk, the log likelihood using G1, G2,· · · , Gn are computed as follows: Li(k) = P(xk|Gi) = log  

256

  • j=1

wi

jN(xk; µi j, Σi j)

  , 1 ≤ i ≤ n,

SPIRE LAB, IISc, Bangalore 18

slide-19
SLIDE 19

Phoneme state posteriorgram features

Computing functionals

SPIRE LAB, IISc, Bangalore 19

slide-20
SLIDE 20

Phoneme state posteriorgram features

Computing functionals

The n-dim Log likelihood vector computed for all frames of utterance l is passed through the functionals block to get a single n x 43 vector. The functional block computes 43 opensmile 4 functionals over all the frames of each of the n dimensions.

4Eyben, W¨

  • llmer, and Schuller, “Opensmile: the munich versatile and fast open-source

audio feature extractor”

SPIRE LAB, IISc, Bangalore 20

slide-21
SLIDE 21

Observation

Topics 4

1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion

SPIRE LAB, IISc, Bangalore 21

slide-22
SLIDE 22

Observation

Average Likelihood Plot

We plot the average log likelihoods for all cold and non-cold utterances from the training set of URTIC speech corpus across 120 phonetic classes GMMSs from acoustic model trained on TIMIT + Boston University Radio News (BN). ’n’ phonetic classes = (3 X ’m’ phonemes) = (3 X 40) = 120. Cold speech features, on average, result in lower likelihoods against the GMMs of each phoneme state compared to the non-cold speech features.

SPIRE LAB, IISc, Bangalore 22

slide-23
SLIDE 23

Observation

We mark the top 10 phonetic classes in the average likelihood plot. The phonemes with highest ten differences in the likelihoods are AA, EH, V, DH, IY, AX, JH, W, T, NG. The nasal sound NG appears in the top ten most discriminating phonemes particularly due to the change in the nasal cavity due to cold.

SPIRE LAB, IISc, Bangalore 23

slide-24
SLIDE 24

Experiments

Topics 5

1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion

SPIRE LAB, IISc, Bangalore 24

slide-25
SLIDE 25

Experiments

Overview of Experimental Results

SPIRE LAB, IISc, Bangalore 25

slide-26
SLIDE 26

Experiments

We report results obtained using the proposed 5160-dim PSP features, End-to-End (e2e) model and discuss the effect of feature selection, effect of corpora and decision fusion. We use unweighted average recall (UAR) as the metric to compare performance among the models as it is invariant to class imbalance. We also consider 2017 InterSpeech Cold Sub-Challenge baseline results.

SPIRE LAB, IISc, Bangalore 26

slide-27
SLIDE 27

Experiments

Effect of feature selection

SPIRE LAB, IISc, Bangalore 27

slide-28
SLIDE 28

Experiments Effect of feature selection

Scores for cold speech classification ( UAR% ) Model Dev Test ComParE functionals (baseline) 64.00 70.20 PSP (5160-dim) 64.00 61.09

SPIRE LAB, IISc, Bangalore 28

slide-29
SLIDE 29

Experiments Effect of feature selection

Scores for cold speech classification ( UAR% ) Model Dev ComParE functionals (baseline) 64.00 PSP (473-dim) 63.60

SPIRE LAB, IISc, Bangalore 29

slide-30
SLIDE 30

Experiments Effect of feature selection

Scores for cold speech classification ( UAR% ) Model Dev ComParE functionals (baseline) 64.00 PSP (500-dim) 63.50

SPIRE LAB, IISc, Bangalore 30

slide-31
SLIDE 31

Experiments Effect of feature selection

We divide the ComParE features into 27 categories, C1 to C27. Among the 27 categories, we observe that pcm fft Mag mfcc performs the best. However, the rest of the classes perform uniformly and worse than pcm fftMag mfcc.

SPIRE LAB, IISc, Bangalore 31

slide-32
SLIDE 32

Experiments Effect of feature selection

e2e model

SPIRE LAB, IISc, Bangalore 32

slide-33
SLIDE 33

Experiments e2e model

e2e model A baseline e2e model with 8 convolutional and 2 LSTM layers is trained on raw audio files. We hypothesize that the e2e classification approach could learn unique time-frequency representations using the convolutional and LSTM layers with the potential to observe new representations in the data. Scores for cold speech classification ( UAR% ) Model Dev ComParE functionals (baseline) 64.00 e2e 66.50

SPIRE LAB, IISc, Bangalore 33

slide-34
SLIDE 34

Experiments e2e model

Effect of corpora

SPIRE LAB, IISc, Bangalore 34

slide-35
SLIDE 35

Experiments Effect of corpora

Table: UAR% on Development set for PSP and features computed from different non-cold corpora.

Scores for cold speech classification ( UAR% ) Corpus Dev TIMIT 65.10 TIMIT+BN 64.00 BN (Boston University Radio News) 60.50 PSP features are computed using the HMMs trained on different speech corpora. The poor UAR using BN could be due to noisy recordings present in BN unlike those in TIMIT.

SPIRE LAB, IISc, Bangalore 35

slide-36
SLIDE 36

Experiments Effect of corpora

Decision fusion

SPIRE LAB, IISc, Bangalore 36

slide-37
SLIDE 37

Experiments Decision fusion

Scores for cold speech classification ( UAR% ) Model Dev Test ComParE+BoAW (baseline) 64.20 67.30 PSP+ComParE+BoAW (unweighted maj.) 65.30 68.52

SPIRE LAB, IISc, Bangalore 37

slide-38
SLIDE 38

Experiments Decision fusion

Scores for cold speech classification ( UAR% ) Model Dev Test ComParE+BoAW (baseline) 64.20 67.30 PSP+BoAW+e2e (unweighted maj.) 69.00 66.70

SPIRE LAB, IISc, Bangalore 38

slide-39
SLIDE 39

Experiments Decision fusion

Scores for cold speech classification ( UAR% ) Model Dev Test ComParE+BoAW (baseline) 64.20 67.30 PSP+ComParE+BoAW (weighted maj.) 66.70 65.09

SPIRE LAB, IISc, Bangalore 39

slide-40
SLIDE 40

Experiments Decision fusion

Scores for cold speech classification ( UAR% ) Model Dev Test 2017 InterSpeech Cold Sub-Challenge baseline results ComParE functionals 64.00 70.20 ComParE BoAW 64.20 67.30 PSP and fusion (PSP+ComParE+BoAW) PSP 64.00 61.09 fusion(unweighted maj.) 65.30 68.52 fusion(weighted maj.) 66.70 65.09 fusion (PSP+BoAW+e2e) fusion(unweighted maj.) 69.00 66.70

SPIRE LAB, IISc, Bangalore 40

slide-41
SLIDE 41

Conclusion

Topics 6

1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion

SPIRE LAB, IISc, Bangalore 41

slide-42
SLIDE 42

Conclusion

Conclusion and future work

We have proposed phoneme state posteriorgram based features to capture the acoustic variability due to speaking in cold condition compared to speaking in healthy condition. Conclusions from the experimentation

1 We obtain a UAR on the development set comparable to that using

the base-line scheme.

2 When combined with the baseline scheme with a weighted decision

fusion approach, we obtain 2.9% (absolute) improvement in the UAR

  • n the development set.

SPIRE LAB, IISc, Bangalore 42

slide-43
SLIDE 43

Conclusion

Proposed future work

1 Computing PSP features using HMMs trained on language specific

speech corpus.

2 Using DNN models to classify the cold and non-cold utterances using

PSP features.

3 Computing PSP features from a HMM trained with a deep neural

network.

SPIRE LAB, IISc, Bangalore 43

slide-44
SLIDE 44

Conclusion

THANK YOU

SPIRE LAB, IISc, Bangalore 44