Phoneme state posteriorgram features for speech based automatic - PowerPoint PPT Presentation

Phoneme state posteriorgram features for speech based automatic classification of speakers in cold and healthy condition Akshay Kalkunte Suresh, Srinivasa Raghavan K M, Dr. Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian Institute of Science (IISc), Bangalore, India SPIRE LAB, IISc, Bangalore 1 1 January 2017

Overview 1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments Effect of feature selection e2e model Effect of corpora Decision fusion 6 Conclusion SPIRE LAB, IISc, Bangalore 2

Introduction Topics 1 1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion SPIRE LAB, IISc, Bangalore 3

Introduction Introduction Problem statement Given a database of speech samples recorded from speakers in healthy condition and suffering from common cold, we have to automatically classify the speech samples into cold affected and healthy speech. Why do we need to do this? Detection of presence of common cold in speech could find applications in healthcare. It could also help in improving the accuracy in automatic speech and speaker recognition systems. SPIRE LAB, IISc, Bangalore 4

Introduction Illustration Identifying whether speaker has cold from speech signal example 1 - Non-cold example 2 - Cold Design your own wardrobe example 3 - Non-cold example 4 - Cold They wandered away SPIRE LAB, IISc, Bangalore 5

Introduction Frequency domain perspective Design your own wardrobe (a) example 1 - Non cold (b) example 2 - Cold They wandered away (a) example 3 - Non cold (b) example 4 - Cold SPIRE LAB, IISc, Bangalore 6

Introduction Speech signal production perspective Congestion in Nasal and Vocal cavity in cold condition could possibly affect speech SPIRE LAB, IISc, Bangalore 7

Introduction Previous works Studies by Tull et al. 1 reveal differences in formant patterns, nasality parameters and melcepstral coefficients between normal and cold speech. Shan et al. 2 observed variations in the energy levels at lower and higher frequency bands and using mel-frequency cepstral coefficient (MFCC) found improvement in speaker recognition systems. P.Rose 3 pointed out that the cold is often accompanied by nasal cavity‘s inflammation and swelling, which changes the volume and shape of nasal cavity and furthermore affects the nasal modulation of sound source excitation signal and causes the speakers voice to change. 1 Tull, “Investigating The Common Cold To Improve Speech Technology” 2 Shan and Zhu, “Speaker Identification Under The Changed Sound Environment” 3 Rose, Forensic Speaker Identification SPIRE LAB, IISc, Bangalore 8

Our hypothesis Topics 2 1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion SPIRE LAB, IISc, Bangalore 9

Our hypothesis Our hypothesis We hypothesize that the change in voice quality in speech affected by common cold could result in lower likelihoods from a model built using normal, healthy speech. We also hypothesize that some phonemes are affected to greater extent. SPIRE LAB, IISc, Bangalore 10

Phoneme state posteriorgram features Topics 3 1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion SPIRE LAB, IISc, Bangalore 11

Phoneme state posteriorgram features Steps to compute phoneme state posteriorgram features Computing the PSP features involves the following stages - Acoustic feature (MFCC) extraction. Gaussian Mixture Models from non-cold speech. Likelihoods of features from Gaussian Mixture Models. Computing functionals. SPIRE LAB, IISc, Bangalore 12

Phoneme state posteriorgram features Feature Extraction SPIRE LAB, IISc, Bangalore 13

Phoneme state posteriorgram features Feature Extraction The speech utterances are divided into ’ N l ’ frames with a window size of 25ms shifted by 10ms. 13-dim MFCC vector is obtained for each frame. Velocity and Acceleration features are appended to obtain a 39-dim feature vector. SPIRE LAB, IISc, Bangalore 14

Phoneme state posteriorgram features Gaussian Mixture Models from non-cold speech SPIRE LAB, IISc, Bangalore 15

Phoneme state posteriorgram features Gaussian Mixture Models from non-cold speech We train a phonetic three state hidden Markov model (HMM) from the non-cold speech data. The GMMs for the HMM states are denoted by G 1 , G 2 , ... G n . SPIRE LAB, IISc, Bangalore 16

Phoneme state posteriorgram features Likelihoods of features from Gaussian Mixture Models SPIRE LAB, IISc, Bangalore 17

Phoneme state posteriorgram features Likelihoods of features from Gaussian Mixture Models The parameters for the i -th GMM ( G i ) is given by λ i = { w i j , µ i j , Σ i j , j = 1 : 256 } , where w i j is the weight for the j -th component; µ i j and Σ i j are the mean vector and diagonal covariance matrix for the j -th component. Given 39-dim acoustic feature vector x k , the log likelihood using G 1 , G 2 , · · · , G n are computed as follows:   256 � w i j N ( x k ; µ i j , Σ i  , 1 ≤ i ≤ n, L i ( k ) = P ( x k | G i ) = log j )  j =1 SPIRE LAB, IISc, Bangalore 18

Phoneme state posteriorgram features Computing functionals SPIRE LAB, IISc, Bangalore 19

Phoneme state posteriorgram features Computing functionals The n-dim Log likelihood vector computed for all frames of utterance l is passed through the functionals block to get a single n x 43 vector. The functional block computes 43 opensmile 4 functionals over all the frames of each of the n dimensions. 4 Eyben, W¨ ollmer, and Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor” SPIRE LAB, IISc, Bangalore 20

Observation Topics 4 1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion SPIRE LAB, IISc, Bangalore 21

Observation Average Likelihood Plot We plot the average log likelihoods for all cold and non-cold utterances from the training set of URTIC speech corpus across 120 phonetic classes GMMSs from acoustic model trained on TIMIT + Boston University Radio News (BN). ’n’ phonetic classes = (3 X ’m’ phonemes) = (3 X 40) = 120. Cold speech features, on average, result in lower likelihoods against the GMMs of each phoneme state compared to the non-cold speech features. SPIRE LAB, IISc, Bangalore 22

Observation We mark the top 10 phonetic classes in the average likelihood plot. The phonemes with highest ten differences in the likelihoods are AA, EH, V, DH, IY, AX, JH, W, T, NG. The nasal sound NG appears in the top ten most discriminating phonemes particularly due to the change in the nasal cavity due to cold. SPIRE LAB, IISc, Bangalore 23

Experiments Topics 5 1 Introduction 2 Our hypothesis 3 Phoneme state posteriorgram features 4 Observation 5 Experiments 6 Conclusion SPIRE LAB, IISc, Bangalore 24

Experiments Overview of Experimental Results SPIRE LAB, IISc, Bangalore 25

Experiments We report results obtained using the proposed 5160-dim PSP features, End-to-End (e2e) model and discuss the effect of feature selection, effect of corpora and decision fusion. We use unweighted average recall (UAR) as the metric to compare performance among the models as it is invariant to class imbalance. We also consider 2017 InterSpeech Cold Sub-Challenge baseline results. SPIRE LAB, IISc, Bangalore 26

Experiments Effect of feature selection SPIRE LAB, IISc, Bangalore 27

Experiments Effect of feature selection Scores for cold speech classification ( UAR% ) Model Dev Test ComParE functionals (baseline) 64.00 70.20 PSP (5160-dim) 64.00 61.09 SPIRE LAB, IISc, Bangalore 28

Experiments Effect of feature selection Scores for cold speech classification ( UAR% ) Model Dev ComParE functionals (baseline) 64.00 PSP (473-dim) 63.60 SPIRE LAB, IISc, Bangalore 29

Experiments Effect of feature selection Scores for cold speech classification ( UAR% ) Model Dev ComParE functionals (baseline) 64.00 PSP (500-dim) 63.50 SPIRE LAB, IISc, Bangalore 30

Experiments Effect of feature selection We divide the ComParE features into 27 categories, C1 to C27. Among the 27 categories, we observe that pcm fft Mag mfcc performs the best. However, the rest of the classes perform uniformly and worse than pcm fftMag mfcc. SPIRE LAB, IISc, Bangalore 31

Experiments Effect of feature selection e2e model SPIRE LAB, IISc, Bangalore 32

Experiments e2e model e2e model A baseline e2e model with 8 convolutional and 2 LSTM layers is trained on raw audio files. We hypothesize that the e2e classification approach could learn unique time-frequency representations using the convolutional and LSTM layers with the potential to observe new representations in the data. Scores for cold speech classification ( UAR% ) Model Dev ComParE functionals (baseline) 64.00 e2e 66.50 SPIRE LAB, IISc, Bangalore 33

Experiments e2e model Effect of corpora SPIRE LAB, IISc, Bangalore 34

Phoneme state posteriorgram features for speech based automatic - PowerPoint PPT Presentation

Phoneme state posteriorgram features for speech based automatic classification of speakers in cold and healthy condition Akshay Kalkunte Suresh, Srinivasa Raghavan K M, Dr. Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian

More on Speech More on Speech Perception Perception Phoneme Phoneme Discrimination

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

The PRONALSYL Letter-to-Phoneme Challenge Bob Damper and Yannick Marchand University

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech production & perception Professor Marie Roch Phonetics & Phonology Phoneme

Letter-to-Phoneme Conversion for a German Text-to-Speech System Vera Demberg Institut fr

First experiments in audio/video features for phoneme recognition Petr Motl cek FIT VUT

Toward Toward Univeral Network-based Univeral Network-based Speech Translation Speech

GMT interim result+ to 30 September 2014 12 November 2014 Air New Zealand, the Fonterra

Climate Change in Montana How are crop and livestock producers responding? Rick Engel Cynthia

Flood Advisory Group Gully Cleansing and Maintenance 28 th April 2014 Larry Austin S trategic

Infocouncil at City of Tea Tree Gully Andrea Sargent Humble beginnings Infocouncil came to

Y Eielson Regional Growth Plan R The F he F-35s 35s Are Are Com Coming: ing: A How Ca Ho

. ~ ~ ~ A::f ~ _ ~" : ~ I~ ~ ~ - ~ l = : : :J ~ S ~ ~ - - - =- " -" - -r- ~

Orient Overseas (International) Limited Annual Results 2018 Disclaimer The information

USWAC ITEMS & CGP UPDATE UTAH STORM WATER ADVISORY COMMITTEE (USWAC) One o f the la rg

Sambuz

Useful Links

Newsletter

Mail Us

Phoneme state posteriorgram features for speech based automatic - PowerPoint PPT Presentation

Phoneme state posteriorgram features for speech based automatic classification of speakers in cold and healthy condition Akshay Kalkunte Suresh, Srinivasa Raghavan K M, Dr. Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian

More on Speech More on Speech Perception Perception Phoneme Phoneme Discrimination

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

The PRONALSYL Letter-to-Phoneme Challenge Bob Damper and Yannick Marchand University

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech production &amp; perception Professor Marie Roch Phonetics &amp; Phonology Phoneme

Letter-to-Phoneme Conversion for a German Text-to-Speech System Vera Demberg Institut fr

First experiments in audio/video features for phoneme recognition Petr Motl cek FIT VUT

Toward Toward Univeral Network-based Univeral Network-based Speech Translation Speech

GMT interim result+ to 30 September 2014 12 November 2014 Air New Zealand, the Fonterra

Climate Change in Montana How are crop and livestock producers responding? Rick Engel Cynthia

Flood Advisory Group Gully Cleansing and Maintenance 28 th April 2014 Larry Austin S trategic

Infocouncil at City of Tea Tree Gully Andrea Sargent Humble beginnings Infocouncil came to

Y Eielson Regional Growth Plan R The F he F-35s 35s Are Are Com Coming: ing: A How Ca Ho

. ~ ~ ~ A::f ~ _ ~&quot; : ~ I~ ~ ~ - ~ l = : : :J ~ S ~ ~ - - - =- &quot; -&quot; - -r- ~

Orient Overseas (International) Limited Annual Results 2018 Disclaimer The information

USWAC ITEMS &amp; CGP UPDATE UTAH STORM WATER ADVISORY COMMITTEE (USWAC) One o f the la rg

Sambuz

Useful Links

Newsletter

Mail Us

Speech production & perception Professor Marie Roch Phonetics & Phonology Phoneme

. ~ ~ ~ A::f ~ _ ~" : ~ I~ ~ ~ - ~ l = : : :J ~ S ~ ~ - - - =- " -" - -r- ~

USWAC ITEMS & CGP UPDATE UTAH STORM WATER ADVISORY COMMITTEE (USWAC) One o f the la rg