speech processing 15 492 18 492
play

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final presentation on New Parameterizations for Emotional Speech Synthesis) Processing Emotional Speech What is it? Emotion/Expressive/Style


  1. Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final presentation on New Parameterizations for Emotional Speech Synthesis)

  2. Processing Emotional Speech  What is it?  Emotion/Expressive/Style  Things beyond the textual content  Why?  Detect frustrated users  Detect confusion/confidence in speakers  Detect truth/lies  Detect engagement in task  How?  Combination of words, spectrum, F0 etc

  3. What is emotional speech  The standard 4 emotions  Neutral, Happy, Sad and Angry  But there are many more  Cold-anger, dominant, passive, shame  Confident, non-confident etc

  4. Can machines recognize emotions?

  5. Where to get data  Record actors  For synthesis this is probably best  People hear more acted emotions than real ones  Mine tv/movies  But usually background music  Mine call-center logs  Lots of angry examples  Mine youtube videos  Probably all emotions, but hard to search

  6. Can machines recognize emotions?  LDC Emotional Prosody Speech and Transcripts • English, dates and numbers, 7 actors, • 2418 utterances, average 3sec., total: ~2h • 4 class problem: happy, hot-anger, sadness, neutral • 6 class problem: […], interest, panic • 15 class problem: […], anxiety, boredom, cold -anger, contempt, despair, disgust, elation, pride, shame  Berlin Emotional Database (emoDB) • German, semantically neutral utterances, 10 actors • 535 utterances, average 2.8sec., total: ~25 min • 6 emotions: anger, boredom, disgust, anxiety/fear, happiness, sadness

  7. Acoustic Features  Feature extraction: 1582 features (openSMILE) • 124 Prosodic Features: 72 F0, 38 Energy, 154 Dur./Pos. • 140 Voice Quality: 68 Jitter (JT), 34 Shimmer (SH), 38 Voicing (VC) • 1178 Spectral Features: 570 MFCC, 304 MEL, 304 LSP 38 low-level descriptors 21 functionals PCM loudness, position max./min., MFCC [0-14] arith.mean, standard deviation, Log Mel Freq. Band [0-7] skewness, kurtosis, LSP Frequency [0-7] lin. regression coefficient 1/2, F0, Voicing lin. regression error Q/A, Jitter / shimmer (local / DDP) percentile 1/99,

  8. Classification and Evaluation  Classification • Using discriminative training, multi-class SVM (1:1), WEKA • Linear kernel, complexity parameter set by cross- validation • Standardized feature sets  Evaluation • Applying 10-fold cross-validation or LOSO (leave-one- speaker/sentence-out) • Also testing on held-out set (test set) • Evaluation criterion accuracy or unweighted average recall (UAR)

  9. Results  LDC Emotional Prosody Speech and Transcripts (LOSO) UAR [%] 4 classes 6 classes 15 classes whole data set 70.4 53.5 23.6 test set 68.3 43.3 23.5 chance level 25.0 16.6 6.7 Berlin Emotional Speech Database UAR [%] 7 classes whole data set 77.0 test set 80.2 Chance level 14.3

  10. Results (normalized)  Speaker / sentence normalization • z-score (X s ) = (X s – mean(X s )) / std(X s )  LDC Emotional Prosody Speech and Transcripts (LOSO) UAR [%] 4 classes 6 classes 15 classes whole data set 75.0 (+4.6) 54.4 (+.9) 27.2 (+3.6)  Berlin Emotional Speech Database (LOSO) UAR [%] 7 classes whole data set 84.2 (+7.2)

  11. Classification Analysis: LDC Emotion

  12. Classification Analysis: emoDB

  13. Can people recognize emotions?

  14. Mechanical Turk  Anonymous workers (Worker-ID)  Simple tasks for small amounts of money  Minimal time, effort, and cost required for significant amounts of crowd-sourced data

  15. English LDC Emotion (4 Emotions)  Short, 1-2 second, wav files  English speech – dates such as “ November 3 rd ”  4 fundamental, distinct emotions  74 unique workers and 169 total HITs completed Results Emotion % Correct Uni-directional Confusion Anger 69% Sadness 67% Neutral 66% Happiness 46% Total 60%

  16. English LDC Emotion (15 Emotions)  Same parameters as previous experiment.  Including less well-defined emotions  Pride, shame, etc.  68 unique workers and 218 total HITs completed Results Emotion % Correct Emotion % Correct Uni-directional Confusion Neutral 29% Happiness 9% Hot-Anger 26% Pride 9% Sadness 25% Despair 8% Boredom 17% Cold-Anger 7% Panic 14% Anxiety 5% Interest 12% Disgust 5% Shame 4% Elation 10% Total 12% Contempt 10%

  17. German Berlin Emotion (7 Emotions)  Short sentences with no emotional connotation  “ The tablecloth is lying on the fridge. ”  37 unique workers and 245 total HITs completed Results Emotion % Correct Common Confusion Pair Neutral 68% 41.8% Anger 62% Sadness 53% Anxiety 45% Happiness 35% Boredom 27% Disgust 11% Total 41%

  18. Subjective Evaluation Takeaways • Humans are significantly more accurate than chance for smaller numbers of emotions – This includes cross-lingual recognition • Certain emotions are consistently identified accurately – Sadness, Neutral, Hot-Anger

  19. Emotional TTS  Record lots of data  1 hour plus in each domain  (Easy to get boredom and anger)  Do voice conversion/parametric synthesis  Better  In all the results aren’t encouraging  Hard to make it sound very natural

  20. Synthesis using AF13s Types of Synthesis Text to speech with no emotion/personality content tts Predicts durations, f0, and spectrum (through AFs) Text to speech with emotion/personality flag ttsE/P Predicts durations, f0, and spectrum (through AFs) No explicit emotion/personality flag, but cgp Natural durations. Predicts f0 and spectrum (through AFs) cgpE/P Emotion/personality flag, and Natural durations. Predicts f0 and spectrum (through AFs) resynth Pure re-synthesis from natural durations, f0, and spectrum “ The best we can do. ”

  21. LDC (English) Emotion Synthesis Objective Evaluation: Chance = 25% Without Speaker Normalization Train human Synthesis tts ttsE cgp cgpE resynth Test human Type 32% 36% 38% 40% 56% 70% UAR With Speaker Normalization Train human Synthesis tts ttsE cgp cgpE resynth Test human Type 33% 54% 57% 61% 61% 75% UAR

  22. LDC (English) Emotion Synthesis Mechanical Turk Human Evaluation: Chance = 25% Average Workers: 59 Average HITs Completed: 308 Files per HIT: 12 Natural Synthesis tts ttsE cgp cgpE resynth Speech Type 28% 28% 35% 37% 41% 60% Percent Correct We see the same trend and ordering in the human evaluation as in the objective classification

  23. German Synthesis Berlin Emotion Objective Evaluation: Chance = 14% tts ttsE cgp cgpE resynth Train human Synthesis Type Test human 14% 29% 65% 72% 82% 84% UAR ADFS Personality Objective Evaluation: Chance = 10% tts ttsP cgp cgpP resynth Train human Synthesis Type Test human 10% 60% 60% 78% 89% 92% UAR

  24. What we actually need  Expressive styles  Frustrated (angry, annoyed, etc)  Interested/Uninterested  Pleased/Unhappy  Cooperative/non-cooperative

  25. How can we use it  Detect frustrated customers  Be frustrated back at them (or not)  What techniques can deflate frustration  Detect (non)confidence  Better aid in tutorial systems  S2S Translation  Copy emotion across language

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend