Speech Processing 15-492/18-492 Emotional Speech (Some slides taken - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final presentation on New Parameterizations for Emotional Speech Synthesis)

Processing Emotional Speech  What is it?  Emotion/Expressive/Style  Things beyond the textual content  Why?  Detect frustrated users  Detect confusion/confidence in speakers  Detect truth/lies  Detect engagement in task  How?  Combination of words, spectrum, F0 etc

What is emotional speech  The standard 4 emotions  Neutral, Happy, Sad and Angry  But there are many more  Cold-anger, dominant, passive, shame  Confident, non-confident etc

Can machines recognize emotions?

Where to get data  Record actors  For synthesis this is probably best  People hear more acted emotions than real ones  Mine tv/movies  But usually background music  Mine call-center logs  Lots of angry examples  Mine youtube videos  Probably all emotions, but hard to search

Can machines recognize emotions?  LDC Emotional Prosody Speech and Transcripts • English, dates and numbers, 7 actors, • 2418 utterances, average 3sec., total: ~2h • 4 class problem: happy, hot-anger, sadness, neutral • 6 class problem: […], interest, panic • 15 class problem: […], anxiety, boredom, cold -anger, contempt, despair, disgust, elation, pride, shame  Berlin Emotional Database (emoDB) • German, semantically neutral utterances, 10 actors • 535 utterances, average 2.8sec., total: ~25 min • 6 emotions: anger, boredom, disgust, anxiety/fear, happiness, sadness

Acoustic Features  Feature extraction: 1582 features (openSMILE) • 124 Prosodic Features: 72 F0, 38 Energy, 154 Dur./Pos. • 140 Voice Quality: 68 Jitter (JT), 34 Shimmer (SH), 38 Voicing (VC) • 1178 Spectral Features: 570 MFCC, 304 MEL, 304 LSP 38 low-level descriptors 21 functionals PCM loudness, position max./min., MFCC [0-14] arith.mean, standard deviation, Log Mel Freq. Band [0-7] skewness, kurtosis, LSP Frequency [0-7] lin. regression coefficient 1/2, F0, Voicing lin. regression error Q/A, Jitter / shimmer (local / DDP) percentile 1/99,

Classification and Evaluation  Classification • Using discriminative training, multi-class SVM (1:1), WEKA • Linear kernel, complexity parameter set by cross- validation • Standardized feature sets  Evaluation • Applying 10-fold cross-validation or LOSO (leave-one- speaker/sentence-out) • Also testing on held-out set (test set) • Evaluation criterion accuracy or unweighted average recall (UAR)

Results  LDC Emotional Prosody Speech and Transcripts (LOSO) UAR [%] 4 classes 6 classes 15 classes whole data set 70.4 53.5 23.6 test set 68.3 43.3 23.5 chance level 25.0 16.6 6.7 Berlin Emotional Speech Database UAR [%] 7 classes whole data set 77.0 test set 80.2 Chance level 14.3

Results (normalized)  Speaker / sentence normalization • z-score (X s ) = (X s – mean(X s )) / std(X s )  LDC Emotional Prosody Speech and Transcripts (LOSO) UAR [%] 4 classes 6 classes 15 classes whole data set 75.0 (+4.6) 54.4 (+.9) 27.2 (+3.6)  Berlin Emotional Speech Database (LOSO) UAR [%] 7 classes whole data set 84.2 (+7.2)

Classification Analysis: LDC Emotion

Classification Analysis: emoDB

Can people recognize emotions?

Mechanical Turk  Anonymous workers (Worker-ID)  Simple tasks for small amounts of money  Minimal time, effort, and cost required for significant amounts of crowd-sourced data

English LDC Emotion (4 Emotions)  Short, 1-2 second, wav files  English speech – dates such as “ November 3 rd ”  4 fundamental, distinct emotions  74 unique workers and 169 total HITs completed Results Emotion % Correct Uni-directional Confusion Anger 69% Sadness 67% Neutral 66% Happiness 46% Total 60%

English LDC Emotion (15 Emotions)  Same parameters as previous experiment.  Including less well-defined emotions  Pride, shame, etc.  68 unique workers and 218 total HITs completed Results Emotion % Correct Emotion % Correct Uni-directional Confusion Neutral 29% Happiness 9% Hot-Anger 26% Pride 9% Sadness 25% Despair 8% Boredom 17% Cold-Anger 7% Panic 14% Anxiety 5% Interest 12% Disgust 5% Shame 4% Elation 10% Total 12% Contempt 10%

German Berlin Emotion (7 Emotions)  Short sentences with no emotional connotation  “ The tablecloth is lying on the fridge. ”  37 unique workers and 245 total HITs completed Results Emotion % Correct Common Confusion Pair Neutral 68% 41.8% Anger 62% Sadness 53% Anxiety 45% Happiness 35% Boredom 27% Disgust 11% Total 41%

Subjective Evaluation Takeaways • Humans are significantly more accurate than chance for smaller numbers of emotions – This includes cross-lingual recognition • Certain emotions are consistently identified accurately – Sadness, Neutral, Hot-Anger

Emotional TTS  Record lots of data  1 hour plus in each domain  (Easy to get boredom and anger)  Do voice conversion/parametric synthesis  Better  In all the results aren’t encouraging  Hard to make it sound very natural

Synthesis using AF13s Types of Synthesis Text to speech with no emotion/personality content tts Predicts durations, f0, and spectrum (through AFs) Text to speech with emotion/personality flag ttsE/P Predicts durations, f0, and spectrum (through AFs) No explicit emotion/personality flag, but cgp Natural durations. Predicts f0 and spectrum (through AFs) cgpE/P Emotion/personality flag, and Natural durations. Predicts f0 and spectrum (through AFs) resynth Pure re-synthesis from natural durations, f0, and spectrum “ The best we can do. ”

LDC (English) Emotion Synthesis Objective Evaluation: Chance = 25% Without Speaker Normalization Train human Synthesis tts ttsE cgp cgpE resynth Test human Type 32% 36% 38% 40% 56% 70% UAR With Speaker Normalization Train human Synthesis tts ttsE cgp cgpE resynth Test human Type 33% 54% 57% 61% 61% 75% UAR

LDC (English) Emotion Synthesis Mechanical Turk Human Evaluation: Chance = 25% Average Workers: 59 Average HITs Completed: 308 Files per HIT: 12 Natural Synthesis tts ttsE cgp cgpE resynth Speech Type 28% 28% 35% 37% 41% 60% Percent Correct We see the same trend and ordering in the human evaluation as in the objective classification

German Synthesis Berlin Emotion Objective Evaluation: Chance = 14% tts ttsE cgp cgpE resynth Train human Synthesis Type Test human 14% 29% 65% 72% 82% 84% UAR ADFS Personality Objective Evaluation: Chance = 10% tts ttsP cgp cgpP resynth Train human Synthesis Type Test human 10% 60% 60% 78% 89% 92% UAR

What we actually need  Expressive styles  Frustrated (angry, annoyed, etc)  Interested/Uninterested  Pleased/Unhappy  Cooperative/non-cooperative

How can we use it  Detect frustrated customers  Be frustrated back at them (or not)  What techniques can deflate frustration  Detect (non)confidence  Better aid in tutorial systems  S2S Translation  Copy emotion across language

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final presentation on New Parameterizations for Emotional Speech Synthesis) Processing Emotional Speech What is it? Emotion/Expressive/Style

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Characteristics of an E ffective Team Jim Hughes, Lovelace Health System Agenda Team

Conclusions 09.15.11 || English 1302: Composition & Rhetoric II || D. Glen Smith, instructor

Emmanuel Agu MIT Epidemiological Change Introduction Ref: A. Madan , Social sensing for

CS 528 Mobile and Ubiquitous Computing Lecture 8b: Voice Analytics, Affect Detection &

UX Design Principles and Guidelines R.I.T S. Ludi/R. Kuehl p. 1 R I T Software Engineering

Occupancy-Regulated Extension Using Chunks to Build Levels Peter Mawhorter Michael Mateas

Deborah A. Dahl Conversational Technologies Chair, W3C Multimodal Interaction Working Group

Game and Learn: An Introduction to Educational Gaming 2. What Is A Good Game? Ruben R.

Sambuz

Useful Links

Newsletter

Mail Us

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final presentation on New Parameterizations for Emotional Speech Synthesis) Processing Emotional Speech What is it? Emotion/Expressive/Style

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Characteristics of an E ffective Team Jim Hughes, Lovelace Health System Agenda Team

Conclusions 09.15.11 || English 1302: Composition &amp; Rhetoric II || D. Glen Smith, instructor

Emmanuel Agu MIT Epidemiological Change Introduction Ref: A. Madan , Social sensing for

CS 528 Mobile and Ubiquitous Computing Lecture 8b: Voice Analytics, Affect Detection &amp;

UX Design Principles and Guidelines R.I.T S. Ludi/R. Kuehl p. 1 R I T Software Engineering

Occupancy-Regulated Extension Using Chunks to Build Levels Peter Mawhorter Michael Mateas

Deborah A. Dahl Conversational Technologies Chair, W3C Multimodal Interaction Working Group

Game and Learn: An Introduction to Educational Gaming 2. What Is A Good Game? Ruben R.

Sambuz

Useful Links

Newsletter

Mail Us

Conclusions 09.15.11 || English 1302: Composition & Rhetoric II || D. Glen Smith, instructor

CS 528 Mobile and Ubiquitous Computing Lecture 8b: Voice Analytics, Affect Detection &