Speech Processing 15-492/18-492
Emotional Speech
(Some slides taken form JHU Workshop 2011 final presentation on New Parameterizations for Emotional Speech Synthesis)
Speech Processing 15-492/18-492 Emotional Speech (Some slides taken - - PowerPoint PPT Presentation
Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final presentation on New Parameterizations for Emotional Speech Synthesis) Processing Emotional Speech What is it? Emotion/Expressive/Style
(Some slides taken form JHU Workshop 2011 final presentation on New Parameterizations for Emotional Speech Synthesis)
What is it?
Emotion/Expressive/Style Things beyond the textual content
Why?
Detect frustrated users Detect confusion/confidence in speakers Detect truth/lies Detect engagement in task
How?
Combination of words, spectrum, F0 etc
Neutral, Happy, Sad and Angry
Cold-anger, dominant, passive, shame Confident, non-confident etc
Record actors
For synthesis this is probably best
People hear more acted emotions than real ones
Mine tv/movies
But usually background music
Mine call-center logs
Lots of angry examples
Mine youtube videos
Probably all emotions, but hard to search
LDC Emotional Prosody Speech and Transcripts
disgust, elation, pride, shame Berlin Emotional Database (emoDB)
Feature extraction: 1582 features (openSMILE)
38 low-level descriptors 21 functionals
PCM loudness, position max./min., MFCC [0-14] arith.mean, standard deviation, Log Mel Freq. Band [0-7] skewness, kurtosis, LSP Frequency [0-7]
F0, Voicing
Jitter / shimmer (local / DDP) percentile 1/99,
Classification
Evaluation
LDC Emotional Prosody Speech and Transcripts (LOSO) UAR [%] 4 classes 6 classes 15 classes whole data set 70.4 53.5 23.6 test set 68.3 43.3 23.5 chance level 25.0 16.6 6.7
UAR [%] 7 classes whole data set 77.0 test set 80.2 Chance level 14.3
LDC Emotional Prosody Speech and Transcripts (LOSO) UAR [%] 4 classes 6 classes 15 classes whole data set 75.0 (+4.6) 54.4 (+.9) 27.2 (+3.6) Berlin Emotional Speech
UAR [%] 7 classes whole data set 84.2 (+7.2)
Speaker / sentence normalization
Anonymous workers (Worker-ID) Simple tasks for small amounts
Minimal time, effort, and cost
Short, 1-2 second, wav files English speech – dates such as “November 3rd” 4 fundamental, distinct emotions 74 unique workers and 169 total HITs completed
Anger 69% Sadness 67% Neutral 66% Happiness 46% Total 60%
Emotion % Correct
Neutral 29% Hot-Anger 26% Sadness 25% Boredom 17% Panic 14% Interest 12% Elation 10% Contempt 10%
Same parameters as previous experiment. Including less well-defined emotions
Pride, shame, etc.
68 unique workers and 218 total HITs completed
Emotion % Correct
Happiness 9% Pride 9% Despair 8% Cold-Anger 7% Anxiety 5% Disgust 5% Shame 4% Total 12%
Short sentences with no emotional connotation
“The tablecloth is lying on the fridge.”
37 unique workers and 245 total HITs completed
Emotion % Correct
Neutral 68% Anger 62% Sadness 53% Anxiety 45% Happiness 35% Boredom 27% Disgust 11% Total 41%
41.8%
1 hour plus in each domain (Easy to get boredom and anger)
Better
Hard to make it sound very natural
Text to speech with no emotion/personality content Predicts durations, f0, and spectrum (through AFs)
Text to speech with emotion/personality flag Predicts durations, f0, and spectrum (through AFs)
No explicit emotion/personality flag, but Natural durations. Predicts f0 and spectrum (through AFs)
Natural durations. Predicts f0 and spectrum (through AFs)
“The best we can do.”
Synthesis Type UAR
Synthesis Type UAR
Synthesis Type Percent Correct
Synthesis Type UAR
Synthesis Type UAR
Frustrated (angry, annoyed, etc) Interested/Uninterested Pleased/Unhappy Cooperative/non-cooperative
Be frustrated back at them (or not) What techniques can deflate frustration
Better aid in tutorial systems
Copy emotion across language