Speech Processing 15-492/18-492 Emotional Speech (Some slides taken - - PowerPoint PPT Presentation

speech processing 15 492 18 492
SMART_READER_LITE
LIVE PREVIEW

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken - - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final presentation on New Parameterizations for Emotional Speech Synthesis) Processing Emotional Speech What is it? Emotion/Expressive/Style


slide-1
SLIDE 1

Speech Processing 15-492/18-492

Emotional Speech

(Some slides taken form JHU Workshop 2011 final presentation on New Parameterizations for Emotional Speech Synthesis)

slide-2
SLIDE 2

Processing Emotional Speech

 What is it?

 Emotion/Expressive/Style  Things beyond the textual content

 Why?

 Detect frustrated users  Detect confusion/confidence in speakers  Detect truth/lies  Detect engagement in task

 How?

 Combination of words, spectrum, F0 etc

slide-3
SLIDE 3

What is emotional speech

The standard 4 emotions

 Neutral, Happy, Sad and Angry

But there are many more

 Cold-anger, dominant, passive, shame  Confident, non-confident etc

slide-4
SLIDE 4

Can machines recognize emotions?

slide-5
SLIDE 5

Where to get data

 Record actors

 For synthesis this is probably best

 People hear more acted emotions than real ones

 Mine tv/movies

 But usually background music

 Mine call-center logs

 Lots of angry examples

 Mine youtube videos

 Probably all emotions, but hard to search

slide-6
SLIDE 6

Can machines recognize emotions?

 LDC Emotional Prosody Speech and Transcripts

  • English, dates and numbers, 7 actors,
  • 2418 utterances, average 3sec., total: ~2h
  • 4 class problem: happy, hot-anger, sadness, neutral
  • 6 class problem: […], interest, panic
  • 15 class problem: […], anxiety, boredom, cold-anger, contempt, despair,

disgust, elation, pride, shame  Berlin Emotional Database (emoDB)

  • German, semantically neutral utterances, 10 actors
  • 535 utterances, average 2.8sec., total: ~25 min
  • 6 emotions: anger, boredom, disgust, anxiety/fear, happiness, sadness
slide-7
SLIDE 7

Acoustic Features

 Feature extraction: 1582 features (openSMILE)

  • 124 Prosodic Features: 72 F0, 38 Energy, 154 Dur./Pos.
  • 140 Voice Quality: 68 Jitter (JT), 34 Shimmer (SH), 38

Voicing (VC)

  • 1178 Spectral Features: 570 MFCC, 304 MEL, 304 LSP

38 low-level descriptors 21 functionals

PCM loudness, position max./min., MFCC [0-14] arith.mean, standard deviation, Log Mel Freq. Band [0-7] skewness, kurtosis, LSP Frequency [0-7]

  • lin. regression coefficient 1/2,

F0, Voicing

  • lin. regression error Q/A,

Jitter / shimmer (local / DDP) percentile 1/99,

slide-8
SLIDE 8

Classification and Evaluation

 Classification

  • Using discriminative training, multi-class SVM (1:1), WEKA
  • Linear kernel, complexity parameter set by cross-

validation

  • Standardized feature sets

 Evaluation

  • Applying 10-fold cross-validation or LOSO (leave-one-

speaker/sentence-out)

  • Also testing on held-out set (test set)
  • Evaluation criterion accuracy or unweighted average recall

(UAR)

slide-9
SLIDE 9

Results

 LDC Emotional Prosody Speech and Transcripts (LOSO) UAR [%] 4 classes 6 classes 15 classes whole data set 70.4 53.5 23.6 test set 68.3 43.3 23.5 chance level 25.0 16.6 6.7

Berlin Emotional Speech Database

UAR [%] 7 classes whole data set 77.0 test set 80.2 Chance level 14.3

slide-10
SLIDE 10

Results (normalized)

 LDC Emotional Prosody Speech and Transcripts (LOSO) UAR [%] 4 classes 6 classes 15 classes whole data set 75.0 (+4.6) 54.4 (+.9) 27.2 (+3.6)  Berlin Emotional Speech

Database (LOSO)

UAR [%] 7 classes whole data set 84.2 (+7.2)

 Speaker / sentence normalization

  • z-score (Xs) = (Xs – mean(Xs)) / std(Xs)
slide-11
SLIDE 11

Classification Analysis: LDC Emotion

slide-12
SLIDE 12

Classification Analysis: emoDB

slide-13
SLIDE 13

Can people recognize emotions?

slide-14
SLIDE 14

Mechanical Turk

 Anonymous workers (Worker-ID)  Simple tasks for small amounts

  • f money

 Minimal time, effort, and cost

required for significant amounts

  • f crowd-sourced data
slide-15
SLIDE 15

English LDC Emotion (4 Emotions)

 Short, 1-2 second, wav files  English speech – dates such as “November 3rd”  4 fundamental, distinct emotions  74 unique workers and 169 total HITs completed

Uni-directional Confusion Emotion % Correct

Anger 69% Sadness 67% Neutral 66% Happiness 46% Total 60%

Results

slide-16
SLIDE 16

English LDC Emotion (15 Emotions)

Uni-directional Confusion

Emotion % Correct

Neutral 29% Hot-Anger 26% Sadness 25% Boredom 17% Panic 14% Interest 12% Elation 10% Contempt 10%

Results

 Same parameters as previous experiment.  Including less well-defined emotions

 Pride, shame, etc.

 68 unique workers and 218 total HITs completed

Emotion % Correct

Happiness 9% Pride 9% Despair 8% Cold-Anger 7% Anxiety 5% Disgust 5% Shame 4% Total 12%

slide-17
SLIDE 17

German Berlin Emotion (7 Emotions)

 Short sentences with no emotional connotation

 “The tablecloth is lying on the fridge.”

 37 unique workers and 245 total HITs completed

Common Confusion Pair

Emotion % Correct

Neutral 68% Anger 62% Sadness 53% Anxiety 45% Happiness 35% Boredom 27% Disgust 11% Total 41%

Results

41.8%

slide-18
SLIDE 18

Subjective Evaluation Takeaways

  • Humans are significantly more accurate than

chance for smaller numbers of emotions

– This includes cross-lingual recognition

  • Certain emotions are consistently identified

accurately

– Sadness, Neutral, Hot-Anger

slide-19
SLIDE 19

Emotional TTS

Record lots of data

 1 hour plus in each domain  (Easy to get boredom and anger)

Do voice conversion/parametric synthesis

 Better

In all the results aren’t encouraging

 Hard to make it sound very natural

slide-20
SLIDE 20

Synthesis using AF13s

Types of Synthesis tts

Text to speech with no emotion/personality content Predicts durations, f0, and spectrum (through AFs)

ttsE/P

Text to speech with emotion/personality flag Predicts durations, f0, and spectrum (through AFs)

cgp

No explicit emotion/personality flag, but Natural durations. Predicts f0 and spectrum (through AFs)

cgpE/P Emotion/personality flag, and

Natural durations. Predicts f0 and spectrum (through AFs)

resynth Pure re-synthesis from natural durations, f0, and spectrum

“The best we can do.”

slide-21
SLIDE 21

LDC (English) Emotion Synthesis

tts ttsE cgp cgpE resynth 32% 36% 38% 40% 56%

Objective Evaluation: Chance = 25%

Without Speaker Normalization With Speaker Normalization

Synthesis Type UAR

tts ttsE cgp cgpE resynth 33% 54% 57% 61% 61%

Synthesis Type UAR

Train human Test human 70% Train human Test human 75%

slide-22
SLIDE 22

LDC (English) Emotion Synthesis

tts ttsE cgp cgpE resynth 28% 28% 35% 37% 41%

Mechanical Turk Human Evaluation: Chance = 25%

Synthesis Type Percent Correct

Natural Speech 60%

We see the same trend and ordering in the human evaluation as in the objective classification Average Workers: 59 Average HITs Completed: 308 Files per HIT: 12

slide-23
SLIDE 23

German Synthesis

Objective Evaluation: Chance = 14%

tts ttsE cgp cgpE resynth Train human Test human 14% 29% 65% 72% 82% 84%

Synthesis Type UAR

Berlin Emotion

Objective Evaluation: Chance = 10%

ADFS Personality

tts ttsP cgp cgpP resynth Train human Test human 10% 60% 60% 78% 89% 92%

Synthesis Type UAR

slide-24
SLIDE 24

What we actually need

Expressive styles

 Frustrated (angry, annoyed, etc)  Interested/Uninterested  Pleased/Unhappy  Cooperative/non-cooperative

slide-25
SLIDE 25

How can we use it

Detect frustrated customers

 Be frustrated back at them (or not)  What techniques can deflate frustration

Detect (non)confidence

 Better aid in tutorial systems

S2S Translation

 Copy emotion across language

slide-26
SLIDE 26