Wavelet-Based Time-Frequency Representations for Automatic - - PowerPoint PPT Presentation

wavelet based time frequency representations for
SMART_READER_LITE
LIVE PREVIEW

Wavelet-Based Time-Frequency Representations for Automatic - - PowerPoint PPT Presentation

Wavelet-Based Time-Frequency Representations for Automatic Recognition of Emotions from Speech asquez-Correa 1 , 2 , T. Arias-Vergara 1 , J. C. V J. R. Orozco-Arroyave 1 , 2 , J. F . Vargas-Bonilla 1 , E. N oth 2 1 Department of


slide-1
SLIDE 1

Wavelet-Based Time-Frequency Representations for Automatic Recognition of Emotions from Speech

  • J. C. V´

asquez-Correa1,2∗, T. Arias-Vergara1,

  • J. R. Orozco-Arroyave1,2, J. F

. Vargas-Bonilla1, E. N¨

  • th2

1Department of Electronics and Telecommunication Engineering,

University of Antioquia UdeA.

2Pattern recognition Lab. Friedrich Alexander Universit¨

  • at. Erlangen-N¨

urnberg. *jcamilo.vasquez@udea.edu.co

1 / 34

slide-2
SLIDE 2

Outline

Introduction Methodology Data Experiments and Results Conclusion

2 / 34

slide-3
SLIDE 3

Introduction: Emotions

3 / 34

slide-4
SLIDE 4

Introduction: Emotion recognition

Recognition of emotion from speech:

◮ Call centers ◮ Emergency services ◮ Depression Treatment ◮ Intelligent vehicles ◮ Public surveillance

4 / 34

slide-5
SLIDE 5

Introduction: Non-stationary analysis

5 / 34

slide-6
SLIDE 6

Introduction: Non-stationary analysis

◮ Time–Frequency Analysis

Wavelet Transform Wigner–Ville distribution Modulation Spectra

6 / 34

slide-7
SLIDE 7

Introduction: Proposal

Features based on the energy content of three Wavelet–based TF representations for the classification of emotions from speech.

◮ Continuous Wavelet transform ◮ Bionic Wavelet transform ◮ Synchro–squeezing Wavelet transform

7 / 34

slide-8
SLIDE 8

Methodology

8 / 34

slide-9
SLIDE 9

Methodology: segmentation

Two types of sounds:

◮ Voiced ◮ Unvoiced

9 / 34

slide-10
SLIDE 10

Methodology: Wavelet Transforms

20 40 60 80 100 120 −1 1 Speech segment Time [ms] Time [ms] Scale s CWT 20 40 60 80 100 120 6.3e−5 2.9e−4 1.3e−3 6.1e−3 2.8e−2 Time [ms] Frequency [Hz] SSWT 20 40 60 80 100 120 7.6 31 120 490 2000 8000 Time [ms] Frequency [Hz] BWT 20 40 60 80 100 120 7.6 31 120 490 2000 8000

CWT: continuous wavelet transform BWT: bionic wavelet transform SSWT: synchro-squeezed wavelet transform

10 / 34

slide-11
SLIDE 11

Methodology: feature extraction

WT Time [ms] Frequency [Hz] 10 20 30 40 100 200 300 510 920 1720 3150 8000 Speech Frame Log−Energy Log−Energy Log−Energy

E[i] = log

  • 1

N

  • fi

N

  • uk
  • WT(uk ,fi )
  • 2
  • (1)

11 / 34

slide-12
SLIDE 12

Methodology: feature extraction

Descriptors (16 × 2) statistic functions (12) ZCR mean RMS Energy standard deviation F 0 kurtosis, skewness HNR max, min, relative position, range MFCC 1-12 slope, offset, MSE linear regression ∆s

Table: Features implemented using openEAR1

1Florian Eyben, Martin W¨

  • llmer, and Bj¨
  • rn Schuller. “OpenSmile: the

munich versatile and fast open-source audio feature extractor”. In: 18th ACM international conference on Multimedia. ACM. 2010, pp. 1459–1462.

12 / 34

slide-13
SLIDE 13

Methodology: classification

SVM Emotions

Extraction GMM-UBM Train set 13 / 34

slide-14
SLIDE 14

Methodology: classification

◮ The scores of the SVM are fused and used as new features for a second SVM. ◮ Leave one speaker out cross validation is performed. ◮ UAR as performance measure.

Features Voiced segments Features Unvoiced segments GMM Unvoiced Supervector voiced Supervector unvoiced GMM Voiced Distance to hyperplane Distance to hyperplane Emotion SVM Unvoiced SVM Voiced SVM Fusion 14 / 34

slide-15
SLIDE 15

Data

Table: Databases used in this study

Database # Rec. # Speak. Fs (Hz) Type Emotions Berlin 534 10 16000 Acted Fear, Disgust Happiness, Neutral Boredom, Sadness Anger IEMOCAP 10039 10 16000 Acted Fear, Disgust Happiness, Anger Surprise, Excitation Frustration, Sadness Neutral SAVEE 480 4 44100 Acted Anger, Happiness Disgust, Fear, Neutral Sadness, Surprise enterface05 1317 44 44100 Evoked Fear, Disgust Happiness, Anger Surprise, Sadness

15 / 34

slide-16
SLIDE 16

Experiments and Results: high vs. low arousal

HIGH AROUSAL LOW AROUSAL POSITIVE VALENCE NEGATIVE VALENCE

Anger Fear Disgust Stress Sadness Boredom Calm Relaxed Interest Surprise Happiness Neutral

16 / 34

slide-17
SLIDE 17

Experiments and Results: high vs. low arousal

Table: Detection of high vs. low arousal emotions. V: voiced, U: unvoiced. Features Segm. Berlin SAVEE enterface05 IEMOCAP CWT V 96 ± 6 83 ± 9 81 ± 2 74 ± 4 U 89 ± 9 80 ± 8 80 ± 1 75 ± 3 Fusion 93 ± 8 87 ± 7 81 ± 3 76 ± 3 BWT V 96 ± 6 82 ± 8 82 ± 2 74 ± 4 U 90 ± 9 80 ± 7 80 ± 2 75 ± 3 Fusion 94 ± 7 85 ± 7 82 ± 2 76 ± 4 SSWT V 96 ± 6 84 ± 8 81 ± 2 76 ± 5 U 89 ± 8 80 ± 7 80 ± 1 76 ± 3 Fusion 95 ± 6 82 ± 6 80 ± 3 77 ± 4 OpenEAR

  • 97 ± 3

83 ± 9 81 ± 2 76 ± 4

17 / 34

slide-18
SLIDE 18

Experiments and Results: high vs. low arousal

Table: Detection of high vs. low arousal emotions. V: voiced, U: unvoiced. Features Segm. Berlin SAVEE enterface05 IEMOCAP CWT V 96 ± 6 83 ± 9 81 ± 2 74 ± 4 U 89 ± 9 80 ± 8 80 ± 1 75 ± 3 Fusion 93 ± 8 87 ± 7 81 ± 3 76 ± 3 BWT V 96 ± 6 82 ± 8 82 ± 2 74 ± 4 U 90 ± 9 80 ± 7 80 ± 2 75 ± 3 Fusion 94 ± 7 85 ± 7 82 ± 2 76 ± 4 SSWT V 96 ± 6 84 ± 8 81 ± 2 76 ± 5 U 89 ± 8 80 ± 7 80 ± 1 76 ± 3 Fusion 95 ± 6 82 ± 6 80 ± 3 77 ± 4 OpenEAR

  • 97 ± 3

83 ± 9 81 ± 2 76 ± 4

18 / 34

slide-19
SLIDE 19

Experiments and Results: high vs. low arousal

Table: Detection of high vs. low arousal emotions. V: voiced, U: unvoiced. Features Segm. Berlin SAVEE enterface05 IEMOCAP CWT V 96 ± 6 83 ± 9 81 ± 2 74 ± 4 U 89 ± 9 80 ± 8 80 ± 1 75 ± 3 Fusion 93 ± 8 87 ± 7 81 ± 3 76 ± 3 BWT V 96 ± 6 82 ± 8 82 ± 2 74 ± 4 U 90 ± 9 80 ± 7 80 ± 2 75 ± 3 Fusion 94 ± 7 85 ± 7 82 ± 2 76 ± 4 SSWT V 96 ± 6 84 ± 8 81 ± 2 76 ± 5 U 89 ± 8 80 ± 7 80 ± 1 76 ± 3 Fusion 95 ± 6 82 ± 6 80 ± 3 77 ± 4 OpenEAR

  • 97 ± 3

83 ± 9 81 ± 2 76 ± 4

19 / 34

slide-20
SLIDE 20

Experiments and Results: positive vs. negative

HIGH AROUSAL LOW AROUSAL POSITIVE VALENCE NEGATIVE VALENCE

Anger Fear Disgust Stress Sadness Boredom Calm Relaxed Interest Surprise Happiness Neutral

20 / 34

slide-21
SLIDE 21

Experiments and Results: positive vs. negative

Table: Detection of positive vs. negative valence emotions. V: voiced, U: unvoiced. Features Segm. Berlin SAVEE enterface05 IEMOCAP CWT V 80 ± 4 64 ± 5 75 ± 2 55 ± 4 U 76 ± 5 64 ± 3 73 ± 3 58 ± 2 Fusion 78 ± 4 67 ± 4 74 ± 2 58 ± 5 BWT V 80 ± 4 64 ± 6 74 ± 2 55 ± 4 U 76 ± 7 64 ± 5 74 ± 3 58 ± 2 Fusion 78 ± 6 65 ± 6 74 ± 4 58 ± 3 SSWT V 82 ± 5 64 ± 5 76 ± 3 56 ± 4 U 77 ± 6 63 ± 3 74 ± 3 58 ± 2 Fusion 79 ± 4 65 ± 5 74 ± 4 60 ± 3 OpenEAR

  • 87 ± 2

72 ± 6 81 ± 4 59 ± 3

21 / 34

slide-22
SLIDE 22

Experiments and Results: positive vs. negative

Table: Detection of positive vs. negative valence emotions. V: voiced, U: unvoiced. Features Segm. Berlin SAVEE enterface05 IEMOCAP CWT V 80 ± 4 64 ± 5 75 ± 2 55 ± 4 U 76 ± 5 64 ± 3 73 ± 3 58 ± 2 Fusion 78 ± 4 67 ± 4 74 ± 2 58 ± 5 BWT V 80 ± 4 64 ± 6 74 ± 2 55 ± 4 U 76 ± 7 64 ± 5 74 ± 3 58 ± 2 Fusion 78 ± 6 65 ± 6 74 ± 4 58 ± 3 SSWT V 82 ± 5 64 ± 5 76 ± 3 56 ± 4 U 77 ± 6 63 ± 3 74 ± 3 58 ± 2 Fusion 79 ± 4 65 ± 5 74 ± 4 60 ± 3 OpenEAR

  • 87 ± 2

72 ± 6 81 ± 4 59 ± 3

22 / 34

slide-23
SLIDE 23

Experiments and Results: positive vs. negative

Table: Detection of positive vs. negative valence emotions. V: voiced, U: unvoiced. Features Segm. Berlin SAVEE enterface05 IEMOCAP CWT V 80 ± 4 64 ± 5 75 ± 2 55 ± 4 U 76 ± 5 64 ± 3 73 ± 3 58 ± 2 Fusion 78 ± 4 67 ± 4 74 ± 2 58 ± 5 BWT V 80 ± 4 64 ± 6 74 ± 2 55 ± 4 U 76 ± 7 64 ± 5 74 ± 3 58 ± 2 Fusion 78 ± 6 65 ± 6 74 ± 4 58 ± 3 SSWT V 82 ± 5 64 ± 5 76 ± 3 56 ± 4 U 77 ± 6 63 ± 3 74 ± 3 58 ± 2 Fusion 79 ± 4 65 ± 5 74 ± 4 60 ± 3 OpenEAR

  • 87 ± 2

72 ± 6 81 ± 4 59 ± 3

23 / 34

slide-24
SLIDE 24

Experiments and Results: multiple emotions

HIGH AROUSAL LOW AROUSAL POSITIVE VALENCE NEGATIVE VALENCE

Anger Fear Disgust Sadness Boredom Relaxed Surprise Happiness Neutral

24 / 34

slide-25
SLIDE 25

Experiments and Results: multiple emotions

Table: Classification of multiple emotions. V: voiced, U: unvoiced. Features Segm. Berlin SAVEE enterface-05 IEMOCAP CWT V 61 ± 8 41 ± 13 48 ± 5 47 ± 6 U 55 ± 7 39 ± 6 46 ± 4 51 ± 4 Fusion 67 ± 7 44 ± 9 51 ± 6 56 ± 5 BWT V 64 ± 9 41 ± 15 48 ± 4 47 ± 5 U 56 ± 7 40 ± 4 45 ± 4 51 ± 4 Fusion 66 ± 7 47 ± 10 50 ± 4 55 ± 6 SSWT V 64 ± 8 43 ± 11 48 ± 4 49 ± 5 U 55 ± 8 40 ± 6 46 ± 4 52 ± 3 Fusion 69 ± 8 45 ± 12 49 ± 6 58 ± 4 OpenEAR

  • 80 ± 8

49 ± 17 63 ± 7 57 ± 3

25 / 34

slide-26
SLIDE 26

Experiments and Results: multiple emotions

Table: Classification of multiple emotions. V: voiced, U: unvoiced. Features Segm. Berlin SAVEE enterface-05 IEMOCAP CWT V 61 ± 8 41 ± 13 48 ± 5 47 ± 6 U 55 ± 7 39 ± 6 46 ± 4 51 ± 4 Fusion 67 ± 7 44 ± 9 51 ± 6 56 ± 5 BWT V 64 ± 9 41 ± 15 48 ± 4 47 ± 5 U 56 ± 7 40 ± 4 45 ± 4 51 ± 4 Fusion 66 ± 7 47 ± 10 50 ± 4 55 ± 6 SSWT V 64 ± 8 43 ± 11 48 ± 4 49 ± 5 U 55 ± 8 40 ± 6 46 ± 4 52 ± 3 Fusion 69 ± 8 45 ± 12 49 ± 6 58 ± 4 OpenEAR

  • 80 ± 8

49 ± 17 63 ± 7 57 ± 3

26 / 34

slide-27
SLIDE 27

Conclusion

◮ This study evaluates different wavelet based TF represen-

tations to model emotional speech (CWT, BWT, SSWT).

◮ When comparing these three TF–based transformations,

SSWT provides better results.

◮ In most of the cases the highest UARs are obtained with the

features extracted from voiced segments.

◮ The fusion scheme shows to be useful to combine the infor-

mation provided by both kinds of segments.

◮ The results with the proposed approach are better than those

  • btained with openEAR when classifying high vs. low arousal

emotions.

◮ Further experiments shall be performed considering other

descriptors extracted from the TF representations to im- prove the results in other classification tasks.

27 / 34

slide-28
SLIDE 28

Conclusion

◮ This study evaluates different wavelet based TF represen-

tations to model emotional speech (CWT, BWT, SSWT).

◮ When comparing these three TF–based transformations,

SSWT provides better results.

◮ In most of the cases the highest UARs are obtained with the

features extracted from voiced segments.

◮ The fusion scheme shows to be useful to combine the infor-

mation provided by both kinds of segments.

◮ The results with the proposed approach are better than those

  • btained with openEAR when classifying high vs. low arousal

emotions.

◮ Further experiments shall be performed considering other

descriptors extracted from the TF representations to im- prove the results in other classification tasks.

28 / 34

slide-29
SLIDE 29

Conclusion

◮ This study evaluates different wavelet based TF represen-

tations to model emotional speech (CWT, BWT, SSWT).

◮ When comparing these three TF–based transformations,

SSWT provides better results.

◮ In most of the cases the highest UARs are obtained with the

features extracted from voiced segments.

◮ The fusion scheme shows to be useful to combine the infor-

mation provided by both kinds of segments.

◮ The results with the proposed approach are better than those

  • btained with openEAR when classifying high vs. low arousal

emotions.

◮ Further experiments shall be performed considering other

descriptors extracted from the TF representations to im- prove the results in other classification tasks.

29 / 34

slide-30
SLIDE 30

Conclusion

◮ This study evaluates different wavelet based TF represen-

tations to model emotional speech (CWT, BWT, SSWT).

◮ When comparing these three TF–based transformations,

SSWT provides better results.

◮ In most of the cases the highest UARs are obtained with the

features extracted from voiced segments.

◮ The fusion scheme shows to be useful to combine the infor-

mation provided by both kinds of segments.

◮ The results with the proposed approach are better than those

  • btained with openEAR when classifying high vs. low arousal

emotions.

◮ Further experiments shall be performed considering other

descriptors extracted from the TF representations to im- prove the results in other classification tasks.

30 / 34

slide-31
SLIDE 31

Conclusion

◮ This study evaluates different wavelet based TF represen-

tations to model emotional speech (CWT, BWT, SSWT).

◮ When comparing these three TF–based transformations,

SSWT provides better results.

◮ In most of the cases the highest UARs are obtained with the

features extracted from voiced segments.

◮ The fusion scheme shows to be useful to combine the infor-

mation provided by both kinds of segments.

◮ The results with the proposed approach are better than those

  • btained with openEAR when classifying high vs. low arousal

emotions.

◮ Further experiments shall be performed considering other

descriptors extracted from the TF representations to im- prove the results in other classification tasks.

31 / 34

slide-32
SLIDE 32

Conclusion

◮ This study evaluates different wavelet based TF represen-

tations to model emotional speech (CWT, BWT, SSWT).

◮ When comparing these three TF–based transformations,

SSWT provides better results.

◮ In most of the cases the highest UARs are obtained with the

features extracted from voiced segments.

◮ The fusion scheme shows to be useful to combine the infor-

mation provided by both kinds of segments.

◮ The results with the proposed approach are better than those

  • btained with openEAR when classifying high vs. low arousal

emotions.

◮ Further experiments shall be performed considering other

descriptors extracted from the TF representations to im- prove the results in other classification tasks.

32 / 34

slide-33
SLIDE 33

Questions

Thanks!

jcamilo.vasquez@udea.edu.co

33 / 34

slide-34
SLIDE 34

Wavelet-Based Time-Frequency Representations for Automatic Recognition of Emotions from Speech

  • J. C. V´

asquez-Correa1,2∗, T. Arias-Vergara1,

  • J. R. Orozco-Arroyave1,2, J. F

. Vargas-Bonilla1, E. N¨

  • th2

1Department of Electronics and Telecommunication Engineering,

University of Antioquia UdeA.

2Pattern recognition Lab. Friedrich Alexander Universit¨

  • at. Erlangen-N¨

urnberg. *jcamilo.vasquez@udea.edu.co

34 / 34