Odyssey 2016, Bilbao, Spain Short- and Long-Term Speech Features for - - PowerPoint PPT Presentation

odyssey 2016 bilbao spain short and long term speech
SMART_READER_LITE
LIVE PREVIEW

Odyssey 2016, Bilbao, Spain Short- and Long-Term Speech Features for - - PowerPoint PPT Presentation

Odyssey 2016, Bilbao, Spain Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System Abraham Woubie 1 , Jordi Luque 2 , and Javier Hernando 1 1 TALP Research Center, Dept. of Signal Theory and Communications,


slide-1
SLIDE 1

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Abraham Woubie 1 , Jordi Luque 2 , and Javier Hernando 1

1 TALP Research Center, Dept. of Signal Theory and Communications,

Universitat Politècnica de Catalunya, Barcelona, Spain

2 Telefonica Research, Edificio Telefonica-Diagonal, Barcelona, Spain

June 24, 2016

Odyssey 2016, Bilbao, Spain

slide-2
SLIDE 2

Outline

❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results ❑ Conclusions

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

2

Odyssey 2016, Bilbao, Spain

slide-3
SLIDE 3

Introduction

❑ Speaker diarization = speaker segmentation + speaker clustering

❑ Motivation

  • MFCC and GMM are the most widely used short-term speech features and

speaker clustering techniques in speaker diarization, respectively.

  • Jitter and shimmer voice-quality measurements (JS) and prosodic features have

been successfully used together with MFCC in GMM based speaker diarization.

  • We have proposed the fusion of scores of i-vectors extracted from MFCC and

long-term speech features for speaker clustering task.

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

3

Speaker 1 Speaker 2 Speaker 3 Speaker segmentation Speaker clustering

Odyssey 2016, Bilbao, Spain

slide-4
SLIDE 4

Outline

❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results ❑ Conclusions

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

4

Odyssey 2016, Bilbao, Spain

slide-5
SLIDE 5

Objectives

❑ Feature selection of long-term voice-quality and prosodic features.

  • The voice-quality features are Absolute Jitter, Absolute

Shimmer and Shimmer apq3.

  • The prosodic ones are pitch, intensity and the first four formant

frequencies.

❑ Stacking these voice-quality and prosodic features in the same

feature vector.

❑ Extraction of i-vectors from short-term spectral and these long-term

feature sets.

❑ Fusion of scores of i-vectors extracted from these features for

speaker clustering.

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

5

Odyssey 2016, Bilbao, Spain

slide-6
SLIDE 6

Outline

❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results ❑ Conclusions

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

6

Odyssey 2016, Bilbao, Spain

slide-7
SLIDE 7

Speech Features

❑ Mel Frequency Cepstral Coefficients (MFCC): They are the mostly widely used

short term features in speaker diarization.

❑ Voice quality features: They measure variations of fundamental frequency and

amplitude of speaker’s voice.

  • We have extracted Absolute Jitter, Absolute shimmer and Shimmer apq3.

Prosodic features: They are estimated capturing the evolution in time of fundamental frequency, acoustic intensity and formant frequencies.

  • We have extracted pitch, intensity and the first four formant frequencies.

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

7

Odyssey 2016, Bilbao, Spain

slide-8
SLIDE 8

Voice-quality

❑ Jitter (absolute): It is the average absolute difference between consecutive periods. ❑ Shimmer (absolute): It is the average absolute logarithm of the ratio between

amplitudes of consecutive periods.

❑ Shimmer (apq3): It is the three-point Amplitude Perturbation Quotient.

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

8

Odyssey 2016, Bilbao, Spain

slide-9
SLIDE 9

Prosody

❑ Prosody is estimated capturing the evolution in time of fundamental frequency,

acoustic intensity and formant frequencies.

  • Pitch: It is the perceived fundamental frequency.
  • Intensity: It is the energy of a speech signal.
  • Formant frequencies: They are concentration of acoustic energy around

particular frequencies

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

9

Odyssey 2016, Bilbao, Spain

5000 Hz 0 Hz 70 dB 30 dB

Intensity Formant frequency Pitch

30 dB 100 dB

slide-10
SLIDE 10

Outline

❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results ❑ Conclusions

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

10

Odyssey 2016, Bilbao, Spain

slide-11
SLIDE 11

Speaker Diarization Architecture

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

11

Odyssey 2016, Bilbao, Spain

Initialization

Speech Reference

Speaker segmentation

Speech only frames MFCC extraction JS extraction Prosody extraction Initialize clusters GMM complexity Stack features MFCC score HMM training

JS and Prosody score

Score fusion Viterbi segmentation Merge clusters BIC computation Final hypothesis

Merge clusters

Speaker clustering

Yes No Merged clusters Clusters

slide-12
SLIDE 12

Proposed Speaker Clustering Architecture

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

12

Odyssey 2016, Bilbao, Spain

i-vector extraction from MFCC i-vector extraction from JS and Prosody UBM Cosine score fusion Final hypothesis Merge clusters Merge clusters

Merged clusters Clusters Speaker Clustering

Yes No

  • Clustering merging is based on threshold value
  • The i-vectors are extracted using Alize toolkit
  • Two gender independent UBMs of 512 GMM components are trained using 100 AMI

shows (not included in the testing set)

  • The T-Matrices are trained using the same previous dataset.
slide-13
SLIDE 13

Stopping Criterion Selection

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

13

Odyssey 2016, Bilbao, Spain

Diarization Error Rate (DER) and cosine-distance score per iteration for selected five shows from the development set

Show one Show two Show three Show four Show five

slide-14
SLIDE 14

Outline

❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques

  • Segmentation
  • Clustering

❑ Experimental Setup and Results ❑ Conclusions

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

14

Odyssey 2016, Bilbao, Spain

slide-15
SLIDE 15

Fusion Techniques: Segmentation

The fusion of voice quality features with the prosodic ones is carried out at the feature level. Tuned alpha

❑ The fusion of short- and long-term speech features is carried out at the score

likelihood level as follows for speaker segmentation:

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

15

Odyssey 2016, Bilbao, Spain

  • log P(x, y) is the fused GMM score
  • θix is model of spectral features
  • θiy is model of JS and Prosody
  • α is weight of MFCC
  • 1- α is weight of JS and Prosody
slide-16
SLIDE 16

Fusion Techniques:Clustering

❑ The fusion of cosine distance scores of i-vectors from the short term and long-term

speech features is carried out at the score level for speaker clustering as follows:

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

16

Odyssey 2016, Bilbao, Spain

  • xi and xj are the i-vectors extracted from MFCC for clusters i and j
  • yi and yj are the i-vectors extracted from voice-quality and Prosody for clusters i and j
  • β is the weight of cosine-distance of i-vectors extracted from MFCC
  • (1- β) is the weight of cosine-distance of i-vectors extracted voice-quality and Prosody

MFCC extraction Voice-quality extraction Prosody extraction Stack features i-vector extraction i-vector extraction Cosine score fusion

Fused score =

slide-17
SLIDE 17

Outline

❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results

  • Experimental Setup
  • Experimental Results

❑ Conclusions

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

17

Odyssey 2016, Bilbao, Spain

slide-18
SLIDE 18

Experimental setup

❑ The experiments have been developed and tested on AMI corpus, a multi-

party and spontaneous speech set of recordings.

❑ The number of speakers is in the range [3 5]. ❑ We have selected 10 shows from AMI corpus as a development set. ❑ Two experimental scenarios have been defined for the test sets:

  • Single-site: 10 shows from Idiap site (total duration= 307 minutes)
  • Multiple-site: 10 shows from Idiap, Edinburgh and TNO sites (294

minutes)

❑ The size of i-vectors for the short- and long-term speech features are100

and 50 respectively.

❑ Oracle SAD has been used as speech activity detection.

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

18

Odyssey 2016, Bilbao, Spain

slide-19
SLIDE 19

Experimental Results

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

19

Features Clustering technique Development set DER (%) Single-site scenario DER (%) Multiple-site scenario DER (%) MFCC (Baseline) GMM/BIC 30.09 15.87 24.66 MFCC + JS +Prosody GMM/BIC 25.98 15.02 22.96 MFCC i-Vector/CD 27.03 15.01 22.79 MFCC + JS +Prosody i-Vector/CD 25.42 13.37 20.06 Table 1: DER using GMM based BIC and i-vector based cosine distance (CD) clustering technique

Odyssey 2016, Bilbao, Spain  Conclusions

  • JS and Prosody improve the DER both in GMM based BIC and i-Vector based CD

clustering techniques.

  • i-vector based CD clustering technique provides better results than GMM based BIC one.
slide-20
SLIDE 20

Experimental Results

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

20

Features Clustering technique Development set DER (%) Single-site scenario DER (%) Multiple- site scenario DER (%) MFCC GMM/BIC 30.09 15.87 24.66 MFCC + JS +Prosody GMM/BIC 25.98 15.02 22.96 MFCC i-Vector/CD 27.03 15.01 22.79 MFCC + JS +Prosody i-Vector/CD 25.42 13.37 20.06 MFCC i-Vector/PLDA 25.96 13.64 19.77 MFCC + JS +Prosody i-Vector/PLDA 25.28 13.07 19.23 Table 2: DER using GMM and i-vector based clustering techniques

Odyssey 2016, Bilbao, Spain  Conclusions

  • i-vector based PLDA clustering techniques provides better DER result than GMM

based BIC and i-vector based CD clustering techniques.

slide-21
SLIDE 21

Boxplot of Single- and Multiple-site scenarios

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

21

Odyssey 2016, Bilbao, Spain

Clustering technique (Feature set)

slide-22
SLIDE 22

Conclusions

❑ We have proposed the fusion of cosine distances of i-vectors extracted from

short- and long-term speech features for speaker clustering task.

❑ Experimental results also show that i-vector clustering technique provides

better DER than GMM clustering one.

❑ Experimental results show that i-vector clustering technique using short-

and long-term features provides better DER than the same clustering technique using only short-term spectral features.

❑ The results of our work show the usefulness i-vector clustering technique

based on short- and long-term speech features within in the framework of speaker diarization.

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

22

Odyssey 2016, Bilbao, Spain

slide-23
SLIDE 23

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

23

Odyssey 2016, Bilbao, Spain

Q & A Thank You !!!