Odyssey 2016, Bilbao, Spain Short- and Long-Term Speech Features for - - PowerPoint PPT Presentation

▶

Aug 27, 2023 640 likes •899 views

Odyssey 2016, Bilbao, Spain Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System Abraham Woubie 1 , Jordi Luque 2 , and Javier Hernando 1 1 TALP Research Center, Dept. of Signal Theory and Communications,

SLIDE 1

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Abraham Woubie 1 , Jordi Luque 2 , and Javier Hernando 1

1 TALP Research Center, Dept. of Signal Theory and Communications,

Universitat Politècnica de Catalunya, Barcelona, Spain

2 Telefonica Research, Edificio Telefonica-Diagonal, Barcelona, Spain

June 24, 2016

Odyssey 2016, Bilbao, Spain

SLIDE 2

Outline

❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results ❑ Conclusions

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

SLIDE 3

Introduction

❑ Speaker diarization = speaker segmentation + speaker clustering

❑ Motivation

MFCC and GMM are the most widely used short-term speech features and

speaker clustering techniques in speaker diarization, respectively.

Jitter and shimmer voice-quality measurements (JS) and prosodic features have

been successfully used together with MFCC in GMM based speaker diarization.

We have proposed the fusion of scores of i-vectors extracted from MFCC and

long-term speech features for speaker clustering task.

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Speaker 1 Speaker 2 Speaker 3 Speaker segmentation Speaker clustering

Odyssey 2016, Bilbao, Spain

SLIDE 4

Outline

❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results ❑ Conclusions

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

SLIDE 5

Objectives

❑ Feature selection of long-term voice-quality and prosodic features.

The voice-quality features are Absolute Jitter, Absolute

Shimmer and Shimmer apq3.

The prosodic ones are pitch, intensity and the first four formant

frequencies.

❑ Stacking these voice-quality and prosodic features in the same

feature vector.

❑ Extraction of i-vectors from short-term spectral and these long-term

feature sets.

❑ Fusion of scores of i-vectors extracted from these features for

speaker clustering.

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

SLIDE 6

Outline

❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results ❑ Conclusions

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

SLIDE 7

Speech Features

❑ Mel Frequency Cepstral Coefficients (MFCC): They are the mostly widely used

short term features in speaker diarization.

❑ Voice quality features: They measure variations of fundamental frequency and

amplitude of speaker’s voice.

We have extracted Absolute Jitter, Absolute shimmer and Shimmer apq3.

❑

Prosodic features: They are estimated capturing the evolution in time of fundamental frequency, acoustic intensity and formant frequencies.

We have extracted pitch, intensity and the first four formant frequencies.

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

SLIDE 8

Voice-quality

❑ Jitter (absolute): It is the average absolute difference between consecutive periods. ❑ Shimmer (absolute): It is the average absolute logarithm of the ratio between

amplitudes of consecutive periods.

❑ Shimmer (apq3): It is the three-point Amplitude Perturbation Quotient.

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

SLIDE 9

Prosody

❑ Prosody is estimated capturing the evolution in time of fundamental frequency,

acoustic intensity and formant frequencies.

Pitch: It is the perceived fundamental frequency.
Intensity: It is the energy of a speech signal.
Formant frequencies: They are concentration of acoustic energy around

particular frequencies

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

5000 Hz 0 Hz 70 dB 30 dB

Intensity Formant frequency Pitch

30 dB 100 dB

SLIDE 10

Outline

❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results ❑ Conclusions

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

SLIDE 11

Speaker Diarization Architecture

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

Initialization

Speech Reference

Speaker segmentation

Speech only frames MFCC extraction JS extraction Prosody extraction Initialize clusters GMM complexity Stack features MFCC score HMM training

JS and Prosody score

Score fusion Viterbi segmentation Merge clusters BIC computation Final hypothesis

Merge clusters

Speaker clustering

Yes No Merged clusters Clusters

SLIDE 12

Proposed Speaker Clustering Architecture

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

i-vector extraction from MFCC i-vector extraction from JS and Prosody UBM Cosine score fusion Final hypothesis Merge clusters Merge clusters

Merged clusters Clusters Speaker Clustering

Yes No

Clustering merging is based on threshold value
The i-vectors are extracted using Alize toolkit
Two gender independent UBMs of 512 GMM components are trained using 100 AMI

shows (not included in the testing set)

The T-Matrices are trained using the same previous dataset.

SLIDE 13

Stopping Criterion Selection

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

Diarization Error Rate (DER) and cosine-distance score per iteration for selected five shows from the development set

Show one Show two Show three Show four Show five

SLIDE 14

Outline

❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques

Segmentation
Clustering

❑ Experimental Setup and Results ❑ Conclusions

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

SLIDE 15

Fusion Techniques: Segmentation

❑

The fusion of voice quality features with the prosodic ones is carried out at the feature level. Tuned alpha

❑ The fusion of short- and long-term speech features is carried out at the score

likelihood level as follows for speaker segmentation:

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

log P(x, y) is the fused GMM score
θix is model of spectral features
θiy is model of JS and Prosody
α is weight of MFCC
1- α is weight of JS and Prosody

SLIDE 16

Fusion Techniques:Clustering

❑ The fusion of cosine distance scores of i-vectors from the short term and long-term

speech features is carried out at the score level for speaker clustering as follows:

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

xi and xj are the i-vectors extracted from MFCC for clusters i and j
yi and yj are the i-vectors extracted from voice-quality and Prosody for clusters i and j
β is the weight of cosine-distance of i-vectors extracted from MFCC
(1- β) is the weight of cosine-distance of i-vectors extracted voice-quality and Prosody

MFCC extraction Voice-quality extraction Prosody extraction Stack features i-vector extraction i-vector extraction Cosine score fusion

Fused score =

SLIDE 17

Outline

❑ Introduction ❑ Objectives ❑ Voice-quality and Prosodic Features ❑ Speaker Diarization Architecture ❑ Fusion Techniques ❑ Experimental Setup and Results

Experimental Setup
Experimental Results

❑ Conclusions

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

SLIDE 18

Experimental setup

❑ The experiments have been developed and tested on AMI corpus, a multi-

party and spontaneous speech set of recordings.

❑ The number of speakers is in the range [3 5]. ❑ We have selected 10 shows from AMI corpus as a development set. ❑ Two experimental scenarios have been defined for the test sets:

Single-site: 10 shows from Idiap site (total duration= 307 minutes)
Multiple-site: 10 shows from Idiap, Edinburgh and TNO sites (294

minutes)

❑ The size of i-vectors for the short- and long-term speech features are100

and 50 respectively.

❑ Oracle SAD has been used as speech activity detection.

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

SLIDE 19

Experimental Results

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Features Clustering technique Development set DER (%) Single-site scenario DER (%) Multiple-site scenario DER (%) MFCC (Baseline) GMM/BIC 30.09 15.87 24.66 MFCC + JS +Prosody GMM/BIC 25.98 15.02 22.96 MFCC i-Vector/CD 27.03 15.01 22.79 MFCC + JS +Prosody i-Vector/CD 25.42 13.37 20.06 Table 1: DER using GMM based BIC and i-vector based cosine distance (CD) clustering technique

Odyssey 2016, Bilbao, Spain  Conclusions

JS and Prosody improve the DER both in GMM based BIC and i-Vector based CD

clustering techniques.

i-vector based CD clustering technique provides better results than GMM based BIC one.

SLIDE 20

Experimental Results

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Features Clustering technique Development set DER (%) Single-site scenario DER (%) Multiple- site scenario DER (%) MFCC GMM/BIC 30.09 15.87 24.66 MFCC + JS +Prosody GMM/BIC 25.98 15.02 22.96 MFCC i-Vector/CD 27.03 15.01 22.79 MFCC + JS +Prosody i-Vector/CD 25.42 13.37 20.06 MFCC i-Vector/PLDA 25.96 13.64 19.77 MFCC + JS +Prosody i-Vector/PLDA 25.28 13.07 19.23 Table 2: DER using GMM and i-vector based clustering techniques

Odyssey 2016, Bilbao, Spain  Conclusions

i-vector based PLDA clustering techniques provides better DER result than GMM

based BIC and i-vector based CD clustering techniques.

SLIDE 21

Boxplot of Single- and Multiple-site scenarios

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

Clustering technique (Feature set)

SLIDE 22

Conclusions

❑ We have proposed the fusion of cosine distances of i-vectors extracted from

short- and long-term speech features for speaker clustering task.

❑ Experimental results also show that i-vector clustering technique provides

better DER than GMM clustering one.

❑ Experimental results show that i-vector clustering technique using short-

and long-term features provides better DER than the same clustering technique using only short-term spectral features.

❑ The results of our work show the usefulness i-vector clustering technique

based on short- and long-term speech features within in the framework of speaker diarization.

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

SLIDE 23

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Odyssey 2016, Bilbao, Spain

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Abraham Woubie 1 , Jordi Luque 2 , and Javier Hernando 1

Universitat Politècnica de Catalunya, Barcelona, Spain

June 24, 2016

Outline

Introduction

speaker clustering techniques in speaker diarization, respectively.

been successfully used together with MFCC in GMM based speaker diarization.

long-term speech features for speaker clustering task.

Speaker 1 Speaker 2 Speaker 3 Speaker segmentation Speaker clustering

Outline

Objectives

❑ Feature selection of long-term voice-quality and prosodic features.

Shimmer and Shimmer apq3.

frequencies.

❑ Stacking these voice-quality and prosodic features in the same

feature vector.

❑ Extraction of i-vectors from short-term spectral and these long-term

feature sets.

❑ Fusion of scores of i-vectors extracted from these features for

speaker clustering.

Outline

Speech Features

short term features in speaker diarization.

amplitude of speaker’s voice.

Prosodic features: They are estimated capturing the evolution in time of fundamental frequency, acoustic intensity and formant frequencies.

Voice-quality

amplitudes of consecutive periods.

Prosody

acoustic intensity and formant frequencies.

particular frequencies

Outline

Speaker Diarization Architecture

Initialization

Speaker segmentation

Speaker clustering

Proposed Speaker Clustering Architecture

Merged clusters Clusters Speaker Clustering

Stopping Criterion Selection

Outline

Fusion Techniques: Segmentation

The fusion of voice quality features with the prosodic ones is carried out at the feature level. Tuned alpha

likelihood level as follows for speaker segmentation:

Fusion Techniques:Clustering

speech features is carried out at the score level for speaker clustering as follows:

Outline

Experimental setup

party and spontaneous speech set of recordings.

minutes)

and 50 respectively.

Experimental Results

Experimental Results

based BIC and i-vector based CD clustering techniques.

Boxplot of Single- and Multiple-site scenarios

Conclusions

short- and long-term speech features for speaker clustering task.

better DER than GMM clustering one.

and long-term features provides better DER than the same clustering technique using only short-term spectral features.

based on short- and long-term speech features within in the framework of speaker diarization.

Q & A Thank You !!!