i-vector space for speaker recognition Timur Pekhovsky Sergey - - PowerPoint PPT Presentation

i vector space
SMART_READER_LITE
LIVE PREVIEW

i-vector space for speaker recognition Timur Pekhovsky Sergey - - PowerPoint PPT Presentation

This work was financially supported by the Ministry of Education and Science of the Russian Federation (14.578.21.0126 (RFMEFI57815X0126) On autoencoders in the i-vector space for speaker recognition Timur Pekhovsky Sergey Novoselov


slide-1
SLIDE 1

On autoencoders in the i-vector space for speaker recognition

Timur Pekhovsky Sergey Novoselov Aleksey Sholokhov Oleg Kudashev

Speech Technology Center Ltd., Russia

This work was financially supported by the Ministry of Education and Science of the Russian Federation (14.578.21.0126 (RFMEFI57815X0126)

slide-2
SLIDE 2

2

OUTLINE

 Motivation and goals  Detailed study of the DAE system

 Datasets and experimental setup  Front-End and i-vector extractor  DAE system description & DAE training procedure  Back-End and scoring. Replacing back-end  Analysis of the DAE system performance

 An improved DAE system

 Dropout regularization  Deep architectures

 DAE system in the domain mismatch scenario

  • Dataset. Back-Ends

 Results

 Conclusions

slide-3
SLIDE 3

3

Motivation and goals

  • to study the properties of DAE in the i-vector space
  • to analyze different strategies of initialization and training of the

back-end parameters

  • to investigate dropout regularization
  • to explore different deep architectures of DAE
  • to investigate DAE based system in case of domain mismatch

condition The denoising autoencoder (DAE) based speaker verification system achieved performance improvement compared to the commonly used baseline (i.e. PLDA on raw i-vectors) [1]. This motivated us for detailed investigation:

[1] Sergey Novoselov, Timur Pekhovsky, Oleg Kudashev, Valentin Mendelev, and Alexey Prudnikov, “Non-linear PLDA for i- vector speaker verification,” in INTERSPEECH 2015, Dresden, Germany, September 6-10, 2015, 2015, pp. 214–218.

slide-4
SLIDE 4

4

Detailed study of the DAE system

slide-5
SLIDE 5

5

Datasets and experimental setup

Training data:

  • telephone channel recordings from the NIST SRE 1998-2008

corpora

  • 16618 sessions of 1763 male speakers (only English language)

Evaluation data:

  • The NIST 2010 SRE protocol (condition 5 extended, males,

English language) Operating points:

  • equal error rate (EER)
  • minimum detection cost function (minDCF 2010)
slide-6
SLIDE 6

6

Front-End and i-vector extractor

  • 20 MFCC (including C0) with their first- and second-order derivatives (Kaldi

version)

  • DNN based posteriors extraction with 11 frames splicing for DNN input
  • DNN with 2700 triphone states and 20 non-speech states

(trained on Switchboard corpus using Kaldi)

  • “SoftVAD” solution using DNN outputs:

𝐺

𝑑 = 𝐺 𝑑 − 𝑛𝑂𝑑

𝜏 , 𝑑 ∈ 𝐽𝑢𝑠 𝑛 =

𝐺

𝑑 𝑑∈𝐽𝑢𝑠

𝑂𝑑

𝑑∈𝐽𝑢𝑠

, 𝜏2 =

𝑇𝑑

𝑑∈𝐽𝑢𝑠

𝑂𝑑

𝑑∈𝐽𝑢𝑠

− 𝑛2

𝐽𝑢𝑠 - DNN output indexes corresponding to triphone states; 𝐎𝑑, 𝑮𝑑, 𝑻𝑑 are 0- 1st- and 2nd-order statistics

  • 400-dimensional i-vectors
slide-7
SLIDE 7

7

DAE system description & DAE training procedure

Learning denoising transform:

  • 𝑗(𝑡, ℎ) is the i-vector representing ℎ -th session of s -th speaker
  • 𝑗(𝑡) is the mean i-vector for speaker 𝑡
  • RBM parametrs are used to initialize denoising neural network
slide-8
SLIDE 8

8

DAE system description & DAE training procedure

Block diagram of speaker recognition systems compared in our experiments

slide-9
SLIDE 9

9

Back-End and scoring

   

s T s s B

i i S ) )( ( 1  

2 1 2 2 1 1

2 Pi i Qi i Qi i Score

T T T

  

 

   

h T s h s s h s S s W

i i i i H S ) )( ( 1 1

, ,

Two-covariance model: where square matrices 𝑄 and 𝑅 can be expressed in terms of (2) and (3).

(2)

(3)

(1)

slide-10
SLIDE 10

10

Back-End and scoring. Replacing back-end

slide-11
SLIDE 11

11

Analysis of the DAE system performance

System EER(%) minDCF Baseline 1.67 0.347 RBM 1.55 0.332 DAE 1.43 0.284 System EER(%) minDCF Baseline 1.63 0.64 RBM 1.65 0.63 DAE 1.43 0.55

Table 1: NIST SRE 2010 test Table 2: “Rus-Telecom” * test

* Rus-Telecom is the Russian-language corpus of telephone recordings. Training set consists of 6508 male speakers and 33678 speech cuts. Evaluation part consists of 235 male speakers and 4210 speech cuts. Evaluation protocol (singlesession enrollments) contains 37184 target trials and 111660 impostor trials.

slide-12
SLIDE 12

12

Analysis of the DAE system performance

Assessing denoising transform:

Class-separability criterion:

) ( ) (

1

F Tr Tr J

B W

    

where Σ𝑋 and Σ𝐶 are the within-speaker and between-speaker covariance matrices

System EER(%) minDCF J Baseline 5.34 0.603 501.45 RBM 5.27 0.611 525.65 DAE 3.19 0.427 537.76 AE 5.42 0.583 494.13

Table 3: NIST SRE 2010 test. Cos scoring Figure 1: Eigenvalues of the matrix 𝐺.

No normalization was applied to the outputs of RBM and DAE!

slide-13
SLIDE 13

13

Analysis of the DAE system performance

Effect of normalization:

) ( ) (

1

F Tr Tr J

B W

    

Table 4: NIST SRE 2010 test. Cos scoring Figure 2: Eigenvalues of the matrix 𝐺.

Whitening & LN was applied to the outputs of RBM and DAE!

System EER(%) minDCF J Baseline 5.34 0.603 501.45 RBM 4.96 0.565 525.35 DAE 4.95 0.558 533.37

slide-14
SLIDE 14

14

Analysis of the DAE system performance

Effect of replacing whitening parameters:

Table 5: NIST SRE 2010 test. Cos scoring Figure 3: Eigenvalues of the matrix 𝐺.

Whitening & LN were applied to the outputs of RBM and DAE! Whitening parameters of the DAE system are replaced by the RBMs ones

System EER(%) minDCF J Baseline 5.34 0.603 501.45 RBM 4.96 0.565 525.35 DAE 2.83 0.393 537.32

slide-15
SLIDE 15

15

Analysis of the DAE system performance

System Whitening: 𝐵, 𝜈 EER(%) minDCF J Baseline raw 5.34 0.603 501.45 RBM no 5.27 0.611 525.65 DAE no 3.19 0.427 537.76 RBM RBM 4.96 0.565 525.35 DAE DAE 4.95 0.558 533.37 DAE RBM 2.83 0.393 537.32

Effect of replacing whitening parameters:

Table 6: NIST SRE 2010 test. Cos scoring

slide-16
SLIDE 16

16

Analysis of the DAE system performance

Effect of replacing back-end parameters:

Table 7: Performance comparison for different configurations

  • f the DAE system. NIST SRE 2010 test

System PLDA: *𝑄, 𝑅+ Whitening: 𝐵/𝜈 EER(%) minDCF Baseline Raw raw/raw 1.67 0.347 RBM RBM RBM/RBM 1.55 0.332 DAE DAE DAE/DAE 1.58 0.336 DAE DAE DAE/RBM 1.55 0.338 DAE RBM DAE/DAE 1.56 0.330 DAE DAE RBM/DAE 1.43 0.291 DAE DAE RBM/RBM 1.44 0.287 DAE RBM RBM/RBM 1.43 0.284

slide-17
SLIDE 17

17

An improved DAE system

slide-18
SLIDE 18

18

Dropout regularization

Dropout for RBM training:

Table 8: Effect of dropout for RBM training. RBM is used to initialize DAE. NIST SRE 2010 test

System EER(%) minDCF DAE 1.43 0.284 DAE+dropout 1.41 0.270

Applying dropout at the stage of discriminative fine-tuning was not helpful!

slide-19
SLIDE 19

19

An improved DAE system

Deep denoising autoencoders : Stacking RBMs

System EER(%) minDCF Baseline 1.67 0.347 DAE 1.43 0.284 DAE5 1.43 0.297

Table 9: NIST SRE 2010 test. PLDA scoring

slide-20
SLIDE 20

20

Aimproved DAE system

Deep denoising autoencoders : Stacking DAEs

System EER(%) minDCF Baseline 1.67 0.347 RBM1 1.55 0.332 DAE1 1.43 0.284 RBM2 1.58 0.329 DAE2 1.30 0.282

Table 10: NIST SRE 2010 test. PLDA scoring

slide-21
SLIDE 21

21

DAE system in the domain mismatch scenario

slide-22
SLIDE 22

22

Domain Adaptation Challenge

DAC setup:

  • GMM-UBM based i-vector extractor (600 dimentional i-vectors)
  • In-domain SRE set (SRE 04, 05, 06, and 08 ).
  • Out-of-domain Switchboard set

Evaluation data:

  • The NIST 2010 SRE protocol (condition 5 extended, males,

English language) Operating points:

  • equal error rate (EER)
  • minimum detection cost function (minDCF 2010)
slide-23
SLIDE 23

23

Back-Ends

The results are presented for the following scoring types:

  • Cosine scoring
  • Two-covariance model (referred to as PLDA)
  • Simplified PLDA with 400-dimensional speaker subspace (referred to as

SPLDA) In our experiments we ignore labels of the in-domain data. We used in-domain SRE set only to estimate the whitening parameters of our systems..

slide-24
SLIDE 24

24

Results

System Whitening/ Training Cos PLDA EER minDCF EER minDCF Baseline SRE/SRE 5.45 0.621 2.18 0.360 5.47 0.634 2.16 0.348 RBM DAE 3.67 0.467 1.67 0.307 Baseline SWB/SWB 9.13 0.788 6.45 0.660 8.97 0.778 6.28 0.667 RBM DAE 8.97 0.764 6.01 0.644 Baseline SRE/SWB 5.45 0.621 4.23 0.554 RBM 5.35 0.631 2.97 0.447 DAE 4.62 0.560 2.63 0.401

Table 11: Performance summary of speaker verification systems with PLDA and cosine back-ends

slide-25
SLIDE 25

25

Results

Table 12: Performance summary of speaker verification systems with SPLDA.

System Whitening/ Training PLDA EER minDCF Baseline SRE/SRE 2.23 0.312 RBM 2.07 0.317 DAE 1.61 0.292 Baseline SRE/SWB 4.21 0.531 RBM 2.66 0.410 DAE 2.36 0.400

slide-26
SLIDE 26

26

CONCLUSIONS

 A study of denoising autoencoders in the i-vector space was presented  We figured out the observed performance gain of DAE based system is due to employing Back-End parameters (Whitening & PLDA) derived from RBM outputs  The question why RBM transform provides better Back-End parameters for a test Set is still open  Dropout helps when applied to RBM training stage and does not help when implemented in the fine-tuning stage  Deep architecture in the form of stacked DAE provides further improvements  All our findings regarding speaker verification systems in matched conditions hold true for mismatched conditions case  Using whitening parameters from the target domain along with DAE trained on the

  • ut-of-domain set allows to avoid significant performance gap caused by domain

mismatch

slide-27
SLIDE 27

27

ABOUT COMPANY

ABOUT COMPANY

Speech Technology Center (STC) is an international leader in speech technology and multimodal

  • biometrics. It has over 25 years of research,

development and implementation experience in Russia and internationally. STC is leading global provider of innovative systems in high-quality recording, audio and video processing and analysis, speech synthesis and recognition, and real-time, high-accuracy voice and facial biometrics

  • solutions. STC innovations are used in both public and

commercial sectors, from small expert laboratories, to large, distributed contact centers, to nation-wide security systems. STC is ISO-9001: 2008 certified.

CONTACTS

Russia 4 Krasutskogo street, St. Petersburg, 196084 Tel.: +7 812 331 0665 Fax: +7 812 327 9297 Email: info@speechpro.com USA Suite 316, 369 Lexington ave New York, NY, 10017 Tel.: +1 646 237 7895 Email: sales-usa@speechpro.com