Ivector transformation and scaling for PLDA based speaker - - PowerPoint PPT Presentation

i vector transformation and scaling for plda based
SMART_READER_LITE
LIVE PREVIEW

Ivector transformation and scaling for PLDA based speaker - - PowerPoint PPT Presentation

Ivectors and PLDA Ivector Transformation Dataset mismatch compensation Conclusions Ivector transformation and scaling for PLDA based speaker recognition Sandro Cumani, Pietro Laface Politecnico di Torino, Italy Sandro Cumani, Pietro


slide-1
SLIDE 1

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions

I–vector transformation and scaling for PLDA based speaker recognition

Sandro Cumani, Pietro Laface

Politecnico di Torino, Italy

Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-2
SLIDE 2

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions

Outline

I–vectors and PLDA I–vector transformation Dataset mismatch compensation Results and conclusions

Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-3
SLIDE 3

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions PLDA assumptions HT–PLDA / length norm

PLDA assumptions

I–vectors are sampled from a Gaussian distribution Similar development and evaluation i–vector distributions

50 100 150 200 250 300 0.000 0.005 0.010 0.015 0.020 0.025 Dev Eval χ2 Distribution of squared i–vector norms −4 −3 −2 −1 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 φ1 φ2 N(0, 1) Histogram of two (whitened) i–vector components with highest skewness φ1, φ2 Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-4
SLIDE 4

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions PLDA assumptions HT–PLDA / length norm

HT–PLDA / length norm

HT–PLDA tries to deal with non–Gaussianity Length normalization (LN)

Mainly deals with dataset mismatch

−4 −3 −2 −1 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 φ1 φ2 N(0, 1) Histogram of two (whitened) i–vector components with highest skewness φ1, φ2 before LN −4 −3 −2 −1 1 2 3 4 5 0.0 0.1 0.2 0.3 0.4 0.5 0.6

  • φ1
  • φ2

N(0, 1) Histogram of two (whitened) i–vector components with highest skewness φ1, φ2 after LN Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-5
SLIDE 5

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions PLDA assumptions HT–PLDA / length norm

HT–PLDA / length norm

Length–normalized i–vectors are still far from Gaussian Can we transform i–vectors as to better fit Gaussian PLDA assumptions and at the same time perform a similar dataset mismatch compensation?

Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-6
SLIDE 6

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions I–vector Transformation Transformation Model SAS Transformation on SRE ’10 data

I–vector Transformation

Assume that i–vectors are sampled from R.V. Φ Represent Φ as a function of a Standard Normal R.V. Y Φ = f −1(Y) The (log) p.d.f. of Φ is given by log PΦ(φ) = log PY(f(φ)) + log

  • Jf(φ)
  • = − 1

2f(φ)Tf(φ) + log

  • Jf(φ)
  • + c

Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-7
SLIDE 7

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions I–vector Transformation Transformation Model SAS Transformation on SRE ’10 data

I–vector Transformation

How do we model the (unknown) function f?

Neural network style approach Composition of (invertible) layers f(φ, θ1, · · · , θn) = f1(·, θ1) ◦ · · · ◦ fn(·, θn)

f is estimated as to maximize the likelihood of the samples

  • f Φ (in our case the i–vectors)

The function f allows transforming Φ–distributed samples into (almost) Gaussian–distributed samples

Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-8
SLIDE 8

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions I–vector Transformation Transformation Model SAS Transformation on SRE ’10 data

I–vector Transformation

−3 −2 −1 1 2 3 4 5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 φ1 p.d.f. of X −3 −2 −1 1 2 3 4 5 −4 −2 2 4

(a) (b)

−4 −2 2 4 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 f(φ1) N(0, 1)

(a) Histogram of φ1 and estimated p.d.f. of Φ (b) Estimated transformation function f (c) Histogram of f(φ1) and p.d.f. of Y ∼ N (0, 1) (c) Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-9
SLIDE 9

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions I–vector Transformation Transformation Model SAS Transformation on SRE ’10 data

Transformation Model

Simple structure (cascade of two types of transformations) Affine layer: fA(φ; A, b) = Aφ + b SAS layer (acting as non–linear units): cascade of

Inverse sinh layer: fS1(φi) = sinh−1(φi) Diagonal affine layer: fS2(φi; δi, εi) = δiφi + εi sinh layer: fS3(φi) = sinh(φi)

The SAS layer can be summarized as fSAS(φi; δi, εi) = sinh(δi sinh−1(φi) + εi)

Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-10
SLIDE 10

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions I–vector Transformation Transformation Model SAS Transformation on SRE ’10 data

Transformation Model

The parameters of the transformation function f are estimated using a Maximum Likelihood criterion Gradients can be computed using an algorithm similar to back–propagation with MSE loss (but we need to take into account Jacobian log–determinants)

Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-11
SLIDE 11

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions I–vector Transformation Transformation Model SAS Transformation on SRE ’10 data

SAS Transformation on SRE ’10 data (female)

50 100 150 200 250 300 0.000 0.005 0.010 0.015 0.020 0.025 Dev Eval χ2 Distribution of squared i–vector norms 50 100 150 200 250 300 0.000 0.005 0.010 0.015 0.020 0.025 Dev Eval χ2 Distribution of squared norms of 1–layer–SAS transformed i–vector System Cond 1 Cond 2 Cond 3 Cond 4 Cond 5 EER DCF10 EER DCF10 EER DCF10 EER DCF10 EER DCF10 PLDA (w/o LN) 2.06 0.288 3.60 0.541 3.27 0.481 1.71 0.335 3.91 0.417 1–layer AS 2.15 0.221 3.36 0.462 2.96 0.414 1.61 0.290 3.19 0.391 PLDA (with LN) 1.81 0.255 2.83 0.476 1.95 0.367 1.21 0.295 2.19 0.347 Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-12
SLIDE 12

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions Dataset mismatch compensation Results

Dataset mismatch compensation

Length–norm can be interpreted as the ML solution for the estimate of a scaling parameter of the i–vector distribution An i–vector φi is sampled from the R.V. Φi Φi ∼ N

  • 0, α−2

i

Σ

  • Given Σ the ML estimate for αi is
  • αi

−1 =

  • φT

i Σ−1φi

D

Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-13
SLIDE 13

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions Dataset mismatch compensation Results

Dataset mismatch compensation

Φi can be represented as Φi = α−1

i

Σ

1 2 Y,

Y ∼ N(0, I) The transformation function f is given by f(φi; A, αi) = αiAφi, A = Σ− 1

2

Applying whitening followed by length–norm is equivalent to applying the transformation f using the ML estimates of A and αi

Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-14
SLIDE 14

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions Dataset mismatch compensation Results

Dataset mismatch compensation

We introduce an α–layer (scaling layer), whose single parameter is i–vector depedent The function f is a cascade of the α–layer and the original SAS layers For efficiency reasons we perform alternate estimation of SAS parameters and αi’s At testing time, for each test i–vector we need to estimate the corresponding αi

Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-15
SLIDE 15

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions Dataset mismatch compensation Results

SRE ’10 Results (female)

400–dimensional i–vectors reduced to 150–dimensions through LDA i–vectors are whitened (this allows initializing the transformation as the identity function)

Results of α–scaled SAS transformation on the female set of NIST SRE 2010 dataset

System Cond 1 Cond 2 Cond 3 Cond 4 Cond 5 EER DCF10 EER DCF10 EER DCF10 EER DCF10 EER DCF10 PLDA (w/o LN) 2.06 0.288 3.60 0.541 3.27 0.481 1.71 0.335 3.91 0.417 1–layer AS 2.15 0.221 3.36 0.462 2.96 0.414 1.61 0.290 3.19 0.391 PLDA (with LN) 1.81 0.255 2.83 0.476 1.95 0.367 1.21 0.295 2.19 0.347 1–layer α–AS iter. 1 1.80 0.204 2.83 0.424 2.15 0.373 1.20 0.280 2.03 0.333 1–layer α–AS iter. 3 1.38 0.192 2.58 0.406 2.30 0.361 1.20 0.237 2.16 0.322 Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-16
SLIDE 16

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions

Conclusions

We investigated an approach to estimate a density transform which allows modifying i–vectors to better fit PLDA assumptions while compensating for dataset mismatch. The transformation is learned using an ML criterion which allows expressing the i–vector p.d.f as a transformation of a standard normal p.d.f. The transformation is implemented using a framework similar to neural networks (with some constraints)

Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition

slide-17
SLIDE 17

I–vectors and PLDA I–vector Transformation Dataset mismatch compensation Conclusions

Conclusions

The proposed approach allows to sensibly improve results, mainly in terms of DCF 2010, both on telephone and microphone conditions Care has to be taken in designing the network structure to avoid overfitting: our experiments with multiple SAS layers were not satisfactory until constraints were imposed on the transformation parameters (work in progress)

Sandro Cumani, Pietro Laface I–vector transformation for PLDA based speaker recognition