Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors - - PowerPoint PPT Presentation

augmented data training of joint acoustic phonotactic dnn
SMART_READER_LITE
LIVE PREVIEW

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors - - PowerPoint PPT Presentation

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE 2015 Alan McCree, Greg Sell, and Daniel Garcia-Romero JHU HLTCOE Odyssey 2016 1 Overview JHU HLTCOE submission to LRE15 One of top performers


slide-1
SLIDE 1

1

Alan McCree, Greg Sell, and Daniel Garcia-Romero JHU HLTCOE Odyssey 2016

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE 2015

slide-2
SLIDE 2

2

Overview

  • JHU HLTCOE submission to LRE15

– One of top performers

  • Approach

– DNN features and state labels – Acoustic, phonotactic, and joint i-vectors – Simple fusion: scale factor and duration – Data augmentation

slide-3
SLIDE 3

3

Outline

  • System design

– i-vector LID system – Improvement with DNNs – Alternative i-vectors

  • LRE15 task

– Data usage/augmentation

  • Results and analysis
slide-4
SLIDE 4

4

LID System

  • Two-covariance model in i-vector space
  • Discriminative refinement of Gaussians

– Within-class covariance scale factor – Language class means – Multiclass MMI

Gaussian scoring i-vector extractor LDA

MFCC stats Raw ivec ivec scores UNSUPERVISED SUPERVISED

Compute stats

slide-5
SLIDE 5

5

LID System

Gaussian scoring i-vector extractor LDA

MFCC stats Raw ivec ivec scores UNSUPERVISED SUPERVISED

Compute stats GMM Compute stats Frame posteriors MFCC

slide-6
SLIDE 6

6

LID System

Gaussian scoring i-vector extractor LDA

MFCC stats Raw ivec ivec scores UNSUPERVISED SUPERVISED

Compute stats STATS DNN Compute stats Frame posteriors MFCC ASR features

slide-7
SLIDE 7

7

DNN Architecture

  • Input 9 spliced frames of 40 dim vectors from LDA+MLLT
  • 5 hidden layers with p-norm pooling (p=2)

– Input/output ratio of 10:1

  • Output targets are clustered phone states (senones)
  • Trained on SWB-1 using Kaldi

9184 dim Bottleneck goes here

slide-8
SLIDE 8

8

Types of i-vectors

  • Acoustic:

– Gaussian probability model (given alignments) – i-vector: analytical solution for MAP estimate – EM estimation of T

  • Phonotactic:

– Multinomial (categorical) probability model – No closed form MAP solution: Newton’s method – Iterative approximate estimation of T

i i

Tw m m + =

i-vector

) (log

i i

softmax Tw p p + =

i-vector

slide-9
SLIDE 9

9

How to Combine?

  • Score fusion
  • I-vector stacking
  • Joint i-vector:

i m i

w T m m + = ) (log

i w i

softmax w T p p + =

i-vector

slide-10
SLIDE 10

10

Joint I-vector Details

  • Based on subspace GMM approach
  • Differences

– MAP instead of ML i-vector estimate – Initialize i-vector with acoustic only (closed-form) – Diagonal Hessian in Newton update – Computation: similar to acoustic (was 10x)

slide-11
SLIDE 11

11

LRE15 Task

  • Conversational speech

– Telephone and broadcast narrowband

  • 20 languages, 6 confusable clusters

– Metric: average Bayes cost (Cavg)

  • Limited training condition

– Use distributed material only – Switchboard (English) + transcriptions – Variable amount per language

slide-12
SLIDE 12

12

LRE15 Systems

  • All use DNNs, i-vectors, and MMI-trained

Gaussian classifier

  • Variations:

– DNNs

  • Bottleneck
  • Clustered phone state (senones)

– i-vectors

  • Acoustic
  • Phonotactic
  • Joint
slide-13
SLIDE 13

13

Back-end and Fusion

  • Systems are already calibrated

– MMI training of covariance scaling and means

  • Duration modeling/scaling
  • Fusion by averaging calibrated scores

– Learn overall scaling after average

m n n m

S t t t c LL + =

slide-14
SLIDE 14

14

LRE15 Cluster Scoring

  • Task: closed set detection per cluster

– Use Bayes’ rule for each – ID posteriors sum to 1 per cluster (sum to 6 total) – Convert to detection LLRs

  • No cluster-specific systems or fusion

– This is a generic 20 language LID system!

slide-15
SLIDE 15

15

Training Data and Augmentation

UBM/T Classifier

3-30 seconds of speech (uniform) LRE Training

REVERB NOISE RESAMPLE ENCODE

Simple: all provided data, full cuts (up to 120 sec), no augmentation

DATA SEGMENT AUGMENT

slide-16
SLIDE 16

16

Augmentation types

  • Sample rate perturbation

– Distorts pitch and speaking rate

  • Additive Noise

– Multiband modulated Gaussian noise

  • Reverberation

– Long or short synthetic impulse response

  • Multiband compression

– Dynamic range compression

  • Cellular speech coding

– GSM-AMR at 4.75 or 6.7 kb/s

slide-17
SLIDE 17

17

Submission Performance

System Avg Cavg [0] Bottleneck, joint 19.9 [1] Senone, acoustic

  • [2] Senone, phonotactic
  • [3] Senone, joint

19.7 [0,3] Fusion 18.8 [0,1,2] Fusion 18.5

slide-18
SLIDE 18

18

Post-eval Improvements

  • Classifier tuning

– ML initialization to MMI instead of single-cut Bayesian (PLDA)

  • List usage

– No Switchboard for UBM/T

  • No segmentation/augmentation
  • Smallest possible number of cuts!
slide-19
SLIDE 19

19

Post-eval Results

System Submission Classifier Class+lists Acoustic baseline

  • 23.8

22.2 [0] Bottleneck, joint 19.9 19.2 18.5 [1] Senone, acoustic

  • 20.2

19.2 [2] Senone, phonotactic

  • 20.9

20.3 [3] Senone, joint 19.7 19.2 18.4 [0,3] Fusion 18.8 18.1 17.3 [0,1,2] Fusion 18.5 18.0 17.3

slide-20
SLIDE 20

20

Conclusion

  • JHU HLTCOE strong performer in LRE15
  • Key components

– DNN features and state labels – Acoustic, phonotactic, and joint i-vectors – Simple fusion: scale factor and duration – Data augmentation