augmented data training of joint acoustic phonotactic dnn
play

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors - PowerPoint PPT Presentation

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE 2015 Alan McCree, Greg Sell, and Daniel Garcia-Romero JHU HLTCOE Odyssey 2016 1 Overview JHU HLTCOE submission to LRE15 One of top performers


  1. Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE 2015 Alan McCree, Greg Sell, and Daniel Garcia-Romero JHU HLTCOE Odyssey 2016 1

  2. Overview • JHU HLTCOE submission to LRE15 – One of top performers • Approach – DNN features and state labels – Acoustic, phonotactic, and joint i-vectors – Simple fusion: scale factor and duration – Data augmentation 2

  3. Outline • System design – i-vector LID system – Improvement with DNNs – Alternative i-vectors • LRE15 task – Data usage/augmentation • Results and analysis 3

  4. LID System UNSUPERVISED SUPERVISED Raw MFCC Compute i-vector stats LDA ivec scores Gaussian ivec stats extractor scoring • Two-covariance model in i-vector space • Discriminative refinement of Gaussians – Within-class covariance scale factor – Language class means – Multiclass MMI 4

  5. LID System UNSUPERVISED SUPERVISED Raw MFCC Compute i-vector ivec stats LDA scores ivec Gaussian stats extractor scoring GMM MFCC Frame posteriors Compute stats 5

  6. LID System UNSUPERVISED SUPERVISED Raw MFCC Compute i-vector stats LDA ivec scores Gaussian ivec stats extractor scoring DNN ASR features STATS Frame posteriors Compute stats MFCC 6

  7. DNN Architecture 9184 dim Bottleneck goes here • Input 9 spliced frames of 40 dim vectors from LDA+MLLT 5 hidden layers with p-norm pooling (p=2) • – Input/output ratio of 10:1 • Output targets are clustered phone states (senones) • Trained on SWB-1 using Kaldi 7

  8. Types of i-vectors i-vector • Acoustic: = + m m Tw i 0 i – Gaussian probability model (given alignments) – i-vector: analytical solution for MAP estimate – EM estimation of T i-vector • Phonotactic: = + p softmax (log p Tw ) i 0 i – Multinomial (categorical) probability model – No closed form MAP solution: Newton’s method – Iterative approximate estimation of T 8

  9. How to Combine? • Score fusion • I-vector stacking • Joint i-vector: = + m m T w 0 i m i i-vector = + (log ) p softmax p T w i 0 w i 9

  10. Joint I-vector Details • Based on subspace GMM approach • Differences – MAP instead of ML i-vector estimate – Initialize i-vector with acoustic only (closed-form) – Diagonal Hessian in Newton update – Computation: similar to acoustic (was 10x) 10

  11. LRE15 Task • Conversational speech – Telephone and broadcast narrowband • 20 languages, 6 confusable clusters – Metric: average Bayes cost (C avg ) • Limited training condition – Use distributed material only – Switchboard (English) + transcriptions – Variable amount per language 11

  12. LRE15 Systems • All use DNNs, i-vectors, and MMI-trained Gaussian classifier • Variations: – DNNs • Bottleneck • Clustered phone state (senones) – i-vectors • Acoustic • Phonotactic • Joint 12

  13. Back-end and Fusion • Systems are already calibrated – MMI training of covariance scaling and means • Duration modeling/scaling c t = 0 n LL S + m m t t 0 n • Fusion by averaging calibrated scores – Learn overall scaling after average 13

  14. LRE15 Cluster Scoring • Task: closed set detection per cluster – Use Bayes’ rule for each – ID posteriors sum to 1 per cluster (sum to 6 total) – Convert to detection LLRs • No cluster-specific systems or fusion – This is a generic 20 language LID system! 14

  15. Training Data and Augmentation UBM/T Simple: all provided data, full cuts (up to 120 sec), no augmentation Classifier DATA SEGMENT AUGMENT REVERB NOISE LRE Training 3-30 seconds of RESAMPLE ENCODE speech (uniform) 15

  16. Augmentation types • Sample rate perturbation – Distorts pitch and speaking rate • Additive Noise – Multiband modulated Gaussian noise • Reverberation – Long or short synthetic impulse response • Multiband compression – Dynamic range compression • Cellular speech coding – GSM-AMR at 4.75 or 6.7 kb/s 16

  17. Submission Performance System Avg Cavg [0] Bottleneck, joint 19.9 [1] Senone, acoustic - [2] Senone, phonotactic - [3] Senone, joint 19.7 [0,3] Fusion 18.8 [0,1,2] Fusion 18.5 17

  18. Post-eval Improvements • Classifier tuning – ML initialization to MMI instead of single-cut Bayesian (PLDA) • List usage – No Switchboard for UBM/T • No segmentation/augmentation • Smallest possible number of cuts! 18

  19. Post-eval Results System Submission Classifier Class+lists Acoustic baseline - 23.8 22.2 [0] Bottleneck, joint 19.9 19.2 18.5 [1] Senone, acoustic - 20.2 19.2 [2] Senone, phonotactic - 20.9 20.3 [3] Senone, joint 19.7 19.2 18.4 [0,3] Fusion 18.8 18.1 17.3 [0,1,2] Fusion 18.5 18.0 17.3 19

  20. Conclusion • JHU HLTCOE strong performer in LRE15 • Key components – DNN features and state labels – Acoustic, phonotactic, and joint i-vectors – Simple fusion: scale factor and duration – Data augmentation 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend