lid senone extraction via deep neural networks for end to
play

LID-senone Extraction via Deep Neural Networks for End-to-End - PowerPoint PPT Presentation

LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification Ma Jin 1 Yan Song 1 Ian Mcloughlin 2 Li-Rong Dai 1 Zhong-Fu Ye 1,3 1 National Engineering Laboratory of Speech and Language Information Processing University of


  1. LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification Ma Jin 1 Yan Song 1 Ian Mcloughlin 2 Li-Rong Dai 1 Zhong-Fu Ye 1,3 1 National Engineering Laboratory of Speech and Language Information Processing University of Science and Technology of China, China 2 School of Computing, University of Kent, Medway, UK 3 State Key Laboratory of Mathematical Engineering and Advanced Computing, China Presented ed b by Profes essor Ian n McLou oughlin 2016. 2016.06.22

  2. Outline • Introduction • Proposed Method • Experiments and Analysis • Conclusion and Feature Work

  3. Introduction – background • What is Language Identification? • Extract utterance representation from a given speech • State-of-the-art Method • GMM/i-vector • Unsupervised fashion • Deep Learning Method • Natural advantages of supervised training

  4. Introduction – existing method • Improved i-vector Method via Deep Learning • Deep bottleneck network based i-vector representation for language identification (Song et.al ) • Study of senone-based deep neural network approaches for spoken language recognition (Ferrer et.al ) • End-to-End Neural Network • Automatic language identification using deep eep n neu eural n networks (Lopez-Moreno et.al ) • Automatic language identification using lon long s shor ort-term me memor ory r y recurrent neu eural n networks (Gonzalez-Dominguez et.al ) • An end-to-end approach to language identification in short utterances using con onvolu lution ional al n neural al n networ orks (Lozano-Diez et.al )

  5. Outline • Introduction • Proposed Method • Experiments and Analysis • Conclusion and Feature Work

  6. Proposed Method – motivation and structure • Convolutional Neural Network • convolutional layers: feature extractor at frame level • pooling layers: map frame level features to utterance representation • Structure • DNN layer: transform acoustic features to a compact representation frame by frame • convolutional layer: transform BN features into units discriminative to languages

  7. Proposed Method – structure details • LID-feature • general acoustic features contain too much useless information, may degrade performance • deep bottleneck features (DBF) are discriminate on phones, not on languages • LID-features are discriminative on languages, and irrelevant between dimensions (large conv kernel) • Spatial Pyramid Pooling • spans features from frame level to utterance level • deals with arbitrary input sizes • obtain statistical information at different time scales

  8. Proposed Method – incremental training strategy • LID-features cannot be extracted directly from general acoustic features • lack of training data • features should be bonded with phones at a frame level, so the target cannot be languages • Incremental Training Strategy • transfer learning from large-scale corpus • incremental training with language corpus

  9. Proposed Method – LID-senone and its statistics discriminate at frame discriminate on level utterance level only few LID-senones can be activated

  10. Proposed Method – hybrid temporal evaluation • 30s/10s/3s neural networks are trained independently • 30s speech could be segmented into 10s/3s and use corresponding networks • 10s speech could be segmented into 3s and use the corresponding network

  11. Outline • Introduction • Proposed Method • Experiments and Analysis • Conclusion and Feature Work

  12. Experiments and Analysis • Dataset • six most confusable languages from NIST LRE 09 (Dari, Farsi, Russian, Ukrainian, Hindi and Urdu) • training duration about 150 hours • evaluation on 30s/10s/3s • Performance indicators: Equal Error Rate (EER) • System • baseline1: BN-GMM/i-vector • baseline2: BN-DNN/i-vector • proposed network1: LID-net • proposed network2: LID-HT-net, LID-net with hybrid temporal evaluation

  13. Experiments and Analysis • Evaluation of Different Convolutional Filter Sizes n c changes • As a consequence, a filter size of 50x21 is selected for all of the following experiments.

  14. Experiments and Analysis • Evaluation of Convolutional Layer Complexity complexity o of conv. l layer er changes es • The performance improves when the complexity increases

  15. Experiments and Analysis • Hybrid Temporal Evaluation • the final LID-net performs well compared with the two baseline systems • i-vector use both zeroth order and first order Baum-Welch statistics. In LID-net, the SPP layer only uses zeroth order Baum-Welch statistics

  16. Outline • Introduction • Proposed Method • Experiments and Analysis • Conclusion and Future Work

  17. Conclusion and Feature Work • Conclusion • we have proposed a comprehensive task-aware network spanning frame to utterance level • an incremental training strategy scheme has been introduced to address over- fitting issues in the deep structure • hybrid temporal evaluation is proposed for various time scales in the same test dataset • Future Work • consider a more comprehensive network rather than relying on three independent networks • Can we incorporate first order B-W statistics?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend