approach
play

Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful - PowerPoint PPT Presentation

Bengali Speech Recognition: A Double Layered LSTM-RNN Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114. Speech


  1. Bengali Speech Recognition: A Double Layered LSTM-RNN Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114.

  2. Speech Recognition Is Important ➢ Provides efficient communication between humans and machine

  3. Speech Recognition Is Important ➢ Provides efficient communication between humans and machine ➢ Increases throughput

  4. Speech Recognition Is Important ➢ Provides efficient communication between humans and machine ➢ Increases throughput ➢ Is the most natural way of communication Image sources: [1] www.glasbergen.com [2] www.playbuzz.com

  5. Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light".

  6. Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light". Machine hears,

  7. Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light". Machine hears, Different pitches, accents, durations, noise levels ➢

  8. Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used •

  9. Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used • ➢ Individual phonemes detected to some extent •

  10. Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used • ➢ Individual phonemes detected to some extent • ➢ Compared results with other methods on Bengali automatic speech recognizer

  11. Methodology ➢ Preprocessing noise reduction, mel frequency cepstral coefficients (MFCC) • extraction ➢ Training ➢ Postprocessing

  12. Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) •

  13. Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO)

  14. Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones •

  15. Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones • 17 frames •

  16. Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones • 17 frames • SSS-OO-TTT-EEEE-RRR-OO • roughly 3 frames for each phone •

  17. Methodology: Preprocessing (2/2) SSS-OO-TTT-EEEE-RRR-OO

  18. Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame

  19. Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient •

  20. Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient • ➢ Next two layers are recurrent each contains 100 LSTM cells •

  21. Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient • ➢ Next two layers are recurrent each contains 100 LSTM cells • ➢ The final layer is a softmax layer with 30 units there are 30 phonemes in total •

  22. Methodology: Training with LSTM (2/3) ➢ An LSTM cell exploits context using gates • has recurrent connection for temporal • modeling has nonlinear squashing activation • functions for capturing complex relation ➢ Together, responsible for associating cepstral coefficients to phones

  23. Methodology: Training with LSTM (3/3)

  24. Methodology: Postprocessing ➢ A phone found in consecutive frames STRONG EVIDENCE, it is present. How many ? ➢ Depends on threshold T

  25. Methodology: Postprocessing

  26. Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames

  27. Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames ➢ Network output : S,Sh,Sh,O,O,O,L,L,O,O,O

  28. Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames ➢ Network output : S,Sh,Sh,O,O,O,L,L,O,O,O ➢ Initial Noise Elimination : _,Sh,Sh,O,O,O,L,L,O,O,O

  29. Methodology: Postprocessing

  30. Methodology: Postprocessing Final Output: " ShOLO "

  31. Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err )

  32. Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err ) ➢ Word Detection Error rate (W err )

  33. Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err ) ➢ Word Detection Error rate (W err ) ➢ Bengali Real Number Dataset (Nahid et al., 2016)

  34. Analysis: How Model Learns ➢ 28.7% P err 13.2% W err ➢ CMU Sphinx4 incurred 15% W err ( Nahid et al ., 2016)

  35. Analysis: Phon labeling ➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L') • they occurred more frequently • no other phon pronounced similarly

  36. Analysis: Phon labeling ➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L') • they occurred more frequently • no other phon pronounced similarly ➢ Some phons frequently garbled

  37. Analysis: Threshold Affects Learning It seems, on ➢ average, there are 5 phones in a word in dataset ➢ Robustness not checked

  38. Limitations ➢ Phon alignment very difficult Same example introduced thrice, but with different • alignment ➢ Only vocabulary words recognized 114 unique words • ➢ Robustness in threshold selection not tested

  39. Conclusion ➢ An RNN architecture is proposed based on phon detection ➢ Bengali Real Numbers are recognized with higher accuracy ➢ The vocabulary can be increased on larger datasets ➢ How to align phones can be learned too

  40. References Nahid, M. M. H., Islam, M. A., & Islam, M. S. (2016, May). A noble approach for • recognizing Bangla real number automatically using CMU Sphinx4 . In Informatics, Electronics and Vision (ICIEV), 2016. Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition • with deep recurrent neural networks." Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013. Ravuri, Suman, and Steven Wegmann. "How neural network features and depth • modify statistical properties of HMM acoustic models." Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.

  41. Thank you

  42. Phoneme List

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend