approach

Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful - PowerPoint PPT Presentation

Bengali Speech Recognition: A Double Layered LSTM-RNN Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114. Speech


  1. Bengali Speech Recognition: A Double Layered LSTM-RNN Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114.

  2. Speech Recognition Is Important ➢ Provides efficient communication between humans and machine

  3. Speech Recognition Is Important ➢ Provides efficient communication between humans and machine ➢ Increases throughput

  4. Speech Recognition Is Important ➢ Provides efficient communication between humans and machine ➢ Increases throughput ➢ Is the most natural way of communication Image sources: [1] www.glasbergen.com [2] www.playbuzz.com

  5. Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light".

  6. Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light". Machine hears,

  7. Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light". Machine hears, Different pitches, accents, durations, noise levels ➢

  8. Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used •

  9. Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used • ➢ Individual phonemes detected to some extent •

  10. Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used • ➢ Individual phonemes detected to some extent • ➢ Compared results with other methods on Bengali automatic speech recognizer

  11. Methodology ➢ Preprocessing noise reduction, mel frequency cepstral coefficients (MFCC) • extraction ➢ Training ➢ Postprocessing

  12. Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) •

  13. Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO)

  14. Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones •

  15. Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones • 17 frames •

  16. Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones • 17 frames • SSS-OO-TTT-EEEE-RRR-OO • roughly 3 frames for each phone •

  17. Methodology: Preprocessing (2/2) SSS-OO-TTT-EEEE-RRR-OO

  18. Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame

  19. Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient •

  20. Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient • ➢ Next two layers are recurrent each contains 100 LSTM cells •

  21. Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient • ➢ Next two layers are recurrent each contains 100 LSTM cells • ➢ The final layer is a softmax layer with 30 units there are 30 phonemes in total •

  22. Methodology: Training with LSTM (2/3) ➢ An LSTM cell exploits context using gates • has recurrent connection for temporal • modeling has nonlinear squashing activation • functions for capturing complex relation ➢ Together, responsible for associating cepstral coefficients to phones

  23. Methodology: Training with LSTM (3/3)

  24. Methodology: Postprocessing ➢ A phone found in consecutive frames STRONG EVIDENCE, it is present. How many ? ➢ Depends on threshold T

  25. Methodology: Postprocessing

  26. Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames

  27. Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames ➢ Network output : S,Sh,Sh,O,O,O,L,L,O,O,O

  28. Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames ➢ Network output : S,Sh,Sh,O,O,O,L,L,O,O,O ➢ Initial Noise Elimination : _,Sh,Sh,O,O,O,L,L,O,O,O

  29. Methodology: Postprocessing

  30. Methodology: Postprocessing Final Output: " ShOLO "

  31. Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err )

  32. Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err ) ➢ Word Detection Error rate (W err )

  33. Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err ) ➢ Word Detection Error rate (W err ) ➢ Bengali Real Number Dataset (Nahid et al., 2016)

  34. Analysis: How Model Learns ➢ 28.7% P err 13.2% W err ➢ CMU Sphinx4 incurred 15% W err ( Nahid et al ., 2016)

  35. Analysis: Phon labeling ➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L') • they occurred more frequently • no other phon pronounced similarly

  36. Analysis: Phon labeling ➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L') • they occurred more frequently • no other phon pronounced similarly ➢ Some phons frequently garbled

  37. Analysis: Threshold Affects Learning It seems, on ➢ average, there are 5 phones in a word in dataset ➢ Robustness not checked

  38. Limitations ➢ Phon alignment very difficult Same example introduced thrice, but with different • alignment ➢ Only vocabulary words recognized 114 unique words • ➢ Robustness in threshold selection not tested

  39. Conclusion ➢ An RNN architecture is proposed based on phon detection ➢ Bengali Real Numbers are recognized with higher accuracy ➢ The vocabulary can be increased on larger datasets ➢ How to align phones can be learned too

  40. References Nahid, M. M. H., Islam, M. A., & Islam, M. S. (2016, May). A noble approach for • recognizing Bangla real number automatically using CMU Sphinx4 . In Informatics, Electronics and Vision (ICIEV), 2016. Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition • with deep recurrent neural networks." Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013. Ravuri, Suman, and Steven Wegmann. "How neural network features and depth • modify statistical properties of HMM acoustic models." Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.

  41. Thank you

  42. Phoneme List

Recommend


More recommend