Bengali Speech Recognition: A Double Layered LSTM-RNN Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114.
Speech Recognition Is Important ➢ Provides efficient communication between humans and machine
Speech Recognition Is Important ➢ Provides efficient communication between humans and machine ➢ Increases throughput
Speech Recognition Is Important ➢ Provides efficient communication between humans and machine ➢ Increases throughput ➢ Is the most natural way of communication Image sources: [1] www.glasbergen.com [2] www.playbuzz.com
Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light".
Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light". Machine hears,
Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light". Machine hears, Different pitches, accents, durations, noise levels ➢
Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used •
Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used • ➢ Individual phonemes detected to some extent •
Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used • ➢ Individual phonemes detected to some extent • ➢ Compared results with other methods on Bengali automatic speech recognizer
Methodology ➢ Preprocessing noise reduction, mel frequency cepstral coefficients (MFCC) • extraction ➢ Training ➢ Postprocessing
Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) •
Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO)
Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones •
Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones • 17 frames •
Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones • 17 frames • SSS-OO-TTT-EEEE-RRR-OO • roughly 3 frames for each phone •
Methodology: Preprocessing (2/2) SSS-OO-TTT-EEEE-RRR-OO
Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame
Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient •
Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient • ➢ Next two layers are recurrent each contains 100 LSTM cells •
Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient • ➢ Next two layers are recurrent each contains 100 LSTM cells • ➢ The final layer is a softmax layer with 30 units there are 30 phonemes in total •
Methodology: Training with LSTM (2/3) ➢ An LSTM cell exploits context using gates • has recurrent connection for temporal • modeling has nonlinear squashing activation • functions for capturing complex relation ➢ Together, responsible for associating cepstral coefficients to phones
Methodology: Training with LSTM (3/3)
Methodology: Postprocessing ➢ A phone found in consecutive frames STRONG EVIDENCE, it is present. How many ? ➢ Depends on threshold T
Methodology: Postprocessing
Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames
Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames ➢ Network output : S,Sh,Sh,O,O,O,L,L,O,O,O
Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames ➢ Network output : S,Sh,Sh,O,O,O,L,L,O,O,O ➢ Initial Noise Elimination : _,Sh,Sh,O,O,O,L,L,O,O,O
Methodology: Postprocessing
Methodology: Postprocessing Final Output: " ShOLO "
Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err )
Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err ) ➢ Word Detection Error rate (W err )
Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err ) ➢ Word Detection Error rate (W err ) ➢ Bengali Real Number Dataset (Nahid et al., 2016)
Analysis: How Model Learns ➢ 28.7% P err 13.2% W err ➢ CMU Sphinx4 incurred 15% W err ( Nahid et al ., 2016)
Analysis: Phon labeling ➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L') • they occurred more frequently • no other phon pronounced similarly
Analysis: Phon labeling ➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L') • they occurred more frequently • no other phon pronounced similarly ➢ Some phons frequently garbled
Analysis: Threshold Affects Learning It seems, on ➢ average, there are 5 phones in a word in dataset ➢ Robustness not checked
Limitations ➢ Phon alignment very difficult Same example introduced thrice, but with different • alignment ➢ Only vocabulary words recognized 114 unique words • ➢ Robustness in threshold selection not tested
Conclusion ➢ An RNN architecture is proposed based on phon detection ➢ Bengali Real Numbers are recognized with higher accuracy ➢ The vocabulary can be increased on larger datasets ➢ How to align phones can be learned too
References Nahid, M. M. H., Islam, M. A., & Islam, M. S. (2016, May). A noble approach for • recognizing Bangla real number automatically using CMU Sphinx4 . In Informatics, Electronics and Vision (ICIEV), 2016. Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition • with deep recurrent neural networks." Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013. Ravuri, Suman, and Steven Wegmann. "How neural network features and depth • modify statistical properties of HMM acoustic models." Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.
Thank you
Phoneme List
Recommend
More recommend