SLIDE 1 Bengali Speech Recognition: A Double Layered LSTM-RNN Approach
Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114.
SLIDE 2
➢ Provides efficient communication between humans and machine
Speech Recognition Is Important
SLIDE 3
➢ Provides efficient communication between humans and machine ➢ Increases throughput
Speech Recognition Is Important
SLIDE 4 ➢ Provides efficient communication between humans and machine ➢ Increases throughput ➢ Is the most natural way of communication
Image sources: [1] www.glasbergen.com [2] www.playbuzz.com
Speech Recognition Is Important
SLIDE 5 ➢ Intuitive for humans, not for machines
We say, "Let there be light".
Speech Recognition Is Difficult
SLIDE 6 ➢ Intuitive for humans, not for machines
We say, "Let there be light". Machine hears,
Speech Recognition Is Difficult
SLIDE 7 ➢ Intuitive for humans, not for machines
We say, "Let there be light". Machine hears,
➢
Different pitches, accents, durations, noise levels
Speech Recognition Is Difficult
SLIDE 8 ➢ A recurrent neural network (RNN) architecture proposed
- Double layered bidirectional long short term memory (LSTM) used
Our Contribution
SLIDE 9 ➢ A recurrent neural network (RNN) architecture proposed
- Double layered bidirectional long short term memory (LSTM) used
➢ Individual phonemes detected
Our Contribution
SLIDE 10 ➢ A recurrent neural network (RNN) architecture proposed
- Double layered bidirectional long short term memory (LSTM) used
➢ Individual phonemes detected
➢ Compared results with other methods on Bengali automatic speech recognizer
Our Contribution
SLIDE 11 ➢ Preprocessing
- noise reduction, mel frequency cepstral coefficients (MFCC)
extraction
➢ Training
➢ Postprocessing
Methodology
SLIDE 12 ➢ Speech signal is divided into a number of frames
- each frame has 13 features (i.e., MFCCs)
Methodology: Preprocessing (1/2)
SLIDE 13 ➢ Speech signal is divided into a number of frames
- each frame has 13 features (i.e., MFCCs)
➢ An example word: (pronounced as: SOTERO)
Methodology: Preprocessing (1/2)
SLIDE 14 ➢ Speech signal is divided into a number of frames
- each frame has 13 features (i.e., MFCCs)
➢ An example word: (pronounced as: SOTERO)
Methodology: Preprocessing (1/2)
SLIDE 15 ➢ Speech signal is divided into a number of frames
- each frame has 13 features (i.e., MFCCs)
➢ An example word: (pronounced as: SOTERO)
- S-O-T-E-R-O
- 5 phones
- 17 frames
Methodology: Preprocessing (1/2)
SLIDE 16 ➢ Speech signal is divided into a number of frames
- each frame has 13 features (i.e., MFCCs)
➢ An example word: (pronounced as: SOTERO)
- S-O-T-E-R-O
- 5 phones
- 17 frames
- SSS-OO-TTT-EEEE-RRR-OO
- roughly 3 frames for each phone
Methodology: Preprocessing (1/2)
SLIDE 17
SSS-OO-TTT-EEEE-RRR-OO
Methodology: Preprocessing (2/2)
SLIDE 18 ➢ Each timestamp carries a frame
Methodology: Training with LSTM (1/3)
SLIDE 19 ➢ Each timestamp carries a frame ➢ Input layer contains 13 units
Methodology: Training with LSTM (1/3)
SLIDE 20 ➢ Each timestamp carries a frame ➢ Input layer contains 13 units
➢ Next two layers are recurrent
- each contains 100 LSTM cells
Methodology: Training with LSTM (1/3)
SLIDE 21 ➢ Each timestamp carries a frame ➢ Input layer contains 13 units
➢ Next two layers are recurrent
- each contains 100 LSTM cells
➢ The final layer is a softmax layer with 30 units
- there are 30 phonemes in total
Methodology: Training with LSTM (1/3)
SLIDE 22 ➢ An LSTM cell
- exploits context using gates
- has recurrent connection for temporal
modeling
- has nonlinear squashing activation
functions for capturing complex relation
➢ Together, responsible for associating cepstral coefficients to phones
Methodology: Training with LSTM (2/3)
SLIDE 23
Methodology: Training with LSTM (3/3)
SLIDE 24 ➢ A phone found in consecutive frames
STRONG EVIDENCE, it is present. How many?
➢ Depends on threshold T
Methodology: Postprocessing
SLIDE 25
Methodology: Postprocessing
SLIDE 26 ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O)
Methodology: Postprocessing
SLIDE 27 ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O)
➢ Network output: S,Sh,Sh,O,O,O,L,L,O,O,O
Methodology: Postprocessing
SLIDE 28 ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O)
➢ Network output: S,Sh,Sh,O,O,O,L,L,O,O,O ➢ Initial Noise Elimination: _,Sh,Sh,O,O,O,L,L,O,O,O
Methodology: Postprocessing
SLIDE 29
Methodology: Postprocessing
SLIDE 30
Final Output: "ShOLO"
Methodology: Postprocessing
SLIDE 31
➢ Phone Detection Error rate (Perr)
Analysis: Evaluation Metrics
SLIDE 32
➢ Phone Detection Error rate (Perr) ➢ Word Detection Error rate (Werr)
Analysis: Evaluation Metrics
SLIDE 33
➢ Phone Detection Error rate (Perr) ➢ Word Detection Error rate (Werr) ➢ Bengali Real Number Dataset (Nahid et al., 2016)
Analysis: Evaluation Metrics
SLIDE 34
➢ 28.7% Perr 13.2% Werr ➢ CMU Sphinx4 incurred 15% Werr (Nahid et al., 2016)
Analysis: How Model Learns
SLIDE 35 ➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L')
- they occurred more frequently
- no other phon pronounced similarly
Analysis: Phon labeling
SLIDE 36 ➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L')
- they occurred more frequently
- no other phon pronounced similarly
➢ Some phons frequently garbled
Analysis: Phon labeling
SLIDE 37 ➢ It seems, on average, there are 5 phones in a word in dataset ➢ Robustness not checked
Analysis: Threshold Affects Learning
SLIDE 38 ➢ Phon alignment very difficult
- Same example introduced thrice, but with different
alignment
➢ Only vocabulary words recognized
➢ Robustness in threshold selection not tested
Limitations
SLIDE 39
➢ An RNN architecture is proposed based on phon detection ➢ Bengali Real Numbers are recognized with higher accuracy ➢ The vocabulary can be increased on larger datasets ➢ How to align phones can be learned too
Conclusion
SLIDE 40
- Nahid, M. M. H., Islam, M. A., & Islam, M. S. (2016, May). A noble approach for
recognizing Bangla real number automatically using CMU Sphinx4. In Informatics, Electronics and Vision (ICIEV), 2016.
- Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition
with deep recurrent neural networks." Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013.
- Ravuri, Suman, and Steven Wegmann. "How neural network features and depth
modify statistical properties of HMM acoustic models." Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.
References
SLIDE 41
Thank you
SLIDE 42
Phoneme List