Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful - - PowerPoint PPT Presentation

approach
SMART_READER_LITE
LIVE PREVIEW

Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful - - PowerPoint PPT Presentation

Bengali Speech Recognition: A Double Layered LSTM-RNN Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114. Speech


slide-1
SLIDE 1

Bengali Speech Recognition: A Double Layered LSTM-RNN Approach

Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114.

slide-2
SLIDE 2

➢ Provides efficient communication between humans and machine

Speech Recognition Is Important

slide-3
SLIDE 3

➢ Provides efficient communication between humans and machine ➢ Increases throughput

Speech Recognition Is Important

slide-4
SLIDE 4

➢ Provides efficient communication between humans and machine ➢ Increases throughput ➢ Is the most natural way of communication

Image sources: [1] www.glasbergen.com [2] www.playbuzz.com

Speech Recognition Is Important

slide-5
SLIDE 5

➢ Intuitive for humans, not for machines

We say, "Let there be light".

Speech Recognition Is Difficult

slide-6
SLIDE 6

➢ Intuitive for humans, not for machines

We say, "Let there be light". Machine hears,

Speech Recognition Is Difficult

slide-7
SLIDE 7

➢ Intuitive for humans, not for machines

We say, "Let there be light". Machine hears,

Different pitches, accents, durations, noise levels

Speech Recognition Is Difficult

slide-8
SLIDE 8

➢ A recurrent neural network (RNN) architecture proposed

  • Double layered bidirectional long short term memory (LSTM) used

Our Contribution

slide-9
SLIDE 9

➢ A recurrent neural network (RNN) architecture proposed

  • Double layered bidirectional long short term memory (LSTM) used

➢ Individual phonemes detected

  • to some extent

Our Contribution

slide-10
SLIDE 10

➢ A recurrent neural network (RNN) architecture proposed

  • Double layered bidirectional long short term memory (LSTM) used

➢ Individual phonemes detected

  • to some extent

➢ Compared results with other methods on Bengali automatic speech recognizer

Our Contribution

slide-11
SLIDE 11

➢ Preprocessing

  • noise reduction, mel frequency cepstral coefficients (MFCC)

extraction

➢ Training

➢ Postprocessing

Methodology

slide-12
SLIDE 12

➢ Speech signal is divided into a number of frames

  • each frame has 13 features (i.e., MFCCs)

Methodology: Preprocessing (1/2)

slide-13
SLIDE 13

➢ Speech signal is divided into a number of frames

  • each frame has 13 features (i.e., MFCCs)

➢ An example word: (pronounced as: SOTERO)

Methodology: Preprocessing (1/2)

slide-14
SLIDE 14

➢ Speech signal is divided into a number of frames

  • each frame has 13 features (i.e., MFCCs)

➢ An example word: (pronounced as: SOTERO)

  • S-O-T-E-R-O
  • 5 phones

Methodology: Preprocessing (1/2)

slide-15
SLIDE 15

➢ Speech signal is divided into a number of frames

  • each frame has 13 features (i.e., MFCCs)

➢ An example word: (pronounced as: SOTERO)

  • S-O-T-E-R-O
  • 5 phones
  • 17 frames

Methodology: Preprocessing (1/2)

slide-16
SLIDE 16

➢ Speech signal is divided into a number of frames

  • each frame has 13 features (i.e., MFCCs)

➢ An example word: (pronounced as: SOTERO)

  • S-O-T-E-R-O
  • 5 phones
  • 17 frames
  • SSS-OO-TTT-EEEE-RRR-OO
  • roughly 3 frames for each phone

Methodology: Preprocessing (1/2)

slide-17
SLIDE 17

SSS-OO-TTT-EEEE-RRR-OO

Methodology: Preprocessing (2/2)

slide-18
SLIDE 18

➢ Each timestamp carries a frame

Methodology: Training with LSTM (1/3)

slide-19
SLIDE 19

➢ Each timestamp carries a frame ➢ Input layer contains 13 units

  • ne for each coefficient

Methodology: Training with LSTM (1/3)

slide-20
SLIDE 20

➢ Each timestamp carries a frame ➢ Input layer contains 13 units

  • ne for each coefficient

➢ Next two layers are recurrent

  • each contains 100 LSTM cells

Methodology: Training with LSTM (1/3)

slide-21
SLIDE 21

➢ Each timestamp carries a frame ➢ Input layer contains 13 units

  • ne for each coefficient

➢ Next two layers are recurrent

  • each contains 100 LSTM cells

➢ The final layer is a softmax layer with 30 units

  • there are 30 phonemes in total

Methodology: Training with LSTM (1/3)

slide-22
SLIDE 22

➢ An LSTM cell

  • exploits context using gates
  • has recurrent connection for temporal

modeling

  • has nonlinear squashing activation

functions for capturing complex relation

➢ Together, responsible for associating cepstral coefficients to phones

Methodology: Training with LSTM (2/3)

slide-23
SLIDE 23

Methodology: Training with LSTM (3/3)

slide-24
SLIDE 24

➢ A phone found in consecutive frames

STRONG EVIDENCE, it is present. How many?

➢ Depends on threshold T

Methodology: Postprocessing

slide-25
SLIDE 25

Methodology: Postprocessing

slide-26
SLIDE 26

➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O)

  • 12 frames

Methodology: Postprocessing

slide-27
SLIDE 27

➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O)

  • 12 frames

➢ Network output: S,Sh,Sh,O,O,O,L,L,O,O,O

Methodology: Postprocessing

slide-28
SLIDE 28

➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O)

  • 12 frames

➢ Network output: S,Sh,Sh,O,O,O,L,L,O,O,O ➢ Initial Noise Elimination: _,Sh,Sh,O,O,O,L,L,O,O,O

Methodology: Postprocessing

slide-29
SLIDE 29

Methodology: Postprocessing

slide-30
SLIDE 30

Final Output: "ShOLO"

Methodology: Postprocessing

slide-31
SLIDE 31

➢ Phone Detection Error rate (Perr)

Analysis: Evaluation Metrics

slide-32
SLIDE 32

➢ Phone Detection Error rate (Perr) ➢ Word Detection Error rate (Werr)

Analysis: Evaluation Metrics

slide-33
SLIDE 33

➢ Phone Detection Error rate (Perr) ➢ Word Detection Error rate (Werr) ➢ Bengali Real Number Dataset (Nahid et al., 2016)

Analysis: Evaluation Metrics

slide-34
SLIDE 34

➢ 28.7% Perr 13.2% Werr ➢ CMU Sphinx4 incurred 15% Werr (Nahid et al., 2016)

Analysis: How Model Learns

slide-35
SLIDE 35

➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L')

  • they occurred more frequently
  • no other phon pronounced similarly

Analysis: Phon labeling

slide-36
SLIDE 36

➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L')

  • they occurred more frequently
  • no other phon pronounced similarly

➢ Some phons frequently garbled

Analysis: Phon labeling

slide-37
SLIDE 37

➢ It seems, on average, there are 5 phones in a word in dataset ➢ Robustness not checked

Analysis: Threshold Affects Learning

slide-38
SLIDE 38

➢ Phon alignment very difficult

  • Same example introduced thrice, but with different

alignment

➢ Only vocabulary words recognized

  • 114 unique words

➢ Robustness in threshold selection not tested

Limitations

slide-39
SLIDE 39

➢ An RNN architecture is proposed based on phon detection ➢ Bengali Real Numbers are recognized with higher accuracy ➢ The vocabulary can be increased on larger datasets ➢ How to align phones can be learned too

Conclusion

slide-40
SLIDE 40
  • Nahid, M. M. H., Islam, M. A., & Islam, M. S. (2016, May). A noble approach for

recognizing Bangla real number automatically using CMU Sphinx4. In Informatics, Electronics and Vision (ICIEV), 2016.

  • Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition

with deep recurrent neural networks." Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013.

  • Ravuri, Suman, and Steven Wegmann. "How neural network features and depth

modify statistical properties of HMM acoustic models." Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.

References

slide-41
SLIDE 41

Thank you

slide-42
SLIDE 42

Phoneme List