Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful - PowerPoint PPT Presentation

Bengali Speech Recognition: A Double Layered LSTM-RNN Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114.

Speech Recognition Is Important ➢ Provides efficient communication between humans and machine

Speech Recognition Is Important ➢ Provides efficient communication between humans and machine ➢ Increases throughput

Speech Recognition Is Important ➢ Provides efficient communication between humans and machine ➢ Increases throughput ➢ Is the most natural way of communication Image sources: [1] www.glasbergen.com [2] www.playbuzz.com

Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light".

Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light". Machine hears,

Speech Recognition Is Difficult ➢ Intuitive for humans, not for machines We say, "Let there be light". Machine hears, Different pitches, accents, durations, noise levels ➢

Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used •

Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used • ➢ Individual phonemes detected to some extent •

Our Contribution ➢ A recurrent neural network (RNN) architecture proposed Double layered bidirectional long short term memory (LSTM) used • ➢ Individual phonemes detected to some extent • ➢ Compared results with other methods on Bengali automatic speech recognizer

Methodology ➢ Preprocessing noise reduction, mel frequency cepstral coefficients (MFCC) • extraction ➢ Training ➢ Postprocessing

Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) •

Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO)

Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones •

Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones • 17 frames •

Methodology: Preprocessing (1/2) ➢ Speech signal is divided into a number of frames each frame has 13 features (i.e., MFCCs) • ➢ An example word: (pronounced as: SOTERO) S-O-T-E-R-O • 5 phones • 17 frames • SSS-OO-TTT-EEEE-RRR-OO • roughly 3 frames for each phone •

Methodology: Preprocessing (2/2) SSS-OO-TTT-EEEE-RRR-OO

Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame

Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient •

Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient • ➢ Next two layers are recurrent each contains 100 LSTM cells •

Methodology: Training with LSTM (1/3) ➢ Each timestamp carries a frame ➢ Input layer contains 13 units one for each coefficient • ➢ Next two layers are recurrent each contains 100 LSTM cells • ➢ The final layer is a softmax layer with 30 units there are 30 phonemes in total •

Methodology: Training with LSTM (2/3) ➢ An LSTM cell exploits context using gates • has recurrent connection for temporal • modeling has nonlinear squashing activation • functions for capturing complex relation ➢ Together, responsible for associating cepstral coefficients to phones

Methodology: Training with LSTM (3/3)

Methodology: Postprocessing ➢ A phone found in consecutive frames STRONG EVIDENCE, it is present. How many ? ➢ Depends on threshold T

Methodology: Postprocessing

Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames

Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames ➢ Network output : S,Sh,Sh,O,O,O,L,L,O,O,O

Methodology: Postprocessing ➢ An example: (Labeled: Sh,Sh,Sh,O,O,O,L,L,O,O,O ) • 12 frames ➢ Network output : S,Sh,Sh,O,O,O,L,L,O,O,O ➢ Initial Noise Elimination : _,Sh,Sh,O,O,O,L,L,O,O,O

Methodology: Postprocessing

Methodology: Postprocessing Final Output: " ShOLO "

Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err )

Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err ) ➢ Word Detection Error rate (W err )

Analysis: Evaluation Metrics ➢ Phone Detection Error rate (P err ) ➢ Word Detection Error rate (W err ) ➢ Bengali Real Number Dataset (Nahid et al., 2016)

Analysis: How Model Learns ➢ 28.7% P err 13.2% W err ➢ CMU Sphinx4 incurred 15% W err ( Nahid et al ., 2016)

Analysis: Phon labeling ➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L') • they occurred more frequently • no other phon pronounced similarly

Analysis: Phon labeling ➢ Some phons almost always detected with higher confidence (e.g., 'A', 'O', 'L') • they occurred more frequently • no other phon pronounced similarly ➢ Some phons frequently garbled

Analysis: Threshold Affects Learning It seems, on ➢ average, there are 5 phones in a word in dataset ➢ Robustness not checked

Limitations ➢ Phon alignment very difficult Same example introduced thrice, but with different • alignment ➢ Only vocabulary words recognized 114 unique words • ➢ Robustness in threshold selection not tested

Conclusion ➢ An RNN architecture is proposed based on phon detection ➢ Bengali Real Numbers are recognized with higher accuracy ➢ The vocabulary can be increased on larger datasets ➢ How to align phones can be learned too

References Nahid, M. M. H., Islam, M. A., & Islam, M. S. (2016, May). A noble approach for • recognizing Bangla real number automatically using CMU Sphinx4 . In Informatics, Electronics and Vision (ICIEV), 2016. Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition • with deep recurrent neural networks." Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013. Ravuri, Suman, and Steven Wegmann. "How neural network features and depth • modify statistical properties of HMM acoustic models." Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.

Thank you

Phoneme List

Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful - PowerPoint PPT Presentation

Bengali Speech Recognition: A Double Layered LSTM-RNN Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114. Speech

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

An NFR Pattern Approach to Dealing An NFR Pattern Approach to Dealing An NFR Pattern Approach to

Object oriented Object oriented Object oriented Object oriented approach and UML approach and

Outcome Based Approach in Outcome Based Approach in Outcome Based Approach in Outcome Based

April 2018 1 } HHS 2020 is an transformational approach to the way HHS services and programs are

Index Investing Core and Satellite Approach to Portfolio Construction Active approach

The Power of Pull The Power of Pull a platform approach to learning a platform approach

How has the new approach to How has the new approach to How has the new approach to How has the

THE TRIAD THE TRIAD APPROACH APPROACH The Triad Approach to make contaminated sites cleanup

ASTEROID APPROACH Kerry Snyder 12/10/14 Motivation 2 Credit: NASA Asteroid Approach - Kerry

A Decision A Decision A Decision-Analytic Approach for A Decision Analytic Approach for

Radicalism approach the Hague A chain approach organised for the Hague from the safety house

A SynBio Approach to Translational Medicine SynBio Approach to Translational Medicine

Multisensory Teaching Approach Richardson ISD What is MTA ? Multisensory Teaching Approach is a

Moorlands Children & Family Approach (Pilot) Moorlands Approach 1. Targeted work in Leek

Taking an Integrated Approach to the SDGs: A Role for Sustainability Science SDGs Flagship Team

We considered age-related disguise as the intentional modification of the speaker's voice to

Building Competitive Advantage through Successful Training and Development Submitted in the

GC0097 Proposal Updates 14th November 2017 Pre-qualification Requirements System Operator

A complete declarative debugger for Maude System demonstration Adri an Riesco Alberto Verdejo

EM EMOTION RECOGNITION IN IN SOUND ANASTASIYA S. POPOVA HSE NN 2017 INTRODUCTION THE

Music Information Retrieval and Music Emotion Recognition Yi-Hsuan Yang Ph.D.

Chapter 1 Internals and Computer System Design Overview Principles Eighth Edition By

Disks and RAID (Chapter 12, 14.2) CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy,

Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful - PowerPoint PPT Presentation

Bengali Speech Recognition: A Double Layered LSTM-RNN Approach Md Mahadi Hasan Nahid, Bishwajit Purkaystha, Md Saiful Islam Department of Computer Science and Engineering Shahjalal University of Science and Technology, Sylhet-3114. Speech

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

An NFR Pattern Approach to Dealing An NFR Pattern Approach to Dealing An NFR Pattern Approach to

Object oriented Object oriented Object oriented Object oriented approach and UML approach and

Outcome Based Approach in Outcome Based Approach in Outcome Based Approach in Outcome Based

April 2018 1 } HHS 2020 is an transformational approach to the way HHS services and programs are

Index Investing Core and Satellite Approach to Portfolio Construction Active approach

The Power of Pull The Power of Pull a platform approach to learning a platform approach

How has the new approach to How has the new approach to How has the new approach to How has the

THE TRIAD THE TRIAD APPROACH APPROACH The Triad Approach to make contaminated sites cleanup

ASTEROID APPROACH Kerry Snyder 12/10/14 Motivation 2 Credit: NASA Asteroid Approach - Kerry

A Decision A Decision A Decision-Analytic Approach for A Decision Analytic Approach for

Radicalism approach the Hague A chain approach organised for the Hague from the safety house

A SynBio Approach to Translational Medicine SynBio Approach to Translational Medicine

Multisensory Teaching Approach Richardson ISD What is MTA ? Multisensory Teaching Approach is a

Moorlands Children &amp; Family Approach (Pilot) Moorlands Approach 1. Targeted work in Leek

Taking an Integrated Approach to the SDGs: A Role for Sustainability Science SDGs Flagship Team

We considered age-related disguise as the intentional modification of the speaker's voice to

Building Competitive Advantage through Successful Training and Development Submitted in the

GC0097 Proposal Updates 14th November 2017 Pre-qualification Requirements System Operator

A complete declarative debugger for Maude System demonstration Adri an Riesco Alberto Verdejo

EM EMOTION RECOGNITION IN IN SOUND ANASTASIYA S. POPOVA HSE NN 2017 INTRODUCTION THE

Music Information Retrieval and Music Emotion Recognition Yi-Hsuan Yang Ph.D.

Chapter 1 Internals and Computer System Design Overview Principles Eighth Edition By

Disks and RAID (Chapter 12, 14.2) CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy,

Moorlands Children & Family Approach (Pilot) Moorlands Approach 1. Targeted work in Leek