Aligning Audiovisual Features for Audiovisual Speech Recognition - PowerPoint PPT Presentation

Aligning Audiovisual Features for Audiovisual Speech Recognition Fei Tao and Carlos Busso Multimodal Signal Processing (MSP) Laboratory Department of Electrical Engineering, The University of Texas at Dallas, Richardson TX-75080, USA

 Audiovisual approach for robust ASR  DNN emerges for AV-ASR Petridis et al. [2018] Ngiam et al. [2011] Neti et al. [2000] End-to-end AV-ASR Multimodal deep learning GMM-HMM 2

Introduction  Fusing audiovisual features followed static fashion  Linear interpolation (extrapolation) to align  Audiovisual modalities fusion on decision, model or feature levels. Audio Phase difference X Time Video Time How to align audiovisual modalities? 3

Motivation  Phase between lip motion and speech [Tao et al., 2016] Lip movement Acoustic activity Phase difference  Bregler and Konig [1994] show the best alignment was with a shift of 120 milliseconds  However, phase is time variant so this may not be the optimum approach 4

Motivation  Audiovisual features concatenated frame-by-frame:  For some phonemes, lip movements precede speech production  For other phonemes, speech production precede lip movements  In some cases, audiovisual modalities are well aligned [Hazen, 2006]  Pronounce the burst release of /b/  Co-articulation effects and articulator inertia may cause phase difference  Lip movement can precedes audio for phoneme /m/ in transition /g/ to /m/ (e.g., word segment ) 5

Deep Learning for Audiovisual  Deep learning for audiovisual ASR:  Ninomiya et al. (2015) extracted bottleneck feature for audiovisual fusion  Ngiam et al. (2011) proposed bimodal DNN for fusing audiovisual modalities  Tao et al. (2017) extended to bimodal RNN on AV-SAD problem for modeling audiovisual temporal information  Rely on linear interpolation to align audiovisual features Proposed Approach: Learn alignment automatically from data using attention model 6

Outline 1. Introduction 2. Proposed Approach 3. Corpus Description 4. Experiment and Result 5. Conclusion

Proposed Framework  Proposed approach relies on attention model  Attention model learns alignment in sequence-to-sequence learning  Output is represented as linear combination of input at all time points  Learn the weights in linear combination following a data-driven framework Output Sequence … … Y(t) Y(t+1) Different Length a(1) … A(T) a(3) a(2) … h(1) h(2) h(3) h(T) Input Sequence 8

Alignment Neural Network (AliNN) Feature space transform Temporal alignment 9

Alignment Neural Network (AliNN) Feature space transform Temporal alignment 10

Alignment Neural Network (AliNN) Temporal align 11

Alignment Neural Network (AliNN) Feature space transform Temporal align 12

Alignment Neural Network (AliNN) Regression Training 13

Alignment Neural Network (AliNN) Audiovisual Regression Aligned Visual Audio Extraction 14

Training AliNN  Training AliNN on the whole utterance is computationally expensive  We segment the utterance into small sections  Length of each segment is 1 sec, shifted by 0.5 sec  Sequence is padded with zeros if needed Zero padding 0.5 sec 1 sec 15

Corpus Description  CRSS-4ENGLISH-14 corpus:  55 females and 50 males (60 hrs and 48 mins)  Ideal condition: high definition camera and close-talk microphone  Challenge condition: tablet camera and tablet microphone  Clean section (read and spontaneous speech) and noisy section (subset of read speech) 16

Audiovisual Features  Audio feature: 13D MFCCs feature (100 fps)  Visual feature: 25D DCT + 5D geometric distance  30 fps for high definition camera  24 fps for tablet camera 17

Experiment Setting  70 speaker for training, 10 for validation, 25 for testing  Gender balanced  Train with ideal condition under clean environment  Test with different conditions under different environments  Two backend:  GMM-HMM: augmented with delta and delta-delta information  DNN-HMM: 15 context frames  Data of tablet (24 fps) is linearly interpolated to 30 fps  Linear interpolation for pre-processing as baseline  Focus on word error rate (WER) 18

Experiment Results  Under ideal condition, the proposed front-end always achieves the best performance  Under tablet condition, the proposed front-end achieve the best performance except GMM-HMM backend  Linear interpolate tablet data to 30 fps may impair the advantage of AliNN Ideal Conditions Tablet Conditions Front-end MODEL Clean [WER] Noise [WER] Clean [WER] Noise [WER] LInterp GMM-HMM 23.3 24.2 24.7 30.7 AliNN GMM-HMM 17.5 19.2 22.7 35.6 LInterp DNN-HMM 4.2 4.9 15.5 15.9 AliNN DNN-HMM 4.1 4.5 4.6 10.0 19

Results Analysis Ideal Conditions Tablet Conditions Front-end MODEL Clean Noise Clean Noise LInterp GMM-HMM 23.3 24.2 24.7 30.7 AliNN GMM-HMM 17.5 19.2 22.7 35.6 LInterp DNN-HMM 4.2 4.9 15.5 15.9 AliNN DNN-HMM 4.1 4.5 4.6 10.0 Ideal Tablet 20

Conclusions  This study proposed the alignment neural network (AliNN)  Learns the alignment between audio and visual modalities from data  Does not need alignment or task label  The proposed front-end is evaluated on CRSS-4ENGLISH-14 corpus  Large corpus for AV-LVASR (over 60h)  The proposed front-end outperforms simple linear interpolation under various conditions  Future work will extend approach to end-to-end framework 21

References:  C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, and J. Zhou, “Audio - visual speech recognition,” Workshop 2000 Final Report, Technical Report 764, October 2000.  J. Ngiam, A. Khosla , M. Kim, J. Nam, H. Lee, and A. Ng, “Multimodal deep learning,” in International conference on machine learning (ICML2011), Bellevue, WA, USA, June-July 2011, pp. 689 – 696.  S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic (2018). End-to-end audiovisual speech recognition. arXiv preprint arXiv:1802.06424  T.J . Hazen, “Visual model structures and synchrony constraints for audio - visual speech recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 3, pp. 1082 – 1089, May 2006.  C. Bregler and Y. Konig , ““ Eigenlips ” for robust speech recognition,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1994) , Adelaide, Aus- tralia, April 1994, vol. 2, pp. 669 – 672.  F. Tao, J.H. L. Hansen, and C. Busso, “Improving bound - ary estimation in audiovisual speech activity detection using Bayesian information criterion,” in Interspeech 2016 , San Francisco, CA, USA, September 2016, pp. 2130 – 2134. 22

Aligning Audiovisual Features for Audiovisual Speech Recognition - PowerPoint PPT Presentation

Aligning Audiovisual Features for Audiovisual Speech Recognition Fei Tao and Carlos Busso Multimodal Signal Processing (MSP) Laboratory Department of Electrical Engineering, The University of Texas at Dallas, Richardson TX-75080, USA

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Introduction to Audiovisual Introduction to Audiovisual Introduction to Audiovisual Compression

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Using multimodal speech production data to evaluate articulatory animation for audiovisual speech

BROADCASTING TV DIGITALIZATION in Albania Albania Audiovisual Landscape (Public audiovisual

Television and on-demand audiovisual services in the Russian Federation A report by Json

Aligning, not Integrating Aligning, not Integrating Architectures: Architectures: Leveraging a

Aligning Mission, Culture & Internal Branding Aligning Mission, Culture & Internal

Multimodality Learning from Text, Speech, and Vision CMU 11-4/611 Natural Language Processing

V1 MT Hubel and Wiesel, 1968 Maunsell and V an Essen, 1983 Relating MT responses to visual

The Lavrentiev phenomena by Alessandro Ferriero CMAP Ecole Polytechnique A. Ferriero, II

Relaxing Bijectivitiy Constraints with Continuously Indexed Normalising Flows ICML 2020 Rob

The Lips of an Encourager James 1:26 The Barnabas Factor Summer Series William Arthur Ward

The (near-)future IEEE 1788 standard for interval arithmetic Nathalie Revol INRIA - Universit

Articulatory Phonetics IPA: The Vowels and the International Phonetic Alphabet Practice

Another Approach to Pairing Computation in Edwards Coordinates Sorina Ionica PRISM, Universit

Aligning Audiovisual Features for Audiovisual Speech Recognition - PowerPoint PPT Presentation

Aligning Audiovisual Features for Audiovisual Speech Recognition Fei Tao and Carlos Busso Multimodal Signal Processing (MSP) Laboratory Department of Electrical Engineering, The University of Texas at Dallas, Richardson TX-75080, USA

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Introduction to Audiovisual Introduction to Audiovisual Introduction to Audiovisual Compression

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Using multimodal speech production data to evaluate articulatory animation for audiovisual speech

BROADCASTING TV DIGITALIZATION in Albania Albania Audiovisual Landscape (Public audiovisual

Television and on-demand audiovisual services in the Russian Federation A report by Json

Aligning, not Integrating Aligning, not Integrating Architectures: Architectures: Leveraging a

Aligning Mission, Culture &amp; Internal Branding Aligning Mission, Culture &amp; Internal

Multimodality Learning from Text, Speech, and Vision CMU 11-4/611 Natural Language Processing

V1 MT Hubel and Wiesel, 1968 Maunsell and V an Essen, 1983 Relating MT responses to visual

The Lavrentiev phenomena by Alessandro Ferriero CMAP Ecole Polytechnique A. Ferriero, II

Relaxing Bijectivitiy Constraints with Continuously Indexed Normalising Flows ICML 2020 Rob

The Lips of an Encourager James 1:26 The Barnabas Factor Summer Series William Arthur Ward

The (near-)future IEEE 1788 standard for interval arithmetic Nathalie Revol INRIA - Universit

Articulatory Phonetics IPA: The Vowels and the International Phonetic Alphabet Practice

Another Approach to Pairing Computation in Edwards Coordinates Sorina Ionica PRISM, Universit

Aligning Mission, Culture & Internal Branding Aligning Mission, Culture & Internal