Aligning Audiovisual Features for Audiovisual Speech Recognition
Fei Tao and Carlos Busso
Multimodal Signal Processing (MSP) Laboratory Department of Electrical Engineering, The University of Texas at Dallas, Richardson TX-75080, USA
Aligning Audiovisual Features for Audiovisual Speech Recognition - - PowerPoint PPT Presentation
Aligning Audiovisual Features for Audiovisual Speech Recognition Fei Tao and Carlos Busso Multimodal Signal Processing (MSP) Laboratory Department of Electrical Engineering, The University of Texas at Dallas, Richardson TX-75080, USA
Multimodal Signal Processing (MSP) Laboratory Department of Electrical Engineering, The University of Texas at Dallas, Richardson TX-75080, USA
2
Neti et al. [2000] GMM-HMM Ngiam et al. [2011] Multimodal deep learning Petridis et al. [2018] End-to-end AV-ASR
3
Time Audio Video
Phase difference Time
4
Phase difference Lip movement Acoustic activity
5
6
8
Y(t) Y(t+1)
h(1) h(2)
h(3) h(T) a(1) a(2) a(3)
A(T)
9
10
11
12
13
14
Aligned Visual Audio Audiovisual
15
16
17
18
19
20
21
22
References:
“Audio-visual speech recognition,” Workshop 2000 Final Report, Technical Report 764, October 2000.
conference on machine learning (ICML2011), Bellevue, WA, USA, June-July 2011, pp. 689–696.
speech recognition. arXiv preprint arXiv:1802.06424
Transactions on Audio, Speech and Language Processing, vol. 14, no. 3, pp. 1082–1089, May 2006.
Acoustics, Speech, and Signal Processing (ICASSP 1994), Adelaide, Aus- tralia, April 1994, vol. 2, pp. 669–672.
detection using Bayesian information criterion,” in Interspeech 2016, San Francisco, CA, USA, September 2016, pp. 2130–2134.