First experiments in audio/video features for phoneme recognition - - PowerPoint PPT Presentation
First experiments in audio/video features for phoneme recognition - - PowerPoint PPT Presentation
First experiments in audio/video features for phoneme recognition Petr Motl cek FIT VUT Brno, motlicek@fit.vutbr.cz M4 meeting in Prague, January 22nd - 23rd 2004 Introduction Data: M4 IDIAP, 41 min., audio-video data (training,
Introduction
- Data: M4 – IDIAP, 41 min., audio-video data (training, testing).
- Labels: 47 phoneme categories, obtained by forced alignment (models on
ICSI data, adapted on M4 data).
- Audio: Beam-formed recordings, 16kHz.
- Video: Cut off head positions.
Audio preprocessing
- Fs = 16kHz, frame-rate 100Hz, 20ms long frames of MFB log energies.
Video preprocessing
- Frame-rate 25Hz, RGB frames 70x70 points.
Bimodal speech recognition system
Neural Net
Recognition results
10 20 30 40 50 60 70 10 20 30 40 50 60 70 80 90 100 110 −−−> timeFeature
scalization Resize Edge calculation 2D−cross correlation LPF 2D − DCT Maximum Square cropping
fusion
Acoustic features (23 dim., 100Hz) Audio signal (16kHz) (25Hz) Visual signal Visual features Visual features
Interpolation parameterization Visual parameterization Acoustic
Gray (16 dim., 25Hz) (16 dim., 25Hz) (39ddim., 100Hz) features Acoustic−visual
Recognition results - Accuracy
Acoustic [%] Visual [%] Acoustic-Visual [%] Phonemes 31.05 12.15 31.33 VAD 94.04 83.79 94.12
- 0 (83%)
96.86 99.62 96.89
- 1 (17%)
79.32 1.44 79.71
Problems & Current focus
- More data for acoustic-visual experiments.
- Incorporation of robust mouth detection algorithm.
- Compensation algorithms to reduce lighting variations, rotation, . . . .
- LDA - to reduce dimensionality and improve discrimination among the speech