Dec 1, 2005 1
Audio Audio-
- Visual Automatic Speech Recognition:
Audio- -Visual Automatic Speech Recognition: Visual Automatic - - PowerPoint PPT Presentation
Audio- -Visual Automatic Speech Recognition: Visual Automatic Speech Recognition: Audio Theory, Applications, and Challenges Theory, Applications, and Challenges Gerasimos Potamianos I BM T. J. Watson Research Center Yorktown Heights, NY
Dec 1, 2005 1
Dec 1, 2005 2
Audio
I mproved ASR
Visual (labial)
Dec 1, 2005 3
Introduction and motivation – Cont.
audio-visual speech processing capability):
to disconnect me, and I’m afraid that’s something I cannot allow to happen.
HAL?
precautions in the pod against my hearing you, I could see your lips move.
(From HAL’s Legacy, David G. Stork, ed., MIT Press: Cambridge, MA, 1997).
Dec 1, 2005 4
I.A. Why audio-visual speech?
Schematic representation of speech production (J.L. Flanagan, Speech Analysis, Synthesis, and Perception, 2nd ed., Springer-Verlag, New York, 1972.)
face muscles play part in speech production and are visible.
different role in the production of the basic speech units. E.g., lips for
bilabial phone set B= /p/,/b/,/m/.
Dec 1, 2005 5
Why audio-visual speech – Cont.
improve intelligibility.
experiment by Summerfield (1979): Noisy word recognition at low SNR.
as demonstrated by the McGurk
effect (McGurk and McDonald, 1976).
conflicting audio.
My bab pope me pu brive.
10 20 30 40 50 60 70 Word regognition Audio only (A) A+ 4 mouth points A+ lip region A+ full face
Dec 1, 2005 6
Why audio-visual speech – Cont.
Dec 1, 2005 7
Why audio-visual speech – Cont.
Correlation between original and estimated features; upper: visual from audio; lower: audio from visual (Jiang et al.,2003).
1.0 0.1
Au2Vi - 4 spk. Vi2Au - 4 spk.
1.0 0.1
are correlated: Thus, for example, one can
recover part of the one channel from using information from the other.
Correlation between audio and visual features (Goecke et al., 2002).
Dec 1, 2005 8
I.B. Audio-visual speech used in HCI
Audio-Visual ASR
Audio input Visual input Acoustic features Visual features Audio-visual integration
SPOKEN TEXT
Audio-Only ASR
Dec 1, 2005 9
Audio-visual speech used in HCI
than audio-only TTS.
Audio output Visual output
TEXT
Audio
Authenticate
speaker
Visual (labial) Face
Who is talking?
Dec 1, 2005 10
I. I. Introduction / motivation for AV speech. Introduction / motivation for AV speech. II. Visual feature extraction for AV speech applications. III. Audio-visual combination (fusion) for AV-ASR. IV. Other AV speech applications. V. Summary. Experiments will be presented along the way.
Dec 1, 2005 11
A. Where is the talking face in the video? B. How to extract the speech informative section of it? C. What visual features to extract? D. How valuable are they for recognizing human speech? E. How do video degradations affect them?
Region-of
Visual features Face and facial feature tracking
Dec 1, 2005 12
II.A. Face and facial feature tracking.
the video, and if so, where? Need:
corners). See for example MPEG-4 facial activity parameters (FAPs).
prerequisite for incorporating audio-visual speech in HCI.
Dec 1, 2005 13
II.A.1 Appearance-based face detection.
TWO APPROACHES:
detect presence of typical face characteristics (mouth edges, nostrils, eyes, nose), e.g.: Low-pass filtering, edge detection, morphological filtering,
features.
relative location and orientation.
information based on skin-tone and
motion in color videos.
From: Graf, Cosatto, and Potamianos, 1998
Dec 1, 2005 14
Appearance-based face detection – Cont.
classification problem (into faces/ non-faces).
11x11 pixel rectangle).
candidates into the 2 classes.
(based on color information on the R,G,B or transformed space), or location/scale (in the case of a video sequence). Use thresholds or statistics. end start ratio
Dec 1, 2005 15
Appearance-based face detection – Cont.
Statistical face models (for face “vector” x).
projected space; P is trainable on basis of annotated (face/non-face) data vectors.
DFFS score:
candidate vector:
Example PCA eigenvectors
th DFFS
F Face Non Face
y
−
< > −
T
DFFS P y x − =
Dec 1, 2005 16
Appearance-based face detection – Cont.
Additional statistical face models:
image compression transform), y = P x .
(ANN – Rowley et al., 1998).
classifier (SVM – Osuna et al., 1997).
} , { , ) , , ( ) | Pr(
, , , 1
f f c N w c
c k c k c k K k
c
∈ =∑ = s m y y
) | Pr( / ) | Pr( f f y y
f x or y f
Dec 1, 2005 17
Appearance-based face detection – Cont.
Face detection experiments:
change.
varying head-pose, background, lighting.
20 40 60 80 100 STUDI O OFFI CE AUTO BN LDA/ PCA DCT/ GMM 20 40 60 80 100 STUDI O OFFI CE AUTO SI : Speaker-indep. MS:Multi-speaker SA: Speaker-adapted
Dec 1, 2005 18
Appearance-based face detection – Cont.
From faces to facial features:
can be scored using trained Fisher, DFFS, GMMs, ANN, etc.
under varying lighting and head pose variations. STUDIO AUTOMOBILE
Dec 1, 2005 19
II.A.2. Face shape & lip contour extraction Four popular methods for lip contour extraction:
iteratively optimized.
matches the template to the lips.
Dec 1, 2005 20
Face shape & lip contour extraction – Cont.
image.
From: Luettin, Thacker, and Beet, 1996.
Dec 1, 2005 21
Face shape & lip contour extraction – Cont.
coefficients of both.
shape/appearance parameters are iteratively obtained, by minimizing a residual error.
AAM tracking on IBM “studio” data (credit: I. Matthews) AAM modes trained on IBM data
Dec 1, 2005 22
II.B. Region-of-interest for visual speech.
circle, lip profiles, etc.
using median or Kalman filter.
equalization) mouth ROI.
Best for ASR
Dec 1, 2005 23
II.C. Visual speech features.
Dec 1, 2005 24
II.C.1. Shape based visual features
(points) are available (extracted as discussed in III.A), and are properly normalized using an affine transform (to compensate for head pose and speaker specifics).
⎩ ⎨ ⎧ ∪ ∈ = 0, ) ( 1, ) , (
C C x,y if y x f
interior
=
y x
y x, f ) ( max h
=
x y
y x, f ) ( max w
=
y
y x, f ) (
x
a
1 +
i
i i
Dec 1, 2005 25
Shape based visual features – Cont.
contour tracking (as discussed in III.A). The resulting lip contour points can be used to derive geometric features, or alternatively, in the case of:
Dec 1, 2005 26
II.C.2. Appearance based visual features
cavity (tongue, teeth visibility, etc.). Instead, use a compressed representation
reduction transform: E.g.: DCT: Discrete cosine transform.
DWT: Discrete wavelet transform. PCA: Principal components analysis. LDA: Linear discriminant analysis.
. } 2 / 2 / , 2 / 2 / , 2 / 2 / : ) , , ( { K k k K k N n n N n M m m M m k n m V
t t t t t t t t
+ < ≤ − + < ≤ − + < ≤ − ← x d D R
d D t t
<< ∈ =
×
, with , P P y x
Dec 1, 2005 27
experiments.
Outer lip features % , Word accuracy
h , w 55.8
+ a
61.9
+ p
64.7
+ FD2-5
73.4
Lip contour features % , Word accuracy Outer-only
73.4
Inner-only
64.0 2 contours 83.9
Feature type % , Word accuracy Lip-contour based
83.9
Appearance (LDA)
97.0
Dec 1, 2005 28
Visual feature comparisons – Cont.
2 4 6 8 10 12 14 16 18 20 70 75 80 85 90 95 100
LDA DWT PCA Word Accuracy [ % ] Number of static features [ ] J
appearance based
features (LDA, DWT, PCA)
(Potamianos et al, 1998).
Dec 1, 2005 29
II.E. Video degradation effects.
Limit of acceptable video rate for automatic speechreading is 15 Hz.
Robustness to noise only in a matched training/testing scenario.
10 20 30 40 50 60 10 20 30 40 50 60 70 80 90 100 FIELD RATE [Hz] WORD ACCURACY [%] 10 20 30 40 50 60 70 80 90 100 20 30 40 50 60 70 80 90 100 SNR [dB] WORD ACCURACY [%]
SNR = 10 dB SNR = 30 dB SNR = 60 dB
MATCHED TRAINING-TESTING MISMATCHED TRAINING-TESTING
Both cases: DWT visual features – connected digits recognition (Potamianos et al., 1998).
Dec 1, 2005 30
Video degradation effects – Cont.
difficulties to robust visual feature extraction.
automobile environments (multiple-speakers, connected digits – Potamianos et al., 2003).
Face detection accuracy decreases: Word error rate increases:
20 40 60 80 100 STUDI O OFFI CE AUTO SI : Speaker-indep. MS:Multi-speaker SA: Speaker-adapted 10 20 30 40 50 60 70 80 STUDI O OFFI CE AUTO
Dec 1, 2005 31
Early (frame, HMM state level). I ntermediate integration (phone level – coupled, product HMMs). Late integration (sentence level – discriminative model combination).
Fixed (global). Adaptive (local).
] , [
,
T t R
A
d A t A
∈ ∈ = o O
)] ( WER ), ( WER min[ ) WER(
V A V A
O O O , O ≤
. 41 , 60 = =
V A
d d
] , [
,
T t R
V
d V t V
∈ ∈ = o O
Dec 1, 2005 32
III.A. Feature fusion in AV-ASR.
and visual-only classifiers – e.g., single-stream HMM) to model the concatenated audio-visual features, or any transformation of them.
audio-visual correlation.
dimensionality reduction.
Dec 1, 2005 33
Feature fusion in AV-ASR – Cont.
Discriminant feature fusion is superior – results in an effective SNR gain
is considered at various SNRs.
5 10 15 20 1 2 3 4 5 6 7 AUDIO-ONLY AV-CONCAT AV-HILDA
6 dB GAIN
CONNECTED DIGITS TASK Matched Training
SIGNAL-TO-NOISE RATIO (SNR), dB WORD ERROR RATE (WER), %
Dec 1, 2005 34
III.B. Decision fusion in AV-ASR.
provide a joint audio-visual score. Typical example is the multi-stream HMM.
c t s c s s V A
V A s K k k c s k c s t s d k c s c t t V c t t A t AV
N w c c c
, , ,
} , { 1 , , , , , , , , , , , , , ,
) , ; ( ) | Pr( ) | Pr( ) | ( Score
λ λ λ
∈ =
⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = = s m
∈
= ≤ ≤
} , { , , , ,
1 ; 1
V A s t c s t c s
λ λ ] ,..., 1 , ), , , [( : where , ] , , [
, , , , , , ,
K k C c w
c s k c s k c s k c s s V A
= ∈ = = s m θ λ θ θ θ
C c∈
] , , [
, ,
T t C c
t c A
∈ ∈ = λ λ
Dec 1, 2005 35
Decision fusion in AV-ASR - Cont.
Multi-stream HMM parameter estimation:
] , [
V A θ
θ
T
s s V A AV A AV
) ( 1
s s k s ) (k s
s
+
θ
) ( 1
k s ) (k s
s
+
θ
Dec 1, 2005 36
Decision fusion in AV-ASR - Cont.
AV-ASR results:
digit ASR paradigm.
fusion is superior to
feature fusion.
superior to separate stream training.
7.5 dB SNR.
5 10 15 20 1 2 3 4 5 6 7 AUDIO-ONLY AV-MS (AU+VI)
7.5 dB GAIN
Joint Training Separate Stream Training CONNECTED DIGITS TASK Matched Training
SIGNAL-TO-NOISE RATIO (SNR), dB WORD ERROR RATE (WER), %
Dec 1, 2005 37
III.C. Asynchronous integration
states, such as phones. This allows state-asynchrony between the two streams, within each phone.
. ) | Pr( ) | ( Score
, ,
, ,
=
S s s t s t AV
c t s
c
λ
|
i.e., , } , {
S s
C S s c ∈ ∈ = c c
Dec 1, 2005 38
Intermediate integration - Cont.
as state-synchronous MS-HMM.
known as the coupled HMM.
Dec 1, 2005 39
Asynchrony; Intermediate integration - Cont.
AV-ASR results:
digit ASR paradigm.
is superior to state- synchronous fusion.
10 dB SNR.
5 10 15 20 1 2 3 4 5 6 7 AUDIO-ONLY AV-PRODUCT HMM CONNECTED DIGITS TASK Matched Training
SIGNAL-TO-NOISE RATIO (SNR), dB WORD ERROR RATE (WER), %
10 dB GAIN
Dec 1, 2005 40
III.D. Stream reliability modeling
global stream weights properly model the reliability of each stream for all available
environment varies locally (more practical). Requires stream reliability estimation at a local level, and mapping of such reliabilities to exponents.
c t t V c t t A t AV
V A
c c c
, , , , , , ,
) | Pr( ) | Pr( ) | ( Score
λ λ
c s t c s , , ,
λ λ ⎯→ ⎯
win win ' , , , ,
t s t s t c s
Dec 1, 2005 41
III.D.1. Global stream weighting.
where Ls,c,F denotes the training set log-likelihood contribution due to the s- modality, c-state (obtained by forced-alignment F).
probabilistic descent algorithm (GPD).
(GPD based estimation)
, max arg if , 1
, , } , { ,
⎩ ⎨ ⎧ = =
∈ F c s V A s c s
s λ L
0.5 1 1.5 2 2.5 3 3.5 0.2 0.4 0.6 0.8 1
log10k λ
*A
(k)
(a): = 0.90 (b): = 0.01 (c): = 0.99
λ
*A
(0)
λ
*A
(0)
λ
*A
(0)
(a) (b) (c)
Dec 1, 2005 42
III.D.2. Adaptive stream weighting.
(e.g., acoustic noise bursts, face tracking failures, etc.).
Consider N-best most likely classes for observing os,t , N-best log-likelihood difference: N-best log-likelih. dispersion:
=
− =
N n n t s t s t s t s t s
c c N L
2 , , , 1 , , , ,
) | ( Pr ) | ( Pr log 1 1
,..., 2 , 1 ,
, ,
N n C c
n t s
= ∈
= + =
− =
N n n t s t s n t s t s N n n t s
c c N N D
2 ' , , , , , , 1 ' ,
) | ( Pr ) | ( Pr log ) 1 ( 2
Dec 1, 2005 43
Adaptive stream weighting – Cont.
indicators and exponents vs. SNR
exponents as:
estimated using MCL or MCE on basis of frame error (Garg et al., 2003).
1 4 1 ,
] ) ( exp 1 [
− =
− + =
i i i t A
d w λ
Dec 1, 2005 44
III.E. Summary of AV-ASR experiments.
5 10 15 20 10 20 30 40 50 60 70 AU-mismatched AU-matched VISUAL-ONLY AV-matched AV-mismatched
10 dB GAIN 10 dB GAIN
CONNECTED DIGITS TASK Matched and Mismatched Training/Testing
SIGNAL-TO-NOISE RATIO (SNR), dB WORD ERROR RATE (WER), %
for connected-digit recog.
training and testing).
clean, tested in noisy).
both, using product HMM.
Dec 1, 2005 45
Summary of AV-ASR experiments - Cont.
5 10 15 20 10 20 30 40 50 60 70 80 90 AUDIO-ONLY AV-HILDA AV-MS (AU+VI) AV-MS (AU+AV-HILDA)
8 dB GAIN
LVCSR TASK Matched Training
SIGNAL-TO-NOISE RATIO (SNR), dB WORD ERROR RATE (WER), %
results for large-vocabulary continuous speech(LVCSR).
training (239 subj.) testing (25 subj.).
SNRs.
using hybrid fusion.
Dec 1, 2005 46
Summary of AV-ASR experiments - Cont.
0.5 1 1.5 2 2.5 3 3.5 STUDI O OFFI CE AUTO Audio AV-HiLDA AV-MSHMM 5 10 15 20 25 30 STUDI O OFFI CE AUTO
Original audio Noisy audio
Dec 1, 2005 47
computer-interfaces require natural interaction & perceptual intelligence, i.e.:
A.
Speech synthesis (AV Text-To-Speech).
B.
Detection of who is speaking (speaker
recognition).
C.
What is being spoken (ASR/enhancement).
D.
Where is the active speaker (speech event detection).
E.
How can the audio-visual interaction be segmented, labeled, and retrieved? (mining).
Dec 1, 2005 48
Dec 1, 2005 49
AV-TTS – Two approaches.
Geometric; Articulatory; Muscular models.
Acquired Processed Concatenated
Viterbi search for best mouth sequence (Cosatto and Graf, 2000).
Dec 1, 2005 50
IV.B. Audio-visual speaker recognition
Two important problems are speaker verification (authentication) and identification
within a closed set of known subjects C based on
thresh
Reject Accept all claim
< > ) | ( Pr ) | ( Pr O O c c
) | ( Pr max arg ˆ O c c
C c ∈
=
Dec 1, 2005 51
IV.B.1. Single-modality speaker recognition
shape features, or appearance based features.
ID-error: TD: S: 27.1, I: 10.4, SI: 8.3 % TI: S: 16.7, I: 4.2, SI: 2.1 %
dynamic link architecture, elastic graphs, Gabor filter jets.
(e.g., single PCA representation of entire face)
vectors are classified (each representing local information, possibly organized in a hierarchy) and classification results are cumulated (e.g., embedded HMMs). 1-D HMM 1-D HMM
Dec 1, 2005 52
IV.B.2. Multi-modal speaker recognition
ID-error: A: 2.01, V: 10.95, AV: 0.40 % VER-EER: A:1.71, V: 1.52, AV: 1.04 %
ID-error: A: 28.4, F: 28.8, AF: 9.12 %
ID-error: A: 10.4, V: 11.0, F: 18.7, AVF: 7.0 %
Dec 1, 2005 53
IV.C. Bimodal enhancement of audio
are correlated. E.g., for 60-dim audio features (oAt) and 41-dim visual (oVt):
to restore acoustic information from the video and the corrupted audio signal.
Dec 1, 2005 54
IV.C.1. Linear bimodal audio enhancement.
. , AU clean and ], , [
) ( , , , ,
T t
C t A t V t A t AV
∈ =
,
) ( , , ) ( ,
T t
C t A t AV E t A
∈ ≈ =
) ( , ) ( , C t A E t A
t AV C i t A T t i
d i
1 , ] [ max arg
2 , ) ( , ,
= > < − =
p
p
t AV C i t A j i k t AV d i t AV
, , ) ( , , , , , , ,
T t j T t 1 ∈ = ∈
Dec 1, 2005 55
Linear bimodal audio enhancement – Cont.
babble noise at 4 dB SNR): Not perfect, but better than noisy features, and helps ASR!
50 100 150 200 250 −60 −40 −20 20 40 60
TIME FRAME, t AUDIO FEATURE VALUE
i = 5
CLEAN AU: NOISY AU: ENHANCED AU:
(AC)
(A)
(EN)
50 100 150 200 250 −60 −40 −20 20 40 60
TIME FRAME, t
i = 3
Dec 1, 2005 56
Linear bimodal audio enhancement – Cont.
5 10 15 20 10 20 30 40 50 60 70 80 90
SIGNAL-TO-NOISE RATIO (SNR), dB WORD ERROR RATE (WER), %
NO HMM RETRAINING
AU-only AU-only
(after HMM retraining)
AV-HiLDA Fusion AU-only Enh. AV Enh.
5 10 15 20 5 10 15 20
SIGNAL-TO-NOISE RATIO (SNR), dB
AFTER HMM RETRAINING
AU-only AU-Enh. AU-Enh.+MLLT AV-Enh. AV-Enh.+MLLT AV-HiLDA Fusion
Dec 1, 2005 57
IV.D. Audio-visual speaker detection
Applications/ problems:
rooms). Signals are available from microphone arrays and video
video corresponds to the audio track? Useful in broadcast video.
information of the audio and visual observations (Nock, Iyengar,
and Neti, 2000):
speech activity indicate speaker intent for HCI. Visual channel improves robustness compared to audio-only system (De Cuetos and Neti, 2000). | | | | | | log 2 1 ) ( ) ( ) , ( log ) , ( ) ; (
; υ a, υ a υ a
υ a υ a υ a s s s P P P P V A I
V A
= = ∑
∈ ∈
Audio-visual synchrony and tracking (Nock, Iyengar, and Neti, 2000).
Dec 1, 2005 58
synthesis, speaker authentication, identification, and localization, speech enhancement.
Dec 1, 2005 59
however, visual speech is not in wide-spread use in main-stream HCI, due to:
well as the associated drastic cost reduction, we believe that audio-visual speech is becoming ready for targeted applications !
development opportunities and challenges.
Dec 1, 2005 60
speechreading systems,” in Speechreading by Humans and Machines, D.G. Stork and M.E. Hennecke eds., Springer, Berlin, pp. 331-349, 1996.
automatic recognition of audio-visual speech,” Proc. IEEE, 91(9): 1306-1326, 2003.
IEEE Trans. Multimedia, 2(3): 141-151, 2000.