1
The Geometry of the Articulatory Region That Produces a Speech Sound
Chao Qin EECS, School of Engineering, UC Merced, USA November 2009
eecs-seminar’09, UCMerced
The Geometry of the Articulatory Region That Produces a Speech Sound - - PowerPoint PPT Presentation
The Geometry of the Articulatory Region That Produces a Speech Sound Chao Qin EECS, School of Engineering, UC Merced, USA November 2009 1 eecs-seminar09, UCMerced Outline Introduction and motivation Nonuniqueness of the inverse
1
eecs-seminar’09, UCMerced
2
3
– Recovering vocal tract shapes from acoustics – Still an open research problem!
– Model-based approaches: Atal et al’78, Boe et al’92 – Data-driven approaches: Qin&Carreira-Perpiñán’07
4
– Is recovering a portion of the vocal tract simpler than recovering the entire VT? – How to quantify the difficulty?
– Useful for facial animation (lips and anterior tongue) and diagnosis of speech disorders (velum height) in dysarthria – Useful for separating linguistic information from speakers’ idiosyncrasy
– Parametric methods: model-based inversion – Nonparametric methods: fewer assumptions
Nonuniqueness of any articulator Nonuniqueness of the entire VT Nonuniqueness of the entire VT Nonuniqueness of every articulator
5
6
7
– MOCHA-TIMIT
– EMA after “mean-filtering” – 12-order line spectral frequency (LSF)
– 7 MLPs for different portions of the front VT – 6 MLPs for individual articulators – 1 RBF for entire vocal tract:
– MLPs: single layer with 100 hidden units – RBF:
1 . bandwidth functions, basis 600 , 1 . tion regulariza = = = σ λ M
8
0.46 0.52 2.78 2.26 2.65 2.44 3.05 2.79 1.35 0.95 2.95 1.35 1.36 1.02 RMSE Individual articulator by MLPs 0.70 0.70 0.46 0.70 0.46 Vy 0.68 0.68 0.52 0.69 0.51 Vx 0.59 0.59 2.72 0.59 2.75 TDy 0.72 0.75 2.19 0.74 2.21 TDx 0.74 0.74 2.60 0.74 2.63 TBy 0.75 0.77 2.36 0.77 2.37 TBx 0.77 0.78 3.01 0.77 3.06 TTy 0.71 0.73 2.71 0.72 2.74 TTx 0.74 0.75 1.32 0.75 1.33 LIy 0.47 0.51 0.92 0.48 0.94 LIx 0.71 0.71 2.93 0.70 2.96 LLy 0.47 0.51 1.28 0.49 1.32 LLx 0.58 0.60 1.33 0.57 1.36 ULy 0.48 0.51 0.99 0.51 1.00 ULx Correlation Correlation RMSE Correlatio n RMSE Whole VT by RBF Portions of the VT by MLPs
9
The entire dataset for speaker fsew0 Estimation errors:
i j i j i j
10
2 / 1 2 1 2 / 1 2 / 1 2 / 1 e r i i r i e r e r
− −
11
12
jw11
20 16 43260 1 D n D n n n n
=
13
– Search multimodality in individual 2D articulatory space (like Qin&Carreira-Perpiñán’07) – Analyze the geometry of the inverse set by shape statistics
– Given an acoustic vector – Find its inverse set – Count number of modes (of kernel density estimate of bandwidth – Compute shape statistics – Repeat for all acoustic vectors in the dataset
1
x
2
X ART Y AC X r y y d x y I
m m
⊂ ≤ =
} ) , ( | { ) (
y ) (y I mm 6 = σ
14
– Eigenvalues of the covariance matrix – measure the spread of the inverse set along its principal axes
2 1
λ λ ≥
shape complex Otherwise 3. manifold 1D and shape elongated 2. manifold 0D and ed concentrat tightly small are and . 1
1 2 1 2
⇒ ⇒ << ⇒ λ λ λ λ
15
16
Extremely infrequent Quite infrequent
17
18
19
20
21
same over all articulators
sound is relatively constrained around a roughly spherical region in articulatory space (0D manifold, eg. vowels)
straight or curved path (1D manifold, eg. glides /l/ and /w/) or multimodality (>=2D manifold, eg. /r/) or even more complex (eg. /m/)
22