The Geometry of the Articulatory Region That Produces a Speech Sound - - PowerPoint PPT Presentation

the geometry of the articulatory region that produces a
SMART_READER_LITE
LIVE PREVIEW

The Geometry of the Articulatory Region That Produces a Speech Sound - - PowerPoint PPT Presentation

The Geometry of the Articulatory Region That Produces a Speech Sound Chao Qin EECS, School of Engineering, UC Merced, USA November 2009 1 eecs-seminar09, UCMerced Outline Introduction and motivation Nonuniqueness of the inverse


slide-1
SLIDE 1

1

The Geometry of the Articulatory Region That Produces a Speech Sound

Chao Qin EECS, School of Engineering, UC Merced, USA November 2009

eecs-seminar’09, UCMerced

slide-2
SLIDE 2

2

Outline

  • Introduction and motivation
  • Nonuniqueness of the inverse mapping
  • Prediction error of individual articulators
  • Nonuniqueness of individual articulators
  • Conclusions
slide-3
SLIDE 3

3

Introduction

  • Articulatory inversion

– Recovering vocal tract shapes from acoustics – Still an open research problem!

  • Nonuniqueness of the inverse mapping

– Model-based approaches: Atal et al’78, Boe et al’92 – Data-driven approaches: Qin&Carreira-Perpiñán’07

slide-4
SLIDE 4

4

Introduction

  • Questions

– Is recovering a portion of the vocal tract simpler than recovering the entire VT? – How to quantify the difficulty?

  • Why recovering portions of the vocal tract?

– Useful for facial animation (lips and anterior tongue) and diagnosis of speech disorders (velum height) in dysarthria – Useful for separating linguistic information from speakers’ idiosyncrasy

  • Approaches

– Parametric methods: model-based inversion – Nonparametric methods: fewer assumptions

Nonuniqueness of any articulator Nonuniqueness of the entire VT Nonuniqueness of the entire VT Nonuniqueness of every articulator

slide-5
SLIDE 5

5

PART I: Prediction Error of Individual Articulators in Inverse Models

slide-6
SLIDE 6

6

Articulatory databases

slide-7
SLIDE 7

7

Prediction error of individual articulators

  • Dataset

– MOCHA-TIMIT

  • Train: 10000 frames
  • Valid: 4000 frames
  • Test: 15 utterances

– EMA after “mean-filtering” – 12-order line spectral frequency (LSF)

  • Inversion by neural networks

– 7 MLPs for different portions of the front VT – 6 MLPs for individual articulators – 1 RBF for entire vocal tract:

  • Model parameters

– MLPs: single layer with 100 hidden units – RBF:

1 . bandwidth functions, basis 600 , 1 . tion regulariza = = = σ λ M

slide-8
SLIDE 8

8

Experimental results: vocal tract inversion

0.46 0.52 2.78 2.26 2.65 2.44 3.05 2.79 1.35 0.95 2.95 1.35 1.36 1.02 RMSE Individual articulator by MLPs 0.70 0.70 0.46 0.70 0.46 Vy 0.68 0.68 0.52 0.69 0.51 Vx 0.59 0.59 2.72 0.59 2.75 TDy 0.72 0.75 2.19 0.74 2.21 TDx 0.74 0.74 2.60 0.74 2.63 TBy 0.75 0.77 2.36 0.77 2.37 TBx 0.77 0.78 3.01 0.77 3.06 TTy 0.71 0.73 2.71 0.72 2.74 TTx 0.74 0.75 1.32 0.75 1.33 LIy 0.47 0.51 0.92 0.48 0.94 LIx 0.71 0.71 2.93 0.70 2.96 LLy 0.47 0.51 1.28 0.49 1.32 LLx 0.58 0.60 1.33 0.57 1.36 ULy 0.48 0.51 0.99 0.51 1.00 ULx Correlation Correlation RMSE Correlatio n RMSE Whole VT by RBF Portions of the VT by MLPs

slide-9
SLIDE 9

9

Normalized estimation error

The entire dataset for speaker fsew0 Estimation errors:

i j i j i j

a a e ˆ − =

slide-10
SLIDE 10

10

Relative estimation error for each articulator

error s r' articulato each

  • f

covariance : position s r' articulato each

  • f

covariance : / 2 1 ~ ) tr( 2 1

2 / 1 2 1 2 / 1 2 / 1 2 / 1 e r i i r i e r e r

Σ Σ       ⇒       Σ Σ Σ

∑ =

− −

λ λ

slide-11
SLIDE 11

11

PART II: Nonuniqueness of Individual Articulators

slide-12
SLIDE 12

12

Wisconsin X-ray microbeam database

jw11

  • rder LPC
  • 20

: positions ry articulato : } , {

20 16 43260 1 D n D n n n n

y x y x ℜ ∈ ℜ ∈

=

slide-13
SLIDE 13

13

Multimodality of the inverse set

  • Nonparametric algorithm

– Search multimodality in individual 2D articulatory space (like Qin&Carreira-Perpiñán’07) – Analyze the geometry of the inverse set by shape statistics

– Given an acoustic vector – Find its inverse set – Count number of modes (of kernel density estimate of bandwidth – Compute shape statistics – Repeat for all acoustic vectors in the dataset

1

x

y

2

x

X ART Y AC X r y y d x y I

m m

⊂ ≤ =

} ) , ( | { ) (

y ) (y I mm 6 = σ

slide-14
SLIDE 14

14

Shape statistics of the inverse set

  • Characterizing the geometry by the shape statistics

– Eigenvalues of the covariance matrix – measure the spread of the inverse set along its principal axes

  • These shape statistics only depend on the acoustic distance

2 1

λ λ ≥

shape complex Otherwise 3. manifold 1D and shape elongated 2. manifold 0D and ed concentrat tightly small are and . 1

1 2 1 2

⇒ ⇒ << ⇒ λ λ λ λ

2 . = r

slide-15
SLIDE 15

15

Eigenvalue plots for some articulators

slide-16
SLIDE 16

16

Percentage of nonuniqueness in the dataset

Extremely infrequent Quite infrequent

slide-17
SLIDE 17

17

Histogram plots for each articulator

slide-18
SLIDE 18

18

Histogram plot for the entire vocal tract

slide-19
SLIDE 19

19

Unique frames in T1 space

slide-20
SLIDE 20

20

Nonunique frames in T1 space

slide-21
SLIDE 21

21

Conclusion

  • Nonuniqueness affects all the articulators of the vocal tract
  • Some or even all articulators may be strongly constrained
  • The normalized inversion error by neural nets is approximately the

same over all articulators

  • Generally, the set of articulatory shapes that correspond to a given

sound is relatively constrained around a roughly spherical region in articulatory space (0D manifold, eg. vowels)

  • Many frames do show more complex shapes: very elongated in a

straight or curved path (1D manifold, eg. glides /l/ and /w/) or multimodality (>=2D manifold, eg. /r/) or even more complex (eg. /m/)

slide-22
SLIDE 22

22

Acknowledge

  • Work funded by NSF award IIS-0754089 and IIS-0711186