Chao Qin and Miguel A. Carreira-Perpi n an Electrical - - PowerPoint PPT Presentation

▶

Jan 05, 2024 100 likes •280 views

Articulatory inversion of American English // by conditional density modes Chao Qin and Miguel A. Carreira-Perpi n an Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu

SLIDE 1

Articulatory inversion

f American English /ô/

by conditional density modes

❦

Chao Qin and Miguel ´

A. Carreira-Perpi˜

n´ an Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu

SLIDE 2

Articulatory inversion

The problem of recovering the sequence of vocal tract shapes that produce a given acoustic utterance (we consider the instantaneous inverse mapping). Difficult because the inverse mapping is multivalued (and nonlinear). articulatory configurations acoustic signal ? ? x articulatory vector y acoustic vector f: x → y forward mapping g: y → x inverse mapping Applications: speech coding, ASR, visualisation, therapy, etc.

p. 1

SLIDE 3

The evidence for nonuniqueness

Indirect evidence from articulatory models (Atal et al 1978), but not realistic, and from bite-block experiments (Lindblom et al 1979), but not natural speech. Direct evidence based on real data for normal speech: ❖ At least one well-documented case (e.g. Espy-Wilson et al 2000): the American English /ô/, produced with retroflex and bunched shapes. ❖ Large-scale statistical study of the vocal-tract shape for normal speech (American English) in the Wisconsin X–ray microbeam database (Qin and Carreira-Perpiñán 2007, 2009): ∼ 15% of the acoustic frames show multiple articulations, including /ô/ and other sounds. ⇒ The nonuniqueness is definitely there, but not very frequently. Since previous work on articulatory inversion has used data containing little nonuniqueness, the differences between algorithms are small.

p. 2

SLIDE 4

The evidence for nonuniqueness (cont.)

Spectral env. (dB/Hz) Vocal tract shapes (mm) Spectral env. (dB/Hz) Vocal tract shapes (mm) /ô/

2000 4000 6000 8000 −50 −40 −30 −20 −10

−80 −60 −40 −20 20 −40 −20 20 40

/æ/

2000 4000 6000 8000 −50 −40 −30 −20 −10

−80 −60 −40 −20 20 −40 −20 20 40

/l/

2000 4000 6000 8000 −50 −40 −30 −20 −10

−80 −60 −40 −20 20 −40 −20 20 40

/u:/

2000 4000 6000 8000 −50 −40 −30 −20 −10

−80 −60 −40 −20 20 −40 −20 20 40

/w/

2000 4000 6000 8000 −50 −40 −30 −20 −10

−80 −60 −40 −20 20 −40 −20 20 40

/y/

2000 4000 6000 8000 −50 −40 −30 −20 −10

−80 −60 −40 −20 20 −40 −20 20 40

From XRMB: /ô/,

tp009 “row”; /l/, tp037 “long”; /w/, tp044 “work”; /æ/, tp001 “has”; /u:/, tp001 “school”;

/y/,

tp040 “you”.

p. 3

SLIDE 5

Representative approaches to articulatory inversion

❖ Analysis-by-synthesis inverts a nonlinear forward mapping f (assumed univalued) from articulators x to acoustics y by minimising the acoustic error y − f(x). Slow and returns invalid shapes unless the initial x is close to the solution. ❖ Other methods directly learn a nonlinear inverse mapping g: y → x: neural net (Shirai and Kobayashi 1991), MDN (Richmond et al 2003), etc. With multiple inverses, this results in their average, which can be an invalid shape. ❖ The distal teacher approach (Jordan and Rumelhart 1992) learns a valid inverse, but only one of them. ❖ Other methods incorporate some sequential information, e.g. by using a time-delay neural net or a trajectory smoothing (Toda et al 2008). ❖ Few methods have directly attempted to represent the multivalued inverse explicitly: the codebook method (Schroeter and Sondhi 1994) and the conditional modes method (Carreira-Perpiñán 2000 and this work).

p. 4

SLIDE 6

The problem with univalued mappings

❖ The best univalued mapping g: y → x in the least-squares sense (ming Ep(x,y)

x − g(y)2

) is the conditional mean g(y) = E {x|y}. But the mean of two valid inverses may not be a valid inverse. ❖ A univalued mapping will learn the conditional mean or possibly a single inverse, but cannot learn more than one inverse. ❖ We can define a multivalued mapping by the modes of the conditional distribution p(x|y). The number of inverses (modes) g(y) = f −1(y) naturally varies as a function of y.

art. x
art. x
ac. y

retroflex retroflex bunched bunched mean mean /ô/ p(x|y = /ô/) p(x, y)

p. 5

SLIDE 7

Inversion by conditional density modes (Carreira-Perpiñán 2000)

❖ Offline, given an training set of articulatory-acoustic vector pairs {(xn, yn)}, learn a conditional density p(x|y).

Gaussian mixture, kernel density estimate, mixture of experts (e.g. MDN). . .

❖ At runtime, given a test acoustic sequence y1, . . . , yT: ✦ For each frame yt, find all possible inverses x1

t, . . . , xKt t .

Find all modes of the conditional density p(x|yt), using e.g. the mean-shift algorithm (Carreira-Perpiñán 2000). This is the computational bottleneck.

✦ Find a unique vocal-tract shape sequence x1, . . . , xT by minimising over the set of modes at all frames the following constraint: E(x1, . . . , xT) =

T−1

xt+1 − xt

continuity

+ λ

yt − f(xt)

validity

Exact solution by dynamic programming in O(Tν2) (ν = average number of modes per frame).

p. 6

SLIDE 8

Inversion by conditional density modes (cont.)

In the codebook method (Schroeter and Sondhi 1994): ❖ A large (105+ entries) table of articulatory-acoustic pairs (xn, yn) is constructed by sampling an articulatory model and possibly doing vector quantisation. ❖ Difficult to remove unrealistic/atypical shapes and to achieve a uniform, comprehensive sampling (nonlinear manifold). ❖ The dynamic programming search is very slow. ❖ The resulting trajectory shows discretisation artifacts. In the conditional density modes method: ❖ The joint density p(x, y) replaces the codebook and is far smaller in memory. ❖ By learning it from real articulatory data of normal speech we ensure it represents feasible and typical shapes. ❖ The mode finding is slow (but it can be accelerated); the dynamic programming is very fast. ❖ The resulting trajectory is smooth.

p. 7

SLIDE 9

A dataset of American English /ô/ sequences

❖ To test the ability of inversion methods to deal with nonuniqueness, we create a dataset with several sequences of American English /ô/, extracted by hand from the Wisconsin X–ray microbeam database (XRMB) (Westbury 1994). ❖ The XRMB records simultaneously the audio and 2D positions of several pellets on the tongue, lips, etc. on the midsagittal plane. ❖ This gives an incomplete representation of the vocal tract (no data beyond the velum).

−95 −75 −55 −35 −15 5 25 −40 −20 20 40

UL LL T1 T2 T3 T4 MNI MNM

X (mm) Y (mm) Pharyngeal wall Palate outline

p. 8

SLIDE 10

A dataset of American English /ô/ sequences (cont.)

❖ Speaker

jw11.

❖ American English /ô/: 402 frames (training) and 6 test trajectories (3 retroflex, e.g. “right”, “roll” + 3 bunched, e.g. “rag”, “row”) chosen manually. ❖ Acoustic features: 20th-order LPC.

−80 −60 −40 −20 20 −30 −20 −10 10 20 30 UL LL MNI MNM T1−T4

1000 2000 3000 4000 5000 6000 7000 8000 −30 −20 −10 10 20 30 40 Training data Testing data

Amplitude (dB) Frequency (Hz)

p. 9

SLIDE 11

Experimental setup: methods

❖ Joint articulatory-acoustic density p(x, y): Gaussian kernel density estimate with σ = 11 mm on the training set, from which we derive the conditional density p(x|yt) of articulators given acoustics. ✦

mean: the conditional mean.

✦

dpmode: the conditional modes picked by the dynamic

programming (we used only the continuity constraint). ✦

mode: the modes closest to the ground truth articulator

sequence x1, . . . , xT (oracle). ❖

rbf: a radial basis function to learn the inverse mapping g

(asymptotically equivalent to

mean).

p. 10

SLIDE 12

Reconstructed sequence for retroflex /ô/ “roll”

t p 9 6

Conditional density

modes, mean, ground truth.

−60 −40 −20 20 −20 20

F1 (nmodes = 5)

−60 −40 −20 20 −20 20

F2 (nmodes = 5)

−60 −40 −20 20 −20 20

F3 (nmodes = 5)

−60 −40 −20 20 −20 20

F4 (nmodes = 5)

−60 −40 −20 20 −20 20

F5 (nmodes = 4)

−60 −40 −20 20 −20 20

F6 (nmodes = 4)

−60 −40 −20 20 −20 20

F7 (nmodes = 3)

−60 −40 −20 20 −20 20

F8 (nmodes = 3)

−60 −40 −20 20 −20 20

F9 (nmodes = 3)

Reconstructions:

dpmode, mean, rbf, ground truth.

−60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20

p. 11

SLIDE 13

Reconstructed sequence for bunched /ô/ “rag”

t p 4

Conditional density

modes, mean, ground truth.

−60 −40 −20 20 −20 20

F1 (nmodes = 1)

−60 −40 −20 20 −20 20

F2 (nmodes = 3)

−60 −40 −20 20 −20 20

F3 (nmodes = 3)

−60 −40 −20 20 −20 20

F6 (nmodes = 5)

−60 −40 −20 20 −20 20

F10 (nmodes = 4)

−60 −40 −20 20 −20 20

F12 (nmodes = 4)

−60 −40 −20 20 −20 20

F13 (nmodes = 3)

−60 −40 −20 20 −20 20

F14 (nmodes = 4)

−60 −40 −20 20 −20 20

F16 (nmodes = 2)

−60 −40 −20 20 −20 20

F17 (nmodes = 1)

Reconstructions:

dpmode, mean, rbf, ground truth.

−60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20 −60 −40 −20 20 −20 20

p. 12

SLIDE 14

Temporal trajectory reconstruction

Bunched /ô/ in utterance

tp050 “row” (for coils T1 to T4).

−28 −25 −22 12 15 18 −38 −35 −32 6.08 6.09 6.1 6.11 6.12 6.13 6.14 6.15 6.16 17 20 23

T1x T2x T3x T4x T1y T2y T3y T4y

−53 −50 −47

ground truth cmean dpmode

17 20 23 −61 −58 −55 6.08 6.09 6.1 6.11 6.12 6.13 6.14 6.15 6.16 12 15 18

T1x T2x T3x T4x T1y T2y T3y T4y

Time (s)

p. 13

SLIDE 15

Inversion results

RMSE (mm) and correlation ([-1,1]) per articulatory channel:

0.5 1.5 2.5 3.5 4.5 RBF cmean cmode dpmode

RMSE

ULx ULy LLx LLy T1x T1y T2x T2y T3x T3y T4x T4y MNIx MNIy MNMx MNMy 0.1 0.5 0.9

Correlation Articulatory channels

Average RMSE in mm (correlation) for the tongue and for all coils:

rbf mean mode dpmode

Tongue 2.66 (0.67) 3.08 (0.55) 1.19 (0.94) 1.20 (0.94) All coils 2.07 (0.52) 2.26 (0.48) 1.27 (0.72) 1.30 (0.71)

p. 14

SLIDE 16

Conclusions

We have proposed: ❖ A subset of the XRMB dataset consisting of sequences of the American English /ô/ with multiple articulations, to be used as a benchmark for articulatory inversion. ❖ An articulatory inversion method that explicitly addresses the multivalued nature of the inverse mapping: ✦ The inverses are the modes of a conditional density. ✦ The recovered inverse trajectory minimises a continuity constraint. The method: ❖ recovers retroflex or bunched shapes for /ô/, unlike other methods that recover an average, straight shape ❖ is perfectly general (not specific to /ô/). Work funded by NSF award IIS–0711186.

http://ee s.u mer ed.edu.

p. 15