[PPT] - University of Southern California IEEE Odyssey June 2016 PowerPoint Presentation

SLIDE 1

Understanding ¡individual-‑level ¡speech ¡variability: ¡ ¡

From ¡novel ¡speech ¡production ¡data ¡to ¡robust ¡speaker ¡recognition Shrikanth ¡(Shri) ¡ ¡Narayanan ¡ Signal ¡Analysis ¡and ¡Interpreta6on ¡Laboratory ¡(SAIL) ¡ h:p://sail.usc.edu ¡

University ¡of ¡Southern ¡California ¡

IEEE Odyssey June 2016

SLIDE 2

2

Different ¡individuals…. ..each ¡with ¡a ¡uniquely ¡shaped ¡vocal ¡instrument

SLIDE 3

3

Different ¡individuals…. ..each ¡with ¡a ¡uniquely ¡shaped ¡vocal ¡instrument

nose tongue velum

SLIDE 4

And ¡with ¡differing ¡arDculatory ¡strategies ¡during ¡speech ¡… FiEeen ¡different ¡individuals ¡producing ¡vowel ¡/i/

SLIDE 5

5

What role can speech science play in understanding and supporting speech technology development?

Theme

SLIDE 6

Talk ¡focus: ¡Vocal ¡tract ¡Structure ¡and ¡Function

Characterize ¡interplay ¡between ¡vocal-‑tract ¡structure ¡and ¡function ¡

Structure: ¡Physical ¡characteristics ¡of ¡the ¡vocal-‑tract ¡apparatus ¡
¡ e.g. ¡hard ¡palate ¡geometry, ¡tongue ¡volume, ¡velum ¡mass ¡
Function: ¡Behavioral ¡characteristics ¡of ¡speech ¡articulation ¡
¡ e.g. ¡dynamic ¡formation ¡of ¡constrictions ¡in ¡the ¡vocal ¡tract

6

SLIDE 7

Overarching ¡Questions

How ¡are ¡individual ¡vocal-‑tract ¡structural ¡differences ¡reflected ¡

in ¡the ¡speech ¡acoustics? ¡

Can ¡structural ¡differences ¡be ¡predicted ¡from ¡acoustics? ¡
How ¡do ¡individuals ¡adopt ¡to ¡structural ¡differences ¡to ¡achieve ¡

phonetic ¡equivalence? ¡

What ¡contributes ¡to ¡distinguishing ¡speakers ¡from ¡one ¡another ¡

from ¡the ¡speech ¡signal?

7

Not only try to differentiate individuals from their speech signal but understand what makes them different from a structure-function perspective

SLIDE 8

Summary ¡of ¡specific ¡goals ¡of ¡this ¡talk

Quantify ¡individual ¡variability ¡in ¡vocal-‑tract ¡morphology ¡
Predict ¡morphological ¡details ¡from ¡acoustics ¡
Characterize ¡individual ¡articulatory ¡strategy ¡
Explore ¡applications ¡to ¡automatic ¡speaker ¡recognition ¡
Interpret ¡speaker ¡recognition ¡as ¡variability ¡in ¡morphology ¡and ¡

strategy ¡(including ¡speaking ¡style ¡differences)

8

SLIDE 9

Speech Production and Articulation kNowledge Group

http://sail.usc.edu/span

Diverse Stimuli TECHNOLOGY APPLICATIONS

RT-MRI 3d MRI Audio EMA

Multimodal Data Acquisition

dynamics of production
3d vocal tract shaping
articulatory coordination
source-filter interaction
realization of prosody
speaker-specific phonetics

Scientific Insights, Models, Theory

direct image analysis
forced alignment
articulator tracking
acoustic feature extraction
cross-modal registration
airway segmentation
morphological characterization
task-dynamic modeling
dynamic 3d vocaltract modeling
joint factor analysis, manifold learning, multiview learning

Multimodal Analysis & Modeling

Vowels,

Continuants

Read sentences
Spontaneous
Non speech

gestures

SLIDE 10

Rest ¡of ¡the ¡talk

Measuring ¡speech ¡producDon: ¡geTng ¡data ¡
focus ¡on ¡magne6c ¡resonance ¡imaging ¡
Analysis ¡of ¡speech ¡producDon ¡data ¡
Some ¡modeling ¡& ¡applicaDon ¡results ¡
Characterizing ¡vocal ¡tract ¡morphology ¡
Understanding ¡speaker ¡specific ¡ar6culatory ¡strategy ¡
Inferring ¡vocal ¡tract ¡structure/strategy ¡from ¡speech ¡signal ¡
Enriching ¡Speaker ¡Verifica6on ¡with ¡produc6on ¡informa6on

10

SLIDE 11

Methods for vocal tract imaging

getting speech production data….

SLIDE 12

Observe, ¡measure, ¡visualize ¡ar6culatory ¡details ¡during ¡speech ¡
Long ¡history ¡of ¡instrumenta6on ¡and ¡imaging ¡applica6ons ¡
Number ¡of ¡techniques, ¡each ¡with ¡its ¡own ¡strengths ¡and ¡limita6ons ¡ ¡

– Spa6al ¡and ¡temporal ¡resolu6on ¡ – Subject ¡safety ¡ ¡ ¡ – Flexibility, ¡ease ¡of ¡use, ¡portability ¡ – Data ¡interpretability ¡ – Specific ¡research ¡and ¡applica6on ¡needs

12

Speech ¡ProducDon ¡Studies: ¡ ¡ Data ¡Is ¡Integral

SLIDE 13

Commonly used speech production data types

X-‑ray ¡ ¡ ¡ ¡ + ¡high ¡temporal ¡and ¡spa6al ¡resolu6on ¡ ¡ − ¡radia6on; ¡limited ¡resolu6on ¡for ¡sob ¡6ssue ¡ Electromagnetometry ¡(EMA) ¡ + ¡safe; ¡high ¡temporal ¡resolu6on; ¡flesh ¡point ¡tracking ¡ ¡ − ¡invasive; ¡spa6ally ¡sparse ¡data; ¡not ¡for ¡pharyngeal ¡structures ¡ Ultrasound ¡ + ¡safe; ¡high ¡temporal ¡resolu6on; ¡portable ¡ − ¡provides ¡incomplete ¡view ¡of ¡vocal ¡tract ¡ Palatography ¡ + ¡safe; ¡high ¡temporal ¡resolu6on; ¡portable ¡ − ¡invasive; ¡provides ¡indirect ¡informa6on ¡on ¡oral ¡cavity ¡

13

SLIDE 14

Classic ¡Speech ¡ProducDon ¡Data ¡Examples

X-‑ray ¡(Stevens, ¡1962) ¡

http://psyc.queensu.ca/~munhallk/05_database.htm

Electropalatography

(courtesy: ¡UCLA ¡Phone6cs ¡Lab)

Ultrasound ¡(Stone, ¡1980) ¡

http://www.speech.umaryland.edu

lower lip upper lip teeth tongue velum

Electromagnetometry

SLIDE 15

Newer ¡PossibiliDes: ¡ ¡ MRI ¡for ¡structural ¡vocal ¡tract ¡imaging

Number ¡of ¡advantages: ¡ ¡

– Non-‑invasive, ¡no ¡ionizing ¡radia6on ¡ – Arbitrary ¡scan ¡plane: ¡Informa6on ¡on ¡complete ¡vocal ¡tract ¡geometry ¡ ¡ – Excellent, ¡flexible ¡structural ¡differen6a6on: ¡Good ¡sob ¡6ssue ¡contrast, ¡SNR ¡ ¡ – Amenable ¡to ¡ ¡computerized ¡3D ¡modeling: ¡reconstruc6on ¡and ¡visualiza6on ¡ – Quan6ta6ve ¡informa6on: ¡area ¡func6on ¡and ¡acous6c ¡rela6ons ¡ – Variability ¡analyses ¡

LimitaDons/Challenges ¡

– Slow: ¡Spa6al ¡& ¡Temporal ¡resolu6on ¡tradeoffs, ¡op6mizing ¡to ¡a ¡given ¡applica6on ¡ – Noisy ¡images: ¡Suscep6bility, ¡blurring ¡ar6facts ¡ – Imaging ¡teeth ¡ – Interac6on ¡with ¡other ¡physiological ¡ac6vi6es: ¡respira6on, ¡swallowing, ¡other ¡movement ¡ – Clean, ¡Synchronized ¡audio ¡(and ¡other ¡modali6es, ¡as ¡needed) ¡ – Ease ¡of ¡experimenta6on, ¡including ¡cost ¡ ¡and ¡portability

15

Capable of 3D imaging of the hydrogen concentration in human body

SLIDE 16

 

MRI: ¡Toward ¡real ¡Dme ¡acquisiDon ¡for ¡speech ¡ (circa ¡2004)

16

TONGUE VELUM

Narayanan. ¡S., ¡Nayak, ¡K., ¡ ¡Lee, ¡S., ¡Sethy, ¡A., ¡and ¡Byrd, ¡D. ¡An ¡approach ¡to ¡real-‑6me ¡magne6c ¡resonance ¡imaging ¡for ¡

speech ¡produc6on. ¡J. ¡Acoust. ¡Soc. ¡Am., ¡115:1771-‑1776, ¡2004.

Improving ¡MRI ¡temporal ¡resoluDon ¡

– A ¡non ¡2D-‑FFT ¡acquisi6on ¡strategy ¡ ¡(spiral ¡k-‑space ¡trajectory) ¡on ¡a ¡GE ¡Signa ¡1.5T ¡ CV/i ¡scanner ¡with ¡a ¡low-‑flip ¡angle ¡spiral ¡gradient ¡echo, ¡9-‑10 ¡images/second ¡ – Adapted ¡pulse ¡sequence ¡originally ¡developed ¡for ¡cardiac ¡imaging. ¡ ¡ – Effec6ve ¡reconstruc6on ¡rates ¡of ¡24-‑35 ¡frames/second ¡

sliding ¡window ¡reconstruc6on ¡technique ¡

First ¡to ¡use ¡real-‑Dme ¡MRI ¡and ¡ ¡ synchronous ¡noise-‑cancelled ¡audio ¡ ¡ to ¡understand ¡vocal ¡tract ¡movements ¡ during ¡natural ¡speech ¡producDon.

SLIDE 17

17

Can ¡we ¡speed ¡up ¡MRI ¡to ¡even ¡better ¡rates? ¡

SLIDE 18

50 100 150 200 250 300 0.5 1 1.5 2 2.5 3 3.5 4 Time resolution (msec) Spatial resolution:(mm2) Cartesian (R=2.4, 1 slice) Spiral (R=6.5, 1 slice)

SpaDal ¡vs.Time ¡resoluDon: ¡speech ¡MRI

Proposed velo- pharyngeal closure velic movements Closures of alveolar trills consonant constrictions tongue movements (vowel to consonant transitions) co-articulation events sustained sounds

Our ¡new ¡system ¡(circa ¡2015) ¡ ¡enables ¡visualiza6on ¡of ¡all ¡speech ¡tasks
Single slice

12 ms/frame (83 fps) Sajan Lingala, Yinghua Zhu, Yoon-Chul Kim, Asterios T

utios, Shrikanth Narayanan, Krishna Nayak. A fast and

flexible MRI system for the study of dynamic vocal tract shaping. Magnetic Resonance in Medicine. 2016

SLIDE 19

50 100 150 200 250 300 0.5 1 1.5 2 2.5 3 3.5 4 Time resolution (msec) Spatial resolution:(mm2) Cartesian (R=2.4, 1 slice) Spiral (R=6.5, 1 slice) Spiral (R=6.5, 3 slices)

Closures of alveolar consonant tongue movements (vowel to co- articulation

sustained sounds

velo- pharynge velic movemen

Our ¡system ¡enables ¡visualiza6on ¡of ¡all ¡speech ¡tasks

SpaDal ¡vs. ¡Time ¡resoluDon: ¡speech ¡MRI

Single slice

12 ms/frame

Three-slice

36 ms/frame Sajan Lingala, Yinghua Zhu, Yoon-Chul Kim, Asterios T

utios, Shrikanth Narayanan, Krishna Nayak. A fast and

flexible MRI system for the study of dynamic vocal tract shaping. Magnetic Resonance in Medicine. 2016

SLIDE 20

Highly ¡accelerated ¡RT-‑MRI ¡of ¡speech ¡is ¡achieved ¡by ¡

synergisDc ¡combinaDon ¡of ¡ ¡

Novel ¡custom ¡upper-‑airway ¡coil ¡design ¡
Fast ¡spiral ¡readouts ¡
Constrained ¡reconstruc6on/TT-‑GRAPPA

Methods

Sajan Lingala, Yinghua Zhu, Yoon-Chul Kim, Asterios T

utios, Shrikanth Narayanan, Krishna Nayak. A fast and

flexible MRI system for the study of dynamic vocal tract shaping. Magnetic Resonance in Medicine. 2016

SLIDE 21

1.5 ¡T ¡GE ¡Signa ¡Scanner ¡(Clinical ¡magnet, ¡LA ¡county) ¡
Real ¡6me ¡(RTHawk) ¡interac6ve ¡system ¡
Simultaneous ¡audio ¡acquisi6on, ¡and ¡noise ¡cancella6on ¡

Experimental ¡set ¡up

RTHawk RTHawk

Scanner ¡Hardware Interac6ve ¡control ¡ sta6on Simultaneous ¡audio ¡acquisi6on

SLIDE 22

Real-‑time ¡MRI ¡at ¡83 ¡fps, ¡2.4 ¡mm/pixel ¡

22

SLIDE 23

Accelerated ¡(7 ¡seconds) ¡Volumetric ¡Protocol

23 Kim, ¡Y. ¡C., ¡Narayanan, ¡S. ¡S., ¡& ¡Nayak, ¡K. ¡S. ¡(2009). ¡Accelerated ¡three-‑dimensional ¡upper ¡airway ¡MRI ¡using ¡compressed ¡sensing. ¡Magnetic ¡ Resonance ¡in ¡Medicine, ¡61(6), ¡1434-‑1440. ¡ Lingala, ¡S., ¡Toutios, ¡A., ¡Toger, ¡J., ¡Lim ¡Y., ¡Zhu, ¡Y., ¡Kim, ¡Y.-‑C., ¡Vaz, ¡C., ¡Narayanan, ¡S., ¡& ¡Nayak, ¡K. ¡(2016). ¡State-‑of-‑the-‑art ¡MRI ¡protocol ¡for ¡ comprehensive ¡assessment ¡of ¡vocal ¡tract ¡structure ¡and ¡function. ¡Interspeech, ¡San ¡Francisco, ¡CA.

SLIDE 24

T2-‑weighted ¡MRI ¡

for ¡detailed ¡anatomical ¡profiles

24

Coronal ¡sweep ¡(3.5 ¡min)
Sagi:al ¡sweep ¡(3.5 ¡min)
Axial ¡sweep ¡(3.5 ¡min)

SLIDE 25

USC-‑TIMIT: ¡A ¡MULTIMODAL ¡ARTICULATORY ¡ DATA ¡CORPUS ¡FOR ¡SPEECH ¡RESEARCH

10 ¡American ¡English ¡talkers ¡(5M, ¡5F). ¡
Real ¡6me ¡MRI ¡(5 ¡speakers ¡also ¡with ¡EMA) ¡

and ¡synchronized ¡audio. ¡

460 ¡sentences ¡each ¡(>20 ¡minutes) ¡
Freely ¡available ¡for ¡speech ¡research.

Narayanan et al. (2011). A Multimodal Real-Time MRI Articulatory Corpus for Speech Research. InterSpeech. Narayanan et al. (2014). Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research. J. Acoust. Soc. Am.

WEB-LINK (with download info): ¡ http://sail.usc.edu/span/usc-timit/ ¡

SAIL homepage: http://sail.usc.edu

SLIDE 26

Some ¡USC-‑TIMIT ¡examples

26

M1 M2 F1 F2

Narayanan et al. (2014). Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research. J. Acoust. Soc. Am.

SLIDE 27

Rest ¡of ¡the ¡talk

Measuring ¡speech ¡producDon: ¡geTng ¡data ¡
focus ¡on ¡magne6c ¡resonance ¡imaging ¡
Analysis ¡of ¡speech ¡producDon ¡data ¡
Some ¡modeling ¡& ¡applicaDons ¡
Characterizing ¡vocal ¡tract ¡morphology ¡
Understanding ¡speaker ¡specific ¡ar6culatory ¡strategy ¡
Inferring ¡vocal ¡tract ¡structure/strategy ¡from ¡speech ¡signal ¡
Enriching ¡Speaker ¡Verifica6on ¡with ¡produc6on ¡informa6on

27

SLIDE 28

Analysis ¡preliminaries

Image ¡analysis ¡
Deriving ¡morphological ¡(structural) ¡details, ¡and ¡
linguistically ¡meaningful ¡articulatory ¡features

SLIDE 29

Measurement ¡of ¡Structural ¡Details

29

SLIDE 30

Vocal ¡Tract ¡Contours ¡

30

First ¡define ¡a ¡contour ¡model ¡segmenta6on ¡manually ¡ : ¡each ¡ar6culator ¡in ¡a ¡different ¡color Now ¡hierarchically ¡opDmize ¡the ¡model ¡fit ¡to ¡the ¡image ¡ in ¡the ¡Fourier ¡domain ¡using ¡gradient ¡descent!

Erik ¡Bresch ¡and ¡Shrikanth ¡Narayanan. ¡ ¡Region ¡segmentaDon ¡in ¡the ¡frequency ¡domain ¡applied ¡to ¡upper ¡airway ¡real-‑ Dme ¡magneDc ¡resonance ¡images. ¡IEEE ¡TransacDons ¡on ¡Medical ¡Imaging. ¡28(3): ¡ ¡323-‑-‑338, ¡March ¡2009.

Model-‑Based ¡Image ¡SegmentaDon ¡In ¡The ¡Fourier ¡Domain

SLIDE 31

Articulator ¡Tracking

31

Erik ¡Bresch ¡and ¡Shrikanth ¡Narayanan. ¡ ¡Region ¡segmentaDon ¡in ¡the ¡frequency ¡domain ¡applied ¡to ¡upper ¡airway ¡real-‑ Dme ¡magneDc ¡resonance ¡images. ¡IEEE ¡TransacDons ¡on ¡Medical ¡Imaging. ¡28(3): ¡ ¡323-‑-‑338, ¡March ¡2009.

SLIDE 32

Articulatory ¡Posture ¡&  Constriction ¡Task ¡Variables

32

These ¡feature ¡sets ¡are ¡useful ¡for ¡modeling ¡speech ¡produc6on ¡dynamics ¡

Adam ¡Lammert, ¡Louis ¡Goldstein, ¡Shrikanth ¡Narayanan ¡and ¡Khalil ¡Iskarous. ¡Sta6s6cal ¡Methods ¡for ¡Es6ma6on ¡of ¡Direct ¡ and ¡Differen6al ¡Kinema6cs ¡of ¡the ¡Vocal ¡Tract. ¡Speech ¡CommunicaAon. ¡55: ¡147–161, ¡2013. ¡ Vikram ¡Ramanarayanan, ¡Adam ¡Lammert, ¡Louis ¡Goldstein, ¡Shrikanth ¡ ¡Narayanan, ¡"Are ¡Ar6culatory ¡Seungs ¡Mechanically ¡ Advantageous ¡for ¡ ¡Speech ¡Motor ¡Control?", ¡PLoS ¡ONE, ¡Public ¡Library ¡of ¡Science, ¡vol. ¡9, ¡no. ¡8, ¡pp. ¡e104168, ¡2014. ¡

SLIDE 33

Tracking ¡Constriction ¡Variables

33 Sorensen, ¡T., ¡ ¡Toutios, ¡A., ¡Goldstein, ¡L., ¡& ¡Narayanan, ¡S. ¡(2016), ¡Characterizing ¡vocal ¡tract ¡dynamics ¡with ¡real-‑time ¡MRI, ¡Conference ¡on ¡ Laboratory ¡Phonology, ¡Ithaca, ¡NY. Vikram ¡Ramanarayanan, ¡Louis ¡Goldstein, ¡Dani ¡Byrd ¡and ¡Shrikanth ¡S. ¡ ¡Narayanan, ¡An ¡inves6ga6on ¡of ¡ar6culatory ¡seung ¡using ¡real-‑6me ¡ ¡magne6c ¡ resonance ¡imaging ¡(2013), ¡in: ¡J. ¡Acoust. ¡Soc. ¡Am., ¡134:1(510-‑519)

SLIDE 34

Speaker-‑Specific ¡Articulatory ¡Models

34 Toutios, ¡A., ¡& ¡Narayanan, ¡S. ¡S. ¡(2015). ¡Factor ¡analysis ¡of ¡vocal-‑tract ¡outlines ¡derived ¡from ¡real-‑time ¡magnetic ¡resonance ¡imaging ¡data. ¡ International ¡Congress ¡of ¡Phonetic ¡Sciences, ¡Glasgow, ¡UK.

SLIDE 35

Analysis ¡and ¡Modeling

Anatomical ¡(morphometric) ¡details ¡
Palatal ¡shapes ¡
Vocal ¡tract ¡length ¡
Tongue ¡volume ¡
Articulatory ¡strategies ¡
Preliminary ¡speaker ¡verification ¡experiments ¡
Moving ¡forward

SLIDE 36

Data ¡analysis: ¡anatomical ¡details ¡

Mid-‑sagittal ¡measures, ¡based ¡on ¡CT ¡

work ¡on ¡anatomical ¡development1 ¡

Landmarks ¡A-‑I ¡
Length ¡measures, ¡e.g. ¡vocal ¡tract ¡length, ¡

horizontal: ¡distance ¡D-‑H ¡

Note ¡also ¡gnathion ¡(Gn) ¡landmark

used ¡for ¡jaw ¡dimensions

1: Vorperian et al. JASA 125(3):1666, 2009

From ¡T2 ¡weighted ¡images ¡obtained ¡for ¡ ¡high ¡so7-‑9ssue ¡contrast

SLIDE 37

Mandible ¡size ¡and ¡shape ¡based 
n ¡previous ¡CT ¡study2 ¡
Multi-‑planar ¡reconstruction ¡(MPR) ¡of ¡slice ¡

through ¡jaw ¡from ¡sagittal ¡image ¡stack ¡

Landmarks: ¡Gonion ¡(GoLt ¡and ¡GoRt), ¡

superior ¡aspect ¡of ¡condylar ¡process ¡ (CdSuLt, ¡CdSuRt)

2: Whyms et al. Oral Surg, Oral Med, Oral Path and Oral Rad,115(5):682, 2013

Data ¡analysis: ¡anatomical ¡details ¡

SLIDE 38

Midsagittal measures1 Mandible measures2 Vocal ¡tract, ¡vertical ¡(VT-‑V) ¡ Posterior ¡cavity ¡length ¡(PCL) ¡ Nasopharyngeal ¡length ¡(NPhL) ¡ Vocal ¡tract, ¡horizontal ¡(VT-‑H) ¡ Lip ¡thickness ¡(LTh) ¡ Anterior ¡cavity ¡length ¡(ACL) ¡ Oro-‑pharyngeal ¡width ¡(OPhW) ¡ Vocal ¡tract, ¡oral ¡(VT-‑O) Vocal ¡tract, ¡vertical ¡(VT-‑V) ¡ Posterior ¡cavity ¡length ¡(PCL) ¡ Nasopharyngeal ¡length ¡(NPhL) ¡ Vocal ¡tract, ¡horizontal ¡(VT-‑H) ¡ Lip ¡thickness ¡(LTh) ¡ Anterior ¡cavity ¡length ¡(ACL) ¡ Oro-‑pharyngeal ¡width ¡(OPhW) ¡ Vocal ¡tract, ¡oral ¡(VT-‑O) 1: Vorperian et al. JASA 125(3):1666, 2009 2: Whyms et al. Oral Surg, Oral Med, Oral Path and Oral Rad,115(5):682, 2013

Data ¡analysis: ¡anatomical ¡details ¡ ¡summary ¡

SLIDE 39

Morphometric ¡Analysis: ¡

¡“test-‑retest” ¡Results

J. ¡Töger, ¡Y. ¡Lim, ¡S. ¡ ¡Lingala, ¡S. ¡Narayanan, ¡K. ¡Nayak. ¡Op6miza6on ¡of ¡real-‑6me ¡vocal ¡tract ¡MRI ¡image ¡reconstruc6on ¡

based ¡on ¡reproducibility ¡of ¡derived ¡quan6ta6ve ¡measures. ¡Proc. ¡Interspeech ¡2016. ¡ ¡ ¡

SLIDE 40

Rest ¡of ¡the ¡talk

Measuring ¡speech ¡producDon: ¡geTng ¡data ¡
focus ¡on ¡magne6c ¡resonance ¡imaging ¡
Analysis ¡of ¡speech ¡producDon ¡data ¡
Some ¡modeling ¡& ¡applicaDons ¡
Characterizing ¡vocal ¡tract ¡morphology ¡
Understanding ¡speaker ¡specific ¡ar6culatory ¡strategy ¡
Inferring ¡vocal ¡tract ¡structure/strategy ¡from ¡speech ¡signal ¡
Enriching ¡Speaker ¡Verifica6on ¡with ¡produc6on ¡informa6on

40

SLIDE 41

 ¡Confined ¡ar6culatory ¡environment  ¡Variability ¡across ¡speakers ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Highly ¡ar6culated, ¡layered ¡system  ¡Reflected ¡in ¡acous6cal ¡proper6es

Interspeaker ¡Variability: ¡ ¡ Vocal ¡Tract ¡Morphology ¡(Structure)

SLIDE 42

42

SLIDE 43

Why ¡is ¡morphological ¡structure ¡relevant? Palate ¡Shape ¡– ¡Principal ¡Components

concavity anteriority sharpness 46% of Variance 30% of Variance 10% of Variance

+ 1.5 std. dev.

1.5 std. dev.

43

anterior posterior

VOCAL ¡TRACT ¡MORPHOLOGY

Adam ¡Lammert, ¡Michael ¡Proctor ¡and ¡Shrikanth ¡Narayanan. ¡Morphological ¡Varia6on ¡in ¡the ¡Adult ¡Hard ¡Palate ¡and ¡ Posterior ¡Pharyngeal ¡Wall. ¡Journal ¡of ¡Speech, ¡Language, ¡and ¡Hearing ¡Research. ¡2013a

SLIDE 44

Concavity: ¡impact ¡on ¡F1 ¡and ¡F2; ¡ ¡ ¡ ¡Anteriority: ¡impact ¡on ¡F2 ¡only; Sharpness: ¡marginal

44

distance from teeth Distance (cm)

Concavity

distance from teeth

Anteriority

distance from teeth

Sharpness δF1 δF2

THEORETICAL ¡IMPACT ¡OF ¡PALATE ¡SHAPE

Adam ¡Lammert, ¡Michael ¡Proctor ¡and ¡Shrikanth ¡Narayanan. ¡Interspeaker ¡Variability ¡in ¡Hard ¡Palate ¡Morphology ¡and ¡Vowel ¡

Produc6on. ¡Journal ¡of ¡Speech, ¡Language, ¡and ¡Hearing ¡Research. ¡2013b

SLIDE 45

THEORETICAL ¡IMPACT: ¡ACOUSTIC ¡MODELING ¡

45

Synthesized Vowels

Palate Shapes F1

300- 400- 500- 600- 700-

A B C D F2-F1 2400 2000 1600 1200 800 400

Formant Sensitivity to Palatal Concavity

Adam ¡Lammert, ¡Michael ¡Proctor ¡and ¡Shrikanth ¡Narayanan. ¡Interspeaker ¡Variability ¡in ¡Hard ¡Palate ¡Morphology ¡and ¡Vowel ¡

Produc6on. ¡Journal ¡of ¡Speech, ¡Language, ¡and ¡Hearing ¡Research. ¡2013b

SLIDE 46

  Experiments ¡on ¡estimating ¡some ¡of ¡these ¡ vocal ¡tract ¡shape ¡details ¡from ¡acoustics...

SLIDE 47

Inversion: ¡Palatal ¡Concavity

Binary ¡Classification ¡ concave ¡or ¡flat ¡palate?

Vocal ¡Tract ¡ Shape

Speaker ¡ID ¡Features ¡

¡MFCC
¡Open-‑smile ¡
¡GMM ¡UWPP ¡
¡Inv. ¡artic. ¡features

Speech ¡ ¡ Articulation

47

Motor ¡ Controller

Li, ¡Lammert, ¡et ¡al. ¡(2013). ¡Automatic ¡Classification ¡of ¡Palatal ¡and ¡Pharyngeal ¡Wall ¡Shape ¡ Categories ¡from ¡Speech ¡Acoustics ¡and ¡Inverted ¡Articulatory ¡Signals. ¡ISCA ¡Workshop ¡on ¡Speech ¡ Production ¡in ¡Automatic ¡Speech ¡Recognition.

Inversion ¡Accuracy: ¡63% ¡-‑ ¡71% ¡

SLIDE 48

Inversion: ¡Vocal ¡Tract ¡Length

Vocal ¡Tract ¡ Structure Vocal ¡Tract ¡ Shape Speech ¡ Signal Speech ¡ ¡ Articulation

invert ¡vocal ¡tract ¡length?

48

Motor ¡ Controller compensation ¡ limited … ¡speaker-‑specific ¡ acoustics

SLIDE 49

Vorperian ¡(2009)

¡ ¡ ¡ Age ¡in ¡Years Vocal ¡Tract ¡Length ¡(cm)

Variation ¡in ¡Vocal ¡Tract ¡Length, ¡Acoustics

Vocal ¡Tract ¡Length ¡(cm) Frequency ¡(Hz)

49

SLIDE 50

50

Vowel ¡Variation ¡& ¡Vocal ¡Tract ¡Length

F2 ¡(Hz) F1 ¡(Hz)

200 ¡ ¡ 600 ¡ ¡ 1000 ¡

2200 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡1700 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡1200

/i/ /i/ /a/ /u/ /u/ /a/ 13.2 ¡cm ¡vt 16.4 ¡cm ¡vt

SLIDE 51

VT ¡Length ¡Estimation: ¡Model

VT ¡Length Lowest ¡Resonance Speed ¡of ¡Sound Formant ¡Number Model: Fn 2n – 1 Φ = , n = 1, 2, 3, … L = c 4Φ Φ β1F1 1 β2F2 3 β3F3 5 βmFm 2m - 1 + + + … + Φ =

51

SLIDE 52

VT ¡Length ¡Estimation: ¡Design

52

Wakita ¡(1977) Fitch ¡(1997)

β1 ¡= ¡0 ¡ ¡ β2 ¡= ¡0 ¡ β3 ¡= ¡0.5 ¡ ¡ β4 ¡= ¡0.5 β1 ¡= ¡-‑0.167 ¡ β2 ¡= ¡0 ¡ ¡ β3 ¡= ¡0 ¡ β4 ¡= ¡1.167

Proposed

Determine ¡β ¡via ¡ linear ¡regression

β1F1 1 β2F2 3 β3F3 5 βmFm 2m - 1 + + + … + Φ = L = c 4Φ

SLIDE 53

¡  ¡5 ¡SPAN-‑TIMIT ¡Subjects: ¡real ¡6me ¡MRI ¡data ¡  ¡Median ¡es6mated ¡value ¡(30 ¡sec. ¡read ¡speech)

r = 0.98

53

EsDmaDng ¡Vt ¡Length ¡From ¡AcousDcs

Could ¡be ¡used ¡to ¡account ¡for ¡speaker ¡differences: ¡new ¡strategies ¡for ¡Vocal ¡ Tract ¡Length ¡Normaliza6on

SLIDE 54

VT ¡Length ¡Estimation: ¡Results

54

Simulated ¡Data Human ¡Speech ¡Data

Accuracy ¡

¡0.631 ¡RMSE ¡(cm) ¡
¡3.8% ¡of ¡total ¡length

Accuracy

¡1.159 ¡RMSE ¡(cm) ¡
¡7.6% ¡of ¡total ¡length

Improvement ¡

¡8% ¡over ¡previous ¡best ¡

Improvement

¡43% ¡over ¡previous ¡best ¡

Lammert, ¡A. ¡& ¡Narayanan, ¡S. ¡ ¡On ¡Short-‑Time ¡Estimation ¡of ¡Vocal ¡Tract ¡Length ¡from ¡Formant ¡

Frequencies. ¡ ¡PLOSOne. ¡2015

* ¡further ¡improvements ¡possible ¡ by ¡refining ¡model

SLIDE 55

55

Computational ¡methodologies ¡allow ¡for ¡optimal ¡ estimation ¡of ¡vocal ¡tract ¡length ¡from ¡acoustics ¡ Inversion ¡provides: ¡

¡Complementary ¡insights ¡into ¡the ¡interplay ¡of ¡

morphology ¡and ¡vowel ¡production ¡

¡General ¡result ¡regarding ¡stability ¡of ¡higher ¡formant ¡

frequencies ¡with ¡empirical ¡and ¡theoretical ¡support ¡

Summary: ¡Vocal ¡Tract ¡Length ¡Inversion

SLIDE 56

Interspeaker ¡Variability ¡in ¡Relative ¡ Tongue ¡Size ¡and ¡Vowel ¡Production

Structure: ¡

¡

¡Relative ¡Tongue ¡Size ¡

Control ¡and ¡Behavior: ¡

¡

¡Interspeaker ¡Vowel ¡Production ¡Variability

SLIDE 57

F2 F1

300 ¡ ¡ 700 ¡ ¡ 1100 ¡ 3000 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2100 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡1200 /i/ /i/ /a/ /u/ /u/ /a/ 13.2 ¡cm ¡vt 132 ¡mm 158 ¡mm

Vocal ¡Tract ¡Length ¡& ¡Formant ¡Frequencies

Fnorm = LobsFobs/14.5

SLIDE 58

normalized ¡ F2 normalized ¡ F1

300 ¡ ¡ 700 ¡ ¡ 1100 ¡ 3000 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2100 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡1200 /i/ /i/ /a/ /u/ /u/ /a/ 13.2 ¡cm ¡vt 145 ¡mm 145 ¡mm

Length ¡Normalization ¡Leaves ¡Residual ¡Error

SLIDE 59

Central ¡Hypothesis

Differences ¡in ¡relative ¡tongue ¡size ¡can ¡ partially ¡explain ¡residual ¡differences ¡in ¡ vowel ¡space ¡shape ¡and ¡size

13.2 ¡cm ¡vt

Small ¡Tongue Large ¡Tongue

SLIDE 60

Primary ¡Questions

1) ¡How ¡does ¡one ¡define ¡and ¡measure ¡tongue ¡size? ¡ 2) ¡How ¡are ¡speakers ¡distributed ¡by ¡tongue ¡size? ¡ 3) ¡What ¡is ¡the ¡effect ¡of ¡tongue ¡size ¡on ¡articulation? ¡ 4) ¡What ¡is ¡the ¡effect ¡of ¡tongue ¡size ¡on ¡acoustics? ¡ 5) ¡Can ¡tongue ¡size ¡be ¡predicted ¡and ¡normalized?

SLIDE 61

What ¡is ¡Known? ¡ ¡(Very ¡Little)

¡Coordinated ¡growth ¡and ¡size ¡of ¡vocal ¡tract ¡structures ¡

(Siebert, ¡1985; ¡Vorperian, ¡1999)

¡Extreme ¡macroglossia ¡in ¡developmental ¡disorders ¡

¡ e.g., ¡Beckwith-‑Wiedemann ¡and ¡Down ¡Syndrome ¡

¡Macroglossia ¡results ¡in ¡atypical ¡consonants
¡Laminalization ¡of ¡coronals ¡(Van ¡Borsel, ¡2000) ¡

¡

¡Labio-‑lingual ¡production ¡of ¡/b,p/ ¡
¡Compensatory ¡articulation ¡

¡

¡Slowed ¡speech ¡rate ¡(Mekonnen, ¡2012) ¡
¡Atypical ¡vowel ¡production? ¡ ¡

¡

¡Noted, ¡but ¡not ¡studied ¡in ¡detail

SLIDE 62

How ¡Does ¡One ¡Define ¡Tongue ¡Size?

¡Mean ¡posture ¡over ¡continuous ¡speech ¡(5 ¡sentences) ¡
¡Outline ¡tongue ¡in ¡midsagittal ¡plane ¡
¡Calculate ¡area ¡of ¡tongue ¡polygon

hyoid/ ¡ epiglottis musculature ¡ (incl. ¡geniohyoid) mandible

SLIDE 63

Distribution ¡of ¡Absolute ¡Tongue ¡Size

* ¡Significant ¡sex ¡differences ¡(p ¡< ¡0.05, ¡Wilcoxon ¡rank-‑sum)

SLIDE 64

"There ¡appear ¡to ¡be ¡no ¡well ¡established ¡criteria ¡for ¡the ¡ assessment ¡of ¡the ¡size ¡of ¡the ¡tongue ¡relative ¡to ¡the ¡ mouth ¡cavity.” ¡– ¡Ardran ¡(1972)

Relative ¡Tongue ¡Size: ¡Normalization

upper ¡ dentition mandible larynx pharyngeal ¡ tonsil

CH MF

Anatomical ¡ ¡ Landmarks Positions: ¡

¡rest

¡

¡mean

SLIDE 65

Which ¡Normalization ¡Factor?

Rest ¡Position Mean ¡Position

* ¡No ¡significant ¡sex ¡differences ¡(p ¡> ¡0.05, ¡Wilcoxon ¡rank-‑sum) ¡ ** ¡Highly ¡correlated ¡across ¡normalization ¡factors ¡(r2 ¡= ¡0.83)

* * * *

CH MF CH MF

SLIDE 66

GLOTTIS LIPS

¡Constriction ¡lengths: ¡0.28L ¡to ¡0.39L ¡
¡Constriction ¡locations: ¡0.9L, ¡0.7L, ¡0.1L ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡( ¡/i/ ¡, ¡/u/ ¡, ¡/a/ ¡)

Simulation ¡Setup: ¡Vowel ¡Spaces

SLIDE 67

F2 F1

200 ¡ ¡ 600 ¡ ¡ 1000 ¡ 2450 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2100 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡1850

Vowel ¡Space ¡Simulations: ¡Consistent ¡with ¡Data

F1

300 ¡ ¡ 700 ¡ ¡ 1100 ¡ 3000 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2100 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡1200

Human ¡Speech Simulation

SLIDE 68

Conclusions

¡Tongue ¡size ¡varies ¡substantially ¡across ¡speakers ¡

(range ¡~ ¡15 ¡– ¡30%) ¡

¡Relatively ¡large ¡tongues ¡result ¡in ¡longer ¡vocal ¡

tract ¡constrictions ¡(but, ¡it ¡depends ¡on ¡the ¡vowel) ¡

¡Long ¡constrictions ¡lead ¡to ¡“stretching” ¡and ¡

“twisting” ¡of ¡vowel ¡spaces ¡(in ¡speech ¡and ¡sim.)

¡Interplay ¡among ¡constrictions, ¡formants ¡and ¡

tongue ¡size ¡is ¡complex ¡(demands ¡further ¡study)

SLIDE 69

Articulatory ¡strategies ¡

how ¡talkers ¡move ¡their ¡vocal ¡tracts

Vocal ¡tract ¡is ¡a ¡redundant ¡system ¡
Articulators ¡have ¡overlapping ¡functions ¡(e.g., ¡both ¡

jaw ¡and ¡lips ¡contribute ¡to ¡bilabial ¡constrictions) ¡

Speakers ¡have ¡several ¡ways ¡to ¡change ¡airway ¡shape ¡

to ¡make ¡a ¡constriction ¡

We ¡call ¡these ¡articulatory ¡strategies

SLIDE 70

Airway ¡shape ¡and ¡constrictions

Using ¡rt-‑MRI, ¡we ¡can ¡observe ¡and ¡measure ¡salient ¡articulatory ¡

details ¡such ¡as ¡vocal ¡tract ¡shape ¡and ¡constrictions ¡at ¡the ¡ phonetic ¡places ¡of ¡articulation ¡

How ¡to ¡make ¡a ¡constriction ¡depends ¡on ¡speaker ¡characteristics ¡ ¡
vocal ¡tract ¡morphology ¡
speaking ¡style ¡
We ¡use ¡rt-‑MRI ¡to ¡build ¡speaker-‑specific ¡forward ¡maps ¡from ¡vocal ¡

tract ¡shape ¡to ¡constrictions ¡

We ¡then ¡use ¡analysis ¡by ¡synthesis ¡simulations ¡to ¡investigate ¡

speaker-‑specific ¡articulatory ¡strategies

SLIDE 71

Locally ¡linear ¡forward ¡map

from ¡factors ¡of ¡vocal ¡tract ¡shape ¡ (Toutios ¡& ¡Narayanan, ¡2015) to ¡constriction ¡degrees ¡ (Ramanarayanan ¡et ¡al.,2013)

Articulator positions to constriction degrees using a statistical estimation technique (Lammert et al. 2013) Contours first automatically extracted (Bresch & Narayanan 2009)

SLIDE 72

Articulatory ¡strategies ¡across ¡speakers 

initial ¡insights ¡using ¡data ¡from ¡18 ¡speakers ¡in ¡task ¡dynamic ¡simulation ¡(analysis ¡by ¡synthesis)

A ¡model-‑based ¡approach ¡quantifies ¡ articulatory ¡strategies ¡by ¡ ¡

1. approximating ¡the ¡speaker-‑specific ¡

forward ¡map ¡with ¡rt-‑MRI ¡data ¡

2. simulating ¡vocal ¡tract ¡constrictions ¡

with ¡a ¡dynamical ¡system, ¡and ¡

3. interpreting ¡the ¡results. ¡

Data: ¡rtMRI ¡of ¡read ¡passages ¡from ¡18 ¡ speakers ¡(9M, ¡9 ¡F) ¡

1 ¡– ¡more ¡tongue/lips ¡ 0 ¡– ¡more ¡jaw ¡

ratio ¡of ¡lips/tongue ¡use ¡to ¡jaw ¡use ¡

by ¡speaker ¡(dots) ¡and ¡ ¡ constriction ¡type ¡(x-‑axis) ¡

Tanner ¡Sorensen, ¡Asterios ¡Tou6os, ¡Louis ¡Goldstein, ¡Shrikanth ¡Narayanan. ¡Characterizing ¡vocal ¡tract ¡dynamics ¡across ¡speakers ¡using ¡real-‑6me ¡MRI. ¡ ¡

Proc. ¡Interspeech, ¡2016

SLIDE 73

Articulatory ¡strategies ¡across ¡speakers: ¡Results 

initial ¡insights ¡using ¡data ¡from ¡18 ¡speakers ¡in ¡task ¡dynamic ¡simulation Lips/tongue ¡contributed ¡more ¡than ¡ jaw ¡in: ¡ ¡

81% ¡of ¡the ¡simulations ¡(bilabial ¡closure) ¡
57% ¡of ¡the ¡simulations ¡(alveolar ¡closure) ¡
83% ¡of ¡the ¡simulations ¡(palatal ¡approx.) ¡
97% ¡of ¡the ¡simulations ¡(velar ¡closure) ¡
94% ¡of ¡the ¡simulations ¡(pharyngeal ¡closure)

Tanner ¡Sorensen, ¡Asterios ¡Tou6os, ¡Louis ¡Goldstein, ¡Shrikanth ¡Narayanan. ¡Characterizing ¡vocal ¡tract ¡dynamics ¡across ¡speakers ¡using ¡ real-‑6me ¡MRI. ¡ ¡Proc. ¡Interspeech, ¡2016

Lips/tongue ¡contributed ¡more ¡than ¡ jaw ¡for: ¡ ¡

14 ¡of ¡the ¡18 ¡speakers ¡(bilabial ¡closure) ¡
11 ¡of ¡the ¡18 ¡speakers ¡(alveolar ¡closure) ¡
16 ¡of ¡the ¡18 ¡speakers ¡(palatal ¡approx.) ¡
all ¡18 ¡speakers ¡(velar ¡closure) ¡
17 ¡of ¡the ¡18 ¡speakers ¡(pharyngeal ¡closure)

How much each speaker used their tongue or lips compared to their jaw differed by constriction type — functional specificity.

SLIDE 74

Speaker ¡recognition ¡using ¡articulatory ¡information? ¡

Initial ¡experiments ¡on ¡speaker ¡verification

Acoustic ¡level ¡methods ¡
Joint ¡factor ¡analysis ¡(JFA) ¡(Kenny, ¡2007); ¡i-‑vector ¡(Dehak, ¡2011) ¡
Simplified ¡Supervised ¡i-‑vector ¡(Li, ¡2013) ¡
Probabilistic ¡linear ¡ ¡discriminative ¡analysis ¡(PLDA) ¡(Prince, ¡2007, ¡Matejka, ¡2011) ¡
Feature ¡level ¡or ¡score ¡level ¡fusion ¡based ¡on ¡multiple ¡features ¡
Short-‑term ¡spectral ¡features ¡(MFCC, ¡LPCC, ¡PLP,..) ¡
Spectral-‑temporal ¡features ¡(FDLP, ¡ ¡Gabor,..) ¡
Prosodic ¡features ¡(pitch, ¡energy, ¡duration, ¡rhythm,…) ¡
Voice ¡source ¡features ¡(glottal ¡features) ¡
High ¡level ¡features ¡(phoneme, ¡semantics, ¡accent,..) ¡ ¡
Attribute-‑based ¡features ¡
Apply ¡features ¡from ¡speech ¡production ¡data? ¡
Difficult ¡to ¡obtain ¡data ¡in ¡operational ¡conditions

74

SLIDE 75

75

Exemplar-specific Talker-independent Acoustic-to-articulatory inversion

?

Articulatory features need to be

estimated for any arbitrary talker

Role ¡of ¡speech ¡producDon ¡in ¡speech/speaker ¡recogniDon: ¡ ¡ use ¡talker-‑independent ¡acous9c-‑ar9culatory ¡inversion

Acoustic- Articulatory Model

Recognized speech/ speaker Acoustic Features

Acoustic Feature Extraction

Talker

Acoustic-to- articulatory mapping Estimated Articulatory Features Speech Production Knowledge

EMA or TV MFCC

Exemplary Subject

(Exemplar)

P. ¡Ghosh, ¡and ¡S. ¡Narayanan. ¡A ¡generalized ¡smoothness ¡criterion ¡for ¡acous6c-‑to-‑ar6culatory ¡inversion. ¡ ¡ ¡J. ¡Acoust. ¡Soc. ¡Am. ¡128(4):2162-‑2172, ¡ ¡

2010.

SLIDE 76

System ¡Overview

Feature ¡level ¡and ¡score ¡level ¡fusion

76

Acoustics Articulation Reference speaker Acoustic-to-articulatory Inversion model training Inversion model UBM, Enrollment, and Test speakers Acoustics MFCC feature extraction MFCC features GMM baseline Inverted articulation Feature level fusion MFCC+inverted articulation Inversion model GMM baseline Score level fusion Output Ming ¡Li, ¡Jangwon ¡Kim, ¡Adam ¡Lammert, ¡Prasanta ¡Ghosh, ¡Vikram ¡Ramanarayanan ¡and ¡Shrikanth ¡Narayanan. ¡Speaker ¡verifica6on ¡based ¡on ¡the ¡ fusion ¡of ¡speech ¡acous6cs ¡and ¡inverted ¡ar6culatory ¡signals. ¡Computer, ¡Speech, ¡and ¡Language. ¡36: ¡196-‑211, ¡March ¡2016 ¡

SLIDE 77

Front ¡end ¡processing ¡and ¡GMM ¡baseline

Front ¡end ¡processing ¡
Wiener ¡filter ¡applied ¡on ¡the ¡XRMB ¡speech ¡production ¡data ¡
Real ¡or ¡inverted ¡articulatory ¡data ¡sampled ¡at ¡100hz ¡
25ms ¡window ¡size ¡with ¡10ms ¡shifts ¡for ¡MFCC ¡extraction ¡
36 ¡dim ¡MFCC ¡(18dim ¡+ ¡delta) ¡with ¡MVN ¡
MVN ¡on ¡the ¡real ¡articulatory ¡data ¡not ¡on ¡the ¡inverted ¡one ¡
Real ¡articulation ¡(mean/var) ¡has ¡encoded ¡vocal ¡tract ¡shape ¡information ¡
Remove ¡mean/var ¡of ¡the ¡real ¡articulation ¡for ¡fair ¡comparison ¡
Inverted ¡articulation ¡(mean/var) ¡has ¡rich ¡speaker ¡information ¡ ¡
Concatenating ¡MFCC ¡and ¡articulation ¡together ¡as ¡feature ¡level ¡fusion ¡
GMM ¡baseline ¡
Conventional ¡GMM-‑UBM-‑MAP ¡approach ¡(limited ¡data) ¡
GMM ¡size ¡256, ¡relevant ¡factor ¡16, ¡AT-‑norm

77

SLIDE 78

Experimental ¡Results

Performance ¡of ¡MFCC-‑real-‑articulation ¡system, ¡“all-‑small” ¡protocol ¡ ¡
“all-‑small” ¡protocol ¡: ¡same ¡as ¡“all”, ¡but ¡a ¡subset ¡of ¡real ¡articulatory ¡data ¡were ¡

removed ¡from ¡the ¡data ¡sets ¡due ¡to ¡the ¡ ¡missing ¡data ¡issue ¡in ¡some ¡channels ¡

Feature ¡level ¡fusion ¡with ¡real ¡articulation ¡data ¡helps ¡(mean/var ¡normalized) ¡
Score ¡level ¡fusion ¡achieved ¡big ¡EER ¡reduction

78

ID System “All-‑small” ¡protocol OptDCF EER 1 MFCC-‑only 0.38 7.56% 2 MFCC+real-‑articulation 0.1 3.96%

SLIDE 79

Performance ¡of ¡MFCC-‑estimated-‑articulation ¡system, ¡“all” ¡protocol

79 ID Systems “All” ¡protocol OptDCF EER 1 MFCC-‑only 0.38 7.56% 2 MFCC+estimated-‑articulation 0.37 7.10% 3 Score ¡level ¡fusion ¡1+2 0.34 6.46%

Experimental ¡Results

SLIDE 80

Modestly ¡Improved ¡Speaker ¡Verification ¡Using ¡ Estimated ¡Articulatory ¡Features ¡

80

¡Using ¡MFCC ¡+ ¡es6mated ¡ar6culatory ¡features ¡(speaker-‑independent ¡ inversion) ¡helps ¡speaker ¡recogni6on ¡performance! XRMB ¡database

Ming ¡Li, ¡Jangwon ¡Kim, ¡Adam ¡Lammert, ¡Prasanta ¡Ghosh, ¡Vikram ¡Ramanarayanan ¡and ¡Shrikanth ¡Narayanan. ¡Speaker ¡verifica6on ¡based ¡on ¡the ¡ fusion ¡of ¡speech ¡acous6cs ¡and ¡inverted ¡ar6culatory ¡signals. ¡Computer, ¡Speech, ¡and ¡Language. ¡36: ¡196-‑211, ¡March ¡2016 ¡

SLIDE 81

Summary: ¡Speaker ¡recognition ¡with ¡production ¡information

An ¡initial ¡practical ¡fusion ¡approach ¡for ¡speaker ¡verification ¡using ¡

both ¡acoustic ¡and ¡articulatory ¡information ¡

Significant ¡performance ¡enhancement ¡(40% ¡relatively) ¡by ¡

concatenating ¡articulation ¡features ¡from ¡measured ¡articulatory ¡ movement ¡data ¡with ¡MFCC ¡

Moderate ¡gains ¡(9%-‑14% ¡relatively) ¡using ¡estimated ¡articulatory ¡

features ¡obtained ¡through ¡acoustic-‑to-‑articulatory ¡inversion ¡

Future ¡work ¡should ¡focus ¡on ¡better ¡inversion ¡methods ¡and ¡

evaluating ¡the ¡proposed ¡methods ¡on ¡larger ¡NIST ¡SRE ¡database.

81

SLIDE 82

S. ¡Lingala, ¡A. ¡Toutios, ¡J. ¡Töger, ¡Y. ¡Lim, ¡Y. ¡Zhu, ¡Y. ¡Kim, ¡Colin ¡Vaz, ¡S. ¡Narayanan, ¡K. ¡Nayak. ¡State-‑of-‑the-‑art ¡MRI ¡

Protocol ¡for ¡Comprehensive ¡Assessment ¡of ¡Vocal ¡Tract ¡Structure ¡and ¡Function. ¡Proc. ¡Interspeech ¡2016.

Moving ¡forward ¡ ¡

Characterize ¡the ¡interplay ¡between ¡articulatory ¡structure ¡and ¡function ¡
Speaker ¡recognition ¡experiments

We ¡aim ¡to ¡collect ¡and ¡share ¡data ¡from ¡200 ¡subjects: ¡

1. Real-‑time ¡MRI ¡at ¡83 ¡fps ¡(scripted ¡and ¡spontaneous ¡speech) ¡
2. Accelerated ¡volumetric ¡MRI ¡(continuant ¡English ¡phonemes) ¡
3. T2-‑weighted ¡volumes ¡

New ¡project ¡underway ¡supported ¡by ¡the ¡NSF ¡ ¡ ¡ ¡

USC ¡in ¡collaboration ¡with ¡MIT ¡Lincoln ¡Laboratory ¡(Tom ¡Quatieri, ¡Nick ¡ Malyska, ¡Adam ¡Lammert)

82

SLIDE 83

Comprehensive ¡ protocol ¡to ¡ study ¡structural-‑ funcDonal ¡ aspects ¡of ¡vocal ¡ tract

Lingala, et al. Interspeech, 2016 Purpose Index Task Length

Rapid ¡Real-‑Ame ¡2D ¡MRI ¡ ¡ (scripted ¡speech) R1-‑R3 consonants ¡in ¡symmetric ¡VCV ¡ ¡ (3 ¡scans) 30 ¡sec ¡ (x3) R4 Vowels ¡in ¡bVt ¡ 30 ¡sec R5 Shibboleth ¡ and ¡ K-‑mart ¡ sentences ¡ 30 ¡sec R6 Rainbow ¡passage 30 ¡sec R7 Grandfather ¡passage 30 ¡sec R8 North ¡ wind ¡ and ¡ the ¡ sun ¡ passage 30 ¡sec R9 Gestures 30 ¡sec R10-‑R18 Repe66on ¡of ¡Index ¡R1-‑R9 30 ¡sec ¡ ¡(x9) R a p i d ¡ R e a l -‑ A m e ¡ spontaneous ¡speech ¡ S1-‑S5 Descrip6on ¡of ¡pictures ¡ 30 ¡sec ¡ (x5) S6-‑S10 Ques6ons/Discussion ¡topics 30 ¡sec ¡ ¡(x5) Accelerated ¡volumetric ¡3D ¡ V1 V o w e l s , ¡ c o n 6 n u a n t ¡ consonants, ¡postures 7 ¡sec ¡ ¡(x33) ¡ V2 Repe66on ¡of ¡V1 7 ¡sec ¡ ¡ ¡(x33) T2 ¡weighted ¡ Sagi:al ¡sweep Res6ng ¡posture ~4mins Axial ¡sweep Res6ng ¡posture ~4mins Coronal ¡sweep Res6ng ¡posture ~4mins

Target: 200 individuals

SLIDE 84

Individual ¡variability: ¡Insights ¡from ¡special ¡populaDons

clinical ¡cases ¡
developing ¡vocal ¡tract

SLIDE 85

Normalized ¡Tongue ¡Size * *

39% 16%

P1 G3 Tongue ¡size ¡effects: ¡Glossectomy ¡Patients

Tongue ¡Tip

Tongue ¡Base/Tip

Glossectomy: ¡surgical ¡removal ¡all/part ¡of ¡tongue ¡(such ¡as ¡for ¡treating ¡oral ¡cancer)

SLIDE 86

Normal ¡tongue ¡vs. ¡congenital ¡aglossia

/ata/ Normal volunteer

partial closure of lips

full closure of tongue tip and alveolar ridge

Aglossia subject

full closure of lips

no closure of tongue tip and alveolar ridge

SLIDE 87

Concluding ¡Remarks

Data ¡is ¡integral ¡to ¡advancing ¡speech ¡communicaDon ¡research ¡
Vocal ¡tract ¡informa6on ¡provides ¡a ¡crucial ¡piece ¡of ¡the ¡puzzle ¡
Need ¡to ¡gather ¡and ¡integrate ¡mul6ple, ¡disparate ¡sources ¡of ¡informa6on ¡toward ¡

geung ¡a ¡more ¡complete ¡picture ¡of ¡speech ¡produc6on ¡

The ¡problem ¡is ¡highly ¡challenging ¡
Technological, ¡computa6onal ¡as ¡well ¡as ¡conceptual/theore6cal ¡challenges ¡
Poten6al ¡for ¡applica6ons ¡including ¡machine ¡speech ¡recogni6on, ¡speaker ¡

modeling ¡and ¡synthesis ¡

Acquiring, ¡interpreDng ¡and ¡uDlizing ¡speech ¡producDon ¡informaDon ¡

is ¡an ¡ongoing ¡interdisciplinary ¡scienDfic ¡endeavor

87

SLIDE 88

88

USC ¡SPAN ¡Team/Alums ¡ sail.usc.edu/span

SLIDE 89

Special ¡thanks ¡to ¡ ¡

89 Krishna Nayak Louis Goldstein Asterios Toutios Adam Lammert Vikram Ramanarayanan Jangwon Kim Tanner Sorensen Sajan Lingala Ming Li

And ¡to ¡collaborators: ¡

Tom ¡Qua9eri, ¡Nick ¡Malyska, ¡Adam ¡Lammert ¡(MIT ¡Lincoln ¡Laboratory) ¡ ¡
Hiro ¡Nakasone ¡for ¡his ¡support ¡and ¡encouragement ¡of ¡speech ¡science-‑

driven ¡speech ¡technology ¡research

SLIDE 90

ACKNOWLEDGEMENTS

NIH ¡Grants ¡DC007124 ¡& ¡DC03172 ¡
NSF, ¡ ¡ONR, ¡DoJ ¡
USC ¡Imaging ¡Sciences ¡Center ¡
LAC-‑USC ¡Hospital ¡
USC ¡Center ¡for ¡High ¡Performance ¡CompuDng

90

Papers, Videos, Teaching resources

http://sail.usc.edu/span

SLIDE 91

References-‑1

ALL ¡REFERENCES ¡BELOW ¡ARE ¡ ¡AVAILABLE ¡AT ¡ ¡sail.usc.edu/publicaDons.php ¡
Shrikanth ¡S. ¡Narayanan, ¡Krishna ¡S. ¡Nayak, ¡Sungbok ¡Lee, ¡Abhinav ¡Sethy, ¡Dani ¡Byrd, ¡"An ¡approach ¡to ¡real-‑6me ¡magne6c ¡resonance ¡

imaging ¡for ¡speech ¡produc6on", ¡Journal ¡of ¡the ¡Acous6cal ¡Society ¡of ¡America, ¡vol. ¡115, ¡no. ¡4, ¡pp. ¡1771-‑1776, ¡2004. ¡

Erik ¡Bresch, ¡Yoon-‑Chul ¡Kim, ¡Krishna ¡S. ¡Nayak, ¡Dani ¡Byrd ¡and ¡Shrikanth ¡S. ¡ ¡Narayanan, ¡Seeing ¡speech: ¡Capturing ¡vocal ¡tract ¡shaping ¡

using ¡real-‑6me ¡ ¡magne6c ¡resonance ¡imaging ¡(2008), ¡in: ¡IEEE ¡Signal ¡Processing ¡Magazine, ¡ ¡25:3(123-‑132) ¡

Vikram ¡Ramanarayanan, ¡Adam ¡Lammert, ¡Louis ¡Goldstein, ¡Shrikanth ¡ ¡Narayanan, ¡"Are ¡Ar6culatory ¡Seungs ¡Mechanically ¡Advantageous ¡

for ¡ ¡Speech ¡Motor ¡Control?", ¡PLoS ¡ONE, ¡Public ¡Library ¡of ¡Science, ¡vol. ¡9, ¡no. ¡8, ¡pp. ¡e104168, ¡2014. ¡

Vikram ¡Ramanarayanan, ¡Louis ¡Goldstein, ¡Dani ¡Byrd ¡and ¡Shrikanth ¡S. ¡ ¡Narayanan, ¡An ¡inves6ga6on ¡of ¡ar6culatory ¡seung ¡using ¡real-‑6me ¡ ¡

magne6c ¡resonance ¡imaging ¡(2013), ¡in: ¡J. ¡Acoust. ¡Soc. ¡Am., ¡134:1(510-‑519) ¡

Adam ¡Lammert, ¡Louis ¡Goldstein, ¡Shrikanth ¡S. ¡Narayanan ¡and ¡Khalil ¡ ¡Iskarous, ¡Sta6s6cal ¡Methods ¡for ¡Es6ma6on ¡of ¡Direct ¡and ¡

Differen6al ¡ ¡Kinema6cs ¡of ¡the ¡Vocal ¡Tract ¡(2013), ¡in: ¡Speech ¡Communica6on, ¡55(147–161) ¡

Vikram ¡Ramanarayanan, ¡Louis ¡Goldstein, ¡Shrikanth ¡S. ¡Narayanan, ¡"Spa6o-‑temporal ¡ar6culatory ¡movement ¡primi6ves ¡during ¡speech ¡

produc6on ¡-‑-‑ ¡extrac6on, ¡interpreta6on ¡and ¡valida6on", ¡Journal ¡of ¡the ¡Acous6cal ¡Society ¡of ¡America, ¡vol. ¡134, ¡pp. ¡1378-‑1394, ¡2013. ¡ ¡ ¡

Adam ¡Lammert, ¡Michael ¡I. ¡Proctor ¡and ¡Shrikanth ¡S. ¡Narayanan, ¡ ¡Morphological ¡Varia6on ¡in ¡the ¡Adult ¡Hard ¡Palate ¡and ¡Posterior ¡

Pharyngeal ¡Wall ¡(2013), ¡in: ¡Journal ¡of ¡Speech, ¡Language, ¡and ¡Hearing ¡ ¡Research, ¡56:2(521-‑530) ¡

Adam ¡Lammert, ¡Michael ¡Proctor ¡and ¡Shrikanth ¡Narayanan. ¡Interspeaker ¡Variability ¡in ¡Hard ¡Palate ¡Morphology ¡and ¡Vowel ¡Produc6on. ¡

Journal ¡of ¡Speech, ¡Language, ¡and ¡Hearing ¡Research. ¡56(6): ¡1924-‑1933, ¡December ¡2013 ¡

Ming ¡Li, ¡Adam ¡Lammert, ¡Jangwon ¡Kim, ¡Prasanta ¡Kumar ¡Ghosh ¡and ¡Shrikanth ¡ ¡S. ¡Narayanan, ¡Automa6c ¡Classifica6on ¡of ¡Palatal ¡and ¡

Pharyngeal ¡Wall ¡ ¡Shape ¡Categories ¡from ¡Speech ¡Acous6cs ¡and ¡Inverted ¡Ar6culatory ¡ ¡Signals, ¡in: ¡ISCA ¡Workshop ¡on ¡Speech ¡Produc6on ¡ in ¡Automa6c ¡Speech ¡ ¡Recogni6on ¡(SPASR), ¡Lyon, ¡France, ¡2013 ¡

Prasanta ¡Ghosh, ¡and ¡Shrikanth ¡Narayanan. ¡A ¡generalized ¡smoothness ¡criterion ¡for ¡acous6c-‑to-‑ar6culatory ¡inversion. ¡ ¡ ¡J. ¡Acoust. ¡Soc. ¡
Am. ¡128(4):2162-‑2172, ¡ ¡2010. ¡
Prasanta ¡Kumar ¡Ghosh ¡and ¡Shrikanth ¡S. ¡Narayanan, ¡Automa6c ¡Speech ¡ ¡recogni6on ¡using ¡ar6culatory ¡features ¡from ¡subject-‑

independent ¡ ¡acous6c-‑to-‑ar6culatory ¡inversion ¡(2011), ¡in: ¡J. ¡Acoust. ¡Soc. ¡Am. ¡ ¡Express ¡Le:ers, ¡130:4(EL251-‑El257) ¡

Jangwon ¡Kim, ¡Sungbok ¡Lee, ¡Shrikanth ¡Narayanan, ¡"Es6ma6on ¡of ¡the ¡movement ¡trajectories ¡of ¡non-‑crucial ¡ar6culators ¡based ¡on ¡the ¡

detec6on ¡of ¡crucial ¡moments ¡and ¡physiological ¡constraints", ¡Interspeech, ¡ ¡2014. ¡ ¡

Vikram ¡Ramanarayanan, ¡Louis ¡Goldstein, ¡Shrikanth ¡Narayanan, ¡"Motor ¡control ¡primi6ves ¡arising ¡from ¡a ¡learned ¡dynamical ¡systems ¡

model ¡of ¡speech ¡ar6cula6on", ¡Interspeech, ¡2014. ¡ ¡

Chris6na ¡Hagedorn, ¡Adam ¡Lammert, ¡Mary ¡Bassily, ¡Yihe ¡Zu, ¡U:am ¡Sinha, ¡Louis ¡Goldstein, ¡Shrikanth ¡S. ¡Narayanan, ¡"Characterizing ¡post-‑

glossectomy ¡speech ¡using ¡real-‑6me ¡MRI", ¡Interna6onal ¡Seminar ¡on ¡Speech ¡Produc6on ¡(ISSP), ¡Cologne, ¡Germany, ¡2014. 91

SLIDE 92

References-‑2

Sajan ¡Lingala, ¡Yinghua ¡Zhu, ¡Yoon-‑Chul ¡Kim, ¡Asterios ¡Tou6os, ¡Shrikanth ¡Narayanan, ¡Krishna ¡Nayak. ¡A ¡fast ¡and ¡flexible ¡MRI ¡system ¡for ¡the ¡

study ¡of ¡dynamic ¡vocal ¡tract ¡shaping. ¡Magne6c ¡Resonance ¡in ¡Medicine. ¡2016 ¡

Ming ¡Li, ¡Jangwon ¡Kim, ¡Adam ¡Lammert, ¡Prasanta ¡Ghosh, ¡Vikram ¡Ramanarayanan ¡and ¡Shrikanth ¡Narayanan. ¡Speaker ¡verifica6on ¡based ¡
n ¡the ¡fusion ¡of ¡speech ¡acous6cs ¡and ¡inverted ¡ar6culatory ¡signals. ¡Computer, ¡Speech, ¡and ¡Language. ¡36: ¡196-‑211, ¡March ¡2016 ¡
Kim, ¡Y. ¡C., ¡Narayanan, ¡S. ¡S., ¡& ¡Nayak, ¡K. ¡S. ¡(2009). ¡Accelerated ¡three-‑dimensional ¡upper ¡airway ¡MRI ¡using ¡compressed ¡sensing. ¡

MagneAc ¡Resonance ¡in ¡Medicine, ¡61(6), ¡1434-‑1440. ¡

Lingala, ¡S., ¡Tou6os, ¡A., ¡Toger, ¡J., ¡Lim ¡Y., ¡Zhu, ¡Y., ¡Kim, ¡Y.-‑C., ¡Vaz, ¡C., ¡Narayanan, ¡S., ¡& ¡Nayak, ¡K. ¡(2016). ¡State-‑of-‑the-‑art ¡MRI ¡protocol ¡for ¡

comprehensive ¡assessment ¡of ¡vocal ¡tract ¡structure ¡and ¡func6on. ¡Interspeech, ¡San ¡Francisco, ¡CA ¡

Sorensen, ¡T., ¡ ¡Tou6os, ¡A., ¡Goldstein, ¡L., ¡& ¡Narayanan, ¡S. ¡(2016), ¡Characterizing ¡vocal ¡tract ¡dynamics ¡with ¡real-‑6me ¡MRI, ¡Conference ¡on ¡

Laboratory ¡Phonology, ¡Ithaca, ¡NY. ¡

Tou6os, ¡A., ¡& ¡Narayanan, ¡S. ¡S. ¡(2015). ¡Factor ¡analysis ¡of ¡vocal-‑tract ¡outlines ¡derived ¡from ¡real-‑6me ¡magne6c ¡resonance ¡imaging ¡data. ¡

InternaAonal ¡Congress ¡of ¡PhoneAc ¡Sciences, ¡Glasgow, ¡UK ¡

Vaz, ¡C., ¡Tou6os, ¡A., ¡& ¡Narayanan, ¡S. ¡(2016): ¡Convex ¡hull ¡convolu6ve ¡non-‑nega6ve ¡matrix ¡factoriza6on ¡for ¡uncovering ¡temporal ¡pa:erns ¡

in ¡mul6variate ¡6me-‑series ¡data ¡. ¡Interspeech, ¡San ¡Francisco, ¡CA. ¡

Tanner ¡Sorensen, ¡Asterios ¡Tou6os, ¡Louis ¡Goldstein, ¡Shrikanth ¡Narayanan. ¡Characterizing ¡vocal ¡tract ¡dynamics ¡across ¡speakers ¡using ¡

real-‑6me ¡MRI. ¡ ¡Proc. ¡Interspeech, ¡2016 ¡ 92

SLIDE 93

References-‑3

DATABASES/WEBSITES ¡with ¡MULTIMEDIA ¡RESOURCES ¡

USC ¡TIMT ¡CORPUS ¡

¡Shrikanth ¡Narayanan, ¡Asterios ¡Tou6os, ¡Vikram ¡Ramanarayanan, ¡Adam ¡Lammert, ¡Jangwon ¡Kim, ¡Sungbok ¡Lee, ¡Krishna ¡Nayak, ¡Yoon-‑Chul ¡ Kim, ¡Yinghua ¡Zhu, ¡Louis ¡Goldstein, ¡Dani ¡Byrd, ¡Erik ¡Bresch, ¡Prasanta ¡Ghosh, ¡Athanasios ¡Katsamanis, ¡Michael ¡Proctor, ¡"Real-‑6me ¡ magne6c ¡resonance ¡imaging ¡and ¡electromagne6c ¡ar6culography ¡database ¡for ¡speech ¡produc6on ¡research ¡(TC)", ¡The ¡Journal ¡of ¡the ¡ Acous6cal ¡Society ¡of ¡America, ¡vol. ¡136, ¡no. ¡3, ¡pp. ¡1307-‑1311, ¡2014. ¡ ¡ ¡ h:p://sail.usc.edu/span/usc-‑6mit/ ¡ ¡

USC ¡rtMRI ¡ ¡IPA ¡Chart ¡illustraDon ¡

Asterios ¡Tou6os, ¡Sajan ¡Goud ¡Lingala, ¡Colin ¡Vaz, ¡Jangwon ¡Kim, ¡John ¡Esling, ¡Patricia ¡Kea6ng, ¡Ma:hew ¡Gordon, ¡Dani ¡Byrd, ¡Louis ¡ Goldstein, ¡Krishna ¡Nayak, ¡and ¡Shrikanth ¡Narayanan, ¡“Illustra6ng ¡the ¡Produc6on ¡of ¡the ¡Interna6onal ¡Phone6c ¡Alphabet ¡Sounds ¡using ¡ Fast ¡Real-‑Time ¡Magne6c ¡Resonance ¡Imaging,” ¡in ¡Proc. ¡Interspeech, ¡San ¡Francisco, ¡2016. ¡ hwp://sail.usc.edu/span/rtmri_ipa/index.html ¡

USC ¡EMO ¡MRI ¡CORPUS ¡

Jangwon ¡Kim, ¡Asterios ¡Tou6os, ¡Yoon-‑Chul ¡Kim, ¡Yinghua ¡Zhu, ¡Sungbok ¡Lee, ¡Shrikanth ¡S. ¡Narayanan, ¡"USC-‑EMO-‑MRI ¡corpus: ¡An ¡emo6onal ¡ speech ¡produc6on ¡database ¡recorded ¡by ¡real-‑6me ¡magne6c ¡resonance ¡imaging", ¡Interna6onal ¡Seminar ¡on ¡Speech ¡Produc6on ¡(ISSP), ¡ Cologne, ¡Germany, ¡2014. ¡ h:p://sail.usc.edu/span/usc-‑emo-‑mri/ ¡

USC ¡EMA ¡CORPUS ¡

Sungbok ¡Lee, ¡Serdar ¡Yildirim, ¡Abe ¡Kazemzadeh ¡and ¡Shrikanth ¡S. ¡Narayanan, ¡An ¡ar6culatory ¡study ¡of ¡emo6onal ¡speech ¡produc6on, ¡in ¡ Proceedings ¡of ¡InterSpeech, ¡pages ¡497-‑500, ¡2005 ¡ ¡ hwp://sail.usc.edu/ema_web/index.html ¡ MUSIC ¡ ¡

Michael ¡I. ¡Proctor, ¡Erik ¡Bresch, ¡Dani ¡Byrd, ¡Krishna ¡S. ¡Nayak, ¡Shrikanth ¡S. ¡Narayanan, ¡"Paralinguis6c ¡Mechanisms ¡of ¡Produc6on ¡in ¡

Human ¡'Beatboxing:' ¡a ¡Real-‑6me ¡Magne6c ¡Resonance ¡Imaging ¡Study", ¡Journal ¡of ¡the ¡Acous6cal ¡Society ¡of ¡America, ¡vol. ¡133, ¡no. ¡2, ¡pp. ¡ 1043-‑1054, ¡2013. ¡ ¡ h:p://sail.usc.edu/span/beatboxing/ ¡

Erik ¡Bresch, ¡Shrikanth ¡S. ¡Narayanan, ¡"Real-‑6me ¡MRI ¡inves6ga6on ¡of ¡resonance ¡tuning ¡in ¡soprano ¡singing", ¡Journal ¡of ¡the ¡Acous6cal ¡

Society ¡of ¡America ¡Express ¡Le:ers, ¡vol. ¡128, ¡no. ¡5, ¡pp. ¡EL335-‑EL341, ¡2010. ¡ ¡ h:p://sail.usc.edu/span/videos/USC-‑Soprano-‑AveMaria.mov 93

Understanding ¡individual-­‑level ¡speech ¡variability: ¡ ¡

University ¡of ¡Southern ¡California ¡

Different ¡individuals…. ..each ¡with ¡a ¡uniquely ¡shaped ¡vocal ¡instrument

Different ¡individuals…. ..each ¡with ¡a ¡uniquely ¡shaped ¡vocal ¡instrument

And ¡with ¡differing ¡arDculatory ¡strategies ¡during ¡speech ¡… FiEeen ¡different ¡individuals ¡producing ¡vowel ¡/i/

What role can speech science play in understanding and supporting speech technology development?

Theme

Talk ¡focus: ¡Vocal ¡tract ¡Structure ¡and ¡Function

Characterize ¡interplay ¡between ¡vocal-­‑tract ¡structure ¡and ¡function ¡

Overarching ¡Questions

in ¡the ¡speech ¡acoustics? ¡

phonetic ¡equivalence? ¡

from ¡the ¡speech ¡signal?

Not only try to differentiate individuals from their speech signal but understand what makes them different from a structure-function perspective

Summary ¡of ¡specific ¡goals ¡of ¡this ¡talk

strategy ¡(including ¡speaking ¡style ¡differences)

Rest ¡of ¡the ¡talk

Methods for vocal tract imaging

getting speech production data….

Speech ¡ProducDon ¡Studies: ¡ ¡ Data ¡Is ¡Integral

Commonly used speech production data types

Classic ¡Speech ¡ProducDon ¡Data ¡Examples

X-­‑ray ¡(Stevens, ¡1962) ¡

Electropalatography

Ultrasound ¡(Stone, ¡1980) ¡

Electromagnetometry

Newer ¡PossibiliDes: ¡ ¡ MRI ¡for ¡structural ¡vocal ¡tract ¡imaging

Number ¡of ¡advantages: ¡ ¡

LimitaDons/Challenges ¡

MRI: ¡Toward ¡real ¡Dme ¡acquisiDon ¡for ¡speech ¡ (circa ¡2004)

Can ¡we ¡speed ¡up ¡MRI ¡to ¡even ¡better ¡rates? ¡

SpaDal ¡vs.Time ¡resoluDon: ¡speech ¡MRI

SpaDal ¡vs. ¡Time ¡resoluDon: ¡speech ¡MRI

synergisDc ¡combinaDon ¡of ¡ ¡

Methods

Experimental ¡set ¡up

RTHawk RTHawk

Real-­‑time ¡MRI ¡at ¡83 ¡fps, ¡2.4 ¡mm/pixel ¡

Accelerated ¡(7 ¡seconds) ¡Volumetric ¡Protocol

T2-­‑weighted ¡MRI ¡

for ¡detailed ¡anatomical ¡profiles

USC-­‑TIMIT: ¡A ¡MULTIMODAL ¡ARTICULATORY ¡ DATA ¡CORPUS ¡FOR ¡SPEECH ¡RESEARCH

Some ¡USC-­‑TIMIT ¡examples

Rest ¡of ¡the ¡talk

Analysis ¡preliminaries

Measurement ¡of ¡Structural ¡Details

Vocal ¡Tract ¡Contours ¡

Model-­‑Based ¡Image ¡SegmentaDon ¡In ¡The ¡Fourier ¡Domain

Articulator ¡Tracking

Articulatory ¡Posture ¡& Constriction ¡Task ¡Variables

Tracking ¡Constriction ¡Variables

Speaker-­‑Specific ¡Articulatory ¡Models

Analysis ¡and ¡Modeling

Data ¡analysis: ¡anatomical ¡details ¡

work ¡on ¡anatomical ¡development1 ¡

used ¡for ¡jaw ¡dimensions

From ¡T2 ¡weighted ¡images ¡obtained ¡for ¡ ¡high ¡so7-­‑9ssue ¡contrast

through ¡jaw ¡from ¡sagittal ¡image ¡stack ¡

superior ¡aspect ¡of ¡condylar ¡process ¡ (CdSuLt, ¡CdSuRt)

Data ¡analysis: ¡anatomical ¡details ¡

Data ¡analysis: ¡anatomical ¡details ¡ ¡summary ¡

Morphometric ¡Analysis: ¡

¡“test-­‑retest” ¡Results

Rest ¡of ¡the ¡talk

 ¡Confined ¡ar6culatory ¡environment  ¡Variability ¡across ¡speakers ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Highly ¡ar6culated, ¡layered ¡system  ¡Reflected ¡in ¡acous6cal ¡proper6es

Interspeaker ¡Variability: ¡ ¡ Vocal ¡Tract ¡Morphology ¡(Structure)

Why ¡is ¡morphological ¡structure ¡relevant? Palate ¡Shape ¡– ¡Principal ¡Components

VOCAL ¡TRACT ¡MORPHOLOGY

Concavity: ¡impact ¡on ¡F1 ¡and ¡F2; ¡ ¡ ¡ ¡Anteriority: ¡impact ¡on ¡F2 ¡only; Sharpness: ¡marginal

THEORETICAL ¡IMPACT ¡OF ¡PALATE ¡SHAPE

THEORETICAL ¡IMPACT: ¡ACOUSTIC ¡MODELING ¡

Synthesized Vowels

Experiments ¡on ¡estimating ¡some ¡of ¡these ¡ vocal ¡tract ¡shape ¡details ¡from ¡acoustics...

Inversion: ¡Palatal ¡Concavity

Vocal ¡Tract ¡ Shape

Speech ¡ ¡ Articulation

Motor ¡ Controller

Inversion ¡Accuracy: ¡63% ¡-­‑ ¡71% ¡

Inversion: ¡Vocal ¡Tract ¡Length

Vocal ¡Tract ¡ Structure Vocal ¡Tract ¡ Shape Speech ¡ Signal Speech ¡ ¡ Articulation

Understanding ¡individual-‑level ¡speech ¡variability: ¡ ¡

Characterize ¡interplay ¡between ¡vocal-‑tract ¡structure ¡and ¡function ¡

X-‑ray ¡(Stevens, ¡1962) ¡

Real-‑time ¡MRI ¡at ¡83 ¡fps, ¡2.4 ¡mm/pixel ¡

T2-‑weighted ¡MRI ¡

USC-‑TIMIT: ¡A ¡MULTIMODAL ¡ARTICULATORY ¡ DATA ¡CORPUS ¡FOR ¡SPEECH ¡RESEARCH

Some ¡USC-‑TIMIT ¡examples

Model-‑Based ¡Image ¡SegmentaDon ¡In ¡The ¡Fourier ¡Domain

Articulatory ¡Posture ¡&  Constriction ¡Task ¡Variables

Speaker-‑Specific ¡Articulatory ¡Models

From ¡T2 ¡weighted ¡images ¡obtained ¡for ¡ ¡high ¡so7-‑9ssue ¡contrast

¡“test-‑retest” ¡Results

 ¡Confined ¡ar6culatory ¡environment  ¡Variability ¡across ¡speakers ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Highly ¡ar6culated, ¡layered ¡system  ¡Reflected ¡in ¡acous6cal ¡proper6es

  Experiments ¡on ¡estimating ¡some ¡of ¡these ¡ vocal ¡tract ¡shape ¡details ¡from ¡acoustics...

Inversion ¡Accuracy: ¡63% ¡-‑ ¡71% ¡

Motor ¡ Controller compensation ¡ limited … ¡speaker-‑specific ¡ acoustics

¡  ¡5 ¡SPAN-‑TIMIT ¡Subjects: ¡real ¡6me ¡MRI ¡data ¡  ¡Median ¡es6mated ¡value ¡(30 ¡sec. ¡read ¡speech)

¡ e.g., ¡Beckwith-‑Wiedemann ¡and ¡Down ¡Syndrome ¡

* ¡Significant ¡sex ¡differences ¡(p ¡< ¡0.05, ¡Wilcoxon ¡rank-‑sum)