Speaker verification based on fusion of acoustic and articulatory - PowerPoint PPT Presentation

Speaker verification based on fusion of acoustic and articulatory information Ming Li 12 , Jangwon Kim 1 , Prasanta Ghosh 3 Vikram Ramanarayanan 1 , Shrikanth Narayanan 1 1. Signal Analysis and Interpretation Lab (SAIL), University of Southern California, USA 2. Sun Yat-Sen University Carnegie Mellon University Joint Institute of Engineering, Sun Yat-Sen University, China 3. Department of Electrical Engineering, Indian Institute of Science (IISc), India 1 This work was supported in part by NSF, NIH and Department of Justice of USA

Outline • Introduction • Motivation of using articulatory information for speaker ID • Speaker Id based on acoustic-to-articulatory inversion features • Methods • System overview • Front end processing and GMM baseline • Database • Wisconsin X-Ray Microbeam data (XRMB) • Experimental results • Conclusions 2

Speaker Verification • Speaker verification 3

Speech production for Speaker ID • Speaker verification • Acoustic level methods • Joint factor analysis (JFA) (Kenny, 2007) • I-vector (Dehak, 2011) • Simplified Supervised I-vector (Li, 2013) • Within class covariance normalization (WCCN) (Hatch, 2006) • Probabilistic linear discriminative analysis (PLDA) (Prince, 2007, Matejka, 2011) • Feature level or score level fusion based on multiple features • Short-term spectral features (MFCC, LPCC, PLP, etc.) • Spectral-temporal features (FDLP, Gabor, etc.) • Prosodic features (pitch, energy, duration, rhythm, etc.) • Voice source features (glottal features) • High level features (phoneme, semantics, accent, etc.) • Apply features from the speech production system? 4

Morphological variability • Vocal track morphological variability • Vocal tract length (Peterson, 1952; Fant, 1960, Lee, 1999, Stevens, 1998) • Shapes of hard palate and posterior pharyngeal wall (Lammert, 2011) • “Automatic Classification of Palatal and Pharyngeal Wall Shape Categories from Speech Acoustics and Inverted Articulatory Signals”, Li et.al, Interspeech satellite workshop SPASR, 2013. • We believe that in order to make standard pronunciation, articulation need to compensate the vocal tract morphology.

Articulatory variability • Articulation also contains speaker specific information • Flat palates exhibit less articulatory variability than highly domed palates during vowel production (Perkell, 1997; Mooshammer, 2004; Brunner, 2005; Brunner, 2009) • Articulation of coronal fricatives is influenced by palate shape • apical vs. laminal articulation of sibilants (Dart, 1991) • jaw height and tongue body positioning (Honda, 2002, Thibeault, 2011) . • An example in Singing • Different speakers articulate even the same words differently 6

Acoustic-to-articulatory inversion • For Speaker ID • Real articulation measurement impossible • Speaker independent acoustic-to-articulatory inversion (Ghosh, 2010) • Map inter- speaker acoustic variability to the reference speaker’s intra -speaker articulatory variability P. K. Ghosh and S. S. Narayanan, “A generalized smoothness criterion for acoustic -to- articulatory inversion,” 7 JASA, 2010.

System Overview • Feature level fusion and score level fusion Acoustics Reference Acoustic-to-articulatory Inversion speaker Inversion model training model Articulation Inversion Inverted Feature MFCC+inverted GMM model articulation baseline articulation level fusion Score level Output fusion UBM, MFCC Enrollment, GMM MFCC Acoustics feature and Test baseline features extraction speakers 8

Front end processing and GMM baseline • Front end processing • Wiener filter applied on the XRMB data • Real or inverted articulatory data sampled at 100hz • 25ms window size with 10ms shifts for MFCC extraction • 36 dim MFCC (18dim + delta) with MVN • MVN on the real articulatory data not on the inverted one • Real articulation (mean/var) has encoded vocal tract shape information • Remove mean/var of the real articulation for fair comparison • Inverted articulation (mean/var) has rich speaker information • Concatenating MFCC and articulation together as feature level fusion • GMM baseline • Conventional GMM-UBM-MAP approach (limited data) • GMM size 256, relevant factor 16, AT-norm 9

Database and experiment design • Wisconsin X-Ray Microbeam data (XRMB) (Westbury,1990) • Both articulatory measurement and simultaneously recorded speech signal are available from multiple speakers • Clean speech (okay for inversion, MRI data too noisy) • Session 1-101, speaker JW11-63, 46 speakers, 4034 utterances • Average duration of 5.72 seconds per utterance. • Two protocols: ALL and L5S (longer than 5s data for testing) Data sets & Protocol ALL L5S √ √ Background: all sessions of JW11-40 √ √ Target: session 11 of JW41-63 √ Test: other sessions of JW41-63 √ Test: other L5S sessions of JW41-63 √ √ Tnorm: sessions 11,12,79,80,81 of JW11-40 10

Experimental Results (1) • Error bar of pair-wise correlation coefficients between session one estimated articulatory signals (after DTW) from all 46 speakers • Lip aperture (LA), lip protrusion (PRO), jaw opening (JAW OPEN), the constriction degree (CD) and 11 constriction location (CL) of tongue tip (TT), tongue blade (TB), and tongue dorsum (TD).

Experimental Results (2) • Estimated articulatory signals of lip aperture (LA) and tongue body constriction location (TBCL) from two-speaker pairs (session 1). • JW48 and JW33 are both female and with the highest correlation in the previous figure 12 • JW48 and JW59 are from different genders, mean and variance have speaker information

Experimental Results (3) • The performance of 26 speakers (closed set) identification systems based on different utterance-level features derived from estimated articulatory data • Multi-class SVM • Train: sessions 12, 79, 80 and 81 of all 26 speakers in the background data set • Test: session 11 of all 26 speakers in the background data set 1 2 3 Features & Systems √ √ √ mean √ √ variance √ mean crossing rate Accuracy 32% 48% 52% 13

Experimental Results (4) • Performance of MFCC-real- articulation system, “all -small ” protocol • “all - small” protocol : same as “all”, but a subset of real articulatory data were removed from the data sets due to the missing data issue in some channels • Feature level fusion with real articulation data helps (mean/var normalized) • Score level fusion achieved big EER reduction ID Systems “All - small” protocol EER OptDCF 1 MFCC-only 11.04% 11.95% 2 MFCC-real-articulation 9.98% 10.15% 3 Score level fusion 1+2 6.42% 6.77% 14

Experimental Results (5) • Performance of MFCC-estimated- articulation system, “all” protocol ID Systems “All” protocol EER Accuracy OptDCF 1 MFCC-only 8.68% 8.73% 89.65% 2 MFCC-estimated-articulation 8.40% 8.44% 90.92% 3 Score level fusion 1+2 7.83% 7.91% 91.74% • Performance of MFCC-estimated-articulation system, “L5S” protocol ID Systems “longer than 5s ” protocol EER Accuracy OptDCF 1 MFCC-only 4.84% 4.88% 95.95% 2 MFCC-estimated-articulation 9.34% 4.52% 97.14% 3 Score level fusion 1+2 4.05% 4.17% 97.02% 15

Experimental Results (6) • DET curves of the MFCC only system and the score level fusion system for “ALL” and “L5S” protocols. 16

Conclusions • We propose a practical fusion approach for speaker verification using both acoustic and articulatory information • Significant performance enhancement (40% relatively) by concatenating articulation features from measured articulatory movement data with MFCC • Moderate gains (9%-14% relatively) using estimated articulatory features obtained through acoustic-to-articulatory inversion • Future works cover investigating better inversion methods and evaluating the proposed methods on NIST SRE database. 17

Speaker verification based on fusion of acoustic and articulatory - PowerPoint PPT Presentation

Speaker verification based on fusion of acoustic and articulatory information Ming Li 12 , Jangwon Kim 1 , Prasanta Ghosh 3 Vikram Ramanarayanan 1 , Shrikanth Narayanan 1 1. Signal Analysis and Interpretation Lab (SAIL), University of Southern

Probabilistic and Model Fusion: . . . Model Fusion: . . . Interval Uncertainty Model Fusion:

Acoustic Acoustic Control Systems BV Acoustic Acoustic Control Systems BV Control Systems BV

High resolution image fusion via fusion frames Shidong Li San Francisco State University

Update on the Fusion Update on the Fusion Energy Sciences Program Energy Sciences Program Ed

October 2016 October 2016 WHAT IS FUSION? TWO FUSION TYPES NEUTRONIC ANEUTRONIC TWO

Modeling with MOSEK Fusion Ulf Worse INFORMS Minneapolis October 5 2013 http://www.mosek.com

W3C Speaker Identification W3C Speaker Identification and Verification Workshop and Verification

A New Adaptation Method for Speaker- -Model Model A New Adaptation Method for Speaker Creation

Oncentra Prostate Image Fusion Josh Mason Oncentra Prostate Image Fusion Multiple image

Acoustic Modeling: Tied-state HMMs & DNN-based models Lecture 7 CS 753 Instructor: Preethi

DIVS DL/ID Verification Systems Verification of Legal Status DIVS Passport Verification

Update of Magnetic Fusion Energy Research Brian A. Nelson for the UW Fusion Energy Research Group

Fusion Nothing But The Truth Fusion Orbotech s True Commitment To The PCB Industry Overall

Speech Processing 15-492/18-492 Speaker ID Who is speaking? Speaker ID, Speaker Recognition

Rate-Based Stochastic Fusion Calculus and Angelo Troina Continuous Time Markov Chains Fusion

The Center for Acoustic Neuroma Translabyrinthine Resection of Acoustic Neuroma Indications 1 -

Articulation from Grade 6 to Grade 7 Silver Creek Middle School Home of the Huskies Home of the

Middle School Matters: Articulation Information Transition from Elementary to Middle School

Transfer and Articulation Coming into Focus Series 17 Courses shall Generally be

Speech Basics for Children with Autism August 6, 2014 National Autism Conference State College,

DJUSD EL PRIORITIES 2018 - 2020 Need for Additional Staff: English World Immersion Learner

2018 WBHC Presentation Application Resilience, Innovation, Health We are interested in proposals

Welcome Empowering School Accountability Committees through DAC 2019 Fall 2019 Fall SAC/DAC

Middle School Articulation Process 4 th Grade Introduction PS 29 Created by School Counselor,

Sambuz

Useful Links

Newsletter

Mail Us

Speaker verification based on fusion of acoustic and articulatory - PowerPoint PPT Presentation

Speaker verification based on fusion of acoustic and articulatory information Ming Li 12 , Jangwon Kim 1 , Prasanta Ghosh 3 Vikram Ramanarayanan 1 , Shrikanth Narayanan 1 1. Signal Analysis and Interpretation Lab (SAIL), University of Southern

Probabilistic and Model Fusion: . . . Model Fusion: . . . Interval Uncertainty Model Fusion:

Acoustic Acoustic Control Systems BV Acoustic Acoustic Control Systems BV Control Systems BV

High resolution image fusion via fusion frames Shidong Li San Francisco State University

Update on the Fusion Update on the Fusion Energy Sciences Program Energy Sciences Program Ed

October 2016 October 2016 WHAT IS FUSION? TWO FUSION TYPES NEUTRONIC ANEUTRONIC TWO

Modeling with MOSEK Fusion Ulf Worse INFORMS Minneapolis October 5 2013 http://www.mosek.com

W3C Speaker Identification W3C Speaker Identification and Verification Workshop and Verification

A New Adaptation Method for Speaker- -Model Model A New Adaptation Method for Speaker Creation

Oncentra Prostate Image Fusion Josh Mason Oncentra Prostate Image Fusion Multiple image

Acoustic Modeling: Tied-state HMMs &amp; DNN-based models Lecture 7 CS 753 Instructor: Preethi

DIVS DL/ID Verification Systems Verification of Legal Status DIVS Passport Verification

Update of Magnetic Fusion Energy Research Brian A. Nelson for the UW Fusion Energy Research Group

Fusion Nothing But The Truth Fusion Orbotech s True Commitment To The PCB Industry Overall

Speech Processing 15-492/18-492 Speaker ID Who is speaking? Speaker ID, Speaker Recognition

Rate-Based Stochastic Fusion Calculus and Angelo Troina Continuous Time Markov Chains Fusion

The Center for Acoustic Neuroma Translabyrinthine Resection of Acoustic Neuroma Indications 1 -

Articulation from Grade 6 to Grade 7 Silver Creek Middle School Home of the Huskies Home of the

Middle School Matters: Articulation Information Transition from Elementary to Middle School

Transfer and Articulation Coming into Focus Series 17 Courses shall Generally be

Speech Basics for Children with Autism August 6, 2014 National Autism Conference State College,

DJUSD EL PRIORITIES 2018 - 2020 Need for Additional Staff: English World Immersion Learner

2018 WBHC Presentation Application Resilience, Innovation, Health We are interested in proposals

Welcome Empowering School Accountability Committees through DAC 2019 Fall 2019 Fall SAC/DAC

Middle School Articulation Process 4 th Grade Introduction PS 29 Created by School Counselor,

Sambuz

Useful Links

Newsletter

Mail Us

Acoustic Modeling: Tied-state HMMs & DNN-based models Lecture 7 CS 753 Instructor: Preethi