a new adaptation method for speaker model model a new
play

A New Adaptation Method for Speaker- -Model Model A New Adaptation - PowerPoint PPT Presentation

High-Level Speaker Verification A New Adaptation Method for Speaker- -Model Model A New Adaptation Method for Speaker Creation in High- -Level Speaker Verification Level Speaker Verification Creation in High Shi-Xiong Zhang and Man-Wai MAK


  1. High-Level Speaker Verification A New Adaptation Method for Speaker- -Model Model A New Adaptation Method for Speaker Creation in High- -Level Speaker Verification Level Speaker Verification Creation in High Shi-Xiong Zhang and Man-Wai MAK Dept. of Electronic and Information Engineering The Hong Kong Polytechnic University PCM 2007 P. 1 S.X. Zhang / Man-Wai Mak

  2. New Adaptation Methods for High-Level Speaker Verification Outline Outline Introduction of Speaker Verification GMM system and MAP Adaptation New Adaptation for Speaker Modeling Experiments and Results PCM 2007 P. 2 S.X. Zhang / Man-Wai Mak

  3. New Adaptation Methods for High-Level Speaker Verification What is Speaker Verification? What is Speaker Verification? – To verify the identity of a claimant based on his/her own voices (Determine whether a person is who he/she claims to be) Is this Mary’s I am Mary voice? ? PCM 2007 P. 3 S.X. Zhang / Man-Wai Mak

  4. Modular Representation of Speaker Verification Two Phases of Speaker Verification Two Phases of Speaker Verification Enrollment Enrollment Phase Phase models for a target Enrollment speech for speaker a target speaker Bob Model Bob Feature Feature Model Model Feature Feature Model training extraction extraction training training extraction extraction training Sally Sally Verification Verification Phase Phase Verification Feature Verification Verification Feature Verification Accepted/ extraction extraction Reject Claimed identity: Sally PCM 2007 P. 4 S.X. Zhang / Man-Wai Mak

  5. Low-level Features for Speaker Verification Traditional Speaker Modeling Traditional Speaker Modeling • The mixture density function is a linear combination of x several Gaussian densities 4 0 0 0 3 5 0 0 3 0 0 0 • Gaussian mixture model (GMM): 2 5 0 0 Frequency 2 0 0 0 1 5 0 0 1 0 0 0 5 0 0 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 T i m e = ∑ M Λ x s s x p ( | ) w p ( ) s i i GMM GMM = i 1 Λ = Σ s μ s s { w , , } s i i i 1 1 { } = − − Σ − − s x x μ s ' s 1 x μ s p ( ) exp ( ) ( ) ( ) i i i i 1/ 2 π Σ 2 D / 2 s (2 ) i PCM 2007 P. 7 S.X. Zhang / Man-Wai Mak

  6. Low-level Features for Speaker Verification Verification based on Speaker and Background Verification based on Speaker and Background GMM Model GMM Model • Universal Background Model (UBM) The UBM is a large GMM trained to represent the distribution of speaker-independent features. Speaker GMM is used to represent a specific user • � ( | Λ x p ) s GMM Speaker GMM Speaker Model Model + Scores Feature X Σ Feature Decision MAP t Decision _ extraction extraction GMM UBM GMM UBM = ∑ M Λ x x b b p ( | ) w p ( ) b i i = i 1 PCM 2007 P. 8 S.X. Zhang / Man-Wai Mak

  7. High-Level Speaker Verification High- -Level Features Level Features High • Humans use several levels of perceptual cues for speaker recognition Perceptual Cues Depends on Perceptual Cues Depends on High-level cues Difficult to • Pronunciations (learned traits) Socio-economic • Pronunciations automatically Socio-economic • Idiolect (word status, education, extract • Idiolect (word status, education, place of birth usage) place of birth usage) • Prosodic Personality type, • Prosodic Personality type, parental influence (Rhythm) parental influence (Rhythm) • Speed of • Speed of speech speech • Intonation • Intonation • Acoustic aspect Physical structure of • Acoustic aspect Physical structure of vocal apparatus of speech Easy to Low-level cues vocal apparatus of speech automatically (physical traits) extract PCM 2007 P. 9 S.X. Zhang / Man-Wai Mak

  8. What’ ’s the s the A Articulatory rticulatory F Feature? eature? What Articulatory features (AFs) are abstract classes that describe the movements and positions of different articulators during speech production. Speaker 1 Speaker 2 / u / Two AFs were adopted for Pronunciation Modelling (AFCPM): PCM 2007 P. 10 S.X. Zhang / Man-Wai Mak

  9. Articulatory Feature Conditional Pronunciation Modeling AFCPM Training AFCPM Training Articulatory Feature Pronunciation Modeling Null-Grammar Null-Grammar Unadapted AFCPM of Phoneme Phoneme q … q … { { , , , , q q } } 1 1 T T Recognizer Recognizer Speaker s MFCC MFCC Sequences Sequences x Creating Creating Creating Creating from from AF-MLP AF-MLP 0 Speaker Speaker Speaker Speaker 0 … … M M M M M M { { l l , , l l , , , , l l } } 0 0 Verification Verification 1 1 2 2 T T 0 for Manner for Manner 0 Models Models Models Models 0 0 Utterance Utterance 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 T i m e and and and and X … X … { { , , , , X X } } 1 1 T T Background Background Background Background P m p q ( , | ) Models Models Models Models AF-MLP AF-MLP 6 S 46 P P P P … … P P { { l l , , l l , , , , l l } } 1 1 2 2 T T for Place for Place P m p q ( , | ) S 2 P m p q ( , | ) S 1 10 Data Sparse Problem PCM 2007 P. 11 S.X. Zhang / Man-Wai Mak

  10. Contribution of our paper Contribution of our paper Enrollment Data for Adapted AFCPM of a target speaker s Speaker s x Adaptation & Adaptation & 4 0 0 0 3 5 0 0 3 0 0 0 2 5 0 0 Frequency 2 0 0 0 Model Creation Model Creation 1 5 0 0 1 0 0 0 5 0 0 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 T i m e � P ( , m p q | ) S 2 AFCPM of Background Model � P ( , m p q | ) S 1 P m p q ( , | ) b 46 P m p q ( , | ) b 2 P m p q ( , | ) b 1 PCM 2007 P. 12 S.X. Zhang / Man-Wai Mak

  11. AFCPM Verification Verification based on Speaker and Background Verification based on Speaker and Background AFCPM Model AFCPM Model � ( , P m p q | ) S AFCPM Speaker Model AFCPM Speaker Model x ⎧ ⎫ q + Scores 4 0 0 0 3 5 0 0 t ⎪ ⎪ 3 0 0 0 Feature 2 5 0 0 Frequency = Feature Σ 2 0 0 0 M ⎨ ⎬ l m Adaptation Decision 1 5 0 0 _ 1 0 0 0 Decision extraction t 5 0 0 ⎪ ⎪ extraction 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 T i m e = P ⎩ l p ⎭ t AFCPM UBM AFCPM UBM P m p q ( , | ) b PCM 2007 P. 13 S.X. Zhang / Man-Wai Mak

  12. Traditional MAP Adaptation Traditional MAP Adaptation � ( = β + − β P m p q , | ) P m p q ( , | ) (1 ) P ( m p q , | ) s MAP MAP s b #((*,*, q ) in the utterances of speaker ) s Adaptation: Adaptation: β = + #((*,*, q ) in the utterances of speaker ) s r Adapted AFCPM Unadapted phoneme-dependent AFCPM of Speaker s of Speaker s � ( , P m p q ( , | ) P m p q | ) S S x 4 0 0 0 AF 3 5 0 0 AF 3 0 0 0 2 5 0 0 MAP Frequency MAP 2 0 0 0 1 5 0 0 1 0 0 0 Modeling Modeling 5 0 0 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 T i m e P m p q ( , | ) b phoneme-dependent AFCPM of Background Model PCM 2007 P. 14 S.X. Zhang / Man-Wai Mak

  13. Limitation of Traditional MAP Adaptation Limitation of Traditional MAP Adaptation PCM 2007 P. 15 S.X. Zhang / Man-Wai Mak

  14. Limitation of Traditional MAP Adaptation Limitation of Traditional MAP Adaptation PCM 2007 P. 16 S.X. Zhang / Man-Wai Mak

  15. Proposed New Adaptation Method Proposed New Adaptation Method ⎡ ⎤ P ( m p q , | ) � ( = β + − β α + − α q q q q b P m p q , | ) P m p q ( , | ) (1 ) ⎢ P ( m p q , | ) (1 ) P m p ( , |*) ⎥ s s s s b b b s ⎣ P m ( , p |*) ⎦ b #((*,*, ) in the utterances of all background speakers) q α = q + b #((*,*, ) in the utterances of all background speakers) q r P m p ( , | * ) Unadapted Phoneme- Adapted AFCPM S independent AFCPM of of Speaker s � ( , Speaker s P m p q | ) S New Adaptation New Adaptation P m p q ( , | ) x S Modeling Modeling 4 0 0 0 3 5 0 0 3 0 0 0 AF AF 2 5 0 0 Frequency 2 0 0 0 1 5 0 0 1 0 0 0 5 0 0 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 T i m e P m p q ( , | ) b Unadapted P m p ( , | * ) b Phoneme-dependent AFCPM of Speaker s AFCPM of Background Model PCM 2007 P. 17 S.X. Zhang / Man-Wai Mak

  16. Traditional MAP Adaptation Traditional MAP Adaptation � ( = β + − β P m p q , | ) P m p q ( , | ) (1 ) P ( m p q , | ) s MAP MAP s b #((*,*, q ) in the utterances of speaker ) s Adaptation: Adaptation: β = + #((*,*, q ) in the utterances of speaker ) s r Adapted AFCPM Unadapted phoneme-dependent AFCPM of Speaker s of Speaker s � ( , P m p q ( , | ) P m p q | ) S S x 4 0 0 0 AF 3 5 0 0 AF 3 0 0 0 2 5 0 0 MAP Frequency MAP 2 0 0 0 1 5 0 0 1 0 0 0 Modeling Modeling 5 0 0 0 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 T i m e P m p q ( , | ) b phoneme-dependent AFCPM of Background Model PCM 2007 P. 18 S.X. Zhang / Man-Wai Mak

Recommend


More recommend