Modeling speech using pole-zero models Christian H. Kasess - PowerPoint PPT Presentation

Modeling speech using pole-zero models Christian H. Kasess Acoustics Research Institute 25.10.2012 Kasess (ARI) Vocal tract modeling SPL 2012 1 / 31

The vocal tract Roughly divided into three cavities Pharyngeal Oral Nasal Oral vowel production Nasal section closed off by velum Nasals and nasalized vowels Nasal section coupled Laterals (e.g. /l/) Airflow on one (or both) sides of the tongue http://pegasus.cc.ucf.edu/ cnye/vocal Generates side branches tract pic.htm Kasess (ARI) Vocal tract modeling SPL 2012 2 / 31

Source-filter model http://health.tau.ac.il/Communication Disorders/noam Glottis acts as source (pulse train) Vocal tract acts as ’slowly’ varying linear filter Kasess (ARI) Vocal tract modeling SPL 2012 3 / 31

Source-filter model Source and filter often assumed independent Glottal opening and closing changes VT filter Glottal pulse is not ideal pulse Effect of glottis not linear Still the source-filter model is useful Commonly used in phonetics Model parameters can be used for speaker recognition Useful for formant tracking Kasess (ARI) Vocal tract modeling SPL 2012 4 / 31

All-pole model All-pole model captures resonances or formants Autoregressive model (AR), linear predictive coding (LPC) p � y ( n ) = a i y ( n − i ) + x ( n ) i = 1 Works well with vowels Easy to estimate Solve the Yule-Walker equations (Toeplitz) with the Levinson-Durbin algorithm p � a i γ ( n − i ) + σ 2 γ ( n ) = x δ n , 0 i = 1 Direct link to simple physical model Correlation function... γ ( i ) = E [ y ( n ) y ( n − i )] Kasess (ARI) Vocal tract modeling SPL 2012 5 / 31

Pole-zero models Nasal spectra show spectral dips Oral cavities and paranasal cavities act as resonators Side branches cause decrease in energy Pole-zero model more efficient Problems with pole-zero models Trickier to estimate Requires in general non-linear methods Correspondence to physical model more difficult Kasess (ARI) Vocal tract modeling SPL 2012 6 / 31

All-pole vs. pole-zero model ctd. ● ● ● Envelope ● ● ● ● ● ● (15,0), RMS= 0.56 −10 ● ● ● ● ● ● ● ● ● ● (10,5), RMS= 0.46 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (15,5), RMS= 0.45 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (20,20), RMS= 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● level[dB] ● ● ● ● ● ● ● ● −30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −50 ● ● ● ● ● ● ● ● ● ● 0 1000 2000 3000 4000 f[Hz] Kasess (ARI) Vocal tract modeling SPL 2012 7 / 31

Pole-zero models Auto Regressive Moving Average (ARMA) p q � � y ( n ) − a k y ( n − k ) = b j x ( n − j ) (1) k = 1 j = 0 Pole-zero model q b j e − i ω k � e − i ω , θ � � x ( ω ) = B j = 0 y ( ω ) = ˆ ˆ A ( e − i ω , θ ) ˆ x ( ω ) (2) p � a k e − i ω k k = 0 Estimation in general a non-linear problem Kasess (ARI) Vocal tract modeling SPL 2012 8 / 31

Time or frequency? Time domain Not suitable for perceputal frequency scales Spectral domain Perceputal frequency scales can be included Logarithmic spectrum can be used Spectral envelope needs to be extracted Harmonics for voiced segments due to glottis Envelope represents VT transfer function (+ glottal pulse) Kasess (ARI) Vocal tract modeling SPL 2012 9 / 31

Spectral error measures Linear spectrum Assumptions about phase are necessary (minimum phase) Speech signal is not minimum phase (glottis) Log spectrum 2 K − 1 � � � e i ω k , θ ′ � y ( ω k ) − log B � � � θ = argmin θ ′ � log ˆ � � A ( e i ω k , θ ′ ) � � � k = 0 Perceptually relevant Log amplitude spectrum 2 � � � � K − 1 � e i ω k , θ ′ � B � � � � � � log | ˆ y ( ω k ) | − log θ = argmin θ ′ � � � � A ( e i ω k , θ ′ ) � � � � � � � k = 0 Phase ignored, minimum phase system easy to obtain Cepstral domain Computationally efficient (only for linear frequency ) Kasess (ARI) Vocal tract modeling SPL 2012 10 / 31

Optimization Methods Estimate numerator and denominator separately Recursive Methods Do not necessarily converge to local minimum Non-linear optimization Newton method Calculation of Hessian necessary Numerically expensive and potentially unstable Gauss-Newton method Hessian approximated through first derivatives Convergence issues Quasi-Newton Approximate Hessian (or its inverse) using iterative scheme Numerically stable and inexpensive Kasess (ARI) Vocal tract modeling SPL 2012 11 / 31

PZ representation Postitions of poles and zeros Number of complex and real poles/zeros needs Multiplicity Quadratic factors Multiplicity Polynomial coefficients Only number of poles and zeros Kasess (ARI) Vocal tract modeling SPL 2012 12 / 31

Recursive estimation Substitute non-linear problem with a linear one Steiglitz-McBride (1965, 1977) 2 K − 1 � � A ( e i ω k ,θ ′ ) B ( e i ω k ,θ ′ ) � � θ i = argmin θ ′ � � ˆ y ( ω k ) A ( e i ω k ,θ i − 1 ) − � � A ( e i ω k ,θ i − 1 ) k = 0 � 2 � 2 � � � K − 1 B ( e i ω k ,θ ′ ) A ( e i ω k ,θ ′ ) � � � � � y ( ω k ) − = argmin θ ′ � ˆ � � � � A ( e i ω k ,θ ′ ) A ( e i ω k ,θ i − 1 ) � � � k = 0 More general: Weighted linear least squares (WLLS) K − 1 � 2 � e i ω k , θ ′ � e i ω k , θ ′ �� θ i = argmin θ ′ W ( ω k , θ i − 1 ) � ˆ y ( ω k ) A − B k = 0 Kasess (ARI) Vocal tract modeling SPL 2012 13 / 31

Marelli and Balazs 2010 Logarithmic amplitude spectrum Estimation of polynomial coefficients Quasi-Newton with line search Gradient calculated analytically Broyden-Fletcher-Goldfarb-Shanno (BFGS) method Iterative approximation of the inverse Hessian (rank-one updates) Line search along gradient Initialized using the WLLS method Kasess (ARI) Vocal tract modeling SPL 2012 14 / 31

Marelli and Balazs 2010 New method shows lowest error Fewer iterations for polynomial representation Kasess (ARI) Vocal tract modeling SPL 2012 15 / 31

Summary Pole-zero Efficient representation for laterals, nasals, ... Different estimation schemes Newton-like method gives good results Speaker verification improved as compared to LPC only (Enzinger et al. 2011) Important questions What is an appropriate degree for the polynomials? Should the glottal source be corrected? What about physiological constraints? Kasess (ARI) Vocal tract modeling SPL 2012 16 / 31

Segmented tube model Vocaltract as a segmented tube (Wakita 1973, Fant 1960) A N+1 A N A 1 A 0 Glottis Lips x Two equations per segment m (volume velocity) ρ c p m ( x ) = A m ( u + m exp ( − ikx ) + u − m exp ( ikx )) (3) u + m exp ( − ikx ) − u − u m ( x ) = m exp ( ikx ) Volume velocity and pressure are matched at boundaries Lossless model (no friction or viscosity, below 4000 Hz ...) Kasess (ARI) Vocal tract modeling SPL 2012 17 / 31

One-tube Model Transfer function u lips / u glottis = u 0 / u N � � 1 0 � � 1 µ m 1 A ( µ, z ) = z N / 2 ( 1 0 ) ˆ � (4) µ m z − 1 z − 1 1 − µ m 0 m = N Correspondence requires fixed segment length (related to f s ) specific boundary conditions required (e.g. N=2) A ( µ, z ) ∝ 1 + ( µ 0 µ 1 + µ 1 µ 2 ) z − 1 + µ 0 µ 2 z − 2 ˆ For µ 0 or µ N = ± 1 reflection coefficients are calculated by recursive algorithm (Markel and Gray, 1976) m -th reflection coefficient µ m := A m − A m + 1 A m + A m + 1 and z := exp i 2 π f f s = exp i 2 π f c 2 l Kasess (ARI) Vocal tract modeling SPL 2012 18 / 31

Branching Tubes nasal cavity pharynx velum glottis oral cavity Nasal tract is added Each tract is modeled as segmented tube For nasals: nasal tract open, oral tract closed Vocaltract model has pole-zero characteristic ˆ B ( µ, z ) Transfer function given as f ( µ, z ) = ˆ A ( µ, z ) Kasess (ARI) Vocal tract modeling SPL 2012 19 / 31

Pole-zero Model No direct way from pole-zero to branched-tube model Numerator polynomial appears also in denominator Pole-zero model has 2 N + M + L coefficients Two-tube model has N + M + L + 1 parameters Numerator can be calculated precisely Current estimation methods Estimate pole-zero model Apply step-down to numerator and Minimize error with respect to either denomiator polynomial (Lim and Lee 1996) or signal filtered with numerator(Schnell 2003) Gives precedence to zeros Kasess (ARI) Vocal tract modeling SPL 2012 20 / 31

Modeling speech using pole-zero models Christian H. Kasess - PowerPoint PPT Presentation

Modeling speech using pole-zero models Christian H. Kasess Acoustics Research Institute 25.10.2012 Kasess (ARI) Vocal tract modeling SPL 2012 1 / 31 The vocal tract Roughly divided into three cavities Pharyngeal Oral Nasal Oral vowel

Lecture 16: Dynamic Programming - Pole Cutting COMS10007 - Algorithms Dr. Christian Konrad

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Constraining global models of black carbon aerosol with Pole-to-Pole observations HIAPER

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Presentation of POLE AVENIA 1 POLE AVENIA is... a non-profit making association under the French

CORSIA Checklist - a Project Developers View Renat Heuberger, CEO South Pole IETA Side Event

Physics 2D Lecture Slides Jan 13 Vivek Sharma UCSD Physics Fitting a 5m pole in a 4m barnhouse

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

IIR Filter Design Chaiwoot Boonyasiriwat October 7, 2020 Filter Design by Pole-zero Placement

Zero Waste at The Nat Zero Waste Zero Waste Zero Waste is a philosophy that encourages the

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Congregational Discussion on our Investment in Ministry September 30, 2018 Questions we will

FULL YEAR RESULTS PRESENTATION YEAR ENDED 31 DECEMBER 2016 SUMMERSET GROUP HOLDINGS LIMITED 29

Results Presentation Year Ended 31 March 2016 25 May 2016 Forward looking statements This

Presentation Year ended 31 December 2018 Disclaimer The information contained in this

Q1 Presentation 2013 19 April, 2013 Disclaimer This presentation has been prepared by Duni

Interim Report January March 2009 Insert picture in this frame Insert picture in this frame

Proposed Modifications To Regulation III (Fees) San Joaquin Valley Air District February 6, 2018

2019-2020 School Budget Presentation & Discussion April 24, 2019 Mission Statem ent The

Modeling speech using pole-zero models Christian H. Kasess - PowerPoint PPT Presentation

Modeling speech using pole-zero models Christian H. Kasess Acoustics Research Institute 25.10.2012 Kasess (ARI) Vocal tract modeling SPL 2012 1 / 31 The vocal tract Roughly divided into three cavities Pharyngeal Oral Nasal Oral vowel

Lecture 16: Dynamic Programming - Pole Cutting COMS10007 - Algorithms Dr. Christian Konrad

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Constraining global models of black carbon aerosol with Pole-to-Pole observations HIAPER

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Presentation of POLE AVENIA 1 POLE AVENIA is... a non-profit making association under the French

CORSIA Checklist - a Project Developers View Renat Heuberger, CEO South Pole IETA Side Event

Physics 2D Lecture Slides Jan 13 Vivek Sharma UCSD Physics Fitting a 5m pole in a 4m barnhouse

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 5: Speech modeling and

IIR Filter Design Chaiwoot Boonyasiriwat October 7, 2020 Filter Design by Pole-zero Placement

Zero Waste at The Nat Zero Waste Zero Waste Zero Waste is a philosophy that encourages the

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Congregational Discussion on our Investment in Ministry September 30, 2018 Questions we will

FULL YEAR RESULTS PRESENTATION YEAR ENDED 31 DECEMBER 2016 SUMMERSET GROUP HOLDINGS LIMITED 29

Results Presentation Year Ended 31 March 2016 25 May 2016 Forward looking statements This

Presentation Year ended 31 December 2018 Disclaimer The information contained in this

Q1 Presentation 2013 19 April, 2013 Disclaimer This presentation has been prepared by Duni

Interim Report January March 2009 Insert picture in this frame Insert picture in this frame

Proposed Modifications To Regulation III (Fees) San Joaquin Valley Air District February 6, 2018

2019-2020 School Budget Presentation &amp; Discussion April 24, 2019 Mission Statem ent The

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

2019-2020 School Budget Presentation & Discussion April 24, 2019 Mission Statem ent The