lecture 14 lpc speech synthesis and autocorrelation based
play

Lecture 14: LPC speech synthesis and autocorrelation- based pitch - PowerPoint PPT Presentation

Lecture 14: LPC speech synthesis and autocorrelation- based pitch tracking ECE 417, Multimedia Signal Processing October 10, 2019 Outline The LPC-10 speech synthesis model Autocorrelation-based pitch tracking Inter-frame


  1. Lecture 14: LPC speech synthesis and autocorrelation- based pitch tracking ECE 417, Multimedia Signal Processing October 10, 2019

  2. Outline • The LPC-10 speech synthesis model • Autocorrelation-based pitch tracking • Inter-frame interpolation of pitch and energy contours • The LPC-10 excitation model: white noise, pulse train • Linear predictive coding: how to find the coefficients • Linear predictive coding: how to make sure the coefficients are stable

  3. The LPC-10 speech synthesis model

  4. The LPC-10 Speech Coder: Transmitted Parameters Each frame is 54 bits, and is used to synthesize 22.5ms of speech. (54 bits/frame)/(0.0225 seconds/frame)=2400 bits/second • Pitch : 7 bits/frame (127 distinguishable non-zero pitch periods) • Energy : 5 bits/frame (32 levels, on a logRMS scale) • 10 linear predictive coefficients (LPC): 41 bits/frame • Synchronization: 1 bit/frame

  5. The LPC-10 speech synthesis model 0 𝑓 𝑜 = , 𝜀 𝑜 − 𝑞𝑄 -./0 Voiced Speech, pitch period P 𝐼(𝑓 $% ) 𝑡[𝑜] 𝐻 G Gain= 𝑓 𝑜 ~𝒪 0,1 Vocal Tract: 𝑓 ;<=>?@ Unvoiced Speech Binary Control Modeled by Switch: an LPC synthesis Voiced (P>0) vs. Filter. Unvoiced (P=0)

  6. Outline • The LPC-10 speech synthesis model • Autocorrelation-based pitch tracking • Inter-frame interpolation of pitch and energy contours • The LPC-10 excitation model: white noise, pulse train • Linear predictive coding: how to find the coefficients • Linear predictive coding: how to make sure the coefficients are stable

  7. Autocorrelation is maximum at n=0 0 𝑠 BB 𝑜 = , 𝑦 𝑛 𝑦[𝑛 − 𝑜] C./0

  8. Autocorrelation is maximum at n=0 0 𝑦 𝑛 𝑦[𝑛 − 𝑜] = 𝑦 𝑜 ∗ 𝑦 −𝑜 = ℱ /H 𝑌 𝜕 K 𝑠 BB 𝑜 = , C./0 O = 1 K 𝑓 $%P 𝑒𝜕 2𝜌 N 𝑌 𝜕 /O Notice that, for n=0, this becomes just Parseval’s theorem: 0 O 𝑦 K 𝑛 = 1 K 𝑒𝜕 𝑠 BB 0 = , 2𝜌 N 𝑌 𝜕 /O C./0 K is positive and real, any value of 𝑓 $%P that is NOT positive and But since 𝑌 𝜕 real will reduce the value of the integral! O O BB 𝑜 = 1 K 𝑓 $%P 𝑒𝜕 ≤ 1 K 𝑒𝜕 = 𝑠 𝑠 2𝜌 N 𝑌 𝜕 2𝜌 N 𝑌 𝜕 BB 0 /O /O

  9. Example of an autocorrelation function computed from file0.wav, “Four score and seven years ago…”

  10. Autocorrelation of a periodic signal Suppose x[n] is periodic, 𝑦[𝑜] = 𝑦[𝑜 − 𝑄] . Then the autocorrelation is also periodic: 0 0 𝑦 K 𝑛 = 𝑠 𝑠 BB 𝑄 = , 𝑦 𝑛 𝑦[𝑛 − 𝑄] = , BB 0 C./0 C./0

  11. Autocorrelation of a periodic signal is periodic Pitch period = 9ms = 99 samples Pitch period = 9ms = 99 samples

  12. Autocorrelation pitch tracking • Compute the autocorrelation • Find the pitch period: 𝑄 = argmax 𝑠 BB [𝑛] X YZ[ \C\X Y]^ • The search limits, 𝑄 ?_` and 𝑄 ?ab , are important for good performance: • 𝑄 ?_` corresponds to a high woman’s pitch, about 𝐺 @ /𝑄 ?_` ≈ 250 Hz • 𝑄 ?ab corresponds to a low man’s pitch, about 𝐺 @ /𝑄 ?ab ≈ 80 Hz 𝑄 ?_` 𝑄 ?ab

  13. The LPC-10 speech synthesis model 0 𝑓 𝑜 = , 𝜀 𝑜 − 𝑞𝑄 -./0 Voiced Speech, pitch period P 𝐼(𝑓 $% ) 𝑡[𝑜] 𝐻 G Gain= 𝑓 𝑜 ~𝒪 0,1 Vocal Tract: 𝑓 ;<=>?@ Unvoiced Speech Binary Control Modeled by Switch: an LPC synthesis Voiced (P>0) vs. Filter. Unvoiced (P=0)

  14. voiced: 𝑦 𝑜 + 𝑄 ≈ 𝑦 𝑜 The voiced/unvoiced decision • 𝑦[𝑜] voiced: 𝑠 BB 𝑄 ≈ 𝑠 BB 0 • 𝑦[𝑜] unvoiced (white noise): 𝑠 BB 𝑜 ≈ 𝜀[𝑜] , which means that 𝑠 BB 𝑄 ≪ 𝑠 BB 0 So a reasonable V/UV decision is: unvoiced: E[𝑦 𝑛 𝑦 𝑛 − 𝑜 ] ≈ 𝜀[𝑜] i jj X i jj k ≥ 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒 : say the frame is voiced. • i jj X i jj k < 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒 : say the frame is • unvoiced. Setting threshold~0.25 works reasonably well.

  15. Outline • The LPC-10 speech synthesis model • Autocorrelation-based pitch tracking • Inter-frame interpolation of pitch and energy contours • The LPC-10 excitation model: white noise, pulse train • Linear predictive coding: how to find the coefficients • Linear predictive coding: how to make sure the coefficients are stable

  16. Inter-frame interpolation of pitch contours We don’t want the pitch period to Frame Boundary Frame Boundary Frame Boundary Frame Boundary change suddenly at frame Pitch boundaries; it sounds weird. Period Sample Number, n

  17. Inter-frame interpolation of pitch contours Linear interpolation sounds much better. We can accomplish linear Frame Boundary Frame Boundary Frame Boundary Frame Boundary interpolation using a formula like Pitch Period 𝑄 𝑜 = (1 − 𝑔)𝑄 u + 𝑔𝑄 uvH Where • 𝑄 u is the pitch period in frame t P/u@ • 𝑔 = is how far sample n is @ from the beginning of frame t • S is the frame-skip. Sample Number, n

  18. Inter-frame interpolation of energy Linear interpolation is also useful for energy, EXCEPT: it sounds better if we interpolate log energy, not energy. u@v}/H 1 𝑦 K [𝑜] log 𝑆𝑁𝑇 u = log , 𝑀 P.u@

  19. Outline • The LPC-10 speech synthesis model • Autocorrelation-based pitch tracking • Inter-frame interpolation of pitch and energy contours • The LPC-10 excitation model: white noise, pulse train • Linear predictive coding: how to find the coefficients • Linear predictive coding: how to make sure the coefficients are stable

  20. The LPC-10 speech synthesis model 0 𝑓 𝑜 = , 𝜀 𝑜 − 𝑞𝑄 -./0 Voiced Speech, pitch period P 𝐼(𝑓 $% ) 𝑡[𝑜] 𝐻 G Gain= 𝑓 𝑜 ~𝒪 0,1 Vocal Tract: 𝑓 ;<=>?@ Unvoiced Speech Binary Control Modeled by Switch: an LPC synthesis Voiced vs. Filter. Unvoiced

  21. Unvoiced speech: e[n]=white noise • Use zero-mean, unit-variance Gaussian white noise • The choice, to use “unvoiced speech,” is communicated by the special code word “P=0” By Morn - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index. php?curid=24084756

  22. Voiced speech: e[n]=pulse train • The basic idea: 0 𝑓 𝑜 = , 𝜀 𝑜 − 𝑞𝑄 -./0 • Modification #1: in order for the RMS to equal 1.0, we need to scale each pulse by 𝑄 : 0 𝑓 𝑜 = 𝑄 , 𝜀 𝑜 − 𝑞𝑄 -./0

  23. Modification #2: the first pulse is not at n=0 30 Pitch period = 80 samples ⇒ first pulse in frame 31 can’t occur until the 70 th sample of the frame

  24. A mechanism for keeping track of pitch phase from one frame to the next • Start out, at the beginning of the speech, with a pitch phase equal to zero, 𝜒 0 = 0 • For every sample thereafter: • If the sample is unvoiced (P[n]=0), don’t increment the pitch phase • If the sample is voiced (P[n]>0), then increment the pitch phase 𝜒 𝑜 = 𝜒 𝑜 − 1 + 2𝜌 𝑄[𝑜] • Every time the phase passes a multiple of 2𝜌 , output a pitch pulse 𝜒 𝑜 − 𝜒 𝑜 − 1 𝑓 𝑜 = € 𝑄 > 0 2𝜌 2𝜌 0 𝑓𝑚𝑡𝑓

  25. The pitch phase method: generate an excitation pulse whenever pitch phase crosses a 2𝜌 -level Phase 𝜒 𝑜 𝜒 𝑜 8𝜌 6𝜌 4𝜌 2𝜌 Sample Number, n 30 𝑓 𝑜

  26. Outline • The LPC-10 speech synthesis model • Autocorrelation-based pitch tracking • Inter-frame interpolation of pitch and energy contours • The LPC-10 excitation model: white noise, pulse train • Linear predictive coding: how to find the coefficients • Linear predictive coding: how to make sure the coefficients are stable

  27. Speech is predictable • Speech is not just white noise and pulse train. In fact, each sample is highly predictable from previous samples. Hk 𝑦[𝑜] ≈ , 𝛽 C 𝑦[𝑜 − 𝑛] C.H • In fact, the pitch pulses are the only major exception to this predictability!

  28. Linear predictive coding (LPC) The LPC idea: 𝑦 𝑜 1. Model the excitation as error Hk 𝑓 𝑜 = 𝑦 𝑜 − , 𝛽 C 𝑦[𝑜 − 𝑛] C.H 𝑓 𝑜 2. Force the coefficients 𝛽 C to explain as much as they can, so that 𝑓 𝑜 is as close to zero as possible.

  29. Linear predictive coding (LPC) K Hk 𝜁 = 𝐹 𝑓 K [𝑜] = 𝐹 𝑦 𝑜 − , 𝛽 ‡ 𝑦[𝑜 − 𝑗] ‡.H Hk 𝜖𝜁 = −2𝐹 𝑦 𝑜 − 𝑘 𝑦 𝑜 − , 𝛽 ‡ 𝑦 𝑜 − 𝑗 𝜖𝛽 $ ‡.H ‹Œ Setting ‹• Ž = 0 gives Hk 𝐹 𝑦 𝑜 − 𝑘 𝑦[𝑜] = , 𝛽 ‡ 𝐹 𝑦 𝑜 − 𝑘 𝑦[𝑜 − 𝑗] ‡.H 𝑆 BB 𝑘 𝑆 BB |𝑗 − 𝑘|

  30. Linear predictive coding (LPC) So we have a set of linked equations, for 1 ≤ 𝑘 ≤ 10 : Hk 𝑆 BB 𝑘 = , 𝛽 ‡ 𝑆 BB |𝑗 − 𝑘| ‡.H • We can write these 10 equations as a 10x10 matrix equation: ⃗ 𝛿 = 𝑆 ⃗ 𝛽 𝛽 = 𝑆 /H ⃗ • …which immediately gives the solution: ⃗ 𝛿 • …where 𝛽 H 𝑆 BB 0 𝑆 BB 1 ⋯ 𝑆 BB 1 ⋮ 𝛿 = ⃗ ⋮ , 𝑆 = 𝑆 BB 1 𝑆 BB 0 ⋯ , 𝛽 = ⃗ 𝛽 Hk 𝑆 BB 10 ⋮ ⋮ 𝑆 BB 0

  31. Outline • The LPC-10 speech synthesis model • Autocorrelation-based pitch tracking • Inter-frame interpolation of pitch and energy contours • The LPC-10 excitation model: white noise, pulse train • Linear predictive coding: how to find the coefficients • Linear predictive coding: how to make sure the coefficients are stable

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend