Lecture 17: LPC speech synthesis and autocorrelation- based pitch - - PowerPoint PPT Presentation

โ–ถ
lecture 17 lpc speech synthesis and autocorrelation based
SMART_READER_LITE
LIVE PREVIEW

Lecture 17: LPC speech synthesis and autocorrelation- based pitch - - PowerPoint PPT Presentation

Lecture 17: LPC speech synthesis and autocorrelation- based pitch tracking ECE 401, Signal and Image Analysis November 5, 2020 Outline The LPC-10 speech synthesis model The LPC-10 excitation model: white noise, pulse train Linear


slide-1
SLIDE 1

Lecture 17: LPC speech synthesis and autocorrelation- based pitch tracking

ECE 401, Signal and Image Analysis November 5, 2020

slide-2
SLIDE 2

Outline

  • The LPC-10 speech synthesis model
  • The LPC-10 excitation model: white noise, pulse train
  • Linear predictive coding: how to find the coefficients
  • Linear predictive coding: how to make sure the coefficients are stable
  • Autocorrelation-based pitch tracking
  • Inter-frame interpolation of pitch and energy contours
slide-3
SLIDE 3

The LPC-10 speech synthesis model

slide-4
SLIDE 4

The LPC-10 Speech Coder: Transmitted Parameters

Each frame is 54 bits, and is used to synthesize 22.5ms of speech. (54 bits/frame)/(0.0225 seconds/frame)=2400 bits/second

  • Pitch: 7 bits/frame (127 distinguishable non-zero pitch periods)
  • Energy: 5 bits/frame (32 levels, on a log-energy scale)
  • 10 linear predictive coefficients (LPC): 41 bits/frame
  • Synchronization: 1 bit/frame
slide-5
SLIDE 5

The LPC-10 speech synthesis model

๐ผ(๐‘“!")

Vocal Tract: Modeled by an LPC synthesis Filter.

๐‘ก[๐‘œ]

๐‘“ ๐‘œ = $

!"#$ $

๐œ€ ๐‘œ โˆ’ ๐‘ž๐‘„ ๐‘“ ๐‘œ ~๐’ช 0,1 Voiced Speech, pitch period P Unvoiced Speech Binary Control Switch: Voiced (P>0) vs. Unvoiced (P=0)

G

๐ป

Gain= ๐‘“%&'()*

slide-6
SLIDE 6

Outline

  • The LPC-10 speech synthesis model
  • The LPC-10 excitation model: white noise, pulse train
  • Linear predictive coding: how to find the coefficients
  • Linear predictive coding: how to make sure the coefficients are stable
  • Autocorrelation-based pitch tracking
  • Inter-frame interpolation of pitch and energy contours
slide-7
SLIDE 7

The LPC-10 speech synthesis model

๐ผ(๐‘“!")

Vocal Tract: Modeled by an LPC synthesis Filter.

๐‘ก[๐‘œ]

๐‘“ ๐‘œ = $

!"#$ $

๐œ€ ๐‘œ โˆ’ ๐‘ž๐‘„ ๐‘“ ๐‘œ ~๐’ช 0,1 Voiced Speech, pitch period P Unvoiced Speech Binary Control Switch: Voiced vs. Unvoiced

G

๐ป

Gain= ๐‘“%&'()*

slide-8
SLIDE 8

Unvoiced speech: e[n]=white noise

  • Use zero-mean, unit-variance

Gaussian white noise

  • The choice, to use โ€œunvoiced

speech,โ€ is communicated by the special code word โ€œP=0โ€

By Morn - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index. php?curid=24084756

slide-9
SLIDE 9

Voiced speech: e[n]=pulse train

  • The basic idea:

๐‘“ ๐‘œ = &

!"#$ $

๐œ€ ๐‘œ โˆ’ ๐‘ž๐‘„

  • Modification #1: in order for the

average energy to equal 1.0, we need to scale each pulse by ๐‘„: ๐‘“ ๐‘œ = ๐‘„ &

!"#$ $

๐œ€ ๐‘œ โˆ’ ๐‘ž๐‘„

slide-10
SLIDE 10

Modification #2: the first pulse is not at n=0

Pitch period = 80 samples โ‡’ first pulse in frame 31 canโ€™t occur until the 70th sample of the frame

30

slide-11
SLIDE 11

A mechanism for keeping track of pitch phase from one frame to the next

  • Start out, at the beginning of the speech, with a pitch phase equal to zero,

๐œ’ 0 = 0

  • For every sample thereafter:
  • If the sample is unvoiced (P[n]=0), donโ€™t increment the pitch phase
  • If the sample is voiced (P[n]>0), then increment the pitch phase

๐œ’ ๐‘œ = ๐œ’ ๐‘œ โˆ’ 1 + 2๐œŒ ๐‘„[๐‘œ]

  • Every time the phase passes a multiple of 2๐œŒ, output a pitch pulse

๐‘“ ๐‘œ = / ๐‘„ ๐œ’ ๐‘œ 2๐œŒ โˆ’ ๐œ’ ๐‘œ โˆ’ 1 2๐œŒ > 0 ๐‘“๐‘š๐‘ก๐‘“

slide-12
SLIDE 12

The pitch phase method: generate an excitation pulse whenever pitch phase crosses a 2๐œŒ-level

30

Sample Number, n Phase ๐œ’ ๐‘œ ๐œ’ ๐‘œ 2๐œŒ 4๐œŒ 6๐œŒ 8๐œŒ ๐‘“ ๐‘œ

slide-13
SLIDE 13

Outline

  • The LPC-10 speech synthesis model
  • The LPC-10 excitation model: white noise, pulse train
  • Linear predictive coding: how to find the coefficients
  • Linear predictive coding: how to make sure the coefficients are stable
  • Autocorrelation-based pitch tracking
  • Inter-frame interpolation of pitch and energy contours
slide-14
SLIDE 14

Speech is predictable

  • Speech is not just white noise and

pulse train. In fact, each sample is highly predictable from previous samples. ๐‘ฆ[๐‘œ] โ‰ˆ &

."/ /0

๐›ฝ.๐‘ฆ[๐‘œ โˆ’ ๐‘›]

  • In fact, the pitch pulses are the
  • nly major exception to this

predictability!

slide-15
SLIDE 15

Linear predictive coding (LPC)

The LPC idea:

  • 1. Model the excitation as error

๐‘“ ๐‘œ = ๐‘ฆ ๐‘œ โˆ’ &

."/ /0

๐›ฝ.๐‘ฆ[๐‘œ โˆ’ ๐‘›]

  • 2. Force the coefficients ๐›ฝ. to

explain as much as they can, so that ๐‘“ ๐‘œ is as close to zero as possible.

๐‘“ ๐‘œ ๐‘ฆ ๐‘œ

slide-16
SLIDE 16

Linear predictive coding (LPC)

๐œ = ๐น ๐‘“![๐‘œ] = ๐น ๐‘ฆ ๐‘œ โˆ’ *

"#$ $%

๐›ฝ"๐‘ฆ[๐‘œ โˆ’ ๐‘—]

!

๐œ–๐œ ๐œ–๐›ฝ& = โˆ’2๐น ๐‘ฆ ๐‘œ โˆ’ ๐‘˜ ๐‘ฆ ๐‘œ โˆ’ *

"#$ $%

๐›ฝ"๐‘ฆ ๐‘œ โˆ’ ๐‘— Setting '(

')+ = 0 gives

๐น ๐‘ฆ ๐‘œ โˆ’ ๐‘˜ ๐‘ฆ[๐‘œ] = *

"#$ $%

๐›ฝ"๐น ๐‘ฆ ๐‘œ โˆ’ ๐‘˜ ๐‘ฆ[๐‘œ โˆ’ ๐‘—]

๐‘†,, ๐‘˜ ๐‘†,, |๐‘— โˆ’ ๐‘˜|

slide-17
SLIDE 17

Linear predictive coding (LPC)

So we have a set of linked equations, for 1 โ‰ค ๐‘˜ โ‰ค 10: ๐‘†** ๐‘˜ = *

"#$ $%

๐›ฝ"๐‘†** |๐‘— โˆ’ ๐‘˜|

  • We can write these 10 equations as a 10x10 matrix equation: โƒ—

๐›ฟ = ๐‘† โƒ— ๐›ฝ

  • โ€ฆwhich immediately gives the solution: โƒ—

๐›ฝ = ๐‘†+$ โƒ— ๐›ฟ

  • โ€ฆwhere

โƒ— ๐›ฟ = ๐‘†** 1 โ‹ฎ ๐‘†** 10 , ๐‘† = ๐‘†** 0 ๐‘†** 1 โ‹ฏ ๐‘†** 1 ๐‘†** 0 โ‹ฏ โ‹ฎ โ‹ฎ ๐‘†** 0 , โƒ— ๐›ฝ = ๐›ฝ$ โ‹ฎ ๐›ฝ$%

slide-18
SLIDE 18

Outline

  • The LPC-10 speech synthesis model
  • The LPC-10 excitation model: white noise, pulse train
  • Linear predictive coding: how to find the coefficients
  • Linear predictive coding: how to make sure the coefficients are stable
  • Autocorrelation-based pitch tracking
  • Inter-frame interpolation of pitch and energy contours
slide-19
SLIDE 19

Speech -> Excitation -> Speech

Now that we know how to find the LPC coefficients, we can imagine an end-to-end LPC analysis-by-synthesis:

LPC synthesis ๐‘ก[๐‘œ] ๐‘“[๐‘œ] Model excitation using pulse train and white noise LPC analysis ๐‘“[๐‘œ] ๐‘ฆ[๐‘œ]

๐‘“ ๐‘œ = ๐‘ฆ ๐‘œ โˆ’ *

,#$ $%

๐›ฝ,๐‘ฆ[๐‘œ โˆ’ ๐‘›] ๐‘ก ๐‘œ = ๐‘“ ๐‘œ + *

,#$ $%

๐›ฝ,๐‘ก[๐‘œ โˆ’ ๐‘›]

slide-20
SLIDE 20

The LPC Analysis Filter

The LPC Analysis Filter is an all-zeros filter (FIR = finite impulse response): ๐‘“ ๐‘œ = ๐‘ฆ ๐‘œ โˆ’ *

,#$ $%

๐›ฝ,๐‘ฆ ๐‘œ โˆ’ ๐‘› โ†” ๐น ๐‘จ = ๐ต ๐‘จ ๐‘Œ(๐‘จ) โ€ฆwhereโ€ฆ ๐ต ๐‘จ = 1 โˆ’ *

,#$ $%

๐›ฝ,๐‘จ+,

slide-21
SLIDE 21

The LPC Synthesis Filter

The LPC Synthesis Filter is an all-poles filter (IIR = infinite impulse response): ๐‘ก ๐‘œ = ๐‘“ ๐‘œ + *

,#$ $%

๐›ฝ,๐‘ก ๐‘œ โˆ’ ๐‘› โ†” ๐‘‡ ๐‘จ = ๐ผ ๐‘จ ๐น(๐‘จ) โ€ฆwhereโ€ฆ ๐ผ ๐‘จ = 1 ๐ต(๐‘จ) = 1 1 โˆ’ โˆ‘,#$

$%

๐›ฝ,๐‘จ+,

slide-22
SLIDE 22

Speech -> Excitation -> Speech

1 ๐ต(๐‘จ) ๐‘ก[๐‘œ] ๐‘“[๐‘œ] Excitation Model ๐ต ๐‘จ ๐‘“[๐‘œ] ๐‘ฆ[๐‘œ]

slide-23
SLIDE 23

The Stability Problem

  • The analysis filter is guaranteed to be stable, as long as the coefficients are
  • finite. Suppose you know that |๐‘ฆ ๐‘œ | โ‰ค ๐‘Œ234, and |๐›ฝ.| โ‰ค ๐›ฝ234. Then,

even in the worst possible case, ๐‘“ ๐‘œ โ‰ค 11๐›ฝ234๐‘Œ234.

  • The synthesis filter has no such guarantee. For example, suppose ๐‘“ ๐‘œ is

just a delta function (๐‘“ ๐‘œ = ๐œ€ ๐‘œ ), and suppose all of the ๐›ฝ. = 0 except the first one, ๐›ฝ/ = 1. 1. Then ๐‘ก ๐‘œ = ๐œ€ ๐‘œ + 1. 1๐‘ก[๐‘œ โˆ’ 1] = 1. 1 5 Which overflows your 16-bit sample buffer after only 110 samples. Your

  • utput will be full of NaNs, and youโ€™ll be saying โ€œWhat happenedโ€ฆ?โ€
slide-24
SLIDE 24

How to Guarantee Stability

Fortunately, the LPC synthesis filter is causal, so itโ€™s easy to guarantee stability. We just need to make sure that all of the poles have magnitude less than 1: |๐‘ !| < 1 We find the poles like this: ๐ผ ๐‘จ = 1 ๐ต(๐‘จ) = 1 1 โˆ’ โˆ‘"#$

$%

๐›ฝ"๐‘จ&" = 1 โˆ!#$

$%

1 โˆ’ ๐‘ !๐‘จ&$ in other words, ๐‘ ! = ๐‘ ๐‘๐‘๐‘ข๐‘ก(๐ต ๐‘จ ) โ€ฆwhich you can do using np.roots, if you define the polynomial correctly. Then you just truncate the magnitude, ๐‘ ! โ† min ๐‘ ! , 0.999 ๐‘“'โˆก)! โ€ฆand then use np.poly to convert back from roots to polynomial.

slide-25
SLIDE 25

Outline

  • The LPC-10 speech synthesis model
  • The LPC-10 excitation model: white noise, pulse train
  • Linear predictive coding: how to find the coefficients
  • Linear predictive coding: how to make sure the coefficients are stable
  • Autocorrelation-based pitch tracking
  • Inter-frame interpolation of pitch and energy contours
slide-26
SLIDE 26

Autocorrelation is maximum at n=0

๐‘ 

** ๐‘œ =

*

,#+-

  • ๐‘ฆ ๐‘› ๐‘ฆ[๐‘› โˆ’ ๐‘œ]
slide-27
SLIDE 27

Autocorrelation of a periodic signal

Suppose x[n] is periodic, ๐‘ฆ[๐‘œ] = ๐‘ฆ[๐‘œ โˆ’ ๐‘„]. Then the autocorrelation is also periodic: ๐‘ 

** ๐‘„ =

*

,#+-

  • ๐‘ฆ ๐‘› ๐‘ฆ[๐‘› โˆ’ ๐‘„] =

*

,#+-

  • ๐‘ฆ! ๐‘› = ๐‘ 

** 0

slide-28
SLIDE 28

Autocorrelation of a periodic signal is periodic

Pitch period = 9ms = 99 samples Pitch period = 9ms = 99 samples

slide-29
SLIDE 29

Autocorrelation pitch tracking

  • Compute the autocorrelation
  • Find the pitch period:

๐‘„ = argmax

6!"#7.76!$%

๐‘ 

88[๐‘›]

  • The search limits, ๐‘„

29: and ๐‘„ 234, are

important for good performance:

  • ๐‘„

&'( corresponds to a high womanโ€™s pitch,

about ๐บ

)/๐‘„ &'( โ‰ˆ 250 Hz

  • ๐‘„

&*+ corresponds to a low manโ€™s pitch,

about ๐บ

)/๐‘„ &*+ โ‰ˆ 80 Hz

๐‘„

!"# ๐‘„ !$%

slide-30
SLIDE 30

The LPC-10 speech synthesis model

๐ผ(๐‘“!")

Vocal Tract: Modeled by an LPC synthesis Filter.

๐‘ก[๐‘œ]

๐‘“ ๐‘œ = $

!"#$ $

๐œ€ ๐‘œ โˆ’ ๐‘ž๐‘„ ๐‘“ ๐‘œ ~๐’ช 0,1 Voiced Speech, pitch period P Unvoiced Speech Binary Control Switch: Voiced (P>0) vs. Unvoiced (P=0)

G

๐ป

Gain= ๐‘“%&'()*

slide-31
SLIDE 31

The voiced/unvoiced decision

  • ๐‘ฆ[๐‘œ] voiced: ๐‘ 

** ๐‘„ โ‰ˆ ๐‘  ** 0

  • ๐‘ฆ[๐‘œ] unvoiced (white noise):

๐‘ 

,, ๐‘œ โ‰ˆ ๐œ€[๐‘œ]

which means that ๐‘ 

,, ๐‘„ โ‰ช ๐‘  ,, 0

So a reasonable V/UV decision is:

  • .-- /

.-- % โ‰ฅ ๐‘ขโ„Ž๐‘ ๐‘“๐‘กโ„Ž๐‘๐‘š๐‘’: say the frame is voiced.

  • .-- /

.-- % < ๐‘ขโ„Ž๐‘ ๐‘“๐‘กโ„Ž๐‘๐‘š๐‘’: say the frame is

unvoiced. Setting threshold~0.25 works reasonably well.

voiced: ๐‘ฆ ๐‘œ + ๐‘„ โ‰ˆ ๐‘ฆ ๐‘œ unvoiced: E[๐‘ฆ ๐‘› ๐‘ฆ ๐‘› โˆ’ ๐‘œ ] โ‰ˆ ๐œ€[๐‘œ]

slide-32
SLIDE 32

Outline

  • The LPC-10 speech synthesis model
  • The LPC-10 excitation model: white noise, pulse train
  • Linear predictive coding: how to find the coefficients
  • Linear predictive coding: how to make sure the coefficients are stable
  • Autocorrelation-based pitch tracking
  • Inter-frame interpolation of pitch and energy contours
slide-33
SLIDE 33

Inter-frame interpolation of pitch contours

We donโ€™t want the pitch period to change suddenly at frame boundaries; it sounds weird.

Pitch Period Sample Number, n Frame Boundary Frame Boundary Frame Boundary Frame Boundary

slide-34
SLIDE 34

Inter-frame interpolation of pitch contours

Linear interpolation sounds much

  • better. We can accomplish linear

interpolation using a formula like ๐‘„ ๐‘œ = (1 โˆ’ ๐‘”)๐‘„

< + ๐‘”๐‘„ <=/

Where

  • ๐‘„

< is the pitch period in frame t

  • ๐‘” =

5#<> >

is how far sample n is from the beginning of frame t

  • S is the frame-skip.

Pitch Period Sample Number, n Frame Boundary Frame Boundary Frame Boundary Frame Boundary

slide-35
SLIDE 35

Inter-frame interpolation of energy

Linear interpolation is also useful for energy, EXCEPT: it sounds better if we interpolate log energy, not energy. log ๐‘†๐‘๐‘‡0 = log 1 ๐‘€ *

1#02 0234+$

๐‘ฆ![๐‘œ]

slide-36
SLIDE 36

Outline

  • The LPC-10 speech synthesis model
  • The LPC-10 excitation model: white noise, pulse train
  • Linear predictive coding: how to find the coefficients
  • Linear predictive coding: how to make sure the coefficients are stable
  • Autocorrelation-based pitch tracking
  • Inter-frame interpolation of pitch and energy contours