ee e6820 speech audio processing recognition lecture 5
play

EE E6820: Speech & Audio Processing & Recognition Lecture 5: - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and synthesis 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear Predictive models (LPC) 4 Other signal models 5 Speech synthesis Dan


  1. EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and synthesis 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear Predictive models (LPC) 4 Other signal models 5 Speech synthesis Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/ E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 1

  2. The speech signal 1 • Speech sounds in the spectrogram ^ ε θ t cl y h z e w c n z d a m I I I has a watch thin as a dime • Elements of the speech signal: - spectral resonances (formants, moving) - periodic excitation (voicing, pitched) + pitch contour - noise excitation (fricatives, unvoiced, no pitch) - transients (stop-release bursts) - amplitude modulation (nasals, approximants) - timing! E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 2

  3. The source-filter model • Notional separation of: source : excitation, fine t-f structure & filter : resonance, broad spectral structure Formants t Glottal pulse Pitch train Vocal tract Speech Radiation Voiced/ resonances + characteristic unvoiced f Frication noise t Source Filter • More a modeling approach than a model E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 3

  4. Signal modeling • Signal models are a kind of representation - to make some aspect explicit - for efficiency - for flexibility • Nature of model depends on goal - classification: remove irrelevant details - coding/transmission: remove perceptual irrelevance - modification: isolate control parameters • But commonalities emerge - perceptually irrelevant detail (coding) will also be irrelevant for classification - modification domain will usually reflect ‘independent’ perceptual attributes - getting at the abstract information in the signal E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 4

  5. Different influences for signal models • Receiver: - see how signal is treated by listeners → cochlea-style filterbank models • Transmitter (source) - physical apparatus can generate only a limited range of signals... → LPC models of vocal tract resonances • Making explicit particular aspects - compact, separable resonance correlates → cepstrum - modeling prominent features of NB spectrogram → sinusoid models - addressing unnaturalness in synthesis → Harmonic+Noise model E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 5

  6. Applications of (speech) signal models • Classification / matching Goal: highlight important information - speech recognition (lexical content) - speaker recognition (identity or class) - other signal classification - content-based retrieval • Coding / transmission / storage Goal: represent just enough information - real-time transmission e.g. mobile phones - archive storage e.g. voicemail • Modification/synthesis Goal: change certain parts independently - speech synthesis / text-to-speech (change the words) - speech transformation / disguise (change the speaker) E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 6

  7. Outline 1 Modeling speech signals 2 Spectral and cepstral models - Auditorily-inspired spectra - The cepstrum - Feature correlation 3 Linear predictive models (LPC) 4 Other models 5 Speech synthesis E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 7

  8. Spectral and cepstral models 2 • Spectrogram seems like a good representation - long history - satisfying in use - experts can ‘read’ the speech • What is the information? - intensity in time-frequency cells; typically 5ms x 200 Hz x 50 dB → Discarded information: - phase - fine-scale timing • The starting point for other representations E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 8

  9. The filterbank interpretation of the short-time Fourier transform (STFT) • Can regard spectrogram rows as coming from separate bandpass filters: f sound • Mathematically: j 2 π k n ( )  –  n 0 ∑ [ , ] [ ] w n ⋅ [ ] ⋅ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - = – exp – X k n 0 x n n 0   n N ∑ [ ] h k n 0 ⋅ [ ] = – x n n n Hk (ej ω ) hk [ n ] W (ej( ω − 2 π k/N ) ) j 2 π kn w [ -n ]   [ ] [ ] ⋅ = – exp - - - - - - - - - - - - - h k n w n where   N n ω 2 π k/N E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 9

  10. Spectral models: Which bandpass filters? • Constant bandwidth? (analog / FFT) • But: cochlea physiology & critical bandwidths → use actual bandpass filters in ear models & choose bandwidths by e.g. CB estimates • Auditory frequency scales - constant ‘Q’ (center freq/bandwidth), mel, Bark... E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 10

  11. Gammatone filterbank • Given bandwidths, which filter shapes ? - match inferred temporal integration window - match inferred spectral shape (sharp hi-F slope) - keep it simple (since it’s only approximate) → Gammatone filters [ ] ⋅ ⋅ ( ω i n ) – 1 N = exp – cos h n n bn time → z plane 0 2 -10 mag / dB 2 log -20 axis! -30 2 -40 2 -50 50 100 200 500 1000 2000 5000 freq / Hz - 2N poles, 2 zeros, low complexity - reasonable linear match to cochlea E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 11

  12. Constant-BW vs. cochlea model • Spectrograms: • Frequency responses: Effective FFT filterbank FFT-based WB spectrogram (N=128) 0 8000 -10 6000 freq / Hz Gain / dB -20 4000 -30 2000 -40 0 -50 0 0.5 1 1.5 2 2.5 3 0 1000 2000 3000 4000 5000 6000 7000 8000 Gammatone filterbank Q=4 4 pole 2 zero cochlea model downsampled @ 64 0 5000 -10 2000 Gain / dB freq / Hz -20 1000 -30 500 -40 200 100 -50 0 1000 2000 3000 4000 5000 6000 7000 8000 0 0.5 1 1.5 2 2.5 3 Freq / Hz time / s linear axis • Magnitude smoothed over 5-20 ms time window E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 12

  13. Limitations of spectral models • Not much data thrown away - just fine phase/time structure (smoothing) - little actual ‘modeling’ - still a large representation! • Little separation of features - e.g. formants and pitch • Highly correlated features - modifications affect multiple parameters • But, quite easy to reconstruct - iterative reconstruction of lost phase E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 13

  14. The cepstrum • Original motivation: Assume a source-filter model: Excitation Resonance n filter H ( e j ω ) source g [ n ] ω n • Define ‘Homomorphic deconvolution’: - source-filter convolution: g [ n ] *h [ n ] G ( e j ω ) ·H ( e j ω ) - FT → product log G ( e j ω ) + log H ( e j ω ) - log → sum: - IFT → separate fine structure: c g [ n ] + c h [ n ] = deconvolution • Definition: ( ( [ ] ) ) = idft log dft x n c n Real cepstrum E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 14

  15. Stages in cepstral deconvolution Waveform and min. phase IR • Original waveform has 0.2 excitation fine structure 0 convolved with resonances -0.2 • DFT shows harmonics 0 100 200 300 400 samps abs(dft) and liftered 20 modulated by resonances 10 • Log DFT is sum of harmonic ‘comb’ and 0 1000 2000 3000 0 freq / Hz resonant bumps log(abs(dft)) and liftered dB 0 • IDFT separates out -20 resonant bumps (low -40 quefrency) and regular, 1000 2000 3000 0 freq / Hz fine structure (‘pitch pulse’) real cepstrum and lifter 200 pitch pulse 100 • Selecting low-n cepstrum 0 separates resonance information quefrency 0 100 200 (deconvolution / ‘liftering’) E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 15

  16. Properties of the cepstrum • Separate source (fine) & filter (broad structure) - smooth the log mag. spectrum to get resonances • Smoothing spectrum is filtering along freq. - i.e. convolution applied in Fourier domain → multiplication in IFT (‘liftering’) Periodicity in time → harmonics in spectrum • → ‘pitch pulse’ in high-n cepstrum • Low-n cepstral coefficients are DCT of broad filter / resonance shape: X e j ω ∫ = ( ) ⋅ ( n ω n ω ) ω c n log cos + sin j d 5th order Cepstral reconstruction Cepstral coefs 0..5 0.1 2 1 0 0 -1 0 1 2 3 4 5 -0.1 0 1000 2000 3000 4000 5000 6000 7000 E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 16

  17. Aside: Correlation of elements • Cepstrum is a popular in speech recognition - feature vector elements are decorrelated : Features Covariance matrix Example joint distrib (10,15) 20 25 -2 16 Auditory spectrum 20 -3 12 15 -4 10 8 -5 5 4 20 18 3 16 16 coefficients 2 14 Cepstral 1 12 12 0 10 8 -1 8 6 -2 4 4 -3 2 -4 50 100 150 5 10 15 20 -5 0 5 frames - c 0 ‘normalizes out’ average log energy • Decorrelated pdfs fit diagonal Gaussians - simple correlation is a waste of parameters • DCT is close to PCA for spectra? E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 17

  18. Outline 1 Modeling speech signals 2 Spectral and cepstral modes 3 Linear Predictive models (LPC) - The LPC model - Interpretation & application - Formant tracking 4 Other models 5 Speech synthesis E6820 SAPR - Dan Ellis L05 - Speech models 2002-02-25 - 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend