Speech Information Processing Akinori Ito Graduate School of - PowerPoint PPT Presentation

Sound Media Engineering part II Speech Information Processing Akinori Ito Graduate School of Engineering, Tohoku Univ. aito@fw.ipsj.or.jp 1

Overview of the lecture ● #1: Production and coding of speech (1) – Speech production, feature of speech sound – Basic codecs: PCM,DPCM,ADPCM ● #2: Coding of speech (2) – Linear Prediction of speech: Linear Prediction Coefficients, PARCOR Coefficients and LSP – CELP coding – Audio coding ● #3: Speech enhancement – Spectral subtraction – Microphone array 2

Production of speech ● Organs that produce speech – vocal cords – larynx – pharynx – tongue vocal – gums tract – teeth – lips – nasal cavity 3

Acoustic tube model ● Human speech production is similar to wind instruments 鼻腔喉頭唇声道声帯 Linguistic content Pitch of voice Personality 4

Linguistic and speaker feature 鼻腔喉頭唇声道声帯 A speaker can control shape of this part 5

Linguistic and speaker feature 鼻腔喉頭唇声道声帯 A speaker cannot control shape of this part, total length of vocal tract 6

Speech waveform ● Complex enough /a/ /i/ /u/ /e/ /o/ 7

Speech waveform ● It is complex, but almost periodic Fundamental period Fundamental period T [s] Fundamental frequency F 0 [Hz] = 1/ T 8

Various "a" ● Two /a/'s with different fundamental frequencies – Same phone = same vocal tract shape – Completely different waveforms – What is the same between these waveforms? 9

Spectrum of speech ● Spectrum of two /a/'s – Spectral shapes are similar →Shape of vocal tract – "Jaggies" of speectrum differ→Fundamental Freq. 10

Spectrum and formant frequencies ● F 0 : 基本周波数 ● F 1 ,F 2 ,..: ホルマント (formant) 周波数 Formant frequencies F 1 ホルマント周波数 F 0 F 2 F 3 F 4 11 Fundamental frequency 基本周波数

Speech coding ● Sound (analog) → Convert to digital data – Handle with computer – Transmission over digital line ● How do we digitize sound? – Goals ● Good quality when converting back to analog sound ● Less bit-rate – Methodology ● Exploit various features of speech 12

Basics of speech coding ● Sampling – Observe the temporally continuous signal at discrete time – Period of "discrete" observation: sampling frequency f s – The original signal can be restored from sampled data when the original signal only contains frequency component under f s /2 (Sampling Theorem) 13

Basics of speech coding ● Quantization – Round off magnitude of signal into discrete level ● Magnitude of signal can be represented in integers – The discrete level : quantization step – Difference between the original signal and quantized signal : Quantization error 14

Sampling and quantization: how are they determined? ● Sampling frequency is determined by the highest frequency in the sound – Telephone : 8kHz (up to 4kHz sound) – High-quality speech: 16kHz (up to 8kHz sound) – CD ： 44.1kHz (up to 22.05kHz sound) ● Quantization is determined by the dynamic range of the sound – To code speech is to quantize speech 15

PCM coding ● PCM(Pulse Code Modulation) – Represent the quantized values as binary numbers ● What to be determined in PCM – How many bits to be used for one sample – How to determine levels of quantization ● Equal steps: linear quantization ● Inequal steps: nonlinear quantization ● Examples of PCM coding – CD:16bit linear quantization – VoIP(G.711): 8bit nonlinear quantization 16

PCM linear quantization ● There are nothing difficult 10 5 0 -5 -10 -7 -7 5 2 -6 -2 0 1 4 0 -2 11 11 0 -1 -2 0 3 2 0 1 33 CD: quantize in 16bit(-32768 ～ +32767) 17

Nonlinear quantization ● Most samples are nearly zero →Total error can be reduced by finely quanti- zing values around zero 10 10 5 5 0 0 -5 -5 -10 -10 18

Example of nonlinear quantization: G.711 ● Speech coding for 64kbit/s digital phone line – 8kHz sampling, 8bit nonlinear quantization – μ-Law (Japan, US) A-Law (Europe) – μ-Law: 14bit linear quant.→8bit nonlinear quant. 150 100 log  1  255 ∣ X ∣ 8192  50 8bit mu-Law 0 Y = 128 sign  X  log256 -50 -100 -150 19 -8000 -6000 -4000 -2000 0 2000 4000 6000 8000 14bit linear

Differential PCM (DPCM) ● In ordinary speech signal, values of two contiguous samples do not differ very much →Reduce bit-rate by transmitting the differences of samples Q - z -1 20

Differential PCM(DPCM) ● Original 15000 10000 5000 waveform 0 -5000 -10000 -15000 -20000 -25000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 ● Differential 15000 waveform 10000 5000 0 -5000 -10000 -15000 -20000 -25000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 21

Adaptive Differential PCM(ADPCM) ● To enhance efficiency of DPCM – Use more sophisticated prediction rather than simple difference – Adaptively change quantization steps ● When difference between two samples is large, the difference to the next sample is likely to be large too ● When difference between two samples is small, the difference to the next sample is likely to be small too 22

Block diagram of ADPCM differential signal d  k  I  k  + adaptive ADPCM PCM input ＋ quantizer output x  k  - x e  k  + prediction signal adaptive signal ＋ predictor de-quantizer + x r  k  d q  k  reconstructed quantized signal differential signal 23

Calculation algorithm of ADPCM x e  k  1.Compute prediction signal 2.Compute difference d  k = x e  k − x  k  3.Quantize (ADPCM output) I  k = Q  d  k   4.De-quantize − 1  I  k   d q  k = Q 5.Reconstruct signal x r  k = x e  k  d q  k  x e  k  1 = pred  x r  k  ,d q  k  ,   6.Compute next prediction 24

Prediction of speech signal ● ADPCM quantizes difference between the input signal and predicted signal ● How to predict signal – DPCM x e  k = x r  k − 1  – A little better way x e  k = 2 x r  k − 1 − x r  k − 2  – G.726 2 6 x e  k = ∑ a i x r  k − i  ∑ b i d q  k − i  i = 1 i = 1 25

Determine quantization step adaptively (example) ● Observe difference between previous sample using the scale ● If the difference is 7 7 "blue", half the size 6 6 of the next scale 5 5 4 4 ● If the difference is 3 3 "red", double the 2 2 1 1 size of the next 0 0 - - 1 1 scale - - 2 2 - - 3 3 - - 4 4 - - 5 5 - - 6 6 - - 7 7 - - 8 8 26

For high-efficiency speech coding ● PCM, DPCM, ADPCM encodes general sound signal – DPCM, ADPCM partly exploits property of input signal ● Human speech is a small part of sound signal →We can enhance efficiency of coding by considering property of human speech ● What is the property of human speech? 27

High-level speech coding digital speech words/ phones semantics speech data feature sentences CELP coder under summarizing PCM coder (mobile phone) research telephone? (public phone) digital speech words/ phones semantics 音声 data feature sentences AD/DA vocoder speech Text-to- synthesis Speech 28

Speech production model nasal 鼻腔 cavity radiation larynx 喉頭 lips 唇 X  vocal cords vocal tract 声道声帯 T  R  S  X = S  T  R  29

Speech production model S  30

Speech production model T  R  S  31

Modeling speech using parameters ● Modeling speech using linear prediction (LPC) – Spectral shape: parameters of linear prediction filter – Vocal cord vibration : residue Estimate p x  k =− ∑ coefficients to a i x  k − i  e  k  minimize residue i = 1 – In spectral domain E  X = = E  H  p 1  ∑ ni  a n e n = 1 S  T  R  32

Analysis and transmission of speech by LPC ● Information to be transmitted – LP coefficients a i and residue e ( k ) ● How to transmit them? – Estimate a i for a fixed number of samples (a block) – Calculate e ( k ) using estimated a i – Transmit a i and e ( k ) as parameters of the block ● How to restore the signal? – Using LPC formula p x  k =− ∑ a i x  k − i  e  k  33 i = 1

Estimation of LP coefficients ● How to estimate LPC from x (1)... x ( k ) – Solve a simultaneous equation (Yule-Walker equation ） → LPC are calculated as the least-error solution – Faster algorithm (Levinson-Durbin algorithm) ● LPC equation x  1    a p  −  =  x  p     e  p   a 1 x  k − 1  x  k − 2  ⋯ x  k − p  x  k  e  k  x  k − 2  x  k − 3  ⋯ x  k − p − 1  a 2 x  k − 1  e  k − 1  ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋮ x  p − 1  x  p − 2  ⋯ − FA = V  E 34

Speech Information Processing Akinori Ito Graduate School of - PowerPoint PPT Presentation

Sound Media Engineering part II Speech Information Processing Akinori Ito Graduate School of Engineering, Tohoku Univ. aito@fw.ipsj.or.jp 1 Overview of the lecture #1: Production and coding of speech (1) Speech production, feature of

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing for Speech Processing for Unwritten Languages Unwritten Languages Alan W

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 11-492/18-495 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and

Physical Computing In the Real World Who Is This Person Meanderer Hanley Weng Google

rhythm and the enactive sense of extent and duration SYNTHESIS ASU SHA XIN WEI Sha Xin Wei

What is computational neuroscience? 1. Use of mathematical/computational tools to study the

Brief introduction to computational & statistical neuroscience Jonathan Pillow Lecture #1

Interaction Devices Mobile Keyboard Animation, Speech and Displays

with atrial fibrillation and no known history of stroke David Conen, MD MPH on behalf of the

mycoses Diagnosis,Therapy and Prophylaxis of Fungal Diseases Original article Inflammatory Tinea

1 Case 1 CASE 1: SCALP Have I missed any additional scalp lacerations? LACERATION How do you

Speech Information Processing Akinori Ito Graduate School of - PowerPoint PPT Presentation

Sound Media Engineering part II Speech Information Processing Akinori Ito Graduate School of Engineering, Tohoku Univ. aito@fw.ipsj.or.jp 1 Overview of the lecture #1: Production and coding of speech (1) Speech production, feature of

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing for Speech Processing for Unwritten Languages Unwritten Languages Alan W

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 11-492/18-495 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and

Physical Computing In the Real World Who Is This Person Meanderer Hanley Weng Google

rhythm and the enactive sense of extent and duration SYNTHESIS ASU SHA XIN WEI Sha Xin Wei

What is computational neuroscience? 1. Use of mathematical/computational tools to study the

Brief introduction to computational &amp; statistical neuroscience Jonathan Pillow Lecture #1

Interaction Devices Mobile Keyboard Animation, Speech and Displays

with atrial fibrillation and no known history of stroke David Conen, MD MPH on behalf of the

mycoses Diagnosis,Therapy and Prophylaxis of Fungal Diseases Original article Inflammatory Tinea

1 Case 1 CASE 1: SCALP Have I missed any additional scalp lacerations? LACERATION How do you

Brief introduction to computational & statistical neuroscience Jonathan Pillow Lecture #1