1
Sound Media Engineering part II
Speech Information Processing
Akinori Ito Graduate School of Engineering, Tohoku Univ. aito@fw.ipsj.or.jp
Speech Information Processing Akinori Ito Graduate School of - - PowerPoint PPT Presentation
Sound Media Engineering part II Speech Information Processing Akinori Ito Graduate School of Engineering, Tohoku Univ. aito@fw.ipsj.or.jp 1 Overview of the lecture #1: Production and coding of speech (1) Speech production, feature of
1
Akinori Ito Graduate School of Engineering, Tohoku Univ. aito@fw.ipsj.or.jp
2
– Speech production, feature of speech sound – Basic codecs: PCM,DPCM,ADPCM
– Linear Prediction of speech: Linear Prediction
Coefficients, PARCOR Coefficients and LSP
– CELP coding – Audio coding
– Spectral subtraction – Microphone array
3
speech
– vocal cords – larynx – pharynx – tongue – gums – teeth – lips – nasal cavity
vocal tract
4
instruments
声帯 声道 喉頭 唇 鼻腔
Pitch of voice Linguistic content Personality
5
声帯 声道 喉頭 唇 鼻腔
A speaker can control shape of this part
6
声帯 声道 喉頭 唇 鼻腔
A speaker cannot control shape of this part, total length of vocal tract
7
/a/ /i/ /u/ /o/ /e/
8
Fundamental period
Fundamental period T [s] Fundamental frequency F0 [Hz] = 1/T
9
– Same phone = same vocal tract shape – Completely different waveforms – What is the same between these waveforms?
10
– Spectral shapes are similar →Shape of vocal tract – "Jaggies" of speectrum differ→Fundamental Freq.
11
基本周波数 ホルマント周波数
F 0 F 1 F 2 F 3 F 4
Formant frequencies Fundamental frequency
12
– Handle with computer – Transmission over digital line
– Goals
– Methodology
13
– Observe the temporally continuous signal at
discrete time
– Period of "discrete" observation:
sampling frequency fs
– The original signal can be restored from sampled data
when the original signal only contains frequency component under fs/2 (Sampling Theorem)
14
– Round off magnitude of signal into discrete level
– The discrete level : quantization step – Difference between the original signal and quantized
signal : Quantization error
15
highest frequency in the sound
– Telephone : 8kHz (up to 4kHz sound) – High-quality speech: 16kHz (up to 8kHz sound) – CD:44.1kHz (up to 22.05kHz sound)
range of the sound
– To code speech is to quantize speech
16
– Represent the quantized values as binary numbers
– How many bits to be used for one sample – How to determine levels of quantization
– CD:16bit linear quantization – VoIP(G.711): 8bit nonlinear quantization
17
5 10
CD: quantize in 16bit(-32768~+32767)
18
→Total error can be reduced by finely quanti- zing values around zero
5 10
5 10
19
– 8kHz sampling, 8bit nonlinear quantization – μ-Law (Japan, US) A-Law (Europe) – μ-Law: 14bit linear quant.→8bit nonlinear quant.
Y =128 sign X log1255∣X∣ 8192 log256
50 100 150
2000 4000 6000 8000 8bit mu-Law 14bit linear
20
contiguous samples do not differ very much →Reduce bit-rate by transmitting the differences of samples
Q
21
waveform
waveform
22
– Use more sophisticated prediction rather than
simple difference
– Adaptively change quantization steps
difference to the next sample is likely to be large too
difference to the next sample is likely to be small too
23
adaptive quantizer adaptive de-quantizer ADPCM
+ PCM input signal predictor + +
+
xk xek xrk d k d qk
reconstructed signal quantized differential signal differential signal prediction signal
I k
24
1.Compute prediction signal 2.Compute difference 3.Quantize (ADPCM output) 4.De-quantize 5.Reconstruct signal 6.Compute next prediction
xek d k =xek−xk I k=Q d k d qk=Q
−1I k
xrk=xekd qk xek1= pred xrk ,d qk ,
25
signal and predicted signal
– DPCM – A little better way – G.726
xek =xrk−1 xek=2 xrk−1−xrk−2 xek=∑
i=1 2
ai xrk−i∑
i=1 6
bid qk−i
26
1 2 3 4 5 6 7
1 2 3 4 5 6 7
between previous sample using the scale
"blue", half the size
"red", double the size of the next scale
27
signal
– DPCM, ADPCM partly exploits property of input
signal
→We can enhance efficiency of coding by considering property of human speech
28
digital data speech feature phones words/ sentences semantics digital data speech feature phones words/ sentences semantics speech 音声 PCM coder (public phone) CELP coder (mobile phone) under research summarizing telephone? AD/DA vocoder speech synthesis Text-to- Speech
29
声帯 声道 喉頭 唇 鼻腔
X =S T R
S T R
radiation
X
vocal cords larynx nasal cavity vocal tract lips
30
S
31
S
T R
32
– Spectral shape: parameters of linear prediction filter – Vocal cord vibration : residue – In spectral domain
xk=−∑
i=1 p
ai xk−iek X = E 1∑
n=1 p
ane
ni
=E H S T R
Estimate coefficients to minimize residue
33
– LP coefficients ai and residue e(k)
– Estimate ai for a fixed number of samples (a block) – Calculate e(k) using estimated ai – Transmit ai and e(k) as parameters of the block
– Using LPC formula
xk=−∑
i=1 p
ai xk−iek
34
– Solve a simultaneous equation (Yule-Walker
equation) →LPC are calculated as the least-error solution
– Faster algorithm (Levinson-Durbin algorithm)
− xk−1 xk−2 ⋯ xk− p xk−2 xk−3 ⋯ xk− p−1 ⋮ ⋮ ⋱ ⋮ x p−1 x p−2 ⋯ x1 a1 a2 ⋮ a p = xk xk−1 ⋮ x p ek ek−1 ⋮ e p
35
→minimize |FA+V|2
F
T F A=−F T V
F
T F=ij, F T V =0j
11 12 ⋯ 1p 21 22 ⋯ 2p ⋮ ⋮ ⋱ ⋮ p1 p2 ⋯ pp a1 a2 ⋮ a p =− 01 02 ⋮ 0p Yule-Walker equation
36
– Matrix is in a special form (symmetric
Toeplitz matrix)
– Quick solution algorithm (Levinson-Durbin
algorithm)
ij=∑
n= p N−1
yn−i yn− j ij=r∣i− j∣= ∑
n=0 N−∣i− j∣−1
yn yn∣i− j∣
37
– Re-synthesis by LPC formula could be unstable
→The output signal eventually oscillates when ai have quantization errors
– Transmit parameters that are equivalent to LPC and
stable against quantization error
38
k i=
∑
n=−∞ ∞
i−1ni−1n
n=−∞ ∞
i−1
2 n ∑ n=−∞ ∞
i−1
2 n
i−1n=xn∑
j=1 i−1
a j
i−1 xn− j
i−1n=xn−i∑
j=1 i−1
b j
i−1 xn− j
Forward prediction error
PARCOR coefficient is equivalent to correlation of the forward prediction errors and backward prediction errors
Backward prediction error
39
x(n-i-1) x(n-i) x(n-i+1) x(n-i+2) x(n-2) x(n-1) x(n) x(n+1) x(n) x(n-i)
bj aj
^ ^ + +
in
correlation k i
...
40
k 1=
∑
n=−∞ ∞
0n0n
n=−∞ ∞
0
2n ∑ n=−∞ ∞
0
2n
0n=xn 0=xn−1
k1 is a correlation coefficient between x(n-1) and x(n) As x(n-1) and x(n) have same variance and zero mean,
x
1n=k1 xn−1
a1
1=−k 1
x
1n−1=k1 xn
b1
1=−k 1
41
k 2=
∑
n=−∞ ∞
1n1n
n=−∞ ∞
1
2n ∑ n=−∞ ∞
1
2n
1
2n=k 21n−1=−k 1k 2 xn−1k 2 xn−2
1n=xn−k1 xn−1 1n=xn−2−k1 xn−1 Here, as 1n=xn−k1 xn−1 x
2n=k 11−k 2xn−1k 2 xn−2
a1
2=−k 11−k 2=a1 1−k 2b1 1
a2
2=−k 2
Similarly, b2
2=b1 1−k 2a1 1
b1
2=−k 2
42
In general, a j
i=a j i−1−k i b j i−1 a0 i−1=0
b j
i=b j−1 i−1−ki a j−1 i−1 b0 i−1=0
We can calculate LP coefficients using this recurrence relation
43
– Representation of LPC equation in z-domain: – Decompose A(z) into P(z) and Q(z)
xk∑
i=1 p
ai xk−i=ek X z1∑
i=1 p
ai z
−i=Ez
A
pz
Pz=A
pz−z − p1 A p z −1
Qz=A
pzz − p1 A pz −1
A
pz= PzQz
2
44
axis (on the unit circle in z-domain)
Pz=1−z
−1 ∏ i=2,4,, p
1−2 z
−1cosiz −2
Qz=1z
−1
i=1,3,, p−1
1−2 z
−1cosiz −2
z=cosi±i sini 1,2,, p
45
012⋯ p
46
– Basic coding scheme for mobile phone – Analysis and synthesis based on LPC – Transmit LSP coefficients and residue
47
LPC analysis quantization LSP coeff. code vector selection residue codebook gain codebook residue LPC synthesis +
weighted distance Select a code vector with minimum distance generating bitstream
48
– LD-CELP (Low-Delay CELP)
– CS-ACELP (Conjugate Structure Algebraic CELP)
– RPE-LTP (Regular Pulse Excitation with Long Term
Prediction)
– VSELP (Vector Sum Excitation LP)
– PSI-CELP (Pitch Synchronous Innovation CELP)
– ACELP (Algebraic CELP)
49
– We cannot make assumption (like speech) on the
input signal
– Model-based coding (like speech) cannot be used
– Split the input signal into low-frequency to high-
frequency
– Change quantization step frequency by frequency
50
frequ- ency analysis quant. quant. quant. quant. generate bitstream restore bitstream dequant. dequant. dequant. dequant. convert into time domain QMF MDCT Wavelet Consider auditory property (psycho- acoustic analysis) Entropy coding (Huffman, arithmetic)
51
– Sub-Band ADPCM – Split the input signal into high and low signals and
encode them using ADPCM individually
(48~64kbit/s)
QMF
ADPCM encoder ADPCM encoder
generate bitstream restore bitstream
ADPCM decoder ADPCM decoder
QMF
52
low-frequency signals and total data amount is identical
combining low and high frequency signal
yi= x2ix2i1 2 zi= x2i−x2i1 2 x2i= yizi x2i1=yi−zi
QMF spilt QMF synthesis
xi xi zi yi
高域 低域
53
encoding
– Layer 1 (MP1), layer 2 (MP2), layer 3 (MP3) – Frequency analysis, psychoacoustic model
polyphase filter bank/ MDCT Q Q Q Q bit- stream gen. bit- stream res. de-Q de-Q de-Q de-Q to time domain FFT psycho- acoustic model
54
– Frequency analysis by polyphase filter bank – Normalization and scholar quantization in every 12
samples
polyphase filter bank nonlinear scholar quant. 32 frequency bands nonlinear scholar quant. normalize normalize Block average power scholar quant. Block average power scholar quant.
55
and MDCT
polyphase filter bank nonlinear scholar quant. 32 bands nonlinear scholar quant. MDCT MDCT Huffman coding bitstream
56
points of frequency-domain signal
X m=∑
k=0 n−1
f k xkcos{ 2n2 k1 n 22m1} xk= 4 f k n
m=0 n/2−1
X mcos{ 2n2 k1 n 22m1}
57
the temporally overlapping data
MDCT Overlap- Add IMDCT
58
– Extract specific speech from input signal that
contains the target speech and other noise
– It is generally difficult : some kind of assumption
needed
– Single channel
– Multiple-channel case
59
signal y
unknown)
– Spectrum of n is known
yt=xtnt Y =X N
60
X =W Y =W X N
∣
X − X ∣
2d min
t ∑ i=0 N −1
∣X it−
X it∣
2=
t ∑ i=0 N −1
∣X it−W iX itN it∣
2min
61
If X(t) and N(t) have no correlation,
∂ ∂W i ∑
t ∑ i=0 N−1
∣X it−W iX itN it∣
2=0
t ∣
−2 X itX itN it2W iX itN it∣
2=0
W i=
t ∣X it
∣
2
t ∣X it∣ 2∑ t ∣N it∣ 2
t ∣X it N it∣≈0
The Wiener filter
62
– Average spectrums of signal and noise are known
– The signal and noise have no correlation
– Suppress frequency band with large noise power
2]
in average
W iX itN it=
t ∣X it∣ 2
t ∣X it∣ 2∑ t ∣N it∣ 2⋅
X itN it
63
1000 2000 3000 4000 5000 6000 7000 8000 9000 50000 100000 150000 200000 250000 300000
Speech Noise
1000 2000 3000 4000 5000 6000 7000 8000 9000 0.2 0.4 0.6 0.8 1 1.2
Spectrum
64
spectrum
– Signal – Noise – Observed signal
X it N it Y it=X itN it
65
– Power spectrum of the observed signal – The signal is assumed to have no correlation to the
noise
– Noise signal is assumed to be stable
∣
Y it∣
2=∣X itN it∣ 2≤∣X it∣ 22∣X it N it∣∣N it∣ 2
∣X it N it∣≪∣X it∣
2∣N it∣ 2
∣X it∣
2≈∣Y it∣ 2−∣N it∣ 2
∣N it∣
2=N i 2
∣X it∣
2≈∣Y it∣ 2−N i 2
66
– Noise spectrum must be prepared beforehand – Estimated from silent part before the voice
X it≈
∣
Y it∣
2−∣N it∣ 2
∣Y it∣
2
Y it
67
– Solution by flooring
∣X it∣
2≈{
∣
Y it∣
2−N i 2
if ∣Y it∣
2N i 2
∣Y it∣
2
0≪1
∣X it∣
2≈{
∣
Y it∣
2− N i 2
if ∣Y it∣
2 N i 2
∣Y it∣
2
1
68
Speech Noise Speech with noise Enhanced speech
69
Speech Noise Speech with noise Enhanced speech
70
– Spatial information can be used
– Linear processing
– Nonlinear processing
71
angle using multiple microphones
– The sound is assumed to be plane wave
d
d sin
q
72
d
q
d sin
sint
sin t−d sin c sin t−2d sin c sin t−3d sin c
73
3d sin c 2d sin c d sin c
sin t−3 d sin c − sin t−3 d sin c − sint−3 d sin c − sint−3 d sin c −
Delay
74
3d sin c 2d sin c d sin c
Delay
+
4sint−3d sin c −
75
3d sin c 2d sin c d sin c
Delay
+
n=0 3
sin t− nd sin3−nd sin c −
f
76
入射角(rad)
n=4, d=1 fun1: w=5 fun2: w=10 fun3: w=50
77
– Fast calculation – Easy hardware realization
– Main robe width become narrower when
– Spatial aliasing
78
microphone
speech+noise noise Signal Processing speech+noise
79
– n(k) is not a noise signal mixed into x(k), so we
can't subtract n(k) from x(k) directly →Use the filter W(z)
speech+noise noise speech+noise W(z)
+ -
xk nk
G(z)
Update W(z) so that power
signal become minimum
yk ek
80
– : i-th filter coefficient at time k
(Many other algorithms have been developed)
yk=∑
i=1 p
wiknk−i wik wik1=wik2 eknk−i1
81
(When the direction of the noise is known)
+ - delay
noise(angle ) speech N d sin N c
82
– Noise suppression using adaptive filter
+ - adaptive filter
noise speech
adaptive filter
83
– Without any constraints, the output become zero
+ -
雑音 音声 H 1 H 2
84
microphone
– Griffith-Jim array
F =∑
i
Gi H i=1 H i Gi
85
+ - delay delay delay + + - + -
H 1 H 2
Output of delayed-sum array noise only
86
– Estimate noise spectrum by suppressing target
signal
– Subtract noise spectrum from the observed
spectrum
– Nonlinear processing – No need to prepare noise spectrum beforehand – Effective for unstable noise
87
+ - delay delay + |DFT|2 + - noise only |DFT|2
sum array
nonlinear processing (overestimation, flooring)