SLIDE 18 Streaming synthesis using LSTMs
... ... ...
x(i)
TEXT Text analysis Linguistic feature extraction
... ... ... ...
Acoustic LSTM-RNN La Duration LSTM-RNN Ld Phoneme durations Frame-level linguistic features Phoneme-level linguistic features Acoustic features Vocoder Vocoder
... ... ...
Waveform
... ... ... ...
Recurrent
layer Output layer
x(1)
...
x(N) d (i) ˆ
1
x (i) x (i)
d (i) ˆ
1
y (i) y (i)
d (i) ˆ
... ... ...
ˆ ˆ
... ... ...
acoustic feature prediction and vocoding are done in streaming fashion
1: Perform text analysis over input text 2: Extract fx.i/gN
iD1
3: for i D 1; : : : ; N do
F Loop over phonemes
4:
Predict O d .i/ given x.i/ by ƒd
5:
for ⌧ D 1; : : : ; O di do F Loop over frames
6:
Compose x.i/
⌧
from x.i/, ⌧, and O d .i/
7:
Predict O y.i/
⌧
given x.i/
⌧
by ƒa
8:
Synthesize waveform given O y.i/
⌧
then stream result
9:
end for
10: end for
Image from Zen & Sak, Unidirectional LSTM RNNs for low-latency speech synthesis, 2015