Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
Nara Institute of Science & Technology, Nara, Japan RIKEN AIP, Japan
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 1
Unifying Speech Recognition and Generation with Machine Speech - - PowerPoint PPT Presentation
Unifying Speech Recognition and Generation with Machine Speech Chain Andros Tjandra , Sakriani Sakti, Satoshi Nakamura Nara Institute of Science & Technology, Nara, Japan RIKEN AIP, Japan 26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 1
Nara Institute of Science & Technology, Nara, Japan RIKEN AIP, Japan
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 1
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 2
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 3
Property ASR TTS Speech features MFCC Mel-fbank MGC log F0, Voice/Unvoice, BAP Text features Phoneme Character Phoneme + POS + LEX (full context label) Model GMM-HMM Hybrid DNN/HMM End-to-end ASR GMM-HSMM DNN-HSMM End-to-end TTS
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 4
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 5
π¦ = predicted speech, ΰ· π§ = predicted text
π§ (seq2seq model transform speech to text)
π¦ (seq2seq model transform text to speech)
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 6
πππ‘π‘π΅ππ π§, ΰ· π§
π¦
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 7
ASR ΰ· π§ = βtexβ π¦ =
πππ‘π‘π΅ππ π§, ΰ· π§
π§ = βtextsβ TTS ΰ· π¦ = π§ = βtextβ
πππ‘π‘πππ π¦, ΰ· π¦
π¦ =
π§ tries to reconstruct speech features ΰ· π¦
π¦) between
predicted ΰ· π¦
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 8
TTS ASR ΰ· π§ = βtextβ ΰ· π¦ = π¦ =
πππ‘π‘πππ(π¦, ΰ· π¦)
π¦ tries to reconstruct speech features ΰ· π§
π§) between
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 9
ASR TTS ΰ· π§ = βtextβ
πππ‘π‘π΅ππ(π§, ππ§)
ΰ· π¦ = π§ = βtextβ
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 10
Input & output
Model states
π
= encoder states
π = decoder state at time π’
π, βπ’ π
exp ππππ π βπ‘
π,βπ’ π
Οπ‘=1
π
exp ππππ π βπ‘
π,βπ’ π
π
ππ’ π‘ β βπ‘
π (expected context)
Loss function βπ΅ππ π§, ππ§ = β 1 π ΰ·
π’=1 π
ΰ·
πβ[1..π·]
1(π§π’ = π) β log ππ§π’[π]
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 11
Input & output
Model states
π
= encoder states
π = decoder state at time π’
π
ππ‘ π’ β βπ’
π (expected context)
Loss function
βπππ1 π¦, ΰ· π¦ = 1 π ΰ·
π‘=1 π
π¦π‘
π β ΰ·
π¦π‘
π 2 + π¦π‘ π β ΰ·
π¦π‘
π 2
βπππ2 π, ΰ· π = β 1 π ΰ·
π‘=1 π
ππ‘ log(ΰ· ππ‘) + 1 β ππ‘ log 1 β ΰ· ππ‘ βπππ π¦, ΰ· π¦, π, ΰ· π = βπππ1 π¦, ΰ· π¦ + βπππ2 π, ΰ· π
predict the phase & inverse STFT
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 12
gTTS library)
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 13
Data Hyperparameter ASR TTS π½ πΎ gen. mode CER (%) Mel Raw Acc (%) Paired (10k)
7.07 9.38 97.7 +Unpaired (40k) 0.25 1 greedy 5.83 6.21 8.49 98.4 0.5 1 greedy 5.75 6.25 8.42 98.4 0.25 1 beam 5 5.44 6.24 8.44 98.3 0.5 1 beam 5 5.77 6.20 8.44 98.3
unpaired)
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 14
Data Hyperparameter ASR TTS π½ πΎ gen. mode CER (%) Mel Raw Acc (%) Paired (80 utt/spk)
10.21 13.18 98.6 +Unpaired (remaining) 0.25 1 greedy 23.03 9.14 12.86 98.7 0.5 1 greedy 20.91 9.31 12.88 98.6 0.25 1 beam 5 22.55 9.36 12.77 98.6 0.5 1 beam 5 19.99 9.20 12.84 98.6
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 15
26-Mar-19 Andros Tjandra @ AHCLab, NAIST, Japan 19