AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
Kaizhi Qian*1, Yang Zhang*23, Shiyu Chang23, Xuesong Yang1, Mark Hasegawa-Johnson1
1University of Illinois at Urbana-Champaign 2MIT-IBM Watson AI Lab 3IBM Research Cambridge
AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss - - PowerPoint PPT Presentation
AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss Kaizhi Qian* 1 , Yang Zhang* 23 , Shiyu Chang 23 , Xuesong Yang 1 , Mark Hasegawa-Johnson 1 1 University of Illinois at Urbana-Champaign 2 MIT-IBM Watson AI Lab 3 IBM Research
Kaizhi Qian*1, Yang Zhang*23, Shiyu Chang23, Xuesong Yang1, Mark Hasegawa-Johnson1
1University of Illinois at Urbana-Champaign 2MIT-IBM Watson AI Lab 3IBM Research Cambridge
Content Encoder
Training
Decoder Speaker Encoder
Content Encoder
𝐷" 𝑌" Training
Decoder Speaker Encoder
Content Encoder
𝐷" 𝑇" 𝑌" 𝑌"
Decoder Speaker Encoder
Training
Content Encoder
𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" Training
Decoder Speaker Encoder
𝔽 % 𝑌"→" − 𝑌" )
) + 𝜇
, 𝐷"→" − 𝐷" "
Content Encoder
𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training
Decoder Speaker Encoder
𝔽 % 𝑌"→" − 𝑌" )
) + 𝜇
, 𝐷"→" − 𝐷" "
Content Encoder
𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training
Decoder Speaker Encoder
, 𝐷"→"
𝔽 % 𝑌"→" − 𝑌" )
) + 𝜇
, 𝐷"→" − 𝐷" "
Content Encoder
𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training
Decoder Speaker Encoder Content Encoder
𝐷" 𝑌" Conversion
Decoder Speaker Encoder
, 𝐷"→"
𝔽 % 𝑌"→" − 𝑌" )
) + 𝜇
, 𝐷"→" − 𝐷" "
Content Encoder
𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training
Decoder Speaker Encoder Content Encoder
𝐷" 𝑇) 𝑌" 𝑌) Conversion
Decoder Speaker Encoder
, 𝐷"→"
𝔽 % 𝑌"→" − 𝑌" )
) + 𝜇
, 𝐷"→" − 𝐷" "
Content Encoder
𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training
Decoder Speaker Encoder Content Encoder
𝐷" 𝑇) % 𝑌"→) 𝑌" 𝑌) Conversion
Decoder Speaker Encoder
, 𝐷"→"
𝔽 % 𝑌"→" − 𝑌" )
) + 𝜇
, 𝐷"→" − 𝐷" "
Content Encoder
𝐷" 𝑇" % 𝑌"→" 𝑌" 𝑌" 𝑌" Training
Decoder Speaker Encoder Content Encoder
𝐷" 𝑇) % 𝑌"→) 𝑌" 𝑌) Conversion
Decoder Speaker Encoder
, 𝐷"→"
Content Encoder
𝐷" 𝑇) % 𝑌"→) 𝑌" 𝑌) Conversion
Decoder Speaker Encoder
Content Encoder
𝐷" 𝑇) % 𝑌"→) 𝑌" 𝑌) Conversion
Decoder Speaker Encoder Seen OR Unseen
M2M F2F M2F F2M 1 2 3 4 M2M F2F M2F F2M
MOS Similarity AutoVC AutoVC-one-hot StarGAN1 Chou et. al.2
1StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks 2Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio
Representations
Source Target Converted
M2M F2F M2F F2M 1 2 3 4 M2M F2F M2F F2M
Seen to seen Seen to unseen Unseen to seen Unseen to unseen
MOS Similarity Source Target Converted